Motion Boundary and Occlusion Reasoning for Video Analysis
Date
2022
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Abstract
With the increasing prevalence of video cameras, video motion analysis has become an important research area in vision. Motion in video is often represented in the form of dense Optical Flow fields, which specify the motion of each pixel from one frame to the next. While existing flow predictors achieve almost sub-pixel performance in existing benchmarks, they still suffer in three particular areas. The first area is near motion boundaries, or the curves across which the optical flow field is discontinuous. The second is in occlusion regions, sets of pixels in one frame without a corresponding pixel in the other. The optical flow is not defined for these occlusion pixels. The third is in regions with large motion as they require high computational and memory costs. This dissertation examines these three challenges for motion boundary detection, occlusion detection, video interpolation, and occlusion-based adversarial attack detection for optical flow.
First, we propose a convolutional neural network named MONet to jointly detect motion boundaries and occlusion regions in video both forward and backward in time. Since both motion boundaries and occlusion regions disrupt correspondences across frames, we first use a cost map of the Euclidean distances between each feature in one frame to its closest feature in the next. To reason in two time directions simultaneously, we direct warp the estimated occlusion region and motion boundary maps between two frames, preserving features in occlusion regions. As motion boundaries align with occlusion region boundaries, we utilize an attention mechanism and a gradient module to enforce the network to focus on the useful 2D spatial regions predicted by the other task. MONet achieves state-of-the-art results for both tasks on various benchmarks.
Next, we consider the video interpolation task, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel visual transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearances as those of the interpolated frame. These aggregated features are then used to refine the interpolated prediction. To account for occlusions in the aggregated CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. Additionally, we augment our training dataset with an occluder patch that moves across frames to improve the network's robustness to occlusions and large motion. We supervise our IA module so that the network is encouraged to down-weight the features that are occluded by these patches. Because existing methods yield smooth predictions especially near motion boundaries, we use an additional training loss based on image gradient to yield sharper predictions.
We finally observe the effect of patch-based adversarial attacks on flow networks that cause occlusions and motion boundaries in the inputs, and present the first method to detect and localize these attacks without any fine-tuning or prior knowledge about the attacks. In particular, we detect the occlusion patch attacks via iterative optimization on the activations from the inner layers of any pre-trained optical flow networks to detect subset of anomalous activations.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Kim, Hannah (2022). Motion Boundary and Occlusion Reasoning for Video Analysis. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/26882.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.