Self-supervised Attention Mechanism for Dense Optical Flow Estimation

栏目: IT技术 · 发布时间: 5年前

内容简介：The Farneback algorithm is an effective technique to estimate the motion of certain image features by comparing two consecutive frames from a video sequence. The algorithm first uses theHistorically, the problem of optical flow is an optimization problem.

The Farneback algorithm is an effective technique to estimate the motion of certain image features by comparing two consecutive frames from a video sequence. The algorithm first uses the polynomial expansion transform to approximate the windows of image frames through the quadratic polynomials. Polynomial expansion transform is a signal transform designed exclusively in the spatial domain and can be used for signals of any dimensionality. The method observes the translation of the polynomial transforms to estimate displacement fields from polynomial expansion coefficients. This method then computes the dense optical flow after a series of iterative refinements. In the implementation code, the algorithm computes the direction and magnitude of optical flow from a two-channel array of flow vectors (dx/dt, dy/dt). The computed direction and magnitude are then visualized by the value of HSV color representation which is set to a maximum of 255 for optimal visibility.

Deep Learning for Dense Optical Flow Estimation

Historically, the problem of optical flow is an optimization problem. After the recent developments in deep learning, many researchers have applied deep learning to solve this optimization problem by processing consecutive video frames as input to calculate the optical flow of the object in motion. Although these approaches just process two consecutive frames at a time, still the essence of a video is captured in these two frames. The main thing that distinguishes videos from images is that videos possess a temporal structure in addition to the spatial structure of the images. However, videos also have other modalities such as sound, but they are of no use in this case. Therefore consecutive frame stream can be interpreted as a collection of images operating in a specific temporal resolution (fps). This means that data in a video is encoded not only spatially but also sequentially, which makes classifying videos quite interesting and yet challenging at the same time.

Source

Generally, deep neural networks require a large amount of training data to learn and optimize the approximation functions. But in the case of optical flow estimation, training data is particularly hard to obtain. The major reason behind this is the difficulty of accurately labeling video footage for the exact motion of every point of an image to subpixel accuracy. Therefore to address the issue of labeling video data, computer graphics are used to simulate massive realistic worlds through instructions. As the instructions are known, the motion of every pixel in the video frame sequence is already known. Some of the recent research that attempts to solve the optical flow problems are PWC-Nets, ADLAB-PRFlow, and FlowNet. Optical flow is widely inherited by many applications like vehicle tracking and traffic analysis through object detection and multi-object tracking by feature-based optical flow techniques from either from a stationary camera or cameras attached to vehicles.

Self-Supervised Deep Learning for Tracking

As mentioned earlier, visual tracking is integral for many tasks like recognition, interaction, and geometry under the domain of video analysis. But at the same time using deep learning for these tasks becomes infeasible due to the huge requirement of labeled video data. Anyway, to achieve high performance, large-scale tracking datasets become necessary which in turn requires extensive efforts and thus makes the deep learning approach more impractical and expensive. Keeping this in mind, recent researchers have put their faith in a promising approach to make the machines learn without human supervision (labeled data) by leveraging large amounts of unlabeled and raw video data. This quest for self-supervised learning started with a research proposal from the Google research team that suggested to make a visual tracking system by training a model on a proxy task of video colorization that doesn’t require any additional labeled data (self-supervision). However, the research suggested that instead of making the model predict the color of the input grayscale frame, it must learn to copy the colors from a set of reference frame, thus leading to the rise of a pointing mechanism that is able to track the spatial feature of a video sequence in a temporal setup. Visualizations and experiments of these self-supervised methods suggest that, although the network is trained without any human supervision, a mechanism for visual feature tracking automatically emerges inside the network. After plenty of training on unlabeled video collected from the internet, the self-supervised model was able to track any segmented region specified in the initial frame of the video frame sequence. However, the self-supervised deep learning methods are trained on an assumption that the color in the frame sequence is temporally stable. Clearly, there are exceptions, like colorful lights can turn on and off in the video.

The Pointer Mechanism trained on a proxy task of Video colorization- Source

The objective of self-supervised learning in tracking is to learn feature embedding that is suitable for matching correspondences along the frame sequence of a video. The correspondence flow is learned by exploiting the natural spatial-temporal coherence in the frame sequence. Correspondance flow can be understood as the feature similarity flow existing between consecutive frames. In simple language, this approach learns a pointer mechanism that can reconstruct a target image by copying pixel information from a set of reference frames. Therefore to make such a model, there are certain precautions a researcher must keep in mind while designing the architecture. First, we must prevent the model from learning trivial solution of this task ( e.g. matching consecutive frames based on low-level color features). Second, we must make the tracker drifting less severe. Tracker drifting (TD) is mainly caused due to occlusion of objects, complex object deformation, and random illumination changes. TD is usually handled by training recursive models over long temporal windows with cycle consistency and scheduled sampling.

The Correspondance Flow matching correspondences between frames over the video — source

Finally, before we look under the hood of this pointer mechanism, let’s cover some of the above-mentioned points that one must consider while designing such models. First, it’s important to remember that correspondence matching is the fundamental building block of these models. Therefore there is a high probability that the model will learn a trivial solution while doing frame reconstruction by pixel-wise matching. To prevent the model from overfitting on a trivial solution, it is important to add color jittering and channel-wise dropout, so that model is forced to rely on low-level color information and must be robust to any kind of color jittering. Lastly, to handle TD, as suggested earlier, recursive training over long temporal windows with forward-backward consistency and scheduled sampling is the best way to alleviate the tracker drifting problem. If we apply the above-mentioned methods, we can be sure that the model robustness will increase and the approach will be able to exploit the spatial-temporal coherence of the video and colors will be able to act as a reliable supervision signal for learning correspondences.

Self-supervised Attention under the Hood

If you look deeper into what actually is the pointer mechanism that is being learned here, you will come to the conclusion that it is a type of attention mechanism. Yes, it’s ultimately the famous trio of QKV (Query-Key-Value, the basis of most attention mechanisms).

source

As we know, the goal of the self-supervised model is to learn robust correspondence matching by effectively encoding feature representations. In simple language, the ability to copy effectively is achieved by training on a proxy task, where the model learns to reconstruct a target frame by linearly combining pixel data from the reference frames, with the weights measuring the strength of correspondence between pixels. However, breaking down this process, we find that there is a triplet (Q, K, V) for every input frame we process. The Q, K, V refer to Query, Key, and Value. To reconstruct a pixel I¹ in the T¹ frame, an Attention mechanism is used for copying pixels from a subset of previous frames in the original sequence. Just, in this case, the query vector (Q) is the present frame’s(I¹) feature embedding (target frame), the key Vector is the previous frame’s(I⁰) feature embedding (reference frame). Now if we compute a dot product (.) between the query and key (Q.K) and take a softmax of the computed product, we can get a similarity between the present frame ( I¹ ) and the previous reference frame (I⁰). This computed similarity matrix when multiplied with a reference instance segmentation mask (V) during inference will give us a pointer for our target frame, thus achieving dense optical flow estimation. Therefore this pointer which is just a combination of Q, K, and V is the actual attention mechanism working under the hood of this self-supervised system.

Everyone needs attention — source

A key element in attention mechanism training is to establish a proper information bottleneck. To circumvent any learning shortcuts that the attention mechanism may resort to, the previously mentioned techniques of intentionally dropping the input color information and channel dropout are used. However, the choice of color spaces still plays an important role in training these attention mechanisms through self-supervision. Many research works have validated the conjecture that using decorrelated color space leads to better feature representations for self-supervised dense optical flow estimation. In simple language, using the LAB format image works better than the RGB format. This is because all RGB channels include a representation of brightness, making it highly correlate to the luminance in Lab, therefore acting as a weak information bottleneck.

Restricted Attention for minimizing physical memory costs

The above-proposed attention mechanism usually comes with high physical memory cost. Therefore processing high-resolution information for correspondence matching can lead to large memory requirements and slower speed.

source

To circumvent the memory cost, ROI localization is used to estimate the candidate windows non-locally from memory banks. Intuitively, we can say that for temporally close frames, spatial-temporal coherence naturally exists in the frame sequence. This ROI localization leads to restricted attention as now the pixel in the target frame is only compared to spatially neighboring pixels of the reference frame. The number of comparable pixels is determined by the size of the dilated window in which the attention is restricted. The dilation rate of the window is proportional to the temporal distance between the present frame and the past frames in the memory bank. After computing the affinity matrix of the restricted attention region, fine-grained matching scores can be computed in a non-local manner. Therefore, with the proposed memory-augmented restricted attention mechanism, the model can efficiently process high-resolution information without incurring large physical memory costs.

Conclusion

In this blog, we started with an introduction to the concept of optical flow and studied its application in object tracking. We also studied how this concept inspired the deep learning tracking systems and how self-supervision and visual attention plays a key role in making these systems. The computed optical flow vectors open a myriad of possible applications that require such an in-depth scene understanding of videos. The discussed techniques are majorly applied to pedestrian tracking, autonomous vehicle navigation, and many more novel applications. The variety of applications where the optical flow can be applied is only limited by the ingenuity of its designers.

In my personal opinion, self-supervision will soon serve as a strong competitor to its supervised counterpart because of its generalizability and flexibility. Self-supervision easily outperforms most of the supervised methods on unseen object categories, which reflects its importance and power in the coming time as we take our steps towards solving human intelligence.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Self-supervised Attention Mechanism for Dense Optical Flow Estimation

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

XML Hacks

Michael Fitzgerald / O'Reilly Media, Inc. / 2004-07-27 / USD 24.95

Developers and system administrators alike are uncovering the true power of XML, the Extensible Markup Language that enables data to be sent over the Internet from one computer platform to another or ......一起来看看《XML Hacks》这本书的介绍吧!

码农工具