Deep Learning Architectures for Action Recognition

栏目: IT技术 · 发布时间: 4年前

内容简介:The first two approaches a) and b), using an LSTM and a 3D ConvNet respectively, share the strengths of being end-to-end trainable and real-time capable. This is because they do not rely on optical flow and instead must learn features that encode this info

The first two approaches a) and b), using an LSTM and a 3D ConvNet respectively, share the strengths of being end-to-end trainable and real-time capable. This is because they do not rely on optical flow and instead must learn features that encode this information. This allows for the network to learn spatiotemporal features directly in end-to-end training. Approaches c) — e) are not real-time capable nor end-to-end trainable because they require optical flow calculations over the raw data. Approaches b), d), and e) use 3D convolutions. This creates a magnitude of more parameters from traditional 2D ConvNets. For a single 3D convolution neural network trained for the UCF101 dataset can have 33M + parameters, compared to just 5M+ parameters in the 2D case [4]. This significantly affects the training cost as 3D ConvNet models trained on Sports-1M take approximately 2 months. This makes it difficult to search for the right architecture for video data. The large number of parameters also creates a risk of overfitting.

The LSTM architecture for videos was popularised in the 2014 paper Long-term Recurrent Convolutional Networks for Visual Recognition and Description by Donahue et. al [5]. The architecture is known as LRCN. It is a direct extension of the encoder-decoder architecture but for video representations. The strength of the LRCN network is that it can handle sequences of various lengths. It can also be adapted to other video tasks like image captioning and video description. The weakness was that the LRCN was not able to beat the state of the art at the time, however it did provide improvements over single frame architectures as noted in Table 3.

Table 3 — Activity recognition: Comparing single frame models to LRCN networks for activity recognition on the UCF101 [25] dataset, with RGB and flow inputs. [5]

Temporal modelling of spatial features is tough for a hidden recurrent layer to learn. Empirically, adding more hidden units to the RGB models did not improve past 256 hidden units. However, adding more hidden units while using Flow input yielded an accuracy boost of 1.7% from 256 units to 1024 units. This shows that the LRCN has a tough time learning optical flow or a similar representation of motion natively.

Table 4 — Action recognition results on UCF101. C3D compared with baselines and state-of-the-art methods in 2015. [6]

2015

3D ConvNets were established as the new state of the art in the 2015 research paper Learning Spatiotemporal Features with 3D Convolutional Networks by Du Tran et. al [6]. In this paper, they establish that the 3D convolution net (C3D) with a 3x3x3 kernel is the most effective in learning spatiotemporal features. Interestingly, de-convolutions reveal that the network is learning spatial appearance for the first few frames followed by salient motion in the later frames of a clip. This architecture is powerful in that many videos can be processed in real time as C3D processes at up to 313fps. The video descriptors generated by this network are also compact and discriminative as we can project the features generated by convolutions to 10 dimensions via PCA and still achieve 52.8% accuracy on the UCF101 dataset.

Figure 5 — Fusion architecture at two layers (after conv5 and after fc8) where both network towers are kept, one as a hybrid spatiotemporal net and one as a purely spatial network. [7]

2016

In 2016, the focus shifted back to two stream networks. In Convolutional Two-Stream Network Fusion for Video Action Recognition by Zisserman et. al. [7], the authors tackled how to effectively fuse spatial and temporal data across streams and create multi-level loss that could handle long term temporal dependencies. The motivating idea here was that in order to discriminate between similar motions in different parts of the image, like brushing hair and brushing teeth, the network will need to take a combination of spatial features and motion features at a pixel location. Theoretically, methods that fuse the streams before densely connected layers could achieve this. In the proposed architecture, the authors fuse the two streams at two locations as shown in Figure 5 . This network was able to better capture motion and spatial features in distinct subnetworks and beat the state of the art IDT and C3D approaches. The multi-level loss is formed by a spatiotemporal loss at the last fusion layer and a separate temporal loss that is formed from output of the temporal net. This allowed the researchers to create spatiotemporal features and model long term temporal dependencies. This method still suffers from the weaknesses of the original two stream network, but performs better due to an enhanced architecture that better serves our real-world biases.

2017

In 2017, Zhu et. al. took two stream networks a step forward by introducing a hidden stream that learns optical flow called MotionNet [8]. This end-to-end approach allowed the researchers to skip explicitly computing optical flow. This means that two streams approaches could now be real-time and errors from misprediction could also be propagated into MotionNet for more optimal optical flow features.

Figure 6 — MotionNet takes consecutive video frames as input and estimates motion. Then the temporal stream CNN learns to project the motion information to action labels. [8]

The researchers find that hidden two stream CNN’s perform at a similar accuracy to non-hidden approaches but can now process up to 10x more frames per second, as seen in Table 6. This enables real-time capabilities for the two stream method.

Table 6 — Two-stream approaches and their accuracy on UCF101. [8]

The MotionNet subnetwork is extensible and can be applied to other deep learning methods where calculating optical flow is necessary. This is important because it allows us to make other approaches in real-time.

In 2017, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset by Zisserman et. al. takes C3D another step forward by merging it with learnings from two stream networks [4]. The researchers propose a novel two stream inflated 3D ConvNet (I3D). Filters and pooling kernels from 2D ConvNets are expanded into 3D, endowing them with an extra temporal dimension. This enables the researchers to take successful architectures for 2D classification and apply them to 3D. The researchers also bootstrap these 3D filters with parameters from 2D ConvNet models trained on massive image datasets like ImageNet.

Table 7 — Performance on the UCF-101 and HMDB-51 test sets (split 1 of both) for architectures starting with / without ImageNet pretrained weights. Original: train on UCF-101 or HMDB-51; Fixed: features from Kinetics, with the last layer trained on UCF-101 or HMDB-51; Full-FT: Kinetics pre-training with end-to-end fine-tuning on UCF-101 or HMDB-51.[4]

Using 3D ConvNets on sequential RGB frames and sequential optical flow frames in a two stream architecture enabled the researchers to beat the state of the art on UCF101. The researchers established the clear importance of transfer learning with the use of the Kinetics dataset. Unfortunately, the model architecture they used is not end-to-end trainable and does not have real-time capabilities.

2018

In 2017 to 2018, many advances in deep residual learning led to novel architectures like 3DResNet and pseudo-residual C3D (P3D) [9]. Unfortunately, I will not cover these papers in this literature review, but I do respectfully acknowledge their impact on the state of the art.

2019

Most recently, in June 2019, Du Tran et. al. propose channel separated convolution networks (CSN) for the task of action recognition in Video Classification with Channel-Separated Convolutional Networks [10]. The researchers build on the ideas of group convolution and depth-wise convolution that received great success in Xception and MobileNet models.

Figure 7 — (a) A conventional convolution, which has only one group. (b) A group convolution with 2 groups. c) A depth-wise convolution where the number of groups matches the number of input/output filters.

Fundamentally, group convolutions introduce regularisation and less computations by not being fully connected. Depth-wise convolutions are the extreme case of group convolutions where the input and output channels equal the number of groups, as seen in Figure 7. Conventional convolutional networks model channel interactions and local interactions (both spatial or spatiotemporal) jointly in their 3D convolutions.

Figure 8 — (a) A standard ResNet bottleneck block. (b) An interaction preserved bottleneck block.

The researchers propose to decompose 3x3x3 convolution kernels into two distinct layers, where the first layer is a 1x1x1 convolution for local channel interaction and the second layer is a 3x3x3 depth-wise convolution for local spatiotemporal interactions. By using these blocks, the researchers significantly decrease the number of parameters in the network and introduce a strong form of regularisation. The channel separated blocks allow for the network to locally learn spatial and spatiotemporal features in distinct layers.

Table 8 — Comparisons with state-of-the-art architectures on Sports-1M

As shown in Table 8, The CSN improves on state of the art RGB methods like R(2+1)D, C3D, and P3D on the Sports-1M dataset. The network is also 2–4x faster during inference. The model is also trained from scratch, where the rest of the models in the table are pretrained on ImageNet or Kinetics dataset. This novel architecture improves on previous factorized networks while reducing overfitting, being exceptionally fast, and producing state of the art accuracy on benchmark datasets.

Conclusion

State of the art

The current state of the art for action recognition (August 2019) is the channel separated network. This network effectively captures spatial and spatiotemporal features in their own distinct layers. The channel separated convolution blocks learns these features distinctly but combines them locally at all stages of convolution. This alleviates the need to perform slow fusion of temporal and spatial two stream networks. The network also does not need to decide between learning spatial or temporal features as in C3D where the network can decide to learn features that are mixed between the two dimensions. This network effectively captures the bias that 2D spatial slices should form a natural image, whereas a 2D slice in the temporal direction has different temporal properties and does not fall in the natural manifold. In this way, the researchers enforce this bias by creating two separate distinct layers to process each direction. Channel separation is an important step forward in action recognition and has beat state of the art results even when trained from scratch. It is also capable of real time inference. For these reasons, I believe CSN’s are the current state of the art.

Summary

We have learned that deep learning has revolutionized the way we process videos for action recognition. Deep learning literature has come a long way from using improved Dense Trajectories. Many learnings from the sister problem of image classification has been used in advancing deep networks for action recognition. Specifically, the usage of convolution layers, pooling layers, batch normalization, and residual connections have been borrowed from the 2D space and applied in 3D with substantial success. Many models that use a spatial stream are pretrained on extensive image datasets. Optical flow has also had an important role in representing temporal features in early deep video architectures like the two stream networks and fusion networks. Optical flow is our mathematical definition of how we believe movement in subsequent frames can be described as densely calculated flow vectors for all pixels. Originally, networks bolstered performance by using optical flow. However, this made networks unable to be end-to-end trained and limited real-time capabilities. In modern deep learning, we have moved beyond optical flow, and we instead architect networks that are able to natively learn temporal embeddings and are end-to-end trainable.

We have also learned that action recognition is a truly unique problem with its own set of complications. The first source of friction is the high computation and memory cost associated with 3D convolutions. Some models take over 2 months to train on Sports-1M on modern GPU’s. The second source of friction is that there is no standard benchmark for video architecture search [11]. Sports-1M and UCF101 are highly correlated and false-label assignment is common when a portion of a video is selected to be trained on but actually may not contain the actual action as it may be in another part of the video. The last source of friction is that designing a video deep neural network is nontrivial. The choice of layers, how to preprocess the input, and how to model the temporal dimension is an open problem. The authors of the papers above attempt to tackle these issues in an empirical fashion and propose novel architectures that resolve temporal modelling in videos.

Future Research

For future research, I recommend looking into how to include more biases we have of the real world in deep video network architecture. An interesting vertical to study is how depth modelling can relate to better video classifications. Current approaches to video classification have to learn that the videos are taken in a 3D environment. Depth forms an important part of our spatial perception. It could be that current approaches have to learn how to express depth in their spatiotemporal modelling of 2D features. Perhaps using monocular depth estimation networks can aid the current video networks in creating a better understanding of the environment itself. An important observation is that any spatial changes in a video come from two sources: a transformation of an external object we are observing, or the observer itself changing viewing angle or position. Both these sources of movement have to be learned by the current networks. It would be interesting to investigate how depth fields could be used to model either sources of change.

Citations

[1] Heng Wang, Alexander Kläser, Cordelia Schmid, Liu Cheng-Lin. Action Recognition by Dense Trajectories. CVPR 2011 — IEEE Conference on Computer Vision

[2] Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014.

[3] Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in neural information processing systems. 2014.

[4] Carreira, Joao, and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics dataset.” proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[5] Donahue, Jeffrey, et al. “Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[6] Tran, Du, et al. “Learning spatiotemporal features with 3d convolutional networks.” Proceedings of the IEEE international conference on computer vision. 2015.

[7] Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. “Convolutional two-stream network fusion for video action recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[8] Zhu, Yi et al. “Hidden Two-Stream Convolutional Networks for Action Recognition.” Lecture Notes in Computer Science (2019): 363–378. Crossref. Web.

[9] Qiu, Zhaofan, Ting Yao, and Tao Mei. “Learning spatio-temporal representation with pseudo-3d residual networks.” proceedings of the IEEE International Conference on Computer Vision. 2017.

[10] Tran, Du, et al. “Video Classification with Channel-Separated Convolutional Networks.” arXiv preprint arXiv:1904.02811 (2019).

[11] Tran, Du, et al. “Convnet architecture search for spatiotemporal feature learning.” arXiv preprint arXiv:1708.05038 (2017).


以上所述就是小编给大家介绍的《Deep Learning Architectures for Action Recognition》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Machine Learning

Machine Learning

Kevin Murphy / The MIT Press / 2012-9-18 / USD 90.00

Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then ......一起来看看 《Machine Learning》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

正则表达式在线测试
正则表达式在线测试

正则表达式在线测试