内容简介:ByIdentifying the various actions that people make with their bodies just from watching a video is a natural, simple task for humans. For example, most people would easily be able to identify a subject as, say, “For an artificial system, this seemingly bas
PREDICT and CLUSTER: Unsupervised Skeleton Based Action Recognition
Jul 18 ·4min read
Identifying the various actions that people make with their bodies just from watching a video is a natural, simple task for humans. For example, most people would easily be able to identify a subject as, say, “ jumping back and forth ,” or “ hitting a ball with their foot ”. This is easy to recognize even if the subject shown in the video footage changes or was recorded from different views. What if we would like a computer system or a gaming console like an Xbox, PlayStation or similar, to be able to do the same? Would that be possible?
For an artificial system, this seemingly basic task is not as natural as for humans, requiring several layers of Artificial Intelligence capabilities such as (i) knowing which specific ‘ features ’ to track when making decisions, along with (ii) the ability to name, or label, a particular action .
With regards to (i), research in visual perception and computer vision has shown that, at least for the human body, 3D coordinates of the joints, i.e. skeleton features , are sufficient for identifying different actions. Additionally, current robust algorithms are able to track these features in real-time using nearly any video source footage, e.g. OpenPose [1].
Teaching a computer system to make predictive associations between collections of points and actions using these features turns out to be a much more challenging task than just selecting said features alone. This is because the system is expected to group sequences of features into “classes” and subsequently associate these with names of the corresponding actions.
Existing deep learning systems try to learn this type of association through a process called ‘ supervised learning ’, where the system learns from several given examples, each with an explanation of the action it represents. This technique also requires camera and depth inputs (RGB+D) at each step. While supervised action recognition has shown promising advancement, it relies on annotation of a large number of sequences and needs to be redone each time another subject, viewpoint, or new action is being considered.
It’s of particular interest, then, to instead create systems that attempt to imitate the perceptual ability of humans, which learn to make these associations in an unsupervised way .
In our recent research entitled, “ Predict & Cluster: Unsupervised skeleton based action recognition ” [2] we developed such an unsupervised system. We have proposed that, rather than teaching the computer to catalog the sequences with their actions, the system will instead learn how to predict the sequences through ‘encoder-decoder’ learning. Such a system is fully unsupervised and operates with only inputs and not requiring labelling of actions at any stage.
In particular, the encoder-decoder neural network system learns to encode each sequence into a code, which the decoder would use to generate exactly the same sequence. It turns out that in the process of learning to encode and then to decode , the Seq2Seq deep neural network self-organizes the sequences into distinct clusters . We developed a way to make sure that learning is optimal (by fixing the weights or states of the decoder) in order to create such an organization and developed tools to read this organization to associate each cluster with an action.
We are able to obtain action recognition results that outperform both previous unsupervised and supervised approaches. Our findings pave the way to a novel type of learning of any type of actions using any input of features. This might include anything from recognizing actions of flight patterns of flying insects to identification of malicious actions in internet activity.
For more info see of the overview video below and the paper in [2].
With Kun Su and Xiulong Liu.
References
[1] OpenPose: https://github.com/CMU-Perceptual-Computing-Lab/openpose
[2] Su, Kun, Xiulong Liu, and Eli Shlizerman. “Predict & cluster: Unsupervised skeleton based action recognition.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2020.
以上所述就是小编给大家介绍的《Teaching Computers to Recognize Human Actions in Videos》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Uberland
Alex Rosenblat / University of California Press / 2018-11-19 / GBP 21.00
Silicon Valley technology is transforming the way we work, and Uber is leading the charge. An American startup that promised to deliver entrepreneurship for the masses through its technology, Uber ins......一起来看看 《Uberland》 这本书的介绍吧!