Using Hourglass Networks To Understand Human Poses
A simple and digestible deep dive into the theory behind Hourglass Networks for human pose estimation
May 30 ·8min read
A man is running at you with a knife. What do you do? Well, most people will only have one thought in mind: RUN. Well, why would you run? Because after observing this man’s aggressive posture, you can conclude that he wants to harm you. And since you want to live to see tomorrow, you decide to run as fast as you possibly can.
Well, how are you able to do all this complex analysis in mere seconds? Well, you’re brain just did something called human pose estimation . Fortunately, since human pose estimation is done by a combination of the eyes with the brain, this is something that we can replicate in computer vision.
To perform human pose estimation, we use a special type of Fully Convolutional Network called Hourglass Networks. The network’s encoder-decoder structure makes it look like an hourglass, hence the name “ hourglass networks”.
But before we dive deeper into the nitty-gritty components of the network, let’s take a look at some other deep neural nets which this network is based on.
Taking A Step Back
Here are some other network architectures which you should be familiar with before looking into hourglass networks:
Convolutional Neural Networks (CNN’s)
- Significance: Automatically learns features which correspond best to a specific object leading to higher classification accuracy.
Resources for further learning: Video , Course , Article
Residual Networks (ResNets)
- Significance : Allows for deeper networks by slowing down the convergence of the network’s gradient in backpropagation.
Resources for further learning: Article , Article , Article , Article
Fully Convolutional Networks (FCN’s)
- Significance: Replaces dense layers with 1x1 convolutions allowing the network to accept inputs with various dimensions.
Resources for further learning:Article, Video
Encoder-Decoder Networks
- Significance: Allows us to manipulate an input by extracting its features and attempting to recreate it (ex. image segmentation, text translation)
We’ll talk about encoder-decoders more since that’s basically what hourglass networks are, but if you want some other cool resources, here are some: video , quora thread , article , GitHub .
The Network At A High Level
So, hope you had some fun learning about all those network architectures, but now its time to combine them all.
Hourglass networks are a type of convolutional encoder-decoder network (meaning it uses convolutional layers to break down and reconstruct inputs). They take an input (in our case, an image), and they extract features from this input by deconstructing the image into a feature matrix.
It then takes this feature matrix and combines it with earlier layers which have a higher spatial understanding than the feature matrix (have a better sense of where objects are in the image than the feature matrix).
- NOTE: The feature matrix has low spatial understanding , meaning i t doesn’t really know where objects are in the image . This is because, to be able to extract the object’s features, we have to discard all pixels which are not features of the object. This means discarding all the background pixels, and by doing this, it removes all knowledge of the object’s locations in the image.
- By combining the feature matrix with early layers in the network who have a higher spatial understanding, this allows us to understand a lot about the input (what it is + where it is in the image).
Doesn’t transporting early layers in the network into later layers ring a bell? ResNets. Residuals are used heavily throughout the network. They are used to combine the spatial info with the feature info, and not only that, each green block represents something we call a bottleneck block .
Bottlenecksare a new type of residual. Instead of having 2 3X3 convolutions, we have 1 1X1 convolution, 1 3X3 convolution and 1 1X1 convolution. This makes the calculations a lot easier on the computer (3X3 convolutions are much harder to do at scale than 1X1 convolutions), which means we get to save lots of memory.
So in summary,
- Input : Image of person
- Encoding : Extract Features through breaking down input into a feature matrix
- Decoding : Combine Feature Info + Spatial Info to Understand the Image in Depth
- Output: Depends on the application, in our case, a heat map of where the joints are.
Understanding the Process Step-By-Step
If we actually want to be able to code this, we need to understand what is happening in every single layer, and why. So here, we’re going to break down the whole process and walk through it step-by-step so that we have a deep understanding of the network (we’re just going to be reviewing the hourglass network’s architecture, not the whole training process).
In this network, we’ll be using:
- Convolutional Layers: Extract features from the image
- MaxPooling Layers: Eliminate parts of the image which aren’t necessary for feature extraction
- Residual Layers: Push Layers deeper into the network
- Bottleneck Layers: Free up memory by including more less-intensive convolutions
- Upsampling Layers : Increase the size of the input (in our case, using the nearest neighbour technique — watch the video to learn more)
Okay, so before diving in, let’s look at yet another diagram of the hourglass network.
So here we can see a couple of things:
- There are 2 sections: encoding and decoding
- Each section has 4 cubes.
- The cubes from the left get passed to the right side to form the cubes on the right
So if we expand each cube, it looks like this:
So in the diagram of the network whole network, each cube is a bottleneck layer (like the one shown above) . After each pooling layer, we’d add one of these bottleneck layers.
However, the first layer is a bit different, since it has a 7X7 convolution (its the only convolution greater than 3X3 in the architecture). Here’s how the first layer would look:
This is how the first cube looks. First of all, the input is passed into a 7X7 convolution combined with a BatchNormalization and ReLu layer. Next, its passed into a bottleneck layer, and the layer duplicates: one goes through the MaxPool and performs feature extraction and the other only attatches back to the network later on in the upsampling (decoding) part.
The coming cubes (cubes 2, 3 and 4) have a similar structure to each other, however different to the structure of cube 1. Here’s how the other cubes (in the encoding section) look like:
These layers are much more simple. The previous output layer gets passed into a bottleneck layer, then it duplicates into a residual layer and also a layer for feature extraction.
We’re going to repeat this process 3 times (in cubes 2, 3 and 4), and then we’re going to produce the feature maps.
Here are the layers involved in creating the feature maps (this section is the three really small cubes you’d see in the diagram of the whole network):
This is the deepest level in the network. It is also the part with the highest feature info and the part with the lowest spatial info. Here, our image is condensed into a matrix (actually, a tensor ) of values which represent the features of our image.
To get to this, it passed through all the 4 encoding cubes and the 3 bottleneck layers on the bottom. We’re now ready to upsample. Here’s how the upsampling layers look:
So here, the incoming residual layer is going to pass through a bottleneck layer, and then perform element-wise addition between itself (the residual layer) and the upsampled feature layer (from the main network).
We’re going to repeat this process 4 times, and then pass the final layer (4th cube in the decoding part) into the final section where we determine how accurate each prediction is.
- NOTE : This is called immediate supervision. It is when you calculate the loss at the end of each stage instead of at the end of the whole network. In our case, we calculate the loss at the end of each hourglass network instead of at the end of the all the networks combines (since for human pose estimation, we use multiple hourglass networks stacked together).
Here’s how the final layer will look like:
So here’s the end of the network. We pass the final network’s outputs through a convolutional layer, then duplicate the layer to produce a set of heatmaps. Finally, we perform an element-wise addition between the inputs of the network, the heatmaps and both of the network’s outputs (one is the predictions and the other is the output to go to the end of the next network).
And then, Repeat!
Yep, that’s it. You just walked through the whole hourglass network. In practice, we’re going to use many of these networks together, so that’s why the title was “and repeat” . Hopefully, this seemingly intimidating topic is now digestible. In my next article, we’ll code the network.
Like I mentioned before, we’re going to apply this to Human Pose Estimation. However, hourglass networks can be used for many things like semantic segmentation, 3d reconstruction and many more. I was reading some really cool papers in 3D reconstruction with hourglass nets, and I’ll link them below so you can read them too.
Overall, I hope you enjoyed reading this article, and if you’re having any trouble understanding this concept, feel free to reach out to me on email, linkedin or even Instagram (insta:@nushaine), and I will do my best to help you understand. Other than that, have a great day and happy coding :)
Resources
Really Cool Papers
Awesome GitHub Repos
以上所述就是小编给大家介绍的《Using Hourglass Networks To Understand Human Poses》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
人工智能
腾讯研究院、中国信通院互联网法律研究中心、腾讯AI Lab、腾讯开放平台 / 中国人民大学出版社 / 2017-10-25 / 68.00元
面对科技的迅猛发展,中国政府制定了《新一代人工智能发展规划》,将人工智能上升到国家战略层面,并提出:不仅人工智能产业要成为新的经济增长点,而且要在2030年达到世界领先水平,让中国成为世界主要人工智能创新中心,为跻身创新型国家前列和经济强国奠定基础。 《人工智能》一书由腾讯一流团队与工信部高端智库倾力创作。本书从人工智能这一颠覆性技术的前世今生说起,对人工智能产业全貌、最新进展、发展趋势进行......一起来看看 《人工智能》 这本书的介绍吧!