内容简介:As described by Zhe Cao in hisAs the name suggests, it is a technique used to estimate how a person is physically positioned, such as standing, sitting, or lying down. One way to obtain this estimate is to find the 18 “joints of the body” or as named in th
Introduction
As described by Zhe Cao in his 2017 Paper , Realtime multi-person 2D pose estimation is crucial in enabling machines to understand people in images and videos.
However, what is the Pose Estimation?
As the name suggests, it is a technique used to estimate how a person is physically positioned, such as standing, sitting, or lying down. One way to obtain this estimate is to find the 18 “joints of the body” or as named in the Artificial Intelligence field: “Key Points.” The images below show our goal, which is to find these points in an image:
The Keypoints go from 0 (Top neck) going down on body joints and returning to head, ending with point 17 (right ear).
The first significant work that appeared using the Artificial Intelligence-based approach was DeepPose , a 2014 paper by Toshev and Zegedy from Google. The paper proposed a human pose estimation method based on Deep Neural Networks (DNNs), where the pose estimation was formulated as a DNN-based regression problem towards body joints. The model consisted of an AlexNet backend (7 layers) with an extra final layer that outputs 2k joint coordinates. The significant problem with this approach is that first, a single person must be detected (classic object detection) following by the model application. So, each human body found on an image must be treated separately, which increases considerably the time to process the image. This type of approach is known as “top-down” because first find the bodies and from it, the joints associated with them.
There are several problems related to Pose Estimation, as:
- Each image may contain an unknown number of people that can appear at any position or scale.
- Interactions between people induce complex spatial interference, due to contact, occlusion, or limb articulations, making association of parts difficult.
- Runtime complexity tends to grow with the number of people in the image, making realtime performance a challenge.
To solve those problems, a more exciting approach (that is the one used on this project) is OpenPose , which was introduced in 2016 by ZheCao and his colleagues from the Robotics Institute at Carnegie Mellon University. The proposed method of OpenPose uses a nonparametric representation, referred to as Part Affinity Fields (PAFs), to “connect” each finds body joints on an image, associating them with individual people. In other words, OpenPose does the opposite of DeepPose, first finding all the joints on an image and after going “up,” looking for the most probable body that will contain that joint without using any person detector (“bottom-up” approach). OpenPose finds the key points on an image regardless of the number of people on it. The below image, retrieved from OpenPose presentation at ILSVRC and COCO workshop 2016 , give us an idea about the process.
The image below shows the architecture of the two-branch multi-stage CNN model used for training. First, a feed-forward network simultaneously predicts a set of 2D confidence maps (S) of body part locations (keypoints annotations from (dataset/COCO/annotations/) and a set of 2D vector fields of part affinities (L), which encode the degree of association between parts. After each stage, the two branches’ predictions, along with the image features, are concatenated for the next stage. Finally, the confidence maps and the affinity fields are parsed by greedy inference to output the 2D keypoints for all people in the image.
During the execution of the project, we will return to some of those concepts for clarification. However, it is highly recommended to follow the OpenPose ILSVRC and COCO workshop 2016 presentation and the video recording at CVPR 2017 for a better understanding.
TensorFlow 2 OpenPose installation (tf-pose-estimation)
The original OpenPose was developed using the model-based VGG pre-trained network and using a Caffe framework . However, for this installation, we will follow Ildoo Kim TensorFlow approach as detailed on his tf-pose-estimation GitHub .
What is tf-pose-estimation?
tf-pose-estimation is the ‘Openpose’, human pose estimation algorithm that has been implemented using Tensorflow. It also provides several variants that have some changes to the network structure for realtime processing on the CPU or low-power embedded devices.
The tf-pose-estimation GitHub, shows several experiments with different models as:
- cmu: the model-based VGG pretrained network described in the original paper with weights in Caffe format converted to be used in TensorFlow.
- dsconv : same architecture as the cmu version except for the depthwise separable convolution of mobilenet.
- mobilenet : based on the mobilenet V1 paper, 12 convolutional layers are used as feature-extraction layers.
- mobilenet v2 : similar to mobilenet, but using an improved version of it.
The studies on this article were done with mobilenet V1 (“mobilenet_thin”), that has an intermediary performance regarding computation budget and latency:
Part 1 — Installing tf-pose-estimation
We follow here, the excellent Gunjan Seth article Pose Estimation with TensorFlow 2.0 .
- Go to terminal and create a working directory (for example, “Pose_Estimation”), moving to it :
mkdir Pose_Estimation cd Pose_Estimation
- Create a Virtual Environment (for example Tf2_Py37)
conda create --name Tf2_Py37 python=3.7.6 -y conda activate Tf2_Py37
- Install TF2
pip install --upgrade pip pip install tensorflow
- Install basic packages to be used during development:
conda install -c anaconda numpy conda install -c conda-forge matplotlib conda install -c conda-forge opencv
- Clone tf-pose-estimation repository:
git clone https://github.com/gsethi2409/tf-pose-estimation.git
- Go to tf-pose-estimation folder and install the requirements
cd tf-pose-estimation/ pip install -r requirements.txt
In the next step, install SWIG , an interface compiler that connects programs written in C and C++ with scripting languages such as Python. It works by taking the declarations found in C/C++ header files and using them to generate the wrapper code that scripting languages need to access the underlying C/C++ code.
conda install swig
- Using Swig, build C++ library for post-processing.
cd tf_pose/pafprocess swig -python -c++ pafprocess.i && python3 setup.py build_ext --inplace
Now, install tf-slim library, a lightweight library used for defining, training, and evaluating complex models in TensorFlow.
pip install git+https://github.com/adrianc-a/tf-slim.git@remove_contrib
That is it! Now, it is essential to run a quick test. For that return to the main tf-pose-estimation directory.
If you follow the sequence, you must be inside tf_pose/pafprocess. Otherwise use the appropriated command to change directory.
cd ../..
Inside tf-pose-estimation directory there is a python script run.py , let's run it, having as parameters:
- model=mobilenet_thin
- resize=432x368 (size of the image at pre-processing)
- image=./images/ski.jpg (sample image inside images directory)
python run.py --model=mobilenet_thin --resize=432x368 --image=./images/ski.jpg
Note that during a few seconds, nothing will happen, but after a minute or so, the terminal should present something similar to the below image:
However, more important, an image will appear on an independent OpenCV window:
Great! The images are proof that everything was properly installed and working fine! We will enter in more detail in the next section. However, for a quick explanation about what the four images mean, the top-left (“Result”) is the pose detection skeleton drawn having the original image (in this case, ski.jpg) as background. The top-right image is a “heat map”, where the “parts detected” (Ss) are shown, and both bottom images show the part association (Ls). The “Result” is the connected S’s and L’s to individual persons.
The next test is a live video:
If the computer has only one camera installed, use: camera=0
python run_webcam.py --model=mobilenet_thin --resize=432x368 --camera=1
If everything goes well, a window will appear with a real live video, like this screenshot:
Part 2 — Going Deeper with Pose Estimation in Images
In this section, we will go more in-depth with our TensorFlow Pose Estimation implementation. It is advised to follow the article, trying to reproduce Jupyter Notebook: 10_Pose_Estimation_Images , which can be downloaded from Project GitHub.
As a reference, this project was 100% developed on a MacPro (2.9Hhz Quad-Core i7 16GB 2133Mhz RAM).
Import Libraries
import sys import time import logging import numpy as np import matplotlib.pyplot as plt import cv2from tf_pose import common from tf_pose.estimator import TfPoseEstimator from tf_pose.networks import get_graph_path, model_wh
Model definition and TfPose Estimator creation
It is possible to use the models located on model/graph sub-directory, as mobilenet_v2_large or cmu (VGG pretrained model).
For cmu, the *.pb files were not downloaded during installation, because they are significant in size. To use it, run the bash script download.sh that is located on /cmu sub-directory.
This project uses mobilenet_thin (MobilenetV1), considering that all images used should be reshaped to 432x368.
Parameters:
model='mobilenet_thin' resize='432x368' w, h = model_wh(resize)
Create estimator:
e = TfPoseEstimator(get_graph_path(model), target_size=(w, h))
Let us load a simple human image for ease analysis. OpenCV is used to read images. The images are stored as RGB, but internally, OpenCV works with BGR. Using OpenCV to show an image has no problem because it will be converted from BGR to RGB before image presentation on a specific window (as saw with ski.jpg on the previous section).
Once the image should be plotted on a Jupyter cell, Matplotlib will be used instead OpenCV. Because of that, the image should be converted before display, as shown below:
image_path = ‘./images/human.png’ image = cv2.imread(image_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.imshow(image) plt.grid();
Observe that this image has a shape of 567x567. OpenCV when reading an image, automatically convert it to an array, where each value goes from 0 to 255, where 0=''white" and 255=''Black"'.
Once the image is an array, it is simple to verify its size, using shape:
image.shape
The result will be (567, 567, 3), where the shape is (width, height, color channels).
Spite that the image can be read using OpenCV; we will use the function read_imgfile(image_path) from the library tf_pose.common to prevent any trouble with color channels.
image = common.read_imgfile(image_path, None, None)
Once we have the image as an array, we can apply the method inference to the estimator (e), having the image array as input (the image will be resized using the parameters w and h defined at principle).
humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=4.0)
After running the above command, let us inspect the array e.heatmap. This array has a shape of (184, 216, 19), where 184 is h/2, 216 is w/2, and 19 is related to the probability of that specific pixel to belong to one of the 18 joints (0 to 17) + one (18: none). For example, inspecting the top-left pixel, a “none” should be expected:
It is possible to verify the last value of this array
which is the highest value of all; what can be understood that with 99.6% of chance, this pixel does not belong to any one of the 18 joints.
Let us try to find the base of the neck (midpoint between shoulders). It is located on the original picture around mid-width (0.5 * w = 108) and around 20% of height, starting top/down (0.2 * h = 37). So, let us inspect this specific pixel:
It is easy to realize that position 1 has a maximum value of 0.7059… (or by calculating e.heatMat[37][108].max() ), which means that that specific pixel has a 70% probability of being a “base neck.” The figure below shows all 18 COCO Keypoints (or “body joints”), showing that "1" corresponds to the "base neck".
It is possible to plot for every pixel, a color representing its maximum value. Doing that, a heat map, showing the key points will magically appear:
max_prob = np.amax(e.heatMat[:, :, :-1], axis=2) plt.imshow(max_prob) plt.grid();
Le us now plot the key points over the reshaped original image:
plt.figure(figsize=(15,8)) bgimg = cv2.cvtColor(image.astype(np.uint8), cv2.COLOR_BGR2RGB) bgimg = cv2.resize(bgimg, (e.heatMat.shape[1], e.heatMat.shape[0]), interpolation=cv2.INTER_AREA) plt.imshow(bgimg, alpha=0.5) plt.imshow(max_prob, alpha=0.5) plt.colorbar() plt.grid();
So, it is possible to see the keypoints (S’s) over the image, being the values shown at colorbar means that more yellow means higher probability.
To get the L’s, the most probable connections (or “bones”) between the key points (or “joints”), we can use the resulted array of e.pafMat. This array has a shape of (184, 216, 38), where here the 38 (2 x 19) is related to the probability of that pixel be part of a horizontal (x) or vertical (y)connection with one of the 18 specific joints + nones.
The functions to plot the above figures are in the Notebook.
Draw the skeleton using method draw_human
With the list human , resultant of e.inference() method, it is possible to draw the skeleton using method draw_human:
image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False)
The result will be below image:
If desired, it is possible to plot only the skeleton, as shown here (let us rerun all code for a recap):
image = common.read_imgfile(image_path, None, None) humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=4.0) black_background = np.zeros(image.shape) skeleton = TfPoseEstimator.draw_humans(black_background, humans, imgcopy=False) plt.figure(figsize=(15,8)) plt.imshow(skeleton); plt.grid(); plt.axis(‘off’);
Getting the Key points (Joints) coordinates
Pose estimation can be used on a series of applications such as robotics, gaming, or medicine. For that, it could be interesting to get the physical keypoints coordinates from the image to be used by other applications.
Looking at the human list resulted from e.inference(), it can be verified that it is a list with a single element, a string. In this string, every key point appears with its relative coordinate and associated probability. For example, for the human image used so far, we have:
For example:
BodyPart:0-(0.49, 0.09) score=0.79 BodyPart:1-(0.49, 0.20) score=0.75 ... BodyPart:17-(0.53, 0.09) score=0.73
We can extract an array (size of 18) from this list with the real coordinates related tothe original image shape:
keypoints = str(str(str(humans[0]).split('BodyPart:')[1:]).split('-')).split(' score=')keypts_array = np.array(keypoints_list) keypts_array = keypts_array*(image.shape[1],image.shape[0]) keypts_array = keypts_array.astype(int)
Let us plot this array (being that the array’s index is the key point), over the original image. Here the result:
plt.figure(figsize=(10,10)) plt.axis([0, image.shape[1], 0, image.shape[0]]) plt.scatter(*zip(*keypts_array), s=200, color='orange', alpha=0.6) img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) plt.imshow(img) ax=plt.gca() ax.set_ylim(ax.get_ylim()[::-1]) ax.xaxis.tick_top() plt.grid();for i, txt in enumerate(keypts_array): ax.annotate(i, (keypts_array[i][0]-5, keypts_array[i][1]+5)
Creating Functions to reproduce the studies on generic images quickly:
The Notebook shows all the code developed so far, “encapsulated” as functions. For example, let us see another image:
image_path = '../images/einstein_oxford.jpg' img, hum = get_human_pose(image_path) keypoints = show_keypoints(img, hum, color='orange')
img, hum = get_human_pose(image_path, showBG=False) keypoints = show_keypoints(img, hum, color='white', showBG=False)
Studying images with multiple persons
So far, only was explored images that contain a single person. Once the algorithm was developed to capture all joints (S’s) and PAFs (L’s) at the same time from the image, finding the most probable connections was only for simplicity. So, the code to get the result is the same; only when we get the result (“human”), for example, the list will have a size compatible with the number of people in the image.
For example, let us use a “busy image” with five people on it:
image_path = './images/ski.jpg' img, hum = get_human_pose(image_path) plot_img(img, axis=False)
The algorithm found all Ss and Ls associating them with the five people. The result is excellent!
From reading the image path to plotting the result, all the process took less than 0.5s, independent of the number of people found in the image.
Let us complicate it and see an image where people are more “mixed” as a couple dancing:
image_path = '../images/figure-836178_1920.jpg img, hum = get_human_pose(image_path) plot_img(img, axis=False)
The result also seems very good. Let us plot only the keypoints, having a different color for each person:
plt.figure(figsize=(10,10)) plt.axis([0, img.shape[1], 0, img.shape[0]]) plt.scatter(*zip(*keypoints_1), s=200, color='green', alpha=0.6) plt.scatter(*zip(*keypoints_2), s=200, color='yellow', alpha=0.6) ax=plt.gca() ax.set_ylim(ax.get_ylim()[::-1]) ax.xaxis.tick_top() plt.title('Keypoints of all humans detected\n') plt.grid();
Part 3: Pose Estimation in Videos and live camera
The process of getting the pose estimation in videos is the same as we did with images because a video can be treated as a succession of images (frames). It is advised to follow the section, trying to reproduce Jupyter Notebook: 20_Pose_Estimation_Video which can be downloaded from Project GitHub.
OpenCV does a fantastic job of handling videos.
So, let us get a .mp4 video and inform OpenCV that we will capture its frames:
video_path = '../videos/dance.mp4 cap = cv2.VideoCapture(video_path)
Now let us create a loop that will capture each frame. Having the frame, we will apply e.inference(), and from the result, we will draw the skeleton, the same way as we did with images. A code at the end was included to stop the video when a key (‘q’, for example) is pressed.
Below the necessary code:
fps_time = 0while True: ret_val, image = cap.read() humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=4.0) if not showBG: image = np.zeros(image.shape) image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False) cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - fps_time)), (10, 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2) cv2.imshow('tf-pose-estimation result', image) fps_time = time.time() if cv2.waitKey(1) & 0xFF == ord('q'): breakcap.release() cv2.destroyAllWindows()
The result is fantastic, but a little slow. The movie that originally had around 30 FPS (Frames per second), will run here in “slow camera”, around 3 FPS.
Here another experience where the movie was run twice, recording the pose estimated skeleton with and w/o the background video. The videos were manually synchronized, but if the result is not perfect, it is fascinating. I cut the last scene of the 1928 Chaplin movie “The Circus, “ where the way the Tramp walks is classic.
Testing with a live camera
It is advised to follow the section, trying to reproduce Jupyter Notebook: 30_Pose_Estimation_Camera which can be downloaded from Project GitHub.
The code needed to run a live camera is almost the same as that used with video, except that the OpenCV videoCapture() method will receive as an input parameter an integer that refers to what real camera is used. For example, an internal camera uses “0” and an external “1”. Also the camera should be set to capture frames as ‘432x368’ as used by the model.
Parameters initialization:
camera = 1 resize = '432x368' # resize images before they are processed resize_out_ratio = 4.0 # resize heatmaps before they are post-processed model = 'mobilenet_thin' show_process = False tensorrt = False # for tensorrt process cam = cv2.VideoCapture(camera) cam.set(3, w) cam.set(4, h)
The loop part of the code should be very similar to the one used with video:
while True: ret_val, image = cam.read() humans = e.inference(image, resize_to_default=(w > 0 and h > 0), upsample_size=resize_out_ratio) image = TfPoseEstimator.draw_humans(image, humans, imgcopy=False) cv2.putText(image, "FPS: %f" % (1.0 / (time.time() - fps_time)), (10, 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2) cv2.imshow('tf-pose-estimation result', image) fps_time = time.time() if cv2.waitKey(1) & 0xFF == ord('q'): breakcam.release() cv2.destroyAllWindows()
Again, the standard video capture at 30 FPS, is reduced to around 10% when the algorithm is used.
Here a full video where the delay can be better observed. However, the result is excellent!
Conclusion
As always, I hope this article can inspire others to find their way in the fantastic world of AI!
All the codes used in this article are available for download on project GitHub: TF2_Pose_Estimation
Regards from the South of the World!
See you in my next article!
Thank you
Marcelo
以上所述就是小编给大家介绍的《Realtime Multiple Person 2D Pose Estimation using TensorFlow2.x》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Effective C# 中文版
Bill Wagner / 李建忠 / 人民邮电出版社 / 2007-5 / 49.00元
本书围绕一些关于C#和.NET的重要主题,包括C#语言元素、.NET资源管理、使用C#表达设计、创建二进制组件和使用框架等,讲述了最常见的50个问题的解决方案,为程序员提供了改善C#和.NET程序的方法。本书通过将每个条款构建在之前的条款之上,并合理地利用之前的条款,来让读者最大限度地学习书中的内容,为其在不同情况下使用最佳构造提供指导。 本书适合各层次的C#程序员阅读,同时可以推荐给高校教......一起来看看 《Effective C# 中文版》 这本书的介绍吧!