内容简介:This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple
This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple technique that solves the object detection problem. Stick around to know how it works.
Contents
- Anchor-Based Detectors
- FCOS proposed idea
- Multi-level detection
- Centre-Ness for FCOS
- Experiments and comparison with Anchor based detectors
- Conclusion
Anchor-Based Detectors
Every famous Object Detection method that we use nowadays (Fast-RCNN, YOLOv3, SSD, RetinaNet, etc.) uses anchors. These anchors are basically pre-defined training samples. They come in different proportions to facilitate various kinds of objects and their proportions. However, as you clearly understand just by their definition, using Anchors involves a lot of Hyper-Parameters. For example, the number of anchors per section of the image, the ratio of dimensions of the boxes, the number of sections an image should be divided into. Most importantly, these hyperparameters impact the end-result even on the slightest of changes. Further, which bounding box is considered as negative vs positive sample is decided by another hyperparameter called Intersection over Union (IoU). IoU value greatly changes which boxes will be considered. Following is a simple image describing use of anchor boxes in Yolov3:
We have been using this approach for one reason and one reason only and that is continuing the idea that the previous approaches used. The first object detectors borrowed the concept of the sliding window from the earlier detection models from classical computer vision. But, there is no need for sliding windows now that we have the computing power of multiple GPUs at our disposal.
FCOS: Proposed Idea
This leads us to the point, why even use anchors and why not perform object detection just like segmentation, i.e. pixel-wise. This is exactly what this paper proposes. Till now, by using the sliding window approach, there was no direct connection between pixel-by-pixel values of the image and the objects detected. Let us now formally see how this approach works.
Let Fᵢ be the Fᵢ⁰ feature maps at layer i of a backbone CNN with a total stride s . Also, we define the ground-truth bounding boxes of the image as Bᵢ = ( x⁰ᵢ, y⁰ᵢ, x¹ᵢ, y¹ᵢ, cᵢ ) ∈ R₄ × {1, 2 … C} . Here is the C is the number of classes. Here (x⁰ᵢ, y⁰ᵢ) and (x¹ᵢ, y¹ᵢ) denote the top-left bottom right corner respectively. For each location (x,y) on the feature map, we can point it to a pixel in the original image. This is similar (although not identical) to something that we do in semantic segmentation as well. We map (x,y) on the feature map to a point (floor(s/2) + x*s, floor(s/2) + y*s) which is near the center of the receptive field. I would encourage the user to take an example image of size (8,8) and a feature map of size (4,4) to actually understand this mapping. With the help of this mapping, we are able to relate every pixel in the image as a training sample. What this means is, every location (x,y) can be one of the positive or negative samples depending on the following conditions: it falls in a ground truth (GT from now on) bounding box and the calculated class label for the location is the class label for that GT bounding box.
Now that we know a point residing inside the GT bounding box, we need to evaluate the dimensions of the box. This is done through a regression on four values (l*, t*, r*, b*). They are defined as:
l* = x-x⁰ᵢ ; t* = y-y⁰ᵢ ; r* = x⁰ᵢ-x ; b* = y⁰ᵢ-y
Eventually, as you will see, a regression-based calculation of these values is part of the loss function for the overall detection algorithm.
Now, because there are no anchors, there is no need for calculating IoU between anchors and GT bounding boxes in order to get the positive samples on which the regressor can be trained. Instead, every location that gives a positive sample (by being inside a GT box and having the correct class) can be part of regression for the bounding box dimensions. This is one of the possible reasons FCOS works better than anchor-based detectors even after using way less number of parameters.
For every location in a feature map, we calculate the classification score and for every location that is a positive sample, we regress. Thus, the overall loss function becomes:
For this paper, the value of λ is taken as 1.
The first part of RHS is the classification of location (x,y) . Standard focal loos used in RetinaNet is used here as well. The second part of RHS is for regressing the bounding box. It is equal to zero for locations that are not a positive sample.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Tensorflow:实战Google深度学习框架
郑泽宇、顾思宇 / 电子工业出版社 / 2017-2-10 / 79
TensorFlow是谷歌2015年开源的主流深度学习框架,目前已在谷歌、优步(Uber)、京东、小米等科技公司广泛应用。《Tensorflow实战》为使用TensorFlow深度学习框架的入门参考书,旨在帮助读者以最快、最有效的方式上手TensorFlow和深度学习。书中省略了深度学习繁琐的数学模型推导,从实际应用问题出发,通过具体的TensorFlow样例程序介绍如何使用深度学习解决这些问题。......一起来看看 《Tensorflow:实战Google深度学习框架》 这本书的介绍吧!