Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

栏目: IT技术 · 发布时间: 5年前

内容简介:This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple

This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple technique that solves the object detection problem. Stick around to know how it works.

Contents

  1. Anchor-Based Detectors
  2. FCOS proposed idea
  3. Multi-level detection
  4. Centre-Ness for FCOS
  5. Experiments and comparison with Anchor based detectors
  6. Conclusion

Anchor-Based Detectors

Every famous Object Detection method that we use nowadays (Fast-RCNN, YOLOv3, SSD, RetinaNet, etc.) uses anchors. These anchors are basically pre-defined training samples. They come in different proportions to facilitate various kinds of objects and their proportions. However, as you clearly understand just by their definition, using Anchors involves a lot of Hyper-Parameters. For example, the number of anchors per section of the image, the ratio of dimensions of the boxes, the number of sections an image should be divided into. Most importantly, these hyperparameters impact the end-result even on the slightest of changes. Further, which bounding box is considered as negative vs positive sample is decided by another hyperparameter called Intersection over Union (IoU). IoU value greatly changes which boxes will be considered. Following is a simple image describing use of anchor boxes in Yolov3:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

We have been using this approach for one reason and one reason only and that is continuing the idea that the previous approaches used. The first object detectors borrowed the concept of the sliding window from the earlier detection models from classical computer vision. But, there is no need for sliding windows now that we have the computing power of multiple GPUs at our disposal.

FCOS: Proposed Idea

This leads us to the point, why even use anchors and why not perform object detection just like segmentation, i.e. pixel-wise. This is exactly what this paper proposes. Till now, by using the sliding window approach, there was no direct connection between pixel-by-pixel values of the image and the objects detected. Let us now formally see how this approach works.

Let Fᵢ be the Fᵢ⁰ feature maps at layer i of a backbone CNN with a total stride s . Also, we define the ground-truth bounding boxes of the image as Bᵢ = ( x⁰ᵢ, y⁰ᵢ, x¹ᵢ, y¹ᵢ, cᵢ ) ∈ R₄ × {1, 2 … C} . Here is the C is the number of classes. Here (x⁰ᵢ, y⁰ᵢ) and (x¹ᵢ, y¹ᵢ) denote the top-left bottom right corner respectively. For each location (x,y) on the feature map, we can point it to a pixel in the original image. This is similar (although not identical) to something that we do in semantic segmentation as well. We map (x,y) on the feature map to a point (floor(s/2) + x*s, floor(s/2) + y*s) which is near the center of the receptive field. I would encourage the user to take an example image of size (8,8) and a feature map of size (4,4) to actually understand this mapping. With the help of this mapping, we are able to relate every pixel in the image as a training sample. What this means is, every location (x,y) can be one of the positive or negative samples depending on the following conditions: it falls in a ground truth (GT from now on) bounding box and the calculated class label for the location is the class label for that GT bounding box.

Now that we know a point residing inside the GT bounding box, we need to evaluate the dimensions of the box. This is done through a regression on four values (l*, t*, r*, b*). They are defined as:

l* = x-x⁰ᵢ ; t* = y-y⁰ᵢ ; r* = x⁰ᵢ-x ; b* = y⁰ᵢ-y

Eventually, as you will see, a regression-based calculation of these values is part of the loss function for the overall detection algorithm.

Now, because there are no anchors, there is no need for calculating IoU between anchors and GT bounding boxes in order to get the positive samples on which the regressor can be trained. Instead, every location that gives a positive sample (by being inside a GT box and having the correct class) can be part of regression for the bounding box dimensions. This is one of the possible reasons FCOS works better than anchor-based detectors even after using way less number of parameters.

For every location in a feature map, we calculate the classification score and for every location that is a positive sample, we regress. Thus, the overall loss function becomes:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

For this paper, the value of λ is taken as 1.

The first part of RHS is the classification of location (x,y) . Standard focal loos used in RetinaNet is used here as well. The second part of RHS is for regressing the bounding box. It is equal to zero for locations that are not a positive sample.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

机械设计实践

机械设计实践

村洋太郎(日) / 王启义/等 / 机械工业出版社 / 1998-08 / 36.00

本书记述了各种设计过程的思考方法和具体作法以及必要的知识和具 体数据。介绍了设计中要决定的内容和相应的制约条件。如功能、机构、 构造、形状、力和强度、尺寸加工工艺、工具、材料、机械要素等。最后 介绍了具体设计实例。本书的目的在于即使不看其他的书和参考书就能设 计出所需要的具体机械。 本书供从事机械设计的有关技术人员及大专院校相关专业的师生使 用。一起来看看 《机械设计实践》 这本书的介绍吧!

HTML 压缩/解压工具
HTML 压缩/解压工具

在线压缩/解压 HTML 代码

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

Markdown 在线编辑器
Markdown 在线编辑器

Markdown 在线编辑器