Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

栏目: IT技术 · 发布时间: 6年前

内容简介:This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple

This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple technique that solves the object detection problem. Stick around to know how it works.

Contents

  1. Anchor-Based Detectors
  2. FCOS proposed idea
  3. Multi-level detection
  4. Centre-Ness for FCOS
  5. Experiments and comparison with Anchor based detectors
  6. Conclusion

Anchor-Based Detectors

Every famous Object Detection method that we use nowadays (Fast-RCNN, YOLOv3, SSD, RetinaNet, etc.) uses anchors. These anchors are basically pre-defined training samples. They come in different proportions to facilitate various kinds of objects and their proportions. However, as you clearly understand just by their definition, using Anchors involves a lot of Hyper-Parameters. For example, the number of anchors per section of the image, the ratio of dimensions of the boxes, the number of sections an image should be divided into. Most importantly, these hyperparameters impact the end-result even on the slightest of changes. Further, which bounding box is considered as negative vs positive sample is decided by another hyperparameter called Intersection over Union (IoU). IoU value greatly changes which boxes will be considered. Following is a simple image describing use of anchor boxes in Yolov3:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

We have been using this approach for one reason and one reason only and that is continuing the idea that the previous approaches used. The first object detectors borrowed the concept of the sliding window from the earlier detection models from classical computer vision. But, there is no need for sliding windows now that we have the computing power of multiple GPUs at our disposal.

FCOS: Proposed Idea

This leads us to the point, why even use anchors and why not perform object detection just like segmentation, i.e. pixel-wise. This is exactly what this paper proposes. Till now, by using the sliding window approach, there was no direct connection between pixel-by-pixel values of the image and the objects detected. Let us now formally see how this approach works.

Let Fᵢ be the Fᵢ⁰ feature maps at layer i of a backbone CNN with a total stride s . Also, we define the ground-truth bounding boxes of the image as Bᵢ = ( x⁰ᵢ, y⁰ᵢ, x¹ᵢ, y¹ᵢ, cᵢ ) ∈ R₄ × {1, 2 … C} . Here is the C is the number of classes. Here (x⁰ᵢ, y⁰ᵢ) and (x¹ᵢ, y¹ᵢ) denote the top-left bottom right corner respectively. For each location (x,y) on the feature map, we can point it to a pixel in the original image. This is similar (although not identical) to something that we do in semantic segmentation as well. We map (x,y) on the feature map to a point (floor(s/2) + x*s, floor(s/2) + y*s) which is near the center of the receptive field. I would encourage the user to take an example image of size (8,8) and a feature map of size (4,4) to actually understand this mapping. With the help of this mapping, we are able to relate every pixel in the image as a training sample. What this means is, every location (x,y) can be one of the positive or negative samples depending on the following conditions: it falls in a ground truth (GT from now on) bounding box and the calculated class label for the location is the class label for that GT bounding box.

Now that we know a point residing inside the GT bounding box, we need to evaluate the dimensions of the box. This is done through a regression on four values (l*, t*, r*, b*). They are defined as:

l* = x-x⁰ᵢ ; t* = y-y⁰ᵢ ; r* = x⁰ᵢ-x ; b* = y⁰ᵢ-y

Eventually, as you will see, a regression-based calculation of these values is part of the loss function for the overall detection algorithm.

Now, because there are no anchors, there is no need for calculating IoU between anchors and GT bounding boxes in order to get the positive samples on which the regressor can be trained. Instead, every location that gives a positive sample (by being inside a GT box and having the correct class) can be part of regression for the bounding box dimensions. This is one of the possible reasons FCOS works better than anchor-based detectors even after using way less number of parameters.

For every location in a feature map, we calculate the classification score and for every location that is a positive sample, we regress. Thus, the overall loss function becomes:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

For this paper, the value of λ is taken as 1.

The first part of RHS is the classification of location (x,y) . Standard focal loos used in RetinaNet is used here as well. The second part of RHS is for regressing the bounding box. It is equal to zero for locations that are not a positive sample.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

机器学习

机器学习

周志华 / 清华大学出版社 / 2016-1-1 / 88.00元

机器学习是计算机科学与人工智能的重要分支领域. 本书作为该领域的入门教材,在内容上尽可能涵盖机器学习基础知识的各方面。 为了使尽可能多的读者通过本书对机器学习有所了解, 作者试图尽可能少地使用数学知识. 然而, 少量的概率、统计、代数、优化、逻辑知识似乎不可避免. 因此, 本书更适合大学三年级以上的理工科本科生和研究生, 以及具有类似背景的对机器学 习感兴趣的人士. 为方便读者, 本书附录给出了一......一起来看看 《机器学习》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具