Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

栏目: IT技术 · 发布时间: 4年前

内容简介:This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple

This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple technique that solves the object detection problem. Stick around to know how it works.

Contents

  1. Anchor-Based Detectors
  2. FCOS proposed idea
  3. Multi-level detection
  4. Centre-Ness for FCOS
  5. Experiments and comparison with Anchor based detectors
  6. Conclusion

Anchor-Based Detectors

Every famous Object Detection method that we use nowadays (Fast-RCNN, YOLOv3, SSD, RetinaNet, etc.) uses anchors. These anchors are basically pre-defined training samples. They come in different proportions to facilitate various kinds of objects and their proportions. However, as you clearly understand just by their definition, using Anchors involves a lot of Hyper-Parameters. For example, the number of anchors per section of the image, the ratio of dimensions of the boxes, the number of sections an image should be divided into. Most importantly, these hyperparameters impact the end-result even on the slightest of changes. Further, which bounding box is considered as negative vs positive sample is decided by another hyperparameter called Intersection over Union (IoU). IoU value greatly changes which boxes will be considered. Following is a simple image describing use of anchor boxes in Yolov3:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

We have been using this approach for one reason and one reason only and that is continuing the idea that the previous approaches used. The first object detectors borrowed the concept of the sliding window from the earlier detection models from classical computer vision. But, there is no need for sliding windows now that we have the computing power of multiple GPUs at our disposal.

FCOS: Proposed Idea

This leads us to the point, why even use anchors and why not perform object detection just like segmentation, i.e. pixel-wise. This is exactly what this paper proposes. Till now, by using the sliding window approach, there was no direct connection between pixel-by-pixel values of the image and the objects detected. Let us now formally see how this approach works.

Let Fᵢ be the Fᵢ⁰ feature maps at layer i of a backbone CNN with a total stride s . Also, we define the ground-truth bounding boxes of the image as Bᵢ = ( x⁰ᵢ, y⁰ᵢ, x¹ᵢ, y¹ᵢ, cᵢ ) ∈ R₄ × {1, 2 … C} . Here is the C is the number of classes. Here (x⁰ᵢ, y⁰ᵢ) and (x¹ᵢ, y¹ᵢ) denote the top-left bottom right corner respectively. For each location (x,y) on the feature map, we can point it to a pixel in the original image. This is similar (although not identical) to something that we do in semantic segmentation as well. We map (x,y) on the feature map to a point (floor(s/2) + x*s, floor(s/2) + y*s) which is near the center of the receptive field. I would encourage the user to take an example image of size (8,8) and a feature map of size (4,4) to actually understand this mapping. With the help of this mapping, we are able to relate every pixel in the image as a training sample. What this means is, every location (x,y) can be one of the positive or negative samples depending on the following conditions: it falls in a ground truth (GT from now on) bounding box and the calculated class label for the location is the class label for that GT bounding box.

Now that we know a point residing inside the GT bounding box, we need to evaluate the dimensions of the box. This is done through a regression on four values (l*, t*, r*, b*). They are defined as:

l* = x-x⁰ᵢ ; t* = y-y⁰ᵢ ; r* = x⁰ᵢ-x ; b* = y⁰ᵢ-y

Eventually, as you will see, a regression-based calculation of these values is part of the loss function for the overall detection algorithm.

Now, because there are no anchors, there is no need for calculating IoU between anchors and GT bounding boxes in order to get the positive samples on which the regressor can be trained. Instead, every location that gives a positive sample (by being inside a GT box and having the correct class) can be part of regression for the bounding box dimensions. This is one of the possible reasons FCOS works better than anchor-based detectors even after using way less number of parameters.

For every location in a feature map, we calculate the classification score and for every location that is a positive sample, we regress. Thus, the overall loss function becomes:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

For this paper, the value of λ is taken as 1.

The first part of RHS is the classification of location (x,y) . Standard focal loos used in RetinaNet is used here as well. The second part of RHS is for regressing the bounding box. It is equal to zero for locations that are not a positive sample.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

内容创业:内容、分发、赢利新模式

内容创业:内容、分发、赢利新模式

张贵泉、张洵瑒 / 电子工业出版社 / 2018-6 / 49

越来越多的内容平台、行业巨头、资本纷纷加入内容分发的战争中,竞争非常激烈。优质的原创性内容将成为行业中最宝贵的资源。在这样的行业形势下,如何把内容创业做好?如何提高自身竞争力?如何在这场战争中武装自己?是每一位内容创业者都应该认真考虑的问题。 《内容创业:内容、分发、赢利新模式》旨在帮助内容创业者解决这些问题,为想要进入内容行业的创业者出谋划策,手把手教大家如何更好地进行内容创业,获得更高的......一起来看看 《内容创业:内容、分发、赢利新模式》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具