Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

栏目: IT技术 · 发布时间: 5年前

内容简介：This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple

This article is a detailed explanation of a new object detection technique proposed in the paper FCOS: Fully Convolutional One-Stage Object Detection published at ICCV’19. I decided to summarize this paper because it proposes a really intuitive and simple technique that solves the object detection problem. Stick around to know how it works.

Anchor-Based Detectors
FCOS proposed idea
Multi-level detection
Centre-Ness for FCOS
Experiments and comparison with Anchor based detectors
Conclusion

Anchor-Based Detectors

Every famous Object Detection method that we use nowadays (Fast-RCNN, YOLOv3, SSD, RetinaNet, etc.) uses anchors. These anchors are basically pre-defined training samples. They come in different proportions to facilitate various kinds of objects and their proportions. However, as you clearly understand just by their definition, using Anchors involves a lot of Hyper-Parameters. For example, the number of anchors per section of the image, the ratio of dimensions of the boxes, the number of sections an image should be divided into. Most importantly, these hyperparameters impact the end-result even on the slightest of changes. Further, which bounding box is considered as negative vs positive sample is decided by another hyperparameter called Intersection over Union (IoU). IoU value greatly changes which boxes will be considered. Following is a simple image describing use of anchor boxes in Yolov3:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

We have been using this approach for one reason and one reason only and that is continuing the idea that the previous approaches used. The first object detectors borrowed the concept of the sliding window from the earlier detection models from classical computer vision. But, there is no need for sliding windows now that we have the computing power of multiple GPUs at our disposal.

FCOS: Proposed Idea

This leads us to the point, why even use anchors and why not perform object detection just like segmentation, i.e. pixel-wise. This is exactly what this paper proposes. Till now, by using the sliding window approach, there was no direct connection between pixel-by-pixel values of the image and the objects detected. Let us now formally see how this approach works.

Let Fᵢ be the Fᵢ⁰ feature maps at layer i of a backbone CNN with a total stride s . Also, we define the ground-truth bounding boxes of the image as Bᵢ = ( x⁰ᵢ, y⁰ᵢ, x¹ᵢ, y¹ᵢ, cᵢ ) ∈ R₄ × {1, 2 … C} . Here is the C is the number of classes. Here (x⁰ᵢ, y⁰ᵢ) and (x¹ᵢ, y¹ᵢ) denote the top-left bottom right corner respectively. For each location (x,y) on the feature map, we can point it to a pixel in the original image. This is similar (although not identical) to something that we do in semantic segmentation as well. We map (x,y) on the feature map to a point (floor(s/2) + x*s, floor(s/2) + y*s) which is near the center of the receptive field. I would encourage the user to take an example image of size (8,8) and a feature map of size (4,4) to actually understand this mapping. With the help of this mapping, we are able to relate every pixel in the image as a training sample. What this means is, every location (x,y) can be one of the positive or negative samples depending on the following conditions: it falls in a ground truth (GT from now on) bounding box and the calculated class label for the location is the class label for that GT bounding box.

Now that we know a point residing inside the GT bounding box, we need to evaluate the dimensions of the box. This is done through a regression on four values (l*, t*, r*, b*). They are defined as:

l* = x-x⁰ᵢ ; t* = y-y⁰ᵢ ; r* = x⁰ᵢ-x ; b* = y⁰ᵢ-y

Eventually, as you will see, a regression-based calculation of these values is part of the loss function for the overall detection algorithm.

Now, because there are no anchors, there is no need for calculating IoU between anchors and GT bounding boxes in order to get the positive samples on which the regressor can be trained. Instead, every location that gives a positive sample (by being inside a GT box and having the correct class) can be part of regression for the bounding box dimensions. This is one of the possible reasons FCOS works better than anchor-based detectors even after using way less number of parameters.

For every location in a feature map, we calculate the classification score and for every location that is a positive sample, we regress. Thus, the overall loss function becomes:

For this paper, the value of λ is taken as 1.

The first part of RHS is the classification of location (x,y) . Standard focal loos used in RetinaNet is used here as well. The second part of RHS is for regressing the bounding box. It is equal to zero for locations that are not a positive sample.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Forget the hassles of Anchor boxes with FCOS: Fully Convolutional One-Stage Object Detection

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

统计自然语言处理基础

Chris Manning、Hinrich Schütze / 苑春法、李伟、李庆中 / 电子工业出版社 / 2005-1 / 55.00元

《统计自然语言处理基础：国外计算机科学教材系列》是一本全面系统地介绍统计自然语言处理技术的专著，被国内外许多所著名大学选为计算语言学相关课程的教材。《统计自然语言处理基础：国外计算机科学教材系列》涵盖的内容十分广泛，分为四个部分，共16章，包括了构建自然语言处理软件工具将用到的几乎所有理论和算法。全书的论述过程由浅入深，从数学基础到精确的理论算法，从简单的词法分析到复杂的语法分析，适合不同水平的读......一起来看看《统计自然语言处理基础》这本书的介绍吧!

码农工具