Neural Networks Intuitions: 8. Translation Invariance in Object Detectors
May 3 ·4min read
Hello everyone!
This article is going to be a short one and focuses on a less significant but highly overlooked concept in object detectors, especially in single shot detectors — Translation Invariance.
Let’s understand what translation invariance is and what makes an image classifier/object detector translation invariant.
*Note: This article assumes you have background knowledge of how single and two-stage detectors work :-)
Translation Invariance:
Translation in computer vision means displacement in space and Invariance means the property of being unchanged.
Therefore when we say an image classifier or an object detector is translation invariant, it means:
Image Classifier can predict a class accurately despite where the class(more specifically, pattern) is located along the image’s spatial dimensions. Similarly, a detector can detect an object irrespective of where it is present in the image.
Let us look at an example for each of the problem to make things clear.
In this article, we will be considering only Convolutional Neural Nets — be it a classifier or a detector and see whether they are translation invariant or not!
Translation Invariance in Convolutional Classifiers:
Are CNNs translation invariant? If so, what makes them invariant to translation?
Firstly, CNNs are not completely translation invariant but only to some extent. Next, it is ‘pooling’ that makes them translation invariant and not the convolution operation(applying filters).
The above statement is applicable only for classifiers and not for object detectors.
If we read Hinton’s paper on translation invariance in CNNs , he clearly states that the pooling layer was introduced to reduce computation complexity and that Translation Invariance was only a by-product of it.
One can make CNNs completely translation invariant by feeding the right kind of data — although this may not be 100% feasible.
Note: I won’t be addressing the question of how pooling makes CNNs translation invariant. You can check it out in the links below :-)
Translation Invariance in Two-stage Detectors:
Two stage object detectors has the following components:
- Region proposal stage
- Classification stage
The first stage predicts locations of objects of interest(i.e region proposals) and the second stage classifies those region proposals.
We can see that the first stage predicts foreground object locations, which means the problem now is reduced to image classification — performed by the second stage . This reduction makes a two-stage detector translation invariant without introducing any explicit changes to the neural network architecture.
This decoupling of the object’s class prediction from the object’s bounding box prediction makes a two stage detector translation invariant !
Translation Invariance in Single stage detectors:
Now that we have looked into two stage detectors, we know that a single stage detector needs to couple box and class predictions. One way of doing that is to make dense predictions(anchors) on a feature map i.e at every grid cell or a group of cells on a feature map.
Read the following article where I explain in-depth about Anchors: Neural Networks Intuitions: 5. Anchors and Object Detection .
Since these dense predictions are made by convolving filters on feature maps, this enables the network to detect the same pattern when occurred in a different location on the feature map.
For example, let us consider a neural network trained to detect dogs present in an image. The filters in the final conv layer are responsible for recognizing these dog patterns.
We feed data to the network such that the dog always appear on the left side of the image and test it with an image where the dog appears on the right side.
One of the filters in the last layer learns the above dog pattern and since the same filter is being convolved throughout and a prediction is being made at every location in the feature map, it recognizes the same dog pattern in a different location!
Finally to answer the question of “Why filters make detectors translation invariant but not classifiers?”
Filters in Conv Nets learn local features in an image rather than taking in the global context. Since the problem of object detection is to detect local features(objects) from the image and not make predictions from the entire feature map(which is what happens in case of an image classifier), filters help in making them invariant to translation.
That’s all in this eighth instalment of my series. I hope you folks were able to get a good grasp of translation invariance in general and what makes the detector invariant to object translation in images. Please feel free to correct me if am wrong :-)
Cheers!
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Google将带来什么?
杰夫·贾维斯 / 陈庆新、赵艳峰、胡延平 / 中华工商联合出版社 / 2009-8 / 39.00元
《Google将带来什么?》是一本大胆探索、至关重要的书籍,追寻当今世界最紧迫问题的答案:Google将带来什么?在兼具预言、宣言、思想探险和生存手册性质的这样一《Google将带来什么?》里,互联网监督和博客先锋杰夫·贾维斯对Google这个历史上发展速度最快的公司进行了逆向工程研究,发现了40种直截了当、清晰易懂的管理与生存原则。与此同时,他还向我们阐明了互联网一代的新世界观:尽管它具有挑战性......一起来看看 《Google将带来什么?》 这本书的介绍吧!