A reading guide about Deep Learning with CNNs
Part II: Image segmentation
Jun 8 ·5min read
Welcome back to Part II of this series. If you have missed the first part, have a look here: Part I: Image recognition and convolutional backbones .
In this part, you will find a guide through the literature about image segmentation with convolutional neural networks (CNNs) until 2019. It adds none scientific sources to this open access review paper to further increase an intuitive understanding of the evolution of CNNs.
Same as in part one, you can find the tables of the sources in this github repository:
Now, let’s dive into the next chapter of our adventure of deep learning with CNNs.
A rough overview about image segmentation with CNNs
During image segmentation, for each pixel a single class is predicted, like this:
When CNNs, which we discussed in Part I , became more popular, they were first used for so called patch based image segmentation. Therefore, a CNN moves over the input image in a moving window style and predicts the class of the center pixel of the patch (a little part of the whole the image) or the complete patch.
With the work of Long et al. 2014 [2], so called fully convolutional networks (FCNs) were introduced, and image segmentation with CNNs became much more sophisticated. Overall, the processing in FCNs looks like this: first features are extracted from the input image, by using a convolutional backbone ( the encoder ,see Part I). Thereby, the resolution is getting smaller, while feature depth is growing. The so extracted feature maps have a high semantic meaning but no precise localization. Since we need pixel-wise predictions for image segmentation, this feature maps are then upsampled back to input resolution (the decoder) . The difference to the input image is now, that each pixel holds a discrete class label and therefore the image is segmented in semantic meaningful classes.
Two major different concepts how the upsampling in the decoder can be done, do exist:
- Naive decoder (this term was used e.g. in Chen et al. 2018 [3]): The upsampling is done by applying e.g. bilinear interpolation
- Encoder-decoder: Upsampling is done by trainable deconvolution operations and/or by merging features from the encoder part with higher localization information during upsampling, see those examples:
In order to dive into image segmentation with deep learning, the sources in the table below are good starting points. Be aware of the fact, that next to CNNs there other deep learning model types which perform image segmentation, like generative adversarial networks (GANs) or long short term memory (LSTM) approaches, but this guide focuses on CNNs. Also, some times models of the R-CNN family are discussed from an image segmentation perspective. This guide will discuss them, when we reach object detection in the next part. So do not be confused, when you read about them somewhere else (like in review papers) and they are not mentioned here yet.
The evolution of FCNs for image segmentation
The evolution of the DeepLab family is characteristic for the evolution of FCN inspired models for image segmentation. DeepLab variants can be found in both, naive-decoder and encoder-decoder models. Hence, the guide orientates on this family by first looking at naive-decoders and then turning towards encoder-decoder models.
Naive-decoder models
The most important insights of naive-decoder models are mainly the establishment of so called atrous convolutions and long range image context exploitation for prediction on pixel level. Atrous convolutions are a variant of normal convolutions, which allow an increasing receptive field without the loss of image resolution. The famous Atrous Spatial Pyramid Pooling module ( ASPP module ) in DeepLab-V2 [4] and later combines both: atrous convolutions and long range image context exploitation. When reading the following literature, focus on the developments of those features — Atrous convolutions, the ASPP module and long range image context exploitation/parsing.
Encoder-decoder models
The today most famous encoder-decoder is probably the U-Net [5]. A CNN which was developed for analyzing medical images. Its clear structure invited many researchers to experiment and adopt it and it is famous for its skip connections, which allow the sharing of features between encoder and decoder paths. Encoder-decoder models focus on enhancing the semantically rich feature maps during upsampling in the decoder with more locally precise feature maps from the encoder.
With the literature at hand, you will be able to reflect on modern image segmentation papers and implementations with CNNs. Let’s meet again in Part III, where we will discuss object detection.
References
[1] Hoeser, T; Kuenzer, C. Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review-Part I: Evolution and Recent Trends. Remote Sensing 2020, 12(10), 1667. DOI: 10.3390/rs12101667.
[2] Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 39, 640–651.
[3] Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision–ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C.; Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 833–851
[4] Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal.
Mach. Intell. 2016, 40, 834–848.
[5] Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Navab, N., Hornegger, J.,
Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。