Recent progress in computer hardware with the democratization to perform intensive calculations has enabled researchers to work with models, that have millions of free parameters. Convolutional neural networks (CNN) have already demonstrated their success in image classification, object detection, scene understanding etc. For almost any computer vision problems, CNN-based approaches outperform other techniques and in many cases even human experts in the corresponding field. Now almost all computer vision application try to involve deep learning techniques to improve traditional approaches. They influence our everyday lives and the potential uses of these technologies look truly impressive.
Reliable image segmentation is one of the important tasks in computer vision. This problem is especially important for medical imaging that can potentially improve our diagnostic abilities and in scene understanding to make safe self-driving vehicles. Dense image segmentation essentially involves dividing images into meaningful regions, which can be viewed as a pixel level classification task. The most straightforward (and slow) approach to such problem is manual segmentation of the images. However, this is a time-consuming process that is prone to mistakes and inconsistencies that are unavoidable when human data curators are involved. Automating the treatment provides a systematic way of segmenting an image on the fly as soon as the image is acquired. This process requires providing necessary accuracy to be useful in the production environment.
In the last years, different methods have been proposed to tackle the problem of creating CNN’s that can produce a segmentation map for an entire input image in a single forward pass. One of the most successful state-of-the-art deep learning method is based on the Fully Convolutional Networks (FCN) . The main idea of this approach is to use CNN as a powerful feature extractor by replacing the fully connected layers by convolution one to output spatial feature maps instead of classification scores. Those maps are further upsampled to produce dense pixel-wise output. This method allows training CNN in the end to end manner for segmentation with input images of arbitrary sizes. Moreover, this approach achieved an improvement in segmentation accuracy over common methods on standard datasets like PASCAL VOC . This method has been further improved and now known as U-Net neural network . The U-Net architecture uses skip connections to combine low-level feature maps with higher-level ones, which enables precise pixel-level localization. A large number of feature channels in upsampling part allows propagating context information to higher resolution layers. This type of network architecture proven themselves in binary image segmentation competitions such as satellite image analysis  and medical image analysis [6, 7] and other .
In this paper, we show how the performance of U-Net can be easily improved by using pre-trained weights. As an example, we show the application of such approach to Aerial Image Labeling Dataset , that contains aerospace images of several cities with high resolution. Each pixel of the images is labeled as belonging to either ”building” or ”not-building” classes. Another example of the successful application of such an architecture and initialization scheme is Kaggle Carvana image segmentation competition , where one of the authors used it as a part of the winning (1st out 735 teams) solution.
Ii Network Architecture
In general, a U-Net architecture consists of a contracting path to capture context and of a symmetrically expanding path that enables precise localization (see for example Fig. 1). The contracting path follows the typical architecture of a convolutional network with alternating convolution and pooling operations and progressively downsamples feature maps, increasing the number of feature maps per layer at the same time. Every step in the expansive path consists of an upsampling of the feature map followed by a convolution. Hence, the expansive branch increases the resolution of the output. In order to localize, upsampled features, the expansive path combines them with high-resolution features from the contracting path via skip-connections . The output of the model is a pixel-by-pixel mask that shows the class of each pixel. This architecture proved itself very useful for segmentation problems with limited amounts of data, e.g. see .
U-Net is capable of learning from a relatively small training set. In most cases, data sets for image segmentation consist of at most thousands of images, since manual preparation of the masks is a very costly procedure. Typically U-Net is trained from scratch starting with randomly initialized weights. It is well known that training network without over-fitting the data set should be relatively large, millions of images. Networks that are trained on the Imagenet  data set are widely used as a source of the initialization for network weights in other tasks. In this way, the learning procedure can be done for non-pre-trained several layers of the network (sometimes only for the last layer) to take into account features of the date set.
. The first convolutional layer produces 64 channels and then, as the network deepens, the number of channels doubles after each max pooling operation until it reaches 512. On the following layers, the number of channels does not change.
To construct an encoder, we remove the fully connected layers and replace them with a single convolutional layer of 512 channels that serves as a bottleneck central part of the network, separating encoder from the decoder. To construct the decoder we use transposed convolutions layers that doubles the size of a feature map while reducing the number of channels by half. And the output of a transposed convolution is then concatenated with an output of the corresponding part of the decoder. The resultant feature map is treated by convolution operation to keep the number of channels the same as in a symmetric encoder term. This upsampling procedure is repeated 5 times to pair up with 5 max poolings, as shown in Fig. 1. Technically fully connected layers can take an input of any size, but because we have 5 max-pooling layers, each downsampling an image two times, only images with a side divisible by 32 () can be used as an input to the current network implementation.
We applied our model to Inria Aerial Image Labeling Dataset . This dataset consists of 180 aerial images of urban settlements in Europe and United States, and is labeled as a building and not building classes. Every image in the data set is RGB and has pixels resolution where each pixel corresponds to a cm of Earth surface. We used 30 images (5 from every 6 cities in the train set) for validation, as suggested in  (valid. IoU 0.647) and  (best valid. IoU 0.73) and trained the network on the remaining 150 images for 100 epochs. Random crops of were used for training and central crops for validation. Adam with learning rate as an optimization algorithm .
We choose Jaccard index (Intersection Over Union) as evaluation metric. It can be interpreted as similarity measure between a finite number of sets. Intersection over union for similarity measure between two setsand , can be defined as following:
where normalization condition takes place:
Every image is consists of pixels. To adapt the last expression for discrete objects, we can write it in the following way
where is a binary value (label) of the corresponding pixel and
is predicted probability for the pixel.
Since, we can consider image segmentation task as a pixel classification problem, we also use the common loss function for binary classification tasks - binary cross entropy that is defined as:
Join these expressions, we can generalized the loss function, namely,
Therefore, minimizing this loss function, we simultaneously maximize probabilities for right pixels to be predicted and maximize the intersection, between masks and corresponding predictions. For more details, see .
At the output of a given neural network, we obtain an image where each pixel corresponds to a probability to detect interested area. The size of the output image is coincides with the input image. In order to have only binary pixel values, we choose a threshold 0.3. This number can be found using validation data set and it is pretty universal for our generalized loss function and many different image data sets. For different loss function this number is different and should be found independently. All pixel values below the specified threshold, we set to be zero while all values above the threshold, we set to be 1. Then, multiplying by 255 every pixel in an output image, we can get a black and white predicted mask
In our experiment, we test 3 U-Nets with the same architecture as shown in Fig. 1
differing only in the way of weights initialization. For the basic model we use network with weights initialized by LeCun uniform initializer. In this initializer samples draw from a uniform distribution within, where and 15] as a default method of weight initialization in convolutional layers. Next, we utilize the same architecture with VGG11 encoder pre-trained on ImageNet while all layers in decoder are initialized by the LeCun uniform initializer. Then, as a final example, we use network with weights pre-trained on Carvana dataset  (both encoder and decoder). Therefore, after 100 epochs, we obtain the following results for validation subset:
1) LeCun uniform initializer: IoU = 0.593
2) The Encoder is pre-trained on ImageNet: IoU = 0.686
3) Fully pre-trained U-Net on Carvana: IoU = 0.687
Validation learning curves in Fig. 3 show benefits of our approach. First of all, pre-trained models converge much faster to its steady value in comparison to the non-pre-trained network. Moreover, the steady-state value seems higher for the pre-trained models. Ground truth, as well as three masks, predicted by these three models, are superimposed on an original image in Fig. 4. One can easily notice the difference in the prediction quality after 100 epochs. Moreover, validation learning curves in Our results for the Inria Aerial Image Labeling Dataset can be easily further improved using different hyper-parameters optimization techniques or standard computer vision methods applying them during pre- and post-processing.
In this paper, we show how the performance of U-Net can be improved using technique knows as fine-tuning to initialize weights for an encoder of the network. This kind of neural network is widely used for image segmentation tasks and shows state of the art results in many binary image segmentation, competitions. Fine-tuning is already widely used for image classification tasks, but to our knowledge is not with U-Net type family architectures. For the problems of image segmentation, the fine-tuning should be considered even more natural because it is problematic to collect a large volume of training dataset (in particular for medical images) and qualitatively label it. Furthermore, pre-trained networks substantially reduce training time that also helps to prevent over-fitting. Our approach can be further improved considering more advanced pre-trained encoders such as VGG16  or any pre-trained network from ResNet family . With this improved encoders the decoders can be kept as simple as we use. Our code is available as an open source project under MIT license and can be found at https://github.com/ternaus/TernausNet.
The authors would like to thank Open Data Science community for many valuable discussions and educational help in the growing field of machine/deep learning. The authors also express their sincere gratitude to Alexander Buslaev who originally suggested to use a pre-trained VGG network as an encoder in a U-Net network.
-  Y. LeCun, Y. Bengio and G. Hinton, Deep learning, Nature 521, 436–444, 2015.
J. Long, E. Shelhamer and T. Darrell. Fully Convolutional Networks for Semantic Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
-  M. Everingham, et al. The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision 111, 98-136, 2015.
-  O. Ronneberger, P. Fischer and T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, arXiv:1505.04597, 2015.
-  V. Iglovikov, S. Mushinskiy and V. Osin, Satellite Imagery Feature Detection using Deep Convolutional Neural Network: A Kaggle Competition, arXiv:1706.06169, 2017.
-  V. Iglovikov, A. Rakhlin, A. Kalinin and A. Shvets, Pediatric Bone Age Assessment Using Deep Convolutional Neural Networks, arXiv:1712.05053, 2017.
-  T. Ching et al., Opportunities And Obstacles For Deep Learning In Biology And Medicine, www.biorxiv.org:142760, 2017.
-  project.inria.fr/aerialimagelabeling/
-  Kaggle: Carvana Image Masking Challenge.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg and Li Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, arXiv:1409.0575, 2014.
-  K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556, 2014.
-  B. Bischke, P. Helber, J. Folz, D. Borth and A. Dengel, Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks, arXiv:1709.05932, 2017.
-  E. Maggiori, Y. Tarabalka, G. Charpiat and P. Alliez, Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark, hal.inria.fr/hal-01468452, IGARSS, 2017.
-  D. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, arXiv:1412.6980, 2014.
-  pytorch.org
-  K. He, X. Zhang, Sh. Ren and J. Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385, 2015.
-  ods.ai