Segmenting Ships in Satellite Imagery With Squeeze and Excitation U-Net

10/27/2019 ∙ by Venkatesh R, et al. ∙ 0

The ship-detection task in satellite imagery presents significant obstacles to even the most state of the art segmentation models due to lack of labelled dataset or approaches which are not able to generalize to unseen images. The most common methods for semantic segmentation involve complex two-stage networks or networks which make use of a multi-scale scene parsing module. In this paper, we propose a modified version of the popular U-Net architecture called Squeeze and Excitation U-Net and train it with a loss that helps in directly optimizing the intersection over union (IoU) score. Our method gives comparable performance to other methods while having the additional benefit of being computationally efficient.



There are no comments yet.


page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Automatic detection of objects of interests is a task that has been a challenge in the computer vision community for decades. A lot of work has been done over the past 10 years [2]

to automatically extract objects from satellite imagery but with limited operational results. Often these solutions work with a limited dataset but are not able to generalize on unseen data. With the advent of deep learning algorithms which make use of convolutional neural networks (CNN’s), there has been a lot of advancement in this field often producing state of the art results.

Shipping traffic has grown rapidly over the last couple of decades. To handle illegal shipping and infractions at seas, maritime bodies usually monitor shipping traffic manually by going through each image. However, this work is a time-consuming task and requires qualified people. Automating this task is of significant importance to many. The advances in computer vision along with the availability of high-resolution data at a higher frequency will lead to the automation of these tasks.

Ii Related Work

Deep learning has attracted a lot of attention, especially when applied to computer vision related tasks. Since AlexNet [4]

was used to win the ImageNet challenge, CNN’s have been applied to a variety of tasks. Fully-Convolutional Network

[5] (FCN) was one of the first architectures for segmentation based on CNN’s. Since then a number of architectures have evolved based on a similar structure. U-Net [17] which is a FCN having an encoder-decoder architecture along with skip-connections which was designed to perform in the absence of a large amount of data. In this paper, we make use of a modified version of U-Net by re-calibrating the learned feature maps using squeeze and excitation module [6].

Fig. 1: Example images from the dataset

When it comes to satellite images, Iglovikov et. al. [3]

made use of a U-Net with a pretrained encoder (WideResnet-38) and recent improvements like activated batch normalization

[15] which allows for memory savings and exponential linear unit [16] (ELU).

Rakhlin et. al. [12] made use of U-net with an m46 encoder which they designed for saving memory. They trained their network using stochastic weight averaging (SWA) [13] which helps in finding much broader optima than gradient descent.

In our work we propose to use U-Net architecture with Resnet-34 [9]

as the backbone and add spatial and channel squeeze and excitation (SE) blocks to our network for segmenting ships in satellite images. For optimizing our model we compare the performance of our model by using various loss functions.

Iii Dataset and Evaluation Metric

The dataset for ship detection includes 3 channel RGB images. Fig. 1 shows examples of images from the dataset. The training set consists of 192,556 images and the validation set consists of 15,606 images. The size of images is 768x768. The dataset is highly imbalanced with close to 60% of the images being empty with no masks. Fig. 2 imbalance in the training set. The rest of the images contained masks of up to 15 ships in an image which includes ships of varying sizes.

The evaluation metric used for this task is

-score at different IoU thresholds. The IoU score between a proposed set of objects and true objects is calculated as follows:


The threshold for IoU ranges from 0.5 to 0.95 with a step size of 0.05. At each threshold value, the score is calculated as follows:


Where TP, FP and FN denote true positives, false positives and false negatives and is set to 2. So it is equivalent to F2-score. The final score is given by the mean of the F2-score at different IoU thresholds.


The mean of F2 score is calculated as shown in (3).

Fig. 2: Images in each class (ships and no ships) vs. number of images
Fig. 3:

Network Architecture with Resnet-34 as the encoder. The blocks E1-E5 are loaded with pretrained weights from Imagenet. The input is passed through the encoder layers followed by squeeze and excitation (SE) blocks before the pooling operation. SE blocks are added to this architecture as suggested in the original paper. This allows for re-calibration of the learned features adaptively. The input image is up-sampled by using bilinear interpolation at each step of the decoder and concatenated with the skip-connection input. The final output is of the same size as the input to the network.

Iv Methods

Iv-a Model architecture and loss function

We chose U-Net architecture for our task which is lightweight compared to the more recent state of the art models like Deeplab v3+ [17] and PSPNet [18]

which exploit global context at different scales. U-Net works well where the data is limited. A typical U-Net consists of a contracting path and an expanding path. The contracting path is known as the encoder and is used to capture the contextual information from the input. The contracting branch consists of convolution operations followed by pooling operations. This series of convolution operations followed by pooling downsamples the image. In the decoder, the image is upsampled progressively followed by a convolution layer. This helps in gradually regaining the size of input to the original size. In order to be able to localize the objects detected in the image, the network needs spatial information about the image. The U-Net architecture uses skip-connections to combine the high resolution from the encoder. The output of the model also known as logits is passed through a softmax layer to generate the predictions. To compare the performance of our model we used a baseline model we use the vanilla U-Net architecture as proposed in the original paper. We then train our model with Resnet-34 as the encoder of the U-Net. Slight modifications are made to the Resnet-34 architecture to help it perform better at this task. Most notably we set the

for the first convolutional layer from 7x7 to 3x3. This enabled us to get a feature map of comparatively larger size, enabling the model to perform better at segmentation. We also make use of spatial and channel squeeze and excitation blocks which has shown to increase the performance of the model by quite a margin with negligible increase in the computation cost. The blocks were added in the architecture as suggested in in the original paper. We also make a few minor but vital changes. Mainly we replace ReLU activation function with with ELU. The network architecture is as shown in Fig.

3 .Also we use Synchronized Batch Normalization [8] which allows batch norm layers to communicate with each other in multi-GPU training. We initialize the weights of the model with pretrained weights from imagenet for available layers and the rest of the layers with He Normal [9] weight initialization.

The cross-entropy loss has a simple gradient with respect to logits which make is easier to update during back-propagation. But for tasks like segmentation where the objective is to directly optimize the IoU, it’s a common practice to use cross-entropy loss in combination with loss which helps in optimizing IoU [1].

Hence we decided to use both Jaccard loss and Lovasz-Softmax [10] loss in combination with cross entropy loss.


This equation is for discrete objects and can be extended for continuous objects as follows:


where and are the corresponding predicted and ground-truth labels.

The final loss function for this task if given as follows:


Where is Jaccard loss and is cross-entropy. For using Lovasz-Softmax, Jaccard loss is replaced with corresponding term of Lovasz loss as given in (6). The weight of alpha is found via grid search and is set to 0.7.

Iv-B Preprocessing, training and mask generation

We preprocessed the input by scaling the 8-bit data [0-255] to floating point values [0-1], subtracted the mean [0.485, 0.456, 0.406] from the inputs and divided by the standard deviation [0.229, 0.224, 0.225]. The numbers are the same for all Resnet based architectures trained on ImageNet.

We downscaled the input image size by a factor of 2 and trained the network. We applied heavy augmentation to our training data like all the dihedral group () transformations, colour, brightness and contrast augmentations along with gamma correction. Another technique that we used to is smart scale crop. Since the dataset is heavily imbalanced, using random crops might cause the imbalance to increase further. We implemented the scale-crop algorithm in such a manner that if the label mask consists of ships, some part of the ship class will be included in the crops. This way we can reduce the computation with cropping without hurting the model performance.

For any deep learning system, the devil is in the detail. For the purposes of training, we used 2 GTX-1080 GPU’s. We kept the batch size constant at 8 throughout the experiments. For sampling the dataset we divided the dataset into 10 classes stratified according to the number of ships in the images. To negate the effect of class imbalance, we sample the images such that each class is visited once every C iterations as proposed in [1]. We used AdamW [11]

optimizer which fixes the weight decay issue in Adam optimizer and helps the model converge quicker. We let the model train for 60 epochs using step learning rate scheduler which decreased the learning rate from 1e-3 to 1e-5 by 0.1 every 20 epochs. Then we switched to stochastic weight averaging

[13]. We train the model for 36 epochs with SGDR [14] learning rate scheduler which cycles the learning rate between 1e-5 to 1e-7. We set the cycle length as 6 and average the weights every 6 epochs. This helps in increasing the validation mIoU of our model. We also tried training the model with a 768x768 input image size. This doesn’t have any significant change on the end results. Finetuning the model with for a few epochs with 768x768 image size which improved our final score by 0.4.

For validation, we use test-time augmentation like rotations in the multiples of 90

or horizontal and vertical flips. We average the prediction by taking their arithmetic mean. This helps in reducing the variance of our predictions. We also tried taking the geometric mean but without much difference in the results.

V Results

We present out results in table [1]. We compare our results with different loss functions as well. Some examples of our results are shown in Fig. 4.

Encoder Network Params Count (millions) Jaccard Lovasz
VGG-19 143M 0.796 0.772
Resnet-34 23.2M 0.832 0.845
Resnet-50 25.6M 0.834 0.826
TABLE I: Mean F2-score on local validation set for different encoders and loss functions
Fig. 4: Sample segmentation results, left: tile of 4 images, right: predicted masks for those images
Network Architecture SE-Block F2-score
U-Net w/ Resnet-34 No 0.827
SE U-Net w/ Resnet-34 Yes 0.845
TABLE II: Mean F2-score on local validation set for U-Net with and without SE Block

We compare results of Resnet-34 with other encoders like Resnet-50 and VGG-19. The validation score of Resnet based models ate higher as compared to VGG-19. This is because of a better gradient flow due to skip-connections. Our heaviest model Resnet-50 gives an inferior performance as compared to Resnet-34. This can be attributed to two reasons. i). The presence of spatial and channel squeeze and excitation blocks in Resnet-34 which improves the ability of the model to learn channel and spatial inter-dependencies. ii) Lovasz-Softmax requires good gradient flow and this gets tougher as the size of the architecture keeps on increases. Resnet-34 outperforms all other models when trained using Lovasz-Softmax loss.

On performing ablation study on the model by training with and without squeeze and excitation block, it’s clear that it give a significant boost in performance for segmentation models.

For comparing the speed of model, we compare the results against popular segmentation PSPNet with a Resnet-50 backbone and DeepLab v3+ with aligned modified Xception as backbone.

Network Architecture Inference Speed (Frames/second) F2-score
PSPNet-50 0.19 0.862
DeepLab v3+ 0.27 0.855
SE U-Net 1.21 0.845
TABLE III: Inference speed and mean F2-score performance on validation set.

PSPNet and DeepLabv3+ give higher performance as compared to our method, this can be attributed to the use of multi-scale features by dedicating a separate module to it. Our method gives comparable performance and a 6x speed upgrade as compared to the top performing method, i.e. PSPNet.

Vi Conclusion

In this paper, we presented an approach to ship segmentation in satellite imagery. We used U-Net architecture with Resnet-34 as the backbone. The quality of segmentation is improved because of three factors. i). Using SE blocks in the architecture. ii). Using Lovasz-Softmax to optimize the mIoU score. iii) Using smart-cropping augmentation to tackle data imbalance. Some areas of improvement for future work would be producing fine-grained masks. Our model produces a single mask for ships which are very close to each other. This could be improved by applying post-processing like watershed transformation. We chose not to apply additional-postprocessing to keep the model computationally efficient.


The authors would like to thank the department of computer science of SRM Institute of Science and Technology for providing us with useful suggestions for conducting these experiments.


  • [1] M. Berman, A. Rannen Ep Triki, and M. Blaschko. The lovasz-softmax loss: A tractable surrogate for the optimiza- ´ tion of the intersection-over-union measure in neural networks, 2018.
  • [2] J.K.E. Tunaley, Algorithms for ship detection and tracking using satellite imagery, 2004.
  • [3] V. Iglovikov, S. Mushinskiy, and V. Osin. Satellite imagery feature detection using deep convolutional neural network: A kaggle competition, 2017.
  • [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks, 2012.
  • [5] Jonathan Long, Evan Shelhamer and Trevor Darrell. Fully Convolutional Networks for Semantic Segmentation, 2015.
  • [6] Abhijit Guha Roy, Nassir Navab, Christian Wachinger, Recalibrating Fully Convolutional Networks with Spatial and Channel ’Squeeze & Excitation’ Blocks, 2018.
  • [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, 2015.
  • [8]
  • [9] K. He, X. Zhang, S. Ren and J. Sun, ”Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1026-1034.
  • [10] Sudre C.H., Li W., Vercauteren T., Ourselin S., Jorge Cardoso M. (2017) Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In: Cardoso M. et al. (eds) Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA 2017, ML-CDS 2017. Lecture Notes in Computer Science, vol 10553. Springer, Cham.
  • [11] Ilya Loshchilov, Frank Hutter, Decoupled Weight Decay Regularization, 2019.
  • [12] Alexander Rakhlin, Alex Davydow, Sergey Nikolenko, Land Cover Classification from Satellite Imagery With U-Net and Lovasz-Softmax Loss
  • [13] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson, Averaging Weights Leads to Wider Optima and Better Generalization, 2019.
  • [14]

    Ilya Loshchilov, Frank Hutter, SGDR: Stochastic Gradient Descent with Warm Restarts, 2017.

  • [15]

    S. R. Bulò, L. Porzi and P. Kontschieder, ”In-place Activated BatchNorm for Memory-Optimized Training of DNNs,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 5639-5647.

  • [16] Djork-Arné Clevert, Thomas Unterthiner and Sepp Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), 2015.
  • [17] Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Springer, Cham.
  • [18] Chen LC., Zhu Y., Papandreou G., Schroff F., Adam H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol 11211. Springer, Cham.
  • [19] H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia, ”Pyramid Scene Parsing Network,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 6230-6239.