Greenhouse segmentation has pivotal importance for climate-smart agricultural land-use planning. Many developed countries have already integrated smart agricultural growth plans into their agenda; hence, monitoring greenhouse related activities through satellite imagery plays a crucial role. Deep learning-based approaches provide state-of-the-art performance in image segmentation; however, semantic segmentation on high-resolution optical satellite imagery is a challenging task. Some images contain target objects that are hardly differentiable from the background such that the greenhouses are similar to the background in terms of color, texture, and appearance and are not standardized across different regions.
In this paper, we propose a sound methodology to identify the greenhouse pixels from its surroundings using a U-net based novel deep learning architectures. We also compare and discuss the effects of architectural variations and different loss functions. The dataset we have used consists of pan-sharpened orthorectified Azersky (SPOT-7) images (red, green, blue, and near infrared channels) with 1.5-meter resolution from 15 regions in Azerbaijan where the greenhouses are densely congested with the total area ofand annotation masks with the total number of 47559 polygons. The dataset covers the period from 2019, September 1 to 2019, November 30. All images are acquired with less than incidence angle and less than 5% cloud coverage.
Raw images are split into tiles with a 32-pixel overlap in both horizontal and vertical directions. There is a significant imbalance between majority and minority classes in the original dataset. Therefore we have only taken into account the tiles that have 0.1 or higher positive class rates. We, through the visual inspection of the labels, noticed that labels occasionally did not represent the ground truth and eliminated the majority of the labels of such kind. Some of the major challenges were variations in scale and orientation of the greenhouses. To mitigate the performance degradation, we employed several augmentation techniques including the mixup, rotation, flipping, adjustment in brightness, and gamma and observed a noticeable increase in the model performance. To further extract knowledge from the raw images, normalized difference vegetation index (NDVI) and texture channels are also derived and incorporated into the dataset.
State-of-the-art concepts from recent progress in deep learning literature (i.e., dilated convolutional neural networks with skip connections) are integrated into the U-net architecture. To avoid producing undesirable artifacts such as the infamous checkerboard pattern, which is usually observed during the decoding phase of the U-Net architecture, transposed convolutional layers are either replaced or coupled with the bilinear interpolation layers. Furthermore, we employed the hard example mining (HEM) technique during the training process, which yielded better results in difficult cases. Additional performance gains are obtained through the implementation of adaptive learning rate strategy and sophisticated loss function, which resulted in a better AUC and scores. Early results indicate that our methodology has proved itself to be successful with up to 95.91% AUC and 93.29% score.
2 Related Work
Semantic segmentation is a task to assign a categorical label to each pixel in an input image. A variety of different methods have been proposed to tackle this task. Some try to detect greenhouses using spectral and textural properties, thresholding, and unsupervised learning-based approaches. proposed Otsu thresholding of grayscale image and subsequent filtering using morphological operations to segment greenhouses. 
employed a bottom-up region-merging technique starting with one-pixel objects and a nearest-neighbor classifier that uses mean values of normalized difference vegetation index (NDVI), normalized difference water index (NDWI), and retrogressive plastic greenhouse index (RPGI) channels. While these works reported decent segmentation accuracy, they depend on high-quality images from specified satellites, consistent lighting conditions, and stark spectral differences between the greenhouse and background pixels - assumptions that do not always hold. In order to adapt to the distribution of data and benefit from large amounts of labeled images, machine learning-based methods can be used.
used machine learning-based methods such as support vector machines and random forests to classify pixels. Machine learning methods that classify each pixel independently suffer from a multitude of problems on semantic segmentation tasks. First, the i.i.d. (i.e., independent and identically distributed) assumption may not hold for pixels that are sufficiently close to each other. Objects of interest almost always occupy significantly more area compared to a single pixel; therefore, label of each pixel heavily depends on the surrounding pixels and their labels. Second, these methods ignore contextual information and higher-level features (e.g., edges, corners, shapes) that can only be obtained from neighboring groups of pixels, and not from a single pixel. These limitations caused the aforementioned algorithms to perform suboptimally on semantic segmentation tasks. More recent deep learning-based methods do not suffer from these problems and achieve better results.
One of the most notable advances in the field is the introduction of fully convolutional network (FCN)  for dense pixel labeling tasks. This work integrated multi-resolution feature maps from different stages of the network to get precise localization of objects. Another approach 
used encoder-decoder architecture where max-pooling indices in the encoder are memorized and used in the decoder to recover the spatial features. used a similar approach, but instead of memorizing the max-pooling indices, corresponding feature maps from encoder and decoder are concatenated. Other approaches are based on the dilated (atrous) convolution  to achieve a larger effective receptive field and therefore capture more context information. In dilated convolution kernel, weights are placed sparsely in the kernel, at intervals given by the dilation rate. As shown in , , and , the choice of dilation rate has a major effect on overall performance. Monotonically increasing the dilation rate throughout the layers expands the effective receptive field without the loss of resolution; however, fails to aggregate local features of small objects. To mitigate this issue,  proposed using exponentially increasing dilation factors (e.g., powers of 2). Dilated convolutions have been adopted in a number of methods for semantic segmentation [11, 12].
Our dataset consists of pan-sharpened orthorectified 4-band images (RGB and NIR) captured by the Azersky (SPOT-7) satellite. The images have a spatial resolution of 1.5 m, and raw pixel values are on the interval . Each labeled image in the dataset has a corresponding ESRI shapefile, which contains greenhouse polygons. The overall design of the proposed methodology is illustrated in Fig. 1.
During the preprocessing phase, the shapefiles are rasterized to obtain the binary masks with the same width and height as the image, where the greenhouse pixels are set to 1 and other pixels to 0. The images and masks are split into tiles with a 32-pixel overlap in both horizontal and vertical directions.
There is a significant imbalance between greenhouse (positive) and background (negative) classes; therefore, tiles with a positive class rate less than 0.1 are filtered out. Satellite images exhibit significant variation in terms of both spectral characteristics and content distribution, which may adversely affect the performance of sensitive deep learning models. We start by eliminating the tiles with the anomalous spectral fingerprint. Then, we apply the Non-Local Means Denoising . In order to improve the image contrast, we use Contrast Limited Adaptive Histogram Equalization (CLAHE)  to obtain consistent images. CLAHE was chosen over other histogram equalization methods to avoid over-amplification of the contrast in near-constant regions of the image. Following the histogram equalization, we applied the Contrast Stretching technique where the minimum and maximum pixel intensities are chosen to be 2 and 98 percentiles respectively. Binary masks were also subject to distortions; therefore, we employed morphological mask smoothing operation such as erosion followed by dilation which is effective in removing noise.
In order to artificially increase the size of training data, we employed multiple augmentation methods such as vertical and horizontal flips, rotations, contrast, brightness, and saturation adjustment. The efficient
library is used to implement the augmentations. Flips are applied randomly with an equal probability of being vertical or horizontal. Rotation angles are randomly chosen among. Brightness and contrast values are adjusted by a multiplicative random factor in the range and respectively.
Moreover, artificial training tiles are generated through the randomized placement of greenhouse patches on the background images. The background samples are manually selected from the training dataset to cover a variety of possible environments (e.g., dryland, grassland, forest, mountainous region, bodies of water, rural and urban landscapes). The motivation behind this approach is that generated out-of-distribution samples act as regularization and allow the model to better generalize to different regions.
3.3 Feature Extraction
We appended 2 more channels (i.e., NDVI and texture) to the input tiles, increasing the total number of channels to 6. NDVI is an indicator of photosynthetically active biomass often used in remote sensing and defined as in Eq. 1. While NDVI alone is not by any means sufficient for unambiguous segmentation of greenhouses, it provides useful information for distinguishing them from visually similar buildings. In order to further enrich the feature set, both the edge and textural information are derived from the images. To extract the edge features, the Sobel operator  is applied to the RGB channels of the input tiles. To extract the textural information and remove the pixel-to-pixel correlation, a lowpass-filtered copy of the image is subtracted from the image itself (Eq. 2), which is also known to be the compression part of the Laplacian Pyramid . The results weighed strongly in favor of the textural feature in comparison to the edge feature and therefore, we only integrated that into our final models. After these steps, the shape of an input batch became , where is batch size.
Standardization of datasets is a common requirement for many machine learning estimators. The entire tiles dataset has gone through the standardization phase to ensure faster convergence during optimization. For this purpose, theis given preference over the
since the former handles outliers better. Standardization is carried out on each channel independently (Eq.3).
4 Model Architecture
We used a modified version of U-Net, a fully convolutional neural network architecture introduced by Ronneberger et al. for medical image segmentation . In general, a U-Net-like architecture consists of an encoder path to capture the context and a symmetrical decoder path that enables precise localization. The encoder consists of multiple blocks, each containing convolutions, and pooling layers, while decoder consists of upsampling layers and convolutions. Transpose convolutions are often used for upsampling in similar architectures, but we opt for bilinear upsampling layers since they do not produce checkerboard artifacts  in the activation maps and do not introduce additional trainable parameters. We proposed and experimented with 2 network architectures referred to as "Model A" and "Model B" in this work. These networks differ in a way the downsampling is performed in the decoder, the structure of bottleneck layers, and the use of dilated (atrous) convolutions . The performance of proposed models is compared to a baseline network which is an implementation of . The models represent a tradeoff between the number of parameters (and thus the computation time) and accuracy: as the number of parameters increases from the Baseline to Model B, so does the performance. Model A has 35% fewer parameters compared to Model B, at a cost of 2.5% drop in score.
We use a combination of binary cross-entropy (Equation 5) and Dice loss (Equation 6). We apply Laplace (additive) smoothing by adding 1 to both the numerator and the denominator in the Dice loss to avoid division by zero when one of the classes is absent in a training sample.
Binary cross-entropy loss is pixel-wise weighted using distances to 2 nearest target objects as presented in . The weight map has the same shape as a ground truth mask and is computed as:
where is the weight map to balance class frequencies, and denote distances for each pixel to the border of the nearest and second nearest object respectively, and as shown in . The main motivation behind this approach is to prevent separate greenhouses from being detected as a single object. The weight placed on the separation of borders enforces a more precise delineation of greenhouses, which is reflected in the output of the model on validation and test images.
Therefore, the total loss is the weighted sum of the binary cross-entropy and Dice loss given by Equation 7.
4.1 Model A
This model is built upon the traditional U-Net architecture with certain modifications applied to both the contracting and expansive paths.
Initially, image tile passes through a convolutional layer with a relatively high receptive field of pixels to convey much of the contextual information to the rest of the model . At each level in the contracting path, a simple computational unit (Fig. 3
), which is composed of the convolutional and batch normalization layers, followed by the Leaky ReLU activation function with a slope of 0.1 is employed. A computational block (Fig.3) refers to the 2 consecutive simple computational units placed back-to-back with a learnable plus connection fusing the input of the first computational unit to the output of the second computational unit. The output of this plus connection is sent to the computational block of the corresponding level in the expansive path . After the plus connection, the max-pooling layer of size and dropout layer of rate 0.1 is applied, and the process repeats itself in the next level of the contracting path. The total depth of the model is set to 4 where the bottleneck level has 256 filters. Moving from one level to another is managed as in the original U-Net paper 
, wherein the number of filters doubles as the height and width of the tensor halves. Except for the initialfilter, all other filters are in the size of .
). The expansive unit consists of a combination of two components. The first component is the upsampling layer, where the expansion is executed using the bilinear interpolation. The upsampling layer has no additional learnable parameters. The second component is the deconvolutional layer, which is also known as the transposed convolutional layer. Deconvolution operation is often disregarded in the literature due to the unpleasant artifacts it generates. However, choosing the filter and stride sizes properly and combining the output with that of the bilinear interpolation promises a better expansion strategy. The justification behind this idea is that instead of learning the correct expansion weights from scratch, the deconvolutional layer starts with a good estimation of the weights and then fine-tunes the weights around the initial estimations. This step would also shorten the convergence time in comparison to the case where the deconvolutional layer is used alone. After the expansive unit, a computational block similar to the one in the contracting path is applied. The output tensor of the plus layers in the expansive path is fed to the level one above. At the end of the expansive path, two consecutive simple computational units as described in the contracting path are applied. The network is followed by a finalconvolutional layer and thus, the output log-probability mask is generated.
4.2 Model B
The model architecture consists of an encoder and decoder, connected through a bottleneck. The encoder consists of convolutional layers followed by batch normalization  and ReLU activation . As in the standard U-net architecture, encoder blocks end with a downsampling layer. However, instead of using max-pooling, we use a custom downsampling approach. The bottleneck is modified to use dilated convolution and contains skip connections .
During the experimentation, it has been observed that using max-pooling may cause the network not to detect the small greenhouses and some parts of the large greenhouses. The hypothesis is that max-pooling layers in the encoder fail to capture the low-activation features that correspond to the small greenhouses located near the large greenhouses, since max-pooling will only preserve the activation of the most prominent feature in a region of the activation map. Particularly, this is the case for low-resolution feature maps where each cell corresponds to a large region in the input image. To overcome this issue, we use the concatenation of strided convolution, max-pooling, and average pooling to perform downsampling while retaining most of the information about the small objects of interest. This method is illustrated in Figure 6.
Moreover, dilated (i.e., atrous) convolutions are used in the bottleneck with exponentially increasing dilation rate as proposed in , to increase the size of effective receptive field (ERF) while preserving the resolution. It has been shown that a naive approach such as increasing the dilation rate monotonically fails to aggregate the local features of small objects. Hence, we adopt the exponential scaling approach in this model. There are 3 sequential dilated convolutional layers in the bottleneck, and they use a dilation factor of 1, 2, and 4 respectively. The diagram of the bottleneck is given in Figure 7.
The bottleneck is followed by the decoder, and the corresponding feature maps of encoder and decoder are concatenated. We used the bilinear upsampling and convolution instead of the transposed convolution because we observed that the transposed convolutions may produce the checkerboard artifacts  in the intermediate activation layers. After the last decoder block, a convolution is applied to produce the output log-probability map.
The models are trained on a total of 82203 augmented tiles and validated on 5176 unaugmented tiles. We employed early stopping based on the validation score. Because of the significant class imbalance, the score is chosen to be a more informative metric compared to the accuracy score. The best model weights are also saved based on the validation score. Training is run on a machine with 32 GB of memory and an Nvidia GTX 1080 GPU, the detailed environment configuration is provided in Table 1. All of the described networks are trained using a combination of weighted binary cross-entropy and Dice loss as shown in Section 4. Evolution of loss and score throughout the training stage is shown in Fig. 8.
|OS||CentOS Linux 7 (Core)|
|CPU model name||Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz|
|GPU model name||GeForce GTX 1080|
|Nvidia driver version||440.33.01|
|GCC version||(GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)|
This strategy corresponds to increasing the learning rate linearly during the first warm-up steps and then decreasing it proportionally to the inverse square root of the step numbers. Halfway through the training, the learning rate is boosted by a factor of 1.5 to re-energize the optimization, which should help to avoid the local minima and increase the convergence speed. For a more detailed description of the model parameters, please refer to Table 2.
5.2 Hard Example Mining
We employed HEM, which is a bootstrapping technique often used in the object detection literature. We train a model on augmented data, then run inference on unaugmented and unfiltered training tiles, and select top 20% of tiles with the highest loss. This process can be repeated multiple times; however, we observed that running it once or twice usually suffices, and more iterations do not yield further improvement. There exist online HEM (OHEM) methods , but they are not suitable for our application since we do not simply propagate the loss selectively for hard negatives in a batch, but rather find new negative tiles, which are potentially not in the training dataset due to filtering. This approach enables us to gather background tiles with hard negatives such as roads, buildings, and bodies of water. Hard tiles are appended to the training dataset during the training, before the augmentation step. We observe that the hard example mining technique results in an improved score on the validation dataset for all models.
|Parameters||Model A||Model B||Baseline|
|Optimizer||Adam ( )||RMSprop ( )||Adam ( )|
|Loss Function||Eq 7||Eq. 7||Eq. 7|
|LR Strategy||Eq. 8||Eq. 8||constant|
|Reduce LR Factor||0.5||0.5||0.5|
|Reduce LR Patience||75||75||40|
Test images are preprocessed as described in the sections above and tiled (). There is a 32-pixel overlap between tiles in both horizontal and vertical directions. The generated tiles are fed into the models in batches of size 32. For each tile, inference is carried out through rotating the tile by and generating probability mask for each rotation separately. The final probability mask is derived by averaging over the probability masks of all rotations. Inference run time analysis is provided in Fig. 9. Thereafter, a binary mask is obtained by thresholding the probability mask outputted by the network. However, a vector representation is desirable for most applications. A commonly used vector format for remote sensing data is the ESRI shapefile format, which is used to represent the final result of inference. Preceding the vectorization, morphological opening operation (i.e., erosion followed by dilation) is applied to the binary mask for smoothing purposes.
Conversion of the raster masks to the set of greenhouse polygons can be implemented through the region extraction and subsequent simplification. Region extraction is performed by grouping the adjacent pixels with the same value (1 in our case, since 0 corresponds to the background class) and the creation of polygons that cover each group.
The majority of the greenhouses have a rectangular shape, but the output of the model may not merely consist of rectangles. Specifically, pixel aliasing artifacts and imperfections of the model are perceptible in the polygons. To reduce these effects, we replaced each proposed polygon with its minimum bounding rectangle (MBR). This resulted in the output shapefiles that contain exclusively rectangular features.
provides extensive evaluation metrics to compare the performance of the proposed models (i.e., Model A and Model B) among each other and also against the baseline model which is a simple U-Net architecture. All models are tested in the same experimental environment. The validation set is intact of any augmentation and consists of 5176 tiles. The models are compared and contrasted based on thescore. According to the conducted experiments, Model B performed highest with a score of 93.29% on the validation set (Fig. 10), which is 2.3% better than the performance of Model A. The proposed models outperformed the baseline model by 4.48% and 2.18% respectively. All the image samples presented below come from the test images that never participated in the training process by any means.
|Model||Precision||Recall||score||Kappa coef.||AUC||IOU||Best threshold||Best epoch|
Fig. 11 demonstrates model performances on an image sample where the target objects (i.e., greenhouses) are clearly distinguishable from the background. Given good atmospheric conditions, image quality, proper geometry and moderate size of the target objects, there is no substantial performance variations across different models. All models resulted in the decent segmentation masks and bounding boxes.
Fig. 12 shows the result of a special case where the greenhouses are located in an environment in which visual impediments such as cloud and haze are present. This poses a challenge for the models to a greater extent. Under these unpleasant circumstances, all models performed poorly. However, Model B slightly exceeded the performance of other models with regard to labeling more of the greenhouse pixels correctly. This resulted in a bounding box that covers the entire target object more accurately.
Models are also challenged in a particular context where the greenhouses of variant scales are exposed to shadow. The corresponding results are portrayed in Fig. 13. In this case, Model B has a clear dominance over other models. On one hand, the Baseline model managed to correctly identify the relatively larger objects, however missed recognizing some of the small scale target entities. On the other hand, Model A performed effectively on the small scale objects, although it unexpectedly missed some of the larger objects. However, Model B executed the given task much more cautiously by identifying all of the greenhouse objects correctly.
Some greenhouses are of irregular shape; hence, Fig. 14 allows us to compare the models concerning the geometry of the target objects. In this context, Model A performed significantly better than the Baseline model; however, it lacked the resilience to identify the entire object by leaving some gaps in the middle of the greenhouse, wherein the pixes are incorrectly recognized as background pixels. This issue is resolved in the case Model B, which performed accurately in both classifying the pixels and tracing the shape of the target object.
We have proposed an end-to-end approach for greenhouse segmentation on satellite imagery. A baseline network and 2 novel models based on the U-Net architecture are proposed and evaluated. The results present strong evidence that described models are capable of achieving high performance on the given task while being computationally feasible, which facilitates the opportunity for deployment. Another contribution of this paper is the comparative analysis of techniques such as the proposed downsampling method, bilinear interpolation, dilated convolution, and skip connection in the context of the U-Net architecture. We highlight the importance of not only the model capacity and hyperparameter selection but also the aforementioned differences in the model architecture.
Conducted experiments showed that Model B outperforms both the Model A and Baseline model in most of the cases. Model A’s performance varies in between that of Model B and the Baseline model. The Baseline model exhibits relatively poor performance on the majority of examples, especially in the presence of shadow, scale variance, irregularities in the target size and shape, and poor atmospheric conditions. Considering the performance and computational costs, we settled upon Model B as our final model.
9 Future Work
Prospects for improvement include more advanced preprocessing steps such as atmospheric correction and de-hazing as well as using ensembles of different semantic segmentation algorithms. Unsupervised pretraining on unlabeled data has shown promising results in the recent literature and can be incorporated into our methodology.
We would like to express our gratitude to Azercosmos OJSC for funding this research and providing the satellite images and annotation masks, and Azerbaijan State Oil and Industry University (ASOIU) for providing access to its High-Performance Computing (HPC) servers.
-  M. Z. Yildirima and C. Ozcan. Extraction of greenhouse areas with image processing methods in karabuk province. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLII-4/W6:107–109, 2017.
-  F. B. Balcik, G. Senel, and C. Goksel. Greenhouse mapping using object based classification and sentinel-2 satellite imagery. In 2019 8th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–5, 2019.
-  D. Koc-San and N.K. Sonmez. Plastic and Glass Greenhouses Detection and Delineation from WorldView-2 Satellite Imagery. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XLI-B7:257–262, June 2016.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1411.4038, 2014.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv:1511.00561 [cs], October 2016. arXiv: 1511.00561.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional Networks for Biomedical Image Segmentation. CoRR, abs/1505.04597, 2015.
-  Fisher Yu and Vladlen Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [cs], April 2016. arXiv: 1511.07122.
-  Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated Residual Networks. arXiv:1705.09914 [cs], May 2017. arXiv: 1705.09914.
-  Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding Convolution for Semantic Segmentation. arXiv:1702.08502 [cs], May 2018. arXiv: 1702.08502.
-  Ryuhei Hamaguchi, Aito Fujita, Keisuke Nemoto, Tomoyuki Imaizumi, and Shuhei Hikosaka. Effective Use of Dilated Convolutions for Segmenting Small Object Instances in Remote Sensing Imagery. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1442–1450, Lake Tahoe, NV, March 2018. IEEE.
-  Ming Wu, Chuang Zhang, Jiaming Liu, Lichen Zhou, and Xiaoqi Li. Towards Accurate High Resolution Satellite Image Semantic Segmentation. IEEE Access, 7:55609–55619, 2019. Conference Name: IEEE Access.
-  Shengyuan Piao and Jiaming Liu. Accuracy Improvement of UNet Based on Dilated Convolution. Journal of Physics: Conference Series, 1345:052066, November 2019.
-  Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Non-Local Means Denoising. Image Processing On Line, 1:208–212, 2011.
-  Stephen M. Pizer, E. Philip Amburn, John D. Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart Ter Haar Romeny, and John B. Zimmerman. Adaptive histogram equalization and its variations. Comput. Vision Graph. Image Process., 39(3):355–368, September 1987.
-  A. Buslaev, A. Parinov, E. Khvedchenya, V. I. Iglovikov, and A. A. Kalinin. Albumentations: fast and flexible image augmentations. ArXiv e-prints, 2018.
-  Dirk-Jan Kroon. Numerical optimization of kernel-based image derivatives, 2009.
-  P. Burt and E. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4):532–540, 1983.
-  Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  Ruirui Li, Wenjie Liu, Lei Yang, Shihao Sun, Wei Hu, Fan Zhang, and Wei Li. Deepunet: A deep fully convolutional network for pixel-level sea-land segmentation. CoRR, abs/1709.00201, 2017.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
-  Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 807–814, Madison, WI, USA, 2010. Omnipress.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
-  Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training Region-based Object Detectors with Online Hard Example Mining. arXiv:1604.03540 [cs], April 2016. arXiv: 1604.03540.