3D Fully Convolutional Neural Networks with Intersection Over Union Loss for Crop Mapping from Multi-Temporal Satellite Images

02/15/2021 ∙ by Sina Mohammadi, et al. ∙ 0

Information on cultivated crops is relevant for a large number of food security studies. Different scientific efforts are dedicated to generate this information from remote sensing images by means of machine learning methods. Unfortunately, these methods do not account for the spatial-temporal relationships inherent in remote sensing images. In our paper, we explore the capability of a 3D Fully Convolutional Neural Network (FCN) to map crop types from multi-temporal images. In addition, we propose the Intersection Over Union (IOU) loss function for increasing the overlap between the predicted classes and ground truth data. The proposed method was applied to identify soybean and corn from a study area situated in the US corn belt using multi-temporal Landsat images. The study shows that our method outperforms related methods, obtaining a Kappa coefficient of 90.8 function provides a superior choice to learn individual crop types.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-temporal remote sensing images are being generated at an unprecedented scale and rate from resources such as Sentinel-2 (5 days frequency), Landsat (16 days frequency), and PlanetScope (daily). In the light of this fact, there have been a lot of scientific efforts towards the goal of converting the huge quantities of multi-temporal remote sensing images into useful information. One of these scientific efforts is automatic crop mapping [1, 2, 3, 4] that is an active research area in remote sensing.

A decisive factor towards the goal of crop classification from multi-temporal images is developing methods that can learn the temporal relationship in image time series. Traditional approaches for temporal feature representation such as Multi layer Perceptron, Random Forest, Support Vector Machine, and Decision Tree

[5, 6, 7, 8, 9] are generally suitable for single-date images and are not able to explicitly consider the sequential relationship of multi-temporal data.

Recently with the success of deep neural networks in learning high-level task-specific features, CNN and LSTM-based methods have drawn increasing attention and achieved promising results in the field of crop classification from multi-temporal images [3, 2, 10, 11, 1, 12]. While most of the deep learning-based methods for crop mapping use pixel-by-pixel approach, in this paper, we will design a Fully Convolutional Neural Network (FCN) and use it for crop mapping. FCNs have been widely used in semantic segmentation, salient object detection, as well as brain tumor segmentation [13, 14, 15, 16, 17], and they are capable of generating the segmentation map of the whole input image at once, and thus they are more efficient computationally. In addition, the spatial relationship of adjacent pixels is taken into account by using FCNs in contrast to pixel-by-pixel approaches, which take individual pixels as input. To fit the need of crop mapping, i.e. learning the sequential relationship of multi-temporal remote sensing data, we use 3D convolution operators as the building blocks of this FCN so that both the spatial and the temporal features would be extracted simultaneously. This would be beneficial to crop mapping since both the spatial and temporal relationships in multi-temporal remote sensing data are important for accurate crop mapping.

To learn the crop types, most of the deep learning-based crop mapping methods use cross-entropy Loss [3, 11, 12, 1, 10, 18]. Despite they achieved promising results, we believe that there is still room for improvement by using a loss function better suited for cop mapping than the cross-entropy loss. To guide the network to generate more accurate prediction for crop types, we propose to learn the crop types by increasing the overlap between the prediction map and ground truth mask directly rather than using the cross-entropy loss that only focuses on per-pixel prediction. To best of our knowledge, this is the first attempt to use this loss function in crop mapping.

In summary, the main contribution of this paper is being the first attempt to learn the crop types by increasing the overlap between the prediction map and ground truth mask for each crop type, which would result in a rethink of the loss functions used to train deep neural networks for crop mapping. In conjunction with this loss function, we design a 3D FCN to simultaneously take into account the spatial and the temporal relationships in multi-temporal remote sensing data.

Figure 1: The architecture of the designed 3D FCN, which is composed of an encoder and a decoder, and is trained using the IOU loss function.

2 The Proposed Method

In this section, we explain our designed 3D FCN and Intersection Over Union (IOU) loss function, which is used to train the network. This network, which is illustrated in Figure 1, is composed of an encoder-decoder network, and it learns to generate the segmentation map of crop lands from the input images.

One important component of our proposed FCN is the 3D convolutional operator that is more beneficial than 2D convolutional operator for multi-temporal crop mapping since it also extracts the temporal features in addition to the spatial features. In the 3D FCN architecture, the Encoder extracts features at four different levels, each of which has different recognition information from each other. At lower levels, the Encoder captures spatial and local information due to the small receptive field, whereas it captures semantic and global information at higher levels because of the large receptive field. To take advantage of the both high level global contexts and low level details, features of different levels are merged in the Decoder through concatenation as shown in Figure 1. In conjunction with the 3D FCN, we propose to use Intersection Over Union (IOU) Loss to guide the FCN to output accurate segmentation maps.

In contrast to most of the deep learning-based crop mapping methods that use cross-entropy loss to learn the crop types, we propose a better loss function to guide the network to learn each crop type more accurately. We propose to use a loss function that tries to increase the overlap between the prediction map and ground truth mask directly. This loss function is more suited to crop mapping than the cross-entropy loss that only focuses on per-pixel prediction. Therefore, our goal is to maximize the Intersection Over Union (IOU) for each crop type by adopting the following loss function:


where C denotes the number of classes, i.e. number of crop types and is defined as:


where M, N, p, g, and denote total number of examples, total number of pixels in each example, prediction map, ground truth mask, and a function that maps the values between zero and one to one respectively.

In the Experimental Results section we will show that using this loss function for learning the crop types results in a boost in the performance compared to using the cross-entropy loss.

3 Experiments

Figure 2: The predicted map of the 3D FCN trained with the IOU Loss, ground truth, and difference map. In the figure, green, yellow, black, purple, and orange represent soybean, corn, other classes, zero value, and one value respectively.

4 Study Area, Preprocessing, and Evaluation Metrics

Our experiments are conducted in the U.S corn belt. We select an 1700x1700 pixel area located in the center of the footprint of h18v07 in the ARD grid system. We use Landsat Analysis Ready Data (ARD) as the input to our method, which are publicly available on USGS’s EarthExplorer web portal. At each observation date, this data contains six optical bands, namely red, green, blue, shortwave infrared 1, shortwave infrared 2, and near-infrared. Furthermore, we used CropScape website portal to download the Cropland Data Layer (CDL) as the labels for training and testing the network. The selected region is mostly composed of corn and soybean. In this project, corn, soybean, and “other” (i.e., merged class of other land cover/land use types) are took as three classes of interest. Therefore, these three categories are assigned to the corresponding pixels of the input image. We use the Landsat multi-temporal data from April 22 to September 23 that covers growing season of corn and soybean [3]

To preprocess the Landsat multi-temporal data and prepare them for training and testing the model, we follow the same procedure as [3]. We remove the invalid pixels from the dataset. An invalid pixel is a pixel with less than seven valid observations after May 15 [3]

, and a valid observation is the pixel that is not filled, shadow, cloudy, or unclear. The invalid pixels are excluded from the dataset and are not used in the training process because they do not contain enough multi-temporal information. Then, to fill in the resulted gaps in the valid pixels, linear interpolation is used that results in 23 time steps with seven days interval from 22 April to 23 September. Furthermore, we normalize the data using the mean and standard deviation values.

As for performance evaluation of the proposed methods, we employ Cohen’s kappa coefficient, macro-averaged producer’s accuracy, and macro-averaged user’s accuracy. Please refer to [3] for more detail.

4.1 Implementation details

We implement our method in Keras


using the Google Colaboratory environment. The designed 3D FCN takes as input a 256x256x23x6 image and outputs 256x256x3 segmentation map. In the input image size, 256x256, 23, and 6 correspond to spatial size, number of time steps in time series, and number of optical bands respectively. In the output map size, 256x256 and 3 correspond to spatial size of the segmentation map and number of classes respectively. We use the stochastic gradient descent with a momentum coefficient 0.9 and a learning rate of 0.005. We split the training data into five sections, and use each of them as validation and the rest for training with batch size 2, which results in 5 models whose softmax outputs are averaged during testing. The code is publicly available at:


Transformer 88.6 90.4 92.1
Random Forest 88.7 91.4 91.4
Multilayer Perceptron 88.8 91.4 91.5
DeepCropMapping (DCM) [3] 89.3 91.7 91.9
Ours(3DFCN+CE Loss) 90.3 92.5 92.8
Ours(3DFCN+IOU Loss) 90.8 93.8 92.6

Table 1: The experimental results. In this table, Kappa, MA-PA, MA-UA, and CE Loss stand for Cohen’s kappa coefficient, ,macro-averaged producer’s accuracy, macro-averaged user’s accuracy, and cross-entropy Loss.

4.2 Experimental Results

We use the data from the selected study area collected in 2015,2016,2017 as our training set, and we test the trained 3D FCNs on the data collected in 2018. Then, we compare our method with the baseline classification models, namely Random Forest (RF) and Multilayer Perceptron (MLP), and Transformer [20] with the exact same settings introduced in [3] (Please refer to [3] for more details). Moreover, we also compare our method with the deep learning-based method introduced in [3]. The results are shown in Table 1

. As seen from the table, our method outperforms other methods in terms of different evaluation metrics. Moreover, it can be seen that the adopted IOU loss function performs better than the cross-entropy loss overall. In addition, to visually investigate the performance of the method, we show the predicted map of the 3D FCN trained with the IOU Loss, ground truth, and difference map in Figure


5 Conclusion

In this study, a 3D FCN with an IOU loss function has been successfully applied to map soybean and corn crops in the US corn belt. The experimental results show that the adopted IOU loss function, which maximizes the overlap between the prediction map and ground truth mask for each crop type, is able to increase the performance compared to using the widely used cross-entropy loss. Therefore, using the IOU Loss function is a better choice to learn individual crop type. For future work, we plan to improve the results for macro-averaged user’s accuracy by adding a loss term to our loss function that tries to improve the Precision of the predicted map.


  • [1] Shunping Ji, Chi Zhang, Anjian Xu, Yun Shi, and Yulin Duan, ‘‘3d convolutional neural networks for crop classification with multi-temporal remote sensing images,’’ Remote Sensing, vol. 10, no. 1, pp. 75, 2018.
  • [2] Liheng Zhong, Lina Hu, and Hang Zhou, ‘‘Deep learning based multi-temporal crop classification,’’ Remote sensing of environment, vol. 221, pp. 430--443, 2019.
  • [3] Jinfan Xu, Yue Zhu, Renhai Zhong, Zhixian Lin, Jialu Xu, Hao Jiang, Jingfeng Huang, Haifeng Li, and Tao Lin, ‘‘Deepcropmapping: A multi-temporal deep learning approach with improved spatial generalizability for dynamic corn and soybean mapping,’’ Remote Sensing of Environment, vol. 247, pp. 111946, 2020.
  • [4] Mariana Belgiu, Wietske Bijker, Ovidiu Csillik, and Alfred Stein, ‘‘Phenology-based sample generation for supervised crop type classification,’’ International Journal of Applied Earth Observation and Geoinformation, vol. 95, pp. 102264, 2021.
  • [5] Reza Khatami, Giorgos Mountrakis, and Stephen V Stehman, ‘‘A meta-analysis of remote sensing research on supervised pixel-based land-cover image classification processes: General guidelines for practitioners and future research,’’ Remote Sensing of Environment, vol. 177, pp. 89--100, 2016.
  • [6] LeeAnn King, Bernard Adusei, Stephen V Stehman, Peter V Potapov, Xiao-Peng Song, Alexander Krylov, Carlos Di Bella, Thomas R Loveland, David M Johnson, and Matthew C Hansen,

    ‘‘A multi-resolution approach to national-scale cultivated area estimation of soybean,’’

    Remote Sensing of Environment, vol. 195, pp. 13--29, 2017.
  • [7] Fabian Löw, U Michel, Stefan Dech, and Christopher Conrad,

    ‘‘Impact of feature selection on the accuracy and spatial uncertainty of per-field crop classification using support vector machines,’’

    ISPRS journal of photogrammetry and remote sensing, vol. 85, pp. 102--119, 2013.
  • [8] Richard Massey, Temuulen T Sankey, Russell G Congalton, Kamini Yadav, Prasad S Thenkabail, Mutlu Ozdogan, and Andrew J Sánchez Meador, ‘‘Modis phenology-derived, multi-year distribution of conterminous us crop types,’’ Remote Sensing of Environment, vol. 198, pp. 490--503, 2017.
  • [9] Di Shi and Xiaojun Yang, ‘‘An assessment of algorithmic parameters affecting image classification accuracy by random forests,’’ Photogrammetric Engineering & Remote Sensing, vol. 82, no. 6, pp. 407--417, 2016.
  • [10] Charlotte Pelletier, Geoffrey I Webb, and François Petitjean, ‘‘Temporal convolutional neural network for the classification of satellite image time series,’’ Remote Sensing, vol. 11, no. 5, pp. 523, 2019.
  • [11] Carolyne Danilla, Claudio Persello, Valentyn Tolpekin, and John Ray Bergado, ‘‘Classification of multitemporal sar images using convolutional neural networks and markov random fields,’’ in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 2231--2234.
  • [12] Nataliia Kussul, Mykola Lavreniuk, Sergii Skakun, and Andrii Shelestov, ‘‘Deep learning classification of land cover and crop types using remote sensing data,’’ IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 778--782, 2017.
  • [13] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang, ‘‘Learning a discriminative feature network for semantic segmentation,’’ in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2018, pp. 1857--1866.
  • [14] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, ‘‘Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,’’ IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834--848, 2017.
  • [15] Sina Mohammadi, Mehrdad Noori, Ali Bahri, Sina Ghofrani Majelan, and Mohammad Havaei, ‘‘Cagnet: Content-aware guidance for salient object detection,’’ Pattern Recognition, p. 107303, 2020.
  • [16] Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu, ‘‘Multi-scale interactive network for salient object detection,’’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9413--9422.
  • [17] Mehrdad Noori, Ali Bahri, and Karim Mohammadi, ‘‘Attention-guided version of 2d unet for automatic brain tumor segmentation,’’ in

    2019 9th International Conference on Computer and Knowledge Engineering (ICCKE)

    . IEEE, 2019, pp. 269--275.
  • [18] Marc Rußwurm and Marco Körner, ‘‘Multi-temporal land cover classification with sequential recurrent encoders,’’ ISPRS International Journal of Geo-Information, vol. 7, no. 4, pp. 129, 2018.
  • [19] François Chollet et al., ‘‘keras,’’ 2015.
  • [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, ‘‘Attention is all you need,’’ in Advances in neural information processing systems, 2017, pp. 5998--6008.