Currently, the end-to-end semantic segmentation models are mostly inspired by the idea of fully convolutional networks (FCNs)  that generally consist of an encoder-decoder architecture. To achieve higher performance, CNN-based end-to-end methods normally rely on deep and wide multi-scale CNN architectures to create a large receptive field in order to obtain strong local patterns, but also capture long range dependencies between objects of the scene. However, this approach for modeling global context relationships is highly inefficient and typically requires a large number of trainable parameters, considerable computational resources, and large labeled training datasets.
Recently, graph neural networks (GNNs) and Graph Convolutional Networks (GCNs)  have received increasing attention and have been applied to, among others, image classification , few-shot and zero-shot classification , point clouds classification  and semantic segmentation . However, these approaches are quite sensitive to how the graph of relations between objects is built and previous approaches commonly rely on manually built graphs based on prior knowledge . In order to address this problem and learn a latent graph structure directly from 2D feature maps for semantic segmentation, the Self-Constructing Graph module (SCG)  was recently proposed and has obtained promising results.
In this work, we extend the SCG to explicitly exploit the rotation invariance in airborne images, by extending it to consider multiple views. More specifically, we augment the input features to obtain multiple rotated views and fuses the multi-view global contextual information before projecting the features back onto the 2-D spatial domain. We further propose a novel adaptive class weighting loss that addresses the issue of class imbalance commonly found in semantic segmentation datasets. Our experiments demonstrate that the MSCG-Net achieves very robust and competitive results on the Agriculture-Vision challenge dataset, which is a subset of the Agriculture-Vision dataset .
In this section, we briefly present graph convolutional networks and the self-constructing graph (SCG) approach that are the foundation of our proposed model, before presenting our end-to-end trainable Multi-view SCG-Net (MSCG) for semantic labeling tasks with the proposed adaptive class weighting loss.
Ii-a Graph Convolutional Networks
Graph Convolutional Networks (GCNs)  are neural networks designed to operate on and extract information from graphs and were originally proposed for the task of semi-supervised node classification. denotes an undirected graph with nodes, where is the adjacency matrix and is the feature matrix. At each layer, the GCN aggregates information in one-hop neighborhoods, more specifically, the representation at layer is computed as
where are the weights of the GCN, , and is the symmetric normalization of including self-loops :
where is the degree matrix,
is the identity matrix, anddenotes the non-linearity function (e.g. ).
Note, in the remainder of the paper, we use to denote the activations after a -layer GCN. However, in practice the could be replaced by alternative graph neural network modules that perform steps of message passing based on some adjacency matrix and input node features .
Ii-B Self-Constructing Graph
The Self-Constructing Graph (SCG) module  allows the construction of undirected graphs, capturing relations across the image, directly from feature maps, instead of relying on prior knowledge graphs. It has achieved promising performance on semantic segmentation tasks in remote sensing and is efficient with respect to the number of trainable parameters, outperforming much larger models. It is inspired by variational graph auto-encoders . A feature map consisting of high-level features, commonly produced by a CNN, is converted to a graph . are the node features, where denotes the number of nodes and where . Parameter-free pooling operations, in our case adaptive average pooling, are employed to reduce the spatial dimensions of to and , followed by a reshape operation to obtain . is the learned weighted adjacency matrix.
The SCG module learns a mean matrix
and a standard deviation matrix
of a Gaussian using two single-layer convolutional networks. Note, following convention with variational autoencoders, the output of the model for the standard deviation is to ensure stable training and positive values for . With help of reparameterization, the latent embedding is where
is an auxiliary noise variable and initialized from a standard normal distribution (
). A centered isotropic multivariate Gaussian prior distribution is used to regularize the latent variables, by minimizing a Kullback-Leibler divergence loss
Based on the learned embeddings, is computed as , where indicates the presence of an edge between node and .
Liu et al.  further introduce a diagonal regularization term
where is defined as
and a diagonal enhancement approach
to stabilize training and preserve local information.
The symmetric normalized that SCG produces and that will be the input to later graph operations is computed as
The SCG further produces an adaptive residual prediction , which is used to refine the final prediction of the network after information has been propagated along the graph.
Ii-C The MSCG-Net
We propose a so-called Multi-view SCG-Net (MSCG-Net) to extend the vanilla SCG and GCN modules by considering multiple rotated views in order to obtain and fuse more robust global contextual information in airborne images in this work. Fig. 2 shows an illustration of the end-to-end MSCG-Net model for semantic labeling tasks. The model architecture details are shown in Table I. We first augment the features () learned by a backbone CNN network to multiple views () by rotating the features. The employed SCG-GCN module then outputs multiple predictions: with different rotation degrees (the index indicates the degree of rotation). The fusion layer merges all the predictions together by reversed rotations and element-wise additions as shown in Table I. Finally, the fused outputs are projected and up-sampled back to the original 2-D spatial domain.
We utilize the first three bottleneck layers of a pretrained Se_ResNext50_32x4d or Se_ResNext101_32x4d  as the backbone CNN to learn the high-level representations. The output size of the CNN is . Note that, we duplicate the weights corresponding to the Red channel of the pretrained input convolution layer in order to take NIR-RGB 4 channels in the backbone CNN, and GCNs (Equation 1
) are used in our model. We utilize ReLU activation and batch normalization only for the first layer GCN. Note, we setand in this work, and here is equal to the number of classes, such that for the experiments performed in this paper.
Ii-D Adaptive Class Weighting Loss
The distribution of the classes is highly imbalanced in the dataset (e.g. most pixels in the images belongs to the background class and only few belong to classes such as planter skip and standing water). To address this problem, most existing methods make use of weighted loss functions with pre-computed class weights based on the pixel frequency of the entire training data to scale the loss for each class-pixel according to the fixed weight before computing gradients. In this work, we introduce a novel class weighting method based on iterative batch-wise class rectification, instead of pre-computing the fixed weights over the whole dataset.
The proposed adaptive class weighting method is derived from median frequency balancing weights . We first compute the pixel-frequency of class over all the past training steps as follows
where, is the current training iteration number, denotes the pixel-frequency of class at the current training step that can be computed as , and .
The iterative median frequency class weights can thus be computed as
here, denotes the number of labels ( in this paper), and .
Then we normalize the iterative weights with adaptive broadcasting to pixel-wise level such that
where and denote the -th prediction and the ground-truth of class separately in the current training samples.
In addition, instead of using traditional cross-entropy function which focuses on positive samples, we introduce a positive and negative class balanced function (PNC) which is defined as
Building on the dice coefficient  with our adaptive class weighting PNC function, we develop an adaptive multi-class weighting (ACW) loss function for multi-class segmentation tasks
where contains all the labeled pixels and is the dice coefficient given as
Iii Experiments and results
We first present the training details and report the results. We then conduct an ablation study to verify the effectiveness of our proposed methods.
Iii-a Dataset and Evaluation
We train and evaluate our proposed method on the Agriculture-Vision challenge dataset, which is a subset of the Agriculture-vision dataset . The challenge dataset consists of aerial farmland images captured throughout 2019 across the US. Each image contains four 512x512 color channels, which are RGB and Near Infra-red (NIR). Each image has a boundary map that indicates the region of the farmland, and a mask that indicates valid pixels in the image. Seven types of annotations are included: Background, Cloud shadow, Double plant, Planter skip, Standing Water, Waterway and Weed cluster. Models are evaluated on the validation set with NIR-RGB images segmentation pairs, while the final scores are reported on the test set with
images. The mean Intersection-over-Union (mIoU) is used as the main quantitative evaluation metric. Due to the fact that some annotations may overlap in the dataset, for pixels with multiple labels, a prediction of either label will be counted as a correct pixel classification for that label.
Iii-B Training details
We use backbone models pretrained on ImageNet in this work. We randomly sample patches of sizeas input and train it using mini batches of size for the MSCG-Net-50 model and size
According to our best practices, we first train the model using Adam  combined with Lookahead  as the optimizer for the first 10k iterations and then change the optimizer to SGD in the remaining iterations with weight decay applied to all learnable parameters except biases and batch-norm parameters. We also set to all bias parameters compared to weight parameters. Based on our training observations and empirical evaluations, we use initial LRs of and for MSCG-Net-50 and MSCG-Net-101 separately, and also apply cosine annealing scheduler that reduces the LR over epochs. All models are trained on a single NVIDIA GeForce GTX 1080Ti. It took roughly 10 hours to train our model for 25 epochs with batch size 10 over NIR-RGB training images.
|Models||mIoU||Background||Cloud shadow||Double plant||Planter skip||Standing water||Waterway||Weed cluster|
We evaluated and tested our trained models on the validation sets and the hold out test sets with just single feed-forward inference without any test time augmentation (TTA) or models ensemble. However, we do include results for a simple two-model ensemble (MSCG-Net-50 together with MSCG-Net-101) with TTA for completeness. The test results are shown in Table II. Our MSCG-Net-50 model obtained very competitive performance with 0.547 mIoU with very small training parameters ( million) and has low computational cost ( Giga FLOPs with input size of ), resulting in fast training and inference performance on both CPU and GPU as shown in Table III. A qualitative comparisons of the segmentation results from our trained models and the ground truths on the validation data are shown in Fig. 3.
(ms - CPU/GPU)
|Se_ResNext50||9.59||18.21||522 / 26|
|MSCG-Net-101||Se_ResNext101||30.99||37.86||752 / 45|
Iii-D Ablation studies
Effect of the Multi-view. To investigate how the multiple views help, we report the results of the single-view models and the multi-view models trained with both Dice loss and ACW loss in Table IV. Note that, for simplicity, we fixed the backbone encoder as Se_ResNext50 and other training parameters (e.g. learning rate, decay police, and so on.). Also, the mIoUs are computed on the validation set without considering multiple labels. The results suggest that multiple views could improve the overall performance from to () mIoU when using Dice loss, and from to () with the proposed ACW loss.
Effect of the ACW loss. As shown in Table IV, we note that for the single-view models, the overall performance can be improved from to () mIoU. For the multi-view models, the performance improved , increasing from to . Compared to the single-view model SCG-Net with Dice loss, which was proposed in  and achieved state-of-the-art performance on a commonly used segmentation benchmark dataset, our Multi-view MSCG-Net model with ACW loss achieved roughly higher mIoU accuracy. We show some qualitative results in Fig. 4 that illustrate the proposed multi-view model and the adaptive class weighting method and show that they help to produce more accurate segmentation results for both larger and smaller classes.
In this paper, we presented a multi-view self-constructing graph convolutional network (MSCG-Net) to extend the SCG module which makes use of learnable latent variables to self-construct the underlying graphs, and to explicitly capture multi-view global context representations with rotation invariance in airborne images. We further developed a novel adaptive class weighting loss that alleviates the issue of class imbalance commonly found in semantic segmentation datasets. On the Agriculture-Vision challenge dataset, our MSCG-Net model achieves very robust and competitive results, while making use of fewer parameters and being computationally more efficient.
This work is supported by the foundation of the Research Council of Norway under Grant 220832.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §I.
-  (2020) Agriculture-vision: a large aerial image database for agricultural pattern analysis. External Links: Cited by: §I, §III-A.
-  (2018) Squeeze-and-excitation networks. In , pp. 7132–7141. Cited by: §II-C.
-  (2019) Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11487–11496. Cited by: §I.
Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9. Cited by: §II-D, §II-D.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-B.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §III-B.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I, §II-A.
-  (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §II-B.
-  (2019) Image classification with hierarchical multigraph networks. arXiv preprint arXiv:1907.09000. Cited by: §I.
-  (2018) Symbolic graph reasoning meets convolutions. In Advances in Neural Information Processing Systems, pp. 1853–1863. Cited by: §I.
-  (2020) Dense dilated convolutions’ merging network for land cover classification. IEEE Transactions on Geoscience and Remote Sensing, pp. 1–12. External Links: Cited by: TABLE II.
-  (2020) Self-constructing graph convolutional networks for semantic labeling. arXiv preprint arXiv:2003.06932. Cited by: §I, §II-A, §II-B, §II-B, §III-D.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §II-D.
-  (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §I.
-  (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9593–9604. Cited by: §III-B.