I Introduction
Currently, the endtoend semantic segmentation models are mostly inspired by the idea of fully convolutional networks (FCNs) [14] that generally consist of an encoderdecoder architecture. To achieve higher performance, CNNbased endtoend methods normally rely on deep and wide multiscale CNN architectures to create a large receptive field in order to obtain strong local patterns, but also capture long range dependencies between objects of the scene. However, this approach for modeling global context relationships is highly inefficient and typically requires a large number of trainable parameters, considerable computational resources, and large labeled training datasets.
Recently, graph neural networks (GNNs)
[1] and Graph Convolutional Networks (GCNs) [8] have received increasing attention and have been applied to, among others, image classification [10], fewshot and zeroshot classification [4], point clouds classification [16] and semantic segmentation [11]. However, these approaches are quite sensitive to how the graph of relations between objects is built and previous approaches commonly rely on manually built graphs based on prior knowledge [11]. In order to address this problem and learn a latent graph structure directly from 2D feature maps for semantic segmentation, the SelfConstructing Graph module (SCG) [13] was recently proposed and has obtained promising results.In this work, we extend the SCG to explicitly exploit the rotation invariance in airborne images, by extending it to consider multiple views. More specifically, we augment the input features to obtain multiple rotated views and fuses the multiview global contextual information before projecting the features back onto the 2D spatial domain. We further propose a novel adaptive class weighting loss that addresses the issue of class imbalance commonly found in semantic segmentation datasets. Our experiments demonstrate that the MSCGNet achieves very robust and competitive results on the AgricultureVision challenge dataset, which is a subset of the AgricultureVision dataset [2].
Ii Methods
In this section, we briefly present graph convolutional networks and the selfconstructing graph (SCG) approach that are the foundation of our proposed model, before presenting our endtoend trainable Multiview SCGNet (MSCG) for semantic labeling tasks with the proposed adaptive class weighting loss.
Iia Graph Convolutional Networks
Graph Convolutional Networks (GCNs) [8] are neural networks designed to operate on and extract information from graphs and were originally proposed for the task of semisupervised node classification. denotes an undirected graph with nodes, where is the adjacency matrix and is the feature matrix. At each layer, the GCN aggregates information in onehop neighborhoods, more specifically, the representation at layer is computed as
(1) 
where are the weights of the GCN, , and is the symmetric normalization of including selfloops [13]:
(2) 
where is the degree matrix,
is the identity matrix, and
denotes the nonlinearity function (e.g. ).Note, in the remainder of the paper, we use to denote the activations after a layer GCN. However, in practice the could be replaced by alternative graph neural network modules that perform steps of message passing based on some adjacency matrix and input node features .
IiB SelfConstructing Graph
The SelfConstructing Graph (SCG) module [13] allows the construction of undirected graphs, capturing relations across the image, directly from feature maps, instead of relying on prior knowledge graphs. It has achieved promising performance on semantic segmentation tasks in remote sensing and is efficient with respect to the number of trainable parameters, outperforming much larger models. It is inspired by variational graph autoencoders [9]. A feature map consisting of highlevel features, commonly produced by a CNN, is converted to a graph . are the node features, where denotes the number of nodes and where . Parameterfree pooling operations, in our case adaptive average pooling, are employed to reduce the spatial dimensions of to and , followed by a reshape operation to obtain . is the learned weighted adjacency matrix.
The SCG module learns a mean matrix
and a standard deviation matrix
of a Gaussian using two singlelayer convolutional networks. Note, following convention with variational autoencoders
[6], the output of the model for the standard deviation is to ensure stable training and positive values for . With help of reparameterization, the latent embedding is whereis an auxiliary noise variable and initialized from a standard normal distribution (
). A centered isotropic multivariate Gaussian prior distribution is used to regularize the latent variables, by minimizing a KullbackLeibler divergence loss
(3) 
Based on the learned embeddings, is computed as , where indicates the presence of an edge between node and .
Liu et al. [13] further introduce a diagonal regularization term
(4) 
where is defined as
and a diagonal enhancement approach
(5) 
to stabilize training and preserve local information.
The symmetric normalized that SCG produces and that will be the input to later graph operations is computed as
(6) 
The SCG further produces an adaptive residual prediction , which is used to refine the final prediction of the network after information has been propagated along the graph.
IiC The MSCGNet
We propose a socalled Multiview SCGNet (MSCGNet) to extend the vanilla SCG and GCN modules by considering multiple rotated views in order to obtain and fuse more robust global contextual information in airborne images in this work. Fig. 2 shows an illustration of the endtoend MSCGNet model for semantic labeling tasks. The model architecture details are shown in Table I. We first augment the features () learned by a backbone CNN network to multiple views () by rotating the features. The employed SCGGCN module then outputs multiple predictions: with different rotation degrees (the index indicates the degree of rotation). The fusion layer merges all the predictions together by reversed rotations and elementwise additions as shown in Table I. Finally, the fused outputs are projected and upsampled back to the original 2D spatial domain.
Layers  Outputs  Sizes 
CNN  
Augment  )  
SCG  
Fusion  
Projection 
We utilize the first three bottleneck layers of a pretrained Se_ResNext50_32x4d or Se_ResNext101_32x4d [3] as the backbone CNN to learn the highlevel representations. The output size of the CNN is . Note that, we duplicate the weights corresponding to the Red channel of the pretrained input convolution layer in order to take NIRRGB 4 channels in the backbone CNN, and GCNs (Equation 1
) are used in our model. We utilize ReLU activation and batch normalization only for the first layer GCN. Note, we set
and in this work, and here is equal to the number of classes, such that for the experiments performed in this paper.IiD Adaptive Class Weighting Loss
The distribution of the classes is highly imbalanced in the dataset (e.g. most pixels in the images belongs to the background class and only few belong to classes such as planter skip and standing water). To address this problem, most existing methods make use of weighted loss functions with precomputed class weights based on the pixel frequency of the entire training data
[5] to scale the loss for each classpixel according to the fixed weight before computing gradients. In this work, we introduce a novel class weighting method based on iterative batchwise class rectification, instead of precomputing the fixed weights over the whole dataset.The proposed adaptive class weighting method is derived from median frequency balancing weights [5]. We first compute the pixelfrequency of class over all the past training steps as follows
(7) 
where, is the current training iteration number, denotes the pixelfrequency of class at the current training step that can be computed as , and .
The iterative median frequency class weights can thus be computed as
(8) 
here, denotes the number of labels ( in this paper), and .
Then we normalize the iterative weights with adaptive broadcasting to pixelwise level such that
(9) 
where and denote the th prediction and the groundtruth of class separately in the current training samples.
In addition, instead of using traditional crossentropy function which focuses on positive samples, we introduce a positive and negative class balanced function (PNC) which is defined as
(10) 
where .
Building on the dice coefficient [15] with our adaptive class weighting PNC function, we develop an adaptive multiclass weighting (ACW) loss function for multiclass segmentation tasks
(11) 
where contains all the labeled pixels and is the dice coefficient given as
(12) 
Iii Experiments and results
We first present the training details and report the results. We then conduct an ablation study to verify the effectiveness of our proposed methods.
Iiia Dataset and Evaluation
We train and evaluate our proposed method on the AgricultureVision challenge dataset, which is a subset of the Agriculturevision dataset [2]. The challenge dataset consists of aerial farmland images captured throughout 2019 across the US. Each image contains four 512x512 color channels, which are RGB and Near Infrared (NIR). Each image has a boundary map that indicates the region of the farmland, and a mask that indicates valid pixels in the image. Seven types of annotations are included: Background, Cloud shadow, Double plant, Planter skip, Standing Water, Waterway and Weed cluster. Models are evaluated on the validation set with NIRRGB images segmentation pairs, while the final scores are reported on the test set with
images. The mean IntersectionoverUnion (mIoU) is used as the main quantitative evaluation metric. Due to the fact that some annotations may overlap in the dataset, for pixels with multiple labels, a prediction of either label will be counted as a correct pixel classification for that label.
IiiB Training details
We use backbone models pretrained on ImageNet in this work. We randomly sample patches of size
as input and train it using mini batches of size for the MSCGNet50 model and sizefor the MSCGNet101 model. The training data (containing 12901 images) is sampled uniformly and randomly flipped (with probability 0.5) for data augmentation and shuffled for each epoch.
According to our best practices, we first train the model using Adam [7] combined with Lookahead [17] as the optimizer for the first 10k iterations and then change the optimizer to SGD in the remaining iterations with weight decay applied to all learnable parameters except biases and batchnorm parameters. We also set to all bias parameters compared to weight parameters. Based on our training observations and empirical evaluations, we use initial LRs of and for MSCGNet50 and MSCGNet101 separately, and also apply cosine annealing scheduler that reduces the LR over epochs. All models are trained on a single NVIDIA GeForce GTX 1080Ti. It took roughly 10 hours to train our model for 25 epochs with batch size 10 over NIRRGB training images.
IiiC Results
Models  mIoU  Background  Cloud shadow  Double plant  Planter skip  Standing water  Waterway  Weed cluster  
MSCGNet50  0.547  0.780  0.507  0.466  0.343  0.688  0.513  0.530  0.508 
MSCGNet101  0.550  0.798  0.448  0.550  0.305  0.654  0.592  0.506  0.509 
EnsembleTTA  0.599  0.801  0.503  0.576  0.520  0.696  0.560  0.538  0.566 
We evaluated and tested our trained models on the validation sets and the hold out test sets with just single feedforward inference without any test time augmentation (TTA) or models ensemble. However, we do include results for a simple twomodel ensemble (MSCGNet50 together with MSCGNet101) with TTA for completeness. The test results are shown in Table II. Our MSCGNet50 model obtained very competitive performance with 0.547 mIoU with very small training parameters ( million) and has low computational cost ( Giga FLOPs with input size of ), resulting in fast training and inference performance on both CPU and GPU as shown in Table III. A qualitative comparisons of the segmentation results from our trained models and the ground truths on the validation data are shown in Fig. 3.
Models  Backbones 
Parameters
(Million) 
FLOPs
(Giga) 
Inference time
(ms  CPU/GPU) 
MSCGNet50 
Se_ResNext50  9.59  18.21  522 / 26 
MSCGNet101  Se_ResNext101  30.99  37.86  752 / 45 
Configurations  
Models 



mIoU  

SCGdice  ✓  0.456  
SCGacw  ✓  0.472  
MSCGdice  ✓  ✓  0.516  
MSCGacw  ✓  ✓  0.527 
IiiD Ablation studies
Effect of the Multiview. To investigate how the multiple views help, we report the results of the singleview models and the multiview models trained with both Dice loss and ACW loss in Table IV. Note that, for simplicity, we fixed the backbone encoder as Se_ResNext50 and other training parameters (e.g. learning rate, decay police, and so on.). Also, the mIoUs are computed on the validation set without considering multiple labels. The results suggest that multiple views could improve the overall performance from to () mIoU when using Dice loss, and from to () with the proposed ACW loss.
Effect of the ACW loss. As shown in Table IV, we note that for the singleview models, the overall performance can be improved from to () mIoU. For the multiview models, the performance improved , increasing from to . Compared to the singleview model SCGNet with Dice loss, which was proposed in [13] and achieved stateoftheart performance on a commonly used segmentation benchmark dataset, our Multiview MSCGNet model with ACW loss achieved roughly higher mIoU accuracy. We show some qualitative results in Fig. 4 that illustrate the proposed multiview model and the adaptive class weighting method and show that they help to produce more accurate segmentation results for both larger and smaller classes.
Iv Conclusions
In this paper, we presented a multiview selfconstructing graph convolutional network (MSCGNet) to extend the SCG module which makes use of learnable latent variables to selfconstruct the underlying graphs, and to explicitly capture multiview global context representations with rotation invariance in airborne images. We further developed a novel adaptive class weighting loss that alleviates the issue of class imbalance commonly found in semantic segmentation datasets. On the AgricultureVision challenge dataset, our MSCGNet model achieves very robust and competitive results, while making use of fewer parameters and being computationally more efficient.
Acknowledgments
This work is supported by the foundation of the Research Council of Norway under Grant 220832.
References

[1]
(2017)
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §I.  [2] (2020) Agriculturevision: a large aerial image database for agricultural pattern analysis. External Links: 2001.01306 Cited by: §I, §IIIA.

[3]
(2018)
Squeezeandexcitation networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 7132–7141. Cited by: §IIC.  [4] (2019) Rethinking knowledge graph propagation for zeroshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11487–11496. Cited by: §I.

[5]
(2016)
Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9. Cited by: §IID, §IID.  [6] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §IIB.
 [7] (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §IIIB.
 [8] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I, §IIA.
 [9] (2016) Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §IIB.
 [10] (2019) Image classification with hierarchical multigraph networks. arXiv preprint arXiv:1907.09000. Cited by: §I.
 [11] (2018) Symbolic graph reasoning meets convolutions. In Advances in Neural Information Processing Systems, pp. 1853–1863. Cited by: §I.
 [12] (2020) Dense dilated convolutions’ merging network for land cover classification. IEEE Transactions on Geoscience and Remote Sensing, pp. 1–12. External Links: ISSN 15580644, Link, Document Cited by: TABLE II.
 [13] (2020) Selfconstructing graph convolutional networks for semantic labeling. arXiv preprint arXiv:2003.06932. Cited by: §I, §IIA, §IIB, §IIB, §IIID.
 [14] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I.
 [15] (2016) Vnet: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §IID.
 [16] (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §I.
 [17] (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9593–9604. Cited by: §IIIB.
Comments
There are no comments yet.