Multi-view Self-Constructing Graph Convolutional Networks with Adaptive Class Weighting Loss for Semantic Segmentation

04/21/2020 ∙ by Qinghui Liu, et al. ∙ University of Tromsø the Arctic University of Norway Norsk Regnesentral 0

We propose a novel architecture called the Multi-view Self-Constructing Graph Convolutional Networks (MSCG-Net) for semantic segmentation. Building on the recently proposed Self-Constructing Graph (SCG) module, which makes use of learnable latent variables to self-construct the underlying graphs directly from the input features without relying on manually built prior knowledge graphs, we leverage multiple views in order to explicitly exploit the rotational invariance in airborne images. We further develop an adaptive class weighting loss to address the class imbalance. We demonstrate the effectiveness and flexibility of the proposed method on the Agriculture-Vision challenge dataset and our model achieves very competitive results (0.547 mIoU) with much fewer parameters and at a lower computational cost compared to related pure-CNN based work. Code will be available at:



There are no comments yet.


page 1

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Currently, the end-to-end semantic segmentation models are mostly inspired by the idea of fully convolutional networks (FCNs) [14] that generally consist of an encoder-decoder architecture. To achieve higher performance, CNN-based end-to-end methods normally rely on deep and wide multi-scale CNN architectures to create a large receptive field in order to obtain strong local patterns, but also capture long range dependencies between objects of the scene. However, this approach for modeling global context relationships is highly inefficient and typically requires a large number of trainable parameters, considerable computational resources, and large labeled training datasets.

Fig. 1: Overview of the MSCG-Net. The Self-Constructing Graph module (SCG) learns to transform a 2D feature map into a latent graph structure and assign pixels () to the vertices of the graph (). The Graph Convolutional Networks (GCN) is then exploited to update the node features (, here denotes the number of layer of GCN) along the edges of graph. The combined module of SCG and GCN (SCG-GCN) can takes augmented multi-view input features (, and , where the index indicates degree rotation) and finally the updated multi-view representations are fused and projected back onto 2D maps.

Recently, graph neural networks (GNNs)

[1] and Graph Convolutional Networks (GCNs) [8] have received increasing attention and have been applied to, among others, image classification [10], few-shot and zero-shot classification [4], point clouds classification [16] and semantic segmentation [11]. However, these approaches are quite sensitive to how the graph of relations between objects is built and previous approaches commonly rely on manually built graphs based on prior knowledge [11]. In order to address this problem and learn a latent graph structure directly from 2D feature maps for semantic segmentation, the Self-Constructing Graph module (SCG) [13] was recently proposed and has obtained promising results.

In this work, we extend the SCG to explicitly exploit the rotation invariance in airborne images, by extending it to consider multiple views. More specifically, we augment the input features to obtain multiple rotated views and fuses the multi-view global contextual information before projecting the features back onto the 2-D spatial domain. We further propose a novel adaptive class weighting loss that addresses the issue of class imbalance commonly found in semantic segmentation datasets. Our experiments demonstrate that the MSCG-Net achieves very robust and competitive results on the Agriculture-Vision challenge dataset, which is a subset of the Agriculture-Vision dataset [2].

The rest of the paper is organized as follows. In the method Section II, we present the methodology in details. Experimental procedure and evaluation of the proposed method is performed in Section III. Finally in Section IV, we draw conclusions.

Ii Methods

In this section, we briefly present graph convolutional networks and the self-constructing graph (SCG) approach that are the foundation of our proposed model, before presenting our end-to-end trainable Multi-view SCG-Net (MSCG) for semantic labeling tasks with the proposed adaptive class weighting loss.

Ii-a Graph Convolutional Networks

Graph Convolutional Networks (GCNs) [8] are neural networks designed to operate on and extract information from graphs and were originally proposed for the task of semi-supervised node classification. denotes an undirected graph with nodes, where is the adjacency matrix and is the feature matrix. At each layer, the GCN aggregates information in one-hop neighborhoods, more specifically, the representation at layer is computed as


where are the weights of the GCN, , and is the symmetric normalization of including self-loops [13]:


where is the degree matrix,

is the identity matrix, and

denotes the non-linearity function (e.g. ).

Note, in the remainder of the paper, we use to denote the activations after a -layer GCN. However, in practice the could be replaced by alternative graph neural network modules that perform steps of message passing based on some adjacency matrix and input node features .

Ii-B Self-Constructing Graph

The Self-Constructing Graph (SCG) module [13] allows the construction of undirected graphs, capturing relations across the image, directly from feature maps, instead of relying on prior knowledge graphs. It has achieved promising performance on semantic segmentation tasks in remote sensing and is efficient with respect to the number of trainable parameters, outperforming much larger models. It is inspired by variational graph auto-encoders [9]. A feature map consisting of high-level features, commonly produced by a CNN, is converted to a graph . are the node features, where denotes the number of nodes and where . Parameter-free pooling operations, in our case adaptive average pooling, are employed to reduce the spatial dimensions of to and , followed by a reshape operation to obtain . is the learned weighted adjacency matrix.

The SCG module learns a mean matrix

and a standard deviation matrix

of a Gaussian using two single-layer convolutional networks. Note, following convention with variational autoencoders 

[6], the output of the model for the standard deviation is to ensure stable training and positive values for . With help of reparameterization, the latent embedding is   where

is an auxiliary noise variable and initialized from a standard normal distribution (

). A centered isotropic multivariate Gaussian prior distribution is used to regularize the latent variables, by minimizing a Kullback-Leibler divergence loss


Based on the learned embeddings, is computed as , where indicates the presence of an edge between node and .

Liu et al. [13] further introduce a diagonal regularization term


where is defined as

and a diagonal enhancement approach


to stabilize training and preserve local information.

The symmetric normalized that SCG produces and that will be the input to later graph operations is computed as


The SCG further produces an adaptive residual prediction , which is used to refine the final prediction of the network after information has been propagated along the graph.

Ii-C The MSCG-Net

Fig. 2: Model architecture of MSCG-Net for semantic labeling includes the CNN-based feature extractor (e.g. customized Se_ResNext50_32x4d taking 4-channel input and output 1024-channel in this work), SCG module taking 3-view (augment the original input by ) inputs and K-layer GCNs (K=2 in this work), the Fusion block merging 3-view outputs together, the fused output is projected and upsampled back to 2D maps for final prediction.

We propose a so-called Multi-view SCG-Net (MSCG-Net) to extend the vanilla SCG and GCN modules by considering multiple rotated views in order to obtain and fuse more robust global contextual information in airborne images in this work. Fig. 2 shows an illustration of the end-to-end MSCG-Net model for semantic labeling tasks. The model architecture details are shown in Table I. We first augment the features () learned by a backbone CNN network to multiple views () by rotating the features. The employed SCG-GCN module then outputs multiple predictions: with different rotation degrees (the index indicates the degree of rotation). The fusion layer merges all the predictions together by reversed rotations and element-wise additions as shown in Table I. Finally, the fused outputs are projected and up-sampled back to the original 2-D spatial domain.

Layers Outputs Sizes
Augment )
TABLE I: MSCG-Net Model Details with one sample of input image size of . Note: denotes an element-wise addition, the index (i.e. , ) indicates the rotated degree, while and denote the reversed rotation degrees.

We utilize the first three bottleneck layers of a pretrained Se_ResNext50_32x4d or Se_ResNext101_32x4d [3] as the backbone CNN to learn the high-level representations. The output size of the CNN is . Note that, we duplicate the weights corresponding to the Red channel of the pretrained input convolution layer in order to take NIR-RGB 4 channels in the backbone CNN, and GCNs (Equation 1

) are used in our model. We utilize ReLU activation and batch normalization only for the first layer GCN. Note, we set

and in this work, and here is equal to the number of classes, such that for the experiments performed in this paper.

Ii-D Adaptive Class Weighting Loss

The distribution of the classes is highly imbalanced in the dataset (e.g. most pixels in the images belongs to the background class and only few belong to classes such as planter skip and standing water). To address this problem, most existing methods make use of weighted loss functions with pre-computed class weights based on the pixel frequency of the entire training data 

[5] to scale the loss for each class-pixel according to the fixed weight before computing gradients. In this work, we introduce a novel class weighting method based on iterative batch-wise class rectification, instead of pre-computing the fixed weights over the whole dataset.

The proposed adaptive class weighting method is derived from median frequency balancing weights [5]. We first compute the pixel-frequency of class over all the past training steps as follows


where, is the current training iteration number, denotes the pixel-frequency of class at the current training step that can be computed as , and .

The iterative median frequency class weights can thus be computed as


here, denotes the number of labels ( in this paper), and .

Then we normalize the iterative weights with adaptive broadcasting to pixel-wise level such that


where and denote the -th prediction and the ground-truth of class separately in the current training samples.

In addition, instead of using traditional cross-entropy function which focuses on positive samples, we introduce a positive and negative class balanced function (PNC) which is defined as


where .

Building on the dice coefficient [15] with our adaptive class weighting PNC function, we develop an adaptive multi-class weighting (ACW) loss function for multi-class segmentation tasks


where contains all the labeled pixels and is the dice coefficient given as


The overall cost function of our model, with a combination of two regularization terms and as defined in the equations 3 and 4, is therefore defined as


Iii Experiments and results

We first present the training details and report the results. We then conduct an ablation study to verify the effectiveness of our proposed methods.

Iii-a Dataset and Evaluation

We train and evaluate our proposed method on the Agriculture-Vision challenge dataset, which is a subset of the Agriculture-vision dataset [2]. The challenge dataset consists of aerial farmland images captured throughout 2019 across the US. Each image contains four 512x512 color channels, which are RGB and Near Infra-red (NIR). Each image has a boundary map that indicates the region of the farmland, and a mask that indicates valid pixels in the image. Seven types of annotations are included: Background, Cloud shadow, Double plant, Planter skip, Standing Water, Waterway and Weed cluster. Models are evaluated on the validation set with NIR-RGB images segmentation pairs, while the final scores are reported on the test set with

images. The mean Intersection-over-Union (mIoU) is used as the main quantitative evaluation metric. Due to the fact that some annotations may overlap in the dataset, for pixels with multiple labels, a prediction of either label will be counted as a correct pixel classification for that label.

Iii-B Training details

We use backbone models pretrained on ImageNet in this work. We randomly sample patches of size

as input and train it using mini batches of size for the MSCG-Net-50 model and size

for the MSCG-Net-101 model. The training data (containing 12901 images) is sampled uniformly and randomly flipped (with probability 0.5) for data augmentation and shuffled for each epoch.

According to our best practices, we first train the model using Adam [7] combined with Lookahead [17] as the optimizer for the first 10k iterations and then change the optimizer to SGD in the remaining iterations with weight decay applied to all learnable parameters except biases and batch-norm parameters. We also set to all bias parameters compared to weight parameters. Based on our training observations and empirical evaluations, we use initial LRs of and for MSCG-Net-50 and MSCG-Net-101 separately, and also apply cosine annealing scheduler that reduces the LR over epochs. All models are trained on a single NVIDIA GeForce GTX 1080Ti. It took roughly 10 hours to train our model for 25 epochs with batch size 10 over NIR-RGB training images.

Iii-C Results

Models mIoU Background Cloud shadow Double plant Planter skip Standing water Waterway Weed cluster
MSCG-Net-50 0.547 0.780 0.507 0.466 0.343 0.688 0.513 0.530 0.508
MSCG-Net-101 0.550 0.798 0.448 0.550 0.305 0.654 0.592 0.506 0.509
Ensemble-TTA 0.599 0.801 0.503 0.576 0.520 0.696 0.560 0.538 0.566
TABLE II: mIoUs and class IoUs of our models on Agriculture-Vision test set. Note: mIoU is the mean IoU over all 7 classes while is over 6-class without the background, and Ensemble-TTA denotes the two models ensemble (MSCG-Net-50 with MSCG-Net-101) combined with TTA methods [12].

We evaluated and tested our trained models on the validation sets and the hold out test sets with just single feed-forward inference without any test time augmentation (TTA) or models ensemble. However, we do include results for a simple two-model ensemble (MSCG-Net-50 together with MSCG-Net-101) with TTA for completeness. The test results are shown in Table II. Our MSCG-Net-50 model obtained very competitive performance with 0.547 mIoU with very small training parameters ( million) and has low computational cost ( Giga FLOPs with input size of ), resulting in fast training and inference performance on both CPU and GPU as shown in Table III. A qualitative comparisons of the segmentation results from our trained models and the ground truths on the validation data are shown in Fig. 3.

Models Backbones Parameters
Inference time
(ms - CPU/GPU)

Se_ResNext50 9.59 18.21 522 / 26
MSCG-Net-101 Se_ResNext101 30.99 37.86 752 / 45
TABLE III: Quantitative Comparison of parameters size, FLOPs (measured on input image size of ), Inference time on CPU and GPU separately.
Fig. 3: Segmentation results on validation data. From the left to right, the input images, the ground truths and the predictions of our trained models.
Dice loss
ACW loss
SCG-dice 0.456
SCG-acw 0.472
MSCG-dice 0.516
MSCG-acw 0.527
TABLE IV: Ablation study of our proposed network. Note that, for simplicity, we fixed the learning high-parameters and the backbone encoder, and mIoU is evaluated on validation set without considering overlapped annotations.
Fig. 4: Segmentation results using different models. From the left to right, the input images, the ground truths and SCG-Net with dice loss, SCG-Net with ACW loss, MSCG-Net with dice loss, and MSCG-Net with ACW loss.

Iii-D Ablation studies

Effect of the Multi-view. To investigate how the multiple views help, we report the results of the single-view models and the multi-view models trained with both Dice loss and ACW loss in Table IV. Note that, for simplicity, we fixed the backbone encoder as Se_ResNext50 and other training parameters (e.g. learning rate, decay police, and so on.). Also, the mIoUs are computed on the validation set without considering multiple labels. The results suggest that multiple views could improve the overall performance from to () mIoU when using Dice loss, and from to () with the proposed ACW loss.

Effect of the ACW loss. As shown in Table IV, we note that for the single-view models, the overall performance can be improved from to () mIoU. For the multi-view models, the performance improved , increasing from to . Compared to the single-view model SCG-Net with Dice loss, which was proposed in [13] and achieved state-of-the-art performance on a commonly used segmentation benchmark dataset, our Multi-view MSCG-Net model with ACW loss achieved roughly higher mIoU accuracy. We show some qualitative results in Fig. 4 that illustrate the proposed multi-view model and the adaptive class weighting method and show that they help to produce more accurate segmentation results for both larger and smaller classes.

Iv Conclusions

In this paper, we presented a multi-view self-constructing graph convolutional network (MSCG-Net) to extend the SCG module which makes use of learnable latent variables to self-construct the underlying graphs, and to explicitly capture multi-view global context representations with rotation invariance in airborne images. We further developed a novel adaptive class weighting loss that alleviates the issue of class imbalance commonly found in semantic segmentation datasets. On the Agriculture-Vision challenge dataset, our MSCG-Net model achieves very robust and competitive results, while making use of fewer parameters and being computationally more efficient.


This work is supported by the foundation of the Research Council of Norway under Grant 220832.


  • [1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §I.
  • [2] M. T. Chiu, X. Xu, Y. Wei, Z. Huang, A. Schwing, R. Brunner, H. Khachatrian, H. Karapetyan, I. Dozier, G. Rose, D. Wilson, A. Tudor, N. Hovakimyan, T. S. Huang, and H. Shi (2020) Agriculture-vision: a large aerial image database for agricultural pattern analysis. External Links: 2001.01306 Cited by: §I, §III-A.
  • [3] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7132–7141. Cited by: §II-C.
  • [4] M. Kampffmeyer, Y. Chen, X. Liang, H. Wang, Y. Zhang, and E. P. Xing (2019) Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11487–11496. Cited by: §I.
  • [5] M. Kampffmeyer, A. Salberg, and R. Jenssen (2016)

    Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–9. Cited by: §II-D, §II-D.
  • [6] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-B.
  • [7] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link, 1412.6980 Cited by: §III-B.
  • [8] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I, §II-A.
  • [9] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §II-B.
  • [10] B. Knyazev, X. Lin, M. R. Amer, and G. W. Taylor (2019) Image classification with hierarchical multigraph networks. arXiv preprint arXiv:1907.09000. Cited by: §I.
  • [11] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing (2018) Symbolic graph reasoning meets convolutions. In Advances in Neural Information Processing Systems, pp. 1853–1863. Cited by: §I.
  • [12] Q. Liu, M. Kampffmeyer, R. Jenssen, and A. Salberg (2020) Dense dilated convolutions’ merging network for land cover classification. IEEE Transactions on Geoscience and Remote Sensing, pp. 1–12. External Links: ISSN 1558-0644, Link, Document Cited by: TABLE II.
  • [13] Q. Liu, M. Kampffmeyer, R. Jenssen, and A. Salberg (2020) Self-constructing graph convolutional networks for semantic labeling. arXiv preprint arXiv:2003.06932. Cited by: §I, §II-A, §II-B, §II-B, §III-D.
  • [14] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §I.
  • [15] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §II-D.
  • [16] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §I.
  • [17] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9593–9604. Cited by: §III-B.