Tree-structured Kronecker Convolutional Networks for Semantic Segmentation

12/12/2018 ∙ by Tianyi Wu, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 2

Most existing semantic segmentation methods employ atrous convolution to enlarge the receptive field of filters, but neglect partial information. To tackle this issue, we firstly propose a novel Kronecker convolution which adopts Kronecker product to expand the standard convolutional kernel for taking into account the partial feature neglected by atrous convolutions. Therefore, it can capture partial information and enlarge the receptive field of filters simultaneously without introducing extra parameters. Secondly, we propose Tree-structured Feature Aggregation (TFA) module which follows a recursive rule to expand and forms a hierarchical structure. Thus, it can naturally learn representations of multi-scale objects and encode hierarchical contextual information in complex scenes. Finally, we design Tree-structured Kronecker Convolutional Networks (TKCN) which employs Kronecker convolution and TFA module. Extensive experiments on three datasets, PASCAL VOC 2012, PASCAL-Context and Cityscapes, verify the effectiveness of our proposed approach. We make the code and the trained model publicly available at https://github.com/wutianyiRosun/TKCN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation is a significant challenge for computer vision. The goal of semantic segmentation is to assign one of the semantic labels to each pixel in an image. Current segmentation models

[1, 2]

based on Deep Convolutional Neural Networks (DCNNs) achieve good performances on several semantic segmentation benchmarks

[3, 4], such as Fully Convolutional Networks (FCNs) [2]. These models transfer classification networks [5, 6]

pre-trained on ImageNet dataset

[7]

to generate segmentation predictions through removing max-pooling, altering fully connected layers and adding deconvolutional layers. More recently, employing atrous convolutions, also named dilated convolutions

[8], instead of standard convolutions in some layers of FCNs has become the mainstream, since atrous convolutions can enlarge the field of view and maintain the resolution of feature maps. Although atrous convolutions show good performances in semantic segmentation, it lacks the capability of capturing partial information. To illustrate this issue better, we define Valid Feature Ratio (VFR) as the ratio of the number of feature vectors involved in the computation to that of all feature vectors in the convolution patch. VFR can measure the utilization ratio of features in convolutional patches. As shown in Fig. 1 (a), the VFR of atrous convolutions is relatively low, which means much more partial information is neglected. As shown in Fig. 1 (b), we can observe that atrous convolutions lose important partial information when employing large rate . Typically, only 9 out of 81 feature vectors in the convolutional patch are involved in the computation. Specially, when the rate is extremely large and exceeds the sizes of feature maps, the 33 filters will degenerate to 11 filters without capturing the global contextual information, since only the center filter branch is effective.

In order to address the above problems, we propose Kronecker convolutions inspired by Kronecker product in computational and applied mathematics [9]. Our proposed Kronecker convolutions can not only inherit the advantages of atrous convolutions, but also mitigate its limitation. The proposed Kronecker convolutions employ Kronecker product to expand standard convolutional kernel so that feature vectors neglected by atrous convolutions can be captured, as shown in Fig. 1 (c). There are two factors in Kronecker convolution, inter-dilating factor and intra-sharing factor . On one hand, the inter-dilating factor controls the number of holes inserted into kernels. Therefore, Kronecker convolutions have the capability of enlarging the field of view and maintain the resolution of feature maps, namely Kronecker convolutions can inherit the advantages of atrous convolutions. On the other hand, the intra-sharing factor controls the size of subregions to capture feature vectors and share filter vectors. Thus, Kronecker convolutions can consider partial information and increase VFR without increasing the number of parameters, as shown in Fig. 1 (a) and (c).

Furthermore, scenes in images have hierarchical structures, which can be decomposed into small-range scenes or local scenes (such as a single object), middle-range scenes (such as multiple objects) and large-range scenes (such as the whole image). How to efficiently capture hierarchical contextual information in complex scenes is significant to semantic segmentation and remains a challenge. Based on this observation, we propose Tree-structured Feature Aggregation (TFA) module to encode hierarchical context information, which is beneficial to better understand complex scenes and improve segmentation accuracy. Our TFA module follows a recursive rule to expand, and forms a tree-shaped and hierarchical structure. Each layer in TFA model has two branches, one branch preserves features of the current region, and the other branch aggregates spatial dependencies within a larger range. The proposed TFA module has two main advantages: (1) Oriented for the hierarchical structures in the complex scenes, TFA module can capture the hierarchical contextual information effectively and efficiently; (2) Compared with the existing multi-scale feature fusion methods based on preset scales and relied on inherent network structures, TFA module can naturally learn representations of multi-scale objects by the tree-shaped structure.

Based on the above observation, we propose Tree-structured Kronecker Convolutional Network (TKCN) for semantic segmentation, which employs Kronecker convolution and TFA module to form a unified framework. We perform experiments on three popular semantic segmentation benchmarks, including PASCAL VOC 2012, Cityscapes and PASCAL-Context. Experimental results verify the effectiveness of our proposed approaches. Our main contributions can be summarized into three aspects:

  • We propose Kronecker convolutions, which can effectively capture partial detail information and enlarge the field of view simultaneously, without introducing extra parameters.

  • We develop Tree-structured Feature Aggregation module to capture hierarchical contextual information and represent multi-scale objects, which is beneficial for better understanding complex scenes.

  • Without any post-processing steps, our designed TKCN achieves impressive results on the benchmarks of PASCAL VOC 2012, Cityscapes and PASCAL-Context.

2 Related Work

In this section, we firstly overview the using of Kronecker product in deep learning and popular semantic segmentation approaches, and then introduce related approaches of two aspects of semantic segmentation, including Conditional Random Fields (CRFs) and Multi-scale Feature Fusion.

Kronecker Product KFC [10] uses Kronecker product to exploit the local structures within convolution and fully-connected layers, by replacing the large weight matrices and by combinations of multiple Kronecker products of smaller matrices, which can approximate the weight matrices of the fully connected layer. In contrast to them, we employ Kronecker product to expand the standard convolutional kernel for enlarging the receptive field of filters, and capturing partial information neglected by atrous convolutions.

Figure 2: Architecture of the proposed TKCN. We employ Kronecker convolutions in ResNet-101 ‘Res4’ and ‘Res5’. Tree-structured Feature Aggregation module is implemented after the last layer of ‘Res5’.

Semantic Segmentation Semantic segmentation is a fundamental task in computer vision. Recently, approaches based on Deep Convolutional Neural Networks [11, 5, 6] achieve remarkable progress in semantic segmentation task, such as DeconvNets [12], DeepLab [1] and FCNs [2]. FCNs transfer the networks of image classification for pixel-level labeling. DeconvNets employ multiple deconvolution layers to enlarge feature maps and generate whole-image predictions. DeepLab methods use atrous convolutions to enlarge the receptive fields so as to capture contextual information. Following these structures, many frameworks are proposed to further improve the accuracy of semantic segmentation.

Conditional Random Fields One common approach to capture fine-grained details and refine the segmentation predictions is CRFs, which are suitable for capturing long-range dependencies and fine local details. CRFasRNN [13]

reformulates DenseCRF with pairwise potential functions and unrolls the mean-field steps as recurrent neural networks, which composes a uniform framework and can be learned end-to-end. Differently, DeepLab frameworks

[1] use DenseCRF [14] as post-processing. After that, many approaches combine CRFs and DCNNs in the uniform frameworks, such as combining Gaussian CRFs [15] and specific pairwise potentials [16]. In contrast, some other approaches directly learn pairwise relationships. SPN [17] constructs a row/column linear propagation model to capture dense, global pairwise relationships in an image, and Spatial CNN [18] learns the spatial relationships of pixels across rows and columns in an image. While these approaches achieve remarkable improvement, they increase the overall computational complexity of the networks.

Multi-scale Feature Fusion Since objects in scene images have various sizes, multi-scale feature fusion is widely used in semantic segmentation approaches for learning features of multiple scales. Some approaches aggregate features of multiple meddle layers. The original FCNs [2] utilize skip connections to perform late fusion. Hypercolumn [19] merges features from middle layers to learn dense classification layers. RefineNet [20]

proposes to pool features with multiple window sizes and fuses them together with residual connections and learnable weights. Some methods obtain multi-scale features from inputs, such as utilizing a Laplacian pyramid

[21], employing multi-scale inputs sequentially from coarse-to-fine [22], or simply resizing input images into multiple sizes [23]. Some other approaches propose feature pyramid modules. DeepLab-v2 [1] employs four parallel atrous convolutional layers of different rates to capture objects and context information of multiple scales. PSPNet [24] performs spatial pooling at four grid scales. More recently, DFN [25] propose a Smooth Network for fusing feature maps across different stages, and CCL [26] propose a scheme of gated sum to selectively aggregate multi-scale features for each spatial position. Most multi-scale feature fusion methods are compromised by preset scales or relying on inherent network structure.

In this paper, we propose Kronecker convolutions to capture partial information neglected by atrous convolutions. Different from the computationally expensive CRF-based approaches and feature fusion methods with manually preset scales, we propose the TFA module to efficiently aggregate features of multiple scales through a tree-shaped structure and the recursive rule.

3 Proposed Approaches

In this paper, we design the Tree-structured Kronecker Convolutional Network (TKCN) for semantic segmentation. As illustrated in Fig. 2, TKCN employs our proposed Kronecker convolution and TFA module. In the following, firstly we formulate the proposed Kronecker convolutions and explain why they can capture partial detail information when enlarging the receptive fields. Then we present the TFA module and analyze how it can efficiently aggregate hierarchical contextual information.

3.1 Kronecker Convolution

Inspired by Kronecker product in computational and applied mathematics, we explore a novel Kronecker convolution whose kernels are transformed by performing Kronecker product.

First of all, we provide a brief review of Kronecker product. If is a matrix and is a matrix, the Kronecker product is the matrix:

(1)

For a standard convolution, it takes the input feature maps and outputs feature maps , where , , , , , are the widths, heights and channels of and respectively. The kernel of the standard convolution is and the bias is . Any feature vector in at position is the multiplication of kernel and the associated convolutional patch in , where . is a square with the center of , where and are coordinates in . So the coordinates of feature vectors in are:

(2)

where . Let , , the convolutional operator can be formulated as matrix multiplication:

(3)

For our proposed Kronecker convolution, we introduce a transformation matrix and enlarge the kernel through computing Kronecker product of and . We set as a fixed matrix. Inter-dilating factor can control the dilation rate of the convolutions. Therefore, the kernel of Kronecker convolution will expand from of to of . In order to avoid bringing extra parameters in the Kronecker convolution, we simply set as the combination of a matrix

and zero matrix

, where is a square matrix which has all the element values of 1. We denote the intra-sharing factor as (), which controls the size of subregions to capture feature vectors and share filter vectors. Thus, the kernel of Kronecker convolution can be formulated as:

(4)

where , . Correspondingly, the associated convolutional patch in , denoted as , will also expand to a square of . Coordinates of feature vectors involved in computation in are:

(5)

where , . Let , . Therefore, the operator of the Kronecker convolution can be formulated as:

(6)
Figure 3: Left: A simple expansion rule generates a TFA architecture. Right: Tree-structured Feature Aggregation module.

Compared with atrous convolutions which simply insert zeros to expand kernels, Kronecker convolutions expand kernels through Kronecker product with transformation matrix . The inter-dilating factor controls the dilation rate of kernels. According to Eqn. (5), when becomes larger, the convolutional patches zoom in so that the receptive fields will be enlarged correspondingly. Since only contains values of ones and zeros, no more parameters are introduced in Kronecker convolutions. Moreover, because has a submatrix of identity matrix, Kronecker convolutions can capture local contextual information ignored by atrous convolutions. As shown in Eqn. (6), each kernel branch has the capability of aggregating features in a subregion. The VFR of Kronecker convolutions is , while atrous convolutions with the same rate have the VFR of . It is clear that VFR of Kronecker convolutions is larger than atrous convolutions since . When , the VFR of Kronecker convolutions will be 100%. In conclusion, our proposed Kronecker convolutions can capture partial information and enlarge the field of view simultaneously without increasing extra parameters.

The proposed Kronecker convolutions can be treated as the generalization of standard convolutions and atrous convoltuions. If , Kronecker convoltions will degenerate to atrous convolutions, since the kernel will change to:

(7)

Therefore, the formulation of Eqn. (5) and (6) will change to:

(8)

Let , corresponding convolutions patch in is . So

(9)

Additionally, if , Kronecker convolutions will degenerate to standard convolutions.

3.2 Tree-structured Feature Aggregation Module

In order to capture hierarchical context information and represent objects of multiple scales in complex scenes, we propose TFA module. TFA module takes the features extracted by the backbone network as the input. TFA module follows an expansion and stacking rule to efficiently encode multi-scale features. As illustrated in the left subfigure of Fig. 

3, in each expansion step, the input is duplicated to two branches. One branch preserves features of the current scale, and the other branch explores spatial dependencies within a larger range. Simultaneously, output features of the current step are stacked with previous features through concatenation. This expansion and stacking rule can be formulated as:

(10)

where , is the output of step n, denotes operators implemented in step n, is the result of TFA module with n steps, and

represents the concatenation operator. In the proposed TFA module, we employ Kronecker convolutions with different inter-dilating factors and intra-sharing factors to capture multi-scale features, followed by Batch Normalization and ReLU layers. Finally, the features of all the branches will be aggregated. As shown in the right subfigure of Fig. 

3, in our experiments, we exploit TFA module with three expansion steps, so the features from all branches are concatenated finally. Particularly, to make a trade-off between computational complexity and model capability, we reduce the output channel of each convolutional layers in TFA module as if the input feature maps of TFA module has the channel of .

Following the above expansion and stacking rule, TFA module forms a tree-shaped and hierarchical structure, which can effectively and efficiently capture hierarchical contextual information and aggregate features from multiple scales. Moreover, features learned from the previous steps can be re-explored in the subsequent steps, which is superior to the existing parallel structure with multiple individual branches.

4 Experiments

In this section, we perform comprehensive experiments on three semantic segmentation benchmarks to show the effectiveness of our proposed approaches, including PASCAL VOC 2012 [3], Cityscapes [4] and PASCAL-Context [27].

mIoU(%) Acc(%)
6 1 77.03 94.97
6 3 78.37 95.25
6 5 78.75 95.36
10 1 78.01 95.17
10 3 78.53 95.24
10 5 78.93 95.34
10 7 79.50 95.53
10 9 79.71 95.54
Table 1: Evaluation results of Kronecker convolution (KConv) with different intra-sharing factor on PASCAL VOC 2012 validation set.

4.1 Experimental Settings

PASCAL VOC 2012 Dataset The PASCAL VOC 2012 segmentation benchmark [3] contains 20 foreground object categories and 1 background class. The original dataset involves 1, 464 training images, 1, 449 validation images, and 1, 456 test images. Extra annotations from [28] are provided to augment the training set to 10, 582 images. The performance is measured by pixel intersection-over-union (IoU) averaged across the 21 classes.

Cityscapes Dataset The Cityscapes datasets [4] contains 5, 000 images collected in street scenes from 50 different cities. The dataset is divided into three subsets, including 2, 975 images in training set, 500 images in validation set and 1, 525 images in test set. High-quality pixel-level annotations of 19 semantic classes are provided in this dataset. Intersection over Union (IoU) averaged over all the categories is adopted for evaluation.

PASCAL-Context Dataset The PASCAL-Context dataset [27] involves 4, 998 images in training set and 5, 105 images in validation set. It provides detailed semantic labels for the whole scene. Our proposed models are evaluated on the most frequent 59 categories and 1 background class.

4.1.1 Implementation Details

We take ResNet-101[6] as our baseline model, which employ atrous convolutions with and in layers of ‘Res4’ and ‘Res5’, respectively. So the resolution of the predictions can be enlarged from to

. Our loss function is the sum of cross-entropy terms for each spatial position in the output score map, ignoring the unlabeled pixels. All the experiments are performed on the Caffe platform. We employ the “poly” learning rate policy, in which we set the base learning rate to

and power to . Momentum and weight decay are set to and respectively. For data augmentation, we employ random mirror and random resize between 0.5 and 2 for all training samples.

4.2 Ablation Studies

We evaluate the effectiveness of the two proposed components, Kronecker convolution and TFA module. All the experiments of ablation studies are conducted on PASCAL VOC 2012 dataset.

Method mIoU (%) Acc (%)
AConv (Baseline) 4 1 75.98 94.80
KConv 4 3 76.70 94.98
AConv 6 1 77.03 94.97
KConv 6 5 78.75 95.36
AConv 8 1 78.14 95.19
KConv 8 5 78.81 95.30
AConv 10 1 78.01 95.17
KConv 10 7 79.50 95.53
AConv 12 1 78.18 95.21
KConv 12 9 79.79 95.53
Table 2: Comparison between Kronecker convolutions (KConv) and atrous convolutions (AConv) on PASCAL VOC 2012 validation set.

4.2.1 Ablation Study for Kronecker Convolution

In order to analyze the effectiveness of Kronecker convolutions, we employ Kronecker convolutions with different and factors in ResNet-101 ‘Res5’. Firstly, we analyze the effect of varying intra-sharing factor . As shown in Tab. 1, we fix the inter-dilating factor with and change from 1 to 5, the mean IoU is continuously improved from 77.03% to 78.75%. Similar results are gained with a fixed , in which the mean IoU increases from 78.01% to 79.71% as increases from 1 to 9. These results show that, with the increase of , more partial information in convolutional patches can be captured, so that the mean IoU increases stably. Especially, we observe that with the same increment of , the improvement increases rapidly at the beginning, and then increases slowly. In the case of , mean IoU only increases 0.21% with ranging from 7 to 9. In order to make a trade-off between computational complexity and model accuracy, we keep ( denotes VFR) in the following experiments. Secondly, we present the results of varying inter-dilating factor in Tab. 2, where are determined by the principle of . We also provide the results of atrous convolutions with the same rates for comparison. As the inter-dilating factor increases from 4 to 12, the mean IoU is significantly improved from 76.70% to 79.79%. Similar results are observed from atrous convolutions which improve the mean IoU from 75.98% to 78.18%. These results show that both the proposed Kronecker convolutions and atrous convolutions benefit from enlarging the field of view, which means Kronecker convolutions can inherit the advantages of atrous convolutions. Thirdly, we compare the results of Kronecker convolutions and atrous convolutions with the same dilation rates. As shown in Tab. 2, Kronecker convolutions bring 0.8%, 1.7%, 0.7%, 1.5%, 1.6% improvements respectively with the dilation rates ranging from 4 to 12. These results show that Kronecker convolutions are stably superior to atrous convolutions, since Kronecker convolutions can aggregate partial detail information neglected by atrous convolutions.

Method mIoU (%) Acc (%)
Baseline (Baseline) 75.98 94.80
Baseline + TFA_S 80.18 95.56
Baseline + TFA_L 81.26 95.83
Baseline + KConv 76.70 94.98
Baseline + KConv + TFA_S 81.34 95.96
Baseline + KConv + TFA_L 82.85 96.26
Table 3: Evaluation results of TFA module on PASCAL VOC 2012 validation set. KConv: employing Kronecker convolution on baseline model ’res4’ and ’res5’.
Figure 4: Result illustration of the proposed TKCN on PASCAL VOC 2012 validation set. From left to right: Input image, baseline, prediction and ground-truth (GT).
Method mIoU (%)
FCN [2] 62.2



GCRF [15]
73.2

Piecewise [29]
75.3

DeepLab [1]
79.7
LC [30] 80.3
RAN-s [31] 80.5
RefineNet [20] 82.4
PSPNet_Ms [24] 82.6
DFN_Ms [25] 82.7
EncNet_Ms [32] 82.9
Deeplabv3+ [33] 89.0
TKCN 82.4
TKCN_Ms 83.2
Table 4: Per-class mean intersection-over-union (IoU) results on the PASCAL VOC 2012 segmentation challenge test set, only using VOC 2012 for training. Ms: employing multi-scale inputs with average fusion during testing.

4.2.2 Ablation Study for TFA Module

We perform experiments to evaluate our proposed TFA module, which employs Kronecker convolutions in all convolutional layers. We adjust the factors of Kronecker convolutions in the three convolutional layers in TFA module and compare three different schemes: (1) Baseline model of dilated ResNet-101; (2) TFA_S configured with small factors and (3) TFA_L configured with large factors . As shown in Tab. 3, compared with baseline, TFA_S acquires 4.20% improvement over baseline, while TFA_L with larger factors bring more improvement of 5.28%. These results show the effectiveness of TFA module, since hierarchical information can be efficiently aggregated through its tree-shaped structure. Moreover, we implement Kronecker convolutions in the ‘Res4’ and ‘Res5’ layers in the baseline model, denoted as ‘KConv’. As shown in Tab. 3, KConv+TFA_S yields improvement over baseline and 1.06% improvement over Baseline + TFA_S, while KConv+TFA_L yields improvement over baseline and 1.59% improvement over Baseline + TFA_L. Therefore, Kronecker convolutions and TFA module can be utilized together to improve the segmentation accuracy cooperatively. In addition, the proposed TFA module has strong generalization capability, since TFA module can bring obvious improvements over both KConv and Baseline.

Figure 5: Result illustration of the proposed TKCN on Cityscapes validation set. From left to right: Input image, baseline, prediction with single-scale input (Ss), prediction with multi-scale (Ms) input and ground-truth (GT).

Method
mIoU (%)

CGNet [34]
64.8
FCN [2] 65.3





DeepLab [1]
70.4

LC [30]
71.1


RefineNet [20]
73.6
FoveaNet [35] 74.1
GRLRNet [36] 77.3
SAC_Ms [37] 78.1
PSPNet_Ms [24] 78.4

BiSENet_Ms [38]
78.9
DFN_Ms [25] 79.3
DenseASPP_Ms [39] 80.6
TKCN 78.9
TKCN_Ms 79.5
Table 5: Per-class mean intersection-over-union (IoU) accuracy on Cityscapes test set, only training with the fine set. Ms: employing multi-scale inputs with average fusion during testing.
Figure 6: Result illustration of the proposed TKCN on PASCAL-Context validation set. From left to right: Input image, baseline, prediction with single-scale input (Ss), prediction with multi-scale (Ms) input and ground-truth (GT).

4.3 Comparison with State-of-the-Arts

In the following, we present the results of TKCN and compare with other state-of-the-art approaches.

4.3.1 Results on PASCAL VOC 2012:

We evaluate our proposed TKCN model on PASCAL VOC 2012 dataset without external data such as COCO dataset [40]. Tab. 4 shows the results of TKCN compared with other state-of-the-art methods on the test set. Our TKCN method achieves 83.2% mean IoU (without pre-trained on extra datasets). Our approach is only lower than the famous DeepLabv3+ [33], which employs a more powerful network (Xception [41]) as the backbone and is pretrained on COCO [40] and JFT [42], resulting in 6% mean IoU improvement. Fig. 4 displays some qualitative results of the TKCN on the validation set of PASCAL VOC 2012, which shows that the proposed TKCN carries out more accurate and finer structures compared with the baseline.




Method
mIoU (%)


FCN [2]
35.1



Context [23]
43.5

DeepLab [1]
45.7


RefineNet_Ms [20]
47.1
PSPNet_Ms [24] 47.8
WRNet_Ms [43] 48.1
CCL_Ms [26] 51.6
EncNet_Ms [32] 51.7

TKCN
51.1

TKCN_Ms
51.8

Table 6: Comparison with other state-of-the-art methods on PASCAL-Context dataset, Ms: employing multi-scale inputs with average fusion during testing.

4.3.2 Results on Cityscapes:

We report the evaluation results of the proposed TKCN on Cityscapes test set and compare to other state-of-the-art methods in Tab. 5, which shows similar conclusion with the results on PASCAL VOC 2012 dataset. The proposed TKCN achieves 79.5% in mean IoU (only training on fine annotated images), which is slightly lower than the very recent DenseASPP [39] which employs a more powerful network (DenseNet [44]) as the backbone. We visualize some segmentation results on the validation set of Cityscapes in Fig. 5.

4.3.3 Results on PASCAL-Context:

Tab. 6 reports the evaluation results of proposed TKCN on PASCAL-Context validation set. Our proposed model yields in mean IoU. Similar to [1], employing multi-scale inputs with average fusion further improves the performance to , which outperforms current state-of-the-art performance. We visualize the prediction results of our model in Fig. 6.

5 Conclusions

In this paper, we propose a novel Kronecker convolution for capturing partial information when enlarging the receptive field of filters. Furthermore, based on Kronecker convolutions, we propose Tree-structured Feature Aggregation module which can effectively capture hierarchical spatial dependencies and learn representations of multi-scale objects. Ablation studies show the effectiveness of each proposed components. Finally, our designed Tree-structured Kronecker Convolutional Network achieves state-of-the-art on the PASCAL VOC 2012, PASCAL-Context and Cityscapes semantic segmentation benchmarks, which demonstrates that our approaches are effective and efficient for high-quality segmentation results.

References