Smoothed Dilated Convolutions for Improved Dense Prediction

08/27/2018 ∙ by Zhengyang Wang, et al. ∙ 0

Dilated convolutions, also known as atrous convolutions, have been widely explored in deep convolutional neural networks (DCNNs) for various tasks like semantic image segmentation, object detection, audio generation, video modeling, and machine translation. However, dilated convolutions suffer from the gridding artifacts, which hampers the performance of DCNNs with dilated convolutions. In this work, we propose two simple yet effective degridding methods by studying a decomposition of dilated convolutions. Unlike existing models, which explore solutions by focusing on a block of cascaded dilated convolutional layers, our methods address the gridding artifacts by smoothing the dilated convolution itself. By analyzing them in both the original operation and the decomposition views, we further point out that the two degridding approaches are intrinsically related and define separable and shared (SS) operations, which generalize the proposed methods. We evaluate our methods thoroughly on two datasets and visualize the smoothing effect through effective receptive field analysis. Experimental results show that our methods yield significant and consistent improvements on the performance of DCNNs with dilated convolutions, while adding negligible amounts of extra training parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

Code Repositories

dilated

Smoothed Dilated Convolutions for Improved Dense Prediction


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dilated convolutions, also known as atrous convolutions, have been widely explored in deep convolutional neural networks (DCNNs) for various tasks, including semantic image segmentation [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], object detection [11, 12, 13, 14], audio generation [15], video modeling [16], and machine translation [17]. The idea of dilated filters was developed in the algorithm à trous for efficient wavelet decomposition in [18] and has been used in image pixel-wise prediction tasks to allow efficient computation [1, 2, 11, 12]. Dilation upsamples convolutional filters by inserting zeros between weights, as illustrated in Figure 1. It enlarges the receptive field, or field of view [5, 6, 8], but does not require training extra parameters in DCNNs. Dilated convolutions can be used in cascade to build multi-layer networks [15, 16, 17]

. Another advantage of dilated convolutions is that they do not reduce the spatial resolution of responses. This is a key difference from downsampling layers, such as pooling layers or convolutions with stride larger than one, which also expand the receptive field of subsequent layers but also reduce the spatial resolution. This allows the transfer of classification models trained on ImageNet 

[19, 20] to semantic image segmentation tasks by removing downsampling layers and applying dilation in convolutions of subsequent layers [21, 3, 4, 5, 6, 7, 8, 9]

. Similar to standard convolutions, a layer consisting of a dilated convolution with an activation function is called a dilated convolutional layer.

Fig. 1: -D Dilated convolutions with a kernel size of . Note that when the dilation rate is , dilated convolutions are the same as standard convolutions. Dilated convolutions enlarge the receptive field while keeping the spatial resolution.

While DCNNs with dilated convolutions achieved success in a wide variety of deep learning tasks, it has been observed that dilations result in the so-called “gridding artifacts” [4, 7, 8]. For dilated convolutions with dilation rates larger than one, adjacent units in the output are computed from completely separate sets of units in the input. It results in inconsistency of local information and hampers the performance of DCNNs with dilated convolutions. As dilated convolutional layers are commonly stacked together in cascade in DCNNs, existing models focus on smoothing such gridding artifacts for a block of cascaded dilated convolutional layers. In [4, 8] the gridding problem was alleviated by adding more layers with millions of extra training parameters after the block of dilated convolutions. In [7] the hybrid dilated convolution (HDC) was proposed, which applies different dilation rates without a common factor for continuous dilated convolutional layers.

In this work, we address the gridding artifacts by smoothing the dilated convolution itself, instead of a block of stacked dilated convolutional layers. Our methods enjoy the unique advantage of being able to replace any single dilated convolutional layer in existing networks as they do not rely on other layers to solve the gridding problem. More importantly, our methods add minimal numbers of extra parameters to the model while some other degridding approaches increase the model parameters dramatically [4, 8]. Our methods are based on an interesting view of the dilated convolutional operation [22, 5, 23], which benefits from a decomposition of the operations. Based on this novel interpretation of dilated convolutions, we propose two simple yet effective methods to smooth the gridding artifacts. By analyzing these two methods in both the original operation and the decomposition views, we further notice that they are intrinsically related and define separable and shared (SS) operations that generalize the proposed methods. Experimental results show that our methods improve current DCNNs with dilated convolutions significantly and consistently, while only adding a few hundred extra parameters. We also employ the effective receptive field (ERF) analysis [24] to visualize the smoothing effect for DCNNs with our dilated convolutions.

Compared to the original conference version of this paper [25], we perform further analysis on SS operations in view of operations on graphs. Based on this analysis, we incorporate deep learning techniques on graphs and propose the SS output layer, which smooths DCNNs with dilated convolutions by only replacing the output layer. In addition, the SS output layer shows a better ability of aggregating information from large receptive fields than original output layers based on dilated convolution. The smoothed DCNNs are able to produce significantly improved dense prediction.

2 Background and Related Work

In this section, we describe dilated convolutions and DCNNs with them. We then discuss the gridding problem and current solutions in detail.

2.1 Dilated Convolutions

In the one-dimensional case, given a -D input , the output  at location of a dilated convolution with a filter  of size  is defined as

(1)

where is known as the dilation rate. Higher dimensional cases can be easily generalized. When , dilated convolutions correspond to standard convolutions. An intuitive and direct way to understand dilated convolutions is that zeros are inserted between every two adjacent weights in the standard convolutional filters. Dilated convolutions are also known as atrous convolutions in which “trous” means holes in French. Figure 1 contains an illustration of dilated convolution in the two-dimensional case.

As mentioned in Section 1, in most cases, DCNNs use dilated convolutions in cascade, which means several dilated convolutional layers are stacked together. The reasons for using this cascaded pattern differ for different tasks. In the task of semantic image segmentation [21, 3, 4, 5, 6, 7, 8, 9], in order to have output feature maps of larger sizes while maintaining the size of the receptive field, dilated convolutions are employed to replace standard convolutions in layers after the removed downsampling layers. For example, if we treat standard convolutions as dilated convolutions with a dilation rate of , when a downsampling layer with a subsampling rate of is removed, the dilation rates of all subsequent convolutional layers should be multiplied by . This results in dilated convolutional layers with dilation rates of etc. In other tasks, such as audio generation [15], video modeling [16], and machine translation [17], the use of dilated convolutions aims at enlarging the receptive fields of outputs. As pointed out in [3, 15, 16], cascaded dilated convolutional layers expand the receptive field exponentially in the number of layers in DCNNs, as opposed to linearly. In these studies, the dilation rate is doubled for every forward layer, starting from up to a limit before the pattern is repeated.

Note that when using dilated convolutions in cascade, the gridding artifacts affect the models more significantly. This is because the dilation rates of continuously stacked layers have a common factor of in all of these DCNNs that use dilated convolutional layers in cascade, as discussed in [7] and Section 2.2. In [5, 6] dilated convolutions in parallel to form the output layer were explored.

2.2 Gridding in Dilated Convolutions

Fig. 2: An illustration of gridding artifacts. The operations between layers are both dilated convolutions with a kernel size of and a dilation rate of . For four neighboring units in layer indicated by different colors, we mark their actual receptive fields in layer and using the same color, respectively. Clearly, their actual receptive fields are completely separate sets of units.

Dilated convolutions with dilation rates larger than one will produce the so called gridding artifacts; that is, adjacent units in the output are computed from completely separate sets of units in the input and thus have totally different actual receptive fields. To view the gridding problem clearly, we first look into a single dilated convolution. Considering the second case in Figure 1 as an example, a -D dilated convolution with a kernel size of and a dilation rate of has a receptive field. However, the number of pixels that are actually involved in the computation is only out of , which implies that the actual receptive field is still , but sparsely distributed. If we further consider the neighboring units in the output, the gridding problem can be seen from Figure 2. Suppose we have two consecutive dilated convolutional layers in cascade, and both dilated convolutions have a kernel size of and a dilation rate of . For four adjacent units indicated by different colors in layer , we show their actual receptive fields in layer and using the same color. We can see that four completely separate sets of units in layer contribute to the computation of the four units in layer . Moreover, since the dilation rates for both layers are , which have a common factor of , the gridding problem also exists in layer . Indeed, whenever the dilation rates of dilated convolutional layers in cascade have a common factor relationship, such as or , the gridding problem is propagated to all layers, as pointed out in [7]. For a block of such layers, neighboring outputs of the block are computed from totally different sets of inputs. This results in the inconsistency of local information and hampers the performance of DCNNs with dilated convolutions.

The gridding artifacts were observed and addressed in several recent studies for semantic image segmentation [4, 7, 8]. As described in Section 2.1, dilated convolutions are mostly employed in cascade in DCNNs. Therefore, these studies focused on solving the gridding problem in terms of a block of stacked dilated convolutional layers. Specifically, hybrid dilated convolution (HDC) was proposed in [7], which groups several dilated convolutional layers and applies dilation rates without a common factor relationship. For example, for a block of dilated convolutions with a dilation rate of , every three consecutive layers are grouped together and the corresponding dilation rates are changed to instead of . For a similar block with a dilation rate of , the same grouping principle is applied and the dilation rates become , instead of . When used together with their proposed dense upsampling convolution (DUC), this approach improved DCNNs for semantic image segmentation. This strategy was also adopted as the “multigrid” method in recent work [6]. Prior to [7], the degridding was performed mainly by adding more layers after the block of dilated convolutional layers [4, 8]. It was proposed in [4]

to add two more standard convolutional layers without residual connections while

[8] proposed to add a block of dilated convolutional layers with decreasing dilation rates. The main drawback of such methods is the requirement for learning a large amount of extra parameters.

3 Smoothed Dilated Convolutions

In this section, we discuss a decomposition view of dilated convolutions. We then propose two approaches for smoothing the gridding artifacts. We also analyze the relationship between the proposed two methods and define separable and shared (SS) operations to generalize them. Based on this analysis, we further propose the SS output layer to perform degridding for the entire network.

Fig. 3: An example of the decomposition of a dilated convolution with a kernel size of and a dilation rate of on a -D input feature map. The decomposition has three steps; namely periodic subsampling, shared standard convolution and reinterlacing. This example will also be used in Figures 3 to 8.

3.1 A Decomposition View of Dilated Convolutions

There are two ways to understand dilated convolutions. As introduced in Section 2.1, the first and more intuitive way is to think of dilated convolutional filters with dilation rate as upsampled standard convolutional filters, by inserting zeros (holes) [12]. Another way to view dilated convolutions is based on a decomposition of the operation [22]. A dilated convolution with a dilation rate of can be decomposed into three steps. First, the input feature maps are periodically subsampled by a factor of . As a result, the inputs are deinterlaced to groups of feature maps of reduced resolution, where is the spatial dimension of the inputs. Second, these groups of intermediate feature maps are fed into a standard convolution. This convolution has filters with the same weights as the original dilated convolution after removing all inserted zeros. More importantly, it is shared for all the groups, which means each group of reduced resolution maps goes through the same standard convolution. The third step is to reinterlace the groups of feature maps to the original resolution and produce the outputs of the dilated convolution.

Figure 3 gives an example of the decomposition in the -D case. To simplify the discussion, we assume the number of input channels and output channels is both . Given a feature map, a dilated convolution with a kernel size of and a dilation rate of will output a

feature map without any padding. In the decomposition of this dilated convolution, the input feature map is periodically subsampled into

groups of feature maps of reduced resolution. Then a shared standard convolution, which has the same weights as the dilated convolution without padding, is applied to these groups of feature maps and obtains groups of feature maps. Finally, they are reinterlaced to the original resolution and produce exactly the same output feature map as the original dilated convolution. This decomposition reduces dilated convolutions into standard convolutions and allows more efficient implementation [1, 11, 5, 23].

We notice that the decomposition view provides a clear explanation of the gridding artifacts; that is, the groups of intermediate feature maps, either before or after the shared standard convolution, have no dependency among each other and thus collect potentially inconsistent local information. Based on this insight, we overcome gridding by adding dependencies among the groups in different steps of the decomposition. We propose two effective approaches in the next two sections.

Fig. 4: Illustration of the degridding method in Section 3.2 for a dilated convolution with a kernel size of and a dilation rate of on a -D input feature map. By using a group interaction layer before reinterlacing, dependencies among intermediate groups are established. The same gray color denotes consistent local information.

3.2 Smoothed Dilated Convolutions by Group Interaction Layers

Our first degridding method attempts to build dependencies among different groups in the third step of the decomposition. We propose to add a group interaction layer before reinterlacing the intermediate feature maps to the original resolution. For a dilated convolution with a dilation rate of on -dimensional input feature maps, the second step of the decomposition produces groups of feature maps of reduced resolution, denoted as , after the shared convolution. Note that each represents a group of feature maps, rather than a single feature map. We define a group interaction layer with a weight matrix given as

(2)

The outputs of this layer are still groups of feature maps, denoted as , computed by

(3)

for . Note that the connections of this layer are between groups instead of feature maps. In fact, every is a linear combination of , weighted by the weight matrix . Through this layer, each collects local information from all groups of feature maps, which adds dependencies among different groups. After the group interaction layer, the groups are reinterlaced to the original resolution and form the final output of the dilated convolutions. The number of extra training parameters in such smoothed dilated convolutions is , independent of the number of input and output channels. DCNNs with dilated convolutions are commonly used in one-dimensional or two-dimensional cases, which means . In practice, choices of are usually . The proposed group interaction layer only requires learning thousands of extra parameters in the worst cases, while the original dilated convolutions usually have millions of training parameters.

We use the same example in Section 3.1 to illustrate the idea in Figure 4. Given the outputs of the second step in the decomposition, the groups of intermediate feature maps build dependencies among each other through the group interaction layer, whose number of weights is only , represented by connections. We use the gray color to represent feature maps after degridding.

3.3 Smoothed Dilated Convolutions by Separable and Shared Convolutions

Fig. 5: Illustration of the degridding method in Section 3.3 for a dilated convolution with a kernel size of and a dilation rate of on a -D input feature map. By adding the separable and shared convolution, the groups created by periodic subsampling have dependencies among each other. The same gray color represents smoothed feature maps.
Fig. 6: Illustration of the differences between the separable convolution and the proposed SS convolution introduced in Section 3.3. For inputs and outputs of channels, the separable convolution has filters in total, with one filter for each channel, while the SS convolution only has one filter shared for all channels.

We further explore an approach to establish dependencies among different groups in the first step of the decomposition; that is, before deinterlacing the input feature maps. Considering a dilated convolution with a dilation rate of on -dimensional input feature maps, the periodic subsampling during deinterlacing distributes each unit in a local area of size in the inputs to a separate group. Therefore, for units in a particular group, all the neighboring units are in the other independent groups, thereby resulting in local inconsistency. If the local information can be incorporated before periodic sampling, it is possible to alleviate the gridding artifacts.

In order to achieve this, we propose separable and shared (SS) convolutions, based on separable convolutions [26, 27]. Given inputs of channels and corresponding outputs of channels, separable convolutions are the same as standard convolutions, except that separable convolutions handle each channel separately. Standard convolutions connect all channels in inputs to all channels in outputs, leading to different filters. In contrast, separable convolutions only connect the th output channel to the th input channel, yielding only filters. In the proposed SS convolutions, “shared” means that, based on separable convolutions, the filters are the same and shared by all pairs of input and output channels. For inputs and outputs of channels, SS convolutions only have one filter scanning all spatial locations and share this filter across all channels. Figure 6 provides a comparison between separable convolutions and SS convolutions. In terms of smoothing dilated convolutions, we apply SS convolutions to incorporate neighboring information for each unit in the input feature maps. Specifically, an SS convolution with a kernel size of is inserted before deinterlacing, thereby adding dependencies among each other to the groups of feature maps produced by periodic subsampling.

The example in Figure 5 illustrates the idea of inserting SS convolutions. Here, the kernel size of the inserted SS convolution is . Note that because the inputs only have one channel, SS convolutions, separable convolutions and standard convolutions are equivalent in this example. However, they become different if the inputs have channels. Importantly, for inputs with multiple channels, the number of training parameters does not change for SS convolutions, as opposed to the other two kinds of convolutions. It means the proposed degridding method has parameters, independent of the number of channels, which corresponds to only tens of extra parameters at most in practice.

3.4 Relationship between the Two Methods

Fig. 7: Another illustration of the proposed method in Section 3.2, corresponding to Figure 4. The method is equivalent to adding an SS block-wise fully-connected layer after the dilated convolution.
Fig. 8: Another illustration of the proposed method in Section 3.3, corresponding to Figure 5. The method is equivalent to inserting an SS convolution before the dilated convolution.
Fig. 9: Illustration of SS operations in view of operations on graphs. Details are provided in Section 3.5. The circle arrow inside a node represents the self-loop.

Both of the proposed approaches are derived from the decomposition view of dilated convolutions. Now we combine all steps and analyze them in view of the original operation. For the second method in Section 3.3, it is straightforward as the separable and shared (SS) convolution is inserted before the first step of decomposition and actually does not affect the original dilated convolution. Consequently, it is equivalent to adding an SS convolution before the dilated convolution, as shown in Figure 8. However, the first method in Section 3.2 performs degridding through the group-wise fully-connected layer between the second and the third steps of the decomposition. To see how to perform the combination, we refer to the example in Figure 4. Before the final step, we have four groups of feature maps and each group has only one feature map. Considering the units in the upper left corner of the four feature maps, without the group interaction layer, these four units form the upper left block of the output feature map after reinterlacing. If we insert the group-wise fully-connected layer, the four new units in the upper left corner become linear combinations of the previous ones and form the upper left block of the output feature map instead. As a result, the new upper left block of the output feature map is computed by a fully-connected operation on the previous one. By examining other units, we find that the fully-connected operation is shared for every non-overlapping blocks, scanning the output feature map with a stride of . Figure 7 provides an illustration. By generalizing this example, we can see that the degridding method is equivalent to a dilated convolution followed by the following operation: use a window of size to scan the output feature map with stride and obtain non-overlapping blocks; for each block, perform the same fully-connected operation that outputs a block of the same spatial size. Note that if the outputs have multiple channels, the operation is shared across channels. This operation is similar to the SS convolution as they both scan spatial locations using a single kernel shared across all channels. Thus, we name it as the SS block-wise fully-connected layer. Based on it as well as the SS convolution, we further define operations which scan spatial locations of inputs using a single filter shared across all channels as SS operations.

As DCNNs commonly employ dilated convolutional layers in cascade, we also look into our proposed methods in this case. As explained above, the first degridding approach is equivalent to adding an SS block-wise fully-connected layer after the dilated convolution, while the second one corresponds to inserting an SS convolution before the dilated convolution. However, for a block of cascaded dilated convolutional layers with the same dilation rate, the order between the dilated convolution and the SS operation only affects the very first and last layers. As a result, the two proposed degridding methods can be generalized as combining appropriate SS operations with dilated convolutions.

3.5 Separable and Shared Operations

Fig. 10: Illustration of our graph attention mechanism. Details are provided by Equations 6 to 8 in Section 3.6. Here, so that .

With the insights above, we develop more effective SS operations to improve dense prediction models with dilated convolutions. According to the definition in Section 3.4, the key of SS operations is to apply a filter that is shared across all channels. Based on this property, we have reinvestigated SS operations in view of operations on graphs. Note that the data that we focus on in this work are grid-like data, such as 1-D text sequences, 2-D images, 3-D videos, etc. For inputs of channels, each spatial location corresponds to a

-dimensional vector. By treating each vector as a node in a graph, the inputs are transformed into a grid-like graph. The left part of Figure 

9 provides an illustration of this transformation for 2-D inputs. We first revisit the proposed SS block-wise fully-connected layer and SS convolution on this graph.

A SS block-wise fully-connected layer scans the inputs using a window with a stride of , as illustrated by the red box in Figure 9. To see the computation within the window, we denote the four nodes as as marked in the figure. The filter in this layer is given by

(4)

The outputs of the window are computed by

(5)

for . Here, means multiplying each element of by , which is consistent with sharing across all channels. In terms of operations on graphs, such computation can be interpreted as a process on a directed subgraph composed of the nodes in the scanning window. Every node interacts with each other and produces its new representation, where the interactions are modeled by directed edges. Specifically, the subgraph does not follow the original grid-like connections. Instead, each node has a directed edge to all nodes including itself, as shown by the top right part of Figure 9. Each directed edge represents a scalar weight in . For example, the edge from to corresponds to , measuring the importance of to . In other words, the SS block-wise fully-connected layer forms a fully-connected directed subgraph in each window during scanning.

The SS convolution differs from the SS block-wise fully-connected layer in that it constructs a different subgraph in its scanning window. The bottom right part of Figure 9 illustrates the directed subgraph for a SS convolution. Unlike the SS block-wise fully-connected layer, where all nodes in a window get updated, the SS convolution only updates the representation of the center node by incorporating information from all nodes in the window. Therefore, the subgraph has directed edges from all nodes to the center node. Nodes except for the center node do not have self-loops or edges between each other. Again, each directed edge refers to a scalar weight. There are 25 directed edges corresponding to the filter of the SS convolution.

To conclude, SS operations can be viewed as scanning the transformed graph using a window. Within each window, a directed subgraph is constructed, where each edge represents a scalar weight. Different ways to form the directed graph result in different SS operations. In addition, there are other ways to generate the scalar weights, instead of making them as training parameters. Many studies on deep learning on graphs have explored this direction, such as mixture model networks (MoNet) [28], GraphSAGE [29], graph attention networks (GAT) [30], and learnable graph convolutional networks (LGCN) [31]. In the next section, we incorporate deep learning techniques on graphs and propose an efficient and effective SS output layer, which improves DCNNs with dilated convolutions by simply replacing the output layer.

3.6 Smoothed DCNNs with Dilated Convolutions

The two proposed methods in Sections 3.2 and 3.3 are able to smooth any single dilated convolution. Our experimental results in Section 4 show that the proposed methods improve the encoders of DCNNs with dilated convolutions. However, dilated convolutions are also used in the output layer of these DCNNs. In this section, we explore the use of SS operations to smooth the entire network.

Various output layers have been proposed for DCNNs with dilated convolutions, in order to aggregate information from large receptive fields for prediction. For example, the large field of view (LargeFOV) layer in [5] is a dilated convolution with a kernel size of and a dilation rate of followed by regular convolutions. The LargeFOV layer has been extended to the atrous spatial pyramid pooling (ASPP) layer [5, 6]. In the ASPP layer, four LargeFOV layers with different dilation rates are employed in parallel, and the outputs are summed or concatenated together as the final output. However, both output layers do not have any smoothing operation, thereby inheriting the gridding artifacts from the encoder to the final output, as illustrated by Figure 2 in Section 2.2.

To address this problem, we propose the SS output layer, which improves the performance by simply replacing dilated convolutions in the output layer by an appropriate SS operation. The proposed SS output layer is able to perform both smoothing and information aggregation for prediction. First, in Section 3.4, we conclude that the proposed degridding methods can be generalized as inserting SS operations between consecutive dilated convolutions. However, for DCNNs whose encoders use dilated convolutions in cascade, it may be more efficient to add only one SS operation after the entire encoder, making it a part of the output layer. Second, the analysis in Section 3.5 indicates that SS operations are able to aggregate information within each scanning window. The advantage of SS operations as compared with output layers based on dilated convolutions is that, given the same receptive field, information from all locations will be incorporated, instead of sampled ones. In addition, SS operations usually have much fewer parameters than dilated convolutions, as analyzed in Section 3.3. As a result, using the SS output layer is efficient and effective.

To be specific, we first transform the output feature maps of the encoder to the grid-like graph as shown in the left part of Figure 9. Then, we propose an SS operation that constructs the same subgraph within each window as the SS convolution, which is illustrated by the bottom right part of Figure 9. Differently, we adopt the graph attention mechanism in GAT [30] to generate the scalar weights. Suppose the window size of our SS operation is . There will be directed edges in the subgraph constructed by the scanning window. We denote the starting nodes of these directed edges as neighboring nodes and the center node as . Note that neighboring nodes include the center node. and are -dimensional vectors, where is the number of input channels. For each directed edge, we compute attention coefficients defined as

(6)

for . Here, are shared for each edge and

is a hyperparameter. The attention coefficients are normalized across

, i.e., all edges:

(7)

where is the generated scalar weight corresponding to the -th directed edge. The output of this window, which is the updated representation of the center node, is computed by

(8)

where is also shared for each edge and is a hyperparameter representing the dimension of . Figure 10 illustrates the graph attention process within a scanning window. Note that if we choose the window size to be larger than the spatial sizes of inputs, the SS operation is able to aggregate global information for prediction. In addition, the SS operation has the same number of parameters when changing the window size, because are all shared for edges, and different window sizes only result in different number of edges.

In our SS output layer, the proposed SS operation scans the transformed grid-like graph and updates every node. Appropriate padding is employed and is always set to for padding nodes. We also apply the multi-head attention as in GAT [30]. A regular convolution follows the SS operation to produce the final output. The proposed SS output layer is evaluated in Section 4.5.

4 Experimental Studies

In this section, we evaluate our methods on the PASCAL VOC 2012 [32] and Cityscapes [33] datasets. Our proposed approaches result in significant and consistent improvements for DCNNs with dilated convolutions. We also perform the effective receptive field (ERF) analysis [24] to visualize the smoothing effect. Finally, we analyze the effectiveness and efficiency of the proposed separable and shared output layer.

4.1 Basic Setup

To conduct our experiments, we choose the task of semantic image segmentation because the gridding artifacts were mainly observed in studies for this task [4, 7, 8]. The consistency of local information is important for such a pixel-wise prediction task on images. In addition, the smoothing effect is easy to visualize on two-dimensional data.

The baseline model in our experiments is the DeepLabv2 [5] with ResNet-101 [20]. It is a fair benchmark to evaluate our smoothed dilated convolutions in three aspects. First, it employed dilated convolutions to adapt ResNet pre-trained on ImageNet [19]

; namely from image classification to semantic image segmentation. Most semantic image segmentation models adopted this transfer learning strategy 

[1, 2, 21, 3, 4, 5, 6, 7, 8, 9] and ResNet is one of the most accurate DCNNs for image classification with pre-trained models available. Second, models that achieved the state-of-the-arts in segmentation tasks recently [6, 7, 9] were developed from DeepLabv2. In [9] the output layer was replaced with a pyramid pooling module. [7] also changed the output layer and additionally proposed changing dilation rates, as mentioned in Section 2.2. The current best model [6] followed the suggestions of [7] and meanwhile, explored going deeper with more dilated convolutional blocks. Third, we intend to compare our degridding methods with existing approaches [4, 8, 7]. While [4, 8] addressed the gridding artifacts by adding more layers that considerably increased the number of training parameters, our methods only require learning hundreds of extra parameters. Thus, we perform the comparison with the idea proposed in [7], which is based on DeepLabv2.

DeepLabv2 is composed of two parts: the encoder and the output layers. The encoder is a pre-trained ResNet-101 model modified with dilated convolutions, and it extracts feature maps from raw images. As introduced in Section 2.1, the last two downsampling layers in ResNet-101 were removed and subsequent standard convolutional layers were replaced by dilated convolutional layers with dilation rates of , respectively. To be specific, after the modification, the last two blocks are a block of stacked dilated convolutional layers with a dilation rate of followed by a block of cascaded dilated convolutions with a dilation rate of . The output layer performs pixel-wise classification by aggregating information from the output feature maps of encoder.

We re-implement DeepLabv2 in Tensorflow and perform experimental studies based on our implementation. Our code is publicly available

111https://github.com/divelab/dilated/. We improve the baseline by addressing the gridding artifacts in the last two blocks of the encoder. To make the comparison independent of the output layer, we conduct experiments with different output layers. In order to eliminate the bias of different datasets, we evaluate our methods on two datasets. All the models are evaluated by pixel intersection over union (IoU) defined as

(9)

4.2 Pascal Voc2012

Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mIoU
DeepLabv2 93.8 85.9 38.8 84.8 64.3 79.0 93.7 85.5 91.7 34.1 83.0 57.0 86.1 83.0 81.0 85.0 58.2 83.4 48.2 87.2 74.0 75.1
Multigrid 93.6 85.4 38.9 82.2 66.9 76.6 93.2 85.3 90.7 35.7 82.5 53.7 83.1 84.2 82.2 84.6 56.9 84.3 45.6 85.5 73.1 74.5
G Interact 93.7 86.9 39.6 84.1 68.9 76.4 93.8 86.2 91.7 36.1 83.7 55.3 85.7 84.0 82.2 84.9 59.5 85.7 46.5 85.0 73.0 75.4
SS Conv 93.9 86.7 39.5 86.2 68.1 77.3 93.8 86.4 91.5 35.4 83.2 59.0 85.2 83.6 82.4 85.2 57.3 82.1 45.8 86.1 75.2 75.4
TABLE I: Experimental results of models with the ASPP output layer and MS-COCO pre-training on PASCAL VOC 2012 val set. Class is the background class and Class represent “aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, diningtable, dog, horse, motorbike, person, potteplant, sheep, sofa, train, tvmonitor”, respectively. This is the same for Tables I to VII.
Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mIoU
DeepLabv2 92.9 85.0 38.1 82.8 66.2 76.5 91.1 82.7 88.4 33.8 77.7 49.9 80.7 78.6 77.9 82.0 51.5 76.6 43.1 82.8 66.6 71.7
Multigrid 92.8 84.9 37.4 81.8 65.6 76.0 90.4 81.3 86.9 32.6 76.8 52.3 80.2 79.5 77.4 81.9 50.7 78.4 41.9 82.7 66.0 71.3
G Interact 93.0 85.1 37.4 83.4 66.9 76.6 90.7 82.0 88.1 33.8 81.1 54.3 81.6 80.2 76.7 81.9 53.7 78.7 43.1 83.9 66.4 72.3
SS Conv 93.0 85.8 38.3 82.5 66.3 77.9 91.6 83.5 88.5 32.4 77.8 52.5 81.9 78.1 79.3 82.1 49.8 78.4 44.4 83.0 67.9 72.1
TABLE II: Experimental results of models with the ASPP output layer but no MS-COCO pre-training on PASCAL VOC 2012 val set.
Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mIoU
DeepLabv2 93.7 85.7 39.4 85.9 67.6 79.0 93.1 86.0 90.7 36.2 79.8 54.6 83.7 80.9 81.4 85.0 57.5 83.5 45.5 84.5 74.1 74.7
G Interact 93.8 85.5 40.0 86.5 67.5 78.1 92.9 86.2 90.4 37.2 80.6 56.5 82.6 80.3 81.0 85.0 58.1 84.8 46.6 84.4 74.8 74.9
SS Conv 93.8 85.3 39.7 86.8 68.7 77.9 94.0 86.3 90.8 35.2 83.1 55.4 84.5 83.8 79.6 85.6 59.3 83.2 46.2 86.2 75.5 75.3
TABLE III: Experimental results of models with the LargeFOV output layer and MS-COCO pre-training on PASCAL VOC 2012 val set.
Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mIoU
DeepLabv2 92.8 84.1 37.9 82.9 65.2 76.5 89.9 82.7 87.9 33.2 74.9 50.2 80.6 76.6 78.6 82.1 52.2 77.4 40.8 80.1 66.6 71.1
G Interact 93.0 84.5 37.8 84.2 66.5 75.9 90.5 83.1 88.4 34.6 75.4 52.3 81.7 75.5 77.4 82.1 52.8 78.2 41.5 81.7 67.9 71.7
SS Conv 92.9 85.5 38.1 83.2 66.5 73.1 91.2 84.0 88.3 34.5 75.2 49.9 81.0 77.2 79.5 82.5 53.7 78.6 42.0 80.0 67.7 71.6
TABLE IV: Experimental results of models with the LargeFOV output layer but no MS-COCO pre-training on PASCAL VOC 2012 val set.

The PASCAL VOC 2012 semantic image segmentation dataset [32] provides pixel-wise annotated natural images. It has been split into train, val and test sets with , and images, respectively. The annotations include classes, which are foreground object classes and class for background. An augmented version with extra annotations [34] increases the size of the train set to . In our experiments, we train all the models using the augmented train set and evaluate them on the val set. When reproducing the baseline DeepLabv2, we do not employ multi-scale inputs with max fusion for testing due to our limited GPU memory. We perform no post-processing such as conditional random fields (CRF) [5], which is not related to our goals. Following DeepLabv2, we train the model with randomly cropped patches of size of and batch size of . Data augmentation by randomly scaling the inputs for training is applied. We set the initial learning rate to and adopt the “poly” learning rate policy [35] as

(10)

where , denotes current iteration number, and denotes learning rate, as in [5, 6, 7]. The model is trained for iterations with a momentum of and a weight decay of .

We implement our proposed methods by inserting appropriate separable and shared (SS) operations before or after each dilated convolution as shown in Figures 7 and 8. An important step is to change the initial learning rate, detailed in each experiment. To make the comparisons solid, we also train the baseline with different initial learning rates and observe the original setting of yields the best performance. The initialization of SS operations is to set them to be identity operations. Specifically, for a group interaction layer with a dilation rate of , the initial filter is

(11)

while for an SS convolution with a dilation rate of , it is

(12)

The original DeepLabv2 used pre-training on MS-COCO [36], which results in more training data and higher performances. Our experiments are conducted under both settings; namely with and without MS-COCO pre-training. The results are given in Tables I and II, respectively. In the tables, “G Interact” denotes the degridding method with a group interaction layer, i.e., adding an SS block-wise fully-connected layer after the dilated convolution and “SS Conv” represents the one with an SS convolution inserted before the dilated convolution. In these experiments with MS-COCO pre-training, the initial learning rates for “G Interact” and “SS Conv” are both . Otherwise, they are set to and , respectively. Clearly, both proposed methods improve the IoU for most classes as well as the mean IoU (mIoU) over the baseline under both settings. It is worth noting that “G Interact” only requires training extra parameters and “SS Conv” requires extra parameters, which are negligible compared to the total number of parameters in the models.

Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 mIoU
DeepLabv2 97.2 79.7 90.1 47.4 49.2 50.3 57.3 69.0 90.6 59.8 92.8 75.9 55.6 92.5 67.5 80.5 64.8 59.7 71.7 71.1
G Interact 97.3 79.6 90.2 50.4 49.9 50.5 58.5 69.1 90.5 58.7 92.7 75.9 55.4 92.5 70.9 80.2 65.0 60.6 71.8 71.6
SS Conv 97.2 79.7 90.3 51.1 50.5 50.2 58.1 69.3 90.5 60.0 92.7 76.1 55.9 92.7 72.7 81.9 66.0 59.7 71.8 71.9
TABLE V: Experimental results of models with the ASPP output layer and MS-COCO pre-training on Cityscapes val set. Class represent “road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle, bicycle”, respectively. This is the same for Tables V and VI.
Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 mIoU
DeepLabv2 97.0 77.9 89.4 44.6 48.6 48.7 54.1 66.7 90.3 58.0 92.5 73.9 51.9 91.6 59.9 75.5 60.5 56.3 69.6 68.8
G Interact 97.1 78.7 89.6 44.7 49.2 48.6 54.2 67.0 90.3 57.6 92.1 74.3 52.2 91.7 59.0 77.1 60.5 56.8 70.1 69.0
SS Conv 97.0 78.3 89.6 45.2 49.4 48.9 54.6 66.5 90.2 57.1 92.0 74.1 52.1 91.8 59.5 76.8 63.5 58.8 69.7 69.2
TABLE VI: Experimental results of models with the ASPP output layer but no MS-COCO pre-training on Cityscapes val set.

We also compare our methods with existing degridding method proposed in [7] and used in [6] as the “multigrid” method. As introduced in Section 2.2, the idea is to group several dilated convolutional layers and change the dilation factors. As we know, for the modified ResNet-101 with dilated convolutions, the last two blocks are a block of stacked dilated convolutional layers with a dilation rate of followed by a block of cascaded dilated convolutions with a dilation rate of . For the first block, we group every layers together and replace the dilation rates from to . We keep for the left layers. For the second block, the dilation factors are changed to . We make the modification and train the models under the same setting as the baseline. The results, denoted as “Multigrid”, are shown in the second lines of Tables I and II. Surprisingly, our implementation indicates that the approach does not improve the performance. An explanation of the results is that the method should be applied together with other modifications, as both [7] and [6] conduct experiments together with other changes over DeepLabv2, such as dense upsamling convolution (DUC) and deeper encoders.

As we address the gridding artifacts in the last two blocks of the encoder, we also run experiments with different output layers in order to make the comparisons independent of the output layer. We replace the original atrous spatial pyramid pooling (ASPP) output layer of DeepLabv2 by the large field of view (LargeFOV) layer, which was applied earlier in [5]. We train the models with the same settings above, with and without MS-COCO pre-training, and show the results in Tables III and IV, respectively. Again, the proposed degridding methods result in significant improvements consistently.

4.3 Cityscapes

We further compare our proposed methods on the Cityscapes dataset [33]. Cityscapes collects images of street scenes from different cities and provides high quality pixel-wise annotations of classes. The images are divided into train, val and test with , and images, respectively. Again, we train models on the train set and perform evaluation on the val set. The training batch size is , where each batch contains randomly cropped patches of size . The initial learning rates for all models are set to . All the other settings are the same as those in Section 4.2.

Experiments are still conducted under both settings, i.e., with and without MS-COCO pre-training, and the results are given in Tables V and VI, respectively. We can see that both of the proposed methods increase the mIoU over the baseline, which shows that the improvements are independent of datasets.

4.4 Effective Receptive Field Analysis

Since we are addressing the gridding artifacts, we perform the effective receptive field (ERF) analysis [24, 8] to visualize the smoothing effect of our methods. These experiments further verify that the improvements of the proposed methods come from degridding. Given a block in DCNNs, the ERF analysis is an approach to characterize how much each unit in the input of the block affects a particular output unit of the block mathematically [24], instead of theoretically.

Fig. 11: ERF visualization for the single dilated convolution with a kernel size of and a dilation factor of . Black pixels represent zero weights.
Fig. 12: ERF visualization for the entire dilated convolutional block. Note that only the leftmost map has black pixels that represent zero weights.
Models 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mIoU
DeepLabv2 (LargeFOV) 93.7 85.7 39.4 85.9 67.6 79.0 93.1 86.0 90.7 36.2 79.8 54.6 83.7 80.9 81.4 85.0 57.5 83.5 45.5 84.5 74.1 74.7
DeepLabv2 (ASPP) 93.8 85.9 38.8 84.8 64.3 79.0 93.7 85.5 91.7 34.1 83.0 57.0 86.1 83.0 81.0 85.0 58.2 83.4 48.2 87.2 74.0 75.1
SS Output (15) 94.1 86.5 39.0 86.2 65.9 80.3 93.8 87.4 90.7 36.0 82.1 59.2 84.2 80.8 81.2 85.5 58.3 84.1 48.1 87.4 74.1 75.5
SS Output (20) 94.1 86.7 40.2 86.9 66.2 80.5 94.5 87.9 91.7 36.1 83.6 60.0 86.2 83.9 82.5 85.6 58.9 84.0 48.9 88.7 74.3 76.3
SS Output (30) 94.2 87.9 40.9 87.3 66.1 79.7 94.8 87.9 92.9 36.5 84.9 59.6 88.3 86.8 83.9 85.7 59.9 84.2 50.0 89.5 74.9 77.0
SS Output Global 94.4 89.2 40.6 84.9 69.7 78.9 94.7 86.8 93.2 38.1 89.9 59.4 87.8 87.3 82.6 85.7 59.4 89.5 52.4 90.2 74.7 77.6
TABLE VII: Experimental results of models with the proposed SS output layer in Section 3.6 and MS-COCO pre-training on PASCAL VOC 2012 val set. “SS Output ()” denotes the model using an SS output layer with a window size of . “SS Output Global” means the window size is chosen to be larger than the spatial sizes of inputs.
Models #Parameters
LargeFOV 9,437,696
ASPP 37,750,784
SS Output 3,409,920
TABLE VIII: Comparison of the number of training parameters between different output layers. The numbers of channels in inputs and outputs are set to and , respectively. Note that the SS output layer has the same number of parameters when changing the window size.

Following the steps in [24, 8], we analyze the models on PASCAL VOC 2012, with the ASPP output layer and MS-COCO pre-training. We compute the ERF for chosen blocks of the baseline and both of the proposed methods. Specifically, suppose the input and output feature maps of a block are and , respectively. The spatial locations of the feature maps are indexed by with representing the center. The ERF is measured by the partial derivative

. To compute it without an explicit loss function, we set the error gradient with respect to

to while for with or , we set it to . Then the error gradient can be back-propagated to and the error gradient with respect to equals to  [24]. However, the results are input-dependent. So are computed for all images in the val set and their absolute values are averaged. Finally, we sum the values over all channels of to get a visualization of the ERF.

In our experiments, we choose two blocks of the DCNNs to visualize the smoothing effect and enlarge the spatial size of visualizations ten times for display. The first block is the very last layer of the encoder, which is a dilated convolution with a kernel size of and a dilation rate of . The ERF analysis results are presented in Figure 11. The ERF of the original dilated convolution in the baseline is obvious. It corresponds to a filter with zeros inserted between non-zero weights. Such a filter results in the gridding problem. For our proposed degridding methods, we can see that they smooth the ERF and thus perform degridding. In addition, both methods expand the rectangular size of the ERF due to the SS operations. The second chosen block is the entire block composed of dilated convolutional layers, which includes the last two blocks of the encoder. Figure 12 shows the ERF visualization. The gridding artifacts are clearly smoothed in both proposed methods. In fact, only the leftmost visualization for the baseline has black pixels that represent zero weights. Particularly, we note that “SS FC” still has a grid-like visualization. A reason of this is the block-wise operation may result in larger grids in terms of blocks. Nevertheless, it alleviates the inconsistency of pixel-wise local information and improves DCNNs with dilated convolutions.

4.5 Separable and Shared Output Layer

We evaluate the proposed separable and shared (SS) output layer in Section 3.6 by only replacing the output layer of DeepLabv2. In our experiments, we set and to in Equations 6 to 8 and use heads in the graph attention mechanism. Different window sizes of the SS output layer are explored. Table VII provides the comparison results between the original DeepLabv2 and models using SS output layers. Clearly, the SS output layer shows its effectiveness by improving the performance significantly. It is worth noting that the larger the window size is, the more the performance gets improved, which indicates the importance of aggregating global information for prediction.

In order to show the efficiency of our SS output layer, we also compare the number of training parameters between different output layers in Table VIII. The number of input channels to the output layer is set to . To be fair, we only compute the number of parameters of the operations before the regular convolutions and set the number of output channels of the operations to . In this case, the LargeFov output layer has a single dilated convolution and the ASPP output layer has four dilated convolutions, while the SS output layer contains the proposed SS operation in Section 3.6 with . Note that the SS output layer has the same number of parameters when changing the window size. According to Table VIII, the proposed SS output layer reduces a large amount of training parameters as compared with output layers based on dilated convolutions.

5 Conclusions

In this work, we propose two simple yet effective degridding methods based on a decomposition of dilated convolutions. The proposed methods differ from existing degridding approaches in two aspects. First, we address the gridding artifacts in terms of a single dilated convolution operation instead of multiple layers in cascade. Second, our methods only require learning a negligible amount of extra parameters. Experimental results show that they improve DCNNs with dilated convolutions significantly and consistently. The smoothing effect is also visualized in the effective receptive field (ERF) analysis. Through further analysis, we relate both proposed methods together and define the separable and shared (SS) operations. The newly defined SS operation is a general neural network operation and may result in a general degridding strategy. We explore this direction in this updated version and propose the SS output layer, which is able to smooth the entire network by only replacing the output layer and obtain improved dense prediction. To conclude, our proposed degridding methods based on SS operations are efficient and effective.

Acknowledgments

This work was supported in part by National Science Foundation grant IIS-1633359 and Defense Advanced Research Projects Agency grant N66001-17-2-4031.

References