Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation

06/05/2020
by   Yujiang Wang, et al.
Imperial College London
UCL
0

Dilated convolutions are widely used in deep semantic segmentation models as they can enlarge the filters' receptive field without adding additional weights nor sacrificing spatial resolution. However, as dilated convolutional filters do not possess positional knowledge about the pixels on semantically meaningful contours, they could lead to ambiguous predictions on object boundaries. In addition, although dilating the filter can expand its receptive field, the total number of sampled pixels remains unchanged, which usually comprises a small fraction of the receptive field's total area. Inspired by the Lateral Inhibition (LI) mechanisms in human visual systems, we propose the dilated convolution with lateral inhibitions (LI-Convs) to overcome these limitations. Introducing LI mechanisms improves the convolutional filter's sensitivity to semantic object boundaries. Moreover, since LI-Convs also implicitly take the pixels from the laterally inhibited zones into consideration, they can also extract features at a denser scale. By integrating LI-Convs into the Deeplabv3+ architecture, we propose the Lateral Inhibited Atrous Spatial Pyramid Pooling (LI-ASPP) and the Lateral Inhibited MobileNet-V2 (LI-MNV2). Experimental results on three benchmark datasets (PASCAL VOC 2012, CelebAMask-HQ and ADE20K) show that our LI-based segmentation models outperform the baseline on all of them, thus verify the effectiveness and generality of the proposed LI-Convs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

11/17/2020

Multi Receptive Field Network for Semantic Segmentation

Semantic segmentation is one of the key tasks in computer vision, which ...
04/18/2021

Gaussian Dynamic Convolution for Efficient Single-Image Segmentation

Interactive single-image segmentation is ubiquitous in the scientific an...
11/30/2017

Spatially-Adaptive Filter Units for Deep Neural Networks

Classical deep convolutional networks increase receptive field size by e...
07/28/2019

Dilated Point Convolutions: On the Receptive Field of Point Convolutions

In this work, we propose Dilated Point Convolutions (DPC) which drastica...
12/12/2018

Tree-structured Kronecker Convolutional Networks for Semantic Segmentation

Most existing semantic segmentation methods employ atrous convolution to...
06/02/2016

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

In this work we address the task of semantic image segmentation with Dee...
03/19/2019

Efficient Smoothing of Dilated Convolutions for Image Segmentation

Dilated Convolutions have been shown to be highly useful for the task of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the introduction of the pioneering Fully Convolutional Networks (FCN) long2015fully

, deep Convolutional Neural Networks (CNNs)

deeplabv3plus2018; zhao2017pyramid; liu2019auto; zhang2018context; huang2019ccnet have made impressive progress in semantic image segmentation, a task that performs per-pixel classifications. In deep CNN models, a series of convolutions and spatial poolings are applied to obtain progressively more abstract and more representative feature descriptors with decreasing resolutions. As a consequence, the deepest features can have significantly lower resolution than the original image (e.g. only or of the input size in FCN long2015fully), hence it would be difficult to decode these features into the segmentation map at the same size of the input image without losing details. This is a crucial challenge in the semantic segmentation task.

Dilated convolutions holschneider1990real, which are first applied to the semantic segmentation task by yu2015multi; chen2014semantic, can effectively overcome such difficulties and thus are widely employed in state-of-the-art segmentation methods liu2019auto; deeplabv3plus2018; yang2018denseaspp; chen2018searching; wang2018understanding. By inserting zeros (dilation) into the convolutional filters, dilated convolutions can observe features from larger areas without increasing the kernel parameters, which is important to the extractions of global semantic features. Besides, it can also produce feature maps that are invariant input resolutions. In practice, dilated convolutions can be utilised to retain the resolution of the feature maps when encoding representations in the backbone network yu2017dilated; wang2018understanding, typically by replacing certain convolutional layers with dilated ones. It can also be employed during the decoding stage to generate more robust semantic labels, e.g. the Atrous Spatial Pyramid Pooling (ASPP) chen2017rethinking; chen2018deeplab adopts three parallel dilated convolutions with different dilation rates to aggregate the multi-scale contextual information.

Despite its broad applications, dilated convolutions still have several limitations. The pixels around semantically meaningful contours separate different objects and possess stronger semantic information. In dilated convolution, however, the importance of those pixels are not explicitly accentuated, and therefore such positional significance has to be implicitly learnt. This can leads to ambiguous and misleading boundary labels. Various approaches have been proposed to compensate for such problems and to refine the contour predictions, including the Conditional Random Fields (CRF) chen2018deeplab; chandra2016fast and the decoder component in Deeplabv3+ deeplabv3plus2018. However, dilated convolution’s sensitivity on spotting semantically meaningful edges still leaves room for improvement.

Additionally, although the receptive field of dilated filters is enlarged, the total number of sampled pixels stay the same, which only consist of a small fraction of pixels in the area. The sparse sampling can somehow impair the potentials for dense prediction tasks like semantic segmentation. Similar concerns were address in shen2018gaussian; wang2018understanding; dai2017deformable; yan2014image, and the proposed improvements include a denser Gaussian sampling process shen2018gaussian, a hybrid dilated convolution module wang2018understanding and the deformable convolutional filters dai2017deformable.

In this paper, we propose to overcome the drawbacks in dilated convolutions from a biologically-inspired perspective, which is to leverage the Lateral Inhibition (LI) mechanism in the human visual system. Lateral inhibition hartline1956inhibition; rizzolatti1975inhibition; von2017sensory

is a neurobiological phenomenon that a neuron’s excitation to a stimulus can be suppressed by the activation of its surrounding neurons. Because of the LI mechanism, our retina cells are sensitive to the spatially varying stimulus such as the semantic borderlines between objects, which is crucial to the inborn segmentation abilities of our eyes. See Fig.

1 (Left) for an intuitive illustration of the LI mechanism.

Motivated by such observations, we propose a dilated convolution with lateral inhibitions (LI-Convs) to enhance the convolutional filter’s sensitivity to semantic contours. The LI-Convs also sample the receptive window in a denser fashion by implicitly making inferences on pixels within the lateral inhibited zones. To evaluate LI-Convs, we follow the Deeplabv3+ deeplabv3plus2018 segmentation models and present two LI-based variants which are 1. the Lateral Inhibited Atrous Spatial Pyramid Pooling (LI-ASPP) for decoding semantic features, and 2. the Lateral Inhibited MobileNet-V2 (LI-MNV2) as the backbone network for encoding features. The performance of LI-ASPP and LI-MNV2 surpass the baseline on three segmentation benchmark dataset: PASCAL VOC 2012 everingham2015pascal, CelebAMask-HQ CelebAMask-HQ and ADE20K zhou2017scene, which verifies the effectiveness and generality of the proposed LI-Convs.

2 Related Works

Semantic Image Segmentation    Fully Convolutional Networks (FCN) long2015fully is the pioneering work of using deep models for semantic segmentation. The fully connected layers in deep image classification models are replaced with convolutional ones to produce semantic heat maps for segmentation predictions. The resolution of such heat maps is typically much smaller than that of the input image (e.g. ), and various works are proposed to compensate the information loss during decoding such features, including the de-convolutional layers noh2015learning; ronneberger2015u; peng2017large, the skip-connections of low-level features badrinarayanan2017segnet; hariharan2015hypercolumns and dilated convolutions yu2015multi; chen2017rethinking; yang2018denseaspp; liu2019auto; chen2018searching. Yu et al. yu2015multi stacks dilated convolutional layers with different dilation rates in a cascaded manner, leading to a context module for aggregating the multi-scale contextual information. Deeplabv3 chen2017rethinking builds an Atrous Spatial Pyramid Pooling (ASPP) module consisting of three parallel dilated convolutions, one 1*1 convolution and one image-level pooling, and it also employs dilated convolutions in the backbone network. DenseASPP yang2018denseaspp introduces dense connection into the ASPP module, while the technique of Neural Architecture Search zoph2016neural is utilised by chen2018searching to search for an optimal decoding structure of organising dilated convolutions layers. For other segmentation practice rahman2016optimizing; wang2019face; luo2020shape; wang2019dynamic, readers are referred to minaee2020image for more details.

Dilated Convolutions    Dilated convolutions, also known as atrous convolutions, is first introduced by Holschneider et al. holschneider1990real in signal analysis and have broad applications such as object detection li2017pedestrian; nguyen2019lightweight, lip-reading xu2020discriminative; martinez2020lipreading and optical flow zhu2018learning; sun2018pwc. It is first applied to semantic segmentation by authors of yu2015multi; chen2014semantic to enlarge filter’s receptive fields without sacrificing the spatial resolution. Conditional Random Fields (CRF) are involved in chandra2016fast; chen2018deeplab as a post-processing procedure to refine the ambiguous semantic contour predictions. Similar ideas can be found in Deeplabv3+ deeplabv3plus2018, which designs a decoding module to incorporate low-level backbone features to improve the qualities of contouring pixels. Deformable convolutions dai2017deformable introduce the offsets into the sampling grids of filters to better model the spatial relationships. Gaussian kernels are adopted by shen2018gaussian to obtain pixels at a wider range in dilated convolutions. Wang et al. wang2018understanding observe the gridding effects brought by the fixed sampling locations in dilated kernels and demonstrate a hybrid dilated convolution with different dilated rates. Different from those approaches, we employ the lateral inhibition (LI) mechanisms hartline1956inhibition to enhance the dilated convolutions’ sensitivity on semantically meaningful contours and to implicitly sample features in a denser fashion.

Lateral Inhibitions    The study on the eyes of horseshoe crab (Limulus) performed by Hartline et al. hartline1956inhibition reveals the lateral inhibition (LI) effects in visual systems, where the excitation of neighbouring neurons can suppress a cell’s response to the stimuli. Although lateral inhibitions are mainly studied in the field of neuroscience roska2000three; sun2004orientation; rizzolatti1975inhibition

, the computer vision community has also shown interests in this mechanism. The recurrent neural network with lateral inhibitions is studied in

mao2007dynamics and it is shown that LI can improve the robustness and efficiency of the network. Authors of fernandes2013lateral introduce LI into a shallow CNN to improve image classification. Similar ideas can be found in the work for colour video segmentation fernandez2014color. Those network architectures are somehow too shallow to be useful for recent methods using deep backbones like MobileNet-V2 (MNV2) mobilenetv22018 or ResNet he2016deep. Recently, authors of cao2018lateral employ LI in VGG model simonyan2014very to improve the performance on saliency detection. However, none of the previous works has evaluated LI’s potentials for semantic segmentation, while their methods of integrating LI do not touch the core mechanisms in deep CNNs such as the convolutional operations. In this work, however, lateral inhibitions work closely with the convolutional filters to fundamentally augment the model’s segmentation powers.

3 Dilated Convolutions with Lateral Inhibitions

3.1 Definition

Define where , and let a discrete function represents a convolutional filter of size . Define another discrete function representing features of arbitrary sizes. Let be the dilation rate, a dilated convolutional operator is written as

(1)

where . Note that turns into a regular convolutional operator when , i.e. no dilation is inserted.

With the introduction of lateral inhibitions (LI), the activation of each sampled pixel, i.e. in Eq. 1, would be suppressed by its neighbours within a certain range. Let the lateral inhibitions come from a square region of size centred on where , and refer this region as the lateral inhibition zone (the LI zone). Define and let be a discrete function describing the spatially-varying inhibition intensities in the LI zones, the amount of the total inhibitions received by a sampled pixel can be described as where . Consequently, a dilated convolutional operator with lateral inhibition (LI-Convs) can be defined as:

(2)

Note that Eq. 2 is essentially an extension of Eq. 1 with the introduction of LI terms. Fig. 1 (Middle, Right) provides an intuitive comparison between dilated convolution and the proposed LI-Convs.

Figure 1: Left: A toy example to illustrate the lateral inhibition mechanisms where the LI intensity is set to . The difference between the two neurons at the centre (representing a semantic contour) becomes more significant after LI.   Middle: A convolutional filter where . The sampled pixels (denoted as red dots) only comprises a small fraction of all pixels in the receptive field.   Right: An illustration of the proposed LI-Convs with lateral inhibition zones. Each sampled pixel receives inhibition signals from eight neighbours to enhance sensitivity on semantic contours and to extract information at a denser scale.

We can also "dilate" the lateral inhibition zone to efficiently expand its field-of-views, in similar way to that of dilated convolutions. Denote the dilation rate in LI zones as , a generalised LI-Convs operator is denoted as

(3)

Although a wide variety of kernel forms can be taken by the LI intensity descriptor , we opt for an intuitive formulation that is also easy to implement. In particular, in Eq. 3 simply takes the production of a differentiable weight and an exponentially decaying factor that is related to the distance between and , which can be described as

(4)

where

is a parameter representing the standard deviation and

refers to a certain distance measurement between and . Here we employ the Euclidean distance.

3.2 Implementation of LI-Convs

Figure 2: The structure of LI-Convs. The lateral inhibitions is first calculated by the LI layer, and the inhibited features are fed into the dilated convolution layer. The dilated convolution part can be any kind of convolution implementations such as the depthwise one chollet2017xception.

We take a straight-forward approach to implement the LI-Convs in Eq. 3. We first design a Lateral Inhibition layer (the LI layer) to perform pixel-wise lateral inhibitions, while a dilated convolutional layer is subsequently applied to the inhibited features. The LI layer is essentially a light-weight module that can be flexibly inserted into deep models, while it can be easily implemented as a convolutional layer with specifically shaped filters. In particular, let a discrete function represent one such LI filter, can be described as:

(5)

Note that the LI filter has identical size with the LI zones which is , and applying

with a stride of 1 can generate pixel-wise inhibited features. We empirically set

in Eq. 5 to a fixed value during training, thus there is only one weight to learn for each LI filter, which is significantly less than that of regular convolutional filters. In practice, we learn the lateral inhibition weights in a channel-wise manner, i.e. each LI filter learns a separate . Therefore, a LI layer will introduce a total of learnable weights where

is the channel number of the input tensor.

A detailed illustration for the LI-Convs implementations can be found in Fig 2

. A ReLu activation is first applied to remove negative activations. Then a LI layer with filters in Eq.

5 is employed to extract inhibited features, followed by another ReLU layer. This is then followed by a dilated convolution layer, which can take any form such as the depthwise conv chollet2017xception.

3.3 LI-ASPP and LI-MNV2

We introduce the proposed LI-Convs into the state-of-the-art segmentation model Deeplabv3+ deeplabv3plus2018 to evaluate the proposed LI-Convs. As shown in Fig. 3 (Left), we have replaced the three parallel dilated convolution operations in Atrous Spatial Pyramid Pooling (ASPP) deeplabv3plus2018 with the proposed LI-Convs, leading to the LI-ASPP model. Besides, we also investigate the potentials of LI layer in the backbone network such as the MobileNet-V2 (MNV2). In particular, we insert the LI layer into the residual bottleneck (RB) of MobileNet-V2 mobilenetv22018, between the expansion convolution and depthwise convolution, as illustrated in Fig. 3 (Right). We refer to this structure as the LI bottleneck layer. In the original MNV2 architecture, there is a total of residual bottleneck layers, and we replace the , and ( refers to the second-highest RB layer) reisudal bottlenecks with the LI bottlenecks to obtain the LI-MNV2 network.

Figure 3: Left: The structures of ASPP and LI-ASPP. ASPP consists of five parallel branches including three dilated convolutions, which are replaced with the proposed LI-Convs in LI-ASPP.   Right: The structures of the residual bottleneck convolution in MobileNet-V2 and the LI bottleneck. The LI layer is inserted between the expansion convolution and depthwise convolution.

4 Experiments

4.1 Datasets

We conduct our experiments on three public benchmark segmentation datasets, which are PASCAL VOC 2012 everingham2015pascal, CelebAMask-HQ CelebAMask-HQ and ADE20K zhou2017scene. There are a total of 21 semantic classes in PASCAL VOC 2012 dataset everingham2015pascal which contains 1,464/1,449/1,456 pixel-wise annotated images for train/validation/test. Following hariharan2011semantic; deeplabv3plus2018, we use an augmented train set with a total of 10,582 annotated images. CelebAMask-HQ CelebAMask-HQ is a large-scale face parsing dataset with 30,000 pixel-wise labelled face images of 19 classes, and they are split into sets with 24,183/2,993/2,824 images for train, validation and test. ADE20K zhou2017scene is a benchmark dataset for scene parsing with 20,210/2,000/3,000 pixel-wise labelled images for train/validation/test. It is a quite challenging dataset, as there are a total of 151 classes in this dataset, and the huge variations of image resolutions also increase the difficulties. We utilise the validation set to evaluate performance on PASCAL VOC 2012 and ADE20K datasets, considering that their test sets are not publicly available, while we follow the standard protocol on CelebAMask-HQ dataset and use the test set for evaluation.

4.2 Experimental Setup

Evaluation metric

   Mean Intersection-over-Union (mIoU) is the most widely used evaluation metric for the segmentation task, and we adopt it to evaluate the quality of model predictions. We also report the model parameters and the FLOPs to provide more comprehensive analyses.

Training Settings    We generally follow the training settings in Deeplabv3+ deeplabv3plus2018

, while we have also made some modifications to suit our needs. Particularly, we use the ImageNet

russakovsky2015imagenet checkpoint provided by MobileNet-V2 authors mobilenetv22018 to initialise LI-MNV2, while the weights of LI-ASPP are randomly initialised. Note that we do not use the MS COCO dataset lin2014microsoft to pre-train the model. During training, we set the image crop size to be for the PASCAL VOC 2012 and CelebAMask-HQ datasets and

for ADE20K dataset. We train for 120 epochs using a batch size of 16 and the SGD method

kiefer1952stochastic is applied to optimise the pixel-wise cross-entropy loss with L2-regularisation. The initial learning rate is set to 0.01 with the decaying policy described in deeplabv3plus2018. The output stride, which is defined in chen2017rethinking denoting the ratio of original input resolution to the final feature’s resolution, is set to be 16 for all datasets. We adopt strategies in deeplabv3plus2018; chen2017rethinking to use the BatchNorm layers ioffe2015batch and to randomly scale the training data for augmentation. Depthwise convolution chollet2017xception is used in the ASPP implementations following deeplabv3plus2018. During evaluations, we set the output stride to be 16 for all datasets, and we use a crop size of for ADE20K and for PASCAL VOC 2012 and CelebAMask-HQ.

Backbone Decoding model Positions of LI-Convs LI Zone Sizes LI Rates Init. Range mIoU (%)
MNV2 LI-ASPP Three dilated convolution layers in ASPP with rates 1 [0.0, 0.0] 72.49
{1,3,5} [0.0, 0.0] 72.56
5 [0.05, 0,15] 72.43
1 [0.05, 0,15] 72.13
1 [0.05, 0.35] 72.64
1 [0.0, 0.15] 72.93
LI-MNV2 ASPP RB 1 [0.0, 0.0] 72.07
RB 72.21
RB 71.78
RB 72.43
RB 72.5
Table 1: The performance of different LI-Conv’s parameters on Pascal Voc 2012 validation set for LI-ASPP and LI-MNV2, respectively. "RB" refers to the Residual Bottleneck in MNV2 mobilenetv22018.

LI Layer Settings    A lateral inhibition layer has several key hyper-parameters that can affect the performance. We fine-tune those parameters on the Pascal Voc 2012 validation set to determine a best-performing combination. Particularly, for LI-ASPP, we set the size of LI zones to be , the value for the standard deviation in Eq. 5 is selected to be , the LI rate in Eq. 3 is set to , and we uniformly initialise the LI intensity in Eq. 5 between and during training. Almost identical LI settings are applied in LI-MNV2, except that all LI intensities are initialised as such that the training can start smoothly without breaking any existing organisations in ImageNet checkpoint. Moreover, we evaluate different positions of adding LI bottlenecks in the MNV2 architectures, and a general trend can be observed that adding LI to higher layers can produce better performance than to bottom ones.

Implementations

   We implement our method in the Tensorflow framework

abadi2016tensorflow. For the implementation of the baseline Deeplabv3+ deeplabv3plus2018 model, we directly use the code provided by authors. To ensure a fair comparison, the decoder module in Deeplabv3+ deeplabv3plus2018 is disabled for all experiments. It takes around one day per GPU (2080TI) to train a model (LI-MNV+LI-ASPP) on Pascal Voc 2012 dataset, and it requires about 2.5/0.6 days to do so on CelebAMask-HQ and ADE20K datasets.

4.3 Results

In Table 1 we demonstrate the performance of different LI parameters on Pascal Voc 2012 validation set for LI-ASPP and LI-MNV2, respectively. In the LI-ASPP experiments, we investigate the performance of different settings of LI hyper-parameters such as the size of LI Zones, the LI rates and the initialisation range for . As shown in Table 1, using a LI zone and let can generally yield better performance than other settings like a LI Zone or . LI-ASPP achieves the best performance when is randomly initialised within , and therefore we opt for this setting for LI-ASPP.

In addition, we evaluate different options of adding LI-Convs in the Residual Bottleneck (RB) layers of the MNV2 architecture mobilenetv22018. It can be spotted from Table 1 that adding LI mechanisms to the early RB layers (e.g. the earliest six RB layers) cannot promote the accuracy. In contrast, LI-Convs integrated with top layers such as the RB layers can produce higher mIoUs. This observation is somehow in line with the expectations since the higher-level layers are generally encoding more semantic representations, which can better benefit from the improved sensitivity to semantic contours introduced by LI layers.

In Table 2, we report the evaluation results of different methods on the three segmentation benchmark datasets, where each experiment is repeated for three times to present the mean and standard deviations (SD). Compared with the baseline method (MNV2+ASPP i.e. Deeplabv3 chen2017rethinking), LI-MNV2 and LI-ASPP both demonstrate superior performance when used solely, while the best mIoUs on three datasets are all achieved by using them together. Particularly, our method (LI-MNV2+LI-ASPP) gains a relative improvement of 1.53%, 1.1% and 1.6% over the baseline on Pascal Voc 2012, CelebAMask-HQ and ADE-20K datasets, respectively, with more stabilised performance featured by smaller SDs. The model’s parameters and FLOPs, however, are slightly increased by 0.10% and 0.75%, which is arguably acceptable considering the accuracy compensations. Basing on those results, the effectiveness and generality of the proposed LI-Convs are therefore justified.

Method mIoU (%) Parameters (Kilo) FLOPs (Mega)
Pascal Voc 2012 CelebAMask-HQ ADE-20K
MNV2+ASPP (Deeplabv3 chen2017rethinking) 71.850.30 74.690.31 30.020.11 2568.02 6479
MNV2 + LI-ASPP 72.630.24 75.250.05 30.430.27 2568.98 6498
LI-MNV2 + ASPP 72.470.34 75.290.15 30.440.36 2569.94 6517
LI-MNV2 + LI-ASPP 72.950.24 75.550.13 30.520.18 2570.52 6528
Table 2: Performance of different methods on the Pascal Voc 2012 and ADE20K (validation set) and on the CelebAMask-HQ (test set). To provide a fair comparison, we repeat each experiment for three times to report the means and the standard deviations, and the model parameters and FLOPs (for crop size ) are also included.

4.4 Discussion

Figure 4: Visualisations of the channel-level features before and after LI layers on CelebAMask-HQ. Although the activation is inhibited globally, the feature patterns after LI layer are generally easier to recognise mainly due to the clarifications on semantic contours.

How the LI layer works    To intuitively understand the LI mechanisms, we dive into the channel-level features to visualise the patterns before and after LI layers. As demonstrated in Fig. 4, we plot several feature channels before and after the LI layers in LI-ASPP on CelebAMask-HQ dataset. It can be discovered that although the intensity of activation is suppressed globally after the LI layer, the inhibited feature exhibits more recognisable patterns with clarified and emphasised contours, which can be more desirable in the segmentation domain.

Figure 5: Visualisations of the class-level heat maps and semantic predictions of the baseline (MNV2+ASPP) and our method (LI-MNV2+LI-ASPP) on CelebAMask-HQ. Deeper reds in heat maps represent higher positive responses or more attention from the model, and vice versa for deeper blues. Our method allocates more attention to shape the semantic boundary areas and thus can produce predictions with higher visual qualities.

What interests the model    In Fig. 5, we visualise the class-level heat maps and the segmentation predictions generated by the baseline (MNV2+ASPP) and our method (LI-MNV2+LI-ASPP) on CelebAMask-HQ. We utilise deeper reds to denote higher positive neurons responses (more model attention) in heat maps, and vice versa for deeper blues. Compared with the baseline, the semantically meaningful contouring areas receive more attention from our model, e.g. the "glasses" and "skin" heat maps in Fig. 5. Such kind of contour sensitivity can be reasonably attributed to the proposed LI-Convs. Besides, the segmentation predictions generated by our method have better visual qualities, which also verifies the superiority of the LI-Convs.

5 Conclusion

We describe a dilated convolution with lateral inhibitions (LI-Convs) to enhance the model’s sensitivity to semantic contours and to extract features at denser scales. The performance of the proposed LI-ASPP and LI-MNV2 architectures is shown to outperform the baseline method on three segmentation benchmark datasets, which verify the effectiveness and generality of the LI-Convs. We also investigate and try to understand the working mechanisms hidden behind. The proposed LI-Convs can be seamlessly integrated into deep models for other tasks, such as lip-reading and object detection, that require explicit awareness of the semantic boundaries.

References