1 Introduction
Semantic segmentation is a fundamental and challenging problem of computer vision, whose goal is to assign a semantic category to each pixel of the image. It is critical for various tasks such as autonomous driving, image editing and robot sensing. In order to accomplish the semantic segmentation task effectively, we need to distinguish some confusing categories and take the appearance of different objects into account. For example, ‘grass’ and ‘ground’ have similar color in some cases and ‘person’ may have various scales, figures and clothes in different locations of the image. Meanwhile, the label space of the output is quite compact and the amount of the categories for a specific dataset is limited. Therefore, this task can be treated as projecting data points in a highdimensional noisy space into a compact subspace. The essence lies in denoising these variation and capturing the most important semantic concepts.
Recently, many stateoftheart methods based on fully convolutional networks (FCNs) [22] have been proposed to address the above issues. Due to the fixed geometric structures, they are inherently limited by local receptive fields and shortrange contextual information. To capture longrange dependencies, several works employ the multiscale context fusion [17], such as astrous convolution [4], spatial pyramid [37], large kernel convolution [25] and so on. Moreover, to keep more detailed information, the encoderdecoder structures [34, 5] are proposed to fuse midlevel and highlevel semantic features. To aggregate information from all spatial locations, attention mechanism [29, 38, 31] is used, which enables the feature of a single pixel to fuse information from all other positions. However, the original attentionbased methods need to generate a large attention map, which has high computation complexity and occupies a huge number of GPU memory. The bottleneck lies in that both the generation of attention map and its usage are computed w.r.t all positions.
Towards the above issues, in this paper, we rethink the attention mechanism from the view of expectationmaximization (EM) algorithm [7] and propose a novel attentionbased method, namely ExpectationMaximization Attention (EMA). Instead of treating all pixels themselves as the reconstruction bases [38, 31], we use the EM algorithm to find a more compact basis set, which can largely reduce the computational complexity. In detail, we regard the bases for construction as the parameters to learn in the EM algorithm and attention maps as latent variables. In this setting, the EM algorithm aims to find a maximum likelihood estimate of parameters (bases). Given the current parameters, the expectation (E) step works as estimating the expectation of attention map and maximization (M) step functions as updating the parameters (bases) by maximizing the complete data likelihood. The E step and the M step execute alternately. After convergence, the output can be computed as the weighted sum of bases, where the weights are the normalized final attention maps. The pipeline of EMA is shown in Fig. 1.
We further embed the proposed EMA method into a module for neural network, which is named EMA Unit. EMA Unit can be simply implemented by common operators. It is also lightweighted and can be easily embedded into existing neural networks. Moreover, to make full use of its capacity, we also propose two more methods to stabilize the training process of EMA Unit. We also evaluate its performance on three challenging datasets.
The main contributions of this paper are listed as follows:

We reformulate the selfattention mechanism into an expectationmaximization iteration manner, which can learn a more compact basis set and largely reduce the computational complexity. To the best of our knowledge, this paper is the first to introduce EM iterations into attention mechanism.

We build the proposed expectationmaximization attention as a lightweighted module for neural network and set up specific manners for bases’ maintenance and normalization.

Extensive experiments on three challenging semantic segmentation datasets, including PASCAL VOC, PASCAL Context and COCO Stuff, demonstrate the superiority of our approach over other stateoftheart methods.
2 Related Works
Semantic segmentation. Fully convolutional network (FCN) [22] based methods have made great progress in image semantic segmentation by leveraging the powerful convolutional features of classification networks [14, 15, 33] pretrained on largescale data [28]. Several model variants are proposed to enhance the multiscale contextual aggregation. For example, DeeplabV2 [4] makes use of the astrous spatial pyramid pooling (ASPP) to embed contextual information, which consists of parallel dilated convolutions with different dilated rates. DeeplabV3 [4] extends ASPP with imagelevel feature to further capture global contexts. Meanwhile, PSPNet [37] proposes a pyramid pooling module to collect contextual information of different scales. GCN [25] adopts decoupling of large kernel convolution to gain a large receptive field for the feature map and capture longrange information.
For the other type of variants, they mainly focus on predicting more detailed output. These methods are based on UNet [27], which combines the advantages of highlevel features with midlevel features. RefineNet [21] makes use of the Laplacian image pyramid to explicitly capture the information available along the downsampling process and output predictions from coarse to fine. DeeplabV3+ [5] adds a decoder upon DeeplabV3 to refine the segmentation results especially along object boundaries. Exfuse [36] proposes a new framework to bridge the gap between lowlevel and highlevel features and thus improves the segmentation quality.
Attention model. Attention is widely used for various tasks such as machine translation, visual question answering and video classification. The selfattention methods [2, 29] calculate the context coding at one position by a weighted summation of embeddings at all positions in sentences. Nonlocal [31] first adopts selfattention mechanism as a module for computer vision tasks, such as video classification, object detection and instance segmentation. PSANet [38] learns to aggregate contextual information for each position via a predicted attention map. Net [6] proposes the double attention block to distribute and gather informative global features from the entire spatiotemporal space of the images. DANet [11] applies both spatial and channel attention to gather information around the feature maps, which costs even more computation and memory than the Nonlocal method.
Our approach is motivated by the success of attention in the above works. We rethink the attention mechanism from the view of the EM algorithm and compute the attention map in an iterative manner as the EM algorithm.
3 Preliminaries
Before introducing our proposed method, we first review three highly correlated methods, that is the EM algorithm, the Gaussian mixture model and the Nonlocal module.
3.1 ExpectationMaximization Algorithm
The expectationmaximization (EM) [7] algorithm aims to find the maximum likelihood solution for latent variables models. Denote as the data set which consists of observed samples and each data point has its corresponding latent variable . We call the complete data and its likelihood function takes the form , where is the set of all parameters of the model. In practice, the only knowledge of latent variables in is given by the posterior distribution . The EM algorithm is designed to maximize the likelihood by two steps, i.e., the E step and the M step.
In the E step, we use the current parameters to find the posterior distribution of given by . Then we use the posterior distribution to find the expectation of the complete data likelihood , which is given by:
(1) 
Then in the M step, the revised parameter is determined by maximizing the function:
(2) 
The EM algorithm executes the E step and the M step alternately until the convergence criterion is satisfied.
3.2 Gaussian Mixture Model
Gaussian mixture model (GMM) [26] is a special case of the EM algorithm. It takes the distribution of data as a linear superposition of Gaussians:
(3) 
where the mean and the covariance are parameters for the th Gaussian basis. The likelihood of the complete data is formulated as:
(4) 
where . can be viewed as the responsibility that the th basis takes for the observation . For GMM, in the E step, the expected value of is given by:
(5) 
In the M step, the parameters are reestimated as follows:
(6)  
where
(7) 
After the convergence of the GMM parameters, the reestimated can be formulated as:
(8) 
In real applications, we can simply replace
as the identity matrix
and leave out the in the above equations.3.3 Nonlocal
The Nonlocal module [31] functions the same as the selfattention mechanism. It can be formulated as:
(9) 
where represents a general kernel function, is a normalization factor and
denotes the feature vector for the location
. As this module is applied upon the feature map of convolutional neural networks (CNN).
Considering that in Eq. (5) is a specific kernel function between and , Eq. (8) is just a specific design of Eq. (9). Then, from the viewpoint of GMM, the Nonlocal module is just a reestimation of , without E steps and M steps. Specifically, is just selected as the in Nonlocal.
In GMM, the number of Gaussian bases is selected manually and usually satisfies . But in the Nonlocal module, the bases are selected as the data themselves, so it has . There are two obvious disadvantages of the Nonlocal module. First, the data are lying in a low dimensional manifold, so the bases are overcomplete. Second, the computation overhead is heavy and the memory cost is also large.
4 ExpectationMaximization Attention
In view of the high computational complexity of the attention mechanism and limitations of the Nonlocal module, we first propose the expectationmaximization attention (EMA) method, which is an augmented version of selfattention. Unlike the Nonlocal module that selects all data points as bases, we use the EM iterations to find a compact basis set.
For simplicity, we consider an input feature map of size from a single sample. is the intermediate activations of a CNN. To simplify the symbols, we reshape into , where , and indexes the dimensional feature vector at pixel . Our proposed EMA consists of three operations, including responsibility estimation (), likelihood maximization () and data reestimation (). Briefly, given the input and the initial bases , estimates the latent variables (or the ‘responsibility’) , so it functions as the E step in the EM algorithm. uses the estimation to update the bases , which works as the M step. The and steps execute alternately for a prespecified number of iterations. Then, with the converged and , reconstructs the original as and outputs it.
It has been proved that, with the iteration of EM steps, the complete data likelihood will increase monotonically. As can be estimated by marginalizing with , maximizing is a proxy of maximizing . Therefore, with the iterations of and , the updated and have better ability to reconstruct the original data . The reconstructed can capture important semantics from as much as possible.
Moreover, compared with the Nonlocal module, EMA finds a compact set of bases for pixels of an input image. The compactness is nontrivial. Since , lies in a subspace of . This mechanism removes much unnecessary noise and makes the final classification of each pixel more tractable. Moreover, this operation reduces the complexity (both in space and time) from to , where is the number of iterations for and . The convergence of EM algorithm is also guaranteed. Notably, EMA takes only three iterations to get promising results in our experiments. So can be treated as a small constant, which means that the complexity is only .
4.1 Responsibility Estimation
Responsibility estimation () functions as the E step in the EM algorithm. This step computes the expected value of , which corresponds to the responsibility of the th basis to , where and
. We formulate the posterior probability of
given as follows:(10) 
where represents the general kernel function. And now, Eq. (5) can be reformulated into a more general form:
(11) 
There are several choices for , such as inner dot , exponential inner dot , Euclidean distance , RBF kernel and so on. As compared in the Nonlocal module, the choice of these functions makes trivial differences in the final results. So we simply take the exponential inner dot in our paper. In experiments, Eq. (11
) can be implemented as a matrix multiplication plus one softmax layer. In conclusion, the operation of
in the th iteration is formulated as:(12) 
where is a hyperparameter to control the distribution of .
4.2 Likelihood Maximization
Likelihood maximization () works as the EM algorithm’s M step. With the estimated , updates by maximizing the complete data likelihood. To keep the bases lying in the same embedding space as , we update the bases using the weighted summation of . So is updated as
(13) 
in the th iteration of .
It is noteworthy that if we set in Eq. (12), then
will become a onehot embedding. In this situation, each pixel is assigned to only one basis. And the basis is updated by the average of those pixels assigned to it. This is what the Kmeans clustering algorithm
[10] does. So the iterations of and can also be viewed as a soft version of Kmeans clustering.4.3 Data Reestimation
EMA runs and alternately for times. After that, the final and are used to reestimate the . We adopt Eq. (8) to construct the new , namely , which is formulated as:
(14) 
As is constructed from a compact basis set, it has the lowrank property compared with the input . We depict an example of in Fig. 2. It’s obvious that outputed from is very compact in the feature space and the feature variance inside the object is smaller than that of the input.
5 EMA Unit
In order to better incorporate the proposed EMA with deep neural networks, we further propose the Expectationmaximization Attention Unit (EMAU) and apply it to semantic segmentation task. In this section, we will describe EMAU in detail. We first introduce the overall structure of EMAU and then discuss bases’ maintenance and normalization mechanisms.
5.1 Structure of EMA Unit
The overall structure of EMAU is shown in Fig. 2. EMAU looks like the bottleneck of ResNet at the first glance, except it replaces the heavy
convolution with the EMA operations. The first convolution without the ReLU activation is prepended to transform the value range of input from
to . This transformation is very important, or the estimated will also lie in , which halves the capacity compared with general convolution parameters. The last convolution is inserted to transform the reestimated into the residual space of .For each of , and steps, the computation complexity is . As we set , several iterations of and plus one is just the same magnitude as a convolution with input and output channel numbers all being . Adding the extra computation from two convolutions, the whole FLOPs of EMAU is around of a module running convolutions with the same number of input and output channels. Moreover, the parameters maintained by EMA just counts to .
5.2 Bases Maintenance
Another issue for the EM algorithm is the initialization of the bases. The EM algorithm is guaranteed to converge, because the likelihood of complete data is limited, and at each iteration both E and M steps lift its current lower bound. However, converging to global maximum is not guaranteed. Thus, the initial values of bases before iterations are of great importance.
We only describe how EMA is used to process one image above. However, for a computer vision task, there are thousand of images in a dataset. As each image has different pixel feature distributions from others, it is not suitable to use the computed upon an image to reconstruct feature maps of other images. So we run EMA on each image.
For the first minibatch, we initialize using Kaiming’s initialization [13], where we treat matrix multiplication as a convolution. For the following minibatches, one simple choice is to update using standard back propagation. However, as iterations of and
can be unrolled as a recurrent neural network (RNN), the gradients propagating though them will encounter the vanishing or explosion problem. Therefore, the updating of
is unstable, and the training procedure of EMA Unit may collapse.In this paper, we use the moving averaging to update in the training process. After iterating over an image, the generated can be regarded as a biased update of , where the bias comes from the image sampling process. To make it less biased, we first average over a minibatch and get the . Then we update as:
(15) 
where is the momentum. For inference, the
keeps fixed. This moving averaging mechanism is also adopted in batch normalization (BN)
[16].5.3 Bases Normalization
In the above subsection, we accomplish the maintenance of for each minibatch. However, the stable update of inside and iterations is still not guaranteed, due to the defect of RNN. The moving averaging mechanism described above requires not to differ significantly from , otherwise it will also collapse like backpropagation. This requirement also constrains the value range of .
To this end, we need to apply normalization upon . At the first glance, BN or layer normalization (LN) [1] sound to be good choices. However, these aforementioned normalization methods will change the direction of each basis , which changes their properties and semantic meanings. To keep the direction of each basis untouched, we choose Euclidean normalization (L2Norm), which divides each by its length. By applying it, then lies in a dimensional united hypersphere, and sequence of forms a trajectory on it.
5.4 Compare with the Double Attention Block
Net [6] proposes the double attention block ( block), in which the output is computed as:
(16) 
where represents the function. , and represent three convolutions with convolution kernels , and , respectively.
If we share parameters between and , then we can mark both and as . We can see that just computes the same as Eq. (5) and those variables lying inside update . The whole process of block equals to EMA with only one iteration. The in block is updated by the backpropagation, while our EMAU is updated by moving averaging. Above all, double attention block can be treated as a special form of EMAU.
and training output stride
on the PASCAL VOC dataset. The iteration number for training is set as . Best viewed on screen.6 Experiments
To evaluate the proposed EMAU, we conduct extensive experiments on the PASCAL VOC dataset [9], the PASCAL Context dataset [24], and the COCO Stuff dataset [3]. In this section, we first introduce implementation details. Then we perform ablation study to verify the superiority of proposed method on the PASCAL VOC dataset. Finally, we report our results on the PASCAL Context dataset and the COCO Stuff dataset.
6.1 Implementation Details
We use ResNet [14]
(pretrained on ImageNet
[28]) as our backbone. Following prior works [37, 4, 5], we employ a poly learning rate policy where the initial learning rate is multiplied by after each iteration. The initial learning rate is set to be for all datasets. Momentum and weight decay coefficients are set to and , respectively. For data augmentation, we apply the common scale ( to ), cropping and flipping of the image to augment the training data. Input size for all datasets is set to . The synchronized batch normalization is adopted in all experiments, together with the multigrid [4]. For evaluation, we adopt the commonly used Mean IoU metric.The output stride of the backbone is set to for training on PASCAL VOC and PASCAL Context, and for training on COCO Stuff and evaluating on all datasets. To speed up the training procedure, we carry out all ablation studies on ResNet50 [14], with batch size . For all models to be compared with stateoftheart, we train them on ResNet101, with batch size . We train 30K iterations on PASCAL VOC and COCO Stuff, and 15K on PASCAL Context. We use a convolution to reduce the channel number from to , and then stack EMAU upon it. We call the whole network as EMANet. We set the basis number , and the number of iterations for training as default.
6.2 Results on the PASCAL VOC Dataset
6.2.1 Bases Maintenance and Normalization
In this part, we first compare different strategies of maintaining . We set in training, and in evaluation. As shown in the left part of Fig. 3, performance of all strategies increases with more iterations of and . When , the gain from more iterations becomes marginal. Moving average performs the best among them. It achieves the highest performances in all iterations and surpasses others by at least in mIoU. Surprisingly, updating by the back propagation shows no merit compared with no updating and even performs worse when .
We then compare the performances with no normalization, LN and L2Norm as described above. From the right part of Fig. 3, it is clear to see that LN is better than no normalization. Since it can partially relieve the gradient chores of RNNlike structure. The performance of LN and no normalization has little correlation with the number of iteration . By contrast, L2Norm’s performance increases as the iterations become larger and it outperforms LN and no normalization when .
Method  SS  MS+Flip  FLOPs  Memory  Params 

ResNet101      190.6G  2.603G  42.6M 
DeeplabV3 [4]  78.51  79.77  +63.4G  +66.0M  +15.5M 
DeeplabV3+ [5]  79.35  80.57  +84.1G  +99.3M  +16.3M 
PSANet [38]  78.51  79.77  +56.3G  +59.4M  +18.5M 
EMANet (256)  79.73  80.94  +21.1G  +12.3M  +4.87M 
EMANet (512)  80.05  81.32  +43.1G  +22.1M  +10.0M 
Method  Backbone  mIoU (%) 

Wide ResNet [32]  WideResNet38  84.9 
PSPNet [37]  ResNet101  85.4 
DeeplabV3 [4]  ResNet101  85.7 
PSANet [38]  ResNet101  85.7 
EncNet [35]  ResNet101  85.9 
DFN [34]  ResNet101  86.2 
Exfuse [36]  ResNet101  86.2 
IDWCNN [30]  ResNet101  86.3 
SDN [12]  DenseNet161  86.6 
DIS [23]  ResNet101  86.8 
EMANet  ResNet101  87.7 
GCN [25]  ResNet152  83.6 
RefineNet [21]  ResNet152  84.2 
DeeplabV3+ [5]  Xception71  87.8 
Exfuse [36]  ResNeXt131  87.9 
MSCI [20]  ResNet152  88.0 
6.2.2 Ablation Study for Iteration Number
From Fig. 3, it is obvious that the performance of EMAU gain from more iterations during evaluation, and the gain becomes marginal when . In this subsection, we also study the influence of in training. We plot the performance matrix upon and as Fig. 4.
From Fig. 4, it is clear that mIoU increases monotonically with more iterations in evaluation, no matter what is. They finally converge to a fixed value. However, this rule does not work in training. The mIoUs peak when and decrease with more iterations. This phenomenon may be caused by the RNNlike behavior of EMAU. Though Moving Average and L2Norm can relieve to a certain degree, the problem persists.
We also carry out experiments on block [6], which can be regarded as a special form of EMAU as mentioned in Sec. 5.4. Similarly, the nonlocal module can also be viewed as a special form of EMAU without step, which includes more bases and . With the same backbone and training scheduler, block achieves 77.41% and the nonlocal module achieves 77.78% in mIoU, respectively. As a comparison, EMANet achieves 77.34% when and . These three results have small differences, which is coincident with our analysis.
Method  Backbone  mIoU (%) 

PSPNet [37]  ResNet101  47.8 
DANet [11]  ResNet50  50.1 
MSCI [20]  ResNet152  50.3 
EMANet  ResNet50  50.5 
SGR [18]  ResNet101  50.8 
CCL [8]  ResNet101  51.6 
EncNet [35]  ResNet101  51.7 
SGR+ [18]  ResNet101  52.5 
DANet [11]  ResNet101  52.6 
EMANet  ResNet101  53.1 
6.2.3 Comparisons with Stateofthearts
We first thoroughly compare EMANet with three baselines, namely DeeplabV3, DeeplabV3+ and PSANet on the validation set. We report mIoU, FLOPs, memory cost and parameter numbers in Tab. 1. We can see that EMANet outperforms these three baselines by a large margin. Moreover, EMANet is much lighter in computation and memory.
We further compare our method with existing methods on the PASCAL VOC test set. Following previous methods [4, 5], we train EMANet successively over COCO, the VOC trainaug and the VOC trainval set. We set the base learning rate as 0.009, 0.001 and 0.0001, respectively. We train 150K iterations on COCO, and 30K for the last two rounds. When inferring over the test set, we make use of multiscale testing and leftright flipping.
As shown in Tab. 2, our EMANet sets the new record on PASCAL VOC, and improves DeeplabV3 [4] with the same backbone by 2.0% in mIoU. Our EMANet achieves the best performance among networks with backbone ResNet101, and outperforms the previous best one by 0.9%, which is significant due to the fact that this benchmark is very competitive. Moreover, it achieves the performance that is comparable with methods based on some larger backbones.
6.3 Results on the PASCAL Context Dataset
To verify the generalization of our proposed EMANet, we conduct experiments on the PASCAL Context dataset. Quantitative results of PASCAL Context are shown in Tab. 3. Noteworthily, our EMANet based on ResNet50 even outperforms MSCI [20] upon ResNet152, which further shows the effectiveness of our proposed EMAU. Moreover, to the best of our knowledge, EMANet based on ResNet101 achieves the highest performance on the PASCAL Context dataset. Even pretrained on additional data (COCO Stuff), SGR+ is still inferior to EMANet.
6.4 Results on the COCO Stuff Dataset
To further evaluate the effectiveness of our method, we also carry out experiments on the COCO Stuff dataset. Comparisons with previous stateoftheart methods are shown in Tab. 4. Remarkably, EMANet achieves 39.9% in mIoU and outperforms previous methods by a large margin.
6.5 Visualization of Bases Responsibilities
To get a deeper understanding of our proposed EMAU, we visualize the iterated responsibility map in Fig. 5. For each image, we randomly select four bases ( and ) and show their corresponding responsibilities of all pixels in the last iteration. Obviously, each basis corresponds to an abstract concept of the image. With the progress of iterations and , the abstract concept becomes more compact and clear. As we can see, these bases converge to some specific semantics and do not just focus on foreground and background. Concretely, the bases of the first two rows focus on specific semantics such as human, wine glass, cutlery and profile. The bases of the last two rows focus on sailboat, mountain, airplane and lane.
7 Conclusion
In this paper, we propose a new type of attention mechanism, namely the expectationmaximization attention (EMA), which computes a more compact basis set by iteratively executing as the EM algorithm. The reconstructed output of EMA is lowrank and robust to the variance of input. We well formulate the proposed method as a lightweighted module that can be easily inserted to existing CNNs with little overhead. Extensive experiments on a number of benchmark datasets demonstrate the effectiveness and efficiency of the proposed EMAU.
Acknowledgment
Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (Grant nos. 61625301 and 61731018), Qualcomm and Microsoft Research Asia. Hong Liu is supported by National Natural Science Foundation of China (Grant nos. U1613209 and 61673030) and funds from Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467).
References
 [1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] H. Caesar, J. Uijlings, and V. Ferrari. Cocostuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
 [4] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [5] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 [6] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. A2nets: Double attention networks. In NeurIPS, pages 350–359, 2018.
 [7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
 [8] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang. Context contrasted feature and gated multiscale aggregation for scene segmentation. In CVPR, pages 2393–2402, 2018.
 [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
 [10] E. W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. biometrics, 21:768–769, 1965.
 [11] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019.
 [12] J. Fu, J. Liu, Y. Wang, J. Zhou, C. Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. IEEE Transactions on Image Processing, 2019.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
 [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [17] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha. Recurrent squeezeandexcitation context aggregation net for single image deraining. In ECCV, pages 254–269, 2018.
 [18] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing. Symbolic graph reasoning meets convolutions. In NeurIPS, pages 1858–1868, 2018.
 [19] X. Liang, H. Zhou, and E. Xing. Dynamicstructured semantic propagation network. In CVPR, pages 752–761, 2018.
 [20] D. Lin, Y. Ji, D. Lischinski, D. CohenOr, and H. Huang. Multiscale context intertwining for semantic segmentation. In ECCV, pages 603–619, 2018.
 [21] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multipath refinement networks for highresolution semantic segmentation. In CVPR, pages 1925–1934, 2017.
 [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [23] P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In ICCV, pages 2718–2726, 2017.
 [24] R. Mottaghi, X. Chen, X. Liu, N.G. Cho, S.W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, pages 891–898, 2014.
 [25] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, pages 4353–4361, 2017.
 [26] S. Richardson and P. J. Green. On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: series B (statistical methodology), 59(4):731–792, 1997.
 [27] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
 [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
 [30] G. Wang, P. Luo, L. Lin, and X. Wang. Learning object interactions and descriptions for semantic image segmentation. In CVPR, pages 5859–5867, 2017.
 [31] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In CVPR, pages 7794–7803, 2018.
 [32] Z. Wu, C. Shen, and A. Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119–133, 2019.
 [33] Y. Yang, Z. Zhong, T. Shen, and Z. Lin. Convolutional neural networks with alternately updated clique. In CVPR, pages 2413–2422, 2018.
 [34] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Learning a discriminative feature network for semantic segmentation. In CVPR, pages 1857–1866, 2018.
 [35] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018.
 [36] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, pages 269–284, 2018.
 [37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
 [38] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia. Psanet: Pointwise spatial attention network for scene parsing. In ECCV, pages 267–283, 2018.
Comments
There are no comments yet.