Benchmarks for Semantic Segmentation.
Self-attention mechanism has been widely used for various tasks. It is designed to compute the representation of each position by a weighted sum of the features at all positions. Thus, it can capture long-range relations for computer vision tasks. However, it is computationally consuming. Since the attention maps are computed w.r.t all other positions. In this paper, we formulate the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed. By a weighted summation upon these bases, the resulting representation is low-rank and deprecates noisy information from the input. The proposed Expectation-Maximization Attention (EMA) module is robust to the variance of input and is also friendly in memory and computation. Moreover, we set up the bases maintenance and normalization methods to stabilize its training procedure. We conduct extensive experiments on popular semantic segmentation benchmarks including PASCAL VOC, PASCAL Context and COCO Stuff, on which we set new records.READ FULL TEXT VIEW PDF
Benchmarks for Semantic Segmentation.
Adaptive Channel-Wise EM Attention for Multimodal Style Transfer
Semantic segmentation is a fundamental and challenging problem of computer vision, whose goal is to assign a semantic category to each pixel of the image. It is critical for various tasks such as autonomous driving, image editing and robot sensing. In order to accomplish the semantic segmentation task effectively, we need to distinguish some confusing categories and take the appearance of different objects into account. For example, ‘grass’ and ‘ground’ have similar color in some cases and ‘person’ may have various scales, figures and clothes in different locations of the image. Meanwhile, the label space of the output is quite compact and the amount of the categories for a specific dataset is limited. Therefore, this task can be treated as projecting data points in a high-dimensional noisy space into a compact sub-space. The essence lies in de-noising these variation and capturing the most important semantic concepts.
Recently, many state-of-the-art methods based on fully convolutional networks (FCNs)  have been proposed to address the above issues. Due to the fixed geometric structures, they are inherently limited by local receptive fields and short-range contextual information. To capture long-range dependencies, several works employ the multi-scale context fusion , such as astrous convolution , spatial pyramid , large kernel convolution  and so on. Moreover, to keep more detailed information, the encoder-decoder structures [34, 5] are proposed to fuse mid-level and high-level semantic features. To aggregate information from all spatial locations, attention mechanism [29, 38, 31] is used, which enables the feature of a single pixel to fuse information from all other positions. However, the original attention-based methods need to generate a large attention map, which has high computation complexity and occupies a huge number of GPU memory. The bottleneck lies in that both the generation of attention map and its usage are computed w.r.t all positions.
Towards the above issues, in this paper, we rethink the attention mechanism from the view of expectation-maximization (EM) algorithm  and propose a novel attention-based method, namely Expectation-Maximization Attention (EMA). Instead of treating all pixels themselves as the reconstruction bases [38, 31], we use the EM algorithm to find a more compact basis set, which can largely reduce the computational complexity. In detail, we regard the bases for construction as the parameters to learn in the EM algorithm and attention maps as latent variables. In this setting, the EM algorithm aims to find a maximum likelihood estimate of parameters (bases). Given the current parameters, the expectation (E) step works as estimating the expectation of attention map and maximization (M) step functions as updating the parameters (bases) by maximizing the complete data likelihood. The E step and the M step execute alternately. After convergence, the output can be computed as the weighted sum of bases, where the weights are the normalized final attention maps. The pipeline of EMA is shown in Fig. 1.
We further embed the proposed EMA method into a module for neural network, which is named EMA Unit. EMA Unit can be simply implemented by common operators. It is also light-weighted and can be easily embedded into existing neural networks. Moreover, to make full use of its capacity, we also propose two more methods to stabilize the training process of EMA Unit. We also evaluate its performance on three challenging datasets.
The main contributions of this paper are listed as follows:
We reformulate the self-attention mechanism into an expectation-maximization iteration manner, which can learn a more compact basis set and largely reduce the computational complexity. To the best of our knowledge, this paper is the first to introduce EM iterations into attention mechanism.
We build the proposed expectation-maximization attention as a light-weighted module for neural network and set up specific manners for bases’ maintenance and normalization.
Extensive experiments on three challenging semantic segmentation datasets, including PASCAL VOC, PASCAL Context and COCO Stuff, demonstrate the superiority of our approach over other state-of-the-art methods.
Semantic segmentation. Fully convolutional network (FCN)  based methods have made great progress in image semantic segmentation by leveraging the powerful convolutional features of classification networks [14, 15, 33] pre-trained on large-scale data . Several model variants are proposed to enhance the multi-scale contextual aggregation. For example, DeeplabV2  makes use of the astrous spatial pyramid pooling (ASPP) to embed contextual information, which consists of parallel dilated convolutions with different dilated rates. DeeplabV3  extends ASPP with image-level feature to further capture global contexts. Meanwhile, PSPNet  proposes a pyramid pooling module to collect contextual information of different scales. GCN  adopts decoupling of large kernel convolution to gain a large receptive field for the feature map and capture long-range information.
For the other type of variants, they mainly focus on predicting more detailed output. These methods are based on U-Net , which combines the advantages of high-level features with mid-level features. RefineNet  makes use of the Laplacian image pyramid to explicitly capture the information available along the down-sampling process and output predictions from coarse to fine. DeeplabV3+  adds a decoder upon DeeplabV3 to refine the segmentation results especially along object boundaries. Exfuse  proposes a new framework to bridge the gap between low-level and high-level features and thus improves the segmentation quality.
Attention model. Attention is widely used for various tasks such as machine translation, visual question answering and video classification. The self-attention methods [2, 29] calculate the context coding at one position by a weighted summation of embeddings at all positions in sentences. Non-local  first adopts self-attention mechanism as a module for computer vision tasks, such as video classification, object detection and instance segmentation. PSANet  learns to aggregate contextual information for each position via a predicted attention map. Net  proposes the double attention block to distribute and gather informative global features from the entire spatio-temporal space of the images. DANet  applies both spatial and channel attention to gather information around the feature maps, which costs even more computation and memory than the Non-local method.
Our approach is motivated by the success of attention in the above works. We rethink the attention mechanism from the view of the EM algorithm and compute the attention map in an iterative manner as the EM algorithm.
Before introducing our proposed method, we first review three highly correlated methods, that is the EM algorithm, the Gaussian mixture model and the Non-local module.
The expectation-maximization (EM)  algorithm aims to find the maximum likelihood solution for latent variables models. Denote as the data set which consists of observed samples and each data point has its corresponding latent variable . We call the complete data and its likelihood function takes the form , where is the set of all parameters of the model. In practice, the only knowledge of latent variables in is given by the posterior distribution . The EM algorithm is designed to maximize the likelihood by two steps, i.e., the E step and the M step.
In the E step, we use the current parameters to find the posterior distribution of given by . Then we use the posterior distribution to find the expectation of the complete data likelihood , which is given by:
Then in the M step, the revised parameter is determined by maximizing the function:
The EM algorithm executes the E step and the M step alternately until the convergence criterion is satisfied.
Gaussian mixture model (GMM)  is a special case of the EM algorithm. It takes the distribution of data as a linear superposition of Gaussians:
where the mean and the covariance are parameters for the -th Gaussian basis. The likelihood of the complete data is formulated as:
where . can be viewed as the responsibility that the -th basis takes for the observation . For GMM, in the E step, the expected value of is given by:
In the M step, the parameters are re-estimated as follows:
After the convergence of the GMM parameters, the re-estimated can be formulated as:
In real applications, we can simply replace
as the identity matrixand leave out the in the above equations.
The Non-local module  functions the same as the self-attention mechanism. It can be formulated as:
where represents a general kernel function, is a normalization factor and
denotes the feature vector for the location
. As this module is applied upon the feature map of convolutional neural networks (CNN).
Considering that in Eq. (5) is a specific kernel function between and , Eq. (8) is just a specific design of Eq. (9). Then, from the viewpoint of GMM, the Non-local module is just a re-estimation of , without E steps and M steps. Specifically, is just selected as the in Non-local.
In GMM, the number of Gaussian bases is selected manually and usually satisfies . But in the Non-local module, the bases are selected as the data themselves, so it has . There are two obvious disadvantages of the Non-local module. First, the data are lying in a low dimensional manifold, so the bases are over-complete. Second, the computation overhead is heavy and the memory cost is also large.
In view of the high computational complexity of the attention mechanism and limitations of the Non-local module, we first propose the expectation-maximization attention (EMA) method, which is an augmented version of self-attention. Unlike the Non-local module that selects all data points as bases, we use the EM iterations to find a compact basis set.
For simplicity, we consider an input feature map of size from a single sample. is the intermediate activations of a CNN. To simplify the symbols, we reshape into , where , and indexes the dimensional feature vector at pixel . Our proposed EMA consists of three operations, including responsibility estimation (), likelihood maximization () and data re-estimation (). Briefly, given the input and the initial bases , estimates the latent variables (or the ‘responsibility’) , so it functions as the E step in the EM algorithm. uses the estimation to update the bases , which works as the M step. The and steps execute alternately for a pre-specified number of iterations. Then, with the converged and , reconstructs the original as and outputs it.
It has been proved that, with the iteration of EM steps, the complete data likelihood will increase monotonically. As can be estimated by marginalizing with , maximizing is a proxy of maximizing . Therefore, with the iterations of and , the updated and have better ability to reconstruct the original data . The reconstructed can capture important semantics from as much as possible.
Moreover, compared with the Non-local module, EMA finds a compact set of bases for pixels of an input image. The compactness is non-trivial. Since , lies in a subspace of . This mechanism removes much unnecessary noise and makes the final classification of each pixel more tractable. Moreover, this operation reduces the complexity (both in space and time) from to , where is the number of iterations for and . The convergence of EM algorithm is also guaranteed. Notably, EMA takes only three iterations to get promising results in our experiments. So can be treated as a small constant, which means that the complexity is only .
Responsibility estimation () functions as the E step in the EM algorithm. This step computes the expected value of , which corresponds to the responsibility of the -th basis to , where and
. We formulate the posterior probability ofgiven as follows:
where represents the general kernel function. And now, Eq. (5) can be reformulated into a more general form:
There are several choices for , such as inner dot , exponential inner dot , Euclidean distance , RBF kernel and so on. As compared in the Non-local module, the choice of these functions makes trivial differences in the final results. So we simply take the exponential inner dot in our paper. In experiments, Eq. (11
) can be implemented as a matrix multiplication plus one softmax layer. In conclusion, the operation ofin the -th iteration is formulated as:
where is a hyper-parameter to control the distribution of .
Likelihood maximization () works as the EM algorithm’s M step. With the estimated , updates by maximizing the complete data likelihood. To keep the bases lying in the same embedding space as , we update the bases using the weighted summation of . So is updated as
in the -th iteration of .
It is noteworthy that if we set in Eq. (12), then
will become a one-hot embedding. In this situation, each pixel is assigned to only one basis. And the basis is updated by the average of those pixels assigned to it. This is what the K-means clustering algorithm does. So the iterations of and can also be viewed as a soft version of K-means clustering.
EMA runs and alternately for times. After that, the final and are used to re-estimate the . We adopt Eq. (8) to construct the new , namely , which is formulated as:
As is constructed from a compact basis set, it has the low-rank property compared with the input . We depict an example of in Fig. 2. It’s obvious that outputed from is very compact in the feature space and the feature variance inside the object is smaller than that of the input.
In order to better incorporate the proposed EMA with deep neural networks, we further propose the Expectation-maximization Attention Unit (EMAU) and apply it to semantic segmentation task. In this section, we will describe EMAU in detail. We first introduce the overall structure of EMAU and then discuss bases’ maintenance and normalization mechanisms.
The overall structure of EMAU is shown in Fig. 2. EMAU looks like the bottleneck of ResNet at the first glance, except it replaces the heavy
convolution with the EMA operations. The first convolution without the ReLU activation is prepended to transform the value range of input fromto . This transformation is very important, or the estimated will also lie in , which halves the capacity compared with general convolution parameters. The last convolution is inserted to transform the re-estimated into the residual space of .
For each of , and steps, the computation complexity is . As we set , several iterations of and plus one is just the same magnitude as a convolution with input and output channel numbers all being . Adding the extra computation from two convolutions, the whole FLOPs of EMAU is around of a module running convolutions with the same number of input and output channels. Moreover, the parameters maintained by EMA just counts to .
Another issue for the EM algorithm is the initialization of the bases. The EM algorithm is guaranteed to converge, because the likelihood of complete data is limited, and at each iteration both E and M steps lift its current lower bound. However, converging to global maximum is not guaranteed. Thus, the initial values of bases before iterations are of great importance.
We only describe how EMA is used to process one image above. However, for a computer vision task, there are thousand of images in a dataset. As each image has different pixel feature distributions from others, it is not suitable to use the computed upon an image to reconstruct feature maps of other images. So we run EMA on each image.
For the first mini-batch, we initialize using Kaiming’s initialization , where we treat matrix multiplication as a convolution. For the following mini-batches, one simple choice is to update using standard back propagation. However, as iterations of and
can be unrolled as a recurrent neural network (RNN), the gradients propagating though them will encounter the vanishing or explosion problem. Therefore, the updating ofis unstable, and the training procedure of EMA Unit may collapse.
In this paper, we use the moving averaging to update in the training process. After iterating over an image, the generated can be regarded as a biased update of , where the bias comes from the image sampling process. To make it less biased, we first average over a mini-batch and get the . Then we update as:
where is the momentum. For inference, the
keeps fixed. This moving averaging mechanism is also adopted in batch normalization (BN).
In the above subsection, we accomplish the maintenance of for each mini-batch. However, the stable update of inside and iterations is still not guaranteed, due to the defect of RNN. The moving averaging mechanism described above requires not to differ significantly from , otherwise it will also collapse like back-propagation. This requirement also constrains the value range of .
To this end, we need to apply normalization upon . At the first glance, BN or layer normalization (LN)  sound to be good choices. However, these aforementioned normalization methods will change the direction of each basis , which changes their properties and semantic meanings. To keep the direction of each basis untouched, we choose Euclidean normalization (L2Norm), which divides each by its length. By applying it, then lies in a -dimensional united hyper-sphere, and sequence of forms a trajectory on it.
Net  proposes the double attention block ( block), in which the output is computed as:
where represents the function. , and represent three convolutions with convolution kernels , and , respectively.
If we share parameters between and , then we can mark both and as . We can see that just computes the same as Eq. (5) and those variables lying inside update . The whole process of block equals to EMA with only one iteration. The in block is updated by the back-propagation, while our EMAU is updated by moving averaging. Above all, double attention block can be treated as a special form of EMAU.
and training output strideon the PASCAL VOC dataset. The iteration number for training is set as . Best viewed on screen.
To evaluate the proposed EMAU, we conduct extensive experiments on the PASCAL VOC dataset , the PASCAL Context dataset , and the COCO Stuff dataset . In this section, we first introduce implementation details. Then we perform ablation study to verify the superiority of proposed method on the PASCAL VOC dataset. Finally, we report our results on the PASCAL Context dataset and the COCO Stuff dataset.
We use ResNet 
(pretrained on ImageNet) as our backbone. Following prior works [37, 4, 5], we employ a poly learning rate policy where the initial learning rate is multiplied by after each iteration. The initial learning rate is set to be for all datasets. Momentum and weight decay coefficients are set to and , respectively. For data augmentation, we apply the common scale ( to ), cropping and flipping of the image to augment the training data. Input size for all datasets is set to . The synchronized batch normalization is adopted in all experiments, together with the multi-grid . For evaluation, we adopt the commonly used Mean IoU metric.
The output stride of the backbone is set to for training on PASCAL VOC and PASCAL Context, and for training on COCO Stuff and evaluating on all datasets. To speed up the training procedure, we carry out all ablation studies on ResNet-50 , with batch size . For all models to be compared with state-of-the-art, we train them on ResNet-101, with batch size . We train 30K iterations on PASCAL VOC and COCO Stuff, and 15K on PASCAL Context. We use a convolution to reduce the channel number from to , and then stack EMAU upon it. We call the whole network as EMANet. We set the basis number , and the number of iterations for training as default.
In this part, we first compare different strategies of maintaining . We set in training, and in evaluation. As shown in the left part of Fig. 3, performance of all strategies increases with more iterations of and . When , the gain from more iterations becomes marginal. Moving average performs the best among them. It achieves the highest performances in all iterations and surpasses others by at least in mIoU. Surprisingly, updating by the back propagation shows no merit compared with no updating and even performs worse when .
We then compare the performances with no normalization, LN and L2Norm as described above. From the right part of Fig. 3, it is clear to see that LN is better than no normalization. Since it can partially relieve the gradient chores of RNN-like structure. The performance of LN and no normalization has little correlation with the number of iteration . By contrast, L2Norm’s performance increases as the iterations become larger and it outperforms LN and no normalization when .
|Wide ResNet ||WideResNet-38||84.9|
From Fig. 3, it is obvious that the performance of EMAU gain from more iterations during evaluation, and the gain becomes marginal when . In this subsection, we also study the influence of in training. We plot the performance matrix upon and as Fig. 4.
From Fig. 4, it is clear that mIoU increases monotonically with more iterations in evaluation, no matter what is. They finally converge to a fixed value. However, this rule does not work in training. The mIoUs peak when and decrease with more iterations. This phenomenon may be caused by the RNN-like behavior of EMAU. Though Moving Average and L2Norm can relieve to a certain degree, the problem persists.
We also carry out experiments on block , which can be regarded as a special form of EMAU as mentioned in Sec. 5.4. Similarly, the non-local module can also be viewed as a special form of EMAU without step, which includes more bases and . With the same backbone and training scheduler, block achieves 77.41% and the non-local module achieves 77.78% in mIoU, respectively. As a comparison, EMANet achieves 77.34% when and . These three results have small differences, which is coincident with our analysis.
We first thoroughly compare EMANet with three baselines, namely DeeplabV3, DeeplabV3+ and PSANet on the validation set. We report mIoU, FLOPs, memory cost and parameter numbers in Tab. 1. We can see that EMANet outperforms these three baselines by a large margin. Moreover, EMANet is much lighter in computation and memory.
We further compare our method with existing methods on the PASCAL VOC test set. Following previous methods [4, 5], we train EMANet successively over COCO, the VOC trainaug and the VOC trainval set. We set the base learning rate as 0.009, 0.001 and 0.0001, respectively. We train 150K iterations on COCO, and 30K for the last two rounds. When inferring over the test set, we make use of multi-scale testing and left-right flipping.
As shown in Tab. 2, our EMANet sets the new record on PASCAL VOC, and improves DeeplabV3  with the same backbone by 2.0% in mIoU. Our EMANet achieves the best performance among networks with backbone ResNet-101, and outperforms the previous best one by 0.9%, which is significant due to the fact that this benchmark is very competitive. Moreover, it achieves the performance that is comparable with methods based on some larger backbones.
To verify the generalization of our proposed EMANet, we conduct experiments on the PASCAL Context dataset. Quantitative results of PASCAL Context are shown in Tab. 3. Noteworthily, our EMANet based on ResNet-50 even outperforms MSCI  upon ResNet-152, which further shows the effectiveness of our proposed EMAU. Moreover, to the best of our knowledge, EMANet based on ResNet-101 achieves the highest performance on the PASCAL Context dataset. Even pretrained on additional data (COCO Stuff), SGR+ is still inferior to EMANet.
To further evaluate the effectiveness of our method, we also carry out experiments on the COCO Stuff dataset. Comparisons with previous state-of-the-art methods are shown in Tab. 4. Remarkably, EMANet achieves 39.9% in mIoU and outperforms previous methods by a large margin.
To get a deeper understanding of our proposed EMAU, we visualize the iterated responsibility map in Fig. 5. For each image, we randomly select four bases ( and ) and show their corresponding responsibilities of all pixels in the last iteration. Obviously, each basis corresponds to an abstract concept of the image. With the progress of iterations and , the abstract concept becomes more compact and clear. As we can see, these bases converge to some specific semantics and do not just focus on foreground and background. Concretely, the bases of the first two rows focus on specific semantics such as human, wine glass, cutlery and profile. The bases of the last two rows focus on sailboat, mountain, airplane and lane.
In this paper, we propose a new type of attention mechanism, namely the expectation-maximization attention (EMA), which computes a more compact basis set by iteratively executing as the EM algorithm. The reconstructed output of EMA is low-rank and robust to the variance of input. We well formulate the proposed method as a light-weighted module that can be easily inserted to existing CNNs with little overhead. Extensive experiments on a number of benchmark datasets demonstrate the effectiveness and efficiency of the proposed EMAU.
Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (Grant nos. 61625301 and 61731018), Qualcomm and Microsoft Research Asia. Hong Liu is supported by National Natural Science Foundation of China (Grant nos. U1613209 and 61673030) and funds from Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467).