Convolutional Neural Networks are rapidly driving advances in semantic image segmentation [13, 5, 9, 40], which aims to predict accurate and effective masks on different classes of targets. To fulfill this challenge, previous study focused on designing or finetuning different network architectures [27, 20, 16, 30, 6]. To our knowledge, all these frameworks adapt the estimator (i.e. loss ) proposed in , which averages pixel-wise cross-entropy over prediction maps and ground truths of input batches. However, this kind of estimator only measures pixel-wise distances between predicitons and ground truths, neglecting the interactions between pixels of same category within their neighborhoods. Whereas, such interactions are crucial especially when the appearances of targets change due to the deformation, illumination variations, occlusion and so forth [14, 17].
Previous loss functions for enhancing the intra-class features were designed for image classification[19, 24, 33, 36], which usually measures batch costs between predicted classes and labels over batchs of images, such like contrastive loss [18, 8], triplet loss  and center loss . However, as mentioned in , the approaches, like contrastive loss and triplet loss, require image pairs or triplets for each training iteration, which result in the dramatic growth of training samples, and thus significantly increase the computational complexity. Center loss overcomes such problem by introducing k-nearest neighbor (k-NN) 
algorithms into sfotmax cross-entropy. At each training iteration, it computes the distances between deep features and every class centers of the features over a mini-batch of images, and updates the centers after each iteration. Center loss can effectively minimize the intra-class variations while keeping the features of different classes separable. Even though, such kind of estimation is still computational expensive, let alone, for image segmentation tasks, each pixel is considered as a training sample. Moreover, most semantic segmentation datasets exhibit long tail distributions with few object categories, which means inter- and intra- classes are imbalanced, and consequently biasing networks training towards major classes. To address class imbalance problem, in the realm of object detection, Lin et al. 
modified standard cross entropy loss to down-weight the losses assigned to well-classified examples, and proposed focal loss.
In this paper, we introduce a novel locally adaptive loss for semantic image segmentation by estimating selectively filtered predictions based on their categories. Figure 1 illustrates the training framework of our proposed method at a glance. The selective pooling filter slides over output feature maps and ground truths simultaneously, meanwhile at each striding step, it selectively pools predicted vectors into a merged one, then computes cost between the merged vector and center pixel’s category label inside filer. Such operation is conducted on each valid pixel over input batches, and finally it computes a global loss for each input batch (see Figure 2
). During training, such loss layer emphasizes on the interactions from same category over neighborhood, which intuitively indicates that stochastic gradients descent(SGD) solver should optimize entire predictions on same category in a scale rather than per pixel. Such loss can effectively supervise networks to summarize features of the same category, meanwhile, indirectly enlarge the differences of inter-class features. Thus, the discriminative capabilities of learned models are significantly improved with higher robustness and object sensitivity. Via this loss, we trained deep neural networks (DNN), and demonstrate that our learned models outperforms against previous state-of-arts.
In summary, we make the following contributions:
We propose a novel locally adaptive loss layer for semantic image segmentation. During learning procedure, it helps networks to improve the capabilities of discriminating targets from both inter- and intra- classes. In our experiments We also verified that the learned models trained with our loss outperform against their counterparts.
We explore a simple method for rebalancing losses from image segmentation datasets, which often exhibit long-tail distribution. Our correction mechanisms can prevent networks from biasing towards majority classes.
We implement other well-known losses (i.e., center loss and focal loss) for image semantic segmentation tasks as our additional contribution. With these losses, the learned models can also predict decent masks, and thus we use them as our counterparts.
The remainder of this paper has the following structure: Section 2 briefly summarizes related work. Section 3 constructs the locally adaptive loss. Section 4 illustrates and evaluates our locally adaptive loss via several numerical experiments using different training frameworks. Section 5 draws conclusions and proposes direction for future work.
2 Related Work
2.1 Image Segmentation
Semantic image segmentation using convolutional neural networks or deep neural networks(DNN) has achieved several breakthroughs in recent years [2, 6, 20, 27, 9, 11]. Inspired by the work , researchers commonly remove last fully connected (FC) layers of neural networks, and then utilize the in-network upsampled or deconvolved predictions of convolutional layers as predicted feature maps. The estimating procedure for training generally computes pixel-wise losses between the maps and ground truths over each batch, and then pools them into a global value for back propagation (BP).
are based on softmax and multinomial cross-entropy between predicted vectors of neurons and labels. However, this computation collapses the spatial dimensions of both predicted maps and labeled images into vectors. The methods like[10, 28, 29] resort to FC layers to establish the prediction masks, which requires more complex hyper-parameters. Recently, He et al.  proposed a regional loss computation, using aligned Region of Interest (ROI)  to maintain each object’s spatial layout. On each aligned ROI, it conducts a pixel-wise sigmoid and binary loss between predictions and targets labels, eliminating inter-class competition
2.2 Weighted Ensemble Entropy Estimator
Density functions, like cross entropy, are widely used as estimators for training CNN and DNN frameworks (e.g., AlexNet , the VGG net , ResNet , DenseNet , etc.). As discussed in , the ensemble of weak estimators can improve the performance of learned models, similar to the methods (e.g., boosting , etc.) proposed in the context of classification. Meanwhile, a weighted ensemble entropy estimator was introduced by optimally combining multiple weak entropy-like estimators(e.g., k-NN entropy functional estimators , intrinsic dimension estimators , etc.). The weighted ensemble entropy estimator is defined as:
where stands for an individual entropy-like estimator, means the number of estimators, and is the weight to be optimized, which subject to . It was also verified that such weighted estimator can provide better prediction accuracy and stronger discrimination ability with higher convergence rate. Note that each weighted weak estimator operates on the same set of input variables . Similarly, center loss  also estimates on the same set of features.
In contrast, we explored a new training loss for image segmentation task, which estimates on the merged intra-correlated predictions for neighboring pixels with same category. We demonstrate that our new loss can help networks to better learn the interactions of neighbor pixels with same category, and thus improve the discriminative abilities on both intra- and inter- classes. Our learned models outperform their counterparts trained via plain pixel-wise estimators.
3 The Locally Adaptive Loss
In this section, we elaborate the estimating procedure of our proposed method, and demonstrate that our locally adaptive loss can improve the discriminative power of the learned models, followed by some discussions.
3.1 Selective Pooling Estimator in Scale
In image semantic segmentation, the main task of an effective loss function is to improve the discriminative capability of learned model. However, in contrast to the image detection and classification where each training batch contains independent samples (e.g., labeled images, bounded objects, etc.), each batch for image segmentation contains all the labeled pixels from different objects, which means several groups of input samples are partially correlated to each other once they belong to same object category. Intuitively, estimating the predictions within a small scale of pixels with same category and minimizing the loss over that scale give a way.
To this end, at each spatial point (i.e. pixel) with location on feature maps, we firstly obtain its normalized predicted distribution vector . Then we conduct our selective pooling kernel with size , operating on predicted vectors of neighboring points (where , , and represent: width and height of the kernel window, number of classes respectively). The filtered vector is then computed as follows:
and then our proposed local estimator is formulated as:
where represents a local cost function (e.g., softmax cross-entropy, etc.), one-hot label vectors denotes the th category of and its neighboring points . means the normalized predicted vector at each neighbor point . is the Gaussian weighting function based on spatial distance to center , and is a indicator function for eliminating the predictions of neighboring points with different classes from center . is the number of points which have the same labels to . Figure 2 illustrates the computation details of our selective pooling filter.
The intuition behind Eqn.3 is that the local estimation computes the entire loss over a group of neighboring points with same category, which indicates the optimization should emphasize on minimizing the overall loss towards a certain category in a scale rather than per point. Note that, for each point, such operation only modifies the distribution of its predicted vector, and does not increase any predicted vector.
For the normalization of predicted vector , we use standard score nomalization, which is defined as: , where means the raw predicted vector, and
stand for mean and standard deviation of. In practice, we adapt softmax cross-entropy as our local cost function . Thus, our local estimator can be rewritten as:
where represents each element value of center point’s label vector . means each element value of the merged vector from the filter.
3.2 Striding and Batch Pooling Stragety
The selective estimator above constructs an ensemble cost based on the category label of center point inside filter. Then, we propose our Locally ADaptive Loss by sliding such filter (with striding step ) over all the input batches (i.e. images), and averaging all the local estimator values with Minkowski pooling. Here we take one image as an input batch, and drop indices and for description brevity:
where is the total valid number of input batches. As
increases, more emphasis is allocated to the areas with high loss values. This is intuitively practical to the image segmentation datasets which often contain imbalanced (or skewed) class distributions (e.g., background and people categories account for the majority of input batches). Consequently, such imbalanced datasets bias networks towards the major classes. Thus, it is more reasonable to increase losses of under-represented categories, which often come from the minority classes. Moreover, such batch pooling strategy could increase the impacts of mispredicted samples of intra-class, which acts as a kind of hard sample mining[7, 38]. However, as mentioned before, our method does not increase any losses number after local cost estimation on each point, therefore the computational time maintains the same at each iteration.
In practice, we set the striding step to 1, which means the filter slides pixel by pixel over input feature maps and ground truths. We use Gaussian-like weighting as a neat allocation of to each neighbor pixel. Specifically, we allocate down-weights to prediction vectors of neighbor pixels, according to their chessboard distances  to center pixel (i.e., to center pixel, to the pixels with , to the pixels with , to the pixels with , and so forth).
In order to determine , we tried different values (see table 1), and adapted accordingly. However, in comparison with the learned model trained by plain softmax cross-entropy (last column), no matter which values adapted, the models via our locally adaptive loss provide consistently higher Mean IoU values on testing dataset. It means the local selective estimator primarily contributes to the effectiveness of locally adaptive loss, rather than batch pooling strategy.
The derivative of w.r.t. the input vector , written in an element-wise is as follows (, stand for each neighbor point’s indices inside filter):
3.3 Relationship of Locally Adaptive Loss with counterparts
Both locally adaptive loss and loss max-pooling methods are designed for image semantic segmentation. Locally adaptive loss directly focuses on connections of adjacent pixels with same category, while loss max-pooling aims to rebalance the datasets between majority and minority classes.
Mostly, locally adaptive loss is a metric approach in the feature space (i.e., activations of last upsampled DNN layer), using the selective pooling filter to increase network attentions on ensemble predicting correctness of neighboring pixels. It applies the filtering operations before local cost computations with ground truths. For loss max-pooling, it only re-weights losses after local cost computations, aiming to increase the contributions of under-represented object classes.
Loss max-pooling is in some way similar to our batch cost pooling strategy, as we use simple Minkowski pooling for handling the imbalanced class datasets. Loss max-pooling can be also embedded into our loss as a replacement for batch pooling strategy.
We have evaluated our novel locally adaptive training loss () on the extended Pascal VOC 
semantic image segmentation datasets. We adapt Intersection-over-Union (IoU) as the evaluating metric on over all classes of datasets.
4.1 Network Arichitecture
For all experiments, we applied DNN network DeepLabV2 proposed in 
, and implemented with TensorFlow, using cuDNN for improving performance. The GPUs used for our experiments are GeForce TX 1070 and Titan Xp. Specifically, we adapted a fully-convolutional ResNet-101  with atrous extensions [22, 39] for base layers before adding atrous spatial pyramid pooling (ASPP) . For our baseline method, we applied upscaling (i.e., deconvolution layers with learned weights) before the softmax cross-entropy (SoftMax CE). In our experiments, the baseline method gives similar results in . Besides baseline method, we also applied center loss  and focal loss  as replacements of softmax cross-entropy, to study the effectiveness of losses from different research areas. We also disabled both multi-scale input to networks and post-operations with conditional random fields (CRF), so that we can precisely conclude on our proposed method without complementary. However, all these complementary methods, including loss max-pooling, can be integrated into our method in case of demanding better overall performance, when given adequate devices and training time. For most, our primary purpose is to demonstrate the effectiveness of locally adaptive loss, competing with other baselines under comparable parameters. We only report results obtained from a single DNN trained using the stochastic gradient descent (SGD) solver, where we set the initial learning rate to , both decay rate and momentum to 0.9. For data augmentation, we adapted random scale perturbations in the range of , and horizontal flipping of images.
4.2 Experiment Results
We evaluated the performance of our loss on the extended Pascal VOC  segmentation benchmark datasets, which consist of 20 object categories and a background category (see Figure 4). We set batch size to 2 with crop size of , using training set with 10,582 images and testing set with 1,449 images. For hyper parameters of our loss, we used kernel sizes of , and for our local selective filters, and ran a total of 20,000 training iterations respectively. For center loss and focal loss, since they were originally designed for object detection and classification, we manually adjusted their hyper parameters to be fitted for image segmentation. Thus, we set , for center loss, and , for focal loss. We report the mean IoU values in table 2 after 20,000 iterations. As shown in table 2, the learned models trained via locally adaptive loss predict consistently improved results, in particular, using with kernel size: alone can give more in predicting accuracy, compared with plain softmax cross-entropy. In contrast, applying center loss leads to mean IoU values similar to plain softmax cross-entropy, while we obtain lower values by using focal loss.
Additionally, in Figure 3 we exhibit several segmented examples on testing set to visually demonstrate improvements via our training framework against others. The first two columns show the original images (randomly cropped) and ground truths. From the 3rd to 6th columns, we can observe the masks predicted by learned models trained via plain somftmax cross entropy, center loss, focal loss and our locally adaptive loss (with kernel size: ) respectively. And we can see that the models via our training framework predict more accurate and effective masks with higher robustness and object sensitivity, compared to its counterparts.
|Training Framework||Hyper Para.||M IoU|
|RN-101 + SoftMax CE||-||74.6|
|RN-101 + Center Loss||,||74.5|
|RN-101 + Focal Loss||,||70.4|
|RN-101 + (ours)||kernel size: (3, 3, 21)||75.7|
|RN-101 + (ours)||kernel size: (5, 5, 21)||76.1|
|RN-101 + (ours)||kernel size: (7, 7, 21)||75.1|
5 Conclusions and Future Work
In this work, we introduced a novel approach to increase networks discriminative capabilities of inter- and intra- class for semantic image segmentations. At each pixel’s position our method firstly conducts adaptive pooling filter operating over predicted feature maps, aiming to merge predicted distributions over a small group of neighboring pixels with same category, and then computes cost between the merged distribution vector and their category label. Our locally adaptive loss does not increase any loss numbers, thus the time complexity maintains the same at each iteration. In the experiments on Pascal VOC 2012 segmentation datasets, the consistently improved results show that our proposed approach achieves more accurate and effective segmentation masks against its counterparts. More extensive experiments will be launched on Cityscapes dataset  and COCO dataset  to further verify our training framework.
Abadi et al. 
Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig
Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin.
Tensorflow: Large-scale machine learning on heterogeneous distributed systems.2016.
- Badrinarayanan et al.  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
- Boroujeni and Shahabadi  Mehdi Ahmadi Boroujeni and Mahmoud Shahabadi. Modern mathematical methods for physicists and engineers. Measurement Science & Technology, 12(12):2211, 2000.
- Bulo et al.  Samuel Rota Bulo, Gerhard Neuhold, and Peter Kontschieder. Loss max-pooling for semantic image segmentation. 2017.
- Caesar et al.  Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. 2016.
- Chen et al.  Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis & Machine Intelligence, PP(99):1–1, 2016.
- Chen et al.  Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: A deep quadruplet network for person re-identification. pages 1320–1329, 2017.
- Chen et al.  Yuheng Chen, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In International Conference on Neural Information Processing Systems, pages 1988–1996, 2014.
Cordts et al. 
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.
The cityscapes dataset for semantic urban scene understanding.In Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
- Dai et al.  Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. pages 3150–3158, 2015.
- Dai et al.  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-FCN: object detection via region-based fully convolutional networks. CoRR, abs/1605.06409, 2016.
- Everingham et al.  Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
- Everingham et al.  Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.
- Fan and Ling  Heng Fan and Haibin Ling. Sanet: Structure-aware network for visual tracking. 2016.
- Fukunage and Narendra  K Fukunage and P. M Narendra. A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, C-24(7):750–753, 1975.
- Girshick  Ross Girshick. Fast r-cnn. Computer Science, 2015.
- Guo et al.  Jinjiang Guo, Vincent Vidal, Irene Cheng, Anup Basu, Atilla Baskurt, and Guillaume Lavoue. Subjective and objective visual quality assessment of textured 3d meshes. ACM Trans. Appl. Percept., 14(2), October 2016.
- Hadsell et al.  Raia Hadsell, Sumit Chopra, and Yann Lecun. Dimensionality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1735–1742, 2006.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. pages 770–778, 2015.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Doll r, and Ross Girshick. Mask r-cnn. 2017.
- Hero et al.  Alfred O. Hero, Jose A. Costa, and Bing Ma. Asymptotic relations between minimal graphs and alpha-entropy. 2003.
- Holschneider et al.  M. Holschneider, R. Kronland-Martinet, J. Morlet, and Ph. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. pages 286–297, 1989.
- Huang et al.  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. 2016.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
- Lin et al.  Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. 8693:740–755, 2014.
- Lin et al.  Tsung Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. Focal loss for dense object detection. pages 2999–3007, 2017.
- Long et al.  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2015.
- Pinheiro et al.  Pedro O Pinheiro, Ronan Collobert, Doll, and R Piotr. Learning to segment object candidates. pages 1990–1998, 2015.
- Pinheiro et al.  Pedro O. Pinheiro, Tsung Yi Lin, Ronan Collobert, and Piotr Doll r. Learning to refine object segments. pages 75–91, 2016.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. In International Conference on Neural Information Processing Systems, pages 91–99, 2015.
- Schapire  Robert E Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.
Schroff et al. 
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.In IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
- Sricharan et al.  K Sricharan, D. Wei, and Hero Ao Rd. Ensemble estimators for multivariate entropy estimation. IEEE Transactions on Information Theory, 59(7):4374–4388, 2013.
- Sricharan et al.  Kumar Sricharan, Raviv Raich, and Alfred O. Hero Iii. Empirical estimation of entropy functionals with confidence. Statistics, 2010.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition, pages 1–9, 2015.
- Wen et al.  Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A Discriminative Feature Learning Approach for Deep Face Recognition. Springer International Publishing, 2016.
- Xiao et al.  Qiqi Xiao, Hao Luo, and Chi Zhang. Margin sample mining loss: A deep learning based method for person re-identification. 2017.
- Yu and Koltun  Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. 2015.
- Zhou et al.  Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.