The advances in Convolutional Neural Networks (CNNs)[1, 2, 3] lead to an extraordinary success in computer vision applications in the past several years. CNNs excel at learning reliable features for a multitude of vision problems, such as image classification [3, 4, 5], object detection [6, 7, 8], semantic segmentation [9, 10, 11], etc.
Traditional CNN architecture consists of three main building blocks: convolutional layers, pooling layers, and fully connected layers . CNNs based on these components are highly successful in image classification tasks. For semantic segmentation, the work in  popularized omitting the fully connected layers. Unlike previous work, this approach enabled producing semantic segmentation at pixel level, albeit at a coarse scale, with a single feed-forward CNN run.
Thus the principle building block of most semantic segmentation CNNs is the operation of convolution, translation invariant by design. The input to a semantic segmentation CNN is the three color channels of an image. Thus the convolutional deep features learned by CNN architectures without fully connected layers are translation invariant. Therefore, when learning a useful feature, the location information is largely lost. However, location can be an important cue. For example, a salient object is more likely to be in the center of the image, the sky is more likely to be in the top part of an image, etc.
We propose to include location information in the feature learning process by augmenting the three color channels with one or more additional channels that carry location information, see Fig. 1. First we experiment with adding location information directly. In particular, we augment the input image with two additional channels, one for image row and one for image column indexes, see Fig. 1(b). To be useful, location information does not have to be precise. Thus we also experiment with adding location information less precisely, namely through the distance transform from the image center, see Fig. 1(c). Here only the location relative to the image center is known, but that might be sufficient in order to obtain improvement over the color input only. The advantage is that only one additional input channel is needed, and fewer weights need to be learned.
The first work that uses coordinate augmentation that we are aware of for computer vision applications is in . They augment input data with coordinates, but their goal is physics simulator, not semantic segmentaiton. Simultaneously with our work, in  they point out the defficiencies of CNNs that are based on RGB image data only, and propose coordinate augmentation for object recognition tasks. Another related work is in . They are interested in user-assisted object segmentation. The user provides positive (inside the object) and negative (outside the object) seeds. They transform the user-provided seeds into two Euclidean distance maps which are then concatenated with the color image channels and are used to train a CNN for object segmentation. In a subsequent work , they extend similar ideas to the scenario of the user provided loose (can be too small or too large) box around the object of interest. Euclidean distance map is constructed from the softened user box and added to the color channels of the image as an input to CNN. The works in [15, 16] are similar in spirit but of a more narrow scope compared to our work. Similar to our work, in [15, 16] they are interested in adding a location channel to help to train a deep neural network. Unlike our work, their location information comes from the user input only, whereas we are using general location information already contained in the data without any user interaction. In computational biology,  use distance transform idea to encode a distance from certain features as an additional input track to CNN.
We evaluate our approach on the tasks of salient object segmentation, semantic segmentation, and scene parsing. We show that both direct and indirect location augmentation improve the accuracy compared to using RGB channels only. In most cases, adding the distance transform channel works better than adding two coordinate channels. The extra computational time needed when augmenting RGB input with location information is small, especially for the networks that are more deep, since location augmentation effects only the input layer. We also experimentally investigate the relationship between the depth of the network and the improved accuracy of adding location augmentation.
2 Location Augmentation
For location usefulness motivation, consider the following artificial machine learning problem. Suppose we have a semantic segmentation task with just two classes: the foreground and the background. The training examples consist of a random collection of distinct color images, and the ground truth is some random shape, for example a circle, of exactly the same radius and in exactly the same location for all the images in the dataset. Assume CNN consists only of convolutional layers with filters of small size, say 3 by 3. With the location information included, learning to segment the circle shape is a trivial task even for for a shallow CNN. The correct solution can be learned with just one hidden layer and by convolution filters.
Without the location information, learning to segment the circle shape is difficult with a CNN, because whether a pixel belong to the circle object depends only on its location, not the color. Interestingly, with a deep CNN and zero padding in the convolutional layers, the circle shape can be learned to some degree, depending on the distribution of colors in the dataset. Zero padding gives a location cue for pixels at image border. This location cue can be eventually propagated to the rest of the pixels if the network is deep enough. However, for such a trivial learning task, having to employ a deep network is not an intellectually satisfying solution, as well as a computationally expensive one. A cheap shallow network on location augmented images is a more sensible solution.
We propose to include location information in the feature learning process by augmenting the three color channels with one or more additional channels that carry location information, see Fig. 1. In Fig. 1(b), the row and column indexes are added directly as separate channels. Instead of using two separate channels, we also tried to encode the precise location information in just one channel, by linearly indexing all image pixels and providing the linear index in one additional channel. However, we found that CNN with such an augmentation is not easy to train in practice. Since the linear index has a much larger scale than image color, we normalized all input features (color and pixel index) to be in the same range, but that did not help training. The reason training is more difficult, therefore, is likely to be that the useful information about pixel position, for example is whether a pixel is on the left or on the right of the image involves division and modulo arithmetic and might be difficult to learn from linear indexes. In contrast, such location information is easy to learn from the separate channels for row and column indexes as in Fig. 1(b).
In Fig. 1(c) we illustrate augmenting image colors with an additional channel that carries only approximate location information. In particular, it is the distance transform from the image center. This allows learning features that are based on proximity to image border and/or image center. We can also combine the row/column channels in Fig. 1(b) with the distance transform channel Fig. 1(c) to have three additional input channels.
3 Experimental Evaluation
In this section, we evaluate location augmentation on the applications of saliency, semantic segmentation and scene parsing. We describe implementation details of our method, introduce datasets and evaluation criteria, and compare the performance on different tasks and datasets by our proposed approach.
In the following sections, all experiments are done on a computer with 16GB main memory and GeForce GTX 1080 with 8GB GPU memory. We use the original datasets without any augmentation with additional data.
3.1 Salient Object Segmentation
Datasets. We evaluate our method on several representative datasets, including MSRA-B , ECSSD, HKU-IS , PASCAL-S , all of which are available online. These datasets contain many images with diverse scenes and have been widely used as saliency segmentation benchmark. MSRA-B contains 5000 images from hundreds of categories. Most images in this dataset contain only one salient object. Due to its diversity and large quantity, MSRA-B has been one of the most widely used datasets in salient object segmentation literature. ECSSD contains 1000 semantically meaningful but structurally complex natural images. HKU-IS is another large-scale dataset that contains more than 4000 challenging images. Most images in this dataset have low contrast with more than one salient object. PASCAL-S contains 850 challenging images (each composed of several objects), all of which are chosen from the validation set of the PASCAL VOC 2010 segmentation dataset. All these datasets provide ground truth human annotations. All networks in this section are trained on the entire MSRA-B dataset, i.e., 5000 images, and tested on ECSSD, HKU-IS and PASCAL-S datasets.
CNN Architecture. We test the effectiveness of location augmentation by experimenting with networks of different depth. We use the standard decoder-encoder architecture  with 2, 3, 4, and 5 pooling layers, illustrated in Fig. 2.
For each network, we test the standard RGB input (Fig. 1(a)), RGB augmented with coordinates (Fig. 1(b)), RGB augmented with distance transform from the center (Fig. 1(c)), and, finally, RGB augmented with both, the coordinates and the distance transform. In all the figures and tables, we refer to these different input types as ‘RGB’, ‘RGB+coord’, ‘RGB+dist’, ‘RGB+dist+coord’, respectively.
Our networks are implemented with TensorFlow. The mini-batch size is set to be 2 and learning rate is set to be and weight decay is set to . We use Adam  to train the network from scratch without utilizing pretrained model weights.
Performance Metrics. To evaluate the quality of a saliency map, we use the F-measure, which is defined as
As published in previous research articles on saliency segmentation, we also set to be 0.3 for stressing the importance of the precision value.
Inference Time. The running time is evaluated by averaging over 1000 images with each size of 400 300 and 10 trials for each input on a GeForce GTX 1080. The running time is calculated in seconds.
Discussion Tables 1, 2, 3 and 4 show the F-measure obtained by training the networks with 2, 3, 4, and 5 pooling layers, respectively. In all cases, augmenting RGB channels with location information helps. The biggest F-measure gains are on the network with 2 pooling layers, from around 7% to 13%, depending on the dataset, see Table 1. But for deeper networks, with more pooling layers, F-measure gains remain significant, around 2% in F-measure across different dataset for the deepest network with 5 pooling layers. Comparing different location augmentation methods, for the network with 2 pooling layers, RGB+dist+coords is the best performer, and for networks with 3, 4, and 5 pooling layers, RGB+dist is the best performer. Running times on all the networks are in Table 5. The increase in inference time for location augmented input is small, especially as the networks get deeper. Thus location augmentation can be used to increase accuracy, or to use a less deep network without losing accuracy but gaining computational efficiency.
|2 pooling||3 pooling||4 pooling||5 pooling|
3.2 Semantic Segmentation
In semantic segmentation, the task is to assign each image pixel an object class. Compared to saliency segmentation, it is a more challenging task due to a larger number of classes. We use the mean of class-wise intersection over union (Mean IoU), a standard metric in semantic segmentation, for evaluation.
|person||potted plant||sheep||sofa||train||tv||mean IoU|
We evaluate our proposed method on the PASCAL VOC dataset. The PASCAL VOC 2011 segmentation challenge contains 1112 training images. Hariharan et al.  collected labels for a larger set of 11318 PASCAL images. The images in the SBD are divided into 8498 training images and 2820 test images. The test images are from a subset of the VOC2011 validation set. There are training images from  included in the PASCAL VOC 2011 validation set, so we test the networks on the non-intersecting set of 736 images as done in . We use FCN-32s 
as the base network and use Caffe to implement the network. We report the best results achieved after convergence at a fixed learning rate by training using SGD with momentum set to 0.99. Weight decay is set to 0.0005. Since pretrained weights are not available for location augmentation connections, to make a fair comparison, we use randomly initialized the parameters for the input layer (whether the input is RGB, or RGB+dist, etc.).
The semantic segmentation results on SBD dataset are shown in Tables 6, 7, and 8. The mean IoU is in the last column of Table 8. Here, again, all location augmentation CNNs perform better than CNNs with RGB input only. The best overall performer is RGB+dist, with the mean IOU 2.6% better compared to using RGB input.
Looking at individual object classes, the airplane class improves the most, namely 8.5% with ‘dist+coord’ augmentation over just RGB input. This makes sense, as airplanes tend to be in the top part of the image and location augmentation can help significantly. The bird class also shows a large improvement, almost 5%, for the same reason. Other classes that show a large improvement with location added are cow, boat, and the horse classes. The only class that does not show any improvement with location augmentation is the bus class. Interestingly, the car class does show a significant improvement with location, even though it is conceptually similar to the bus class. This happens probably because cars are distributed in less random locations in this dataset compared to the bus locations.
3.3 Scene Parsing
Scene parsing is similar to semantic segmentation. The goal is to assign each pixel in the image a category label. Scene parsing provides a complete understanding of the scene and is useful for automatic driving, robot sensing, etc.
We evaluate the performance of our proposed method on scene parsing task on Cityscapes dataset 
, which is a recently released large-scale dataset that contains a diverse set of street scenes for semantic urban scene understanding. It contains high quality pixel-level finely annotated images collected from 50 cities. We train the network on 2,975 training images, and test the network on its validation set which contains 500 images. These images define 19 categories containing both stuff and objects. All images are resized from 10242048 to 512 1024 to fit into computer GPU memory.
We also use FCN-32s  as the base network and use Caffe  to implement the network. We report the best results achieved after convergence at a fixed learning rate by training using SGD with a high momentum of 0.99. For a fair comparison, we also randomly initialize the parameters of first layer, since pretrained weights on location connections are not available.
The scene parsing results on Cityscapes dataset are shown in Tables 9, 10, and 11. The mean IoU measure over all classes is in the last column of Table 11. Again as in semantic segmentation, the best result is obtained with distance augmentation, that is using RGB+dist for the input layer. The mean IoU measure is improved by 3.3%. All classes show improvement using location augmentation. The most improved classes are traffic sign, wall, sidewalk, motorcycle. The least improved class is the road.
In this paper, we proposed to include location information in the input layer of a CNN architecture, adding it to the standard RGB input. We include a direct measure of location, such as image coordinates, and also an indirect measure, such as distance transform from the image center. We also test the combination of the two. In most cases, adding just the distance transform gives a better accuracy, compared to adding direct location information. This might be due to the necessary information about location being sufficiently represented by the distance from the image center, on the one hand, and the distance transform augmentation having less parameters to learn, on the other hand.
We showed that a significant achievement in the accuracy can be achieved by adding location information, while the increase in the computational time is negligible. The increase in computational time is small because the number of new weights introduced is small. The deeper is the network, the less effect location augmentation has, as expected, but still it stays significant as the depth is increased.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics 36 (1980) 193–202
-  LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1 (1989) 541–551
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems. (2012)
-  Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., Lecun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: International Conference on Learning Representations. (2014)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Conference on Computer Vision and Pattern Recognition. CVPR (2014) 580–587
-  Girshick, R.B.: Fast R-CNN. CoRR (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Conference on Computer Vision and Pattern Recognition. (2016) 770–778
-  Clement Farabet, Camille Couprie, L.N., LeCun, Y.: Learning hierarchical features for scene labeling. Transactions on Pattern Analysis and Machine Intelligence (2013)
-  Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. Transactions on Pattern Analysis and Machine Intelligence 39 (2017) 640–651
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR. (2015)
-  Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International Journal of Computer Vision 59 (2004) 2004
-  Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: Learning a physics simulator from video. In: NIPS. (2017) 4539–4547
-  Liu, R., Lehman, J., Molino, P., Such, F.P., Frank, E., Sergeev, A., Yosinski, J.: An intriguing failing of convolutional neural networks and the coordconv solution. In: NIPS. (2018)
-  Xu, N., Price, B.L., Cohen, S., Yang, J., Huang, T.S.: Deep interactive object selection. CoRR abs/1603.04042 (2016)
-  Xu, N., Price, B.L., Cohen, S., Yang, J., Huang, T.S.: Deep grabcut for object selection. CoRR abs/1707.00243 (2017)
-  Dean, V., Delong, A., Frey, B.: Deep learning for branch point selection in rna splicing. In: NIPS Workshop on Machine Learning in Computational Biology. (2016)
-  Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Computer Vision and Pattern Recognition. (2015)
-  Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Computer Vision and Pattern Recognition. (2013)
-  Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Computer Vision and Pattern Recognition. (2014)
-  Tie Liu, Zejian Yuan, J.S.J.W.N.Z.X.T., Shum, H.Y.: Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (2011) 353–367
-  Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems. (2015)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. ICLR. (2015)
-  Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: International Conference on Computer Vision. (2011)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Computer Vision and Pattern Recognition. (2015)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. (2014)
-  Marius Cordts, Mohamed Omran, S.R.T.R.M.E.R.B.U.F.S.R., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. CVPR. (2016)