Robot localization is essential for navigation, planning, and autonomous operation. There are vision-based approaches addressing the robot localization and mapping problem in various environments garcia2015vision . Constant changes in the environment, such as varying weather conditions and seasonal changes, and the need of expert knowledge to properly lay out a set of domain specific features makes it hard to develop a robust and generic solution for the problem.
Visual localization can be distinguished according to two main types: metric and topological localization. Metric localization consists of computing the coordinates of the location of the observer. The coordinates of the vehicle pose are usually obtained by visual odometry methods visapp15 ; melekhov2017cnnBspp ; MohantyDeepVO ; NicolaiRSSw2016 . Visual metric approaches can provide accurate localization values, but suffer from drift accumulation as the trajectory length increases. In general, they do not reach the same accuracy as LiDAR-based localization techniques. Topological localization detects the observer’s approximate position from a finite set of possible locations lowry2016visual ; arandjelovic2016netvlad ; arroyo2016fusion . This class of localization methods provides coarse localization results, for instance, if a robot is in front of a specific building or room. Due to its limited state space, topological approaches provide reliable localization results without drift, but only rough position measurements.
Accurate, drift-free localization can be obtained by combining both approaches, which is known as topometric localization mazuran15isrr . In this paper, we present a novel topometric localization technique that formulates metric and topological mapping as learning problems via a deep network. The first network computes the relative visual odometry between consecutive images, while the second network estimates the topological location. Both networks are trained independently and their outputs are provided to our topometric optimization technique. By fusing the output of the visual odometry network with the predicted topological location, we are able to produce an accurate estimate that is robust to the trajectory length and has minimum drift accumulation. An overview of the proposed method is shown in Figure 1. We introduce a real-world dataset collected from the Freiburg University Campus over the course of six month. We compare the proposed approach on the dataset, to state-of-the-art visual odometry methods. The experimental evaluation shows that the proposed approach yields basically the same accuracy as LiDAR-based localization methods. The dataset will be made publicly available to simplify comparisons and foster research in visual localization.
2 Related Work
One of the seminal deep learning approaches for visual odometry was proposed by Konda et al. visapp15
. They proposed a CNN architecture which infers odometry based on classification. A set of prior velocities and directions are classified through a softmax layer to infer the transformation between images from a stereo camera. A major drawback of this approach lies in modeling a regression problem as a classification one which reduces the representational capabilities of the learned model. Other approachesmelekhov2017cnnBspp ; MohantyDeepVO ; NicolaiRSSw2016 tackle the problem of visual ddometry as a regression problem. Nicolai et al. NicolaiRSSw2016 proposed a CNN architecture for depth images based on LiDAR scans. They proposed a simple architecture that yields real-time capability. Mohanty et al. MohantyDeepVO proposed a Siamese AlexNet AlexNet based approach called DeepVO, where the translation and rotation outputs of the network are regressed through an L2-loss layer with equal weight values. Choosing weight values for the translational and rotational components of the regression output is explored by Melekhov et al. melekhov2017cnnBspp . They propose a Siamese AlexNet AlexNet network similar to DeepVO MohantyDeepVO , where they add a weight term to balance the translational and rotational errors. They additionally use a spatial pyramid pooling (SPP) layer SPP which allows for arbitrary input image resolutions. Our metric localization approach shares similarities with melekhov2017cnnBspp
, since we use SPP and a loss function which balances translation and rotation losses. Our contribution to this problem is a new densely connected architecture along with a different angular representation.
Another part of our approach is topological localization. With increasing focus on the long-term autonomy of mobile agents in challenging environments, the need for life-long visual place recognition has become more crucial than before lowry2016visual . In arandjelovic2016netvlad
, the authors present an end-to-end approach for large-scale visual place recognition. Their network aggregates mid-level convolutional features extracted from the entire image into a compact vector representation using VLADjegou:VLAD , resulting in a compact and robust image descriptor. Chen et al. chen2014convolutional combine a CNN with spatial and sequential filtering. Using spatio-temporal filtering and spatial continuity checks ensures consecutive first ranked hypotheses to occur in close indices to the query image. In the context of this work, the problem of visual place recognition can be considered as a generalized form of topological localization. Similar to the visual place recognition approaches presented above, the authors in arroyo2016fusion
present a topological localization approach that is robust to seasonal changes using a CNN. They fuse information from convolutional layers at several depths, finally compressing the output into a single feature vector. Image matching is done by computing the Hamming distance between the feature vectors after binarization, thus improving the speed of the whole approach. Similar to these approaches, we use a CNN architecture that aggregates information from convolutional layers to learn a compact feature representation. However, instead of using some distance heuristic, the output of our network is a probability distribution over a discretized set of locations. Moreover, whereas previous methods rely on either distance or visual similarity to create the topological set of locations, we introduce a technique that takes both factors into account.
Topometric localization was explored by the works of Badino et al. Badino11Topometric and Mazuran et al. mazuran15isrr . Badino et al. Badino11Topometric proposed one of the first methods to topometric localization. They represent visual features using a topometric map and localize them using a discrete Bayes filter. The topometric representation is a topological map where each node is linked to a pose in the environment. Mazuran et al. mazuran15isrr extended the previous method to relative topometric localization. They introduced a topometric approach which does not assume the graph to be the result of an optimization algorithm and relaxed the assumption of a globally consistent metric map.
This paper proposes a topometric localization method using image sequences from a camera with a deep learning approach. The topometric localization problem consists of estimating the robot pose and its topological node , given a map encoded as a set of globally referenced nodes equipped with sensor readings mazuran15isrr ; sprunk12icra .
We propose a deep CNN to estimate the relative motion between two consecutive image sequences, which is accumulated across the traversed path to provide a visual odometry solution. In order to reduce the drift often encountered by visual odometry, we propose a second deep CNN for visual place recognition. The outputs of the two networks are fused to an accurate location estimate. In the remainder of this section, we describe the two networks and the fusion approach in detail.
3.1 Metric Localization
The goal of our metric localization system is to estimate the relative camera pose from images. We design a novel architecture using Dense-blocks huang2016densely as base, which given a pair of images in a sequence will predict a 4-dimensional relative camera pose :
where is the and relative translation values and is the relative rotation estimation. We represent the rotation using Euler6 notation, i.e., the rotation angle is represented by two components ,.
The proposed metric localization network is designed to regress the relative translation and orientation of an input set . We train the network based on the Euclidean loss between the estimated vectors and the ground truth. Having a loss function that deals with both the translation and orientation in the same manner was found inadequate, due to the difference in the scale between them. Instead, we define the following loss function:
where and are respectively the relative ground-truth translation and rotation vectors and and their estimated counterparts. Similar to Kendall et al. kendall2015posenet , we use the parameter to balance the loss for the translation and orientation error.
To estimate the visual odometry or relative camera pose we propose a Siamese architecture built upon dense blocks huang2016densely and spatial pyramid pooling (SPP) SPP . The direct connections between multiple layers of a dense block yielded state-of-the-art results on other tasks, and as we show in this paper, also yields excellent performance for the task at hand. Figure 2
shows the proposed VONet architecture. The network consist of two parts: time feature representation and regression, respectively. The time representation streams are built upon dense blocks with intermediate transition blocks. Each dense block contains multiple dense layers with direct connections from each layer to all subsequent layers. Consequently, each dense layer receives as input the feature maps of all preceding layers. A dense layer is composed of four consecutive operations; namely batch normalization (Batch Norm)ioffe2015batch nair2010rectified , a convolution (conv) and a drop-out. The structure of the transition layer is very similar to that of a dense layer, with the addition of a pooling operation after the drop-out and a convolution instead of the one. Furthermore, we alter the first two convolutional operations in the time representation streams by increasing their kernel sizes to and , respectively. These layers serve the purpose of observing larger image areas, thus providing better motion prediction. We also modify the dense blocks by replacing ReLUs with exponential linear units (ELUs), which proved to speed up training and provided better results clevert2015fast . Both network streams are identical and learn feature representations to each of the images and .
We fuse both branches from VONet through concatenation. We also tried a fully connected layer, but extracting and fusing features from convolutional layers produced better results. The fused features are passed to another dense block, which is followed by a Spatial Pyramid Pooling (SPP) layer. SPPs are another main building block of our approach. Using SPP layers has two main advantages for the specific task we are tackling. First, it allows the use of the presented architecture with arbitrary image resolutions. The second advantage is the layer’s ability to maintain part of the spatial information by pooling within local spatial bins. The final layers of our network are fully connected layers used as a regressor estimating two 2-dimensional vectors.
3.2 Topological Localization
Given a set of images acquired on a path during navigation, the task of topological localization can be considered as the problem of identifying the location among a set of previously visited ones. To this end, we first pre-process the acquired images to create the distinct key-frames, then we train a CNN that learns the probability distribution over the likelihood of the key-frames given the input image.
To extract visually distinct locations from a given path, we cluster poses by computing an image-to-image correlation score, so that similar images are grouped together in one cluster. We select clusters that are within a certain distance threshold to represent the distinct key-frames. We chose the value of so that there is a small visual aliasing between the generated key-frames.
Similar to our VONet, we introduce a network architecture based on DenseNet huang2016densely , namely LocNet. Our core network consists of four dense blocks with intermediate transition blocks. The proposed architecture differs from the DenseNet architecture by the addition of an extra fully connected layer before the prediction, and the addition of extra connections between the dense blocks fusing information from earlier layers to later ones. Similar to the experiments of Huang et al. huang2016densely
on ImageNet, we experiment with the different depths and growth rates for the proposed architecture, using the same configurations as the ones reported by Huanget al.. Figure 3 illustrates the network architecture for LocNet-121. Given an input image, the network estimates the probability distribution over the discrete set of locations.
3.3 Topometric Localization
Our topometric approach aims to refine metric localization given topological priors. The topological localization network provides a set of values corresponding to locations and the probability of prediction confidence. For this purpose we need to fuse metric and topological network predictions into a single representation . The proposed topometric fusion approach is optimized as below:
where is the forward drift correction, the backward path optimization, and the smooth parameter of the smoothness term. At time , is the pose and is the matched topological node with probability higher than a threshold .
where is a quadratic polynom and is the weight function described in locallyweightRegression .
Equation (3) presents our forward drift correction approach, which we model in the following manner: Given a high probability topological node matching, we compute the translation and rotation errors and propagate it through the next metric localization predictions until another topological node is detected. Whereas the forward drift correction component is responsible for mitigating future error accumulation, the backward and smoothness terms are designed to further correct the obtained trajectory. The backward path optimization is introduced in Equation (4). The backward optimization term works as follows: Given a confident topological node it calculates its error to it and, using an exponential decay, corrects previous predictions in terms of translation and rotation values, until it reaches a predefined time window . We also treat the exponential decay functions separately for translation and orientation because of their different scale values.
The final term, which compromises smoothing the trajectory, is presented in Equation (5). It corresponds to a local regression approach, similar to the that used by Cleveland locallyweightRegression . Using a quadratic polynomial model with we locally fit a smooth surface to our current trajectory. One difference from this term to the others is that such term is only applied to translation. Rotation is not optimized in this term given the angles are normalized. We choose smoothing using local regression due to the flexibility of the technique, which does not require any specification of a function to fit the model, only requiring a smoothing parameter and the degree of the local polymonial.
We evaluated our topometric localization approach on a dataset collected from Freiburg campus across different seasons. The dataset is split into two parts; RGB data and RGB-D data. We perform a separate evaluation for each the proposed Visual Odometry and Topological Localization networks, as well as the fused approach. The implementation was based on the publicly available Tensorflow learning toolboxabadi2016tensorflow , and all experiments were carried out with a system containing an NVIDIA Titan X GPU.
4.1 Experimental setup - Dataset
In order to evaluate the performance of the suggested approach, we introduce the Freiburg Localization (FLOC) Dataset, where we use our robotic platform, Obelix kummerle2015autonomous for the data collection. Obelix is equipped with several sensors, however in this work, we relied on three laser scanners, a Velodyne HDL-32E scanner, the Bumblebee camera and a vertically mounted SICK scanner. Additionally, we mounted a ZED stereo camera to obtain depth data. As previously mentioned, the dataset was split into two parts; RGB and RGB-D data. We used images from the Bumblebee camera to collect the former, and the ZED camera for the latter. The dataset collection procedure went as follows; we navigate the robot along a chosen path twice, starting and ending at the same location. One run is chosen for training and the other for testing. The procedure is repeated several times for different paths on campus, at different times of the day throughout a period of six months. The collected dataset has a high degree of noise incurred by pedestrians and cyclists walking by in different directions rendering it more challenging to estimate the relative change in motion between frames. We use the output of the SLAM system of Obelix as a source of ground-truth information for the traversed trajectory. We select nodes that are at a minimum distance of away from each other, along with the corresponding camera frame. Each node provides the position of the robot, and the rotation in the form of a quaternion. As mentioned previously in Section 3.1, we opt for representing the rotations in Euler6 notation. Accordingly, we convert the poses obtained from the SLAM output of Obelix to Euler6. Furthermore, we disregard translation motion along the -axis, and rotations along the - and - axes, as they are very unlikely in our setup.
For the remainder of this section, we focus our attention on two sequences of the FLOC dataset, namely Seq-1 and Seq-2. Seq-1 is an RGB-D sequence with meters of total length captured by the ZED camera, while Seq-2 is comprised of a longer trajectory of meters of RGB only data captured by the Bumblebee camera. We favored those two sequences from the dataset as they are representative of the challenges faced by vision-based localization approaches. Seq-1 represents a case where most vision-based localization systems are likely to perform well as the trajectory length is short. Moreover the presence of depth information facilitates the translation estimation from the input images. On the other hand, Seq-2 is more challenging with almost double the trajectory length and no depth information.
4.2 Network Training
The networks were training on a single stage manner. VONet was trained using Adam solver AdamSolver , with a mini-batch of for epochs. The initial learning rate is set to , and is multiplied by every two epochs. The input resolution image is downscale to due to memory limitations. We adopt the same weight initilization as in huang2016densely . The loss function balance variable is set to . The training time for VONet for 100 epochs took around
hours on a single GPU. LocNet was trained using Nesterov Momentumsutskever2013importance , with a base learning rate of for epochs with a batch size of . The learning rate was fixed throughout the evaluation.
4.3 Metric Localization
We evaluate our VO based metric localization approach over multiple sequences of FLOC. For these experiments we compared our approach with Nicolai et al. NicolaiRSSw2016 , DeepVO MohantyDeepVO and cnnBspp melekhov2017cnnBspp . For each sequence, two metrics are provided: average translation error and average rotation error as a function of the sequence length.
Figure 4 shows the computed trajectories of the compared methods for Seq-1. Table 1 depicts the average translation and rotation error as a function of sequence length. Our approach outperforms the compared methods with almost two times smaller error for translation and rotation inference making it the closest to the ground-truth.
|DeepVO||Nicolai et al.||Ground Truth||cnnBspp||Ours|
|Method||Translation [%]||Rot [deg/m]|
Seq-1 shows that our VO approach can achieve state-of-the-art performance. However the cumulative error characteristic of the problem makes it harder for longer trajectories. Figure 5 presents results for Seq-2. For this experiment the trajectories have a bigger error, especially for translation. Table 2 quantifies the obtained values, confirming the difficulty of this sequence for all tested techniques. The results show that our approach is still capable of largely outperforming the compared methods, with a translation and rotation error almost twice as low as the other methods. Despite the performance of our approach, it is still far from being competitive with LiDAR based approaches, like the one used to generate our ground-truth kummerle2015autonomous . With this goal in mind, we exploit the topological localization method to refine our metric approach providing an even more precise topometric approach.
|Method||Translation [%]||Rot [deg/m]|
|DeepVO||Nicolai et al.||Ground Truth||cnnBspp||Ours|
4.4 Topological Localization
In this section, we evaluate the performance of the proposed LocNet architecture. To get an estimate of the suitability of the proposed architecture to the problem at hand, we used the Places2 dataset zhou2016places
for scene recognition. The dataset contains over ten million scene images divided into 365 classes. We use the pretrained DenseNet model on ImageNet to initialize the weights for our LocNet-121 architecture, as our architecture is quite similar to DenseNet aside from using a different activation function. Using the pretrained model, we are able to achieve comparable performance to the Places365-GoogLeNet architecture as reported by the authors. Additionally, we compare the performance of our LocNet architecture with that of Residual Networks (ResNet)he2016deep given its recent performance in image recognition. We evaluate the performance of both architectures over multiple sequences of the FLOC dataset and the Cambridge Landmarks dataset kendall2015posenet . For both datasets, we report the accuracy in terms of the number of images where the predicted location is within a radius of the ground-truth pose. Table 3 illustrates the performance results on Seq-1 of the FLOC dataset. We investigate the effect of the depth of the network on the accuracy of the predicted poses, while comparing the number of parameters. The best performance was achieved using LocNet-169 with an accuracy of with approximately less parameters than its best performing counterpart in ResNet. Table 4 illustrates the performance on the different scenes from the Cambridge Landmarks dataset. On this dataset, LocNet-201 achieves the best performance with the exception of King’s College scene. It is worth noting that LocNet-169 achieves the second highest accuracy in four out of the five remaining scenes, providing further evidence to the suitability of this architecture to the problem at hand. For the remainder of the experimental evaluation, we use the prediction output from LocNet-201.
|RN: ResNet, LN: LocNet|
4.5 Topometric Localization
This section presents the results of fusing both topological and metric localization techniques. Figure 6 presents both the metric and topometric results for Seq-1. As can be noticed the trajectory difference between ground truth and our topometric approach is almost not visually distinguishable. Table 5 shows an improvement of in the translation inference and superior to for orientation. Such values provide competitive results even to the LiDAR system utilized to provide ground-truth to FLOC.
We also evaluated topometric localization using Seq-2. Figure 7 depicts the obtained results. While the results for this sequence are not as accurate as those of Seq-1, the gain in translation is more than the metric counterpart. For orientation, even though our metric approach already presents good results, the error is reduced by half using our topometric technique, as shown in Table 5.
|Method||Translation [%]||Rot [deg/m]|
|Seq. 1 Metric||1.54||0.2919|
|Seq. 1 Topometric||0.21||0.0464|
|Seq. 2 Metric||3.82||0.1137|
|Seq. 2 Topometric||0.38||0.0634|
One important characteristic of our topometric approach is that the error is bounded in between consecutive key-frames and does not grow unboundedly over time like with the metric localization method. The presented results show that based on the frequency of the topological nodes we can expect a maximum cumulative error based on the corrected topometric error and not on the pure metric cumulative error.
In this paper, we have presented a novel deep learning based topometric localization approach. We have proposed a new Siamese architecture, which we refer to as VONet, to regress the translational and rotational relative motion between two consecutive camera images. The output of the proposed network provides the visual odometry information along the traversed path. Additionally, we have discretized the trajectory into a finite set of locations and have trained a convolutional neural network architecture, denoted as LocNet, to learn the probability distribution over the locations. We have proposed a topometric optimization technique that corrects the drift accumulated in the visual odometry and further corrects the traversed path. We evaluated our approach on the new Freiburg Localization (FLOC) dataset, which we collected over the course of six months in adverse weather conditions using different modalities and which we will provide to the research community. The extensive experimental evaluation shows that our proposed VONet and LocNet architectures surpass current state-of-the-art methods for their respective problem domain. Furthermore, using the proposed topometric approach we improve the localization accuracy by one order of magnitude.
Acknowledgements.This work has been partially supported by the European Commission under the grant numbers H2020-645403-ROBDREAM, ERC-StG-PE7-279401-VideoLearn, the Freiburg Graduate School of Robotics.
- (1) Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
- (2) Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition.
- (3) Arroyo, R., Alcantarilla, P.F., Bergasa, L.M., Romera, E.: Fusion and binarization of cnn features for robust topological localization across seasons. In: Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (2016)
- (4) Badino, H., Huber, D., Kanade, T.: Visual topometric localization. In: IEEE Intelligent Vehicles Symposium (IV) (2011)
- (5) Chen, Z., Lam, O., Jacobson, A., Milford, M.: Convolutional neural network-based place recognition. arXiv preprint arXiv:1411.1509 (2014)
- (6) Cleveland, W.S.: Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74(368), 829–836 (1979)
- (7) Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
- (8) Garcia-Fidalgo, E., Ortiz, A.: Vision-based topological mapping and localization methods: A survey. Robotics & Autonomous Systems 64, 1–20 (2015)
- (9) He, K., Zhang, X., Ren, R., Sun, J.: Deep residual learning for image recognition. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (2016)
- (10) Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
- (11) Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
- (12) Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR 2010 - 23rd IEEE Conference on Computer Vision & Pattern Recognition, pp. 3304–3311 (2010). DOI 10.1109/CVPR.2010.5540039. URL https://hal.inria.fr/inria-00548637
- (13) Kaiming, H., Xiangyu, Z., Shaoqing, R., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European Conf. on Computer Vision (2014)
- (14) Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proc. of the IEEE Int. Conf. on Computer Vision (2015)
- (15) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- (16) Konda, K., Memisevic, R.: Learning visual odometry with a convolutional network. In: Proc. of the 10th Int. Conf. on Computer Vision Theory and Applications (2015)
- (17) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems (2012)
- (18) Kümmerle, R., Ruhnke, M., Steder, B., Stachniss, C., Burgard, W.: Autonomous robot navigation in highly populated pedestrian zones. Journal on Field Robotics 32(4), 565–589 (2015)
- (19) Lowry, S., Sünderhauf, N., Newman, P., Leonard, J., Cox, D., Corke, P., Milford, M.J.: Visual place recognition: A survey. IEEE Transactions on Robotics 32(1), 1–19 (2016)
- (20) Mazuran, M., Boniardi, F., Burgard, W., Tipaldi, G.D.: Relative topometric localization in globally inconsistent maps. In: Proc. of the Int. Symposium on Robotics Research (ISRR) (2015)
- (21) Melekhov, I., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. arXiv preprint arXiv:1702.01381 (2017)
- (22) Mohanty, V., Agrawal, S., Datta, S., Ghosh, A., Sharma, V.D., Chakravarty, D.: Deepvo: A deep learning approach for monocular visual odometry. arXiv preprint arXiv:1611.06069 (2016)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines.
In: Proc. of the 27th Int. Conf. on machine learning (ICML-10) (2010)
- (24) Nicolai, A., Skeele, R., Eriksen, C., Hollinger, G.A.: Deep learning for laser based odometry estimation bibtex. In: RSS workshop Limits and Potentials of Deep Learning in Robotics (2016)
- (25) Sprunk, C., Lau, B., Burgard, W.: Improved non-linear spline fitting for teaching trajectories to mobile robots. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (2012)
- (26) Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. Proc. of the Int. Conf. on Machine Learning 28, 1139–1147 (2013)
- (27) Zhou, B., Khosla, A., Lapedriza, A., Torralba, A., Oliva, A.: Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055 (2016)