End-to-end Learning Improves Static Object Geo-localization in Monocular Video

by   Mohamed Chaabane, et al.

Accurately estimating the position of static objects, such as traffic lights, from the moving camera of a self-driving car is a challenging problem. In this work, we present a system that improves the localization of static objects by jointly-optimizing the components of the system via learning. Our system is comprised of networks that perform: 1) 6DoF object pose estimation from a single image, 2) association of objects between pairs of frames, and 3) multi-object tracking to produce the final geo-localization of the static objects within the scene. We evaluate our approach using a publicly-available data set, focusing on traffic lights due to data availability. For each component, we compare against contemporary alternatives and show significantly-improved performance. We also show that the end-to-end system performance is further improved via joint-training of the constituent models.



There are no comments yet.


page 3

page 8


End-to-End Multi-Object Tracking with Global Response Map

Most existing Multi-Object Tracking (MOT) approaches follow the Tracking...

TwistSLAM: Constrained SLAM in Dynamic Environment

Moving objects are present in most scenes of our life. However they can ...

End-to-end learning of keypoint detection and matching for relative pose estimation

We propose a new method for estimating the relative pose between two ima...

Joint 3D Reconstruction of a Static Scene and Moving Objects

We present a technique for simultaneous 3D reconstruction of static regi...

Improving Robot Success Detection using Static Object Data

We use static object data to improve success detection for stacking obje...

DL-SLOT: Dynamic Lidar SLAM and Object Tracking Based On Graph Optimization

Ego-pose estimation and dynamic object tracking are two key issues in an...

"The Pedestrian next to the Lamppost" Adaptive Object Graphs for Better Instantaneous Mapping

Estimating a semantically segmented bird's-eye-view (BEV) map from a sin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many self-driving vehicle systems rely on a high-definition (HD) map to ensure safety and driving policy comfort and legal conformance. Unlike a standard navigation map, an HD map contains detailed 3D structure such as LiDAR point clouds, as well as the precise position and semantics of traffic signs, lights, lanes, and other road markings. One challenge when using an HD map is some portion or set of objects in the map may be out-of-date with changes that occur in the world. The safety of self-driving systems is improved when on-board perception systems not only detect and track dynamic actors in the scene, but also perceive the static traffic-control objects. This allows the system to combine the benefits provided by both perception and mapping for traffic-control features – timeliness of real-time perception, human-verified accuracy of the map.

In this work, we present a method for 3D detection, tracking, and localizing spatially-compact static objects (such as signs and traffic lights) from a single camera of a self-driving car. We assume that each frame of video can be associated with a reasonable ego-pose of the camera, as is readily available in open-source self-driving data sets. Our method consists of neural networks that address each of the main components of the system, combined to allow joint-optimization via learning to improve overall performance. Given the problem domain, we constrain the solution space to online methods.

The top-level model takes a pair of geo-located video frames as input and outputs a set of localized objects (6 Degree-of-Freedom, or “6D" poses). For each input image, a sub-network performs 6D pose regression for each detected object. Detected objects are represented with both appearance and pose information for learning how to associate them between frames. We employ an existing object detector, but propose new networks for single-image object pose regression and cross-image object matching. The system applies these networks in a multi-object tracking paradigm to produce robust 6D locations for the set of tracked objects in a video sequence.

We evaluate the performance of the proposed approach on traffic lights due to availability of data. In principle, this method could be applied to other static object types as well. In summary, our main contributions are: (i) a novel pose regression network for estimating 6D poses of static objects from geolocated RGB inputs, shown to outperform contemporary methods, (ii) a novel method for matching objects between pairs of video frames combining multi-resolution appearance features and geometric features from our pose regression network, (iii) the formulation of multi-object tracking of static objects using these models, and (iv) an evaluation comparing the performance of the individual components against contemporary alternatives, and also showing the benefit to the system-level performance of jointly-optimizing the models with a multi-task loss function.

2 Related Work

Localizing street-level objects using multi-view geometry has been the focus of important prior work. Hebbalaguppe et al. [hebbalaguppe2017telecom] proposes an automatic system to update telecom inventory using stereo-vision distance estimation with a SIFT feature matching algorithm, applied to Google street view images. Krylov et al. [krylov2018automatic] combines monocular depth estimation and triangulation to enable automatic localization of static objects. Their approach first detects objects, then uses a CNN to estimate depth, and finally employs a custom Markov Random Field (MRF) to perform object triangulation. The same authors extend their approach by adding LiDAR data for object segmentation, triangulation, and monocular depth estimation for traffic lights [krylov2018object]. Zhang et al. [zhang2018using] proposes a method for mapping roadside utility poles from street view images. Their approach consists of a CNN-based object detector followed by a line-of-bearing method for object-localization.

In contrast to these works, we hypothesize that an end-to-end trainable system will perform better when compared to systems using disjoint components[ruder2017overview, chaabane2020looking]

. Prior works commonly use deep learning to detect objects in imagery, but then employ distinct secondary processes to track or otherwise associate observations across images, lacking the full support of information from the object detection model. Consequently, as the number of nearby objects increases, geometric-only techniques can fail because of the inherent spatial uncertainty of the features. In our approach, we make the assumption that strong similarities can be derived from complementary visual and geometric features, and that jointly learning these features in a single end-to-end system has additional performance benefits.

The prior work most closely related to ours is by Nassar et al. [nassar2019simultaneous], who propose an end-to-end trainable object geo-localization architecture. A pair of images is fed to their architecture: objects are first detected in the image pairs, then matching projections are learned, and finally the geo-coordinates of the objects are predicted. Our work shares a commitment to an end-to-end approach, but differs significantly in implementation details. Also, our additional multi-object tracking stage is novel and improves overall performance.

Static object tracking can be considered as a special case of moving object tracking, where the objects have zero velocity. Recent research of multi-object tracking primarily follows the tracking-by-detection paradigm. Several different RGB-based approaches belong to this category. One category relies on exploiting re-identification modules [bergmann2019tracking, wojke2017simple, zhang2019robust, zhu2018online] to accurately match objects between frames. Another category uses motion and continuity cues [choi2015nearr, karunasekera2019multiple, milan2017online, wang2019exploit]. Other approaches rely on the 3D properties as well such as shape and approximate depth [scheidegger2018mono, sharma2018beyond]. However, when considering static objects, object poses can be exploited for tracking in a stronger fashion that can be done when tracking dynamic objects. Our method incorporates the features from jointly-learned pose and appearance features to track static objects across video frames.

3 Proposed Approach

Our object localization method consists of two models. The first is a pose regression network (§ 3.1) used to estimate the 6D pose of objects present in an RGB image. The second is an object matching network (§ 3.2) used to track the detected objects across a sequence of frames.

Our approach is an online method, so it uses information derived only from past frames, making it suitable for use in self-driving vehicles and other streaming applications. At each given frame

, the network produces a set of 2D object detections in the image. For each detection, the 6D pose is estimated. The current-frame detections are associated with tracks of previously-detected objects using the object matching network. For each tracked object, we aggregate the estimated 6D poses over time to compute the final location and rotation. Object locations are aggregated by taking the median over each dimension in the location and rotation vectors. In this section, we provide details on the two main components, the pose regression network and the object matching network.

3.1 Pose Regression Network

Figure 1: Single-image Object Pose Regression Our model first computes bounding boxes (crops) of objects of interest from geolocated images. Each image crop is then processed with an encoder-decoder CNN to generate a feature map, , which is processed by an attention module to yield . Using average pooling, we create a fixed-size geometry embedding , which is fed to the pose regressor to output the 6D pose.

Figure 1 illustrates the architecture of our object pose regression model. Our approach is designed for online processing of a stream of geolocated images, such as those that might be produced by self-driving vehicles. The method is for application to spatially compact static objects, such as traffic lights or signs. We use the term “spatially compact" to distinguish such objects from things like lane lines or road edge boundaries. As static objects of interest are tracked across frames, the per-frame pose estimates are used not only to refine the final 6D pose of the object, but also to help disambiguate matching objects across frames (see § 3.2.2). Our network outputs 6D object pose vectors where represents the 3D translation vector of the center of the object in the camera coordinate system and represents the unit vector orthogonal to the object (the direction in which traffic light or sign is facing) with respect to the camera coordinate frame. To estimate the pose, we train our network using the euclidean loss for the translation regression, and the log hyperbolic cosine loss for the rotation regression, where is the ground truth pose and is the estimated pose. Instead of regressing the full translation vector , our pose regression network is trained to regress the component and the object’s center position in image pixel space. This formulation provides better invariance to camera parameters. We use projective geometry to recover the full translation vector for , where , are the camera focal lengths, and (, ) is the camera principal point offset.

Our pose regression network is a two-staged network. The first stage is a typical 2D object detection network [bai2017deep, redmon2016you, xiong2019upsnet]

. We pad the bounding boxes of the detected objects by

pixels for each side to include more context and to take into account slight errors coming from the object detector model. Features from within each padded bounding box (“image crop") are used in the second stage to estimate object pose.

3.1.1 Geometry Embedding

The image crop is fed into an encoder-decoder network that maps an image of size into a feature map . Each pixel of the feature map is an -dimensional vector representing the appearance information of the input image crop at each pixel location. From the feature map , we derive the embedding of the image crop as follows. We employ a spatial attention mechanism to focus the embedding on the most salient parts of the image crop. The spatial attention distribution is learned using convolutions from the extracted feature maps . The spatial attention map is then normalized using softmax of the responses:


The normalized spatial attention map is applied to weight the feature map to generate the attention-weighted feature map (we replicate for times to match the size of ). Average pooling is then applied to to obtain the geometry embedding .

3.1.2 Pose Regressor

The pose regressor transforms the geometry embedding into 6D pose estimates for each object crop in the input image. The pose regressor is composed of a rotation and a translation branch, each composed of fully connected layers. The rotation branch estimates the rotation vector and is normalized before computing the loss. The translation branch estimates the component of the translation vector and the object’s center position . The network is trained by minimizing the loss .

3.2 Object Matching Network

The object matching network is responsible for associating objects between pairs of frames, allowing the system to track objects through the video sequence. We employ a deep network (Figure 2) that jointly learns object appearances, geometries, and affinities in a pair of video frames in an end-to-end fashion. We will refer to this as the “object matching network."

Figure 2: Object Matching Network. A pair of images frames apart, and , along with the detected 2D bounding boxes, are input to the network. The feature sub-network extracts a -dimensional vector encoding pose and appearance information for each detected object in each frame. The affinity sub-network uses these to produce affinity estimations, matching objects across the two frames.

3.2.1 Data Preparation and Encoding

A pair of images frames apart, and , are input to the object matching network along with the sets of bounding boxes of the detected objects, and respectively, with where N is the maximum number of allowed detected objects in any frame. In order to provide more robustness during inference, the matching network is trained using image pairs separated by a variable amount of time. The lower bound of the interval is a single frame of separation. The upper bound is a number of frames representing a few seconds of time, to allow capturing the situation where the camera has moved significantly between two observations of the same object. When generating the training data, we sample uniformly from the range between the lower and upper time intervals (in number of frames between image pairs), and the training data is expressed as , where is the maximum frames of separation between image pairs. Each image in a pair is resized to a fixed-size. For each training pair, we create the ground truth matching matrix which contains matching scores between objects of frame (in the rows) and objects of frame (in the columns). We add nonexistent objects to and nonexistent objects to in order to obtain a fixed-size matching matrix. These additional rows and columns are filled with zeros. An element from the matching matrix encodes the association between the object observations and . A value of encodes an association, meaning that the observations pertain to the same physical object. Entities entering and leaving the scene are encoded with and , respectively.

3.2.2 Feature Sub-network

The feature sub-network extracts the compact features used to associate objects between image pairs. The pair of frames are fed in parallel to the feature sub-network where the two branches share the same set of weights. This sub-network is composed of a geometry feature extractor (yellow box in Figure 2) and an appearance feature extractor (green box in Figure 2). The underlying idea of our feature sub-network is that we can compute the affinity scores between objects based on visual and geometric cues.

We are mainly focusing on autonomous driving scenes, where the video frames are from a monocular camera mounted on a car moving on the road plane, and the tracked targets are static objects near the road. Thus, geometry features that describe the location and rotation of objects can be helpful to discriminate between objects. Benefiting from reliable pose estimation, we expect that the same physical object in the 2 frames and will have similar estimations of location and rotation in a common reference frame. Thus, from any frame , we use our pose regression network to output the estimated location and rotation of the detected object. The estimated pose is then transformed into the camera coordinates system of a common reference frame ; in our implementation we chose the reference frame to be the first frame for each video.

The geometry embedding used in the pose regression network contains information about the geometry of the objects as well. Thus, we concatenate the features of with the 6 pose values described above to construct geometry feature descriptor for the detected object.

Layer Output size Kernel size Stride Receptive field Label

conv 1 7 -
Max Pool 2 9 -
4 conv 1 17 -
Max Pool 2 21
6 conv 1 45 -
Max Pool 2 53
9 conv 1 85
11 conv 1 101 -
Max Pool 2 117
12 conv 1 213
15 conv 1 277
17 conv 1 341 -
Max Pool 2 373
19 conv 1 565
22 conv 1 757 -
Max Pool 2 821
25 conv 1 1077 -
Max Pool 2 1205

Table 1: Details on the architecture of the appearance sub-network used in the object matching network. The layers used in the final embedding are denoted in the column “Label” as .

Given a monocular imaging system, the objects and close surroundings are expected to maintain their visual appearance over short time spans. To extract appearance features, we employ a convnet inspired by the increased performance of CNNs with smaller filter size () and deeper architectures such as VGG [Simonyan15]

. It consists of 26 convolutional layers and 7 max-pooling layers. Each convolutional layer is followed by batch normalization


and a ReLu activation function; see Table 

1 for more details. Each layer represents a different abstract feature representation of the input, where deeper levels provide higher dimensional representations of larger receptive fields (RF). We concatenate features from multiple layers, building a multi-resolution and multi-abstraction representation. For each detected object, we extract feature vectors from the object’s center location as regressed from our pose regression network. If the center of the object is at position in the input frame of size , then for a feature map of size , we extract -dimensional feature vector at position as the corresponding feature vector for the object. We extract appearance features [ from ten layers matching varying receptive fields. The features are concatenated to construct appearance feature vector for the detected object. This multi-resolution architecture helps to simultaneously capture fine geometric details as well as higher-level semantics of the surroundings. We show that our multi-resolution network produces richer features and therefore outperforms those using a single receptive field (see § 4.4).

After extracting appearance and geometry features for each detected object, we concatenate both to obtain () which is a fused feature descriptor for the detected object. For each frame, , we construct matrix , by padding by rows (filled with zeros) for nonexistent objects to construct fixed-size feature matrices.

3.2.3 Affinity Sub-network

Using the extracted feature matrices and

, we build the tensor

where is the concatenation of the feature vectors of the object of and the object of . The tensor contains all possible concatenations of feature vectors of objects between the two frames. This formulation allows us to compute object affinities in a single forward pass. is fed to a similarity estimator network composed of 6 layers of convolutions. The output of the similarity estimator network is similarity matrix where each element represents the affinity between bounding box and . Note that we use convolutions so that the computation of is computed using only the feature vectors and and will not be affected by other feature vectors. To consider objects entering and leaving between the two frames, we construct two matrices and where we append a column and a row, respectively. These additional rows and columns are filled with a basis value . Then, we apply column-wise and row-wise softmax to and respectively to obtain and which are fed to the affinity loss layer.

3.2.4 Joint Loss Function

To train the object matching network, we use the loss function as the average of losses and where is the error of matching objects detected in to the objects in and is the error of matching objects detected in to the objects in . The expression of the losses are given by:


where , and are the elements in the row and column of matrices , and respectively. In inference, the similarity score between object of and the object of is given as the average of and .

Training optimizes the joint affinity and pose estimation losses as defined in Eq. (4). The loss of the pose estimation task is computed as the average of the pose losses of all object detected in both frames. Pose and affinity losses are traded-off with a scalar .


3.2.5 Multi-Object Tracking

Our Multi-Object Tracking (MOT) approach follows the tracking-by-detection paradigm. Given a new frame with the bounding boxes of the detected objects, the tracker computes the similarity scores between the already tracked targets (each target consists of multiple instances from different frames) and the newly detected objects using the object matching network. The score matrix is defined as , where represents the similarity between the target and detection and it is computed as the maximum over the similarity between the instances of the target before frame and the detection at current frame , for represents the likelihood of target to not being matched to any of the new detected objects at frame and is computed as the average of the values at last column in for the instances of target and for and . Finally, the widely-used Hungarian algorithm [kuhn1955hungarian] is adopted to derive the optimal assignments.

4 Experiments

4.1 Datasets

We constructed the Traffic Lights Geo-localization (TLG) data set.111The code used to construct this data set will be made available upon publication. TLG is derived from nuScenes [caesar2019nuscenes], a popular open-source data set for autonomous driving. The nuScenes data contains 1000 scenes of 20 seconds (at 12Hz video rate), filmed in two cities (Boston and Singapore), in both night and day, and with three weather conditions (rain, sun and clouds). Each scene comes with data from six cameras placed at different angles on the car.

We selected those scenes within road intersections containing traffic lights (TLs). For each scene in the nuscenes data set, and for each video clip from one of the six cameras, we iterated through key frames (2Hz), selecting TLs within 100 meters of the camera location. Each TL location was transformed from world coordinates to camera coordinates, and then into 2D homogeneous image coordinates, using the provided extrinsic and intrinsic camera calibration parameters. We filter TL locations not visible to the camera. Finally, scenes are selected only if, at least one TL is visible in 5 different key frames in one of the six cameras. With this process, we ended up with 348 scenes for training and 56 scenes for testing. On average, two traffic lights appear per image.

In the TLG data, each video clip (from different cameras) in each scene contains 240 RGB images (including 40 key frames) with resolution of . Images are augmented with camera pose information and camera metadata, including information about each visible TL: unique ID, 6D pose in world coordinates, 6D pose in the camera coordinates of the first frame, and TL type (horizontal or vertical).

We created three sub-datasets for our main tasks, one each for pose, matching, and tracking. The “Traffic Lights 6D Pose" data contains around 66,000 snippets of TLs (60,000 for training and 6,000 for testing) along with their 6D poses. The “Traffic Lights Matching" data contains 200,000 pairs of images (170,000 for training and 30,000 for testing) along with bounding boxes of TLs and ground truth matching matrices between the two images. Average elapsed time between image pairs in the Traffic Lights Matching data set is 1.4 seconds (the maximum frames of separation between image pairs is set to 35) and on average, four traffic lights appear per image. The “Multi-Traffic Lights Tracking" (MTLT) data provides a detection and annotation file for each video following the format of [milan2016mot16].

We evaluated several other potential sources of data that we hoped could be used to evaluate our static object localization approach. Unfortunately, beyond nuScenes, we were unable to find other useful data sets.

4.2 Implementation details

We implement our proposed approach using PyTorch

[paszke2019pytorch]. All experiments were run on an Ubuntu server with an Nvidia TitanX GPU with 12GB of memory. The performance comparison of contemporary methods for all tasks evaluated in this work were produced using the original authors’ publicly-available code. Source code for this work will be released upon publication.

In the pose regression network, our 2D object detector is the same as used in PoseCNN [xiang2017posecnn]. It is pre-trained on COCO [lin2014microsoft] and Mapillary [neuhold2017mapillary] datasets. The bounding box padding, , is set to be between 5-25 pixels, scaled based on the bounding box. The architecture used to extract feature map is composed of a Resnet-18 encoder followed by 4 up-sampling layers as decoder. The geometry embedding dimension is set to 128. The weight factor

is set to 0.1. Our pose regression network is trained using SGD for 40 epochs with a momentum of 0.9, and a weight decay of 0.0005.

For the object matching network, the maximum number of tracked objects, , is set to 30 and is set to 8. The frames were resized to 896 × 896. By experimental evaluation, the optimal dimensions of the appearance features vectors are set to 100, 80, 70, 60, 50, 40, 30, 30, 20 and 20 respectively, which results in a 634-dimensional (500 + 6 + 128) feature descriptor for each detected object. The object matching and pose regression networks are jointly trained for 130 epochs with a momentum of 0.9, a weight decay of 0.0008, and is 0.005. The pose network is initialized to pre-trained weights.

4.3 6D Pose Estimation

Many state-of-the-art methods for 6D object pose estimation [peng2019pvnet, li2018deepim, zakharov2019dpod, oberweger2018making] use 3D models of the objects. These methods do not work well for our application because of the presence of multiple types and sizes of TLs (and other static objects of interest) in real-world scenarios. Thus, we compared our model to those which take RGB images as input and regress directly 6D poses such as PoseNet [kendall2015posenet] and PoseCNN [xiang2017posecnn]. To make the comparison fair, all methods use the same object detector [long2015fully] as in PoseCNN, and we fine-tune both PoseNet and PoseCNN on our training data with the same loss function used to train our pose regression network. Table 2 presents a comparison of our pose regression model against PoseNet and PoseCNN on the Traffic Lights 6D Pose data.

Our single-image pose regression network outperforms both PoseNet and PoseCNN. As expected, TLs far away from the camera can be challenging to locate accurately. All methods have considerably lower pose errors when evaluating only on TLs within 20 meters. In the full data set, TLs can be up to 100 meters away from the camera. We show in a later discussion of end-to-end performance (see Table 4) that most of the translation error is concentrated in the depth axis, .

To understand the effects of the attention module and joint training strategy, we compared the performance of three variants of our pose regression network as shown in Table 2. The inclusion of the attention module reduces the rotation and translation errors. This shows how focusing on some regions in the image crop helps our model to extract a better representation for 6D pose regression. We also see that training the pose regression and object matching networks jointly improves pose regression performance.

Model 6D Pose Errors (mean/median) Run time
All objects Near () objects sec/frame
Translation (m) Rotation () Translation (m) Rotation ()
Ours (w/o Attention) 4.95 / 3.93 17.68 / 10.51 3.02 / 2.24 16.26 / 7.64 0.05
Ours (Baseline) 4.67 / 3.61 17.00 / 9.70 2.64 / 1.83 14.74 / 6.24 0.05
Ours (Joint Training) 4.43 / 3.39 15.97 / 9.16 2.51 / 1.70 14.21 / 6.08 0.05
PoseNet [kendall2015posenet] 7.25 / 5.83 28.47 / 21.82 5.36 / 4.48 24.31 / 18.23 0.04
PoseCNN [xiang2017posecnn] 5.54 / 4.47 19.63 / 11.35 3.68 / 2.91 18.04 / 8.86 0.11
Table 2: Pose regression ablation study. In “w/o Attention" we removed the attention module of the pose regression (). In “Joint Training", the regression model is trained jointly with the object matching model to minimize loss function in Eq. (4). “Baseline" indicates training the model as described, as a stand-alone network

4.4 Object Matching

To highlight the impact of the feature sub-network of the object matching network, we report matching accuracy after changing the feature extractor component in Table 4. In this ablation study, we measure the impact of using only appearance features, only geometric features, using both appearance and geometric features, and joint training. Additionally, we measure variants of the appearance features when larger or smaller receptive fields are used, and we show variants of the geometric features when using only the 6D values or when combining the 6D values with the vector from the pose regression network.

Object Matching Feature Extractor mAP Runtime
Resnet-50 [he2016deep] 0.744 0.1
VGG-16 [Simonyan15] 0.824 0.08
AFE (RFs 213 only) 0.857 0.11
AFE (RFs 213 only) 0.839 0.12
AFE 0.873 0.12
GFE (6D only) 0.825 0.08
GFE (6D + ) 0.831 0.08
AFE + GFE 0.912 0.14
AFE + GFE (Joint Training) 0.928 0.14
Figure 3: Object matching network ablation study. AFE uses only the appearance features. GFE uses only the geometry features. For AFE, also shown is the impact on receptive field (RF) sizes. For GFE, we show with and without including the pose regression feature vector
Figure 4: Object matching examples. Each column of the figure shows a pair of frames separated by n frames. Object matching remains robust to illumination and weather conditions and existence of multiple similar TLs in the frames.

In Table 4

and in the following text, “AFE" will indicate using only appearance features and “GFE" will indicate using only geometric features in the object matching network. AFE outperformed single RF based architectures (Resnet-50 and VGG-16) by more than 4.9 percentage points, which demonstrates the benefit of multi-resolution networks for our application. We found that appearance features extracted from small RFs perform better than those extracted from larger RFs, as illustrated when comparing AFE (RFs

) and AFE (RFs ). This fact is supported by comparing Resnet-50 (RF size = 483) and VGG-16 (RF size = 212), where VGG outperforms Resnet-50. Combining features from both small and large RFs (AFE) results in mAP gain of 1.6 percentage points. This can be explained by the fact that features from small RFs will focus on low level information such as color, texture, and shape, while features from large RFs will have richer contextual information that can be beneficial in some challenging cases.

By comparing performance of AFE and GFE, we can conclude that appearance is more important than geometry for our object matching network. However, including the geometry cues helps to increase the mAP by 3.9 percentage points over appearance alone. We argue that the advantage gained from geometry features come when TLs look similar and are close in image space. In those cases, TLs will also have similar backgrounds and thus produce similar appearance embeddings. The joint training strategy provides the remaining improvements, increasing the object matching network’s mAP by 1.6 percentage points when compared to stand-alone training.

Figure 4 shows examples of the object matching network’s output from our Traffic Lights Matching data. We observe that the association appears robust to illumination and weather conditions. Also, even with the existence of multiple similar looking TLs at very close locations in the image space, the network is able to correctly associate the TLs. The chosen examples in Figure 4 are random. We noted similar level of performance by the object matching network for all the examples we tested.

4.5 Multi-Object Tracking


DMAN [zhu2018online]
80.79 82.40 61.12 12.91 103 3.3
DeepSORT [wojke2017simple] 77.69 77.81 56.34 9.41 69 17.2
Tracktor++ [bergmann2019tracking] 83.31 86.73 66.54 9.71 82 2.64
Ours (GFE) 74.12 75.32 51.17 20.64 162 9.7
Ours (AFE) 81.29 82.18 62.37 12.66 96 6.1
Ours (GFE + AFE) 85.52 85.14 69.57 10.79 61 5.3

Table 3: Comparison of our method and contemporary MOT trackers on the MTLT test sequences. We utilise the standard MOT metrics [bernardin2008evaluating]: MOTA (multi-object tracking accuracy), MOTP (multi-object tracking precision), MT (number of mostly tracked trajectories), ML (number of mostly lost trajectories), IDS (number of identity switches) and FPS (frame per second). and indicate higher or lower values are preferred

We evaluate the performance of our tracker using MOT metrics and compare its performance with state-of-the-art MOT systems which have publicly available code (Table 3). By only using appearance features (AFE), our tracker achieves 81.29 in terms of MOTA which is higher than appearance-based trackers (DMAN and DeepSORT), demonstrating the strength of our multi-resolution appearance features. By using only geometry features (GFE), our tracker achieves 74.12 in MOTA. By using both appearance and geometry features, the tracking accuracy is increased to 85.52 in MOTA, out-performing the other methods. Our tracker is twice as fast as Tracktor++, which has somewhat similar performance for many of the metrics other than IDS, where our method is much better.

Our tracker’s performance as captured by the MT metric is significantly better, suggesting our tracker generates more integrated trajectories by combining geometry and appearance cues. Similarly, our tracker’s identity switches (IDS) value of 61 is best. Both MT and IDS are critical metrics when the output of the tracker is used to generate an aggregated pose estimate, as in our application, as we present in the following section.

4.6 Object Geo-localization

We finally evaluate performance of our proposed approach for static object mapping by comparing predicted and ground truth geo-locations of traffic lights in the TLG data set. We compare our proposed approach with MRF-triangulation [krylov2018automatic] and SSD-ReID-Geo [nassar2019simultaneous]. By analyzing the errors of different methods (Table 4), we note that errors along Z-axis (depth) are considerably higher than errors along X and Y axes, which is typical for monocular vision-based systems. When localizing traffic lights, errors along Z-axis are less troubling than lateral or vertical errors. This is because the perception of whether or not a traffic light pertains to the self-driving car (i.e., the lane the car is in) is more affected by its horizontal position above the road than the depth along the roadway. A lateral error of 2m could cause confusion about which lane the light controls. On the other hand, a depth error of a few meters is unlikely to cause such confusion. Our method shows a median error in the and axes of less than 20cm, and mean error within 25cm. The median depth error ( axis) of about 1.5m is well-within the accuracy bounds of the problem domain.

Model TE along X-axis (m) TE along Y-axis (m) TE along Z-axis (m)
Mean Median Std Mean Median Std Mean Median Std
Ours 0.25 0.16 0.15 0.23 0.15 0.14 2.24 1.47 1.28
MRF-triangulation 0.31 0.24 0.12 0.35 0.27 0.15 4.75 3.89 1.92
SSD-ReID-Geo 0.64 0.51 0.37 0.51 0.45 0.33 3.77 2.85 1.68
Table 4: Translation Error (TE) along X, Y and Z axes

We computed object-based precision/recall using two distance thresholds, 2m Euclidean distance and 3 units of Mahalanobis distance. In this case, 3 units of Mahalanobis distance corresponds to an ellipse defined with semi-axes: x=0.4, y=0.39, and z=3.84 meters. The advantage of the Mahalanobis distance is that it provides much tighter thresholds in the X and Y axes while allowing more tolerance in depth, making it more suitable for our application.

Figure 5: Comparison of the performance of our approach for static object geo-localization against MRF-triangulation [krylov2018automatic] and SSD-ReID-Geo [nassar2019simultaneous]. An estimated geo-location is a true positive if it is within a threshold distance of a ground truth point. Methods marked with * use only key frames (2fps) for testing, methods marked with are tested with only frame pairs, and “with rot" means that true positives must also be within of the true orientation.

Figure 5 compares the precision/recall of our approach against MRF-triangulation and SSD-ReID-Geo. Our approach leads to more accurate geolocalizations than the other methods. Our approach outperforms MRF-triangulation thanks to the efficiency of our pose regression model over the depth estimation in [krylov2018automatic], and the joint learning employed by our approach. SSD-ReID-Geo uses only pairs of frames when estimating object poses. For a fair comparison, we also tested our approach using only frame pairs (Figure 5, denoted with ). Our approach outperforms SSD-ReID-Geo, even with this restriction. We also observe that using the Mahalanobis distance, the PR curve of SSD-ReID-Geo becomes lower than MRF-triangulation due to the added restrictions along X and Y axes.

When adding a rotation error component to the definition of a true positive (i.e., within the distance threshold and within the angular threshold of ), there is only a slight lowering of performance, indicating that our 6D regression performs well for both translation and rotation components.

5 Conclusions

This paper proposes an end-to-end method for 6D detection, tracking, and localization of spatially-compact static objects from a single camera of a self-driving car. We showed jointly optimizing the pose regression and object matching models improves 6D pose estimation, tracking and geo-localization simultaneously. Future plans include sharing features between the 2D object detector and the object matching network, which will provide opportunities for further joint optimization and inference speed-up. We also aim to replace the only non-differentiable module of our approach – the Hungarian algorithm – with an equivalent differentiable network to allow complete end-to-end learning. In this work, we were limited to evaluate performance of our approach on traffic lights, application to other static compact object (signs, etc.) requires creating or identifying new data sets.