Log In Sign Up

Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems

Predicting the future location of vehicles is essential for safety-critical applications such as advanced driver assistance systems (ADAS) and autonomous driving. This paper introduces a novel approach to simultaneously predict both the location and scale of target vehicles in the first-person (egocentric) view of an ego-vehicle. We present a multi-stream recurrent neural network (RNN) encoder-decoder model that separately captures both object location and scale and pixel-level observations for future vehicle localization. We show that incorporating dense optical flow improves prediction results significantly since it captures information about motion as well as appearance change. We also find that explicitly modeling future motion of the ego-vehicle improves the prediction accuracy, which could be especially beneficial in intelligent and automated vehicles that have motion planning capability. To evaluate the performance of our approach, we present a new dataset of first-person videos collected from a variety of scenarios at road intersections, which are particularly challenging moments for prediction because vehicle trajectories are diverse and dynamic.


page 1

page 3

page 5


On-Road Motion Planning for Automated Vehicles at Ulm University

The Institute of Measurement, Control and Microtechnology of the Univers...

Vehicle trajectory prediction in top-view image sequences based on deep learning method

Annually, a large number of injuries and deaths around the world are rel...

Graph and Recurrent Neural Network-based Vehicle Trajectory Prediction For Highway Driving

Integrating trajectory prediction to the decision-making and planning mo...

Designing an Automated Vehicle: Strategies for Handling Tasks of a Previously Required Accompanying Person

When using a conventional passenger car, several groups of people are re...

Future Person Localization in First-Person Videos

We present a new task that predicts future locations of people observed ...

Graph Convolution Networks for Probabilistic Modeling of Driving Acceleration

The ability to model and predict ego-vehicle's surrounding traffic is cr...

Maneuver-based Anchor Trajectory Hypotheses at Roundabouts

Predicting future behavior of the surrounding vehicles is crucial for se...

I Introduction

Safe driving requires not just accurately identifying and locating nearby objects, but also predicting their future locations and actions so that there is enough time to avoid collisions. Precise prediction of nearby vehicles’ future locations is thus essential for both autonomous and semi-autonomous (e.g., Advanced Driver Assistance Systems, or ADAS) driving systems. Extensive research [1, 2, 3] has been conducted on predicting vehicles’ future actions and trajectories using overhead (bird’s eye view) observations. But obtaining overhead views requires either an externally-mounted camera (or LiDAR), which is not common on today’s production vehicles, or aerial imagery that must be transfered to the vehicle over a network connection.

Fig. 1: Illustration of future vehicle localization. Location and scale are represented as bounding boxes in predictions.

A much more natural approach is to use forward-facing cameras that record the driver’s “first-person” or “egocentric” perspective. In addition to being easier to collect, the first-person perspective captures rich information about the object appearance, as well as the relationships and interactions between the ego-vehicle and objects in the environment. Due to these advantages, egocentric videos have been directly used in applications such as action recognition [4, 5], navigation [6, 7, 8], and end-to-end autonomous driving [9]. For trajectory prediction, some work has simulated bird’s eye views by projecting egocentric video frames onto the ground plane [1, 2], but these projections can be incorrect due to road irregularities or other sources of distortion, which prevent accurate vehicle position prediction.

In this paper we consider a more challenging problem of predicting the relative future locations and scales (represented as bounding boxes in Figure 1) of nearby vehicles with respect to an ego-vehicle equipped with an egocentric camera. We introduce a multi-stream RNN encoder-decoder (RNN-ED) architecture to effectively encode past observations from different domains and generate future bounding boxes. Unlike other work that has addressed prediction in simple scenarios such as freeways [1, 2], we consider complicated urban driving scenarios that involve a variety of multi-vehicle behaviors and interactions.

The contributions of this paper are three-fold. First, to the best of our knowledge, our work is the first to address the problem of future vehicle localization under egocentric view and challenging driving scenarios such as intersections. Second, we propose a multi-steam RNN-ED architecture for better temporal modeling and explicitly capturing vehicles’ motion and appearance information by using dense optical flow and future ego-motion as inputs. Third, to test our approach, we introduce a new first-person video dataset — the Honda Egocentric View - Intersection (HEV-I) dataset — collected in a variety of scenarios involving road intersections, which we plan to publicly release. The dataset includes over vehicles (after filtering) in videos. We evaluate our approach on this new proposed dataset, along with the existing KITTI dataset, and achieve the state-of-the-art results comparing to published baselines.

Ii Related Work

Egocentric Vision. An egocentric camera view is often the most natural perspective for observing an ego-vehicle environment, but it introduces additional challenges due to its narrow field of view. The literature in egocentric visual perception has typically focused on activity recognition [10, 11, 12, 4, 5], object detection [13, 14], person identification [15, 16, 17], video summarization [18], and gaze anticipation [19]

. Recently, papers have also applied egocentric vision to ego-action estimation and prediction. For example, Park

et al[20] proposed a method to estimate the location of a camera wearer in future video frames. Su et al[21] introduced a Siamese network to predict future behaviors of basketball players in multiple synchronized first-person views. Bertasius et al[22] addressed the motion planning problem for generating an egocentric basketball motion sequence in the form of a 12-d camera configuration trajectory.

More directly related to our problem, two recent papers have considered predicting pedestrians’ future locations from egocentric views. Bhattacharyya et al[23]

model observation uncertainty using Bayesian Long Short-Term Memory (LSTM) networks to predict the distribution of possible future locations. Their technique does not try incorporate image features such as object appearance. Yagi

et al[24] use human pose, scale, and ego-motion as cues in a convolution-deconvolution (Conv1D) framework to predict future locations. The specific pose information applies to people but not to other on-road objects like vehicles. Their Conv1D model captures important features of the activity sequences but does not explicitly model temporal updating along each trajectory. In contrast, our paper proposes a multi-stream RNN-ED architecture with both past vehicle location and image features as inputs for predicting vehicle locations from egocentric views.

Trajectory Prediction. Previous work on vehicle trajectory prediction has used motion features and probabilistic models [1, 25]

. The probability of specific motions (e.g., lane change) is first estimated, and the future trajectory is predicted using Kalman filtering. Recently, computer vision and deep learning techniques have been investigated for trajectory prediction, by posing trajectory prediction as a sequence-to-sequence generation problem. Alahi

et al[26] proposed Social-LSTM to model pedestrian trajectories as well as their interactions. The proposed social pooling method was then improved by Gupta et al[27] to capture global context for a Generative Adversarial Network (GAN). Social pooling is first applied to vehicle trajectory prediction in Deo et al[2] with multi-modal maneuver conditions. Other work models scene context information using attention mechanisms to assist trajectory prediction [28, 29]. Lee et al[3]

incorporate RNN models with conditional variational autoencoders to generate multimodal predictions, and select the best prediction by ranking scores.

However, these methods model trajectories and context information from a bird’s eye view in a static camera setting, which significantly simplifies the challenge of measuring distance from visual features. In contrast, in monocular first-person views, physical distance can be estimated only indirectly, through scaling and observations of participant vehicles, and the environment changes dynamically due to ego-motion effects. Consequently, previous work cannot be directly applied to first-person videos. On the other hand, the first-person view provides higher quality object appearance information compared to birds eye view images, in which objects are represented only by the coordinates of their geometric centers. This paper encodes past location, scale, and corresponding optical flow fields of target vehicles to predict their future locations, and we further improve prediction performance by incorporating future ego-motion.

Iii Future vehicle localization from
first-person views

We now present our approach to predicting future bounding boxes of vehicles in first-person views. Our method differs from traditional trajectory prediction because the distances of object motion in perspective images do not correspond to physical distances directly, and because the motion of the camera (ego-motion) induces additional apparent motion on nearby objects.

Consider a vehicle visible in the egocentric field of view, and let its past bounding box trajectory be , where is the bounding box of the vehicle at time (i.e., its center location and width and height in pixels, respectively). Similarly, let the future bounding box trajectory be given by . Given image evidence observed from the past frames, , and its corresponding past bounding box trajectory , our goal is to predict .

Fig. 2: The proposed future vehicle localization framework (better in color).

We propose a multi-stream RNN encoder-decoder model to encode temporal information of past observations and decode future bounding boxes, as shown in Figure 2. The past bounding box trajectory is encoded to provide location and scale information, while dense optical flow is encoded to provide pixel-level information about vehicle motion, scale change, and appearance. Our decoder can also consider information about future ego-motion, which could be available from the planner of an intelligent vehicle. The decoder generates hypothesized future bounding boxes by temporally updating from the encoded hidden state.

Iii-a Temporal Modeling

Iii-A1 Location-Scale Encoding

One straightforward approach to predict the future location of an object is to extrapolate a future trajectory from the past. However, in perspective images, physical object location is reflected by both its pixel location and scale. For example, a vehicle located at the center of an image could be a nearby lead vehicle or a distant vehicle across the intersection, and such a difference could cause a completely different future motion. Therefore, this paper predicts both the location and scale of participant vehicles, i.e., their bounding boxes. The scale information is also able to represent depth (distance) as well as vehicle orientation, given that distant vehicles tend to have smaller bounding boxes and crossing vehicles tend to have larger aspect ratios.

Iii-A2 Motion-Appearance Encoding

Another important cue for predicting a vehicle’s future location is pixel-level information about motion and appearance. Optical flow is widely used as a pattern of relative motion in a scene. For each feature point, optical flow gives an estimate of a vector

that describes its relative motion from one frame to the next caused by the motion of the object and the camera. Compared to sparse optical flow obtained from traditional methods such as Lucas-Kanade [30], dense optical flow offers an estimate at every pixel, so that moving objects can be distinguished from the background. Also, dense optical flow captures object appearance changes, since different object pixels may have different flows, as shown in the left part of Fig. 2.

In this paper, object vehicle features are extracted by a region-of-interest pooling (ROIPooling) operation using bilinear interpolation from the optical flow map. The ROI region is expanded from the bounding box to contain contextual information around the object, so that its relative motion with respect to the environment is also encoded. The resulting relative motion vector is represented as

, where is the size of the pooled region.

We use two encoders for temporal modeling of each input stream and apply the late fusion method:



represents the gated recurrent units 

[31] with parameter ,

are linear projections with ReLU activations, and

and are the hidden state vectors of the GRU models at time .

Iii-B Future Ego-Motion Cue

Awareness of future ego-motion is essential to predicting the future location of participant vehicles. For autonomous vehicles, it is reasonable to assume that motion planning (e.g. trajectory generation) is available [32], so that the future pose of the ego vehicle can be used to aid in predicting the relative position of nearby vehicles. Planned ego-vehicle motion information may also help anticipate motion caused by interactions between vehicles: the ego-vehicle turning left at intersection may result in other vehicles stopping to yield or accelerating to pass, for example.

In this paper, the future ego motion is represented by 2D rotation matrices and translation vectors [24], which together describe the transformation of the camera coordinate frame from time to . The relative, pairwise transformations between frames can be composed to estimate transformations across the prediction horizon from the current frame:


The future ego-motion feature is represented by a vector , where , is the yaw angle extracted from , and and are translations from the coordinate frame at time . We use a right-handed coordinate fixed to ego vehicle, where vehicle heading aligns with positive . Estimated future motion is then used as input to the trajectory decoding model.

Iii-C Future Location-Scale Decoding

We use another GRU for decoding future bounding boxes. The decoder hidden state is initialized from the final fused hidden state of the past bounding box encoder and the optical flow encoder:


where is the decoder’s hidden state, is the initial hidden state of the decoder, and are linear projections with ReLU activations applied for domain transfer. Instead of directly generating the future bounding boxes Y, our RNN decoder generates the relative location and scale of the future bounding box from the current frame as in (3b), similar to [24]. In this way, the model output is shifted to have zero initial, which improves the performance.

Iv Experiments

Iv-a Dataset

The problem of future vehicle localization in egocentric cameras is particularly challenging when multiple vehicles execute different motions (e.g. ego-vehicle is turning left but yields to another moving car). However, to the best of our knowledge, most existing autonomous driving datasets are proposed for scene understanding tasks 

[33, 34] that do not contain much diverse motion. This paper introduces a new egocentric vision dataset, the Honda Egocentric View-Intersection (HEV-I) data, that focuses on intersection scenarios where vehicles exhibit diverse motions due to complex road layouts and vehicle interactions. HEV-I was collected from different intersection types in the San Francisco Bay Area, and consists of video clips each ranging between to seconds. Videos were captured by an RGB camera mounted on the windshield of the car, with resolution (reduced to in this paper) at frames per second (fps).

Fig. 3: HEV-I dataset statistics.
Dataset # videos # vehicles scene types
KITTI residential, highway, city road
HEV-I urban intersections
TABLE I: Comparison with KITTI dataset. The number of vehicles is tallied after filtering out short sequences.

Following prior work [24], we first detected vehicles by using Mask-RCNN [35] pre-trained on the COCO dataset. We then used Sort [36] with a Kalman filter for multiple object tracking over each video. In first-person videos, the duration of vehicles can be extremely short due to high relative motion and limited fields of view. On the other hand, vehicles at stop signs or traffic lights do not move at all over a short period. In our dataset, we found a sample of seconds length is reasonable for including many vehicles while maintaining reasonable travel lengths. We use the past second of observation data as input to predict the bounding boxes of vehicles for the next second. We randomly split the training () and testing () videos, resulting in training and testing samples.

Models Easy Cases Challenging Cases All Cases
Linear 31.49 / 17.04 / 0.68 107.93 / 56.29 / 0.33 72.37 / 38.04 / 0.50
ConstAccel 20.82 / 13.86 / 0.74 90.33 / 49.06 / 0.35 58.00 / 28.05 / 0.53
Conv1D [24] 18.84 / 12.09 / 0.75 37.95 / 20.97 / 0.64 29.06 / 16.84 / 0.69
RNN-ED-X 23.57 / 11.96 / 0.74 43.15 / 22.24 / 0.60 34.04 / 17.46 / 0.67
RNN-ED-XE 22.28 / 11.60 / 0.74 42.27 / 22.39 / 0.61 32.97 / 17.37 / 0.67
RNN-ED-XO 17.45 / 8.68 / 0.78 32.61 / 16.72 / 0.66 25.56 / 12.98 / 0.72
RNN-ED-XOE 16.72 / 8.52 / 0.80 32.05 / 16.63 / 0.66 24.92 / 12.86 / 0.73
TABLE II: Quantitative results of proposed methods and baselines on HEV-I dataset with metrics FDE/ADE/FIOU.
Linear 78.19 38.21 0.33
ConstAccel 55.66 25.78 0.39
Conv1D [24] 44.13 24.38 0.49
Ours 37.11 17.88 0.53
TABLE III: Quantitative results on KITTI dataset. We compare our best model with baselines for simplicity.
Fig. 4: Qualitative results on HEV-I dataset (better in color).
Fig. 5: Failure cases on HEV-I dataset (better in color).

Statistics of HEV-I are shown in Fig. 3. As shown, most vehicle tracklets are short in Fig. 3 (a) because vehicles usually drive fast and thus leave the field of the first-person view quickly. Fig. 3 (b) shows the distribution of ego vehicle yaw angle (in ) across all videos, where positive indicates turning left and negative indicates turning right. It can be seen that HEV-I contains a variety of different ego motions. Distributions of training and test sample trajectory lengths (in pixels) are presented in Fig. 3 (c) and (d). Although most lengths are shorter than pixels, the dataset also contains plenty of longer trajectories. This is important since usually the longer the trajectory is, the more difficult it is to predict. Compared to existing data like KITTI, the HEV-I dataset contains more videos and vehicles, as shown in Table I. Most object vehicles in KITTI are parked on the road or driving in the same direction on highways, while in HEV-I, all vehicles are at intersections and performing diverse maneuvers.

Iv-B Implementation Details

We compute dense optical flow using Flownet2.0 [37] and use a ROIPooling operator to produce the final flattened feature vector . ORB-SLAM2 [38] is used to estimate ego-vehicle motion from first-person videos.

We use Keras with TensorFlow backend 

[39] to implement our model and perform training and experiments on a system with Nvidia Tesla P100 GPUs. We use the gated recurrent unit (GRU) [40] as basic RNN cell. Compared to long short-term memory (LSTM) [41], GRU has fewer parameters, which makes it faster without affecting performance [42]. The hidden state size of our encoder and decoder GRUs is . We use the Adam [43] optimizer to learn network parameters with fixed learning rate and batch size . Training is terminated after epochs and the best models are selected.

Iv-C Baselines and Metrics

Baselines. We compare the performance of the proposed method with several baselines:

Linear regression (Linear) extrapolates future bounding boxes by assuming the location and scale change are linear.

Constant Acceleration (ConstAccel) assumes the object has constant horizontal and vertical acceleration in the camera frame, i.e. that the second-order derivatives of are constant values.

Conv1D is adapted from [24], by replacing the location-scale and pose input streams with past bounding boxes and dense optical flow.

To evaluate the contribution of each component of our model, we also implemented multiple simpler baselines for ablation studies:

RNN-ED-X is an RNN encoder-decoder with only past bounding boxes as inputs.

RNN-ED-XE builds on RNN-ED-X but also incorporates future ego-motion as decoder inputs.

RNN-ED-XO is a two-stream RNN encoder-decoder model with past bounding boxes and optical flow as inputs.

RNN-ED-XOE is our best model as shown in Fig.2 with awareness of future ego-motion.

Evaluation Metrics. To evaluate location prediction, we use final displacement error (FDE) [24] and average displacement error (ADE) [26], where ADE emphasizes more on the overall prediction accuracy along the horizon. To evaluate bounding box prediction, we propose another metric called final intersection over union (FIOU) that measures overlap between the predicted bounding box and ground truth at the final frame.

Iv-D Results on HEV-I Dataset

Quantitative Results. As shown in Table II, we split the testing dataset into easy and challenging cases based on the FDE performance of the ConstAccel

baseline. A sample is classified as easy if the

ConstAccel achieves FDE lower than the average FDE (), otherwise it is classified as challenging. Intuitively, easy cases include target vehicles that are stationary or whose future locations can be easily propagated from the past, while challenging cases usually involve diverse and intense motion, e.g. the target vehicle suddenly accelerates or brakes. In evaluation, we report the results of easy and challenging cases, as well as the overall results on all testing samples.

Our best method (RNN-ED-XOE) significantly outperforms naive baselines including Linear and ConstAccel on all cases (FDE of 24.92 vs. 72.37 vs. 58.00). It also improves about from the state-of-the-art Conv1D baseline. The improvement on challenging cases is more significant since future trajectories are complex and temporal modeling is more difficult. To more fairly compare the capability of RNN-ED and convolution-deconvolution models, we compare RNN-ED-XO with Conv1D. These two methods use the same features as inputs to predict future vehicle bounding boxes, but rely on different temporal modeling frameworks. The results (FDE of 25.56 vs 29.06) suggest that the RNN-ED architecture offers better temporal modeling compared to Conv1D, because the convolution-deconvolution model generates future trajectory in one shot while the RNN-ED model generates a new prediction based on the previous hidden state. Ablation studies also show that dense optical flow features are essential to accurate prediction of future bounding boxes, especially for challenging cases. The FDE is reduced from to by adding optical flow stream (RNN-ED-XO) to RNN-ED-X model. By using future ego-motion, performance can be further improved, as shown in the last row of Table II.

Qualitative Results. Fig. 4 shows four sample results of our best model (in green) and the Conv1D baseline (in blue). Each row represents one test sample and each column corresponds to each time step. The past and prediction views are separated by the yellow vertical line. Example (a) shows a case where the initial bounding box is noisy because it is close to the image boundary, and our results are more accurate than those of Conv1D. Example (b) shows how our model, with awareness of future ego-motion, can predict object future location more accurately while the baseline model predicts future location in the wrong direction. Examples (c) and (d) show that for a curved or long trajectory, our model provides better temporal modelling than Conv1D. These results are consistent with our evaluation observations.

Failure Cases. Although our proposed method generally performs well, there are still limitations. Fig.5 (a) shows a case when the ground truth future path is curved due to uneven road surface, which our method fails to consider. In Fig.5 (b), the target vehicle is occluded by pedestrians moving in the opposite direction, which creates misleading optical flow that leads to an inaccurate bounding box (especially in frame). Future work could avoid this type of error by better modeling the entire traffic scene as well as relations between traffic participants.

Iv-E Results on KITTI Dataset

We also evaluate our method on a 38-video subset of the KITTI raw dataset, including city, road and residential scenarios. Compared to HEV-I, the road surface of KITTI is more uneven and vehicles are mostly parked on the side of the road with occlusions. Another difference is that in HEV-I, the ego-vehicle often stops at intersections to yield to other vehicles, resulting in static samples with no motion at all. We did not remove static samples from the dataset since predicting a static object is also valuable.

To evaluate our method on KITTI, we first generate the input features following the same process of HEV-I dataset, resulting in training and testing samples. Performance of baselines and our best model are shown in Table III. Both learning-based models are trained for 40 epoches and the best models are selected. The results show that our method outperforms all baselines including the state-of-the-art Conv1D (FDE of 37.11 vs 78.19 vs 55.66 vs 44.13). We also observe that both learning-based methods did not perform as well as they did on HEV-I. One possible reason is that KITTI is much smaller so that the models are not fully trained. In general, we conclude that the use of the proposed framework results in more robust future vehicle localization across different datasets.

V Conclusion

We proposed the new problem of predicting the relative location and scale of target vehicles in first-person video. We presented a new dataset collected from intersection scenarios to include as many vehicles and motion as possible. Our proposed multi-stream RNN encoder-decoder structure with awareness of future ego motion shows promising results compared to other baselines on our dataset as well as on KITTI, and we tested how each component contributed to the model through an ablation study.

Future work includes incorporating evidence from scene context, traffic signs/signals, depth data, and other vehicle-environment interactions. Social relationships such as vehicle-to-vehicle and vehicle-to-pedestrian interactions could also be considered.