Foresee: Attentive Future Projections of Chaotic Road Environments with Online Training

05/30/2018 ∙ by Anil Sharma, et al. ∙ IIIT Delhi 0

In this paper, we train a recurrent neural network to learn dynamics of a chaotic road environment and to project the future of the environment on an image. Future projection can be used to anticipate an unseen environment for example, in autonomous driving. Road environment is highly dynamic and complex due to the interaction among traffic participants such as vehicles and pedestrians. Even in this complex environment, a human driver is efficacious to safely drive on chaotic roads irrespective of the number of traffic participants. The proliferation of deep learning research has shown the efficacy of neural networks in learning this human behavior. In the same direction, we investigate recurrent neural networks to understand the chaotic road environment which is shared by pedestrians, vehicles (cars, trucks, bicycles etc.), and sometimes animals as well. We propose Foresee, a unidirectional gated recurrent units (GRUs) network with attention to project future of the environment in the form of images. We have collected several videos on Delhi roads consisting of various traffic participants, background and infrastructure differences (like 3D pedestrian crossing) at various times on various days. We train Foresee in an unsupervised way and we use online training to project frames up to 0.5 seconds in advance. We show that our proposed model performs better than state of the art methods (prednet and Enc. Dec. LSTM) and finally, we show that our trained model generalizes to a public dataset for future projections.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Environment anticipation is an important task for situation awareness and decision making. There is recent progress in anticipation of road environments [15, 11] for safe driving and behavioral cloning [1] where an agent tries to clone behavior of a human driver. However, anticipation becomes difficult in real world because it is uncertain and dynamic [3]. Consider, for example, the road environment. The road environment is highly dynamic and stochastic due to the presence of a diverse set of human drivers and pedestrians, few examples are shown in figure 1. The figure shows that the road space has chaotic movement of pedestrians and vehicles. We define chaotic environment as that environment where the traffic participants follow no rule and move randomly as shown in the figure. The same case is seen on road in developing countries like India. In such environments, the road space is shared by pedestrians, vehicles (cars, trucks, buses, motor-bikes etc.), and sometimes animals as well. Even when the environment is complex, its behavior can be modeled [15]. Modeling such an environment requires detection, tracking and, understanding of the dynamics of the traffic participants. Given that they are also interacting with each other (for example, the lane change of one car on road affect the motion of other cars as well), the modelling is not trivial. Anticipating behavior of the environment is essential in various applications such as autonomous driving [26], driving assistance [27], multi-target tracking [21], autonomous landing on a moving target [10], etc. However, on the other hand, irrespective of the environment, humans are very good at anticipating such an environment. For example, they drive very successfully by anticipating maneuvers even in a very crowded chaotic shared space such as markets, street roads, highways etc. We explore ways to achieve that anticipation power in machines using neural networks by exploiting the predictive power of recurrent neural network to capture this human behavior. In this paper, we propose a deep learning architecture to generate future projections in terms of the camera frames few frames in advance. The future projections will help any robot/learning agent in situation awareness for decision making and planning in an unseen environment.

Fig. 1: Example scenarios showing chaotic movement of different kind of traffic participants on a typical road environment. Few images in first row shows that the pedestrians are randomly crossing vehicles and hinder vehicle movement even on the main road. Second row shows that the vehicles typically do not follow rules such as lane following and move in various directions making the anticipation task more difficult. In many images, lanes are also not visible.

In this work, we propose Foresee, a deep learning architecture for future projections of the chaotic road environment directly from the raw camera images. The network is composed of two layers of GRUs (Gated Recurrent Units [6]) to encode the dynamics of the environment into a small representation in the hidden layers. Next, we reconstruct the future projections from the encoded representations using a fully connected layer. We train the network in an unsupervised way to achieve the desired performance. We formulate the above problem as a sequence generation task, where a sequence is the collection of images that are contiguous in time. We are interested in predicting the future from the past few sequence of frames. When we use Foresee with online training, we are able to project the future up to seconds in advance.

Our specific research contributions are as follows:

  1. We propose a deep learning architecture, Foresee, using Gated Recurrent Units (GRUs) and attention for future projections. We show that the proposed architecture performs better than the current state of the art. We evaluate our proposal on a vast set of images collected in chaotic road environments of Indian roads. We investigate various design choices by analyzing various hyper-parameters in Foresee.

  2. We have collected a very large real road environment data using a monocular camera and dashcam videos from YouTube. In total, we have videos. The videos capture interaction of many traffic participants on various kinds of roads. For example, during urban driving, highway driving, merging at intersections, market, streets, etc.

  3. We investigate and compare performance of two future projection architectures with our method on a very large dataset.

  4. We explore online training on Foresee to make projections. We observed that online training improves the performance and helps Foresee to project future up to seconds in advance. Finally, we will show that Foresee

    trained on our dataset generalizes to a public dataset. We will also show that the projected images can be used for steering angle estimation for behavioral cloning 

    [1] in an autonomous driving simulator.

The subsequent sections are structured as follows. Section II describe state of the art for future predictions. In section III, we provide details of the proposed system. Section IV describe the collected dataset, experimental setup and evaluation pipeline. In section V, we demonstrate the future prediction results on the dataset collected on chaotic road environment. Section VI has discussion and future work of our paper and section VII concludes the paper.

Ii Related Works

In this section, we describe the state of the art approaches for environment anticipation.

We are not first to look in this direction and related works have also explored future predictions from various viewpoints. One common approach is Bayesian filtering to predict next state as in Kalman filter 

[16]. We include works that use neural networks as a learning architecture can directly anticipate environment from data without any explicit need of modeling. Works like [30, 8, 23] have looked at Bayesian filtering for state prediction using neural networks. However, this was looked into separate blocks of object detection, tracking and prediction. Handcrafted features are used in such approaches which is an extra overhead. Authors in Redmon and Farhadi [25], Ning et al. [22]

have used CNN and LSTMs to find the future trajectory of an object using current camera location. They have first predicted the target location and then tracked it using LSTMs. The above approaches are supervised and requires a labeled dataset to predict target locations. The supervised learning for chaotic environments is very difficult as the environment is shared by different kinds of participants and a proper label for object locations is difficult to get. The labeling of images is a very costly task.

Hence, an end-to-end learning approach is desired even for such a complex problem. One would also like an unsupervised learning task for a learning problem. The neural networks can capture the representative features for object patterns and motion dynamics. These can also capture the interactions among the traffic participants. For example,  

[4] has modeled the interaction among pedestrians using a LSTM [14] network. In this, authors have looked upon the task of interactions among pedestrians as a social force model Helbing and Molnar [13]. They have proposed a deep neural network model by using a separate LSTM model for each object and then the interactions are captured using a pooling layer among neighbors. Authors in  Ondrúška and Posner [23] have looked one step ahead for object tracking in partially observable environment. Their approach to bayesian filtering is end-to-end trainable and is unsupervised to predict the fully observable state. However, they test their approach on a simulation environment where dynamics are pre-defined and hence do not capture the chaotic environment.

The papers Srivastava et al. [28], Lotter et al. [20] are very similar to our work. They show that their framework can predict future frame in advance using unsupervised learning. Authors in Srivastava et al. [28] have used multilayer LSTM networks for future predictions and modeled it as a sequence prediction task. Whereas in Lotter et al. [20], authors have proposed a video sequence prediction architecture using LSTM and predictive error coding. They evaluated their proposal on simple environment like fewer vehicles and only vehicles in the scene. We compare the performance of these methods with our proposed network on the dataset that we have collected for chaotic road environments. We observe that approach in [28] fails beyond next frame and the approach in Lotter et al. [20] fails to encode the environment dynamics properly. It produces a high quality image which is more or less similar to previous frame. We compare and show that Foresee performs better than above two for chaotic environments. The common problem with deep learning architectures is that they fail to generalize on different datasets but we will show that Foresee trained on our dataset generalizes to kitti dataset for future projections.

Iii Proposed Methodology

In this section, we will explain Foresee in detail that helps to predict the future projections of the road environment.

Fig. 2: System architecture for future projections.

Iii-a System Overview:

In this subsection, we describe the system architecture diagrammed in figure 2. The future projection is carried out in various steps. The input image is first normalized in range between and and then gamma correction is applied to enhance illumination. The normalized corrected image is then re-sized to shape . An image sequence is then created by concatenating the last frames. The input image sequence is the sequence of images starting from current frame to frames in past and the output image sequence is the sequence of images in the future. The prepared image sequence is then passed to the recurrent network to encode the temporal sequence for future projections. The output sequence is reconstructed from the encoded representations using a fully connected layer with hidden units equal to the number of pixels in the output image (). In the next subsection, we will explain the recurrent network named Foresee, which is composed of stacked GRU cells which are used recursively to predict future projections.

Iii-B Foresee: Recursive Future projections using GRU Network and Attention

Fig. 3: Foresee model: attention is applied on the GRUCells. Single block is foresee and it is applied recursively to predict a longer output sequence.

To predict the future projections, a stack of GRU cells [6] is employed along with attention [24, 5]. In the subsequent text, we will explain GRU and attention method. The Foresee network consists of GRUCells which has a hidden state corresponding to each time step. Refer to figure 3, one block of the figure is Foresee. The figure diagrams the recursive use of Foresee for future projections few frames in advance.

A Gated Recurrent Unit (GRU) cell is the slight variation of the Long Short Term Memory (LSTM) Cell where the forget and input gate of the LSTM is combined into a single update gate. It also combines the cell state and the hidden state into the hidden state itself. For more details on the LSTM, the readers can refer to online tutorial 

111 The input at time is fed into the network and the necessary information to encode the temporal sequence till time is stored in the hidden state of the GRU cell. For each frame in the input sequence, each layer computes following functions:


where is the hidden state at time , is the output of previous layer at time or the input at time for the first layer. is the reset gate, is the new gate and is the update gate. The ’s are the weight parameters for the gate.

Fig. 4: Attention mechanism to improve the quality at a larger time step. In conventional GRUs, the quality degrades for longer sequences.

In GRU networks, the reconstruction quality degrades with the longer sequence as it cannot stuff all the information into its hidden layer (see  [5]

). To resolve this problem, attention methods were employed, for example, in text generation 

[5], this was employed to improve long term dependency. We observed that using attention, the output is not only the function of previous time hidden-state and current input but it is now a function which computes the weighted sum of all input encodings. Since road environment is not markov i.e., it does not depend only on previous frame, the attention method helps the network to attend to past frames as compared to only the previous frame. Figure 4

shows a graphical representation of the attention (weighted sum using a context vector). The attention mechanism takes outputs from all previous time steps and makes a context vector which is a weighted sum of the representations at previous time steps. The context vector is then multiplied with the new hidden state and then the output is reconstructed. The attention layer performs following operations on the hidden state of the GRU network(equation 



where is the output sequence (outputs for all timestamps) of the GRU network, is the attention context vector, is the matrix multiplication function, is the weighted output at time . The training loss is mean squared error between target frame and projected frame . The training loss is mentioned in equation 9.


The above procedure is applied to project the next frame from the input sequence (please note that we are using input sequence of length ). We investigate the different ways of attention and other hyper-parameters in section V-B. To generate projections of many frames Foresee is used recursively. The recursive Foresee is detailed in figure 3.

Iv Experimental Setup

In this section, we demonstrate the dataset, experimental setup and the performance metric used for the evaluation of Foresee and its comparison with state of the art methods.

Dataset: To capture the chaotic environment, we collected data on Delhi roads where the road space is shared by pedestrians, vehicles (cars, buses, auto-rickshaws, bicycles etc), and sometime animals as well. We have collected data for urban traffic and highway traffic on Delhi roads under both chaotic and ordered traffic situations. We have collected real road environment images using a car-mounted Point-Grey monocular camera on various times on various days on Delhi roads. The FPS was varying but the videos are then standardized to FPS. In addition to the data collection on Delhi roads, we have made use of dash-cam videos available on YouTube. We selected videos of the chaotic situations (defined in introduction) only and were from various states in India. In total, we have videos. These videos are then randomly splitted into three sets for training, validation and testing. The training set is used to train the model, validation set is used to decide whether the network is over-fitting and testing set is used to evaluate the trained model and for performance comparison with other methods. All the sets are sufficiently large and cater to various environmental settings. The dataset description is given in table I. Just to say, the dataset was collected on the wild and contain environments like market, heavy traffic, mild traffic, mixed objects (pedestrians, vehicles, animals etc.), different backgrounds (building, trees etc.) and different infrastructural variations (3D pedestrian crossing etc.). The intuition behind collection of such a dataset is to develop a deep learning architecture which can be generalized to various real road environments. The data frames were normalized between and using opencv and re-sized to . We choose a smaller image size because with smaller image size the network is less complex and also deep learning is able to make sense of the objects and environment on smaller images, for example, cifar dataset [19] has size images and is widely used for classification.

Set Number of videos Number of images
Training 55 82,265
Validation 22 4,314
Testing 24 14,500
Total 101 101,079
TABLE I: Dataset description

Evaluation Metric:To quantitatively assess the projected image quality, we have used mean square error (MSE) and structural similarity index measure (SSIM) [29]

as the evaluation metric. MSE at time

is the mean square error between target image at time and projected image at time . SSIM assess the image quality based on the structural degradation and compares an image with a reference image.

Experiments: We designed following experiments for better understanding and evaluation of the proposed architectures for future projections:

  1. Qualitative understanding of Foresee and what representations it is able to learn

  2. Quantitative understanding of Foresee and its hyper-parameters

  3. Performance comparison with state of the art methods for future projections. We compare with encoder-decoder LSTM method [28] and prednet [20].

  4. Foresee with online training to see the benefits online training can provide for the future projections

  5. Quantitative evaluation of Foresee on Kitti dataset [9] to check the generalization power of Foresee.

  6. Steering estimation on an autonomous driving simulator to check how well the projection help for behavioral cloning.

Implementation Details:

Foresee is implemented in pytorch with gradient computation in adagrad. The whole computation is done on a Tesla K20m GPU.

V Results

In this section, we will show the efficacy of our proposed framework using the experiments mentioned in the previous section.

V-a Qualitative understanding of Foresee and what representations it is able to learn

First of all, we investigated various sequence prediction architectures as explained in Andreas Karpathys blog 222 We found many-to-many sequence prediction networks to be more effective for encoding the environment representations for future projections (see section V-B). Method proposed in [28] is also a many-to-many sequence prediction network using LSTMs (Long Short Term Memory [14]). In the many-to-many architectures, we first employed the approach proposed by Srivastava et. al. [28] and identified that it is not able to persist the sequence representation even for few frames because the road environment is continuously changing and the next frame does not depend completely on the current frame, we need a method to attend previous frames as well. For this, we investigated various approaches such as reconstruction mechanisms using deconvolution layers and fully connected layers, feature representation using RESNET [12] and attention [24]. We observed that attention is performing the best for the chaotic environments. In the same architecture, GRUs are performing better than LSTMs. For brevity, we show the future projections only for GRU with attention (which is Foresee) in figure 5. For choosing hyper-parameters of the network, we did an exhaustive search over multiple hidden state sizes and input sequence lengths, etc (refer section V-B). In the final network, the hidden state size is and the input sequence length is frames ( second). Our model has layers of GRU cells. The learning rate and the weight optimization algorithm impacted the performance a lot. We observed highest performance when using the Adam algorithm [18] for weight optimization. We try Foresee+online in which we first apply online training to the input sequence and then decode the future projections. Out of all tested models, we show results of Foresee on our test set in figure 5 and Foresee+online in figure 8. Figure 6 show projected image sequence of a video from test set along with the target image sequence.

Fig. 5: Image showing next frame projection using Foresee. All images are generated using Foresee. Each image is .
Fig. 6: Image showing (a) target sequence of images from the test set, (b) corresponding next frame projections generated using Foresee.
Approaches MSE(train) MSE(val)
MM-1 (output)
MM-2 (output)
TABLE II: Training on different many-to-many sequence prediction architectures. MM-2 is the many-to-many sequence when input and output sequences are synced. MM-1 is the architecture when output length is larger than input length. These architecture are taken from [17].
hyper-param (Input,hidden) AttnHidden AttnOutput

MSE (Last) MSE (All) MSE (Last) MSE (All)
10, 512 0.00189
20, 512
10, 1024
20, 1024
TABLE III: Performance with various hyper-parameters when attention is applied at the hidden layer (AttnHidden) and the reconstructed output (AttnOutput). MSE(Last) show the MSE when the attention is applied only at the last time step during the training procedure and MSE(All) signifies the MSE value when the attention is applied at all steps of encoding and decoding.

V-B Quantitative understanding of Foresee and its hyper-parameters

Foresee has various hyper-parameters that require tuning to achieve better results. The investigated parameters are input sequence length, hidden size of GRU cell, attention mechanism and training procedure. Table III shows the average mean square error on the validation set using different combinations of hyper-parameters when the attention is used at the topmost hidden layer and at the reconstructed output. Analysis is following:

  1. Input sequence length (named Input in table III): We expect Foresee to project future up to seconds in advance. For this we explored input length of seconds. We didn’t go beyond seconds because it is clear from the table that the model started over-fitting. This is expected as the road environment is dynamic.

  2. Hidden Size of GRU cell (hidden): We investigated various hidden sizes. For an input image of size ( pixels), we tried hidden size of and . Refer to table III which shows performance of different hidden sizes, we observe that network overfits for larger hidden sizes other than the last column. Last column shows that the attention is applied at reconstructed output and at all time steps. Hidden size of and input sequence length of performed the best.

  3. Attention mechanism: Attention helps the network to attend to a specific part of input sequence. Attention can be applied at several locations in the network. For example, at the reconstructed output (named AttnOutput) or at the hidden state (named AttnHidden). It can also be chosen when to apply it, while encoding, decoding or at both. Table III shows the average MSE when attention is applied at topmost hidden layer (first two columns) and when it is applied at the reconstructed output (last two columns). MSE(Last) signifies the MSE when attention is applied only for the last time step whereas MSE(All) signifies that attention is applied at all steps of the input sequence along with decoder. Intuitively one would expect hidden to perform better because if the hidden state is not good enough then output will surely be worse but here attention at output performs better. The initial reconstructions are better and attention helped to make use of it. Whereas when attention is used at hidden a slight error at hidden will also propagate while reconstruction.

  4. Training procedure: For many-to-many sequence generation, we used encoder-decoder training procedure. The procedure are shown in figure 7

    . We tried to backpropagate with error computation only for encoder (named MM-2), only for decoder (MM-1). Table 

    II shows the two sequence prediction architectures when attention is applied at output. MM-1 (output sequence is longer than input during training) is over-fitting and MM-2 when input and output sequence length are in sync performs better.The two networks are shown in figure 7.

Fig. 7: Many-to-many sequence generation in recurrent networks.

V-C Performance comparison with state of the art methods

In this section, we compare performance of various future projection architectures with Foresee and the performance is quantified using MSE and SSIM. Table IV shows the MSE and SSIM values of the three architectures, approaches of  [28] and [20] and our proposed approach without and with online training. The results were computed on our test set when the model is trained with our train set. Copy last frame is the trivial approach when previous frame is used as the projected frame, in this case MSE is very high and since the image does not have noise and share the background it has high SSIM.

Prednet [20]: We trained prednet on our training set. We stopped the training process after a day of training and when the validation loss reached approx. . To produce results of prednet, we have used the scripts provided. The output of prednet is image. On the original projections, the MSE of prednet was and SSIM was . To make the comparison fair we downsampled prednet output and then compute MSE and SSIM which are shown in table IV. Prednet makes use of the previous frame and then performs error correction. In highly dynamic environment the prednet returns only the previous frame and doesn’t encode the motion dynamics into the representations and hence the output is very bad both visually and quantitatively.

We have generated all possible sequence of length starting from first frame on the testing set. The results shown in the table are based on the exhaustive set of sequences.

Enc. Dec. LSTM [28]: We trained their architecture on our training with hidden size and input sequence length of frames (they reported same parameters in their paper on a image patch). As noted, a sequence prediction architecture without attention does not encode environment representations.

Approach MSE SSIM
Enc. Dec. LSTM
Copy last frame
Foresee with online
TABLE IV: Performance comparison on test data.

V-D Foresee with online training

At any point of time, the agent/vehicle has access to the previous frames that share the same background as the current frame or few frames in future and hence these can be used for further training. By this way, Foresee will get to know about the background of the input images which it can adapt to and can project the better representations. Since averaging with previous frame improves results, we averaged the input at the GRUcell with the previous frame at every time steps instead of averaging on the output sequence. In brief, online training is the real-time training of the network on the previously seen frames.

For online training, we used previously available frames (up to second) to train the model again using Adam as the optimizer. Next, we projected frames ( second) using the newly trained model. Previously trained network (on training set) is used from scratch for each video. Using online training, Foresee is able to predict frame seconds in advance with an average MSE of on our test set with SSIM on first frame of . The table IV shows error values for the next frame projection only. Figure 8 shows the projected output for seconds. The network is able to preserve the representation for longer than frame.

Fig. 8: Future projection for up to frames using Foresee+online.

V-E Quantitative evaluation of Foresee on Kitti dataset

In this section, we show the generalization power of our model on a public dataset. We used the previously trained model (on our training set) and test future projections for Kitti dataset [9] which is also used in prednet [20]. With Foresee+online, we achieve MSE and SSIM. Figure 9 shows the projected images using Foresee+online. The figure clearly shows that Foresee+online is able to learn representations on another data set and it generalizes to other environments.

Fig. 9: Future projections on Kitti dataset when Foresee is trained on our training dataset.

V-F Steering Estimation on the projected images:

We use projected image from Foresee for behavioral cloning [7, 1]. In Behavioral cloning [7, 1], we train a neural network to capture the behavior of a human driver for an autonomous vehicle. The trained model will try to estimate steering angle from images for lane keeping. Estimating steering angles from future images may help to avoid a near miss or improve drive. We use images from an autonomous vehicle simulator [2] for steering angle estimation. The available data was divided into training and

testing after steering angle normalization. We trained a convolution neural network with MSE loss on simulator training set. We did not train

Foresee again on these images and generated future projections from the previously trained model (on our train set). The projected images are then used for steering angle estimation. The average MSE in the steering angle values estimated from Foresee is and when estimated using original images. The estimated values from the network are shown in figure 10. The figure shows estimation from Foresee generated images, original images and ground truth values.

Fig. 10: Figure showing steering angle values estimated from the projected images (Foresee), Ground truth angle and steering angle estimated from the original images (Original in figure legend).

Vi Discussion

Future projections are essential for behavioral cloning [1] in an unseen environment. We showed that GRU with attention is able to achieve better performance for future projections as compared to other available methods. The task of future projection has been looked in unsupervised way in the literature. However, no work has explored online training for such a task as the environment is repeating and online training can help in better projections. We showed a trivial online training approach which improves the error rate and helps in projection of a longer sequence. More online training methods can be explored for improvements in this direction. The future projections can be used other tasks such as tracking, cooperative perception and behavioral cloning.

Vii Conclusion

We proposed Foresee, a deep learning architecture for future projections using Gated Recurrent Units and attention methods. We showed that attention when applied at all steps of the reconstructed output of the input sequence performs better. We collected a very large data set of chaotic road environments which capture different traffic participants and different infrastructural variations. We showed that the proposed architecture performs better than the state of the art methods for future projections. At end we showed that Foresee with online training which is able to project future for up to seconds in advance and generalizes to a public dataset for road environments. The projected images were shown to be effective in steering estimation on simulated images from an autonomous vehicle simulator.


The authors would like to thank team Swarath333 at IIIT-Delhi for their help in data collection. The authors would also like to thank Dr. Arun Balaji Buduru, Assistant Professor at IIIT-Delhi, for his value able comments and discussions.