Log In Sign Up

Gaze Training by Modulated Dropout Improves Imitation Learning

Imitation learning by behavioral cloning is a prevalent method which has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drivers. However, there are other cues, e.g. gaze behavior, available from human drivers that have yet to be exploited. Previous researches have shown that novice human learners can benefit from observing experts' gaze patterns. We present here that deep neural networks can also profit from this. We propose a method, gaze-modulated dropout, for integrating this gaze information into a deep driving network implicitly rather than as an additional input. Our experimental results demonstrate that gaze-modulated dropout enhances the generalization capability of the network to unseen scenes. Prediction error in steering commands is reduced by 23.5 dropout. Running closed loop in the simulator, the gaze-modulated dropout net increased the average distance travelled between infractions by 58.5 Consistent with these results, we also found the gaze-modulated dropout net to have lower model uncertainty.


Utilizing Eye Gaze to Enhance the Generalization of Imitation Networks to Unseen Environments

Vision-based autonomous driving through imitation learning mimics the be...

Efficiently Guiding Imitation Learning Algorithms with Human Gaze

Human gaze is known to be an intention-revealing signal in human demonst...

AGIL: Learning Attention from Human for Visuomotor Tasks

When intelligent agents learn visuomotor behaviors from human demonstrat...

Dynamics of Driver's Gaze: Explorations in Behavior Modeling & Maneuver Prediction

The study and modeling of driver's gaze dynamics is important because, i...

Visual Attention Prediction Improves Performance of Autonomous Drone Racing Agents

Humans race drones faster than neural networks trained for end-to-end au...

Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation

A high-precision manipulation task, such as needle threading, is challen...

I Introduction

End-to-end deep learning has captured much attention and has been widely applied in many autonomous control systems. Successful attempts have been made in end-to-end driving including imitation learning

[1, 2, 3, 4]

and reinforcement learning


. Not like reinforcement learning, where a predefined reward function is inevitable, imitation learning can be adapted to complex driving environments more effectively. Among various imitation learning methods, one typical solution is behavioral cloning through supervised learning. It has been successfully implemented for many tasks including off-road driving

[6] and lane following [1], with both simplicity and efficiency.

Behavioral cloning follows a teacher-student paradigm, where students learn to mimic the teacher’s demonstration. Previous behavioral cloning work for autonomous driving mainly focused on learning only the explicit mapping from the sensory input to the control output, with little consideration to other implicit supervisions from teachers despite the fact that there are still a wealth of cues from human experts.

Researches have shown that novice human learners can benefit from observing experts’ gaze [7]. As an example, Yamini et al. [8] showed that viewing the expert gaze videos can improve the hazard anticipation ability of novice drivers. Therefore, it is very promising to investigate whether deep driving networks trained by behavioral cloning could benefit from exposure to expert gaze patterns.

However, it is not yet clear how gaze information can best be integrated into deep neural networks. Some recent work [9, 10] has incorporated human attention to improve policy performance for learning-based visuomotor tasks. However, they simply add the gaze map as an additional image-like input. We question whether this is the best way to incorporate gaze information. Saccadic eye movements shift the eye gaze in different directions, allocating high resolution processing and attention towards different part of the visual scene. This suggests that gaze behavior may be better viewed as a modulating effect, rather than as an additional source of information. In addition, treating the gaze map as an additional image input increases the complexity of the network. However, this additional complexity is inefficiently utilized, as most of the gaze map is close to zero.

To better exploit the information from human eye gaze, which encodes rich information about human attention and intent, in this paper, we proposed gaze-modulated dropout to incorporate gaze information into deep driving networks. A conditional adversarial network (


) was trained to estimate the human eye gaze distribution in the visual scene while driving

[10, 11]

. Then, we used the estimated gaze distribution to modulate the dropout probability of units at different spatial locations. Units near the estimated gaze location have lower dropout probability than units far from the estimated gaze location. We hypothesized that this will help the network focus on task critical objects and ignore task-irrelevant areas such as the background. This should be especially helpful when the network encounters new and unfamiliar environments. In addition, the proposed gaze-modulated dropout does not increase the complexity or structure of the behavior cloning network. It can be easily inserted into many existing neural network architectures simply by replacing the normal dropout layers.

To validate such robust generalization ability augmented by gaze-modulated dropout, we propose to evaluate the epistemic (model) uncertainty of the trained model, which is an indicator of model confidence. The epistemic uncertainty measures the similarity between newly observed samples and previous observation [12]. It can be evaluated with Monte Carlo dropout sampling [13]. In the context of autonomous driving, the use of epistemic (model) uncertainty can help to reveal how well the generalization capability of the trained model is to unseen environments. Thus, evaluation through the epistemic uncertainty can reflect the improvements imposed by gaze-modulated dropout more meaningfully.

The contributions of our work are mainly as follows:

  • We propose gaze-modulated dropout based on the generated gaze maps and incorporated it into a deep network trained by imitation learning (PilotNet) [1]. We show that utilizing gaze-modulated dropout significantly improve the driving performance.

  • We demonstrate that modelling of auxiliary cues not directly related to the control commands improves the performance of imitation learning. This takes imitation learning to the next step, by showing the benefits of a more complete understanding of human expert behaviors.

  • we evaluate the model uncertainty of trained models and validated the effectiveness of the imitation network with gaze-modulated dropout, showing significant performance improvements.

Ii Related work

In this section, we first summarize works in end-to-end autonomous driving. Then we introduce related works about eye gaze involving in assisted driving and autonomous driving. Finally, different kinds of uncertainties and their estimations were reviewed.

Ii-a End-to-end autonomous driving network

For the vision-based autonomous driving system, end-to-end methods become more and more popular as they avoid decomposition of processes into multiple stages and optimize the whole system simultaneously[1].

Reinforcement learning algorithms learn to drive in a trial-and-error fashion which does not require human demonstrations. There are some works applying deep reinforcement learning to autonomous driving in the simulators such as Carla [2], [14] and TORCS [15]. Pan et al. [15] learned driving policy by reinforcement learning in the TORCS simulator and transfered the learned policy to real-world driving data through virtual to real image translation network. Liang et al. [14] proposed an imitative reinforcement learning network. It first trained the imitation network from human demonstration. The learned weights were then used as initialization for the policy network trained by reinforcement learning. The major limitations of reinforcement learning are the request of designing a reward function which can be hard in complex driving tasks and its low sample efficiency.

Imitation learning trains an agent from human demonstrations. Bojarski et al. [1] trained a convolutional neural network (PilotNet) to map visual input to steering commands for road following task. A similar framework was also applied in an obstacle avoidance task in ground vehicles [6]. Codevilla et al. [4] proposed a deep multi-branch imitation network. Directed by high-level commands input from a planning module or human user, it trained a conditional controller to map visual input and speed measurements to action output. The trained controller allowed an autonomous vehicle to stay on the road and also follow the high-level command which represents the expert’s intention (go left, go right or go straight). The main limitation of these systems is that they do not generalize well to unseen scenes [14]. For example, the model of [4] trained in the first town of Carla simulator had obvious performance degradation in the second town even with a lot of data augmentation techniques.

As far as we know, no systems consider utilizing human gaze behavior yet in spite of the rich driver-intention-related cues it contains. Furthermore, the uncertainty of the autonomous driving network is rarely considered.

Ii-B Eye gaze in driving

In the context of driving scenario, a wealth of cues regarding human intent and decision making are contained in the human eye gaze.

For assisted driving systems, there are some works utilizing eye gaze information to monitor the mental state of drivers, such as driver fatigue monitoring [16] and driver workload classification [17].

However, in the field of autonomous driving research, the utilization of eye gaze has not yet been well exploited. Whether and how human gaze can help autonomous driving is still under-explored [18].

A very related work [19] proposed a multi-branch deep neural network to predict eye gaze in urban driving scenario. They attached much importance to gaze data analysis over different driving scenes and driving conditions. They studied how eye gaze distributed over different semantic categories in a scene and the effects of driving speed to driver’s attention, to name a few. However, they did not apply the predicted gaze information to a autonomous driving system.

Ii-C Uncertainty in Deep Learning

The estimation of uncertainty can reflect the model’s confidence and what the models do not know [20]. Epistemic (model) uncertainty and aleatoric (data) uncertainty are the two main types of uncertainty we can quantify.

The aleatoric uncertainty captures noise in the observations due to motion noise or sensor noise. Kendall et al.[20] proposed a method to quantify the aleatoric uncertainty by adding an auxiliary output. Feng et al. [21] followed this method to capture the uncertainty in 3D object detection task. Wang et al. [22] estimated aleatoric uncertainty regarding segmentation results in the medical image segmentation task. They analyzed the effect of different image transformation on the segmentation result through uncertainty outputs.

The epistemic uncertainty is a measure of ”familiarity” which quantifies how similar a new sample is with previously seen samples [12]. Kendall et al. [20] proposed a method to capture epistemic uncertainty by sampling over the distribution of model weights using dropout in testing time. This method to quantify the model uncertainty has been applied into various tasks such as semantic segmentation [13]and depth regression [23]. They showed that larger epistemic uncertainty is observed for objects which are rare in training datasets. The main limitation for this dropout method is that it needs expensive sampling which is not suitable for real-time application. However, it can be used as an offline evaluation technique for model confidence to the unseen samples.

Iii Methods

In this section, we introduced the implementation of gaze-modulated dropout and the framework of the autonomous driving network with gaze-modulated dropout. Modelling method of epistemic (model) uncertainty was introduced as well.

Iii-a Framework

As shown in Fig.1, the autonomous driving system with modulated gaze dropout can be divided into two parts: the gaze network and imitation network.

To achieve the autonomous driving system with gaze information in real time other than human guided driving system, we synthesized the gaze map by the Pix2Pix network as the gaze network following [10]. Trained with pairs of driving scenes and ground truth gaze maps, the gaze network learned to generate gaze maps from the driver-view images.

The generated gaze map was further used as the mask of the dropout, which is referred to as the gaze-modulated dropout. The imitation network had the same structure as the PilotNet proposed by [1]

that has five convolutional layers and four fully connected layers. We used ReLU as the activation function, and dropout is applied after the first two convolutional layers.

Fig. 1: The autonomous driving system with gaze-modulated dropout. The gaze network (Pix2Pix) generates gaze map for the gaze-modulated dropout in the imitation network as [10]. As the input, the gray-scale drive-view image is forwarded to networks for path following or overtaking according to the command. The filters of convolutional layers appear as blocks marked by corresponding sizes and are flattened to feed into the fully connected (FC) layers. The four vertical bars denote the four FC layers. The final layer has a single scalar unit which encodes the steering command.

Iii-B Gaze-modulated dropout

To help the network to focus on task critical objects, we introduced gaze information into the network by dropping fewer units of the highly concerned area of the input images or features and dropping more on the area with low attention. We refered it as gaze-modulated dropout. It is similar to conventional dropout, but the main difference is the non-uniform drop probability.

As shown in Fig.2, the drop probability of gaze-modulated dropout () is decided by the gaze map (). Along the horizontal line on the gaze map, the keep probability () increases when the pixel value increases. While for the conventional dropout, referred to uniform dropout, the dropout probability () across the image keeps the same. The drop probability of a certain unit with indexes can be summarized in (1) and (2), where and are adjustable parameters. is the drop probability of all pixels for uniform dropout and of gaze-modulated dropout is the maximum drop probability of units corresponding to the zero-value pixels in gaze map. As the most area of the gaze map is zero-value, with slight abuse of notation, we marked as for gaze-modulated dropout.


The implementation details of gaze map modulated dropout are shown in Algorithm 1. Given the input image or features and the gaze map , a random array was generated first with the same width and height as . The keep probability mask with same size as

was first obtained by interpolating the gaze map and then scaled to the range

, with its value representing the keep probability. By comparing it with , a binary mask was further generated. The zero-values of the binary mask set the corresponding pixels of to be zero and was normalized in the end.

For uniform dropout, it is not recommended to apply dropout in testing [24] to get an averaged prediction as it requires many thinned models running exponentially. A simple approximate averaging method was conducted instead, with no dropout at test time. For gaze-modulated dropout, we applied it in a similar way. It is not feasible to directly remove the gaze-modulated dropout module at the test stage as the gaze information is essential. For a fair comparison, we kept the gaze-modulated dropout in testing, but also made some modifications to approximate the averaging effect. We studied the effectiveness of the approximation and compared it with uniform dropout in section IV-C2.

Fig. 2: Keep probability settings for gaze-modulated dropout and uniform dropout.
Input: Activation output of a layer (), gazemap (), mode, most probability of dropout
1 = )
2 if mode==pixel-wise dropout then
3       Randomly sample array with size of :
4       mask = (, size=)
5       Rescale the range of values of to (, 1)
6       Binary mask =
7       Apply the mask: =
8       Normalize the features: =
9 end if
Algorithm 1 Gaze-modulated dropout

Iii-C Uncertainty

As mentioned above, there exist epistemic (model dependent) uncertainty and aleatoric (data dependent) uncertainty. We evaluated epistemic uncertainty of the trained models, and investigated the effect of gaze-modulated dropout based on the hypothesis that gaze-modulated dropout drop task-irrelevant units and reduces the difference among scenes to improve the generalization capability of the model.

Modelling epistemic uncertainty using stochastic regularization technique like dropout has been proved to be effective for different tasks [25]

. By performing forward passes multiple times for the same input, mean and variance of the output can be obtained. The

epistemic uncertainty is also captured. In equation (3) and (4), we show the calculation of mean and variance of the outputs through multiple forward passes, where represents the forward times, and corresponds to the masked model weights. For each time , the weights are set zeroed with the probability of by dropout.


Iv Experiments and Results

In this section, training details and testing details of the imitation networks, evaluation about the performance of the network, further thinking of the implementation of the gaze-modulated dropout in testing stage are given.

Iv-a Network training

Iv-A1 Data collection

The experiments are conducted in TORCS [26] which is an open source highway driving simulator. It has been used for vision-based autonomous driving system in recent researches[27],[28]. We choose TORCS as it simulates multi-lane highway diving scenarios which meet the needs of our task. The subject was asked to watch the screen that shows the real-time drive-view scene and control the car with steering wheel and pedal. For most of the time, lane following was required. An overtaking was executed if it was needed. During the time that the subject is changing lane, a button will be pressed to mark this overtaking maneuver. During the experiments, the gaze data and action data of the subjects are collected. The gaze data is collected by a remote eye tracker, Tobii Pro X60. We chose five different scenes and collected data for four trials in each scene.

Iv-A2 Training and testing details

For the gaze network, we followed the Pix2Pix architecture[11]. Around 3500 images from and are used for training. For the imitation network, three trials of and , six trials in total are used for training. It is about 40000 images in total. The remaining one trial in and and all the trials in , and are used for testing.

For the consideration of robustness, we refer to [4] and add expert demonstrations of recovery from drift. It accounts for a proportion of 10% of the training data. Furthermore, we perform data augmentations such as random changes in contrast, brightness and gamma online for each image.

We trained two convolutional networks with the same structure but with the following behavior data and overtaking behavior data respectively. During the test, we manually selected the imitation network for path following or overtaking.

Iv-A3 Networks to compare

We trained the imitation network with uniform dropout as the baseline. It is referred to as Uniform dropout in the results. To evaluate the effectiveness of the gaze-modulated dropout, we implement three methods.

Real gaze-modulated dropout: The imitation network was trained with dropout given the real gaze map as the mask. To simplify, gaze-modulated dropout mentioned below without specification refers to the real gaze-modulated dropout.

Estimated gaze-modulated dropout: The imitation network was trained with dropout given the estimated gaze map as the mask. The estimated gaze map was generated by the gaze network. Under these circumstances, the trained network can be applied to the online autonomous driving test.

Center Gaussian blob modulated dropout: The imitation network was trained with dropout given the image with single Gaussian in the center as the mask. The implementation of Gaussian in the image center is based on the observation that subject looks at the center area of the scene mostly.

Iv-B Gaze Network Evaluation

T1 T2 T3 T4 T5
gaze map
0.75 0.63 0.94 0.88 0.82
Gaussian blob
3.08 2.57 2.63 1.95 2.37
gaze map
0.86 0.87 0.83 0.81 0.85
Gaussian blob
0.68 0.72 0.73 0.80 0.75

The similarity of gaze map estimates to the real gaze map over different tracks. KL denotes Kullback-Leibler divergence, and CC denotes Correlation Coefficient.

The function of the gaze network is to generate gaze map with similar intensity distribution to the real gaze map, given drive-view images. For quantitative evaluation, two standard metrics from saliency literature [29],[30], the Kullback-Leibler divergence (KL) and the Correlation Coefficient (CC) are adopted to evaluate the similarity. A larger similarity indicates better performance. A smaller KL divergence and a larger CC mean better similarity. As the subject tends to look at the center region of the image, we also calculate the similarity between the image with a single Gaussian in the center and the real gaze map as a comparison.

As shown in Table I, for all the tracks, the estimated gaze map has better similarity with real gaze map than the center Gaussian blob. For the two seen tracks (T1 and T2), KL for estimated gaze map is 75.6% smaller than the center Gaussian blob and CC for estimated gaze map is 23.6% larger than the center Gaussian blob. For the unseen tracks (T3-T5), KL for the estimated gaze map is 61.5% smaller than the Gaussian blob and CC for the estimated gaze map is 9.4% higher than the Gaussian blob on average.

Iv-C Gaze-Modulated Dropout Evaluation

Iv-C1 Prediction error vs. drop probability

Fig. 3: Prediction errors of the model using gaze-modulated dropout and uniform dropout with varying from 0.1 to 0.8.

To make a fair comparison, we scan over s from 0.1 to 0.8 with a step of 0.1. The imitation networks are trained and tested in the same way (same settings). For each image, it was forwarded into the imitation network for 50 times with dropout. Mean absolute estimation error between the steering angles generated by the network and the human driver are calculated each time. Fig.3 shows the average prediction errors of models with gaze-modulated dropout and uniform dropout. As shown in Fig.3, the average prediction error of seen tracks stabilize at around three degrees for both dropout methods. For the unseen tracks, the uniform dropout shows a trend of slight increase in prediction error when increases. While for gaze-modulated dropout, prediction error decreases markedly when varies from 0.2 to 0.8.

The results show that the uniform dropout makes little difference while the gaze-modulated dropout has a significant effect in improving the model performance in unseen tracks.

We average the prediction error over all tracks and choose the corresponding to the smallest prediction errors. If not otherwise specified, for gaze-modulated dropout is chosen to be 0.7 hereinafter. For uniform dropout, it is 0.1.

Iv-C2 Discussion on dropout implementation while testing

As we do the averaging in the aforementioned experiments, we notice that the steering output varies while we stochastically drop out feature units. It is not reliable to directly apply gaze-modulated dropout to the network nor practicable to forward the input for multiple times. Inspired by dropout, we directly multiply the features with the mask generated by gaze map. In practice, we followed the Algorithm1 but replaced the binary mask M with the keep probability mask K in line 7 while testing.

We compared the prediction error of gaze-modulated dropout and uniform dropout with respect to different implementations. As introduced in [24], forwarding the input without dropout at the testing stage approximates the averaging method. We compared the results of the averaging prediction error of multiple times forward passes and direct multiplying the keep probability pixel-wise as a approximation of average. The results can be seen in Fig. 4 if we compare the first two and the last two groups of bars. Both the uniform dropout and the gaze-modulated dropout show increases in prediction error for the ones approximate the averaging effect. And the increment for gaze-modulated dropout is 0.08 degrees on average, which indicates the feasibility of multiplying the features with the gaze generated mask in tests.

Iv-C3 Performance on dataset

Fig. 4: Quantitative performance on dataset. Prediction errors of model using real gaze, estimated gaze, center Gaussian blob modulated dropout and uniform dropout.

To evaluate the performance of the imitation network with modulated dropout, we first tested it on the testing dataset. Without loss of generality, we divided the tracks into seen tracks and unseen tracks and further averaged the prediction error of different tracks. for estimated gaze and center Gaussian blob modulated dropout are set to 0.7 (the same for real gaze). As shown in Fig.4, for seen tracks, imitation networks with different dropout methods show close results. However, for unseen tracks, the network with estimated gaze-modulated dropout outperforms the uniform dropout one with 23.5% lower prediction errors. The network with real gaze-modulated dropout achieves 28.3% decrease in the average estimation errors. Both of the networks also perform better than the network with center Gaussian blob dropout. On average, the imitation network with estimated gaze-modulated dropout achieves 12.7% decrease and the network with real gaze-modulated dropout achieves a 15.5% decrease in the steering angle estimation error.

To summarize, either real gaze-modulated dropout or estimated gaze-modulated dropout improves the performance of the imitation networks in unseen environments and outperforms the center Gaussian blob modulated dropout and uniform dropout.

Iv-C4 Discussion on model uncertainty

To identify the effectiveness of the imitation network from the perspective of uncertainty, we evaluate the model uncertainty of the imitation networks. We first scan over from 0.1 to 0.9 with a step of 0.1 for both dropout methods. To match the two s for uniform dropout and gaze-modulated dropout, we recorded the drop rate of the gaze-modulated dropout for each sample and obtained an average drop probability across the testing data set.

Fig. 5: Epistemic uncertainties of model using gaze-modulated dropout and uniform dropout with varied average drop probability.

As shown in Fig.5, it is clear to see that model uncertainty for seen scenes is smaller than that of unseen scenes. Also, we observe an increase in model uncertainty while increasing the average drop probability for both gaze-modulated dropout and uniform dropout, seen and unseen scenes. As the model uncertainties embody the familiarity of the model to the input images, we interpret results as that unseen scene is less familiar and dropping more units will cause more loss of details so that the output features of dropout will vary more for multiple times forward passes. When comparing the model uncertainty of the uniform dropout and gaze-modulated dropout, we can not observe obvious difference for less than 0.3 for unseen tracks and model uncertainties of seen tracks for the two dropout methods are pretty similar. Furthermore, for uniform dropout, the model uncertainty increases dramatically for larger than 0.7. Model uncertainties of the networks with gaze-modulated dropout are more stable that shows a linear growth trend with the increase of .

To compare the model uncertainties of gaze-modulated dropout with different dropout methods, the parameter was chosen to ensure the same average drop probability. We trained models using uniform dropout with equal to 0.66. Besides, we also evaluate the model uncertainty for the center Gaussian blob modulated dropout and estimated gaze-modulated dropout with equal to 0.7. We focused on the model uncertainties of unseen tracks. We found that gaze-modulated dropout and estimated gaze dropout had smaller model uncertainties, which were 2.23 and 2.60 respectively. And center Gaussain blob had a larger model uncertainty (3.88 ). The uniform dropout had largest model uncertainty (4.36 ).

Recall the prediction error evaluation in part IV-C3, we found that network with lower model uncertainty achieves higher action estimation accuracy. It is consistent with the finding in [31], where Kendall and Cipolla find that model uncertainty is positively correlated with positional error for camera relocalization task. It indicates that the gaze-modulated dropout helps the network to capture the common features in unseen scenes and filtered out the background area which varies a lot over different scenes.

Iv-D Close loop performance

We test the performance of the networks in the simulator with unseen tracks. In each episode, the agent car started from a new location to drive along the path and overtake car if needed, given the command from human expert. The driving length of each episode is set to be km. Human intervention will be given to drive the agent car until it back to the road when infractions happen (drive outside the lane or collisions). We test it in two cases: track with cars and without cars. In the case that no other cars are running on the road, the agent car simply follow the path. We evaluate the percentage of cars overtaking successfully, and also measured the average distance driving from start without infractions, and the average distance travelled between two infractions.

The results are summarized in Table II. The proposed network with gaze-modulated dropout performs significantly better than the baseline network with uniform dropout. In terms of the success rate of cars overtaking, the proposed method is 31.5% better than baseline. In terms of average distance travelled before the infractions and between two infractions, the proposed method is 55.2% and 58.5% better than baseline on average of the two cases.

To see the performance comparison between imitation network with gaze-modulated dropout and imitation network with uniform dropout running in the actual simulator, please see our supplementary video. The networks shown in the video were trained on the T1-T4 and test in T5 on the driving simulator.

With cars No cars
Success rate
of cars
overtaking (%)
With uniform
67.6 N/A
With estimated gaze
modulated dropout
88.9 N/A
Ave. distance travelled
from start without
infractions (km)
With uniform
0.28 0.50
With estimated gaze
modulated dropout
0.55 0.57
Ave. distance
travelled between
two infractions (km)
With uniform
0.40 0.48
With estimated gaze
modulated dropout
0.61 0.79
TABLE II: Quantitative performance running in testing track on the simulator. We compare the network with estimated gaze-modulated dropout and the network with uniform dropout. We measure the percentage of successful cars overtaking, average distance travelled without infractions (km) and average distance driven between infractions (km). Higher is better in all measurements.

V Conclusions

In this paper, we show that a conditional GAN can generate an accurate estimation of human gaze maps. The learned gaze network generalizes well to unseen tracks. Furthermore, we propose gaze-modulated dropout based on the generated gaze maps and incorporate it into the imitation network. We show that the use of gaze-modulated dropout significantly improves the human action estimation accuracy and decreases the model uncertainty. We demonstrate that deep driving network could also benefit from expert’s gaze as novice driver does. This work is an effort to take imitation learning to the next level by implicitly exploiting expert behavior.

Our work can be extended in several directions. First, the trained gaze network does not consider spatial-temporal characteristics of eye gaze. Including a spatio-temporal module such as recurrent neural network may improve the performance. Second, since the eye gaze data contains a wealth of information related to human intent, we may utilize the estimated gaze map to help choose the driving maneuvers, i.e. following and overtaking, which is currently done manually in our system. Finally, the proposed gaze-modulated dropout may also be applied to other tasks besides driving, such as vision-based robot navigation and robot manipulation.