As self-driving vehicles gain popularity and become a more viable transportation solution for their low accident rates, it would not be surprising to see tens to hundreds of thousands of these vehicles on the road in the next 5 years. This huge potential market has attracted many companies to invest in the technologies involved in self-driving, such as deep learning, computer vision, data processing, and so on. However, these new technologies face many challenges, in the form of road hazards, changing conditions, etc. Therefore, it will be crucial to develop methods that allow technically unskilled users to teach the algorithm in a way that allows them to customize the driving experience to their needs.
Human demonstration has long been the standard training approach for the self-driving industry. One fairly new method is the use of Convolution Neural Networks (CNNs) as a function approximator in deep RL
. CNNs have revolutionized pattern recognition and are especially powerful in image recognition tasks. Because this approach uses convolution kernels to scan road images at different driving time points, fewer parameters will need to be trained compared to the total number of operations.
Deep RL algorithms, such as Deep Q-Network (DQN), suffer from poor initial performance compared with the classic RL algorithm, since they start as a tabula rasa . This also contributes to increased training time, because these algorithms need to learn the unspecified features in addition to the policy, in contrast to using handengineered features. In addition, complex domains, like autonomous driving, demand a low error margin in order to avoid safety issues. These problem are non-trivial and consequential in real-world applications.
In order to use deep RL to solve real-world problems with low error rates, there is a need to increase its speed and accuracy. One method is by using humans to provide demonstrations. Human demonstrations have been used in RL for a long time ; however, this area has only recently garnered interest as a method that may speed training in deep RL .
One contribution of this work is its illustration of the results of applying human driving demonstration to a DQN algorithm by providing a pre-trained CNN model with later fine-tuning through human interaction. Using an interactive machine learning method will help individual self-driving vehicles to gain expertise by fine-tuning the pre-trained deep neural network that allows a self-driving agent to gain experience in a unfamiliar region without learning from scratch. Interactive learning could also help avoid risks arising from unfamiliar road conditions and new layouts, since interaction will be able easily guide the self-driving agent at an early stage by steering and take back the driver seat.
By including a human driver for demonstration we target three problems: (1) feature learning via human demonstration; (2) policy learning through DQN; and (3) interactive learning for novel environments. In this work, we address the first two problems, i.e. feature and policy learning, by speeding up the pre-trained CNN model with human demonstration to learn the underlying features in the hidden layers of the network [5, 6, 7]. We address the third problem by augmenting the DQN learning process to allow a human teacher to provide suggestions during the episodes used by the DQN to gather training data.
The learning environment was structured as a simulationthat can easily mimic any combination of real road conditionsand city layout using the Unreal Engine software. Inaddition, we tested augmenting the training process throughinteractive learning using the same environment; simulatedagents do not traditionally represent the same environmentalaccuracy as real vehicles, but the new simulation we usedincluded manual tuning in the simulation, e.g. rigid body,forces and torques. 
We tested our approach in both the Deep Q-Network(DQN) with and without human demonstration and evaluatedits performance using the AirSim car simulator in aneighborhood environment domain. Our results show an increasein the speed of the learning process with a large improvementin self-driving performance. The generality ofthis approach suggests that it is feasible and necessary fordeep RL algorithms to incorporate human demonstrationand interaction.
Ii Related Work
Although our work does not fall directly under the umbrellaof transfer learning, it is similar to the transfer learningmethods in deep learning. At the domain of deep neuralnetworks for image classification, Yosinski et al. haveshown the benefits speed up of learning features from existingmodels when the datasets are similar. In this work, we structured a human demonstrationconvolution network , then used thepre-trained model as source, and the CNN model was thenused to initialize the RL agent’s network.
Existing research on pre-training in RL [9, 10] has shown improvementwhen using a pre-trained model on similar datasets. The capabilityof these studies were limited by the small numberof parameters learned and by the state input. In our work,we used the raw images of simulation driving domain asnetwork input from the human driving demonstration. It isworth nothing that the pre-training model needs to learn thefeatures of states as well as policy.
Our approach of using supervised learning for pretrainingis similar to that of 
. In their model,pre-training involves learning to predict an action based oninput image and minimizing the loss between predicted andactual actions provided by human volunteers. We used asimilar approach, with image frames from human demonstratorsas input data and labels provided by the action takencorresponding to each image frame. Another approach topre-training is to learn the latent feature by using unsupervisedlearning through deep belief networks. Although the approach is different, the fundamentalgoal is the same: to improve learning by using pretrainednetworks instead of random initialization.
Other recent work leveraging human input in deep RL includethe use of human feedback to learn a reward function and, similar to our system,pre-training of a network with human demonstrationin DQN . However, these examples of pretraining(combining large-margin supervised loss and temporaldifference loss) are focused on close imitation of thedemonstrator. In our work, we use only the cross-entropyloss and focus on learning features.
Another study, , also used supervised learningfor human demonstration and learned networks to initializethe policy network for RL. However, that study focusedon a single domain and used a huge amount of data providedby human experts to train the supervised network. In contrast,our approach used a much smaller training dataset andillustrates the usability and feasibility of such an approach toaffect the deep RL algorithm. Our study shows that a smallamount of data gained from a non-expert is enough for a supervisedneural network to learn important feature representationfor driving from demo image frames; deep RL algorithmssuch as DQN can benefit from having the pre-trainedmodel as a starting point.
Iii Deep Reinforcement Learning
Reinforcement learning (RL) problems are normally modeledas a Markov Decision Process, represented by a tupleof values. The essence of RL is to let theagent explore an unknown environment by taking an action. After taking each action, the agent lands at a certainstate . A reward, , is given based on theaction taken and the next state, , of the agent. The aim ofthe RL algorithm is to let the agent learn to maximize theexpected reward, , for each state at time. The importance of future and immediate rewards is determinedby the discount factor, ; a value close to 1suggests the agent should treat a future reward as important,and vice versa for values close to 0.
Iii-a Deep Q-Network
The Deep Q-Network (DQN) algorithm is the rising star ofthe deep RL domain thanks to its ability to generalize andits flexibility in solving problems in different domains. Thefirst implementation of DQN  was capableof learning to solve 49 Atari games directly from the screenpixels by combining Q-learning with a deep convolution neural network.
A classic Q-learning algorithm learns the value of stateactionpairs instead of the value of states:
and uses the expected discounted reward from performing actions in state . The optimal policy was later calculated by maximizing the Q value .
When in a domain with a state space that is fairly largeor continuous (e.g. Atari games or driving), it is not feasibleto directly compute the Q value. To allow the use of Q-learningalgorithm in a more general state space, regardlessof size and continuity, the DQN algorithm uses a constitutionalneural network as a function approximation to estimatethe Q function by where is the network’s weight parameters. At each iteration, i, theDQN is trained to minimize the mean-square error (MSE)between the Q-network and where
is the network’s weight from the previous iteration.The loss function in this approach can be expressed as
where are state-action samples drawn from experience replay memory with a mini-batch of size 32. The reward is calculated using reward clipping that scales thescores by clipping all reward when positive at 1, negative at-1 and 0 when rewards are unchanged. The use of experiencereplay memory, a target network, and reward clippinghelps to stabilize the learning. To ensure the agent obtainssufficient exploration of the state space, DQN also uses anaction -greedy policy.
The usage of experience replay memory, a target network, and reward clipping help to stabilize the learning. To ensure the agent sufficiently explores the available state space, DQN also utilizes an action -greedy policy.
Iii-B Pre-Training Networks for Deep Reinforcement Learning
Deep RL generally needs to balance two tasks at the sametime: (1) feature learning and (2) policy learning. Eventhough Deep RL has already been quite successful at performingboth tasks in parallel, to ensure model convergenceand performance requires a long training time and a largeamount of data. To address the feature learning task, we believea supervised CNN model with human demonstrationdata input would dramatically speed up the learning processand quality, which from the leverage more resource on policylearning. In our work, deep RL learns feature representationsby pre-training its network using human demonstrationsfrom non-experts; we refer to this approach as a pretrainedmodel. 
The pre-trained model method is similar to Bojarski’sEnd-to-End approach , using a deep CNN tolearn the feature space. We also applied data augmentationto increase the sample size by adding artificial shifts and rotations.Unlike Bojarski’s work, our approach relies only thecenter camera, and we also changed the dimension of input,convolution filter size, and network work output dimensionto fit our approach. We construct the network as a multi-classificationmodel; we assumed that humans could providecorrection action (labels) while driving.
The model was parametrized by using an MSE loss functionand an Adam optimizer
with alearning rate of 0.0001. The training library is Keras witha Tensorflow backend. We used a batch size of 128. TheCNN architecture followed the same structure, with differentparameters for its input dimension, filter size, and outputdimension. It included a normalization layer and fiveconvolution layers, each with a dropout layer. It followedfive flattened layers, each with a dropout layer. The activationfunction was ELU and the regulation function is L2.The original network’s output had a single output for eachvalid action, which was not appropriate for our work. Instead,we increased the output dimension to three: throttle,steering, and brake. The weights and biases learned from thepre-trained CNN were used to initiate the DQN network.
We were handling raw image data, so the first layer ofnormalization was extremely important, since normalizationhelps to generalize a model faster (due to the different lightingcaptured from the camera). We also applied normalizationfor the parameters that passed down the network in allfully connected layers. This normalization prevented learnedparameters from either vanishing or exploding. The networkhad roughly 27 million connections and 250 parameters.
Iii-C Interactive Deep Q-Network
As part of our work, we introduce the concept of human suggestion to the original DQN paradigm, which we call Interactive Deep Q-Network (IDQN). Recall that DQNs learn the optimal policy by first exploring the state space to learn rewards associated with visual input into the CNN used by the DQN. The visual input most often takes the form of a camera view, either of the state space or some small portion thereof. The Replay Memory stores tuples of these events, which take the form .
In order for an agent to discover the desired policy, the reward function must be properly set up such that the DQN can converge with discovered rewards. This means the reward policy can be tedious to build and must be reconfigured if some additional actions are desired. To solve this, we propose giving a human the ability to add suggestions to the agent in the form of adding extra tuples to the Replay Memory with elevated reward values for future training.
To accomplish this, a visual input system was designed to allow a trainer to suggest, either through a keyboard or GUI buttons, more appropriate actions the agent should take at a given point in time. When the trainer signals to add a suggestion, the last trained frame is re-added to the Replay Memory (but not the History, which is used for inference only) that will be later sampled from to continuously train the agent.
Over time, the trainer’s input is sampled against and due to its elevated reward shapes the policy the agent uses to incorporate the preferences the trainer is attempting to convey. The benefit is that the agent both learns the policy that maximizes the reward function at a faster rate as well as more complex policies that may only be known to the trainer providing the suggestions.
Iv Experiment Design
We use AirSim (https://github.com/Microsoft/AirSim), an open source simulator based on Unreal Engine as an autonomous vehicle agent  Figure 1. The deep reinforcement learning DQN and supervised learning CNN are both implemented using Tensorflow; the rest of the platform consisted of:
Windows 10 Pro x64
AirSim Neighborhood Binary
tensorflow or tensorflow-gpu
Microsoft Cognitive Toolkit
Python Packages: (Install using pip3 install)
CUDA 8.0 (GPU only) and cudNN 6 (GPU only)
Iv-a Supervised Convolution Neural Network
Due to limited computational resources and time constraints,we used only four datasets from human driving demonstrations.Regardless, we still achieved values 0.1 and 0.3for the training and validation data, respectively. The imagesused are from the center scene image camera.
After collecting more than 1500 image frames from humandemonstrations, we first constructed augmentation ofthe images Figure 2
. Since we assumed the CNN modelwould only focus on the lower part of the image, the road,the images were cropped accordingly. To mimic real roadconditions, we also included artificial shifts and rotations tohelp the network to learn from poor position or orientationdata. The magnitude of these perturbations was randomlyapplied from a normal distribution with a mean of zero anda standard deviation twice the standard deviation reported inBojarski?s End-to-End approach paper.
As mentioned earlier, the CNN model has five convolutionlayers and four fully connected layers. We applied adropout layer after each layer to randomly remove a certainpercentage of the learned parameters. For conventional layers,the dropout rates were all 0.5, and for fully connectedlayer the rates were 0.5, 0.4, 0.25, and 0. In this work, wealso used an exponential linear unit as an activation functionto include non-linearity. The batch sizes tested were 128and 64; the batch size did not appear to have a large effecton model performance. Another difference from the originalpaper is that Instead of using three cameras–left, right, andcenter–we used only a center camera. because it’s the onlyoption AirSim offers.
Iv-B Deep Q-Network
For the Deep Q-Network (DQN) portion of our experiment, the original network used in Minh’s 2015 paper was used, sourced from an existing Python implementation included as an example in the AirSim GitHub repository. This code implemented the following DQN components:
Action Model - CNN model used in action inference which is trained frequently (after 200 steps, then every 4 steps)
Target Model - CNN model used in loss calculations which is cloned from the Action Model occasionally (every 1000 steps)
Replay Memory - Holds up to 500,000 event tuples previously described which can provide mini-batches for training the Action Model
History - Holds N recent visual inputs for historical sequence inputs
Linear Epsilon Annealing Explorer - Scales the exploration rate of the agent based on a maximum random chance (100%) that is phased down to a minimum random chance (5%) over a given number of steps (5,000)
DQN Agent - Wrapper class that combines the other components with functions to pick an action based on the CNN policy approximator and exploration policy, save observations based on actions taken, and retrain the model based on sampled mini-batches from the Replay Memory.
The model is trained with a learning rate of 0.001, momentum of 0.95, and mini-batch size of 32 events.
For this experiment, the input consisted of 84 x 84 images from the simulated front camera, converted to grayscale. The action space consisted of: forward (no turning), left (-0.25 steering), and right (0.25 steering) all at a constant acceleration of 0.35. The reward function, which can be seen in Equation 3, measures the distance from the center of the street, angle from the centerline following the street, speed traveling, whether the car has left the boundaries of the street, and whether a collision has occurred.
The last two measurements are considered catastrophic and result in an end to the episode and a large negative reward.
Iv-C Human Suggestion for Deep Q-Network
In order to provide a mechanism for trainers to add suggestions during training, a GUI was developed which allowed the trainer to add suggestions to the agent during training. A representation of the pipeline which combines this GUI with the existing DQN agent and simulation system is shown in Figure 3. The training workflow follows the sequence:
Run AirSim simulator (choose Car mode)
Run the app.py Python script
Start the DQN Agent by pressing the space bar
The agent continues to explore the simulation space, making actions and learning the policy
(Suggestion Only) The trainer can use either the UP/LEFT/RIGHT arrow keys or the GUI buttons to suggest FORWARD, LEFT, or RIGHT actions respectively
Iv-D Evaluation Criteria
MSE loss was used as one criterion for model learning fromhuman demonstration. As with most supervised learningapproaches, the measurements of performance are a functionof the difference between the predicted action and thehuman-demonstrated action. Another measurement we includedis the accident rate: fewer accidents indicated betterlearning.
For IDQN, we looked at how the reward improves overthe progression through episodes. Specifically, we measuredthe mean and standard deviation of the reward per episode,total reward gathered per episode, and the total number ofsteps taken per episode. We deem success as an improvementin the mean and standard deviation, which can indicatediscovery of a better policy, as well as an increase in total rewardper episode, which, if seen in concert with an increasein steps taken per episode, can indicate the episode lastedlonger, meaning the car successfully traveled further downthe street.
In the human demonstration part of the work, we achieved anaverage training loss of 0.1 and validation loss of 0.3, quitean improvement considering only 4 demonstration datasetswere collected. For the CNN, we used an exponential linearunit (ELU) as the activation function. This not only helpedavoid a vanishing gradient via the identity for positive values,it also improved the learning characteristic by includingnegative values, which allowed it to push the mean unit activationcloser to zero. In essence, the ELU improved thenetwork and sped up training.
One reason we did not achieve a loss of less than 0.001–as reported in the original Nvidia End-to-End learning paper–is because of the extra predictions we included.Instead of a single output, our CNN model returnedthree results: throttle, brake, and steering angle. We believea better metric might be loss divided by three, since the threeoutputs all contributed to the loss (presumably not equally).
Results from the IDQN experiment, shown in Figure 4, show some improvement in the form of a tightening of the standard deviation in mean reward per episode. We take this to mean that the policy has converged on a more optimal approximation of the reward policy. In addition, we see that the total reward and total steps have both seen a measured increase, which, as noted previously, we take to mean the agent can travel further in the episode and thus gather more reward.
More generally, we saw the ability for the agent to learn more complex policy approximations. This was shown by the agent learning, though suggestions given to the agent during train, learning to make a left-hand turn at an intersection. In comparison, the original DQN agent failed to learn what navigation to perform at the intersection and simply ran into the fence at the opposite side of the street. This resulted in much larger total rewards/steps seen in three of the last four episodes in the IDQN agent in the results in Figure 4.
We were not able to include the pre-trained mode in theDeep Q-Network, due to the complexity of the simulatorenvironment and time strain. However, from the results forindividual performance of the two models, we believe ourapproach is feasible.
Human demonstrations are partially responsible for thesuccess of our approach. It will be important to investigatehow the demonstrator’s performance and the amountof demonstration data affect the benefits of pre-training thenetwork in future work. Human demonstration End-to-Endlearning could also be used as a comparison candidate forour approach. Although the suggestion for demonstrationhours is more than 100 driving hours, from our work, webelieve a much smaller sample could achieve similar results.In the pre-trained CNN model, we ignore the informationcollected from the depth and segmentation cameras. Severalstudies have shown that this information could further improveself-driving agents.
We also show that our IDQN approach is successful in increasing the speed and accuracy of training over the original DQN implementation. Additionally, the IDQN approach allows for the learning of more complex policy approximation to be learned without rebuilding a more complicated reward function to instruct the agent. In the future, we hope to take some direction from the work by Hausknecht  involving Deep Recurrent Q-Learning, which involves adding an LSTM layer to extract knowledge from sequential images used for input. This has the added benefit of learning policy in partially observable MDPs like the forward-facing camera used for input in the experiments conducted as part of our work. We also plan to investigate methods to limit the ability for the trainer to ”over-train” by providing too many suggestions without seeing the appropriate feedback in the form of better results. This could be from either better instructions or better order of training from user suggestions for better transparency.
-  M. Bojarski, “End to end learning for self-driving cars,” arXiv:1604.07316v1, 2016.
-  R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” volume 1. MIT press Cambridge, 1998.
-  S. V. M. Argall, B. D.; Chernova and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems 57(5):469–483, 2009.
-  S. H. K. B. L. Kurin, V.; Nowozin and B. Leibe, “The atari grand challenge dataset,” arXiv preprint arXiv:1705.10998, 2017.
P.-A. B. Y. B. S. Erhan, D.; Manzagol and P. Vincent, “The difficulty of
train- ing deep architectures and the effect of unsupervised pre- training,”
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), 153–160., 2009.
-  Y. C. A. M. P.-A. V. P. Erhan, D.; Bengio and S. Bengio, “Why does unsupervised pre-training help deep learning?” J. Mach. Learn. Res. 11:625–660., 2010.
-  J. B. Y. Yosinski, J.; Clune and H. Lipson, “How transferable are features in deep neural networks?” Advances in neural information processing systems, 3320–3328, 2014.
-  S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics. Springer, 2018, pp. 621–635.
-  F. Abtahi and I. Fasel, “Deep belief nets as function approximators for reinforcement learning,” RBM 2:h3, 2011.
-  M. Anderson, C. W.; Lee and D. L. Elliott, “Faster reinforcement learning after pretraining deep networks to predict state dynamics,” International Joint Conference on, 1–7. IEEE, 2015.
-  J. B. T. B. M.-M. L. S. Christiano, P.; Leike and D. Amodei, “Deep reinforcement learning from human preferences.” arXiv preprint arXiv:1706.03741, 2017.
-  M. P. O. L. M. S. T. P. B. S. A. D.-A. G. O. I. A. J. e. a. Hester, T.; Vecerik, “Learning from demonstrations for real world reinforcement learning,” arXiv preprint arXiv:1704.03732, 2017.
-  A. M. C. J. G. A. S. L. V. D. D. G. S. J. A. I. P. V. L. M. e. a. Silver, D.; Huang, “Mastering the game of go with deep neural networks and tree search,” Nature 529(7587):484–489, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning 8(3-4):279–292, 1992.
-  M. E. T. Gabriel V. de la Cruz Jr., Yunshu Du, “Pre-training neural networks with human demonstrations for deep reinforcement learning,” arXiv:1709.04083v1, 2017.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
-  M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” CoRR, abs/1507.06527, 2015.