Recently, research in brain science has gradually received the public’s attention. Given the rapid progress in brain imaging technologies and in molecular and cell biology, much progress has been made in understanding the brain at the macroscopic and microscopic levels. Currently, the human brain is the only truly general intelligent system that can cope with different cognitive functions with extremely low energy consumption. Learning from information processing mechanisms of the brain is clearly the key to building stronger and efficient machine intelligence . In recent years, some bio-inspired intelligent methods have emerged, and they are definitely different from the classical mathematical programming principle. Bio-inspired intelligence has the advantages of strong robustness and efficient, good distributed computing mechanism. Besides, it is easy to combine with other methods.
The mammalian brain has multiple learning subsystems. Niv et al.  categorized major learning components into four classes: the neocortex, the hippocampal formation (explicit memory storage system), the cerebellum (adaptive control system) and the basal ganglia (reinforcement learning ). Among these learning components, reinforcement learning is particularly attractive to researches. Nowadays, converging evidence links reinforcement learning to specific neural substrates, thus assigning them to precise computational roles. Most notably, much evidence suggests that the neuromodulator known as dopamine provides basal ganglia target structures with phasic signals that convey a reward prediction error which can influence learning and action selection, particularly in stimulus-driven habitual instrumental behaviors 
. Hence, many efforts have been made to investigate the capability of bio-inspired reinforcement learning by applying them to artificial intelligence related tasks.[4, 5, 6, 7, 8, 9]
In recent years, deep reinforcement learning has contributed to many of the spectacular success stories of artificial intelligence in recent years [10, 11]. After the initial success of the deep Q network (DQN) , a variety of improved models have been published successively. Later on and based on the former discoveries, Mnih et al.  proposed the Nature DQN in 2015 and introduced the replay memory mechanism to break the strong correlations between the samples. Mnih et al.  proposed a deep reinforcement learning approach, in which the parameters of the deep network are updated by multiple asynchronous copies of the agent in the environment. Van Hasselt et al.  suggested the Double DQN to eliminate overestimation; they added a target Q network independent from the the current Q network. It was shown to apply to large-scale function approximation . Newer techniques included deep deterministic policy gradients and mapping an observation directly to action, both of which could operate over continuous action spaces . Schaul et al.  suggested prioritized replay, adding prior to the replay memory to relieve the sparse reward and slow converge the problem slowly 17]. hausknecht2015deep  replaced the last fully connected layer in the network with an LSTM layer. Also, Foerster et al.  proposed to use multi-agent to describe a state distributively. However, most previous studies such as Atari games focused on the simple environment and action space. Because of this limitation, there is an urgent need to further improve the capability of deep reinforcement learning in a more challenging and complex scenario such as simulated 3D driving control problem.
One deep-learning-based studied the simulated self-driving game
. However, three problems existed in their implementation. First, they created a handcrafted dataset. Obviously, one can never create this ideal benchmark dataset that includes all the bad situations encountered by the driving agent during training. At best, one can include the best behavior that the driving agent should implement in each step. The driving agent was reported to perform well when it had a good position in the driveway. However, the behavior deteriorated rapidly when the driving agent deviated from the driveway. Such behaviors indicated the dissimilar distribution and instability even though correctional measures were taken on the dataset. Second, they trained and supervised their network in a supervised way. As there are many possible scenarios, manually tackling all possible cases using supervised learning methods will likely yield a more simplistic policy. Third, their experiments were built on ideal conditions; for example, they assumed that the brakes were ignored. In our experiments, we take the brakes into consideration. Moreover, to support autonomous capabilities, a robotic driven agent should adopt human driving negotiation skills when overtaking, halting, braking, taking left and right turns, and pushing ahead in unstructured roadways. It comes naturally that a trial-and-error way of study is more suitable for this simulated self-driving game. Hence, the bio-inspired reinforcement learning method in the study is a more suitable way for the driving agent to learn how to make decisions.
In our study, we proposed a deep recurrent reinforcement learning network to solve simulated self-driving problems. Rather than creating a handcrafted dataset and training in a supervised way, we adopted a bio-inspired trail-and-error technique for the driving agent to learn how to make decisions. Furthermore, this paper provides three innovations. First, intermediate game parameters were completely abandoned, and the driving agent relied on only raw screen outputs. Second, a weighting layer was introduced in the network architecture to strengthen the intermediate effect. Third, a simple but effective experience recall mechanism was applied to deal with the sparse and lengthy episode.
Deep Q-learning is used to help AI agents operate in environments with discrete actions spaces. Based on the knowledge of Deep Q-learning, we proposed a modified DRQN model in order to infer the full state in partially observable environments.
Ii-a Deep Q-learning
Reinforcement learning manages learning policies for an agent interacting in an unknown environment. In each step, an agent observes the current states of the environment, makes decisions according to a policy , and observes a reward signal . Given the current states and a set of available actions, the main aim of the DQN is to approximate the maximum sum of discounted rewards. The value of a given action-state pair or the ‘Q-value’ is given by:
According to the Bellman equation, it gives the approximating form of Q-values by combining the reward obtained with the current state-action pair and the highest Q-value at the next state , and the best action :
We often use the form involving a iterative process:
Actions are typically chosen following a -greedy exploration policy. The epsilon value was valued between 0.0 to 1.0. In order to encourage the agent to explore the environment, the was set to 1.0 at the beginning. During the training process, the value decayed gradually as the experience accumulated. Then, the agent could use the experience to achieve the task.
The network stands for the approximation of the expected Q-value
, which leads to the loss function:
Ii-B Recurrent Reinforcement Learning
For some special games which are three-dimensional and partially observable, the DQN lacks the ability to solve the problem. In partially observable environments, the agent only receives an observation of the current environment, which is usually insufficient to infer the full state of the system. The real state is the combination of the current observation and an unfixed length of history states. Hence, we adopted the DRQN model on top of the DQN to deal with such conditions (see Figure 1). The last full connect layer was replaced by the LSTM in the DRQN model in order to record former information. Figure 2 shows the sequential updates in the recurrent network. An additional input stood for the previous information is added to the recurrent model. The output of the LSTM , which combined the current observation and the history information , was used to approximate the Q-value . The history information was updated and passed through the hidden state to the network in the next time step:
Ii-C Network Architecture
In the beginning, we used the baseline DRQN model carried out by the CMU in VizDoom2016 . However, we obtained unsatisfied results with the same model. Hence, we have made three improvements in our modified model to make things work better. The whole architecture is shown in Figure 3. First, the network was built on top of the NVIDIAs autopilot model 
. To reduce overfitting, the original model was modified by adding several batch normalization layers. We used a four-layer stronger CNN for feature extraction. The input size was set to 320*240. The first convolutional layer contained 32 kernels, with the a size of 8*8, and the stride was 4. The second convolutional layer contained 64 kernels, with the a size of 4*4 and the stride was 4. The third layer contained 128 kernels, with the a size of 3*3 and the stride was 1. The last convolutional layer contained 256 kernels with a size of 3*3 and the stride is 1. The activate functions were all Relu, and the size of pooling layers were all 2*2. Second, we abandoned the fully connected layer before the LSTM layer in the original DRQN model and fed the LSTM directly with the high-level feature. The number of units in LSTM was set to 900. Third, the subsequent structure was divided into two groups for different purposes. One was mapped to the set of possible actions, and the other was a set of scalar values. The final action value function was acquired using both of them. We will introduce their functions respectively.
We used two collateral layers rather than a single DRQN network to approximate the value function. The auxiliary layer also had the same dimension as the action space. The original intention was to balance the impact of history and the current. The DRQN model enabled the driving agent to make a reasonable action from a fully observing perspective. Nevertheless, we wanted to focus our attention on the precise instantaneous change of the current view. The auxiliary layer was mapped to [0, 1.0] using the softmax method, thus suggesting the direction of correction for the raw approximations using DRQN. Because of this intervention, the network would not only learn the best action for a state but also understand the importance of taking actions. Imagine an agent driving in a straight line and having an obstacle far ahead, an original DRQN model could learn that it is time to make the driving agent move a bit to avoid hitting the obstacle. The modified model will also have an insight into when it is the best time to move, with the knowledge that the danger of the obstacle increases as it gets closer. was used to represent the original output of the DRQN, and was used to represent the significance provided by the auxiliary layer. We used the formula to express the final strengthen of the Q-value (see Figure 3):
The result was stored in the prioritized experience memory unit. During training, the sequence in the memory unit was removed, and we used the Eq. 5 to calculate the loss.
Ii-D Implementation Details
Reinforcement learning consists of two basic concepts: action and reward. Action is what an agent can do in each state. Given that the screen is the input, a robot can take steps within a certain distance. An agent can take finite actions. When a robot takes an action in a state, it receives a reward. Here, the term reward is an abstract concept that describes the feedback from the environment. A reward can be positive or negative. When the reward is positive, it corresponds to our normal meaning of reward. However, when the reward is negative, it corresponds to what we usually call punishment. We also describe the training details such as the hyperparameters, input size selection, frameskip, and prioritized replay.
Ii-D1 Action Space
The game had five actions, including Left, Right, Straight, Brake and Backwards. The rocking range of the joystick reflects the numerical value of the speed control, which is mapped into a region of -80 to 80. To keep the model simple, we considered that only the turning control had a high requirement for precision. The speed section was discretized into a set of [0, 20, 40, 80] for speed control. Another three actions were represented by 1/0 flag. Thus, each action was represented by a 5-dimensional vector shown below:
Under most circumstances, the driving agent cannot explore a path with big rewards initially. The driving agent often gets stuck somewhere in the halfway and wait for the time to elapse before resetting. We have to make the rewards of these cases variant in order to make these experience meaningful. Hence, we set a series of landmarks along the track. A periodical reward was given to the driving agent when each landmark was reached. The closer the distance between the landmark and the destination, the bigger the phased reward was given. We have considered formulating a more precise reward system to increase density, such as giving the driving agent a slight punishment when it deviates from the driveway. However, hidden game parameters were needed, and they were against the original aim of this research.
The network was trained using the RMSProp algorithm and the minibatch size was 40. The replay memory was set to contain 10000 recent frames. The learning ratewas set, from 1.0 to 0.1, to follow a linear degradation and then fixed at 0.1. The discount factor was set to 0.9. The exploration rate for the -greedy policy was set to follow a linear degradation, from 1.0 to 0.1, and then fixed at 0.1. The exploration rate was set to 0 when we evaluated the model. The original screen outputs were three channel RGB images. They were first transformed into grey scale images and then fed to the network.
Ii-D4 Input size
In the beginning, the original 640*480 screen resolution was resized to 160*120 to accelerate training. After several hours of trials, the driving agent still got stuck in most cases and could not complete one lap. The resulting rewards oscillated around the baseline reward for not finishing the game in the limited number of steps, thus indicating the resolution was too low to recognize. Then, the input size was resized to 320*240. We also reduced the punishments to encourage positive rewards. After the observation of the same length of time, the distribution of the resulting rewards cast off random oscillation and started to turn positive over the baseline. Hence, we kept this resolution as the input size of the network despite the memory consumption as the system started to learn within the acceptable limit of time.
Obviously, it is not necessary to feed every single frame into the network because there is only a weak change between adjacent frames. Because of this, we used the frame-skip technique  as in most approaches. One frame was taken out as the network input by every k + 1 frames. The same action was repeated over the skipped frames. The higher k was set, the faster the training speed. However, the control effect became imprecise at the same time. Considering the balance of low computing resource consumption and smooth control, we finally set a frame skip of k = 3 by relying on our experience.
Ii-D6 Prioritized replay
In most cases, there was little replay memory with high rewards, which would be time-consuming with huge replay table and only sparse rewards. As prioritized replay was a method that can make learning from experience replay more efficient, we simply timed the important experiences in proportion to their rewards and stored them into the replay memory. That way, they would have a greater opportunity to be recalled .
The model was trained using three different tracks, which cover all the track that Stanford used for comparison. An individual sets of weights was trained separately for each model because each track has different terrain textures. The rewards were low initially because it is equivalent to a random exploration at the beginning of training and because the driving agent would get stuck somewhere without making any significant progress. After about 1400 episodes of training, the driving agent finished the race. Under most circumstances, the driving agent did not finish the race in given steps so the reward was positive but not as high as receiving the final big reward. We set this step limit because of a lack of a reset mechanism for dead situations which was very useful in the early stage of training. In Stanford’s report, they created a DNF flag to represent the driving agent getting stuck. In our experiment, the driving agent had learned better policy and displayed better behaviors, proving better robustness of the system. We also visualized the CNN in order to validate the ability of the model.
Iii-a Experimental Environment
We chose the car racing game to carry out our simulated self-driving experiment. In order to play the car racing game autonomously, we used a third-partied OpenAI Gym environment wrapper for the game developed by Bzier111https://github.com/bzier/gym-mupen64plus. The API provides direct access to the game engine and allows us to run our scripts while playing the game frame-by-frame. Through the API, we can easily obtain the game information, either the screen output or the high-level intermediate game parameters like its location in the minimap. Our model was verified to be able to deal with the partially observable game environment efficiently.
Iii-B Rewards Analysis
For comparison, the model was trained and tested using the same tracks like those used by Stanford in their supervised learning method. Each track has different terrain textures and difficulty routes. Therefore, an individual set of weights was trained separately for each model. The experiments were carried out on a common configured portable laptop, and all models converged after spending over 80 hours each.
Iii-C CNN Visualization
The quality of the high-level feature was regarded as the key measurement of the learning process. The network could be evaluated from many aspects such as the loss function and the validation accuracy in traditional supervised learning. However, in unsupervised learning, we did not have this kind of standard to provide a quantitative assessment of the network. Hence, we visualized the output of the last layer (see Figure5), with the aim of ascertaining the kind of features that are captured by the network.The high-level layers output seemed quite abstract through direct observation. Thus, we visualized the high-level features through deconvolution. Deconvolution and unpooling are the common methods for reconstructing entity.
Iii-D Result and Discussion
The test results are shown in Table I. The driving agent was tested on several tracks in line with those of Stanford’s experiment such as Farm, Raceway, and Mountain. We evaluated our model based on its achieved time. For each track, we run 10 races in real-time and calculate the mean race times as the final result. The Stanford results  and the Human results were borrowed from the Stanford report. The human results were obtained by a real human participant playing each track twice: The first time is to get used to the track, and the second time is to record their time. The results were obtained by testing the driving agent on each track ten times and then taking the average (excluding the DNF cases).
|Track||Our model||Stanford model ||Human|
In sum, our results seemed to require a longer period to finish the race. Through observation, we found that our agent’s driving path was not as smooth as that provided by supervised learning results. We closed this course for two reasons. First, we reduced the epsilon too fast. As shown in the Figure 4, the epsilon value is rapidly adjusted to 0.1 as the rewards increase. This indicates the experiences show more reliability. Then during the latter phase of training, we depended mainly on experiences even though there were still many better state-action sets to explore. Second, we discretized the actions roughly. We used a set of [0; 20; 40; 80] as the options for speed. To be more precise, we used the control variable of the joystick. We observed that the numerical value of the joystick parameter did not maintain a linear relationship with the real effect on the driving agent as shown on the screen. Through observation, a speed of 20 m/s and 40 m/s both produced a tiny effect while a speed of 80m/s would make a radical change. An accurate modeling between the joystick parameter and the real effect would make the segmentation more reasonable. However, compared with the experiment done by the Stanford group, our experiments performed satisfactory even if the driving agent deviated from the driveway. We considered the brake and trained the driving agent using a trial-and-error method, which was more in line with the real situation. Hence, the bio-inspired reinforcement learning method in the study was a more suitable approach for the driving agent to make decisions.
It is worth mentioning that the driving agent is more sensitive to the green grass than the yellow sand texture. The reason is that we used a fixed set of weights to transform the color image into a gray-scale image before inputting, which is in charge of this phenomenon. In the future, we will use the original three channels of the RGB data to carry out the experiments to end this conjecture and improve the results.
Brain-inspired learning has recently gained additional interests in solving decision-making tasks. In this paper, we propose an effective brain-inspired end-to-end learning method with the aim of controlling the simulated self-driving agent. Our modified DRQN model has proven to manage plenty of error states effectively, thus indicating that our trial-and-error method using deep recurrent reinforcement learning could achieve better performance and stability. By using the screen pixels as the only input of the system, our method highly resembles the experience of human-beings solving a navigation task from the first-person perspective. This resemblance makes this research inspirational for real-world robotics applications. Hopefully, the proposed brain-inspired learning system will inspire real-world self-driving control solutions.
-  M.-m. Poo, J.-l. Du, N. Y. Ip, Z.-Q. Xiong, B. Xu, and T. Tan, “China brain project: basic neuroscience, brain diseases, and brain-inspired computing,” Neuron, vol. 92, no. 3, pp. 591–596, 2016.
-  Y. Niv, “Reinforcement learning in the brain,” Journal of Mathematical Psychology, vol. 53, no. 3, pp. 139–154, 2009.
-  F. Rivest, Y. Bengio, and J. Kalaska, “Brain inspired reinforcement learning,” in Advances in neural information processing systems, 2005, pp. 1129–1136.
-  M. McPartland and M. Gallagher, “Learning to be a bot: Reinforcement learning in shooter games.” in AIIDE, 2008.
-  J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682–697, 2008.
-  H. Duan, S. Shao, B. Su, and L. Zhang, “New development thoughts on the bio-inspired intelligence based control for unmanned combat aerial vehicle,” Science China Technological Sciences, vol. 53, no. 8, pp. 2025–2031, 2010.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3357–3364.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3389–3396.
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”
International conference on machine learning, 2016, pp. 1928–1937.
-  H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning.” in AAAI, vol. 2. Phoenix, AZ, 2016, p. 5.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations (ICLR), 2016.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations (ICLR), 2016.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” in 2015 AAAI Fall Symposium Series, 2015.
-  J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent q-networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2137–2145.
-  H. Ho, V. Ramesh, and E. T. Montano, “Neuralkart: A real-time mario kart 64 ai,” 2017.
-  S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
-  G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning.” in AAAI, 2017, pp. 2140–2146.
-  M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013.