Declaration
All sentences or passages quoted in this document from other people’s work have been specifically acknowledged by clear crossreferencing to author, work and page(s). Any illustrations that are not the work of the author of this report have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure.
Name: Yue Zheng
1.1 Background
The applications of Artificial Intelligence are widely used in recent years. As one part of them, Reinforcement Learning has achieved incredible results in game playing. An intelligent agent will be created and trained with reinforcement learning algorithms to fulfill this tasks. In the Future of Go Summit 2017, Alpha Go which is an AI player trained with deep reinforcement learning algorithms won three games against the world best human player in Go. The success of reinforcement learning in this area shock the world and many researches are launched such as driverless cars. Deep learning methods such as convolutional neural network contributes a lot to this because these techniques solves the problem of dealing with high dimension input data and feature extraction.
Trex Runner is a dinosaur game from Google Chrome offline mode. The aim of the player is to escape all obstacles and get higher score until reaching the limitation which is . The moving speed of the obstacles will increase as time goes by which make it difficult to get the highest score.
The code of this project can be found in this link which is written in Python.
1.2 Aim of the project
The aim of this project is to create an agent using different algorithms to play Trex Runner and compare the performance of them. Internal covariate shift is the change of distribution in each layer of during the training which may result in longer training time especially in deep neural network. To cope with this problem, batch normalization use linear transformation on each feature to normalize the data with the same mean and variance. The same problem may also occur in deep reinforcement learning because the decision is based on neural network. Beyond the comparison of different reinforcement learning algorithms, this project will also investigate the effect of batch normalization. The overall objectives of this project are list below.

Create an agent to play Trex Runner

Compare the difference among different reinforcement learning algorithms

Investigate the effect of batch normalization in reinforcement learning
1.3 Overview
This study opens with a literature review on deep learning and reinforcement learning. Each section includes the history of the field and the techniques related to this study. Chapter 3 includes the description of the game and the choice of algorithms according the literature review. The entire processing step will be shown as well as the architecture of the model. The design of the experiments and the evaluation methods are presented in this chapter too. Chapter 4 shows the result of all the experiments and the discussion of each experiment. Chapter 5 presents the conclusion of this study and the proposed future works.
2.1 Deep Learning
Deep learning is a class of Machine Learning model based on Artificial Neural Network (ANN). There two kinds of deep learning model which is widely used in recent years. Recurrent Neural Network is one of them which shows its power in Natural Language Processing. The other one plays an important role in deep reinforcement learning called Convolutional Neural Network (CNN). It is one of the most effective models for computer vision problems such as object detection and image classification. This section gives a brief introduction of deep learning and detailed information about convolutional neural network.
2.1.1 History of Deep Learning
An artificial neural network is a computation system inspired by biological neural networks which were first proposed by McCulloch, a neurophysiologist [24]
. In 1957, Perceptron was invented by Frank
[31]. Three years later, his experiments show this algorithm can recognize some of alphabets [32]. However, Marvin proved that a single layer perceptron cannot deal with XOR problem [25]. This stopped the development of ANN until Rumelhart et al. show that some useful representations can be learned with multilayer perceptron, which is also called neural network, and backpropagation algorithm
[33] in 1988. One year later, LeCun et al. first used a fivelayer neural network and backpropagation to solved digit classification problem and achieved great results [21]. His innovative model is known as LeNet which is the beginning of the convolutional neural network.The origin of CNN was proposed by Fukushima named Neocognitron which was a selforganized neural network model with multiple layers [10]. This model achieved a good result in object detection tasks because it is not positionsensitive. As mentioned before, LeCun et al. invented LeNet and got less than error rate in mnist handwritten digits dataset in 1998[21]
. The model used convolutions and subsampling which is called convolution layer and pooling layer today to convert the original images into feature vectors and perform classification with fully connected layers. At the same time, some neural network models show some acceptable results in face recognition
[20], speech recognition [51] and object detection [49]. But the lack of reliable theory caused the research of CNN to stagnate for many years.In the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012
[9], Alex and his team got error rate with an eightlayer deep neural network (AlexNet) [19]. This was a significant result compared with the one from second rank participant which was. Beyond LeNet, AlexNet used eight layers to train the classifier with data augmentation, dropout, ReLU, which mitigated overfitting problem. Another significant discovery was that parallel computing with multiple GPUs can largely decrease the training time.
Two years later, Simonyan and Zisserman introduced a sixteenlayer neural network (VGGNet) and won the first prize in ILSVRC 2014 classification and localization tasks [41]. This model got the stateoftheart result with error rate at that time. VGGNet also proved that using smaller filter size and deeper network can improve the performance of CNN. The size of all filters in the model was no greater than while the first two layers in AlexNet were and . In the same year, GoogLeNet [46], the best model of ILSVRC 2014 classification and detection tasks, first used inception which was proposed by Lin [23]
to solve vanishing gradient problem. Inception replaced one node with a network which was consisted of several convolutional layers and pooling layers then concatenate them before passing to the next layer. This change made the feature selection between two layers more flexible. In other words, it can be updated by the backpropagation algorithm.
Another problem of the deep neural network was degradation resulting in high training error caused by optimization difficulty. To solve that problem, He et al. proposed a deep residual learning framework (ResNet) using a residual mapping instead of stacking layers directly [12]. This model won the championship in ILSVRC 2015 with only error rate. His experiments show that this new framework can not only solve degradation problems but also can improve the computing efficiency. ResNet There were many variants based on ResNet such as InceptionResNet [45] and DenseNet [14]. The former one combined improved inception techniques into ResNet. Every two convolutional layers were connected in the later model. This change mitigated vanishing gradient problems and improved the propagation of features.
2.1.2 Deep Neural Network and Activation Function
A neural network or multilayer perceptron consists of three main components: the input layer, the hidden layer, and the output layer. Each unit in one layer called a neuron. The input data are fed into the input layer conducting linear transformation through weights in the hidden layer. Finally, the result will be given nonlinear ability through activation function and fed into the output layer.
Activation function enables the network to learn more complicated relationships between inputs and outputs. There are three widely used activation functions shown in Figure 2.1: sigmoid, tanh and ReLU. ReLU is the most commonly used one in three because it has a low computational requirement and better performance in solving vanishing gradient problems compared with the other two.
(2.1) 
(2.2) 
(2.3) 
To illustrate the entire process in neural network, here is an example in Figure 2.2. Given an input data and a randomly generated weight in the hidden layer, the output of the neural network is and the activation of the hidden layer is
. Therefore, the estimated value
can be calculated by(2.4) 
where is the bias of the hidden layer and is the number of features. If the number of hidden layers in the neural network is greater than two, this is also called Deep Neural Network (DNN). Consider a simple DNN with three hidden layers shown in Figure 2.3. Given the same input in matrix form and is the weight between th and th layer, the output of th layer can be calculated by
(2.5) 
where is the activation function between th and th layer and , especially . The dimension of those variables are shown in Table 2.1
Parameter  Description  Dimension 

Input data  
Weight between input layer and hidden layer 1  
Weight between hidden layer 1 and hidden layer 2  
Weight between hidden layer 2 and hidden layer 3  
Weight between hidden layer 3 and output layer 
2.1.3 Backpropagation Algorithm
Section 2.1.2 introduces the way to estimated label using deep neural network. In order to optimize this estimation, a cost function is used to quantify the difference between the estimated value and the true value . To give a simple example, Mean Square Error [1] which is often applied to regression problems is used in this section to illustrate how the backpropagation algorithm works. Equation 2.6 shows the form of mean square error.
(2.6) 
where is the input data and represent all weights used in the model. Thus, the optimization problem can be described as following
(2.7) 
Stochastic Gradient Descent (SGD) is an effective way to solve this optimization problem if is convex. However, it still shows acceptable results in deep neural network even though there is no guarantee for global optimal point in nonconvex optimization [11]. Instead of finding the optimal point directly, SGD optimize the objective function 2.7 iteratively by following equation
(2.8) 
where is the learning rate which can control the update speed of the weight. Using gradient methods such as SGD to optimize the cost function in neural network is called backpropagation algorithm. Considering the deep neural network shown in 2.3 and the techniques of matrix calculus [13], the gradient of with respect to is
(2.9) 
where is elementwise matrix multiplication and is the gradient with respect to . Results for other can be calculated in a similar way. With equation 2.8, can be updated during each iteration by
(2.10) 
2.1.4 Convolutional Neural Network
Compared with a common deep neural network, a convolutional neural network has two extra components which are convolutional layer and pooling layer. The convolutional layers make use of several trainable filters to select different features. The pooling layer reduces the dimension of the data by subsampling.
In the convolution layer, the output of the last layer is convolved by trainable filters with elementwise matrix multiplication. The size and the number of each filter are defined by the user and the initial value is randomly generated. The moving step of a filter in each convolution layer is decided by stride. In order to keep the information of the border during the forward propagation, a series of zeros attached to the border of the image called padding. Figure
2.4 shows how the result of one neuron in a convolution layer comes from and how the filter will be updated in every iteration in backpropagation.The reason for using a pooling layer is not only for dimension reduction but also for detecting invariant features including translation, rotation, scale from the input [37]
. There are two types of operations in pooling layer: max pooling and average pooling. In max pooling, only the maximum value in userdefined windows will be chosen while all values in the window will make contributions to the output in average pooling. The choice of operation is dependent on tasks and Boureau has made a theoretical comparison between those two
[6]. Both max and average operation are shown in Figure 2.5.A complete convolutional neural network consists of several convolutional layers, pooling layers, and fully connected layers. The fully connected layer is the same concept of DNN which used the flatten vector of the last output of the other two layers as input. Considering a classification problem as shown in Figure 2.6, an image with a size of is fed into CNN and output a scalar which represents its class. Table 2.2 lists the filters information used in CNN.
Layer  Numbers  Size  Stride  Padding  Output dimension 

Convolution 1  8  88  2  0  82828 
Max Pooling  8  22  2  /  81414 
Convolution 2  16  66  2  0  1644 
The backpropagation algorithm in convolutional neural network is a little different from described in section 2.1.3 because of two extra layer type. In average pooling layer, the error will be divided by which is the size of the filter and propagate to the last layer. In max pooling layer, the position of the maximum value will be stored when forward propagating and the error will be directly passed through that position. In convolutional layer, backpropagation can be calculated through basic differentiation. Consider the convolution operation in Figure 2.4, if the error from the output layer is , then we have
(2.11) 
where
(2.12) 
is the output matrix in Figure 2.4. The differentiation of with respect to , , can be computed in a similar way.
2.1.5 Batch Normalization
With the increasing depth of the neural network, the training time becomes longer. One of the reason is the distribution of input in each layer changes when updating the weight which is called Internal Covariate Shift. In 2015, Ioffe proposed Batch Normalization (BN) which make the distribution in each layer more stable and achieve shorter training time [15]. In each neuron, the input can be normalized by Equation 2.13
(2.13) 
where is used to avoid zero variance. Now data in each neuron follow the distribution with mean
. However, this changes the representation ability of the network which may lead to the loss of information in the earlier layer. Therefore, Ioffe used another linear transformation to restore that representation(2.14) 
where and are learnable parameters, especially, the result is the same as original when and . The mean and variance during training will be stored and will be treated as the mean of the variance of test data. In their experiment, BN can not only deal with Internal Covariate Shift problems but also mitigate vanishing gradient problems.
2.2 Reinforcement Learning
Reinforcement Learning (RL) is a class of machine learning aiming at maximum the reward signal when making decisions. The basic component of reinforcement learning is the agent and the environment. As shown in Figure 2.7, the agent will receive feedback including observation and reward from the environment after each action. To generate a better policy, it will keep interacting with the environment and improve its decisionmaking ability step by step until the policy converges.
2.2.1 History of Reinforcement Learning
In recent years, reinforcement learning becomes popular because of Alpha Go, a program that can beat human expert in Go [39]. In the Future of Go Summit 2017, Alpha Go Master shocked the world by winning all three games against Ke Jie, the world best player in Go. But the research of reinforcement learning started very early. According to Sutton, the early history of RL can be divided into two main threads [42].
One of them was optimal control. To cope with arising optimal control problems which were called ”multistage decision processed” in 1954, the theory Dynamic Programming (DP) was introduced by Bellman [4]
. In the theory, he proposed the concept of ”functional equation”, which was often called the Bellman equation today. Although DP was one of the most effective approaches to solve optimal control problems at that time, the high computational requirements which is called ”the curse of dimensionality” by Bellman were not easy to solve
[5]. Three years later, he built a model called Markov Decision Processes (MDPs) to describe a kind of discrete deterministic processes
[3]. This deterministic system and the concept of value function which is described in the Bellman equation consists of the basic theory of modern reinforcement learning.In optimal control thread, solving problems required full knowledge of the environment and it was not a feasible way to deal with most problems in the real world. The trialanderror thread focused more on the feedback rather than the environment itself. The first expression about the key idea of trailanderror including ”selectional” and ”associative” called ”Law of Effect” was written in Edward Thorndike’s book ”Animal Intelligence” [47]
. Although supervised learning was not ”selectional”, some researchers still mistook it for reinforcement learning and concentrated on pattern recognition
[8, 55]. This led to rare researches in actual trialanderror learning until Klopf recognized the difference between supervised learning and RL: the motivation to gain more rewards from the environment [16, 17]. However, there were still some remarkable works such as the reinforcement learning rule called ”selective bootstrap adaptation” by Widrwo in 1973 [54].Both of two threads came across in modern reinforcement learning. Temporal Difference (TD) learning was a method that predicts future values depend on the current signal which originated from animal learning psychology. This idea was first proposed and implemented by Samuel [35]. In 1972, Klopf developed the idea of ”generalized reinforcement” and linked the trialanderror learning with animal learning psychology [16]. In 1983, Sutton developed and implemented the actorcritic architecture in trialanderror learning based on the idea from Klopf [2]. Five years later, he proposed TD() algorithms which used additional step information for update policy and made TD learning a general prediction method for deterministic problems [44]. One year later, Chris used optimal control methods to solve temporaldifference problems and developed the Qlearning algorithm which estimated delayed reward by action value function [53]. In 1994, an online Qlearning was proposed by Rummery and Niranjan which was known as SARSA [34]. The difference between Qlearning and SARSA was that the agent used the same policy during the learning process in SARSA while it always chooses the best action based on value function in Qlearning.
With the development of the deep neural network, DeepMind proposed Deep Qlearning Network (DQN) algorithm which used a convolutional neural network to solve high dimensionality of the state in reinforcement learning problems [27]. Two years later, they modified DQN by adding a target policy to improve its stability [28]. The highlight of the DQN was not only the combination of deep learning and RL but also the experience replay mechanism. To solve dependency problems when optimizing CNN, Mnih et al. stored the experiences to memory in each step and randomly sampled a minibatch to optimize the neural network based on the idea from Lin [22]. In 2015, this mechanism was improved by measuring the importance of experience with temporal difference error [36]. Meanwhile, Wang proposed Dueling DQN which used an advantage function learning how valuable a state was without estimating each action value for each state [52]. This new neural network architecture was helpful when there was no strong relationship between actions and the environment. In 2016, DeepMind proposed Double DQN which show the higher stability of the policy by reducing overestimated action values [50].
Although a series of algorithms based on DQN show humanlevel performance on Atari games, they still failed to deal with some specific games. DQN was a valuebased method which meant the choice of action was depend on the action values. However, choosing action randomly may be the best policy in some games such as Rockpaperscissors. To deal with this problem, Sutton proposed policy gradient which enabled the agent to optimize the policy directly [43]. Based on this, OpenAI proposed a new family of algorithms such as Proximal Policy Optimization (PPO) [38]. PPO used a statistical method called importance sampling which was used to estimate a distribution by sampling data from another distribution and this simple modification show a better performance in RoboschoolHumanoidFlagrun.
Since the basic policy gradient method sampled data from completed episodes, the variance of the estimation was high because of the high dimension action space. Similar to valuebased method, Actorcritic method was proposed to solve this problem [18]. Compared with the policy gradient, this method used a critic to evaluate the chosen action. This made the policy can be updated after each decision which not only reduced the variance but also accelerated the convergence. The famous improved actorcritic based algorithm is asynchronous advantage actorcritic (A3C) [26]. Similar to Dueling DQN, this method used advantage function to estimate value function and performed computing in parallel which can largely increase the learning speed.
2.2.2 Markov Decision Processes
As mentioned in Section 2.2.1
, the interaction between the agent and the environment can be modeled as a Markov Decision Process which is based on Markov property. Markov property describes a kind of stochastic processes that the probability of next event occurring only depend on the current event.
Definition 1 (Markov property [40])
Given a state at time in a finite sequence . This sequence has Markov property, if and only if
(2.15) 
A Markov Decision Process (MDP) is a random process with Markove property, values and decisions.
Definition 2 (Markov Decision Process [40])
A Markov Decision Process can be described as a tuple

is a finite set of states

is a finite set of actions

is a state transition probability matrix
(2.16) 
is a reward function
(2.17) 
is a discount factor
To describe how the decision is made, a policy is required to define the behaviour of the agent.
Definition 3 (Policy [40])
A policy is a distribution over actions given states
(2.18) 
In MDP, the agent is expected to get as many rewards as it can from the environment. However, maximizing the reward at timestep makes the agent shortsighted which means it only considers the reward from the next action rather the total reward of one episode. Therefore, return is defined as the concept ”reward” which the agent is expected to maximize.
Definition 4 (Return [40])
The return is the total discounted reward from timestep .
(2.19) 
where is the discount factor.
The value of the discount factor represents how farsighted the agent will be. If this value is , the agent will treat every reward in the future as the same. But this will also make the agent confused about which decision is not appropriate. At this point, the behavior of the agent can be described as in Figure 2.8
Since the return is defined in a random process, similar to reward function, the expectation of it can be defined as following which is also called value function.
Definition 5 (State Value Function [40])
The statevalue function of an MDP is the expected return starting from state , and then following policy
(2.20) 
Definition 6 (Action Value Function [40])
The actionvalue function is the expected return starting from state , taking action a, and then following policy
(2.21) 
2.2.3 Bellman Equation
Since we define the Markov decision process in Section 2.2.2, the behavior of the agent can be described mathematically. As mentioned in Section 2.2.1, this problem can be solved by the Bellman equation.
Theorem 1 (Bellman Expectation Equation [40])
The statevalue function can be decomposed into immediate reward plus discounted value of successor state
(2.23) 
The actionvalue function can similarly be decomposed
(2.24) 
Here is a simple proof for Equation 2.24. According to the Definition 4, the return at time can be decomposed into two parts: the immediate reward and the discounted return at time
(2.25) 
(2.26) 
Due to the linearity of expectation, can be replaced by and then we obtain the Bellman equation for actionvalue function. The statevalue function can be proved in the same way. With the definition of optimal value function
Definition 7 (Optimal Value Function [40])
The optimal statevalue function is the maximum value function over all policies
(2.27) 
The optimal actionvalue function is the maximum actionvalue function over all policies
(2.28) 
Theorem 1 can be extended to Bellman optimality equation
Theorem 2 (Bellman Optimality Equation [40])
The optimal statevalue function can be decomposed into maximum immediate reward plus discounted optimal value of successor state
(2.29) 
The optimal actionvalue function can similarly be decomposed
(2.30) 
where , , ,
Here is a simple proof for Equation 2.30. Due to the linearity of expectation, Equation 2.26 can be decomposed into the expectation of the immediate reward
(2.31) 
and the expectation of the discounted return at time
(2.32) 
According to the definition of reward function in Definition 2, Equation 2.31 is equal to . If next state is , Equation 2.32 can be written as following with the transition probability matrix
(2.33) 
With the Markov property, we know the expectation of the return in Equation 2.33 is not related to the current state and action and this is equal to the statevalue function. Therefore, Equation 2.33 can be written as following
(2.34) 
(2.35) 
It is easy to prove that there is always an optimal policy for any Markov decision process and it can be found by maximizing actionvalue function.
(2.36) 
2.2.4 Exploitation vs Exploration
If the agent has complete knowledge of the environment, in the other word, the transition probability can be calculated given state and action , Equation 2.30 can be solved by an iterative method with appropriate . However, this method is unable to deal with an unknown environment because a large amount of information has to be collected to estimate before the convergence of action value function. If the function tends to be stable before the environment has been fully explored, the performance of the model would be far from satisfactory, especially in high action space situation.
To deal with this problem,  greedy selection [42] is introduced to ensure the agent make enough exploration before the convergence of the action value function. Instead of choosing the best action estimated by function, there is a probability of to randomly select from all actions. The mathematical expression of this method is shown as following
(2.37) 
where is the number of actions. This method may have a bad effect on the performance of the agent at first several episodes during the training but it can widen the horizon of the agent in long term view.
2.2.5 Temporal Difference Learning
As mentioned in Section 2.2.4, most environment in the real world is unknown. To solve this problem, a method called Monte Carlo (MC) is used to sample data for estimating value function. The agent can learn the environment from one episode experience and the value function can be approximated by the mean of the return instead of the expectation. The mathematical expression can be described as following
(2.38) 
where is the state at time , is the sum of return and is the counter to record the visit number of state . There are two kinds of visit: first visit and every visit. The former one means the model only need to record the first visit of state in one episode while all visit of in one episode will be taken into consideration in every visit. Simplify equation 2.38, we can get the recurrence equation for
(2.39) 
where is the learning rate which can control the update speed of the value function and is the state at time . The problem of Monte Carlo method is all rewards in one episode have to be collected to get . The value function can only be updated when reaching the end of the episode which may lead to low training efficiency. To update value function with an incomplete episode, the return can be replaced by estimated value function using bootstrapping. With the Bellman equation 2.23 and 2.39, we can write
(2.40) 
This idea is called Temporal Difference (TD) Learning. In TD learning, value function will be updated immediately after a new observation. Compared with MC methods, TD learning has lower variance because there are too many random actions in the Monte Carlo method which will lead to the high variance. Similarly, the recurrence equation for action value function can be written as following
(2.41) 
where and is the state and action at time , and is the state and action at time . Equation 2.41 shows an iterative method to get the optimal action value function . With this equation and  greedy policy, the RL problem can be solved by Sarsa [34].
The name of Sarsa is from the sequence . Besides Sarsa, there is another similar algorithm called Q learning [53].
In algorithm 2, there are two policies during the iteration. When choosing the action from given , Sarsa uses  greedy policy while Q learning uses greedy policy. But both of them are choosing with  greedy policy. Considering the example of Cliff Walking shown in Figure 2.9 from Sutton’s book [42], every transition in the environment will get reward except next state is the cliff which the agent will get reward, Sarsa is more likely to choose the safe path while Q learning tends to choose the optimal path with  greedy policy. But both of them can reach the optimal policy if reducing the value of .
2.2.6 Deep Q Network
Q learning is a powerful algorithm to solve simple reinforcement problems. However, it is unable to deal with continuous states or continuous actions. To solve the former problem, deep learning method can be used to approximate action value function.
Generally, states are image data observed by the agent and convolutional neural network is an effective way to extract features from this kind of data in convolution layers and feed them into the fully connected layer to approximate function. Several consistent stationary images will be stacked into one input data to make the model understand that the agent is moving. But the input data is highly dependent, the performance of the model will be largely affected by the dependency.
As mentioned in Section 2.2.1, DeepMind introduced experience replay pool which will store the experience into the memory and sample some of them to optimize the neural network model in 2013 [27]. Using Q learning, deep learning and experience replay pool, the improved algorithm named Deep Q Network (DQN) shows incredible performance on Atari games according to their paper. Two years later, they found the agent became more stable by using two network [28]. This algorithm can be described as below
All states in Algorithm 3 have to be preprocessed before feeding into a neural network model. Based on Deep Q Network, there are three kinds of improved algorithms considering the stability of the training process, the importance of each experience and new neural network architecture. Double DQN [50] utilizes the advantage of two networks. Instead of finding the optimal value from target network directly, this method chooses the optimal action from the policy network and find the corresponding value in the target network. Use the term in Algorithm 3, the change can be illustrated as following
(2.42) 
Prioritized Experience Replay (PER) introduced a way to efficiently sample transitions from the experience replay pool [36]. Instead of uniform random sampling, there is a priority of each transition
(2.43) 
where is the priority of transition and is the indicator of the priority, especially when using uniform random sampling. The priority can be measured by TD error , which is the following term
(2.44) 
Based on TD error, can be calculated in two way. The first is proportional prioritization which uses the absolute value of TD error
(2.45) 
where is to avoid zero prioritization. The other one is rankbased
(2.46) 
where is the rank of transition by sorting TD error
. According to Schaul, both proportional based and rank based prioritization can speedup the training but the later one is more robust which has better performance when meeting outliers.
However, the random sampling is abandoned after adding priority mechanism which will result in high bias. In other words, those transitions with small TD error are unlikely to be sampled and the distribution is changed. Therefore, the final model may far from the optimal policy and performance of the agent even be lower than DQN. Important sampling (IS) [29]
is an effective technique to estimate a distribution by sample data from a different distribution. Given a probability density function
over distribution , with the definition of the expectation(2.47) 
where denotes the expectation for and is the integrand. Given another probability density function , the expectation can be written as following
(2.48) 
where denotes the expectation for . With Monte Carlo integration, the expectation can be estimated by
(2.49) 
(2.50) 
adding a tunable parameter , we obtain the importancesampling weights
(2.51) 
where will decay from a userdefined initial value to and the bias completely disappears when . Term is used to normalize the weight to increase stability. Use the term in Algorithm 3, the update of function can be modified as following
(2.52) 
where is TD error and is learning rate.
Dueling DQN architecture used a new concept called advantage function which is the subtraction of the action value function and state value function [52].
(2.53) 
As shown in Figure 2.10, dueling network architecture use summation of two steams which is advantage function and state value function to get the function. The state values can be updated more accurately with this method.
3.1 Requirements
3.1.1 Software Requirement
Considering the readability of the code, widely used additional frameworks such as Torch, Python is a suitable choice for this project. OpenCV is used to preprocess the image getting from the environment. Numpy is a Python library which accelerates matrices operations with C. This enables the user to write efficient scientific computing code with Python. There are plenty of deep learning frameworks like Tensorflow which has many extensive API and is widely used in industrial products. However, it will take a relatively long time for the beginner to fully understand the usage of Tensorflow. Pytorch is a recently developed framework which is described as ”Numpy with GPU”. The simplicity of Pytorch makes more and more academic researchers using it to implement their new ideas in a much easier way. Because Trex Runner is running on Chrome, the latest Chrome is used here. Gym is a game library developed by OpenAI
[7]. This framework provides a builtin environment for some famous games such as Atari 2600 and it is easy for the user to customize their own environment. Table 3.1 shows all software requirement in this project.Software  Description 

OS  Windows 10 
Programming language  Python 3.7.4 
Framework  OpenCV, Pytorch, Numpy, Gym 
Browser  Chrome 76 
3.1.2 Hardware Requirement
As the game is running on Chrome, it is hard to use a Linux server to perform the experiments. Although headless Chrome is a plausible choice, there are some environmental issues during the investigation. Therefore, all experiments will be running on the laptop from the author. There will be some limitation such as 6GB GPU memory limits the size of experience replay pool. Therefore, parameters related to hardware limitation will be suitably chosen without tuning in this project. Table 3.2 lists all hardware information used in this project.
Hardware  Description 

CPU  Intel Core i58300H 
RAM  16G 
GPU  Nvidia GTX 1060 6G 
3.2 Game Description
Trex Runner is a dinosaur game from Google Chrome offline mode. Everyone can access this link on Chrome to play the game. The target for players is to control the dinosaur overcoming as many obstacles as possible. The current score of the game will increase by time if the dinosaur keeps alive as shown at the top right corner of Figure 3.1 as well as the highest score. As shown in Figure 3.2, the dinosaur has three actions to choose in every state: do nothing, jump or duck.
Environment plays an important role in reinforcement learning because the agent will improve the policy based on the feedback from it. However, it is difficult to quantify the rewards for each action as well as the return for an entire episode. In most research for RL algorithms, modifying reward will not be taken into consideration but it will significantly impact the performance of the model because it decides the behavior of the agent. For example, shaping reward shows a better performance in Andrew’s experiment[30]. It adds a new term to modify the original reward based on the goal
(3.1) 
The closer the agent towards the goal, the larger the is. However, the aim of this project is to train the agent to play the game and compare the performance between different algorithms. So the effect of reward function will not be taken into consideration and a fixed reward function will be used across all experiments.
Since there is no previous study on Trex Runner with reinforcement learning, the design of reward function is a hard part of this project. Intuitively, the best design is awarding the agent for jumping over the obstacles and penalizing it for hitting the obstacles. The jumping reward will gradually increase as time goes by. However, object detection in moving pictures is required to fulfill this goal. As this task is out of the requirements of this project, we proposed a naive reward design as shown in Algorithm 4.
The basic idea of Algorithm 4 is giving a relatively small reward to the agent if it is alive and penalize it when hitting an obstacle. Zero reward for jumping is set to make the dinosaur only jumps if it is very close to obstacles. The unexpected jump will limit the movement in the next few states.
Although there are three kinds of action in this game as introduced in Section 3.2, duck is optional because the agent can overcome the obstacle using jump under the same circumstances. Considering most obstacles in the game are cactus which can only be overcome by jumping, only two actions (do nothing and jump) will be used in this investigation.
3.3 Model Selection
Since there are only two actions in Trex Runner, according to the literature review on deep reinforcement learning in Section 2.2.1, valuebased methods are proved to be powerful to handle this game. Although policybased methods such as proximal policy gradient is a good choice too, only DQN, double DQN, DQN with prioritized experience replay and dueling DQN will be investigated in this project due to the time limitation.
Deep Q network which is shown in Algorithm 3 is a basic reinforcement learning algorithm using deep learning. According to the result from DeepMind, it is expected to achieve at least humanlevel results with only DQN.
Double DQN mitigates the value overestimation problems utilizing two advantage of two networks as shown in Equation 2.42 but it is not expected to achieve a higher performance in this experiment because there is only two actions. The bad effect of overestimation problems is not obvious under this circumstance.
Dueling DQN adds an advantage function which is the subtraction of action value function and state value function before the output layer in the convolutional neural network as shown in Equation 2.53. Since the evaluated game in [52] is a similar racing game overcoming obstacles compared with Trex Runner, this algorithm is expected to have a better performance than DQN.
Prioritized Experience replay improves training efficiency by changing the distribution of the stored transitions. It assigns the weight for each experience by TD error. There are two ways to calculate prioritization which is proportional based method and rankbased method. According to the [36], the former one has a relatively better performance, only this method will be implemented in this investigation due to the time limitation. The performance is expected to be the same as DQN because there is no change in the algorithm but it may be faster to reach the same performance.
3.4 Image Preprocessing
Following the preprocessing step in [27, 28], the raw observed image which is in RGB representation will be converted to grayscale representation. To make the network easier to recognize dinosaur and obstacles, unnecessary objects such as clouds and scores will be removed. In this step, the color of the background and the object are reversed in order to perform erosion and dilation. These two basic morphological operations can help reduce small bright color which is often noisy data. Finally, the image is resized to following the recipe from DeepMind. Since the movement should be recognized by the neural network, perform the same preprocessing step for last four frames in the history and stack those four as one data point which is also the input of CNN. The entire process is shown in Figure 3.3.
3.5 Convolutional Neural Network Architecture
There are two kinds of convolutional neural network used in this project. The basic DQN is proposed in [27, 28] which used three convolutional layers and two fully connected layers. The reason for not using pooling layer is to detect the movement of the agent. Both max pooling and average pooling may make the neural network ignore a very small change in the image. Therefore, there are only convolutional layers in this architecture. The architecture for training the agent using DQN is shown in Figure 3.4.
Dueling architecture is proposed in [52] which divided the network into two parts. One of them is only related to the state value function , the other one is advantage function which is affected by both state and action. The final action value function is the summation of those two.
(3.2) 
where is the shared parameter of CNN, is the value function only parameter and is the advantage function only parameter. Both DQN and Dueling DQN are using Algorithm 3
, the only difference is the neural network architecture. RMSprop
[48] which is an adaptive gradient method based on stochastic gradient descent will be used as the optimization algorithm in this project. This is the same optimization method used by DeepMind [27, 28]. Figure 3.5 shows the process of dueling DQN.3.6 Experiments
3.6.1 Hyperparameter Tuning
Before the comparison of algorithms, hyperparameter tuning is required to get highperformance models. As mentioned before, the memory size is fixed to
due to the hardware limitation. Because there is no previous study on this game, and the hyperparameters list in [28] have a bad result on this game. All other hyperparameters have to be set to a suitable value. Grid search is performed to find a workable combination of those parameters.Due to the time limitation, all parameters will only be slightly modified and only one hyperparameter will vary during each tuning experiment. The choice of the parameter will consider both score and stability. Each parameter will be tuned with episodes.
3.6.2 Comparison of different Deep Q Network Algorithms
There are three improved reinforcement algorithms based on DQN mentioned in Section 2.2.6. Double DQN makes the performance of the agent more stable by solving the overestimated value problem. Prioritized experience replay improves the training efficiency by sample more valuable transitions. Dueling DQN modifies the neural network architecture to get a better estimation of state values.
In this experiment, DQN will be first used to train the agent based on the hyperparameters tuned in Section 3.6.1 and this result will be treated as a baseline across all the experiments. Double DQN, DQN with prioritized experience replay and Dueling DQN will be applied to the agent separately. The performance of those three is expected to be better than DQN according to the related papers. Due to time limitation, no combination of those three algorithms will be performed in this project. This section only compares the performance of each algorithm.
3.6.3 Effect of Batch Normalization
As mentioned in Section 2.1.5, it is proved that batch normalization can reduce training time and mitigate the vanishing gradient problem in a convolutional neural network. However, there is no evidence that this method has the same effect on reinforcement learning. This section will perform experiments on this point. Based on the experiment in Section 3.6.2, adding batch normalization in each convolutional layer and compared with the results with the outcome in previous experiments.
3.7 Evaluation
To evaluate the performance of the agent, DeepMind used trained agent playing the game for times for up to min and  greedy policy with [28]. Considering only one game is investigated in this project, the average score will be used instead of average reward because the number of jumps in each episode will affect the total reward according to the designed reward function. The greedy policy will be used in the evaluation stage instead of  greedy policy because the later one will bring randomness to the decision which will affect the performance of the trained model. Therefore, the trained agent will play the game for times without time limitation and using greedy policy. All outcomes will be compared with the results from a human expert.
The average scores during the training stage will be shown graphically. This is a clear way to show the learning efficiency of each algorithm. Both graphical and statistical results such as mean, variance and median will be analyzed. However, only statistical results will be analyzed in the testing stage because the trained model for each algorithm are the same and there is no increasing trend can be shown like in the training stage. These results will be visualized with a boxplot.
4.1 Hyper Parameter Tuning
The value of hyperparameters may affect the performance of the model. However, there are so many parameters in reinforcement learning including optimization algorithm parameters such as learning rate. This may take a long time to find the optimal combination of these parameters using a grid search. Since there is no metric like accuracy in RL which can easily reflect the performance of the model, we assume each parameter is independent of others. Therefore, each parameter can be tuned one after another. Because the objective of this project is to compare the performance between different algorithms and the effect of batch normalization, those tuned parameters by DQN will be used across all the experiments. The start hyperparameters of DQN are shown in Table 4.1.
Hyper parameter  Value  Description 

Memory Size  Size of experience replay pool  
Batch Size  128  Size of minibatch to optimize model 
Gamma  0.99  Discount factor 
Initial  Explore probability at the start of the training  
Final  End point of explore probability in decay  
Explore steps  Number of steps for decay from to  
Learning Rate  Learning speed of the model 
4.1.1 Learning Rate
Learning rate controls the learning speed of the model, too large value will result in divergence and too small value may double the training time.
4.1.2 Batch Size
Batch size defines how many transitions will be used to update the neural network which may affect the training speed. But as mentioned in 2.2.1, too big size will cause the dependency problems which may largely affect the performance of the model.
As shown in Figure 4.2, the average score of three curves at epoch are all around . Among those three, the most stable one is batch size .
4.1.3 Epsilon
 greedy policy determines the probability of exploration. In some games, especially with high action spaces, this value can affect how good the model will converge. However, there are only two actions in Trex Runner so it is unnecessary to random choose action at the begin. Instead of initializing to as DeepMind did in their paper [28], the start value is set to in this model.
All experiments achieve acceptable results in Figure 4.3 except the one with fixed . In this case, we select from to but either of those three can be chosen according to this graph. This experiment also demonstrates the positive effect of linear annealing for .
4.1.4 Explore Step
Explore step is the number of steps required to anneal from to . As mentioned that hyperparameters related to exploration will not affect too much in this game. The most stable one will be selected from Figure 4.4 which is .
4.1.5 Gamma
Discount factor decides how farsighted the agent will be. Too small value will make the agent consider more about the current reward and too big value will make the agent pay the same attention to rewards after this time point. This may confuse the agent about which action leads to a high or low return.
Figure 4.5 shows the average score for four different gamma. Obviously, make the agent shortsighted and there is no significant change during epochs. When , the average score fluctuates widely after 50th epoch. Since has a gradually increasing trend, this will be used as the final discount factor.
4.2 Training Results
The tuned hyperparameters from the previous experiment are listed in Table 4.2. Although these parameters are tuned by DQN algorithm, they are expected to fit other three improved algorithms which are Double DQN, Dueling DQN and DQN with prioritized experience replay because there is no big difference among them. All algorithms will be only trained with 200 epochs because of the time limitation. The total training time for each algorithm is shown in the last column of Table 4.4
Hyper parameter  Value before tune  Value after tune 

Memory Size  
Batch Size  128  128 
Gamma  0.99  0.99 
Initial  
Final  
Explore steps  
Learning Rate 
4.2.1 Dqn
Figure 4.6 shows the result of DQN algorithm for epochs with tuned parameters. A gradually increased average score can be seen from this graph. This not only proves that the agent can play the game through DQN but also shows that the design of the reward function is relatively reasonable. This result will be treated as a baseline and will be used to compare with other algorithms.
4.2.2 Double DQN
Double DQN has a similar performance in training compared with DQN. As mentioned before, the effect of overestimation is not so significant in Trex Runner because there are only two actions. As shown in Figure 4.7, there are four data points with average scores below while all average scores are above this value in DQN.
4.2.3 Dueling DQN
Surprisingly, dueling DQN shows an incredible training performance after th epoch while the curve before that time seems similar. In Figure 4.8, the average score is above which is ten times higher than the maximum average score in DQN. However, these scores have a high variance which fluctuates widely between and . From the graph, the training process of dueling DQN is stable before th epoch and end up with an increasing trend. Since we tuned all hyperparameters based on DQN, these values may not be the best for dueling network which results in the stable and relatively low average scores before 150th epoch.
4.2.4 DQN with Prioritized Experience Replay
Another important finding in this section is the performance of prioritized experience replay. This is expected to have a shorter training time and a higher performance compared with DQN. But the result shown in Figure 4.9 suggests that the agent failed to learn to play the game with this method. There are two reasons for that.
One problem is from the algorithm. Compared with DQN, there are two extra steps have been applied to PER: weight calculation and prioritization update. Following the implementation in [36], sum tree which is a data structure with time complexity for sampling and updating is used to store transitions instead of a linear list to accelerate memory related manipulation. The training time of PER is twice more than the one of DQN because of the batch size. Since we know that all sampled transitions will be traversed when updating the prioritization, the larger batch size is the longer time is required to perform this operation. Table 4.3 shows that this process is very timeconsuming even using the batch size . These data are extracted from the training results choosing the same score of 43. The step size is the average value from ten records.
Algorithm  Score  Batch Size  Step Size 

DQN  43  128  180 
DQN with PER  43  128  7 
DQN with PER  43  32  22 
The other problem is from the game. Because this game is based on Chrome, it continues running when performing optimization while the game from official OpenAI Gym is paused during this operation. Therefore, there is a delayed time before sending the action to Chrome. This influence is enlarged in prioritized experience replay since the time for update operation with batch size takes approximately times longer than normal DQN.
Change the choice of hyperparameter can mitigate the first problem but the result is not as good as other algorithms. One thing we can expect is PER is unable to help the agent to get a higher score under this circumstance because the game speed will increase as time goes by. Since the time for updating the prioritization will not change too much, the time interval between two consistent decisions will be longer. This may limit the performance of the model. To eliminate the high computational effect from updating prioritization, the best way is to redevelop the game but due to the time limitation and the primary objective of this study, this result will be used as we can still compare the effect of batch normalization on this algorithm.
4.2.5 Batch Normalization
Since the aim of this experiment is to find how batch normalization affects DQN algorithms, each result will be compared with the one without batch normalization which is shown in Figure 4.10.
From Figure 4.10, we can see that batch normalization can increase the mean of average scores in all experiments. But this also brings high variance which makes the average score diverge. According to the topleft graph, the first time for DQN agent to reach the average is approximately th epoch while the agent using DQN with batch normalization reach the same average score at th epoch and it is easy for it to get the higher score after that time. Double DQN curve has a similar trend but batch normalization in both of them also result in wide fluctuation. It is hard to say whether dueling network benefits from the batch normalization because there is a significant increase trend on the bottom left graph. However, it is still can be seen that BN enable the agent to reach the same performance much earlier from th epoch to th epoch. For DQN with prioritized experience replay, even the performance is limited by the game itself, the one with batch normalization still can get a relatively higher score.
4.2.6 Further Discussion
As graphical results and some explanation of them are shown above, this part will discuss numerical results from the experiments. Table 4.4 shows some statistical data fro training process. The maximum score is pointless in most games but considering Trex Runner is a racing game, we still include this in the table. The last three columns are percentile data which are calculated by sorting in ascending order and finding the observation. So is the same as the median. The last column shows the training time for each algorithm.
Algorithm  Mean  Std  Max  25%  50%  75%  Time (h) 

DQN  537.50  393.61  1915  195.75  481  820  25.87 
Double DQN  443.31  394.01  2366  97.75  337  662.25  21.36 
Dueling DQN  839.04  1521.40  25706  155  457  956.5  35.78 
DQN with PER  43.50  2.791  71  43  43  43  3.31 
DQN (BN)  777.54  917.26  8978  97.75  462.5  1139.25  32.59 
Double DQN (BN)  696.43  758.81  5521  79  430.5  1104.25  29.40 
Dueling DQN (BN)  1050.26  1477.00  14154  84  541.5  1520  40.12 
DQN with PER (BN)  46.14  7.54  98  43  43  43  3.44 
Ignoring the result from prioritized experience replay because of the inappropriate game environment, all algorithms achieve great results according to Table 4.4. Two algorithms with dueling network stand out from them. The one with batch normalization has the mean over which is more than the one without BN. But the later one got the maximum score of which means the agent can keep running for around half an hour in one episode. However, both of them have high variance which exceed the mean.
Double DQN both with BN and without BN perform worse than DQN. This indicates that double DQN may reduce the performance in low dimension action space. But batch normalization shortens the gap between those two algorithms which can be seen from the median and percentile.
Although most of statistical metrics are improved by batch normalization, the variance is much higher than before. As shown in the table, the variance from DQN with BN is twice more than the one without BN. Only the variance from dueling network is lower after BN. But it is reasonable because there is an incredible increase in the very later stage of the training shown in Figure 4.10.
4.3 Testing Results
After training the agent for episodes, we use the latest model with greedy policy and play Trex Runner for times with each algorithm. Figure 4.11 shows the boxplot of those results as well as the collected data from the human expert. It is obvious that the agent trained by DQN with prioritized experience replay fail to learn to play the game because of the game environment issue discussed in the last section. It is surprising that the performance of double DQN is far from satisfactory even though it has similar training results compared with DQN. Table 4.5 shows that the mean of DQN results is three times higher than the one from double DQN. Dueling DQN algorithm achieves the highest score even though it still has the highest variance which is three times more than the variance from DQN.
According to Table 4.5, batch normalization improves the performance of the model regardless of algorithms and even the mean of DQN with PER is increased. However, it is not easy to say the effect of BN in dueling DQN is positive or not. From Figure 4.11, the one without BN has more outliers which results in high variance even though its mean is higher. Consider the median which is not sensitive with the outlier data, the one with BN is better and the minimum score is more than 200 which stands out from other algorithms. Since score 43 indicates the first time the agent meets the obstacle, it is easy to infer that all trained model fails to jump over the first cacti at least once except dueling DQN with BN. But dueling DQN is not fully trained which can be seen from the training result in Figure 4.8. That’s also one reason for high variance as we can see in the boxplot. The agent trained with dueling DQN achieved over 8000 at least three times.
Algorithm  Mean  Std  Min  Max  25%  50%  75% 

Human  1121.9  499.91  268  2384  758  992.5  1508.5 
DQN  1161.30  814.36  45  3142  321.5  1277  1729.5 
Double DQN  340.93  251.40  43  942  178.75  259.5  400.75 
Dueling DQN  2383.03  2703.64  44  8943  534.75  1499.5  2961 
DQN with PER  43.30  1.64  43  52  43  43  43 
DQN (BN)  2119.47  1595.49  44  5823  1218.75  1909.5  2979.75 
Double DQN (BN)  382.17  188.74  43  738  283.75  356  525.5 
Dueling DQN (BN)  2083.37  1441.50  213  5389  1142.5  1912.5  2659.75 
DQN with PER (BN)  45.43  7.384  43  78  43  43  43 
Bibliography
 [1] (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13 (3), pp. 469–475. Cited by: §2.1.3.
 [2] (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics (5), pp. 834–846. Cited by: §2.2.1.
 [3] (1957) A markov decision process. journal of mathematical mechanics. Cited by: §2.2.1.
 [4] (1954) The theory of dynamic programming. Bulletin of the American Mathematical Society 60 (6), pp. 503–515. Cited by: §2.2.1.
 [5] (1958) Combinatorial processes and dynamic programming. Technical report RAND CORP SANTA MONICA CA. Cited by: §2.2.1.
 [6] (2010) Learning midlevel features for recognition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2559–2566. Cited by: §2.1.4.
 [7] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §3.1.1.
 [8] (1955) Generalization of pattern recognition in a selforganizing system. In Proceedings of the March 13, 1955, western joint computer conference, pp. 86–91. Cited by: §2.2.1.
 [9] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.1.
 [10] (1980) Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36 (4), pp. 193–202. Cited by: §2.1.1.

[11]
(2015)
Escaping from saddle points—online stochastic gradient for tensor decomposition
. In Conference on Learning Theory, pp. 797–842. Cited by: §2.1.3.  [12] (2015) Resnetdeep residual learning for image recognition. ResNet: Deep Residual Learning for Image Recognition. Cited by: §2.1.1.
 [13] (2012) Matrix calculus: derivation and simple application. Technical report Cited by: §2.1.3.
 [14] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.1.1.
 [15] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.1.5.
 [16] (1972) Brain function and adaptive systems: a heterostatic theory. Technical report AIR FORCE CAMBRIDGE RESEARCH LABS HANSCOM AFB MA. Cited by: §2.2.1, §2.2.1.
 [17] (1982) The hedonistic neuron: a theory of memory, learning, and intelligence. ToxicologySci. Cited by: §2.2.1.
 [18] (2000) Actorcritic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §2.2.1.
 [19] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.1.1.
 [20] (1997) Face recognition: a convolutional neuralnetwork approach. IEEE transactions on neural networks 8 (1), pp. 98–113. Cited by: §2.1.1.
 [21] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.1.1, §2.1.1.
 [22] (1992) Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (34), pp. 293–321. Cited by: §2.2.1.
 [23] (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §2.1.1.
 [24] (1943) A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (4), pp. 115–133. Cited by: §2.1.1.
 [25] (1969) Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition 19 (88), pp. 2. Cited by: §2.1.1.
 [26] (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.2.1.
 [27] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.2.1, §2.2.6, §3.4, §3.5, §3.5.
 [28] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.1, §2.2.6, §3.4, §3.5, §3.5, §3.6.1, §3.7, §4.1.3.
 [29] (2001) Annealed importance sampling. Statistics and computing 11 (2), pp. 125–139. Cited by: §2.2.6.
 [30] (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §3.2.
 [31] (1957) The perceptron, a perceiving and recognizing automaton project para. Cornell Aeronautical Laboratory. Cited by: §2.1.1.
 [32] (1960) Perceptron simulation experiments. Proceedings of the IRE 48 (3), pp. 301–309. Cited by: §2.1.1.
 [33] (1988) Learning representations by backpropagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §2.1.1.
 [34] (1994) Online qlearning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, England. Cited by: §2.2.1, §2.2.5.
 [35] (1959September 15) Aerosol dispensers and like pressurized packages. Google Patents. Note: US Patent 2,904,229 Cited by: §2.2.1.
 [36] (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §2.2.1, §2.2.6, §3.3, §4.2.4.
 [37] (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In International conference on artificial neural networks, pp. 92–101. Cited by: §2.1.4.
 [38] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.2.1.
 [39] (2017) Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §2.2.1.
 [40] (2015) University college london course on reinforcement learning. University College London. Cited by: Definition 1, Definition 2, Definition 3, Definition 4, Definition 5, Definition 6, Definition 7, Theorem 1, Theorem 2.
 [41] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.1.
 [42] (2018) Reinforcement learning: an introduction. MIT press. Cited by: Figure 2.9, §2.2.1, §2.2.4, §2.2.5.
 [43] (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.2.1.
 [44] (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §2.2.1.

[45]
(2017)
Inceptionv4, inceptionresnet and the impact of residual connections on learning
. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §2.1.1.  [46] (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.1.1.
 [47] (1911) Animal intelligence; experimental studies, by edward l. thorndike. The Macmillan company, New York. Cited by: §2.2.1.
 [48] (2012) Lecture 6.5rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §3.5.
 [49] (1994) Original approach for the localisation of objects in images. IEE ProceedingsVision, Image and Signal Processing 141 (4), pp. 245–250. Cited by: §2.1.1.
 [50] (2016) Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.2.1, §2.2.6.
 [51] (1995) Phoneme recognition using timedelay neural networks. Backpropagation: Theory, Architectures and Applications, pp. 35–61. Cited by: §2.1.1.
 [52] (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: Figure 2.10, §2.2.1, §2.2.6, §3.3, §3.5.
 [53] (1989) Learning from delayed rewards. Cited by: §2.2.1, §2.2.5.
 [54] (1973) Punish/reward: learning with a critic in adaptive threshold systems. IEEE Transactions on Systems, Man, and Cybernetics (5), pp. 455–465. Cited by: §2.2.1.
 [55] (1960) Adaptive switching circuits. Technical report Stanford Univ Ca Stanford Electronics Labs. Cited by: §2.2.1.
Comments
There are no comments yet.