Reinforcement Learning and Video Games

09/10/2019 ∙ by Yue Zheng, et al. ∙ 0

Reinforcement learning has exceeded human-level performance in game playing AI with deep learning methods according to the experiments from DeepMind on Go and Atari games. Deep learning solves high dimension input problems which stop the development of reinforcement for many years. This study uses both two techniques to create several agents with different algorithms that successfully learn to play T-rex Runner. Deep Q network algorithm and three types of improvements are implemented to train the agent. The results from some of them are far from satisfactory but others are better than human experts. Batch normalization is a method to solve internal covariate shift problems in deep neural network. The positive influence of this on reinforcement learning has also been proved in this study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 35

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Declaration

All sentences or passages quoted in this document from other people’s work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations that are not the work of the author of this report have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure.

Name: Yue Zheng
 

1.1 Background

The applications of Artificial Intelligence are widely used in recent years. As one part of them, Reinforcement Learning has achieved incredible results in game playing. An intelligent agent will be created and trained with reinforcement learning algorithms to fulfill this tasks. In the Future of Go Summit 2017, Alpha Go which is an AI player trained with deep reinforcement learning algorithms won three games against the world best human player in Go. The success of reinforcement learning in this area shock the world and many researches are launched such as driverless cars. Deep learning methods such as convolutional neural network contributes a lot to this because these techniques solves the problem of dealing with high dimension input data and feature extraction.

T-rex Runner is a dinosaur game from Google Chrome offline mode. The aim of the player is to escape all obstacles and get higher score until reaching the limitation which is . The moving speed of the obstacles will increase as time goes by which make it difficult to get the highest score.

The code of this project can be found in this link which is written in Python.

1.2 Aim of the project

The aim of this project is to create an agent using different algorithms to play T-rex Runner and compare the performance of them. Internal covariate shift is the change of distribution in each layer of during the training which may result in longer training time especially in deep neural network. To cope with this problem, batch normalization use linear transformation on each feature to normalize the data with the same mean and variance. The same problem may also occur in deep reinforcement learning because the decision is based on neural network. Beyond the comparison of different reinforcement learning algorithms, this project will also investigate the effect of batch normalization. The overall objectives of this project are list below.

  • Create an agent to play T-rex Runner

  • Compare the difference among different reinforcement learning algorithms

  • Investigate the effect of batch normalization in reinforcement learning

1.3 Overview

This study opens with a literature review on deep learning and reinforcement learning. Each section includes the history of the field and the techniques related to this study. Chapter 3 includes the description of the game and the choice of algorithms according the literature review. The entire processing step will be shown as well as the architecture of the model. The design of the experiments and the evaluation methods are presented in this chapter too. Chapter 4 shows the result of all the experiments and the discussion of each experiment. Chapter 5 presents the conclusion of this study and the proposed future works.

2.1 Deep Learning

Deep learning is a class of Machine Learning model based on Artificial Neural Network (ANN). There two kinds of deep learning model which is widely used in recent years. Recurrent Neural Network is one of them which shows its power in Natural Language Processing. The other one plays an important role in deep reinforcement learning called Convolutional Neural Network (CNN). It is one of the most effective models for computer vision problems such as object detection and image classification. This section gives a brief introduction of deep learning and detailed information about convolutional neural network.

2.1.1 History of Deep Learning

An artificial neural network is a computation system inspired by biological neural networks which were first proposed by McCulloch, a neurophysiologist [24]

. In 1957, Perceptron was invented by Frank

[31]. Three years later, his experiments show this algorithm can recognize some of alphabets [32]. However, Marvin proved that a single layer perceptron cannot deal with XOR problem [25]

. This stopped the development of ANN until Rumelhart et al. show that some useful representations can be learned with multi-layer perceptron, which is also called neural network, and backpropagation algorithm

[33] in 1988. One year later, LeCun et al. first used a five-layer neural network and backpropagation to solved digit classification problem and achieved great results [21]. His innovative model is known as LeNet which is the beginning of the convolutional neural network.

The origin of CNN was proposed by Fukushima named Neocognitron which was a self-organized neural network model with multiple layers [10]. This model achieved a good result in object detection tasks because it is not position-sensitive. As mentioned before, LeCun et al. invented LeNet and got less than error rate in mnist handwritten digits dataset in 1998[21]

. The model used convolutions and sub-sampling which is called convolution layer and pooling layer today to convert the original images into feature vectors and perform classification with fully connected layers. At the same time, some neural network models show some acceptable results in face recognition

[20], speech recognition [51] and object detection [49]. But the lack of reliable theory caused the research of CNN to stagnate for many years.

In the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012

[9], Alex and his team got error rate with an eight-layer deep neural network (AlexNet) [19]. This was a significant result compared with the one from second rank participant which was

. Beyond LeNet, AlexNet used eight layers to train the classifier with data augmentation, dropout, ReLU, which mitigated overfitting problem. Another significant discovery was that parallel computing with multiple GPUs can largely decrease the training time.

Two years later, Simonyan and Zisserman introduced a sixteen-layer neural network (VGGNet) and won the first prize in ILSVRC 2014 classification and localization tasks [41]. This model got the state-of-the-art result with error rate at that time. VGGNet also proved that using smaller filter size and deeper network can improve the performance of CNN. The size of all filters in the model was no greater than while the first two layers in AlexNet were and . In the same year, GoogLeNet [46], the best model of ILSVRC 2014 classification and detection tasks, first used inception which was proposed by Lin [23]

to solve vanishing gradient problem. Inception replaced one node with a network which was consisted of several convolutional layers and pooling layers then concatenate them before passing to the next layer. This change made the feature selection between two layers more flexible. In other words, it can be updated by the backpropagation algorithm.

Another problem of the deep neural network was degradation resulting in high training error caused by optimization difficulty. To solve that problem, He et al. proposed a deep residual learning framework (ResNet) using a residual mapping instead of stacking layers directly [12]. This model won the championship in ILSVRC 2015 with only error rate. His experiments show that this new framework can not only solve degradation problems but also can improve the computing efficiency. ResNet There were many variants based on ResNet such as Inception-ResNet [45] and DenseNet [14]. The former one combined improved inception techniques into ResNet. Every two convolutional layers were connected in the later model. This change mitigated vanishing gradient problems and improved the propagation of features.

2.1.2 Deep Neural Network and Activation Function

A neural network or multi-layer perceptron consists of three main components: the input layer, the hidden layer, and the output layer. Each unit in one layer called a neuron. The input data are fed into the input layer conducting linear transformation through weights in the hidden layer. Finally, the result will be given non-linear ability through activation function and fed into the output layer.

Activation function enables the network to learn more complicated relationships between inputs and outputs. There are three widely used activation functions shown in Figure 2.1: sigmoid, tanh and ReLU. ReLU is the most commonly used one in three because it has a low computational requirement and better performance in solving vanishing gradient problems compared with the other two.

(2.1)
(2.2)
(2.3)
Figure 2.1: Three type of activation functions.

To illustrate the entire process in neural network, here is an example in Figure 2.2. Given an input data and a randomly generated weight in the hidden layer, the output of the neural network is and the activation of the hidden layer is

. Therefore, the estimated value

can be calculated by

Figure 2.2: A simple neural network.
(2.4)

where is the bias of the hidden layer and is the number of features. If the number of hidden layers in the neural network is greater than two, this is also called Deep Neural Network (DNN). Consider a simple DNN with three hidden layers shown in Figure 2.3. Given the same input in matrix form and is the weight between -th and -th layer, the output of -th layer can be calculated by

Figure 2.3: A simple deep neural network with three hidden layers.
(2.5)

where is the activation function between -th and -th layer and , especially . The dimension of those variables are shown in Table 2.1

Parameter Description Dimension
Input data
Weight between input layer and hidden layer 1
Weight between hidden layer 1 and hidden layer 2
Weight between hidden layer 2 and hidden layer 3
Weight between hidden layer 3 and output layer
Table 2.1: Dimension description of the deep neural network

2.1.3 Backpropagation Algorithm

Section 2.1.2 introduces the way to estimated label using deep neural network. In order to optimize this estimation, a cost function is used to quantify the difference between the estimated value and the true value . To give a simple example, Mean Square Error [1] which is often applied to regression problems is used in this section to illustrate how the backpropagation algorithm works. Equation 2.6 shows the form of mean square error.

(2.6)

where is the input data and represent all weights used in the model. Thus, the optimization problem can be described as following

(2.7)

Stochastic Gradient Descent (SGD) is an effective way to solve this optimization problem if is convex. However, it still shows acceptable results in deep neural network even though there is no guarantee for global optimal point in non-convex optimization [11]. Instead of finding the optimal point directly, SGD optimize the objective function 2.7 iteratively by following equation

(2.8)

where is the learning rate which can control the update speed of the weight. Using gradient methods such as SGD to optimize the cost function in neural network is called backpropagation algorithm. Considering the deep neural network shown in 2.3 and the techniques of matrix calculus [13], the gradient of with respect to is

(2.9)

where is element-wise matrix multiplication and is the gradient with respect to . Results for other can be calculated in a similar way. With equation 2.8, can be updated during each iteration by

(2.10)

2.1.4 Convolutional Neural Network

Compared with a common deep neural network, a convolutional neural network has two extra components which are convolutional layer and pooling layer. The convolutional layers make use of several trainable filters to select different features. The pooling layer reduces the dimension of the data by subsampling.

In the convolution layer, the output of the last layer is convolved by trainable filters with element-wise matrix multiplication. The size and the number of each filter are defined by the user and the initial value is randomly generated. The moving step of a filter in each convolution layer is decided by stride. In order to keep the information of the border during the forward propagation, a series of zeros attached to the border of the image called padding. Figure

2.4 shows how the result of one neuron in a convolution layer comes from and how the filter will be updated in every iteration in backpropagation.

Figure 2.4: Operations in convolution layer.

The reason for using a pooling layer is not only for dimension reduction but also for detecting invariant features including translation, rotation, scale from the input [37]

. There are two types of operations in pooling layer: max pooling and average pooling. In max pooling, only the maximum value in user-defined windows will be chosen while all values in the window will make contributions to the output in average pooling. The choice of operation is dependent on tasks and Boureau has made a theoretical comparison between those two

[6]. Both max and average operation are shown in Figure 2.5.

Figure 2.5: Operations in pooling layer.

A complete convolutional neural network consists of several convolutional layers, pooling layers, and fully connected layers. The fully connected layer is the same concept of DNN which used the flatten vector of the last output of the other two layers as input. Considering a classification problem as shown in Figure 2.6, an image with a size of is fed into CNN and output a scalar which represents its class. Table 2.2 lists the filters information used in CNN.

Figure 2.6: A simple convolutional neural network.
Layer Numbers Size Stride Padding Output dimension
Convolution 1 8 88 2 0 82828
Max Pooling 8 22 2 / 81414
Convolution 2 16 66 2 0 1644
Table 2.2: Property of convolutional layer and pooling layer

The backpropagation algorithm in convolutional neural network is a little different from described in section 2.1.3 because of two extra layer type. In average pooling layer, the error will be divided by which is the size of the filter and propagate to the last layer. In max pooling layer, the position of the maximum value will be stored when forward propagating and the error will be directly passed through that position. In convolutional layer, backpropagation can be calculated through basic differentiation. Consider the convolution operation in Figure 2.4, if the error from the output layer is , then we have

(2.11)

where

(2.12)

is the output matrix in Figure 2.4. The differentiation of with respect to , , can be computed in a similar way.

2.1.5 Batch Normalization

With the increasing depth of the neural network, the training time becomes longer. One of the reason is the distribution of input in each layer changes when updating the weight which is called Internal Covariate Shift. In 2015, Ioffe proposed Batch Normalization (BN) which make the distribution in each layer more stable and achieve shorter training time [15]. In each neuron, the input can be normalized by Equation 2.13

(2.13)

where is used to avoid zero variance. Now data in each neuron follow the distribution with mean

and standard deviation

. However, this changes the representation ability of the network which may lead to the loss of information in the earlier layer. Therefore, Ioffe used another linear transformation to restore that representation

(2.14)

where and are learnable parameters, especially, the result is the same as original when and . The mean and variance during training will be stored and will be treated as the mean of the variance of test data. In their experiment, BN can not only deal with Internal Covariate Shift problems but also mitigate vanishing gradient problems.

2.2 Reinforcement Learning

Reinforcement Learning (RL) is a class of machine learning aiming at maximum the reward signal when making decisions. The basic component of reinforcement learning is the agent and the environment. As shown in Figure 2.7, the agent will receive feedback including observation and reward from the environment after each action. To generate a better policy, it will keep interacting with the environment and improve its decision-making ability step by step until the policy converges.

Figure 2.7: Interaction between the agent and the environment.

2.2.1 History of Reinforcement Learning

In recent years, reinforcement learning becomes popular because of Alpha Go, a program that can beat human expert in Go [39]. In the Future of Go Summit 2017, Alpha Go Master shocked the world by winning all three games against Ke Jie, the world best player in Go. But the research of reinforcement learning started very early. According to Sutton, the early history of RL can be divided into two main threads [42].

One of them was optimal control. To cope with arising optimal control problems which were called ”multi-stage decision processed” in 1954, the theory Dynamic Programming (DP) was introduced by Bellman [4]

. In the theory, he proposed the concept of ”functional equation”, which was often called the Bellman equation today. Although DP was one of the most effective approaches to solve optimal control problems at that time, the high computational requirements which is called ”the curse of dimensionality” by Bellman were not easy to solve

[5]

. Three years later, he built a model called Markov Decision Processes (MDPs) to describe a kind of discrete deterministic processes

[3]. This deterministic system and the concept of value function which is described in the Bellman equation consists of the basic theory of modern reinforcement learning.

In optimal control thread, solving problems required full knowledge of the environment and it was not a feasible way to deal with most problems in the real world. The trial-and-error thread focused more on the feedback rather than the environment itself. The first expression about the key idea of trail-and-error including ”selectional” and ”associative” called ”Law of Effect” was written in Edward Thorndike’s book ”Animal Intelligence” [47]

. Although supervised learning was not ”selectional”, some researchers still mistook it for reinforcement learning and concentrated on pattern recognition

[8, 55]. This led to rare researches in actual trial-and-error learning until Klopf recognized the difference between supervised learning and RL: the motivation to gain more rewards from the environment [16, 17]. However, there were still some remarkable works such as the reinforcement learning rule called ”selective bootstrap adaptation” by Widrwo in 1973 [54].

Both of two threads came across in modern reinforcement learning. Temporal Difference (TD) learning was a method that predicts future values depend on the current signal which originated from animal learning psychology. This idea was first proposed and implemented by Samuel [35]. In 1972, Klopf developed the idea of ”generalized reinforcement” and linked the trial-and-error learning with animal learning psychology [16]. In 1983, Sutton developed and implemented the actor-critic architecture in trial-and-error learning based on the idea from Klopf [2]. Five years later, he proposed TD() algorithms which used additional step information for update policy and made TD learning a general prediction method for deterministic problems [44]. One year later, Chris used optimal control methods to solve temporal-difference problems and developed the Q-learning algorithm which estimated delayed reward by action value function [53]. In 1994, an online Q-learning was proposed by Rummery and Niranjan which was known as SARSA [34]. The difference between Q-learning and SARSA was that the agent used the same policy during the learning process in SARSA while it always chooses the best action based on value function in Q-learning.

With the development of the deep neural network, DeepMind proposed Deep Q-learning Network (DQN) algorithm which used a convolutional neural network to solve high dimensionality of the state in reinforcement learning problems [27]. Two years later, they modified DQN by adding a target policy to improve its stability [28]. The highlight of the DQN was not only the combination of deep learning and RL but also the experience replay mechanism. To solve dependency problems when optimizing CNN, Mnih et al. stored the experiences to memory in each step and randomly sampled a mini-batch to optimize the neural network based on the idea from Lin [22]. In 2015, this mechanism was improved by measuring the importance of experience with temporal difference error [36]. Meanwhile, Wang proposed Dueling DQN which used an advantage function learning how valuable a state was without estimating each action value for each state [52]. This new neural network architecture was helpful when there was no strong relationship between actions and the environment. In 2016, DeepMind proposed Double DQN which show the higher stability of the policy by reducing overestimated action values [50].

Although a series of algorithms based on DQN show human-level performance on Atari games, they still failed to deal with some specific games. DQN was a value-based method which meant the choice of action was depend on the action values. However, choosing action randomly may be the best policy in some games such as Rockpaperscissors. To deal with this problem, Sutton proposed policy gradient which enabled the agent to optimize the policy directly [43]. Based on this, OpenAI proposed a new family of algorithms such as Proximal Policy Optimization (PPO) [38]. PPO used a statistical method called importance sampling which was used to estimate a distribution by sampling data from another distribution and this simple modification show a better performance in RoboschoolHumanoidFlagrun.

Since the basic policy gradient method sampled data from completed episodes, the variance of the estimation was high because of the high dimension action space. Similar to value-based method, Actor-critic method was proposed to solve this problem [18]. Compared with the policy gradient, this method used a critic to evaluate the chosen action. This made the policy can be updated after each decision which not only reduced the variance but also accelerated the convergence. The famous improved actor-critic based algorithm is asynchronous advantage actor-critic (A3C) [26]. Similar to Dueling DQN, this method used advantage function to estimate value function and performed computing in parallel which can largely increase the learning speed.

2.2.2 Markov Decision Processes

As mentioned in Section 2.2.1

, the interaction between the agent and the environment can be modeled as a Markov Decision Process which is based on Markov property. Markov property describes a kind of stochastic processes that the probability of next event occurring only depend on the current event.

Definition 1 (Markov property [40])

Given a state at time in a finite sequence . This sequence has Markov property, if and only if

(2.15)

A Markov Decision Process (MDP) is a random process with Markove property, values and decisions.

Definition 2 (Markov Decision Process [40])

A Markov Decision Process can be described as a tuple

  • is a finite set of states

  • is a finite set of actions

  • is a state transition probability matrix

    (2.16)
  • is a reward function

    (2.17)
  • is a discount factor

To describe how the decision is made, a policy is required to define the behaviour of the agent.

Definition 3 (Policy [40])

A policy is a distribution over actions given states

(2.18)

In MDP, the agent is expected to get as many rewards as it can from the environment. However, maximizing the reward at time-step makes the agent short-sighted which means it only considers the reward from the next action rather the total reward of one episode. Therefore, return is defined as the concept ”reward” which the agent is expected to maximize.

Definition 4 (Return [40])

The return is the total discounted reward from time-step .

(2.19)

where is the discount factor.

The value of the discount factor represents how far-sighted the agent will be. If this value is , the agent will treat every reward in the future as the same. But this will also make the agent confused about which decision is not appropriate. At this point, the behavior of the agent can be described as in Figure 2.8

Figure 2.8: Markov decision process in reinforcement learning.

Since the return is defined in a random process, similar to reward function, the expectation of it can be defined as following which is also called value function.

Definition 5 (State Value Function [40])

The state-value function of an MDP is the expected return starting from state , and then following policy

(2.20)
Definition 6 (Action Value Function [40])

The action-value function is the expected return starting from state , taking action a, and then following policy

(2.21)

With the Definition 5, 6 and the definition of the expectation, we can simply write

(2.22)

where is a set of action the agent can choose.

2.2.3 Bellman Equation

Since we define the Markov decision process in Section 2.2.2, the behavior of the agent can be described mathematically. As mentioned in Section 2.2.1, this problem can be solved by the Bellman equation.

Theorem 1 (Bellman Expectation Equation [40])

The state-value function can be decomposed into immediate reward plus discounted value of successor state

(2.23)

The action-value function can similarly be decomposed

(2.24)

Here is a simple proof for Equation 2.24. According to the Definition 4, the return at time can be decomposed into two parts: the immediate reward and the discounted return at time

(2.25)

Substitute with Equation 2.25 in Definition 6

(2.26)

Due to the linearity of expectation, can be replaced by and then we obtain the Bellman equation for action-value function. The state-value function can be proved in the same way. With the definition of optimal value function

Definition 7 (Optimal Value Function [40])

The optimal state-value function is the maximum value function over all policies

(2.27)

The optimal action-value function is the maximum action-value function over all policies

(2.28)

Theorem 1 can be extended to Bellman optimality equation

Theorem 2 (Bellman Optimality Equation [40])

The optimal state-value function can be decomposed into maximum immediate reward plus discounted optimal value of successor state

(2.29)

The optimal action-value function can similarly be decomposed

(2.30)

where , , ,

Here is a simple proof for Equation 2.30. Due to the linearity of expectation, Equation 2.26 can be decomposed into the expectation of the immediate reward

(2.31)

and the expectation of the discounted return at time

(2.32)

According to the definition of reward function in Definition 2, Equation 2.31 is equal to . If next state is , Equation 2.32 can be written as following with the transition probability matrix

(2.33)

With the Markov property, we know the expectation of the return in Equation 2.33 is not related to the current state and action and this is equal to the state-value function. Therefore, Equation 2.33 can be written as following

(2.34)

Considering 2.31, 2.34 and 2.22, the action-value function can be written as following

(2.35)

It is easy to prove that there is always an optimal policy for any Markov decision process and it can be found by maximizing action-value function.

(2.36)

Considering 2.36, Bellman optimality equation for action-value function can be obtained by replacing the policy in Equation 2.35 with optimal policy. There are many ways to solve this equation such as Sarsa and Q-learning. This will be discussed in Section 2.2.5.

2.2.4 Exploitation vs Exploration

If the agent has complete knowledge of the environment, in the other word, the transition probability can be calculated given state and action , Equation 2.30 can be solved by an iterative method with appropriate . However, this method is unable to deal with an unknown environment because a large amount of information has to be collected to estimate before the convergence of action value function. If the function tends to be stable before the environment has been fully explored, the performance of the model would be far from satisfactory, especially in high action space situation.

To deal with this problem, - greedy selection [42] is introduced to ensure the agent make enough exploration before the convergence of the action value function. Instead of choosing the best action estimated by function, there is a probability of to randomly select from all actions. The mathematical expression of this method is shown as following

(2.37)

where is the number of actions. This method may have a bad effect on the performance of the agent at first several episodes during the training but it can widen the horizon of the agent in long term view.

2.2.5 Temporal Difference Learning

As mentioned in Section 2.2.4, most environment in the real world is unknown. To solve this problem, a method called Monte Carlo (MC) is used to sample data for estimating value function. The agent can learn the environment from one episode experience and the value function can be approximated by the mean of the return instead of the expectation. The mathematical expression can be described as following

(2.38)

where is the state at time , is the sum of return and is the counter to record the visit number of state . There are two kinds of visit: first visit and every visit. The former one means the model only need to record the first visit of state in one episode while all visit of in one episode will be taken into consideration in every visit. Simplify equation 2.38, we can get the recurrence equation for

(2.39)

where is the learning rate which can control the update speed of the value function and is the state at time . The problem of Monte Carlo method is all rewards in one episode have to be collected to get . The value function can only be updated when reaching the end of the episode which may lead to low training efficiency. To update value function with an incomplete episode, the return can be replaced by estimated value function using bootstrapping. With the Bellman equation 2.23 and 2.39, we can write

(2.40)

This idea is called Temporal Difference (TD) Learning. In TD learning, value function will be updated immediately after a new observation. Compared with MC methods, TD learning has lower variance because there are too many random actions in the Monte Carlo method which will lead to the high variance. Similarly, the recurrence equation for action value function can be written as following

(2.41)

where and is the state and action at time , and is the state and action at time . Equation 2.41 shows an iterative method to get the optimal action value function . With this equation and - greedy policy, the RL problem can be solved by Sarsa [34].

1:set learning rate , number of episodes , explore rate , discount factor
2:set ,
3:for episode to  do
4:     initialize time
5:     get state from the environment
6:     choose action following - greedy policy from
7:     while episode is incomplete do
8:         take action and get next state , reward from the environment
9:         choose action following - greedy policy from
10:         update
11:         , ,
12:     end while
13:end for
Algorithm 1 Sarsa

The name of Sarsa is from the sequence . Besides Sarsa, there is another similar algorithm called Q learning [53].

1:set learning rate , number of episodes , explore rate , discount factor
2:set ,
3:for episode to  do
4:     initialize time
5:     get state from the environment
6:     while episode is incomplete do
7:         choose action following - greedy policy from
8:         take action and get next state , reward from the environment
9:         update
10:         ,
11:     end while
12:end for
Algorithm 2 Q learning

In algorithm 2, there are two policies during the iteration. When choosing the action from given , Sarsa uses - greedy policy while Q learning uses greedy policy. But both of them are choosing with - greedy policy. Considering the example of Cliff Walking shown in Figure 2.9 from Sutton’s book [42], every transition in the environment will get reward except next state is the cliff which the agent will get reward, Sarsa is more likely to choose the safe path while Q learning tends to choose the optimal path with - greedy policy. But both of them can reach the optimal policy if reducing the value of .

Figure 2.9: Example of cliff walking from [42].

2.2.6 Deep Q Network

Q learning is a powerful algorithm to solve simple reinforcement problems. However, it is unable to deal with continuous states or continuous actions. To solve the former problem, deep learning method can be used to approximate action value function.

Generally, states are image data observed by the agent and convolutional neural network is an effective way to extract features from this kind of data in convolution layers and feed them into the fully connected layer to approximate function. Several consistent stationary images will be stacked into one input data to make the model understand that the agent is moving. But the input data is highly dependent, the performance of the model will be largely affected by the dependency.

As mentioned in Section 2.2.1, DeepMind introduced experience replay pool which will store the experience into the memory and sample some of them to optimize the neural network model in 2013 [27]. Using Q learning, deep learning and experience replay pool, the improved algorithm named Deep Q Network (DQN) shows incredible performance on Atari games according to their paper. Two years later, they found the agent became more stable by using two network [28]. This algorithm can be described as below

1:initialize policy network with random weights
2:set learning rate , number of episodes , explore rate , discount factor
3:set batch size , update step
4:set target network
5:for episode to  do
6:     initialize time
7:     get state from the environment
8:     while episode is incomplete do
9:         choose action following - greedy policy from policy network
10:         take action and get next state , reward from the environment
11:         store transition in experience replay pool
12:         random sample batch experience from the pool
13:         calculate corresponding from policy network
14:         calculate using target network
15:         optimize the policy model with gradient
16:         replace target network with policy network when reach the update step
17:         ,
18:     end while
19:end for
Algorithm 3 Deep Q Network

All states in Algorithm 3 have to be pre-processed before feeding into a neural network model. Based on Deep Q Network, there are three kinds of improved algorithms considering the stability of the training process, the importance of each experience and new neural network architecture. Double DQN [50] utilizes the advantage of two networks. Instead of finding the optimal value from target network directly, this method chooses the optimal action from the policy network and find the corresponding value in the target network. Use the term in Algorithm 3, the change can be illustrated as following

(2.42)

Prioritized Experience Replay (PER) introduced a way to efficiently sample transitions from the experience replay pool [36]. Instead of uniform random sampling, there is a priority of each transition

(2.43)

where is the priority of transition and is the indicator of the priority, especially when using uniform random sampling. The priority can be measured by TD error , which is the following term

(2.44)

Based on TD error, can be calculated in two way. The first is proportional prioritization which uses the absolute value of TD error

(2.45)

where is to avoid zero prioritization. The other one is rank-based

(2.46)

where is the rank of transition by sorting TD error

. According to Schaul, both proportional based and rank based prioritization can speed-up the training but the later one is more robust which has better performance when meeting outliers.

However, the random sampling is abandoned after adding priority mechanism which will result in high bias. In other words, those transitions with small TD error are unlikely to be sampled and the distribution is changed. Therefore, the final model may far from the optimal policy and performance of the agent even be lower than DQN. Important sampling (IS) [29]

is an effective technique to estimate a distribution by sample data from a different distribution. Given a probability density function

over distribution , with the definition of the expectation

(2.47)

where denotes the expectation for and is the integrand. Given another probability density function , the expectation can be written as following

(2.48)

where denotes the expectation for . With Monte Carlo integration, the expectation can be estimated by

(2.49)

where is sampled from . if

is uniform distribution and

refers to Equation 2.43, we have

(2.50)

adding a tunable parameter , we obtain the importance-sampling weights

(2.51)

where will decay from a user-defined initial value to and the bias completely disappears when . Term is used to normalize the weight to increase stability. Use the term in Algorithm 3, the update of function can be modified as following

(2.52)

where is TD error and is learning rate.

Dueling DQN architecture used a new concept called advantage function which is the subtraction of the action value function and state value function [52].

(2.53)

As shown in Figure 2.10, dueling network architecture use summation of two steams which is advantage function and state value function to get the function. The state values can be updated more accurately with this method.

Figure 2.10: Architecture comparison between Dueling DQN and DQN [52]

3.1 Requirements

3.1.1 Software Requirement

Considering the readability of the code, widely used additional frameworks such as Torch, Python is a suitable choice for this project. OpenCV is used to preprocess the image getting from the environment. Numpy is a Python library which accelerates matrices operations with C. This enables the user to write efficient scientific computing code with Python. There are plenty of deep learning frameworks like Tensorflow which has many extensive API and is widely used in industrial products. However, it will take a relatively long time for the beginner to fully understand the usage of Tensorflow. Pytorch is a recently developed framework which is described as ”Numpy with GPU”. The simplicity of Pytorch makes more and more academic researchers using it to implement their new ideas in a much easier way. Because T-rex Runner is running on Chrome, the latest Chrome is used here. Gym is a game library developed by OpenAI

[7]. This framework provides a built-in environment for some famous games such as Atari 2600 and it is easy for the user to customize their own environment. Table 3.1 shows all software requirement in this project.

Software Description
OS Windows 10
Programming language Python 3.7.4
Framework OpenCV, Pytorch, Numpy, Gym
Browser Chrome 76
Table 3.1: Software requirement

3.1.2 Hardware Requirement

As the game is running on Chrome, it is hard to use a Linux server to perform the experiments. Although headless Chrome is a plausible choice, there are some environmental issues during the investigation. Therefore, all experiments will be running on the laptop from the author. There will be some limitation such as 6GB GPU memory limits the size of experience replay pool. Therefore, parameters related to hardware limitation will be suitably chosen without tuning in this project. Table 3.2 lists all hardware information used in this project.

Hardware Description
CPU Intel Core i5-8300H
RAM 16G
GPU Nvidia GTX 1060 6G
Table 3.2: Hardware requirement

3.2 Game Description

T-rex Runner is a dinosaur game from Google Chrome offline mode. Everyone can access this link on Chrome to play the game. The target for players is to control the dinosaur overcoming as many obstacles as possible. The current score of the game will increase by time if the dinosaur keeps alive as shown at the top right corner of Figure 3.1 as well as the highest score. As shown in Figure 3.2, the dinosaur has three actions to choose in every state: do nothing, jump or duck.

Figure 3.1: A screenshot of T-rex Runner.
(a) Do nothing
(b) Jump
(c) Duck
Figure 3.2: Three type of actions in T-rex Runner

Environment plays an important role in reinforcement learning because the agent will improve the policy based on the feedback from it. However, it is difficult to quantify the rewards for each action as well as the return for an entire episode. In most research for RL algorithms, modifying reward will not be taken into consideration but it will significantly impact the performance of the model because it decides the behavior of the agent. For example, shaping reward shows a better performance in Andrew’s experiment[30]. It adds a new term to modify the original reward based on the goal

(3.1)

The closer the agent towards the goal, the larger the is. However, the aim of this project is to train the agent to play the game and compare the performance between different algorithms. So the effect of reward function will not be taken into consideration and a fixed reward function will be used across all experiments.

Since there is no previous study on T-rex Runner with reinforcement learning, the design of reward function is a hard part of this project. Intuitively, the best design is awarding the agent for jumping over the obstacles and penalizing it for hitting the obstacles. The jumping reward will gradually increase as time goes by. However, object detection in moving pictures is required to fulfill this goal. As this task is out of the requirements of this project, we proposed a naive reward design as shown in Algorithm 4.

1:if episode is completed then
2:     return reward as
3:else
4:     if agent choose jump then
5:         return reward as
6:     else
7:         return reward as
8:     end if
9:end if
Algorithm 4 Reward Design in T-rex Runner

The basic idea of Algorithm 4 is giving a relatively small reward to the agent if it is alive and penalize it when hitting an obstacle. Zero reward for jumping is set to make the dinosaur only jumps if it is very close to obstacles. The unexpected jump will limit the movement in the next few states.

Although there are three kinds of action in this game as introduced in Section 3.2, duck is optional because the agent can overcome the obstacle using jump under the same circumstances. Considering most obstacles in the game are cactus which can only be overcome by jumping, only two actions (do nothing and jump) will be used in this investigation.

3.3 Model Selection

Since there are only two actions in T-rex Runner, according to the literature review on deep reinforcement learning in Section 2.2.1, value-based methods are proved to be powerful to handle this game. Although policy-based methods such as proximal policy gradient is a good choice too, only DQN, double DQN, DQN with prioritized experience replay and dueling DQN will be investigated in this project due to the time limitation.

Deep Q network which is shown in Algorithm 3 is a basic reinforcement learning algorithm using deep learning. According to the result from DeepMind, it is expected to achieve at least human-level results with only DQN.

Double DQN mitigates the value overestimation problems utilizing two advantage of two networks as shown in Equation 2.42 but it is not expected to achieve a higher performance in this experiment because there is only two actions. The bad effect of overestimation problems is not obvious under this circumstance.

Dueling DQN adds an advantage function which is the subtraction of action value function and state value function before the output layer in the convolutional neural network as shown in Equation 2.53. Since the evaluated game in [52] is a similar racing game overcoming obstacles compared with T-rex Runner, this algorithm is expected to have a better performance than DQN.

Prioritized Experience replay improves training efficiency by changing the distribution of the stored transitions. It assigns the weight for each experience by TD error. There are two ways to calculate prioritization which is proportional based method and rank-based method. According to the [36], the former one has a relatively better performance, only this method will be implemented in this investigation due to the time limitation. The performance is expected to be the same as DQN because there is no change in the algorithm but it may be faster to reach the same performance.

3.4 Image Preprocessing

Following the preprocessing step in [27, 28], the raw observed image which is in RGB representation will be converted to gray-scale representation. To make the network easier to recognize dinosaur and obstacles, unnecessary objects such as clouds and scores will be removed. In this step, the color of the background and the object are reversed in order to perform erosion and dilation. These two basic morphological operations can help reduce small bright color which is often noisy data. Finally, the image is resized to following the recipe from DeepMind. Since the movement should be recognized by the neural network, perform the same preprocessing step for last four frames in the history and stack those four as one data point which is also the input of CNN. The entire process is shown in Figure 3.3.

Figure 3.3: Preprocessing steps for T-rex Runner.

3.5 Convolutional Neural Network Architecture

There are two kinds of convolutional neural network used in this project. The basic DQN is proposed in [27, 28] which used three convolutional layers and two fully connected layers. The reason for not using pooling layer is to detect the movement of the agent. Both max pooling and average pooling may make the neural network ignore a very small change in the image. Therefore, there are only convolutional layers in this architecture. The architecture for training the agent using DQN is shown in Figure 3.4.

Figure 3.4: Convolutional Neural Network architecture for Deep Q Network.

Dueling architecture is proposed in [52] which divided the network into two parts. One of them is only related to the state value function , the other one is advantage function which is affected by both state and action. The final action value function is the summation of those two.

(3.2)

where is the shared parameter of CNN, is the value function only parameter and is the advantage function only parameter. Both DQN and Dueling DQN are using Algorithm 3

, the only difference is the neural network architecture. RMSprop

[48] which is an adaptive gradient method based on stochastic gradient descent will be used as the optimization algorithm in this project. This is the same optimization method used by DeepMind [27, 28]. Figure 3.5 shows the process of dueling DQN.

Figure 3.5: Convolutional Neural Network architecture for Dueling Deep Q Network.

3.6 Experiments

3.6.1 Hyperparameter Tuning

Before the comparison of algorithms, hyperparameter tuning is required to get high-performance models. As mentioned before, the memory size is fixed to

due to the hardware limitation. Because there is no previous study on this game, and the hyperparameters list in [28] have a bad result on this game. All other hyperparameters have to be set to a suitable value. Grid search is performed to find a workable combination of those parameters.

Due to the time limitation, all parameters will only be slightly modified and only one hyperparameter will vary during each tuning experiment. The choice of the parameter will consider both score and stability. Each parameter will be tuned with episodes.

3.6.2 Comparison of different Deep Q Network Algorithms

There are three improved reinforcement algorithms based on DQN mentioned in Section 2.2.6. Double DQN makes the performance of the agent more stable by solving the overestimated value problem. Prioritized experience replay improves the training efficiency by sample more valuable transitions. Dueling DQN modifies the neural network architecture to get a better estimation of state values.

In this experiment, DQN will be first used to train the agent based on the hyperparameters tuned in Section 3.6.1 and this result will be treated as a baseline across all the experiments. Double DQN, DQN with prioritized experience replay and Dueling DQN will be applied to the agent separately. The performance of those three is expected to be better than DQN according to the related papers. Due to time limitation, no combination of those three algorithms will be performed in this project. This section only compares the performance of each algorithm.

3.6.3 Effect of Batch Normalization

As mentioned in Section 2.1.5, it is proved that batch normalization can reduce training time and mitigate the vanishing gradient problem in a convolutional neural network. However, there is no evidence that this method has the same effect on reinforcement learning. This section will perform experiments on this point. Based on the experiment in Section 3.6.2, adding batch normalization in each convolutional layer and compared with the results with the outcome in previous experiments.

3.7 Evaluation

To evaluate the performance of the agent, DeepMind used trained agent playing the game for times for up to min and - greedy policy with [28]. Considering only one game is investigated in this project, the average score will be used instead of average reward because the number of jumps in each episode will affect the total reward according to the designed reward function. The greedy policy will be used in the evaluation stage instead of - greedy policy because the later one will bring randomness to the decision which will affect the performance of the trained model. Therefore, the trained agent will play the game for times without time limitation and using greedy policy. All outcomes will be compared with the results from a human expert.

The average scores during the training stage will be shown graphically. This is a clear way to show the learning efficiency of each algorithm. Both graphical and statistical results such as mean, variance and median will be analyzed. However, only statistical results will be analyzed in the testing stage because the trained model for each algorithm are the same and there is no increasing trend can be shown like in the training stage. These results will be visualized with a boxplot.

4.1 Hyper Parameter Tuning

The value of hyperparameters may affect the performance of the model. However, there are so many parameters in reinforcement learning including optimization algorithm parameters such as learning rate. This may take a long time to find the optimal combination of these parameters using a grid search. Since there is no metric like accuracy in RL which can easily reflect the performance of the model, we assume each parameter is independent of others. Therefore, each parameter can be tuned one after another. Because the objective of this project is to compare the performance between different algorithms and the effect of batch normalization, those tuned parameters by DQN will be used across all the experiments. The start hyperparameters of DQN are shown in Table 4.1.

Hyper parameter Value Description
Memory Size Size of experience replay pool
Batch Size 128 Size of minibatch to optimize model
Gamma 0.99 Discount factor
Initial Explore probability at the start of the training
Final End point of explore probability in decay
Explore steps Number of steps for decay from to
Learning Rate Learning speed of the model
Table 4.1: Hyper parameters used in all experiments

4.1.1 Learning Rate

Learning rate controls the learning speed of the model, too large value will result in divergence and too small value may double the training time.

Figure 4.1: Hyper parameter tuning for learning rate.

Figure 4.1 shows four different values of learning rate. Obviously, is too small and there is no increase trend during the entire process. Both and

make the score unstable after 50th epoch. Considering the stability and

epochs will be trained in formal experiment, will be chosen as learning rate.

4.1.2 Batch Size

Batch size defines how many transitions will be used to update the neural network which may affect the training speed. But as mentioned in 2.2.1, too big size will cause the dependency problems which may largely affect the performance of the model.

Figure 4.2: Hyper parameter tuning for batch size.

As shown in Figure 4.2, the average score of three curves at epoch are all around . Among those three, the most stable one is batch size .

4.1.3 Epsilon

- greedy policy determines the probability of exploration. In some games, especially with high action spaces, this value can affect how good the model will converge. However, there are only two actions in T-rex Runner so it is unnecessary to random choose action at the begin. Instead of initializing to as DeepMind did in their paper [28], the start value is set to in this model.

Figure 4.3: Hyper parameter tuning for explore probability .

All experiments achieve acceptable results in Figure 4.3 except the one with fixed . In this case, we select from to but either of those three can be chosen according to this graph. This experiment also demonstrates the positive effect of linear annealing for .

4.1.4 Explore Step

Explore step is the number of steps required to anneal from to . As mentioned that hyperparameters related to exploration will not affect too much in this game. The most stable one will be selected from Figure 4.4 which is .

Figure 4.4: Hyper parameter tuning for explore steps.

4.1.5 Gamma

Discount factor decides how far-sighted the agent will be. Too small value will make the agent consider more about the current reward and too big value will make the agent pay the same attention to rewards after this time point. This may confuse the agent about which action leads to a high or low return.

Figure 4.5: Hyper parameter tuning for discount factor .

Figure 4.5 shows the average score for four different gamma. Obviously, make the agent short-sighted and there is no significant change during epochs. When , the average score fluctuates widely after 50th epoch. Since has a gradually increasing trend, this will be used as the final discount factor.

4.2 Training Results

The tuned hyperparameters from the previous experiment are listed in Table 4.2. Although these parameters are tuned by DQN algorithm, they are expected to fit other three improved algorithms which are Double DQN, Dueling DQN and DQN with prioritized experience replay because there is no big difference among them. All algorithms will be only trained with 200 epochs because of the time limitation. The total training time for each algorithm is shown in the last column of Table 4.4

Hyper parameter Value before tune Value after tune
Memory Size
Batch Size 128 128
Gamma 0.99 0.99
Initial
Final
Explore steps
Learning Rate
Table 4.2: Hyperparameters used in all experiments

4.2.1 Dqn

Figure 4.6 shows the result of DQN algorithm for epochs with tuned parameters. A gradually increased average score can be seen from this graph. This not only proves that the agent can play the game through DQN but also shows that the design of the reward function is relatively reasonable. This result will be treated as a baseline and will be used to compare with other algorithms.

Figure 4.6: Training result for DQN.

4.2.2 Double DQN

Double DQN has a similar performance in training compared with DQN. As mentioned before, the effect of overestimation is not so significant in T-rex Runner because there are only two actions. As shown in Figure 4.7, there are four data points with average scores below while all average scores are above this value in DQN.

Figure 4.7: Training result for Double DQN compared with DQN.

4.2.3 Dueling DQN

Surprisingly, dueling DQN shows an incredible training performance after th epoch while the curve before that time seems similar. In Figure 4.8, the average score is above which is ten times higher than the maximum average score in DQN. However, these scores have a high variance which fluctuates widely between and . From the graph, the training process of dueling DQN is stable before th epoch and end up with an increasing trend. Since we tuned all hyperparameters based on DQN, these values may not be the best for dueling network which results in the stable and relatively low average scores before 150th epoch.

Figure 4.8: Training result for Dueling DQN compared with DQN.

4.2.4 DQN with Prioritized Experience Replay

Another important finding in this section is the performance of prioritized experience replay. This is expected to have a shorter training time and a higher performance compared with DQN. But the result shown in Figure 4.9 suggests that the agent failed to learn to play the game with this method. There are two reasons for that.

Figure 4.9: Training result for DQN with prioritized experience replay compared with DQN.

One problem is from the algorithm. Compared with DQN, there are two extra steps have been applied to PER: weight calculation and prioritization update. Following the implementation in [36], sum tree which is a data structure with time complexity for sampling and updating is used to store transitions instead of a linear list to accelerate memory related manipulation. The training time of PER is twice more than the one of DQN because of the batch size. Since we know that all sampled transitions will be traversed when updating the prioritization, the larger batch size is the longer time is required to perform this operation. Table 4.3 shows that this process is very time-consuming even using the batch size . These data are extracted from the training results choosing the same score of 43. The step size is the average value from ten records.

Algorithm Score Batch Size Step Size
DQN 43 128 180
DQN with PER 43 128 7
DQN with PER 43 32 22
Table 4.3: Step size difference between DQN and DQN with PER

The other problem is from the game. Because this game is based on Chrome, it continues running when performing optimization while the game from official OpenAI Gym is paused during this operation. Therefore, there is a delayed time before sending the action to Chrome. This influence is enlarged in prioritized experience replay since the time for update operation with batch size takes approximately times longer than normal DQN.

Change the choice of hyperparameter can mitigate the first problem but the result is not as good as other algorithms. One thing we can expect is PER is unable to help the agent to get a higher score under this circumstance because the game speed will increase as time goes by. Since the time for updating the prioritization will not change too much, the time interval between two consistent decisions will be longer. This may limit the performance of the model. To eliminate the high computational effect from updating prioritization, the best way is to redevelop the game but due to the time limitation and the primary objective of this study, this result will be used as we can still compare the effect of batch normalization on this algorithm.

4.2.5 Batch Normalization

Since the aim of this experiment is to find how batch normalization affects DQN algorithms, each result will be compared with the one without batch normalization which is shown in Figure 4.10.

Figure 4.10: Batch normalization on DQN, Double DQN, Dueling DQN and DQN with prioritized experience replay

From Figure 4.10, we can see that batch normalization can increase the mean of average scores in all experiments. But this also brings high variance which makes the average score diverge. According to the top-left graph, the first time for DQN agent to reach the average is approximately th epoch while the agent using DQN with batch normalization reach the same average score at th epoch and it is easy for it to get the higher score after that time. Double DQN curve has a similar trend but batch normalization in both of them also result in wide fluctuation. It is hard to say whether dueling network benefits from the batch normalization because there is a significant increase trend on the bottom left graph. However, it is still can be seen that BN enable the agent to reach the same performance much earlier from th epoch to th epoch. For DQN with prioritized experience replay, even the performance is limited by the game itself, the one with batch normalization still can get a relatively higher score.

4.2.6 Further Discussion

As graphical results and some explanation of them are shown above, this part will discuss numerical results from the experiments. Table 4.4 shows some statistical data fro training process. The maximum score is pointless in most games but considering T-rex Runner is a racing game, we still include this in the table. The last three columns are percentile data which are calculated by sorting in ascending order and finding the observation. So is the same as the median. The last column shows the training time for each algorithm.

Algorithm Mean Std Max 25% 50% 75% Time (h)
DQN 537.50 393.61 1915 195.75 481 820 25.87
Double DQN 443.31 394.01 2366 97.75 337 662.25 21.36
Dueling DQN 839.04 1521.40 25706 155 457 956.5 35.78
DQN with PER 43.50 2.791 71 43 43 43 3.31
DQN (BN) 777.54 917.26 8978 97.75 462.5 1139.25 32.59
Double DQN (BN) 696.43 758.81 5521 79 430.5 1104.25 29.40
Dueling DQN (BN) 1050.26 1477.00 14154 84 541.5 1520 40.12
DQN with PER (BN) 46.14 7.54 98 43 43 43 3.44
Table 4.4: Training results

Ignoring the result from prioritized experience replay because of the inappropriate game environment, all algorithms achieve great results according to Table 4.4. Two algorithms with dueling network stand out from them. The one with batch normalization has the mean over which is more than the one without BN. But the later one got the maximum score of which means the agent can keep running for around half an hour in one episode. However, both of them have high variance which exceed the mean.

Double DQN both with BN and without BN perform worse than DQN. This indicates that double DQN may reduce the performance in low dimension action space. But batch normalization shortens the gap between those two algorithms which can be seen from the median and percentile.

Although most of statistical metrics are improved by batch normalization, the variance is much higher than before. As shown in the table, the variance from DQN with BN is twice more than the one without BN. Only the variance from dueling network is lower after BN. But it is reasonable because there is an incredible increase in the very later stage of the training shown in Figure 4.10.

4.3 Testing Results

After training the agent for episodes, we use the latest model with greedy policy and play T-rex Runner for times with each algorithm. Figure 4.11 shows the boxplot of those results as well as the collected data from the human expert. It is obvious that the agent trained by DQN with prioritized experience replay fail to learn to play the game because of the game environment issue discussed in the last section. It is surprising that the performance of double DQN is far from satisfactory even though it has similar training results compared with DQN. Table 4.5 shows that the mean of DQN results is three times higher than the one from double DQN. Dueling DQN algorithm achieves the highest score even though it still has the highest variance which is three times more than the variance from DQN.

Figure 4.11: Boxplot for test result with eight different algorithms

According to Table 4.5, batch normalization improves the performance of the model regardless of algorithms and even the mean of DQN with PER is increased. However, it is not easy to say the effect of BN in dueling DQN is positive or not. From Figure 4.11, the one without BN has more outliers which results in high variance even though its mean is higher. Consider the median which is not sensitive with the outlier data, the one with BN is better and the minimum score is more than 200 which stands out from other algorithms. Since score 43 indicates the first time the agent meets the obstacle, it is easy to infer that all trained model fails to jump over the first cacti at least once except dueling DQN with BN. But dueling DQN is not fully trained which can be seen from the training result in Figure 4.8. That’s also one reason for high variance as we can see in the boxplot. The agent trained with dueling DQN achieved over 8000 at least three times.

Algorithm Mean Std Min Max 25% 50% 75%
Human 1121.9 499.91 268 2384 758 992.5 1508.5
DQN 1161.30 814.36 45 3142 321.5 1277 1729.5
Double DQN 340.93 251.40 43 942 178.75 259.5 400.75
Dueling DQN 2383.03 2703.64 44 8943 534.75 1499.5 2961
DQN with PER 43.30 1.64 43 52 43 43 43
DQN (BN) 2119.47 1595.49 44 5823 1218.75 1909.5 2979.75
Double DQN (BN) 382.17 188.74 43 738 283.75 356 525.5
Dueling DQN (BN) 2083.37 1441.50 213 5389 1142.5 1912.5 2659.75
DQN with PER (BN) 45.43 7.384 43 78 43 43 43
Table 4.5: Test results

Bibliography

  • [1] D. M. Allen (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13 (3), pp. 469–475. Cited by: §2.1.3.
  • [2] A. G. Barto, R. S. Sutton, and C. W. Anderson (1983) Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics (5), pp. 834–846. Cited by: §2.2.1.
  • [3] R. Bellman (1957) A markov decision process. journal of mathematical mechanics. Cited by: §2.2.1.
  • [4] R. Bellman et al. (1954) The theory of dynamic programming. Bulletin of the American Mathematical Society 60 (6), pp. 503–515. Cited by: §2.2.1.
  • [5] R. Bellman (1958) Combinatorial processes and dynamic programming. Technical report RAND CORP SANTA MONICA CA. Cited by: §2.2.1.
  • [6] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce (2010) Learning mid-level features for recognition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2559–2566. Cited by: §2.1.4.
  • [7] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §3.1.1.
  • [8] W. A. Clark and B. G. Farley (1955) Generalization of pattern recognition in a self-organizing system. In Proceedings of the March 1-3, 1955, western joint computer conference, pp. 86–91. Cited by: §2.2.1.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1.1.
  • [10] K. Fukushima (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics 36 (4), pp. 193–202. Cited by: §2.1.1.
  • [11] R. Ge, F. Huang, C. Jin, and Y. Yuan (2015)

    Escaping from saddle points—online stochastic gradient for tensor decomposition

    .
    In Conference on Learning Theory, pp. 797–842. Cited by: §2.1.3.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2015) Resnet-deep residual learning for image recognition. ResNet: Deep Residual Learning for Image Recognition. Cited by: §2.1.1.
  • [13] P. Hu (2012) Matrix calculus: derivation and simple application. Technical report Cited by: §2.1.3.
  • [14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.1.1.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.1.5.
  • [16] A. H. Klopf (1972) Brain function and adaptive systems: a heterostatic theory. Technical report AIR FORCE CAMBRIDGE RESEARCH LABS HANSCOM AFB MA. Cited by: §2.2.1, §2.2.1.
  • [17] A. H. Klopf (1982) The hedonistic neuron: a theory of memory, learning, and intelligence. Toxicology-Sci. Cited by: §2.2.1.
  • [18] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §2.2.1.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.1.1.
  • [20] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back (1997) Face recognition: a convolutional neural-network approach. IEEE transactions on neural networks 8 (1), pp. 98–113. Cited by: §2.1.1.
  • [21] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.1.1, §2.1.1.
  • [22] L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3-4), pp. 293–321. Cited by: §2.2.1.
  • [23] M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §2.1.1.
  • [24] W. S. McCulloch and W. Pitts (1943) A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (4), pp. 115–133. Cited by: §2.1.1.
  • [25] M. Minsky and S. Papert (1969) Perceptron: an introduction to computational geometry. The MIT Press, Cambridge, expanded edition 19 (88), pp. 2. Cited by: §2.1.1.
  • [26] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.2.1.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §2.2.1, §2.2.6, §3.4, §3.5, §3.5.
  • [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.1, §2.2.6, §3.4, §3.5, §3.5, §3.6.1, §3.7, §4.1.3.
  • [29] R. M. Neal (2001) Annealed importance sampling. Statistics and computing 11 (2), pp. 125–139. Cited by: §2.2.6.
  • [30] A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §3.2.
  • [31] F. Rosenblatt (1957) The perceptron, a perceiving and recognizing automaton project para. Cornell Aeronautical Laboratory. Cited by: §2.1.1.
  • [32] F. Rosenblatt (1960) Perceptron simulation experiments. Proceedings of the IRE 48 (3), pp. 301–309. Cited by: §2.1.1.
  • [33] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. (1988) Learning representations by back-propagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §2.1.1.
  • [34] G. A. Rummery and M. Niranjan (1994) On-line q-learning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, England. Cited by: §2.2.1, §2.2.5.
  • [35] A. J. Samuel (1959-September 15) Aerosol dispensers and like pressurized packages. Google Patents. Note: US Patent 2,904,229 Cited by: §2.2.1.
  • [36] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §2.2.1, §2.2.6, §3.3, §4.2.4.
  • [37] D. Scherer, A. Müller, and S. Behnke (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In International conference on artificial neural networks, pp. 92–101. Cited by: §2.1.4.
  • [38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.2.1.
  • [39] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815. Cited by: §2.2.1.
  • [40] D. Silver (2015) University college london course on reinforcement learning. University College London. Cited by: Definition 1, Definition 2, Definition 3, Definition 4, Definition 5, Definition 6, Definition 7, Theorem 1, Theorem 2.
  • [41] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.1.1.
  • [42] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: Figure 2.9, §2.2.1, §2.2.4, §2.2.5.
  • [43] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.2.1.
  • [44] R. S. Sutton (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §2.2.1.
  • [45] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.1.1.
  • [46] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.1.1.
  • [47] E. Thorndike (1911) Animal intelligence; experimental studies, by edward l. thorndike. The Macmillan company, New York. Cited by: §2.2.1.
  • [48] T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §3.5.
  • [49] R. Vaillant, C. Monrocq, and Y. Le Cun (1994) Original approach for the localisation of objects in images. IEE Proceedings-Vision, Image and Signal Processing 141 (4), pp. 245–250. Cited by: §2.1.1.
  • [50] H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.2.1, §2.2.6.
  • [51] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang (1995) Phoneme recognition using time-delay neural networks. Backpropagation: Theory, Architectures and Applications, pp. 35–61. Cited by: §2.1.1.
  • [52] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: Figure 2.10, §2.2.1, §2.2.6, §3.3, §3.5.
  • [53] C. J. C. H. Watkins (1989) Learning from delayed rewards. Cited by: §2.2.1, §2.2.5.
  • [54] B. Widrow, N. K. Gupta, and S. Maitra (1973) Punish/reward: learning with a critic in adaptive threshold systems. IEEE Transactions on Systems, Man, and Cybernetics (5), pp. 455–465. Cited by: §2.2.1.
  • [55] B. Widrow and M. E. Hoff (1960) Adaptive switching circuits. Technical report Stanford Univ Ca Stanford Electronics Labs. Cited by: §2.2.1.