FlapAI Bird: Training an Agent to Play Flappy Bird Using Reinforcement Learning Techniques

03/21/2020 ∙ by Tai Vu, et al. ∙ Stanford University 0

Reinforcement learning is one of the most popular approach for automated game playing. This method allows an agent to estimate the expected utility of its state in order to make optimal actions in an unknown environment. We seek to apply reinforcement learning algorithms to the game Flappy Bird. We implement SARSA and Q-Learning with some modifications such as ϵ-greedy policy, discretization and backward updates. We find that SARSA and Q-Learning outperform the baseline, regularly achieving scores of 1400+, with the highest in-game score of 2069.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

Code Repositories

FlapAI-Bird

An AI program that plays Flappy Bird using reinforcement learning.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, the advent of artificial intelligence (AI) has caused advancements in speech recognition, machine translation, and computer vision, among other fields. One particular application has been to teach AI agents to play games. AI agents have surpassed human ability in classic arcade games and Go

[5] [7].

The most common way that these agents are trained to play games is using reinforcement learning methods. Reinforcement learning is the process where an agent in a certain state chooses to take some action in a predefined environment and receives some reward . As the agent takes more actions, it is able to determine, for each , which action is the best to take if the agent wants to maximize its score.

We specifically focus on how reinforcement learning can be applied to newer games like Flappy Bird. In the game, a bird moves at a constant horizontal rate and falls according to in-game gravity. There are pipes with gaps that are initialized at random points. The player must navigate the bird through the pipes by tapping on the bird to make the bird jump vertically. After overcoming each obstacle, the player gains 1 in-game point. The game ends when the bird collides with a pipe or hits the ground. The player’s goal is to maximize the total score.

In this project, we apply various AI algorithms to train an agent to play Flappy Bird. Specifically, the approaches we use are SARSA and Q-Learning. We experiment with SARSA and different variants of Q-Learning: a tabular approach, Q-value approximation via linear regression, and approximation via a neural network. In addition, we implement several modifications including

-greedy policy, discretization and backward updates. After comparing the Q-Learning and SARSA agents, we find that Q-Learning agents perform the best with regular scores of 1400+ and a maximum score of 2069. SARSA agents decisively outperform the baseline with a maximum score of 832.

2 Related Work

We have seen successful attempts at training agents to play Atari games using reinforcement learning. Mnih et al. tried this approach on seven Atari 2600 games and found that agents trained using Q-Learning outperformed human experts on three Atari games [4]. Instead of engineering features from scratch to represent state and using those features to estimate

, Mnih et al. used a convolutional neural network to learn

instead.

There have been attempts to play Flappy Bird using Q-Learning with convolutional neural networks as function approximators as well. A team tried using a similar approach to Mnih et al. by passing an image of the screen into a convolutional neural network [5]. We were inspired by this project and Mnih et al. to experiment with convolutional neural networks to estimate

values. Our approach differs from these two resources in network architecture and hyperparameter choices, however.

Flappy Bird was a natural choice of game, because of work done by Mnih et al. and the novelty of teaching AI agents to play newer games. While we thought that using Q-Learning with convolutional neural networks to play Flappy Bird was an interesting application, we also wanted to see how it compared to other variants of Q-Learning as well. Therefore, we took the base Q-Learning algorithm and modified it in different ways, such as adjusting our exploration probability

, discretizing our screen into grids of varying sizes, and using different function approximators to estimate . We find that simpler methods like Q-Learning and SARSA outperform more complex deep Q-Learning approaches.

3 Infrastructure

We decided to use OpenAI Gym, which is an environment for training reinforcement learning agents [1][2]. OpenAI Gym provided an emulator for Flappy Bird and a useful representation of game states, namely the positions of the nearest two pipes, and the y-position of the bird. There were also images of the screen that were accessible from using the emulator. We slightly modified the state in the ways noted in section 5 and implemented a reward function.

4 Implementation

Our implementation for the program was structured into the following modularized files.

  • main.py: The main program that processes command line arguments and runs different agents.

  • FlappyBirdGame.py: The game environment for Flappy Bird.

  • TemplateAgent.py: A template for different agents.

  • BaselineAgent.py: Baseline Agent.

  • SARSAAgent.py: SARSA Agent.

  • QLearningAgent.py: Q-Learning Agent.

  • FuncApproxLRAgent.py: Function Approximation Agent with Linear Regression.

  • FuncApproxDNNAgent.py

    : Function Approximation Agent with a Feed Forward Neural Network.

  • FuncApproxCNNAgent.py: Function Approximation Agent with a Convolutional Neural Network.

  • utils.py: A file containing some helper functions.

5 Model

5.1 States

We came up with two formulations of state. The second formulation in Table 2 is only used in the Q-Learning agents that use a convolutional neural network as a function approximator. The first formulation of state in Table 1 is used in the remaining SARSA and Q-Learning agents.

Note that the input image has been convolved into a output volume and flattened. The remaining features and

are then appended to this flattened vector. Then, this whole vector is passed into the fully connected layer of the network.

Feature Description
The horizontal distance between the bird and the next pipe.
The vertical distance between the bird and the bottom pipe.
The in-game velocity of the bird.

Table 1: State representation for SARSA, Q-Learning and
function approximation using linear model and FFNN
Feature Description
A 80 80 grayscale image of the screen.
The in-game velocity of the bird.

Table 2: State representation for function approximation using CNN
Figure 1: A graphic illustration of the first state representation

5.2 Actions

We represented the an action using , where 0 represents "flap" and 1 represents "do not flap", as predefined by the game environment we use.

5.3 Rewards

OpenAI Gym steps through frames of the game and sets a terminal flag if the agent hits a pipe in the current frame. We decided to formulate rewards by giving the agent a positive reward of +5 every time it passes a pipe and -1000 for hitting a pipe. This is to heavily discourage the bird from making bad decisions and dying. In addition, we gave the bird a smaller reward of +0.5 for surviving in between frames. This is to incentivize the bird to stay alive even if it has not passed a pipe yet. We found that without this +0.5 reward, the bird essentially moves randomly in between pipes. This makes sense, as the bird does not prefer one state to another in between pipes, if it is not rewarded for staying alive. Only when the bird passes a pipe does it learn the optimal actions to get to the pipe in the first place.

6 Methods

6.1 Baseline

For our baseline algorithm, we simply applied a random policy. In particular, for each state, the agent selects action 0 with some probability and selects action 1 with probability . Here, we experimented with . We believed that this was a good baseline because it was relatively simple to implement and defined a clear strategy that a better trained agent could outperform.

6.2 Oracle

Since Flappy Bird has no maximum score, we decided that the oracle should be an agent that achieves an infinite score. Therefore, any of our agents will be outperformed by the oracle.

6.3 Reinforcement Learning Algorithms

We decided to use both on-policy and off-policy, model-free reinforcement learning algorithms to teach our agent to play Flappy Bird. We chose to do so because reward and transitions are unknown to the agent. Therefore, we used Q-Learning and SARSA. We also made modifications to Q-Learning, by adjusting , discretizing the state space, using function approximation, and using backward Q-Learning. We also modified SARSA by discretizing our screen into grids.

6.4 Sarsa

SARSA [8] is an on-policy reinforcement learning algorithm that estimates for a given state-action pair , where the agent is following some policy . SARSA takes in tuples and updates our estimates for as follows:

where is the reward for taking action in state , is some discount value (in this case ), and for some choice of . Intuitively, SARSA creates a bootstrapped estimate by minimizing the difference between that estimate and the "true value" of our policy over many iterations.

6.5 Q-Learning

Q-Learning [9] is an off-policy reinforcement learning algorithm. Q-Learning creates bootstrapped estimates of by minimizing the difference between the estimate and the "true value" of our policy . It is similar to SARSA, except that it has the agent take the optimal action (according to our current Q-value estimates) rather than an action defined by policy . At the end of each iteration, our Q-value estimates are updated according to the following rule:

where is the reward for taking action in state , is some discount value (in this case ), and for some choice of .

6.6 Modifications

We modified Q-Learning and SARSA in the following ways to speed up learning and improve performance. The implementations of the base algorithms had low performance on our task.

6.6.1 Discretization

Initially, we treated our screen as a matrix of pixels, where each individual pixel as a location for our agent. We quickly found out that the number of locations were too large. Therefore we chose to treat our screen as discrete grids of sizes , , , , and to reduce the number of states our bird could occupy.

Specifically, for a positive integer and a level of discretization , we discretize using the following formula:

6.6.2 Epsilon-Greedy

We implemented an epsilon-greedy approach to start off because we needed some way to explore our state space in early iterations. In our implementations of epsilon-greedy Q-Learning and SARSA, we experiment with . This means with probability , our agent moves randomly, and with probability our agent takes action , where . We have the following algorithms for Q-Learning and SARSA in this case:

1:Initialize all Q-values to 0
2:for iteration = 1, N do
3:     Initialize memory
4:     Set initial state
5:     Set
6:     while  is not terminal do
7:         With probability select a random action from
8:         otherwise select
9:         Perform action and observe next state and reward
10:         Store observation in
11:         Increment
12:     end while
13:     for observation in  do
14:         Set
15:     end for
16:end for
Algorithm 1 Q-Learning with epsilon-greedy policy
1:Initialize all Q-values to 0
2:for iteration = 1, N do
3:     Initialize memory
4:     Set initial state
5:     Set
6:     With probability select a random action from
7:     otherwise select
8:     while  is not terminal do
9:         Perform action and observe next state and reward
10:         With probability select a random action from
11:         otherwise select
12:         Store observation in
13:         Increment
14:     end while
15:     for observation in  do
16:         Set
17:     end for
18:end for
Algorithm 2 SARSA with epsilon-greedy policy

6.6.3 Forward versus Backward Updates

We also experimented with backward updates. The updates were performed from the first frame to the most recent frame in forward updates. Backward updates performed Q-value updates in the opposite direction; this allows more important information, specifically when the bird hits the pipe, to be considered first.

6.7 Function Approximation

Similar to the original version of Q-Learning, function approximation [10] attempts to minimize the difference between the predicted value and the target value

. In other words, the implied loss function is

However, instead of using the above update rule, function approximation estimates the Q-values with a function and a set of weights , i.e.

6.7.1 Linear Regression

We chose to estimate our Q-values by using a linear model. Specifically, we used the first formulation of state as described in section 5.1, , with an action appended at the end. We took weight vector and a bias term and estimated Q-values as follows:

We only used linear function approximation for Q-Learning.

6.7.2 Feed-Forward Neural Network (FFNN)

We also chose to use a more complex estimator in the form of a feed-forward neural network. Our network consisted of a 3-neuron input layer, a 50-neuron hidden layer, 20-neuron hidden layer, and 2-neuron output layer. If we call the size of the current layer

, and the size of the previous layer , then each layer is modeled as follows:

where

is some nonlinearity (e.g. tanh, ReLU, sigmoid). In this case, we use the ReLU functions.

is the output of the previous layer, is the weight matrix,

is the bias vector, and

is the output of the current layer.

Our 2-neuron output layer was set up such that we get the Q-values of the agent being in its current state and taking action or . We chose the action associated with the neuron with the highest output.

6.7.3 Convolutional Neural Network (CNN)

We believed that the game images were also a good way to encode state, so we used our second formulation of state in section 5.1. Since we were working with image data, a convolutional neural network was a natural choice. We preprocessed the input images by removing the background, turning them into grayscale and resizing them to 80 x 80. This helped limiting the number of parameters and eliminated unnecessary features like backgrounds and colors.

We used two convolutional layers, one using sixteen

kernels with stride

, and another with thirty-two kernels with stride . Finally, we flattened the output of the last convolutional layer and constructed our state vector , as described in section 5.1, where .

We passed through fully connected layers. In this case, our nonlinearity is the ReLU function. The fully connected layers consisted of a 9248-neuron input layer and a 2-neuron output layer.

The output of the 2-neuron layer represented the Q-values of the agent being in state and taking two actions 0 and 1.

7 Results and Discussion

7.1 Model Performance: SARSA and Q-Learning

We decided to evaluate each model based on its maximum score and average score. We chose to do so because both are important in determining which algorithm is "best" for playing Flappy Bird. The maximum score represents the peak-performance for an algorithm, while the average score is a better descriptor of performance consistency for each agent when trained for a large number of iterations.

Figures 3 and 3 show some examples of the curves that we see when training our agent for 10,000 iterations using Q-Learning. Across all variants of Q-Learning that we implement, we see that the maximum score and average score are increasing. Especially since the average score is increasing steadily, we know that the agent is gradually learning to play the game exponentially better as we increase the number of iterations.

Figure 2: Training curve for Q-Learning, backward updates,
Figure 3: Training curve for Q-Learning, forward updates,
Algorithm Discretization Update Order Epsilon Mean Standard Deviation Max
Baseline None 0.13 0.357 2
Q-Learning 10 Backward 0 209.298 216.967 1491
Q-Learning 10 Backward 0.1 159.398 162.553 1224
Q-Learning 10 Forward 0 67.21 69.067 582
Q-Learning 10 Forward 0.1 63.48 63.346 448
Q-Learning None Backward 0 0.507 0.796 4
Q-Learning None Backward 0.1 0.772 0.942 5
Q-Learning None Forward 0 0.386 0.644 4
Q-Learning None Forward 0.1 0.316 0.593 3
Q-Learning 5 Backward 0 36.236 41.438 438
Q-Learning 50 Backward 0 80.486 77.624 622
SARSA 10 Forward 0 87.553 87.31 530
SARSA 10 Forward 0.1 117.317 112.998 811
Table 3: Comparision across all levels of hyperparameters for SARSA and Q-Learning, 8000 iterations

7.2 SARSA versus Q-Learning

We first seek to understand the difference in performance between Q-Learning and SARSA. From the plots in Figure 4, when discretization is held constant at , we see that Q-Learning has a higher maximum than SARSA earlier on. This is likely because Q-Learning is an off-policy algorithm and, thus, is able to estimate the optimal policy faster than SARSA, an on-policy algorithm can. Especially when trained for only a few thousand iterations, the agent trained by SARSA must try out both the zero and one actions at different states to find the one that yields the greatest value, rather than following a supposed optimal policy directly, like Q-Learning can. Once we trained our SARSA agent for more iterations, though, we see that it does perform the same in terms of maximum and average score as our Q-Learning Agent.

Figure 4: Training curve for Q-Learning, SARSA, forward updates,

7.3 Comparision of Discretization Levels

We also chose to try out different discretizations of our screen. We experimented with , , , , and . The main trade-off that we considered was the precision of our Q-value estimate versus the reduction in state space.

Implementing Q-Learning without discretization caused our agent to converge extremely slowly. Since our screen was , there were about different locations the bird could be, and velocities the bird could have. Therefore, there were about states to explore. The number of states were too large, which translated to the bird only sparsely exploring the state space. For most states and action , because most states are unexplored. As a result, the agent would move randomly most of the time when we ran it for 8000 iterations, making it no better than the baseline. It will take an unreasonably large number of iterations to explore all states.

On the other hand, discretizing to an extreme extent, like in the case, converges quickly but leads to a suboptimal solution. Quick convergence is the result of the agent being able to explore most of the state space. However, we make a strong, and often false, assumption that every state in a grid has the same Q-value. Thus, we are not able to converge to the optimal solution, simply because our initial modeling assumptions were false.

Out of the different discretizations, the discretization level performs the best, with a max score of 1896. This means that slicing the screen into grids best trades off Q-value estimate precision and state space size. After a certain number of iterations, we see that each line in Figure 5 plateaus. This corresponds to convergence; once the state space is completely explored, then the agent has found the optimal policy at their level of discretization. Note that, as our discretization level become smaller, the line plateaus slower, but attains a higher score. Naturally, this follows our intuition, that no discretization, which corresponds to the discretization level will perform the best as the number of iterations increase, but will take the longest to converge.

Figure 5: Maximum score achieved for Q-Learning, backward updates,

7.4 Comparison of greedy policies

Our next batch of experiments explored the effect of an epsilon-greedy policy. One of the most interesting differences came from evaluating the performance of Q-Learning with discretization with and without an epsilon-greedy approach, where or . Surprisingly, we see that the model where performed worse than when . This is consistent with the cliff-walking effect detailed by Sutton and Barto [7], where agents trained by Q-Learning tend to choose paths that are at the "edge" of the cliff. Our pipe represents the cliff, because it is associated with a negative value. We see that the Q-Learning agents stay very close to pipes; using an epsilon-greedy approach is almost guaranteed to cause the bird to choose a random action when it is passing through a pipe. Acting randomly inside a pipe is likely to end the game. This is reflected in our results, where the averages of the two are quite similar, 225 for and 154 for . However, for maximum score, we get 1896 for and 1125 for , which is a significant difference.

In the emulator, SARSA agents try to maximize their distance from the pipes when passing in between them. This is in direct contrast to Q-Learning which remains close to one pipe. Therefore, the impact of an epsilon-greedy policy is less pronounced for SARSA. This is evident in Figures 7 and 7, where the difference between maximum scores for SARSA of 448 versus the difference between maximum scores for Q-Learning of 987. This is because a random move is not likely to end the game if a "safe" middle path is take compared to a path close to a pipe.

In addition, both the Q-Learning and SARSA agents are able fully explore the state space, since it is discretized into grids. Thus, there is no advantage to having an . Having an is only useful when the state space is too large to explore in the first place.

Figure 6: Epsilon comparisions for Q-Learning, forward updates, discretization
Figure 7: Epsilon comparisions for SARSA, forward updates, discretization

7.5 Comparison of Forward versus Backward Q-Learning

We observe in Figure 8 that backward Q-Learning increases the scores significantly to about 2000 after the first 3000 iterations. Then, the scores stay around 1500 for the rest of the time. Meanwhile, the scores produced by forward updates go up more gradually and stay below the figures for backward updates most of the time. However, the scores eventually reach 2000 after 10000 iterations. Hence, we can see that backward Q-Learning converges to the optimal policy much more quickly than forward Q-Learning does. Here, we define the optimal policy as the policy determined the best set of Q-values that an agent converges to after a sufficient number of iterations, given a certain level of discretization. For example, in Figure 8, backward Q-Learning with 10 x 10 discretization converges to the optimal policy that produces scores of around 1500 after 3500 iterations.

The reason while backward Q-Learning leads to faster convergence than forward Q-Learning does is that backward updates allow the agent to learn important information earlier. Specifically, backward Q-Learning performs updates of Q-values from the most recent and also the most important experience first, which is the frame when the bird hits the pipes. This information is then propagated through all earlier states in one iteration. Thus, the bird learns the bad states that leads to hitting the pipes more quickly and effectively, so it is able to avoid the pipes after a small number of iterations.

For example, consider a simple iteration with the sequence , where is a terminal state where the bird hits the pipes. If we perform forward Q-Learning, will get updated first based on and it receives no information about the pipes at . However, if we perform backward Q-Learning, will get updated first based on and it will change significantly because . Then, get updated based on , so now gets notified about the pipes at via .

Figure 8: Comparison of forward and backward updates for Q-Learning, , 10 10 discretization

7.6 Model Performance: Function Approximation

All our attempts at function approximation performed comparable to our baseline, which is noted in Figure 10. We believe that this is due to various shortcomings in our formulation of features and neural networks as described below.

Algorithm Discretization Update Order Epsilon Mean Standard Deviation Max
Baseline 0.1 0.13 0.357 2
Linear Regression None Backward 0.1 0.21 0.415 2
FFNN None Backward 0.1 0.326 0.437 3
CNN None Backward 0.1 0.28 0.372 3
Table 4: Comparison accross all Q-Learning agents with function approximation, backward updates,

7.6.1 Linear Regression

For linear regression, a line was likely too simple to capture the underlying complexities around the value of a certain state. Intutively, this makes sense; for example, for our feature as described in section 5.1, there is not a clear linear relationship between the reward an agent receives and the distance it is from the pipe. In other words, moving closer to a pipe does not increase the expected reward of an agent by a fixed amount. Therefore, we needed a more complicated model to estimate our Q-values.

7.6.2 Feed-Forward Neural Network (FFNN)

After we saw that our linear regression failed to outperform the baseline, we decided to switch to a model that could model a highly nonlinear relationship between the states and the resultant Q-value. The FFNN had a similar issue as linear regression; while the model itself has higher variance, our inputs were still too simple. Essentially, we believed that only having

were not enough to predict the value of state .

7.7 Convolutional Neural Network (CNN)

We wanted to add more relevant features than our initial three in section 5.1. Therefore, we created a second formulation of state, also described in 5.1. However, this formulation of state did not seem to make a difference, most likely because we trained it for too few iterations. Looking up related resources, we found that a deep learning approach to Flappy Bird requires learning over hundreds of thousands of iterations

[6]. We trained our network for 8000 iterations due to limited computing resources. This result was expected; no discretization was applied to the grid beforehand and, as a result, the CNN must learn the Q-values for all states in a large state space. Thus, it will take many iterations for the policy learned by the CNN to be optimal.

8 Conclusion

In this paper, we implement and compare the performance of agents trained by modified variants of the SARSA and Q-Learning algorithms. We notice that discretization techniques helped our agent converge more quickly within 8000 iterations and achieved a reasonably high score of 2069. We also find that, with discretization, epsilon-greedy policies generally perform worse for Q-Learning on average. Finally, we find that updating our Q-values backwards temporally helps our Q-values converge their true values faster, as well. For function approximation, we go through a natural progression of models, starting off with a simple linear model, and progressing to a CNN. However, we find none of them outperformed the baseline.

Altogether, all of our Q-Learning and SARSA agents outperform the baseline. It was only the Q-Learning with function approximation agents that either performed at or slightly above the baseline. This was largely due to poor feature engineering or lack of training. In the end, our Q-Learning agent with discretization, backward updates, and performed the best.

In the future, training the CNN for hundreds of thousands of iterations is likely to be successful. We would also want to adjust hyperparameters, and experiment with more levels of discretization, to see if an agent trained with smaller levels of discretization can outperform our current, top model. In addition, we would like to try adding polynomial features in our linear regression and FFNN models to have them model more complex relationships in our data.

References