The last few years have been marked with exceptional progress in the field of Artificial Intelligence (AI). Much of the progress has come from recent advances in deep learning. Models employing deep neural networks achieved remarkable performance in many areas including speech recognition, object recognition and reinforcement learning[lecun2015deep, mnih2016asynchronous, mnih2015human, gu2016continuous]
. In reinforcement learning, Mnih et al. proposed a deep learning approach to estimate Q-learning function of state and action tuples in Atari games to achieve human-level performance[mnih2015human, mnih2013playing, guo2014deep]. Silver et al. applied a similar approach of deep learning to learn policy and value functions of states and action in the complex game of AlphaGo [silver2016mastering, silver2017mastering].
Another front where AI is progressing exceptionally is approaching AI through drawing inspiration from human cognitive processes [lake2016building, AAAI1714840, battaglia2016interaction, nguyen2020learning, rempe2020predicting, lerer2016learning, baradel2019cophy]. Lake et al. developed a model for human-level concept learning through probabilistic induction [lake2015human]. The approach deploys methods of probabilistic programming [ghahramani2015probabilistic] to construct computational frameworks that capture human learning abilities in forming concepts. The construction of computational frameworks was used to draw insights on human cognition in other areas, Battaglia et al. proposed a model based on an intuitive physics engine as a cognitive mechanism humans use to make robust inferences in complex natural scenes [battaglia2013simulation, battaglia2016interaction].
In human development, infants have primitive concepts of how objects move in their environments, this is observed through their ability to track moving objects around them. It is through these primitive concepts infants grow to learn faster and make more accurate predictions [mccloskey1983intuitive, lake2016building]. Experiments on humans cognitive processes of intuitive physics inference show that as tasks of inference are harder, the response time of humans increases [hamrick2015think]. This is due to a trade off between response time and number of physics simulations performed, and the harder the task gets, the more simulation runs humans seem to perform.
Several attempts proposed models for agents to develop a sense of intuitive physics that humans possess [watters2017visual, NIPS2016_6113, wu2016physics, wu2015galileo, bakhtin2019phyre]. Agrawal et al. approached the problem by reverse engineering intuitive physics [NIPS2016_6113]. Using robot arms, the agent performed enormous number of actions (i.e. pokes) on objects placed on a table to understand the process by how objects move. The approach is inspired by how infants develop their physics intuition. The model is then tested by requiring an agent to move objects on a table to match a given final state of object positions on the table. Wu et al. proposed a model capable of predicting the movement of objects placed on an inclined surface given image pixel data [wu2015galileo]. The model incorporates deep learning methods to learn physical features of objects and a 3D physics simulation to predict their trajectories and where an object will most likely stop, the approach was capable of achieving an accuracy comparable to human subjects.
In this paper, we propose a framework for bots to deploy tools for interacting with the physics of their environments. The bots employ a coupling of a probabilistic program with a physics simulation engine to do inference of moving objects in a setting governed by Newtonian laws of motion. However, methods of probabilistic programs can be slow in such setting due to their need to generate many samples. Hence we complement our approach with a model free component to aid the sampling procedures in becoming more efficient through learning from experience during game playing . We present a case where a combination of model-free approaches (CNN in our model) and model-based approaches (probabilistic programming and physics simulation) is able to achieve what neither could alone [kansky2017schema, schulman2015trust, henderson2017deep]. Existing research proposed such ideas of combining model-free and model-based approaches in other contexts [chebotar2017combining, nagabandi2018neural]. The performance of the model outperforms an all model-free or model-based approach [evans2018learning]. It has been evident that such approaches combine the best of both worlds [battaglia2018relational]. The case study shows empirical results of the performance of the model on the game of Flappy Bird, a game of a bird in free fall and required to avoid obstacles by jumping through openings. Our model exhibits similar patterns in behavior of humans when it comes to the trade off between sampling time versus accuracy of decisions [hamrick2015think] and the ability to learn through experience of game play.
In sections 2, 3 and 4 of the paper, we propose the framework and discuss the process of inference and learning of parameters. Sections 5 include empirical results of the model performance while playing Flappy Bird and discussion.
Given the state of an agent in an environment governed by Newtonian laws of physics, the goal is to have an agent that predicts the desired behavior given a state. It does that by probabilistically sampling actions most suitable to achieve the desired behavior. To illustrate, given a bot in a state very close to hit the floor, the agent first should be able to infer that it needs to increase altitude and then probabilistically bias the sampling process to actions that are in line with increasing altitude to avoid collision.
The decision making pipeline of the agent includes two main subparts; the first is a convolutional neural network (CNN) and the second is a probabilistic framework for sampling actions in an intuitive physics setting. The general architecture of the pipeline is included in figure 1. The CNN takes pixel data as inputs and produces the parameter
corresponding to prior probabilities for the probabilistic program. The probabilistic program is parametrized by
the prior probabilities for each action,the probabilities of each action, the sampled action, the velocity of the action and the collision state at time .
Convolutional Neural Network (CNN)
The objective of the CNN is to make the sampling procedure of probabilistic programs more efficient, this is done through estimate parameters of the prior probability distribution of actions to be taken given a state of the agent. This helps the probabilistic model in sampling more effectively through skewing a Dirichlet distribution in a manner where actions sampled are more likely to match a desired behavior. CNN has a similar architecture to that developed by Mnih et al.[mnih2015human].The CNN takes the last frames as inputs and outputs parameterizing the prior for the distribution of actions probabilities. The input of the neural network consists of an frames of pixel data for the past 4 time steps after preprocessing. Preprocessing denoted includes transforming the raw pixel data to grayscale then rescaling frame size to a resolution of . The first hidden layer convolves 32 filter of size
with stride 4 on the input frames and applies rectifier nonlinearity. The second hidden layer convolves 64 filters with sizes ofwith stride 2 then applies rectifier nonlinearity. The third hidden layer convolves 64 filters with sizes of with stride 1 then applies rectifier nonlinearity. The fourth layer is a fully connected 512 nodes with rectifier nonlinearity. Then the output layer is fully connected and has as many nodes as there are actions in the game. The output of the CNN parametrizes a Dirichlet distribution in the probabilistic framework.
The model learns from experience through the data generated while game playing. Sources of the data include positions information of objects in the environment, pixel data of the state, actions taken by the agent and the rewards received at every time step.
The CNN learns from the pixel data of past states and actions; the inputs to the CNN are the pixel data and the outputs are the frequencies of actions in subsequent time steps of a predefined interval. is the frequency of making decision in the future time steps after state . The training sample of the model has tuples after which the model was negatively rewarded are not included in the training of the CNN since the goal is to learn about values of parameter
resulting in the bot being positively rewarded. The CNN is fit through applying stochastic gradient descent on the following cost function:
Where the parameters for the neural network are denoted by .
We use probabilistic programming to perform the tasks of inference in a fashion similar to the discussion by Gharmani et al. in [ghahramani2015probabilistic]. This section discusses the process by which the data for the world of the agent and its actions are generated. We employ control flow to sample actions resulting in no collisions.
At every iteration, the agent will use a probabilistic program coupled with the physics engine algorithm illustrated in Algorithm 1 to do inference about the physics of its future. The Algorithm takes as inputs the state of the world defined by the past 4 frames of pixel data and past positions data of objects in the environment. The CNN estimates the direction the agent should be moving towards through estimating Dirichlet parameters . An alternative approach is to is to provide a for a uniform Dirichlet on probabilities of actions which can be computationally cumbersome under tight time constraints to make a decision.
The agent proceeds with an iterative process between a probabilistic program and a physics engine simulation. The process starts with sampling probabilities for each action denoted from a Dirichlet distribution. The agent samples actions from a categorical distribution parametrized with
. To account for the possible stochasticity of actions, a Gaussian distribution is fitted overs the observed change in velocities for every action type, the model learns the distributions through the history of its position data and the actions taken in the past. For every move the agent sampled,is sampled from a Gaussian corresponding to action type .
After an action plan is sampled from the probabilistic program, the simulation engine will simulate the plan. The physics simulation returns the status of the agent after time steps. The process continues to sample then simulate in an iterative manner for an milliseconds. The physics engine estimates physics characteristics of the environment (for example the gravitational acceleration) through the application of Newtonian laws of motion to the historical positions data it observed.
Upon the end of the iterative sampling and simulation process, the bot is to estimate the distribution through the set of simulations of actions called . Given the samples it generated in , the conditional density is given by:
The decision on which action to take at the current time step denoted is then given by:
Before starting a new iteration of sampling new actions then simulating, the parameters are updated with the actions if a simulation results in no collision (i.e. ). If so, the process increments the simulation horizon with time steps, the strategy is to continuously expand the simulation horizon as no collisions are observed for the bot to detect potential further away obstacles in its direction. The process of sampling and simulation loops until milliseconds pass. The Algorithm returns the action providing the highest probability for the bot to survive unwanted collisions in time steps.
Experiments and Results
The approach was tested on the game of Flappy Bird, a game of inference about the physics of the bird and its environment towards avoiding collision with obstacles. Several replica implementations of the game is available on github along with an implementation of DQN and A3C proposed by Mnih et al. [mnih2015human, mnih2016asynchronous]. The game is available for play at http://flappybird.io. During the game, bird is required to chose from flapping or doing nothing to avoid collisions and pass through the openings.
Figure 2 demonstrates how CNNs aid probabilistic programs in sample more efficiently, the task for the agent is to infer about actions with highest probability in terms of passing through openings facing Flappy Bird. The agent is to sample action from a prior parametrized by , the is given by the CNN to help reduce the number of samples needed to pass through the opening. Given actions of flap and do nothing, with the agent is more likely to sample flaps in its plan as to maintain its current height moving towards an opening. Figure 2(a) illustrates the frames of pixel data used as inputs to estimate the Dirichlet (i.e. Beta since two actions are offered for bird to pick from), the probability of is larger resulting in simulating more samples with ascending altitude. On the contrary, in (b) the sampling distribution is skewed towards values less than resulting in the sampling process halting from flapping in most of its samples. Hence, the CNN skews the Dirichlet distribution in a manner where actions sampled match the desired change in altitude.
demonstrates the performance of the model against alternative approaches and state of the art techniques. Average scores are calculated after running each trained model for 10 times and observing the final score. PB-CNN is the proposed methodology and PB-Uniform is an alternative approach where a CNN is not present, instead the Dirichlet distribution is parametrized with all one parameter resulting in a uniform distribution over the parameter. A3C is an implementation of Mnih et al. [mnih2016asynchronous] on Flappy Bird. Human data are gathered through players on the web page in flappy.io.
Figure 3 (a) shows the average accuracy of PB-CNN and PB-Uniform for different milliseconds of time allowed for iterative sampling and simulation process. The advantage CNN brings to the model is significant, this is because CNN narrows down the sampling space significantly allowing the model to explore the conditional distribution of samples given no collisions much faster (i.e. higher frequency of samples resulting in ). In figure 3
(b) we show the average score per training epoch of the A3C approach. In figure3 (c), the average score of humans play is close to PB-Uniform with ms and ms. After training the CNN part of PB-CNN of the model on 10000 frames of game play, the model’s performance improves significantly. The performance of PB-CNN with ms is similar to A3C in average score. The average score is calculated for 10 times of game play.
The average score for humans was 11.27 in 47 million games played, 95% of them had a score of 6 points or lower according to flappybird.io. One possible reason why humans under-perform could be explained by their significantly large delay in response time compared to methods discussed in the paper here. Under a task of inference about the physics of the world, research suggested that humans response rate was in between 500 and 2000 milliseconds depending on the hardness of the inference task at hand [hamrick2015think]. The urgency of making a decision leaves no ample time for sampling and simulating the physics of the game that could be enough for humans to perform as well as bots.
Conclusion and Future Work
We propose a framework for bots to maneuver games with intuitive physics inspired by cognitive processes of humans. The approach draws inspiration from recent approaches of modeling concept learning where the framework includes a coupling of a probabilistic program with a physics simulation engine. The model was tested on the game of Flappy Bird and compared to state of the art techniques of model-free approaches. Advantages of our model over model-free approaches namely A3C is the ability to learn from very few examples relative to the number of examples A3C requires. Advantages of our model over model-based approaches is its ability to learn from experience.
Potential future work include investigating approaches to learn about rewards structures in games of physics intuition. This will enable bots to perform more complex moves beyond simpler tasks such as the ones in the illustrated game of Flappy Bird where the objective is to avoid unwanted collision. Other games such as Space Invaders involve learning strategies of shooting and hiding that are beyond the capabilities of the model in its existing state. Another potential future direction is to deploy neural network to detect objects in the frames of the game rather than explicitly having access to position data of the objects in the game. This would potentially help the framework better generalizes over other games.