Reinforcement Learning

What is Reinforcement Learning?

One prominent notion in behaviorist psychology is the idea that agents seek pleasure and avoid pain, thus learning habit patterns of survival. Reinforcement learning seeks to incentivize computational agents to naturally learn correct decisions by trial and error and to pursue a long term reward. 
The classic example is that of a mouse in a maze (see figure below). The mouse has a problem space (the maze) short term rewards (crumbs), a long term reward (cheese), and punishments (electric shocks). 

Exploration, Exploitation, and MDPs

When the mouse is first introduced to its problem space, it tends to do a lot of exploration. For example the mouse may find an action which gives a small amount of reward and instead of exploiting this action it will choose exploration in order to find a more advantageous solution. This is one type of epsilon-greedy strategy and as the model ages it will become less exploratory and its ‘desire’ to explore will diminish.

Markov decision processes (MDPs) allow us to understand how the problem space is navigated. The problem is broken down into states (in our example, positions on the grid), a set of actions (movement directions), state transitions, and rewards. Furthermore a discount factor increases the attractiveness of short term rewards versus long term rewards. One useful fact about MDPs is that the current state encodes enough information for future decisions to erase all past data.

Policy Learning and Neural Networks in Reinforcement Learning

In order to effectively learn to navigate the problem space a policy function is instated. A policy function P outputs one action for every state. This complex function which may have many potential states and actions.  Deep reinforcement learning uses (deep) neural networks to attempt to learn and model this function. The neural networks are trained using supervised learning with a ‘correct’ score being the training target and over many training epochs the neural network becomes able to recognize the ideal action to take in any given state.