Roughly 35 years ago, biologists discovered a cluster of cells in the hippocampal area of rodents, which encode the spatial position of the animal. Within minutes of seemingly random movement in an unknown environment, the cells specialize to fire only around one position each. Moreover, the cell population covers the whole accessible area, which lead many scientists to believe that the so called place cells are a preliminary stage of the rodents navigational control .
How these cells adapt, on the other hand, is still a question open to debate. Until recently, the only explanation was the so called path integration
, a technique in which the rodent integrates its movement up to estimate the current position. Since the sense of movement can be flawed, small mistakes will sum up over time and have to be corrected by external stimuli, which have to be identified in the environment, first. The computational approach to this problem, used in robotics, is calledsimultaneous localization and mapping (SLAM) and will be discussed as an alternative to our approach in section 5.1.7.
Last year, Franzius et al. 
were able to show that a memoryless feed forward network is able to learn place cell behaviour. The network adapts to videos of a head mounted camera (substitutional for the rodents eyes) by an unsupervised learning technique calledslow feature analysis (SFA).
Reinforcement learning (or neuro-dynamic programming) is a method to learn a control based on reward and punishment. A set of rewarded/punished example movements is generalized to estimate the expected sum of future rewards (value) at every position and for every possible action. Given a current position, the so called state of the system, the control chooses the action that promises the highest value. Obviously, the efficiency of this approach depends on how well the value can be estimated, which in return depends on the coding of the state.
Place cells provide an intuitive coding for linear architectures. Weighting every cells output with the mean value of its active region can summed up approximate any value function up to a quality depending on the number of cells. Therefore, the place cells of Franzius et al. should be a natural (and biological plausible) preprocessing for linear value estimators.
As it turns out, properly trained SFA produces a mapping of video images into corresponding trigonometric polynomials in the space of robot positions. Franzius’ place cells were products of an additional step independent component analysis (ICA) which does not influence the linear approximation quality and can therefore be omitted.
The goal of this thesis is to formulate this basic idea into a working procedure and to demonstrate its soundness in a real world robot navigation experiment.
We want to learn a preprocessing using slow feature analysis (SFA) out of the video images of a robots head mounted camera. This preprocessing should represent the robots position and orientation (its state) in a fashion suitable for linear function approximation. With this state at hand, we want to use the reinforcement learning method policy iteration (PI) to learn a control for the robot.
Note that the preprocessing is adapted to one environment, e.g. one room only, and will not work anywhere else. However, the same is true for any control learned by reinforcement learning, so SFA fits well within this framework.
Both SFA and PI need an initial random walk, crossing the whole environment. Therefore the robot has to drive around by choosing random actions, first. The only needed sensor information of this phase is the video of a head mounted camera. However, for the reinforcement method, we also need to record the reward and punishment at every step. As simple task, the robot receives reward for entering a goal area and punishment for getting too close to walls. The resulting control should drive into the goal area as quickly as possible, while keeping its distance to walls.
Secondly, SFA will learn a series of filters based on the recorded video. The output of these filters, applied on the initial video, is used as input of policy iteration. PI estimates the expected sum of future rewards (value) for every action and position.
With the value estimator at hand, or to be precise its parameter vectors (one for every action), the control works as depicted in figure 1.1:
The robot starts in an unknown position and wishes to navigate into a goal area, marked with a G for demonstration purposes.
A head mounted video camera shoots a picture ) out of the current perspective.
The series of filters , learned by slow feature analysis, produce one real valued output ), each. For every possible action, the output is multiplied by a weight vector , found by reinforcement learning. The sum of the weighted outputs is the value estimation of this action.
The robot executes the action with the highest value and repeats the procedure until it reaches the goal.
The reinforcement method employed in this thesis, least squares policy iteration (LSPI), chooses the most promising in a finite number of actions , based on the current state . Unfortunately, its complexity is in time and in space. An efficient kernel SFA algorithm itself has a complexity of at least in time and in space, where is the number of support vectors (comparable to ). So why not use a kernel version of LSPI when there is no significant advantage to the proposed method in computational complexity?
For once, LSPIs complexity also depends on the number of actions . An SFA preprocessing can use the available memory up to to produce a number of filters . This way one can consider a much larger number of actions. More importantly, the state space extracted by SFA can be used in more than one reinforcement problem. For example, when the robot should learn two tasks, one can learn both given the same initial video but different rewards. Therefore, determining the preprocessing once allows to learn multiple tasks quickly afterwards.
In chapter 3 the theoretical background of reinforcement learning is presented as well as the complete derivation of LSPI with all necessary algorithms.
Chapter 4 describes slow feature analysis theoretically. It also contains an overview of recent applications and algorithms, including a novel derivation of a kernelized algorithm.
Both chapters are based on common concepts of machine learning and kernel techniques, which are introduced for the sake of completeness in chapter 2.
Within the NeuRoBot project, the author was able to perform experiments with a real robot. To check the theoretical predictions, we constructed a rectangular area with tilted tables, in which the robot should navigate towards some virtual goal area. We tested two goal areas, which were not marked or discriminable otherwise besides the reward given in training.
To evaluate the proposed method less time consuming, the author also implemented a simulated version of the above experiment. At last, theoretical predictions exist only for rectangular environments. To test the behaviour outside this limitation, we used the simulator to create an environment consisting of two connected rooms.
We live in a world where things follow causal relationships, which should be expressible as functions of observations (i.e. products of the real ”causes”). The exact nature and design of these relationships are unknown but might be inferred from past observations. To complicate things even further, these observations can also be afflicted by noise, i.e. random distortions independent of the real ”cause”. Regression deals with inference of functional relationships from past observations, based on some simplifying assumptions.
The random variable depends on the observation by with being another zero mean random variable.
We assume that some kind of observation yields an input sample , which can be assigned a target value by by an expert or process of nature. Due to errors in this process, is distorted by a random variable , independent of .
Estimating the volume inside a balloon based on its diameter would be an example. Clearly there is a relationship, but since real balloons do not follow ideal mathematical shapes, it is hard to find. Physicists would take measurements of balloon diameters (the input samples) and air volumes (the target values) to find a function which explains most measurements with a minimum amount of noise. Since measurements are prone to be inexact and the rubber of individual balloons differs slightly, we have to expect some noise in the target values. An exact reproduction of the observed target values will therefore reproduce measure errors and mispredict unseen future inputs. The physicists approach to this problem is formulated as Ockhams razor, which in short states that theories have to be simple as well as explanatory.
In the following, we will denote as the training set. Regression aims to estimate the function based only on this data. The balance of Ockhams razor is difficult to maintain and led to a variety of regression algorithms.
2.1.1 Optimization problem
We wish to find a function which estimates the real (but unknown) relationship . The only knowledge available is the training set , so we construct a cost function which defines the optimization goal indirectly by comparing the target values with the predictions made by some given function .
Definition 2.2 (Cost function)
A cost function has the form .
The intuition is that the cost function is small if is applied, and hopefully also small for functions similar to . Minimizing with respect to should therefore lead to a similar function.
The approach suggesting itself is to evaluate at every sample and consider the distance under some norm to target as the empirical cost for this sample. Because we have no reason to favor any sample, we sum up the individual costs and the result is called the empirical cost function :
Note that, due to the norm, the noise term in assumption 2.1 will not cancel out. The only way for to reach zero is to reproduce the noise of in , which is not desirable.
If we wish to avoid this problem, we can invoke Ockhams razor and penalize complex functions with a regularization term . A less complex function will be less likely to reproduce the observed noise and therefore predict unseen data better. Examples for
would be the length of the parameter vector (linear functions), the number of support vectors (kernel machines) or the number of hidden neurons (neuronal networks). Together with the empirical cost we obtain theregularized cost function :
Note that the regularization term is independent of the sample size and with growing will loose relevance. Section 2.1.4 will discuss the effects of sample size in more detail.
Definition 2.3 (Optimal function)
The global minimum of a cost function with respect to is called optimal function of :
The optimal function of is the function we were looking for and should resemble at least on the training set. Arbitrary functions can not be represented in a computer, so a common approach chooses a parameterized function class to pick from.
One approach to find the optimal function is to calculate the gradient of with respect to the function class parameters at an arbitrary start position . One can follow the negative gradient in small steps, leading to smaller costs if the step size is well chosen. This method is called gradient descent .
The disadvantage is that the method will stop at every that fulfills , called an extrema111 If also we speak of a minimum, but not necessarily a global minimum. of . Depending on the considered function class the cost function can have multiple extrema with respect to the parameters , and not all have to be global minima after definition 2.3. In other words, gradient descent will converge to whatever extrema the gradient lead to, starting at . This means we can never be sure that we reached a global minimum, if no closed solution of is possible.
Keeping this in mind, it is wise to choose such that every extrema of the cost function is a global minimum, ideally the only one. In this thesis we will focus on convex functions.
There are other functions without local minima, but convex functions provide other desirable mathematic properties (see  for details). In some special cases it is even possible to derive a closed analytical solution (section 2.1.5).
Definition 2.4 (Convex function )
A function is convex if is a convex set and if for all , and with , we have
We can tighten the definition further. When the strict inequality in equation 2.4 holds, we speak of a strictly convex function. These have the additional advantage that the minimum is unique, i.e. no other extrema exists.
Anyway, the main property we are interested in both cases is that every extrema is a global minimum:
Proposition 2.1 (First order convexity condition )
Suppose is differentiable (i.e. its gradient exists at each point in , which is open). Then is convex if and only if is convex and
holds for all .
Proof: See .
Every extrema of satisfies . According to proposition 2.1, if is convex then , which identifies as a global minimum of .
However, we do not necessarily wish the functions but the discussed cost functions to be convex:
Proposition 2.2 (Convex cost function)
Lemma 2.3 (Pointwise sum )
The pointwise sum of two convex functions and , with , is a convex function.
Proof: Insert into and apply (eq. 2.4) two times.
Lemma 2.4 (Norms )
Every norm on is convex.
Proof: If is a norm, and , then
The inequality follows from the triangle inequality, and the equality follows from homogeneity of a norm.
2.1.3 Function classes
Convexity of the cost function depends on its formulation and the considered function class . The family of convex functions is large (see  for examples). However, we want to introduce two classes that will play a mayor role in this thesis because they allow a closed analytical solution under the squared norm.
Definition 2.5 (Linear function )
A function is linear if for all and it satisfies the condition
Definition 2.6 (Affine function )
An affine function is the sum of a linear function and a constant: .
The bias violates the linear condition, but affine functions share many properties with linear functions. They can express more relationships, however. For example, a line that does not cross the origin can be expressed as an affine, but not as a linear function.
By extending the input vector by a constant (i.e. ) one can express an affine function as a linear function. This trick can be applied if the algorithm at hand requires linearity but the data can only be properly predicted by an affine function.
Affine functions are convex.
Proof: We show convexity for every component .
In regression, we do not know the real target function of assumption 2.1, but aim to find a ”similar” function . Since we can not construct a similarity measure between and directly, we define a cost function instead. Though one can obtain an optimal to the cost function, the approach is susceptible to many sources of error:
The cost function can be chosen differently, leading to different optimal functions. Here the choice of regularized vs. empirical arises as well as the choice of the regularization term. Anyway, the two presented equations are not the only imaginable cost functions (see  for more).
The considered function class might not be suited for the data at hand. On the one hand, can be too weak, e.g. a parabola can not be represented by a linear function. If, on the other hand, the sample size is too small, can reduce the individual costs of all training samples to zero, but take unreasonable values between them. This effect is called over-fitting .
The training sets sampling might introduce errors. If a region of is left out, can not be estimated there. More general, unbalanced sampling leads to a good estimation of where many samples are available, but a poor estimation where this is not the case. The sum of the individual costs tolerates big errors in rarely sampled regions, if it means to shrink the error in highly sampled regions even slightly. This effect is also influenced by the choice of . Especially linear and affine functions are known to react badly to unbalanced sampling .
The last point can be circumnavigated if one can ensure identical and independent distributed (iid) sampling. This way, given enough samples, the training set will be sampled homogeneously from . However, in most practical cases one can not give such a guaranty.
The common procedure to validate is to withhold a test set from the training. A comparison of the normalized empirical costs of training and test set (called training and test error) demonstrates how well generalizes. If the errors are approximately the same, the optimal function predicts unseen inputs obviously as well as the training samples. However, if the test error is significantly higher than the training error,
is probably over-fitting. Counter measures would include more training samples, a less complex function class or stronger regularization.
Another useful test is the computation of for the same cost function and training data, but a different function class . The comparison of the test and training errors can tell us which function class is better suited. Especially if one can choose the complexity in a family of function classes (e.g. number of support vectors in kernel machines), it is interesting at which complexity the error saturates. Finding this point (i.e. the simplest model which estimates well) is sometimes referred to as model selection or model comparison .
2.1.5 Linear regression algorithms
The least squares regression problem applies the squared norm and the class of linear functions on the empirical cost function (eq.2.1):
where is the squared Frobenius norm and the matrices and are introduced for simplicity.
Extending the input samples by a constant (i.e. ) allows us to use the same formulation for the class of affine functions.
To find the optimal function we set the derivation of equation 2.7 with respect to the parameter matrix to zero:
which holds, providing the rows of are linearly independent and the covariance matrix therefore of full rank.
We can also use the regularized cost function (eq. 2.2) with the regularization term which penalizes the squared norm of the column vectors of . regulates how strong the penalty for complex functions is.
Regression with this regularization term is known as ridge regression 
or in the context of neural networks asweight decay .
Both algorithms have a complexity of in space and max) in time. The term originates in the matrix inversion.
2.2 Kernel techniques
Realistic processes can seldom be approximated sufficiently with linear function classes. Classes of nonlinear functions (e.g. neural networks) could provide satisfactory results, but their cost functions are rarely convex and the optimization therefore complicated.
Another approach is the nonlinear expansion of the input samples . This way, projected into a high dimensional feature space, the problem can be treated as linear, while the solution in original space is nonlinear. The classic example to demonstrate this is the XOR problem.
In the XOR problem, the input and the target are connected by a simple rule: If both entries of are the same then , otherwise . It is easy to see that there is no linear function that solves this problem, but if we expand the input vector , the parameter vector explains the relationship perfectly. Thus, with the suggested expansion, the XOR problem is linearly solvable.
Expansion almost always increases the dimensionality of the input. The fact that one normally does not know the perfect expansion beforehand rarely lead to feasible expansions that solve the problem. In the case that a suitable expansion size would outnumber the number of training samples, the kernel trick can be employed.
2.2.1 The kernel trick
The kernel trick defines the nonlinear expansion indirectly. One exploits the representer theorem, which states that the optimal function of a cost function can be represented as scalar products of the training set.
Kernels are a broad class of functions, which are equivalent to a scalar product in a corresponding vector space. The intuition of the kernel trick is to exchange the euclidian scalar product by one in a high dimensional nonlinear feature space, i.e. another kernel.
Choosing a nonlinear kernel is equivalent to a nonlinear expansion of the input vector. This way one can handle huge feature spaces. The drawback is that the optimal function depends on scalar products to all training samples, which can be infeasible for large sample sizes.
In the following we introduce the elements of kernel methods used in this thesis. To follow the derivation path in full length, the reader is referred to .
Definition 2.7 ((Positive definite) kernel )
Let be an nonempty set. A function on which for all and all gives rise to a positive definite Gram matrix is called a positive definite kernel.
For a definition of Gram matrix and positive definite matrix, see .
One can prove that every positive definite kernel function uniquely implies a scalar product with being the projection of into another space.
To be more exact, one can show that a kernel function implies a reproducing kernel Hilbert space (RKHS) of functions with a scalar product. With a detour over Mercer kernels (which are equivalent to positive definite kernels) one can show that it is possible to construct a mapping for which acts as a dot product.
In the following we will refer to an arbitrary kernel function in the corresponding RKHS .
Theorem 2.6 (Representer Theorem )
Denote by a strictly monotonic
increasing function, by a set and by
arbitrary loss function. Then each minimizer
an arbitrary loss function. Then each minimizerof the regularized risk
admits a representation of the form
Proof: See . Note that a loss function is a slightly different defined cost function and regularized risk a cost function with a regularization term.
Multivariate functions can be represented component wise , giving rise to a new parameter matrix :
where is called a kernel expansion of .
Remark 2.1 (”Kernel trick” )
Given an algorithm which is formulated in terms of a positive definite kernel , one can construct an alternative algorithm by replacing by another positive definite kernel.
The kernel trick allows the replacement of scalar products (which are positive definite kernels) by complex, nonlinear kernels. One can take this replacement as a projection into a high dimensional feature space. In this space, we can solve our optimization problem linear. The only restriction is that the original algorithm must be formulated entirely with scalar products.
2.2.2 Kernelized regression
To demonstrate the kernel trick, we will derive a kernelized version of the linear ridge regression algorithm.
First we have to notice that there are no scalar products in algorithm 2. Therefore, it is necessary to reformulate the algorithm . Again, we start by setting the derivation of with respect to to zero:
with . Next we will clear of its dependency on . Substituting (eq. 2.12) in and the derivation with respect to yields:
where the Gram matrix of scalar products can be replaced by any other kernel matrix . Prediction follows (eq. 2.11):
The complexity of this algorithm is in space and in time.
Since the kernel implies the feature space, choosing a nonlinear expansion is equivalent to choosing a kernel. Therefore it is necessary to know common kernel functions and their properties.
As we have seen in the XOR example, a polynomial expansion can be useful. The direct approach would be to collect all multivariate monomials up to degree in a vector. For dimensional input vectors, the expanded vector would have dimensionality .
The corresponding kernel function
projects into the same space of polynomials with degree . To demonstrate this let us consider the following example : The input vectors shall be . The polynomial kernel function of degree 2 for those two is:
The projection defined this way spans the space of polynomials of degree 2 and has therefore dimensionality 6.
Polynomial kernels are all about the euclidian angle between two inputs. Additional, the euclidian length of both vectors play a role. If one likes to get rid of the last effect, a normalized polynomial kernel returns only the cosine between and to the power of : . The induced feature space is of lower dimensionality, but the kernel reacts less extreme to large input vectors.
A radial basis functions (RBF) kernel has the general form of
Different norms and parameters and generate different kernels, which all imply an infinite dimensional RKHS . The two most popular both use the norm and are called Laplacian () and Gaussian (). The latter is by far the most common kernel and has become synonymous with the name RBF kernel.
RBF kernels are based on the distance of two input vectors. The kernel parameter determines the radius of influence for training samples. In well sampled regions (and well adjusted ) the kernel shows good approximation properties but where no training data is available the kernels will produce only small output and therefore perform poorly.
2.2.4 Kernelized covariance eigenvalue problem
Many unsupervised learning algorithms (e.g. PCA ) demand an eigenvalue decomposition of the covariance matrix:
In most cases one aims to project samples
onto eigenvectors: . To catch more complex, nonlinear relationships, one can expand into a nonlinear feature space . The kernel trick can help to describe the expansion more efficient (e.g. Kernel PCA ).
One starts with ensuring zero mean by subtracting the sample mean from all columns . Subsequently, one can perform the eigenvalue decomposition on the centered covariance matrix . Standard implementations of eigenvalue decompositions have a complexity of with being the dimensionality of samples .
Inner and outer product 
If we want to reformulate the covariance matrix eigenvalue decomposition, we can exploit a relationship between the inner product and outer product of .
The singular value decomposition with refers to eigenvalues and eigenvectors of both outer and inner product. Therefore, an eigenvalue decomposition of the inner product produces the same non zero eigenvalues as the outer product and the corresponding eigenvectors and are related by:
We proceed as before. With the Gram matrix of scalar products at hand, we first have to center the data represented by it. Subtracting the sample mean , it is easy to proof that the centered kernel matrix is:
Secondly, we perform the eigenvalue decomposition . Using (eq. 2.17) we can now express the term with the nonzero eigenvalues and corresponding eigenvectors of :
Replacing by another kernel matrix and the term in (eq. 2.19) by , we implicitly project the data into a feature space defined by the kernel:
2.2.5 Kernel matrix approximation
The application of the kernel trick is restricted to a small amount of samples (), because the kernel matrix stores entries. If this number is exceeded, one would like to sacrifice approximation quality for manageable kernel matrix sizes. Different approaches have been made to overcome this problem by means of an approximation of the kernel matrix , where (see for example ). They all have in common that they assume a given subset of the training samples (support vectors) to be is representative for the whole set.
An algorithm to select support vectors is presented in section 2.2.6.
Subset of Data 
The simplest approach to kernel matrix approximation is called subset of data (SD). In this approach only the subset of support vectors is used and the approximation therefore has size . Because, only the support vectors influence all information of the remaining samples is lost.
If the support vectors are chosen randomly out of the training set, subset of data is also known as the Nyström method .
Projected Process 
A better suited approach is called projected process (PP). Here, the non support vector rows are removed but all columns are kept and the resulting matrix has size . The matrix is used to approximate . While preserving much more information, the method can only be applied to algorithms which use .
This restriction can be circumnavigated, if an eigenvalue decomposition of has to be performed anyway. Because for , it is sufficient to perform the eigenvalue decomposition on . Taking the square root of the eigenvalues , together with the unchanged eigenvectors , yields the projected process approximation of .
2.2.6 Support vector selection
Whatever approximation method one chooses, the choice of support vectors is crucial to preserve as much information as possible. Selecting an optimal set of support vectors means to minimize a specific cost function with respect to the set of support vectors, which is a hard combinatorial problem .
A number of heuristics have been proposed to find a suitable set. Beside the purely random Nyström method , the sequential sparse Bayesian learning algorithm  (related to the relevance vector machine ) estimates the contribution of training samples to the optimization problem. The procedure works iteratively, but not greedy. For big training sets, the convergence is very slow. However, the method is specified for a Bayesian setting (like Gaussian processes , for example) and therefore not suitable for our purposes.
We assume the data to be mapped into a feature space , defined implicitly by a kernel . Theorem 2.6 (representer theorem) assures that the optimal function is a linear combination of scalar products with training samples. If two expanded training samples would be linear dependent, obmitting one would not change the optimal function. Likewise, those samples which can be approximated badly using linear combinations of the remaining set are likely to be important.
where is an accuracy parameter determining the level of sparsity. The term describes the minimal distance to the affine hull of set . It serves as an importance measure, which is independent of the problem we want to optimize afterwards.
When we denote the kernel matrix and the kernel expansion , we can express the left side of (eq. 2.21) as . The derivation with respect to yields:
which allows us to reformulate the ALD condition (eq. 2.21):
More precise, algorithm 4 runs sequentially through the training samples and tests if the current sample fulfills the ALD condition for the current set of support vectors . If the condition fails, the sample becomes a support vector and the inverse kernel matrix (which is guaranteed to be invertible for ) is updated using the Woodbury matrix identity .
The overall complexity of the algorithm is in space and in time.
The support vector set obtained by this method should be well-conditioned for most optimization problems. It also has the nice property that for every there is a maximum size for , which is independent of .
Assume that (i) k is a Lipschitz continuous Mercer kernel on and (ii) is a compact subset of . Then there exists a constant that depends on and on the kernel function such that for any training sequence , and for any , the number of selected support vectors N, satisfies .
Proof: See .
As a downside, the number of selected support vectors depends largely on input set and kernel , There is no way to estimate from the parameter , which is responsible for the sparsity of . In practice one is interested in a support vector set of an appropriate (large but not too large) size. The experiments conducted in chapter 5 used RBF kernels and kept a fixed while varying the kernel width . This procedure was repeated until a set of suitable size was found.
Learning behaviour for an autonomous agent means learning to choose the right actions at the right time. In machine learning, this is called a policy: A function that chooses an action based on some input representing the state of the world. In human terms, to define a policy we have to define two things first: The perception and the goal of the agent. The first defines what we see as a ”state”, the second what we see as ”right”.
While perception of the world is an open field which is mostly circumnavigated by defining some ”essential variables” like position and orientation, behavioural experiments (classical conditioning, for example Pavlov’s dog bell) lead to a simple approach to the second question: Reward and punishment. In short, an agent should try to maximize his future reward while minimizing future punishment. Thus, training the agent becomes a simple choice what it shall and shall not do, eliminating the need to show it how.
Technically, we assume to have access to some function that filters meaningful (discrete or continuous) states out of the sensor data available to the agent. The exact nature of this function is of little importance to this chapter, besides that it provides the full state of the world. In this thesis, we want to show that Slow Feature Analysis (SFA, chapter 4) can learn such a function under the right conditions.
The evolution of state and reward over time is modelled as a stochastic Markov decision process (MDP). Since a critical choice leading to reward or punishment might be necessary long before its fruits will be received, we can not use standard regression approaches. Instead we try to estimate future reward (punishment is simply negative reward) for all actions in a state and take the action which is most promising. The sum of the expected future reward is called value (sec. 3.2).
Unfortunately, the future (and with it the value) we are trying to predict for the policy, depends on the policy itself. Since this is a ”hen or egg” problem, policy iteration methods repeat the steps value estimation and policy determination until they converge to the optimal policy which maximizes the value for all states (section 3.3).
3.1.1 Markov processes
After learning, we wish for the agent to move towards the reward. To do this, it has to learn which transition exist between a given set of states and where the reward is given. Once this knowledge is somehow established, it needs to find a decision function, telling it where to choose which action.
As we will see, the central element is the estimation of the sum of expected future rewards, called value. Value estimation is easiest in Markov reward processes (MRP), so we will start by introducing them to the reader. The MRP itself is unable to model control tasks, but there exists an extension named markov decision processes (MDP). MDP models the element of choice by introducing a set of actions between which a policy can choose. We will finish this section by defining the optimal policy, a function which chooses the action that will maximize the future reward.
Definition 3.1 (Markov reward process MRP)
The discounted MRP consists of a set of states , the transition probabilities with , a function that assigns a real valued reward to every transition , and a discount factor .
MRPs are used to describe the statistical properties of the random walk of a variable in discrete time steps and the reward it collects in the meantime. In opposition to iid drawing, the random walk is subject to the Markov property
in other words the probability of being in state at time depends exclusively on the predecessor state . In most situations reward depends solely on the outcome, so we simplify in the following as only dependent on the target state of a transition.
Definition 3.2 (Value )
The value is the expected sum of future rewards, discounted by the factor :
Due to the infinite sum, one commonly uses a recursive definition which is known as the Bellman function :
Intuitively, the value function tells us which states are promising (figure 3.1). However, it does not say how to get there. For control purposes, we would need a function that returns the value of all possible actions in the current state. Within the framework of Markov decision processes, we can define exactly such a function, called Q-value.
Definition 3.3 (Markov decision process MDP )
The discounted MDP is an extension of the MRP to control problems. It defines an additional set of actions and transition probabilities and reward function are extended to depend on the executed action as well.
As in the MRP case, we simplify the reward function as exclusively dependent on the target state: .
Definition 3.4 ((Stationary) Policy )
A function with is called a stationary policy.
A policy represents a control or decision automat. is the probability that the automat chooses action in state . Since the value in a MDP depends on future decisions, it is necessary that the policy is stationary (i.e. does not change) throughout value evaluation.
Definition 3.5 (Q-value )
The Q-value of an MDP is the value of an action executed in a state, if all following actions are selected with respect to the stationary policy :
The Q-value function defines implicitly a value function for every action (see figure 3.2). This way one can navigate by always choosing the action with the highest Q-value.
Of course, we can also reformulate the classical value function (eq. 3.3), which is now dependent on policy , in terms of Q-values:
For a given policy , the value of a MDP is equal to the the value of the MRP with .
Definition 3.6 (Optimal policy)
The optimal stationary policy maximizes the value for every state :
This is obviously the control function we were looking for. A control algorithm simply has to draw the actions according to and is guaranteed to find the best way to the reward. Lemma 3.2 shows how to define such a policy with the help of Q-values.
(i) There can exist several but (ii) at least one is equivalent to a deterministic selection function with :
Of course, the Q-values them self depend on the policy and the only way to solve this dilemma is to improve both quantities in an alternating fashion (see policy iteration, section 3.3).
3.1.2 State representation
The MDP formalism provides us with a (possibly infinite) set of states . At this point we are interested in value estimation, that means we want to find a function that at least approximates the value for all states. If we restrict ourself to linear functions, i.e. , we need a linear representation to project the states into a subset of .
Definition 3.7 (State representation)
Let be an arbitrary set. An injective function will be called a representation of the set of states . If for , will be called a linear representation.
We aim here to find a representation that works with linear functions, i.e.