## 1 Introduction

Reinforcement learning (RL) has made significant progress in recent years, allowing the use of policy and value-based algorithms on images to perform tasks in fully observable contexts [8, 23, 29]

. However, when the state space is more complex or partially observable, learning remains harder. Solutions are introduced by implementing additional learning modules like Long Short Term Memory (LSTM)

[12] or enlarging the networks by adding convolutional layers [14].Another strategy to facilitate the learning of an agent in a complex visual environment is to reduce the dimension of the state space [20]. This can be accomplished by extracting relevant features from the images using learning based methods such as auto-encoder [18]. The dimension of the state space has a large impact on how these solutions are implemented. Indeed, an auto-encoder needs lots of parameters to encode large states and hence its training is time and energy consuming.

Furthermore using sparse representation in RL has been proven to be efficient [22], this can be done with Radial Basis Function Networks (RBFN) [3, 11, 24, 26] a well know function approximator used for classification and regression tasks [32]

. Since its success in deep learning, the idea of using RBFN in RL has emerged and encountered a great interest especially in non-visual tasks applications

[2, 5, 6]. In all these works, RBFN are however not applied directly on raw pixel, but rather on features extracted from the environment that involve pre-training steps.##### Contributions.

In this paper, we aim to train an agent in a partially observable environment, with only raw pixel inputs and without prior knowledge, i.e., without pre-training or additional information. We extend the RBFN to extract random sparse features from visual images. Our major contribution is the design and the analysis of a generic features extraction method based on RBFN for visual inputs. We evaluate our extracted features performance in a Q-learning setting and compare the result with state-of-the-art methods. We used Vizdoom as the virtual environment [15] where visual tasks are hard to solve with classic RL algorithms [16].

##### Paper Organisation.

In section 2, we present a brief review of related approaches, then we introduce the theoretical background in section 3. In section 4 we present our RBFN method and analyse the extracted features in section 5. Finally section 6 presents an evaluation of the method on two different visual RL tasks and compare the performances against state-of-the-art approaches.

## 2 Related Works

State representation learning methods are popular approaches to reduce the state dimension used in visual RL. In the survey [20], Lesort et al. distinguished several categories of approaches for state representation learning. Auto-encoders can be used to extract features by reconstructing input images [18, 25, 13], or by forward model learning to predict the next state given the actual state and the corresponding action [19]. Learning an inverse model to predict the action knowing states can also be used to extract features that are unaffected by uncontrollable aspect of the environment [33]. Prior knowledge can also be used to constrain the state space and extract conditional features [10]. These learning representation methods typically require to train a network upstream of an RL algorithm, which can be time-consuming due to the large number of parameters or the environment dimension. Even when the state representation is trained parallel with the agent as in [27]

or when unsupervised learning is used as in

[1, 30], the training is still time-consuming since multiple convolutional networks are trained. Unlike the current state-of-the-art, our method does not rely on pre-training or convolutional network; instead the features are extracted using a combination of Radial Basis Function (RBF) without training nor prior knowledge.RBFNs have been recently used as feature extractors for 3-D point cloud object recognition [7]. They achieved similar or even better performances compared to the state-of-the-art with a faster training speed. Earlier, in [9]

RBF-based auto-encoder showed the efficiency of RBFN as feature extractors prior to classification tasks. One key property of RBFN activation is their sparsity; for each input, only a few neurons will be activated. This is advantageous for RL as it imposes locality of activation avoiding catastrophic interference and keeping stable value which allows bootstrapping

[22]. Applying directly RBFN on high dimensional images is challenging and has not been yet much explored in the literature. Other sparse representations, such as extreme machine learning or sparse coding, have been investigated but none have been extended to visual RL as their computing cost tends grows exponentially with the input dimension. One solution in the literature is to reduce the inputs dimension with convolution

[21, 28].## 3 Background

### 3.1 Reinforcement Learning

In RL we consider an agent interacting with an environment through actions , states and rewards

. The environment is modelled as a Markov decision process (MDP). Such processes are defined by the tuple

, where and are the states and action spaces,is the transition probability to go from a state

to another state by performing an action . is the reward function. At each time step, the agent will face a state and choose an action according to its policy . The environment will give back a reward and the probability of reaching as the next state is given by . The goal of the agent is to find a policy which maximizes its expected return, which we define as the cumulative discounted reward along time-step , i.e., , where is the discounted rate determining the actual value of future reward.We chose to evaluate our state representation method in a Q-learning [31]

setting. To optimize the policy, Q-learning estimates the state-action value function (Q-value) which represents the quality of taking an action,

, in a state, , following a policy . is defined by(1) |

The optimal policy can then be estimated by approximating the optimal Q-value given by the Bellman equation such that

(2) |

In Deep Q-learning [23]

, the optimal Q-value is approximated by a neural network parametrized by

. At each time-step Q-values are predicted by the network, then the agent chooses the best action according to an exploration strategy. The agent experiences are stored into a replay memory buffer where is a Boolean indicating if the state is final or not. Parameters of the Q-Network are optimized at each iteration to minimize the loss for a batch of state uniformly chosen in the replay buffer , defined by(3) |

To stabilize the training a target network is used with parameters which is periodically copied from the reference network with parameters .

### 3.2 Radial Basis Function Network (RBFN)

The network is composed of three layers: the inputs , the hidden layers compose of RBFs and the output layers . A RBF is a function defined by its center and its width as

(4) |

In this paper we will consider Gaussian RBF with the Euclidean norm defined as

(5) |

A RBFN computes a linear combination of RBF:

(6) |

where, is a learnable weight between output and hidden RBF layer .

RBFN are fast to train due to the well-localized receptive fields that allow activation of few neurons for one input. The farther the input value is from the receptive field of the neuron, the closer output value is to zero.

## 4 Method

Our method focuses on the projection of high dimensional input state spaces into a lower dimensional space by combining Gaussian receptive filters and RBF Gaussian activations. Our network architecture is shown in Fig 1. Each receptive field can be seen as the attention area of the corresponding neuron. The attention area of a neuron , for a pixel in a state (of shape ) with coordinates is defined as follows:

(7) |

where, define the center of the Gaussian function along spatial dimension and

are the standard deviations.

is the full matrix that defines the spatial attention of a neuron. Given the attention area, the activation of the hidden neuronis computed using a Gaussian RBF activation function weighted by

:(8) |

where, is the center and the standard deviation of the RBF intensity Gaussian activation function. Symbol is the Hadamard product, i.e., the element-wise product. Parameters and have the same size as the input channel.

To test the efficiency of our extraction we use the extracted features as the state in a Q-learning algorithms where the Q-value will be approximated by a linear combination of Gaussian neurons activations.

(9) |

Where, is the weight between the action and the neuron . On each step the input image is passed to the RBF layer and the computed features are saved in the replay buffer. Then during each training iteration a batch of random features is chosen from the replay memory buffer and the weights are updated using a gradient descent step (Adam [17]) to minimize equation (3).

### 4.1 Selection of RBF Parameters

There are 6 hyper-parameters for each Gaussian unit. Centers of Gaussian filters and centers of Gaussian activation are chosen uniformly between and as we need to cover all the state space. The standard deviations influence the precision of the activation of a neuron, i.e., the proximity between pixel intensities weighted by attention area and intensity center of the neuron. In that way RBF layer allows activation of few neurons for a particular image.

The hyper-parameters are chosen at the beginning and never changed during training.

After empirical experiments and based on the work of [4] which used also Gaussian attention area, for all the study we choose neurons for gray and neurons for rgb inputs as each neuron has 3 intensity center, one for each canal. The choice of standard deviations is as follows: and uniformly chosen in .

### 4.2 Vizdoom Scenarios

We evaluate our method on two Vizdoom scenarios to show the robustness of our network to different state dimensions and partially observable tasks. Scenarios are defined as follows:

#### 4.2.1 Basic Scenario.

In this task the agent is in a square room, its goal is to shoot a monster placed on a line in front of him. At each episode the monster has a different position while the agent spawns all the time in the same place. In this scenario the agent has the choice between 8 possible actions: move right, move left, shoot, turn left, turn right, move forward, move backward and do nothing. In this configuration the agent can be in a state where the monster is not visible. The agent gets a reward of 101 when the shoot hits the monster, -5 when the monster is missed and -1 for each iteration. The episode is finished when the agent shoots the monster or when it reaches the timeout of 300 steps.

#### 4.2.2 Health Gathering Scenario.

In this task the agent is still in a square room but the goal is different, the agent has to collect health packs to survive. It has life points at the beginning of the episode, each iteration the agent looses some life points, if the agent reaches a health pack it gains some life point. The episode ends when the agent dies or when it reaches 2100 steps. Reward is designed as follows: . Possible actions are move forward, move backward, turn left, turn right, do nothing. All the health packs spawn randomly and the agent spawns in the middle of the room.

## 5 Analysis of Pattern Activations

In this section we put in evidence the sparsity of our network, and analyse the activation patterns of our neurons and their differences. We then present an example of feature targeted by a RBF neuron.