Learning intuitive physics and one-shot imitation using state-action-prediction self-organizing maps

07/03/2020 ∙ by Martin Stetter, et al. ∙ 0

Human learning and intelligence work differently from the supervised pattern recognition approach adopted in most deep learning architectures. Humans seem to learn rich representations by exploration and imitation, build causal models of the world, and use both to flexibly solve new tasks. We suggest a simple but effective unsupervised model which develops such characteristics. The agent learns to represent the dynamical physical properties of its environment by intrinsically motivated exploration, and performs inference on this representation to reach goals. For this, a set of self-organizing maps which represent state-action pairs is combined with a causal model for sequence prediction. The proposed system is evaluated in the cartpole environment. After an initial phase of playful exploration, the agent can execute kinematic simulations of the environment's future, and use those for action planning. We demonstrate its performance on a set of several related, but different one-shot imitation tasks, which the agent flexibly solves in an active inference style.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the last decade, rapid progress in the field of deep learning has led to a number of remarkable achievements in many fields of artificial intelligence (AI)


. However, human learning and intelligence seem to work radically differently from the supervised pattern recognition approach adopted in most deep learning architectures. Among many other things, humans, for example playing infants, are able to learn from exploration and imitation, learn from much fewer examples and create richer representations

[28]. They can flexibly reason over these representations and creatively elicit novel state configurations never seen before.

Here we suggest a simple neural network architecture which learns to represent the dynamic physical characteristics of its environment in an unsupervised, exploratory way. By inference on the basis of this representation the system can plan actions to reach externally given or intrinsically generated goals. In the following, we summarize related work and outline the proposed model.

1.1 Related work

Impressive performance of deep learning approaches has been demonstrated in some classical supervised tasks such as object [26] or speech recognition [18, 19], but also in unsupervised domains including representation learning and learning generative models [17, 24, 42]

. Recently, "deep reinforcement learning"

[30] has been demonstrated to achieve human or super-human level performance on playing Atari video games from raw pixel frames.

Despite of all these achievements, studying deep learning might be less useful when it comes to understanding human-like intelligence with the goal to create artificial general intelligence [28]. Major issues raised include:

  • Deep learning approaches are in essence model-free and require massive amounts of labelled examples - orders of magnitude more than humans - in order to learn a complex task [28, 40].

  • In most deep learning approaches, the original problem is being re-formulated in a clever way as a related, supervised task which can be tackled by deep multilayer perceptrons. Hence the intelligent, creative part is done by the human designer, not by the algorithm.

  • Deep learning approaches work on associations and cannot grasp cause and effect [38].

The first issue has been addressed in many ways with approaches summarized as "few-shot learning" (for a recent review see [50]). Generally, a model is trained on a large body of related tasks to learn an inductive bias or prior, which is then exploited to solve the task at hand based on only one or few examples. Impressive recent achievements include one-shot imitation learning in robots [9] and meta-reinforcement learning [49], the latter showing close relationships to biological reinforcement learning [48].

Predictive processing (PP) summarizes a different class of approaches which during the last years have gained much popularity [21, 7]. After seminal work by Rao and Ballard on predictive coding in the visual cortex [39], predictive processing has been put on a sound statistical formulation known as free energy principle by Friston and coworkers [11, 12, 13]. PP models view the cerebral cortex as a hierarchically organized prediction engine which constantly tries to explain, i.e., predict, incoming sensory data as effect of hidden causes. Successful prediction is suggested to cause the subjective percept. Hence, sensor signals act as supervisory signals, and the brain’s internal states (hypothesized causes) play the role of generative signals or inputs, which are adjusted such as to minimize prediction error. Appealingly, prediction error minimization can also be achieved differently: by acting on the environment such as to make the own prediction come true. This second mechanism is referred to as "active inference" and represents the action generation mechanism suggested by the free energy principle.

It is worth noting that in these models the term prediction refers to predicting the present state of the environment, not its future [6]. The dynamic evolution of signals is addressed implicitly by considering generalized variables (variables plus all their time derivatives) and predicting the instantaneous present state of those. In such a setup, to achieve highly performant long-term temporal prediction, in particular higher temporal derivatives would have to be known up to very high, maybe unrealistic accuracy.

Recently, Lake, Ullman, Tenenbaum and Gershman [28] formulated cornerstones believed to be crucial ingredients of human-like learning and cognition, which is suggested to be more model-building like than pattern-recognition like: (i) "building causal models" of the world, (ii) "ground models on intuitive theories of physics and psychology", and (iii) "harness compositionality and learning-to-learn".

The authors state that in the approach of learning as model-building:

"Cognition is about using these models to understand the world, to explain what we see, to imagine what could have happened and didn’t, or what could be true and isn’t, and then planning actions to make it so." ([28], p.2)

Lake et al. [27] have developed "probabilistic program induction" as a model for few-shot visual concept learning, which focusses on compositionality and learning-to-learn. Our model, which we outline in the next subsection, follows a similar philosophy but concentrates on intuitive physics and causality.

1.2 Outline of model

In this work we suggest a very simple neural architecture, which learns in completely unsupervised fashion and incorporates several of the mentioned principles: it learns a model of the dynamics of its environment by playful exploration ("intuitive physics"), can play virtual, predicted episodes ("what could be true and isn’t") and can plan action sequences to bring the environment closer to a target state that has either been intrinsically chosen as a goal or has been given extrinsically by a one-shot demonstration ("planning actions to make it so").

The proposed system consists of sparse unsupervised networks, which in this work are implemented as Kohonen-type self-organizing maps (SOM) [25]: a perceptual or state network, which learns a representation of the environment’s states, an action module which represents possible motor commands, and a sensory-motor integrating state-action network which learns and represents associations between both (cf. Figure 1a). During "playful exploration", the agent executes (e.g., randomly sampled) actions and observes resulting state changes. Observing means that each active state-action unit learns to predict the environment’s next state given the currently active state-action pair it represents. This mechanism discovers the effects caused by the agent’s own actions111The authors are well aware that mere temporal order does not necessarily reflect a true cause-effect relationship: the rooster crows before sunrise, but it does not cause sunrise. Nevertheless, the rooster’s crow can be used to predict sunrise with a decent hit-rate.. For convenience, we refer to this architecture as "state-action-prediction self-organizing maps" (SapSom).

In the proposed setup, playing virtual episodes or "kinematic mental simulation", for which evidence has been found in human reasoning [23], corresponds to repeated prediction of state transitions starting from a virtual start state under a virtual action sequence. Here, the term "virtual" denotes intrinsically generated states, which are represented by active units without sensory stimulation and without action to be executed. This process of playing virtual episodes serves to transform a virtual action sequence into a predicted sequence of states.

A goal is defined as a target state or group of states in latent state space. Action planning corresponds to search in action sequence space, similar to active inference, such that predicted resulting states approximate as good as possible a desired region in latent space (i.e., reach that goal).

We provide a proof of concept using the Open AI’s gym cartpole environment. First it is demonstrated that - in the interventional case - SapSom correctly learns to represent the phase space structure of the cartpole system. This representation is identified as "intuitive physics" here 222We emphasize that this is not to state that intuitive physics is actually acquired like this in biological systems. Most likely, evolutionary and prenatal self-organizational mechanisms [46] might play an important role here.. We further demonstrate that SapSom can virtually play action-induced state sequences which closely resemble the actual temporal state sequence provided by the environment. Action planning is then addressed by a one-shot imitation task: after presenting a single episode with upright, stationary pole to the system (which is imprinted as target state set), the system immediately balances the pole for a decent (though usually not infinite) number of time steps. One-shot learning arises from the fact that, in the present architecture, imitation is an inference rather than a learning process (for a similar philosophy, see [49]). Moreover, the system is not fixed to optimally perform one single task, as in classical reinforcement learning. For example, when primed with the goal to let the pole tilt to one side slowly, or as fast as possible, SapSom sucessfully plays these games after only one demonstration. As discussed by Lake et al. [28], this ability might come closer to human-level flexibility than classical one-task reinforcemant learning.

The cartpole environment is very simple and the proposed system - in its present form - is subject to several limitations. At the end of the paper we discuss how these might be relaxed to result in more powerful architectures, and relate our model to biological findings.

Figure 1: (a) Proposed network architecture. A sensorimotor self-organizing map learns to represent state-action combinations, each represented by a state and action som, respectively. An activated state-action unit learns to predict the most likely next state, (brown), conditioned on the current state and action, it represents. (b) Reduced architecture actually implemented for the demonstrations in the result section (for details see text). 1D state and action representations are drawn for simplicity.

2 Model

In this work we actually implement a reduced version of the model outlined in figure 1a. The resulting simplified architecture is shown in figure 1

b. Simplifications are: (i) no action map is explicitely learned, which would maintain codebook vectors of motor commands. This restricts the implementation to discrete finite action spaces, which is, however, sufficient for the cases studied here. (ii) instead of a full sensorimotor map, a set of

action-conditioned state-only maps is trained, where is the number of actions. All maps share the same perceptual (state) representation, but each map learns an individual, action-conditioned state transition matrix. This prohibits dimensional control of the sensorimotor space, but makes sure that all state action pairs are readily represented.

2.1 Representation learning

We wish to learn a sparse representation of the input space, because it is believed that this will facilitate the learning of unimodal sharply peaked state transition distributions. As discussed above, many powerful techniques for representation learning and even several possibilites for obtaining sparse representations are around (e.g., [10]). Here we use self-organizing maps (SOMs) [25], because in addition to sparsity they maintain topographic order in map space, which may be useful in terms of predictive processing.

SOMs have been successfully used both in bio-inspired hierarchical representation learning [29] and reinforcement learning [41]

. To briefly summarize, in a SOM, neurons are geometrically arranged in a regular grid (here 2D rectangular). Each unit at map location

maintains a codebook vector with the same dimension as the input space. On presentation of an input vector or "environmental state" , the winning unit is found, defined by its codebook vector being closest to the input,


where denotes euclidean norm, and the notation on the left hand side of eq. 1 explicates the dependence of the winner on . The winning unit and its neighbours in map space learn according to


where can be interpreted as a localized neural activation pattern centered around the winning unit and is defined in eq. (3) below. The strong localization of activation entails that only a small fraction of units is active at every time step, resulting in a sparse code. The learning rule eq. (2) brings codebook vectors of map neighbours both closer to regions of high data density and to each other.

SOMs and predictive processing

There is an interpretation of SOMs in terms of predictive processing: when considering the codebook vector of an active unit as prediction of the input, is just the prediction error in input space333prediction in the sense of "predicting the presence", often referred to as quantization error. Finding according to eq. (1) thus minimizes the input prediction error by inference, and the learning step eq. (2) further minimizes it by learning.

Similarly to [29], we define a data-driven variable width of the activation pattern, namely


where has been empirically set to 25 percent of the map diameter throughout this work. By this, for inputs that are well-represented with small prediction error, only a small neighbourhood learns, whereas a large prediction error (maybe due to nonstationary statistics of the input) causes a large portion of the map to rearrange to better represent this novel input.

The normalized activation pattern in map space can then be interpreted as recognition density, where


According to eq. (3), the recognition density is sharply peaked for small prediction errors and more distributed for larger representational uncertainty. In SOMs, the generative distribution, which denotes the likelihood of inputs given a map state, simply becomes , denoting Kronecker’s delta. For a comprehensive treatment of generative and recognition models, see, e.g., [11, 12, 13].

Finally, due to topographic order, prediction error in map space can be defined in geometrical terms, simply as geometric distance between most likely true and predicted states, respectively.

2.2 Prediction learning

During learning, the system will exploit the possibility to act on the environment and to directly observe the consequences of these own actions on environmental state changes. Hence, learning is situated at the intervention level of Pearl’s Causal Hierarchy and can be formulated in the framework of Do-calculus [37]. Due to its similarity to how infants explore their environment by testing the effects of their actions, we metaphorically refer to this style of learning as "playful exploration".

During playful exploration, the model learns to approximate the action-conditioned Markov transition distribution , where run over all state units and runs over all actions. For this, each action-conditioned state network, labelled by , updates an individual state transition matrix with components (for matrix operations, indices being appropriately re-ordered as scalars), whenever is executed. Let

be the column vector of probabilities

, eq. (4), assigned to map states at time in response to input . On execution of action , the environment changes to state leading to a new distribution . The state distribution predicted by network , in contrast, is given by . We adopt a simple least squares scheme and mimimize with respect to . Gradient descent leads to the learning rule


where is a learning step size variable and the superscript denotes the transpose of a vector or matrix444

Usually, the Kullback Leibler divergence between both distributions is minimized, but this is usually tractable only under severe simplifications, which often lead to treatment of modes only


2.3 Inference

After exploring the environment to a sufficient extent, inference can be done on the so-far learnt representation using the state transition matrices. For example, given a certain start map state , the system can generate virtual action sequences by activating action nodes due to some schedule, without actually executing the corresponding motor commands, and predict the sequence of states that would result from executing that sequence. Correlates of this in human cognition might be kinematic mental simulation with the goal to plan actions. Moreover, the start state might be virtually generated as well, instead of perceptually caused, giving the possibility to elaborate on virtual scenarios never seen before, which might be considered a kind of artificial creativity. We do not formulate an explicit neuronal model of how state and action representations might be spontaneously generated, but there are mechanisms and models of how this might occur spontaneously or in response to stimulation [45]. In the present context, sequence prediction is referred to as "playing virtual episodes", as this procedures accepts a start state and an action sequence and returns a sequence of predicted environmental states.

Sequence prediction

We explore two possibilites to perform one step prediction given environmental state and action . When "predicting by expectation", is determined according to eqs. (1, 3, 4), and is computed. The expected next map state becomes and the expected next environmental state is given by , where denotes expectation. Note that this in general yields non-integer "winner locations" , which should be interpreted in a population coding way [15]

as the estimated maximum of the activation pattern eq. (

2) rather than the identity of a map unit. "Prediction by mode" simply considers the winning unit, and finds the most likely next map state as , resulting in . In order to avoid deadlocks, prediction of the currently winning state is usually suppressed.

When the probability densities are sharply peaked, results for both methods become increasingly similar to each other. Actually, we found that prediction by mode yields slightly better results than prediction by expectation, therefore the former method is used throughout the results presented in section 3.

A sequence of states given a start state and an action sequence is predicted by consecutively applying one-step predictions on the basis of estimated environmental states, i.e., first state is predicted from the start state and , and each next environmental state on the basis of and . For prediction by expectation this means that instead of applying multistep prediction of the density, , the result of each one step prediction is collapsed to the expected next state, which is fed back to the system for each time step. For prediction by mode, both procedures are identical.

Goals and action planning

Sequence predictions can be used to plan action sequences in order to reach a goal. This requires the definition of what a goal is in the present context. We suggest to define a goal as a target state or a subset of target states in map space. Target states might be provided by stimulation, e.g., by demonstrating a target environmental state to the system, or might be intrinsically generated as described in the previous paragraph. These target states are then imprinted or memorized, while the system tries to reach and maintain them by executing a suitable sequence of actions. A biological correlate of target state memorization might be persistent non-distractible neural activity found in prefrontal cortex [14]. The described procedure is closely related to the following concepts: (i) one shot imitation: imprinting the target state corresponds to the single demonstration of the goal, reaching the goal is then done by inference over the learned intuitive physical model. (ii) active inference: a system predicts a target state and minimizes prediction error by driving the environment towards the predicted (i.e., desired) state.

A large body of reinforcement learning literature exists on how to find a policy , which specifies how actions should be planned in order to maximize reward. Here we suggest an action planning strategy which does not rely on external reward signals but operates entirely on the distances between target states and the current state. Actually, the drive to actually try and reach an imprinted goal representation by active inference must in some respect be generated by an intrinsic reward mechanism, which is, however, not explicitly modelled here. Possible distance measures include (here euclidean) distance either between environmental states, , which is the input prediction error in active inference terminology, or - because of topographic order - between map states, , or both. We found action plans on the basis of environmental state distances to work better than map distance for the tasks considered here, hence the results in section 3 are generated using this distance measure.

Many distance-based action planning schemes can be imagined, here we use simple step greedy forward search for action planning: On the basis of the true present state , find the action sequence of length , the execution of which minimizes the distance between the predicted state resulting from that action sequence and the target state.

3 Results

The system was implemented using PyTorch’s tensor library

555https://pytorch.org/. For the experiments shown, a SOM was trained and analyzed. SapSom was tested on the Open AI gym cartpole environment (v0, for a screenshot of the rendered cartpole see figure 1b, inset)666https://gym.openai.com/envs/CartPole-v0/. The environment accepts two actions, push the cart to the left () or to the right () with a fixed force, and yields four sensory signals, namely the cart location , the pole’s angle with the vertical and their time derivatives, i.e., . The system was originally designed as a testbed for reinforcement learning systems with the goal to keep the pole vertical by balancing, therefore the enviroment also returns a reward for each step and triggers a "done" signal, as soon as the pole hits degrees, the cart hits the screen border, or 200 steps of balancing are successfully executed. Throughout this work, the reward signal was ignored, because SapSom operates in a completely unsupervised way. A sequence of steps between cartpole initialization and trigger of the done signal is referred to as an episode.

Figure 2: Directions of motion in the phase plane (arrows) for five complete random episodes. Arrows are located at the states to which they apply. (a) Blue: Real directions of motion in the next step as provided by the environment. Red: Directions of motion predicted by the network when provided with the same state and action. Predictions approximate real movements very well. (b) Prediction of motions in phase space under virtual left push (blue) and virtual right push (red), respectively (for discussion see text).

3.1 Intuitive physics

Here we tested whether SapSom could learn a representation of the environment’s Newtonian dynamics, metaphorically referred to as "intuitive physics" [28]. In technical terms, we tested whether, after training, the system could approximate the real phase portrait of the environment by its own predicted phase portrait. Results are shown for the phase plane, because the pole’s behaviour rather than the cart’s behaviour is usually considered in the cartpole environment.

The system was trained as follows: in order to assure correct unfolding of the map, the SOM representing the input space was pretrained over episodes using a standard learning scheme for self-organizing maps with exponential decays for and between start and end values and , respectively. Subsequently, both representation and prediction parts were trained simultaneously over episodes with , as described in section 2. Randomly selected actions were used during "playful exploration".

After training, the real and predicted dynamics of the environment were analyzed by playing real episodes and determining the real and predicted directions of motion from one step to the next. For a given phase point and next action , the real direction of motion was determined by executing on the environment and calculating . The predicted direction of motion was calculated by applying and determining , then predicting the next state and corresponding predicted input , and finally computing . Prediction by mode was used throughout the results section, because it operated slightly more robust.

The real (blue) and predicted (red) directions of motion in the angle phase plane are shown in figure 2a for five complete random episodes. Note that this phase portrait is not uniquely defined, because at each point directions are conditioned on and . Real and predicted directions of motion agree very well with each other. However, there are small deviations, although in principle the cartpole physics is deterministic and should in principle be learnable to arbitrary accuracy. The existence of small deviations is due to the state representation’s quantization error: similar, but different environmental state trajectories will be mapped to the same map unit, but will have slightly different time evolutions. These differences cannot be resolved by the system. Where quantization errors become large (e.g. for novel states), prediction errors can become large as well.

Figure 2b displays SapSom’s predicted directions of motion when planning to execute a left push (blue) or a right push (red), respectively, for the same five episodes. The configuration reflects correct "comprehension" of the situation: a left push generally accelerates the pole to the right, i.e., angular velocity increases, which is correctly mirrored by the blue arrows pointing upwards towards increasing . Under opposite sign, the same is true for right push.

We conclude, that SapSom can learn a reasonably accurate representation of the cartpole dynamics only from interventional exploration. Because this is achieved in a completely unsupervised way and without making explicit use of the equations of motion, this can be understood as a way of capturing an intuition about the physics of the environment.

In the next two subsections, we present results about inference on this model. In order to separate slow learning and inference effects, learning was switched off in the following by setting .


Figure 3: (a) Screen shots of cartpole with identical start state followed by eight left pushes. top: real time evolution. bottom: 8-step prediction of time evolution. (b, c): Real time evolution (dashed) and predicted time evolution with same start state (solid) of (b) and (c). Blue: 8 left pushes; red: 8 right pushes; green: random action sequence of length 8. (d) Real (dashed) and predicted (solid) time evolution of a longer sequence of length 39: oscillatory actions (3 left followed by alternating (6 x right) (6 x left) pushes). Major features of motion are correctly captured in all cases.

3.2 Playing virtual episodes

Next we examined whether the trained system could virtually play episodes on the basis of its intuitive physics, i.e., whether given an action sequence and a start state, the corresponding future state sequence could be predicted reasonably well.

Results for a number of different scenarios are summarized in figure 3. The top figure 3a illustrates by a number of screenshots, how the cartpole evolves under its true dynamics (top row) and under predicted dynamics for the same start state and action sequence (bottom row). Actions were eight left pushes. The bottom row images were generated by manually setting the cartpole’s state to , rendering the environment, and then capturing the screen. From visual inspection one may conclude that both sequences agree very well.

A slightly more quantitative analysis of prediction quality is given by plotting the time evolution of (in arbitrary units as provided by the gym environment, figure 3b) and (figure 3c) under true (dashed) and predicted (solid) dynamics. Blue traces correspond to eight left pushes, red traces to eight right pushes and green traces to a random sequence of actions (0, 1, 0, 0, 0, 1, 0, 0; 0=left, 1=right). Although the coincidence is not perfect, the general features of the resulting motion (shifting and tilting to the correct direction) are captured quite well. Finally, in order to test the prediction performance on a longer and more complicated sequence, the comparison was run under an action sequence of length , which elicits an oscillation (Figure 3d). The gym environment initializes the cartpole’s start state with small random numbers, resulting in a small positive initial value for in this case. As a consequence, in the true dynamics (dashed line) a slowly accelerating tilt to the right under gravity (drift of ) is superimposed with the faster oscillation evoked by the action sequence.

The general behaviour (upward drift under oscillation) is correctly predicted by the agent (solid line), even though the difference between the absolute values of real and predicted angles increases over time. This increasing prediction error can be understood by keeping in mind that, besides the action sequence, only the start state is available to the system (i.e., no intermediate sync).

It may be concluded that, for the environment considered, SapSom can perform qualitatively and semi-quantitatively correct multistep predictions of the environment’s future under a given virtual action sequence (usually generated by the agent itself). This encourages us to test the system’s action planning performance when performing a task. Since in the present system controling the environment in order to achieve a goal is an inferential rather than slow learning process, we test SapSom on a set of one-shot imitation tasks.


Figure 4: Time evolution of (a, c) and (b, d) under four different tasks. (a, b) blue: balancing; red: slow tilt to the right; green: fast tilt to the left. (c, d) keep-pole-tilted-stationary task.

3.3 One-shot imitation

A task requires a goal to be formulated, either implicitely (by reward structure) or explicitely, by demonstrating one or a few success stories, i.e., examples where the goal has been reached. Here we adopt the latter approach, which has the advantage that no sometimes complex reward structure needs to be formulated. For SapSom, we define a goal as a target state or set of target states in its state representation, which it has to reach and maintain. This target state can be spontaneously generated by the system (intrinsic or curiosity-driven goal) or can be imprinted from outside (extrinsic goal).

Here we formulate an extrinsic goal by presenting to the system a single sequence of target states, which are imprinted into its map. Imprinting means that the corresponding winning units are memorized as part of the goal. For example, if the goal is to balance the pole, a sequence of states with upright stationary pole under various cart states is presented to SapSom. Technically, instead of the true sequence of states, we only present the vector of expected values,

, and the vector of inverse variances or

precisions, , to the system (the superscript "g" stands for "goal"). For the pole balancing example, when and are allowed to vary strongly with zero mean, we obtain and , where small precisions are omitted and large precisions are capped to a maximum of 1. Reaching the target state then means to search for a sequence of actions, which drive the environment’s actual state towards the target state(s) and keep it there. For simplicity, we avoid explicitly determining the distance between the actual state and all target states, but instead use the precision weighted distance between and : . For action planning, one-step greedy forward search is applied, i.e., .

Figure 4 shows the time evolution of cartpole location (4a, 4c) and angle (4b, 4d) for four different goals given to the agent. The first goal (4a, 4b, traces in blue) was to balance the pole, imprinted as . The second goal (4a, 4b, red traces) was to let the pole tilt to the right to achieve moderate final angular velocity . The third goal (4a, 4b, green traces) was to let it tilt to the left with very high final angular velocity .

The final angular velocities were for the moderate tilt case, and for the rapid tilt case. The latter angular velocities are about the maximum which can be achieved when pushing to one side all the time, which was correctly predicted by the agent: All but one action over the five rapid tilt trials were "push right".

Figure 4c and 4d illustrate how the system acts in a slightly more challenging task, namely to keep the pole tilted to the right and stationary (). To achieve this, the agent had to constantly accelerate the cart into the direction of the tilt in a controlled way, such that the cart leaves the regime previously explored and learned. The traces indicate that the agent manages to keep the pole steadily tilted over a considerable number of time steps, although it does not quite reach the goal of rad but instead stabilizes values around rad. Also, when approaching the screen boundaries, the agent fails to stabilize any longer, beacuse this configuration is far from what it experienced during playful exploration with random actions only.

In summary, we find that the system is quite flexible in solving different related tasks and generalizes satisfactorily to previously unseen regimes.

OpenAI defines "solving" the cartpole problem as being able to balance the pole over an average of 195 time steps per episode, taken over the last 100 episodes, where each episode ends in case of failure or after 200 time steps otherwise. In order to characterize the performance of the agent, we analyzed 100 episodes under the balancing task (Figure 4b, blue). We found that 61 episodes reached the limit of 200 steps of balancing, the average number of time steps was 188 steps. Hence, SapSom does not quite reach the definition for solving the task, but comes quite close.

The results demonstrate, that SapSom can perform each of these tasks very well (although not perfectly) after only one presentation of the goal state. This is possible, because task solution is done via inference over the intuitive physics learned, rather than via slow modification of weights. Hence, in comparison with reinforcement learning, the presented system provides two major advantages: (i) it does not need any extrinsic reward structure (which has often to be engineered in a tedious process), and (ii) it can flexibly solve various tasks, which would require both a new reward structure and retraining in reinforcement learning. The latter flexibility has also been specified as an important feature of human-like learning and performance [28].

4 Discussion

The goal of this work was to suggest a minimal model which shows important properties of artificial intelligence: learning from experience, comprehension of its surroundings, reasoning, planning, and flexible solution of different tasks. It does so by learning a representation of the dynamical physical properties of its environment by exploration, and by performing inference on this representation to reach goals. The philosophy behind this approach is related to the KISS principle and Occam’s razor at the model level: we identified minimal ingredients which seem both plausible and important for achieving the mentioned properties777There may be completely different ways, though, to generate the same behaviour. The key ingredients of SapSom are:

  • An adaptive and sparse sensorimotor representation, in particular the possibility to act on an environment rather than just observing, in order to learn by exploration.

  • A temporal sequence learning and prediction mechanism at the sensorimotor level.

  • A short term memory mechanism to store target states in state representation space.

  • A mechanism to intrinsically generate action and state representations and to reason on their temporal evolution on the basis of temporal sequence prediction.

  • A distance measure between states. If reasoning is to be done at the representation level, a distance measure between sparse state representations, such as a topographic map, is required.

In the following, we discuss possible extensions of our model, and relate its mechanisms to biological findings.

4.1 Possible extensions

A simple set of one-layer network architectures was used (in the full model, Figure 1a, a three layer architecture) to learn an embedding of the input space. Both the brain and state of the art representation techniques, in contrast, maintain hierarchical representations of modalities. In our model, single-layer state representation learning can easily be replaced by hierarchical self-organizing map structures [29]

, or by contemporary high performance approaches such as (vector-quantized) variational autoencoders

[24, 42], generative adversarial networks, [17] or deep convolutional architectures in general (e.g., [26]), which might be trained on raw frame sequences. A sparse representation at the embedding level, which is considered crucial for temporal sequence learning, can be either directly provided by such systems [42], or can be achieved by using a SOM or other competitive learning mechanism [10] on top of their output or encoding layer.

SapSom’s temporal sequence learning mechanism is simply 1st order Markov, rendering it somewhat similar to Hidden Markov Models

[3]. Clearly, reasoning on environments of natural complexity requires variable (and sometimes very high) order Markov representations. Hawkins et al. [20] developed a biologically plausible model for variable order sequence memory which is embedded in their hierarchical temporal memory framework. It is based on lateral cortical connections operating on a sparse representation, where each state is represented by multiple model neurons. Their approach could be most naturally incorporated in our framework, but also other approaches for variable order sequence prediction can be considered without changing the fundamental way the model works [22, 40]. A different obvious possibility to include sensitivity to variable length history is to use recurrent connectivity. Recurrent SOMs [47, 29] can be trained to represent short sequences of input patterns instead of individual states.

Besides being one-step forward only, action sequence planning is done simply by minimizing the precision-weighted euclidean distances between current and goal environmental states. For high dimensional state spaces, however, it is known that the euclidean distance can lose much of its meaning [1]. Consequently, for high-dimensional spaces it might be beneficial to either resort to other distance measures, such as for example the cosine distance, or to put more weight on distance measurement in the topographically ordered latent space.

Another issue that is unresolved is how to stably establish a hierarchy of temporal timescales which is clearly present in human reasoning. When related to biology, assuming a "temporal timestep" within the brain of a few tens of milliseconds, all reasoning modeled here would operate in the sub second range. Humans, in contrast, make hierarchical action plans over a broad range from below seconds to years, and seem to compose longer term plans by abstracting from shorter ones [23]. Hierarchical arrangements of RSOMs have the potential to model a temporal hierarchy, however, for longer history the algorithm requires some parameter finetuning and becomes subject to numerical instability. Temporal hierarchy might instead be more stably generated when interlacing RSOM-layers with a novelty-driven gating mechanism (Stetter, unpublished results).

In its minimal version, SapSom operates in completely unsupervised mode. It does not evaluate any supervisory or extrinsic reward signals. The feedback used is sensory information evoked by its own actions. Because there needs to be some intrinsic motivation mechanism which drives the system to explore its environment, to try and reach goals, and eventually to intrisically set goals, the present approach shows similarities with curiousity driven learning, where intrinsic reward signals are generated based on prediction errors [36]. Humans, in contrast, do use both extrinsic and intrinsically generated reward signals (e.g., [44]).

Reinforcement learning mechanisms can be incorporated in a natural way on top of the sensorimotor representation, as SapSom - up to the reward signal - considers its environment as a Markov decision process which it learns to approximate. The present approach therefore shows a natural link with model-based RL

[5]. Alternatively, Q-learning could use its update rule to optimize a Q-value that is assigned to each unit in the sensorimotor map (cf. Figure 1). The resulting policy would then replace the greedy forward search algorithm established here. Generally, however, finding efficient search strategies in action sequence space that approach human performance in being creative and solving problems is an important ongoing research issue.

4.2 Relation to biology

Conceptually, our model ranges in the middle of the spectrum ranging from approaches that deliberately abstract from the way the brain works [43] to computational neuroscience models which put their main focus on how cognitive mechanisms are specifically implementated in biological neural networks [8]. Our approach is to formulate computational models, which are inspired by fundamental mechanisms of brain function, and are in principle compliant with what is known about biological information processing. Accordingly, a few parallels can be drawn between SapSom’s present implementation and the brain’s biological neural networks.

First of all, Kohonen’s self organizing maps are biologically motivated by retinotopy and smooth feature maps found in the early visual pathway, by mexican-hat like lateral cortical connectivity, and Hebbian learning. These aspects seem more closely related to biology than an error backpropagation mechanism, for which no biological correlate could be found so far. Moreover, SOMs have been used very successfully as models of self-organization in the early visual pathway

[32]. Given the observation that all areas in the neocortex are laid out very uniformly and seem to perform similar operations [31], hierarchical systems of SOMs appear to be good candidates of neurally inspired models at least of posterior neocortex.

The state transition matrices optimized during prediction learning can be interpreted as lateral (figure 1b) or top-down (figure 1a) cortico-cortical connections. When doing so, the learning rule eq. (5) just corresponds to spike-time dependent plasticity (e.g., [4]). For this we recall that that map state probabilities are derived from normalized activations of state neurons, eq. (4). When is interpreted as connection from state-action neuron with index to target neuron , the learning rule reads . the first term says that will be increased, when becomes active after (stronger than expected), and is decreased, if is expected active but remains silent.

The basic structural and functional aspects of the present model can be mapped to neocortical and subcortical structures and their putative functions. The sensory and motor representations should be partly related to the corresponding representational systems of posterior cortex. However, the most natural biological equivalents of the sensory, motor, and sensorimotor model components (cf. figure 1a) might be situated in the prefrontal cortex and basal ganglia complex. For example, in a spatial working memory task, neurons which represent the cue ("sensory"), the planned response ("motor"), and the combination of both ("sensorimotor") have all been found in lateral prefrontal cortex [2].

Moreover, there is evidence that prefrontal cortex is capable of actively maintaining information by robust, persistent neural activity, which, according to prominent models, might be rapidly updatable by a striatum-driven gating mechanism [33]. The system comprised of dorsolateral, anterior cingulate and orbitofrontal cortices, which interact with the striatum and thalamus, are considered crucial for representing goals and context, generating action plans and evaluating expected rewards thereof (for a review of data and detailled computational models, see [34]). Hence, non-distractible sustained activation might be a neural correlate of the model’s goal states, whereas their distractible counterparts might underly working memory required during mental execution of the search through action plans.

The intrinsic activation of action and state sequences (i.e., activating units without actually sensing or acting) by repeated temporal sequence prediction is suggested here as a mechanism for reasoning over the implicit physical model of the environment. This is in agreement with the current view propagated in cognitive sciences, that cogitive processes arise from co-ordination of cortical states already present in spontaneous activity ([35] and references therein). Also, in psychological studies it has been corroborated that humans use the mental simulation of short temporal sequences to create and test informal algorithms when mentally solving abduction and deduction tasks [23]. No explicit neural mechanism of such autonomous intrinsic activation is suggested in our model (instead activation and search is executed from outside the core networks), but variations in the level of global spontaneous activity have been shown earlier to have the potential to elicit and dynamically stabilize intrinsic cortical activity patterns [45].

Learning more about the principles of this orchestration process of cortical states, or, in SapSom terminology, learning about how search and prediction should be ideally designed given an implicit world model, might lead to a better understanding of the principles of human thinking in general – an exciting field for future research.


The authors wish to thank Monika Stetter for numerous valuable discussions on the subject and Livia Scheunemann for helpful comments on the manuscript. MS was on sabbatical leave joining the CIML Lab at University of Regensburg.


  • [1] Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. In Jan Van den Bussche and Victor Vianu, editors, Database Theory — ICDT 2001, pages 420–434, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.
  • [2] Wael F Asaad, Gregor Rainer, and Earl K Miller.

    Neural activity in the primate prefrontal cortex during associative learning.

    Neuron, 21(6):1399 – 1407, 1998.
  • [3] Leonard E. Baum and Ted Petrie.

    Statistical inference for probabilistic functions of finite state markov chains.

    Ann. Math. Statist., 37(6):1554–1563, 12 1966.
  • [4] Guoqiang Bi and Mu-ming Poo. Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18:10464–72, 01 1999.
  • [5] Matthew M Botvinick and Ari Weinstein. Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences, 369, 2014.
  • [6] Andy Clark. Radical predictive processing. The Southern Journal of Philosophy, 53(S1):3–27, 2015.
  • [7] Andy Clark. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 2016.
  • [8] Gustavo Deco and Edmund Rolls. Attention, short-term memory, and action selection: A unifying theory. Progress in neurobiology, 76:236–56, 08 2005.
  • [9] Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1087–1098. Curran Associates, Inc., 2017.
  • [10] Peter Földiák. Forming sparse representations by local anti-hebbian learning. Biol. Cybern., 64(2):165 – 170, 1990.
  • [11] Karl J. Friston. Learning and inference in the brain. Neural Netw., 16(9):1325 – 1352, November 2003.
  • [12] Karl J. Friston. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 360:815 – 836, 2005.
  • [13] Karl J. Friston. The free-energy principle: a unified brain theory? Nature reviews. Neuroscience, 11:127–38, 2010.
  • [14] Joaquin M. Fuster and Garrett E. Alexander. Neuron activity related to short-term memory. Science, 173(3997):652–654, 1971.
  • [15] Apostolos P. Georgopoulos, Andrew B. Schwartz, and Ronald E. Kettner. Neuronal population coding of movement direction. Science, 233(4771):1416–1419, 1986.
  • [16] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, MA, USA, 2016.
  • [17] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS14, pages 2672 – 2680, Cambridge, MA, USA, 2014. MIT Press.
  • [18] Alex Graves, Douglas Eck, Nicole Beringer, and Juergen Schmidhuber. Biologically plausible speech recognition with lstm neural nets. In Auke Jan Ijspeert, Masayuki Murata, and Naoki Wakamiya, editors, Biologically Inspired Approaches to Advanced Information Technology, pages 127–136, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
  • [19] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
  • [20] Jeff Hawkins, Dileep George, and Jamie Niemasik. Sequence memory for prediction, inference and behaviour. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521):1203–1209, 2009.
  • [21] Jakob Hohwy. The Predictive Mind. OUP Oxford, 2013.
  • [22] Steven Jensen, Daniel Boley, Maria Gini, and Paul Schrater. Rapid on-line temporal sequence prediction by an adaptive agent. Proceedings of the International Conference on Autonomous Agents, pages 67–73, 2005.
  • [23] Sangeet Suresh Khemlani, Robert Mackiewicz, Monica Bucciarelli, and Philip N. Johnson-Laird. Kinematic mental simulations in abduction and deduction. Proceedings of the National Academy of Sciences, 110(42):16766–16771, 2013.
  • [24] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints, page arXiv:1312.6114, 2013.
  • [25] Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1):59–69, January 1982.
  • [26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [27] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [28] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
  • [29] Jeffrey W. Miller and Peter H. Lommel. Biomimetic sensory abstraction using hierarchical quilted self-organizing maps, volume 6384 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, page 63840A. 2006.
  • [30] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
  • [31] Vernon B. Mountcastle. An organizing principle for cerebral function: The unit module and the distributed system. In F. O. Schmitt, editor, Neuroscience, Fourth Study Program, pages 21–42. MIT Press, Cambridge, MA, 1979.
  • [32] Klaus Obermayer, Helge Ritter, and Klaus Schulten. A principle for the formation of the spatial structure of cortical feature maps. Proceedings of the National Academy of Sciences, 87(21):8345–8349, 1990.
  • [33] Randall C O’Reilly. Biologically based computational models of high-level cognition. Science, 314(5796):91–94, October 2006.
  • [34] Randall C. O’Reilly, Jacob Russin, and Seth A. Herd. Computational models of motivated frontal function. In Mark D’Esposito and Jordan H. Grafman, editors, The Frontal Lobes, volume 163 of Handbook of Clinical Neurology, pages 317 – 332. Elsevier, 2019.
  • [35] David Papo. How can we study reasoning in the brain? Frontiers in Human Neuroscience, 9:222, 2015.
  • [36] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the 34th International Conference on Machine Learning - Volume 70

    , ICML 2017, pages 2778–2787. JMLR.org, 2017.
  • [37] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, USA, 2000.
  • [38] Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. CoRR, abs/1801.04016, 2018.
  • [39] Rajesh Rao and Dana Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2:79–87, 02 1999.
  • [40] David Rawlinson, Abdelrahman Ahmed, and Gideon Kowadlo. Learning distant cause and effect using only local and immediate credit assignment. stat.ML, abs/1905.11589, 2019.
  • [41] David Rawlinson and Gideon Kowadlo. Generating adaptive behaviour within a memory-prediction framework. PLoS ONE, 7:e29264, 2012.
  • [42] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. cs.LG, abs/1906.00446, 2019.
  • [43] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, USA, 3rd edition, 2009.
  • [44] Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, 1997.
  • [45] Martin Stetter. Dynamic functional tuning of nonlinear cortical networks. Phys. Rev. E, 73:031903, Mar 2006.
  • [46] Martin Stetter, Elmar W. Lang, and Adolf Müller. Emergence of orientation selective simple cells simulated in deterministic and stochastic neural networks. Biological cybernetics, 68(5):465–476, 1993.
  • [47] Markus Varsta, Jukka Heikkonen, and Jose del R. Millan. Context learning with the self-organizing map. Proceedings of the Workshop on Self-Organizing Maps ’97, pages 197–202, 1997.
  • [48] Jane Wang, Zeb Kurth-Nelson, Dharshan Kumaran, Dhruva Tirumala, Hubert Soyer, Joel Leibo, Demis Hassabis, and Matthew Botvinick. Prefrontal cortex as a meta-reinforcement learning system. Nature Neuroscience, 21, 06 2018.
  • [49] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. cs.LG, abs/1611.05763, 2016.
  • [50] Yaqing Wang, Quanming Yao, James Kwok, and Lionel M. Ni. Generalizing from a few examples: A survey on few-shot learning. cs.LG, abs/1904.05046, 2019.