ToyArchitecture: Unsupervised Learning of Interpretable Models of the World

03/20/2019 ∙ by Jaroslav Vítků, et al. ∙ 30

Research in Artificial Intelligence (AI) has focused mostly on two extremes: either on small improvements in narrow AI domains, or on universal theoretical frameworks which are usually uncomputable, incompatible with theories of biological intelligence, or lack practical implementations. The goal of this work is to combine the main advantages of the two: to follow a big picture view, while providing a particular theory and its implementation. In contrast with purely theoretical approaches, the resulting architecture should be usable in realistic settings, but also form the core of a framework containing all the basic mechanisms, into which it should be easier to integrate additional required functionality. In this paper, we present a novel, purposely simple, and interpretable hierarchical architecture which combines multiple different mechanisms into one system: unsupervised learning of a model of the world, learning the influence of one's own actions on the world, model-based reinforcement learning, hierarchical planning and plan execution, and symbolic/sub-symbolic integration in general. The learned model is stored in the form of hierarchical representations with the following properties: 1) they are increasingly more abstract, but can retain details when needed, and 2) they are easy to manipulate in their local and symbolic-like form, thus also allowing one to observe the learning process at each level of abstraction. On all levels of the system, the representation of the data can be interpreted in both a symbolic and a sub-symbolic manner. This enables the architecture to learn efficiently using sub-symbolic methods and to employ symbolic inference.



There are no comments yet.


page 1

page 9

page 10

page 11

page 12

page 13

page 14

Code Repositories


A simulation environment for the creation and observation of ML models based on PyTorch

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Motivation

Despite the fact that strong AI capable of handling a diverse set of human-level tasks was envisioned decades ago, and there has been significant progress in developing AI for narrow tasks, we are still far away from having a single system which would be able to learn with efficiency and generality comparable to human beings or animals. While practical research has focused mostly on small improvements in narrow AI domains, research in the area of Artificial General Intelligence (AGI) has tended to focus on frameworks of truly general theories, like AIXI [44], Causal Entropic Forces [100], or PowerPlay [86]. These are usually uncomputable, incompatible with theories of biological intelligence, and/or lack practical implementations.

Another class of algorithm that can be mentioned encompasses systems that are usually somewhere on the edge of cognitive architectures and adaptive general problem-solving systems. Examples of such systems are: the Non-Axiomatic Reasoning System [96], Growing Recursive Self-Improvers [2], recursive data compression architecture [24], OpenCog [32], Never-Ending Language Learning [14], Ikon Flux [68], MicroPsi [3], Lida [23] and many others [49]

. These systems usually have a fixed structure with adaptive parts and are in some cases able to learn from real-world data. There is often a trade-off between scalability and domain specificity, therefore they are usually outperformed by deep learning systems, which are general and highly scalable given enough data, and therefore increasingly more applicable to real-world problems.

Finally, at the end of this spectrum there are theoretical roadmaps that are envisioning promising future directions of research. These usually suggest combining deep learning with additional structures enabling, for example, more sample-efficient learning, more human-like reasoning, and other attributes [63, 52].

Our approach could be framed as something between the ones described above. It is an attempt to propose a reasonably unified AI architecture111The term “architecture” is to be taken to mean an autonomous learning and decision system which controls an agent in a virtual/real environment. which takes into account the big picture, and states the required properties right from the beginning as design constraints (as in [8]), is interpretable, and yet there is a simple mapping to deep learning systems if necessary.

In this paper, we present an initial version of the theory (and its proof-of-concept implementation) defining a unified architecture which should fill the aforementioned gap. Namely, the goals are to:

  • Provide a hierarchical and decentralized architecture capable of robust learning and inference across a variety of tasks with noisy and partially-observable data.

  • Produce one simple architecture which either solves, or has the potential to solve as many of the requirements for general intelligence as possible222That is according to the holistic design principles of [63, 67]..

  • Emphasize simplicity and interpretability and avoid premature optimization, so that problems and their solutions become easier to identify. Thus the name “ToyArchitecture”.

This paper is structured as follows: first, we state the basic premises for a situated intelligent agent and review the important areas in which current Deep Learning (DL) methods do not perform well (Section II). Next, in Section III, we describe the properties of the class of environments in which the agent should be able to act. We try to place restrictions on those environments such that we make the problem practically solvable but do not rule out the realistic real world environments we are interested in. Section IV then transforms the expected properties of the environments into design requirements on the architecture. In Section V the functionality of the prototype architecture is explained with reference to the required properties and the formal definition in the Appendix. Section VI, presents some basic experiments on which the theoretical properties of the architecture are illustrated. Finally, Section VII compares the ToyArchitecture to existing models of AI, discusses its current limitations, and proposes avenues for future research.

Ii Required Properties of the Agent

This section describes the basic requirements of an autonomous agent situated in a realistic environment, and discusses how they are addressed by current Deep Learning frameworks.

  1. Learning: Most of the information received by an agent during its lifetime comes without any supervision or reward signal. Therefore, the architecture should learn in a primarily unsupervised way, but should support other learning types for the occasions when feedback is supplied.

  2. Situated cognition:

    The architecture should be usable as a learning and decision making system by an agent which is situated in a realistic environment, so it should have abilities such as learning from non-i.i.d. and partially observable data, active learning

    [37], etc.

  3. Reasoning: It should also be capable of higher-level cognitive reasoning (such as goal-directed, decentralized planning, zero shot learning, etc.). However, instead of needing to decide when to switch between symbolic/sub-symbolic reasoning, the entire system should hierarchically learn to compress high-dimensional inputs to lower-dimensional (a similar concept to the semantic pointer [10]), slower changing [99], and more structured [61] representations. At each level of the hierarchy, the same inference mechanisms should be compatible with both (simple) symbolic and sub-symbolic terms. This refers to one of the most fundamental problems in AI—chunking: how to efficiently convert raw sensory data into a structured and separate format [88, 62]. The system should be able to learn and store representations of both simple and complex concepts to that they can be efficiently reused.

  4. Biological inspiration: The architecture should be loosely biologically plausible [62, 33, 36]. This means that principles that are believed to be employed in biological networks are preferred (for example in [57]) but not required (as in [55]). The entire system should be as uniform as possible and employ decentralized reasoning and control [19]

Recent progress in DL has greatly advanced the state of AI. It has demonstrated that even extremely complex mappings can be learned by propagating errors through multiple network layers. However, deep networks do not sufficiently address all the requirements stated above. The problems are in particular:

  1. Networks composed of unstructured layers of neurons may be too general; therefore, gradient-based methods have to “reinvent the wheel” from the data for each task, which is very data-inefficient. Furthermore, these gradient-based methods are susceptible to problems such as vanishing gradients when training very deep networks. These drawbacks are partially addressed by transfer learning 

    [74] and specialized differentiable modules [43, 84, 85, 38].

  2. The inability to perform explaining-away efficiently, especially in feedforward networks. This starts to be partially addressed by [83, 60].

  3. Deep networks might form quite different internal representations than humans do. The question is whether (and if so: how?) DL systems form conceptual representations of input data or rather learn surface statistical regularities [47]. This could be one of the reasons why it is possible to do various kinds of adversarial attacks [93, 91] on these systems.

  4. The previous two points suggest that deep networks are not interpretable enough, which may be a hurdle to future progress in their development as well as pose various security risks.

  5. The inability to build a model of the world based on a suitable conceptual/localist representation [82, 4, 20] in an unsupervised way leads to a limited ability to reuse learned knowledge in other tasks. This occurs especially in model-based Reinforcement Learning which, for the purposes of this paper, is more desirable than emulating model-free RL [16] owing to its sample efficiency. Solving this problem in general can lead to systems which are capable of gradual (transfer/zero-shot [41, 30]) learning.

  6. Many learning algorithms require the data to be i.i.d., a requirement which is almost never satisfied in realistic environments. The learning algorithm should ideally exploit the temporal dependencies in the data. This has been partially addressed e.g. in [64, 11].

  7. One of unsolved problems of AI lies in sub-symbolic/symbolic integration [9, 49]. Most successful architectures employ either just symbolic or sub-symbolic representations. This naturally leads to the situation that sub-symbolic deep networks which operate with raw data are usually not designed with higher-level cognitive processing in mind (although there are some exceptions [15]).

Some of the mentioned problems are addressed in a promising “class” of cortex-inspired networks [13]. But these usually aim just for sensory processing [76, 72, 79, 69, 75], their ability to do sensory-motoric inference is limited [53], or they focus only on sub-parts of the whole model [34].

Iii Environment Description and Implications for the Learned Model

In order to create a reasonably efficient agent, it is necessary to encode as much knowledge about the environment as possible into its prior structure—without loss of universality over the class of desired problems. In other words, we are not aiming for an artificial intelligence which is universally general in all possible hypothetical universes (which might not even be possible [101]), but rather for an efficient and multi-purpose machine tailored to a chosen class of environments.

We consider realistic environments with core properties (such as space and time) following from physical laws. The purpose of this section is to describe the assumed properties of the environment and their implications for the properties of the world model. In the following, the process which determines the environment behavior will be called the Generator, while the model of this process learned by the agent will be called the (Learned) Model.

For simplicity, we first consider a passive agent which is unable to follow goals or interact with the environment using actions. In Section V, we extend both the Model and the Generator by considering actions and reinforcement signals as well. There are multiple properties which we desire of the Generator.

Iii-a Stationarity

The dynamics of the environment is generated by a stationary process (or a non-stationary one which is changing slowly enough for the agent to adapt its learned model to the changes).

Iii-B Non-linearity, Continuity and Partial Observability

Real environments are typically continuous and partially observable. Their Generators can be modeled as general non-linear dynamical systems:


where the state transition function and observation function are nonlinear functions taking state variable and inputs as parameters, the is the derivative of . The function changes the state variable, while the function produces observations which can be perceived by the agent. The terms and denote noise [26, 89]

. This means that hidden states are not observed directly; rather, they have to be estimated indirectly from the observations


Iii-C Non-determinism and Noise

Even though the internal evolution of realistic environments may be deterministic, they are often complex and typically have non-observable hidden states. An observation function for these environments will thereby impart incomplete information. Additionally, the sensors of the agent are imprecise, and thus there is inherent noise ( in Eq. 1) so the reading of (even if it is for a fully observable world) may be flawed. We can model this uncertainty (be it for faulty sensors or non-observability) by expressing the Generator as a stochastic process.

Iii-D Hierarchical Structure and Spatial and Temporal Locality

It is reasonable to expect that the agent will interact with an environment that has many hidden state variables and very complex functions for state-transitions and observations: and in Eq. 1. Learning in this setting is not a tractable task in general. Therefore, we will include additional assumptions based on properties of the real world.

Figure 1: The Hierarchical Generator (left), which generates spatially and temporally localized observable patterns. The Learned Model in the agent (right) should ideally correspond to the structure of the Generator. The agent’s sensors and actuators are localized in the environment as in [18, 26]. Note that in many cases a single observation is a mix of effects of multiple sub-generators running in parallel.

We assume that the Generator has a predominantly hierarchical structure [25, 59], both in space and time; therefore, it can be modeled as Hierarchical Dynamic Model (HDM) [26]. We expect that the observations generated by such system are both local in space (one event influences mostly events which share similar spatial locations) and in time (subsequent observations share more information than distant ones), as described by the following power law relations:


where is a measure of mutual information between variables and , is a spatial distance function appropriate for the particular environment (e.g., Euclidean distance between pixels in an image), is temporal distance, and is a positive constant.

Note that both requirements are not strict and allow sporadic non-hierarchical interactions, interactions between small details in spatially/temporally distant events.

These relations reflect a common property of real world systems—that they have structure on all scales [25, 58]. It can serve as an inductive bias enabling the agent to learn models of environments in a much more efficient way by trying to extract information on all levels of abstraction. These assumptions also reveal an important property that data perceived and actions performed by the agent are highly non-i.i.d., which has to be taken into consideration when designing the agent.

Another important property of such a hierarchy is that at the lowest levels, most of the information (objects, or their properties) should be “place-coded” (e.g. by the fact that a sub-generator on a particular position is active/inactive), but as we ascend the hierarchy towards more abstract levels, the information should be more “rate-coded” in that we keep track of the state of particular sub-generators (e.g. their hidden states or outputs) through time [83]. This means that in higher levels, the representation should become more structured and local.

Iii-E Decentralization and High Parallelism

The spatial locality of the environment implies that on the bottom of the Generator hierarchy, each of the sub-generators influences a spatially localized part of the environment. In realistic environments it is usually true that multiple things happen at the same time. This implies that a single observation should be a mix of results of multiple sub-generators (relatively independent sub-processes/causes) running in parallel, similar to Layered HMMs [70].

Iv Design Requirements on the Architecture

The assumptions about the Generator described in the previous section were derived from the physical properties of the real world. They serve as a set of constraints that can be taken into account when designing the architecture to model these realistic environments. Such constraints should make the learning tractable while retaining the universality of the agent.

The goal is to place emphasis on the big picture and high-level interactions within parts of the architecture while still providing some functional prototype. Therefore, individual parts of the presented architecture are as simple and as interpretable as possible. Many of the implemented abilities share the same mechanisms, which results in a universal yet relatively simple system.

The sensors of any agent situated in a realistic environment have a limited spatial and temporal resolution, so the agent is in reality observing a discrete sequence of observations , each drawn from an intractable but finite vocabulary

. Thus, it could be possible to approximate the Generator by a Hidden Markov Model (HMM) with enough states.

An approximation more suitable for the hierarchical structure of the Generator is the Hierarchical Hidden Markov Model (HHMM) [22]

. It is a generalization of the HMM where each state is either a production state (a leaf node that emits an observation) or a hidden state which itself represents an HMM. The HMM generates sequences by recursive activation of one of the states in the model (vertical transition) until the production state is encountered. After this, control is handed back to the parent HMM where a horizontal transition is made. The HHMM can be converted into an HMM by concatenating the observation-emitting states and recomputing the transition probabilities. Note that this is a relatively general approach which is similar to Linear Time-Invariant (LTI) dynamical systems 

[81]. However, the HHMM has two substantial limitations, namely the inability to efficiently reflect (rare) non-hierarchical relationships between subparts (two neighboring sub-processes cannot directly share any information about their states) and its serial nature.

In order to efficiently address the fact that the Generator is parallel (and therefore, each observation can contain results of multiple sub-processes mixed together), the architecture has to be able to learn how to disentangle [40, 8] independent events from each other, and continue to do so on each level of the learned hierarchy.

We will show that the architecture presented in this paper overcomes both aforementioned limitations of HHMM and can efficiently approximate the Generator described in the previous sections. Namely, it can operate in continuous environments (similar to semi-HMMs [7]), but it can also automatically chunk the continuous input into semi-discrete pieces. It can process multiple concurrently independent sub-processes (an example of this is a multimodal sensor data fusion as in Layered HMMs [70]), and can handle non-linear dynamics of the environment. Finally, the architecture presented here can handle non-hierarchical interactions via top-down or lateral modulatory connections, which are often called the context [77, 35, 1, 13].

Iv-a Hierarchical Partitioning and Consequences

Due to the fact that the interactions are largely constrained by space and time, the generating process can be seen as mostly decentralized, and it is reasonable to also create the Learned Model as a hierarchical decentralized system consisting of (almost) independent units, which we call Experts. In the first layer, each Expert has a spatially limited field of view—it receives sequences of local subparts of the observations from the Generator (see Fig. 1). The locality assumptions in Eq. 2 suggest that such a localized Expert should be able to model a substantial part of the information contained in its inputs without the need for information from distant parts of the hierarchy.

The outputs of Experts in one layer serve as observations for the Experts in subsequent layers, which have also only localized receptive fields but generally cover larger spatial areas, and their models span longer time scales. They try to capture the parts of the information not modelled by the lower layer Experts, in a generally more abstract and high-level form.

Each Expert models a part of the Generator observed through its receptive field using discrete states with linear and serial (as opposed to parallel) dynamics. In an ideal case, the Expert’s receptive field would correspond exactly to one of the local HMMs:


where the is a transition matrix and is an observation emission matrix.

But in reality, one Expert can see observations from multiple neighboring Generator HMMs, it might not see all of the observations and does not know about the sporadic non-hierarchical connections, so the optimal partitioning of the observations and the exact number of states for each Expert is not known a priori and in general cannot be determined. Therefore, the architecture starts as a universal hierarchical topology of Experts and adapts based on the particular data it is observing. Although all the parameters of the topology and the Experts could be made learnable from data (e.g. the number of Experts, their topology, the parameters of each Expert), we decided to fix some of them (e.g. the topology) or set them as hyperparameters (e.g. the parameters of each Expert). Therefore, the current version of the architecture uses the following two assumptions:

  • The local receptive field of each Expert is defined a priori and fixed.

  • The number of hidden states of the model in each Expert is chosen a priori and fixed as well.

These assumptions (see Fig. 3) have the following implications:

  • An Expert might not perceive all the observations that are necessary to determine the underlying sub-process of the Generator responsible for the observations.

  • An Expert might not have sufficient resources (e.g. number of hidden states/sequences) to capture the underlying sub-process.

Figure 2: Example of the hierarchical structure of the world which fulfills the locality in space assumption, and has a fixed number of hidden states. The hierarchy has two levels, in there are 3 parallel Markov models, and one in on the top. The denotes state in layer , numbers on the edges are illustrative transition probabilities.
Figure 3: Approximation of one Markov model from the Generator shown in Fig. 2. Here, one part of the Generator (green box) is approximated by two Experts (yellow boxes

). While both Experts have insufficient number of states, the bottom one mitigates this problem by increasing the order of its Markov chain (parameter

in Eq. (4)). The denotes the hidden state of Expert in layer . The top Expert (which models the process with Markov order 1) shows that the original process cannot be learned well if it has an insufficient number of states (the red Expert states corresponds to the red Generator states and in the original process). Given the state , the Expert is unable to predict the probabilities of the next states correctly. Compared to this, the bottom Expert models the process with Markov order 2 (), therefore the probabilities of the next states depend on the current and previous state (indicated by arrows across 3 states in the image). In this case, despite the fact that is ambiguous, the bottom Expert can correctly predict the next state of the original process (for simplicity, transition probabilities are illustrative and not all are depicted).

Note that even without the aforementioned assumptions, with the ideal structure and topology of the Experts, their models would not correspond exactly to the Generator until fully learned, which can be impossible to achieve due to limited time and limited information being conveyed via the observations. Therefore, the architecture has to be robust enough so that multiple independent sub-processes of the Generator can be modeled by one Expert, and conversely, multiple Experts might be needed to model one subprocess. Such Experts can then be linked via the context channel (see Appendix A-B). It is a topic of further research whether, and how much, fixing each parameter limits the expressivity and efficiency of the model.

So instead of modelling the input as one HMM as described in Eq. (3), each Expert is trying to model the perceived sequences of observations using a predefined number of hidden states and some history of length .

Additionally, we define an output projection function computing the output of the Expert :


where and are some general functions, is the hidden state of the Expert at time , and

is the vector of observations in time

. The output projection function provides a compressed representation of the Expert’s hidden state to its parents, which is then processed as their observations.

We expect that there will be many Experts with highly overlapping (or nearly identical) receptive fields on each layer, which is motivated by the following two points:

  • Typically there will be multiple independent processes generating every localized part of the observation vector. So it might be beneficial to model them independently in multiple Experts.

  • Since the Experts will learn in an unsupervised way, it is useful to have multiple alternative representations of the same observation

    in multiple Experts. It might even be necessary in practice, since there is no one good representation for all purposes. Other Experts in higher layers can then either pick a lower-level Expert with the right representation for them or use outputs of multiple Experts below as a distributed representation of the problem (which has a higher capacity than a localized one 


Iv-B Resulting Requirements on the Expert

As discussed in the previous section, the local model in each Expert might need to violate the Markov property and will never exactly correspond to a Generator sub-process. Thus, the goal of the Expert is not to model the input observations perfectly by itself, but to process them so that its output data is more informative about the environment than its inputs, and the Experts following in the hierarchy can make their own models more precise.

In order to be able to successfully stack layers of multiple Experts on top of each other, the output of Expert has to use a suitable representation. This representation has to fulfill two seemingly contradictory requirements:

  • It preserves spatial similarity of the input (see e.g. the Similar Input Similar Code (SISC) requirement in [79] or Locality Sensitive Hashing (LSH) [46]). In this case, the architecture should be able to hierarchically process the spatial inputs, even if there is no temporal structure that could be learned333Note that in the case where the output of the Spatial Pooler is a one-hot vector, the spatial similarity can be preserved only on the level of multiple experts, which together produce a locality-sensitive binary sparse code representing the input observation(s)..

  • It should disambiguate two identical inputs based on their temporal (or top-down/lateral) context. The amount of context information added into the output should be weighted by the certainty about this context.

In the current implementation, we address this by converting the continuous observations into a discrete hidden state (based on the spatial similarity), which is then converted again into a (more informative) continuous representation on the output where the continuity captures the information obtained from the context inputs. It does so by working in four steps:

  1. Separation (disentanglement) of observations produced by different causes (sub-generators). The expert has to partition the input observations in a way that is suitable for further processing. Based on the assumption that values in each part of the observation space are a result of multiple sub-generators/causes (see Section III-E

    ), the Expert should prefer to recognize part of the input generated by only one source. This can be achieved for example via spatial pattern recognition (parts of the observation space which correlate with each other are probably generated by the same source) or by using multiple Experts looking at the same data (see Appendix

    A-G). Alternative ways to obtain well disentangled representations of observations generated by independent sources are discussed in [40, 95, 90].

  2. Compression

    (abstraction, associative learning). Ideally, each expert should be able to parse high-dimensional continuous observations into discrete and localist (i.e. semi-symbolic) representations that are suitable for further processing. This can be done by associating parts of the observation together, which itself is a form of abstraction, and by omitting unimportant details. It is performed based on suitable criteria (e.g. associations of inputs from different sources seen frequently together) and under given resource constraints (e.g. a fixed number of discrete hidden states). This way, the expert efficiently partitions continuous observations into a set of discrete states based on their similarity in the input space.

  3. Expansion (augmentation). Since the input observations (and consequently the hidden states) can be ambiguous, each Expert should be able to augment information about the observed state so that the output of the Expert is less ambiguous and consists of Markov chains of lower order. This can be resolved e.g. by adding temporal, top-down or lateral context [72].

  4. Encoding. The observed state augmented with the additional information has to be encoded. This encoding should be in a format which converts spatial and temporal similarity observed in the inputs, and similarity obtained from other sources (context), into a SISC/LSH space of the outputs. thus enabling Experts higher in the hierarchy to do efficient separation and compression.

By iterating these four steps, a hierarchy of Experts is gradually converting a suboptimal or ambiguous model learned in the first layer of Experts into a model better corresponding to the true underlying HMM at a higher level. These mechanisms allow the architecture to partially compensate for the frequent inconsistencies between the hidden Generator and Learned Model topologies. Further improvements could potentially be based on distributed representations and forward/backward credit assignment (similar to [79, 66, 60]).

V Description of the Prototype Architecture

At a high level, the passive architecture consists of a hierarchy of Experts (where denotes -th expert in -th layer), whose purpose is to match the hierarchical structure of the world (depicted in Fig. 2) as closely as possible, as described in Section IV-B.

Separation (step 1) is solved on the level of multiple Experts looking at the same data and is described in more detail in Appendix A-G. Unless specifically stated, this version of disentanglement is not used in the experiments described in Section VI.

Compression (step 2) is implemented by clustering input observations

from a lower level (either another expert or some sensoric input) using the k-means

444In the prototype, we cluster the data via k-means for simplicity and better interpretability, but there are no restrictions on how the compression is performed in general.. The Euclidean distance from an input observation to known cluster centers is computed and the winning cluster is then regarded as the hidden state (see Eq. 4). This part is called the “Spatial Pooler” (SP) (terminology borrowed from [36]).

The hidden state for the current time step is then passed to the next module called the “Temporal Pooler” (TP), which performs Expansion (step 3). It partitions the continuous stream of hidden states into sequences of (as in Layered Hidden Markov Models)—Markov chains of some small order , and publishes identifiers of the current sequences and their probabilities. It does so by maintaining a list of sequences and how often they occur. As it receives cluster representations555In the prototype, this is in the form of a 1-hot vector representing the index of the cluster. from the SP, the TP learns to predict to which cluster the next input will belong. This prediction is calculated from how well the current history matches with the known sequences, the frequency that each sequence has occurred in the past, and any contextual information from other sources, such as neighboring Experts in the same layer, parent Experts in upper layers, or some external source from the environment.

Encoding (step 4) is implemented via Output Projection

. The idea is to enrich the winning cluster (what the Expert has observed) with temporal context (past and predicted events). This way, the Expert is able to decrease the order of the Markov chain of recognized states. It is done by calculating the probability distribution over which sequences the TP is currently in, and subsequently, by calculating a distribution over the predicted clusters for the next input. This prediction is combined with the current and past cluster representations to create a

projection over the probable past, present, and future states of the sequence. This projection is passed to the SP of the next Experts in the hierarchy. See Fig. 4 for a diagram illustrating the dataflow.

The TP runs only if the winning cluster in the SP changes which results in an event-driven architecture. The SP serves as a quantization of the input so that if the input does not change enough, the information will not be propagated further.

The context is a one-way communication channel between the TP of an Expert and the TP(s) of the Expert(s) below it in the hierarchy. This context serves two purposes: First, as another source of information for a TP when determining in which sequence it is. And second, as a way for parent Experts to communicate their goals to their children666As the parents are not connected directly to the actuators, they have to express their desired high-level (abstract) actions as goals to their children which then incorporate these goals into their own goals and propagate them lower. Experts on the lowest levels of the hierarchy are connected directly to actuators and can influence the environment., depicted in Fig. 4 and explained in Appendix A-D.

The context consists of three parts: 1) the output of the SP (i.e. the cluster representation), 2) the next cluster probabilities from the TP, and 3) the expected value of any rewards that the architecture will receive if in the next step, the input falls into a particular cluster (interpreted as goals)777In Fig. 4, the goals are shown as separate from the context for clarity..

In order to influence the environment, the Expert first needs to choose an action to perform, which is the role of the active architecture. The goal is a vector of rewards that the parent expects the architecture will receive if the child can produce a projection which will cause the parent SP to produce a hidden state corresponding to the index of the goal value.

An expert receiving a goal context computes the likelihood of the parent getting to each hidden state using its knowledge of where it presently is, which sequences will bring about the desired change in the parent, and how much it can influence its observation in the next step . It rescales the promised rewards using these factors, and adds knowledge about its own rewards it can reach. Then it calculates which of its hidden states lead to these combined rewards. From here, it publishes its own goal (next step maximizing the expected reward), and if it interacts directly with the environment picks an action to follow888The action of bottom level Experts at is provided on from the environment, so the picking of an action is equivalent to taking the cluster center of the desired state and sampling the actions from the remembered observation(s). See Appendix A-C for more details..

A much more detailed description of the architecture, its mechanics, and principles can be found in the Appendix.

Figure 4: A High-level description of the Expert and its inputs/outputs. The observations are converted by the Spatial Pooler into a one-hot vector representing cluster center probabilities. The Temporal Pooler computes probabilities of known sequences , which are then projected to the output. The external context is received from top-down and lateral connections from other Experts. The corresponding goal vector is used to define a high-level description of the goal state. The Context output of the Expert typically conveys the current cluster center probabilities, while the Goal output represents a (potentially actively chosen) preference for the next state expressed as expected rewards. This can be interpreted as a goal in the lower levels or used directly by the environment (see Appendix A-D).

Vi Experiments

This relatively simple architecture combines a number of mechanisms. The general principles of the ToyArchitecture has broad applicability to many domains. This can be seen in the variety of experiments which can be devised for it, from making use of local receptive fields for each Expert, to processing short natural language statements.

Rather than going though all of them, this section will instead show some selected experiments which focus on demonstrating and validating the functionality of the mechanisms described in this paper. The experiments were performed in either BrainSimulator999 or TorchSim101010 The source code of the ToyArchitecture implementation in TorchSim is freely available alongside TorchSim itself.

Vi-a Video Compression - Passive Model without Context

We demonstrate the performance of a single Expert by replicating an experiment from [54]. The input to the Expert is the video from the paper with a resolution of , composed of 434 frames.

The experiment demonstrates a basic setting, where the architecture just passively observes and the data has a linear structure with local dependencies. Therefore, a single Expert is able to learn a model of this data with only the passive model and without needing context, as detailed in Appendix A-A.

The Expert has 60 cluster centers and was allowed to learn 3000 sequences of length 3, where the lookbehind (how far in the past the TP looks to calculate in which sequence it is currently in) is and the lookahead (how many future steps the TP predicts) is . Both the SP and TP were learning in an on-line fashion. The video is played for 3000 simulation steps during training (through the video almost 9 times). The cluster centers are initialized to random small values with zero mean.

Fig. 5

shows the course of: reconstruction error, prediction error in the pixel space, and prediction error in the hidden representation

111111In all cases the error is computed as the sum of squared differences between the reconstruction (prediction) and the ground truth..

Figure 5: Top graph: reconstruction and prediction errors during the course of on-line training of the Expert on the video. Bottom graph: cluster usage (moving window averaged), where each line represents the percentage of time each cluster is active. Both parts of the Expert (the Spatial Pooler and Temporal Pooler) learn on-line (internal training batches are sampled from the recent history). The reconstruction error (in the observation/pixel space: ’recoError’) decreases first, because the Spatial Pooler learns to cluster the video frames. This causes an overall decrease of prediction error in the observation/pixel space: ’predPixe’. Note that around step 1700, the prediction error in the observation space decreases, despite the fact that the internal prediction error increases. This is because the changes in the SP representation degrade the sequences learned by the TP. Around step 2000, the learned spatial representation (clustering) is stable (cluster usage shows that all clusters have data) and therefore the inner representation of temporal dependencies starts to improve. Around step 3000, the Temporal Pooler predicts perfectly. The ’predError’ is measured as a prediction error in the hidden space (on the clusters).

First, the SP learns cluster centers to produce given a video frame at time . In the beginning, only a small number of cluster centers are trained, therefore the winning cluster changes very sporadically (all of the data is chunked into just a small number of clusters). Since the TP runs only if the winning cluster of the SP changes, this results in a situation where the data for the TP changes very infrequently, which means that the TP learns very slowly. This can be seen around step 1000 in Fig. 5, where the reconstruction error converges between and . At this point, boosting121212A mechanism which moves unused cluster centers towards the populated ones with the largest variation in their data points, see Appendix A-A. starts to have an effect, which results in all clusters being used.

(a) Sim. step 1000
(b) Sim. step 1500
(c) Sim. step 3000
Figure 6: Convergence of the Temporal Pooler’s transition probabilities. Each point is a cluster center. The Expert learns sequences of 3 consecutive cluster centers. At the beginning of the simulation, only several cluster centers are used, the Temporal Pooler learns transitions between currently used clusters. Finally, when all the cluster centers are used for some time, the Temporal Pooler converges and learns the linear structure of the video.

The larger the number of clusters in use, the more often the TP sees an event ( changes), and the more frequently it learns. In the last stage of the experiment, the prediction errors start to converge towards zero.

Figure 7: Resulting cluster centers after learning. Each of the 60 clusters corresponds to approximately 7 frames in the video. The cluster is active when the nearest 7 frames in the video are encountered. This results in spatial (but also temporal) compression. Note that one Expert is not designed to learn such a large input. Rather, multiple Experts with local receptive fields should typically process the input collaboratively.

This results in a trained Expert, which can recognize the current observation (compute ) and reconstruct it back. The learned spatial representation is shown in Fig. 7 and the convergence of the temporal structure is shown in Fig. 6.

As a result, given two131313Just the current state would be enough in this experiment, since the learned temporal structure has the Markov property. consecutive hidden states , the Expert can predict the next hidden state and reconstruct it in the input space. This process can be seen in a supplementary video141414Video generated by the Expert: The first part of the video shows how the Expert can recognize and reconstruct current observation and predict the next observation. The second part of the video (48 seconds in) shows the case where the prediction of the next frame is fed back to the input of the Expert. This shows that the Expert can be ‘prompted’ to play the video from memory. The spatio-temporal compression caused by the clustering and event-driven nature of the TP results in a faster replay of the video as only significant changes in the position of the bird are clustered differently and thus remembered as events by the TP.

Discussion: This experiment demonstrates the capability of on-line learning on linear data of one Expert using the passive model without context. The Expert first learns to chunk/compress the input data into discrete pieces (modelling the hidden space) and then to predict the next state in this hidden space. The prediction can be then fed back to the input of the Expert, which results in replaying meaningful input sequences (where time is distorted by the event-driven nature of the algorithm).

Two important remarks can be made here.

  1. The reconstruction error converges to small values fast, but only a small fraction of clusters is used at that moment. After this first stage, all the clusters start to be used. This change improves the reconstruction error slightly and allows the Temporal Pooler to start learning. This is relevant to 

    [87], where it is argued that the internal structure of the network changes, even if it might not be apparent from the output.

  2. The prediction error in the pixel space decreases before the prediction error in the hidden space starts to decrease. The reason for this is that even if the Temporal Pooler predicts a uniform distribution over the hidden states

    (i.e. the TP is not trained yet), all the cluster centers are moving closer towards the real data and thus the average prediction improves no matter what cluster is predicted.

This experiment shows the performance of an Expert on video, but the same algorithm should process other modalities as well without any changes. It shows a trivial case, where the hidden space just has a linear structure (Markov process). The following experiment extends this to non-Markovian case, where the use of context is beneficial.

Vi-B Audio Compression – Passive Model with Context

This experiment demonstrates a simple layered interaction of three Experts connected in three layers, as depicted in Fig. 8. Its purpose is to demonstrate that top-down context provided by parent Experts helps improve the prediction accuracy at the lower levels.

The setup is the following: Expert in layer 1 processes151515Normally, denotes the -th Expert in the -th layer. Since this experiment uses just one Expert per layer, the subscript will be omitted for clarity. the observations and computes outputs , the parent Expert in layer 2 processes the output vector of : as its own observation and produces the context vector. This context vector is used by the to improve the learning of its own Temporal Pooler as described in Appendix A-B. The same is done for the third layer.

The input data to the architecture is an audio file with a sequence of spoken numbers 1-9. The speech is converted by Discrete Fourier Transform into the frequency domain with 2048 samples. Each time step, one sample with 2048 values is passed as an observation

to . The original audio file is available on-line161616Original audio file with labels:

Figure 8: Setup of the experiment with context. receives the feature vectors on the input and receives the context vector from its parent , which helps it to resolve uncertainty in the Temporal Pooler. The same holds for higher level(s).

All the Experts in the hierarchy share the same number of available sequences (2000), and the lookahead . The Expert which processes raw observations has 80 cluster centers and a sequence length . Its parent Expert has 50 cluster centers with a sequence length . The most abstract Expert has 30 clusters and can learn sequences of length .

The results of a baseline experiment with just the bottom Expert are shown in Fig. 9. After training both Spatial and Temporal Poolers, it can be seen that the sequences of hidden states are highly non-Markovian (Fig. 13(a)). The order of the Markov chains is higher than the supported maximum of . After connecting the prediction171717In this simulation, the GreedyWTA function was applied on the prediction. to the Expert’s input as a new observation , the Expert is almost able to reconstruct two words, but is stuck in a loop of these two181818The audio generated by one Expert without context available is located at:, after some time the prediction starts failing.. The reason of is that many sequences are going through several clusters which correspond to relative silence. In these states, the Expert does not have enough temporal context to determine in which direction to continue.

Figure 9: Convergence of the Spatial Pooler’s reconstruction error (recoErrorObs) and Temporal Pooler’s prediction error both in the observation (predErrorObs) and hidden (predError) space. The graph below shows cluster usage in time: one of the clusters is used much more often, probably representing a silent part. The prediction error converges to a relatively high value, since the Expert is unable to learn the model correctly by itself.

But if we connect several Experts in multiple layers above each other, the parent Experts provide temporal context to the Experts below. Since the Experts higher in the hierarchy represent the process as a Markov chain of lower order (see Fig. 13), the context vector provided by them serves as extra information according to which the low-level Expert(s) can learn to predict correctly. Due to the event-driven nature of each Expert, the hierarchy naturally starts to learn from the low level towards the higher ones. Once learned, the average prediction error on the bottom of the 3-layer hierarchy is lower compared to the baseline 1-Expert setup.

After connecting the bottom Expert in a closed loop, like in the previous experiment, the entire hierarchy is able to replay the audio correctly. The resulting audio can be found on-line191919Audio generated by a hierarchy of 3 Experts: and the representation is shown in Fig. 13.

Figure 10: Error and cluster usage charts tracking the convergence of the bottom Expert in the presence of the context signal. Compare with the baseline in Fig. 9 which does not use the context. See Fig. 9 for a description of the plotted lines. Since the processing of the SP is not influenced by the context, the SP works identically as in the baseline case (e.g. the cluster usage and SP outputs are the same in both experiments).
Figure 11: Error and cluster usage charts tracking the convergence of . See Fig. 9 for a description of the plotted lines.
Figure 12: Error and cluster usage charts tracking the convergence of . See Fig. 9 for a description of the plotted lines.

Figures 10, 11, and 12 show the convergence of the Spatial and Temporal Poolers for each Expert in the hierarchical setting. The Spatial Pooler in the bottom layer behaves identically as in Fig. 9, but here, the Temporal Pooler can use the top-down context to decrease its prediction errors significantly. The cluster usage graphs show the effect of increasingly abstract representations. In layers 2 and 3, there is no explicit cluster for silence as in the first layer, because those silences cannot be used to predict the next number, and so are disregarded.

Note that the event-driven processing in the Experts, the architecture implements adaptive compression in the spatial and temporal domain on all levels. This is exhibited as either speeding up the video in the preceding experiment, or speeding up the resulting generated audio in this experiment.

(a) Transitions in
(b) Transitions in
(c) Transitions in
Figure 13: Resulting learned sequences in all the Experts. It can be seen how the output projections to help to adaptively compress predictable parts of the input. The higher in the hierarchy, the lower the order of the Markov chain the Experts process. On the top of the hierarchy, the order is 1 and for the Expert the sequence of hidden states has a linear structure.

Discussion: The experiment has shown how the context can be used to extendthe ability of a single Expert to learn longer term dependencies. It has also shown that the hierarchy works as expected: higher layers form representations that are more compressed and have lower orders of Markov chains. The activity on higher layers can provide useful top-down context to lower layers, and these lower layers can leverage it to decrease their own prediction error.

Vi-C Learning Disentangled Representations

This experiment illustrates the ability of the architecture to learn disentangled representations of the input space. In other words, this is the ability to recover hidden independent generative factors of the observations in an unsupervised way. Such an ability may be vital for learning grounded symbolic representations of the environment [40, 8]. In the prototype implementation, the ability to disentangle the generative factors is implemented via a predictive-coding-inspired mechanism (described in Appendix A-G), and is limited only to the input being created by an additive combination of the factors.

The experiment shows how a group of two Experts can automatically decompose the visual input into independently generated parts of the input. And to naturally learn about each of them separately, without any domain-specific modifications.

The input is a sequence of observations of a simple gray-scale version of the game pong (shown in the top left in Fig. 14). The ball moves on realistic trajectories and the paddle is moved by an external mechanism so that it collides with the ball around of the time.

Figure 14: Top left: the current visual input (pong, with ball and paddle). The image shows the representation of two objects learned in an unsupervised way by two Experts competing for the same visual input. The learning automatically decomposes the observations into two independent parts. The independent parts in this case correspond to the paddle (left) and the ball (right). By representing each object in a separate Expert, each is able to learn the simple temporal structures governing the behavior of its object independently of the other, leaving the learning of structures resulting from the interaction of the objects to higher and more abstract layers. From the representation it can be easily seen that the paddle moves just in one axis (linear structure discovered by the TP), while the ball moves through the entire 2D space (grid). The current position of the ball and the paddle are shown in yellow, each cluster center is overlaid with the visual input it represents.

The experiment shows how a simple competition of two Experts for the visual input can lead to the unsupervised decomposition of observations into independent parts. Here, there are two mostly independent parts on the input, therefore the Spatial Pooler of one Expert represents one part (paddle), the other Expert the other part (ball). The resulting representations are shown in Fig. 14. The rest of the architecture works without any modification, therefore each of the Temporal Poolers learn the behavior of just a single object202020See the illustrative video of the inference at Representing states of each of the objects independently is much more efficient than representing each state of the scene at once.

Discussion: Although this simple mechanism is not as powerful as DL-based approaches [40], it is interpretable and considerably simpler. It was experimentally tested that such a configuration is able to disentangle up to roughly 6 independent sources of input. In case the number of latent generative factors of the environment is smaller than the number of competing Experts , then the group of Experts forms a sparse distributed representation of the input. It is a topic for further research if application of this simple mechanism on each layer of the hierarchy212121As with each mechanism in the ToyArchitecture, we expect the workload to be distributed among all Experts, closely interacting with other mechanisms, and performed using simple algorithms rather than being localized in one part of the architecture and solving the problem all at once.

could overcome its limitations and achieve results comparable to deep neural networks.

Vi-D Simple Demo of Actions

The purpose of this experiment is to show the interplay of most of the mechanisms in the architecture. A small hierarchy of two Experts has to learn the correct representation of the environment on two levels of abstraction, then use this representation to explore, discover a source of reward, learn its ability to influence the environment through actions and then collect the rewards. The passive model works identically to the previous experiments, and addition the active parts of the model are enabled. Moreover, all the active parts of the model should be backwards compatible, which means that this configuration of the network should work also on the previous experiments, even though there are no actions available.

This experiment uses a hierarchy of two Experts to find and continuously exploit reward in a simple gridworld. Each time the agent obtains the reward, its position is reset to a random position where there is no reward. The reward location is fixed, but visually indicated. The agent must therefore explore tiles to find it and remember the position. Fig. 15 pictures the initial state of the world.

Figure 15: The initial state of the world, with the agent represented as the green circle and the reward tile highlighted by the authors.

The agent itself consists of two Experts connected in a narrow hierarchy similar to the one depicted in Fig. 8. Expert has 44 cluster centers, a sequence length of 5 and lookahead of 3, and has 5 clusters with 7 and 5 for sequence length and lookahead respectively. As stated in appendices A-C and A-D the agent sees the action on the input (the one pixel tall 1-hot vector in the bottom left of Fig. 15), and all levels receive reward (100, in this case) when the agent steps onto the reward tile.

With a lookahead of 3, can ‘see’ the reward only 2 actions into the future (the reward is given when the agent is reset, so it is effectively delayed by 1 step). Expert meanwhile clusters sequences from , so that it has a longer ‘horizon’ over which to see. Expert therefore has to guide to the vicinity of the reward tile by means of the context and goal vectors.

The results of 10 independent runs measured by average reward per step is presented in Fig. 16. As one would expect the average reward increases as time goes on, indicating that the agent has learned where the reward is, and is actively following its learned path to that reward.

Figure 16: The average (minimum and maximum) collected reward per step across 10 runs. Learning and exploration was disabled after the step 250,000.
Figure 17: An interpretation of the clusters of , projected through the clusters of and into the input space. Expert clusters spatial and temporal information from , so its clusters represent a superposition of states of .

A particularly good example of clustering is in Fig. 17. This shows that had created clusters where temporally contiguous projections from are spatially clustered together. So that if we were to overlap these 5 images there would be a contiguous ‘line’ of agent positions from anywhere in the environment to right beside the reward tile. Discussion: This experiment demonstrates that the hierarchical exploration and goal-directed mechanisms are functional and, when trained appropriately, allow an Expert hierarchy to find rewards and follow goals. However, when the clustering is done poorly (as has been the case for at least one run of the experiment), the model encounters a lot of difficulty. Since the model is constantly learning, the cluster centers might find a global (local) optima or continuously drift in time. Therefore, incentivising a ‘good’ clustering without domain specific knowledge is currently an open question and will be mentioned further in Section VII.

Vi-E Actions in Continuous Environments

The current design of the architecture supports not only discrete environments, but was also tested in continuous environments with continuous actions. The last experiment serves as a simple illustration of this and is similar to another experiment of the authors of [54]222222Link to the video of the original experiment:

Figure 18: An example of first-person visual input to the Expert. Right: current visual input; top left: reconstruction of the current cluster (the part which corresponds to the visual input); bottom left: reconstruction of selected next cluster center (the part which corresponds to the visual input, the other part is taken as an action to be executed). The Expert is predicting that it will turn left in the next step, and therefore the track will correspondingly be more in the center of the visual field.
Figure 19: An example of cluster centers learned in after puppet-learning (showing ground-truth action). The task can be solved pretty well by a reactive agent (stimulus response policy). As a consequence of this, each cluster center represents some visual input and its corresponding learned action. Training in a RL setting, where the reward is given for staying on the road, leads to very similar cluster centers.

The environment is a simple first-person view of a race track. The goal is to stay on the road and therefore to drive as fast as possible.

The topology is composed of just one Expert which receives a visual image and a continuous action (the top bit is forward, and then there are barely visible slight turning actions below) stacked together.

Discussion: The single Expert was able to learn to drive on a road in a so called puppet-learning setting, where the correct (optimal) actions are shown (a human drove through the track manually several times). But it was also able to learn correct behavior in a RL setting, where just the visual input and a reward signal (for staying on the road) was provided. Despite the fact that the learned representation is simple and seems to be on the edge of memorization, the agent was able to generalize well and was able to navigate also on previously unseen tracks (with the same colors). An example of agent autonomously navigating in the racing track is online232323Autonomous navigation of the agent on the race track:

These five experiments suggest that hierarchical extraction of spatial and temporal patterns is a relatively domain-independent inductive bias that can create useful models of the world in an unsupervised manner, forming a basis for sample efficient supervised learning. The same basic architecture has been tested on a variety of tasks, exhibiting non-trivial behaviour without requiring domain specific information, nor huge volumes of data on which to train.

Vii Discussion and Conclusions

This paper has suggested a path for the development of general-purpose learning algorithms through their interpretability. First, several assumptions about the environments were made, then based on these assumptions a decentralized architecture was proposed and a prototype was implemented and tested. This architecture attempts to solve many problems using several simple and interpretable mechanisms working in conjunction. The focus was not on performance on a particular task, it was rather on the generality and the potential to provide a platform for sustainable further development.

We presented one relatively simple and homogeneous system which is able to model and interact with the environment. It does this using the following mechanisms:

  • extraction of spatio-temporal patterns in an unsupervised way,

  • formation of increasingly more abstract and more informative representations,

  • improvement of predictions on the lower levels by means of the context provided by these abstract representations,

  • learning of simple disentangled representations,

  • production of actions and exploration of the environment in a decentralized fashion,

  • and hierarchical, decentralized goal-directed decision making in general.

Vii-a Similar architectures

There are many architectures/algorithms which share some aspects with the work presented here. The similarities can be found in the focus on unsupervised learning, hierarchical representations, recurrence in all layers, and the distributed nature of inference.

The original inspiration for this work was the PhD Thesis “How the Brain Might Work” [18]. The hierarchical processing with feedback loops in ToyArchitecture is similar to CortexNet [13], a class of networks inspired by the human cortex. There are also a lot of architectures that are more or less inspired by predictive coding [6, 90], but they are focused on passively learning from the data.

Many of these architectures are implemented in ANNs, using the most common neuron model. They are often similar in their hierarchical nature, such as the Predictive Vision Model [78]; a hierarchy of auto-encoders predicting the next input from the current input and top-down/lateral context. More recently, the Neurally-Inspired Hierarchical Prediction Network [75] uses convolution and LSTMs connected in a predictive coding setting. Several publications try to gate the LSTM cells in a manner inspired by cortical micro-circuits [73].

There are more networks that are loosely inspired by these concepts. The main idea is usually in the ability to have some objective in all layers, enabling the network to produce intermediate gradients which improves convergence and robustness. Examples of these are Ladder Networks [76], or the Depth-gated LSTM [102].

There are also networks that use their own custom model of neurons. These include the Hierarchical Temporal Memory (HTM) [34], the Feynman Machine [53] or Sparsey [79].

A model inspired by similar principles was also able to solve CAPTCHA. It is the Recursive Cortical Network (RCN) [60]

. It works on visual inputs that are manually factorised into shape and texture. Compared to other architectures mentioned here, it is based on probabilistic inference and therefore is closer to the hypothesis that the brain implements Bayesian inference


There are fewer architectures that are also focused on learning actions. An example of a system implemented using deep learning techniques is Predictive Coding-based Deep Dynamic Neural Network for Visuomotor Learning [45]. It learns to associate visual and proprioceptive spatio-temporal patterns, and is then able to repeat the motoric pattern given the visual input. The Feynman Machine was also shown to learn and later execute policies taught via demonstration [54]. Despite the fact that both of the architectures are able to learn and execute sequences of actions, none of them currently support autonomous active learning. In contrast to the ToyArchitecture, the mechanisms for exploration and learning from rewards are missing. An architecture emphasizing the role of actions and active learning in shaping the representations is [37]. Similarly to the ToyArchitecture, actions are part of the concept representation and not just the output of the architecture.

A more loosely bio-inspired architecture is World Models [30]. These combine VAE for spatial compression of the visual scene, RNNs for modeling the transitions between the world states, and a part which learns policies. Compared to the ToyArchitecture, this structure is only has a single layer (just one latent representation) and learns its policies using an evolutionary-based approach. Here, the interesting aspect is that after learning the model of the environment, the architecture does not need the original environment to improve itself. It instead ‘dreams’ new environments on which to refine its policies.

Another deep learning approach focused on a universal agent in a partially observable environment is the MERLIN architecture [97]. Based on predictive modelling, it tries to learn how to store and retrieve representations in an unsupervised manner, which are then used in RL tasks. Unlike the ToyArchitecure, it is a flat system where the memory is stored in one place instead of in a distributed manner.

Vii-B Limitations and Future Work

Despite promising initial results, the theory is far from complete and there are many challenges ahead. The performance of the model is partially sacrificed for interpretability, and in the current (purely unsupervised or semi-supervised setup) it is far behind its DL-based counterparts. It seems that the current biggest practical limitation of the model is that the Experts do not have efficient mechanisms to make the representation in other Experts more suitable for their own purposes (i.e. a mechanism which implements credit assignment through multiple layers of the hierarchy). There are some potentially promising ways how to improve this (either based on an alternative basis [79], a DL-framework [75] or a probabilistic one [60]).

Another way to scale up the architecture would be to use multiple Experts with small, overlapping receptive fields (as discussed in Section IV-A), ideally in combination with a mechanism efficiently distributing the representations among them (see Appendix A-G). Our preliminary results (not presented in this paper) show that such redundant representations can not only increase the capacity of the architecture [42], but also provide a population for evolutionary based algorithms of credit assignment.

During development, empirical evidence suggested that a better form of lateral coordination (lateral context between Experts) is missing in the model, especially in the case of wide hierarchies with multiple experts on each layer processing information from local receptive fields. Examples of this can be seen in [72] and [60].

Some mechanisms to obtain a grounded symbolic representation of the environment were tested in the form of disentanglement. It is not clear now whether these mechanisms would be scalable all the way towards very abstract conceptual representations of the world, or if there is something missing in the current design which would support abstract reasoning.

One of the big challenges in designing complex adaptive systems is in life-long or gradual learning; i.e. the ability to accumulate new non-i.i.d. knowledge in an increasingly efficient way[80]. The system has to be able to integrate new knowledge into the current knowledge-base, while not disrupting it too much. It should also be able to use the current knowledge-base to improve the efficiency of gathering new experiences. So despite that some of these topics are partially covered by the architecture (decentralized system, natural reuse of sub-systems in the hierarchy, event-driven nature of the computation mitigating forgetting), there are still many open questions that need to be addressed.


Appendix A A Detailed Description of the Architecture

This appendix describes the various mechanisms of the ToyArchitecture Experts and how hierarchies of them interact. We will first focus on describing the passive Expert which does not actively influence its environment, and is without context. Then, we will show how it can be extended with context (A-B) and actions (A-C). Afterwards, we will extend the definition of the context to allow experts in higher levels to send goals to the experts in lower levels (A-D). We will define the exploration mechanisms (A-F), and describe how a Reinforcement Learning (RL) signal can interface with the architecture so that it can learn from its actions. Together, these mechanisms implement distributed hierarchical goal-directed behavior.

Variable Description
Length of the past
Length of the lookbehind part
(past + current step)
Length of the future
Length of the whole sequence
Index of a layer
Index of an Expert in the layer
Set of cluster centers (states)
of an Expert
Dimension of , number of
cluster centers
Sequence of complete observations
Sequence of observations of
Expert in layer
Hidden state of the Expert in layer
Output vector of the Expert in layer
Number of sequences considered in a TP
Set of all providers of context to
an Expert
Likelihoods of seeing each context
element from each provider in each
position of each sequence
Table I: Selected notation.

During inference, the task of an Expert in layer () is to convert a sequence of observations perceived in its own receptive field into a sequence of output values . For simplicity, when discussing a single Expert, we will omit the and from the notation of the hidden states, observations, outputs, etc.

A-a The Passive Model without Context

As discussed in Section IV-B, the process is split into the Spatial Pooler, the Temporal Pooler242424The terminology adopted from [34], and Output Projection, which can be expressed by the following three equations:


where the and are learned parameters of the model.

A-A1 Spatial Pooler

The non-linear observation function from Eq. 5 is implemented by k-means252525Due to simplicity and interpretability reasons, as described in Section I. clustering and produces one-hot vector over the hidden states:


where is the Euclidean distance between two vectors , and is a set of learned cluster centers of the Expert corresponding to the parameter ), and is a winner-takes-all (WTA) function which returns a one-hot representation of . The observation function considers only the current observation and covers step number two (compression) as described in Section IV-B. Separation is performed on a level of multiple Experts and is described in Appendix A-G.

Because we are learning from a stream of data, it might happen that some cluster centers in the Spatial Pooler do not have any data points and thus would be never adapted. There can be two underlying reasons for this: 1) the cluster centers were initialized far from any meaningful input, or 2) the agent has not seen some types of inputs for a long period (e.g. it stays inside a building for some time and does not see any trees). In situation 1, we would like to move the cluster center to an area where it would be more useful, but in situation 2, we typically want to keep the cluster center at its current position in order to not forget what was learned and have it be useful again in the future. We solve this dilemma by implementing a boosting algorithm similar to [36]. We define a hyper-parameter (boosting threshold) and every cluster center, which has not received any data for the last

steps, starts to be boosted where it is moved towards the cluster center with highest variance among its data points. Using this parameter, we can modify the trade-off between adapting to new knowledge and not-forgetting old knowledge.

A-A2 Temporal Pooler

The goal of the Temporal Pooler is to take into account a past sequence of hidden states and predict a sequence of future states . Since the sequence of observations might not have the Markovian property, and it might have been further compromised by the Spatial Pooler, the problem is not solvable in general. So we limit the learning and inference in one Expert to Markov chains of low order and learn the probabilities:


which we express in the form of sequences . Each sequence can thus be divided into three parts of fixed size a history: , the current state: and a lookahead part: , the entire sequence having the length of . We call the history together with the current step the lookbehind which is a sequence of length . See the bottom of Fig. 21 for an illustration.

The theoretical number of possible sequences of hidden states grows very quickly with the number of states and the required order of Markov chains. But the observed sub-generator usually generates only a very small subset of these sequences in practice. Using a reasonable number of states and length of sequences (e.g. and ), it is possible to learn the transition model by storing all encountered sequences in each Expert

and computing their prior probabilities based on how often they were encountered. Then, the probability of the

-th sequence is computed as:


where denotes the normalization of values to probabilities, and:


where is the prior probability of the sequence (i.e. how often it was observed relative to other sequences), is the match of the beginning of the sequence with the recent history of states , and is an indicator function producing a value close to if the hidden state corresponds to the cluster at the -th position in the sequence , otherwise 262626 is a small constant ensuring each sequence has a nonzero probability and that sequences corresponding at least partially to the data have higher probabilities than those which do not correspond at all.. The parameter defines the fixed length of the required match of the sequence, so given and , the sequence probabilities will be computed based on the first clusters in the sequence. Sequences such as this can be used for predicting 2 steps into the future. The value from Eq. 6 is then a probability distribution over all sequences in time step :


These are the main principles behind learning and inference of the Temporal Pooler.

A-A3 Output Projection

Finally, the Expert has to apply the output function described in Eq. 7. In each time step, the output function takes the the current sequence probabilities and produces the output of the Expert :


When defining the output function, the following facts need to be taken into account: the outputs of Experts in layer are processed by the Spatial Poolers of Experts in layer where the observations are clustered based on some distance metric. There are two extreme situations:

  • In the case where the sequence of states for the child Experts on layer are not predictable, the parent Experts in layer should form their clusters mostly based on the spatial similarities of the hidden states of the Experts272727The one-hot output of the Spatial Pooler does not fulfill this requirement. But the spatial similarity is preserved over the outputs of multiple Experts which receive similar inputs (distributed representation). The parent Experts then receive outputs of multiple experts from layer , therefore they perceive the code which preserves the spatial similarity. in layer . This way, the details of the unpredictable processes are preserved as much as possible and passed into higher layers of abstraction where these uncertainties can be resolved.

  • On the other hand, in the case where the state sequences are perfectly predictable, the spatial properties of the observations are relatively less important than their behavior in time, and the clustering in layer should be performed based on the similarities between sequences (i.e. temporal similarity).

Based on these properties, the output function should be defined so that the resulting hierarchy implements implicit efficient data-driven allocation of resources. The parts of the process that are easily predictable by separate Experts low in the hierarchy will be compressed early. The unpredictable parts of the process will propagate higher into the hierarchy where the Experts try to predict them with more abstract spatial and temporal contexts. This is a compromise between sending what the architecture knows well vs just sending residual errors [90].

In the current version, we use the following output projection function: The output dimension is fixed to the same number of hidden states in the Expert, and . The output function is defined as follows:


where is the indicator function from Eq. 11, is the WTA function from Eq. 8, is the normalization function from Eq. 10, and is the probability that we are currently in sequence from Eq. 11. This definition of the output function has the following properties:

  • In the case that the observation sequence is not predictable, the predictions from sequences with high probability will have high dispersion over future clusters. Therefore the position corresponding to the current hidden state and recent history of length will be dominant in the output vector . So the parent Expert(s) in layer will tend to cluster these outputs mostly based on the recent history of length as opposed to the predictions.

  • In the case that the observation sequence is perfectly predictable, only one sequence will have high probability in each time step, so both the past and predicted states will have high probability. Therefore the parent Expert(s) in will tend to cluster based on the predicted future more than in the previous case. The sequence of observations for will therefore be more linear (similar to a sliding window over the recent history and future), therefore it will be possible to chunk the observations more efficiently. More importantly: the output of these parent Experts will correspond more to the future (since the lower-level Experts are predicting better). As a result, the higher levels in the hierarchy should compute with data which correspond to the increasingly more distant future. This way the hierarchy does not think about what happened but rather what is going to happen.

This means that the temporal resolution in higher layers is determined automatically based on the predictability of the observations in the current layer, and this resolution can dynamically change in time. Since the clustering applies a strict winner-takes-all (WTA) function, and the Temporal Pooler does not accept repeating inputs, the entire mechanism naturally results in a completely event-driven architecture.

A-B The Passive Model with External Context

Until now, the goal of each Expert has been to learn a model of the part of the environment solely based upon its own observations . This can lead to highly suboptimal results in practice. Often it is necessary to use some longer-term spatial or temporal dependencies as described in Section IV-B. A context input (see Fig. 4) is used to provide this information.

The meaning and use of the bottom-up and context connections should be asymmetrical: the bottom-up (excitatory) connections decode “visual appearance” of the concept, while top-down (modulatory) connections [1] help to resolve the interpretation of the perceived input using the context. This asymmetry should prevent positive feedback loops in which the bottom-up input might be completely ignored during both learning and inference. As a result, the hidden state of the architecture should still track actual sensory inputs.

The context input can be then seen as a high-level description of the current situation from the Expert’s surroundings (both from the higher level and possibly from neighboring experts in the same layer).

It is possible to use various sources of information as a context vector, such as:

  • Past activity of other Experts: this extends the ability of to take into account dependencies further in the past.

  • Recent activity of other Experts: this increases spatial and temporal range.

  • Predicted activity of other Experts: this extends the ability of to distinguish the recent observation history according to the future. This process could be likened to Epsilon Machines, where the idea is to differentiate histories according to their future impact [94, 12].

The context output of an Expert: is a concatenation of the Spatial Pooler output (i.e. the winning cluster for this input) and the Temporal Pooler prediction of the next cluster. The goal is also attached to the context (see Fig. 20), but we will talk about them separately for clarity. The ensemble can be thought of colloquially as communicating: “Where, I am”, “Where I expect to be in the future”, and “What reward I expect for each possible future clusters”.

Figure 20: Context and Goal input vectors to Expert . Both are collections of top-down and lateral inputs from other Experts from the previous time step. The Goal input has some parts masked-out (blue parts). The resulting two input vectors can be interpreted as a high-level description of the current state, and a passive prediction of what will happen next (, ), and the goal as a preference (measured in the expected value of reward) for the next state. Note that in this figure, the variable denotes the time for the Expert receiving the context (), while denotes the time for an Expert sending the context. Because all Experts are event driven, the time between two changes in an Expert states is different for different Experts.

The context input is a collection of context outputs (refer to the red lines in the Fig. 4) from multiple other Experts. Each Expert supplying context is known as a provider, and there is no distinction between parent providers and lateral providers282828Note that in general context connections that skip multiple layers are allowed as well.. The context input to Expert is therefore defined as:


where denotes concatenation, is set of providers for , and and are the current () and predicted () clusters of provider from the previous step292929The variable denotes the time for the Expert receiving the context (), while denotes the time for an Expert sending the context (because all Experts are event driven, the time between two changes in an Expert states is different for different Experts). respectively.

Context is incorporated in the Temporal Pooler prediction process by having the TP learn the likelihoods of each context element from each provider being 1, for each lookbehind cluster in each sequence as .

In using the context during inference, we augment the calculation of the unnormalised sequence probabilities (Eq. 11) by also matching the current history of contexts with the remembered sequence contexts .

We start by extending the definition of in Eq. 6:


We consider each context provider separately. For each sequence, we calculate the likelihood of that sequence based on the history from each individual provider :


Considering the role of the context, we wish that in a world where multiple sequences are equally probable, the context will disambiguate the situation. Given that is learned alongside , in a situation where each Expert has the same data, the contexts should correlate highly with and the predictions based solely on the context history would be approximate to the predictions using the cluster history and priors:


But in reality, each Expert might be looking at a different receptive field and have generally different information. On the other hand, context from most of the Experts can be of no use for the recipient and it is probable that it will be highly correlated among the providers. Thus averaging the predictions based on the individual contexts might obscure the valuable information. So rather than using every context equally for disambiguation, we would like to use only the most informative one. We choose the most unexpected context to use, as the context which is the most disruptive to the otherwise anticipated predictions is likely to contain the most information about the current state of the agent and environment. As a metric of unexpectedness, we use the Kullback-Leibler[51] divergence between the predictions based on the history of cluster centers and one “informative” context vs predictions based just on the history of cluster centers.

We therefore update Eq. 10 to include this selection and use of the most informative context:


As a result, using the context as a high level description of the current situation, each Expert can also consider longer spatial and temporal dependencies which defy the strict hierarchical structure (see Section III-D) in order to learn the model of its own observations more accurately.

A-C Actions as Predictions

Until now, the architecture has been only able to passively observe the environment and learn its model. Now, the mechanisms necessary to actively interact with the environment (i.e. to produce actions) will be introduced with as small a change to the architecture as possible.

From the theoretical perspective, the HMM can be extended to a Partially Observable Markov Decision Process (POMDP) [92]. While remembering that each Expert processes Markov chains of order , the decision process corresponds to the setting:


where denotes an action taken by the Expert at time . Note that this setting could be treated as a task for active inference, where the agent proactively tries to discover the true state of the environment if necessary [98, 27]. But for now, we will consider a similar approximation of the problem as in the previous sections and leave and explicit active inference implementation to future work.

Since we want the hierarchy to be as general as possible, it is desirable to define the actions in such a way that they can be used in case the Expert has the ability to control the actuators (either directly or indirectly through other Experts), but do so that they will not harm the performance in the case that the Expert is not able to perform actions, and can only passively observe.

Figure 21: An example of a recent sequence of states (where , ) in which shows: 1) The sequence of context inputs helping to resolve uncertainty during computation of ; 2) A goal vector defining the the expected rewards of the target state Bottom: the library of learned sequences , each sequence is defined by an ordered list of states , each potentially in a different context . Top: visualization of the current state and several possible futures. These futures are estimated based on the content of the Model. First, the probability distribution is computed based on a sequence of recent states and contexts. Then, the sequence probabilities are increased proportionally to the probability that the reward can be obtained by following that sequence (updated sequence probabilities based on the reward are depicted on the right). As a result, this increases the probability of choosing the state as an action—setting it as a goal output . In this example, the first 3 sequences are equally matched by the and therefore they have equal probabilities. But after applying the , the has the smallest probability, the has higher probability since it sets the to zero in the future, but the has the highest probability, because it both: sets the to zero and to one, as required by the goal input .

For this reason, actions are not explicitly modeled in this architecture. Instead, an action is defined as an actively selected prediction of the next desired state303030This actively chosen predicted state action should be reachable from the current state with a high probability, i.e., be in coherence with what is possible. . The selected action (desired state in the next step) is indicated on the Goal output of the Expert (see Fig. 4): .

Given the library of sequences , the recent history of hidden states , and the context inputs , the Expert computes the sequence probabilities using Eq. 19. Then, those sequence probabilities are altered based on preferences over the states to which they lead (see Appendix A-D and A-E). This results in a new probability distribution


where can be seen as a sequence selection function, see Fig. 21 for illustration. Finally, the Goal output of the Expert for the next simulation step is computed (see Fig. 4). This can be seen as actively predicting the next position in a sequence:


where converts the probabilities of sequences into a probability distribution over clusters predicted in the next step:


where is the indicator function from Eq. 11, position corresponds to the next immediate step and is the WTA function from Eq. 8. The in Eq. 23 is an action selection function for which it is possible to use multiple functions, namely identity, -greedy selection, sampling, or -greedy sampling.

In the example in Fig. 21, without considering any preferences over the sequences (the in Eq. 22 collapses to identity), the probabilities of the first three sequences are equal, therefore the function would choose the states , and with equal probability.

The whole process can be seen as follows: Each Expert throughout the hierarchy, calculates a plan based on a short time horizon , chooses the desired imminent actions (states one step in the future which are desired and probably reachable) and encodes this information as the Goal output . This signal is then either received by other Experts and interpreted as the goal they should reach), or used directly by the motor system in the case that the Expert is able to control something.

In the presented prototype implementation, the desirability of the goal states is encoded as a vector of rewards that the parent expects that the architecture will receive if the child can produce a projection which will cause the parent SP to produce the hidden state corresponding to the index of the goal value.

An expert receiving a goal context computes the likelihood of the parent getting to each hidden state using its knowledge of where it presently is (Eq. 19), which sequences will bring about the desired change in the parent (Eq. 30), and how much it can influence its observation in the next step by its own actions (see Appendix A-F). It rescales the promised rewards using these factors, combines them with knowledge about its own rewards (see Appendix A-E) and then calculates which hidden states in the next step correspond to sequences leading towards these combined rewards. From here, it either publishes its own goal (expected reward for getting into each cluster), or if it interacts directly with the environment picks an action to follow313131The action of bottom level Experts at is provided on from the environment, so the picking of an action is equivalent to taking the cluster center of the desired state and sampling the actions from the remembered observation(s).. This mechanism is described in more details in the following section.

A-D Goal-directed Inference

This section will describe the mechanisms which enable the Expert:

  • To decode the goal state received from an external source (usually other Experts).

  • To determine to what extent the goal state can be reached, or at least if the distance between the current state and the goal can be decreased.

  • To make a first step (“action”) leading towards this goal if it is possible by setting to an appropriate value.

As a result, these mechanisms should allow the hierarchy of Experts to act deliberately. The architecture will hierarchically decompose a decision—potentially a complex plan, represented as one or several steps on an abstract level, into a hierarchy of short trajectories. This corresponds to the ability to do decentralized goal-directed inference, which is similar to hierarchical planning (e.g. state-based Hierarchical Task Network (HTN) planning [28]). Note that such a hierarchical decomposition of a plan has many benefits, such as the ability to follow a complex abstract plan for longer periods of time, but still be able to reactively adapt to unexpected situations at the lower levels. There are also theories that such mechanisms are implemented in the cortex [71].

In this section, we will show a simple mechanism which approximates the behavior of a symbolic planner. This demonstrates one important aspect: the hierarchy of Experts converts the input data into more structured representations. On each level of the hierarchy the representation can be interpreted either sub-symbolically or symbolically. This gives us the ability to define symbolic inference mechanisms on all levels of the hierarchy (e.g. planning), which then use grounded representations.

Furthermore, in Appendix A-E, we will show how a reinforcement signal can be used for setting preferences over the states in each Expert. This will in fact equip the architecture with model-based RL [65]. It also means that locally reachable goal states can emerge across the entire hierarchy with them appearing on different time scales and levels of abstraction, which leads to completely decentralized decision making.

The main idea of goal-directed inference is loosely inspired by the principles of predictive coding in the cortex [90], where it is assumed that each region tries to minimize the difference between predicted and actual activity of neurons. In ToyArchitecture, a more explicit approach for determining the desired state is used. The approach can be likened to a simplified, propositional logic-based version [48] of the symbolic planner called Stanford Research Institute Problem Solver (STRIPS) [21]. In this architecture, each Expert will be able to implement forward state-space planning with a limited horizon [29].

STRIPS definition:  Let be a propositional language with finitely many predicate symbols, finitely many constant symbols, and no function symbols. A restricted state-transition system is a triple , which is described in Table II.

variable meaning
State—set of ground atoms of
Set of states
Set of operators (actions)
State transition function
Operator —transforms one state
to another, if applicable
Precondition—set of literals which
determines if the operator
is applicable
Effect—set of literals which
determines how the operator
changes the state if applied
Table II: STRIPS language definition.

State satisfies a set of ground literals (denoted ) iff: every positive literal in is in and every negative literal is not in . It is possible to represent states as binary vectors (where each ground literal corresponds to one position in the vector) and operators/actions as operations over these vectors.

The operator is applicable to the state under the following conditions.


Then, the state transition function for an applicable operator in state is defined as:


The STRIPS planning problem instance is a triple , where: is the restricted state-transition system described above, is the current state and is a set of ground literals describing the goal state (which means that describes only required properties, which are a subset of the propositional language ).

Given the planning instance , the task is to find a sequence of operators (actions), which consecutively transform the initial state into a form which fulfills the conditions of the goal state .

One possible method to find such a sequence is to search through the state-space representation. Since the decision tree has potentially high branching factor, it is useful to apply some heuristic while choosing the operators to be applied. To quote from the original paper 

[21]: “We have adopted the General Problem Solver strategy of extracting differences between the present world model [state] and the goal, and of identifying operators that are relevant to reducing these differences”.

Now we will describe how an approximation of this mechanism is implemented in ToyArchitecture.

Similar mechanisms in the ToyArchitecture:  The architecture learns sequences of length and each step in the sequence corresponds to an action. Each sequence is a trajectory in the state-space of Expert (see states with big letters and transitions between them in the Fig. 21). But, more crucially, from the point of view of the parent323232For simplification, we can consider one parent Expert, but the approach generalizes to top-down connections from multiple parents as well as multiple lateral connections from other Experts in the same layer simultaneously. Expert , each sequence can be seen as an operator .

For the purposes of planning, aside the context vector input , the Expert is equipped with a goal vector input , which specifies the goal description .

From the point of view of , describes the current state (corresponds to in the STRIPS), while describes a superposition of desirable goal states. With each position marked by a real number indicating how preferable the state is for the parent333333We can think of this as the expected value of the state for the parent..

Note that (as explained in Appendix A-B) each Expert learns the probabilities of sequences dependent on context and position in the sequence (Eq. 19) and stores them the form of a table of frequencies of observations of each combination. This allows us to define the operator (corresponding to the learned sequence ) in a stochastic form, where we define the probability of being as the probability that the ending clusters of the sequence will be observed in the context :


The precondition determining the applicability of the operator can be also defined in a stochastic manner as the probability that the Expert is currently in the sequence :


where is a probability of the sequence similar to Eq. 19, but computed for the situation when the Expert actively tries to influence it (see Appendix A-F for more details). Note that Eqs. 29 and 28 imply that the meaning of the operators is different in each Expert and each time step.

Finally, the sequence selection function from Eq. 22 can be defined as follows:


where denotes normalization to probabilities described in Eq. 10.

This means that each Expert can implement deliberate decision making, looking ahead steps into the future. Each step, it looks for currently probable sequences which maximise the expected value when moving the parent from the current context vector to the state dictated by .

Data: Observation history ,
Context history ,
Goal description
Result: Goal output
1 Compute applicability of the operators: compute sequence probabilities (Eq. 29);
2 Select operators that are applicable and have high chance of achieving one of the the goal states: weight sequence probabilities by these (Eq. 30);
3 Compute the probabilities of preferred states in the next step by (Eq. 24);
4 if Expert is to produce an action then
5      Apply an action selection function (Eq. 23);
6       Set the selected action to the ;
8      Set to the values of the received expected values weighted by the computed next step probabilities;
9 end if
Algorithm 1 Goal directed inference—an approximation of a stochastic version of STRIPS state-space planning with a limited horizon. Describes how the Expert decides on which action to apply in order to maximise the expected value of rewards communicated in from the current context . If an Expert is directly connected to the actuators, then an action is selected directly, otherwise the expert propagates the expected values of the states to its children.

The entire process is summarized in Algorithm 1 and illustrated on an example in Fig. 21. Compared to STRIPS, each Expert can plan with only limited lookahead, but this decision is decomposed into sub-goal of Experts in lower layer. This leads to the efficient hierarchical decomposition of tasks. Moreover, compared to classical symbolic planners, representations in the hierarchy are completely learned from data and since the Experts still compute with probabilities, the inference is stochastic and can be interpreted as continuous and sub-symbolic.

A-E Reinforcement Learning

In the previous section we described how the Expert can actively follow an externally given goal. The same mechanism can be used for reinforcement learning with a reward .

When reaching a reward, every Expert in the architecture gets the full reward or punishment value. During learning, each Expert assumes that it was at least partially responsible for gaining the reward and therefore associates the reward gained at with the state/action pair at , so that for all there is a corresponding which is an estimate of the reward gained when in state and taking action . Because the Experts are event driven, they sum up all the rewards received during the steps they did not run (their cluster did not change).

The initial expert reward calculation is:


This is the expected value of the promised reward from each provider, for each future state in each sequence. Any rewards that the Expert can ‘see’ from this point are also included as the term ).

The action which the Expert should perform is related to the sequence that it wants to move to. As it is trying to maximise its rewards, the Expert should pick an action which would position it in a which has the highest likelihood of obtaining the most rewards. As this is the expected value (and also assume that rewards are sparse, and that an Expert can only expect reward once in the current lookahead (i.e. ) of the sequence.), the maximum of rewards from the sequence is used343434Taking the maximum as a lower bound on the expected reward works only in case rewards are all non-negative or all non-positive. In case we want the agent to accept both rewards and punishments at once, they need to be processed separately and combined just in the lower Experts sending the actual actions to the environment.: