## 1 Introduction

Humans and animals alike have evolved to complete a wide variety of tasks in a world where resources are scarce. In particular, since learning and planning have an associated energy cost, the brain has probably evolved to solve multiple tasks while spending the least possible amount of energy

niven2016neuronal; sengupta2013information. It is only natural then that the brain has an innate ability to generalize what it learns in one task to succeed in future ones. Otherwise, it would be too costly to learn from scratch the appropriate solution for each problem encountered. Given how any artificial agent would face exactly the same burdens, it is highly desirable for it to possess the same generalization capacities. Standard deep reinforcement learning techniques have shown outstanding progress in solving complex tasks using a single architecture mnih2015human; silver2017mastering; oriol2019alphastar; schrittwieser2019mastering, but there is still much progress to do in terms of transferring knowledge among multiple tasks flesch2018comparing; cobbe2019quantifying; zhao2019investigating; yu2019metaworld and under constraints like time harb2018waiting, memory capacity, and energy.One of the common traits of standard deep reinforcement learning methods is that sensory inputs and their successive encodings are represented in each processing stage as continuous real-valued vectors. This type of representations is very flexible and allows using efficient gradient-based optimization techniques, both properties that greatly enhance the learning performance in single tasks. However, this flexibility encourages learning very complex models, which usually take advantage of spurious statistical patterns

duan2016benchmarking that are not essential to solve the task and that are not present in similar tasks. Thus, the excess in flexibility directly inhibits the transference of knowledge between tasks. In contrast, both animals and humans exhibit the use of discrete representations to encode sensory inputs and internal states sergent2004consciousness; zhang2008discrete; linderman2019hierarchical. These representations, on the contrary, work as information bottlenecks that make learning harder and favor low-complexity models that only capture the most relevant patterns zaslavsky2018efficient. For example, the use of discrete groups of colors to identify edible food briscoe2001evolution is a particularly useful trait for survival that separates sensory inputs into discrete categories while ignoring essentially useless information for the specific task. Aside from promoting simplicity, these representations tend to be modular and transferable between different contexts. For example, while color might be a useful category to recognize edible food, it is certainly not limited to this task and can be used in conjunction with other discrete representations to recognize arbitrary physical objects. The human language is perhaps the epitome of the representational power of discrete categories: humans seem to possess an innate ability to represent arbitrarily complex phenomena from a finite amount of discrete expressions, which constitute what we call language hauser2002faculty. Just as with colors, a subset of these expressions can be used to categorize sensory inputs at the level of abstraction necessary to carry out a task, while being readily transferable to completely different purposes. Discovering methods that identify rich discrete representations, like colors or language expressions, seems then like a promising path to endow artificial agents with the ability of generalization.Past artificial intelligence techniques that relied on the use of symbolic representations were later demonstrated to be sub-optimal in comparison with fully learning-based methods that made no assumptions about the problem being solved

tesauro2002programming; silver2017mastering. However, the representations used were usually the result of hand-picked features or expert-based rules. In contrast, modern techniques that have used discrete categories successfully use them as labels of more complex elements, such as object representations with continuous features veerapaneni2019entity; kulkarni2019unsupervised, vector codes with continuous entries razavi2019generating, or distributions of actions given states, commonly called options or skills bacon2017opctioncritic; kulkarni2016hierarchical; tessler2017deep; frans2018meta; eysenbach2019diversity; marino2019hierarchical; osa2019hierarchical; goyal2020reinforcement. Here, we show that purely discrete representations of sensory inputs can be discovered by self-supervision and that, contrary to what is commonly assumed, they can improve the learning efficiency in a task by leveraging the experience acquired in previous tasks. To achieve this, we take an information bottleneck approach to learn the representations in an off-policy fashion, and then use them to provide priors in new tasks.## 2 Multi-task and transfer learning

Before further elaborating on the problem of finding useful discrete representations, a brief description of the multi-task and transfer problems in reinforcement learning is needed. In the multi-task problem, an agent finds itself in an environment and their joint state is specified by a vector . The agent has the ability to change this state to a new state by taking an action , and, as a result of this change, the agent receives a reward . During its lifetime, the agent faces a sequence of tasks , where each one determines how the states will evolve and what rewards will receive the agent. So, the agent has to find a policy, a behavior rule that selects actions based on states, for each task, such that the accumulated reward during its whole lifetime is maximized (Figs. 1A, 1B) yu2019metaworld

. The transfer learning problem is similar, but it distinguishes between past and future tasks. In this case, the agent faces a sequence of tasks as well, but the objective now is to perform well in a future unseen task. The agent has to leverage the available information to learn a policy that is optimal in an unknown task after minimal interventions.

A skill-based approach to these problems consists in finding general policies from states to actions, skills, that can be used across different tasks. Under this approach, the agent has to learn a set of skills along with a high-level policy from states to skills for each task (Fig. 1C). Ideally, these policies are easier to learn than the original policies from states to actions. In practice, this separation might not present significant advantages in a multi-task setting nachum2019hierarchy. Nonetheless, if a future task is close in nature to the previous ones, it is reasonable to think that the skills can be used without major modifications and only a new policy from states to skills is necessary.

## 3 Concept learning

Irrespective of what is understood as a concept, a defining characteristic is that it allows separating those sensory inputs that correspond to it from those that do not. Considering only this criterion, a set of skills can be regarded as a set of concepts in a specific task . This is the case since once an agent learns an optimal deterministic policy from states to skills , it is able to distinguish if two states correspond to the same skill or not, that is, the skills determine a partition of the state-space by separating from the rest those states where the agent behaves similarly (Fig. 2A). Such a partition contains information that might be useful when facing a second task: consider two states and that correspond to the same skill in a task . If the agent learns that it should select the skill in the state during a second task , then the agent has a reason to believe that it should select in the state as well. In this manner, the partition in the task provides a prior that can accelerate learning in a future task , especially if is similar in nature to . Let us suppose now that the agent learns the optimal policy in the second task and that it was indeed the case that the skill was the right choice in both states and . Naturally, the confidence of the agent about the similarity of these states should increase. So that if the agent faces a new task it will have an even greater reason to believe that if the optimal skill in the state is , then it should be the same for . In this case, the concepts no longer correspond to the skills themselves, but to the sets that result from the intersection of the partitions determined by the skills. These concepts can be understood as task-invariant sets of states for which the skill selected depends only on the specific task (Fig. 2B). Given their potential to allow the transference of skills, we pose the problem of concept learning as one of learning these invariant sets. More concretely, consider an agent in a transfer learning setting that can train in a set of tasks and will be tested in a future unknown task of similar nature. The agent is endowed with a set of skills and it has the ability to classify states into concepts , by means of a classifier . We want the agent to learn a classifier such that its selection of skills in the future task depends as little as possible on the states once the concepts are known, that is, we want the concepts to provide most of the necessary information to decide when to use each skill (Fig. 1D). Given that the number of tasks can be potentially infinite (and so the number of potential concepts), in practice the existence of such a classifier relies on the assumption that there exist large clusters of states such that their interaction with skills is similar across tasks. This is not unlike humans, which seem to present innate priors related to the existence and persistence of physical objects baillargeon2008innate; fields2017eigenforms.

As a solution, we propose to take two steps: first, to train the agent in the available set of tasks following standard methods, which implies learning the skills and policies from states to skills; and second, to use the policies learned to generate triples of examples , which will be used to train the classifier , following an information bottleneck approach, as explained below.

If we think of the task , the state , the skill and the concept

as random variables, they should follow the graphical model in Fig

3A: in accessible tasks, the skill is selected based on the state and the task , and the concept is chosen based only on the state . We want the concepts to be as informative as possible of the skills for each task , so we would like to maximize a quantity like the mutual information of and given , . This is an appropriate metric since it is large when there is little uncertainty about the value of given the value of , and of given the value of . So, maximizing it captures the idea that two states which correspond to similar behaviors in the same task, across different tasks, should be categorized with the same concept , but does not restrict the policies nor the classifier to be deterministic. Now, since we also want to use the information about the states as little as possible, the function that should be maximized is:(1) |

in agreement with the information bottleneck principle tishby2000information (Fig. 3B). Here, denotes a vector of parameters that determine the classifier , and is a parameter that specifies how relevant is to guess the skill based on the concept , at the expense of using too much information about . This parameter should be large enough to avoid using more concepts than necessary, or overfitting to very complex clusters of states that are not generalizable to future tasks. This penalization is especially desirable if the policies from states to skills are not perfectly trained, a case in which some of the examples will be erroneous and will affect the quality of the concepts.

In addition, if we consider that after selecting a skill in a state there is a transition to a state , we can see that the classifier will implicitly determine a transition from the concept to the concept . Since the concepts should provide an abstraction of the system to make decisions, one would hope that the dynamics of the concepts are as deterministic as possible, so that the effect of selecting a skill when a concept is perceived can be easily inferred. With this, a third term should be considered in the Eq.[1], namely, the mutual information between the concept and the next concept , (Fig. 3C).

## 4 Concept priors

Once concepts are learned, the question arises of how to use them to accelerate learning in a future task. We propose to obtain a Monte Carlo estimate of the Q-values

, which are the expected cumulative rewards that the agent wants to maximize. These estimates allow us to determine a high-level policy from concepts to skills . Since the states corresponding to the same concept tend to behave similarly, one would expect that the state-policy (parametrized by a vector ) is close to the concept-policy (see Fig.1D). So, a natural way to introduce this concept inductive bias is to penalize the state-policy being distant to the concept-policy by adding the term(2) |

to the optimization objective of the standard method being used. This term is the Kullback-Leibler divergence between the policies, which provides a measure of how dissimilar they are. Additionally,

is used to denote the concept most closely related to the state .## 5 Results

We use two sets of locomotive tasks brockman2016openai; frans2018meta to test our algorithm. With both sets we follow a process of three stages for learning concepts, and for the second one we add a fourth stage for the transference of the concepts and the skills. In the first stage, the agent (Figs. 4A, 4B) learns different skills that are related to purely locomotive tasks, like jumping, walking in straight line, or rotating. In the second stage (which might be simultaneous to the first one), the agent learns to solve different tasks from the available set. These tasks are characterized by the presence of common objects like a floor, walls, or spherical targets that the agent can perceive (Fig. 4C). In the third stage, the agent learns concepts by maximizing how informative they are about the behavior of the agent and the dynamics of the system (Eq. [1]). Additionally, in the fourth stage, the agent uses the learned concepts as priors that guide its learning on a new task that displays the same objects as the previous tasks, but its reward function and the placement and distribution of the objects are different (Fig. 4D).

The first set of tasks consists of a one-legged agent (Fig. 4A) which has to perform 4 different tasks: to run as fast as possible, to run at a desired speed, to run while achieving the largest possible height, and to run while staying at a specific low height (SI Appendix). To solve the tasks, the agent learns automatically 8 skills that do not have information about the velocity or the height, forcing them to be transferable. Based on these skills and the 4 policies , our method allows learning a classifier that automatically identifies 20 different concepts related to shapes and motions (Fig. 5). The second set of locomotive tasks consists of an ant-like agent (Fig. 4B) whose ultimate purpose is to learn to escape from a maze. Initially, the ant learns to walk in straight line, to rotate clockwise, and to rotate anti-clockwise. Then, the ant learns to avoid walls and to get close or far from spherical targets, depending on whether their associated reward is positive or negative (Fig. 4C). In this case, the classifier is able to learn concepts like having a free path, being in front, to the right, or to the left of a wall, or being close or inside of a target (Fig. 6). Finally, by using these concepts, the agent is able to significantly accelerate its learning on the new tasks, as seen in Fig. 7. Given that the agent can identify concepts like targets or walls, it only needs to perceive these concepts a few times in order to decide how to act with respect to them, and in turn with all the states that are related to them, even if the agent has not been in some, or most, of these states.

## 6 Discussion and conclusions

Our method can be understood as a self-supervised learning approach since the skills and their associated policies are automatically providing input-output pairs to train the classifier. So, with this approach the structure of the control problems is being exploited to learn discrete representations, just as other methods leverage the geometric structure of the system

kulkarni2019unsupervised. Our positive results showing that conceptual representations can be learned by solving control tasks is consistent with the neurological evidence that indicates that a good portion of the concepts that humans learn, concrete concepts, are grounded in the sensory-motor brain systems kiefer2012Conceptual. This type of concepts is closely related to the notions of object and action, like wall, walking, or distant, so it was expected that precisely the concepts of this type would be the ones learned. It remains to see if more abstract concepts can be learned in a similar fashion. A positive outcome would strengthen the position that, while abstract concepts can be detached from any sensorimotor representation, they could in fact be learned by the same means than concrete concepts. Most possibly, learning this type of concepts requires considering further inductive biases concerning aspects like modularity or causal relations.Another particularity of our method is that it can be readily combined with any skill-based hierarchical method, or essentially with any standard method where the action space is discrete since it is an off-policy method that does not require interactions between the agent and the environment. This property makes our method a useful tool to interpret what is being learned by current deep reinforcement methods, thus allowing to extract conceptual knowledge from them.

While we constrained ourselves to a self-supervised off-policy approach, we believe that the concepts might be learned simultaneously with the skills, in a similar fashion to how interest functions are learned in the option framework khetarpal2020options. Similarly, we assumed throughout the document that the concepts are task-agnostic. While this supposition motivated our approach, in general, concepts do depend on the context in which they are used. So, it remains as well to understand how concepts can be task-dependent, without losing their transferability across different tasks.

In conclusion, we have introduced a deep reinforcement learning method that is able to learn conceptual representations of sensory inputs in a multi-task control setting. We showed that these representations can capture intuitive notions and allow the transference of skills from a set of known tasks to a set of unknown ones. We made three big assumptions to simplify the problem of learning transferable concepts: forcing the concepts to be exclusive and exhaustive (there is always one concept active and if one is active the rest is not); learning in a self-supervised off-policy fashion; and considering only inflexible task-independent concepts. Taking into account all these limitations, we consider that our main contribution is not the proposed method itself but laying out the problem of concept learning. We see our method as a proof of concept that illustrates the potential of learning conceptual representations that disentangle the skills from the tasks. Ultimately, learning concepts could contribute notably to artificial agents being able to generalize as well as living beings do.

## Acknowledgments

The authors would like to thank Daniel Ochoa and Juan Pablo Martínez for helpful discussions.

## References

## Supplementary Information

### Environments

#### Libraries

All the environments were simulated with the MuJoCo simulator todorov2012mujoco, by means of the OpenAI gym wrapper brockman2016openai. The Hopper agent belongs to the OpenAI gym library and its related code was modified to create different tasks. The Ant agent is the same one used in frans2018meta and the related tasks were defined by us.

#### Hopper

The standard Hopper environment consist of a one-legged agent with three rotational actuators that should learn to walk as fast as possible. When the agent reaches certain height and angle thresholds it is restarted. We modified these thresholds to allow the agent to perform more diverse movements, e.g., crouching. Also, we defined 4 different environments with the rewards in the Table 1. The resulting behaviors can be observed in the Fig. 8.

#### Ant

The Ant agent consists of a semi-spherical body with 4 legs attached and a head. Its actuators correspond to 8 joints, 2 for each leg. Given that only two types of objects were used: walls and targets, the visual system of the Ant agent consisted of 3 simple arrays in gray-scale. If a ray from the head of the agent to a wall was smaller than a predefined threshold, the value of the respective pixel in the first array was proportional to the length of the ray. Similarly, the second array corresponded to the distance from the head of the agent to a target, and the third array corresponded to how intense was a target in the line of sight.

### Reinforcement Learning Techniques

#### Hopper training

We trained skills and high-level policies in parallel, following a similar scheme as the one explained in frans2018meta. We used the soft actor-critic algorithm (SAC) haarnoja2018soft to train both skills and high-level policies. Since the skill space is discrete, we used a non-conventional version of SAC, which is detailed in the Alg.1. After a fixed number of steps, the task was randomly changed during an episode to ensure a diversity of states in the initiation sets for all tasks. This forced the agent to make marked transitions of behavior, which resulted in disentangled skills and concepts (see Movie S1).

#### Ant training

The ant training consisted of 3 stages. In the first state, 3 different skills were trained in parallel using the standard SAC haarnoja2018soft: walking along a goal direction, rotating to the left, and rotating to the right. To achieve this, an appropriate reward signal related with linear or angular velocities was defined (see Movie S2). In the second stage, 4 different high-level policies were learned by means of the SAC algorithm and noisy dueling deep Q networks wang2016dueling; fortunato2018noisy. Each policy corresponded to a different task where the agent was presented with two types of objects: walls and targets. In all of these tasks the agent received a reward signal that encouraged having a large linear velocity, and, in the two tasks were targets appeared, it encouraged the agent to either seek or avoid the targets (see Movie S3). In the third and final state, the Ant was trained to escape from a maze with two different methods: SAC with random network distillation (RND) burda2019exploration, and SAC with RND and concept priors. In SAC, the policy is optimized by reducing its “distance” to a target distribution that is computed from the action-value networks. The concept priors provide a second target distribution, which at a starting point might be a better target than the original one. For this reason, in the second method we express the actor loss as a convex combination between the “distances” to the two targets. The weight in this combination depends on the novelty of each state, in such a way that if a state has not been visited too many times we should not trust the original target and for this reason the second “distance” is given more relevance. In order to calculate this novelty, we take advantage of the RND method, which naturally provides such a quantity. The target provided by the concepts is the high-level policy from concepts to skills . In order to determine this policy we use every-visit Monte-Carlo updates sutton2018reinforcement to estimate a matrix of Q-values. We take this approach since concepts correspond to observations in a partially observable MDP and so any simple bootstrapping method is prohibited. The Alg.2 provides a detailed explanation of the second method (the first one is identical if all steps related with the concept priors are ignored) (see Movie S4).

### Information Bottleneck Optimization

#### Information Bottleneck

Consider two random variables and

determined by a probability distribution

. We can sample pairsfrom this distribution, but we have no access to the distribution itself. A typical machine learning task is to find a parametric function of the mean value of

given , or to find a parametric approximation of the generative model . The general idea of the information bottleneck approach tishby2000information is to encode into an auxiliary random variable such that is maximally informative about , while using the least possible information about . So, an ideal captures the relationship of and by throwing away the components of that are not related to . To achieve this, it is typically assumed that the three variables satisfy a fork probabilistic structure where and are conditionally independent given. This graph structure is commonly called a Markov chain and is denoted as

. The problem is then to find the generative model with parameters that maximizes the information bottleneck function:where is the mutual information between and . The parameter controls the trade-off between preserving the structure of through , and maximally encoding with .

#### Concept Learning with the Deep Information Bottleneck

The objective function consist of two information bottlenecks: one that encapsulates the idea of concepts as invariant sets, and one that works as a regularization term that encourages having deterministic transitions between concepts. Specifically, the objective function is:

Calculating this function for the parameters requires to have access to quantities like or . Since both random variables and are discrete, using neural networks to estimate these quantities might not be the best approach, since learning might be too slow or too unstable. Instead, we only use a neural network to estimate the classifier , and we use the Markov chain assumption of the information bottleneck, together with and the trained policies , to estimate the required quantities. For example:

(3) |

Furthermore, since changes slowly, we opt to estimate quantities like

with visit count tensors that accumulate sums like those in Eq.

Concept Learning with the Deep Information Bottleneck for different batches, but each batch is given more relevance than to the previous ones. This is achieved by multiplying the tensors with a forgetting factor. See the Alg.3 for a detailed description of the procedure used to learn concepts.Task | Reward |
---|---|

Fast hopper | |

Slow hopper | |

High hopper | |

Low hopper |

Concept label | Concept | Color |
---|---|---|

Free-fall | Blue | |

Crouched | Orange | |

Preparing for landing / N-shape | Dark green | |

Inclined / Ready to jump / Finished landing | Red | |

Free-fall | Dark blue | |

Suspended in air | Brown | |

Controlled falling | Dark gray | |

Just landed / Flexed | Cyan | |

Jumping | Purple | |

Ascending while close to the ground | Yellow | |

Fully crouched | Dark yellow | |

Knee flexed | Green | |

Ascending while close to the ground | Pink | |

Jumping | Light blue | |

Extreme inclination (followed by termination) | Light brown | |

Leg extension / Jumping | Persian green | |

Leg extension while jumping or falling | Bright purple | |

Vertical and short jump | Dark purple | |

Knee and foot slightly flexed | Lavender | |

Still with L shape (typical initial state) | Deep sky blue |

Concept label | Concept | Color |
---|---|---|

Not assigned | ||

Wall in top left direction | Cyan | |

Wall to the right | Yellow | |

Inside/in front of target | Light sky blue | |

Free path | Blue | |

Not assigned | ||

Wall in top left direction (close) | Pink | |

Target to the right | Dark blue | |

Wall in the "corner of the eye" | Black | |

Not assigned | ||

Distant target in sight | Purple | |

Inside/in front of target | Red | |

Not assigned | ||

Wall to the left | Brown | |

Wall in front (close) | Green | |

Wall in top right direction | Dark purple | |

Wall in front | Orange | |

Wall in top left direction (close) | Light orange | |

Target to the left | Turquoise | |

Free path | Light green |

Hyperparameter | Value |
---|---|

True | |

Hyperparameter | Value |
---|---|

False |

Hyperparameter | Value |
---|---|

Hyperparameter | Value |
---|---|

Comments

There are no comments yet.