Log In Sign Up

Concept Learning in Deep Reinforcement Learning

by   Diego Gomez, et al.

Deep reinforcement learning techniques have shown to be a promising path to solve very complex tasks that once were thought to be out of the realm of machines. However, while humans and animals learn incrementally during their lifetimes and exploit their experience to solve new tasks, standard deep learning methods specialize to solve only one task at a time and whatever information they acquire is hardly reusable in new situations. Given that any artificial agent would need such a generalization ability to deal with the complexities of the world, it is critical to understand what mechanisms give rise to this ability. We argue that one of the mechanisms humans rely on is the use of discrete conceptual representations to encode their sensory inputs. These representations group similar inputs in such a way that combined they provide a level of abstraction that is transverse to a wide variety of tasks, filtering out irrelevant information for their solution. Here, we show that it is possible to learn such concept-like representations by self-supervision, following an information-bottleneck approach, and that these representations accelerate the transference of skills by providing a prior that guides the policy optimization process. Our method is able to learn useful concepts in locomotive tasks that significantly reduce the number of optimization steps required, opening a new path to endow artificial agents with generalization abilities.


page 3

page 4

page 6

page 8

page 16

page 17

page 18

page 19


AIGenC: AI generalisation via creativity

This paper introduces a computational model of creative problem solving ...

Learning State Representations for Query Optimization with Deep Reinforcement Learning

Deep reinforcement learning is quickly changing the field of artificial ...

Language as an Abstraction for Hierarchical Deep Reinforcement Learning

Solving complex, temporally-extended tasks is a long-standing problem in...

Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks

Deep reinforcement learning yields great results for a large array of pr...

Measuring and Characterizing Generalization in Deep Reinforcement Learning

Deep reinforcement-learning methods have achieved remarkable performance...

Perspective Taking in Deep Reinforcement Learning Agents

Perspective taking is the ability to take the point of view of another a...

Reward-Predictive Clustering

Recent advances in reinforcement-learning research have demonstrated imp...

Code Repositories

1 Introduction

Humans and animals alike have evolved to complete a wide variety of tasks in a world where resources are scarce. In particular, since learning and planning have an associated energy cost, the brain has probably evolved to solve multiple tasks while spending the least possible amount of energy

niven2016neuronal; sengupta2013information. It is only natural then that the brain has an innate ability to generalize what it learns in one task to succeed in future ones. Otherwise, it would be too costly to learn from scratch the appropriate solution for each problem encountered. Given how any artificial agent would face exactly the same burdens, it is highly desirable for it to possess the same generalization capacities. Standard deep reinforcement learning techniques have shown outstanding progress in solving complex tasks using a single architecture mnih2015human; silver2017mastering; oriol2019alphastar; schrittwieser2019mastering, but there is still much progress to do in terms of transferring knowledge among multiple tasks flesch2018comparing; cobbe2019quantifying; zhao2019investigating; yu2019metaworld and under constraints like time harb2018waiting, memory capacity, and energy.

One of the common traits of standard deep reinforcement learning methods is that sensory inputs and their successive encodings are represented in each processing stage as continuous real-valued vectors. This type of representations is very flexible and allows using efficient gradient-based optimization techniques, both properties that greatly enhance the learning performance in single tasks. However, this flexibility encourages learning very complex models, which usually take advantage of spurious statistical patterns

duan2016benchmarking that are not essential to solve the task and that are not present in similar tasks. Thus, the excess in flexibility directly inhibits the transference of knowledge between tasks. In contrast, both animals and humans exhibit the use of discrete representations to encode sensory inputs and internal states sergent2004consciousness; zhang2008discrete; linderman2019hierarchical. These representations, on the contrary, work as information bottlenecks that make learning harder and favor low-complexity models that only capture the most relevant patterns zaslavsky2018efficient. For example, the use of discrete groups of colors to identify edible food briscoe2001evolution is a particularly useful trait for survival that separates sensory inputs into discrete categories while ignoring essentially useless information for the specific task. Aside from promoting simplicity, these representations tend to be modular and transferable between different contexts. For example, while color might be a useful category to recognize edible food, it is certainly not limited to this task and can be used in conjunction with other discrete representations to recognize arbitrary physical objects. The human language is perhaps the epitome of the representational power of discrete categories: humans seem to possess an innate ability to represent arbitrarily complex phenomena from a finite amount of discrete expressions, which constitute what we call language hauser2002faculty. Just as with colors, a subset of these expressions can be used to categorize sensory inputs at the level of abstraction necessary to carry out a task, while being readily transferable to completely different purposes. Discovering methods that identify rich discrete representations, like colors or language expressions, seems then like a promising path to endow artificial agents with the ability of generalization.

Past artificial intelligence techniques that relied on the use of symbolic representations were later demonstrated to be sub-optimal in comparison with fully learning-based methods that made no assumptions about the problem being solved

tesauro2002programming; silver2017mastering. However, the representations used were usually the result of hand-picked features or expert-based rules. In contrast, modern techniques that have used discrete categories successfully use them as labels of more complex elements, such as object representations with continuous features veerapaneni2019entity; kulkarni2019unsupervised, vector codes with continuous entries razavi2019generating, or distributions of actions given states, commonly called options or skills bacon2017opctioncritic; kulkarni2016hierarchical; tessler2017deep; frans2018meta; eysenbach2019diversity; marino2019hierarchical; osa2019hierarchical; goyal2020reinforcement. Here, we show that purely discrete representations of sensory inputs can be discovered by self-supervision and that, contrary to what is commonly assumed, they can improve the learning efficiency in a task by leveraging the experience acquired in previous tasks. To achieve this, we take an information bottleneck approach to learn the representations in an off-policy fashion, and then use them to provide priors in new tasks.

Figure 1: Deep reinforcement learning methods in the context of multi-task learning. (A) Standard methods learn, for each task

, the parameters of a prototypical neural network that determines the policy

. (B) Goal-based methods parametrize the task as a goal vector that allows using a single neural network to perform multiple tasks but making learning much harder. (C) Skill-based hierarchical methods learn two sets of parameters: one task-dependent, related to the policy that selects skills based on states , and one task-independent, related to the skills themselves, which are task-agnostic policies

. (D) Concept-based methods learn task-agnostic parameters related to the skills and to a classifier of states (the categories of this classifier are the concepts

). A task-dependent tabular policy provides a prior to learn faster the policies .

2 Multi-task and transfer learning

Before further elaborating on the problem of finding useful discrete representations, a brief description of the multi-task and transfer problems in reinforcement learning is needed. In the multi-task problem, an agent finds itself in an environment and their joint state is specified by a vector . The agent has the ability to change this state to a new state by taking an action , and, as a result of this change, the agent receives a reward . During its lifetime, the agent faces a sequence of tasks , where each one determines how the states will evolve and what rewards will receive the agent. So, the agent has to find a policy, a behavior rule that selects actions based on states, for each task, such that the accumulated reward during its whole lifetime is maximized (Figs. 1A, 1B) yu2019metaworld

. The transfer learning problem is similar, but it distinguishes between past and future tasks. In this case, the agent faces a sequence of tasks as well, but the objective now is to perform well in a future unseen task. The agent has to leverage the available information to learn a policy that is optimal in an unknown task after minimal interventions.

A skill-based approach to these problems consists in finding general policies from states to actions, skills, that can be used across different tasks. Under this approach, the agent has to learn a set of skills along with a high-level policy from states to skills for each task (Fig. 1C). Ideally, these policies are easier to learn than the original policies from states to actions. In practice, this separation might not present significant advantages in a multi-task setting nachum2019hierarchy. Nonetheless, if a future task is close in nature to the previous ones, it is reasonable to think that the skills can be used without major modifications and only a new policy from states to skills is necessary.

3 Concept learning

Figure 2: Concepts as invariant sets. (A) The policy from states to skills determines a partition of the states in a single task. (B) The concepts correspond to the intersection of partitions, which are sets of states that are mapped to the same skills across different tasks.

Irrespective of what is understood as a concept, a defining characteristic is that it allows separating those sensory inputs that correspond to it from those that do not. Considering only this criterion, a set of skills can be regarded as a set of concepts in a specific task . This is the case since once an agent learns an optimal deterministic policy from states to skills , it is able to distinguish if two states correspond to the same skill or not, that is, the skills determine a partition of the state-space by separating from the rest those states where the agent behaves similarly (Fig. 2A). Such a partition contains information that might be useful when facing a second task: consider two states and that correspond to the same skill in a task . If the agent learns that it should select the skill in the state during a second task , then the agent has a reason to believe that it should select in the state as well. In this manner, the partition in the task provides a prior that can accelerate learning in a future task , especially if is similar in nature to . Let us suppose now that the agent learns the optimal policy in the second task and that it was indeed the case that the skill was the right choice in both states and . Naturally, the confidence of the agent about the similarity of these states should increase. So that if the agent faces a new task it will have an even greater reason to believe that if the optimal skill in the state is , then it should be the same for . In this case, the concepts no longer correspond to the skills themselves, but to the sets that result from the intersection of the partitions determined by the skills. These concepts can be understood as task-invariant sets of states for which the skill selected depends only on the specific task (Fig. 2B). Given their potential to allow the transference of skills, we pose the problem of concept learning as one of learning these invariant sets. More concretely, consider an agent in a transfer learning setting that can train in a set of tasks and will be tested in a future unknown task of similar nature. The agent is endowed with a set of skills and it has the ability to classify states into concepts , by means of a classifier . We want the agent to learn a classifier such that its selection of skills in the future task depends as little as possible on the states once the concepts are known, that is, we want the concepts to provide most of the necessary information to decide when to use each skill (Fig. 1D). Given that the number of tasks can be potentially infinite (and so the number of potential concepts), in practice the existence of such a classifier relies on the assumption that there exist large clusters of states such that their interaction with skills is similar across tasks. This is not unlike humans, which seem to present innate priors related to the existence and persistence of physical objects baillargeon2008innate; fields2017eigenforms.

As a solution, we propose to take two steps: first, to train the agent in the available set of tasks following standard methods, which implies learning the skills and policies from states to skills; and second, to use the policies learned to generate triples of examples , which will be used to train the classifier , following an information bottleneck approach, as explained below.

If we think of the task , the state , the skill and the concept

as random variables, they should follow the graphical model in Fig

3A: in accessible tasks, the skill is selected based on the state and the task , and the concept is chosen based only on the state . We want the concepts to be as informative as possible of the skills for each task , so we would like to maximize a quantity like the mutual information of and given , . This is an appropriate metric since it is large when there is little uncertainty about the value of given the value of , and of given the value of . So, maximizing it captures the idea that two states which correspond to similar behaviors in the same task, across different tasks, should be categorized with the same concept , but does not restrict the policies nor the classifier to be deterministic. Now, since we also want to use the information about the states as little as possible, the function that should be maximized is:


in agreement with the information bottleneck principle tishby2000information (Fig. 3B). Here, denotes a vector of parameters that determine the classifier , and is a parameter that specifies how relevant is to guess the skill based on the concept , at the expense of using too much information about . This parameter should be large enough to avoid using more concepts than necessary, or overfitting to very complex clusters of states that are not generalizable to future tasks. This penalization is especially desirable if the policies from states to skills are not perfectly trained, a case in which some of the examples will be erroneous and will affect the quality of the concepts.

Figure 3: Concept learning as a problem of preserving a probabilistic structure. (A) The learned policy , the training dynamics , and the classifier determine together a probabilistic graphical model for the random variables . The ideal classifier encapsulates this information in a high-level probabilistic graphical model where the concepts replace the states. (B,C) Encapsulating the behavior of the agent and the dynamics of the system can be done by optimizing the information bottlenecks (B) and (C).

In addition, if we consider that after selecting a skill in a state there is a transition to a state , we can see that the classifier will implicitly determine a transition from the concept to the concept . Since the concepts should provide an abstraction of the system to make decisions, one would hope that the dynamics of the concepts are as deterministic as possible, so that the effect of selecting a skill when a concept is perceived can be easily inferred. With this, a third term should be considered in the Eq.[1], namely, the mutual information between the concept and the next concept , (Fig. 3C).

4 Concept priors

Once concepts are learned, the question arises of how to use them to accelerate learning in a future task. We propose to obtain a Monte Carlo estimate of the Q-values

, which are the expected cumulative rewards that the agent wants to maximize. These estimates allow us to determine a high-level policy from concepts to skills . Since the states corresponding to the same concept tend to behave similarly, one would expect that the state-policy (parametrized by a vector ) is close to the concept-policy (see Fig.1D). So, a natural way to introduce this concept inductive bias is to penalize the state-policy being distant to the concept-policy by adding the term


to the optimization objective of the standard method being used. This term is the Kullback-Leibler divergence between the policies, which provides a measure of how dissimilar they are. Additionally,

is used to denote the concept most closely related to the state .

5 Results

We use two sets of locomotive tasks brockman2016openai; frans2018meta to test our algorithm. With both sets we follow a process of three stages for learning concepts, and for the second one we add a fourth stage for the transference of the concepts and the skills. In the first stage, the agent (Figs. 4A, 4B) learns different skills that are related to purely locomotive tasks, like jumping, walking in straight line, or rotating. In the second stage (which might be simultaneous to the first one), the agent learns to solve different tasks from the available set. These tasks are characterized by the presence of common objects like a floor, walls, or spherical targets that the agent can perceive (Fig. 4C). In the third stage, the agent learns concepts by maximizing how informative they are about the behavior of the agent and the dynamics of the system (Eq. [1]). Additionally, in the fourth stage, the agent uses the learned concepts as priors that guide its learning on a new task that displays the same objects as the previous tasks, but its reward function and the placement and distribution of the objects are different (Fig. 4D).

Figure 4: Agents and tasks used to learn concepts. (A) Hopper: a one-legged agent that can jump by using its three rotational actuators. (B) Ant: an agent with a head and four legs. (C) Training tasks for the ant agent. In all of them the ant is rewarded by moving as fast as possible. In the tasks with green spherical targets, the ant receives a positive or a negative reward when it is inside a target and these targets last for a finite amount of steps. (D) Task where the ant agent tests the learned concepts. It is a maze with cross shape where the ant receives a positive reward when it reaches the exit, which is a green spherical target.

The first set of tasks consists of a one-legged agent (Fig. 4A) which has to perform 4 different tasks: to run as fast as possible, to run at a desired speed, to run while achieving the largest possible height, and to run while staying at a specific low height (SI Appendix). To solve the tasks, the agent learns automatically 8 skills that do not have information about the velocity or the height, forcing them to be transferable. Based on these skills and the 4 policies , our method allows learning a classifier that automatically identifies 20 different concepts related to shapes and motions (Fig. 5). The second set of locomotive tasks consists of an ant-like agent (Fig. 4B) whose ultimate purpose is to learn to escape from a maze. Initially, the ant learns to walk in straight line, to rotate clockwise, and to rotate anti-clockwise. Then, the ant learns to avoid walls and to get close or far from spherical targets, depending on whether their associated reward is positive or negative (Fig. 4C). In this case, the classifier is able to learn concepts like having a free path, being in front, to the right, or to the left of a wall, or being close or inside of a target (Fig. 6). Finally, by using these concepts, the agent is able to significantly accelerate its learning on the new tasks, as seen in Fig. 7. Given that the agent can identify concepts like targets or walls, it only needs to perceive these concepts a few times in order to decide how to act with respect to them, and in turn with all the states that are related to them, even if the agent has not been in some, or most, of these states.

Figure 5: Concepts learned in the first set of locomotive tasks after optimizing the Eq.[2]. The agent is able to abstract concepts corresponding to shapes (e.g., leg extended, flexed, and inclined) and motions (e.g., jumping, landing, and falling) (A) Progression of concepts activated during two trajectories. The number indicates the concept most closely related to the state shown. (B) t-SNE projection of several trajectories. The color of each point corresponds to its associated concept, and the colors are the same as those used in (A). It can be noticed that concepts provide a disentangled partition of the states, even though optimizing the Eq.[1] requires minimizing the mutual information between the states and the concepts. This indicates that the policies do provide a natural partition of the state-space. A more detailed exposition of the concepts can be found in the SI Appendix.
Figure 6: Concepts learned in the second set of locomotive tasks. In all cases the points represent the position of the ant and their color the concept identified. The arrows indicate the orientation of the ant. The tasks follow the same order as those presented in Fig.4C. Several common concepts can be identified across the tasks: red - inside/in front of target, purple - target in the line of sight, orange - wall in front (far), green - wall in front (close), blue - free path to walk, yellow - wall to the right, brown - wall to the left, etc. In (A, B) the points correspond to samples stored in memory, while in (C, D) they are trajectories followed by the ant in a single episode. Additionally, in (C) the ant enters the targets since they correspond to a positive reward, while in (D) the opposite happens. A more detailed exposition of the concepts can be found in the SI Appendix.
Figure 7: Comparison of performances in transferring skills to the maze task. In both cases the algorithm SAC haarnoja2018soft

is used to train the policy from states to skills. Each curve corresponds to the average of 6 different seeds and 18 test episodes per data-point. The shades represent a standard deviation. (A) Mean steps required to find the exit of the maze after performing a specific amount of policy gradient steps. When using concept priors learning converges faster and so the number of steps decreases faster. (B) Similarly, the mean returns increase faster when using concept priors since the agent is successful more times in finding the exit.

6 Discussion and conclusions

Our method can be understood as a self-supervised learning approach since the skills and their associated policies are automatically providing input-output pairs to train the classifier. So, with this approach the structure of the control problems is being exploited to learn discrete representations, just as other methods leverage the geometric structure of the system

kulkarni2019unsupervised. Our positive results showing that conceptual representations can be learned by solving control tasks is consistent with the neurological evidence that indicates that a good portion of the concepts that humans learn, concrete concepts, are grounded in the sensory-motor brain systems kiefer2012Conceptual. This type of concepts is closely related to the notions of object and action, like wall, walking, or distant, so it was expected that precisely the concepts of this type would be the ones learned. It remains to see if more abstract concepts can be learned in a similar fashion. A positive outcome would strengthen the position that, while abstract concepts can be detached from any sensorimotor representation, they could in fact be learned by the same means than concrete concepts. Most possibly, learning this type of concepts requires considering further inductive biases concerning aspects like modularity or causal relations.

Another particularity of our method is that it can be readily combined with any skill-based hierarchical method, or essentially with any standard method where the action space is discrete since it is an off-policy method that does not require interactions between the agent and the environment. This property makes our method a useful tool to interpret what is being learned by current deep reinforcement methods, thus allowing to extract conceptual knowledge from them.

While we constrained ourselves to a self-supervised off-policy approach, we believe that the concepts might be learned simultaneously with the skills, in a similar fashion to how interest functions are learned in the option framework khetarpal2020options. Similarly, we assumed throughout the document that the concepts are task-agnostic. While this supposition motivated our approach, in general, concepts do depend on the context in which they are used. So, it remains as well to understand how concepts can be task-dependent, without losing their transferability across different tasks.

In conclusion, we have introduced a deep reinforcement learning method that is able to learn conceptual representations of sensory inputs in a multi-task control setting. We showed that these representations can capture intuitive notions and allow the transference of skills from a set of known tasks to a set of unknown ones. We made three big assumptions to simplify the problem of learning transferable concepts: forcing the concepts to be exclusive and exhaustive (there is always one concept active and if one is active the rest is not); learning in a self-supervised off-policy fashion; and considering only inflexible task-independent concepts. Taking into account all these limitations, we consider that our main contribution is not the proposed method itself but laying out the problem of concept learning. We see our method as a proof of concept that illustrates the potential of learning conceptual representations that disentangle the skills from the tasks. Ultimately, learning concepts could contribute notably to artificial agents being able to generalize as well as living beings do.


The authors would like to thank Daniel Ochoa and Juan Pablo Martínez for helpful discussions.


Supplementary Information



All the environments were simulated with the MuJoCo simulator todorov2012mujoco, by means of the OpenAI gym wrapper brockman2016openai. The Hopper agent belongs to the OpenAI gym library and its related code was modified to create different tasks. The Ant agent is the same one used in frans2018meta and the related tasks were defined by us.


The standard Hopper environment consist of a one-legged agent with three rotational actuators that should learn to walk as fast as possible. When the agent reaches certain height and angle thresholds it is restarted. We modified these thresholds to allow the agent to perform more diverse movements, e.g., crouching. Also, we defined 4 different environments with the rewards in the Table 1. The resulting behaviors can be observed in the Fig. 8.


The Ant agent consists of a semi-spherical body with 4 legs attached and a head. Its actuators correspond to 8 joints, 2 for each leg. Given that only two types of objects were used: walls and targets, the visual system of the Ant agent consisted of 3 simple arrays in gray-scale. If a ray from the head of the agent to a wall was smaller than a predefined threshold, the value of the respective pixel in the first array was proportional to the length of the ray. Similarly, the second array corresponded to the distance from the head of the agent to a target, and the third array corresponded to how intense was a target in the line of sight.

Reinforcement Learning Techniques

Hopper training

We trained skills and high-level policies in parallel, following a similar scheme as the one explained in frans2018meta. We used the soft actor-critic algorithm (SAC) haarnoja2018soft to train both skills and high-level policies. Since the skill space is discrete, we used a non-conventional version of SAC, which is detailed in the Alg.1. After a fixed number of steps, the task was randomly changed during an episode to ensure a diversity of states in the initiation sets for all tasks. This forced the agent to make marked transitions of behavior, which resulted in disentangled skills and concepts (see Movie S1).

Ant training

The ant training consisted of 3 stages. In the first state, 3 different skills were trained in parallel using the standard SAC haarnoja2018soft: walking along a goal direction, rotating to the left, and rotating to the right. To achieve this, an appropriate reward signal related with linear or angular velocities was defined (see Movie S2). In the second stage, 4 different high-level policies were learned by means of the SAC algorithm and noisy dueling deep Q networks wang2016dueling; fortunato2018noisy. Each policy corresponded to a different task where the agent was presented with two types of objects: walls and targets. In all of these tasks the agent received a reward signal that encouraged having a large linear velocity, and, in the two tasks were targets appeared, it encouraged the agent to either seek or avoid the targets (see Movie S3). In the third and final state, the Ant was trained to escape from a maze with two different methods: SAC with random network distillation (RND) burda2019exploration, and SAC with RND and concept priors. In SAC, the policy is optimized by reducing its “distance” to a target distribution that is computed from the action-value networks. The concept priors provide a second target distribution, which at a starting point might be a better target than the original one. For this reason, in the second method we express the actor loss as a convex combination between the “distances” to the two targets. The weight in this combination depends on the novelty of each state, in such a way that if a state has not been visited too many times we should not trust the original target and for this reason the second “distance” is given more relevance. In order to calculate this novelty, we take advantage of the RND method, which naturally provides such a quantity. The target provided by the concepts is the high-level policy from concepts to skills . In order to determine this policy we use every-visit Monte-Carlo updates sutton2018reinforcement to estimate a matrix of Q-values. We take this approach since concepts correspond to observations in a partially observable MDP and so any simple bootstrapping method is prohibited. The Alg.2 provides a detailed explanation of the second method (the first one is identical if all steps related with the concept priors are ignored) (see Movie S4).

Information Bottleneck Optimization

Information Bottleneck

Consider two random variables and

determined by a probability distribution

. We can sample pairs

from this distribution, but we have no access to the distribution itself. A typical machine learning task is to find a parametric function of the mean value of

given , or to find a parametric approximation of the generative model . The general idea of the information bottleneck approach tishby2000information is to encode into an auxiliary random variable such that is maximally informative about , while using the least possible information about . So, an ideal captures the relationship of and by throwing away the components of that are not related to . To achieve this, it is typically assumed that the three variables satisfy a fork probabilistic structure where and are conditionally independent given

. This graph structure is commonly called a Markov chain and is denoted as

. The problem is then to find the generative model with parameters that maximizes the information bottleneck function:

where is the mutual information between and . The parameter controls the trade-off between preserving the structure of through , and maximally encoding with .

Concept Learning with the Deep Information Bottleneck

The objective function consist of two information bottlenecks: one that encapsulates the idea of concepts as invariant sets, and one that works as a regularization term that encourages having deterministic transitions between concepts. Specifically, the objective function is:

Calculating this function for the parameters requires to have access to quantities like or . Since both random variables and are discrete, using neural networks to estimate these quantities might not be the best approach, since learning might be too slow or too unstable. Instead, we only use a neural network to estimate the classifier , and we use the Markov chain assumption of the information bottleneck, together with and the trained policies , to estimate the required quantities. For example:


Furthermore, since changes slowly, we opt to estimate quantities like

with visit count tensors that accumulate sums like those in Eq.

Concept Learning with the Deep Information Bottleneck for different batches, but each batch is given more relevance than to the previous ones. This is achieved by multiplying the tensors with a forgetting factor. See the Alg.3 for a detailed description of the procedure used to learn concepts.

Figure 8: Hopper tasks. (A) Task 0: the agent should run as fast as possible without falling. To achieve this, the agent maintains its knee slightly flexed during both jumps and landings. Such a shape provides stability without affecting too much the speed. (B) Task 1: the agent should run at a specific velocity. To maintain the desired speed, the agent lands in such a way that most of the momentum is lost in the impact. (C) Task 2: the agent should achieve the largest possible height. To achieve this, the agent propels itself during a long time and jumps in a mostly vertical direction, without flexing the knee. (D) Task 3: the agent should run while staying at the lowest possible height without falling. The agent has to flex both its knee and its ankle in order to stay low, and it has to maintain completely reclined its upper body to keep balance.
Figure 9: t-SNE projection of different trajectories performed by the Hopper agent. The colors indicate the concepts related with the states, as detailed in the Tab.2.
Figure 10: Concepts learned by the Ant agent in the task 0 of the second stage. Each frame corresponds to a subset of states where the agent was oriented in the same direction as the imaginary arrow that connects the centers of the figure and the frame. The colors indicate the concepts associated to the states, as detailed in Tab.3. It can be noticed how the agent learns to classify the walls, or lack thereof, depending on its relative position to them.
Figure 11: Concepts learned by the Ant agent in the task 1 of the second stage. Each frame corresponds to a subset of states where the agent was oriented in the same direction as the imaginary arrow that connects the centers of the figure and the frame. The colors indicate the concepts associated to the states, as detailed in Tab.3. It can be noticed how the agent learns to classify the walls, or lack thereof, depending on its relative position to them.
Figure 12: Concepts learned by the Ant agent in the task 2 of the second stage. The colors indicate the concepts associated to the states, as detailed in Tab.3, and the orientation of the arrows indicate the orientation of the agent. It can be noticed how the agent learns different concepts related with the targets (green circles), like being far (purple) or close to them (red and sky blue), or being to their left (dark blue) or right (turquoise). Also, it can be observed that the Ant enters to all of the targets. This indicates that the agent learns effectively to collect them when they have assigned a positive reward.
Figure 13: Concepts learned by the Ant agent in the task 3 of the second stage. The colors indicate the concepts associated to the states, as detailed in Tab.3, and the orientation of the arrows indicate the orientation of the agent. Similarly to Fig.12, it can be noticed how the agent learns different concepts related with the targets and also how it learns to avoid them when they are associated to a negative reward.
Figure 14: Distribution of the concepts learned by the Ant agent in the four tasks of the second stage. It can be noticed that there are 5 concepts that are only active during the tasks 2 and 3 (specifically, the concepts 3, 7, 10, 11, and 18). These are precisely the concepts related with the targets, so the agent learns appropriately to semantically separate the walls from the targets. Also, while there were available 20 concepts for the agent to learn, it only used 16 of them. This shows the effectiveness of the information bottleneck to avoid overfitting.
Task Reward
Fast hopper
Slow hopper
High hopper
Low hopper
Table 1: Rewards corresponding to the four hopper tasks used. makes reference to the linear velocity approximation , where is the horizontal position of the Hopper agent and is the time interval between the steps and . is the vertical position of the agent.
Concept label Concept Color
Free-fall Blue
Crouched Orange
Preparing for landing / N-shape Dark green
Inclined / Ready to jump / Finished landing Red
Free-fall Dark blue
Suspended in air Brown
Controlled falling Dark gray
Just landed / Flexed Cyan
Jumping Purple
Ascending while close to the ground Yellow
Fully crouched Dark yellow
Knee flexed Green
Ascending while close to the ground Pink
Jumping Light blue
Extreme inclination (followed by termination) Light brown
Leg extension / Jumping Persian green
Leg extension while jumping or falling Bright purple
Vertical and short jump Dark purple
Knee and foot slightly flexed Lavender
Still with L shape (typical initial state) Deep sky blue
Table 2: Concepts learned by the Hopper agent, ordered by their frequency. In general, the discrete representations effectively capture a singular concept, but in some cases it is difficult to distinguish one from the others, in particular the pairs 8 and 13, and 9 and 12, and the concept 15.
Concept label Concept Color
Not assigned
Wall in top left direction Cyan
Wall to the right Yellow
Inside/in front of target Light sky blue
Free path Blue
Not assigned
Wall in top left direction (close) Pink
Target to the right Dark blue
Wall in the "corner of the eye" Black
Not assigned
Distant target in sight Purple
Inside/in front of target Red
Not assigned
Wall to the left Brown
Wall in front (close) Green
Wall in top right direction Dark purple
Wall in front Orange
Wall in top left direction (close) Light orange
Target to the left Turquoise
Free path Light green
Table 3: Concepts learned by the Ant agent. In general, the partitions determined by the classifier capture a semantic representation that is easily distinguishable from the rest. The only exceptions are the pairs of concepts 3 and 11, and 4 and 19. This pairs probably originate from errors in the policies used to learn them.

  Algorithm 1 Multi-task soft actor-critic (SAC) with discrete skill space Input:      High-level decision period , limits and , exploration parameters and , memory capacity ,      learning rates , , and , discount rate , target update rate , parallel training boolean      if  is False then           Trained skills      else           Number of skills , number of high-level warmup episodes , number of joint training episodes       Initialize replay buffer with capacity Initialize critics and with random parameters and Initialize target critics and with parameters and Initialize actor with random parameters Initialize entropy weights Initialize exploration factor if  is True then      Initialize atomic replay buffer , actor-critic networks for the skills, and episode counter for episode to  do      Choose task randomly      for step to  do           Observe current state           Sample skill           Initialize reward           for atomic step to  do                Sample action from skill :                Execute action in simulator and observe transition , atomic reward , and termination signal                Accumulate reward:                if  is True and  then                     Store transition tuple in                     Perform standard SAC optimization step to improve the skills                                if  is True then                     Store intermediate values: ,                     break                                     Store transition tuple in D           Sample batch of transitions from           Calculate state-value targets:           Calculate action-value targets:           Calculate critic losses: , for                      Calculate target skill-distributions:           Calculate actor loss:           Calculate policy entropy:           Calculate entropy weight loss:           Optimize critics: , for           Optimize actor:           Optimize entropy weight:           Update exploration factor:           Update target critic parameters: , for            if  is True then           if  then                Forget and initialize actor-critic networks for the skills                      Count episode:       Output:      Critics and , actor , and skills

Table 4: Algorithm used to train the high-level policies for the Hopper and Ant agents.
Hyperparameter Value
Table 5: Hyperparameters used to train the Hopper agent with the Alg.1.
Hyperparameter Value
Table 6: Hyperparameters used to train the Ant agent with the Alg.1.

  Algorithm 2 Soft actor-critic (SAC) with discrete skill space, random network distillation (RND), and concept priors Input:      High-level decision period , limits , , and , learning rates , , , , and , target rate ,      Memory capacity , concept classifier , discount rates and , forgetting factor , intrinsic reward weight Initialize replay buffer with capacity Initialize extrinsic critic and actor with random parameters and Initialize intrinsic critic Initialize target critics and with parameters and Initialize random network distillation nets , , and , with random parameters , target, and Initialize visit count and action-value matrices: , Initialize entropy weights: , for episode to  do      Initialize empty trajectory lists: ,      for step to  do           Observe current state           Sample skill           Execute for atomic steps (Alg.1) and observe transition , reward , and termination signal           Store transition tuple in D and append it to           Select concept as           Append tuple to           Sample batch of transitions from           Calculate extrinsic critic loss as in Alg.1           Calculate total values as a linear comb. of extrinsic and intrinsic ones:           Calculate target skill-distributions:           Calculate actor divergence from target:           Calculate policy from concepts to skills:           Select batch concepts:           Calculate concept prior divergence:           Estimate states’ novelty by comparing the random nets error:                 novelty ratios                           Calculate actor loss:           Calculate policy entropies: ,           Calculate exploration factors and based on           Calculate entropy weight losses:                ,           Optimize critic and actor: ,           Optimize entropy weights: , for           Update target critic parameters:           if  is True then                MC update:                     Initialize , ,                     for  to  do                                                                              ,                                                                                                                     if  then                Perform standard RND with to update and the critics and                                      Perform Monte-Carlo update of the visit count and action-value matrices with      Perform standard RND with Output:      Critic and actor

Table 7: Algorithm used to learn the policy in the maze environment. The algorithm was simplified for simplicity. Just as in the Alg.1, two critics are used to estimate the action-values.
Hyperparameter Value
Table 8: Hyperparameters used to train the Ant agent in the maze task with the Alg.2.

  Algorithm 3 Information bottleneck optimization for concept learning Input:      Replay memory with transitions      Trained high-level policies      Forgetting parameter

     Number of training epochs

     Bottleneck weights , , Initialize visit count tensors and : , Initialize concept classifier with random parameters for epoch to  do      Sample batch of transitions from      Calculate skill distributions using the high-level policies      Determine concept and next-concept distributions: ,      Count skill-concept visits:      Count concept-transition visits:      Update visit count tensors: ,      Estimate the concept distribution:      Estimate the high-level policies and : ,      Estimate the transition distributions: ,      Estimate concept entropies:            ,                 Estimate skill entropies:            ,                 Estimate transition entropies:            ,                 Calculate the mutual informations:            , ,      Calculate the classifier loss:      Optimize the classifier parameters: Output:      Concept classifier

Table 9: Algorithm used to learn the concepts for both the Hopper and the Ant agents.
Hyperparameter Value
Table 10: Hyperparameters used to learn concepts with Alg.3.