Log In Sign Up

Perspective Taking in Deep Reinforcement Learning Agents

Perspective taking is the ability to take the point of view of another agent. This skill is not unique to humans as it is also displayed by other animals like chimpanzees. It is an essential ability for efficient social interactions, including cooperation, competition, and communication. In this work, we present our progress toward building artificial agents with such abilities. To this end we implemented a perspective taking task that was inspired by experiments done with chimpanzees. We show that agents controlled by artificial neural networks can learn via reinforcement learning to pass simple tests that require perspective taking capabilities. In particular, this ability is more readily learned when the agent has allocentric information about the objects in the environment. Building artificial agents with perspective taking ability will help to reverse engineer how computations underlying theory of mind might be accomplished in our brains.


Interval timing in deep reinforcement learning agents

The measurement of time is central to intelligent behavior. We know that...

Artificial virtuous agents in a multiagent tragedy of the commons

Although virtue ethics has repeatedly been proposed as a suitable framew...

Emergent Multi-Agent Communication in the Deep Learning Era

The ability to cooperate through language is a defining feature of human...

Learning to run a power network with trust

Artificial agents are promising for realtime power system operations, pa...

How Do You Act? An Empirical Study to Understand Behavior of Deep Reinforcement Learning Agents

The demand for more transparency of decision-making processes of deep re...

Artificial Agents Learn Flexible Visual Representations by Playing a Hiding Game

The ubiquity of embodied gameplay, observed in a wide variety of animal ...

Concept Learning in Deep Reinforcement Learning

Deep reinforcement learning techniques have shown to be a promising path...

1 Introduction

Many decisions we take depend on others, what they think, what they believe, and what we know about what they know. This ability to understand and infer the mental states of others is called theory of mind [1] or mindreading [2]. Not only humans have the ability to take into consideration what others think and believe. In controlled experiments it has been shown that chimpanzees can know what other conspecifics see and know [3]. Here we ask whether artificial intelligence (AI) agents controlled by neural networks [4] could also learn to infer what other agents perceive and know. In particular, we test here the ability of an agent trained via reinforcement learning to acquire one essential part of theory of mind: perspective taking.

Figure 1: (a) Overview of the simulation environment and vision modes. The artificial monkey with green circle is the subordinate agent and the one with red circle is the dominant agent. The hands of both agents point to their orientation. Below in the same diagram it is illustrated how egocentric and allocentric visual representations differ. In the egocentric mode objects are perceived relative to the perceiving agent’s position and orientation. In the allocentric mode the agent perceives the objects in terms of their absolute location with respect to a fixed reference. (b) Two examples of the subordinate agent goal-oriented behavior driven by a neural network controller. In the top panel the agent should avoid the food as it is observed by the dominant. In the bottom panel the agent should acquire the food as it is not observed by the dominant. The path the agent followed is marked with the red triangles. The green arrow represents the final position and orientation of the subordinate (monkey with green circle) when the episode terminated.

Perspective taking is looking at things from a perspective that differs from our own [5]. It could be defined as "the cognitive capacity to consider the world from another individual’s viewpoint" [6]. It is one of the social competencies that underlies social understanding in many contexts [7]. The perspective taking ability is not unique to humans and has been observed in other animals like chimpanzees [3]. Chimpanzee social status is organized hierarchically (dominant, subordinate) [8], which is at full display during food gathering: when there is food available that both can reach, the dominant animal almost always obtains it. But what happens if the dominant could potentially reach the food placed behind an obstacle, but does not know that food is there? Can the subordinate take advantage of this? In a series of experiments [3] two chimpanzees were set into two separate cages facing each other with food positioned between them. The researchers manipulated what the dominant and the subordinate apes could see. For example in one condition, one piece of food could not be seen by the dominant chimpanzee. The results demonstrated that the subordinate animal exploited this favourable situation and indeed obtained more food in this condition. Hence, it was able to consider what the dominant chimpanzee could and could not see, i.e. take the perspective of the dominant chimpanzee into account [3, 9, 10]. This work done with chimpanzees was the inspiration for our study.

The aim of the present work is to study whether an AI agent controlled by a neural network can learn to solve similar perspective taking task using reinforcement learning (RL). RL is a branch of AI that allows an agent to learn by trial and error while interacting with the environment. In particular, the agent must learn to select the best action in each specific state to maximize its cumulative future reward [11]. The agent could be for example an autonomous robot [12] or a character in a video game [13]. The idea behind learning by interacting with an environment is inspired from how human and animal infants learn from the rich cause-effect or action-consequence structure of the world [11, 14, 15]. Therefore, RL is a biologically plausible mechanism for learning certain associations and behaviors.

Recently, Rabinowitz and colleagues have proposed a neural network to model agents’ behavior in a grid world [16]. The proposed network was trained using meta-learning; in a first stage the network was trained to learn some priors for the behavior of different types of agents to subsequently speed up the learning of a model of a specific agent from a small number of behavioral observations. Their approach was a first step to induce theory of mind faculties in AI agents that were indeed able to pass some relevant tests for theory of mind skills. However, as the authors themselves note in their paper, their current approach is limited in several important aspects that require future work. To name a few, the observer agent learning to model the behavior of others is trained in a supervised manner, it has full observability of the environment and of the other agents, and it is not itself a behaving agent.

In this paper we are interested in the emergence of perspective-taking faculties in behaving agents trained via RL with partial observability. We believe that these are more plausible conditions to model how humans and other animals might develop perspective taking abilities. Furthermore, we are interested in a specific question about perspective taking: is it simpler to learn perspective taking with allocentric or egocentric representations of the environment? With allocentric input the position of other objects and agents is presented in relation to each other independently of the position of the perceiving agent. With egocentric input the position of all objects and other agents is given with respect to the position of the perceiving agent. This means that for example when the agent changes its orientation the whole world will rotate. See Fig 1 for an illustration of the two modes of visual input. From neuroscience and behavioral experiments it is known that although animals perceive the world from the egocentric viewpoint, this information is transformed to allocentric code in structures like the hippocampus [17, 18, 19]. Presumably the fact that this transformation is computed in the brain hints that the allocentric view enables some functions that cannot be achieved through egocentric representation alone [17, 19]. It is possible that perspective taking is one of these functions. Intuitively it seems that taking the perspective of the other agent demands ignoring own sensory input and taking into account the relations between the other agent and the objects in the environment, just as provided by the allocentric representation. Hence, we hypothesize that RL agents can learn to solve perspective taking tasks more easily from allocentric input representations as compared to egocentric.

Here we do not claim that RL captures all aspects of perspective taking or is the exact model of how perspective taking is learned in biological organisms [20]. However, we hope that understanding the capabilities and limitations of RL in acquiring perspective taking skills will lead to a better algorithmic understanding of the computational steps required for perspective taking in biological organisms.

2 Methods

2.1 Environment and Task

Experiments were conducted using environments created with Python toolbox Artificial Primate Environment Simulator (APES) [21]. The toolbox allows to simulate a 2D grid world in which multiple agents can move and interact with different items. Agents obtain information from their environment according to a visual perception model. Importantly, APES includes different visual perception algorithms that allows to calculate visual input based on agents’ location, orientation, visual range and angle, and obstacles. In particular, for this work, we simulate two modes for the agents’ vision: egocentric and allocentric (for detailed descriptions see the subsection on visual encoding). For further specifics on the toolbox the reader can access the associated GitHub repository

For the perspective taking task, we generated a grid world environment where each element can spawn randomly within specific regions. The elements considered included two agents (a dominant and a subordinate), and a single food item (reward). In the present experiments only the subordinate agent is controlled by a RL algorithm and can execute actions in the environment by moving around and changing its orientation. The dominant agent is not controlled by any learning algorithm but its role is critical. The value of the reward obtained by the subordinate at reaching the food depends on whether the food is visible from the dominant’s point of view. If food is retrieved by the subordinate when observed by dominant the value of the food item becomes negative (to mimic the punishment received from the dominant in the nature). If the food is obtained while not observed by the dominant the value of the reward is positive. See Table 1 listing the events rewarded and its correspondent values.

Event Reward Value
Eating food observed by dominant -1000
Eating food not observed by dominant +1000
Every time step -0.1
Table 1: List of the events rewarded in the experiments and their respective values.

In the experiments we considered 3 increasingly complex levels for the perspective taking task depending on the coverage of the food and dominant’s locations (see Fig 2).

2.1.1 Level 1

Level 1 represents the simplest setting where the dominant agent and the food item have a fixed position in the environment. Fig 2A shows the set of allowed initial positions for the dominant agent (red cells), food item (blue cells), and subordinate agent (green cells). To successfully solve the task the subordinate agent must learn to navigate to reach the food’s location based on the dominant’s orientation only. The dominant can change his orientation (4 possible orientations) between episodes. The subordinate agent can spawn in any of the 11 positions at the leftmost side of the arena and is initially oriented towards the right. Both agents have degrees field of vision. Thus, there are a total of possible initial configurations in this level. In addition, the movement and rotation of the subordinate agent across the grid makes the total number of configurations .

Figure 2: Possible starting positions for each element in the environment. The subordinate agent (green) always looks right at the start of the episode. The dominant agent (red) and food (blue) can spawn in larger areas with higher levels of the task. a) In Level 1 the dominant has a fixed position but it has different orientations between episodes. The food has a fixed position. b) In Level 2 the food can spawn in a area. c) In Level 3 the dominant and the food can both spawn in a area. Note that overlapping between elements is not allowed.

2.1.2 Level 2

In Level 2, the complexity is increased by allowing the food item to have more places to spawn inside a area (see Fig 2B). In this case, an optimal subordinate agent should decide to navigate to reach the food item depending on both its perception of the dominant’s agent orientation and the food position. Both dominant’s orientation and food item location determine whether food item is under dominant’s vision and hence the optimal strategy of the agent to navigate or not towards the food. In Level 2, the number of different initial configurations is , and the total number of configurations during the episode is .

2.1.3 Level 3

In Level 3, both the dominant and food item can spawn inside a area (see Fig 2C). This implies that the subordinate needs to integrate three pieces of information to successfully determine whether to the food item is observed by the dominant: 1) the orientation of dominant, 2) the position of the dominant, 3) the position of the food. Fig 2C shows the possible locations of each element. In Level 3 there are possible initial configurations. Upon movement of the subordinate the number of possible states becomes .

For all three levels, we note that since the subordinate agent moves and rotates around the environment direct perception of dominant agent and food is not always present. However, the agent is equipped with a short term memory in the form of a LSTM layer that allows to integrate temporal information [22]. In all experiments the vision angle of the subordinate and dominant was 180 degrees.

2.2 Model

2.2.1 Input

The input to the network controlling the subordinate actions is a set of binary maps. They encode the different agents and other elements properties in the environment. The list of inputs to the network include:

  • Spatial location of elements: or binary one-hot map for each element with at the corresponding element position.

  • Observability mask: or binary mask which indicates the field of vision of the subordinate. It helps to distinguish whether a cell in the grid world is empty or out of the field of vision.

  • Orientations:

    binary one-hot vector for each agent with

    at the corresponding agent orientation.

2.2.2 Visual encoding: allocentric vs egocentric

In this work we compare two types of visual perception. With allocentric input the locations and orientations of items in the environment are encoded in reference to a fixed system of coordinates (as if the vision is provided by a fixed camera with a top-down view). In egocentric input, the items are perceived from the eyes of the subordinate agent, and hence they change in relation to the agent movements and rotations.

In allocentric encoding, we feed to the network arrays that represent positional information of items in the environment in addition to arrays encoding agents orientation. In this mode, when the subordinate changes its orientation and moves, only four bits will change corresponding to its previous location, current location, previous orientation, and current orientation.

In egocentric encoding, the subordinate agent own position and orientation with respect to input remain fixed despite the agent’s movement or rotation. We humans and similarly to other animals, when we turn left or right we still look forward and in the same position from our perspective. Hence, in egocentric encoding the network is not fed the subordinate agent orientation, but still it is fed the relative dominant orientation. Thus, the dominant orientation input is based on where it looks relative to the subordinate (toward the subordinate, same direction as the subordinate, looking to its left or right). In the egocentric condition the input arrays that represent the environment are since the agent can look at opposite orientations from opposite corners of the arena.

2.2.3 Architecture

In our model we used a neural network to control the actions of the subordinate agent. The architecture and hyperparameters are the same as in

[23] with two important exceptions. First, additional inputs are fed to the network. Orientation of both agents are input after the convolutional layer as shown in Fig 3. Second, we used a dueling Q network [24] instead of advantage actor-critic model.

In a dueling Q network the state-action value (Q-value) calculation is based on two separate estimates: the state value (how good the current state is) and advantages (which benefits are obtained from each action) as shown in Equation



where is a state, is an action, is the state value, is the advantage, and is the next action. represents the network parameters, while is the subset of parameters used in the value network and is the subset of parameters used in the advantage network.

Using a dueling network architecture involves updating two network models: a training model (parametrized by ) which weights are updated using gradient descent, and a target model (parametrized by ) which weights are periodically -averaged with training model’s weights as shown in Equation 2:


The -greedy policy

chooses random action with probability

and action with maximum Q-value otherwise as in Equation 3:


The action space includes moving up, down, left, right, and "not move". Every moving action is accompanied by a rotation so that the agent is always looking at the same direction it is heading. For example, if the agent moves up, it will also rotate to face north. This conforms to the fact that humans and most animals advance in the same direction they are facing.

To summarize, Fig 3 illustrates the network architecture, its input layers and the output to control the subordinate agent.

Figure 3: The architecture used in the model has 1 convolutional layer with 6 filters and kernel of size 3 followed by 2 fully connected layers with 32 hidden nodes each and 1 LSTM layer with 128 cells. Output of the LSTM layer is used to learn the advantages for each possible action and the state value . Together the state value and advantage heads are used to compute the Q values using Equation 1 which determines the policy of the agent. Input to the network includes 5 binary matrices that represent: dominant agent, subordinate agent, food item, observable (which is a map that contain 1 at positions of the grid within the field of vision of the subordinate agent and 0 otherwise, on all other maps only observable positions are populated), and obstacles (obstacle matrix did not hold any information in the current experiments). Each binary matrix indicates the position of an observed element by 1 and zeros otherwise. The orientation of the dominant and subordinate agents is represented by one hot vector of size four for each agent. These vectors are concatenated with the flattened output of the convolutional layer.

2.3 Training

All models for Level 1 and Level 2 were trained for 6 millions steps. Models for Level 3, with highest complexity, were trained for 20 million steps. An episode is terminated when the food is eaten by subordinate or the food is not eaten after 100 time steps.

We used replay memory to remove sequential correlations and smooth distribution changes during training as it is usually done in other studies [25]. The replay buffer size used is

trajectories. Maximum length of each trajectory is 100 time steps. Shorter trajectories were padded with zeros.

For the neural network implementation we used the Keras library [26]. We used Adam optimizer with a fixed learning rate at and annealed the exploration probability from to reach after using of the total number of steps. We clipped the gradient at 2 to prevent the gradient explosion problem.

3 Results

Next we present simulations to determine the performance of a subordinate agent trained by reinforcement learning to pass a simple perspective taking task. In particular, we study three levels of complexity in the perspective taking task according to the amount of information that the agent needs to take into account to successfully pass the task (see Methods section for a detailed explanation of the three levels considered). For all the levels we compare the performance of agents using allocentric or egocentric encodings for visual perception.

In all cases described in Fig 4, we evaluated the performance of the agents by testing the model every 10 training episodes. At those times we froze the model parameters, launch 100 test episodes, and record the obtained reward. During test episodes the exploration rate is set to zero so that actions are selected by the neural controller and no random actions are executed.

3.1 Level 1

Level 1 is the simplest version of the perspective taking task. In this case the food item and the dominant agent have a fixed position in the environment, although the dominant agent changes its orientation between episodes. Hence the optimal strategy to be learned by the subordinate agent would be to reach the food based on the orientation of the dominant (when the orientation is directed upwards or leftwards, see also Fig 2A). Other information is either irrelevant and should be filtered out or should be used only for the navigation from the current position of the agent towards the food item. Note also that the LSTM layer allows the subordinate agent to have a short term context for its decision making. Thus, even when items are out of the field of vision of the agent but in the recent history, the subordinate can use this information to guide its actions.

In Fig 4

, panels A and B show the average reward obtained by allocentric and egocentric agents during test episodes as a function of the number of training episodes. The learning curves (in blue) are compared to the optimal reward or ground truth (in red) obtained by an agent that would perform the best strategy for each episode. Note that the variance for the optimal reward is due to the difference in reward between the optimal food-gathering episodes (

1000 points) and food-avoidance episodes ( 0 points), and that these two types of episodes do not occur exactly in 50 proportion in every batch of 100 random testing episodes.

In general, agents with either allocentric and egocentric visual input can solve the task and achieve a near optimal performance. Usually, at the end of the training the agent displays a direct trajectory towards the food item when this is to be consumed (dominant agent does not observe the food).

We note, however, that a temporary drop in performance occurs around 2000 training episodes for the agent using allocentric input. Visual inspection of the agent trajectories at this stage shows that the agent often displays a repetitive selection of actions leading to loops in its movements across the environment. Such a behavior is possible due to the zero exploration rate used during test episode and the greedy selection of actions based on Q-values. In comparison to the allocentric case, the egocentric agent maintains a more stable performance across training episodes once it has learned to solve the task.

Figure 4:

Average subordinate reward (blue) versus average optimal reward (red) over 100 episodes per data point. Results were averaged over 7 different seeds. Shading variability as estimated by the standard error of the mean. Panels in different rows correspond to distinct levels of the perspective taking task. Level 1: food and dominant position are fixed; Level 2: food spawns in

area; Level 3: food and dominant spawn in area. Panels in different columns correspond to different types of visual spatial processing. Allocentric: spatial information is given in relation to a fixed reference point; Egocentric: spatial information is given relative to the position and orientation of the perceiving agent. The panels represent the average reward during 100 test episodes for a) Level 1 allocentric, b) Level 1 egocentric, c) Level 2 allocentric, d) Level 2 egocentric, e) Level 3 allocentric, and f) Level 3 egocentric.

3.2 Level 2

Level 2 experiments require a more complex policy than Level 1 since there is more variation in the places that the food item can spawn in. This implies that the event of the food item being observed by the dominant agent depends on the particular combination of dominant orientation and food location. Hence, the subordinate needs to learn to integrate information about the dominant orientation and food position and decide based on that when to go for the food and when to avoid it.

In Fig 4C one can observe how the testing reward in the allocentric condition improved over time. The network achieves a good performance after 2500 episodes. In this case, the good performance is sustained for the entire training period.

In comparison, in the egocentric configuration the network did not perform equally well, this is shown in Fig 4D. Presumably this can be due to the additional complexity of egocentric vision in estimating relative positions and orientations between other objects.

3.3 Level 3

This level is the most complex for the perspective taking since both the food item and the dominant agent can spawn in different locations. This means that the subordinate agent would need to combine at least three pieces of information (dominant position, dominant orientation, and food position) to discover whether the food is positively or negatively rewarded. In this case, there are starting configurations, and more than 1 million potential states, to which the subordinate agent needs to assign an appropriate action.

In Fig 4E we report the average reward for testing episodes with the allocentric vision. The network manages to obtain a significant performance even if the learning is slower than in previous levels and it reaches a plateu after 15000 training episodes. Note that Level 1 and Level 2 agents were trained for 6 million time steps, while Level 3 agents are trained for 20 million. Different levels and seeds can all result in a different number of episodes since episodes are terminated when the agent retrieves the food item or 100 time steps have elapsed.

As expected, the egocentric configuration did not work on Level 3 since it did not work on the simpler Level 2. Fig 4F shows the testing reward and the big gap between the model reward and the maximal reward obtained by an agent following a perfect strategy (agent taking the shortest path towards the food item only when this is not observed by the dominant).

We also launched 1000 additional test cases after the training finished to evaluate the behavior of the subordinate agent with allocentric input. In Fig 5A one can observe the typical behavior of the subordinate agent when it is expected to eat or expected to avoid the food. The subordinate managed to eat the food of the cases when it should eat the food (non-observed by the dominant) and avoided it of the cases when it should avoid it (observed by the dominant). The goal-directed behavior of the subordinate agent during different conditions is visualized as representative trajectories in panels B, C, D and E in Fig 5.

Figure 5: Quantification of the subordinate behavior and examples of model trajectories for a trained Level 3 allocentric agent. a) Bar plot with the percentage when the model performed the expected optimal behavior. Two expected behaviors are distinguished: 1) agent is expected to eat when the food is not observed by the dominant, and 2) agent is expected to avoid when the food is observed by the dominant. b) An example of the subordinate agent (green circle) avoiding the food although it should approach it. c) An example of the model reaching the food when it should not reach it. d) An example of the model performing the expected behavior of navigating and obtaining the food. e) An example of model behavior of avoiding the food when this is observed by the dominant agent (red circle).

Solving the perspective taking task at any of the mentioned levels presumably involves both an estimation of whether the food item is being observed by the dominant and navigational aspects to reach the food item (or avoid it). So far, the allocentric and egocentric types of visuo-spatial processing have been compared in their overall performance in the task without any distinction between these potential components. Moreover, it has been pointed out that allocentric or map-like representations enhance certain navigational skills [27]. To compare both types of processing independently of navigational aspects, we trained the model to output a binary decision of whether the food item is visible to the dominant. The models for allocentric and egocentric are the same as described in Fig 3 with the exception that the LSTM layer is replaced by a fully connected and there is a single output. The input to the model is the initial observed state by the subordinate and the training is supervised. The dataset contained around 26400 samples and was split 80-20 to training and validation. Fig 6

shows the accuracy at validation samples as a function of the number of training epochs. To ensure the representativity of the model all accuracies were averaged over 20 random initializations of the model weights. Overfitting was not observed as training and testing errors were almost identical. As observed in Fig

6 allocentric mode exhibited a faster learning of the decision of whether the food item is observed by the dominant agent.

Figure 6:

Validation accuracy when using allocentric vs egocentric inputs to predict whether the food item is observed by the dominant agent. In this case the navigational aspects of the task are eliminated, simple supervised learning was used. Line and shading indicate the mean validation accuracy and standard error of the mean (SEM) over 20 different initializations of the network weights.

4 Discussion

In this work we aimed to develop RL agents that could solve a basic perspective taking task. For that we devised a perspective taking task inspired by work done with chimpanzees. The behavior of the agents showed evidence for basic perspective taking skills, which demonstrates that at least a part of perspective taking skills can indeed be learned through reinforcement learning. Furthermore, we observed that the agents learn much more efficiently when endowed with allocentric vision for spatial processing.

4.1 Allocentric vs egocentric input

The condition with egocentric vision corresponds to the natural way how animals interact with the world: they perceive objects and other agents from their viewpoint. However, for some reason the brains of animals have also developed specific systems where objects are represented in allocentric fashion - they are represented in relation to each other as in a map [17, 19]. This allocentric representation allows animals to compute certain aspects of the world more easily. One of such functions might be perspective taking. Indeed, in our work we found that RL agents can much more readily learn perspective taking skills from allocentric representations. In fact, our RL agents did not learn more complex (complexity level 2 and 3) perspective taking at all from egocentric input. This result potentially shows that in order to take the perspective of another agent, the agent (or animal) needs to transform the sensory representation to allocentric coordinates. Our work hints that one reason why allocentric coordinates are preferable is that they allow to reduce the state space and hence perform computations that are computationally harder with egocentric representations.

An important question in neuroscience is how this transformation from egocentric to allocentric coordinates is computed in the brain [19, 28]. Also, it is clear that in the animals these two systems interact [18, 28]. In the present work we did not study the interactions of these two systems but rather observed the agents who had one or the other type of input. In the future work we seek to study how the allocentric representation is computed from the egocentric input and how these two systems interact during online decision making.

4.2 Different levels of perspective taking

Although some agents in this study could solve the perspective taking tasks presented here, we would not claim that the complexity of the task is anywhere near what humans encounter in their everyday perspective taking tasks [2]. Also, although the current task was inspired by work done with chimpanzees, the chimpanzees were not trained on the task, they were just tested at it- they had acquired perspective taking from various encounters with different chimpanzees under natural conditions [3]. Furthermore, our RL agents could compute their decision to approach or avoid the food based on the orientation of the dominant and the food position, whereas in the original experiments with chimpanzees [3] the different conditions also involved obstacles (to hide the food). Hence, our current setup is a very simplified version of perspective taking, but hopefully it lies the groundwork for more elaborate experiments.

Based on human studies, perspective taking has been divided into two levels [2], namely level 1 and 2 perspective taking (it is important to note that there is no correspondence with the 3 "levels of environment complexity" studied in this work). Level 1 perspective taking is about the question "what the other agent sees" and level 2 is more complicated, by also taking into account how it is seen [2]. Our agents mastered level 1 perspective taking, but even here the caveat is that our RL agents were trained for thousands of episodes in the very same task. When children and chimpanzees are given level 1 perspective taking tasks, they solve it without training [2]. In this sense we advocate for claiming that the current RL agents solve level 0 perspective taking tasks that we define as "achieving perspective taking level 1 behavior after extensive training". From this viewpoint it is also clear which tasks should our RL agents try to solve next - level 1 and 2 perspective taking without extensive training on the same task.

Rabinowitz and colleagues [16] used a meta-learning strategy to close the gap between level 0 and level 2 perspective taking, however their study had several assumptions (training by supervision, full observability and non-behaving agents) that were addressed in our study. In general, a future avenue of work should address more complex environments and perspective taking tasks involving more agents and stringent generalization tests.

Another avenue for future work is that of opening the networks that successfully implement perspective-taking capabilities. In particular, it will be interesting to search and study the receptive fields of specific neurons in the network whose activity correlates with a decision requiring perspective-taking skills.

5 Conclusion

Perspective taking, like any other cognitive ability, has multiple facets and for the scientific understanding of such abilities it is necessary to study all of these facets [2]. Here we studied the simplest possible case where agents driven by neural controllers learned with the help of reinforcement learning in a simple task.

By understanding the capabilities and limitations of RL agents in acquiring perspective taking one can better understand the computational demands of perspective taking and more generally, of mindreading. In short, RL might help us to better understand perspective taking just as deep neural networks have led to a better understanding of biological vision [20].

6 Acknowledgments

We would like to thank Daniel Majoral for fruitful discussions and comments.