Agents and Devices: A Relative Definition of Agency

05/31/2018 ∙ by Laurent Orseau, et al. ∙ 0

According to Dennett, the same system may be described using a `physical' (mechanical) explanatory stance, or using an `intentional' (belief- and goal-based) explanatory stance. Humans tend to find the physical stance more helpful for certain systems, such as planets orbiting a star, and the intentional stance for others, such as living animals. We define a formal counterpart of physical and intentional stances within computational theory: a description of a system as either a device, or an agent, with the key difference being that `devices' are directly described in terms of an input-output mapping, while `agents' are described in terms of the function they optimise. Bayes' rule can then be applied to calculate the subjective probability of a system being a device or an agent, based only on its behaviour. We illustrate this using the trajectories of an object in a toy grid-world domain.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans categorise physical systems into two important classes: agents, and non-agents (which we here call ‘devices’). Since both are mechanically described by physics, what is the difference? Dennett has proposed that the distinction lies in how we subjectively explain these systems, and identifies two ‘explanatory strategies’111We ignore a third strategy, the design stance, in this article.: the physical stance, which dennett2009intentional describes as “the standard laborious method of the physical sciences, in which we use whatever we know about the laws of physics and the physical constitution of the things in question to devise our prediction”, and the intentional stance, which he describes as “the strategy of interpreting the behavior of an entity (person, animal, artifact, whatever) by treating it as if it were a rational agent who governed its ‘choice’ of ‘action’ by a ‘consideration’ of its ‘beliefs’ and ‘desires.”’


show that, by formalising agents as rational planners in an environment, it is possible to automatically infer the intentions of a human agent from its actions using inverse reinforcement learning 

(russell1998learning; ng2000irl; choi2015hbirl). However, this does not tell us whether to categorise a system as an agent or a device in the first place; this question is observer-relative, since it depends the observer’s prior knowledge (chambon2011what) and how efficiently they can apply each explanatory stance.

Instead of modelling human cognition, we consider an artificial reasoner. We propose a formalization of these ideas so as to compute, from the point of view of a mechanical observer, the subjective probability that a given system is an agent. To simplify matters, we assume a clearly identified system that takes a sequence of inputs and returns a sequence of outputs at discrete time steps.

First, we discuss a few informal examples in Section 2. We give some notation and the formalism of the main idea in Section 3. More details on devices and agents are given in Sections 3.2 and 3.1. We validate our proposal on a set of simple experiments in Section 4, showing that some behaviours are better described as devices rather than agents, and vice-versa, using more specific algorithms tailored for this domain. We also demonstrate how our model can explain how agents can change their mind and switch goals—and still be considered agents, as long as the switches are rare—thus implementing the hypothesis of baker2009action.

2 Examples

We informally consider three examples from dennett2009intentional: a stone, a thermostat and a game-playing computer.

A stone follows a parabolic trajectory when falling. If we interpret this as “wanting to reach the ground”, we need to explain why the trajectory is parabolic rather than some other shape; it is easier to predict the trajectory directly by using Newtonian physics.

dennett2009intentional describes the thermostat as the simplest artifact that can sustain an intentional stance. The reason it is on the knife edge is that it can be described either as a reactive device (“if temperature is below the command, start heating”), or as an agent (“make sure the temperature is close to the command”), using descriptions of comparable simplicity.

A system may strongly invite the intentional stance even if it is entirely reactive. For example, the policy network in AlphaGo (silver2016alphago) can play go at a high level, even without using Monte-Carlo tree search. A mechanical description would be fairly complex, consisting mostly of a large list of apparently arbitrary weights, but it is very simple to express the goal “it wants to win at the game of go”.

3 Notation and formalism

At each time step , the system under consideration receives an input or observation and returns an output or action . We denote history pair by . These produce the sequences and of inputs and outputs from step 1 to included, and we call the sequence an interaction history or trajectory. We will also use the notation , and similarly for and . The sets and are considered finite for simplicity. The probability simplex over a set is denoted , if , then and . The indicator function has value 1 if is true, 0 otherwise.

In order to output a probability that a system is an agent, we must give probabilistic definitions of both devices and agents and then apply Bayes theorem to inverse the likelihood of an


trajectory to posterior probabilities of both views of the system. We take a Bayesian point of view: a system belongs to a set of possible systems, so we build a mixture of all such systems for both agents and devices.

Describing devices: Mixture .

Let be a set of physical processes that can be described as a system, as an input-output device, that is, as some function

that outputs a probability distribution to outputs given an interaction history of inputs

and outputs . The set can be finite, countable, or uncountable, but we consider it countable here. Then the likelihood of the sequence of outputs for a given sequence of inputs to the system, and supposing that the system is a device is

is thus a mixture of all these probability distribution functions, where each such function is assigned a prior weight so that .

Among all device descriptions in , at step the posterior probability of a particular device description is found using Bayes rule in sequence: and the conditional probability of the next output can now be written:

Describing agents: Mixture .

Similarly to devices, we define a mixture over the set of all possible agents . We will describe how to define the mixture and the models for the agents in Section 3.2.

Putting it altogether: Mixture .

Now we can put both descriptions together in a single mixture . In effect, within we assume that any trajectory can be explained by either the mixture of agents or the mixture of devices, and nothing else. We take an uniform prior of the two mixtures:

Using Bayes’ rule, we can now compute the likelihood that a sequence of outputs

is generated by an agent rather than by a system. The (subjective) probability that the device is an agent given a trajectory is the probability that the trajectory is generated by an agent with the environment times the prior probability of being an agent (


Furthermore, the posterior probability of a particular device , how well this device can explain the trajectory compared to other devices and agents, is

and similarly for an agent .

3.1 Devices

In principle, the device mixture

can be any probabilistic model that can be used to compute a likelihood of the output history; A more Bayesian view is to consider the set of all possible models (decision trees, neural networks, etc.) within some class and assign some prior to them. In

Section 4 we use a mixture of simple contextual predictive models.

To produce a complete inference algorithm, we also consider the choice of a universal prior measures over the set of all computable devices.

Information theoretic choice: Algorithmic probability.

Ignoring computational limitations, an optimal choice for the device mixture is to use (a straightforward variant of) Solomonoff’s mixture (solomonoff1964formal; legg2008machine) for some particular Turing-complete reference machine. If an observed input-output trajectory can be described by any computable function, Solomonoff’s inference will quickly learn to predict correctly its behaviour. In the programming language for our reference machine, all (semi-)computable devices can be expressed: Consider a program that, given a sequence of inputs and outputs , outputs a probability distribution over the next observation . Each device is assigned a prior weight , where is the length in bits of the description of the device on the reference machine. Hence, if there is a computable device that correctly describes the system’s behaviour (if the system’s behaviour is computable), then Solomonoff’s mixture prediction will be almost as good as since at all steps , or in logarithmic-loss or code redundancy terms Thanks to this very strong learning property, the subjective prior bias quickly vanishes with evidence, that is, with the length of the trajectory.

A (somewhat) more computable choice.

Under a Solomonoff prior (which does not consider computation time), the invariance theorem (li2008introduction) says the prior also contains an “interpreter” for all agents. The cost to describe an agent as a device is then always bounded by the cost of the interpreter. The speed prior (schmidhuber2002speed; filan2016loss) is a computable variant of the Solomonoff prior that takes into account the computation time required to output the sequence , hence greatly weakening the invariance theorem.

A more observer-dependent prior could also be considered, for example that depends on the computational limitations of the observer and its background knowledge about the world.

3.2 Agents

To assess whether a given trajectory is agent-like, we apply Bayesian inverse reinforcement learning (ramachandran2007birl; choi2015hbirl) except that we want to output a probability rather than a reward function.

Since the problem is inherently harder than “forward” RL, most previous work in IRL focuses on MDPs. Here, since the purpose of this paper is to provide a unified and general framework, we propose a more general formulation using Bayesian model-based and history based environments (Hutter2004uaibook). The model of the environment may be imperfect and allows for the agent to learn about it through interaction (and update its beliefs with Bayes theorem). For agents, inputs are usually called observations and outputs actions.

After describing this general reinforcement learning framework, we “invert” it to find the probability that an agent is acting according to some reward function.

An environment is a probability distribution over observations given the past observations and actions, with The environment can either be the known environment or an uncertain environment, as in a mixture of potential environments, with their posteriors updated using Bayes theorem.

A utility function (or reward function) assigns an instantaneous value to the current trajectory. The cumulated utility of an interaction sequence is the sum of the instantaneous utilities along that sequence.

A policy is a probability distribution over actions given the past, is how likely the agent is to take action at time . Similarly to environments, we extend the definition of a policy:

Now, given a particular utility function , the value of a given policy in an environment is given by:


where is the discount factor. This last form also allows us to consider the value of taking action after some history , which is useful to define the policies. In particular, we may want the agent to follow the best policy that always chooses one of the actions of optimal value for a given underlying utility function in an environment :

But it is more realistic to consider that the agents are only approximately rational. For simplicity in the remainder of this paper we will consider -greedy policies instead, which is still one of the favourite choices in RL research (mnih2015dqn). The policy of the -greedy agent chooses an optimal action with probability :


With , the agent always selects one of the best actions, that is, it acts rationally.222This definition slightly departs from the standard one in order to allow for integrating over .


In an environment , given a utility function and an exploration parameter , we can compute the likelihood of the sequence of actions conditioned on the observations simply with .

Thanks to the nice form of Eq. 2, we can actually make a mixture of all values for in closed form:

where is some prior over and is the number of times a best action is chosen w.r.t. , and . The integral is the definition of the Beta function, and thus taking we obtain:


where is the binomial coefficient .

Finally, we can now build the mixture over all goals:


A simple choice for the weights is if is finite.

Universal IRL.

Similarly to devices in Section 3.1, we can also use Solomonoff’s prior over the set of reward functions, which would lead to “inverting” AIXI, where AIXI is the optimal Bayesian RL agent for the class of all computable environments and reward functions (Hutter2004uaibook).

With the speed prior for devices.

In the case we use the speed prior for the devices, one problem arises: Since the agent can use the Bellman equation for free, if any device can be represented as an agent then everything may look like an agent because the penalty for devices is too large. To compensate for this, we take away something from agents, for example we can set he prior to instead of .

4 Experiments

To test our hypothesis, we built a gridworld simulator (see for example Fig. 2). The system under consideration (the yellow triangle) can move in the 4 directions (up, down, left, right) except if there is a wall. The red, green, blue and magenta balloons have fixed positions. Does the system act rationally according to one of the goals, or is its behaviour better described as a moving device that simply reacts to its environment? The experimenter can make the triangle follow a sequence of actions .

4.1 Device descriptions

For a device, we define the observation at step to be the kind of cell (wall, empty, red, green, blue, magenta) it is facing in the world, in the direction of its last action.

A device’s behaviour is defined by a set of associations between a context and an action, for all possible contexts; a context is made of the current observation and the last action the agent took. An example of a device’s deterministic function can be found in Table 1.

Cell in front of the system
wall empty red green blue magenta
Last act
Table 1: An example of a device that moves along the walls.

There are different deterministic functions describing devices. As for agents below, we allow for -deterministic devices, at each step there is a probability of that the device takes the agent given by its deterministic function, and an chance that it takes a different action.

Each context is associated with a multinomial predictor. Let be the number of actions. Let be the set of all mutually-exclusive contexts (only one context is active at any step), and let be the set of contexts that have been visited after the trajectory . Let be the number of times action has been taken in the context , and let be the number of visits of the context . An -deterministic context model puts a categorical distribution over the set of actions for each context, where is a

-dimension vector of probability distributions over

, hence :

which in the current experiments are essentially a Markov model of order 2. We can now build a continuous mixture of all such

-deterministic context models:

where . Taking a uniform prior over

leads to a multinomial estimator:

4.2 Agent descriptions

We consider a very small set of goals, —the red, green, blue, and magenta circles in Fig. 2.

To be able to assign a probability to the actions of the trajectory, we first need to solve the Markov Decision Process (MDP) 

(sutton1998reinforcement) for each goal, using states instead of histories, where the state is simply a (row, column) position in the environment. The value in Eq. (1) is then computed for each state-action, with a reward of 1 for reaching the goal, and 0 everywhere else. The resulting mixture is computed with Eqs. 3 and 4.

4.2.1 The switching prior

An interesting point made by baker2009action is that people often switch from one goal to another in the middle of a trajectory. In order to take such behaviours into account, we will also use veness2012context’s switching prior technique (volf1998switching)which is an efficient mixture over all sequences of models (here, all possible sequences of goals), that keeps a probability of of switching at time from the current goal to a different one—and thus has a probability of of keeping the current goal.

Unfortunately, the switching prior does not seem to cooperate well with the integration over in Eq. 3. Therefore, instead of using Eq. 3, we use a mixture of a fixed number of values for , which is sufficient for the purposes of this demonstration.333With different values, the performance of the mixture may start to degrade after a few hundreds steps, but the considered trajectories in this demonstrator are usually shorter.

With being the set of all policies:

where the last line implements the switching update rule444This is a slight simplification over (veness2012context) for readability that has a logarithmic loss of at each switch instead of . with . If no switching is necessary, the cost (in the logarithmic loss) is bounded by at time , which is a rather small cost to pay.

Apart from the inversion of the MDP, the computation time taken by the mixture for a sequence of length is , compared to for the non-switching mixture of Eq. 4.

4.3 Some trajectories

Some sample trajectories and associated results are given in Figs. 6, 5, 3, 1, 4, 7 and 2.We report the negative log likelihood (NLL) for both device and agent mixtures, remembering that where we use as an abbreviation of . We also report the posteriors of the device and agent mixtures in the global mixtures along with their negative log values as the latter are usually more informative, as they can be interpreted as complexities or relative losses. The switching prior is used only for the trajectory of Fig. 5, as for the other trajectories switching is similar to not switching.

Running in circles.

(See Fig. 1.) This behaviour is a prototypical example of a system behaving more like a device than like an agent: the behaviour is very simple to explain in terms of instantaneous reactions without referring to some goal.

(a) Trajectory.
Device Agent
18.01 37.48
1.00 0.00
0.00 19.40
(b) Posteriors of the device and agent mixtures.
Figure 1: The system is running in circles for 25 steps.
Rational behaviour.

(See Fig. 2.) This behaviour is strongly described as that of an agent. Indeed, it appears that it is going as fast as possible to the magenta balloon. A device description is however still relatively simple, as witnessed by the low relative complexity of the device mixture’s posterior.

(a) Trajectory.
Device Agent
18.16 11.31
0.00 1.00
6.85 0.00
(b) Posteriors of the device and agent mixtures.
Figure 2: The system goes straight to the magenta balloon.
Suboptimal trajectory toward the blue balloon.

(See Fig. 3.) The system attains the blue balloon after 66 steps, whereas the fastest path requires only 36 steps. The system is still considered as an agent because of the difficulty to attain the blue balloon, which compensates for the suboptimality of the trajectory.

(a) Trajectory.
Device Agent
83.53 72.04
0.00 1.00
11.49 0.00
(b) Posteriors of the device and agent mixtures.
Figure 3: The system is going toward the blue balloon in a suboptimal way.
Following walls.

(See Fig. 4.) This is another example of a behaviour that is typical of a reactive system that acts without purposes. This trajectory seems to be more agent-like than a random one or running in circles, and one may be tempted to describe the behaviour of the system as “it wants to avoid walls”. However, when described with a simple deterministic reactive system without intentions (“when there is a wall in front, turn right”), it seems to lose its agency aspect.

(a) Trajectory.
Device Agent
26.09 43.33
1.00 0.00
0.00 17.24
(b) Posteriors of the device and agent mixtures.
Figure 4: The system turns when facing a wall.
Switching goals.

(See Fig. 5.) The system looks like it is going first toward the magenta balloon, but before reaching it switches to going to the green balloon. This time, for the agent’s mixture we use the switching prior model described in Section 4.2.1. We also report the log likelihood of the trajectory for the non-switching model for information: without the switching prior, the behaviour toward either the blue or the green balloons is very suboptimal, and thus (without a switching prior) it is easier to consider the trajectory as generated by a device rather than an agent. The posteriors of each goal along the trajectory is shown in Fig. 6. Between steps 3 and 19, the system seems to go to any other goal than the magenta one, and this becomes clearer starting at step 10 when the system enters the corridor. However, the mixture cannot yet tell which goal is more likely. Similarly, when going away from the blue balloon, the system is uncertain as to which is the actual target now, and becomes certain it is the green balloon only after the middle corridor’s entrance.

(a) Trajectory.
Switching Non-Switching
Device Agent Agent
64.4851 30.3043 92.1446
0.0000 1.0000 NA
34.1808 0.0000 NA
(b) Posteriors of the device and agent mixtures.
Figure 5: Switching goals using a the switching agent model.
Figure 6: Sequence of the posteriors of the different goals for the trajectory of Fig. 5 using a switching prior.
Random behaviour.

(See Fig. 7.) A random behaviour is difficult to explain both in terms of a device and in terms of an agent, and thus leads to a high NLL in both cases: The context hits (see Fig. 8) have high entropy, and the best value for an -greedy agent policy is high too (around 0.6).

(a) Trajectory.
Device Agent
141.64 144.11
0.92 0.08
0.08 2.55
(b) Posteriors of the device and agent mixtures.
Figure 7: The system is choosing its action uniformly randomly for 100 steps.
Context Action
in_front,last_action up down left right
empty,down 2 2 3 8
empty,left 4 5 2 5
empty,right 3 8 4 2
empty,up 5 5 4 2
wall,down 4 6 2 3
wall,left 2 2 2 _
wall,right _ 2 5 3
wall,up _ _ _ 4
Figure 8: Context hits for the random experiment (“-” = 0).

5 Conclusion

Every physical system can be described as either an agent (which pursues goals) or a device (which responds mechanically to its inputs). Hence we ask the question of subjectively how much sense it makes to call the system an agent or a device; we quantify the answer in the form of a posterior probability. This subjective probability takes into account the observer’s intrinsic biases and background knowledge.

We formalize the idea using inverse reinforcement learning techniques for agents (roughly, given a sequence of actions and observations, find the best goal and -greedy policy for this goal), and sequence prediction techniques for devices (roughly, find the best -deterministic policy that fits the observed behaviour), and compare the two resulting likelihoods.

The approach was validated on a simple and clear test domain with a varied set of trajectories. While the purpose of this work is to provide a mostly non-anthropocentric formalization of a definition of agency, it would be informative to investigate the extent to which it matches human judgements.

From a reinforcement learning perspective, the proposed approach may also be useful to design environments that can help maximize “agenthood”, that is, to build agents that can thrive as agents rather than performing device-like tasks.


This paper has emerged from the discussions that took place at the 2016 SAB workshop on “Mathematical and philosophical conceptions of agency”, organized by Simon McGregor.555 Thanks also to Peter Dayan, Tom Erez, Chrisantha Fernando, Nando de Freitas, Thore Graepel, Hado Van Hasselt, Andrew Lefrancq, Sean Legassick, Joel Z. Leibo, Jan Leike, Rémi Munos, Toby Ord, Pedro Ortega and Olivier Pietquin.