A fundamental part of intelligent reasoning is being able to make decisions under uncertain conditions ((Danks, 2014), (Lake et al., 2017), (Pearl, 2018b)). In some cases, a decision maker who faces an uncertain environment has enough information to make choices by maximizing expected utility, which is the classic formal criteria for making decisions if rational preferences are assumed ((Bernardo, 2000), (Gilboa, 2009)). On the other hand, if enough information is not available, the decision maker could attempt to learn from the environment by interacting with it.
Learning by interaction has been extensively studied by computer scientists using the Reinforcement Learning (RL) setting(Sutton & Barto, 1998)
, but the most common used techniques in this field are purely associative and do not consider any high-level structure of the environment beyond what is expressable in a Markov Decision Process(Garnelo et al., 2016).
A particular case of a higher level structure; i.e., beyond associative patterns, is the case of causal structure. A causal structure encondes a series of cause-effect relations between events and knowing such relations allows a decision maker to add extra knowledge into the uncertainty of his environment and also allows it to plan ahead his actions since he can predict what a certain action will cause ((Spirtes et al., 2000), (Pearl, 2018a)).
Since human beings are known to learn causal models in sequential decision making processes ((Sloman & Hagmayer, 2006), (Nichols & Danks, 2007), (Meder et al., 2010), (Hagmayer & Meder, 2013), (Danks, 2014)), and even though this learning is not perfect (Rottman & Hastie, 2014), we propose that an autonomous agent can learn and use causal information while interacting with an uncertain environment which is governed by a fixed causal mechanism which is unknown to the agent.
While the standard setting in RL is to model the agent-environment interaction as an agent that moves from one state to another inside a model of the environment and observing a reward as these transitions occur, we propose to model it as a game between the decision maker and a player called Nature which will select his actions from the causal model in response to what the decision maker has chosen.
The proposed way for an agent to learn from repeated interactions is by giving her beliefs about the structure of the environment and a way to update them after an outcome has been observed. The agent, using her current beliefs, will generate a local causal model and choose an action from it as if that model was the true one. Then, after she observes the consequences of her actions, her beliefs will be updated according to the observed information in order to make a better choice the next time. The agent, besides learning a policy to choose actions will also learn a causal model from the environment since the causal model she forms will approximate the true model.
Learning a causal model of the environment allows to extract high-level insights of a phenomena beyond associative descriptions of what is observed. A causal model is able to explain why a particular decision was made since it allows to extract the causes and effects of an agent’s actions. Once a causal model is acquired, an external user is able to reason about what…if… statements that associative methods can not answer (Pearl & Mackenzie, 2018). When a decision maker chooses an action out of many, a causal model allows to ask what would’ve happened if another action was taken without actually performing the alternative action.
2 Related Work
Decision problems in which the actions available to a decision maker are interventions over a known causal model are analyzed by (Lattimore et al., 2016) as a bandit problem where the optimal action must be learned over rounds of action-observation in which only one action can be chosen. In a classic bandit problem an agent chooses an arm from a slot machine, observes a reward and then moves on to the next machine which is of the same kind and whose initial settings are independent of the previous machine and action (Sutton & Barto, 1998).
Several algorithms exists for finding the best arm in a multi-arm bandit, such as those described in (Bubeck et al., 2009), (Audibert & Bubeck, 2010), (Gabillon et al., 2012), (Agarwal et al., 2014) , (Jamieson et al., 2014), (Jamieson & Nowak, 2014), (Chen & Li, 2015), (Carpentier & Locatelli, 2016), (Russo, 2016), (Kaufmann et al., 2016), but none of these works consider causal-governed environments. In (Ortega & Braun, 2014)
results on sequential decision-making using Generalized Thompson Sampling that could be extended into causal inference problems are given.
As far as we know (Lattimore et al., 2016) is the first paper to consider causal relations between the effects of actions. They consider a decision maker who must choose the best among several possible interventions on a given causal model. The optimality of the action in this context is in terms of the minimal regret. The case where the causal model is not known is left as future work.
By considering a causal model which is partially known and intervening variables from the unknown part of the model and by avoiding sampling arms that are considered sub-optimal, (Sen et al., 2017) extend the work of (Lattimore et al., 2016).
The aforementioned papers assume the causal model is known to the decision makers so their work focuses on using causal information to make good choices, but the problem of acquiring this causal knowledge is left unattacked.
In this work we propose to acquire, by repeated interaction, causal information about the environment as well as using the current causal knowledge inside each round to make better decisions. By modelling the environment as a player in a game we allow it to have objectives to pursue which will allow to model a rich family of situations where several agents are competing against each other and a causal entity controls the outcomes.
3 Problem setup
we mean a stochastic binary relation between events of a probability spacedenoted by that is transitive, irreflexive and antisymmetric (Spirtes et al., 2000).
A directed acyclic graph (DAG) can be used to represent all of the relations that occur in that space by considering a node for every variable that is related to another and a directed edge to express the causal relation, call this DAG and consider a probability measure that expresses the conditional statements from the DAG.
We require that this measure satisfies the Markov Causal Condition, Causal Minimality and Causal Faithfulness as stated in (Spirtes et al., 2000). The relation between and is given by the Manipulation Theorem of (Spirtes et al., 2000) and the Do-Calculus rules from (Pearl, 2009). We also require that the condition known as Causal Sufficiency is satisfied by the model, which means there are not any causes lying outside of the model.
Let a Decision Problem under Uncertainty in which an agent has to choose one among several options which are causally related to the elements of . The elements in are uncertain events which are governed by a causal mechanism. This means that when the decision maker chooses an action , an uncertain event will occurr in such a way that an outcome will occur, which is causally related to . The decision maker seeks to maximize her utility and we assume that she has rational preferences, so we can substitute her preferences for the expected value of a utility function (Gilboa, 2009). If the decision maker does not know the probabilities nor the structure of the underlying causal model then she can not calculate the expected utility of any action. Instead, she will have a subjectiveprobability distribution which represents the agent’s knowledge and uncertainties which will be updated by interacting with the environment through succesive rounds of decision making.
Inside each round, any response from the environment will be independent from the previous rounds, but the actions of the decision maker will be based upon previously acquired causal information and are expected to improve the utility for the agent.
We define a game between the decision maker and a new abstract player called Nature. The base game’s structure will be the same as the original decision problem. Nature will be indifferent among the possible outcomes of the game and will select its actions from the causal model. This is interpreted as the causal response from the environment to the actions of the decision maker. Nature having objectives to pursue (non-constant payments) will be left as future work.
For an agent to reason about and modify her causal knowledge we endow her with a probability distribution over a suitable space. The beliefs must allow to form a local
model in a given moment to be used for decision making. We will later exploit the fact that causal graphical models can be expressed in terms of conditional distributions, so having beliefs about a causal model is equivalent to having beliefs about these conditional distributions. After each round of the game, the beliefs will be updated in a Bayesian way in order to achieve convergence towards the true model(Shoham & Leyton-Brown, 2008).
4 Proposed Method
In this section we describe our approach for studying decision making in causal environments as described in Section 3. For the sake of explanation we consider three separate cases:
The decision maker fully knows the causal model.
The decision maker knows only the structure of the causal model.
The decision maker does not know the causal model.
4.1 The causal model is completely known
Consider a decision problem under uncertainty where a decision maker has to choose on out of many elements of a set
and where the consequences, or effects, of her actions are expressed as the outcome of a random variablewhich we will call target variable. The relation between values of and actions is expressed by a causal graphical model , which is known by the decision maker. The decision maker whishes to choose an action such that the observed value of maximizes her utility. It is assumed that the variable that is going to be intervened is known by the decision maker; i.e., she knows what variable can she intervene.
This is the simplest case of the three mentioned because if the decision maker fully knows the causal model, then she can proceed as in classic decision problems by directly obtaining the probabilities of different values for the target variable given that an action is made and choose which achieves the highest probability for the desired value of the target variable. The action selected will be a best response for the decision maker as well as the maximum expected utility choice.
Pearl’s do-calculus (Pearl, 2009) says that the effect of setting some variable to a value can be expressed in terms of observational distributions as follows:
The decision maker can use this expression to find the probabilities for her desired value of the target variable given the possible interventions available to her.
4.2 Only the structure is known
If only the graphical structure of is known, then it is not obvious how to find the best action to make since the information required for calculating expected utilities is not available.
In this case, the decision maker will attempt to learn from her uncertain environment by forming beliefs over unknown parameters of the environment and update them according to the observed outcomes. In order to make a Bayesian update over the parameters, these must be defined in such way that in each round the decision maker can define a causal model from the parameters in order to make a decision using this model as if it was the true one as described in Section 4.1.
To model the interaction between the decision maker and the environment, we consider a game with the following characteristics:
Players: The set of players of this game is the set whose elements are the original decision maker, and a player called Nature.
Actions: The actions for the decision maker are the available options she has in the decision problem;i.e. . The actions of Nature are possible realizations of the variables of the causal model.
Preferences: The decision maker satisfies the von Neumann-Morgenstern axioms of rationality and therefore it is assumed to be maximizing expected utilities. Nature is indifferent over outcomes.
Beliefs: Since the decision maker has uncertainty about her environment, she will encode it in a probability distribution over a suitable space.
In this game we will assume that Nature moves first and assigns some state to the environment which is unknown to the decision maker. For this reason, the base game is an extensive game with imperfect information since the decision maker makes a choice without knowing the play made by Nature. We choose extensive games since Nature’s moves are interpreted as a response from the causal model to the actions of the decision maker, so a sequential interpretation had to be considered.
Since the decision maker knows the graph structure, she can explicitly find a non interventional expression for the interventional distribution and update her beliefs about these unknown quantities from observed data. If the decision maker were not allowed to know, at the end of each round, the play of the Nature then this will have to be estimated as a hidden variable using, for example, the EM algorithm(Dempster et al., 1977), but meanwhile we are assuming that this information is available at the end of each round.
Given the structure of the model; i.e., the variables in it and the directed edges, the joint distribution of those variables can be expressed as a product of the formwhere are the parents of in the underlying DAG in . Since these distributions fully characterize the model, the decision maker will have beliefs over each one of these parameters. Notice that each of these parameters is itself a distribution of length equal to the number of possible values of the variable which is being conditioned, call the maximum number of possible values .
A distribution suitable to modelling discrete probability vectors is the-dimensional Dirichlet distribution, whose support is the set of probability vectors of length (Hjort et al., 2010). The dimensional Dirichlet distribution has a density with respect to the Lebesgue measure given by
where are such that and . The Dirichlet distribution is useful since it is conjugate for itself (Bernardo, 2000).
In this way, the decision maker will have beliefs about the CPT’s in the form of parameters of several Dirichlet distributions. Using the agent’s current beliefs, a causal graphical model can be specified. Using this fully specified (structure + parameters) as a true model, the decision maker will make her choice as in Case 1. When the decision maker observes the value of the target variable, she will update the parameters that specify her beliefs.
Previously we argued that the agent’s beliefs were going to be distributions over a suitable space, but what is going to be updated are the parameters of such distributions. Namely, the corresponding to the Dirichlet random variable assigned to each CPT.
For the belief updating, given a new data point, two cases must be considered:
The variable to update has no parents.
The variable to update has parents.
In the first case, if a prior Dirichlet() is used, then the posterior is given by
where is a vector of the number of occurrences of that observed data point.
For the second case, we must consider both the occurrences of that data point as well as the parents for each of the variables. Following (Barber, 2012) we denote as the number of times the event is observed. In this case, if the prior of conditioned on its parents having the value is given by a a Dirichlet(), then the posterior for the variable given an observed data point is given by
4.3 The model is not known
The causal model were fully unknown, the decision maker will have to deal with the problem using only any previous knowledge and her own intuitions. Again, any previous knowledge and considerations will be expressed as beliefs about the uncertainties in the environment, which will take the form of a probability distributions over a suitable space.
As in the previous case, we consider a repeated game where the base game consists of Nature assigning a random state of the environment and responding to the agents choices with the effects that were caused by her decisions. In this game, as well as in the previous one, the decision maker will attempt to learn by updating, and using, beliefs in a suitable way.
The most notable difference with the previous case is that the structure of the model is also to be learned in such a way that both the structure and parameters converge to the true model in the limit. In the previous case the decision maker knew the form of the Conditional Probability Tables (CPT) involved in any calculation. In this case, she doesn’t know the structure of the DAG so which CPT’s are involved is unknown.
If the decision maker knew which variables appear in the true model that governs the environment, even though she didn’t know how they are connected, she could use a Dirichlet Process to generate Dirichlet distributions and generate causal graphical models the same way as in Case 2 and updating the parameters of the process using the observed information. The Dirichlet Process111with parameters , which was introduced by (Ferguson, 1973), is random measure defined over a space such that for each partition the vector follows a Dirichlet distribution (Hjort et al., 2010), (Müller et al., 2016), (Ghosal & van der Vaart, 2017).
Belief updating using causal information when the decision maker doesn’t know the structure of the model nor its parameters is yet to be studied and left as future work.
5 Test scenario
We consider the following hypothetical example where the proposed method will be tested.
Consider a patient who arrives at a hospital who can either have disease or disease . The doctor can either give him some pill or send him into surgery. Both treatments entail risks and whether the treatment cures the patient or not depends on which disease it had originally. The doctor could be facing a mutation from a known disease, so she has some knowledge about what could happen if a treatment is given to the patient. Using her previous knowledge as a true model, she can choose a treatment and observe the outcome from which she will learn about this disease, so she could make a better decision the next time a similar patient arrives.
The causal model that governs this situation is shown in Figure 1. The parameters for this model were fixed intuitively in such a way that each treatment is effective for only one disease, but the most effective treatment is riskier.
The variables in the model are:
Disease: Either or .
Treatment: Either pill or surgery.
Reaction: Either dying or surviving.
Lives: Either living or dying.
The variables are causally related as shown in Figure 1.
The variable Lives is the target variable and, in this example, the only variable that can be intervened upon is the variable Treatment. The decision maker prefers an outcome in which the patient lives.
In this scenario, Nature’s move will consist in randomly assigning a disease to the patient. Then, the medic will asign a treatment using his current beliefs about the disease and the possible outcomes. The decision nodes for this play of the medic form an information set because the medic doesn’t know how she arrived there since she doesn’t know what disease did Nature assign. Finally, Nature will sample the consequence of the treatment from the causal model and the medic will observe the outcome.
For this test scenario whose causal graphical model is shown in Figure 1, we see by applying the Pearl’s do-calculus that the interventional distribution is given by
In fact, from the structure of the model, which is shown in Figure 1, we see that the involved probabilities in any calculations are:
We can also see that the joint distribution for all of the variables can be expressed as
This expression will be useful when specifying beliefs about the model as Dirichlet distributions.
As proof of concept we implemented Case 1 and Case 2 for the test scenario and compared it with an agent performing Q-learning (Watkins & Dayan, 1992) and an agent choosing her actions at random.
For the implementation, we defined a true causal model as the one shown in Figure 1 using the library Pgmpy (Ankan & Panda, 2015). Then, we defined an agent which has beliefs for each of the CPT that appear in the factorization for the model and randomly assigned values for the parameters for each one of the Dirichlet distributions as mentioned before.
This agent will find the action that maximizes her desired value for the target variable using do-calculus. The action thus selected will be used as evidence in the true causal model and a MAP inference will be used to simulate the most likely outcome given this action. The target variable will output a value if the patient lives. The value of the target variable will be the reward of each round.
6.1 Case 1: The causal model is completely known
If the causal model is completely known to the decision maker, then in one step she can obtain the probability for her desired value of the target variable, which in this case is the value corresponding to the outcome in which the patient lives at the end. Using this probability, she can choose which treatment to assign. Since this action maximizes the probability of the occurence desired value, it maximizes the expected utility, and it is also a best response to the player Nature.
6.2 Case 2: Only the structure is known
From the expression of the joint probability we notice that we need to specify a distribution over each one of these distributions, which will be each one a Dirichlet distribution.
We begin with a random assignation of the parameter for each of the distributions considered. We use Dirichlet distribution for each of the conditional probability tables that appear in the factorization of the joint probability for the graph of . Since each of the variables in the model is binary, then the product of these Dirichlets is again Dirichlet.
With this parameters, the decision maker forms a causal model and chooses the action that maximizes the probability of the desired value for target variable as in Case 1. With this action chosen, we simulate an outcome from the causal graphical model using the chosen action as an intervention. This evidence is used to update the parameters, which then will be used to generate a new causal model, and so on.
We show the results of the experiment, where we compare the performance obtained by the causal agent, a random agent who selects his actions at random, and an agent performing Q-learning. We show the average reward obtained by the agent over and rounds.
In Figure 2 we observe the average rewards for each agent in 20 rounds of decision making. Here we notice that Q-learning outperforms our algorithm, which has a similar performance as the random choosing procedure until round 11.
In Figure 3 we observe the average rewards for each agent in 50 rounds of decision making. Our algorithm follow closely the Q-learning agent and outperform the random agent.
In Figure 4 we observe the average reward obtained by the three agents in 100 rounds, where our algorithm slightly outperforms Q-Learning.
In Figure 5 we observe the average reward obtained by the three agents in 200 rounds. The average reward obtained is very similar for Q-learning and our algorithm.
We see that our method obtains a very similar reward as the classic Q-learning algorithm for a larger number of rounds, where the random agent is outperformed. Even though our model has a similar performance to a classic learning algorithm it learns a causal model of the environment which allows it to explain why an agent chose her actions as well as allowing conterfactual reasoning.
7 Conclusion and Future Work
In this work we have proposed a way to make decisions in uncertain environments which are known to be governed by causal mechanisms. The proposed decision making procedure attempts to resemble how human beings act when causal relations are present. Human beings are known to use, and modify, causal knowledge when making decisions.
This ideas motivated us to study how an autonomous agent could learn from her environment when it has a particular structure, in this case being causal relations what gives the environment a certain structure.
We assumed here that the decision maker is aware of the causal nature of the environment, but lacks information about its specific parameters which are to be learned by interaction. It is reasonable to assume that the variables involved are known, since in many situations we are aware of what we are intervening upon and what do we expect it to affect.
The experimental results show that considering causal structure in a decision making process yields a good performance when compared to non-causal classic algorithms, but our model has the extra feature of learning a causal model of the environment which could be exported for problems involving similar scenarios.
The problem of discovering the variables itself and the connections between them is far more general and it is left as future work.
We are grateful to the National Institute of Astrophysics Optics and Electronics (INAOE) and to Mrs. Graciela Soto for their generous funding in order to attend ICML 2018.
Agarwal et al. (2014)
Agarwal, Alekh, Hsu, Daniel, Kale, Satyen, Langford, John, Li, Lihong, and
Taming the monster: A fast and simple algorithm for contextual
International Conference on Machine Learning, pp. 1638–1646, 2014.
- Ankan & Panda (2015) Ankan, Ankur and Panda, Abinash. pgmpy: Probabilistic graphical models using python. In Proceedings of the 14th Python in Science Conference (SCIPY 2015). Citeseer, 2015.
- Audibert & Bubeck (2010) Audibert, Jean-Yves and Bubeck, Sébastien. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pp. 13–p, 2010.
- Barber (2012) Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
- Bernardo (2000) Bernardo, JM. Bayesian theory. Wiley Series in Probability and Statistics., 2000.
- Bubeck et al. (2009) Bubeck, Sébastien, Munos, Rémi, and Stoltz, Gilles. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pp. 23–37. Springer, 2009.
- Carpentier & Locatelli (2016) Carpentier, Alexandra and Locatelli, Andrea. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590–604, 2016.
- Chen & Li (2015) Chen, Lijie and Li, Jian. On the optimal sample complexity for best arm identification. arXiv preprint arXiv:1511.03774, 2015.
- Danks (2014) Danks, David. Unifying the mind: Cognitive representations as graphical models. MIT Press, 2014.
- Dempster et al. (1977) Dempster, Arthur P., Laird, Nan M., and Rubin, Donald B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
- Ferguson (1973) Ferguson, Thomas S. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230, 1973.
- Gabillon et al. (2012) Gabillon, Victor, Ghavamzadeh, Mohammad, and Lazaric, Alessandro. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pp. 3212–3220, 2012.
- Garnelo et al. (2016) Garnelo, Marta, Arulkumaran, Kai, and Shanahan, Murray. Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518, 2016.
Ghosal & van der Vaart (2017)
Ghosal, Subhashis and van der Vaart, Aad.
Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
- Gilboa (2009) Gilboa, Itzhak. Theory of Decision under Uncertainty. Cambridge University Press, 2009.
- Hagmayer & Meder (2013) Hagmayer, York and Meder, Björn. Repeated causal decision making. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(1):33, 2013.
- Hjort et al. (2010) Hjort, Nils Lid, Holmes, Chris, Müller, Peter, and Walker, Stephen G. Bayesian nonparametrics, volume 28. Cambridge University Press, 2010.
- Jamieson & Nowak (2014) Jamieson, Kevin and Nowak, Robert. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, pp. 1–6. IEEE, 2014.
- Jamieson et al. (2014) Jamieson, Kevin, Malloy, Matthew, Nowak, Robert, and Bubeck, Sébastien. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pp. 423–439, 2014.
- Kaufmann et al. (2016) Kaufmann, Emilie, Cappé, Olivier, and Garivier, Aurélien. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
- Lake et al. (2017) Lake, Brenden M., Ullman, Tomer D., Tenenbaum, Joshua B., and Gershman, Samuel J. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
- Lattimore et al. (2016) Lattimore, Finnian, Lattimore, Tor, and Reid, Mark D. Causal bandits: Learning good interventions via causal inference. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 1181–1189. Curran Associates, Inc., 2016.
Meder et al. (2010)
Meder, Björn, Gerstenberg, Tobias, Hagmayer, York, and Waldmann, Michael R.
Observing and intervening: Rational and heuristic models of causal decision making.The Open Psychology Journal, 3:119–135, 2010.
- Müller et al. (2016) Müller, Peter, Quintana, Fernando Andrés, Jara, Alejandro, and Hanson, Tim. Bayesian nonparametric data analysis. Springer series in Statistics, 2016.
- Nichols & Danks (2007) Nichols, William and Danks, David. Decision making using learned causal structures. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 29, 2007.
- Ortega & Braun (2014) Ortega, Pedro A. and Braun, Daniel A. Generalized thompson sampling for sequential decision-making and causal inference. Complex Adaptive Systems Modeling, 2(1):2, 2014.
- Pearl (2009) Pearl, Judea. Causality. Cambridge university press, 2009.
- Pearl (2018a) Pearl, Judea. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv preprint arXiv:1801.04016, 2018a.
- Pearl & Mackenzie (2018) Pearl, Judea and Mackenzie, Dana. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.
- Pearl (2018b) Pearl, J., Mackenzie D. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018b.
- Rottman & Hastie (2014) Rottman, Benjamin Margolin and Hastie, Reid. Reasoning about causal relationships: Inferences on causal networks. Psychological bulletin, 140(1):109, 2014.
- Russo (2016) Russo, Daniel. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417–1418, 2016.
- Sen et al. (2017) Sen, Rajat, Shanmugam, Karthikeyan, Dimakis, Alexandros G, and Shakkottai, Sanjay. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pp. 3057–3066, 2017.
- Shoham & Leyton-Brown (2008) Shoham, Yoav and Leyton-Brown, Kevin. Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
- Sloman & Hagmayer (2006) Sloman, Steven A. and Hagmayer, York. The causal psycho-logic of choice. Trends in Cognitive Sciences, 10(9):407–412, 2006.
- Spirtes et al. (2000) Spirtes, Peter, Glymour, Clark N., and Scheines, Richard. Causation, prediction, and search. MIT press, 2000.
- Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement learning: An introduction. MIT Press, 1998.
- Watkins & Dayan (1992) Watkins, Christopher JCH and Dayan, Peter. Q-learning. Machine learning, 8(3-4):279–292, 1992.