1 The challenges of experimental design with big data
The value of experimental design in physical and socio-medical fields is increasingly realised, but at the same time systems under consideration are more complex. It may not be possible to do a carefully controlled experiment in many areas, but at the same time huge quantities of data are being produced, for example from social media and web-based transactions. An added problem is that the traditions of experimental design differ. For example in engineering design it will be possible to do a control experimental on a test bench, whereas in the social-medical sciences the local counterfactual will be missing: we do not know how a particular patient would have fared if they were not given the drug. Foundation work on these issues is by b10 . Roughly, the causal effect can only be measured on the average, with great care taken about the background population, with more reluctance than in the physical sciences to extend the conclusions outside the population under study. An old issue, which goes back into the history of science, is the distinction between active and passive observation. Is placing a sensor on a driverless car to collect data (for control) an intervention in the sense of the declaration that to prove causation you have to intervene? Despite these different historical traditions there seems to be general agreement (i) that deriving causal models is a kind of gold standard and (ii) that to produce a causal model we need to guard against bias from different sources: hidden confounders, sampling bias, incomplete models, feedback and so on.
We cover a few of the ideas from the theory of causation (Section 2) and then suggest that the double activity of building causal models while at the same time guarding against bias has features of a cooperative game (Section 3.1). At its simplest a randomized clinical trial is minimax solution to a game against the sources of bias. With this in mind we make the natural but speculative suggestion that we can import theories of Nash equilibrium and supply a simple example motivated by the theory of optimum experimental design under a heading of optimal bias design. We could have taken a Bayesian optimal design, for example from b6 ; b12 . But for this short paper we felt it was enough to allow our randomness to come from the error distribution or the randomization itself.
2 Causal Models
A major critique of passive analysis of the machine-learning type is the lack of attention to the building of causal models. We discuss briefly the main ingredients of causal graphical models and then the implications for experimental designb11 .
A causal model is often described via a direct acyclic graph (DAG), , where each vertex. Care has to be taken with the edges . The natural intuition that the edge means causes is not correct, at least not without much qualification. The DAG is a vehicle for describing all conditional independence structures.
We can define a variable which is never observed as latent, also hidden. There is a slight difference: hidden may be that we do not know it is there but it might be. Latent may also be taken as expressing prior information. Thus a latent layer in machine learning context may be included to allow a more complex model, such as a mixture model.
The conundrum with causal models stems from the distinction between passive observation and active experimental design. Experimental design is an intervention and there are essentially two types. First, we can simply apply some kind of treatment at node to obtain a special , for example give a patient a drug. Second, and even more active, one can set variable to say high and low levels.
Passive observation means that a joint sampling distribution covers all observed . The act of setting should be thought as advantageous in the sense that we are in some kind of classical or optimal design framework, but disadvantageous in that it is destructive. Roughly, setting destroys our ability to learn about the population from which comes.
Consider a simple DAG: and for ease of explanation we write down a univariate linear version with obvious interpretation
where are error variables. Suppose we are interested in the last causal parameter . Ideal would be to carry out a controlled experiment, setting the levels of and observing . The first assumption to make is governed by the following:
Principal 1. The distribution of conditional on a set value of is the same as when the same value of was passively observed.
There are arguments to justify this but it remains a most important assumption. We can also passively observe . Note that the model is nonlinear in the parameters as and also that is Gaussian if the are Gaussian. One may not have to choose between a controlled experiment and passive observation. This lead to another principal, see b5
Principal 2. A mixture of passive observation experiment and active experimentation may be optimal.
There is considerable discussion in trying to understand how to learn for DAG models with interventions, and controlled experiments are a form of intervention. Most effort has been put into identifiability; see b3 for a review. In our example suppose there is an extra arrow . Such an arrow is referred to as a backdoor. If the index is time we can say that there is another path from into the future in addition to .
Now if we fix
we cannot so simply estimatebecause the distribution of is corrupted by the new path. In the observational case, we have another parameter and the changed equation
There are now too many parameters for the observations (even with replication).
The celebrated backdoor theorem due to b9 tells us how to obtain identifiability. Suppose you want to see whether causes , then we need two conditions for a good conditioning set of variables :
No node (variable) in is a descendent of
blocks every (backdoor) path from to that has an arrow into .
This theorem tell us: (i) whether there is confounding given this DAG, (ii) if it is possible to remove the confounding and (iii) which variables to condition on to eliminate the confounding. For example, if we are trying to establish the effect of on then we must observe, or set and condition on, any which is not a descendant of and blocks all paths from, in our case, ancestors of . In addition, if there are any other downstream (future) variable such as an extra with , then will not interfere with our causal analysis; we can forget it. In summary
Principal 3. Guard against effects from nuisance confounders by suitable additional conditioning.
3 Bias Models
Before presenting our contribution, we briefly review relevant literature. For the model without bias
with , for , and the usual assumptions on the random error, Drovandi and Stufken propose information-based and sequential algorithms (also response adaptive in Drovandi ) for the selection of a subsample from a large, or possibly big, dataset. They provide an optimal subsample with respect to a chosen utility function.
Bias model and optimal design of experiments were considered by b8 and recently in the context of big data by Wiens . Those authors add a bias term to the model (2) and thus study . They search for a design which minimises the mean square error of the least square estimator of the parameters, guarding from the bias term. In particular Wiens proposes a theory of minimax - and -robust design as subset of a large finite set of points, while b8 proves results for a design to be optimal when the effect of the bias term is bounded above from a given constant and below from zero.
The conditioning argument of the backdoor theorem is a way of avoiding biases. In the above example in Equation (1) gives a bias. Enough conditioning creates a kind of laboratory inside which we can conduct our experiment by setting the level of . Sometimes this is referred to as creating a Markov blanket. But there are sources of bias which either we do not know at all or have some ideas about but are too costly to control. Biases range from those we really know about but simply do not observe to those which are introduced to model additional variability. This will affect the overall distribution of the observed variables, in a way similar to classical factor analysis.
Principal 4. Special models are needed to ascertain and guard against hidden sources of bias, for example using randomization or latent variable methods.
We build on the ideas in b8 and discuss in details how optimal experimental design can guard against hidden sources of bias, indicated below with the letter . Thus consider a two part model in which the first part is the causal model of main interest with parameters and the second part is the bias term with parameters . This separation is familiar from traditional experimental design where and might be treatment and block parameters, respectively, b1 ; b8 . The model is:
are independent and have equal variance.
We want to protect the usual least square estimator, , obtained from the reduced model in Equation (2) ignoring the bias term
. Define the full moment matrix by
where is the experimental design measure over -space. Then the mean squared error (MSE) matrix can be written as
with the standardised bias parameter and the sample size (see b8 ).
Well known criteria for optimality ask to minimise over the choice of experimental design the quantity: (the trace criteria or -optimality) or (the -optimality criteria).
The design problem is easier when the design space and design are direct products and thus can be written as
with and , and are finite subsets of, respectively, and . Then, includes a term which depends only on , likewise a factor in depends only on .
The most familiar example is from clinical trials where one compares a treatment against a control. Consider the simple case
where the are unwanted confounders which may be a source of bias, the is the grand mean and points are allocated to each group. Adapting the above analysis we obtain
where the , , terms are the group means and . The bias term is which is zero when . This is the simplest case of balance and extends easily to multivariate . A number of methods of achieving balance have been studied, each of which can be cast in the above framework
Stratification: balancing in each stratum and then aggregating the difference.
Distance methods: pairing up treatment and control with which are close in -space with respect to some distance such as Mahalanobis distance b7 .
3.1 A Game Theoretic Approach
For ease of explanation we introduce two players: Alice (A) and Bob (B). Alice selects a causal model design using and Bob selects design using . In the product case (4), Alice and Bob can operate separately. In other cases they may cooperate fully to find the best design over the design space for the pair . However there is another possibility, namely to use a Nash equilibrium approach b2 ; b4 ; b13 .
For two players A and B with composite cost functions and solutions at equilibrium it holds
We illustrate the presence of Nash equilibrium in causation-bias set up by a simple example. We take a distorted design space, but still a product-type design measure. Thus, let the model be
and let the design have a four support points (we put the design measure in the second line):
Since, in this case, is a column vector:
The equilibrium takes the form:
There are two Nash equilibria given by solving This gives two solutions and with and computed numerically. Note that both solutions do not depend on , and in fact scale invariance of this kind is a well known feature of Nash equilibrium.
We can compare the solution with an overall optimisation by setting and minimizing . The minimum is , it is achieved at with . Whereas at the value of is approximated to with .
Let us return to the role of Bob in our narrative. His experimental design decision will depend on his knowledge about the bias. For ease of explanation we reduce the argument to two canonical cases.
Approach 1. Unknown
Under a restriction
this achieves a maximum at the maximum eigenvalue:. We can take this as our criterion which is close to the -optimality of optimum design theory.
Approach 2. In equation (3), for unknown in some function class, we have
where . We cannot optimise over because, in our narrative, Alice needs it for the causal parameter . A solution is then
is the randomization distribution. In the language of game theory this is a mixed strategy to achieve a minimax solution.
Randomisation has been heralded as one of the most important contributions of statistics to scientific discovery. There are several arguments put forward for using randomization: (i) it helps support assumptions of exchangeability in a Bayesian analysis (ii) it supports classical zero mean and equal variance arguments and (iii) it produces roughly balanced samples.
After a discussion of some issues related to the use of experimental design to help establish causation in complex models, we study in a little more detail the use of optimal design methods to remove bias. In the standard case the causal part of a model can be estimated orthogonally from the bias. In more complex cases the problem can be set up as a co-operative game. We demonstrate the existence of Nash equilibria for a small example and point to a formulation which would include randomization. This is preliminary work, establishing model classes (for example special ’s, ’s, ’s) and conditions on for which Approaches 1 and 2 can be turned into efficient algorithms is object of current work. The general proposition is that such methods will help protect causal models against bias.
Acknowledgements.We thank the anonymous reviewers for thorough reading of the manuscript.
- (1) Box, G.E., Draper, N.R.: A basis for the selection of a response surface design. Journal of the American Statistical Association, 54(287), 622–654 (1959)
- (2) Cheng, C.S., Li, K.C.: A minimax approach to sample surveys. The Annals of Statistics, 552–563 (1983)
- (3) Drovandi, C.C., Holmes, C., McGree, J.M., Mengersen, K., Richardson, S., Ryan, E.G.: Principles of Experimental Design for Big Data Analysis. Statistical Science, 32(3), 385–404 (2017)
- (4) Drton, M., Weihs, L.: Generic identifiability of linear structural equation models by ancestor decomposition. Scandinavian Journal of Statistics, 43(4), 1035–1045 (2016)
- (5) Grant, W.C., Anstrom, K.J.: Minimizing selection bias in randomized trials: A Nash equilibrium approach to optimal randomization. Journal of Economic Behavior & Organization, 66(3), 606–624 (2008)
- (6) Hainy, M., Müller, W.G., Wynn, H.P.: Approximate Bayesian computation design (ABCD), an introduction. In: mODa 10–Advances in Model-Oriented Design and Analysis, 135–143. Springer, Heidelberg (2013)
- (7) Hainy, M., Müller, W.G., Wynn, H.P.: Learning functions and approximate Bayesian computation design: ABCD. Entropy, 16(8), 4353–4374 (2014)
- (8) LaLonde, R.J.: Evaluating the econometric evaluations of training programs with experimental data. The American economic review, 604–620 (1986)
- (9) Montepiedra, G., Fedorov, V.V.: Minimum bias designs with constraints. Journal of Statistical Planning and Inference, 63(1), 97–111 (1997)
- (10) Pearl, J.: Causality. Cambridge university press (2009)
- (11) Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55 (1983)
Rubin, D.B.: Bayesian inference for causal effects: The role of randomization.The Annals of Statistics, 34–58 (1978)
- (13) Sebastiani, P., Wynn, H.P.: Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B, 62(1), 145–157 (2000)
- (14) Stenger, H.: A minimax approach to randomization and estimation in survey sampling. The Annals of Statistics, 395–399 (1979)
Wang, H., Yang M., Stufken, J.: Information-Based Optimal Subdata Selection for Big Data Linear Regression.Journal of the American Statistical Association (in press)
- (16) Wiens, D.P.: I-robust and D-robust designs on a finite design space. Statistics and Computing, 28(2), 241–258 (2018)