Imagine an image of a bridge over a river. On top of the bridge, a car speeds through the right lane. Consider the question
“Is there a car in this image?”
This is a question about the observable properties of the scene under consideration, and modern computer vision algorithms excel at answering these kinds of questions. Excelling at this task is fundamentally about leveraging correlations between pixels and image features across large datasets of images.111Here and below, the term correlation is meant to include the more general concept of statistical dependence. The term feature
denotes, for instance, a numerical value from the image representation of a convolutional neural network.However, a more nuanced understanding of images arguably requires the ability to reason about how the scene depicted in the image would change in response to interventions. The list of possible interventions is long and complex but, as a first step, we can reason about the intervention of removing an object.
To this end, consider the two counterfactual questions “What would the scene look like if we were to remove the car?” and “What would the scene look like if we were to remove the bridge?” On the one hand, the first intervention seems rather benign. We could argue that the rest of the scene depicted in the image (the river, the bridge) would remain the same if the car were removed. On the other hand, the second intervention seems more severe. If the bridge were removed from the scene, it would make little sense for us to observe the car floating weightless over the river. Thus, we understand that removing the bridge would have an effect on the cars located on top of it. Reasoning about these and similar counterfactuals allows to begin asking questions of the form
“Why is there a car in this image?”
This question is of course poorly defined, but the answer is linked to the causal relationship between the bridge and the car. In our example, the presence of the bridge causes the presence of the car, in the sense that if the bridge were not there, then the car would not be either. Such interventional semantics of what is meant by causation aligns with current approaches in the literature .
In light of this exposition, it seems plausible that the objects in a scene share asymmetric causal relationships. These causal relationships, in turn, may differ significantly from the correlation structures that modern computer vision algorithms exploit. For instance, most of the images of cars in a given dataset may also contain roads. Therefore, features of cars and features of roads will be highly correlated, and therefore features of roads may be good car predictors in an iid setting irrespective of the underlying causal structure . However, should a car sinking in the ocean be given a low “car score” by our object recognition algorithm because of its unusual context? The answer depends on the application. If the goal is to maximize the average object recognition score over a test set that has the same distribution as the training set, then we should use the context to make our decision. However, if the goal is to reason about non-iid situations, or cases that may require intervention, such as saving the driver from drowning in the ocean, we should be robust and not refuse to believe that a car is a car just because of its context.
While the correlation structure of image features may shift dramatically between different data sets or between training data and test data, we expect the causal structure of image features to be more stable. Therefore, object recognition algorithms capable of leveraging knowledge of the cause-effect relations between image features may exhibit better generalization to novel test distributions. For these reasons, the detection of causal signals in images is of great interest. However, this is a very challenging task: in static image datasets we lack the arrow of time, face strong selection biases (pictures are often taken to show particular objects), and randomized experiments (the gold standard to infer causation) are unfeasible. Because of these reasons, our present interest is in detecting causal signals in observational data.
In the absence of any assumptions, the determination of causal relations between random variables given samples from their joint distribution is impossible in principle [12, 13]. In particular, any joint distribution over two random variables and is consistent with any of the following three underlying causal structures: (i) causes , (ii) causes , and (iii) and are both caused by an unobserved confounder . However, while the causal structure may not be identifiable in principle, it may be possible to determine the structure in practice. For joint distributions that occur in the real world, the different causal interpretations may not be equally likely. That is, the causal direction between typical variables of interest may leave a detectable signature in their joint distribution. In this work, we will exploit this insight to build a classifier for determining the cause-effect relation between two random variables from samples of their joint distribution.
Our experiments will show that the higher-order statistics of image datasets can inform us about causal relations. To our knowledge, no prior work has established, or even considered, the existence of such a signal.
In particular, we make a first step towards the discovery of causation in visual features by examining large collections of images of different objects of interest such as cats, dogs, trains, buses, cars, and people. The locations of these objects in the images are given to us in the form of bounding boxes. For each object of interest, we can distinguish between object features and context features. By definition, object features are those mostly activated inside the bounding box of the object of interest. On the other hand, context features are those mostly found outside the bounding box of the object of interest. Independently and in parallel, we will distinguish between causal features and anticausal features, cf. . Causal features are those that cause the presence of the object of interest in the image (that is, those features that cause the object’s class label), while anticausal features are those caused by the presence of the object in the image (that is, those features caused by the class label). Our hypothesis, to be validated empirically, is
Object features and anticausal features are closely related. Context features and causal features are not necessarily related.
We expect Hypothesis 1 to be true because many of the features caused by the presence of an object should be features of subparts of the object and hence likely to be contained inside its bounding box (the presence of a car causes the presence of the car’s wheels). However, the context of an object may cause or be caused by its presence (road-like features cause the presence of a car, but the presence of a car causes its shadow on a sunny day). Providing empirical evidence supporting Hypothesis 1 would imply that (1) there exists a relation between causation and the difference between objects and their contexts, and (2) there exist observable causal signals within sets of static images.
Our exposition is organized as follows. Section 2 introduces the basics of causal inference from observational data. Section 3 proposes a new algorithm, the Neural Causation Coefficient (NCC), for learning to infer causation from a corpus of labeled data. NCC is shown to outperform the previous state-of-the-art in cause-effect inference. Section 4 makes use of NCC to distinguish between causal and anticausal features. As hypothesized, we show a consistent relationship between anticausal features and object features. Finally, Section 5 closes our exposition by offering some conclusions and directions for future research. Our code is available at http://github.com/lopezpaz/visual_causation.
2 Observational causal inference
Randomized experiments are the gold standard for causal inference . Like a child may drop a toy to probe the nature of gravity, experiments rely on interventions to reveal the causal relations between variables of interest. However, experiments are often expensive, unethical, or impossible to conduct. In these situations, we must discern cause from effect purely through observational data and without the ability to intervene . This is the domain of observational causal inference.
and aims to infer whether or . In particular, is assumed to be drawn from one of two models: from a causal model where , or from an anticausal model where . Figure 1 exemplifies a family of such models, the Additive Noise Model (ANM) , where the effect variable is the noisy realization of a nonlinear function of the cause variable.
If we make no assumptions about the distributions , , and appearing in Figure 1, the problem of observational causal inference is nonidentifiable . To address this issue, we assume that whenever , the causes, noises, and mechanisms are jointly independent. This should be interpreted as an informal statement that includes two types of independences. One is the independence between the cause and the mechanism (ICM) [8, 16], which is formalized not as an independence between the input variable and the mechanism , but as an independence between the data source (that is, the distribution ) and the mechanism mapping cause to effect. This can be formalized either probabilistically  or in terms of algorithmic complexity . The second independence is between the cause and the noise. This is a standard assumption in structural equation modeling, and it can be related to causal sufficiency. Essentially, if this assumption is violated, our causal model is too small and we should include additional variables . ICM is one incarnation of uniformitarianism: processes in nature are fixed and agnostic to the distributions of their causal inputs. In lay terms, believing these assumptions amounts to not believing in spurious correlations.
For most choices of , ICM will be violated in the anticausal direction. This violation will often leave an observable statistical footprint, rendering cause and effect distinguishable from observational data alone . But, what exactly are these causal footprints, and how can we develop statistical tests to find them?
2.1 Examples of observable causal footprints
Here we illustrate two types of observable causal footprints.
First, consider a linear additive noise model , where the cause and the noise are two independent uniform random variables with bounded range, and the mechanism is a linear function (Figure 1(a)). Crucially, it is impossible to construct a linear additive noise model where the new cause and the new noise are two independent random variables (except in degenerate cases). This is illustrated in Figure 1(b)
, where the variance of the new noise variablevaries across different locations of the new cause variable , and is depicted by red bars. Therefore, the ICM assumption is satisfied for the correct causal direction but violated for the wrong causal direction . This asymmetry makes the cause distinguishable from the effect . Here, the relevant footprint is the independence between and .222A simple exercise reveals that whenever the cause variable and the noise variable are two independent Gaussian random variables, the symmetry of the distribution renders cause and effect indistinguishable under the linear additive noise model.
Second, consider a new observational sample where , where is a monotonic function, and where deterministically. The causal relationship is deterministic, and so the noise-based footprints from the previous paragraphs are rendered useless. Let us assume thatincreases whenever the slope of decreases, as depicted by Figure 1(c). Loosely speaking, the shape of the effect distribution is thus not independent of the mechanism . In this example, ICM is satisfied under the correct causal direction , but violated under the wrong causal direction . Again, this asymmetry renders the cause distinguishable from the effect . Here, the relevant footprint is a form of independence between the density of and the slope of .
It may be possible to continue in this manner, considering more classes of models and adding new footprints to detect causation in each case. However, engineering and maintaining a catalog of causal footprints is a tedious task, and any such catalog will most likely be incomplete. To amend this issue, the next section proposes to use neural networks to learn causal footprints directly from data.
3 The neural causation coefficient
To learn causal footprints from data, we follow  and pose cause-effect inference as a binary classification task. Our input patterns are effectively scatterplots similar to those shown in Figures 1(a) and 1(b). That is, each data point is a bag of samples drawn iid from a distribution . The class label indicates the causal direction between and .
Using data of this form, we will train a neural network to classify samples from probability distributions as causal or anticausal. Since the input patterns
are not fixed-dimensional vectors, but bags of points, we borrow inspiration from the literature on kernel mean embedding classifiers and construct a feedforward neural network of the form
In the previous equation, is a feature map, and the average over all is the mean embedding of the empirical distribution . The function is a binary classifier that takes a fixed-length mean embedding as input .
In kernel methods, is fixed a priori and defined with respect to a nonlinear kernel . In contrast, our feature map and our classifier
are both multilayer perceptrons, which are learned jointly from data. Figure3 illustrates the proposed architecture, which we term the Neural Causation Coefficient (NCC). In short, to classify a sample as causal or anticausal, NCC maps each point in the sample to the representation , computes the embedding vector across all points , and classifies the embedding vector as causal or anticausal using the neural network classifier . Importantly, the proposed neural architecture is not restricted to cause-effect inference, and can be used to represent and learn from general distributions.
NCC has some attractive properties. First, predicting the cause-effect relation for a new set of samples at test time can be done efficiently with a single forward pass through the aggregate network. The complexity of this operation is linear in the number of samples. In contrast, the computational complexity of kernel-based additive noise model inference algorithms is cubic in the number of samples
. Second, NCC can be trained using mixtures of different causal and anticausal generative models, such as linear, non-linear, noisy, and deterministic mechanisms linking causes to their effects. This rich training allows NCC to learn a diversity of causal footprints simultaneously. Third, for differentiable activation functions, NCC is a differentiable function. This allows us to embed NCC into larger neural architectures or to use it as a regularization term to encourage the learning of causal or anticausal patterns.
The flexibility of NCC comes at a cost. In practice, labeled cause-effect data as in Equation (2) is scarce and laborious to collect. Because of this, we follow  and train NCC on artificially generated data. This turns out to be advantageous as it gives us easy access to unlimited data. In the following, we describe the process to generate synthetic cause-effect data along with the training procedure for NCC, and demonstrate the performance of NCC on real-world cause-effect data.
3.1 Synthesis of training data
We will construct synthetic observational samples, where the th observational sample contains points. The points comprising the observational sample are drawn from an additive noise model , for all .
The cause terms are drawn from a mixture of Gaussians distributions. We construct each Gaussian by sampling its mean from
, its standard deviation fromfollowed by an absolute value, and its unnormalized mixture weight from followed by an absolute value. We sample and . We normalize the mixture weights to sum to one. We normalize to zero mean and unit variance.
The mechanism is a cubic Hermite spline with support
and knots drawn from , where . The noiseless effect terms are normalized to have zero mean and unit variance.
The noise terms are sampled from , where
. To generalize the ICM, we allow for heteroscedastic noise: we multiply eachby , where is the value of a smoothing spline with support defined in Equation (3) and random knots drawn from . The noisy effect terms are normalized to have zero mean and unit variance.
This sampling process produces a training set of labeled observational samples
3.2 Training NCC
We train NCC with two embedding layers and two classification layers followed by a softmax output layer. Each hidden layer is a composition of batch normalization, dropout . We train for
iterations using RMSProp with the default parameters, where each minibatch is of the form given in Equation (4) and has size . Lastly, we further enforce the symmetry , by training the composite classifier
where tends to zero if the classifier believes in , and tends to one if the classifier believes in . We chose our parameters by monitoring the validation error of NCC on a held-out set of synthetic observational samples. Using this held-out validation set, we cross-validated the percentage of dropout over , the number of hidden layers over , and the number of hidden units in each of the layers over .
3.3 Testing NCC
We test the performance of NCC on the Tübingen dataset, version 1.0 . This is a collection of one hundred heterogeneous, hand-collected, real-world cause-effect observational samples that are widely used as a benchmark in the causal inference literature . The NCC model with the highest synthetic held-out validation accuracy correctly classifies the cause-effect direction of of the Tübingen dataset observational samples. This result outperforms the previous state-of-the-art on observational cause-effect inference, which achieves accuracy on this dataset .333The accuracies reported in Lopez-Paz et al.  are for version 0.8 of the dataset, so we reran the algorithm from Lopez-Paz et al.  on version 1.0 of the dataset.
4 Causal signals in sets of static images
We have all the necessary tools to explore the existence of causal signals in sets of static images at our disposal. In the following, we describe the datasets that we use, the process of extracting features from these datasets, and the measurement of object scores, context scores, causal scores, and anticausal scores for the extracted features. Finally, we validate Hypothesis 1 empirically.
We conduct our experiments with the two datasets PASCAL VOC 2012  and Microsoft COCO . These datasets contain heterogeneous images collected “in the wild.” Each image may contain multiple objects from different categories. The objects may appear at different scales and angles and may be partially visible or occluded. In the PASCAL dataset, we study all the twenty classes aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, and television. This dataset contains 11541 images. In the COCO dataset, we study the same classes. This selection amounts to 99,309 images. We preprocess the images to have a shortest side of pixels, and then take the central crop.
4.2 Feature extraction
We use the last hidden representation (before its nonlinearity) of a residual deep convolutional neural network of 18 layers
as a feature extractor. This network was trained on the entire ImageNet dataset. In particular, we denote by the vector of real-valued features obtained from the image using this network.
Building on top of these features and using the images from the PASCAL dataset, we train a neural network classifier formed by two hidden layers of units each to distinguish between the classes under study. In particular, we denote by
the vector of continuous log odds (activations before the classifier nonlinearity) obtained from the imageusing this classifier. We use features before their nonlinearity and log odds instead of the class probabilities or class labels because NCC has been trained on continuous data with full support on .
In the following we describe how to compute, for each feature , four different scores: its object score, context score, causal score, and anticausal score. Importantly, the object/context scores are computed independently from the causal/anticausal scores. For simplicity, the following sections describe how to compute scores for a particular object of interest . However, our experiments will repeat this process for all the twenty objects of interest.
4.2.1 Computing “object” and “context” feature scores
We featurize each image in the COCO dataset in three different ways, for all . First, we featurize the original image as . Second, we blackout the context of the objects of interest in by placing zero-valued pixels outside their bounding boxes. This produces the object image , as illustrated in Figure 3(b). We featurize as . Third, we blackout the objects of interest in by placing zero-valued pixels inside their bounding boxes. This produces the context image , as illustrated in Figure 3(c). We featurize as .
Using the previous three featurizations we compute, for each feature , its object score and its context score . Intuitively, features with high object scores are those features that react violently when the object of interest is removed from the image.
Furthermore, we compute the log odds for the presence of the object of interest in the original image as .
4.2.2 Computing “causal” and “anticausal” feature scores
For each feature , we compute its causal score , and its anticausal score . Because we will be examining one feature at a time, the values taken by all other features will be an additional source of noise to our analysis, and the observed dependencies will be much weaker than in the synthetic NCC training data. To avoid detecting causation between independent random variables, we train RCC with an augmented training set: in addition to presenting each scatterplot in both causal directions as in (4), we pick a random permutation to generate an additional uncorrelated example with label . We use our best model of this kind which, for validation purposes, obtains accuracy in the Tübingen dataset.
Figure 5 shows the mean and standard deviation of the object scores and the context scores of the features with the top 1% anticausal scores and the top 1% causal scores. As predicted by Hypothesis 1, object features are related to anticausal features. In particular, the features with the highest anticausal score exhibit a higher object score than the features with the highest causal score. This effect is consistent across all classes of interest when selecting the top 1% causal/anticausal features, and remains consistent across out of classes of interest when selecting the top 20% causal/anticausal features. These results indicate that anticausal features may be useful for detecting objects in a robust manner, regardless of their context. As stated in Hypothesis 1, we could not find a consistent relationship between context features and causal features. Remarkably, we remind the reader that NCC was trained to detect the arrow of causation independently and from synthetic data. As a sanity check, we did not obtain any similar results when replacing the NCC with the correlation coefficient or the absolute value of the correlation coefficient.
Although outside the scope of this paper, we ran some preliminary experiments to find causal relationships between objects of interest, by computing the NCC scores between the log odds of different objects of interest. The strongest causal relationships that we found were “bus causes car,” “chair causes plant,” “chair causes sofa,” “dining table causes bottle,” “dining table causes chair,” “dining table causes plant,” “television causes chair,” and “television causes sofa.”
Our experiments indicate the existence of statistically observable causal signals within sets of static images. However, further research is needed to best capture and exploit causal signals for applications in image understanding and robust object detection. In particular, we stress the importance of (1) building large, real-world datasets to aid research in causal inference, (2) extending data-driven techniques like NCC to causal inference of more than two variables, and (3) exploring data with explicit causal signals, such as the arrow of time in videos .
The authors thank Florent Perronin and Matthijs Douze for fruitful discussions.
- Daniušis et al.  P. Daniušis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Schölkopf. Inferring deterministic causal relations. In UAI, 2010.
- Everingham et al.  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012), 2012. URL http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
ResNet training in Torch, 2016.URL https://github.com/facebook/fb.resnet.torch.
- Hinton et al.  G. Hinton, N. Srivastava, and K. Swersky. Lecture 6a: Overview of mini-batch gradient descent, 2014. URL http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
- Hoyer et al.  P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In NIPS, 2009.
- Ioffe and Szegedy  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
- Janzing and Schölkopf  D. Janzing and B. Schölkopf. Causal inference using the algorithmic Markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.
- Lemeire and Dirkx  J. Lemeire and E. Dirkx. Causal models as minimal descriptions of multivariate systems, 2006.
- Lin et al.  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740–755. 2014.
- Lopez-Paz et al.  D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of cause-effect inference. pages 1452–1461, 2015.
- Mooij et al.  J. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. JMLR, 17(32):1–102, 2016.
- Pearl  J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
- Peters et al.  J. Peters, J. Mooij, D. Janzing, and B. Schölkopf. Causal discovery with continuous additive noise models. JMLR, 15:2009–2053, 2014.
- Pickup et al.  L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman, B. Schölkopf, and W. T. Freeman. Seeing the arrow of time. In CVPR, 2014.
- Reichenbach  H. Reichenbach. The direction of time. University of California Press, Berkeley, CA, 1956.
- Schölkopf et al.  B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In ICML, 2012.
- Smola et al.  A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In ALT, pages 13–31, 2007.
- Srivastava et al.  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- Steyvers et al.  M. Steyvers, J. B. Tenenbaum, E. J. Wagenmakers, and B. Blum. Inferring causal networks from observations and interventions. Cognitive science, 27(3):453–489, 2003.