Learning Causality: Synthesis of Large-Scale Causal Networks from High-Dimensional Time Series Data

05/06/2019
by   Mark-Oliver Stehr, et al.
0

There is an abundance of complex dynamic systems that are critical to our daily lives and our society but that are hardly understood, and even with today's possibilities to sense and collect large amounts of experimental data, they are so complex and continuously evolving that it is unlikely that their dynamics will ever be understood in full detail. Nevertheless, through computational tools we can try to make the best possible use of the current technologies and available data. We believe that the most useful models will have to take into account the imbalance between system complexity and available data in the context of limited knowledge or multiple hypotheses. The complex system of biological cells is a prime example of such a system that is studied in systems biology and has motivated the methods presented in this paper. They were developed as part of the DARPA Rapid Threat Assessment (RTA) program, which is concerned with understanding of the mechanism of action (MoA) of toxins or drugs affecting human cells. Using a combination of Gaussian processes and abstract network modeling, we present three fundamentally different machine-learning-based approaches to learn causal relations and synthesize causal networks from high-dimensional time series data. While other types of data are available and have been analyzed and integrated in our RTA work, we focus on transcriptomics (that is gene expression) data obtained from high-throughput microarray experiments in this paper to illustrate capabilities and limitations of our algorithms. Our algorithms make different but overall relatively few biological assumptions, so that they are applicable to other types of biological data and potentially even to other complex systems that exhibit high dimensionality but are not of biological nature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

08/21/2012

Learning LiNGAM based on data with more variables than observations

A very important topic in systems biology is developing statistical meth...
03/15/2021

SEMgraph: An R Package for Causal Network Analysis of High-Throughput Data with Structural Equation Models

With the advent of high-throughput sequencing (HTS) in molecular biology...
05/04/2018

Causal Queries from Observational Data in Biological Systems via Bayesian Networks: An Empirical Study in Small Networks

Biological networks are a very convenient modelling and visualisation to...
04/28/2018

Fuzzy logic based approaches for gene regulatory network inference

The rapid advancement in high-throughput techniques has fueled the gener...
10/06/2020

Gene Regulatory Network Inference with Latent Force Models

Delays in protein synthesis cause a confounding effect when constructing...
12/23/2020

The necessity and power of random, under-sampled experiments in biology

A vast array of transformative technologies developed over the past deca...
11/11/2015

Instantaneous Modelling and Reverse Engineering of DataConsistent Prime Models in Seconds!

A theoretical framework that supports automated construction of dynamic ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

Systems biology strives at an understanding of all the interactions between components of a biological system over time. The excellent survey [37] describes some of the challenges and opportunities in the context of human cells, using the innate immune system as a prime example. It is argued that systems biology has become a highly interdisciplinary endeavor that involves an iterative cycle which is driven by the interplay between emerging biological problems and the development of new technologies and computational tools. One particular approach to abstract from the complexity of biological systems arises from network science [15]. In this context, a wide range of graph-based formalisms, ranging from broad classes such as scale-free graphs [2] to very concrete formalisms such as variations of Petri nets [7] have been explored. While the comprehensive modeling of, say a human cell as a network, and hence its response to a wide range of conditions, in other words, addressing the system identification problem, is an important long-long term objective of system biology, we are concerned with a more modest goal in this paper, namely the modeling of key processes in the human cell under very specific conditions and at a level of abstraction that is consistent with the large but in many ways still limited amount of data that is available with today’s high-throughput measurement technologies. As part of an iterative cycle such as the one discussed in [37] these models can lead to hypothesis generation, inform new experiments, and can be subsequently confirmed and possibly refined or rejected based on new results.

More specifically, we present some of the algorithms developed in the DARPA Rapid Threat Assessment (RTA) program, where we have been exploring data analysis, machine-learning, and logic-based techniques to support biologists in understanding the so-called mechanism of action

(MoA) that is triggered when an (unknown) drug or toxic substance hits a human cell. From relatively short windows of time after the event in question (e.g. 48 hours) our algorithms generate graphs representing potential causality between compounds. In spite of the use of perturbations, it should be noted that this abstract notion of causality is based on obervational data with its known limitations (e.g. confounding effects), and might be better called causality modulo observational equivalence. This is in contrast to, for example, knock-out studies for individual genes, which however due to higher cost cannot compete with the sheer data volume and coverage typical for observational studies. The basis for our algorithms are time series of typically high-dimensional data, e.g. transcriptomics (gene expression), proteomics, and metabolomics data. As exemplified in

[36]

, we have also developed algorithms for anomaly detection that highlight certain nodes in such graphs as potentially impacted and allow the biologist to narrow down the mechanism of action. The algorithms developed use a variety of models including Gaussian processes (on non-linear time scales) and other linear and non-linear models, ranging from principal component analysis and various types of clustering to a broad range of neural network models. For more details about RTA and some initial results we refer to

[36].

More recent algorithms that we developed include anomaly detection using different types of autoencoders, causality detection and network graph synthesis based on convolutional autoencoders, deep and wide neural networks predicting temporal evolution and their visualization as graphs, generative adversarial networks for synthetic modeling and detection of typical vs. unusual behavior, and Siamese (twin) neural networks for probabilistic causality detection.

In this paper, the primary focus is on Siamese neural networks for probabilistic causality detection and their training and validation using a synthetic dynamic gene expression model taking advantage of our original Gaussian process model. They are the basis for two models that support the synthesis of probabilistic causal networks, meaning that causality can be probabilistically quantified. The price to pay for probabilistic results lies in some underlying modeling assumptions about the structure and dynamics of these biological networks. Our other algorithms for network synthesis offer a non-probabilistic notion of potential causality that comes with its own advantages and disadvantages. Correlation and cross-correlation graphs such as those in [36]

fall into this heuristic category, but also our autoencoder-based network synthesis and network synthesis based on predictive deep and wide neural networks. A discussion of these heuristic approaches is included in this paper as well.

While our work is directly motivated by the systems biology domain, we conjecture that our methods are applicable in other domains where high-dimensional observational time series data is available and network modeling is essential. For concreteness, however, we continue to use the biological domain in our explanations, specifically focussing on transcriptomics (gene expression) data generated by high-throughput microarray experiments, as this is the richest data source available in the RTA project and as it allows us to relate our results to those of other projects to asses biological plausibility.

As we will see in the following section on related work, there are important distinctions between the problem of interest, namely identifying the mechanism of action (MoA), and other related problems such as the reconstruction of gene regulatory networks. First of all, our goal is to use observational data from a single experiment in the best possible way to narrow down the MoA to a small set of candidates, which may be used to inform further experimentation. Reconstruction of the entire network from such limited experimental conditions is clearly not feasible, because only a small number of network interactions may have been excercised and hence observed under those conditions. Since we are only concerned with networks in this paper, we also adopt an abstract and greatly simplified notion of MoA as a network of potential causality. We assume that for a given experiment such an MoA (process) network should be consistent with and hence a subnetwork (another simplifying assumption) of the unknown biological (system) network, for example the gene regulatory network if we just focus on transcriptomics. Note how we distinguish the system-level view — a system-level biological model would capture all possible behavior at a certain level of abstraction — from the process-level view, which captures the (abstract) biological behavior under specific experimental (including initial) conditions and is the focus of our analysis. Another noteworthy feature of our approach is that it minimizes the amount of prior biological knowledge used in the analysis, for example unlike [29] we do not make assumptions about which genes act as transcription factors and we do not use drug signature databases such as [21] or other databases curated from the literature [17]. This is motivated in the context of the RTA project by the initial lack of knowledge about new threats/mechanisms that are unknown or just being discovered. Of course, this does not preclude the combination of our methods with prior knowledge in other applications.

2 Related Work

The Connectivity Map [35] is a database of differential gene expression signatures, that is transcriptional responses to disease, genetic perturbation (knockdown or overexpression of a gene) or treatment with a small molecule. In its current version the library contains more than 1 million signatures resulting from perturbations of multiple human cell types. Connectivity between signatures refers to their similarity or dissimilarity. The latest version of the Connectivity Map is based on the L1000 assay, which is a Luminex bead-based method that measures the mRNA expression of 978 so-called landmark genes, and uses a machine-learning-based inference algorithm to infer the expression of 11350 additional genes. An earlier version of the Connectivity Map [21] was based on the direct measurement of the full transcriptome using microarrays, but was limited to combinations of perturbations and cell types. Our RTA project also uses the microarray-based method to measure the full transcriptome. Different from the above, we apply it to collect time series data, but due to the higher cost focus only on a small number of perturbation experiments (seven experiments have been conducted so far with six different drugs). We also collect corresponding time series of proteomics and metabolomics data for a more comprehensive picture.

The DREAM project (Dialogue on Reverse Engineering Assessment and Methods) conducted an assessment of over thirty gene regulatory network inference methods on E. coli, S. aureus, S. cerevisiae, and in silico microarray data [24, 28]. The evaluation indicates that there is no single best method for all data sets. As a possible solution, an integration of this community of network inference experts is proposed, leading to an integrated community network for each data set. The in-silicio data set [28] is particularly noteworthy, since it is based on a well-defined synthetic network combining subnetworks from E. coli and S. cerevisiae with a continuous differential equation model and Gaussian noise. Note, however, that none of the networks are representative of the large size and interaction complexity of the human gene regularity network, which is furthermore unknown to a large extent. Moreover, each data set contains knock-out and know-down experiments, which facilitates the detection of dependencies. As an example, the inferred community network for E. Coli contains 1688 edges. About of the edges were known and previously experimentally validated interactions, while the remaining were conjectured to be false positives or newly discovered true interactions (which were confirmed by additional experiments). The algorithms developed in several DREAM challenges quite clearly demonstrate the challenges of the system identification problem even under very favorable conditions, that is small overall network size, relatively low dynamic complexity due to non-human cells, and a reasonably good understanding of ground truth. While we use human cells, we focus on the identification of the process (MoA) under very specific conditions, and the synthetic dynamic model we use is very generic and kept as simple as possible to reduce the assumptions and number of parameters that need to be stipulated.

Our probabilistic approach to causal network synthesis is quite different from the traditional approach of [33, 26]

and hence the large body of research conducted in the context of Bayesian networks and their many variations. While these approaches, are primarily based on exploiting certain asymmetries in the structure of conditional independence relations (involving at least three variables) that partly reveal the direction of causality, our approach is focused on directly detecting binary causality based on learned patterns occurring in a high-dimensional time series. Our focus on relations between only two variables enables better scalability and more directly focuses on the relation of interest and its physical nature. Graphs with a million causal edges are not unusual in our targeted applications, even if the user might only focus on a small subset at a time. While the traditional approach is also applicable to time series (see e.g.

[8]), they need to be sufficiently long to reliably perform the independence tests, in other words, they ideally need to exercise the system in a large number of conditions, which is not the case in our targeted applications. While in Bayesian network inference the consistency of edge orientation is a primary concern, we intentionally avoid imposing consistency constraints in the algorithms presented in this paper. In a separate paper [34], we show how domain-dependent consistency requirements can be implemented in an approximate and scalable fashion on top of the probabilistic causal graph synthesis presented here.

The particular artificial neural networks we use for causality detection in our primary method fall into the general class of Siamese neural networks, which are composed using two subnetworks with shared weights, each responsible for processing one of the inputs. While originally proposed for signature verification (a type of similarity detection) in [4, 5], they have recently attracted renewed interest [20, 31, 3, 32, 23, 38, 10, 22], with many applications ranging from image recognition to audio search. The key abstract feature of Siamese networks that is important for us is their capability to directly encode structural properties of a logical relation (in our case undirected causality) and hence to enable efficient learning tailored to the domain. Although we use relatively low-complexity 1D convolutional Siamese networks suitable for time series processing, we had to resort to curriculum training to obtain a stable learning algorithm. While conventional Siamese networks are suitable for similarity relations that exhibit a structural symmetry, we also use a variation of Siamese networks to detect directed causality which directly encode the structural antisymmetry of this relation (when viewed as a function) in terms of the network structure.

3 Probabilistic Causal Network Synthesis

In this section, we informally summarize our probabilistic approach to network synthesis. First, we briefly review our use of Gaussian processes, which are at the core of our broader RTA workflow and also the foundation for our synthetic dynamic model. This model will be used to train and validate our undirected and directed causality detectors (both different types of Siamese neural networks in the broad sense). Undirected and directed causal network synthesis will then be performed based on these detectors. For illustration purposes, we use primarily results from our RTA workflow for the experiment [36], which was used for a DARPA challenge and later revealed to be a statin (specifically atorvastatin, a drug affecting cholesterol biosynthesis). Finally, we conclude this section with a biological plausibility check that is based on entirely independent experimental data from two other large projects.

3.1 Gaussian Processes

A Gaussian process [30] represents a probability distribution over a parameterized class of continuous functions

. The projection at each time point is Gaussian, i.e., normally distributed, with a

mean and variance that is explicitly represented

in the Gaussian process model. A time series of observations can then be obtained by sampling at a finite number of arbitrary time points.

Figure 1: Sample Gaussian processes for control (black) and treated (green) with 2 SD bands and the log-ratio shown in blue (right scale).

Gaussian process modeling allows us to address several challenges when faced with high-dimensional and very noisy biological time series data. First, observations are only available at a small number of time points with irregular spacing between them. Also the number of observations at each time point may vary. By exploiting the continuity of the underlying functions, Gaussian processes allow us to benefit from the local structure to estimate the mean and to provide an explicit estimate of variance at each time point. The second challenge arises in the integration of multiple data sources, in our case we have transcriptomic, proteomic, and metabolomic data, each with different sampling time points, and different amounts of measurement noise. Again, the Gaussian processes allow us to integrate all data into a uniform continuous time scale and allow us to build specialized models for each data source to reflect the

biological variation (over time) and the measurement error (at each time point).

Specifically, we use GPflow [9]

, a library based on TensorFlow, to estimate Gaussian processes separately for each type of data, for each compound (transcript, protein, or metabolite), and for treated and control conditions. All observed quantities are uniformly represented in log2 scale. Furthermore, we use a log-based time scale to reflect the approximate spacing of observations with more observations closer to the beginning of the experiment (defined as Time 0). The class of functions is generated by a kernel that combines squared exponentials (reflecting biological variation) in the log-based time scale with additive white noise (to capture measurement error). The squared exponential time scale parameter is fixed at

to reflect the density of available observations. The squared exponential variance is fixed for each type of data by using an average of the optimal values over all compounds. The white noise variance is computed by Maximum Likelihood Estimation (MLE) for each specific compound.

The output of the Gaussian process workflow is a uniform time series for each compound with time points (

time points without the initial state) for control and treated sample means and the corresponding standard deviations (SDs). Formally, the learned Gaussian process model is continuous. The

time points sample the time dimension at sufficiently many discrete points to preserve the information. The log-ratio time series for each compound, which is the primary basis for subsequent stages of our workflow, is then computed by point-wise difference between treated and control.

In Fig. 1 we show Gaussian processes for four selected genes and their log-ratio in the experiment. Do the similar shapes indicate the presence of a causal relationship and if yes in which direction ? The answer is not obvious due to large amount of noise, and clearly has to be probabilistic in nature. These are the kind of questions we are trying to address in the following subsections.

The confidence in the change (as measured by the log-ratio) in response to the perturbation is scored by computing the area between the SD bands for control and treated samples. We perform the computation for 1 and 2 SDs, the former to account for the higher degree of noise in the proteomics and metabolomics data. This provides a ranking of the measured compounds that generalizes the biologist criteria that for a significant change the SDs of the control and treated samples should not overlap. This ranking can be used as a first filter to focus attention on compounds in which we have some confidence in the measured change.

3.2 Synthetic Dynamic Model

To train our probabilistic causality detector, we use a simple synthetic model determined by a small number of parameters. The model needs to provide positive and negative examples of causally related genes in the context of a given experiment. To this end we assume that the previously estimated Gaussian processes for each gene represents a distribution of biological plausible gene expression profiles that are consistent with the observed data. Our first step is to generate a set of noisy pairs of control and treated time series, that is a set of pairs , where and are sampled from the Gaussian process model for control and treated (e.g. with an unknown drug or toxin in the context of our RTA challenges) conditions of a random gene . For each pair, we compute the difference , giving rise to a set of log-ratio time series samples consistent with the observed data. In another step, each sampled ratio time series will be normalized, so that that all have mean and variance . We use to denote a sample normalized series.

Our first goal is to construct synthetic gene expression time series for causally related pairs (positive examples). To obtain a uniform size time series representation for the synthetic genes, we fix a window size smaller than the size of the full time series (e.g. of out ).111The window size is a fixed hyperparameter that reflects a tradeoff between the number of time lags we can model and the data available for causality detection. To simulate a possible time lag associated with causality, we sample and from the previous normalized log-ratio time series set, and compute the set of lagged pairs where is obtained from by a left shift of zero more more positions (effectively renumbering the time series indices). From the lagged pair set, we then compute the set of synthetic pairs where and are obtained by restricting and , respectively, to the same maximally overlapping window (that is the same time series indices) of size . Note that in this set is a possible observation of a causally related pair (modeled after gene ), and both and are times series of size

. With our sample hyperparameters this construction will result in time shifts in the index range

.222It should be noted that there is an intentional bias in this construction towards synthetic pairs with shorter lag. A more refined model could be developed by using biological assumptions about the lag distribution. The full lagged pair set will be computed correspondingly by random sampling over all genes .

The lagged pairs computed so far account for noise and delay, but do not account for interactions between genes. To capture interaction, we use a simple equal-weight linear superposition model defined by a single complexity parameter , denoting the number of genes participating in an interaction in addition to the gene of interest . The motivation for using such a simple interaction model, is that typically the Gaussian processes are sufficiently rich (we exploit the high-dimensionality of the time series here) and already exhibit common non-linear behavior, so that additional complexity in the composition would not make a noticeable difference (in addition to the impossibility of estimating detailed interaction parameters from the short time series). Hence, the synthetic pairs with interactions are generated by sampling from the full lagged pair set and computing . Note how the are not used and can also be regarded as additional noise affecting the original causal interaction . We will see that this additional noisy influence will make causality detection quite challenging depending on the complexity parameter , that we also refer to a mixin parameter.

With the method above it is possible to generate positive samples for causal dependency, that are quite realistic and tailored to the specific experimental conditions (e.g. taking into account the variance and noise estimated and represented by the Gaussian processes). Hence, we also refer to the synthetic pair set as the positive synthetic pair set. Pure negative samples, on the other hand, are difficult to synthesize from existing expression profiles, because we do not know which pairs of genes are truly independent. Here we make another biologically plausible assumption, namely that the large majority of gene pairs can be considered to be independent for practical purposes, in other words that the underlying causal network is sparse. With this approximation, we define a negative synthetic pair set by sampling and independently from the synthetic pair set ( stands for a variables that are not used) and computing . A causal connection between and can occur by chance, if the underlying genes are dependent, but will be rare under our sparsity assumption, again exploiting the high-dimensionality of our time series.

Finally, positive and negative synthetic pair sets of equal size (e.g. pairs each) are combined into a single balanced labeled synthetic set by labeling positive and negative pairs with and , respectively. Training and validation sets are then generated from this set (e.g. by random split of vs. of the labeled samples). For comparison purposes, we also generate an ideal version of this noisy synthetic data set, by replacing the initial sampling of time series pairs from Gaussian processes by sampling just a single time series, in other words, by enforcing .

3.3 Learning Causality using Siamese Networks

Causality is often associated with a temporal direction, also referred to as the arrow of time, and we would ultimately expect that a causality detector does not only identify the existence of a potential causal connection but also its direction. In our approach, we decompose the problem by utilizing two models.333Part of the motivation will become clear later and is related to the practical difficulty of predicting direction even if a causal relationship has already been established. The first model is designed to detect and probabilistically quantify the existence of causality, while a second model is used to probabilistically determine the direction under the condition that a causal dependency is present. One advantage of this two-stage approach is that the two probabilities can be clearly separated and the relational properties can be structurally represented in the neural network. Our first model will be concerned with a symmetric relation of undirected causality (which is the dual of independence, also a symmetric relation), while the second model will be concerned with an asymmetric relation of directed causality that can be considered as a temporal refinement.

The undirected causality detector is a Siamese neural network, that is a neural network with two identical subnetworks that are each responsible for processing one of the arguments of the binary symmetric causality relation. Each argument is a time series (e.g. of size consistent with our sample data). The replicated subnetwork is a 1D convolutional network (we use a window size444The convolutional window size is a hyperparameter that reflects a tradeoff between maximum time lag and maximum size of the patterns that can be taken into account.

and stride

) with bias and a relu-activation function yielding a tensor (e.g.

dimensional). This is followed by an average pooling layer yielding a vector (e.g.

dimensional), followed by a dense layer with the same output dimension and again with bias and relu-activation. Hence, the output of each subnetwork is a vector in feature space (e.g.

dimensional). The two outputs are combined by a dot-product layer (which captures the symmetry of the problem as part of the architecture). As usual for Siamese networks, the symmetry is also exploited by weight-sharing between the two subnetworks. Finally, to obtain a probabilistic output, a sigmoid function is applied to a linear transformation of the scalar result from the dot-product, which can also be viewed as trivial dense layer with bias and sigmoid-activation.

As a loss function we use binary cross-entropy and as an optimizer we use TensorFlow’s implementation of Adam (with default parameters). The key hyperparameters we vary are the type of model (ideal or noisy) and the complexity of the model, using

as settings for the mixin-parameter. We found that conventional training often gets stuck in low accuracy solutions (around or lower), and while random restarts can help, we opted for a more stable solution based on curriculum learning [16]. To this end, we train the models with increasing complexity of the training data set (we use the mixin-parameter to vary the complexity of the synthetic model). We start with training data generated by , which always yields high quality models, and then proceed incrementally to train a sequence of models for , a process that usually converges to models with similar metrics. Each stage is trained for epochs with a batch size of .

Figure 2: This figure illustrates some of the challenges of causality detection in the presence of noise even with a mixin parameter

. From left to right the title of each plot contains a synthetic gene pair identifier, the actual probability of causality, and the predicted probability of causality. On the top we see two samples from the test set that are correctly classified as causally dependent (with reasonably high probability). On the bottom we see two incorrectly classified pairs. We have a false positive on the left hand side and a false negative on the right hand side.

Figure 3: For fixed architectural hyperparameters (convolutional window size and feature dimension), we show how the accuracy decreases with increasing mixin parameter that determines the complexity of the synthetic model. For comparison we also show the results for an ideal model without noise.
Figure 4: The Receiver Operating Chacteristic (ROC) curve shows the tradeoff between true and false positives (here for the noisy model with ). For graph synthesis, we restrict our attention to the left part of the curve, keeping the false positive rate below . The Area under the Cuve (AUC) is , which is another indication of the difficulty of the system identification problem (the quest for completeness) even at the abstract level of causal dependencies.

All models are subsequently validated on the synthetic test sets, which gives results similar to the training set. Some typical test cases are shown in Fig. 2 and clearly illustrate the difficulty of the probabilistic causality detection problem, which cannot be simply reduced to pattern similarity, but needs to be aware of the overall time series distribution and the amount of noise. For one particular experiment () we show a summary of the test and validation results in terms of accuracy in Fig. 3. We can clearly see the increasing difficulty of the causality detection problem with the increasing complexity parameter. While we do not show the results for other RTA experiments, it is worth to point out that the summary results are quite similar.

In Fig. 4, we also show the ROC curve for the model with . In practice, we are interested in higher probability results, so that we use a cutoff of around (corresponding to a false positive rate of ) at the expense of completeness. Across all RTA experiments, we use probability cutoffs in the range to , which still results in detailed causal networks with about a million dependencies. For example, the graphs in Fig. 7 and Fig. 8 that we will discuss subsequently in more detail show only dependencies with a probability of at least , in other words, they focus the biologist’s attention on the subnetwork for which higher probability relations can be derived, without making any statements about other parts of the network. A more detailed analysis of our validation results shows that higher probabilities are slightly underestimated relative to the synthetic model, which is acceptable for most practical purposes, but needs to be kept in mind if we approximate an independence relation as the complement of undirected causality.555This is, for example, relevant in our synthesis of independence graphs, which are interesting in their own right but not included in this paper.

3.4 Extension to Directed Causality

Determining the direction of causality is the task of another network that is typically applied to pairs that exhibit a high probability of causal dependency. By estimating the time lag and its direction we embed this decision problem into a continuous problem, for which we can generate training and validation data using a refinement of our synthetic model. As an architecture, we use another albeit somewhat unconventional type of Siamese network. Its input is again a pair of time series. Its output is not a probability, but simply a real score (trained to be) in the interval and proportional to the temporal lag between and . The score is interpreted as a directed causal dependency with the highest lag that has been modeled, and is interpreted as causality with maximum lag in the opposite direction, that is . A score around denotes no or a minor lag and hence indecision about the direction (if a causality has already been established). Probabilities will be later associated with the time lag through the validation process.

This lag detector is again a Siamese neural network, that is a neural network with two identical subnetworks that are each responsible for processing one of the argument time series. As before, the replicated subnetwork is a 1D convolutional network with bias and a relu-activation function followed by a dense layer again with bias and relu-activation. Unlike the undirected causality detector there is no pooling layer involved. Hence, the output of each subnetwork is a tensor (e.g. dimensional). Weight-sharing is still used between the two subnetworks. After flattening, the two outputs are combined by a subtraction layer (which captures the antisymmetry of the problem) yielding a vector (e.g. of dimension ), The next layer is a dense layer without bias and with tanh-activation (reducing the dimension to ), and finally a linear dense layer without bias is used to obtain a scalar output. As a loss function we use mean square error and as an optimizer we again use TensorFlow’s implementation of Adam (with default parameters). For training we use curriculum learning with the same stages and parameters as for the causality detector.

The training and validation data set for the lag detector will be generated in the same way as the (positive) synthetic pair set for the causality detector (e.g. again pairs), but we allow and track positive, negative, and zero time lags in the construction. Furthermore, the labels will be replaced by the , where is the lag used in the construction of the pair and is the maximum lag (for our sample hyperparameters we have with ). Training and validation sets are then generated from this set of labeled pairs (e.g. again by random split of vs. ).

Figure 5: Like causality detection, the quantification of the time lag is a difficult problem due the presence of noise even with a mixin parameter . From left to right the title of each plot contains a synthetic gene pair identifier, the actual lag (of the blue relative to the black series), and the predicted lag. On the top we see two samples from the test set where the direction of causality is correctly predicted (but not the magnitude). On the bottom, we see how a delay of zero can lead to a positive predicted lag (left hand side), and a high positive lag leads to a prediction of near zero lag (right hand side).
Figure 6: Again we see how the accuracy decreases with increasing mixin parameter that determines the complexity of the synthetic model. The results are based on a fixed threshold . The accuracy can be improved by using a higher threshold at the expense of a larger number of pairs for which the direction cannot be determined (precision/recall tradeoff).

Sample test cases for the experiment are again shown in Fig. 5 and sample validation results in terms of accuracy for a fixed lag threshold can be found in Fig. 6. For the given threshold, the test accuracy of (using again the noisy model with ) is intended for comparison with the other models. It may not be sufficient for the biologist, in which case it can be increased by using a larger threshold at the expense of a larger number of pairs for which the direction cannot be decided. Even for a fixed theshold, the test accuracy varies quite a bit across experiments (it ranges between and for the experiments conducted in the RTA project). Another noteworthy observation is the difference between test and train accuracy, which might point to some degree of overfitting in spite of the large amount of training data. However, we found that for other experiments this difference is less pronounced and more likely related to a selected threshold which is too low for the specific experiment. We think that more experimentation is needed to decide if additional forms of regularization would be beneficial for the model.

3.5 Probabilistic Causal Network Synthesis

Once our causality and lag detection models are trained and validated they can be used to synthesize causal networks. In this step, the models are applied directly to the Gaussian processes, more precisely the mean, which is the most likely biological process consistent with the observations, as opposed to samples generated by the synthetic model. Since these graphs can be huge, it is often useful to restrict the genes to a subset, for example using the ranking defined by the 1 or 2 SD filter on Gaussian processes that focuses on genes that are most impacted by the experiment.

Figure 7: Using a causality detector trained on the synthetic model for experiment with Gaussian process noise and , we show a graph that depicts for the top 1000 genes passing the 2 SD filter the undirected causality relation predicted by the causality detector (with a cutoff probability of approx. ). Genes that are in average up- or down-regulated are colored green and red, respectively. The darkness of the edges indicates the probability of causal relationship. Note that the graph consists of many connected components, including many smaller ones collected in the lower right quadrant. Isolated nodes were removed, because they do not exhibit any high probability causal connections.
Figure 8: Based on a synthetic model for with Gaussian process noise and , we show a graph that combines the undirected causality relation predicted by the causality detector (with a cutoff probability of approx. ) and the direction predicted by the lag detector (with a threshold corresponding to a probability of approx. ). Undirected edges show that the direction cannot be decided with sufficient probability. For clarity, the graph is restricted to the top 100 genes passing the 2 SD filter. The large upregulated component on the top contains many genes related to cholesterol regulation. Four down-regulated genes corresponding to our sample Gaussian processes from Fig. 1 are labeled and involve the strongly down-regulated UBD. It is interesting to note that no sufficiently high probability direction can be established between ASB4 and SEM3C, even if the direction might seem obvious from visual inspection.

An undirected causal network is defined by nodes corresponding to all genes in given subset and edges between pairs of genes for which the causality detector detects a dependency with at least the cutoff probability, which is another parameter in the graph synthesis process. In our automated workflow, we vary it between 100 and 1000 to generate graphs that are interpretable by biologists. An example of such a graph is shown in Fig. 7.

A directed causal network is a refinement of an undirected causal network. Each edge is directed according to the prediction of the lag detector based on a positive threshold in the interval that was associated with a probability during validation. Each undirected edge becomes a directed edge if the lag prediction reaches at least the positive threshold and remains undirected otherwise. An example of such a graph in shown in Fig. 8.

3.6 Biological Plausibility Check

In this section, we argue why our simple equal-weight Gaussian process mixture model and especially the use of a mixin complexity parameter between and is biologically plausible and a reasonable tradeoff given the limited amount of data available. The limitation comes mainly from the short duration of a biological experiment, which is unavoidable, because cells have a very limited lifetime outside of the human body leading to an overlay with a multitude of other processes not related to the experiment and furthermore the impact of a system perturbation naturally exhausts itself as time progresses.

Figure 9: Statistics of degree distribution for various interactions in the integrated human gene regulatory network of the ENCODE project [12].

To understand the complexity of a typical individual MoA in the context of transcriptomics, we recall that we view the MoA as a process and hence a subnetwork of the gene regulatory network, which should give us an estimate for the upper bound complexity. The richest source currently available for the human gene regulatory network is the ENCODE project [12], which identified interactions between transcription factors (TF), micro-RNA (miRNA), and all remaining messenger RNA (mRNA). In the RTA project, we are not concerned with miRNA, although it is a factor contributing to the biological variation. The degree distributions of the proposed integrated network are shown in Fig. 9. To use a robust statistics, we are mostly interested in the median in-degree of genes (mRNA and TF, which we do not distinguish a priori) which is between (for the TF/mRNA interactions) and (for TF/TF interactions which are much more rare with less than ). Knowing that, first not all transcription factors are equally weighted, and second that it is a complex combination of transcription factors that typically controls transcription, the impact of each single transcription factor is greatly diminished especially in the context of a very specific experiment. We conservatively estimate that this should lead to reduction of at least a factor of that should bring us to an in-degree in the range of and hence a complexity parameter in the range of .

Previously we have used for causal graph synthesis. To understand if this choice is at least biologically plausible, we use the Connectivity Map [21] data to establish an approximate ground truth for undirected causality based on correlation (with all its known limitations). To this end, we use the early version of the raw Connectivity Map data based on complete transcriptomics via microarrays (albeit an older chip generatation compared to the one we use in the RTA project). The data contains signatures arising from the perturbations of 5 different cell lines with more than small molecules in 6100 experiments. Using a conservative assumption that this set is diverse enough to contain at least independent experiments, we can compute the Pearson correlation between each pair of genes across all 6100 perturbations, and use a threshold of to establish a set of known (positively) dependent gene pairs as a ground truth with a p-value below . We also establish a set of common (protein-coding) genes that were observed in both projects.

Figure 10: Predicted accuracy of large undirected causal networks for different RTA experiments is compared with actual accuracy relative to approximate ground truth based on Connectivity Map correlations, which is naturally incomplete. All results are for a fixed mixin parameter and synthesized undirected causal network graphs with close to edges.

The number of undirected pairs in the correlation-based dependency set is which reduces to pairs of common genes (about of all possible pairs) that we also observe. The corresponding graph will serve as approximate ground truth. We then use our undirected causal graph synthesis to generate a large graph with edges (no filtering applied), which is still possible with a probabilistic lower bound of for causal dependency in case of . The graph is then also reduced by restricting the nodes to the set of common genes yielding edges. Both of these graphs are clearly not for human consumption, but enable us to do some meaningful statistical analysis of the accuracy of the later relative to the former: of the edges, that is about are also in the approximate ground truth dependency graph. While these results for experiment are approximately consistent, it turns out that other RTA experiments show larger differences as summarized in Fig. 10

. In some cases, outliers might indicate experimental problems (e.g. in case of

). The variation across experiments, however, might also point to the limitations in the ground truth (e.g. the bias towards certain types of perturbations) and in addition to the natural incompleteness of correlation-based causality. Another possible but in our view less likely explanation is that the typical behavior is more complex than captured by our mixin parameter , potentially suggesting the use of larger complexity parameters, which also have been explored in the RTA project.

4 Heuristic Causal Network Synthesis

In this section, we briefly present two other causal network synthesis approaches that we developed in the RTA program. The first one is based on convolutional autoencoders and implements a novel evidence-based approach to detect and quantify inconsistencies in the model using a quite general form of domain knowledge. It is an instance of a more general approach that uses domain knowledge for validation and model selection, which should be contrasted with our more recent work [34], where domain knowledge is directly incorporated into the neural network model. With the second network synthesis approach presented below we explore the feasibility of applying deep neural networks in this area, which is not obvious given the short duration of observation that limits the amount of data that can be collected. We therefore explore an adaptive approach parameterized by complexity that includes a range of deep and wide neural networks as part of a broader model family. Both of these approaches to network synthesis are very different from our Siamese neural network approach and hence can serve as complementary source for biological hypothesis generation. The price for their higher complexity is that these methods do not provide a probabilistic interpretation of the resulting networks, but on the other hand the results do not depend on a specific synthetic model and hence are equally valid for all types of data collected in the RTA project.

4.1 Synthesis based on Convolutional Autoencoders

Different types of autoencoders [18] are already used in the RTA workflow for the detection of anomalies [36]. Here we apply them to detect potential causality. Like our Siamese networks, one motivation for this approach is the need to detect more complex causal relationships that go beyond the type that can be detected by cross-correlation, which is at the core of the baseline algorithms in our RTA workflow [36]. Convolutional autoencoders [25] have been successfully used in image processing due to their capability to learn reoccurring patterns and perform a dimensionality reduction in an unsupervised setting. In logical terms, they structurally encode the homogeneity of the input space, a type of symmetry that can be equationally defined by translation invariance.

In a nutshell, we train a family of 1D convolutional autoencoders (with average pooling and elu-activation) to learn common patterns that are distributed over the time series (not occurring only at a particular time). Hyperparameters are used for feature dimension (e.g. various settings between and ) and window size (e.g. , , , out of time points). The training set consists of of the log-ratio time series set estimated using our Gaussian process model. The remaining are used for validation for autoencoder performance. The autoencoders are trained with a mean square error loss using TensorFlow’s Adam optimizer with default parameters and an early stopping rule to reduce overfitting. The resulting encoders transform each window into a lower dimensional feature space (between and dimensions). In the next phase, the convolutional subnetwork of the encoder will be used as a standalone network to quantify the expression of the features over time (e.g. for different windows of the full time series if the window size is ) for all estimated Gaussian processes (the mean time series) of a given experiment.

We now define a feature occurrence as a triple consisting of the feature vector and the gene and time point (the center of the feature/pattern) where it was detected. To identify similar features, we compute the pairwise Euclidean distances between the features of all such occurrences. Using a variable distance threshold hyperparameter, we obtain a reduced set of pairs of such occurrences representing approximate matches (across genes and time points). We also say that each such match is a witness for potential causality. A pair of two occurrences is directed, written , if the time associated with is strictly lower than the time of . It is undirected and written as if the associated times are equal (that is not distinguishable within our finite time resolution).

In this evidence-based (rather than probabilistic) approach, we introduce another hyperparameter , the number of witnesses (we use settings , , , , and ), to lift the causal relation from feature occurrences to genes: A pair of genes is in the (symmetric) undirected causality relation, written , iff there are at least witnesses of undirected causality involving and . A pair of genes is in the directed causality relation, written , iff there are at least witnesses of directed causality involving and in this order. Immediate inconsistencies are removed from this directed relation by requiring that it does not contain pairs of undirected causality or pairs of directed causality in the opposite direction.

Figure 11: Autoencoder-Based Network Synthesis. We show the synthesized causal network corresponding to the top model for for window size (which uses a modest feature dimension and witnesses) limited to the top genes passing the 2 SD filter. Directed edges denote potential causality. Undirected edges denote a potential causal connection for which the direction cannot be determined. As usual, isolated nodes are eliminated from the graph. Note the large fraction of undirected edges in the large green component, again related to the Cholesterol pathway. Often switching to a model with smaller window size can resolve the direction but at the cost of higher uncertainty.
Figure 12: Integrated Autoencoder-Based Network Synthesis. An integrated causal network corresponding to the top model for for window size (which uses a modest feature dimension and only witnesses) limited to the top compounds passing the 1 SD filter. Based on a generalized algorithm, it includes positive (green) and negative (red) causality. Most edges are directed from left to right. Dark edges are undirected. The graph contains genes and proteins, and location information for some of the proteins that is not visible here (nucleus vs. cytoplasma). Many smaller disconnected components are shown on the side.

The distance threshold hyperparameter is computed to target a specific causal network size that is suitable for exploration and interpretation by the biologist. We usually generate multiple graphs with different sizes in the range of to edges and let the user select the appropriate level of detail and switch between different views. Overall we obtain a family of causal networks (representing both directed and undirected causality between genes) parameterized by convolutional window size, feature dimension, number of witnesses, and network size. Finally, generic domain knowledge about the nature of causality (not specific to the biological domain) is used to heuristically quantify the consistency of the witnesses (utilizing probabilistic lower bounds for inconsistencies) and hence the plausibility of these models. The results are then combined with validated model performance to establish a ranking of all models. For each window size, the three top models in this ranking are actually generated in the form of graphs and can be further explored by the biologist. A sample model is shown in Fig. 11.

The advantage of our autoencoder-based graph synthesis algorithm is that very few assumptions are needed for this evidence-based approach. In the context of the RTA workflow, they are not only applied to transcriptomics data sets across all experiments but also to proteomics and metabolomics data, resulting in integrated causal networks that can include nodes and potential causal relations for all three types of biological compounds. The RTA workflow also includes a generalized algorithm that captures positive and negative (that is inhibitory) causal relationships. A sample result is shown in Fig. 12.

4.2 Synthesis with Deep and Wide Neural Networks

With the rapidly increasing computational capabilities and the related progress in deep learning applications it is natural to explore if more complex models that might reveal additional information about the nature of causal interactions can be synthesized and meaningfully validated in spite of the relative short duration of our time series. Since it is not a priori clear what the appropriate model complexity is given the amount and nature of the data, we use a parameterized family including a range of deep and wide neural networks and perform a comparative model validation.

To enable insight into the nature of causal interactions, we consider a high-dimensional predictive setting, where based on the changes exhibited by the log-ratio time series between two time points and of our Gaussian processes (using again time points for the period) the neural networks will be trained to predict the changes for all genes from to . Each change is represented as an approx. -dimensional vector with a component for each protein-coding gene. For a fixed experiment, the sample set is the data set of all pairs of a change vector and the label, which is the change vector for the next time point. We use the first of the time scale for training and the remaining for validation, with the idea that predicting the long-term effect will be most challenging and biologically relevant (in spite of some bias introduced by this choice).

We use depth and width as architectural hyperparameters ranging over natural numbers in and , respectively. A deep and wide neural network is then defined by stacking composite layers each consisting of a dense layer of width without bias but with -regularization (using a coefficient ), followed by a drop out layer (using a low dropout rate of ), and followed by an elu-activation layer. Input and output dimension is , the number of protein-coding genes (approx. ). Dropout and regularization parameters were experimentally determined to reduce overfitting, which is easily possible given the huge number of network parameters, especially for networks with depth and width . As a loss function we use mean square error (MSE) and we again use TensorFlow’s implementation of Adam (with default parameters). We train each network for epochs with a batch size of .

Figure 13: Comparative model validation of deep and wide neural networks for the RTA experiment . We only show the top 10 models ranked by relative test MSE (last column). It is noteworthy that the top model is a deep and wide network with depth . The next two models have much smaller depth and , but their predictive performance is worse, which at first seems to indicate that the dynamics of requires a fairly complex explanation.
Figure 14: Comparative model validation of deep and wide neural networks for the RTA experiment with pretraining on all transciptomics data sets of the RTA project. Again we only show the top 10 models ranked by relative test MSE. It is noteworthy that the top models are networks of depth , indicating that in the context of all data relatively low complexity models provide the best explanation for , but a higher complexity model with depth is still among the top and can be used if more detail is needed. It also has better relative MSE and lower complexity than the top model without pretraining.

Since we are concerned with populations of biological cells, changes in the obsvervables do not occur abruptly (a property that is already reflected in our Gaussian process model). Hence we measure the improvement in MSE relative to the default prediction that the change is the same at the next time point (persistence property). A summary of validation results for experiment is shown in Fig. 13. While the difference between test and training MSE points to some degree of overfitting, it is not severe in terms of relative MSE. Generally, we observe a high degree of predictability in the data set, which is consistent across all experiments. For example, the top model in Fig. 13 has a lower MSE than the default prediction.

A natural question is if the models can be further improved by using more data even if it is not directly related to the experiment in question. To test this idea, we added a pretraining stage to our workflow, where all the neural networks are trained on all transcriptomics data sets generated in the RTA project (but only for epochs). In the second stage, we then continue the training using only data of the experiment for which the causal network needs to be constructed (again only for instead of epochs). Sample results for can be found in Fig. 14 and indicate further improvements. In fact, improvements in relative test MSE were obtained for most RTA experiments, except for two cases where the numbers remained approximately the same.

Figure 15: The network corresponding the the top model for with pretraining but limited to the top genes passing the 2 SD filter and limited to a maximum degree of for a concise presentation.
Figure 16: The network corresponding the the best depth 4 model for with pretraining but limited to the top genes passing the 2 SD filter and limited to a maximum degree of for a concise presentation.

When evaluating the validation results across all experiments we can observe that with pretraining the top models are always of depth (or in one case), which seems to confirm that pretraining has an effect similar to regularization or incorporation of background knowledge in reducing model complexity. In most cases, however, deep networks are still included among the top 10 ranked models (for which we generate graphs), and a biologist might prefer them because they provide an alternative or more detailed view, albeit at the cost of slightly lower predictive accuracy.

Compared with the earlier approaches, our causal network synthesis is more heuristic for deep and wide neural networks that have been trained and validated as explained above, and primarily driven by the need to obtain graphs of manageable complexity. In essence, we use the neural network structure as the template for our causal network, or in other words the causal network is simply a (very) abstract visualization of the structure of a deep and wide neural network. The key here is to focus on the most important subnetwork, which may depend on the level of detail the user is interested in. Hence we generate a parameterized family of directed causality graphs with two types of nodes. The parameters are the common maximum in- and out-degree (we use settings , and ) and the maximum number of genes to consider (for which we use , , and ). For a given subset of genes (which is typically a subset defined by our SD ranking satisfying the maximum bound), we then construct the graph by the following multistep process. Initially, the nodes of the graph are simply all units of the neural network. Inputs and output units are identified with their corresponding genes. Each positive-weight connection in the neural network can potentially give rise to an edge, but we determine a suitable weight threshold for each layer so that edges can be defined as the subset of neural connections with a weight above the threshold. We choose the lowest threshold that still leads to a number of edges bounded by the square root twice the number of possible connections (a sparsity constraint). In the next step we perform layer-wise edge pruning (starting with edges derived from lowest weight connections within each layer) till the in- and out-degree for each node satisfies the required bound. Examples of synthesized networks can be found in Figs. 15 and 16.

Variations of this algorithms are clearly possible, e.g. by varying the sparsity constraint. Also if we take into account positive and negative weights, we obtain a generalized synthesis algorithm covering positive and negative causal relations that has also been implemented in the RTA workflow. For the reader familiar with the foundations of Petri nets, it is also worth to point out that structurally causal networks of depth are a subclass of Petri nets, and it might be possible to establish a formal connection between the generalized Petri nets called piles in [27] and deeper causal networks.

5 Implementation in JupyterFlow

Motivated by the need to interact with subject matter experts, the need for reproducibility, and the need to efficiently execute and validate machine learning algorithms on large data sets, we have developed at SRI what is to our knowledge the first integrated approach that combines the benefits of interactive computing and distributed workflows. This framework, named JupyterFlow, is based on Petri nets (generalized dataflow graphs) [13] and integrates features of interactive computing (Jupyter/Python Notebooks) [19] with data-driven distributed and parallel execution on heterogenous clusters, typically with GPUs (Graphical Processing Units) on a subset of the nodes. JupyterFlow integrates with TensorFlow [1], which in the RTA workflow is used for estimating Gaussian Process models [9], for training a wide range of neural networks, and most recently for approximate probabilistic inference [34].

Figure 17: Small Excerpt of the RTA Workflow in JupyterFlow. Each transition in the Petri net is defined by a Python/Jupyter notebook. The excerpt shows part of our graph synthesis algorithm (in fact only the pretraining) utilizing deep and wide neural networks and includes multiple stages: preprocessing, Gaussian process estimation, neural network training, comparative validation, and (intermediate) graph synthesis. The high degree of parallelism admitted by the workflow is quite typical for all our machine learning workflows and important to obtain reasonable execution times. Here is is mainly due to the need for architectural hyperparameter exploration (in the middle), but also the preprocessing and Gaussian process estimation is done in parallel (on the left-hand side).

JupyterFlow supports multiple modes of operations: the parallel execution of workflows on powerful GPU servers, the distributed execution on heterogeneous clusters and the cloud (where GPUs may be only available on some machines), and a mode of disconnected operation, where laptops may occasionally synchronize with the main workflow, but otherwise can work independently (without continuous network connectivity). Generally, the workflow executes automatically as a collection of computational notebooks with automatically inferred dataflow dependencies and resource constraints (e.g., memory, number of GPUs), which are essential to optimize the location of computations in the cluster. In addition, each notebook can also be executed interactively (e.g., for diagnosis and experimentation).

Just as TensorFlow utilizes a form of dataflow graphs in the small, JupyterFlow utilizes dataflow graphs in the large. In the case of RTA, our current workflow (see Fig. 17 for a small excerpt) consists of notebooks utilizing Python and TensorFlow that give rise to metalevel dataflow dependencies (this is not counting dependencies at the TensorFlow level). We are currently executing this workflow on SRI’s private cluster (with up to 4 GPUs per node) and in the Google Cloud (using large instances with up to 96 CPUs and 8 GPUs). Our latest version of JupyterFlow also runs in Docker containers and supports virtual heterogeneous Kubernetes clusters (we tested up to 100 nodes) which can be scaled up and down dynamically. Efficiently sharing large amounts of data becomes possible by taking advantage of the combination of an overlay file system and the powerful possibilities offered by shared storage devices in the Google cloud.

6 Conclusion

Viewing biological systems as complex continuous systems that can be modeled as networks of interacting components at a very abstract level, we have presented three causal network synthesis algorithms, that support the biologist in understanding the mechanism of action (MoA) of a system perturbation, which in our experiments were caused by a toxin or a drug. All our algorithms take into account the reality that only a very limited period after the perturbation is relevant to identifying the MoA. Interestingly, this limitation is not unique to perturbations in the biological domain. The lack of data due to the limited time period is partly compensated by the possibility of performing a large number of measurements in parallel with today’s high-throughput technologies, leading to a relatively short but high-dimensional time series for each experiment. Nevertheless, our modeling has to be sufficiently abstract to account for the limited amount of data, which in absence of specific biological knowledge precludes detailed stochastic and differential equation models.

All our algorithms combine Gaussian processes with abstract network modeling. Our probabilistic causal network synthesis algorithm is based on an intermediate synthetic model that tries to capture key aspects of causality with very few parameters. Our autoencoder-based network synthesis does not give approximate probabilistic results but can use generic knowledge about causality to identify models that are more likely (based on rough probabilistic lower bounds) to be consistent with the data. Finally, our deep neural network prediction-based algorithm is purely heuristic, but allows networks that can express more detail about the dynamic nature of interactions.

In this paper we have discussed the first probabilistic synthesis algorithm in more detail. This is because of its simplicity, the low complexity of the underlying model, and its capability to generate at least approximate probabilities, which were confirmed to be roughly biologically plausible. We believe that the other two algorithms, while potentially more powerful in the detection of causalities (and hence still interesting for hypothesis generation), may be more prone to false positives, due to their relatively higher complexity, but we think that more experiments (possibly beyond the biological domain) are needed to understand their relative advantages and limitations.

Here, we briefly summarize a number of extensions of our probabilistic synthesis algorithm that might be worthwhile to explore. First, a natural extension is to take into account inhibitor edges (that we also refer to as negative causal edges in contrast to the positives ones). Such negative causal edges are already implemented in our other heuristic algorithms. Based on the Connectivity Map data, we know that such negative dependencies are much rarer, but they can be useful to complement the picture, and also lead to a more precise notion of independence, if defined as the absence of causality.

There are at least two places where with minor modifications our synthetic model can take into account additional biological knowledge about the network. First, by making additional assumptions about the dynamics of causality, time lags could be modeled using a parameterized distribution as in [29]. Although, in our view, worthwhile to investigate, the price to pay for this approach is that it will lead to different models (and with varying degree of uncertainty) depending of the type of data, e.g. transcriptomics, proteomics, and metabolomics, complicating an integrated network-based treatment. The other possible improvement is concerned with the balanced nature of the synthetic data set. A prior reflecting a lower probability of causality vs. independence, may slightly improve the accuracy, but in the light of the typical use of the causality detector, e.g., when applied after some filtering to significantly affected pairs of genes, the prior may not be accurate anymore, leading to another potential source of error. Again different types of data would also have to be taken into account, leading to increased complexity of the model. Another interesting avenue for future work is to consider Generative Adversarial Networks (GANs) [14] as an alternative to sampling based on Gaussian processes. We have already used GANs for anomaly detection in the RTA workflow, but not directly exploited their generative capabilities. It would be interesting to understand if especially the convolutional GANs that we are using can lead to biologically more realistic modeling (especially regarding the overall distribution) and hence potential accuracy improvements for our Siamese networks.

Finally, we would like to mention the opportunity to use our probabilistic causality detector as part of another algorithm, which can lead to further improvements (e.g. reduction of false positive rates). In the probabilistic algorithm presented in this paper, causality detection (both the existence of causality and its direction) is performed locally (and hence without taking into account network context), that is only on the basis of two potentially related time series. Global knowledge about network properties, e.g. relative consistency between edges, absence of cycles, or degree/density knowledge cannot be exploited in this way. A possible systematic solution is an integrated treatment of data and knowledge in machine learning using PALO, our Probabilistic Approximate Logic that is subject of a separate paper [34], which also includes the application to biological network synthesis but has broader applicability.

Due to the small number of assumptions underlying our network synthesis algorithms, we expect that they also may be applicable in other domains where a complex system is perturbed and high-dimensional time series data for a relative small number of time points after the perturbation needs to be analyzed to understand the mechanism of action, in other words, how the perturbation propagates through the system. One sample domain, for which we obtained some promising preliminary results, is the domain of global financial markets, where the high dimensionality of the time series results from the broad range of tradable financial products that can be seen as observables of the global financial system. This is another quite natural test case for our algorithms, since financial time series data is readily available and there are many types of perturbation events that are reoccurring and well-defined. Previous work indicates that network modeling can have important applications in this area [6]. Other domains of potential interest are related to social media and news, where the high dimensionality of the observation space is a natural result of the vast number of interrelated topics and information sources. For example, we have done some preliminary semantic analysis and time series modeling using the very rich data source of global events and news is provided by the GDELT project [11]

. Understanding the network-scale effects of events could lead to more accurate assessment of news content, its sources, and their inherent biases. The vast amount of available data is reflected in a larger amount of computing and storage resources needed in spite of our use of lightweight natural language processing techniques. A restriction to subsets of the full media universe (e.g. to particular domains of interest) turned out to be more tractable for our initial experiments, which already point to numerous predictable temporal patterns and invariants in the global news cycle.

Acknowledgements Research was sponsored by the U.S. Army Research Office and the Defense Advanced Research Projects Agency and was accomplished under Cooperative Agreement Number W911NF-14-2-0020. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. We gratefully acknowledge the inspiration and feedback from all members of the RTA project team, especially for their help in interpreting the results and suggesting to explore new directions such as the application of deep neural networks to RTA data sets.

Author Contributions Akos Vertes, Andrew Poggio, and Carolyn L. Talcott conceived the research. Mark-Oliver Stehr conceived and implemented the JupyterFlow framework, developed the network synthesis algorithms within this framework, and wrote the manuscript with input from Carolyn L. Talcott and Merrill Knapp. To test the algorithms, Ziad J. Sahab, Peter Avar, Andrew R. Korte, Lida Parvin, and Akos Vertes carried out the metabolomics and proteomics analysis. Deborah I. Bunin and Denise Nishita carried out the transcriptomics analysis. Akos Vertes, Ziad J. Sahab, Peter Avar, Andrew R. Korte, Mark-Oliver Stehr, Carolyn L. Talcott, Merrill Knapp, Andrew Poggio, Deborah I. Bunin, Brian M. Davis, Christine A. Morton, Christopher J. Sevinsky, and Maria I. Zavodszky contributed to the discussions of the results.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association.
  • [2] R. Albert. Scale-free networks in cell biology. Journal of Cell Science, 118(21):4947–4957, 2005.
  • [3] S. Berlemont, G. Lefebvre, S. Duffner, and C. Garcia. Class-balanced Siamese neural networks. Neurocomput., 273(C):47–56, Jan. 2018.
  • [4] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah. Signature verification using A ”Siamese” time delay neural network. IJPRAI, 7(4):669–688, 1993.
  • [5] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. Signature verification using a ”Siamese” time delay neural network. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 737–744. Morgan-Kaufmann, 1994.
  • [6] N. J. C. Calkin and M. López de Prado. The topology of macro financial flows: An application of stochastic flow diagrams. Algorithmic Finance, 3(1-2):43–85, 2014.
  • [7] C. Chaouiya. Petri net modelling of biological networks. Briefings in Bioinformatics, 8(4):210–219, 07 2007.
  • [8] T. Chu and C. Glymour. Search for additive nonlinear time series causal models. J. Mach. Learn. Res., 9:967–991, June 2008.
  • [9] A. G. de G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18:40:1–40:6, 2017.
  • [10] D. Droghini, F. Vesperini, E. Principi, S. Squartini, and F. Piazza. Few-shot Siamese neural networks employing audio features for human-fall detection. In

    Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence

    , PRAI 2018, pages 63–69, New York, NY, USA, 2018. ACM.
  • [11] GDELT: Global Database of Events, Language, and Tone. https://www.gdeltproject.org/about.html.
  • [12] M. B. Gerstein, A. Kundaje, M. Hariharan, S. G. Landt, K. K. Yan, C. Cheng, X. J. Mu, E. Khurana, J. Rozowsky, R. Alexander, R. Min, P. Alves, A. Abyzov, N. Addleman, N. Bhardwaj, A. P. Boyle, P. Cayting, A. Charos, D. Z. Chen, Y. Cheng, D. Clarke, C. Eastman, G. Euskirchen, S. Frietze, Y. Fu, J. Gertz, F. Grubert, A. Harmanci, P. Jain, M. Kasowski, P. Lacroute, J. J. Leng, J. Lian, H. Monahan, H. O’Geen, Z. Ouyang, E. C. Partridge, D. Patacsil, F. Pauli, D. Raha, L. Ramirez, T. E. Reddy, B. Reed, M. Shi, T. Slifer, J. Wang, L. Wu, X. Yang, K. Y. Yip, G. Zilberman-Schapira, S. Batzoglou, A. Sidow, P. J. Farnham, R. M. Myers, S. M. Weissman, and M. Snyder. Architecture of the human regulatory network derived from ENCODE data. Nature, 489(7414):91–100, 2012.
  • [13] C. Girault and R. Valk. Petri Nets for Systems Engineering: A Guide to Modeling, Verification, and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010.
  • [14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2672–2680, Cambridge, MA, USA, 2014. MIT Press.
  • [15] M. Gosak, R. Markovič, J. Dolenšek, M. Slak Rupnik, M. Marhl, A. Stožer, and M. Perc. Network science of biological systems at different scales: A review. Physics of Life Reviews, 24:118–135, Mar. 2018.
  • [16] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1311–1320, 2017.
  • [17] H. Han, H. Shim, D. Shin, J. E. Shim, Y. Ko, J. Shin, H. Kim, A. Cho, E. Kim, T. Lee, H. Kim, K. Kim, S. Yang, D. Bae, A. Yun, S. Kim, C. Y. Kim, H. J. Cho, B. Kang, S. Shin, and I. Lee. TRRUST: a reference database of human transcriptional regulatory interactions. Scientific Reports, 5, 2015.
  • [18] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
  • [19] T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bussonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov, D. Avila, S. Abdalla, and C. Willing. Jupyter notebooks – a publishing format for reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press, 2016.
  • [20] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML 2015 Deep Learning Workshop, 2015.
  • [21] J. Lamb, E. Crawford, D. Peck, J. Modell, I. Blat, M. Wrobel, J. Lerner, J. Brunet, A. Subramanian, K. Ross, M. Reich, H. Hieronymus, G. Wei, S. Armstrong, S. Haggarty, P. Clemons, R. Wei, S. Carr, E. Lander, and T. Golub. The Connectivity Map: using gene-expression signatures to connect small molecules and genes and disease. Science, 313(5795):1929–35, 2006.
  • [22] Q. Li, J. Zhu, R. Cao, K. Sun, J. M. Garibaldi, Q. Li, B. Liu, and G. Qiu. Relative geometry-aware Siamese neural network for 6DOF camera relocalization. CoRR, abs/1901.01049, 2019.
  • [23] P. Manocha, R. Badlani, A. Kumar, A. Shah, B. Elizalde, and B. Raj. Content-based representations of audio using Siamese neural networks. In ICASSP, pages 3136–3140. IEEE, 2018.
  • [24] D. Marbach, J. C. Costello, R. Küffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, M. DREAM5 Consortium, Kellis, J. J. Collins, and G. Stolovitzky. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796–804, 2012.
  • [25] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber.

    Stacked convolutional auto-encoders for hierarchical feature extraction.

    In Proceedings of the 21th International Conference on Artificial Neural Networks - Volume Part I, ICANN’11, pages 52–59, Berlin, Heidelberg, 2011. Springer-Verlag.
  • [26] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000.
  • [27] C. A. Petri. Nets, time and space. Theor. Comput. Sci., 153(1-2):3–48, Jan. 1996.
  • [28] R. J. Prill, D. Marbach, J. Saez-Rodriguez, P. K. Sorger, L. G. Alexopoulos, X. Xue, N. D. Clarke, G. Altan-Bonnet, and G. Stolovitzky. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PloS one, 5(2), 2010.
  • [29] S. A. Ramsey, S. L. Klemm, D. E. Zak, K. A. Kennedy, V. Thorsson, B. Li, M. Gilchrist, E. S. Gold, C. D. Johnson, V. Litvak, G. Navarro, J. C. Roach, C. M. Rosenberger, A. G. Rust, N. Yudkovsky, A. Aderem, and I. Shmulevich. Uncovering a macrophage transcriptional program by integrating evidence from motif scanning and expression dynamics. PLoS Computational Biology, 4(3), 2008.
  • [30] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
  • [31] U. Shaham and R. R. Lederman. Common variable learning and invariant representation learning using Siamese neural networks. CoRR, abs/1512.08806, 2015.
  • [32] U. Shaham and R. R. Lederman. Learning by coincidence: Siamese networks and common variable learning. Pattern Recognition, 74:52–63, 2018.
  • [33] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer, 1993.
  • [34] M.-O. Stehr. Probabilistic Approximate Logic and its Implementation in the Logical Imagination Engine, 2019. SRI International Technical Report (in preparation).
  • [35] A. Subramanian, R. Narayan, S. M. Corsello, D. D. Peck, T. E. Natoli, X. Lu, J. Gould, J. F. Davis, A. A. Tubelli, J. K. Asiedu, D. L. Lahr, J. E. Hirschman, Z. Liu, M. Donahue, B. Julian, M. Khan, D. Wadden, I. C. Smith, D. Lam, A. Liberzon, C. Toder, M. Bagul, M. Orzechowski, O. M. Enache, F. Piccioni, S. A. Johnson, N. J. Lyons, A. H. Berger, A. F. Shamji, A. N. Brooks, A. Vrcic, C. Flynn, J. Rosains, D. Y. Takeda, R. Hu, D. Davison, J. Lamb, K. Ardlie, L. Hogstrom, P. Greenside, N. S. Gray, P. A. Clemons, S. Silver, X. Wu, W. N. Zhao, W. Read-Button, X. Wu, S. J. Haggarty, L. V. Ronco, J. S. Boehm, S. L. Schreiber, J. G. Doench, J. A. Bittker, D. E. Root, B. Wong, and T. R. Golub. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell, 171(6):1437–1452, 2017.
  • [36] A. Vertes, A. Arul, P. Avar, A. R. Korte, H. Li, P. Nemes, L. Parvin, S. Stopka, S. Hwang, Z. J. Sahab, L. Zhang, D. I. Bunin, M. Knapp, A. Poggio, M.-O. Stehr, C. L. Talcott, B. M. Davis, S. R. Dinn, C. A. Morton, C. J. Sevinsky, and M. I. Zavodszky. Inferring mechanism of action of an unknown compound from time series omics data. In M. Ceska and D. Safránek, editors, Computational Methods in Systems Biology - 16th International Conference, CMSB 2018, Brno, Czech Republic, September 12-14, 2018, Proceedings, volume 11095 of Lecture Notes in Computer Science, pages 238–255. Springer, 2018.
  • [37] D. E. Zak and A. Aderem. Systems biology of innate immunity. Immunological reviews, 227(1):264–282, 2009.
  • [38] Y. Zhang and Z. Duan. Iminet: Convolutional semi-siamese networks for sound search by vocal imitation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 304–308. IEEE, 2018.