Finding relationships between different variables in large datasets [15, 12, 1] is an important problem that has ramifications in fields ranging from environmental science to economics and genetic networks. Understanding what variables affect a certain quantity becomes increasingly challenging when these relationships are highly non-linear, like those occurring in dynamical systems with several variables. Quite often in a large dataset with several variables, only a few variables maybe significantly affecting the target variable and identifying these variables is first vital step in exploring these dependencies in more detail.
Several methods exist which can help find dependencies and correlations between variables. However most of these methods are good at detecting a certain class of functions while they fail for others. There are some methods which are quite good at detecting functional dependencies between 2 variables [15, 2], they have however not been demonstrated in a multi-variable scenario where a target variable depends on several input variables. Finding functional dependencies has been a topic explored extensively in context of relational databases[10, 6]. However these methods rely on finding exact functional relationships by finding all attributes which have a one to one or one to many relationship with a certain column Y. But this approach does not work well for small databases which are just a sample of the true distribution as in these cases one to one relations are more likely to occur. Also in such cases, it is difficult to reliably find the smallest subset of variables which are sufficient to describe Y. These methods do not offer any control over what kind of functional relationships maybe considered intuitively as good or interesting candidates. Also, these methods do not provide any kind of score to evaluate functional dependencies.
In this paper, we use Neural networks as devices to model nonlinear behavior and find complex non-linear relationships. Especially deep neural networks (DNN) which consist of more than 1 hidden layer are excellent candidates for efficiently modelling multi-variable non-linear polynomial functions with small number of neurons[9, 16]. Additionally a regularization mechanism allows us to control the complexity of the model we wish to consider 
. Neural networks have been used recently to discover physical concepts, identify phase transitions and design quantum experiments[8, 14, 13]
. To help find dependencies, we use an DNN based autoencoder architecture which consists of an encoder-decoder pair. The encoder maps the input space to a latent space, while the decoder maps the latent space to the output space. This architecture has been used, amongst other applications, for non-linear Principle Component analysis (PCA) where the goal is to find a compressed representation of data. As such the input and the output of the autoencoder is conventionally the same. In our method the input will be , which is the set of input features and
is the target feature or the set of features. We then use compression of mutual information in the latent space to derive a loss function which can be minimized to find the smallest set of features inwhich can be used to reliably reconstruct . The loss function can be used to assign a score to compare the functional dependencies on different set of input parameters.We then demonstrate this method to find dependencies in chaotic dynamical systems. Also we show that this method can be used to find non-linear causal connections in the Grangier sense for chaotic systems [3, 17, 11], even for a small dataset of 100 samples.
We now derive a loss function using the information bottleneck method  based on the fact that the latent intermediate layer can be used to extract only relevant information from and used to reconstruct . We represent this latent representation by
. We also now assume a Markov chain. This means . This is because correspond to observed ground truth data.We now use the fact that we want to extract only relevant information from which can reconstruct . We use Shannon mutual information to quantify this information [19, 4]. Therefore want to maximize the quantity . The first term and the second term describe the capacity of the encoder and the decoder respectively with determining the relative weight between the two terms. We can write as:
where is the Shannon entropy. We neglect since it is fixed by the data. Since it is very difficult to calculate , we can approximate it by another analytic function
. Using the fact that the KL divergence which measures the ‘distance’ between 2 probability distributions is always non-negative:
we can write
We can now choose an appropriate function for which allows us to derive a suitable loss function as well as allows us to tune the complexity of the decoder. The output of the decoder is given by which describes the composite function of the decoder neural network which acts on the latent variable . To also include an additional L1 regulation parameter which helps restrict the magnitude of the weights in the decoder neural network, we use the following function for
where etc. are weights of different neurons in the decoder network. Therefore we can write
Now we use the fact that . Using the Markov chain condition, this can be written as . Approximating where is the number of distinct data points, we can write
Similarly we can define as:
We now again use another analytical function in place of and use the result on positivity of KL divergence and get:
For convenience we use a Gaussian function centred at 0.
where are different components of and is an adjustable parameter. For we can use:
This means we use a linear transformation from
and add a independent Gaussian noise with varianceand mean 0 to each component. We now plug in definitions 9,10 into equation 8 and obtain:
Writing we can write the above equation as
Using the approximation , we can write
Similarly substituting equation 10 into equation 6 and assuming to be small enough so that we obtain:
Therefore we can define a loss function to be minimized as
We observe that the first term tries to minimize the least squares difference between and and the second term controls the size of the weights of the decoder which in turn controls the maximum degree polynomials the decoder NN can approximate. For the third term we see that as we increase the , the NN will try to keep small to keep the total loss function small. Assuming now that we standardize our data so that on an average have similar magnitudes, we absorb it into . The third term will now be smallest when only corresponding to those are non-zero, which are required to reproduce . Using this intution and the fact that term inside the summation over in equation 17 is always , we can further simplify the loss function as
where we have merged with . This way we treat both the encoder and decoder weights on equal terms using L1 regularization. From a practical standpoint L1 is advantageous since it can shrink weights faster.
For further study we use a NN in which the encoder has 2 linear layers. This gives us a mapping . We then add Gaussian noise to the latent variables
. The latent code is then sent through a multilayer decoder network with non-linear activation functions to give the output
. We perform batch-normalization in between intermediate neural network layers
. This layers prevents change in data distributions between adjacent layers and allows neural network learning at a higher learning rate. We then minimize the loss function in equation 16 using Stochastic gradient descent with different batch sizes. We can tune the values of
(regularization parameters) to obtain as low values of loss function as possible. This choice of regularization parameters may also depend on our prior knowledge about the complexity of the system. The data is split into the training and validation set. The training data is used to build the model and validation set checks how well the model generalizes. The basic heuristic for tuning these parameters is as follows: after fixing the learning rate for the gradient descent, we first increase the value ofwhich basically fixes the complexity of functions the decoder can simulate. We then increase the value of and look at the value of the mean square error and stop when the mean square error is as small as possible for both the training and the validation set. We now use this method to infer relationships in well known non-linear systems. We first consider a Lorenz96 non-linear system which is defined as:
where goes from to where is the number of oscillators and ,, . is the driving term and we choose where the system behaves in the chaotic regime. Figure 1 shows the results for N=5. We run N=5 times with each time for i from 1 to 5. We see that the latent representation is basically just the added Gaussian noise when the corresponding has no dependency on . The number of data points was 3000 and learning rate was 0.0001 and values of
where 0 and 0.1 respectively. The training was run for 1000 epochs with a batch size of 300.
Next we apply NN to infer causal relationship in a set of non-linear delay equations. For this we look at the following set of equations:
for i=1,2,3. We choose to choose parameters which correspond to a fan-in pattern shown in Figure 2. The values of are as follows . These parameters corresponds to a chaotic regime. In this case both and are causally driven by . A fan-in pattern is a good test because correlation based tests would falsely infer a causal relationship between and . To infer the causal relationships, we run the NN with and input . From Figure 3 we can see that we are able to correctly infer the dependencies, even for a very small data-set of 50 points. The plots were obtained for a learning rate of 0.001 and values of 0.1 and 0.005 respectively.The number of epochs was 1500 with a batch size of 32.
We also summarize the performance of this method using 2 metrics False discovery (FD) and Miss rate (MR) which are defined as:
where FN, FP, TP are False negatives, false positives and true positives respectively. Here a positive means a certain variable has been discovered to be independent of the output. The negative means a variable has been discovered to be related to the output.This data is obtained by obtaining results over 20 independent runs of the model. For the Lorenz96 model, the best result is obtained with while for the set of equations 18, best results are obtained for
The proposed approach using NN is a versatile platform for inferring relationships, especially in complex non-linear systems. This is because NN are a powerful tool to model such non-linear functions. Even though it is difficult to infer the exact functional form using a NN, this method can help locate functional dependencies between variables in a multivariable system. These variables can then be probed more extensively to find the functional (or approximate functional) form of the relationships. Methods based on sparse regression have been used in the past to find functional relationships. However they rely on pre-knowledge of the set of basis functions to use for the regression. The proposed method has no such requirement and with a large enough NN, can simulate any complex non-linear function. Besides locating functional relationships, it can also help infer causal relationships in non-linear data as seen in the discussed example, where it correctly inferred causal relationship even for a small dataset of 50 samples.
The author would like to thank Akshatha Mohan for helpful comments and critical assessment of the manuscript.
-  (2016-04) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937 (en). External Links: Cited by: §I.
-  (2001-04) Remarks on the Maximum Correlation Coefficient. Bernoulli 7 (2), pp. 343. External Links: Cited by: §I.
-  (2012-04) Causality and Persistence in Ecological Systems: A Nonparametric Spectral Granger Causality Approach. The American Naturalist 179 (4), pp. 524–535 (en). External Links: Cited by: §I.
-  (2004-09) On approximation measures for functional dependencies. Information Systems 29 (6), pp. 483–507 (en). External Links: Cited by: §II.
-  (2006-07) Reducing the Dimensionality of Data with Neural Networks. Science 313 (5786), pp. 504–507 (en). External Links: Cited by: §I.
-  (1999-02) Tane: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. The Computer Journal 42 (2), pp. 100–111 (en). External Links: Cited by: §I.
-  (2015-03) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs]. Note: arXiv: 1502.03167 External Links: Cited by: §III.
-  (2020-01) Discovering Physical Concepts with Neural Networks. Physical Review Letters 124 (1), pp. 010508 (en). External Links: Cited by: §I.
-  (2017-09) Why Does Deep and Cheap Learning Work So Well?. Journal of Statistical Physics 168 (6), pp. 1223–1247 (en). External Links: Cited by: §I.
-  (2012-02) Discover Dependencies from Data—A Review. IEEE Transactions on Knowledge and Data Engineering 24 (2), pp. 251–264. External Links: Cited by: §I.
-  (2015-05) Detecting Causality from Nonlinear Dynamics with Short-term Time Series. Scientific Reports 4 (1), pp. 7464 (en). External Links: Cited by: §I.
-  (2010-04) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences 107 (14), pp. 6286–6291 (en). External Links: Cited by: §I, §III.
-  (2018-02) Active learning machine learns to create new quantum experiments. Proceedings of the National Academy of Sciences 115 (6), pp. 1221–1226 (en). External Links: Cited by: §I.
-  (2019-09) Identifying quantum phase transitions using artificial neural networks on experimental data. Nature Physics 15 (9), pp. 917–920 (en). External Links: Cited by: §I.
-  (2011-12) Detecting Novel Associations in Large Data Sets. Science 334 (6062), pp. 1518–1524 (en). External Links: Cited by: §I, §I.
-  (2018-04) The power of deeper networks for expressing natural functions. arXiv:1705.05502 [cs, stat]. Note: arXiv: 1705.05502 External Links: Cited by: §I.
-  (2012-06) . Physical Review Letters 108 (25), pp. 258701 (en). External Links: Cited by: §I.
-  (1996-01) Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288 (en). External Links: Cited by: §I, §II.
-  (2000-04) The information bottleneck method. arXiv:physics/0004057. Note: arXiv: physics/0004057 External Links: Cited by: §II.