Discovering dependencies in complex physical systems using Neural Networks

01/27/2021
by   Sachin Kasture, et al.
0

In todays age of data, discovering relationships between different variables is an interesting and a challenging problem. This problem becomes even more critical with regards to complex dynamical systems like weather forecasting and econometric models, which can show highly non-linear behavior. A method based on mutual information and deep neural networks is proposed as a versatile framework for discovering non-linear relationships ranging from functional dependencies to causality. We demonstrate the application of this method to actual multivariable non-linear dynamical systems. We also show that this method can find relationships even for datasets with small number of datapoints, as is often the case with empirical data.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/05/2019

Data-driven Modelling of Dynamical Systems Using Tree Adjoining Grammar and Genetic Programming

State-of-the-art methods for data-driven modelling of non-linear dynamic...
06/22/2021

Learning Dynamical Systems from Noisy Sensor Measurements using Multiple Shooting

Modeling dynamical systems plays a crucial role in capturing and underst...
05/13/2022

Physics guided neural networks for modelling of non-linear dynamics

The success of the current wave of artificial intelligence can be partly...
12/01/2020

Deep Gravity: enhancing mobility flows generation with deep neural networks and geographic information

The movements of individuals within and among cities influence key aspec...
11/01/2021

Deep neural networks as nested dynamical systems

There is an analogy that is often made between deep neural networks and ...
04/11/2020

On Error Correction Neural Networks for Economic Forecasting

Recurrent neural networks (RNNs) are more suitable for learning non-line...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Finding relationships between different variables in large datasets [15, 12, 1] is an important problem that has ramifications in fields ranging from environmental science to economics and genetic networks. Understanding what variables affect a certain quantity becomes increasingly challenging when these relationships are highly non-linear, like those occurring in dynamical systems with several variables. Quite often in a large dataset with several variables, only a few variables maybe significantly affecting the target variable and identifying these variables is first vital step in exploring these dependencies in more detail.

Several methods exist which can help find dependencies and correlations between variables. However most of these methods are good at detecting a certain class of functions while they fail for others. There are some methods which are quite good at detecting functional dependencies between 2 variables [15, 2], they have however not been demonstrated in a multi-variable scenario where a target variable depends on several input variables. Finding functional dependencies has been a topic explored extensively in context of relational databases[10, 6]. However these methods rely on finding exact functional relationships by finding all attributes which have a one to one or one to many relationship with a certain column Y. But this approach does not work well for small databases which are just a sample of the true distribution as in these cases one to one relations are more likely to occur. Also in such cases, it is difficult to reliably find the smallest subset of variables which are sufficient to describe Y. These methods do not offer any control over what kind of functional relationships maybe considered intuitively as good or interesting candidates. Also, these methods do not provide any kind of score to evaluate functional dependencies.

In this paper, we use Neural networks as devices to model nonlinear behavior and find complex non-linear relationships. Especially deep neural networks (DNN) which consist of more than 1 hidden layer are excellent candidates for efficiently modelling multi-variable non-linear polynomial functions with small number of neurons

[9, 16]. Additionally a regularization mechanism allows us to control the complexity of the model we wish to consider [18]

. Neural networks have been used recently to discover physical concepts, identify phase transitions and design quantum experiments

[8, 14, 13]

. To help find dependencies, we use an DNN based autoencoder architecture which consists of an encoder-decoder pair. The encoder maps the input space to a latent space, while the decoder maps the latent space to the output space. This architecture has been used, amongst other applications, for non-linear Principle Component analysis (PCA) where the goal is to find a compressed representation of data

[5]. As such the input and the output of the autoencoder is conventionally the same. In our method the input will be , which is the set of input features and

is the target feature or the set of features. We then use compression of mutual information in the latent space to derive a loss function which can be minimized to find the smallest set of features in

which can be used to reliably reconstruct . The loss function can be used to assign a score to compare the functional dependencies on different set of input parameters.We then demonstrate this method to find dependencies in chaotic dynamical systems. Also we show that this method can be used to find non-linear causal connections in the Grangier sense for chaotic systems [3, 17, 11], even for a small dataset of 100 samples.

Figure 1: Plot shows comparison between and the corresponding scaled version of for (a)-(d) different values of for equation 17. In the plots where is essentially noise, information from the corresponding is not used to reconstruct using the decoder. is a scaling factor chosen so that and are comparable

Ii Theory

We now derive a loss function using the information bottleneck method [19] based on the fact that the latent intermediate layer can be used to extract only relevant information from and used to reconstruct . We represent this latent representation by

. We also now assume a Markov chain

. This means . This is because correspond to observed ground truth data.We now use the fact that we want to extract only relevant information from which can reconstruct . We use Shannon mutual information to quantify this information [19, 4]. Therefore want to maximize the quantity . The first term and the second term describe the capacity of the encoder and the decoder respectively with determining the relative weight between the two terms. We can write as:

(1)

where is the Shannon entropy. We neglect since it is fixed by the data. Since it is very difficult to calculate , we can approximate it by another analytic function

. Using the fact that the KL divergence which measures the ‘distance’ between 2 probability distributions is always non-negative:

(2)

we can write

(3)

We can now choose an appropriate function for which allows us to derive a suitable loss function as well as allows us to tune the complexity of the decoder. The output of the decoder is given by which describes the composite function of the decoder neural network which acts on the latent variable . To also include an additional L1 [18]regulation parameter which helps restrict the magnitude of the weights in the decoder neural network, we use the following function for

(4)

where etc. are weights of different neurons in the decoder network. Therefore we can write

(5)

Now we use the fact that . Using the Markov chain condition, this can be written as . Approximating where is the number of distinct data points, we can write

(6)

Similarly we can define as:

(7)

We now again use another analytical function in place of and use the result on positivity of KL divergence and get:

(8)

For convenience we use a Gaussian function centred at 0.

(9)

where are different components of and is an adjustable parameter. For we can use:

(10)

where

This means we use a linear transformation from

and add a independent Gaussian noise with variance

and mean 0 to each component. We now plug in definitions 9,10 into equation 8 and obtain:

(11)

Writing we can write the above equation as

(12)

Using the approximation , we can write

(13)

Similarly substituting equation 10 into equation 6 and assuming to be small enough so that we obtain:

(14)
Figure 2: Plots shows the case of fan-in causality pattern for set of delay equations in equation 18 for set of values used to obtain results in Figure 3
Figure 3: Plot shows comparison between and the corresponding scaled version of for (a)-(c) different values of for the set of delay equations 18. In the plots where is noise, information from the corresponding is not used to reconstruct using the decoder

Therefore we can define a loss function to be minimized as

(15)

We observe that the first term tries to minimize the least squares difference between and and the second term controls the size of the weights of the decoder which in turn controls the maximum degree polynomials the decoder NN can approximate. For the third term we see that as we increase the , the NN will try to keep small to keep the total loss function small. Assuming now that we standardize our data so that on an average have similar magnitudes, we absorb it into . The third term will now be smallest when only corresponding to those are non-zero, which are required to reproduce . Using this intution and the fact that term inside the summation over in equation 17 is always , we can further simplify the loss function as

(16)

where we have merged with . This way we treat both the encoder and decoder weights on equal terms using L1 regularization. From a practical standpoint L1 is advantageous since it can shrink weights faster.

Iii Application

For further study we use a NN in which the encoder has 2 linear layers. This gives us a mapping . We then add Gaussian noise to the latent variables

. The latent code is then sent through a multilayer decoder network with non-linear activation functions to give the output

. We perform batch-normalization in between intermediate neural network layers

[7]

. This layers prevents change in data distributions between adjacent layers and allows neural network learning at a higher learning rate. We then minimize the loss function in equation 16 using Stochastic gradient descent with different batch sizes. We can tune the values of

(regularization parameters) to obtain as low values of loss function as possible. This choice of regularization parameters may also depend on our prior knowledge about the complexity of the system. The data is split into the training and validation set. The training data is used to build the model and validation set checks how well the model generalizes. The basic heuristic for tuning these parameters is as follows: after fixing the learning rate for the gradient descent, we first increase the value of

which basically fixes the complexity of functions the decoder can simulate. We then increase the value of and look at the value of the mean square error and stop when the mean square error is as small as possible for both the training and the validation set. We now use this method to infer relationships in well known non-linear systems. We first consider a Lorenz96 non-linear system which is defined as:

(17)

where goes from to where is the number of oscillators and ,, . is the driving term and we choose where the system behaves in the chaotic regime. Figure 1 shows the results for N=5. We run N=5 times with each time for i from 1 to 5. We see that the latent representation is basically just the added Gaussian noise when the corresponding has no dependency on . The number of data points was 3000 and learning rate was 0.0001 and values of

where 0 and 0.1 respectively. The training was run for 1000 epochs with a batch size of 300.


Next we apply NN to infer causal relationship in a set of non-linear delay equations. For this we look at the following set of equations:

(18)

for i=1,2,3. We choose to choose parameters which correspond to a fan-in pattern shown in Figure 2. The values of are as follows . These parameters corresponds to a chaotic regime. In this case both and are causally driven by . A fan-in pattern is a good test because correlation based tests would falsely infer a causal relationship between and [12]. To infer the causal relationships, we run the NN with and input . From Figure 3 we can see that we are able to correctly infer the dependencies, even for a very small data-set of 50 points. The plots were obtained for a learning rate of 0.001 and values of 0.1 and 0.005 respectively.The number of epochs was 1500 with a batch size of 32.

Figure 4: Plot shows the plot for FD vs MR for different values of . The legend also mentions the non-linear system for the plotted data. ‘dde’ stands for the delay difference equations in equation 18

We also summarize the performance of this method using 2 metrics False discovery (FD) and Miss rate (MR) which are defined as:

(19)

where FN, FP, TP are False negatives, false positives and true positives respectively. Here a positive means a certain variable has been discovered to be independent of the output. The negative means a variable has been discovered to be related to the output.This data is obtained by obtaining results over 20 independent runs of the model. For the Lorenz96 model, the best result is obtained with while for the set of equations 18, best results are obtained for

Iv Conclusion

The proposed approach using NN is a versatile platform for inferring relationships, especially in complex non-linear systems. This is because NN are a powerful tool to model such non-linear functions. Even though it is difficult to infer the exact functional form using a NN, this method can help locate functional dependencies between variables in a multivariable system. These variables can then be probed more extensively to find the functional (or approximate functional) form of the relationships. Methods based on sparse regression have been used in the past to find functional relationships. However they rely on pre-knowledge of the set of basis functions to use for the regression. The proposed method has no such requirement and with a large enough NN, can simulate any complex non-linear function. Besides locating functional relationships, it can also help infer causal relationships in non-linear data as seen in the discussed example, where it correctly inferred causal relationship even for a small dataset of 50 samples.

V Acknowledgements

The author would like to thank Akshatha Mohan for helpful comments and critical assessment of the manuscript.

References

  • [1] S. L. Brunton, J. L. Proctor, and J. N. Kutz (2016-04) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937 (en). External Links: ISSN 0027-8424, 1091-6490, Link, Document Cited by: §I.
  • [2] A. Dembo, A. Kagan, and L. A. Shepp (2001-04) Remarks on the Maximum Correlation Coefficient. Bernoulli 7 (2), pp. 343. External Links: ISSN 13507265, Link, Document Cited by: §I.
  • [3] M. Detto, A. Molini, G. Katul, P. Stoy, S. Palmroth, and D. Baldocchi (2012-04) Causality and Persistence in Ecological Systems: A Nonparametric Spectral Granger Causality Approach. The American Naturalist 179 (4), pp. 524–535 (en). External Links: ISSN 0003-0147, 1537-5323, Link, Document Cited by: §I.
  • [4] C. Giannella and E. Robertson (2004-09) On approximation measures for functional dependencies. Information Systems 29 (6), pp. 483–507 (en). External Links: ISSN 03064379, Link, Document Cited by: §II.
  • [5] G. E. Hinton (2006-07) Reducing the Dimensionality of Data with Neural Networks. Science 313 (5786), pp. 504–507 (en). External Links: ISSN 0036-8075, 1095-9203, Link, Document Cited by: §I.
  • [6] Y. Huhtala (1999-02) Tane: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. The Computer Journal 42 (2), pp. 100–111 (en). External Links: ISSN 0010-4620, 1460-2067, Link, Document Cited by: §I.
  • [7] S. Ioffe and C. Szegedy (2015-03) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs]. Note: arXiv: 1502.03167 External Links: Link Cited by: §III.
  • [8] R. Iten, T. Metger, H. Wilming, L. del Rio, and R. Renner (2020-01) Discovering Physical Concepts with Neural Networks. Physical Review Letters 124 (1), pp. 010508 (en). External Links: ISSN 0031-9007, 1079-7114, Link, Document Cited by: §I.
  • [9] H. W. Lin, M. Tegmark, and D. Rolnick (2017-09) Why Does Deep and Cheap Learning Work So Well?. Journal of Statistical Physics 168 (6), pp. 1223–1247 (en). External Links: ISSN 0022-4715, 1572-9613, Link, Document Cited by: §I.
  • [10] J. Liu, J. Li, C. Liu, and Y. Chen (2012-02) Discover Dependencies from Data—A Review. IEEE Transactions on Knowledge and Data Engineering 24 (2), pp. 251–264. External Links: ISSN 1041-4347, Link, Document Cited by: §I.
  • [11] H. Ma, K. Aihara, and L. Chen (2015-05) Detecting Causality from Nonlinear Dynamics with Short-term Time Series. Scientific Reports 4 (1), pp. 7464 (en). External Links: ISSN 2045-2322, Link, Document Cited by: §I.
  • [12] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky (2010-04) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences 107 (14), pp. 6286–6291 (en). External Links: ISSN 0027-8424, 1091-6490, Link, Document Cited by: §I, §III.
  • [13] A. A. Melnikov, H. Poulsen Nautrup, M. Krenn, V. Dunjko, M. Tiersch, A. Zeilinger, and H. J. Briegel (2018-02) Active learning machine learns to create new quantum experiments. Proceedings of the National Academy of Sciences 115 (6), pp. 1221–1226 (en). External Links: ISSN 0027-8424, 1091-6490, Link, Document Cited by: §I.
  • [14] B. S. Rem, N. Käming, M. Tarnowski, L. Asteria, N. Fläschner, C. Becker, K. Sengstock, and C. Weitenberg (2019-09) Identifying quantum phase transitions using artificial neural networks on experimental data. Nature Physics 15 (9), pp. 917–920 (en). External Links: ISSN 1745-2473, 1745-2481, Link, Document Cited by: §I.
  • [15] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti (2011-12) Detecting Novel Associations in Large Data Sets. Science 334 (6062), pp. 1518–1524 (en). External Links: ISSN 0036-8075, 1095-9203, Link, Document Cited by: §I, §I.
  • [16] D. Rolnick and M. Tegmark (2018-04) The power of deeper networks for expressing natural functions. arXiv:1705.05502 [cs, stat]. Note: arXiv: 1705.05502 External Links: Link Cited by: §I.
  • [17] J. Runge, J. Heitzig, V. Petoukhov, and J. Kurths (2012-06)

    Escaping the Curse of Dimensionality in Estimating Multivariate Transfer Entropy

    .
    Physical Review Letters 108 (25), pp. 258701 (en). External Links: ISSN 0031-9007, 1079-7114, Link, Document Cited by: §I.
  • [18] R. Tibshirani (1996-01) Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288 (en). External Links: ISSN 00359246, Link, Document Cited by: §I, §II.
  • [19] N. Tishby, F. C. Pereira, and W. Bialek (2000-04) The information bottleneck method. arXiv:physics/0004057. Note: arXiv: physics/0004057 External Links: Link Cited by: §II.