A Local Information Criterion for Dynamical Systems

05/27/2018 ∙ by Arash Mehrjou, et al. ∙ Max Planck Society 0

Encoding a sequence of observations is an essential task with many applications. The encoding can become highly efficient when the observations are generated by a dynamical system. A dynamical system imposes regularities on the observations that can be leveraged to achieve a more efficient code. We propose a method to encode a given or learned dynamical system. Apart from its application for encoding a sequence of observations, we propose to use the compression achieved by this encoding as a criterion for model selection. Given a dataset, different learning algorithms result in different models. But not all learned models are equally good. We show that the proposed encoding approach can be used to choose the learned model which is closer to the true underlying dynamics. We provide experiments for both encoding and model selection, and theoretical results that shed light on why the approach works.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Objects are of various complexities in nature. A round stone looks simpler than a convoluted rough piece of rock; a constant beep-like sound is simpler than an orchestra. We humans have internal ideas about how complex are objects. The complexity can also be defined for abstract objects such as mathematical creatures. The focus of this paper is on the complexity of dynamical systems that model the laws of nature (newton1833philosophiae, ). To our eyes, a dynamical system is nothing more than a temporal sequence of observations. We might use the data sequence to infer a model. But what is the better representation of the dynamical system – the data or the model, and which model should we use? In this paper, we take a closer look at efficient encoding of dynamical systems and, based on that, propose a model selection criterion with practical use in empirical inference.

For illustration, assume the following scenario: Alice and Bob are friends and they are talking over the phone. Alice is watching a dynamical system and wants to share her experience with Bob. Alice knows what Bob knows about the nature, math, etc. She is watching a temporal sequence of observations caused by an underlying mathematical expression . Unfortunately, the transmission cord from Alice to Bob charges her for every voltage pulse. Therefore, Alice would like to transmit her experience to Bob with the least phone cost. Due to the physical constraints, Alice can observe samples from the model with sampling frequency where is the time interval between two consecutive observations. Assume the phone call starts at time and Alice can describe each of her observations with bits. One trivial solution is that Alice talks constantly with Bob and tells him every observation at each time instant for an indeterminate amount of time. Despite its simplicity, this approach will cost Alice a horrible amount that increases without bound as . More cleverly, Alice can use her prior assumptions about nature and her belief that her observations are not totally random. Hence, she is able to infer the underlying dynamics by a nonparametric model from her observations . Assume this model is chosen from a hypothesis set and both Alice and Bob agree on the members of . Thus, Alice only needs to inform Bob about the initial state of the system and the model she has learned about the dynamics. Bob can reconstruct the sequence on his side by running starting from the initial state . Notice that the state dynamics may cover only a small subset of the state space, which removes the need to model on its whole domain. We use this property of dynamical systems for compressing their information and obtaining an optimal local trade-off between model complexity and prediction accuracy.

The underlying questions of this example are highly relevant also for artificial intelligent systems. Imagine autonomous vehicles flying or driving in a formation

(alam2015heavy, ), or multiple robots coordinating their actions (rubenstein2014programmable, ). These systems need to know of each other; that is, agents need to transmit dynamics information between each other. An intelligent agents, however, will use its resources wisely and thus communicate only when and what is necessary. In this scenario, better encoding of dynamical systems means reduced communication, lower bandwidth requirements, and thus reduced cost. Likewise, intelligent agents may store various internal models for the purpose of simulation, prediction, or control (camacho2013model, ). Better representations here may mean improved performance, reduced memory requirements, and faster computation.

Contribution — In this paper, we propose to encode dynamical systems through local representations, which are computed to yield (locally) an optimal trade-off between model complexity and predictive performance. The criterion automatically chooses the ‘right’ complexity – locally simple dynamics are represented by low-order models, while higher-order representations are automatically taken in areas with more complex dynamics. Because the representation thus adapts to the local information content of the dynamics, the proposed encoding scheme represents a novel information criterion for dynamical systems, which we call Local Information Criterion for Dynamical Systems (L-ICDS).

L-ICDS is motivated by compressing information through local representations. Since there are theories and empirical evidence in machine learning confirming the relation between generalization and compression 

(vapnik2013nature, ; luxburg2004compression, )

, we hypothesized L-ICDS is also useful for model selection. Indeed, we show that the information criterion can be used (in addition to efficient encoding) to choose among different models learned from a given dataset. In particular, we show that it is possible to choose between different architectures of neural networks (NNs) and to compare different types of learned models (e.g., NNs versus GPs) solely with the aid of the compression score and not with test data. We extend our empirical findings with theoretical results, which confirm the correctness of our method for certain function classes and provide insight into why L-ICDS is a useful criterion for model selection. Fig. 

1 illustrate the two proposed applications of L-ICDS: encoding and model selection.

(a) Encoding
(b) Model selection
Figure 1: Two cartoons for two ideas presented in this paper. (a) Encoding a time series by first learning a model from the observations (bottom) and then locally encoding the model by L-ICDS. (b) Using the L-ICDS score as an information criterion to perform model selection. Two different neural network architectures are trained from a time series and L-ICDS score is computed for each.

Related Work — The subject of obtaining a representation of a dynamical system from its input-output data is known as system identification (ljung1999system, ; nelles2013nonlinear, ) or model learning (nguyen2011model, ). Two major approaches in system identification are gray-box and black-box approaches (ljung1999system, ; nelles2013nonlinear, ). Gray-box methods learn the parameters of a known model (tulleken1993grey, ), where parameters typically have a physical interpretation. However, black-box methods need to identify both the structure and parameters of the model (sjoberg1995nonlinear, ). In black-box system identification, or machine learning in general, choosing the appropriate structure is usually done by investigating model performance on a left-out validation set. Information criterion is a different approach to model selection by taking into account model complexity and data explanation at the same time (yamaoka1978application, ). Many information criteria have been proposed and used for supervised (fogel1991information, )

and unsupervised learning 

(mehrjou2016improved, ). Despite some recent work (darmon2018information, ; mangan2017model, ) on the the information criterion approach towards dynamical systems, the field is not explored well yet. This work is proposing a compression method for dynamical systems that can be used as an information criterion and for model selection as well.

Models of dynamical systems take very different representations. On the one end of the spectrum, there are classical parametric models such as linear transfer functions and state-space models

(ljung1999system, ), as well as nonlinear gray-box models with known structure (e.g., from first principles) and some free parameters. In these, the model structure is relatively rigid and information is encoded in a small number of parameters, often with some physical interpretation. Neural networks (NNs) (wang2016data, ; narendra1990identification, ) can also learn model structure and encode information in a large number of weight parameters without direct physical interpretation. Fuzzy models such as Takagi-Sugeno (takagi1993fuzzy, ) encode dynamical systems as a set of fuzzy rules or sets and associated models. Nonparametric methods such as Gaussian process (GP) models (frigola2013bayesian, ; doerr2017optimizing, ; eleftheriadis2017identification, ) and classical time- or frequency-domain methods (wellstead1981non, ) represent dynamical system information essentially in a dataset (in time or frequency domain). Herein, we propose to encode dynamical systems in local models whose complexity is adapted to the data stream. Our encoding thus provides a middle ground between encoding in a dataset and a single (global) parametric model.

The benefits of local modeling approaches for dynamical systems have long been recognized (atkeson1997locally, ; nelles1996basis, ; TiMeViSc16, ). These include, in particular, the abilities to learn fast and incrementally from a continuous stream of (possibly large) data, while allowing for non-stationary distributions (TiMeViSc16, ). This is critical in real-time learning such as robot control (ScAt10, ). While in these works the complexity of the local model must be chosen a priori (most often, locally linear models are used), our method allows for determining the optimal model complexity, which is adapted to the data.

The proposed encoding scheme for dynamical systems was first considered in (solowjowCDCsubm_arxiv, ), but in a different context from the one herein. While in (solowjowCDCsubm_arxiv, ), the true dynamics are assumed known and L-ICDS is used to efficiently communicate state information between agents in a networked system, we consider encoding of dynamics models learned from data. Moreover, the proposal of L-ICDS for model selection (Sec. 4), and all theoretical (Theorems 1-3) and empirical results (Sec. 3 and 5) are novel contributions.

2 Proposed Local Encoding

In this section, we explain our proposed encoding scheme for dynamical systems

(1)

We present our idea based on the concepts of algorithmic complexity (wallace2005statistical, )

, Universal Turing Machines (UTM) 

(turing1937computable, ), and minimum message length (MML) (wallace1999minimum, ). UTM is a programmable machine that receives a message as input and produces the desired output. Minimum length of the input message can be seen as the complexity of the output and is called algorithmic complexity (AC). The MML principle chooses a model for the observed data where the joint AC of the tuple (model, data) is minimized. The detailed prerequisite definitions are delegated to the supplementary material.

General notion— Our aim is to construct a brief and efficient explanation for the observed data from the model. The explanation is a message consisting of two parts. The first part encodes some general assertion (theory) about the source of data and the second part is the explanation for the data were the assertion is correct (wallace1999minimum, ). Throughout this paper, we assume the data takes finite discrete values with certain precision. Hence, each data example can be encoded to a finite sequence of ‘0s’ and ‘1s’. This is a reasonable assumption because, in practice, values are usually stored in a quantized way on digital computers, and we shall consider a finite horizon hereafter.

Alice’s encoding of a dynamical system— Assume Alice is given a long sequence of observations to be transmitted to Bob over the phone. Alice thinks of a message consisting of two parts. The first part encodes her belief about the dynamical system

minipage=0.55left

0:  Dynamical function , initial state , global time horizon , maximum number of partitions , maximum number of expansion terms for each local model
0:  Approximated states , optimal total cost , optimal total complexity , optimal number of partitions
1:  : Observations
2:  : Optimal total cost
3:  : Optimal total complexity
4:  : Optimal number of partitions
5:  : Local time horizon
6:  for  do
7:     Reset to
8:     Reset to
9:     Reset to
10:     for  do
11:         
12:         
13:         
14:         
15:         
16:     end for
17:     if  = 1 then
18:         
19:         
20:     else if  and  then
21:         
22:         
23:         
24:     end if
25:  end for
26:  return  
Algorithm 1 L-ICDS: Computing the efficient code

minipage=0.4right

0:  Dynamical function , index of local window , local time horizon , relative weight , initial state of local time horizon , and the maximum allowed complexity .
0:  Local approximate of state trajectory , optimal local cost , optimal local complexity
1:  : Observations
2:  : Optimal local cost
3:  : Optimal local complexity
4:  
5:  
6:  for  do
7:     
8:     
9:     if  then
10:         
11:     else if  and  then
12:         
13:         
14:     end if
15:  end for
16:  return  
Algorithm 2 LMS: Local Model Selection

that has generated the sequence, and the second part encodes the unexplained portion of the data by the assumed dynamical system. In this setting, Bob has a UTM that decodes and reconstructs the original sequence. The first part of the message teaches Bob Alice’s belief about the source dynamical system, and the second part teaches Bob how to recover the observations given this dynamical system. Assume that Alice and Bob have agreed on a finite set of functions from which the dynamics is chosen. Therefore, the first part of the message takes bits to choose one member of this set. The second part of the message encodes the initial point , from which the dynamical system starts evolving.

Again, we assume that state values are discrete, finite and chosen from alphabet set . This assumption is valid by assuming bounded value and finite precision for states. This requires bits to encode the initial point. In total, the number of bits required to encode the sequence of observations can be seen as an Information Criterion for Dynamical Systems (ICDS). For a deterministic dynamical system, having suffices to recover for all (within the assumed precision). Therefore, ICDS number of bits is sufficient information to recover the sequence .

Can Alice do better?— The states of a dynamical system move along a certain trajectory in the state space depending on and . Therefore, we do not need to encode for its entire input space. If takes a simple shape around the working point, we can save many bits by encoding locally rather than globally. This idea results in Local Information Criterion for Dynamical Systems (L-ICDS). Assume the state space is adaptively partitioned into pieces along the state trajectory and the complexity of the system within each partition is also adaptively chosen. The input tape of the UTM is formed as a concatenation of several messages (instead of two as before), i.e., as , where each tuple corresponds to the local partition in the state space of the dynamical system of Eq. 1. In each tuple, reprograms the UTM into the simulator of local approximation to and decodes to its corresponding initial point from the Observations; It means that the local trajectory corresponding to each local model starts from a point belonging to the correct trajectory to prevent propagating error from one local model to the next one. Formally speaking, we look for a local representation of a function based on a finite set of basis functions

(2)

where is the local approximation to around working point . In other words, approximates the function in its local partition of the state space to which belongs. The set is chosen from the hypothesis space with cardinality . The set is chosen rich enough such that it contains basis functions that are able to approximate arbitrarily well as . Different classes of basis functions can be used, e.g., Taylor expansion, Fourier series, Legendre polynomials, etc (andrews1992special, ). In this paper, we use Taylor expansion to showcase our points, but the concepts are generally applicable to other expansions as well. Let us next assume the coefficients are chosen from a finite discrete set . The coefficients are bounded because we approximate the dynamics function by a smooth function (e.g., NN with tanh nonlinearity or GP) and the derivatives are bounded. In addition, the coefficient are continuous quantities, but we again assume they are represented by finite precision (as represented on a computer). Therefore, each local message requires bits code as follows: . On the other hand, if is encoded globally, we have that encodes on its whole input domain. The idea of this section is that in many practical dynamical systems, needs to be much larger than to give a good approximation to on its whole domain, which may result in (see Fig. 2).

2.1 Practical Algorithm

In this section, we present a practical algorithm to implement the above-mentioned idea of encoding (the exposition of this subsection follows (solowjowCDCsubm_arxiv, )). Taylor expansion is used as the method for local approximation to dynamics function as Eq. 2. L-ICDS does not differentiate between whether the model is known () or is learned () and considers both as the function to be locally approximated. In this section, we simply write to refer to either one of them. The difference will however matter for model selection in Sec. 4.

Local time horizon— Local approximation relies on partitioning the input space of the dynamics function . Because governs a dynamcial system, partitioning -space amounts to partitioning -space. This means we divide the global time horizon into local time horizons with length where is a hyper-parameter. The detailed cost function is then written as

(3)

for each local time horizon delimited by and and for all partitions. Finding the optimal local complexity is implemented by Alg.  2, which is used as a module of L-ICDS in Alg. 1. The total cost function is then written as . The optimal number of partitions is found by

(4)

where is the maximum allowed number of partitions. The concise message of this section is that the minimum value of usually occurs for , which implies that the proposed method gives a better encoding compared with global encoding where . Notice that and are hyper-parameters of the model, which are chosen by our prior idea about the complexity of the dynamics function (larger values for more complicated functions). We observed that reasonably high values for these hyper-parameters, e.g. and worked well for a variety of systems and benchmarks that we have considered in the paper and also in the supplementary document.

How to choose ? The hyper-parameter

acts as a balancing weight between the complexity of local Taylor approximation and error in the prediction of states. It can also be interpreted from an information theoretic perspective: Assume the values of the coefficients of the Taylor expansion come from a Gaussian distribution, i.e.,

. In the optimal coding scheme, the number of bits required to encode the coefficients equals the Shannon entropy of the normal distribution,

. Thus,

is log-proportional to the variance of coefficients that is caused by the fluctuations of the dynamics functions. In the current version, we manually choose

such that two terms of Eq. 3 are of the same order.

The proposed method is summarized in Alg.  1 and the schematic in Fig. 1(a) The general message is that a sequence of approximations over the sequence of partitions , leads to a better encoding, i.e., (see Sec. 4.1 of supplementary document for an illustrative example).

2.2 Theoretical Results

In this section, we will prove that it is possible to control the error introduced by local approximations. We distinguish here between two objects, the states and the dynamics , both as a function of time . We rely heavily on the identity in Eq. 1, which adds a lot of regularity to this problem. Therefore, we can derive statements of the type: if and are close in some sense, then the state trajectories and are close as well. And even better, the opposite is also true – close states imply close dynamical functions. This guarantees sufficiently accurate state prediction, while being able to reduce model complexity. Furthermore, we will elaborate later on the other direction in order to deploy L-ICDS as a model selection criterion.

First, we show a result that accurate local approximations imply precise state estimations. The proof of this and all following theorems are given in the supplementary material.

Theorem 1.

Consider Eq. 1 with Lipschitz-continuous on . Furthermore, assume a Lipschitz-continuous approximation is used to obtain state approximations . Then, for ,

(5)

In particular, this implies: if , then .

Next, we show the opposite direction: close state trajectories imply close dynamical systems.

Theorem 2.

Let the assumptions of Theorem 4 hold, and assume and . Then, we have

(6)

This implies: if then .

3 Experiments: Encoding Dynamical Systems

(a)
(b)
(c)
(d) L-ICDS score
(e)
(f)
(g)
(h) L-ICDS score
Figure 2: The accuracy of states for different number of local partitions and L-ICDS score in the last column. (Top row) system. (Bottom row) second state of the pendulum.

In this section, we illustrate how the encoding scheme proposed in Sec. 2 looks in practice. We elaborate in detail on the algorithm with the aid of two examples. More descriptive examples are in the supplementary material. We consider: 1) the one-dimensional system ; and 2) a pendulum with two states and , where

is a standard white noise process. We use the Euler–Maruyama method

(schuss1988stochastic, ) to sample multiple trajectories from the systems and use those to learn the dynamics. In this example, we train a shallow NN as model of the dynamics (details on the learning method are given in the supplementary material). The function is now the input to the L-ICDS algorithm, which computes a local approximation . Fig. 2 depicts the functionality of the proposed method and highlights the local approximations in dependence of the number of partitions and the complexity order. In particular, the last column of Fig. 2 shows that L-ICDS prefers non-trivial solutions with , which results in the simplest model that still gives accurate state. Similar experiments for a more sophisticated system (quadrotor) are presented in the supplementary material with similar conclusions.

(a) NN = [1],
(b) NN = [5],
(c) NN = [10],
(d) NN = [10, 5],
(e) NN = [10, 10],
(f) NN = [15, 5],
Figure 3: Each figure shows the true function (solid) and the learned function

(dashed) by the neural network. The architecture of the used neural network is represented as the caption of each subfigure: [#neurons of layer 1, #neurons of layer 2, …]. The L-ICDS score for each model is also mentioned in each caption.

4 Local Information Criterion for Model Selection

In this section, we extend our idea in order to regard L-ICDS as a model selection criterion. From an abstract point of view, we can motivate our approach in terms of information compression and argue that simpler functions should be preferred when they explain data equally well. In addition to empirical findings, we support our claim with theoretical results, which give insight in the applicability of the proposed method. The schematic in Fig. 1(b) summarizes the key ideas of this section.

Again, we consider the three objects , , and , which are, respectively, the true dynamical function, the output of an arbitrary learning algorithm, and the local encoding. Assume we are given a collection of observed sequences all generated by the underlying dynamical system (1). Based on this dataset, we can deploy several different learning algorithms in order to obtain approximations . These approximations will most likely differ in their quality, which gives rise to the question, which of the learned functions should be selected.

Frequently used methods to learn dynamical functions are, for example, NNs and GPs (cf. ‘Related work’). However, determining the depth of the NN and finding a suitable kernel function for the GP are non-trivial tasks. Ill-considered choices can lead to overfitting and bad performance and hence, should be discarded as soon as possible. For example, an over-parameterized NN may overfit to the training data and result in zero training error while being far from the correct dynamics function .

We propose a new way to compare learned functions, which is based on L-ICDS and facilitates choosing among them. In particular, we claim that, for a certain class of functions, the function with the smallest is closest to the true dynamics. We quantify this statement in the following theorem.

Theorem 3 (Model Selection).

Let the assumptions of Theorems 4 and 2 hold, and

(7)

then

(8)

where are constants, which depend on certain properties of the dynamical systems.

Remark 1.

The proposed algorithm works well for certain types of systems, which we confirm with empirical results. However, the theorem does not guarantee that is works for all systems; clearly, for large and , the above theorem is not very meaningful. The condition in Eq. 26 is an interesting starting point to investigate the suitable class of functions, for which the theorem yields a meaningful bound. Finally, since Theorem 1 can also be stated in with slightly different assumptions (acosta2004optimal, ), it is also possible to derive a result similar to Theorem 3 purely in the norm.

The proposed method quantifies the model accuracy along a trajectory, which depends on the initial point . Since we are interested in obtaining results, which are representative in the whole domain of the training data , we propose to randomize within and average over the obtained results to make them meaningful for the whole domain.

5 Experiments: Model Selection for Dynamical Systems

After discussing the capabilities of L-ICDS as a model selection criterion, we also present empirical results in order to provide more evidence to our claims. First, consider a dynamical system as in Eq. 1, with and additive white noise . This system is used to generate 10 noisy trajectories starting from randomly chosen initial points and 100 data points each are sampled with the aid of the Euler–Maruyama method (schuss1988stochastic, ). This results in a training set with total size of 1000 samples. In Fig. 3, the learned functions are depicted together with the true function . Figure 3 clearly shows the connection between the score and the respective fit of the different NN architectures as the least score gives the best fit.

Next, we consider dynamical systems of Eq. 1 for some benchmark problems. Similar benchmark problems are considered, for example, in (doerr2017optimizing, ) and (kroll2014benchmark, ), which we used to shape the nonlinearities for the problems considered herein, which are summarized in Table 1. After generating noisy data based on the dynamical system, we deploy several NNs with different depth and width and a GP to capture the behaviour of the system (details of the learning procedures in the supplementary material). The results in terms of the score and actual distance to the underlying function (in the norm sense) are shown in Table 1. We emphasize that the best learned function achieves the lowest score.

The proposed method does not aim at improving any of the model learning methods. Instead, we provide a structured way to post-process learned models and select the best among several candidates. However, the presented ideas might be incorporated into improving also the training process.

NN=[1] NN=[10] NN=[40] GP
NN=[1] NN=[2] NN=[5] GP
NN=[1] NN=[2] NN=[30] GP
NN=[1] NN=[2] NN=[5] GP
Table 1: Results for the model selection experiment. The proposed criterion L-ICDS is applied to the dynamical functions shown in the first column. Columns 2–4 show results for learning NNs (NN=[#number of hidden units of each layer]), and column 5 the GP. Per model, the two numbers are the L-ICDS score (left) and the ground truth as the norm between the learned and the true function. Best scores are highlighted in bold.

6 Discussion

In this paper, we proposed L-ICDS as a method to efficiently encode information of a dynamical system, which is either known or learned from a sequence of observations. We built the encoding scheme on top of the minimum message length principle and came up with a practical method to approximate the algorithmic complexity of dynamical systems by means of local approximations. In addition to efficient encoding, we showed through experiments and theorems that the proposed encoding criterion can be used for model selection likewise. By comparing L-ICDS scores for different learned models (e.g., NN and GP), the model that is closer to the underlying dynamics can be selected. For future work, we aim to apply L-ICDS for efficient communication in networked multi-agent systems. Also, we seek to characterize more precisely the class of dynamical systems, for which L-ICDS is effective (cf. Remark 1), and investigate extensions to stochastic systems.

Acknowledgments

This work was supported in part by the Max Planck Society, the Cyber Valley Initiative, and the German Research Foundation (DFG) grant TR 1433/1-1.

References

  • [1] Isaac Newton. Philosophiae naturalis principia mathematica, volume 1. G. Brookman, 1833.
  • [2] Assad Alam, Bart Besselink, Valerio Turri, Jonas Martensson, and Karl H Johansson. Heavy-duty vehicle platooning for sustainable freight transportation: A cooperative method to enhance safety and efficiency. IEEE Control Systems, 35(6):34–56, 2015.
  • [3] Michael Rubenstein, Alejandro Cornejo, and Radhika Nagpal. Programmable self-assembly in a thousand-robot swarm. Science, 345(6198):795–799, 2014.
  • [4] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer, 2013.
  • [5] Vladimir Vapnik.

    The nature of statistical learning theory

    .
    Springer science & business media, 2013.
  • [6] Ulrike Von Luxburg, Olivier Bousquet, and Bernhard Schölkopf.

    A compression approach to support vector model selection.

    Journal of Machine Learning Research, 5(Apr):293–323, 2004.
  • [7] Lennart Ljung. System Identification: Theory for the User. Prentice Hall PTR, 1999.
  • [8] Oliver Nelles. Nonlinear system identification. Springer, 2013.
  • [9] Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive processing, 12(4):319–340, 2011.
  • [10] Herbert JAF Tulleken. Grey-box modelling and identification using physical knowledge and bayesian techniques. Automatica, 29(2):285–308, 1993.
  • [11] Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. Nonlinear black-box modeling in system identification: a unified overview. Automatica, 31(12):1691–1724, 1995.
  • [12] Kiyoshi Yamaoka, Terumichi Nakagawa, and Toyozo Uno. Application of akaike’s information criterion (aic) in the evaluation of linear pharmacokinetic equations. Journal of pharmacokinetics and biopharmaceutics, 6(2):165–175, 1978.
  • [13] David B Fogel. An information criterion for optimal neural network selection. IEEE Transactions on Neural Networks, 2(5):490–497, 1991.
  • [14] Arash Mehrjou, Reshad Hosseini, and Babak Nadjar Araabi. Improved bayesian information criterion for mixture model selection. Pattern Recognition Letters, 69:22–27, 2016.
  • [15] David Darmon. Information-theoretic model selection for optimal prediction of stochastic dynamical systems from data. Physical Review E, 97(3):032206, 2018.
  • [16] Niall M Mangan, J Nathan Kutz, Steven L Brunton, and Joshua L Proctor. Model selection for dynamical systems via sparse regression and information criteria. Proc. R. Soc. A, 473(2204):20170009, 2017.
  • [17] Wen-Xu Wang, Ying-Cheng Lai, and Celso Grebogi. Data based identification and prediction of nonlinear and complex dynamical systems. Physics Reports, 644:1–76, 2016.
  • [18] Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on neural networks, 1(1):4–27, 1990.
  • [19] Tomohiro Takagi and Michio Sugeno. Fuzzy identification of systems and its applications to modeling and control. In Readings in Fuzzy Sets for Intelligent Systems, pages 387–403. 1993.
  • [20] Roger Frigola, Fredrik Lindsten, Thomas B Schön, and Carl Edward Rasmussen. Bayesian inference and learning in gaussian process state-space models with particle mcmc. In Advances in Neural Information Processing Systems, pages 3156–3164, 2013.
  • [21] Andreas Doerr, Christian Daniel, Duy Nguyen-Tuong, Alonso Marco, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. Optimizing long-term predictions for model-based policy search. In Proceedings of Machine Learning Research, volume 78, pages 227–238, November 2017.
  • [22] Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identification of gaussian process state space models. In Advances in Neural Information Processing Systems, pages 5315–5325, 2017.
  • [23] Peter E Wellstead. Non-parametric methods of system identification. Automatica, 17(1):55–69, 1981.
  • [24] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control. In Lazy learning, pages 75–113. Springer, 1997.
  • [25] Oliver Nelles and Rolf Isermann.

    Basis function networks for interpolation of local linear models.

    In Decision and Control, 1996., Proceedings of the 35th IEEE Conference on, volume 1, pages 470–475, 1996.
  • [26] Jo-Anne Ting, Franziska Meier, Sethu Vijayakumar, and Stefan Schaal. Locally Weighted Regression for Control, pages 1–14. Springer US, Boston, MA, 2016.
  • [27] S. Schaal and C. Atkeson. Learning control in robotics. IEEE Robotics Automation Magazine, 17(2):20–29, June 2010.
  • [28] Friedrich Solowjow, Arash Mehrjou, Bernhard Schölkopf, and Sebastian Trimpe. Minimum information exchange in distributed systems. arXiv preprint arXiv:1805.09714, 2018.
  • [29] Christopher S Wallace. Statistical and inductive inference by minimum message length. Springer Science & Business Media, 2005.
  • [30] Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London mathematical society, 2(1):230–265, 1937.
  • [31] Chris S. Wallace and David L. Dowe. Minimum message length and kolmogorov complexity. The Computer Journal, 42(4):270–283, 1999.
  • [32] Larry C Andrews and Larry C Andrews. Special functions of mathematics for engineers. McGraw-Hill New York, 1992.
  • [33] Zeev Schuss. Stochastic differential equations. Wiley Online Library, 1988.
  • [34] Gabriel Acosta and Ricardo G Durán. An optimal poincaré inequality in l 1 for convex domains. Proceedings of the american mathematical society, pages 195–202, 2004.
  • [35] Andreas Kroll and Horst Schulte. Benchmark problems for nonlinear system identification and control using soft computing methods: need and overview. Applied Soft Computing, 25:496–513, 2014.
  • [36] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
  • [37] Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1(1):1–7, 1965.
  • [38] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • [39] Gabriel Hoffmann, Haomiao Huang, Steven Waslander, and Claire Tomlin. Quadrotor helicopter flight dynamics and control: Theory and experiment. In AIAA Guidance, Navigation and Control Conference and Exhibit, page 6461, 2007.

Appendices

Appendix A Algorithmic Complexity and Universal Turing Machine

The concepts of algorithmic complexity and universal Turing machine were briefly introduced in paper and will be presented in more details here. Using the Alice-Bob scenario, we briefly review some necessary terms. The message is a sequence of symbols chosen from a set of alphabets. Each symbol is encoded by a binary sub-sequence called word that forms the entire message when the code for all symbols are pieced together. Shannon’s information theory considers the message as a sequence of outcomes of a random process [36]. Assume the message can obtain its words from a set

and the probability of word

be for all . The goal is to find a code that maps each word to a binary string with length such that the expected length of the string which is defined as

(9)

is minimized. It can be proved that the optimal code in this sense will be obtained if where this value is also known as Shannon’s entropy. We consider the base of the logarithm 2 throughout this paper. The length of the code of a word can be taken as a measure of information content of word represented by . Nonetheless, the major limitation of Shannon’s approach to information is its explicit dependence on probabilistic source of the message. Algorithmic Complexity(AC) is a different approach that removes this assumption and gives a more generic idea of information. To present the core idea of AC, some preliminary definitions are required which will be briefed in the following

Universal Turing Machine— A Turing machine (TM) is a machine with

  1. A clock that synchronizes all activities of the machine.

  2. A finite set of internal states indexed by . The machine may change its state at the clock tick.

  3. A binary work tape which can be moved to the right or left and be updated by the machine.

  4. A one-way binary input tape which forms the input to the machine. The input tape cannot be moved backward.

  5. A one-way binary output tape that carries the machine’s output.

  6. An instruction list that determines the action of the machine at each clock tick depending on the current value of the input tape, work tape and the internal state of the machine. The action may include moving the input tape, updating and moving the work tape, updating and moving the output tape, or moving to a new internal state.

Given a binary string representing some data or information, the amount of information, a.k.a Algorithmic Complexity (AC), in given a particular Turing Machine (TM), is the shortest input tape which will cause TM to output , i.e., . It is obvious from this definition that the information content of a message depends on the chosen TM. The concept of Universal Turing Machine (UTM) comes as an assistance here. Apart from its detailed definition that can be looked up in [37], a UTM has the interesting property of being programmable. Meaning that the input tape may consist of two concatenated parts such that pushes the the initial Turing machine TM0 into a state from that state on, the UTM behaves as another Turing machine TM1. The second part of the input tape is then decoded by TM1 rather than TM0. This capacity of UTM enables us to achieve a universal measure for complexity or information content. In the next section we discuss how information content of dynamical systems can be described in the framework of a UTM.

Appendix B Proofs of Theoretical Results

We provide here the proofs to our theoretical results:

Theorem 4.

Consider Eq. (1) with Lipschitz-continuous on . Furthermore, assume a Lipschitz-continuous approximation is used to obtain state approximations . Then, for ,

(10)

In particular, this implies: if , then .

Proof of Theorem 1.

We start the proof by showing that there exists a well defined solution to the considered ODE, which is due to the Picard–Lindelöf theorem.

Next, we show how to bound a function against its derivative, which is frequently done in Poincaré inequalities. Depending on the given assumptions, these results all look slightly different. Here, we use , and proof .

We start with the fundamental theorem of calculus and obtain

(11)

Hence, we obtain for the absolute value

(12)

Now we assume a multiplicative one and apply the Cauchy-Schwarz inequality

(13)

Since and everything is nonnegative, we obtain

(14)

Taking the square and integrating does not change the inequality, since the right hand side is not dependent on anymore. This yields the final result

(15)

Now we substitute and obtain

(16)

Lemma 1.

Assume and on the domain . We can show

(17)
Proof.

Since we consider a bounded domain and the derivative is bounded we conclude that is bounded as well. We proof the statement by considering the worst case scenario, which is a triangle for this case. What essentially can happen is that the support of the function shrinks, while the maximum remains constant. However, by bounding the derivative we have control over the growth of the area beneath the function. Therefore, the extreme case is a triangle with the maximal slope and peak point . This yields

(18)

and

(19)

Hence for we obtain our claim. ∎

Lemma 2.

Assume the function is monotonically increasing in . Then the variation of is given by

(20)
Proof.

The proof is straight forward and follows immediately with a monotonicity and telescope sum argument. ∎

Theorem 5.

Let the assumptions of Theorem 4 hold.
Additionally, assume and . Then, we have

(21)

This implies: if then .

Proof of Theorem 2.

We use bounded variation type arguments here. In particular, we start again with

(22)

It is well known fact in analysis that the quantity can be used to compute the total variation of a smooth function. We will use an equivalent approach to quantify the total variation and use this to bound the derivative with the states. We use

(23)

where we take the supremum over all possible grids, which are not necessarily equidistant. Hence, is the number of grid points, which can in general go to infinity. This is even possible for functions with bounded variation, as long as the function value decays fast enough. The assumption ensures a finite number of oscillations on a bounded domain, which combined with Lemma 2 yields that . Hence, there exists an optimal grid with a finite number of points and we can use the bound

(24)

We can split this apart with the triangle inequality and make the quantity even bigger by dropping the grid. Hence, we obtain

(25)

With the aid of Lemma 1, and the same argument as in the end of the proof of Theorem 1 we conlude this proof. ∎

Theorem 6 (Model Selection).

If the previous assumptions hold, and

(26)

then

(27)

where are constants, which depend on certain properties of the dynamical systems.

Proof of Theorem 3.

We consider three objects in this proof - the true dynamical function , an approximation , which is most likely obtained from a learning algorithm and the local approximation , obtained through a local expansion, e.g. Taylor.

We start by inserting and obtain with the triangle inequality

(28)

Now, we use Theorem 2 and obtain

(29)

The assumption expands to

(30)

Hence, for we obtain

(31)

Now we apply the Cauchy-Schwarz inequality to transform the norm into the norm and obtain

(32)

With the aid of Theorem 1 it follows that

(33)

With the aid of the triangle inequality we can again show

(34)

Appendix C Learning dynamical systems

The more detailed description of the method we used to learn dynamics function from observational data is presented here. We use a simple black-box approach to learn the dynamical system from a set of trajectories.

Learning by Neural network— Assume we are given a collection of sequences of observations . Each sequence covers a trajectory in the state space starting from some starting point . We use each sequence as a mini-batch of observations and train by the following simple relationship between its input/output pairs:

(35)
(36)
(37)

The reason for using a collection instead of a single sequence is clear. A single sequence starting from an initial point is unlikely to be representative enough so that is learned as a good approximation to . Once is learned, we can use automatic differentiation to compute its derivative w.r.t. the input [38].

Learning Gaussian Process— The discretization of the nonlinear dynamics function is done just like above. We used a vanilla GP without any sparse approximations and a squared exponential kernel function.

Appendix D More experiments

Some parts of the experiment sections are delegated to here from the main text. It includes more sophisticated experiments with higher dimensional and physical dynamical systems.

d.1 Enlarged version of the illustrative example

(a) State trajectory starting from
(b) True dynamics vs learned dynamics
Figure 4: (a) Shows that the negative region of the state space is not visited when the initial state is positive. (b) Shows the learned dynamics function when only some trajectories of the state dynamics governed by is available.

As an illustrative example, let’s assume the dynamical system with depicted in Fig. 4(a). This dynamical system is stable and the evolution of its state is depicted in Fig. 4. If the initial point resides in the positive region of the state space, it never leaves the non-negative side of the state space. Hence, encoding for the negative domain is not necessary and we can safely only encode in its positive domain. This can be formalized in terms of algorithmic complexity as

(38)

meaning that knowing allows to design a better code for .

d.2 Illustrative example:

Here is more figures related to sec.3(Fig.2) of the main text. The assumed dynamical system is . The state evolution for different number of partitions is depicted in Fig. 5. Fig. 5(f) shows that L-ICDS score finds a non-trivial local encoding of the space for .

d.3 Local apprximations to the learned function:

In this section, the system is used to generate samples based on the method explained for the experiment in the main text. The dynamics function is then learned and depicted in Fig. 6(a). Once the function is learned, multiple local Taylor approximations is computed and shown in Fig. 6(b-e) corresponding to different number of partitions. Figure. 6(f) shows the score is optimal for a non-trivial case.

(a)
(b)
(c)
(d)
(e)
(f) L-ICDS score
Figure 5: The effect of learning by then approximating by on state evolution of the dynamical system. The optimal trade-off is found by L-ICDS score in (f) for and the corresponding state evolution is shown in (b). The number of employed basis functions in each local region is written over that region
(a) True and learned dynamics
(b) ,
(c) ,