Log In Sign Up

Neural Physicist: Learning Physical Dynamics from Image Sequences

by   Baocheng Zhu, et al.
Ant Financial
Alibaba Group

We present a novel architecture named Neural Physicist (NeurPhy) to learn physical dynamics directly from image sequences using deep neural networks. For any physical system, given the global system parameters, the time evolution of states is governed by the underlying physical laws. How to learn meaningful system representations in an end-to-end way and estimate accurate state transition dynamics facilitating long-term prediction have been long-standing challenges. In this paper, by leveraging recent progresses in representation learning and state space models (SSMs), we propose NeurPhy, which uses variational auto-encoder (VAE) to extract underlying Markovian dynamic state at each time step, neural process (NP) to extract the global system parameters, and a non-linear non-recurrent stochastic state space model to learn the physical dynamic transition. We apply NeurPhy to two physical experimental environments, i.e., damped pendulum and planetary orbits motion, and achieve promising results. Our model can not only extract the physically meaningful state representations, but also learn the state transition dynamics enabling long-term predictions for unseen image sequences. Furthermore, from the manifold dimension of the latent state space, we can easily identify the degree of freedom (DoF) of the underlying physical systems.


page 6

page 8


Active Learning in Gaussian Process State Space Model

We investigate active learning in Gaussian Process state-space models (G...

Neural Implicit Representations for Physical Parameter Inference from a Single Video

Neural networks have recently been used to analyze diverse physical syst...

Learning A Physical Long-term Predictor

Evolution has resulted in highly developed abilities in many natural int...

Unsupervised Separation of Dynamics from Pixels

We present an approach to learn the dynamics of multiple objects from im...

LagNetViP: A Lagrangian Neural Network for Video Prediction

The dominant paradigms for video prediction rely on opaque transition mo...

Learning to Simulate Human Movement

Modeling how human moves on the space is useful for policy-making in tra...

Physics-Informed Neural State Space Models via Learning and Evolution

Recent works exploring deep learning application to dynamical systems mo...

1 Introduction

Discovering physical laws by doing experiments has always been only human expertise. Through quantitative measurements, using inductive reasoning, researchers propose hypotheses to explain the observed data and to predict future. In recent years, deep learning (DL)


has shown its extraordinary power in information extraction and pattern recognition, fuelled the major breakthroughs in various areas such as image recognition


, natural language processing


and reinforcement learning

silver_general_2018 . It would be of great help to the basic science if we can use DL to extract the underlying physical laws directly from experimental data or observations. However, applying DL to facilitate physical law discoveries is still rarely explored de_deep_2018 ; iten_discovering_2018 ; breen_newton_2019 .

In this paper, we try to make one step towards this goal. In our setting, for a physical system, we only have the image sequences of object movement, each with a different global parameter setting. Our goal is to build a deep learning model to infer the underlying physical state transition dynamics which can enable long-term movement predictions, especially for unseen global parameter settings.

To solve the problem above, there are two key tasks to be done: state identification and state transition learning. Taking advantage of recent progresses in representation learning, we propose a novel neural architecture named Neural Physicist (NeurPhy), which uses variational auto-encoder (VAE) kingma_auto-encoding_2013 to extract underlying dynamic state at each time step, and neural process (NP) garnelo_neural_2018 ; eslami_neural_2018 to extract the global system representations. For state transition learning, NeurPhy uses a stochastic state space model (SSM) krishnan_deep_2015 ; karl_deep_2016 ; doerr_probabilistic_2018 ; ha_world_2018 ; hafner_learning_2019 to learn the physical dynamic process, which can naturally incorporate uncertainty estimation. To the best of our knowledge, this is the first work that learns physical dynamics in an end-to-end way from raw image sequences by performing system identification and state extraction. The main contributions of this paper are: 1) We split state learning into two parts: global and local (dynamic) representations, and use NP to extract the global representations of the image sequences. The learned representations are physically meaningful and match the underlying system parameters. 2) We use a stochastic SSM to learn systems’ physical transition dynamics together with learned global state representations. The dynamic states extracted at each time step are Markovian and match the ground-truth states. Specifically, our proposed architecture does not require a recurrent structure to infer the dynamic states, and the state transition dynamics are not limited to linear models used in many previous works karl_deep_2016 ; rangapuram_deep_2018 ; fraccaro_disentangled_2017 . It is thus more computationally efficient and has better applicability. 3) The NeurPhy can extrapolate to image sequences whose global system parameters are not seen in the training phase. Regarding each image sequence as a task, this means our model naturally has the ability for meta-learning andrychowicz_learning_2016 ; chen_learning_2017 ; wang_learning_2016 .

2 The Model

Figure 1:

(a) Model architecture of NeurPhy. Circles are random variables and filled circles represent variables observed during training. Solid lines denote the generative process; Dashed lines denote the inference model; Dotted lines denote multi-step transition dynamics. (b) Model architecture for extracting global representation from context samples. A context sample consists of two consecutive image frames encoded through CNN, then all samples are aggregated to a global representation


For any physical system, though governed by the same physical laws, different system parameters usually result in different observed behaviours. Taking dampened pendulum for example, if we change its length or mass, the motion period and decay time will change too. In our setting, for each physical environment, given one set of system parameters, we have a time-discrete sequence of length denoted by . Note that each frame is a high-dimensional raw image and each sequence defines a task. Then, for different system parameters setting, we have different tasks denoted by . Given an arbitrary image sequence , our problem is to infer the underlying system parameters and state transition dynamics, and to make long-term predictions of the system’s evolution. Note that, we are dealing with a meta-learning problem here, i.e., we have to make correct predictions with global parameters not seen in the training phase.

We tackle the problem as follows, shown graphically in Figure 1(a). For any image sequence, we first split the whole sequence into two parts: contexts and targets. The context samples are used to extract the global representation of the sequence (see next subsection for details). The intrinsic Markovian state is inferred from historical images using a recognition network, then transits to step-by-step, aided with global information , using a stochastic state space model. Note that is the steps dynamic state transited and is called the overshooting length hafner_learning_2019 ; gregor_shaping_2019 . Finally, an observation model is used to generate image from the dynamic state .

Note that our architecture is quite general. If the input sequences are low-dimensional variables instead of images, all convolutional layers in NeurPhy can be replaced by multilayer perceptrons with all other parts intact, which makes the learning tasks simpler. We show the corresponding experimental results later in supplementary materials.

2.1 Global Representation

In order to infer the global representation of the image sequence, we leverage the recent progresses in NP garnelo_neural_2018 ; eslami_neural_2018 and conditional-VAE sohn_learning_2015 . Detailed structure is shown in Figure 1(b).

We randomly select context samples from the whole image sequence, and each context sample contains two images from consecutive time steps and

. We use a convolutional neural network (CNN) to extract the information

from each context sample, and aggregate all of them into a global representation . This procedure resembles NP proposed by Garnelo et al. garnelo_neural_2018 . NP is a kind of stochastic process which uses neural networks to learn distributions over functions. A stochastic process needs to satisfy two conditions: exchangeability and consistency. Reflected in NP, the parameters of inference network should be invariant to permutations of observations and time , so the aggregation operator in the procedure should be invariant to the exchange of . Here, for simplicity, we use the mean operator. Please note that we use two consecutive frames of images as a context sample and aggregate all context samples into , will force the network to extract the global information of state transition across different time steps.

2.2 Stochastic State Space Model

As shown in Figure 1(a), our state space model differs from previous works krishnan_deep_2015 ; karl_deep_2016 ; fraccaro_disentangled_2017 ; deisenroth_pilco:_2011 ; eleftheriadis_identification_2017 ; frigola_variational_2014 ; doerr_probabilistic_2018

mainly in the following aspects. First, we only have access to image sequences, since individual image observations generally do not reveal the full state of the system, we consider a partially observable Markov decision process (POMDP). Here, we use a convolutional recognition network which stacks two consecutive image frames as inputs to extract the underlying Markovian dynamic states

, i.e., we assume . In the experimental section, we will show that our recognition model can extract the correct position and velocity. The use of only two consecutive frames but not the whole history also enables our model to be non-recurrent, which can significantly simplify the model structure and speed up inferencing. Second, our dynamic transition function takes the form , which uses the global information of the sequence extracted from context samples. Splitting the state information into global and local ones ( and ) is quite intuitive, as the global information can represent the time invariant features of the system while the local ones represent only the information which changes from time to time. Third, we use multi-layer neural networks to model the state transition function, which makes our transition model non-linear and more general. Moreover, the inferred dynamic states are stochastic, which means our transition model can naturally obtain uncertainty estimation with more robustness. Last but not least, we use the overshooting technique hafner_learning_2019 ; gregor_shaping_2019 to train the model with multi-step prediction loss in the latent state space. In the experimental section, we will see that using overshooting technique can greatly reduce the compounding error of multi-step predictions.

It is well known that a powerful generative model such as VAE usually ignores the conditioning information alemi_fixing_2018 ; he_lagging_2019 ; gregor_shaping_2019 , which makes learning latent Markovian states impossible. We tackle this problem mainly with two techniques: First we split the state information into global and local parts, which makes the learning of the Markovian dynamic states easier. Second, we employ overshooting technique, which enforces the consistency in the multi-step state transition process.

2.3 The Lower Bound Objective Function

Combining discussions of previous sections, our model consists of following components:


In order to overcome the intractability of posterior distributions of the latent state , we use standard variational inference technique kingma_auto-encoding_2013 , and derive the evidence lower bound on the data log-likelihood conditioned on context points:


Here, the subscript denotes the overshooting length. For more detailed derivation, see supplementary materials.

The first term is the VAE reconstruction loss term. For images, we calculate it using cross-entropy of each pixel. The recognition model infers the approximate state posterior from the past observations (In our setting, from past two consecutive observations ). Then the latent dynamic states is decoded to by the observation model, which we want to make it and ground-truth image as similar as possible. The second term is the dynamic state transition loss term. Here, the discrepancy between the transition distribution and the approximate posteriors distribution is calculated by Kullback–Leibler (KL) divergence. The expectation of KL term is taken on and , which can be written explicitly as and for overshooting length . The state transition process is as follows: First a recognition model is used to infer the approximate posterior state from the past observations . Then conducts multi-step transition to in the state space. Then is compared against the one extracted from the recognition model using . Note that all the comparison is done in the latent state space, but not the observation space, which significantly reduces the computation expenditure.

The previous derivation assumes a fixed overshooting length . To better model both short and long term dynamics, we can combine loss terms from different overshooting length up to some dynamic transition horizon , i.e., we combine , to arrive at our final objective function:


Here, we include weighting factors analogously to the higgins_beta-vae:_2017 , which can tune the relative strength between the competitive reconstruction and dynamic transition loss terms.

3 Experiments

We apply NeurPhy to two experimental physical systems, i.e., damped pendulum and planetary orbits motion, and verify whether our model can correctly discover the underlying global physical parameters and the associated dynamics. Detailed parameter settings of network structures can be found in supplementary materials.

In all experiments, each image sequence forms a task which is associated with one set of global system parameters. The image sequence contains a total of 101 time steps in , and each image consists of binary pixels. We take 90% of tasks as meta-training tasks, the other 10% as meta-test tasks. For each meta-training task, we use 20 context samples that are randomly selected from the image sequence to extract the global representation. We take 90% frames as training (target) samples, and the other 10% as test ones. For the meta-test tasks, we also use 20 context samples, but they come from the first 21 frames of the image sequence to avoid using any future information. The rest samples are for meta-testing. We set maximum overshooting length , batch size , and run epochs using Adam optimizer kingma_adam:_2014 with fixed learning rate of .

3.1 Damped Pendulum

Figure 2: (a) The damped pendulum system with initial angular velocity and gravity . (b) The motion of pendulum angle for and . (c) The motion of pendulum angle for and . (d) The state space of vs. for all sequences.

The first problem we consider is a pendulum swinging back and forth, subject to gravity and air friction with an initial angular velocity of , as shown in Figure 2(a). From the second law of motion, the pendulum’s angle follows equation:


where is the pendulum mass, is the gravitational constant, is the damping coefficient generated by air friction and is the pendulum length. Here denotes time derivative, denotes second derivative, and we rewrite angular velocity as .

If the angle is small enough, the equation of motion has an analytical solution, but in general, it can only be solved numerically. From the state at time step , we can get the state at by following dynamic transition equations:


Here, we assume time interval is small enough.

In our experiments, we fix the gravitational constant and damping coefficient to and set the initial angle and angular velocity to respectively. We vary the pendulum length and mass between the interval and respectively, and generate a total of different combinations of and . For each combination, we generate states of time-steps (with ) and render them into images. From Equation (5), we can see that the global parameters of the pendulum system is and , and we plot the typical evolution of for different and in Figure 2(b)(c). We assume that the different masses are caused by different pendulum densities and cannot be directly detected from single images. In this case, in order to discover the correct dynamics and to make good predictions, the model has to learn the global system parameter from the changes in the image sequences. The state space for all sequences and time steps of each sequence are shown in Figure 2(d). The state space curve of each sequence is a spiral caused by the damping. If we wait long enough, they will all sink into the rest point , and the central hole will be fully filled.

Figure 3: (a-b) The latent space of global representations with coloring according to the ground-truth global parameters and . (c-h) The learned dynamic state space of pendulum. The first row denotes the state space of all samples with coloring according to ground-truth parameters , and . The second row denotes the state space of two slices with and .

Figure 4: Two samples of image sequences of pendulum’s motions in meta-test set for 50-step predictions. The first row shows the ground-truth images, the second row shows the predicted ones.

In the experiments, we set both the dimension of global representations and dynamic states to 3. As shown in Figure 3(a)(b), The learned lies in a two-dimensional manifold. From the colormap plot, we see a one-to-one correspondence with the underlying ground-truth parameters and , which means NeurPhy has extracted the correct global representations. To show it more quantitatively, we fit the learned representations with global parameters and using a linear and a quadratic regression respectively. The fitted are shown in Table 1. We can see that the fitting is quite well up to quadratic terms (especially for parameter ), which means the representation we learned is indeed a simple mapping from the de-facto physical system variables. Note that learning a manifold of using one dimension to encode is non-trivial, as the pendulum’s mass can only be extracted from the time evolution of the image sequence.

In Figure 3(c)(d), we plot the learned latent Markovian dynamic state space together with the ground-truth state parameters , and (shown by colormap). Note that the state parameter is unchanged during the motion, but it is necessary for the image reconstruction, so the state space encodes the parameters using all three dimensions. More clearly, we plot the state space for two particular parameter setting with and in Figure 3(d)(f)(h), and we can see that each lies in a two-dimensional Mobius strip sub-manifold that encodes and respectively.

In Figure 4, started from the leftmost state images, we plot two sets of predicted images for the next 50 time steps on the meta-test set. The images in the first row are the ground-truth, while images in the second row are the predicted ones. We can see that the learned dynamic state transition model can make good long-term predictions even though the maximum overshooting length is only 5.

More quantitatively, we show the model performance for both training, test and meta-test samples in Table 2. Recalling that in Equation 3, we list the value of cross-entropy terms together with the Kullback–Leibler (KL) divergence terms for different overshooting length (KL1KL5 for ). Lower cross-entropy means better image reconstruction, and smaller KL term means better long-term prediction. For ablation study, we also experiment with maximum overshooting length to study whether larger overshooting length can really help learning dynamic transitions. As indicated in Table 2, while a larger maximum overshooting length can slightly hurt the reconstruction fidelity, it can greatly reduce the KL terms, facilitating better long-term predictions.

Experiment (Linear) (Quadratic) (Linear) (Quadratic)
Damped pendulum 0.999 1.000 0.795 0.949
(Linear) (Quadratic) (Linear) (Quadratic)
Planetary orbits motion 0.959 0.977 0.681 0.935
Table 1: Fitting scores for global parameters
Experiment Stage Cross Entropy KL1 KL2 KL3 KL4 KL5
Damped pendulum 1 Training 3.9 21.7 52.6 105.9 188.4 300.4
5 Training 20.9 6.4 10.2 16.1 24.5 35.8
1 Test 4.3 25.2 64.3 135.5 234.8 369.6
5 Test 21.5 7.3 12.0 19.3 29.4 41.7
1 Meta-test 4.3 25.2 64.3 135.5 234.8 369.6
5 Meta-test 21.2 7.8 12.0 19.5 29.6 43.0
Planetary orbits motion 1 Training 0.6 6.3 10.0 12.8 15.9 18.9
5 Training 2.3 2.6 3.8 4.8 5.8 6.9
1 Test 0.8 6.9 10.4 12.9 16.0 19.4
5 Test 2.4 2.6 4.0 4.9 5.9 7.0
1 Meta-test 0.6 47.2 104.7 156.2 208.3 265.8
5 Meta-test 2.7 18.4 37.9 54.5 70.3 86.6
Table 2: Model performance

3.2 Planetary Orbits Motion

Figure 5: Schematic diagram of the planetary orbits motion. The effects of different global parameters on orbits are shown in (a) , (b) and (c) . (d) The state space of vs. for all sequences.

Next, we apply NeurPhy to the prediction problem of planetary orbits motion. Planetary orbits generally form ellipses as shown in Figure 5. Specifically, the radius has following expression:


where denotes the perihelion distance, denotes the eccentricity and denotes the angle of the major axis. The effects of different parameters on planetary orbits are shown in Figure 5(a)-(c). We also plot the dynamic state space in Figure 5(d). More detailed derivation of the orbit dynamics can be found in supplementary materials.

Without loss of generality, we set initial angle , and vary initial radius and velocity to generate different motion sequences. Here, and denote the velocity along radius and angular direction respectively. A set of has one-to-one correspondence to a set of , which defines a unique orbit. In another word, given the initial state, the system parameters

characterizing the planetary orbits are also determined. At each moment, the dynamic state can be represented solely by two parameters

, and the corresponding velocity can be calculated by together with global parameters as shown in supplementary materials.

The time evolution of is governed by dynamic equations:


Note that is proportional to the angular momentum of the orbit ( and denote the gravitational constant and the mass of the center sun, respectively)(see supplementary materials for the derivation). From Equation 7, we can see that there are three global parameters defining the orbit motion, but with , we can first obtain from Equation 6 and then substitute it to the second formula to obtain . This means that in the dynamic evolution of planet orbits, is a pseudo-degree of freedom, and the dynamics can be defined with just two global parameters , which also reflects that there is one hidden symmetry in the orbit motion.

Figure 6: (a-b) The global representation manifold for planetary orbits with coloring according to and . (c-d) The state space of planetary orbits motion, with coloring according to and .

We generate image sequences by varying , and in the range , and respectively and for each sequence, we generate image frames. If our model can learn correct dynamics, it is expected that the dynamic latent parameters lie in a two dimensional manifold correlated to the ground-truth states , while the global representation of each sequence should also lie within a two dimensional manifold corresponding to the ground-truth system parameters .

In Figure 6(a)(b), we can clearly see that the learned global latent representation indeed lies in a two dimensional manifold. This is an amazing result, as we take a long analytical effect in previous paragraphs to identify that the independent number of ground-truth system parameters which affect the dynamic evolution is two (i.e. and ) rather than three (i.e. , and ). We again fit the latent representations with global parameters and using linear and quadratic regressions shown in Table 1. The fitted scores are very high which means a good match with the ground-truth global parameters. The learned dynamic state space in Figure 6(c)(d) also shows our model can infer the physically meaningful state representations. In Figure 7 we also show two sets of multi-step predictions (50 time-steps) on the meta-test set, with both ground-truth (the first row) and predicted (the second row) images. It can be seen that the dynamic model we learned can make correct long-term prediction of the orbit motion. We also show the model performance together with in Table 2, the lower KL term values for also indicates that the overshooting technique is of great help to long-term predictions.

Figure 7: Two samples of image sequences of orbit motions in meta-test set for 50-step predictions. The first row shows the ground-truth images, while the second row shows the predicted ones.

4 Related Work

Our work was inspired by the work of Garnelo et al. garnelo_neural_2018 applying NP to function regression and optimization. Kim et al. kim_attentive_2019 addressed the underfitting issue by incorporating attention mechanism. Eslami et al. eslami_neural_2018 and Kumar et al. kumar_consistent_2018

applied it to the scene representation and Singh et al.

NIPS2019_9214 extended it to sequential cases.

For the state transition modelling, Karl et al. karl_deep_2016 proposed Deep Variational Bayes Filters (DVBF), a linear Gaussian state space model (LGSSM) using linear Gaussian models to learn and to identify Markovian state space. Deisenroth et al. deisenroth_pilco:_2011 introduced PILCO, a data-efficient policy search method in model-based reinforcement learning, which uses GPs to model state transition dynamics explicitly incorporating model uncertainty into long-term planning, achieving unprecedented learning efficiency on high-dimensional control tasks. Watter et al. WatterSBR15 proposed Embed to Control (E2C) that learns a local linear dynamical transition model from raw pixel images which achieves promising results on a variety of complex control problems. Recently, Hafner et al. hafner_learning_2019 proposed Deep Planning Network (PlaNet), a model-based reinforcement learning method that learns the state dynamics from images and plans in latent space and achieves performance comparable to model-free algorithms in continuous control tasks.

For physical dynamics learning, Iten et al. iten_discovering_2018 used NP to discover physical concepts, but they did not learn the underlying Markovian state and dynamics, while the inputs to the model are low-dimensional states. Breen et al. breen_newton_2019 applied deep neural networks to solve the motion of chaotic three-body problem and achieved accurate solutions several orders faster than a state-of-the-art solver.

5 Conclusions

In this work, we propose a novel network architecture NeurPhy that can learn physical dynamics directly from image sequences. NeurPhy can not only correctly extract the global system parameters from the sequence and dynamic state from each frame, but also succeeds in predicting dynamic evolution in unseen cases. We apply our model for the characterization and prediction of damped pendulum and planetary orbits motion, and achieve promising results. Our architecture is quite general, which can be easily extended to model-based reinforcement learning, which we will explore in the future.

6 Supplementary Material

6.1 Derivation of the lower bound objective function

Now we derive the lower bound objective function of NeurPhy. For overshooting length , Conditioned on context points , the log-likelihood of data can be written as:


Here, we use the Markovian state space asumptiondoerr_probabilistic_2018 , i.e., the dynamic state contains all the information needed for image reconstruction. Also, we assume the approximate posterior does not depend on context explicitly given . From Jensen’s inequality, we can get:


which is eq.(2) in the main paper.

6.2 Derivation of the planetary orbits motion using the law of universal gravitation

Figure 8: Schematic diagram of the planetary orbits motion with r denoting the radius from planet to the central sun, denoting the angle between r and reference axis. and

denote unit vectors along radial and angular direction respectively.

6.2.1 The elliptical orbits

In this subsection, we derive the elliptical orbits motion obeyed by planets which circle around a central sun. We follow the derivation in reference roy_orbital_2004 ; cite_s1 . Assuming the mass of central sun is several orders larger than the mass of planet , then the position of the central sun is approximately still and we only need to consider the motion of the planet. Using polar coordinates , the position vector can be written as (see Figure 8):


Here, symbols in bold font denote vectors, symbols with hat denotes unit vectors along the corresponding coordinate direction. Applying time derivative to , we can get:


Here, we rewrite angular velocity as and have used identities in the derivation.

Now, From the second law of motion and the law of universal gravitation:


we have:


Here, denotes the gravitational constant and denotes the second derivative of time. Eq. (15) can be rewritten to a total differential form:


where is a constant, which is proportional to angular momentum . Eq. (16) is also equivalent to the Kepler’s second law of planetary motion.

Let , we have:


Together with eq.(14), we can get:


By solving above equation, we have following general solution:


So the planetary orbit is:


Here, denotes the angle of the major axis and is a coefficient.

The nearest radius called the perihelion distance can be calculated by setting , i.e.,


Define the eccentricity , we have:


Then we can rewrite using as:


This is the eq. (6) used in the main paper.

6.2.2 Derive system global parameters from initial condition

Now we derive the global parameters from the initial condition . Note that is a constant, we have:


From the law of energy conservation, we have:


Here, the left-hand side is the energy at initial moment, while the right-hand side represents the energy at perihelion . Together with and eq. (22), we have:


From the above equation, we can get the eccentricity , and from eq. (23), we have:


from which we can get . Here, we assume the initial angle .

So, from eq. (22)(24)(25)(27), we can see that the set has one-to-one correspondence with , as claimed in the main paper.

6.2.3 Derive the dynamic transition equations

From Eq.(16), we have:


Integrate above equation with time , we can derive the time evolution of . And from the relation of with (eq. (23)), we can also get the time evolution of .

The differential equation eq. (28) is a transcendental equation, which has no analytical solution. In our setting, we only have to derive the state transition equations from to . From eq. (28), we have:


and from eq. (23), we have:


These transition equations correspond to eq. (7) in the main paper.

6.3 Experimental results for low-dimensional inputs using Neurphy

Our model is quite general, if inputs are not raw images but low-dimensional variables (such as positions of pendulum end-point), our model can also be applied to the systems by changing all convolutional layers in NeurPhy to multilayer perceptrons. As we will show below, the learning tasks actually become easier.

6.3.1 Damped pendulum

Figure 9: The global representation manifold for damped pendulum with colormaps denoting underlying (a) and (b) . The hole marked by dashed circles represent samples in meta-test.

Figure 10: The dynamic state space for damped pendulum with colormaps denoting underlying (a) , (b) and (c) .

We use the position coordinates of pendulum end-points as inputs (i.e., the images of dimension in the main paper are replaced by positions of dimension ), following the same routine as in the main paper, we run experiments with various global parameters ( and ).

The learned global representations are shown in Figure 9. In Figure 9(a) and (b), colormaps denote the ground-truth and respectively. We can see a clear correspondence between the learned manifold and the true global parameters. Furthermore, there are some holes between data points in the figure (such as the places marked by dashed circles), corresponding to the meta-test samples. In Figure 10, we plot the dynamic state space of damped pendulum with colormaps denoting the ground-truth dynamic state . The learned state space lies in a 3-dimensional manifold, which shows good correspondences with underlying dynamic states.

Figure 11: The true and predicted x coordinates of pendulum end-points with different overshooting lengths d: (a) , (b) , (c) , (d) , (e) and (f) .

To show the performance of our model on low-dimensional inputs, we plot the x coordinates of pendulum end-points for training set in Figure 11, with x axes denoting the true values and y axes denoting predicted ones. We can see a perfect match between the true and the predicted pendulum’s position. When the overshooting length d is longer, the prediction get worse. In Table 3, we calculate the mean square errors (MSEs) of the true and the predicted pendulum’s position for training, test and meta-test (20 and 2 context samples) sets. represents the result with overshooting length , in the table, d changes from to . We can see that on both training and test sets, MSEs are quite small, which get a bit larger on the meta-test set.

MSE T+0 T+1 T+2 T+3 T+4 T+5
Training 0.00082 0.0033 0.010 0.025 0.062 0.14
Test 0.00095 0.0040 0.014 0.035 0.086 0.19
Meta-test (20 contexts) 0.00081 0.011 0.046 0.14 0.28 0.56
Meta-test (2 contexts) 0.00082 0.016 0.064 0.19 0.36 0.72
Table 3: MSE for damped pendulum with different overshooting lengths.

In Figure 12, we plot the x coordinates of one training sequence with different predicting length (overshooting length d). The solid lines are ground-truth coordinates of the sequence and points in symbols are predicted ones for both context, target and test samples. We can clearly see that our model fits well. In Figure 13 and 14, we apply our model to one meta-test sample whose underlying global parameters is unseen in the training set. In Figure 13, as the same as the training setting, we give the model 20 context samples, but in Figure 14, we only give the model 2 context samples to see if our meta-learning algorithm can deal with fewer context samples. We can see that our model can still make quite good prediction even only 2 context samples are given, this can also be seen in the last row of Table 3, the MSEs for different overshooting lengths is only slightly larger than those given 20 context samples.

Figure 12: The true and predicted x coordinates of damped pendulum for a training sequence. The solid lines denote true x coordinates, the blue stars are context points, the red circles are target points and the green triangles are test points. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

Figure 13: The true and predicted x coordinates of damped pendulum for a meta-test sequence with 20 context samples. The solid lines denote true x coordinates, the blue stars are context points, the red circles are test points for prediction. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

Figure 14: The true and predicted x coordinates of damped pendulum for a meta-test sequence with 2 context samples. The solid lines denote true x coordinates, the blue stars are context points, the red circles are test points for prediction. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

6.3.2 Planetary orbits motion

We also apply the low-dimensional inputs model for the prediction of planetary orbits motion. All the settings are the same as in the main paper except the inputs are changed to planet’s position coordinates . In Figure 15 and 16, we plot the learned global representation and dynamic state space. In Figure 17, we plot the true and predicted x coordinates of planet’s positions. In Table 4, we give the MSEs of predictions. In Figure 18, 19 and 20, we plot the performance for a training and meta-test (both 20 and 2 context samples) sequence. We can see that our model indeed can learn the underlying global and local representations together with the dynamic transition process. Also note that it is easier for our model to learn the planetary orbits motion than the damped pendulum.

Figure 15: The global representation manifold for planetary orbits with colormaps denoting (a) and (b) .

Figure 16: The dynamic state space for planetary orbits motion with colormaps denoting underlying (a) , (b) .

Figure 17: The true and predicted x coordinates of planet’s positions with different overshooting lengths d: (a) , (b) , (c) , (d) , (e) and (f) .
MSE T+0 T+1 T+2 T+3 T+4 T+5
Training 0.000067 0.00023 0.00053 0.00094 0.0016 0.0034
Test 0.000068 0.00023 0.00055 0.00097 0.0017 0.0034
Meta-test (20 contexts) 0.000068 0.0019 0.013 0.050 0.11 0.16
Meta-test (2 contexts) 0.000067 0.0041 0.030 0.20 0.11 0.30
Table 4: MSE for planetary orbits motion with different overshooting lengths.

Figure 18: The true and predicted x coordinates of planet’s positions for a training sequence. The solid lines denote true x coordinates, the blue stars are context points, the red circles are target points and the green triangles are test points. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

Figure 19: The true and predicted x coordinates of planet’s positions for a meta-test sequence with 20 context samples. The solid lines denote true x coordinates, the blue stars are context points, the red circles are test points for prediction. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

Figure 20: The true and predicted x coordinates of planet’s positions for a meta-test sequence with 2 context samples. The solid lines denote true x coordinates, the blue stars are context points, the red circles are test points for prediction. The x-axis denotes time step and we plot results for different overshooting lengths in (a) , (b) , (c) , (d) , (e) and (f) .

6.4 Network structure of NeurPhy

6.4.1 For images inputs

For Global representation model and Recognition model, we use convolutional neural networks, each with four layers with filter sizes of [32,64,64,128], kernel sizes of

, strides of 2 and ReLU activations. For

Observation model, we use deconvolutional neural networks with filter sizes of [64,64,32,1], kernel sizes of [5,5,6,6], strides of 2 and three layers of ReLU activations, and the activation of last layer is Sigmoid. For State space model, we just use a 4-layer forward neural network with cell units [128,128,64,16] and ReLU activations. The last layer is further connected to a dense layer with units size , where is the dimension of the latent dynamic state, which is set it to 3 in our experiments. First units generates the mean value of and the second units produces the standard derivation.

6.4.2 For low-dimensional variables inputs

The network architecture is almost the same as the case for the images, except that the convolutional layers are placed by multi-layers forward networks. For Global representation model, it is replaced by a 4-layer forward neural network with cell units of [128,128,64,16] and ReLU activations. For Recognition model, we only use a 2-layer forward neural network with cell units of [32,16] and ReLU activations. For damped pendulum system, we set for all , batch size and run 1000 epochs. For planetary orbits motion, we set for all , batch size and run 500 epochs.