1 Introduction
Discovering physical laws by doing experiments has always been only human expertise. Through quantitative measurements, using inductive reasoning, researchers propose hypotheses to explain the observed data and to predict future. In recent years, deep learning (DL)
goodfellow_deep_2016has shown its extraordinary power in information extraction and pattern recognition, fuelled the major breakthroughs in various areas such as image recognition
krizhevsky_imagenet_2017 devlin_bert:_2019 silver_general_2018 . It would be of great help to the basic science if we can use DL to extract the underlying physical laws directly from experimental data or observations. However, applying DL to facilitate physical law discoveries is still rarely explored de_deep_2018 ; iten_discovering_2018 ; breen_newton_2019 .In this paper, we try to make one step towards this goal. In our setting, for a physical system, we only have the image sequences of object movement, each with a different global parameter setting. Our goal is to build a deep learning model to infer the underlying physical state transition dynamics which can enable longterm movement predictions, especially for unseen global parameter settings.
To solve the problem above, there are two key tasks to be done: state identification and state transition learning. Taking advantage of recent progresses in representation learning, we propose a novel neural architecture named Neural Physicist (NeurPhy), which uses variational autoencoder (VAE) kingma_autoencoding_2013 to extract underlying dynamic state at each time step, and neural process (NP) garnelo_neural_2018 ; eslami_neural_2018 to extract the global system representations. For state transition learning, NeurPhy uses a stochastic state space model (SSM) krishnan_deep_2015 ; karl_deep_2016 ; doerr_probabilistic_2018 ; ha_world_2018 ; hafner_learning_2019 to learn the physical dynamic process, which can naturally incorporate uncertainty estimation. To the best of our knowledge, this is the first work that learns physical dynamics in an endtoend way from raw image sequences by performing system identification and state extraction. The main contributions of this paper are: 1) We split state learning into two parts: global and local (dynamic) representations, and use NP to extract the global representations of the image sequences. The learned representations are physically meaningful and match the underlying system parameters. 2) We use a stochastic SSM to learn systems’ physical transition dynamics together with learned global state representations. The dynamic states extracted at each time step are Markovian and match the groundtruth states. Specifically, our proposed architecture does not require a recurrent structure to infer the dynamic states, and the state transition dynamics are not limited to linear models used in many previous works karl_deep_2016 ; rangapuram_deep_2018 ; fraccaro_disentangled_2017 . It is thus more computationally efficient and has better applicability. 3) The NeurPhy can extrapolate to image sequences whose global system parameters are not seen in the training phase. Regarding each image sequence as a task, this means our model naturally has the ability for metalearning andrychowicz_learning_2016 ; chen_learning_2017 ; wang_learning_2016 .
2 The Model
For any physical system, though governed by the same physical laws, different system parameters usually result in different observed behaviours. Taking dampened pendulum for example, if we change its length or mass, the motion period and decay time will change too. In our setting, for each physical environment, given one set of system parameters, we have a timediscrete sequence of length denoted by . Note that each frame is a highdimensional raw image and each sequence defines a task. Then, for different system parameters setting, we have different tasks denoted by . Given an arbitrary image sequence , our problem is to infer the underlying system parameters and state transition dynamics, and to make longterm predictions of the system’s evolution. Note that, we are dealing with a metalearning problem here, i.e., we have to make correct predictions with global parameters not seen in the training phase.
We tackle the problem as follows, shown graphically in Figure 1(a). For any image sequence, we first split the whole sequence into two parts: contexts and targets. The context samples are used to extract the global representation of the sequence (see next subsection for details). The intrinsic Markovian state is inferred from historical images using a recognition network, then transits to stepbystep, aided with global information , using a stochastic state space model. Note that is the steps dynamic state transited and is called the overshooting length hafner_learning_2019 ; gregor_shaping_2019 . Finally, an observation model is used to generate image from the dynamic state .
Note that our architecture is quite general. If the input sequences are lowdimensional variables instead of images, all convolutional layers in NeurPhy can be replaced by multilayer perceptrons with all other parts intact, which makes the learning tasks simpler. We show the corresponding experimental results later in supplementary materials.
2.1 Global Representation
In order to infer the global representation of the image sequence, we leverage the recent progresses in NP garnelo_neural_2018 ; eslami_neural_2018 and conditionalVAE sohn_learning_2015 . Detailed structure is shown in Figure 1(b).
We randomly select context samples from the whole image sequence, and each context sample contains two images from consecutive time steps and
. We use a convolutional neural network (CNN) to extract the information
from each context sample, and aggregate all of them into a global representation . This procedure resembles NP proposed by Garnelo et al. garnelo_neural_2018 . NP is a kind of stochastic process which uses neural networks to learn distributions over functions. A stochastic process needs to satisfy two conditions: exchangeability and consistency. Reflected in NP, the parameters of inference network should be invariant to permutations of observations and time , so the aggregation operator in the procedure should be invariant to the exchange of . Here, for simplicity, we use the mean operator. Please note that we use two consecutive frames of images as a context sample and aggregate all context samples into , will force the network to extract the global information of state transition across different time steps.2.2 Stochastic State Space Model
As shown in Figure 1(a), our state space model differs from previous works krishnan_deep_2015 ; karl_deep_2016 ; fraccaro_disentangled_2017 ; deisenroth_pilco:_2011 ; eleftheriadis_identification_2017 ; frigola_variational_2014 ; doerr_probabilistic_2018
mainly in the following aspects. First, we only have access to image sequences, since individual image observations generally do not reveal the full state of the system, we consider a partially observable Markov decision process (POMDP). Here, we use a convolutional recognition network which stacks two consecutive image frames as inputs to extract the underlying Markovian dynamic states
, i.e., we assume . In the experimental section, we will show that our recognition model can extract the correct position and velocity. The use of only two consecutive frames but not the whole history also enables our model to be nonrecurrent, which can significantly simplify the model structure and speed up inferencing. Second, our dynamic transition function takes the form , which uses the global information of the sequence extracted from context samples. Splitting the state information into global and local ones ( and ) is quite intuitive, as the global information can represent the time invariant features of the system while the local ones represent only the information which changes from time to time. Third, we use multilayer neural networks to model the state transition function, which makes our transition model nonlinear and more general. Moreover, the inferred dynamic states are stochastic, which means our transition model can naturally obtain uncertainty estimation with more robustness. Last but not least, we use the overshooting technique hafner_learning_2019 ; gregor_shaping_2019 to train the model with multistep prediction loss in the latent state space. In the experimental section, we will see that using overshooting technique can greatly reduce the compounding error of multistep predictions.It is well known that a powerful generative model such as VAE usually ignores the conditioning information alemi_fixing_2018 ; he_lagging_2019 ; gregor_shaping_2019 , which makes learning latent Markovian states impossible. We tackle this problem mainly with two techniques: First we split the state information into global and local parts, which makes the learning of the Markovian dynamic states easier. Second, we employ overshooting technique, which enforces the consistency in the multistep state transition process.
2.3 The Lower Bound Objective Function
Combining discussions of previous sections, our model consists of following components:
(1) 
In order to overcome the intractability of posterior distributions of the latent state , we use standard variational inference technique kingma_autoencoding_2013 , and derive the evidence lower bound on the data loglikelihood conditioned on context points:
(2) 
Here, the subscript denotes the overshooting length. For more detailed derivation, see supplementary materials.
The first term is the VAE reconstruction loss term. For images, we calculate it using crossentropy of each pixel. The recognition model infers the approximate state posterior from the past observations (In our setting, from past two consecutive observations ). Then the latent dynamic states is decoded to by the observation model, which we want to make it and groundtruth image as similar as possible. The second term is the dynamic state transition loss term. Here, the discrepancy between the transition distribution and the approximate posteriors distribution is calculated by Kullback–Leibler (KL) divergence. The expectation of KL term is taken on and , which can be written explicitly as and for overshooting length . The state transition process is as follows: First a recognition model is used to infer the approximate posterior state from the past observations . Then conducts multistep transition to in the state space. Then is compared against the one extracted from the recognition model using . Note that all the comparison is done in the latent state space, but not the observation space, which significantly reduces the computation expenditure.
The previous derivation assumes a fixed overshooting length . To better model both short and long term dynamics, we can combine loss terms from different overshooting length up to some dynamic transition horizon , i.e., we combine , to arrive at our final objective function:
(3) 
Here, we include weighting factors analogously to the higgins_betavae:_2017 , which can tune the relative strength between the competitive reconstruction and dynamic transition loss terms.
3 Experiments
We apply NeurPhy to two experimental physical systems, i.e., damped pendulum and planetary orbits motion, and verify whether our model can correctly discover the underlying global physical parameters and the associated dynamics. Detailed parameter settings of network structures can be found in supplementary materials.
In all experiments, each image sequence forms a task which is associated with one set of global system parameters. The image sequence contains a total of 101 time steps in , and each image consists of binary pixels. We take 90% of tasks as metatraining tasks, the other 10% as metatest tasks. For each metatraining task, we use 20 context samples that are randomly selected from the image sequence to extract the global representation. We take 90% frames as training (target) samples, and the other 10% as test ones. For the metatest tasks, we also use 20 context samples, but they come from the first 21 frames of the image sequence to avoid using any future information. The rest samples are for metatesting. We set maximum overshooting length , batch size , and run epochs using Adam optimizer kingma_adam:_2014 with fixed learning rate of .
3.1 Damped Pendulum
The first problem we consider is a pendulum swinging back and forth, subject to gravity and air friction with an initial angular velocity of , as shown in Figure 2(a). From the second law of motion, the pendulum’s angle follows equation:
(4) 
where is the pendulum mass, is the gravitational constant, is the damping coefficient generated by air friction and is the pendulum length. Here denotes time derivative, denotes second derivative, and we rewrite angular velocity as .
If the angle is small enough, the equation of motion has an analytical solution, but in general, it can only be solved numerically. From the state at time step , we can get the state at by following dynamic transition equations:
(5) 
Here, we assume time interval is small enough.
In our experiments, we fix the gravitational constant and damping coefficient to and set the initial angle and angular velocity to respectively. We vary the pendulum length and mass between the interval and respectively, and generate a total of different combinations of and . For each combination, we generate states of timesteps (with ) and render them into images. From Equation (5), we can see that the global parameters of the pendulum system is and , and we plot the typical evolution of for different and in Figure 2(b)(c). We assume that the different masses are caused by different pendulum densities and cannot be directly detected from single images. In this case, in order to discover the correct dynamics and to make good predictions, the model has to learn the global system parameter from the changes in the image sequences. The state space for all sequences and time steps of each sequence are shown in Figure 2(d). The state space curve of each sequence is a spiral caused by the damping. If we wait long enough, they will all sink into the rest point , and the central hole will be fully filled.
In the experiments, we set both the dimension of global representations and dynamic states to 3. As shown in Figure 3(a)(b), The learned lies in a twodimensional manifold. From the colormap plot, we see a onetoone correspondence with the underlying groundtruth parameters and , which means NeurPhy has extracted the correct global representations. To show it more quantitatively, we fit the learned representations with global parameters and using a linear and a quadratic regression respectively. The fitted are shown in Table 1. We can see that the fitting is quite well up to quadratic terms (especially for parameter ), which means the representation we learned is indeed a simple mapping from the defacto physical system variables. Note that learning a manifold of using one dimension to encode is nontrivial, as the pendulum’s mass can only be extracted from the time evolution of the image sequence.
In Figure 3(c)(d), we plot the learned latent Markovian dynamic state space together with the groundtruth state parameters , and (shown by colormap). Note that the state parameter is unchanged during the motion, but it is necessary for the image reconstruction, so the state space encodes the parameters using all three dimensions. More clearly, we plot the state space for two particular parameter setting with and in Figure 3(d)(f)(h), and we can see that each lies in a twodimensional Mobius strip submanifold that encodes and respectively.
In Figure 4, started from the leftmost state images, we plot two sets of predicted images for the next 50 time steps on the metatest set. The images in the first row are the groundtruth, while images in the second row are the predicted ones. We can see that the learned dynamic state transition model can make good longterm predictions even though the maximum overshooting length is only 5.
More quantitatively, we show the model performance for both training, test and metatest samples in Table 2. Recalling that in Equation 3, we list the value of crossentropy terms together with the Kullback–Leibler (KL) divergence terms for different overshooting length (KL1KL5 for ). Lower crossentropy means better image reconstruction, and smaller KL term means better longterm prediction. For ablation study, we also experiment with maximum overshooting length to study whether larger overshooting length can really help learning dynamic transitions. As indicated in Table 2, while a larger maximum overshooting length can slightly hurt the reconstruction fidelity, it can greatly reduce the KL terms, facilitating better longterm predictions.
Score  

Experiment  (Linear)  (Quadratic)  (Linear)  (Quadratic) 
Damped pendulum  0.999  1.000  0.795  0.949 
(Linear)  (Quadratic)  (Linear)  (Quadratic)  
Planetary orbits motion  0.959  0.977  0.681  0.935 
Experiment  Stage  Cross Entropy  KL1  KL2  KL3  KL4  KL5  

Damped pendulum  1  Training  3.9  21.7  52.6  105.9  188.4  300.4 
5  Training  20.9  6.4  10.2  16.1  24.5  35.8  
1  Test  4.3  25.2  64.3  135.5  234.8  369.6  
5  Test  21.5  7.3  12.0  19.3  29.4  41.7  
1  Metatest  4.3  25.2  64.3  135.5  234.8  369.6  
5  Metatest  21.2  7.8  12.0  19.5  29.6  43.0  
Planetary orbits motion  1  Training  0.6  6.3  10.0  12.8  15.9  18.9 
5  Training  2.3  2.6  3.8  4.8  5.8  6.9  
1  Test  0.8  6.9  10.4  12.9  16.0  19.4  
5  Test  2.4  2.6  4.0  4.9  5.9  7.0  
1  Metatest  0.6  47.2  104.7  156.2  208.3  265.8  
5  Metatest  2.7  18.4  37.9  54.5  70.3  86.6 
3.2 Planetary Orbits Motion
Next, we apply NeurPhy to the prediction problem of planetary orbits motion. Planetary orbits generally form ellipses as shown in Figure 5. Specifically, the radius has following expression:
(6) 
where denotes the perihelion distance, denotes the eccentricity and denotes the angle of the major axis. The effects of different parameters on planetary orbits are shown in Figure 5(a)(c). We also plot the dynamic state space in Figure 5(d). More detailed derivation of the orbit dynamics can be found in supplementary materials.
Without loss of generality, we set initial angle , and vary initial radius and velocity to generate different motion sequences. Here, and denote the velocity along radius and angular direction respectively. A set of has onetoone correspondence to a set of , which defines a unique orbit. In another word, given the initial state, the system parameters
characterizing the planetary orbits are also determined. At each moment, the dynamic state can be represented solely by two parameters
, and the corresponding velocity can be calculated by together with global parameters as shown in supplementary materials.The time evolution of is governed by dynamic equations:
(7) 
Note that is proportional to the angular momentum of the orbit ( and denote the gravitational constant and the mass of the center sun, respectively)(see supplementary materials for the derivation). From Equation 7, we can see that there are three global parameters defining the orbit motion, but with , we can first obtain from Equation 6 and then substitute it to the second formula to obtain . This means that in the dynamic evolution of planet orbits, is a pseudodegree of freedom, and the dynamics can be defined with just two global parameters , which also reflects that there is one hidden symmetry in the orbit motion.
We generate image sequences by varying , and in the range , and respectively and for each sequence, we generate image frames. If our model can learn correct dynamics, it is expected that the dynamic latent parameters lie in a two dimensional manifold correlated to the groundtruth states , while the global representation of each sequence should also lie within a two dimensional manifold corresponding to the groundtruth system parameters .
In Figure 6(a)(b), we can clearly see that the learned global latent representation indeed lies in a two dimensional manifold. This is an amazing result, as we take a long analytical effect in previous paragraphs to identify that the independent number of groundtruth system parameters which affect the dynamic evolution is two (i.e. and ) rather than three (i.e. , and ). We again fit the latent representations with global parameters and using linear and quadratic regressions shown in Table 1. The fitted scores are very high which means a good match with the groundtruth global parameters. The learned dynamic state space in Figure 6(c)(d) also shows our model can infer the physically meaningful state representations. In Figure 7 we also show two sets of multistep predictions (50 timesteps) on the metatest set, with both groundtruth (the first row) and predicted (the second row) images. It can be seen that the dynamic model we learned can make correct longterm prediction of the orbit motion. We also show the model performance together with in Table 2, the lower KL term values for also indicates that the overshooting technique is of great help to longterm predictions.
4 Related Work
Our work was inspired by the work of Garnelo et al. garnelo_neural_2018 applying NP to function regression and optimization. Kim et al. kim_attentive_2019 addressed the underfitting issue by incorporating attention mechanism. Eslami et al. eslami_neural_2018 and Kumar et al. kumar_consistent_2018
applied it to the scene representation and Singh et al.
NIPS2019_9214 extended it to sequential cases.For the state transition modelling, Karl et al. karl_deep_2016 proposed Deep Variational Bayes Filters (DVBF), a linear Gaussian state space model (LGSSM) using linear Gaussian models to learn and to identify Markovian state space. Deisenroth et al. deisenroth_pilco:_2011 introduced PILCO, a dataefficient policy search method in modelbased reinforcement learning, which uses GPs to model state transition dynamics explicitly incorporating model uncertainty into longterm planning, achieving unprecedented learning efficiency on highdimensional control tasks. Watter et al. WatterSBR15 proposed Embed to Control (E2C) that learns a local linear dynamical transition model from raw pixel images which achieves promising results on a variety of complex control problems. Recently, Hafner et al. hafner_learning_2019 proposed Deep Planning Network (PlaNet), a modelbased reinforcement learning method that learns the state dynamics from images and plans in latent space and achieves performance comparable to modelfree algorithms in continuous control tasks.
For physical dynamics learning, Iten et al. iten_discovering_2018 used NP to discover physical concepts, but they did not learn the underlying Markovian state and dynamics, while the inputs to the model are lowdimensional states. Breen et al. breen_newton_2019 applied deep neural networks to solve the motion of chaotic threebody problem and achieved accurate solutions several orders faster than a stateoftheart solver.
5 Conclusions
In this work, we propose a novel network architecture NeurPhy that can learn physical dynamics directly from image sequences. NeurPhy can not only correctly extract the global system parameters from the sequence and dynamic state from each frame, but also succeeds in predicting dynamic evolution in unseen cases. We apply our model for the characterization and prediction of damped pendulum and planetary orbits motion, and achieve promising results. Our architecture is quite general, which can be easily extended to modelbased reinforcement learning, which we will explore in the future.
6 Supplementary Material
6.1 Derivation of the lower bound objective function
Now we derive the lower bound objective function of NeurPhy. For overshooting length , Conditioned on context points , the loglikelihood of data can be written as:
(8) 
Here, we use the Markovian state space asumptiondoerr_probabilistic_2018 , i.e., the dynamic state contains all the information needed for image reconstruction. Also, we assume the approximate posterior does not depend on context explicitly given . From Jensen’s inequality, we can get:
(9) 
which is eq.(2) in the main paper.
6.2 Derivation of the planetary orbits motion using the law of universal gravitation
6.2.1 The elliptical orbits
In this subsection, we derive the elliptical orbits motion obeyed by planets which circle around a central sun. We follow the derivation in reference roy_orbital_2004 ; cite_s1 . Assuming the mass of central sun is several orders larger than the mass of planet , then the position of the central sun is approximately still and we only need to consider the motion of the planet. Using polar coordinates , the position vector can be written as (see Figure 8):
(10) 
Here, symbols in bold font denote vectors, symbols with hat denotes unit vectors along the corresponding coordinate direction. Applying time derivative to , we can get:
(11)  
(12) 
Here, we rewrite angular velocity as and have used identities in the derivation.
Now, From the second law of motion and the law of universal gravitation:
(13) 
we have:
(14)  
(15) 
Here, denotes the gravitational constant and denotes the second derivative of time. Eq. (15) can be rewritten to a total differential form:
(16) 
where is a constant, which is proportional to angular momentum . Eq. (16) is also equivalent to the Kepler’s second law of planetary motion.
Let , we have:
(17) 
Together with eq.(14), we can get:
(18) 
By solving above equation, we have following general solution:
(19) 
So the planetary orbit is:
(20) 
Here, denotes the angle of the major axis and is a coefficient.
The nearest radius called the perihelion distance can be calculated by setting , i.e.,
(21) 
Define the eccentricity , we have:
(22) 
Then we can rewrite using as:
(23) 
This is the eq. (6) used in the main paper.
6.2.2 Derive system global parameters from initial condition
Now we derive the global parameters from the initial condition . Note that is a constant, we have:
(24) 
From the law of energy conservation, we have:
(25) 
Here, the lefthand side is the energy at initial moment, while the righthand side represents the energy at perihelion . Together with and eq. (22), we have:
(26) 
From the above equation, we can get the eccentricity , and from eq. (23), we have:
(27) 
from which we can get . Here, we assume the initial angle .
6.2.3 Derive the dynamic transition equations
From Eq.(16), we have:
(28) 
Integrate above equation with time , we can derive the time evolution of . And from the relation of with (eq. (23)), we can also get the time evolution of .
The differential equation eq. (28) is a transcendental equation, which has no analytical solution. In our setting, we only have to derive the state transition equations from to . From eq. (28), we have:
(29) 
and from eq. (23), we have:
(30)  
(31) 
These transition equations correspond to eq. (7) in the main paper.
6.3 Experimental results for lowdimensional inputs using Neurphy
Our model is quite general, if inputs are not raw images but lowdimensional variables (such as positions of pendulum endpoint), our model can also be applied to the systems by changing all convolutional layers in NeurPhy to multilayer perceptrons. As we will show below, the learning tasks actually become easier.
6.3.1 Damped pendulum
We use the position coordinates of pendulum endpoints as inputs (i.e., the images of dimension in the main paper are replaced by positions of dimension ), following the same routine as in the main paper, we run experiments with various global parameters ( and ).
The learned global representations are shown in Figure 9. In Figure 9(a) and (b), colormaps denote the groundtruth and respectively. We can see a clear correspondence between the learned manifold and the true global parameters. Furthermore, there are some holes between data points in the figure (such as the places marked by dashed circles), corresponding to the metatest samples. In Figure 10, we plot the dynamic state space of damped pendulum with colormaps denoting the groundtruth dynamic state . The learned state space lies in a 3dimensional manifold, which shows good correspondences with underlying dynamic states.
To show the performance of our model on lowdimensional inputs, we plot the x coordinates of pendulum endpoints for training set in Figure 11, with x axes denoting the true values and y axes denoting predicted ones. We can see a perfect match between the true and the predicted pendulum’s position. When the overshooting length d is longer, the prediction get worse. In Table 3, we calculate the mean square errors (MSEs) of the true and the predicted pendulum’s position for training, test and metatest (20 and 2 context samples) sets. represents the result with overshooting length , in the table, d changes from to . We can see that on both training and test sets, MSEs are quite small, which get a bit larger on the metatest set.
MSE  T+0  T+1  T+2  T+3  T+4  T+5 
Training  0.00082  0.0033  0.010  0.025  0.062  0.14 
Test  0.00095  0.0040  0.014  0.035  0.086  0.19 
Metatest (20 contexts)  0.00081  0.011  0.046  0.14  0.28  0.56 
Metatest (2 contexts)  0.00082  0.016  0.064  0.19  0.36  0.72 
In Figure 12, we plot the x coordinates of one training sequence with different predicting length (overshooting length d). The solid lines are groundtruth coordinates of the sequence and points in symbols are predicted ones for both context, target and test samples. We can clearly see that our model fits well. In Figure 13 and 14, we apply our model to one metatest sample whose underlying global parameters is unseen in the training set. In Figure 13, as the same as the training setting, we give the model 20 context samples, but in Figure 14, we only give the model 2 context samples to see if our metalearning algorithm can deal with fewer context samples. We can see that our model can still make quite good prediction even only 2 context samples are given, this can also be seen in the last row of Table 3, the MSEs for different overshooting lengths is only slightly larger than those given 20 context samples.
6.3.2 Planetary orbits motion
We also apply the lowdimensional inputs model for the prediction of planetary orbits motion. All the settings are the same as in the main paper except the inputs are changed to planet’s position coordinates . In Figure 15 and 16, we plot the learned global representation and dynamic state space. In Figure 17, we plot the true and predicted x coordinates of planet’s positions. In Table 4, we give the MSEs of predictions. In Figure 18, 19 and 20, we plot the performance for a training and metatest (both 20 and 2 context samples) sequence. We can see that our model indeed can learn the underlying global and local representations together with the dynamic transition process. Also note that it is easier for our model to learn the planetary orbits motion than the damped pendulum.
MSE  T+0  T+1  T+2  T+3  T+4  T+5 
Training  0.000067  0.00023  0.00053  0.00094  0.0016  0.0034 
Test  0.000068  0.00023  0.00055  0.00097  0.0017  0.0034 
Metatest (20 contexts)  0.000068  0.0019  0.013  0.050  0.11  0.16 
Metatest (2 contexts)  0.000067  0.0041  0.030  0.20  0.11  0.30 
6.4 Network structure of NeurPhy
6.4.1 For images inputs
For Global representation model and Recognition model, we use convolutional neural networks, each with four layers with filter sizes of [32,64,64,128], kernel sizes of
, strides of 2 and ReLU activations. For
Observation model, we use deconvolutional neural networks with filter sizes of [64,64,32,1], kernel sizes of [5,5,6,6], strides of 2 and three layers of ReLU activations, and the activation of last layer is Sigmoid. For State space model, we just use a 4layer forward neural network with cell units [128,128,64,16] and ReLU activations. The last layer is further connected to a dense layer with units size , where is the dimension of the latent dynamic state, which is set it to 3 in our experiments. First units generates the mean value of and the second units produces the standard derivation.6.4.2 For lowdimensional variables inputs
The network architecture is almost the same as the case for the images, except that the convolutional layers are placed by multilayers forward networks. For Global representation model, it is replaced by a 4layer forward neural network with cell units of [128,128,64,16] and ReLU activations. For Recognition model, we only use a 2layer forward neural network with cell units of [32,16] and ReLU activations. For damped pendulum system, we set for all , batch size and run 1000 epochs. For planetary orbits motion, we set for all , batch size and run 500 epochs.
References
 (1) http://farside.ph.utexas.edu/teaching/301/lectures/node155.html.

(2)
Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and
Kevin Murphy.
Fixing a Broken ELBO.
In
International Conference on Machine Learning
, pages 159–168, 2018.  (3) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pages 3981–3989, 2016.
 (4) Philip G. Breen, Christopher N. Foley, Tjarda Boekholt, and Simon Portegies Zwart. Newton vs the machine: solving the chaotic threebody problem using deep neural networks. arXiv:1910.07291 [astroph, physics:physics], Oct. 2019. 00000 arXiv: 1910.07291.
 (5) Yutian Chen, Matthew W. Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P. Lillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 748–756. JMLR. org, 2017.
 (6) Emmanuel de Bézenac, Arthur Pajot, and Patrick Gallinari. Deep learning for physical processes: Incorporating prior scientific knowledge. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018.
 (7) Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: a modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 465–472. Omnipress, 2011.
 (8) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
 (9) Andreas Doerr, Christian Daniel, Martin Schiegg, NguyenTuong Duy, Stefan Schaal, Marc Toussaint, and Trimpe Sebastian. Probabilistic Recurrent StateSpace Models. In International Conference on Machine Learning, pages 1279–1288, 2018.
 (10) Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identification of Gaussian process state space models. In Advances in neural information processing systems, pages 5309–5319, 2017.
 (11) S. M. Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil Rabinowitz, Helen King, Chloe Hillier, Matt Botvinick, Daan Wierstra, Koray Kavukcuoglu, and Demis Hassabis. Neural scene representation and rendering. Science, 360(6394):1204–1210, June 2018.

(12)
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pages 3601–3610, 2017.  (13) Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational Gaussian process statespace models. In Advances in neural information processing systems, pages 3680–3688, 2014.
 (14) Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural Processes. arXiv:1807.01622 [cs, stat], July 2018. arXiv: 1807.01622.
 (15) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Adaptive computation and machine learning. The MIT Press, Cambridge, Massachusetts, 2016. 11003.
 (16) Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, and Aaron van den Oord. Shaping Belief States with Generative Environment Models for RL. arXiv:1906.09237 [cs, stat], June 2019. arXiv: 1906.09237.
 (17) David Ha and Jürgen Schmidhuber. World Models. arXiv:1803.10122 [cs, stat], Mar. 2018. arXiv: 1803.10122.
 (18) Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In International Conference on Machine Learning, pages 2555–2565, 2019.
 (19) Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor BergKirkpatrick. Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534, 2019.
 (20) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR, 2(5):6, 2017.
 (21) Raban Iten, Tony Metger, Henrik Wilming, Lidia del Rio, and Renato Renner. Discovering physical concepts with neural networks. arXiv:1807.10300 [physics, physics:quantph], July 2018. arXiv: 1807.10300.
 (22) Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data. arXiv:1605.06432 [cs, stat], May 2016. arXiv: 1605.06432.
 (23) Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive Neural Processes. arXiv:1901.05761 [cs, stat], Jan. 2019. arXiv: 1901.05761.
 (24) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], Dec. 2014. arXiv: 1412.6980.
 (25) Diederik P. Kingma and Max Welling. AutoEncoding Variational Bayes. arXiv:1312.6114 [cs, stat], Dec. 2013. arXiv: 1312.6114.
 (26) Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman Filters. arXiv:1511.05121 [cs, stat], Nov. 2015. arXiv: 1511.05121.
 (27) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017. 49878.
 (28) Ananya Kumar, S. M. Ali Eslami, Danilo J. Rezende, Marta Garnelo, Fabio Viola, Edward Lockhart, and Murray Shanahan. Consistent Generative Query Networks. arXiv:1807.02033 [cs, stat], July 2018. arXiv: 1807.02033.
 (29) Syama Sundar Rangapuram, Matthias W. Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems, pages 7785–7794, 2018.
 (30) Archie E. Roy. Orbital motion. CRC Press, 2004.
 (31) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Science, 362(6419):1140–1144, Dec. 2018.
 (32) Gautam Singh, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. Sequential neural processes. pages 10254–10264, 2019.
 (33) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pages 3483–3491, 2015.
 (34) Jane X. Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv:1611.05763 [cs, stat], Nov. 2016. arXiv: 1611.05763.
 (35) Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin A. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, NIPS, pages 2746–2754, 2015.