Self-Supervised Decomposition, Disentanglement and Prediction of Video Sequences while Interpreting Dynamics: A Koopman Perspective

by   Armand Comas, et al.

Human interpretation of the world encompasses the use of symbols to categorize sensory inputs and compose them in a hierarchical manner. One of the long-term objectives of Computer Vision and Artificial Intelligence is to endow machines with the capacity of structuring and interpreting the world as we do. Towards this goal, recent methods have successfully been able to decompose and disentangle video sequences into their composing objects and dynamics, in a self-supervised fashion. However, there has been a scarce effort in giving interpretation to the dynamics of the scene. We propose a method to decompose a video into moving objects and their attributes, and model each object's dynamics with linear system identification tools, by means of a Koopman embedding. This allows interpretation, manipulation and extrapolation of the dynamics of the different objects by employing the Koopman operator K. We test our method in various synthetic datasets and successfully forecast challenging trajectories while interpreting them.



There are no comments yet.


page 3

page 7

page 13

page 14

page 15

page 16

page 17


Self-supervised Reinforcement Learning with Independently Controllable Subgoals

To successfully tackle challenging manipulation tasks, autonomous agents...

Self-Supervised Monocular Scene Decomposition and Depth Estimation

Self-supervised monocular depth estimation approaches either ignore inde...

Self-supervised Video Object Segmentation

The objective of this paper is self-supervised representation learning, ...

Object-centric Video Prediction without Annotation

In order to interact with the world, agents must be able to predict the ...

Self-Supervised Equivariant Scene Synthesis from Video

We propose a self-supervised framework to learn scene representations fr...

Self-Supervised Damage-Avoiding Manipulation Strategy Optimization via Mental Simulation

Everyday robotics are challenged to deal with autonomous product handlin...

An Information-theoretic Progressive Framework for Interpretation

Both brain science and the deep learning communities have the problem of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning of symbolic representations from high dimensional data poses a great challenge to current machine intelligence. As humans, our intuitive modeling of the world is based on abstract categories, or symbols. The universe of those categories is unbounded and continuous. Fortunately, we can approach symbolic reasoning by discretizing and simplifying those categories. One interesting direction for categorization is based on compositionality, specialization and hierarchy. Different concepts will be in charge of different tasks, and the hierarchical composition of their outputs will generate the complex behaviors we wish to model.

More specifically, we look at the task of visual perception. For visual scenes, a simplification of the symbols encompasses entities, their attributes and their interactions with the environment. Numerous efforts have focused on decomposing a static scene into its composing objects and background in an unsupervised fashion [8, 10, 20, 4, 19]. In this case the objective is to learn the category of “object” and its related attributes such as “location”, “appearance” or “depth”, without labeling any of those categories. The supervision signal is often a surrogate task of coherence, such as hierarchical rendered reconstruction of the scene or imposing rules on the intermediate representations (i.e. contrastive learning [14]).

Similarly, adding one dimension to the problem, many approaches have tackled unsupervised video decomposition [16, 7, 6, 12, 17, 11, 13]. An added issue in this case is finding the correspondences of the decomposed objects across time (i.e. tracking) while defining what an object is. And as there is time, there is a future. Therefore, the question of how the scene will

look arises. Recurrent Neural Networks (RNNs) provide a useful class of models for forecasting tasks. Once a scene is decomposed, they are often used to model the underlying dynamics. However, RNNs suffer from exploding and vanishing gradient and it is hard to incorporate high-level constraints to the model. Related to the latter, an other issue of such methods is their lack of high-level interpretability.

In this work, we advocate that a data-driven physics-based approach can bring a principled and interpretable perspective to the dynamics modelling, while preserving the model’s predictive power. We make use of Koopman theory, which is based on the insight that a finite-dimensional nonlinear system can be transformed to an infinite-dimensional linear dynamical system, and then propagated in time using a linear operator . Therefore, we can apply tools of linear algebra and spectral theory, and the dynamical system can be understood as a composition of first-order impulse responses. Koopman theory has been successfully applied to model time series with many applications [29, 31, 21]. We develop this further in Section 2.

Our model, which we call Koopman-based Interpretable Decomposition and Disentanglement (KIDD), (i) uses an attention-based tracking method to learn representations from video factorized into moving objects and their attributes: appearance, confidence and pose; (ii) finds a non-linear mapping to a the Koopman space for the dynamic latent representations; (iii) learns the Koopman operators that characterize the underlying dynamics of the training data; and (iv) performs unsupervised video prediction using the latent representations. In our experiments, we also propose simple decomposition techniques to interpret the objects dynamics.

Figure 1: Overall architecture of KIDD. In the left, an attention-based recurrent tracker decomposes the scene into its composing objects. In the center, the object representations are disentangled into Confidence, Appearance and Pose. In the right, the dynamic features of Pose are modelled and forecasted by using a Koopman embeding. These latent representations are later used to reconstruct or predict frames.

2 Background

We consider a time-invariant autonomous dynamical system of a single object on of the form:


where is the state of the system at time . is a potentially non-linear function that defines the temporal transition of the states.

The fundamental insight of Koopman operator theory is that the finite-dimensional nonlinear dynamics of Equation 1 can be transformed to an infinite-dimensional linear dynamical system by considering an appropriately chosen Hilbert space of scalar observables [15, 23]

. The eigenfunctions

of the Koopman operator are difficult to find, and some algorithms have been proposed to tackle the challenge. The most widely used are the dynamic mode decomposition (DMD) [29] and its extension to nonlinear observables, the extended DMD (EDMD) algorithm [31].

Previous research used hand-crafted eigenfunctions to model the observable space. Those were chosen from function families or directly from previous knowledge of the physics of the problem. Currently, some approaches use deep neural networks to represent the observable space

[21, 25, 18, 32, 1]. Neural networks have the advantage of being universal approximators, and are effective in finding the Koopman invariant subspace given a downstream task. With this latter perspective, Koopman methods have been applied successfully to fluid dynamics [24, 1], atomic and molecular scale dynamics [33, 22], chaotic systems [3] or traffic dynamics [33], between others.

Koopman methodology is data-driven, model-free and can discover the underlying dynamics and control of a given system from data alone [28]. The system identification problem is then reduced to finding the operator

. This is usually done by linear regression given historical data (e.g. by means of ordinary least squares) or by end-to-end gradient-descent-based optimization. For the latter, the Koopman operator is often learned jointly with the mapping and inverse mapping.

The Koopman operator is defined by:


where is the mapping from the state space to the observable space and denotes the composition operator. Therefore:



is the eigenvalue of

corresponding to the eigenfunction . In some cases, Koopman is employed in presence of control inputs. There are different approaches to introducing inputs to Koopman (e.g. [28, 18]). In the studied cases, inputs model forces external to an object, originated from object-environment interactions. Therefore, inputs will depend both on the state of the object and the environment’s geometry. The latter is learned implicitly from data. Consequently, Equations 3 and 4 in the presence of inputs are modified as follows:


Here, depends on the current state (Closed-loop control). It is expected to be sparse, and low-dimensional. is the input Koopman operator. We usually define dimension such that . A challenge will be to discover and the correct mapping to the Koopman manifold simultaneously.

3 Related Work

Object decomposition Numerous recent publications have focussed on unsupervised decomposition of a scene into the different objects that compose it. Attend, Infer, Repeat (AIR)[8] presents an unsupervised way to count, locate objects and reconstruct a scene by providing structure in the inference module. AIR also infers the number of objects present in the scene. Following the path of structured design, iodine [10]

also presents an unsupervised way to decompose scene into multiple objects using Gaussian mixture model as the generative model and amortized iterative inference. Leveraging compositional structure it learns disentangled, interpretable and generalizable representations. Similarly, slot attention


introduces a general purpose plug-in network based on attention that discovers objects in an image. They draw lines with transformers and soft K-means clustering. One interesting feature of this model is that it can generalize to previously unsen composition and more objects since slots are not associated with objects, rather each slot has ability to capture any of the object.

Video decomposition drnet[7] is an early work that decomposes video into a static component (content) and a dynamic component (pose). This is a key idea that will be embraced by many of the discussed works. In order to achieve this decomposition, it makes use of an adversarial loss to enforce that the dynamic component doesn’t carry identity information. sqair [16] extends the idea of air to infer a video sequence by considering its temporal progression. Since different objects are discovered sequentially, sqair is not scalable. scalor [13] overcomes this limitation by massive parallelization and is therefore able to handle hundreds of objects in a scene and predict their future trajectories simultaneously. Similar to drnet, ddpae[12], decomposes a video into pose and content, where the content is static and the pose is dynamic and modeled using RNNs. stove [17]

again decomposes and disentangles the objects in the latent space. It uses a graph to model the interaction between them. By using a Markov model in the latent space,

[17] applies inference in the series. They show that their model generates realistic frames and conserves kinetic energy even when predicted for long time. Similarly, C-SWM [14] uses graph neural network to learn relations between objects in a self-supervised way directly from raw videos by using structured model. However, in this case the supervision is done by means of contrastive learning and a bipartite loss. Tracking by Animation [11] shows that decomposition, disentanglement and deterministic generation of objects is a good self-supervision signal for learning tracking. We will use their ideas on attention-based tracking for our method.

Koopman Operator Koopman Operator Theory studies transformation of nonlinear dynamics into a space with linear dynamics given by linear operator . It has been successfully used to disentangle the dynamic modes in complex dynamical systems using dynamic mode decomposition techniques like DMD [29], and Extended DMD (EDMD)[31]

. By leveraging the fact that Koopman operator based methods are completely data-driven and require a rich family of function to generate a mapping, many modern works have used deep neural networks to approximate the eigenvectors associated to the Koopman operator. This has resulted into several deep network based Koopman methods

[21, 1, 25, 18, 32, 26]. This idea has recently found successful application in fluid dynamics [24, 1], atomic and molecular scale dynamics [33, 22], chaotic systems [3] and traffic dynamics [33]. The original Koopman operator theory is developed without any external inputs to the dynamical system. Later, it has been generalized to considering inputs [28]. This gives rise to other methods like Compositional Koopman [18]

, that have used Koopman theory to model object dynamics and interactions with other objects and the environment in a compositional way. The introduction of inputs to the dynamics modelling allows for applications in control and reinforcement learning. Our work uses ideas from the recent developments in Koopman-based modelling and self-supervised video decomposition and prediction to propose a joint decomposition of the static and dynamic components of a scene. This will allow for compositional generation and interpretability of the dynamics. To the best of our knowledge, this is the first work that tackles this set of problems jointly with an end-to-end model.

4 Method

Video often presents multiple objects in motion that generate a complex dynamical scene. In the pixel space, dynamics are strongly non-linear. But by decomposing the scene into abstract categories, dynamics are simpler to model and more interpretable. For our approach, we decompose the scene into its moving objects. For simplicity, we assume that there is no background. We track and identify objects across input frames, and assign sets of variables to each one of them. Each object representation will be disentangled into the following categories:

  • Pose : Indicates the parameters for an affine spatial transformation of an object; and coordinates of the centroid, scale and ratio.

  • Appearance

    : Modeled either dynamic across frames or static, a vector containing information about an objects appearance.

  • Confidence

    : A probability scalar indicating the certainty of an object being correctly modeled.

Our method uses concepts of soft-attention for feature-based tracking. We build on top of [11] for our tracking mechanism, and modify the architecture to allow stochasticity and forecasting. Following, we describe the main modules that form our model:


We encode each frame of a video with a convolutional encoder, and obtain a feature map as , where pe is a positional encoding. We then track objects across features by using an array of trackers. trackers are initialized, where is an upper-bound on the expected number of objects in every scene. We define the tracker recurrent updates as:


where is an attention-based tracking function described in Equations 10 and 11. We implement the appearance latent vector

to be the only stochastic variable in this setup, given its inherent complexity. It is sampled from a gaussian distribution and trained as in a VAE framework.

Tracker updates

We use a content-based soft-attention mechanism [2], relying on the Query, Key, Value triad. We attend to the information of the feature map that describes the object:


A GRU [5] cell is used to update the hidden state of each tracker . Such state queries the features . Note that the Value and Key are both the raw convolutional features, modified iteratively following Equations 12 and 13. is the Query strength. An attention map is applied to the values resulting in the current input to the GRU.


We need a mechanism to provide information across trackers, while preserving the identity of each object through time. Based on [11, 9], trackers interact through external memory by using interface variables. We use the frame convolutional features as external memory, and implement read and write operations. When iterating through trackers, the input will be updated as follows:


Here, and are the erasing and writing vector, respectively. The feature map is upadated in the spatial locations indicated by the attention matrix of the previous iteration. The operator denotes a element-wise product in the channel dimension of .

Koopman Embedding

We employ Koopman theory as an alternative to modelling dynamics. We argue that this will introduce benefits to our model, as we discuss in Section 5.
Assumptions are made with respect to the objects’ dynamics in the scene. We define the state as a concatenation of delayed instances (from now on referred to as delayed coordinates) of the pose vector for the object:


Therefore . indicates our prior belief on the number of time-steps needed to model dynamics with an Auto-Regressive (AR) approach. We assume that the effects of the environment on each object will remain unchanged through training and testing cases. Hence, these effects are learned implicitly by the model. The observables are obtained through the mapping (equivalent to Eq. 3). and make reference to the spaces of observables and inputs respectively. We recover the original state by approximating the inverse function with a deterministic Auto-Encoder (AE) architecture. In the presence of external forces, we will also model inputs as a non-linear mapping from the state space. Note that we set if we assume that the objects’ dynamics are not affected by the environment:


We refer to the estimated states as

. We define the Koopman operators and as parameter matrices. The pose vector is recovered by keeping the first stacked coordinate of the estimated state , and limited to the range . The eigendecomposition of the Koopman operator provides us with insights of the dynamics in the scene.

Training and Objective

Given the video sequence , its generative distribution is given by:


where . Note that we both reconstruct the whole sequence with length and predict it from initial frames . For our experiments, will coincide with the number of delayed coordinates .
Given the inferred latent variables, we reconstruct and predict for each object sequentially. In particular, we first generate the object in the center with resolution , given the appearance . The decoder is a deconvolutional layer. We then apply a spatial transformer to rescale and place the object according to the pose . For each object, the generative model is:


Future prediction is similar to reconstruction, except in this case is extrapolated using the Koopman operator in the observable space.
The generated frame is the summation over for all objects. Similarly to the VAE framework, we train the model by maximizing the evidence lower bound (ELBO).


Here, we use self-supervision for reconstructing the input and predicting that same input from few initial conditions (). We also add regularizers for a better learning of the Koopman embedding, so that the final expression of our objective with respect to the trainable weights is:


Here, and indicate the and losses respectively. is the nuclear norm of a matrix, which is a convex surrogate of the rank function. is used to enforce sparsity in . Finally, the s are the weights applied to each one of the loss particles. In terms of implementation, we use linear annealing to increase linearly the weights of the regularizers as training advances. For more details see the Appendix.

5 Experiments

We evaluate KIDD

in variations of the Moving MNIST dataset. Our goal is to show that our method can capture and model the implicit dynamics of objects in video sequences with a Koopman embedding. Therefore, we will preserve the same distribution of appearances through the experiments (MNIST digits), and vary the nature of their trajectories. Our baselines are the established state-of-the-art methods for decomposed self-supervised video generation:

ddpae [12], drnet [7] and scalor [13] All of them base their dynamics modelling on RNNs.

Evaluation Metrics

Our quantitative results will be measured in terms of pixel-level Binary Cross entropy (BCE) per frame, Mean Square Error (MSE) per frame, Mean Absolute Error (MAE), Structural Similarity (SSIM) and the Perceptual Similarity Metric (LPIPS) [34].

5.1 Implementation

Our model is trained with different configurations for every experiment. More details are provided in the Appendix. When it comes to the dimensionality of the observable space , it ranges from 15 to 30. We use an appearance vector of size 50, a pose vector of size 4 and the number of delayed coordinates per state , , corresponds to the number of input frames in our experiments. The tracker hidden state corresponding to Equation 8 has dimension 288. The tracker attends to a convolutional map. The objects are decoded to a size of and located in the frame.

5.2 Moving MNIST Experiments

Moving MNIST [30] is a synthetic dataset consisting of two digits with size moving independently in a

frame. Each sequence is generated on-the-fly by sampling MNIST digits and synthesizing trajectories according to a definition of the motion. Our model is trained for 200 epochs. We randomly generate 9k sequences for training, 1k for validation and 2k for testing. In our experiments, we simulate 4 scenarios as follows:

  • Circular motion: For the first experiment, we generate a fairly simple dataset, for which we know the expected results. We sample randomly initial coordinates , radius and the angular step length. We generate the motion with equation: , , where increases linearly with a slope given by the angular step length. Finally, we constrain the motion to the dimensionality of the frame. We generate frames as our input. From the first 3, we will generate 17 with supervision.

  • Cropped circular motion: We mask the 29 top rows of the circular motion case, simulating a partially cropped frame. Note that an object of size can be completely occluded.

  • Inelastic/Superelastic collisions: With a fixed velocity, we sample initial coordinates and angle and let the object collide against the frame limits. We increase the complexity of the case by simulating an inelastic response from the left and top limits and a superelastic response from the right and bottom limits . We generate chunks of frames as input. From the first 3, we will generate 10 with supervision.

  • 3D to 2D motion projection: This motion is created parametrically following , , , , with coordinates and angular velocities . This parameterization constrains it to lay within the cube . We then rotate the trajectory with an angle and project a random portion of the full trajectory with size (6 in 10 out) to the axis. The offset phase and are also randomly generated. Using perspective projection, the objects are resized according to their depth in the axis, after being projected. Figure 2 gives an example of the generated sequences.

    Figure 2: Example of a projected trajectory for 3D to 2D motion projection experiment. In orange the selected chunck.

The number of input frames will contain enough information to predict auto-regressively in most sequences of each scenario. Therefore, we use delayed coordinates for our state corresponding to the input frames. More details on the dataset can be found in the Appendix.

A quantitative general overview of the experiments can be seen in Table 1. KIDD outperforms the baselines in most cases except for scalor in reconstruction, and ddpae. In terms of Perceptual Similarity (LPIPS), our prediction outperforms all the baselines. ddpae is the closest to ours in terms of architecture. The key difference is the dynamics modeling. ddpae uses a concatenation of LSTMs for reconstructing and predicting the pose. Each of them has a hidden state of size . Our objective is not to outperform RNN-based methods for short sequences such as video. We aim to perform similarly while gaining interpretability and manipulability of the dynamics. In the same table, we can see the results of KIDD after keeping the top eigenvalues of the Koopman matrix , what is known as model reduction. The actual dimensionality of ranges from to in the experiments. As we see, the results are very similar to the prediction with the full matrix. This means that our model was able to capture the motion of the digits with only 3 conjugate pairs of eigenvalues in all studied cases. Figure 3 illustrates how single or conjugated pairs of eigenvalues impact on the dynamics. The produced dynamics will be a weighted combination of the impulse responses produced each eigenvalue (real or conjuagate pair). In the same figure, we see a manipulation of the learned model for the 3D to 2D motion projection case. Qualitatively, the generated frames are sharp and accurate, and the learned dynamics are correct. Note that even after modifying the eigenvalues of matrix , the motion of the digits is physically plausible and smooth.

Figure 3: Top right: Visualization of the atomic dynamics given by the eigenvalues of matrix [27]. Top left: Learned Koopman matrix eigenvalues for 3D to 2D projection experiment. In green we highlight the eigenvalue pair that we will modify to visualize the effect of manipulations. Bottom, predictions by rows: Ground Truth; our main baseline ddpae; KIDD with the object decomposition bounding boxes; 2 variations to the radius of the highlighted eigenvalue; 3 variations to the angle of the eigenvalue. We can see in blue, how increasing the radius increases the effect of that particular eigenvalue, both in size and position. In red, a modification in the angle has a direct effect on the frequency of the motion.

Circular Motion Experiments

Table 1 shows the results for this scenario. We see that we perform similarly to ddpae in both the cropped and the complete version of the experiment. Here, we know that the motion is expected to be sinusoidal in both and . Therefore, a correct Koopman mapping would not need to be non-linear to capture the dynamics in this case. A sinusoid can be modelled by a linear operator with a complex pair of eigenvectors in the unit circle. In Figure 4, we can see that pair, together with real eigenvalues that present no oscillation. We observe also that the learned model is stable as no eigenvalue has greater radius than 1. The model has learned a very similar operator for both the cropped and the complete version. This is an indicator that KIDD

is learning consistent dynamics across datasets when they share the same motion. It also suggests that the model is able to impute a trajectory when the data is missing. Qualitative results can be seen in the Appendix.

Collision Experiments

For this scenario, the challenge is the use of inputs . Every collision against the frame limits applies a force to the object, that modifies its dynamics. Therefore, we model the effect of the environment as in Equation 16, allowing the inputs to be non-zero, and forcing them to be sparse and low-dimensinoal to avoid overfitting. This generates sharp objects and captures correctly the dynamics. See Appendix for more details.

3D to 2D Projection Experiments

. Quantitative results for this experiment are fairly close to ddpae, especially when it comes to prediction. This dataset is challenging because it entangles linear motion across dimensions by means of projection. It also encompasses digit size variations. However KIDD is able to estimate the dynamics correctly in most cases, and generates sharp-looking objects. Figure 3 shows the qualitative performance in terms of prediction of KIDD. We can see that the predictions are accurate and sharp. The model correctly disentangles the two objects that appear in the scene, and models their dynamics independently. To understand and interpret the dynamics learned in the Koopman space, we modify the the eigenvalues of the operator . As shown in Fig. 3 top, we modify the eigenvalues highlighted in green by changing their radius or their angle . We can interpret the following from the 3 bottom: Shadowed in red, we can see that this particular eigenvalue pair has effect in the latter part of the trajectory. If we increase its module above 1, we observe an increase on the intesity of the variations, that seem to oscilate strongly at the end of the sequence (see size of the object). This happens because the system is now unstable. If we vary the angle of the eigenvalue pair with respect to the real axis, we see variations in terms of frequency. When we subtract 20 degrees to the angle, it is almost 0. Therefore, we observe a constant trajectory for the objects. When we increase that angle, we see the frequency of the digits oscillation increasing with it.

This is a clear example of how Koopman allows us to interpret and manipulate the modelled dynamics.

Circular Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e)
ddpae[12] 40.13 71.96 / 123.94 162.64 / 225.67 331.80 / 0.87 0.82 / 0.19 0.21 /
KIDD 59.97 84.23 84.64 139.92 168.83 169.96 283.26 371.46 317.80 0.86 0.82 0.82 0.15 0.17 0.17
Cropped circular Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e)
ddpae[12] 35.04 54.95 / 98.48 118.90 / 176.09 248.25 / 0.88 0.85 / 0.21 0.23 /
KIDD 47.80 59.26 59.32 108.28 120.04 120.15 212.84 254.09 254.24 0.88 0.86 0.86 0.17 0.19 0.19
Collision Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e)
drnet[7] 109.01 214.49 / 218.14 339.58 / 478.39 1350.20 / 0.75 0.60 / 0.30 0.40 /
scalor[13] 13.24 329.63 / 50.02 494.41 / 148.02 1584.48 / 0.95 0.28 / 0.02 0.48 /
ddpae[12] 49.53 93.52 / 146.65 199.57 / 263.39 423.88 / 0.84 0.76 / 0.23 0.25 /
KIDD 59.53 103.63 110 155.07 205.31 212.38 291.18 449.73 473.68 0.83 0.77 0.76 0.19 0.21 0.22
3D to 2D proj Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e) Rec Pred pred(6e)
drnet[7] 80.31 136.06 / 172.43 248.11 / 331.71 732.60 / 0.77 0.66 / 0.32 0.41 /
scalor[13] 7.58 233.18 / 32.05 377.34 / 87.95 1018.30 / 0.96 0.32 / 0.02 0.45 /
ddpae [12] 20.99 43.72 / 57.08 86.87 / 124.45 223.50 / 0.94 0.89 / 0.11 0.14 /
KIDD 38.15 45.22 48.25 85.37 92.67 96.27 188.46 216 228.97 0.89 0.88 0.87 0.15 0.14 0.15
Table 1: Quantities comparison of all methods for the four scenarios. We evaluate reconstruction, prediction, and prediction with model reduction (6e denotes that we keep only the top 6 eigenvalues). From top to bottom: Circular motion, Cropped circular motion, Inelastic/Superelastic collisions, 3D to 2D motion projection. Our method performs similarly to the best RNN-based baseline and outperforms the rest of baselines in prediction.
Figure 4: Visualization of the eigenvalues of the learned matrix for both circular experiments. In the left, for the complete frame experiment; in the right, for the cropped frame experiment. We can see how the learned eigenvalues are very close in both cases, while one of them has been trained in a corrupted dataset. In red, we highlight the eigenvalues of that have been empirically shown to have almost no effect on the dynamics modeling.

6 Conclusion

We propose a self-supervised method for decomposing a video sequence into interpretable components. In the spirit of compositionality, we decompose a scene into its composing objects and the objects into their attributes of pose, appearance and confidence. We leverage the dynamic components to learn a Koopman-based model for their dynamics. We embed the dynamic attributes of each object into a space where dynamics are linear, and therefore prediction is performed by a linear operator and the sparse influence of inputs from the environment. This design enables us to completely decompose a video into meaningful and interpretable components, including its evolution in time. The key advantage of our model is that we can utilize tools in control theory to understand the underlying dynamical system in a high-dimensional and highly nonlinear sequence such as video; for instance, we provided insights into the dynamics of objects through the analysis of eigenvalues of learned Koopman operator. Through carefully designed experiments, we showed that our method does not sacrifice accuracy or predictability while maintaining interpretability. To the best of our knowledge, our work is the first to introduce Koopman analysis in the interpretation of video sequences. We are excited that this opens the door to exchange ideas between computer vision, control system and interpretability, thereby allowing the theoretical development and analysis in the area of control systems to positively impact interpretation of video sequences. In the future, we wish to extend our method to handle more complex videos involving interactions between objects and between objects and the environment simultaneously. We also intend to extend this implementation into handling the background in parallel to the foreground. This would allow KIDD to model more complex environments.


  • [1] Omri Azencot, N. Benjamin Erichson, Vanessa Lin, and Michael Mahoney.

    Forecasting sequential data using consistent koopman autoencoders.


    International Conference on Machine Learning (ICML)

    , 2020.
  • [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. a Computer Research Repository (CoRR), 2015.
  • [3] S. Brunton, B. W. Brunton, J. Proctor, E. Kaiser, and J. N. Kutz. Chaos as an intermittently forced linear system. Nature Communications, 8, 2017.
  • [4] C. Burgess, Loïc Matthey, Nicholas Watters, Rishabh Kabra, I. Higgins, M. Botvinick, and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. a Computer Research Repository (CoRR), 2019.
  • [5] Kyunghyun Cho, B. V. Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation.

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 2014.
  • [6] Armand Comas, Chi Zhang, Zlatan Feric, Octavia Camps, and Rose Yu. Learning disentangled representations of videos with missing data. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [7] Emily L Denton and vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [8] S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, koray kavukcuoglu, and Geoffrey E Hinton.

    Attend, infer, repeat: Fast scene understanding with generative models.

    In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • [9] A. Graves, Greg Wayne, M. Reynolds, T. Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, J. Agapiou, Adrià Puigdomènech Badia, K. Hermann, Yori Zwols, Georg Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476, 2016.
  • [10] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning (ICML), 2019.
  • [11] Z. He, J. Li, Daxue Liu, Hangen He, and D. Barber. Tracking by animation: Unsupervised learning of multi-object attentive trackers.

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1318–1327, 2019.
  • [12] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [13] Jindong Jiang*, Sepehr Janghorbani*, Gerard De Melo, and Sungjin Ahn. Scalor: Generative world models with scalable object representations. In International Conference on Learning Representations (ICLR), 2020.
  • [14] Thomas Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured world models. In International Conference on Learning Representations (ICLR), 2020.
  • [15] B. O. Koopman. Hamiltonian systems and transformation in hilbert space. Proceedings of the National Academy of Sciences of the United States of America, 17 5:315–8, 1931.
  • [16] Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [17] Jannik Kossen, Karl Stelzner, Marcel Hussing, Claas Voelcker, and Kristian Kersting. Structured object-aware physics prediction for video modeling and planning. In International Conference on Learning Representations (ICLR), 2020.
  • [18] Yunzhu Li, Hao He, Jiajun Wu, Dina Katabi, and Antonio Torralba. Learning compositional koopman operators for model-based control. In International Conference on Learning Representations (ICLR), 2020.
  • [19] Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and Sungjin Ahn. Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. In International Conference on Learning Representations (ICLR), 2020.
  • [20] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [21] Bethany Lusch, J. N. Kutz, and S. Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications, 9, 2018.
  • [22] Andreas Mardt, L. Pasquali, Hao Wu, and F. Noé. Vampnets for deep learning of molecular kinetics. Nature Communications, 9, 2017.
  • [23] I. Mezic. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynamics, 41:309–325, 2005.
  • [24] Jeremy Morton, Antony Jameson, Mykel J Kochenderfer, and Freddie Witherden. Deep dynamical modeling and control of unsteady fluid flows. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [25] Jeremy Morton, F. Witherden, and Mykel J. Kochenderfer. Deep variational koopman models: Inferring koopman observations for uncertainty-aware dynamics modeling and control. In International Joint Conferences on Artificial Intelligence (IJCAI), 2019.
  • [26] Samuel E. Otto and Clarence W. Rowley. Linearly recurrent autoencoder networks for learning dynamics. SIAM Journal on Applied Dynamical Systems, 18(1):558–593, 2019.
  • [27] Charles L Phillips and H Troy Nagle. Digital control system analysis and design. Prentice Hall Press, 2007.
  • [28] J. Proctor, S. Brunton, and J. N. Kutz. Generalizing koopman theory to allow for inputs and control. SIAM J. Appl. Dyn. Syst., 17:909–930, 2018.
  • [29] P. Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics, 656:5–28, 2008.
  • [30] Nitish Srivastava, Elman Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
  • [31] M. Williams, I. Kevrekidis, and C. Rowley. A data–driven approximation of the koopman operator: Extending dynamic mode decomposition. Journal of Nonlinear Science, 25:1307–1346, 2015.
  • [32] Yongqian Xiao, Xin Xu, and Qianli Lin.

    Cknet: A convolutional neural network based on koopman operator for modeling latent dynamics from pixels.

    ArXiv, 2021.
  • [33] Tian Xie, A. France-Lanord, Yanming Wang, Y. Shao-Horn, and J. Grossman. Graph dynamical networks for unsupervised learning of atomic scale dynamics in materials. Nature Communications, 10, 2019.
  • [34] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Appendix A Model Implementation Details

Figure A.1: Eigenvalues in the unit circle for the Koopman operator for the three ablated scenarios. The axes are the imaginary (I) and real (R) components of the eigenvalues. Shadowed in red, the area where the eigenvalues have negligible effect on the trajectories, according to the results shown in Table 1.The following indicators are linked to the results shown in Figures A.2, A.3, and A.4: Highlighted in green, the chosen eigenvalue pair for the ablation study. The blue and red arrows indicate the kind of manipulation that the chosen eigenvalue pair will undergo. and are the magnitude and phase of the uppermost eigenvalue with respect to the real axis.

a.1 Baseline Models


The scalor[13] model is an unsupervised model for learning scalable object oriented representations. The model can track a large amount of objects with a dynamic background. However, although scalor can predcit future frames from previous frames, its model is not specifically designed for prediction, although prediction is reported. The baseline was chosen as an alternative for parallel object decomposition in an attention-based tracking. In this work, our experiments show that it scalor is very good at reconstruction, but predicts very poorly. Thus, we were not able to reach a reasonable performance in terms of prediction for the studied dynamical scenes.

For our experiments we used the implementation available in In order to obtain the best performance with scalor on our data set, we trained the model for 625 epochs. However, we found that the loss reached a plateau for the prediction task in the early stages. At that point of training, scalor is unable to generate predictions. In order to tackle this phenomenon, we kept the MNIST digits stationary and trained scalor until it could generate predictions. From that point, we observed that after 625 epochs of training, we got some good prediction frames. Nevertheless, predictions are often blurry and inaccurate, and obvious artifacts appear in the scene. The training time per 100 epochs for scalor is around 30 minutes with 4 RTX 2080 Ti GPU with 12.8GB memory each of them.


The original version of the drnet[7] model only uses the first four frames as input for training. For our experiments, we need the three/six input frames. We changed the scene discriminator in drnet to train on all frames in the sequences. The rest of the model was kept exactly the same as the authors’ implementation for better reproduction of results. As mentioned in their github repository (, the main network and the lstm in drnet were trained separately. Firstly, we trained the main base network, then we trained it again with skip connection, and finally we trained the lstm part. Therefore, the performance of the the lstm part is determined by the main network. The scene discriminator was trained with BCE loss. The main network and lstm were trained with MSE loss; and we trained the main model and lstm with 4 RTX 2080 Ti GPU with 12.8GB memory each. For more details about drnet, please refer to their github.


We used the code provided by the authors. The hyperparameters that they use in the public version were kept unchanged. Also, we followed the instructions in their github repository ( for the Moving MNIST experiment. The model was trained for 250 epochs, similarly to KIDD.

a.2 Our Model


The main latent variables have the following dimensions: , and . The latter’s four dimensions correspond to , where and are the coordinates of the centroid of an object; is the increment of the size of the object with respect to the decoded appearance, which would be of size ; and is the increment of the ratio , which by default is . The increments are weighted by a scalar that regulates their effect. All components of the pose are bounded between and .

Following the VAE framework, we implemented to be a sample of a learned posterior distribution , with the reparametrization trick. As usual, we regularized our training by adding a KL divergence term between our posterior and a Gaussian prior with and .

We implemented the convolutional features with dimensionality and the tracker hidden states as . The dimensions were chosen after a manual sweep of hyperparameters range. Particularly, the dimensionality of was chosen from the range ; from ; and from

. The Koopman mapping and inverse is parametrized by a multi-layer perceptron with 4 layers and hidden state dimensionality of 40. The Koopman operator

is initialized as a matrix of 0s, and the input operator with Xavier initialization (same as other layers of the Koopman embedding). was obtained by leveraging the input frames and a positional encoding PE. The latter has dimensionality and values that indicate the distance from the frame edges in terms, normalized in . With the exception of these details, the implementation of the attention and memory follows the major guidelines of [11].

We trained the model in all scenarios for 250 epochs and 9k iterations per epoch. We used a batch size of , and a prior number of objects . Similarly to [11] or [12], our model can set redundant components to be empty by reducing the confidence value to 0. The learning rate was set to and reduced by a factor of 0.7 on plateau of the validation loss. We used Adam as our optimizer with parameters and weight decay regularization .

Next, we describe variations for each experiment. For the circular motion experiment, we have an observable space of dimension and inputs are set to 0. The same is done for the cropped circular motion experiment, but in this case the loss is evaluated only in the visible part of the frame. Note that this is also done for the baselines. The loss weights are increased linearly, according to the values that will be provided in the codebase.

We also use for the Inelastic/Superelastic Collision experiment. In this case, the input dimensionality is . We keep it low dimensional so it does not absorb the free dynamics of the object.

Finally, for the 3D to 2D motion projection experiment, we expect higher order dynamics in the Koopman manifold, given the apparent complexity of the dataset. Therefore, we set an observable space of .

Further details can be found in the codebase that will be provided together with the final version of the work.


We implemented this method using Ubuntu 18.04, Python 3.6, Pytorch 1.2.0 and Cuda 10.0.


For each of our experiments we used 2 GPUs RTX 2080 Ti (Blower Edition) with 12.8GB of memory.

Figure A.2: Ablation study and qualitative results of the 3D to 2D motion projection scenario. The labels in the Y axis indicate the variations sufferd by a selected eigenvalue of the learned . Written in red, we indicate the behaviors we perceive. We show the predictions of our model and the strongest baseline, ddpae. This setup will be the same as in the rest of ablation studies shown (Figures A.3 and A.4). We can see how the model decomposes the scene into its composing objects and predicts accurately their trajectory. We vary one of the eigenvalues (Figure A.1), chosen so that the visualization is clear. We see how variations in angle (red dotted line) have an effect in the frequency of oscillation of the trajectory. Variations in the radius (blue dotted line) show unstable behaviours when the eigenvalue is greater than 1, and smoothing effects when it’s smaller than 1.

Figure A.3: Ablation study and qualitative results of the Circular motion scenario. The labels in the Y axis indicate the variations sufferd by a selected eigenvalue of the learned . Written in red, we indicate the behaviors we perceive.

Figure A.4: Ablation study and qualitative results of the Cropped circular motion scenario. In transparent blue, we show the occluded area of the frame. The labels in the Y axis indicate the variations sufferd by a selected eigenvalue of the learned . Written in red, we indicate the behaviors we perceive. In this particular case it’s interesting to highlight the unsupervised imputation of the trajectory that KIDD discovers in an unsupervised fashion. This indicates that the model learns the correct dynamics. ddpae fails to reconstruct the objects properly.
Figure A.5: Qualitative comparison (with failure cases) of KIDD against the baselines for Inelastic/Superelastic collision case. In the top, success case where the inelastic collision is properly modelled by KIDD. The best baseline, ddpae, also reconstructs the trajectory successfully. In the center, failure case where the decomposition and reconstruction of the scene is correct, but the dynamics are slightly off for KIDD, from the points of collision. In the bottom, failure case in which KIDD is unable to decompose the scene. Note that scalor provides a very poor prediction although it has the best reconstruction.

Figure A.6: Failure case for the cropped circular motion scenario. In this case, one of the numbers is never seen in the initial conditions, and therefore KIDD can’t reconstruct its appearance. Note that ddpae is also unable to reproduce it, and the results are qualitatively worse.

Appendix B More examples and failure cases

In this section, we provide examples for the four studied scenarios, including failure cases for KIDD. We will also show more examples of our ablation studies. For the latter, we modify the eigenvalues of the learned Koopman operator to study and visualize the variations in the dynamics of the objects in the scene. Figure A.1 gives an overview of the learned eigenvalues for three of the studied scenarios: Circular motion, Cropped circular motion and 3D to 2D motion projection. For these cases, there are no inputs , and therefore the operator generates the dynamics.

Figure A.1 shows shadowed in red the eigenvalues that have been empirically proven to have negligible effect on the performance in Table 1. For all cases, 6 eigenvalues (usually 3 complex conjugate pairs) are enough to generate the behaviour seen in the dynamics of the Koopman manifold. We chose an eigenvalue pair (highlighted in green) and changed its radius and angle to study the effects on the scene dynamics.

In Figure A.2, we see the case of 3D to 2D motion projection. Here, we display the two best-performing RNN-based baselines (ddpae and drnet) toghether along KIDD. Qualitatively, we see that they perform similarly or worse than KIDD in this case. We can see how the model identifies and predicts independently each one of the digits. The blue dotted line shows behaviors due to changes in magnitude of the eigenvalue pair. When the eigenvalues are outside of the unit circle, the system they model is unstable. This can be observed in the figure by looking at the behaviour for . The digits start showing an unstable behaviour by changing progressively it size and the amplitude of their oscillation. At a certain point, the digit gets stuck in the bottom frame limit, reaching the constraint of the pose vector. For variations in the phase angle , it is interesting to note that it has a direct link to the frequency of the digit’s oscillation. When , is close to 0. This has a clear impact on the vertical component of the object’s trajectory. When we increase , the vertical oscillation frequency increases with it. These behaviors are as expected given the illustration in Figure 3 (top-left).

We see a very similar behavior in Figures A.3 and A.4. For , Figure A.3 shows an oscillating behavior of the digit sizes that increase with . Again, the behavior seems unstable. Figure A.4 illustrates a familiar saturation behavior for . Also, it shows how KIDD is able to find the unseen dynamics of a digit in a partially visible frame. If we observe the third row of Figure A.4, we can see how the digit “2” has the expected oscillation, even when it was unseen neither in the input data or the self-supervision. This is an indicator of the ability of KIDD to discover the true dynamics of a system. It also exhibits better reconstruction and prediction that the strongest baseline ddpae.

Figure A.5 shows three particular cases of the Inelastic/Superelastic collision. The first example is a success, the second one is a partial success and the third one is a failure. In all cases we see a similar behaviour for the baselines. ddpae is the closest to our model in capturing the dynamics for prediction, with the difference that it leverages RNNs. scalor has a very good reconstruction, but its prediction is not comparable to the other tested architectures. drnet seems to decompose the scene and capture partially the dynamics and appearance. However, its performance is poor. For the top case, KIDD correctly decomposes the scene and predicts the inelastic and superelastic collisions with the result of an accurate and sharp prediction. For the center case, the performance is similar. However, we can observe that the final digits are a bit off from their actual trajectory. This is likely due to a bad modelling of the collisions. Finally, in the bottom example all methods fail to decompose the scene into its composing objects, as they appear overlapped.

Finally, Figure A.6 illustrates a failure case for the Cropped circular motion scenario. In this case, the input frames are heavily occluded. One of the digits (6) is fully visible but the other is not. In this case, as expected, none of the methods can model the occluded digit. However, given the uncertainty, KIDD has a sharper and more accurate prediction than dddpae.