Future Frame Prediction for Robot-assisted Surgery

by   Xiaojie Gao, et al.
The Chinese University of Hong Kong

Predicting future frames for robotic surgical video is an interesting, important yet extremely challenging problem, given that the operative tasks may have complex dynamics. Existing approaches on future prediction of natural videos were based on either deterministic models or stochastic models, including deep recurrent neural networks, optical flow, and latent space modeling. However, the potential in predicting meaningful movements of robots with dual arms in surgical scenarios has not been tapped so far, which is typically more challenging than forecasting independent motions of one arm robots in natural scenarios. In this paper, we propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences. Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools. Furthermore, we add the invariant prior information from the gesture class into the generation process to constrain the latent space of our model. To our best knowledge, this is the first time that the future frames of dual arm robots are predicted considering their unique characteristics relative to general robotic videos. Experiments demonstrate that our model gains more stable and realistic future frame prediction scenes with the suturing task on the public JIGSAWS dataset.



There are no comments yet.


page 8

page 10


Dual Motion GAN for Future-Flow Embedded Video Prediction

Future frame prediction in videos is a promising avenue for unsupervised...

Task-Generic Hierarchical Human Motion Prior using VAEs

A deep generative model that describes human motions can benefit a wide ...

Stochastic Video Generation with a Learned Prior

Generating video frames that accurately predict future world states is c...

Simple Video Generation using Neural ODEs

Despite having been studied to a great extent, the task of conditional g...

Motion Selective Prediction for Video Frame Synthesis

Existing conditional video prediction approaches train a network from la...

Adaptive Future Frame Prediction with Ensemble Network

Future frame prediction in videos is a challenging problem because video...

High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

Predicting future video frames is extremely challenging, as there are ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With advancements in robot-assisted surgeries, visual data are crucial for surgical context awareness that promotes operation reliability and patient safety. Owing to their rich representations and direct perceptibility, surgical videos have been playing an essential role on driving automation process on tasks with obvious clinical significance such as surgical process monitoring [3, 4], gesture and workflow recognition [10, 11, 18], and surgical instrument detection and segmentation [25, 5, 14, 16]. However, these methods either only focus on providing semantic descriptions of current situations or directly assume that future information is available, thus developing future scene prediction models for surgeries should be investigated. With only happened events available, future prediction will serve as a crucial prerequisite procedure to facilitate advanced tasks including generating alerts [3], boosting online recognition with additional information [33]

, and supporting decision making of reinforcement learning agents 

[24, 11]. Besides supplying an extra source of references, predicted frames could also be converted into an entity demonstration in an imitation style [31], which will indeed accelerate the training of surgeons.

Recently, deep learning techniques have been applied to solve nature video prediction problems. To model the temporal dependencies, early methods used the Long Short Term Memory (LSTM) 

[13] or convolutional LSTM, [28] to capture the temporal dynamics in videos [30, 35]. Villegas et al. utilized the predicted high-level structures in videos to help generate long-term future frames [36]. Methods based on explicit decomposition or multi-frequency analysis were also explored [35, 6, 15]. Action-conditional video prediction methods were developed for reinforcement learning agents [27, 8]. However, these deterministic methods might yield blurry generations due to the missing considerations of multiple moving tendencies [7]

. Thus, stochastic video prediction models are proposed to capture the full distributions of uncertain future frames, which are mainly divided into three types: autoregressive models 

[19], flow-based generative models [23]

, generative adversarial networks 

[32], and VAE based approaches [2, 7]. As a famous VAE method, Denton and Fergus proposed SVG-LP that uses learned prior from content information rather than standard Gaussian prior [7]. Diverse future frames could also be generated based on stochastic high-level keypoints [26, 20]. Although these approaches gain favorable results on general robotic videos [8, 2, 7, 23]

, the domain knowledge in dual arm robotic surgical videos, such as the limited inter-class variance and class-dependent movements, are not considered.

Contrast to general robotic videos, a complete surgical video consists of several sub-phases with limited inter-class variances, meaningful movements of specific sub-tasks, and more than one moving instruments, which make this task extremely challenging. Another challenge in surgical robotic videos is that some actions have more complicated motion trajectories rather than repeated or random patterns in general videos. Although predicting one-arm robotic videos were investigated in [8, 2], robots with more than one arm will lead to more diverse frames and make the task even more difficult. Targeting at the intricate movements of robotic arms, we utilize the content and motion information jointly to assist in predicting the future frames of dual arm robots. In addition, experienced surgeons will refer to the classes of current gestures as prior knowledge constantly to forecast future operations. How to incorporate content, motion, and class information into the generation model is of great importance to boosting the predicting outcomes of robotic surgery videos.

In this paper, we propose a novel approach named Ternary Prior Guided Variational Autoencoder (TPG-VAE) for generating the future frames of dual arm robot-assisted surgical videos. Our method combines the learned content and motion prior together with the constant class label prior to constrain the latent space of the generation model, which is consistent with the forecasting procedure of humans by referring to various prior. Notably, while the diversity of future tendencies is represented as distribution, the class label prior, a specific high-level target, will maintain invariant until the end of this phase, which is highly different from general robotic videos. Our main contributions are summarized as follows. 1). A ternary prior guided variational autoencoder model is tailored for robot-assisted surgical videos. To our best knowledge, this is the first time that future scene prediction is devised for dual arm medical robots. 2). Given the tangled gestures of the two arms, the changeable prior from content and motion is combined with the constant prior from the class of the current action to constrain the latent space of our model. 3). We have extensively evaluated our approach on the suturing task of the public JIGSAWS dataset. Our model outperforms baseline methods in general videos in both quantitative and qualitative evaluations, especially for the long-term future.

2 Method

2.1 Problem Formulation

Given a video clip , we aim to generate a sequence to represent the future frames, where is the -th frame with and denoting the width, height, and number of channels, respectively. As shown in Fig. 1, we assume that is generated by some random process, involving its previous frame and a ternary latent variable . This process is denoted using a conditional distribution , which can be realized via an LSTM. The posterior distributions of can also be encoded using LSTM. Then, after obtaining by sampling from the posterior distributions, the modeled generation process is realized and the output from the neural network is adjusted to fit . At the meanwhile, the posterior distribution is fitted by a prior network to learn the diversity of the future frames.

Figure 1: Illustration of the training process of the proposed model. The posterior latent variables of content and motion at time step together with the class label prior try to reconstruct the original frame conditioning on previous frames. And the prior distributions from content and motion at time step are optimized to fit the posterior distributions at time step .

2.2 Decomposed Video Encoding Network

Exacting spatial and temporal features from videos is essential for modeling the dynamics of future trends. In this regard, we design encoders using Convolutional Neural Network (CNN) for spatial information which is followed by an LSTM to model the temporal dependency. The content encoder

, which is realized by a CNN based on VGG net [29]

, takes the input as the last observed frame and extracts content features as a one-dimensional hidden vector

. We use to denote the latent code sampled from the content distribution, which is part of . To preserve dependency, we use to model the conditional distribution of the video content. And the posterior distribution of

is estimated as a Gaussian distribution

with its expectation and variance as


To further obtain an overall distribution of videos, a motion encoder is adopted to capture the changeable movements of surgical tools. With a similar structure to the content encoder, the motion encoder observes the frame difference computed by and outputs motion features as a one-dimensional hidden vector . Note that is calculated directly using the element-wise subtraction between the two frames that are converted to gray images in advance. As another part of , we denote the random latent code from motion as . And is utilized to calculate the posterior distribution of as a Gaussian distribution with its expectation and variance as


It is worth to mention that our motion encoder also play a role as attention mechanism since the minor movements of instruments are caught by the difference between frames. The motion encoder helps alleviate the problem of limited inter-class variance which is caused by the huge proportion of unchanged parts of surgical frames. Although our method also explicitly separates the content and motion information in a video as [35], we consider the comprehensive distribution rather than deterministic features.

For prior distributions of and

, two LSTMs with the same structure are applied to model them as normal distribution

and , respectively, as


The objective of prior networks is to estimate the posterior distribution of time step with information up to .

2.3 Ternary Latent Variable

Anticipating the future development based on some inherent information can reduce the uncertainty, which also holds for the robotic video scenario. Here, we apply the class label information of the video to be predicted as the non-learned part of the latent variable , which is the available ground truth label. Even if there are not ground truth labels, they could be predicted since the gesture recognition problem have been solved with a relatively high accuracy [17], for example, 84.3% [10] in robotic video dataset JIGSAWS [12, 1]. With the obtained surgical gesture label, we encode it as a one-hot vector , where is the number of gesture classes. For each video clip, is directly applied as a part of the ternary latent variable to the generation process by setting the current gesture class as 1 and all others 0. Thus, the complete ternary latent variable of our method is written as


where and are sampled from and during training, respectively. Note that will keep unchanged for frames with the same class label.

After acquiring , we can perform the future frame prediction using an LSTM to keep the temporal dependencies and a decoder to generate images. The features go through the two neural networks to produce the next frame as


To provide features of the static background, skip connections are employed from content encoder at the last ground truth frame to the decoder like [7].

During inference, the posterior information of the ternary latent variable is not available and a prior estimation is needed to generate latent codes at time step . A general way to define the prior distribution is to let the posterior distribution get close to a standard normal distribution where prior latent code is then sampled. However, this sampling strategy tends to lose the temporal dependencies between video frames. Employing a recurrent structure, the conditional relationship can also be learned by a prior neural network [7]. And we directly use the expectation of the prior distribution rather than the sampled latent code to produce the most likely prediction under the Gaussian latent distribution assumption. Another reason is that choosing the best generation after sampling several times is not practical for online prediction scenarios. Hence, we directly use the outputs of Eq. (3) and is replaced of the ternary prior as


It is worth noting that our method considers the stochasticity during training for an overall learning, while produces the most possible generation during testing.

2.4 Learning Process

In order to deal with the intractable distribution of latent variables, we train our neural networks by maximizing the following variational lower bound using the re-parametrization trick [22]:


where is used to balance the frame prediction error and the prior fitting error. As shown in Fig. 1, we use reconstruction loss to replace the likelihood term [34]

and the loss function to minimize is


where represents the loss.

3 Experimental Results

We evaluate our model for predicting surgical robotic motions on the dual arm da Vinci robot system [9]. We design experiments to investigate: 1) whether the complicated movements of the two robotic arms could be well predicted, 2) the effectiveness of the content and motion latent variables in our proposed video prediction model, and 3) the usefulness of the constant label prior in producing the future motions of the dual arm robots.

3.1 Dataset and Evaluation Metrics

We validate our method with the suturing task of the JIGSAWS dataset [12, 1], a public dataset recorded using da Vinci surgical system. The gesture class labels are composed of a set of 11 sub-tasks annotated by experts. We only choose gestures with a sufficient amount of video clips (100), i.e. positioning needle (G2), pushing needle through tissue (G3), transferring needle from left to right (G4), and pulling suture with left hand (G6). The records of the first 6 users are used as training dataset (470 sequences) and the rest 2 users for testing (142 sequences), which is consistent with the leave-one-user-out (LOUO) setting in [12]. Every other frame in the dataset is chosen as the input .

We show quantitative comparisons by calculating VGG Cosine Similarity, Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) scores 

[34] between ground truth and generated frames. VGG Cosine Similarity uses the output vector of the last fully connected layer of a pre-trained VGG neural network. PSNR generally indicates the quality of reconstruction while SSIM is a method for measuring the perceived quality.

3.2 Implementation Details

The encoder for content and decoder use the same architecture as VGG-based encoder in [7], while the motion encoder utilizes only one convolutional layer after each pooling layer with one eighth channel numbers. The dimensions of outputs from the two encoders are both 128. All LSTMs have 256 cells with a single layer except with two layers. A linear embedding layer is employed for each LSTM. The hidden output of is followed by a fully connected layer activated using a tanh function before going into the decoder, while the hidden outputs of the rest LSTMs are followed by two separate fully connected layers indicating expectation and the logarithm of variance.

Following the previous study [7], we set the resolution of the videos as from to save time and storage. The dimensionalities of the and the Gaussian distributions are 128 and 16, respectively. We train all the components of our method using the Adam optimizer [21] in an end-to-end fashion, with a learning rate of . We set and , i.e., the max length of video frames to train. For all the experiments, we train each model on predicting 10 time steps into the future conditioning on 10 observed frames, i.e.

. All models are trained with 200 epochs.

Based on their released codes, we re-implement MCnet [35] and SVG-LP [7] on the robotic dataset, which are the typical methods of deterministic and stochastic predictions. We also show the results of two ablation settings: 1) SVG-LP*: SVG-LP trained using loss and tested without sampling; 2) ML-VAE: our full model without latent variables of content. We randomly choose 100 video clips from the testing dataset with the number of different gestures equal. Then, we test each model by predicting 20 subsequent frames conditioning on 10 observed frames. The longer testing period than training demonstrates the generalization capability of each model. For SVG-LP, we draw 10 samples from the model for each test sequence and choose the best one given each metric. Other VAE based methods directly use the expectation of the latent distribution without sampling for inference.

Figure 2: Qualitative results showing the gesture of G2 among different models. Compared to other VAE-based methods, our model captures the moving tendency of the left hand while other methods only copy the last ground truth frame. Frames with blue edging indicate the ground truth while the rest are generated by each model.
Figure 3: Quantitative evaluation on the average of the three metrics towards the 100 testing clips. The dotted line indicates the frame number the models are trained to predict up to; further results beyond this line display their generalization ability. For the reported metrics, higher is better.

3.3 Results

3.3.1 Qualitative Evaluation

Fig. 2 shows some of the outcomes denoting the gesture G2 from each model. Generations from MCnet are sharp for initial time steps, but the results rapidly distort in later time steps. Although MCnet also utilizes skip connections, it cannot produce a crisp background in a longer time span, which implies that the back and forth moving patterns of the two arms can hardly be learned the deterministic model. SVG-LP and SVG-LP* tend to predict static images of the indicated gesture, which implies that the two misunderstand the purpose of the current gesture. ML-VAE also tends to lose the movement of the left hand, which confirms the importance of content encoder. Capturing the movements of the two arms, our TPG-VAE model gives the closest predictions towards the ground truth.

3.3.2 Quantitative Evaluation

We compute VGG cosine similarity, PSNR, and SSIM for earlier mentioned models on the 100 unseen test sequences. Fig. 3 plots the average of all three metrics on the testing set. Concerning VGG Cosine Similarity, MCnet shows the worst curve because of the lowest generation quality, while other methods behaves similarly. The reason might be that the frames with good quality are similar to each other on a perceptual level. For PSNR and SSIM, all methods maintain a relatively high level at the beginning while deteriorate as going further into the future. SVG-LP* shows better performance than the SVG-LP, which indicates the loss is more appropriate than MSE in this task. Both ML-VAE and TPG-VAE demonstrate better outcomes than the two published methods, i.e., MCnet and SVG-LP, particularly in later time steps. Interestingly, MCnet demonstrates a poor generalization capacity that its performance curves on three metrics deteriorate faster after going through the dotted lines in Fig. 3

. Our full model exhibits a stronger capability to retain image quality in longer time span. Note that methods without sampling random variables when testing also gain favorable results, which suggests that the movements in the JIGSAWS dataset have relatively clear objectives.

t=15 t=20 t=25 t=30 t=15 t=20 t=25 t=30
MCnet [35] 25.342.58 23.922.46 20.532.06 18.791.83 0.8740.053 0.8360.058 0.7120.074 0.6140.073
SVG-LP [7] 27.473.82 24.624.21 23.064.20 22.144.09 0.9270.054 0.8830.080 0.8520.089 0.8320.088
SVG-LP* 27.853.57 25.094.13 23.304.30 22.304.21 0.9330.046 0.8930.072 0.8570.087 0.8360.088
M-VAE 27.743.67 25.144.09 23.244.30 22.154.10 0.9320.050 0.8940.072 0.8570.088 0.8340.087
CM-VAE 27.443.83 25.094.07 23.024.19 22.164.16 0.9270.056 0.8930.075 0.8530.087 0.8340.088
CL-VAE 28.003.73 25.324.15 23.494.34 22.244.28 0.9350.042 0.8970.073 0.8620.087 0.8350.088
ML-VAE 28.243.51 25.774.02 23.954.26 22.284.26 0.9360.046 0.9030.071 0.8700.084 0.8360.088
TPG-VAE (ours) 26.263.17 26.133.85 24.883.68 23.673.50 0.9170.048 0.9110.060 0.8920.067 0.8710.071
Table 1: Comparison of predicted results at different time step (meanstd).

Table 1

lists the average and standard deviation of the performances of each method on the 100 testing clips. VGG Cosine Similarity is not shown because it cannot distinguish the results of VAE based models. All methods tend to degrade as the time step goes on. ML-VAE also exhibits superior outcomes than the other compared methods, which verifies the effectiveness of the proposed motion encoder and class prior. With the ternary prior as high-level guidance, our TPG-VAE maintains high generation quality while showing stable performances with the smallest standard deviation.

3.3.3 Further Ablations

Table 1 also shows additional ablations of our TPG-VAE. Each of the following settings is applied to justify the necessity: 1) M-VAE: our full model without latent variables of content and class labels; 2) CM-VAE: our full model without latent variables of class labels; 3) CL-VAE: our full model without latent variables of motion. All three ablation models give better outcomes than SVG-LP. Comparing CM-VAE with CL-VAE, we find that class labels contribute more than motion latent variables since class information helps the model remove more uncertainty, which suggests that recognizing before predicting is a recommended choice. To be mentioned, we do not consider the ablation setting with the only label prior since it degenerates into a deterministic model that cannot interpret the diversities of videos.

Figure 4: Qualitative comparison indicating the gesture of G4. Our method generates images with high quality and predicts the actual location of the left arm, while other methods tend to lose the left arm. Frames with blue edging indicate the ground truth while the rest are generated by each model.

3.3.4 Discussion on Dual Arm Cases

The two arms of the da Vinci robot cooperate mutually to achieve a certain task, thus the movements are highly entangled, which makes prediction very challenging. Without enough prior information, the predicted frames might lead to unreasonable outcomes due to the loss of temporal consistency. Fig. 4 shows the results of gesture G4, where the left arm is gradually getting into the visual field. As for the first 10 predicted frames, all models realize the temporal images that the left hand is moving to the center. For the rest 10 predictions, SVG-LP tend to lose the left hand due misunderstanding the current phase. Owning to more complete guidance, i.e., the ternary prior, our TPG-VAE predicts the movements of the two arms successfully while shows crisp outcomes, which verifies our assumption that additional references help for prediction of the dual arm movements.

4 Conclusion and Future Work

In this work, we present a novel method based on VAE for conditional robotic video prediction, which is the first work for dual arm robots. The suggested model employs learned and intrinsic prior information as guidance to help generate future scenes conditioning on the observed frames. The stochastic VAE based method is adapted as a deterministic approach by directly using the expectation of the distribution without sampling. Our method outperforms the baseline methods on the challenging dual arm robotic surgical video dataset. Future work can be made to explore higher resolution generation and apply the predicted future frames to other advanced tasks.

4.0.1 Acknowledgements.

This work was supported by Key-Area Research and Development Program of Guangdong Province, China (2020B010165004), Hong Kong RGC TRS Project No.T42-409/18-R, National Natural Science Foundation of China with Project No. U1813204, and CUHK Shun Hing Institute of Advanced Engineering (project MMT-p5-20).


  • [1] N. Ahmidi, L. Tao, S. Sefati, Y. Gao, C. Lea, B. B. Haro, L. Zappella, S. Khudanpur, R. Vidal, and G. D. Hager (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE. Trans. Biomed. Eng.. Cited by: §2.3, §3.1.
  • [2] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2018) Stochastic variational video prediction. In ICLR, Cited by: §1, §1.
  • [3] B. Bhatia, T. Oates, Y. Xiao, and P. Hu (2007) Real-time identification of operating room state from video. In AAAI, Cited by: §1.
  • [4] N. Bricon-Souf and C. R. Newman (2007) Context awareness in health care: A review. Int. J. Med. Inform.. Cited by: §1.
  • [5] E. Colleoni, S. Moccia, X. Du, E. De Momi, and D. Stoyanov (2019) Deep learning based robotic tool detection and articulation estimation with spatio-temporal layers. RA-L. Cited by: §1.
  • [6] E. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In NurIPS, Cited by: §1.
  • [7] E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. In ICML, Cited by: §1, §2.3, §2.3, §3.2, §3.2, §3.2, Table 1.
  • [8] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In NurIPS, Cited by: §1, §1.
  • [9] C. Freschi, V. Ferrari, F. Melfi, M. Ferrari, F. Mosca, and A. Cuschieri (2013) Technical review of the da Vinci surgical telemanipulator. Int. J. Med. Robot. Cited by: §3.
  • [10] I. Funke, S. Bodenstedt, F. Oehme, F. von Bechtolsheim, J. Weitz, and S. Speidel (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In MICCAI, Cited by: §1, §2.3.
  • [11] X. Gao, Y. Jin, Q. Dou, and P. Heng (2020) Automatic gesture recognition in robot-assisted surgery with reinforcement learning and tree search. In ICRA, Cited by: §1.
  • [12] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, et al. (2014) JHU-ISI gesture and skill assessment working set (JIGSAWS): A surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, Cited by: §2.3, §3.1.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation. Cited by: §1.
  • [14] M. Islam, D. A. Atputharuban, R. Ramesh, and H. Ren (2019) Real-time instrument segmentation in robotic surgery using auxiliary supervised deep adversarial learning. RA-L. Cited by: §1.
  • [15] B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li (2020) Exploring spatial-temporal multi-frequency analysis for high-fidelity and temporal-consistency video prediction. In CVPR, Cited by: §1.
  • [16] Y. Jin, K. Cheng, Q. Dou, and P. Heng (2019) Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In MICCAI, Cited by: §1.
  • [17] Y. Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C. Fu, and P. Heng (2017) SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging. Cited by: §2.3.
  • [18] Y. Jin, H. Li, Q. Dou, H. Chen, J. Qin, C. Fu, and P. Heng (2020) Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal.. Cited by: §1.
  • [19] N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu (2017) Video pixel networks. In ICML, Cited by: §1.
  • [20] Y. Kim, S. Nam, I. Cho, and S. J. Kim (2019) Unsupervised keypoint learning for guiding class-conditional video prediction. In NurIPS, Cited by: §1.
  • [21] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • [22] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §2.4.
  • [23] M. Kumar, M. Babaeizadeh, D. Erhan, C. Finn, S. Levine, L. Dinh, and D. Kingma (2020) VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation. In ICLR, Cited by: §1.
  • [24] D. Liu and T. Jiang (2018) Deep reinforcement learning for surgical gesture segmentation and classification. In MICCAI, Cited by: §1.
  • [25] F. Milletari, N. Rieke, M. Baust, M. Esposito, and N. Navab (2018) CFCM: Segmentation via coarse to fine context memory. In MICCAI, Cited by: §1.
  • [26] M. Minderer, C. Sun, R. Villegas, F. Cole, K. P. Murphy, and H. Lee (2019) Unsupervised learning of object structure and dynamics from videos. In NurIPS, Cited by: §1.
  • [27] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in Atari games. In NurIPS, Cited by: §1.
  • [28] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    In NurIPS, Cited by: §1.
  • [29] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.2.
  • [30] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using LSTMs. In ICML, Cited by: §1.
  • [31] A. K. Tanwani, P. Sermanet, A. Yan, R. Anand, M. Phielipp, and K. Goldberg (2020) Motion2Vec: Semi-Supervised Representation Learning from Surgical Videos. In ICRA, Cited by: §1.
  • [32] S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) MoCoGAN: Decomposing motion and content for video generation. In CVPR, Cited by: §1.
  • [33] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy (2016) EndoNet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging. Cited by: §1.
  • [34] R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee (2019) High fidelity video prediction with large stochastic recurrent neural networks. In NurIPS, Cited by: §2.4, §3.1.
  • [35] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. In ICLR, Cited by: §1, §2.2, §3.2, Table 1.
  • [36] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In ICML, Cited by: §1.