1 Introduction
With advancements in robotassisted surgeries, visual data are crucial for surgical context awareness that promotes operation reliability and patient safety. Owing to their rich representations and direct perceptibility, surgical videos have been playing an essential role on driving automation process on tasks with obvious clinical significance such as surgical process monitoring [3, 4], gesture and workflow recognition [10, 11, 18], and surgical instrument detection and segmentation [25, 5, 14, 16]. However, these methods either only focus on providing semantic descriptions of current situations or directly assume that future information is available, thus developing future scene prediction models for surgeries should be investigated. With only happened events available, future prediction will serve as a crucial prerequisite procedure to facilitate advanced tasks including generating alerts [3], boosting online recognition with additional information [33]
, and supporting decision making of reinforcement learning agents
[24, 11]. Besides supplying an extra source of references, predicted frames could also be converted into an entity demonstration in an imitation style [31], which will indeed accelerate the training of surgeons.Recently, deep learning techniques have been applied to solve nature video prediction problems. To model the temporal dependencies, early methods used the Long Short Term Memory (LSTM)
[13] or convolutional LSTM, [28] to capture the temporal dynamics in videos [30, 35]. Villegas et al. utilized the predicted highlevel structures in videos to help generate longterm future frames [36]. Methods based on explicit decomposition or multifrequency analysis were also explored [35, 6, 15]. Actionconditional video prediction methods were developed for reinforcement learning agents [27, 8]. However, these deterministic methods might yield blurry generations due to the missing considerations of multiple moving tendencies [7]. Thus, stochastic video prediction models are proposed to capture the full distributions of uncertain future frames, which are mainly divided into three types: autoregressive models
[19], flowbased generative models [23], generative adversarial networks
[32], and VAE based approaches [2, 7]. As a famous VAE method, Denton and Fergus proposed SVGLP that uses learned prior from content information rather than standard Gaussian prior [7]. Diverse future frames could also be generated based on stochastic highlevel keypoints [26, 20]. Although these approaches gain favorable results on general robotic videos [8, 2, 7, 23], the domain knowledge in dual arm robotic surgical videos, such as the limited interclass variance and classdependent movements, are not considered.
Contrast to general robotic videos, a complete surgical video consists of several subphases with limited interclass variances, meaningful movements of specific subtasks, and more than one moving instruments, which make this task extremely challenging. Another challenge in surgical robotic videos is that some actions have more complicated motion trajectories rather than repeated or random patterns in general videos. Although predicting onearm robotic videos were investigated in [8, 2], robots with more than one arm will lead to more diverse frames and make the task even more difficult. Targeting at the intricate movements of robotic arms, we utilize the content and motion information jointly to assist in predicting the future frames of dual arm robots. In addition, experienced surgeons will refer to the classes of current gestures as prior knowledge constantly to forecast future operations. How to incorporate content, motion, and class information into the generation model is of great importance to boosting the predicting outcomes of robotic surgery videos.
In this paper, we propose a novel approach named Ternary Prior Guided Variational Autoencoder (TPGVAE) for generating the future frames of dual arm robotassisted surgical videos. Our method combines the learned content and motion prior together with the constant class label prior to constrain the latent space of the generation model, which is consistent with the forecasting procedure of humans by referring to various prior. Notably, while the diversity of future tendencies is represented as distribution, the class label prior, a specific highlevel target, will maintain invariant until the end of this phase, which is highly different from general robotic videos. Our main contributions are summarized as follows. 1). A ternary prior guided variational autoencoder model is tailored for robotassisted surgical videos. To our best knowledge, this is the first time that future scene prediction is devised for dual arm medical robots. 2). Given the tangled gestures of the two arms, the changeable prior from content and motion is combined with the constant prior from the class of the current action to constrain the latent space of our model. 3). We have extensively evaluated our approach on the suturing task of the public JIGSAWS dataset. Our model outperforms baseline methods in general videos in both quantitative and qualitative evaluations, especially for the longterm future.
2 Method
2.1 Problem Formulation
Given a video clip , we aim to generate a sequence to represent the future frames, where is the th frame with and denoting the width, height, and number of channels, respectively. As shown in Fig. 1, we assume that is generated by some random process, involving its previous frame and a ternary latent variable . This process is denoted using a conditional distribution , which can be realized via an LSTM. The posterior distributions of can also be encoded using LSTM. Then, after obtaining by sampling from the posterior distributions, the modeled generation process is realized and the output from the neural network is adjusted to fit . At the meanwhile, the posterior distribution is fitted by a prior network to learn the diversity of the future frames.
2.2 Decomposed Video Encoding Network
Exacting spatial and temporal features from videos is essential for modeling the dynamics of future trends. In this regard, we design encoders using Convolutional Neural Network (CNN) for spatial information which is followed by an LSTM to model the temporal dependency. The content encoder
, which is realized by a CNN based on VGG net [29], takes the input as the last observed frame and extracts content features as a onedimensional hidden vector
. We use to denote the latent code sampled from the content distribution, which is part of . To preserve dependency, we use to model the conditional distribution of the video content. And the posterior distribution ofis estimated as a Gaussian distribution
with its expectation and variance as(1)  
To further obtain an overall distribution of videos, a motion encoder is adopted to capture the changeable movements of surgical tools. With a similar structure to the content encoder, the motion encoder observes the frame difference computed by and outputs motion features as a onedimensional hidden vector . Note that is calculated directly using the elementwise subtraction between the two frames that are converted to gray images in advance. As another part of , we denote the random latent code from motion as . And is utilized to calculate the posterior distribution of as a Gaussian distribution with its expectation and variance as
(2)  
It is worth to mention that our motion encoder also play a role as attention mechanism since the minor movements of instruments are caught by the difference between frames. The motion encoder helps alleviate the problem of limited interclass variance which is caused by the huge proportion of unchanged parts of surgical frames. Although our method also explicitly separates the content and motion information in a video as [35], we consider the comprehensive distribution rather than deterministic features.
For prior distributions of and
, two LSTMs with the same structure are applied to model them as normal distribution
and , respectively, as(3)  
The objective of prior networks is to estimate the posterior distribution of time step with information up to .
2.3 Ternary Latent Variable
Anticipating the future development based on some inherent information can reduce the uncertainty, which also holds for the robotic video scenario. Here, we apply the class label information of the video to be predicted as the nonlearned part of the latent variable , which is the available ground truth label. Even if there are not ground truth labels, they could be predicted since the gesture recognition problem have been solved with a relatively high accuracy [17], for example, 84.3% [10] in robotic video dataset JIGSAWS [12, 1]. With the obtained surgical gesture label, we encode it as a onehot vector , where is the number of gesture classes. For each video clip, is directly applied as a part of the ternary latent variable to the generation process by setting the current gesture class as 1 and all others 0. Thus, the complete ternary latent variable of our method is written as
(4) 
where and are sampled from and during training, respectively. Note that will keep unchanged for frames with the same class label.
After acquiring , we can perform the future frame prediction using an LSTM to keep the temporal dependencies and a decoder to generate images. The features go through the two neural networks to produce the next frame as
(5)  
To provide features of the static background, skip connections are employed from content encoder at the last ground truth frame to the decoder like [7].
During inference, the posterior information of the ternary latent variable is not available and a prior estimation is needed to generate latent codes at time step . A general way to define the prior distribution is to let the posterior distribution get close to a standard normal distribution where prior latent code is then sampled. However, this sampling strategy tends to lose the temporal dependencies between video frames. Employing a recurrent structure, the conditional relationship can also be learned by a prior neural network [7]. And we directly use the expectation of the prior distribution rather than the sampled latent code to produce the most likely prediction under the Gaussian latent distribution assumption. Another reason is that choosing the best generation after sampling several times is not practical for online prediction scenarios. Hence, we directly use the outputs of Eq. (3) and is replaced of the ternary prior as
(6) 
It is worth noting that our method considers the stochasticity during training for an overall learning, while produces the most possible generation during testing.
2.4 Learning Process
In order to deal with the intractable distribution of latent variables, we train our neural networks by maximizing the following variational lower bound using the reparametrization trick [22]:
(7)  
where is used to balance the frame prediction error and the prior fitting error. As shown in Fig. 1, we use reconstruction loss to replace the likelihood term [34]
and the loss function to minimize is
(8) 
where represents the loss.
3 Experimental Results
We evaluate our model for predicting surgical robotic motions on the dual arm da Vinci robot system [9]. We design experiments to investigate: 1) whether the complicated movements of the two robotic arms could be well predicted, 2) the effectiveness of the content and motion latent variables in our proposed video prediction model, and 3) the usefulness of the constant label prior in producing the future motions of the dual arm robots.
3.1 Dataset and Evaluation Metrics
We validate our method with the suturing task of the JIGSAWS dataset [12, 1], a public dataset recorded using da Vinci surgical system. The gesture class labels are composed of a set of 11 subtasks annotated by experts. We only choose gestures with a sufficient amount of video clips (100), i.e. positioning needle (G2), pushing needle through tissue (G3), transferring needle from left to right (G4), and pulling suture with left hand (G6). The records of the first 6 users are used as training dataset (470 sequences) and the rest 2 users for testing (142 sequences), which is consistent with the leaveoneuserout (LOUO) setting in [12]. Every other frame in the dataset is chosen as the input .
We show quantitative comparisons by calculating VGG Cosine Similarity, Peak SignaltoNoise Ratio (PSNR) and structural similarity (SSIM) scores
[34] between ground truth and generated frames. VGG Cosine Similarity uses the output vector of the last fully connected layer of a pretrained VGG neural network. PSNR generally indicates the quality of reconstruction while SSIM is a method for measuring the perceived quality.3.2 Implementation Details
The encoder for content and decoder use the same architecture as VGGbased encoder in [7], while the motion encoder utilizes only one convolutional layer after each pooling layer with one eighth channel numbers. The dimensions of outputs from the two encoders are both 128. All LSTMs have 256 cells with a single layer except with two layers. A linear embedding layer is employed for each LSTM. The hidden output of is followed by a fully connected layer activated using a tanh function before going into the decoder, while the hidden outputs of the rest LSTMs are followed by two separate fully connected layers indicating expectation and the logarithm of variance.
Following the previous study [7], we set the resolution of the videos as from to save time and storage. The dimensionalities of the and the Gaussian distributions are 128 and 16, respectively. We train all the components of our method using the Adam optimizer [21] in an endtoend fashion, with a learning rate of . We set and , i.e., the max length of video frames to train. For all the experiments, we train each model on predicting 10 time steps into the future conditioning on 10 observed frames, i.e.
. All models are trained with 200 epochs.
Based on their released codes, we reimplement MCnet [35] and SVGLP [7] on the robotic dataset, which are the typical methods of deterministic and stochastic predictions. We also show the results of two ablation settings: 1) SVGLP*: SVGLP trained using loss and tested without sampling; 2) MLVAE: our full model without latent variables of content. We randomly choose 100 video clips from the testing dataset with the number of different gestures equal. Then, we test each model by predicting 20 subsequent frames conditioning on 10 observed frames. The longer testing period than training demonstrates the generalization capability of each model. For SVGLP, we draw 10 samples from the model for each test sequence and choose the best one given each metric. Other VAE based methods directly use the expectation of the latent distribution without sampling for inference.
3.3 Results
3.3.1 Qualitative Evaluation
Fig. 2 shows some of the outcomes denoting the gesture G2 from each model. Generations from MCnet are sharp for initial time steps, but the results rapidly distort in later time steps. Although MCnet also utilizes skip connections, it cannot produce a crisp background in a longer time span, which implies that the back and forth moving patterns of the two arms can hardly be learned the deterministic model. SVGLP and SVGLP* tend to predict static images of the indicated gesture, which implies that the two misunderstand the purpose of the current gesture. MLVAE also tends to lose the movement of the left hand, which confirms the importance of content encoder. Capturing the movements of the two arms, our TPGVAE model gives the closest predictions towards the ground truth.
3.3.2 Quantitative Evaluation
We compute VGG cosine similarity, PSNR, and SSIM for earlier mentioned models on the 100 unseen test sequences. Fig. 3 plots the average of all three metrics on the testing set. Concerning VGG Cosine Similarity, MCnet shows the worst curve because of the lowest generation quality, while other methods behaves similarly. The reason might be that the frames with good quality are similar to each other on a perceptual level. For PSNR and SSIM, all methods maintain a relatively high level at the beginning while deteriorate as going further into the future. SVGLP* shows better performance than the SVGLP, which indicates the loss is more appropriate than MSE in this task. Both MLVAE and TPGVAE demonstrate better outcomes than the two published methods, i.e., MCnet and SVGLP, particularly in later time steps. Interestingly, MCnet demonstrates a poor generalization capacity that its performance curves on three metrics deteriorate faster after going through the dotted lines in Fig. 3
. Our full model exhibits a stronger capability to retain image quality in longer time span. Note that methods without sampling random variables when testing also gain favorable results, which suggests that the movements in the JIGSAWS dataset have relatively clear objectives.
Methods  PSNR  SSIM  

t=15  t=20  t=25  t=30  t=15  t=20  t=25  t=30  
MCnet [35]  25.342.58  23.922.46  20.532.06  18.791.83  0.8740.053  0.8360.058  0.7120.074  0.6140.073 
SVGLP [7]  27.473.82  24.624.21  23.064.20  22.144.09  0.9270.054  0.8830.080  0.8520.089  0.8320.088 
SVGLP*  27.853.57  25.094.13  23.304.30  22.304.21  0.9330.046  0.8930.072  0.8570.087  0.8360.088 
MVAE  27.743.67  25.144.09  23.244.30  22.154.10  0.9320.050  0.8940.072  0.8570.088  0.8340.087 
CMVAE  27.443.83  25.094.07  23.024.19  22.164.16  0.9270.056  0.8930.075  0.8530.087  0.8340.088 
CLVAE  28.003.73  25.324.15  23.494.34  22.244.28  0.9350.042  0.8970.073  0.8620.087  0.8350.088 
MLVAE  28.243.51  25.774.02  23.954.26  22.284.26  0.9360.046  0.9030.071  0.8700.084  0.8360.088 
TPGVAE (ours)  26.263.17  26.133.85  24.883.68  23.673.50  0.9170.048  0.9110.060  0.8920.067  0.8710.071 
Table 1
lists the average and standard deviation of the performances of each method on the 100 testing clips. VGG Cosine Similarity is not shown because it cannot distinguish the results of VAE based models. All methods tend to degrade as the time step goes on. MLVAE also exhibits superior outcomes than the other compared methods, which verifies the effectiveness of the proposed motion encoder and class prior. With the ternary prior as highlevel guidance, our TPGVAE maintains high generation quality while showing stable performances with the smallest standard deviation.
3.3.3 Further Ablations
Table 1 also shows additional ablations of our TPGVAE. Each of the following settings is applied to justify the necessity: 1) MVAE: our full model without latent variables of content and class labels; 2) CMVAE: our full model without latent variables of class labels; 3) CLVAE: our full model without latent variables of motion. All three ablation models give better outcomes than SVGLP. Comparing CMVAE with CLVAE, we find that class labels contribute more than motion latent variables since class information helps the model remove more uncertainty, which suggests that recognizing before predicting is a recommended choice. To be mentioned, we do not consider the ablation setting with the only label prior since it degenerates into a deterministic model that cannot interpret the diversities of videos.
3.3.4 Discussion on Dual Arm Cases
The two arms of the da Vinci robot cooperate mutually to achieve a certain task, thus the movements are highly entangled, which makes prediction very challenging. Without enough prior information, the predicted frames might lead to unreasonable outcomes due to the loss of temporal consistency. Fig. 4 shows the results of gesture G4, where the left arm is gradually getting into the visual field. As for the first 10 predicted frames, all models realize the temporal images that the left hand is moving to the center. For the rest 10 predictions, SVGLP tend to lose the left hand due misunderstanding the current phase. Owning to more complete guidance, i.e., the ternary prior, our TPGVAE predicts the movements of the two arms successfully while shows crisp outcomes, which verifies our assumption that additional references help for prediction of the dual arm movements.
4 Conclusion and Future Work
In this work, we present a novel method based on VAE for conditional robotic video prediction, which is the first work for dual arm robots. The suggested model employs learned and intrinsic prior information as guidance to help generate future scenes conditioning on the observed frames. The stochastic VAE based method is adapted as a deterministic approach by directly using the expectation of the distribution without sampling. Our method outperforms the baseline methods on the challenging dual arm robotic surgical video dataset. Future work can be made to explore higher resolution generation and apply the predicted future frames to other advanced tasks.
4.0.1 Acknowledgements.
This work was supported by KeyArea Research and Development Program of Guangdong Province, China (2020B010165004), Hong Kong RGC TRS Project No.T42409/18R, National Natural Science Foundation of China with Project No. U1813204, and CUHK Shun Hing Institute of Advanced Engineering (project MMTp520).
References
 [1] (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE. Trans. Biomed. Eng.. Cited by: §2.3, §3.1.
 [2] (2018) Stochastic variational video prediction. In ICLR, Cited by: §1, §1.
 [3] (2007) Realtime identification of operating room state from video. In AAAI, Cited by: §1.
 [4] (2007) Context awareness in health care: A review. Int. J. Med. Inform.. Cited by: §1.
 [5] (2019) Deep learning based robotic tool detection and articulation estimation with spatiotemporal layers. RAL. Cited by: §1.
 [6] (2017) Unsupervised learning of disentangled representations from video. In NurIPS, Cited by: §1.
 [7] (2018) Stochastic video generation with a learned prior. In ICML, Cited by: §1, §2.3, §2.3, §3.2, §3.2, §3.2, Table 1.
 [8] (2016) Unsupervised learning for physical interaction through video prediction. In NurIPS, Cited by: §1, §1.
 [9] (2013) Technical review of the da Vinci surgical telemanipulator. Int. J. Med. Robot. Cited by: §3.
 [10] (2019) Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video. In MICCAI, Cited by: §1, §2.3.
 [11] (2020) Automatic gesture recognition in robotassisted surgery with reinforcement learning and tree search. In ICRA, Cited by: §1.
 [12] (2014) JHUISI gesture and skill assessment working set (JIGSAWS): A surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, Cited by: §2.3, §3.1.
 [13] (1997) Long shortterm memory. Neural Computation. Cited by: §1.
 [14] (2019) Realtime instrument segmentation in robotic surgery using auxiliary supervised deep adversarial learning. RAL. Cited by: §1.
 [15] (2020) Exploring spatialtemporal multifrequency analysis for highfidelity and temporalconsistency video prediction. In CVPR, Cited by: §1.
 [16] (2019) Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In MICCAI, Cited by: §1.
 [17] (2017) SVRCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging. Cited by: §2.3.
 [18] (2020) Multitask recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal.. Cited by: §1.
 [19] (2017) Video pixel networks. In ICML, Cited by: §1.
 [20] (2019) Unsupervised keypoint learning for guiding classconditional video prediction. In NurIPS, Cited by: §1.
 [21] (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
 [22] (2014) Autoencoding variational bayes. In ICLR, Cited by: §2.4.
 [23] (2020) VideoFlow: A Conditional FlowBased Model for Stochastic Video Generation. In ICLR, Cited by: §1.
 [24] (2018) Deep reinforcement learning for surgical gesture segmentation and classification. In MICCAI, Cited by: §1.
 [25] (2018) CFCM: Segmentation via coarse to fine context memory. In MICCAI, Cited by: §1.
 [26] (2019) Unsupervised learning of object structure and dynamics from videos. In NurIPS, Cited by: §1.
 [27] (2015) Actionconditional video prediction using deep networks in Atari games. In NurIPS, Cited by: §1.

[28]
(2015)
Convolutional lstm network: a machine learning approach for precipitation nowcasting
. In NurIPS, Cited by: §1.  [29] (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §2.2.
 [30] (2015) Unsupervised learning of video representations using LSTMs. In ICML, Cited by: §1.
 [31] (2020) Motion2Vec: SemiSupervised Representation Learning from Surgical Videos. In ICRA, Cited by: §1.
 [32] (2018) MoCoGAN: Decomposing motion and content for video generation. In CVPR, Cited by: §1.
 [33] (2016) EndoNet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging. Cited by: §1.
 [34] (2019) High fidelity video prediction with large stochastic recurrent neural networks. In NurIPS, Cited by: §2.4, §3.1.
 [35] (2017) Decomposing motion and content for natural video sequence prediction. In ICLR, Cited by: §1, §2.2, §3.2, Table 1.
 [36] (2017) Learning to generate longterm future via hierarchical prediction. In ICML, Cited by: §1.
Comments
There are no comments yet.