Automatic alignment of surgical videos using kinematic data

04/03/2019 ∙ by H. Ismail Fawaz, et al. ∙ 0

Over the past one hundred years, the classic teaching methodology of "see one, do one, teach one" has governed the surgical education systems worldwide. With the advent of Operation Room 2.0, recording video, kinematic and many other types of data during the surgery became an easy task, thus allowing artificial intelligence systems to be deployed and used in surgical and medical practice. Recently, surgical videos has been shown to provide a structure for peer coaching enabling novice trainees to learn from experienced surgeons by replaying those videos. However, the high inter-operator variability in surgical gesture duration and execution renders learning from comparing novice to expert surgical videos a very difficult task. In this paper, we propose a novel technique to align multiple videos based on the alignment of their corresponding kinematic multivariate time series data. By leveraging the Dynamic Time Warping measure, our algorithm synchronizes a set of videos in order to show the same gesture being performed at different speed. We believe that the proposed approach is a valuable addition to the existing learning tools for surgery.



There are no comments yet.


page 6

page 7

Code Repositories


Automatic alignment of surgical videos using kinematic data

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Video without alignment
(b) Video with alignment
Figure 1: Example on how a time series alignment is used to synchronize the videos by duplicating the gray-scale frames. Best viewed in color.
(a) Original time series without alignment
(b) Warped time series with alignment
Figure 2: Example of aligning coordinate X’s time series for subject F, when performing three trials of the suturing surgical task.

Educators have always searched for innovative ways of improving apprentices’ learning rate. While classical lectures are still most commonly used, multimedia resources are becoming more and more adopted [22] especially in Massive Open Online Courses (MOOC) [13]. In this context, videos have been considered as especially interesting as they can combine images, text, graphics, audio and animation. The medical field is no exception, and the use of video-based resources is intensively adopted in medical curriculum [11] especially in the context of surgical training [9]. The advent of robotic surgery also simulates this trend as surgical robots, like the Da Vinci [7], generally record video feeds during the intervention. Consequently, a large amount of video data has been recorded in the last ten years [19]. This new source of data represent an unprecedented opportunity for young surgeons to improve their knowledge and skills [5]. Furthermore, video can also be a tool for senior surgeons during teaching periods to assess the skills of the trainees. In fact, a recent study [14] showed that residents spend more time viewing videos than specialists, highlighting the need for young surgeons to fully benefit from the procedure. In [6], the authors showed that knot-tying scores and times for task completion improved significantly for the subjects that watched the videos of their own performance.

However, when the trainees are willing to asses their progress over several trials of the same surgical task by re-watching their recorded surgical videos simultaneously, the problem of videos being out-of-synch makes the comparison between different trials very difficult if not impossible. This problem is encountered in many real life case studies, since experts on average complete the surgical tasks in less time than novice surgeons [12]. Thus, when trainees do enhance their skills, providing them with a feedback that pinpoints the reason behind the surgical skill improvement becomes problematic since the recorded videos exhibit different duration and are not perfectly aligned.

Although synchronizing videos has been the center of interest for several computer vision research venues, contributions are generally focused on a special case where multiple simultaneously recorded videos (with different characteristics such as viewing angles and zoom factors) are being processed 

[26, 25, 15]. Another type of multiple video synchronization uses hand-engineered features (such as points of interest trajectories) from the videos [24, 2], making the approach highly sensitive to the quality of the extracted features. This type of techniques was highly effective since the raw videos were the only source of information available, whereas in our case, the use of robotic surgical systems enables capturing an additional type of data: the kinematic variables such as the Cartesian coordinates of the Da Vinci’s end effectors [5].

In this paper, we propose to leverage the sequential aspect of the recorded kinematic data from the Da Vinci surgical system, in order to synchronize their corresponding video frames by aligning the time series data (see Figure 1 for an example). When aligning two time series, the off-the-shelf algorithm is Dynamic Time Warping (DTW) [20] which we indeed used to align two videos. However, when aligning multiple sequences, the latter technique does not generalize in a straightforward and computationally feasible manner [16]. Hence, for multiple video synchronization, we propose to align their corresponding time series to the average time series, computed using the DTW Barycenter Averaging (DBA) algorithm [16]. This process is called Non-Linear Temporal Scaling (NLTS) and has been proposed to find the multiple alignment of a set of discretized surgical gestures [3], which we extend in this work to continuous numerical kinematic data. Figure 2 depicts an example of stretching three different time series using the NLTS algorithm. Examples of the synchronized videos and the associated code can be found on our GitHub repository111, where we used the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) [5] to validate our work.

The rest of the paper is organized as follows: in Section 2, we explain in details the algorithms we have used in order to synchronize the kinematic data and eventually their corresponding video frames. In Section 3, we present our experiments and finally conclude the paper and discuss our future work in Section 4.

2 Methods

In this section, we detail each step of our video synchronization approach. We start by describing the Dynamic Time Warping (DTW) algorithm which allows us to align two videos. Then, we describe how Non-Linear Temporal Scaling (NLTS) enables us to perform multiple video synchronization with respect to the reference average time series computed using the DTW Barycenter Averaging (DBA) algorithm.

2.1 Dynamic Time Warping

Dynamic Time Warping (DTW) was first proposed for speech recognition when aligning two audio signals [20]. Suppose we want to compute the dissimilarity between two time series, for example two different trials of the same surgical task, and . The length of and are denoted respectively by and , which in our case correspond to the surgical trial’s duration. Here,

is a vector that contains six real values, therefore

and can be seen as two distinct Multivariate Time Series (MTS).

To compute the DTW dissimilarity between two MTS, several approaches were proposed by the time series data mining community [21], however in order to apply the subsequent algorithm NLTS, we adopted the “dependent” variant of DTW where the Euclidean distance is used to compute the difference between two instants and . Let be the point-wise dissimilarity matrix between and , where . A warping path is a series of points that define a crossing of . The warping path must satisfy three conditions: (1) ; (2) ; (3) and for all and . The DTW measure between two series corresponds to the path through that minimizes the total distance. In fact, the distance for any path is equal to . Hence if P is the space of all possible paths, the optimal one - whose cost is equal to - is denoted by and can be computed using: .

The optimal warping path can be obtained efficiently by applying a dynamic programming technique to fill the cost matrix . Once we find this optimal warping path between and , we can deduce how each time series element in is linked to the elements in . We propose to exploit this link in order to identify which time stamp should be duplicated in order to align both time series, and by duplicating a time stamp, we are also duplicating its corresponding video frame. Concretely, if elements , and are aligned with the element when computing , then by duplicating twice the video frame in for the time stamp , we are dilating the video of to have a length that is equal to ’s. Thus, re-aligning the video frames based on the aligned Cartesian coordinates: if subject completed “inserting the needle” gesture in 5 seconds, whereas subject performed the same gesture within 10 seconds, our algorithm finds the optimal warping path and duplicates the frames for subject in order to synchronize with subject the corresponding gesture. Figure 1 illustrates how the alignment computed by DTW for two time series can be used in order to duplicate the corresponding frames and eventually synchronize the two videos.

2.2 Non-Linear Temporal Scaling

The previous DTW based algorithm works perfectly when synchronizing only two surgical videos. The problem arises when aligning three or more surgical trials simultaneously, which requires a multiple series alignment. The latter problem has been shown to be NP-Complete [23] with the exact solution requiring operations for sequences of length . This is clearly not feasible in our case where varies between and and , which is why we ought to leverage an approximation of the multiple sequence alignment solution provided by the DTW Barycenter Averaging (DBA) algorithm which we detail in the following paragraph.

DBA was originally proposed in [18] as a technique that averages a set of time series by leveraging an approximated multiple sequence alignment algorithm called Compact Multiple Alignment (CMA) [17]. DBA iteratively refines an average time series

and follows an expectation-maximization scheme by first considering

to be fixed and finding the best CMA between the set of sequences (to be averaged) and the refined average sequence . After computing the CMA, the alignment is now fixed and the average sequence is updated in a way that minimizes the sum of DTW distances between and  [16].

DBA requires an initial value for . There exist many possible initializations for the average sequence [17], however, since our ultimate goal is to synchronize a set of sequences by duplicating their elements (dilating the sequences), we initialize the average to be equal to the longest instance in . We then find precisely the exact optimal number of time series elements - and their associated video frames - to be duplicated in order to synchronize multiple videos, using the NLTS technique which we describe in details in the following paragraph.

Non-Linear Temporal Scaling (NLTS) was originally proposed for aligning discrete sequences of surgical gestures [3]. In this paper, we extend the technique for numerical continuous sequences (time series). The goal of this final step is to compute the approximated multiple alignment of a set of sequences which will eventually contain the precise information on how much a certain frame from a certain series should be duplicated. We first start by computing the average sequence (using DBA) for a set of time series that we want to align simultanously. Then, by recomputing the Compact Multiple Alignment (CMA) between the refined average and the set of time series , we can extract an alignment between and each sequence in . Thus, for each time series in we will have the necessary information (extracted from CMA) in order to dilate the time series appropriately to have a length that is equal to ’s, which also corresponds to the length of the longest time series in . Figure 2, depicts an example of aligning three different time series using the NLTS algorithm.

3 Experiments

We start by describing the JIGSAWS dataset we have used for evaluation, before presenting our experimental study.

3.1 Dataset

Figure 3: Snapshots of the three surgical tasks in the JIGSAWS dataset (from left to right): suturing, knot-tying, needle-passing [5].

The JIGSAWS dataset [5] includes data for three basic surgical tasks performed by study subjects (surgeons). The three tasks (or their variants) are usually part of the surgical skills training program. Figure 3 shows a snapshot example for each one of the three surgical tasks (Suturing, Knot Tying and Needle Passing). The JIGSAWS dataset contains kinematic and video data from eight different subjects with varying surgical experience: two experts (E), two intermediates (I) and four novices (N) with each group having reported respectively more than 100 hours, between 10 and 100 hours and less than 10 hours of training on the Da Vinci. All subjects were reportedly right-handed.

The subjects repeated each surgical task five times and for each trial the kinematic and video data were recorded. When performing the alignment, we used the kinematic data which are numeric variables of four manipulators: left and right masters (controlled directly by the subject) and left and right slaves (controlled indirectly by the subject via the master manipulators). These kinematic variables (76 in total) are captured at a frequency equal to 30 frames per second for each trial. Out of these 76 variables, we only consider the Cartesian coordinates () of the left and right slave manipulators, thus each trial will consist of an MTS with 6 temporal variables. We chose to work only with this subset of kinematic variables to make the alignment coherent with what is visible in the recorded scene: the robots’ end-effectors which can be seen in Figure 3. However other choices of kinematic variables are applicable, which we leave the exploration for our future work. Finally we should mention that in addition to the three self-proclaimed skill levels (N,I,E) JIGSAWS contains the modified Objective Structured Assessment of Technical Skill (OSATS) score [5], which corresponds to an expert surgeon observing the surgical trial and annotating the performance of the trainee.

3.2 Results

(a) Videos synchronization process
(b) Perfectly aligned videos
Figure 4: Video alignment procedure with duplicated (gray-scale) frames.

We have created a companion web page222 to our paper where several examples of synchronized videos can be found. Figure 4 illustrates the multiple videos alignment procedure using our NLTS algorithm, where gray-scale images indicate duplicated frames (paused video) and colored images indicate a surgical motion (unpaused video). In Figure (a)a we can clearly see how the gray-scale surgical trials are perfectly aligned. Indeed, the frozen videos show the surgeon ready to perform “pulling the needle” gesture [5]. On the other hand, the colored trial (bottom right of Figure (a)a) shows a video that is being played, where the surgeon is performing “inserting the needle” gesture in order to catch up with the other paused trials in gray-scale. Finally, the result of aligning simultaneously these four surgical trials is depicted in Figure (b)b. By observing the four trials, one can clearly see that the surgeon is now performing the same surgical gesture “pulling the needle” simultanously for the four trials. We believe that this type of observation will enable a novice surgeon to locate which surgical gestures still need some improvement in order to eventually become an expert surgeon.

Figure 5: A polynomial fit (degree 3) of DTW dissimilarity score (y-axis) as a function of the OSATS score difference between two surgeons (x-axis).

Furthermore, in order to validate our intuition that DTW is able to capture characteristics that are in relationship with the motor skill of a surgeon, we plotted the DTW distance as a function of the OSATS [5] score difference. For example, if two surgeons have both an OSATS score of 10 and 16 respectively, the corresponding difference is equal to . In Figure 5

, we can clearly see how the DTW score increases whenever the OSATS score difference increases. This observation suggests that the DTW score is low when both surgeons exhibit similar dexterity, and high whenever the trainees show different skill levels. Therefore, we conclude that the DTW score can serve as a heuristic for estimating the quality of the alignment (whenever annotated skill level is not available) - especially since we observed low quality alignments for surgeons with very distinct surgical skill levels.

Finally, we should note that this work is suitable for many research fields involving motion kinematic data with their corresponding video frames. Examples of such medical applications are assessing mental health from videos [27] where wearable sensor data can be seen as time series kinematic variables and leveraged in order to synchronize a patient’s videos and compare how well the patient is responding to a certain treatment. Following the same line of thinking, this idea can be further applied to kinematic data from wearable sensors coupled with the corresponding video frames when evaluating the Parkinson’s disease evolution [1] as well as infant grasp skills [10].

4 Conclusion

In this paper, we showed how kinematic time series data recorded from the Da Vinci’s end effectors can be leveraged in order to synchronize the trainee’s videos performing a surgical task. With personalized feedback during surgical training becoming a necessity [8, 4], we believe that replaying synchronized and well aligned videos would benefit the trainees in understanding which surgical gestures did or did not improve after hours of training, thus enabling them to further reach higher skills and eventually become experts. We acknowledge that this work needs an experimental study to quantify how beneficial is replaying synchronized videos for the trainees versus observing non-synchronized trials. Therefore, we leave such exploration and clinical try outs to our future work.


  • [1] K. Criss and J. McNames (2011) Video assessment of finger tapping for parkinson’s disease and other movement disorders. In IEEE International Conference on Engineering in Medicine and Biology Society, pp. 7123–7126. Cited by: §3.2.
  • [2] G. D. Evangelidis and C. Bauckhage (2011) Efficient and robust alignment of unsynchronized video sequences. In

    Joint Pattern Recognition Symposium

    pp. 286–295. Cited by: §1.
  • [3] G. Forestier, F. Petitjean, L. Riffaud, and P. Jannin (2014) Non-linear temporal scaling of surgical processes. Artificial Intelligence in Medicine 62 (3), pp. 143 – 152. Cited by: §1, §2.2.
  • [4] G. Forestier, F. Petitjean, P. Senin, F. Despinoy, A. Huaulmé, H. Ismail Fawaz, J. Weber, L. Idoumghar, P. Muller, and P. Jannin (2018) Surgical motion analysis using discriminative interpretable patterns. Artificial Intelligence in Medicine 91, pp. 3 – 11. Cited by: §4.
  • [5] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, C. C. G. Chen, R. Vidal, S. Khudanpur, and G. D. Hager (2014) The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): a surgical activity dataset for human motion modeling. In Modeling and Monitoring of Computer Assisted Interventions – MICCAI Workshop, Cited by: §1, §1, §1, Figure 3, §3.1, §3.1, §3.2, §3.2.
  • [6] G. E. Herrera-Almario, K. Kirk, V. T. Guerrero, K. Jeong, S. Kim, and G. G. Hamad (2016) The effect of video review of resident laparoscopic surgical skills measured by self- and external assessment. The American Journal of Surgery 211 (2), pp. 315 – 320. Cited by: §1.
  • [7] C. A. Intuitive Surgical Sunnyvale The Da Vinci Surgical System. External Links: Link Cited by: §1.
  • [8] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P-A. Muller (2018)

    Evaluating surgical skills from kinematic data using convolutional neural networks

    In Medical Image Computing and Computer Assisted Intervention, Vol. 11073, pp. 214–221. Cited by: §4.
  • [9] R. Kneebone, J. Kidd, D. Nestel, S. Asvall, P. Paraskeva, and A. Darzi (2002) An innovative model for teaching and learning clinical procedures. Medical education 36 (7), pp. 628–634. Cited by: §1.
  • [10] Z. Li, Y. Huang, M. Cai, and Y. Sato (2019) Manipulation-skill assessment from videos with spatial attention network. ArXiv. Cited by: §3.2.
  • [11] I. Masic (2008) E-learning as new method of medical education. Acta informatica medica 16 (2), pp. 102. Cited by: §1.
  • [12] S.S. McNatt and C.D. Smith (2001) A computer-based laparoscopic skills assessment device differentiates experienced from novice laparoscopic surgeons. Surgical Endoscopy 15 (10), pp. 1085–1089. Cited by: §1.
  • [13] B. Means, Y. Toyama, R. Murphy, M. Bakia, and K. Jones (2009) Evaluation of evidence-based practices in online learning: a meta-analysis and review of online learning studies. Cited by: §1.
  • [14] P. Mota, N. Carvalho, E. Carvalho-Dias, M. J. Costa, J. Correia-Pinto, and E. Lima (2018) Video-based surgical learning: improving trainee education and preparation for surgery. Journal of surgical education 75 (3), pp. 828–835. Cited by: §1.
  • [15] F. Padua, R. Carceroni, G. Santos, and K. Kutulakos (2010) Linear sequence-to-sequence alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2), pp. 304–320. Cited by: §1.
  • [16] F. Petitjean, G. Forestier, G. I. Webb, A. E. Nicholson, Y. Chen, and E. Keogh (2014) Dynamic time warping averaging of time series allows faster and more accurate classification. In 2014 IEEE International Conference on Data Mining, pp. 470–479. Cited by: §1, §2.2.
  • [17] F. Petitjean and P. Gançarski (2012) Summarizing a set of time series by averaging: from steiner sequence to compact multiple alignment. Theoretical Computer Science 414 (1), pp. 76 – 91. Cited by: §2.2, §2.2.
  • [18] F. Petitjean, A. Ketterlin, and P. Gançarski (2011) A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition 44 (3), pp. 678 – 693. Cited by: §2.2.
  • [19] A. K. Rapp, M. G. Healy, M. E. Charlton, J. N. Keith, M. E. Rosenbaum, and M. R. Kapadia (2016) YouTube is the most frequently used educational video source for surgical preparation. Journal of Surgical Education 73 (6), pp. 1072–1076. Cited by: §1.
  • [20] H. Sakoe and S. Chiba (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1), pp. 43–49. Cited by: §1, §2.1.
  • [21] M. Shokoohi-Yekta, B. Hu, H. Jin, J. Wang, and E. Keogh (2017) Generalizing dtw to the multi-dimensional case requires an adaptive approach. Data Mining and Knowledge Discovery 31 (1), pp. 1–31. Cited by: §2.1.
  • [22] T. L. Smith and S. Ransbottom (2000) Digital video in education. In Distance learning technologies: Issues, trends and opportunities, pp. 124–142. Cited by: §1.
  • [23] L. WANG and T. JIANG (1994) On the complexity of multiple sequence alignment. Journal of Computational Biology 1 (4), pp. 337–348. Cited by: §2.2.
  • [24] O. Wang, C. Schroers, H. Zimmer, M. Gross, and A. Sorkine-Hornung (2014) Videosnapping: interactive synchronization of multiple videos. ACM Transactions on Graphics 33 (4), pp. 77. Cited by: §1.
  • [25] D. Wedge, P. Kovesi, and D. Huynh (2005) Trajectory based video sequence synchronization. In Digital Image Computing: Techniques and Applications, pp. 13–13. Cited by: §1.
  • [26] L. Wolf and A. Zomet (2002) Sequence-to-sequence self calibration. In European Conference on Computer Vision, pp. 370–382. Cited by: §1.
  • [27] Y. Yamada and M. Kobayashi (2017) Detecting mental fatigue from eye-tracking data gathered while watching video. In Artificial Intelligence in Medicine, pp. 295–304. Cited by: §3.2.