Real-time analysis of cataract surgery videos using statistical models

10/18/2016 ∙ by Katia Charrière, et al. ∙ 0

The automatic analysis of the surgical process, from videos recorded during surgeries, could be very useful to surgeons, both for training and for acquiring new techniques. The training process could be optimized by automatically providing some targeted recommendations or warnings, similar to the expert surgeon's guidance. In this paper, we propose to reuse videos recorded and stored during cataract surgeries to perform the analysis. The proposed system allows to automatically recognize, in real time, what the surgeon is doing: what surgical phase or, more precisely, what surgical step he or she is performing. This recognition relies on the inference of a multilevel statistical model which uses 1) the conditional relations between levels of description (steps and phases) and 2) the temporal relations among steps and among phases. The model accepts two types of inputs: 1) the presence of surgical tools, manually provided by the surgeons, or 2) motion in videos, automatically analyzed through the Content Based Video retrieval (CBVR) paradigm. Different data-driven statistical models are evaluated in this paper. For this project, a dataset of 30 cataract surgery videos was collected at Brest University hospital. The system was evaluated in terms of area under the ROC curve. Promising results were obtained using either the presence of surgical tools (A_z = 0.983) or motion analysis (A_z = 0.759). The generality of the method allows to adapt it to any kinds of surgeries. The proposed solution could be used in a computer assisted surgery tool to support surgeons during the surgery.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training acquisition of new techniques is a very important part of a surgeon’s career and it requires a major investment from expert surgeons in terms of supervision. The work presented in this paper aims to support the training process by automatically analysing the surgical process in surgery monitoring videos. This analysis could be used in the future for providing targeted recommendations or warnings, similar to the guidance of expert surgeons.

Our first goal is to support surgeons during their first surgeries and reduce the time of expert supervision. Assisting the surgeon during the surgery implies to be able to analyze the surgical process in real time. This could be done through the automated analysis of the recorded video. Also, information recorded during previous surgeries could be reused to recognize similar situations during the analysis of a new surgery video. The methods we developed could also take their place with the emergence of surgery simulators singh2014high to support the initial training of surgeons, even before their first (supervised) surgery. In this scenario, rather than the video stream, we could use the presence (or trajectories) of surgical tools to perform the analysis.

Although this problem is common to any surgery, we focus in this paper on eye surgeries and the cataract surgery in particular. Cataract surgery aims to replace the eye’s natural lens by a synthetic lens when its transparency has been lost. This is one of the most practiced surgery. In this surgery, the surgeon watches the patient’s eye through a binocular microscope, the output of which can be video recorded. An accurate analysis of the surgery is necessary to be able to provide relevant information. Ideally, we would like to know, at each instant of the surgery, which surgical gesture or step is being performed by the surgeon. But an accurate analysis of the surgical process, with the real-time constraint, is a challenging task: the algorithm has to be fast and we have to be able to recognize a quite large number of surgical steps.

In this paper, we propose to work with two levels of description, in order to perform a robust and precise analysis of the surgical process. As presented in Fig. 1

, at the coarsest level, the surgery is divided into ”phases” and at the finest level, it is divided into ”steps”. By learning a multilevel statistical model of the surgical process, the proposed methods use both the knowledge of the temporal process of the surgery and the knowledge of the relationships between the surgical steps and phases. The system provides the most probable sequences of surgical steps and phases, given present and past information only, by assigning a label for steps and phases to each image of the video. In the field of medical data analysis, an ongoing problem is the small number of available data. Building an annotated medical dataset is a challenging task, first because of the sensitivity of the data. Also, the annotation of data by experts is time consuming and surgeons’ time is precious. Our system needs to deal with this limitation of data, especially to train a model of the surgical process. Our system also needs to deal with the specifity of the medical data recorded from video monitored surgery. In the case of the cataract surgery, the system has to deal with the motion of the eye during the surgery, and some zoom or level variations.

The remainder of this paper is organized as follows. Related work is presented in 2. The proposed methods are described in 3. In 4, we discuss our experimental setup and results. Finally, 5 presents some concluding remarks.

Figure 1: Description of the cataract surgery into surgical phases and surgical steps

2 Related work

To analyze the surgical process, a surgery can be conveniently regarded as a succession of surgical gestures (finest representation), activities, steps, tasks or phases (coarsest representation). Every surgical process analysis method describes the surgery at a given level of abstraction. A high level of description, into surgical phases lalys2012framework or tasks Quellec2014b, provides a global description of the surgery with a simple sequencing. Indeed, surgeries generally have the same phase sequencing padoy2012statistical; lalys2013automatic. Automatic recognition at a low level of description, on the other hand, is a challenging task because of the large number of possible temporal sequences, sometimes marginally represented in the dataset. So, a low level of description, into surgical gestures zappella2013surgical, activities lalys2013automatic or steps, allows a more precise analysis of the surgery, but implies a more complex surgical process.

Some authors propose to work with several levels of description. For instance, Padoy et al. padoy2012statistical use the presence of the surgical tools in the field of view of the camera to detect actions. This action detection supports the recognition of (high-level) phases. Forestier et al. forestier2015automatic use low-level recordings of the activities that are performed by a surgeon to automatically predict the current phase of the surgery. Those two methods allow an on-line recognition of (high-level) phases, but low level information (presence of surgical tools or activities) is manually provided by surgeons. Lalys et al. propose a method based on an automated extraction of the visual content of the video and use the sequential nature of the surgical phases as a temporal constraint for (low-level) activity detection lalys2013automatic.

In terms of methodology, several methods reuse data recorded during a video-monitored surgery for the automated analysis of surgical processes lalys2013automatic; zappella2013surgical. In particular, some of these methods rely on Content-based video retrieval (CBVR), whose goal is to find similar videos or sub-videos inside a dataset chattopadhyay2008application; Quellec2014c; quellec2015real; loukas2016shot. But those methods do not model the temporal sequencing of the surgical process. Different kinds of models were used to model this process, like Dynamic Time Warping (DTW) averaging padoy2012statistical; lalys2013automatic

, which builds an average surgery. But this method does not allow on-line computations because it requires the entire video to be known (past, present, but also future information). On the other hand, Hidden Markov Models (HMMs)

padoy2012statistical or their derivatives, like Conditional Random Fields (CRFs), do allow on-line computations. CRFs seems to provide better results than the HMMs in the context of automatic surgical video analysis Quellec2014b; tao2013surgical. A Hierarchical Hidden Markov Model (HHMM), a hierarchical generalization of HMMs, was also used by twinanda2016endonet to perform a phase recognition taking into account inter-phase and intra-phase dependencies.

A majority of methods for the analysis of video data are developed without the real-time constraint: they are applied to automatic documentation and report generation stanek2012automatic; lalys2013automatic, fast search of similar cases in a database andre2012learning or educative video construction cao2008medical. A few of them allow on-line analysis padoy2012statistical; forestier2015automatic, but at a high level of description (into surgical phases) and do not allow accurate analysis of the surgery. Lalys et al. lalys2013automatic propose a finer analysis of the surgery, but this method is not able to perform on-line analysis of the surgical process.

The method presented in this paper extends a previous solution from our group, presented at a conference charriere2016real. That system performs an on-line analysis of a cataract surgery video at two different levels of description. It uses high-level phase recognition to help low-level step recognition, but it also uses information from step recognition to refine the recognition of phases.

3 Methodology

In this paper, we present a comprehensive study comparing several surgical process models, working at multiple description levels, and using various kinds of observations from the video stream. Those models aim to model the temporal process at each level of description as well as the relationships between steps and phases.

First a Hierarchical Hidden Markov Model (HHMM) is evaluated. The advantage of HHMM is that is can jointly model the temporal process at multiple description levels, using a simple relationship model between levels. We then evaluate two novel models which handle separately the relation relationships between steps and phases and the temporal process. A first model is composed of a Bayesian network (BN) and Hidden Markov Models (HMMs). The BN is used to model the conditional relationships between steps and phases while the temporal relationships at each level of description are modeled by HMMs. We also evaluate a variation on this model where HMMs are replaced by Conditional Random Fields (CRFs). These models can take as inputs different kinds of observations, computed throughout the surgery. These observations are presented in the following section. Next, the models are presented.

3.1 Observations

To generate observations, the video is divided into overlapping fixed-sized sub-sequences. Each sub-sequence contains the same number of frames, as presented in Fig. 2. The best length of sub-sequences () and the best temporal shift () between them are chosen after a learning step (§ 4.2). For each sub-sequence, observations are computed. In this paper, observations derive either from the presence of surgical tools or from the analysis of motion in videos through the content-based video retrieval (CBVR) paradigm. In the case of the motion analysis, we evaluate the model using Motion Histograms (MH) Quellec2014c

as feature vectors, but also using Bags of Visual Words (BoVW)

wang:inria-00439769; zappella2013surgical. As motion extraction can be disturbed by scale variation or motion of the eye, we also evaluate the system after a spatial normalization of images from the video stream.

Figure 2: The video divided into overlapping fixed-size sub-sequences

3.1.1 Presence of surgical tools

We can assume than the use of a specific tool is closely related to the surgical steps and phases being performed. In a scenario where the system is used jointly with a surgery simulator, we can easily assume that this information will be provided by the simulator. In a scenario where the system is used during a real surgery, this information may also be obtained using barcodes or RFID chips roberts2006radio; yao2010rfid

. Another solution consists in automatically tracking surgical tools in the surgical scene by computer vision methods. In this paper, this information is manually provided by surgeons, which allows us to validate the models.

3.1.2 Video motion comparison

Because the presence of tools cannot always be obtained easily, we also try to evaluate the model using general motion features extracted from the video content. In this paper, we compare Motion Histograms (MH), based on the extraction of the optical flow, with the widely used Bag-of-Visual-Words (BoVW) model. Disruptive motion could appear, induced by camera or eye motion during cataract surgeries for exemple. The influence of a spatial normalization of images that composed the video on the performance of the system is evaluated. This normalization step aims to refine feature extraction based on pupil center and scale tracking


  • Normalization: Pupil center and scale tracking are obtained without explicitly segmenting these landmarks, as presented on quellec2014normalizing

    . First, a robust solution to track the pupil center is used: it uses the fact that the pupil boundaries, the limbs and the sclera / lid interface are concentric. Then, the zoom level is estimated from the illumination pattern reflected on the cornea. Knowing this information for each frame of the video, we can pre-processe them to balance eye motion, zoom or level variation before the feature extraction step. First, eye motions are balanced by registering all frames on the same iris center. A simple coordinate system change is applied, which places the iris center at the image center. This should allow eliminate motion induced by eye or camera motion and make tool motion more relevant. Then, all frames are scaled on a same scale level to balance zoom or level variations. After this last preprocessing step, all irises should have the same radius. Finally, a circular mask centered on the iris center is applied to select a region of interest, because all relevant actions should appear in a region closed to the iris location.

  • Motion Histograms (MH): To compute the optical flow, strong corners are first detected and selected. The OpenCV 2 library111 is used to select strong corners and the optical flow between each pair of consecutive frames is computed at each strong corner by the Lucas-Kanade iterative method lucas1981iterative. Finally, the motion contained in the sub-sequence as a whole is characterized by one 8-bin amplitude histogram, two 8-bin amplitude-weighted spatial histograms (one for the x-coordinates and one for the y-coordinates) and one 8-bin amplitude-weighted directional histogram.

  • Bags of Visual Words (BoVW): The BoVW features are based on Space-Time Interest Points (STIP), which were proposed by Laptev et al. laptev2005space. STIP points are first detected locally within each sub-sequence. Histograms of oriented gradient (HOG) and histograms of optical flows (HOF) are then extracted inside a cube centered around each STIP point and concatenated. During a learning step, those local feature vectors are used to build a dictionary of visual words. Once the dictionary is learnt, a histogram of visual words is extracted from each video sub-sequence and used as a feature vector for the sub-sequence as a whole.

Finally, given a feature vector (MH or BoVW), the nearest neighbors of each sub-sequence are found in the training set by comparing sub-sequences with a Euclidean distance. The best number of nearest neighbors () is also chosen after a learning step. For each sub-sequence, the labels of the nearest sub-sequences at the finest granularity level (steps), provide the probability of belonging to each step.

This concludes our presentation of observations. The statistical models are presented next.

3.2 Statistical Models of the surgical process

Let be a sequence of observations, where represent the observation generated at time . Let be the labels for steps and be the labels for phases. Our goal is, given a model, to find the sequences of labels for steps and phases, denote respectively by and , that are most likely to generate the observation sequence.

The evaluated multilevel statistical models try to represent both the relationships between steps and phases and the temporal relationships at each level of description. For instance, if an ”Incision” step is being performed, the probability of an ”Opening” phase is high. Conversely, if an ”Opening” phase is being performed, the probability of a ”Stitching up” step is very low. Also, if an ”Incision” step is being performed by the surgeons, we can refine the probabilities by knowing that a ”Stitching up” step has a really low probability of occurrence in the next sub-sequence. The different surgical steps and phases identified by the surgeons of the Brest hospital are presented on Fig. 1.

3.2.1 Hhmm

First, a Hierarchical Hidden Markov Model (HHMM) is used to model the relationships between steps and phases and the temporal relationships. The HHMM derives from the HMM and each state of the HHMM is an HHMM as well. In our HMMM, each state denoted by represent a step or phase label. Each state produces a sequence of symbols (instead of a unique symbol), through a process of recursive activations which ends when a production state is reached. The production states (as opposed to the internal states) are the only states which emit observable symbols. In our case, labels for phases are represented by some internal state, and each state for phase is itself a HHMM composed by some production states which represent the labels for steps (Fig. 3)


Figure 3: Example of HHMM; green nodes represent labels for phases and blue nodes represent labels for steps

Following the notations introduced by fine1998hierarchical, a state of a HHMM is denoted by where and represent the state index and the hierarchical index. In our case, . A HHMM is defined by the following set of parameters: . For each internal state a transition probability matrix is associated, denoted by . The transition probabilities represent the probability of making a horizontal transition between the th to the th substate of . The matrix represent the initial probability of the substates of . This can also be interpreted as the probability of making a vertical transition, which is the probability of entering substate from . Then, an output probability vector is also associated with each production state.

To link the presence of the surgical tool with the model, we compute the output probability vector , the probability of being on a step from a given pair of surgical tools. In the case of motion analysis as observation (§3.1.2

), probabilities provided by the KNN search directly provide the conditional probability

required for the inference. All probabilities are learned by frequency counting in the training set.

In HHMM, the relationships between steps and phases are modeled very simply: we only indicate which steps can happen in each phases. In particular, we do not indicate the probability of coocurrence between a step and a phase. In this paper, we propose to model these cooccurence relationships using a Bayesian network (BN). Once these relationships are modeled, temporal relationships can be analyzed independently for steps and for phases. With this relaxation, multiple temporal models can be used, such as hidden Markov models (HMMs) and conditional random fields (CRFs).

3.2.2 Bayesian networks to model the cooccurrence of steps and phases

Bayesian networks are convenient as they can model the influence of step occurrence on phases but also the influence of the phase occurrence on steps, Bayesian networks are suitable models pearl1998bayesian. Each label for steps and phases is represented by a node in the network. Each conditional relationship between a step and a phase is represented by an edge . As we need to link the observations with the model, some observation nodes are added. We denote by the observation nodes. Thus, we define the structure of the Bayesian network by the graph where represents the sets of nodes and represents the set of edges. As we need to link the observations, obtained from the visual content of the video, with the model, some observation nodes are added. If the presence of surgical tools in the field of view of the camera is available, each surgical tool is associated with an observation node: the observation is true if and only if the tool is present. If we consider motion analysis, probabilities defined in 3.1.2 need to be converted to Boolean evidence (true or false), for compatibility with Bayesian networks: each observation node is associated with a range of probabilities (for instance: ”the KNN search indicates a probability between 10% and 20% that ”Incision” is being performed in the sub-sequence”). Thus, we define the structure of the Bayesian network by the graph where represent the sets of nodes and represent the set of edges. Conditional probabilities associated with each edge are learned by frequency counting in the training set.

3.2.3 Bn + Hmm

One HMM is defined for each granularity level: one for the step level, called , and one for the phase level, called . A HMM is defined by a quadruplet . The sets of states and are defined respectively by the labels for steps and the labels for phases. The transition probabilities which compose the transition matrices are obtained by counting transitions in the training set: , where represent the number of cases in the training set where we observe a transition from state to state . In our case, it is not necessary to define the observation probability matrices and , because the conditional probability is given by the Bayesian network inference (Fig. 4).

Figure 4: Example of a BN (on the left) with two HMMs (on the right); green nodes represent labels for phases and blue nodes represent labels for steps

3.2.4 Extension

We also evaluate an extension of this model, for which we reuse the output probabilities for phases as additional observation in the BN. This aims to add a major influence of phase recognition (which should be easier) on step recognition. The results of HMM for phases for the previous subsequence are added as complementary observation. In this case, some observation nodes are added to the models and linked with nodes represented phase labels. Each observation node is associated with a range of probabilities.

3.2.5 BN + CRFs

A limitation of HMMs is that, if the training set does not represent all possible transitions, some transitions will likely not be recognized during the labelling session. CRFs are trained without explicitly counting cooccurrence frequencies in a reference dataset. Also, they do not consider consecutive steps only. Therefore, they are less limited by dataset size lafferty2001conditional.

Following the previous model, a Bayesian network is used to model the conditional relationships between the occurrence of steps and phases. But, the two HMMs are replaced by two CRF : one CRF for each granularity level. In a CRF model, the conditional probability of the label sequence , given the observation sequence could be modelled by the equation:


where and are the CRF unary and pairwise potentials. represent a vector of weights over the unary potentials and a vector of weights over the pairwise potentials learnt from training data by maximum log-likelihood estimation.

The unary potentials represent the score of assigning a label to an observation. In our case, the definition of potentials can only rely on the present and past information. Those potentials depend on observations and are obtained with the Bayesian network which computes, at each time , the probabilities of occurrence for each step and phase where or . The unary potentials are defined as follows:


The pairwise potential represents the probability of switching from label to when moving from an observation to another one. The relationship between two adjacent observations is given by the transition probability computed from the training dataset. The pairwise potentials are defined as follows:


The Wapiti library222 is used for training the CRF. The L-BFGS333Limited-memory Broyden-Fletcher-Goldfarb-Shanno Quasi-Newton algorithm is used to learn the weight vectors and .

Figure 5: Methodology of the system

3.3 On-line recognition

Each constructed model is used to determine which labels for steps and phases are associated with each subsequence of the video given the observation extracted from that subsequence. Inference algorithms are described below.

3.3.1 HHMM inference

Inference of the HHMM is based on the generalized Viterbi algorithm proposed by fine1998hierarchical, adapted to an on-line constraint. For the presence of the surgical tools as evidence, the production states use this Boolean evidence. For the motion analysis the probabilities are directly provided by the KNN search to the production states. At the production state level, the inference is similar to the Viterbi algorithm forney1973viterbi. When a final state at the step level (production states) is reached, this implies a transition a the phase level (internal states), as follows:

  • Initialization:

  • At each time step:

The transition at the phase level, activates the HHMM (at the step level) associated with the new phase. This inference finds the most likely multilevel state sequence.

3.3.2 BN + HMMs inference

The Bayesian network determines a probability of occurrence for each phase and step label, given the observations. Those probabilities are then used during the HMMs inferences to determine the most likely labels for steps and phases.

First, Bayesian network inference is performed thanks to the D-lib library444

using a method from the MCMC (Markov chain Monte Carlo) family of methods: the Gibbs sampler algorithm. This inference algorithm performs the computation of posterior probabilities, given the new observations, which are provided to the Bayesian network through observation nodes. The inference algorithm computes the

probabilities at each temporal step and , where or . Then, those probabilities are used by the two HMMs. The HMMs inference is performed with the Viterbi algorithm forney1973viterbi. This inference finds the most likely sequence of hidden states at each granularity level.

3.3.3 BN + CRF inference

Similar to the previous model, the BN provide probabilities of each labels for steps and phases to the CRF. CRF inference is usually performed with the forward-backward algorithm sha2003shallow, but in our case we would like to perform an on-line analysis, that is during the execution of the surgery. So we can only use information from past and present. The forward algorithm determines the probability to obtain label for the sub-sequence recorded at time t, where or Quellec2014b as follows:


The Wapiti library was used for CRFs inference, with a modification to allow online inference.

Motion analysis Tools
MH BoVW MH + Norm. MH + feedback HMM P MH + Norm. + feedback HMM P tools tools + feedback HMM P
Steps 0.674 0.686 0.721 0.676 0.733 0.903 0.946
Phases 0.812 0.828 0.832 0.811 0.819 0.922 0.910
Means 0.743 0.757 0.777 0.743 0.779 0.913 0.928
Nr Frames / s 13.2 0.92 11.22 14.2 9.88 21.5 16.9
Table 1: Evaluation of the model consisting of a Bayesian network and two HMMs with the different sources of observations as input. is the area under the ROC curve and the last row represents the number of frames we are able to process in a second.

The statistical models provide the probability occurrence of each surgical step and phases. The label with the maximum occurrence probability (for steps and phases) are associated with the images of the interval.

4 Experiments

After a presentation of the dataset collected for this study, the influence of the different sources of observations as input of the models and the influence of the statistical models are evaluated.

4.1 Cataract Surgery Database

A database of 30 cataract surgeries was used in this study. Surgeries were performed by different surgeons. Videos were recorded in DV format. This database was manually annotated by surgeons using a description in phases and steps. Surgeons also manually indicated timing for the presence of surgical tools in the camera’s field of view. Five phases and 20 steps were identified by surgeons, as presented in Fig. 1. Surgeries in the database were not always performed in the same way. Surgeries are surgeon- and patient-dependent. The incision step, for instance, is strongly dependent of surgeons. The ”Phacoemulsification” step has a variable execution time depending on the stage of development of cataract. Finally, the ”Closure” phase could be realized by several ”Wound Hydration” steps or by a ”Point Suturing” step.

4.2 Training and evaluation procedure

As we need a sufficient number of examples to build the statistical model, and as our database is quite small, the evaluation of the system was performed through a 6-fold cross validation. The database was partitioned into six subsets of five videos. Each subset was used as test set, while the other subsets were used as training set. The training set was used to learn the model (structures and probabilities), and to optimize the parameters through a grid search. The parameters , , the HMMs’ time steps , or the number of potentials for the CRFs are only evaluated for the model using the presence of surgical tools as observation, which is the most reliable: the same parameters were used in the model using motion analysis.

We evaluated the performance of our system by measuring the area under the Receiver Operating Characteristic (ROC) curve for each step and phase. We built one ROC curve for each step and phase and for each fold. We computed the mean area for steps, phases and a mean of the two levels. We also evaluated the number of frames that the system is able to process in a second using one core of a « quad core » Intel(R) Core(TM) i7-3770 (3.40GHz) processor.

MH Tools
Steps 0.674 0.691 0.521 0.903 0.980 0.908
Phases 0.812 0.828 0.517 0.922 0.986 0.844
Means 0.743 0.759 0.520 0.913 0.983 0.863
Nr Frames / s 13.2 13.2 3507 21.5 21.7 4791
Table 2: Evaluation of the three models with Motion Histograms (MH) and the presence of the surgical tools in the field of view of the camera (tools) as observations

4.3 First experiment: influence of observations

First, the influence of the different observations on the performance of the system was evaluated on the model, composed by a BN and two HMMs. For the model using motion analysis, the number of nearest neighbors and the number of classes to represent output KNN probabilities by observation nodes was also learned. After the learning step, () and () were set, respectively, to two seconds and one second for each validation set. The three kinds of input were evaluated and compared. For motion analysis, the two features (MH and BoVW) were first compared, then the influence of the spatial normalisation of images was evaluated with the MH as features. Then, the presence of the surgical tools as observations was evaluated as input of the model. We also evaluated a combination of the presence of the tools and the results of the HMM output for the previous subsequence as observations. The results are presented in Table 1.

The good results obtained using the presence of tools as observations confirm that the use of surgical tools is strongly correlated with step occurrence. And because the system was able to process about 21 frames per second using one core, it shows that the system is compatible with the real time constraint. Recognition performance was quite inferior for surgical steps, because at this level the surgical process is more complex, with a larger number of possible transitions. With the motion analysis as observations, the results were also inferior. That was expected because, in this case, the observations are automatically extracted from the visual content. BoVW features provided better performance in terms of than MH. But BoVW are not compatible with the real time constraint. Results with MH were satisfactory anyway, especially given the number of frames we were able to process in one second for an entirely automated system. The spatial normalization of images improves performance in terms of ROC curve with a mean area of instead of , but the number of processed frames per second is slightly reduced. However, the system is always compatible with the real-time constraint. When the presence of tools is used as observations, feeding back the results of the HMM for phases as complementary observation improves performance, especially for steps recognition. It shows that the recognition of the surgical phases also has an influence on the recognition of the surgical steps.

4.4 Second experiment: model comparison

We then compare the different statistical models with two kinds of observations: the motion analysis with Motion Histograms (MH) as features and the presence of the surgical tools in the field of view of the camera (tools) as observations. The results are presented in Table 2.

The model composed by a BN and two CRFs achieved the best performances with a mean area under the ROC of up to 0.98 achieved for the two levels of description, with the presence of the surgical tools in the field of view of the camera. The improvement of the results obtained by replacing HMMs with CRFs can be explained by the fact that our training set does not represent all the possible transitions and CRFs better handle the small amount of examples. The HHMM is very fast, with more than 4500 images processed per second. But the results are inferior than results obtains with BN+CRFs or BN+HMMs and are very low with the motion histograms as features for the motion analysis. The HHMM inference is not able to detect transitions well, especially with a noisy input.

5 Discussion and Conclusion

In this paper, we have proposed several systems, based on statistical models, able to analyze a cataract surgery in real time (during the execution of the surgery). The recommended system consists of a Bayesian network and two CRFs because of its ability to model the relationships between steps and phases as well as temporal relationships, and that with a low number of examples in the training set. The system is easily adaptable to other video-monitored surgeries.

The proposed system works at two levels of description, it allow an accurate and complete analysis of the surgery. This is adapted for the generation of specific warnings and recommendations in order to supplement supervision by expert surgeons. Given a surgery, the systems provide the most likely sequences of surgical steps and phases. The system has been evaluated with two levels of description, but they could be adapted easily to any number of levels.

Also, the system allows to easily evaluate any kind of observations (or combinations of observations). Good results were obtained using the presence of surgical tools in the field of view of the camera as observation, at the input of the statistical model, especially when the model consisted of a Bayesian network and two CRFs (one for each level of description). In this configuration, the labeling step was almost perfect and a mean area under the ROC curve of 0.982 was achieved. The presence of tools could be obtained easily in the case of a surgical simulator. In the case of a use during a real surgery, it would be very interesting to develop a solution to automatically recognize the surgical tools. For a use during a real surgery instead of a simulator, an automated system could be developed to recognize the surgical tools and enhance the CBVR tools. Indeed, more contrasted results were obtained with the automated generation of observations from the visual content of the video through motion analysis. Some good results were obtained for the recognition of surgical phases with a mean area under the ROC Curve of 0.828. But the results were quite inferior for the recognition of surgical steps with a mean area of 0.691. The results were improved by a spatial normalization of images that composed the videos.

The low number of examples in our dataset impacts the performance of recognition with models based on HMMs (HHMM or BN+HMMs) because of independences found after the training step. If a transition is not represented in the training set, it would not be recognized during the analysis of a surgery. But this problem is well handled by replacing HMMs by CRFs.

In conclusion, a general framework was proposed for the automatic sequencing of surgeries and encouraging results were obtained in a dataset of cataract surgery videos.

The authors would like to thank the Urban Community of Brest (Brest Métropole Océane) and the ”Institut Mines-Télecom” for funding this project.