## I Introduction

Minimally invasive surgeries (MIS) are beneficial for patients due to less trauma, lower blood loss and faster recovery. MIS usually uses the laparoscope as an intraoperative imaging modality to provide surgical guidance. However, due to the limited field of view (FOV) and complex 6-DoF motion of the laparoscope, it is difficult for the surgeons to relate between the laparoscopic images and the in vivo anatomical structures [bergen2014stitching]. For example, for hernia repairs, it is imperative for general surgeons to scan the entire region to identify the size of the mesh that needs to be placed. A small FOV provided by the laparoscope makes it challenging to identify the correct size of the mesh. For lung segmentectomy surgery, the surgeon needs to identify different segments of the lung to isolate the segment with the tumor. A larger FOV obtained by image mosaicking technologies will help plan the sublobar resection to ensure complete tumor removal while preserving as much uninvolved lung parenchyma as possible.

Image mosaicking technologies have their roots in the computer vision field. Most early image mosaicking methods align the images according to the homography-based transformation

[szeliski1997creating], which uses a matrix to convert the two-dimensional (2D) pixel coordinates from one image to another. For example, Brown et al. used RANSAC [fischler1981random] and SIFT matches [lowe1999object] to compute the homography matrix between images [brown2007automatic], and this work has also been used to mosaic laryngoscopic images [iakovidis2013efficient]. Bano et al. proposed to estimates the homography using deep learning

[bano2020deep]. Homography assumes that the camera motion mainly comprises of rotational motion and/or the environment is planar, which makes it impracticle to handle image parallax caused by translational camera motion in the three-dimensional (3D) environment. To solve this problem, many image mosaicking works have been proposed to integrate local warp functions [zaragoza2013projective][li2017quasi][li2015dual]. For example, Zhang et al. proposed a parallax-tolerant image stitching method that first obtains homography-based alignment and then performs content preserving warping for refinement [zhang2014parallax]. Lee et al. developed a local warping method using multiple homographies and warping residuals [lee2020warping]. Chen et al. proposed to use grid mesh to guide the local warp models [chen2016natural]. Fan et al. proposed to use stereo videos for refining the warps [fan2019stereoscopic]. Although these works have shown promising results, they are mostly designed for static environments and are difficult to handle the deforming in vivo environments. In addition, the heavy computational burden to compute the local warps made them too slow for real-time surgical navigation.Image mosaicking methods have found many clinical applications, such as microscopic and fetoscopic images mosaicking. For microscopic images, the main difficulty is to handle the tissue deformation [kose2017automated]. For example, a recent work by Guo et al. proposed to mosaic laser endomicroscopic images by first estimating the initial rigid transformation from feature matches, and then estimating the local non-rigid warps using an intensity-based similarity metric [gong2020robust]. Vercauteren et al. proposed to compensate for the motion distortion of fibered microscopy images by modeling the relationship between motion and motion distortions [vercauteren2006robust]. These methods are sufficient for mosaicking the microscopic images because the objects at the microscopic scale are almost planar, but are difficult to handle laparoscopy images due to the complex 6-DoF camera motion and significant 3D shapes of the tissues. For fetoscopic images, most existing methods consider the tissues as rigid, and the deformation problem has not been not fully addressed [bano2020deepMiccai][gaisser2018stable].

In this paper, we propose a novel 2D non-rigid simultaneous localization and mapping (SLAM) method for laparoscopy images mosaicking in real-time. The concept of non-rigid SLAM was proposed in the DynamicFusion work [newcombe2015dynamicfusion], and is now an emerging topic in the computer vision field. Unlike the traditional rigid SLAM methods that estimate the 6-DoF rigid motion of the camera [totz2011dense]

, non-rigid SLAM estimates the deformation and motion of the environment with respect to the camera, which usually has high degrees of freedom. Our 2D non-rigid SLAM method considers the 2D image mosaic as the environment map, which is similar to the 3D point cloud built by traditional 3D SLAM methods. Non-rigid SLAM has also been introduced in the medical field. For example, Song et al. proposed the MISSLAM method to compensate for tissue deformation and track the camera motion in real-time

[song2018mis]. However, most existing non-rigid SLAM methods were designed for 3D reconstruction and used the iterative-closet-points (ICP) algorithm for non-rigid alignment, which cannot align 2D images for mosaicking. To solve this problem, one natural idea is to use the feature matches to build non-rigid correspondences between images. However, mismatches are unavoidable and the non-rigid deformation makes it difficult to remove the mismatches. In addition, dense image mosaicking requires the ability to track every pixel, but feature matches are sparse. To overcome these difficulties, we introduce our previous work called as expectation maximization and dual quaternion (EMDQ) [zhou2020smooth], which can generate dense and smooth deformation field from the sparse and noisy feature matches in real-time. Using EMDQ as the key algorithm to track the deformation of image pixels, we have also proposed an uncertainty-based loop closing method to reduce the accumulative errors. To the best of our knowledge, this is the first non-rigid SLAM method developed for image mosaicking.The paper is organized as follows: Section II will give a brief introduction of our EMDQ algorithm. Section III will describe the design of the 2D non-rigid SLAM system, including tracking, loop closing, uncertainty management, smoothing and dense mosaicking. Evaluation results on in vivo data are presented in Section IV. A discussion is presented in Section V.

### I-a Related Works

The field of view of the laparoscope is small and the ability to generate a larger field of view from input laparoscopic videos can provide better guidance for diagnosis and surgical planning, which is clinically useful for several surgical procedures including those on the bladder [soper2012surface], retina [richa2012hybrid] and colon [karargyris2010three]

. This problem has been studied for decades, and the methods can be roughly classified into 3D and 2D. The 3D mosaicking methods are mostly based on structure from motion (SfM) or simultaneous localization and mapping (SLAM). SfM methods require batch processing, thus making it difficult for real-time applications. The SLAM methods, which originated from the robotics navigation field, have been introduced for surgical navigation to extend the field of view of the laparoscope in real-time. For example, our previous work generated dense 3D model of the tissue surface from stereo laparoscopy videos by using SLAM

[zhou2019real]. Mahmoud et al. proposed a monocular dense reconstruction method for tissue surfaces [mahmoud2018live]. Mountney et al. proposed to use EKF-SLAM to build a dense 3D textured model of the MIS environment [mountney2009dynamic]. However, because most SLAM systems are designed for static environment, they cannot handle the deformation of soft tissues. To solve this problem, Mountney et al. proposed to learn a periodic organ motion model for deformation compensation [mountney2010motion], which cannot handle more general tissue deformation. A recent work by Lamarca et al. was able to track the camera motion in deforming environments, but did not perform mosaicking [lamarca2020defslam]. The 3D mosaic can provide richer information to the surgeons than 2D, but often requires additional camera calibration steps in the clinical workflow. Further for stereo laparoscopes, which are not the standard of care imaging modality, the calibration must be highly precise for the stereo matching algorithms [poggi2019guided]. For some clinical applications, 3D mosaicking is less demanding and 2D mosaicking of the laparoscopic images is sufficient for the clinicians to better understand the in vivo environment. For example, Erden et al. proposed to quantify the behavior of soft tissues by mosaicking the microlaparoscopic images [erden2012understanding].For both 3D and 2D mosaicking, the majority of methods are designed for rigid environments. For 2D cases, the pixel deformation problem is even more serious, because the translational motion of the camera on 3D structures can also contribute to the non-rigid motion of the pixels. An intuitive example is that when the camera scans the tissue surface, distant areas will have slower motion than closer areas in the image, which is also referred to as parallax and has attracted much attention in the computer vision field. For example, Zaragoza et al. proposed an as-projective-as-possible warp that allows local non-projective deviations [zaragoza2013projective]. In the medical field, many works describing algorithms to compensate for pixel deformation have been reported for the mosaicking of microscopy images[hughes2015high][rosa2012building][loewke2010vivo]. We refer to the recent review paper [perperidis2020image] by Perperidis et al. for more details. However, since the tissue surface at the microscopic scale is nearly planar, these methods are designed for planar objects with small deformation, which cannot be applied to the laparoscopy images robustly [bergen2014stitching].

Non-rigid image mosaicking is closely related to a new concept called non-rigid SLAM [newcombe2015dynamicfusion][innmann2016volumedeform], which was developed in recent years with the purpose of simultaneously estimating the deformation and performing mosaicking in real-time. Most non-rigid SLAM methods were designed for 3D reconstruction, which converted the deformation estimation problem to an optimization problem that minimized a cost function consisting of ICP, smoothing and other terms, and ran on powerful GPUs to achieve real-time. However, since ICP cannot be applied to RGB images directly without proper segmentation or edge detection, existing non-rigid SLAM methods cannot be used for 2D image mosaicking. To the best of our knowledge, this paper proposes the first non-rigid SLAM system that is purely based on feature matches for non-rigid image mosaicking. In addition, our method can work in real-time on both CPU and GPU.

## Ii Brief Introduction of EMDQ

First, we briefly describe the EMDQ algorithm, which is our recent work and the key algorithm in the 2D non-rigid SLAM system proposed in this paper. Details of EMDQ are given in Ref. [zhou2020smooth]. Generally speaking, the EMDQ algorithm is used as a black box in the proposed non-rigid SLAM system. The input of EMDQ is the coordinates of image feature matches and the output is the dense deformation field between images.

### Ii-a Emdq

The basic idea of the proposed 2D non-rigid SLAM system is to estimate the deformation of image pixels from feature matching results, and then perform mosaicking accordingly. The EMDQ algorithm can remove mismatches from noisy feature matches, and generate smooth and dense deformation field by interpolating among the feature matches in real-time. The deformation field is represented using dual quaternion (DQ)

[kavan2008geometric], which is a useful mathematical tool to generate smooth interpolation among multiple rigid transformations. As an example, Fig. 1 shows the motion of pixels from one frame to another that can be obtained by the deformation field generated by EMDQ, which is essential for our non-rigid SLAM system. However, due to the fact that the deformation field is generated by interpolating among sparse feature inliers, two problems need to be further addressed: (1) the feature inliers may not distribute at all areas on the image, hence the deformation field at areas that are distant from feature inliers may be inaccurate, which is often referred to as uncertainty in the registration problem, (2) accumulative errors may exist if the deformation field is tracked from pairs of adjacent video frames. Our 2D non-rigid SLAM consists of multiple algorithms and data management methods to address the above problems.### Ii-B Image Feature Matching Method Selection

Although EMDQ can generate the deformation field from the results of any image feature matching methods, such as ORB [rublee2011orb] and SURF [bay2006surf], it is important to select an appropriate feature matching method for the accuracy, speed and robustness of the non-rigid SLAM system. Since EMDQ generates the deformation field by interpolating among the feature matches, the accuracy is higher if all pixels have close feature inliers. Due to this reason, ORB is not appropriate since the ORB feature points mainly distribute at the rich texture areas, which makes it difficult to track low texture areas accurately. Although there exist improved ORB methods [mur2015orb] that are able to detect feature points uniformly on the images and have been widely used in rigid SLAM systems, in practice we found that its percentage of inliers is significantly lower than that of the standard ORB. This is acceptable for rigid SLAM because the rigid motion model can be estimated with a few matches. However for non-rigid SLAM, it may result in low robustness because some image areas may not have feature inliers.

Compared with ORB, SURF is more accurate but much slower, which makes it unpopular for the rigid SLAM systems. A major computational burden of SURF is to build the image pyramid for handling the change of image scale. However, the change of image scale in the laparoscopic image mosaicking task is usually small. In addition, our non-rigid SLAM system tracks the deformation from pairs of adjacent video frames, which have small change of image scale as long as the camera motion is not too fast. Hence, to achieve faster computational speed, we propose to reduce the number of SURF octave layers to one, which avoids the use of image pyramid and significantly reduces the computational burden. In practice we found that SURF with one octave layer works very well. However, given the development of learning-based features [schonberger2017comparative], we are not implying that SURF is the best choice but it is easy to replace SURF with other features since EMDQ only needs the coordinates of feature matches as the input.

## Iii 2D Non-rigid SLAM System Design

The prerequisite of dense and nonrigid image mosaicking is to recover the deformation of all image pixels. Without loss of generality, this 2D non-rigid SLAM system considers the video frame at time as the reference frame, and estimate the deformation field of frames at with respect to frame 0 for mosaicking. Inspired by DynamicFusion [newcombe2015dynamicfusion], we use sparse control points, or deformation nodes, to represent the deformation field to reduce the computational burden. The nodes are assigned with warp functions and motion of other image pixels are obtained by interpolating among the warp functions of the neighboring deformation nodes. Specifically, the warp function of a node is represented by a scale factor and a dual quaternion, which is

(1) |

where and are image coordinates of node at time and respectively (in pixels). represent the warp function of node at time , and and ^{1}^{1}1Standard DQ is a 8-dimensional vector. For the image mosaicking task, we consider the image is on the plane and 4 dimensions are always zero, which are removed to reduce the computational burden. are the related scale factor and dual quaternion respectively. is the dual quaternion-based transform. and

are the related rotation matrix and translational vector respectively, which are not explicitly used in our method.

Similarly, we denote the warp function on a pixel at time as . In our system, we use the linear combination of the neighboring nodes to obtain , that is

(2) |

and

(3) |

where is the weight between node and pixel , which is determined by the distance between node and pixel at time 0, that is

(4) |

where is a coefficient.

Hence, the deformation recovery problem is equivalent to the estimation of the warp functions , . We propose a novel 2D non-rigid SLAM framework based on EMDQ, as shown in Fig. 2. This non-rigid SLAM system tracks the deformation from adjacent video frames using the EMDQ algorithm, which uses the SURF matching results as the input. However, this tracking strategy may result in accumulative errors and to solve this problem, a loop closing method is integrated into the system. The tracking and loop closing results are merged according to the uncertainties. This system also include methods to determine, propagate and merge the uncertainties. Finally, according to the deformation estimation results, we warp the coordinates of image pixels from time to according to Eq. (2) and (3), and generate a larger mosaic in real-time by using CPU or GPU parallel computing technologies.

### Iii-a Tracking

For each video frame at time step , we extract the SURF feature points and match them with the frame. The changes of for node can be obtained from the EMDQ algorithm, which are denoted as . Then, we update the warp functions of all node with

(5) |

where is the updating function of the warp functions, which involves dual quaternion-based computation and the details are given in Appendix A. We use the subscript label ”track” to denote the results of the tracking module.

### Iii-B Loop Closing and Key Frames

The results of the tracking module are accurate if the video sequence is short but may have accumulative errors for long video sequences. Hence, we integrate the loop closing module. As the example shown in Fig. 3, the basic idea of the loop closing method is to match the current video frame at time with the previous key frames, obtain the EMDQ deformation field with respect to the key frames, and then merge the tracking and loop closing results according to the uncertainties.

Add new key frames: The non-rigid SLAM system maintains a list of key frames in the computer memory for loop closing. We will add frame as a new key frame if the nodes have large displacements compared with those of the existing key frames, that is

(6) |

where is the index of previous key frames and is the set of key frame indexes, and are the coordinates of node at frame and key frame respectively, is a threshold. The information stored in the computer memory mainly include the SURF features and the warp functions of the deformation nodes.

Select previous key frames for loop closing: From the list of key frames, we select key frames that are close to the current frame for loop closing, and the distance metric is the same as in Eq. (6). Using brute force, we compare all existing key frames with the tracking results , which are obtained by Eq. (5) and (1). This brute force search is effective because the number of key frames is often small for the following reasons: (1) the surgical scene is usually much smaller than that of rigid SLAM methods, hence it does not need a long video sequence for mosaicking, and (2) the method to add key frames, given by Eq. (6), guarantees that the key frames have small overlap. For situations when long video sequences are required, such as fetoscopy or cystoscopy images, one may use the kd-tree structure for searching close frames. With a selected key frame , we perform the SURF matching, the EMDQ computation and update the warp function (5) to obtain the estimation results of the warp functions at time , that is

(7) |

where is the change in warp function of node from key frame to frame obtained by the EMDQ algorithm.

Tracking Failure: Too fast laparoscope motion and/or large glossy components of the tissue surface may cause a failure in the feature tracking process. In traditional rigid SLAM systems, this problem is usually solved in the loop closing module by matching the previous key frames. For short video sequences, the number of previous key frames is limited and it is feasible use brute force search. For long video sequences, one may refer to the bag-of-words (BoW) method that is widely used in large-scale rigid SLAM systems for obtaining candidate key frames to match the current frame [mur2015orb]. In addition to the technology solution, it is also possible to include the surgeon in the image mosaicking process by warning too fast laparoscope motion to retrieve the tracking, since our method works in real-time.

### Iii-C Uncertainty

The estimated warp functions from the tracking and loop closing modules are and

respectively, which may be different and we merge them according to the uncertainties. The basic idea is borrowed from the extended Kalman filter (EKF) that uses Gaussian distributions to assign soft weights for merging.

Uncertainty of the EMDQ results: According to Eq. (5) and Eq. (7), the warp functions are updated according to results of EMDQ. Hence, the uncertainty of the EMDQ results is essential for estimating the uncertainty of the tracking and loop closing results. Because the EMDQ algorithm generates the deformation field by interpolating among the inliers of the feature matches, it is intuitive that if node is distant from the inliers of feature matches, the uncertainty of is high. Under this analysis, the uncertainty of is

(8) |

where is a coefficient, is the distance between node and the feature inlier . For each node , we search all feature inliers and use the ones that provide the minimum uncertainty. We use the exponential function to make the uncertainty small at areas that are close to the features, and increase significantly at distant areas. This design is consistent with the observation that the estimation of areas that are close to the control points are much more certain and accurate.

Uncertainty propagation: According to Eq. (5), the uncertainty of node at time , , should be updated from and . Because is independent with , the uncertainty of node is propagated by

(9) |

The uncertainties of the loop closing results, , are obtained similarly.

However, because the tracking module updates the warp functions from each two adjacent frames, that is , the uncertainties of nodes may increase too fast according to the prorogation method (9), since is determined by the distance between node and the image features (see (8)). To solve this problem, we propose to take into account the feature matching relationships among multiple frames. For example, for a feature point that can be tracked continuously at multiple frames, the uncertainties of its neighboring nodes should not increase too fast. Hence, we introduce the concepts of feature uncertainty. For a feature at time , if it is an inlier (can find the correct correspondence at by EMDQ), then its uncertainty is propagated by

(10) |

where is the squared error of feature when performing the EMDQ algorithm between and . Because only matches with small errors are considered as inliers by the EMDQ algorithm, is small and the increasing rate of the uncertainties of feature inliers is small. If feature is an outlier at time , then is determined in the same way as in Eq. (9).

Then, we introduce a spatial restriction between the uncertainties of nodes and features in the tracking module. For a node , its uncertainty should satisfy

(11) |

for any image feature . Because the increase in the feature uncertainties are limited by (10), the increase of node uncertainties can also be limited.

Uncertainty-based Merging: For each node , we consider the tracking and loop closing modules as two sensors, and merge their results and by using the extended Kalman filter (EKF) according to the uncertainties and . The following merging algorithm is adapted from Ref.[sun2004multi]. For node , the covariance matrix is

(12) |

where is the correlation coefficient, which is used in EKF to suggest the correlation relationship between sensors. In this system we determine by

(13) |

where and are the coordinates of node at time and key frame respectively, is a coefficient. Then, the merged uncertainty is

(14) |

and the weights of the two sensors are

(15) |

According to the weights (15), we take the weighted average of and and obtain a new warp function for node at time . However in practice we found that the equations (12) to (15) may result in negative weights if is too large, in that case we will simply take the values related to the smaller uncertainty as the merged values.

The uncertainties increase as in the tracking module (Eq. (9)), and decrease after merging with the loop closing results (Eq. (14)). In this way our system maintains the uncertainties of nodes at a low level.

The above EKF-based merging is equivalent to linear merging if . EKF-based merging is more appropriate to handle slow motion when the results of tracking and loop closing modules are obtained from very close frames, hence one of them should be omitted without decreasing the uncertainty by Eq. (14). When the motion is slow, according to (13), which will result in the omission.

### Iii-D ARAP-based Smoothing

After merging the results of tracking and loop closing modules, we add an as-rigid-as-possible (ARAP) smoothing [sorkine2007rigid] step to obtain the final estimation results of the warp functions at time . ARAP smoothing is widely used in the non-rigid SLAM systems [newcombe2015dynamicfusion], which is usually integrated into the cost function and minimized by a Gauss-Newton-like optimization method. Because these real-time optimization methods often run on a powerful GPU, we propose a novel iterative method that uses closed form ARAP results to update the warp functions, which is also effective on the CPU. Specifically, the ARAP warp function of node at time , , is computed from the change of the coordinates of its neighboring nodes between time 0 and . Specifically,

(16) |

(17) |

where is the weight between node and , which is similar to Eq. (4). In practice we will remove the related columns of and if is too small for faster computation. It is worth noting that there exists a trade-off between speed and robustness, because a large will reduce the number of neighboring nodes for ARAP smoothing, which will result in faster speed but lower robustness.

Then, following Ref.[arun1987least], the rotation matrix , translation vector and scale can be obtained by

(18) |

(19) |

(20) |

Then we generate the quaternion from and .

Denoting the ARAP warp functions of node at time as , our goal is obtain the warp function that minimizes

(21) |

where is a coefficient suggesting the weight of the ARAP term, is a fixed value. In this cost function, is the data term, which is determined by the merging method and is fixed in this step. We update by

(22) |

Our method to minimize cost (21) is to iteratively estimate by using Eq. (18)-(20), and then estimate the new by using Eq.(22) to update according to Eq. (1). We check the cost (21) after each iteration, and will terminate the process if the cost increases. In practice we found that a few iterations can obtain good results hence we set the maximum number of iterations to 5.

As the example shown in Fig. 4, the initial deformation nodes can be tracked robustly by using the methods introduced in sections III.A-III.D. In addition, all pixels can also be tracked according to the interpolation among the nodes, which is a prerequisite for dense and non-rigid image mosaicking.

### Iii-E Adding New Nodes

At time 0, we insert the first node at the center of the image, and then insert new nodes by checking the 6-neighboring locations of existing nodes iteratively until all image areas are covered. The same node inserting strategy is also performed when new observed areas are distant from the existing nodes. The warp functions of new nodes are equal to the weighted average value of the neighboring existing nodes, where the weights are the same as the ARAP weights (4).

### Iii-F Dense and Non-rigid Image Mosaicking in Real-time

As the reference frame, frame is inserted directly to the mosaicking image. At time , we compute the warp functions of nodes , and then compute the warp effects on each pixel by interpolating among the neighboring nodes following Eqs. (2) and (3). The coordinate of pixel at time can be obtained by , and the related RGB value, , will be merged to the mosaicking image at . To make the mosaicking image more smooth, we have developed the truncated signed distance function (TSDF)-like [curless1996volumetric] method to merge frame with the large mosaicking image, that is

(23) |

(24) |

The above computations, including both estimation and RGB values merging, need to be performed for all pixels, which is computationally expensive because the laparoscopic images may have millions of pixels. Note that the computations are independent for each pixel, hence parallel computational technologies can be used for acceleration. We have developed both CPU and GPU-based parallel computation methods. The CPU parallel computation is based on the OpenMP library to make full use of all CPU cores, and the GPU parallel computation is developed using CUDA C++, which launches GPU threads for each pixel in parallel. In practice we found that a powerful CPU can achieve the real-time requirement, while the GPU-based computation can make the system performance 2x faster than the CPU version.

### Iii-G Parameters Setting

Key parameters used in our method are as follows: 2e-4 is used to compute the weight between nodes and pixels (see Eq. (4)), which is also used for computing the ARAP weights between nodes. 3e-3 is used to compute the uncertainty of nodes according to the distance to feature inliers (see Eq. (8)). 5e-3 is used to correlation coefficient for uncertainty-based merging (see Eq. (13)). Every frames, we perform the loop closing step and every 2 frames are used for mosaicking. Because , and are related to squared distances in pixels, to make the algorithms self-adaptive to different image resolutions, we propose to adjust , and by multiplying . By considering as the reference resolution, suggests the scale change of the input video frames, where and are the width and height of the video frames.

## Iv Experiments

The source code was implemented in C++ and ran on a desktop with an Intel Core i9 3.0 GHz CPU (16 cores) and NIVIDA Titan RTX GPU.

To evaluate the performance of the proposed mosaicking method on laparoscopic images in real-world surgical scenarios, we obtained intraoperative videos during surgeries performed in our hospital and online videos ^{2}^{2}2http://hamlyn.doc.ic.ac.uk/vision/. The videos were recorded under an Institution Review Board approved protocol. We compared the results of our algorithm to the the as-natural-as-possible (ANAP) [lin2015adaptive] and Parallax [zhang2014parallax] image mosaicking methods for comparison. Both ANAP and Parallax include elastic warping mechanisms to handle image deformation. Due to the heavy computational burden, both ANAP and Parallax are off-line methods and their codes were available as Matlab scripts. In our experiments, we down-sampled the video frames to reduce the computational time of ANAP and Parallax since they are very slow for long image sequences. For the runtime, we report the average per frame computational time for our method since it is an online real-time method, and report the total computational time for ANAP and Parallax since they are off-line methods.

As shown in Fig. 5, the first experiments were conducted on laparoscopy videos captured during robot-assisted liver surgery. We asked the surgeons to move the laparoscope within the patients’ abdomens. The livers had small deformation caused by respiration. The translational motion of the laparoscope is large and the tissue surfaces had significant 3D shapes, which caused large pixel deformation due to parallax. For these cases, ANAP and Parallax are not as robust as our method.

For the experiment shown in Fig. 6, the videos were captured during a minimally invasive sublobar lung surgery. Due to heartbeat and respiratory motion caused by the adjacent lung, the deflated lung had significant and fast deformation. For this data, the camera motion was mainly along the tissue surface without significant changes in magnification, which made it easier to match adjacent video frames. Hence, all three methods were able to obtain satisfying results.

In the experiments shown in Fig. 7 and 8, the laparoscopy videos were obtained from the Hamlyn public dataset, which includes videos showing the porcine abdomen deformed due to respiration. Our method and the Parallax method can obtain good mosaicking results.

In these above experiments on in vivo laparoscopy videos, our online real-time mosaicking method obtained excellent results, which were comparable with the results of the off-line methods. Since the tissue deformation in most of the collected in vivo videos were relatively small, the ability to handle large deformation was not fully evaluated. Hence, we introduce the Mandala dataset from Ref. [lamarca2020defslam], which includes four gray-level image sequences of a soft blanket. As shown in Fig. 10, the deformation of the blanket increased from case 1 to case 4, and the camera scanned the deforming blanket. The deformation in the Mandala data was significant and fast. Our method demonstrated better ability to handle large deformation to reserve more texture details, as illustrated in Fig. 9. In this experiments, the images blending method was the multi-band blending (MBB) method [brown2007automatic], which can handle larger registration errors caused by the large deformation. However, MBB may not be appropriate for blending laparoscopy images due to the complex illumination condition. Hence, for laparoscopy images mosaicking, we found the TSDF-based methods (Eq. (23) and (24)) can obtain better results.

## V Conclusion

We have developed a novel 2D non-rigid SLAM system for laparoscopy images mosaicking. To the best of our knowledge, this is the first image mosaicking method that can handle large translational camera motion, complex anatomical surfaces and tissue deformation in real-time, which makes it possible to mosaic in vivo laparoscopic images. Experiments with real-world in vivo data on different types of organs have shown the feasibility of our method. This would be particularly useful for minimally invasive surgeries wherein a larger field of view can provide the surgeon with a precise anatomical map to plan the surgery (for example segmentectomy) or placement of a device (meshes for hernias).

Limitations and future works: (1) 2D image mosaicking has an intrinsic requirement that different images of 3D objects can be reprojected to the same mosaic, hence it may not be able to mosaic images obtained at arbitrary positions. Hence, our method is mainly designed for situations when the laparoscope moves along the tissue surface, and cannot handle some types of camera motions, such as going through the tubular airways/vessels. (2) Our method works under a reasonable assumption that during the scan, the instruments are removed from the field of view. Occlusion by surgical instruments may interfere with the SURF feature matching results and further affect the deformation field generated by EMDQ. Future works will include the integration of instrument segmentation to improve the robustness. (3) Because EMDQ assumes the deformation is smooth, our method cannot handle sharp non-rigid motions of the tissue, such as cutting of tissue surfaces.

## Appendix A Dual Quaternion-based Warp Function Updating

The updating function of the warp function used in (5) is

(25) |

which suggest that if

(26) |

then and should have

(27) |

Under this requirements, the details of the computation are as follows:

(28) |

where is the function that convert a translational vector to the related dual quaternion. Then,

(29) |

(30) |

where is the multiply function between two dual quaternions.