Long-Term Temporally Consistent Unpaired Video Translation from Simulated Surgical 3D Data

by   Dominik Rivoir, et al.

Research in unpaired video translation has mainly focused on short-term temporal consistency by conditioning on neighboring frames. However for transfer from simulated to photorealistic sequences, available information on the underlying geometry offers potential for achieving global consistency across views. We propose a novel approach which combines unpaired image translation with neural rendering to transfer simulated to photorealistic surgical abdominal scenes. By introducing global learnable textures and a lighting-invariant view-consistency loss, our method produces consistent translations of arbitrary views and thus enables long-term consistent video synthesis. We design and test our model to generate video sequences from minimally-invasive surgical abdominal scenes. Because labeled data is often limited in this domain, photorealistic data where ground truth information from the simulated domain is preserved is especially relevant. By extending existing image-based methods to view-consistent videos, we aim to impact the applicability of simulated training and evaluation environments for surgical applications. Code and data will be made publicly available soon.


page 1

page 4

page 6

page 7

page 8


World-Consistent Video-to-Video Synthesis

Video-to-video synthesis (vid2vid) aims for converting high-level semant...

Learning Blind Video Temporal Consistency

Applying image processing algorithms independently to each frame of a vi...

HyperCon: Image-To-Video Model Transfer for Video-To-Video Translation Tasks

Video-to-video translation for super-resolution, inpainting, style trans...

Improving Surgical Training Phantoms by Hyperrealism: Deep Unpaired Image-to-Image Translation from Real Surgeries

Current `dry lab' surgical phantom simulators are a valuable tool for su...

Surgical Skill Assessment via Video Semantic Aggregation

Automated video-based assessment of surgical skills is a promising task ...

Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation

Colorectal cancer screening modalities, such as optical colonoscopy (OC)...

Cross-Domain Conditional Generative Adversarial Networks for Stereoscopic Hyperrealism in Surgical Training

Phantoms for surgical training are able to mimic cutting and suturing pr...

1 Introduction

One of the most promising applications of GAN-based image-to-image translation 

[12, 44] is the transfer from the simulated domain to realistic images as it presents great potential for applications in computer graphics. More importantly, unpaired translation [48] (where no image correspondences between domains are required during training) enables the generation of realistic data while preserving ground information from the simulated domain which would otherwise be difficult to obtain (depth maps, optical flow or semantic segmentation). The acquired synthetic data can then serve as training or evaluation data in settings where labeled data is limited.

The availability of realistic, synthetic data is especially crucial in the field of computer-assisted surgery (CAS) [24, 5]

. CAS aims at providing assistance to the surgical team (visualizing target structures or prediction of complications) by means of analyzing available sensor data. In minimally-invasive surgery, where instruments and a camera are inserted into the patient’s body through small ports, video is the predominant data source. Intelligent assistance systems are especially relevant here, since performing surgery through small ports and limited view is extremely challenging. However, two major factors which currently limit the impact of deep learning in CAS are the lack of (a) labeled training data and (b) realistic environments for evaluation 

[23]. For instance, evaluating a SLAM (Simultaneous Localization and Mapping) algorithm [30, 40] on laparoscopic video data poses several problems since the patient’s ground truth geometry is typically not accessible in the operating room (OR) and recreating artificial testing environments with realistic and diverse patient phantoms is extremely challenging. Other CAS applications which could benefit from temporally consistent synthetic training data include action recognition, warning systems, surgical navigation and robot-assisted interventions [24, 23].

Previous research has shown the effectiveness of synthetic, surgical images as training data for downstream tasks such as liver segmentation [34, 38]. However, their applications are still limited since many challenges in CAS include a temporal component. Using the previous example of evaluating a SLAM algorithm, realistic as well as temporally consistent video sequences would have to be generated in order to provide a useful evaluation environment.

Unpaired video translation has recently garnered interest in various non-surgical specialties [3, 10, 7, 9, 31, 47]. Most approaches thereby condition the generator on previous translated frames to achieve smooth transitions, short-term temporal consistency. However, they are fundamentally not designed for long-term consistency. Intuitively, when an object entirely leaves the field of view, consistent rendering cannot be ensured when it returns since the previous frame contains no information regarding the object’s appearance. Even when the model is conditioned on multiple frames, the problem persists in longer sequences.

In the special case of translating from a simulated environment, however, the underlying geometry and camera trajectories are often available. Point correspondence between views are thus known and can be used to ensure globally consistent translations. The relatively new research area of neural rendering [42] aims at using the knowledge of the underlying 3D scene for image synthesis but has mainly been studied in supervised settings to date [42, 22, 39, 43].

We propose a novel approach for unpaired video translation which utilizes the available information of the simulated domain’s geometry to achieve long-term temporal consistency. A state-of-the-art image translation model is extended with a neural renderer which learns global texture representations. This way, information can be stored in 3D texture space and can be used by the translation module from different viewpoints. I.e. the model can learn the position of details such as vessels and render them consistently (Fig. 1). To ensure texture consistency, we introduce a lighting-invariant view-consistency loss. Furthermore, we employ methods to ensure that labels created in the simulated domain remain consistent when translating them to realistic images. We show experimentally that our final generated video sequences retain detailed visual features over long time distances and preserve label consistency as well as optical flow between frames.

2 Related Work

2.1 Unpaired Image and Video Translation

Image-based GANs [12, 35] have gathered much attention showing impressive results as unconditioned generative models [35, 6, 15, 16, 17] or in conditional settings such as image-to-image translation [44, 8, 33, 21]. However, their real-world applications are often limited, since the content of purely generative models is difficult to control and supervised image-to-image translation requires corresponding image pairs, which are often not available. The introduction of unpaired image translation through cycle consistency [48] hence widened their applicability and impact. Since then, several extensions have been proposed using shared content spaces [19], multi-modality [18, 14], few-shot translation [20] or removing the need for cycle consistency through contrastive learning [32]. From an application standpoint, several works [34, 28, 38, 36, 27, 46] have shown the effectiveness of leveraging synthetic training data for surgical applications.

There have also been several attempts at extending unpaired translation to videos where generated sequences have to be temporally smooth in addition to being realistic in individual frames [3, 10, 7, 9, 31, 47]. Bansal  [3] tackle this problem by introducing a temporal cycle consistency loss and Engelhardt  [10]

use a temporal discriminator to model realistic transitions between frames. Several recent approaches estimate optical flow to ensure temporal consistency in consecutive frames 

[7, 9, 31, 47]. While there have been steady improvements in generating smooth transitions between frames, these models fail to capture long-term consistency. We aim to overcome this by adding a neural rendering component to our model. To the best of our knowledge, no successful solutions for long-term consistent video translation in the unpaired setting have been published to date.

Figure 2: We combine unpaired image translation with neural rendering for view-consistent translation from simulated to photorealistic surgical videos. The model’s key concept is a learnable, implicit representation of the scene’s global texture. During training, texture features are projected into image space as which, combined with a simple rendering , serve as input to the unpaired image translation module. To encourage long-term temporally consistent translation, we warp two translated views into a common pixel space and employ our lighting-invariant consistency loss.

2.2 Physically-grounded Neural Rendering

While unpaired visual translation methods are also sometimes categorized as neural rendering, the term most commonly refers to image synthesis approaches which incorporate knowledge of the underlying physical world [42]. By introducing differentiable components to rendering pipelines, neural representations of 3D shapes [39, 49, 22], lighting [41, 29, 1, 2] or textures [43] can be learned from image data for applications like novel view synthesis, facial re-enactment or relighting. Most closely related to our work, Thies  [43] introduce a deferred neural renderer with neural textures, where implicit texture representations are learned from image sequences with a ground truth 3D model and camera poses. In contrast to their work, however, our model is built in an unsupervised setting where no correspondence between the simulated 3D data and real images is available. Finally, Alhaija  [2] propose a deferred neural renderer for unpaired translation from fixed albedo, normal and reflection maps to realistic output. However, since the texture representation is not learned, this work is more closely related to image-to-image translation.

Mallya  [25] propose a model for long-term consistent, paired video translation. They estimate information of the underlying physical world (depth, optical flow, semantic segmentation) to render globally consistent videos. This is currently the closest attempt at combining neural rendering with GAN-based translation. However, the paired data setting is a major hurdle for real-world applications.

We rather aim at bringing unpaired translation and neural rendering closer together. We believe that requiring knowledge of the simulated 3D geometry in an unpaired setting is less restrictive for many applications than paired video translation, which requires rich ground truth information in the real domain.

3 Methods

We incorporate a neural rendering component into state-of-the-art unpaired image translation to facilitate view-consistent video synthesis (Fig. 2). Our two domains consist of simulated surgical abdominal scenes with a simulated camera in domain and a set of real surgical images with similar views in domain . For translation from to , an implicit representation of the simulated scene’s texture is learned end-to-end along with the translation networks. For a given viewpoint of the simulated scene, the learnable texture features are projected into image space as feature map . Additionally, a reference image is rendered with predefined colors under a simple light model and is used as a prior for translation. Combined, serves as input to the encoder

of the unpaired image translation module. Errors can be backpropagated into the texture map through the differentiable projection operator and enable the model to learn a global representation of the texture. During training, two translated frames

of arbitrary views are warped into a common pixel space and constrained using our lighting-invariant view-consistency loss. Also note that the projected texture maps are part of the translation cycle, transfer from to includes the prediction of a reference image as well as a texture map . We jointly learn textures for 7 different simulated abdominal scenes with common translation networks used for all scenes.

3.1 Neural Textures

For true long-term consistency, we require a method which can store information about the scene across different time points. To allow our network to store scene-related information independent of the current view, we choose to store data in texture-space. When rendering a new frame, this data can then be looked up and used as input to the image translation pipeline. In our work, we use the term neural texture [43]

to relate to a tensor of size

with learnable values. We first set up a mapping function to associate 3D surface points with their corresponding pixels in the neural texture. When rendering the scene from a new perspective, we trace a ray from each pixel in the image plane into the scene, to determine which 3D surface point should be rendered and then use the mapping function to look up the texture information for this point.

In practice, we choose simple projection as the mapping function: For each mesh (liver, gallbladder, etc.), we split the texture into six squares of equal size, let each represent one side of the axis-aligned bounding box of the mesh and project the surface point into each of the images to look up the texture information. The six lookup results are combined via a weighted average. For this, the weight is calculated as the dot product between the surface normal and the normal of the bounding box side

and set to zero if the dot product is smaller than zero (at any time, a maximum of three sides will contribute to the final combined texture lookup value). The projection onto the bounding box sides yields a continuous position on the texture, so we perform bilinear interpolation of the four closest texture pixels. The main advantage of this method is that it requires no additional manual setup when a new 3D scene is introduced into the pipeline, however this mapping function could easily be replaced with a different one.

Projecting texture features into image space results in an image with channels, which we call . The important aspect of this rendering method is that it is comprised entirely of texture lookup and interpolation functions and is thus fully differentiable, errors in can be propagated back into the global neural texture.

Additionally, we render the scene again from the same camera view, this time using a traditional rendering algorithm with predefined fixed colors for each tissue type and a single light source. This results in an additional image with 3 channels (Red, green, blue) which, together with , makes up the input tensor to the translation network (see Fig. 2). To speed up rendering during training, we precompute mappings and interpolation weights for all frames.

3.2 Unpaired Image Translation Module

Our translation module builds on state-of-the-art methods MUNIT [14] and UNIT [19]. The model enforces cycle consistency as well as a shared content space through interchangeable encoders and decoders for each domain [19] (Fig. 2).

Given a projected texture map and reference image , the encoder extracts their content . To encourage a shared content space of domains and , is used for both reconstructing texture features and reference image as well as for translating to . For cycle consistency, the content and subsequently the original input are reconstructed again from translated image . Analogously, the same cycles are enforced for translation from to and the loss is used on all reconstructions. We use Multi-Scale Discriminators [44] on translated images and real ones with the LS-GAN loss [26]. Finally, we use a Multi-Scale Structural-Similarity loss [45, 34] between and as well as and to encourage label-preserving translation. We repurpose MUNIT’s [14] network architectures, but remove styles from the encoders and decoders.

3.3 View Consistency Loss

To enforce view consistency, two random views of the same simulated scene are sampled and translated during each training iteration. Using the knowledge about the scene’s geometry, the second view is warped into the pixel space of the first view and consistent rendering is enforced through a pixel-wise view-consistency loss. In minimally-invasive surgery, however, the only source of light is a lamp mounted on the camera. This results in changing light conditions whenever the field of view is adjusted and the image center typically being brighter than its surroundings. This poses an additional challenge for view-consistency. Therefore, we propose to minimize the angle between RGB vectors instead of a channel-wise loss. For a pair of translated views

, the loss is defined as


where are the pixel locations in that have a matching pixel in and is the RGB vector at this location. is the warping operator into ’s pixel space. Note that the angle between vectors can be computed by . This enforces consistent hue in corresponding locations while allowing varying brightness.

Errors are only backpropagated through the non-warped view . Adversarial, MS-SSIM and reconstruction losses are also computed only on this view to avoid an imbalance between domains A and B.

3.4 Training Details

We train our model for 500,000 iterations and use the Adam optimizer with initial learning rates of for networks and for neural textures. Learning rates are halved every 100,000 iterations. The total loss is


with adversarial , cycle , image and content reconstruction as well as MS-SSIM losses . We use , , , , . For the view-consistency loss, is initialized with and set to after 10,000 iterations to avoid forcing consistency on unrefined translations in early stages of training.

For each of the 7 scenes, we learn texture squares of size with 3 feature channels for 6 axis-aligned sides and 5 meshes (liver, ligament, abdominal wall, gallbladder and fat/stomach). Hence, the neural texture has a total size of .

3.5 Data

For the domain of real images , we collected 28 recordings of robotic, abdominal surgeries from the University Hospital Carl Gustav Carus Dresden and manually selected sequences which contain views of the liver. The institutional review board approved the usage of this data. Frames were extracted at 5fps, resulting in a total of 13,334 training images. During training, images are randomly resized and cropped to size 256x512.

For the simulated domain , we built seven artificial abdominal 3D scenes in Blender containing liver, liver ligament, gallbladder, abdominal wall and fat/stomach. The liver meshes were taken from a public dataset (3D-IRCADb 01 data set, IRCAD, France) while all other structures were designed manually. For each scene, we generated 3,000 random views of size 256x512, resulting in a total of 21,000 training views. To evaluate temporal consistency, we manually created seven 20-second sequences at 5fps which pan over each scene with varying viewpoints and distances.

4 Experiments

To establish that our method produces both realistic and long-term consistent outputs, we need to evaluate the quality of individual images as well as consistency between consecutive or non-consecutive frames. Thus, we establish several baselines and evaluate them using various metrics. We place a special focus on both detailed and temporally consistent translation, since correct re-rendering of details such as vessels is crucial for obtaining realistic videos.

Figure 3: Qualitative comparison of view-consistency results. Consistent areas are marked green, inconsistent ones red. Our model can render fine-grained details consistently across views. SSIM-ReCycle often produces consistent outputs which, however, lack detail and realism. SSIM-MUNIT produces realistic but flickering results. Quality and consistency can best be judged in the supplementary video.

4.1 Baselines

SSIM-MUNIT: This is Pfeiffer ’s [34] model for surgical image translation trained on our dataset of real and synthetic images and . It is trained for 370k iterations as in the original paper. This corresponds to our image translation module but with added styles and noise injected into generator input. We remove these components in our model since they are disadvantageous for view consistency. To evaluate realism, we translate each training image once with a randomly drawn style and once with a style extracted from a randomly drawn image. Test video sequences are translated with a randomly drawn but fixed style.

ReCycle and SSIM-ReCycle: We compare to Bansal ’s unpaired video translation approach ReCycle-GAN [3] which is trained on triplets of consecutive video frames to maintain temporal consistency. We use the variant with additional non-temporal cycles (https://github.com/aayushbansal/Recycle-GAN

). Triplets of simulated images are obtained through random, linear trajectories comparable to real sequences. The model is trained for the suggested 40 epochs. Longer training times lead to degrading performance. Additionally, we implement a variant with MS-SSIM loss for label preservation.

OF-UNIT: State-of-the-art unpaired video translation models condition the generator on translations from previous time-steps to ensure short-term temporal consistency. Many methods thereby warp the previous image by estimating optical flow and achieve incremental improvement through better OF estimation [7, 9, 31]. We argue, however, that even perfect OF is not enough for long-term consistency and can even have detrimental effects as we show later. To demonstrate this, we build a variant of our model which uses ground-truth OF to warp the previous translation, it can potentially produce perfect transitions between frames. We replace the input of the encoder with , where is the generated frame of the previous time step and is the perfect warping operator using ground-truth optical flow. We replace the view-consistency loss with an loss and increase the weight to since it gave better results and train for 1 Mio. iterations due to the stronger correlation between subsequent samples. Synthetic sequences are obtained from the triplets used for ReCycle. OF-UNIT serves as an upper-bound for the state of the art in unpaired video translation.

Ours w/o vc and Ours w/o tex, vc: Finally, we ablate our model by removing first the view-consistency loss and then also neural textures. The second model corresponds to SSIM-MUNIT without styles and noise in the generator.

4.2 Metrics

Realism: We compare the realism of models through the commonly used metrics Frechet Inception Distance (FID) [13] and Kernel Inception Distance (KID) [4] for which we sample 10,000 random images from the set of real and generated training images, each. Further, we train a U-Net variant for liver segmentation on a dataset of 405 laparoscopic images from 5 patients and report the Dice score when evaluated on all 21,000 generated images. This metric measures both realism and label preservation.

Temporal Consistency: We introduce two metrics to evaluate the temporal consistency of the sequences generated from each scene. Firstly, we measure the mean absolute error for the estimated optical flow OF of consecutive translated frames and their corresponding simulated reference images by where is the ground truth optical flow of the synthetic scene and is the optical flow estimated by the Gunnar-Farneback method [11] on the generated frames. As argued by Chu  [9], this is better than the more common pixel error on warped images, since the latter favors blurry sequences. Secondly, the metrics ORB-1, ORB-5 and ORB-10 measure how consistently image features are rendered. For ORB-1, we compute all ORB feature [37] matches in consecutive frames and determine whether the matched feature points correspond the the same 3D location. We report the accuracy of matches as well as the average number of correct matches per image pair. A blurry but consistent sequence might yield a high accuracy, so the number of matches gives additional information on how detailed the results are. A match is considered correct if its distance is smaller than 1mm in the underlying 3D scene. To investigate consistency beyond consecutive frames, we do the same with pairs that are 5 and 10 frames apart (1 and 2 sec.) as ORB-5 and ORB-10.

Method Data Realism Temp. Consistency
% % (# per pair) % (# per pair) % (# per pair)
SSIM-MUNIT [34] img 28.3 .0132 59.2 8.64 60.5% (32.5) 36.1% (15.7) 19.1% (7.0)
ReCycle [3] vid 61.5 .0454 40.7 8.89 69.6% (16.3) 43.9% (7.0) 23.5% (2.7)
SSIM-ReCycle vid 80.6 .0622 50.9 8.75 88.9% (13.3) 67.4% (6.2) 43.2% (2.8)
OF-UNIT vid 26.8 .0125 57.7 8.53 93.5% (32.4) 59.0% (11.1) 30.7% (4.4)
OF-UNIT (revisit) vid - - - 8.91 69.9% (15.7) 43.8% (7.3) 24.7% (3.4)
Ours w/o tex,vc img 27.3 .0114 56.8 8.43 81.7% (31.2) 51.3% (13.3) 29.5% (6.2)
Ours w/o vc vid 27.0 .0134 55.2 8.35 88.3% (27.9) 66.8% (14.6) 44.5% (7.5)
Ours vid 26.8 .0124 57.1 7.62 91.8% (49.7) 73.0% (27.2) 49.6% (13.9)
Table 1: Quantitative results with best scores printed bold. For metrics ORB-1, ORB-5 and ORB-10, we report the accuracy of feature matches and the total number of correct matches per image pair, indicating both consistency as well as level of detail.

Time-independence: Finally, we show the pitfalls of previous approaches which condition on previous time steps. We extend each test sequence by running it first forward and then backward such that each view is visited twice with varying temporal distance. I.e. given a sequence , we extend it to similar to Mallya ’s [25] evaluation. We then compute the same metrics OF and ORB-1. But instead of comparing frame to its successor , we use the time point in the extended sequence which corresponds to its successor, namely . For all methods except OF-UNIT, this is equivalent to the original metric since they depend only on the current view. Analogously for ORB-5 and ORB-10, we compare to time points and , respectively. We denote these experiments as OF-UNIT (revisit).

5 Results

Figure 4: Details are stored in our neural textures. We found the 3rd feature channel often to correspond to vessels.

5.1 Realism

Table 1 shows our model achieves similar FID and KID scores as image-based approaches (SSIM-UNIT and Ours w/o tex,vc) while strongly outperforming video-based methods ReCycle and SSIM-ReCycle. We hypothesize that their temporal cycle-loss favors blurry images since they are easier to predict for the temporal prediction model. Fig. 3 supports this hypothesis as our and image-based models show more detailed and realistic translations than SSIM-ReCycle. For OF-UNIT, similar realism scores to ours are expected, since it uses the same translation module.

Figure 5: Estimated optical flow in a scene with camera motion where the hue indicates the direction of the flow. In our results, consistent motion is detected on textured surfaces while blurriness or flickering lead to poor flow estimates in other models.

Further, we evaluate a pretrained liver segmentation network on the generated data. Again, our model yields comparable results to image-based methods while outperforming ReCycle and SSIM-ReCycle. This indicates that our results are not only realistic but content of the simulated domain is also translated correctly. The gap between ReCycle and SSIM-ReCycle additionally shows the importance of the MS-SSIM loss for label-preservation. Example 2 in Fig. 3 shows a failure case of our model where a stomach-like texture with vessels is rendered on the liver. Introducing neural textures supposedly improves the sharpness and level of detail in translations but increases the model’s freedom to change content in the scene. The quantitative results, however, suggest that this is only a minor effect.

5.2 Temporal Consistency

Using the established ORB feature detector, we evaluate how consistently visual features are re-rendered in following frames of generated video sequences. We report how often detected feature matches are correct as well as the number of correct matches per pair of frames. For neighboring frames, our model achieves an accuracy of , outperforming all baselines except OF-UNIT. However, this is not surprising since the latter uses the perfectly warped previous frame as input. For larger frame distances, however, our model outperforms OF-UNIT, showing its superiority w.r.t. long-term consistency. Additionally, the absolute number of correct matches per image pair is drastically higher than in OF-UNIT and other models even for neighboring frames. This indicates that our neural textures not only enable consistent translation but also encourage more detailed rendering. Fig. 3 shows several translated views with detailed as well as consistent textures. In Fig. 4, we show how the location of vessels is stored in the neural texture.

We observe that other methods fail to generate detailed as well as temporally consistent sequences. While SSIM-MUNIT produces detailed translations (indicated by the high number of matches), it achieves the lowest accuracies. Oppositely, video-based ReCycle and SSIM-ReCycle produce more consistent but less detailed renderings, indicated by their high accuracy but low number of correct matches.

Figure 6: When revisiting a previous view, time-dependent models such as OF-UNIT fail to render textures consistently. Our model maintains consistency independently of the duration between visits by storing information in texture-space.

Note that SSIM-MUNIT induces flickering since noise is injected into the generator. Temporal consistency can already be strongly improved by removing this component (Ours w/o tex,vc). Adding neural textures without enforcing view-consistency (Ours w/o vc) further improves results.

Evaluating temporal consistency through optical flow (OF) supports our previous findings. This metric measures both temporal consistency as well as level of detail, since Gunnar-Farneback flow often fails on smooth surfaces. Image- and other video-based methods yield high errors, since the former tend to produce detailed but flickering sequences, while the latter often generate blurry but consistent views (Fig. 5). By learning textures in 3D space, our model achieves both detailed and consistent renderings.

5.3 Time-independence

We have seen that the time-dependent baseline OF-UNIT achieves very consistent transitions between frames and still achieves respectable results for larger frame distances. However, if the second frame is replaced with the same view revisited at a later point of the sequence, then performance drastically degrades. This is because the model does not have the capacity to remember the appearance of areas which have left the field of view. It even underperforms compared to its unconditioned variant Ours w/o tex,vc. We hypothesize that dependence on the previous trajectory actually encourages appearance changes over time (Fig 6). We believe time-independence is therefore an important feature for achieving long-term consistency, even in non-static scenes. With our approach, moving objects as well as deformations can potentially be handled by moving or deforming the neural texture accordingly.

Figure 7: Our angle loss allows the translation module to adjust brightness of areas according to the current view. In real images, the center is often brightest since the light source is mounted on the camera.

5.4 Lighting-invariant View Consistency

We proposed an angle-based loss for view consistency which only keeps the hue of corresponding areas consistent. Fig. 7 shows that our angle loss allows for more realistic lighting since the translation module can change brightness according to the current view. On the other hand, an L1 loss enforces static brightness from arbitrary viewpoints. This results in incorrect lighting like in the left image where the light appears to come from the bottom right. More examples can be found in the supplementary material.

6 Conclusion

We combine neural rendering with unpaired image translation from simulated to photorealistic videos. We target surgical applications where labeled data is often limited and realistic but simulated evaluation environments are especially relevant. Through extensive evaluation and comparison to related approaches, we show that our results maintain the realism of image-based approaches while outperforming video-based methods w.r.t. temporal consistency. We show that optical flow is consistent with the underlying simulated scene and that our model can render fine-grained details such as vessels consistently from different views. Also, data generation can easily be scaled up by adding more simulated scenes. A crucial observation about the model is that it leverages the rich information contained in the simulated domain while requiring only an unlabeled set of images on the real domain. This way, consistent and label-preserving data can be generated without limiting its relevance for real-world applications. Specifically, ground truth which would be unobtainable in surgical settings can be generated (depth, optical flow, point correspondences). This work is a step towards more expressive simulated environments for surgical assistance systems, robotic applications or training aspiring surgeons. While we focus on surgical applications (where access to labeled data is especially restricted), the model can potentially be used for any setting with a simulated base for translation.


Funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden.


  • [1] H. A. Alhaija, S. K. Mustikovela, A. Geiger, and C. Rother (2018) Geometric image synthesis. In

    Asian Conference on Computer Vision

    pp. 85–100. Cited by: §2.2.
  • [2] H. A. Alhaija, S. K. Mustikovela, J. Thies, M. Nießner, A. Geiger, and C. Rother (2020)

    Intrinsic autoencoders for joint neural rendering and intrinsic image decomposition

    In International Conference on 3D Vision, Cited by: §2.2.
  • [3] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh (2018) Recycle-gan: unsupervised video retargeting. In Proceedings of the European conference on computer vision (ECCV), pp. 119–135. Cited by: §1, §2.1, §4.1, Table 1.
  • [4] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. In International Conference on Learning Representations, Cited by: §4.2.
  • [5] S. Bodenstedt, M. Wagner, B. P. Müller-Stich, J. Weitz, and S. Speidel (2020) Artificial intelligence-assisted surgery: potential and challenges. Visceral Medicine 36 (6), pp. 450–455. Cited by: §1.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §2.1.
  • [7] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei (2019) Mocycle-gan: unpaired video-to-video translation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 647–655. Cited by: §1, §2.1, §4.1.
  • [8] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8789–8797. Cited by: §2.1.
  • [9] M. Chu, Y. Xie, J. Mayer, L. Leal-Taixé, and N. Thuerey (2020) Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG) 39 (4), pp. 75–1. Cited by: §1, §2.1, §4.1, §4.2.
  • [10] S. Engelhardt, R. De Simone, P. M. Full, M. Karck, and I. Wolf (2018) Improving surgical training phantoms by hyperrealism: deep unpaired image-to-image translation from real surgeries. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 747–755. Cited by: §1, §2.1.
  • [11] G. Farnebäck (2003) Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pp. 363–370. Cited by: §4.2.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §1, §2.1.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, Cited by: §4.2.
  • [14] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV), pp. 172–189. Cited by: §2.1, §3.2, §3.2.
  • [15] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, Cited by: §2.1.
  • [16] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.1.
  • [17] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §2.1.
  • [18] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pp. 35–51. Cited by: §2.1.
  • [19] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §2.1, §3.2.
  • [20] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10551–10560. Cited by: §2.1.
  • [21] X. Liu, G. Yin, J. Shao, X. Wang, and h. Li (2019) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §2.1.
  • [22] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §1, §2.2.
  • [23] L. Maier-Hein, M. Eisenmann, D. Sarikaya, K. März, T. Collins, A. Malpani, J. Fallert, H. Feussner, S. Giannarou, P. Mascagni, et al. (2020)

    Surgical data science–from concepts to clinical translation

    arXiv preprint arXiv:2011.02284. Cited by: §1.
  • [24] L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou, et al. (2017) Surgical data science for next-generation interventions. Nature Biomedical Engineering 1 (9), pp. 691–696. Cited by: §1.
  • [25] A. Mallya, T. Wang, K. Sapra, and M. Liu (2020) World-consistent video-to-video synthesis. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 359–378. Cited by: §2.2, §4.2.
  • [26] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017) Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2794–2802. Cited by: §3.2.
  • [27] A. Marzullo, S. Moccia, M. Catellani, F. Calimeri, and E. De Momi (2021) Towards realistic laparoscopic image generation using image-domain translation. Computer Methods and Programs in Biomedicine 200, pp. 105834. Cited by: §2.1.
  • [28] S. Mathew, S. Nadeem, S. Kumari, and A. Kaufman (2020) Augmenting colonoscopy using extended and directional cyclegan for lossy image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4696–4705. Cited by: §2.1.
  • [29] A. Meka, C. Haene, R. Pandey, M. Zollhöfer, S. Fanello, G. Fyffe, A. Kowdle, X. Yu, J. Busch, J. Dourgarian, et al. (2019) Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–12. Cited by: §2.2.
  • [30] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §1.
  • [31] K. Park, S. Woo, D. Kim, D. Cho, and I. S. Kweon (2019) Preserving semantic and temporal consistency for unpaired video-to-video translation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1248–1257. Cited by: §1, §2.1, §4.1.
  • [32] T. Park, A. A. Efros, R. Zhang, and J. Zhu (2020) Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, pp. 319–345. Cited by: §2.1.
  • [33] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §2.1.
  • [34] M. Pfeiffer, I. Funke, M. R. Robu, S. Bodenstedt, L. Strenger, S. Engelhardt, T. Roß, M. J. Clarkson, K. Gurusamy, B. R. Davidson, et al. (2019) Generating large labeled data sets for laparoscopic image processing tasks using unpaired image-to-image translation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 119–127. Cited by: §1, §2.1, §3.2, §4.1, Table 1.
  • [35] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, Cited by: §2.1.
  • [36] A. Rau, P. E. Edwards, O. F. Ahmad, P. Riordan, M. Janatka, L. B. Lovat, and D. Stoyanov (2019) Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy. International journal of computer assisted radiology and surgery 14 (7), pp. 1167–1176. Cited by: §2.1.
  • [37] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: an efficient alternative to sift or surf. In 2011 International conference on computer vision, pp. 2564–2571. Cited by: §4.2.
  • [38] M. Sahu, R. Strömsdörfer, A. Mukhopadhyay, and S. Zachow (2020) Endo-sim2real: consistency learning-based domain adaptation for instrument segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 784–794. Cited by: §1, §2.1.
  • [39] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer (2019) Deepvoxels: learning persistent 3d feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446. Cited by: §1, §2.2.
  • [40] J. Song, J. Wang, L. Zhao, S. Huang, and G. Dissanayake (2018) Mis-slam: real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing. IEEE Robotics and Automation Letters 3 (4), pp. 4068–4075. Cited by: §1.
  • [41] T. Sun, J. T. Barron, Y. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. E. Debevec, and R. Ramamoorthi (2019) Single image portrait relighting.. ACM Trans. Graph. 38 (4), pp. 79–1. Cited by: §2.2.
  • [42] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, et al. (2020) State of the art on neural rendering. In Computer Graphics Forum, Vol. 39, pp. 701–727. Cited by: §1, §2.2.
  • [43] J. Thies, M. Zollhöfer, and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–12. Cited by: §1, §2.2, §3.1.
  • [44] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §2.1, §3.2.
  • [45] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §3.2.
  • [46] A. R. Widya, Y. Monno, M. Okutomi, S. Suzuki, T. Gotoda, and K. Miki (2021) Self-supervised monocular depth estimation in gastroendoscopy using gan-augmented images. In Medical Imaging 2021: Image Processing, Vol. 11596, pp. 1159616. Cited by: §2.1.
  • [47] J. Xu, S. Anwar, N. Barnes, F. Grimpen, O. Salvado, S. Anderson, and M. A. Armin (2020) OfGAN: realistic rendition of synthetic colonoscopy videos. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 732–741. Cited by: §1, §2.1.
  • [48] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.1.
  • [49] J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman (2018) Visual object networks: image generation with disentangled 3d representations. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §2.2.