The ability to detect known objects and their 3D position relative to the viewer is crucial for many Augmented Reality applications and robotic tasks. Recent advances with deep convolutional models such as SSD-6D [kehl2017ssd], PoseCNN [Xiang-RSS-18] and YOLO6D [tekin2018real] allow solving this problem using only RGB images in real-time. However, to achieve state-of-the-art performance these models require a large amount of labeled training data. The assembly of such a training-set is an expensive, error-prone and time-consuming process [hinterstoisser2019annotation], making it cumbersome and often times inapplicable for use with custom applications.
When 3D geometry is available, one can resort to synthetically generate training data by rendering, which allows to create a virtually infinite training set in an automated fashion. However, it was shown that deep CNN models, even when applying cross-validation, tend to over-fit to the specific data-set [torralba2011unbiased] and show significantly degraded performance when presented with data from a different domain [ganin2016domain]. Particularly, there is a strong domain-gap between real and synthesized images, which typically prevents the use of synthetic images for training.
To overcome this limitation, existing approaches apply domain randomization (DR) to enforce domain invariance by overwhelming the model with variation [tremblay2018training, tobin2017domain], requiring the network to learn deeper, more abstract, features that are invariant across domains. An alternative direction is to reduce the gap by employing photo-realistic rendering [tsirikoglou2017procedural] and structurally correct context generation [prakash2019structured]. Notably, tremblay2018deep
apply both for the task of object pose estimation. However, designing a randomization method for a specific domain requires a domain expert to define which parts must stay invariant. Conversely, increasing photo-realism requires an artist to carefully model the specific environments in detail. This in turn increases the cost of generating the data thus negating the primary selling point of using synthetic images in the first place.
real images are available, transfer learning can be exploited, e.g. by fine-tuning a synthetically trained model with real images[oquab2014learning]. Alternatively, [hinterstoisser2018pre] propose to pre-train a network on real data and fine-tune on synthetic data. Furthermore, it is possible to use the real data to enforce domain invariance during training [ganin2016domain]. Recent work [zakharov2019deceptionnet] extend this approach to guide domain-randomization, thus introducing some benefits of learning-based domain adaptation. However, it is still required to correctly select and design randomization modules.
Recent advances on generative adversarial networks (GANs) [goodfellow2014generative, karras2018progressive, brock2018large] have shown great improvements regarding image quality and plausibility, training stability and variation of output. Of particular interest in the task of closing the domain gap are conditional GANs [isola2017image]
. These are networks that, unlike traditional GANs, take additional inputs to condition generated output. Here, the image-conditional GANs form a general-purpose framework for image-to-image translation problems, like semantic segmentation, colorization and other style transfer tasks. Existing solutions can be split into paired models[isola2017image, wang2018high] and unpaired models [CycleGAN2017]. The former are trained to adapt source to target images, paired in an supervised fashion, while the latter do not require supervision and instead directly learn to transfer the distribution of image features found in two unstructured data-sets.
This work focuses on employing such models to formulate the domain gap between real and synthetic images as a style-transfer problem. At this, we introduce training pipelines for both paired and unpaired image translation and evaluate the results on the task of object pose estimation. This is a particularly challenging scenario which requires a high fidelity of the object contours, which was, to best of our knowledge, not previously addressed with GAN based image translation. In the context of paired image translation, we propose the use of the intermediate edge domain to do away with the need of real images for supervised training. Here, we evaluate different mapping strategies for transferring CAD geometry with unknown surface properties into the edge domain.
Most closely related to our work is [antoniou2017data] which employ GANs for data augmentation. In contrast, specifically address the domain gap and use synthetic data generation instead of data augmentation. [rambach2018learning] introduce the ”pencil filter” as a domain with reduced expressiveness to tackle the domain gap and train a pose estimation network. However, the pose estimation network takes a strong performance hit as the ”pencil domain” does not retain enough relevant features. We are avoiding this hit by learning a reversed mapping from the reduced domain to real images to reconstruct appropriate features. In the medical domain, [mahmood2018unsupervised] employ a GAN architecture for domain adaptation. However, they only consider the use of an unsupervised GAN model for reverse domain adaptation by making real images more synthetic, while we consider both paired and unpaired architectures and also consider the forward domain adaptation in the unsupervised case.
Based on the above, our key contributions are;
formulating the domain gap as a learning problem using off-the-shelf image-conditional GANs,
introduction of the intermediate edge domain for training paired translation networks purely from synthetic data and
evaluation of paired and unpaired models regarding pose estimation performance.
This paper is structured as follows: in Section II the general approach is introduced and the choice of suitable GAN models is discussed. In Section III the method is evaluated in terms of pose estimation performance of the YOLO6D [tekin2018real] model on the LineMod [hinterstoisser2012model] data-set.
We conclude with Section IV giving a summary of our results and discussing the limitations and future work.
The core idea of our approach is to formulate the domain gap as a learning problem that is addressed with generative CNN models. Here, we use the generative adversarial framework to train a conditional generator that is able to augment images such that the pose estimation network becomes invariant to the source domain. For this, the statistical distribution of image features found in both the real world and the synthetic domain must be matched, allowing the alignment of one domain to the other.
In this section we first discuss applicable GAN models and show qualitative results on the LineMod data-set to motivate the choice of specific image-conditional GAN models. Then, we present our pipeline for fully synthetic training based on supervised image translation, leveraging Pix2PixHD [wang2018high]. Next, we turn to unsupervised image translation and introduce an alternative pipeline, that replaces the GAN model with CycleGAN [CycleGAN2017], which simplifies data acquisition by lifting the requirement of pairing images from both domains.
Ii-a Baseline methods
We define two baseline methods for synthetic training. Both follow the data augmentation scheme of [tekin2018real] by using random backgrounds from a set of real images, apply random image scaling and randomly adjust exposure and saturation adjustment. However, instead of using crops from real image, we render the object on top of the background (See Figure 2) with the following methods:
Realistic texturing, by applying the true texture extracted from the data-set [rojtberg2019real] and
randomly texturing the object during rendering.
The first option depends on having some real data available, but allows generating an arbitrary amount of realistic data by rendering. The second scheme applies blind DR by randomizing both the object and the background appearance. Note that this scheme neither requires plausible object placement nor plausible object appearance and thus is far easier to set up, compared to other DR solutions.
Ii-B Suitable GAN models
The main requirement on the generator is that the resulting images have a high correlation with a given pose, such that the pose estimation model can be trained in a supervised fashion. In theory any GAN model can be used for this if images can be mapped into the latent space of the generator. Such a mapping allows to condition the generator to create images, that resemble the input inside the latent space.
If the latent space is constructed in way that allows for interpolation by e.g. employing Kullback-Leibler loss[diederik2014auto], this approach would also allow synthesizing novel samples not seen during training. In the case of pose estimation, this is required to generate images for views not seen by the adaptation network during training. However, the resulting image must retain enough fidelity for the pose estimation network to predict the correct pose. Not all GAN architectures are practical for this use case. For a preliminary experiment, we use the StyleGAN [karras2019style] model, which offers state-of-the art generation performance and allows interpolation in the latent space of the generator.
We train StyleGAN at a reduced resolution of px for three days using approximately renderings composed on top of real backgrounds. The model is initialized using the weights for the LSUN Cat data-set [yu2015lsun] of the original publication. This allows operating at the reduced training period, compared to training the model from scratch which required 13 days according to the original publication.
We then map a real image into the latent space and use it as the mean for the random vector to sample new images, which would ideally retain most of the original image content; particularly the pose of the target object. Figure3 shows an exemplary result; while the images seem plausible (considering the reduced training time), it is obvious that the model is not able to sufficiently capture the locality of the object. This results in a significantly different pose in the sampled image, compared to the input image. This precludes this approach in being used for pose estimation. Therefore, we focus on conditional GAN models in the following.
Ii-C Paired intermediate domain translation
In this section we introduce a training pipeline based on the Pix2PixHD [wang2018high] paired image translation model. This is a supervised approach, which requires aligned image pairs to learn the domain translation. Pairing synthetic images with real images requires an according training set of real images, which defies the goal of synthetic training. Therefore, we introduce a deterministic transform into an intermediate domain with reduce expressiveness, making real-world and rendered images less distinguishable. Here, we use the Laplace filter, which approximates the second order image derivative that is continuous and directly translates to edge strength without requiring a thinning step. The Pix2PixHD model is then employed to reconstruct real images from their Laplace filtered variants.
The pipeline now consists of two trainable models; the Pix2PixHD model for domain adaptation followed by the pose-estimation task network (see Figure 4). As the performance of the second model depends on the first model, the pipeline has to be trained in two stages. First, the domain adaptation network is trained until convergences on rendered images with random backgrounds and their Laplace filtered variants. Here, the network must simultaneously reconstruct the rendering as well as the real background which forces the network towards realistic reconstructions. Next, the pose estimation model is trained on the reconstructed images. Assuming the adaptation network was able to generalize from the training data, we now can create a virtually infinite amount of realistic views.
The remaining question is how to model the object surface before converting it to the Laplace domain, given that we do not want to impose any restrictions on possible object appearances. Here, one needs to balance the learning problem between the domain adaptation and the pose estimation network — e.g. enforcing discriminative features makes the problem for the adaptation network more challenging, yet it reduces the difficulty for the pose estimation network. Specifically, we opted for the following methods covering different work distributions of the involved networks:
Use the real world texture (see Figure 4). This should result in the best model performance as it is the easiest task. However, to obtain the texture real images are needed, which dissents with our goal of fully synthetic training. This option was therefore mainly included to assess the performance loss of mapping into the Laplace domain as well as the loss induced by the following options. It corresponds to baseline variant a) with domain adaptation.
Use a random texture (see Figure (a)a). This prevents the networks from learning any surface related information of the object to be detected. While this should not affect the domain adaptation network which is already faced with the task of reconstructing arbitrary background images, it makes the task of pose estimation significantly more challenging. The important shading and contour cues can be arbitrarily degraded by the used texture. This corresponds to baseline variant b) with domain adaptation.
Use a uniform color for the object (see Figure (b)b). Instead of using a random texture, we assume the object to be of one uniform, yet arbitrary color. Given the Laplace intermediate domain, the adaptation network cannot learn a correct object colorization. Therefore, we set the target color to gray, which is the most likely guess the adaptation network can take, given this task. Keeping the surface properties fixed allows to apply a consistent shading to the object, which in turn can be exploited by the pose estimation network. However, the reconstruction of a plausible surface shading in turn makes the task of the domain adaptation network more challenging, but is a deliberate choice for balancing the work. To prevent the adaptation network to over-fit to a specific lighting position, we place several point lights randomly around the object.
Use a fixed checkerboard pattern for the object surface (see Figure 6). Here, the Laplace images are generated from a uniformly colored object as above, while the reconstruction target is rendered with a fixed checkerboard texture. This makes the pose estimation task easier as the network can rely on stable cues on the object surface. However, this is particularly challenging for the adaptation network as it must encode the object geometry to reconstruct a correct appearance, while being confronted with merely an edge image generated from an uniformly colored object.
Note that the translation of the object appearance to a different representation in the last two methods, requires the translation network to be executed at inference time as well.
Ii-D Direct image domain translation
In this section we introduce a CycleGAN [CycleGAN2017] based training pipeline for unsupervised domain adaptation. As this model does not require matching image pairs, there is no need for an explicit intermediate representation, and the model can directly learn the mapping between the domains of synthetic and real images. For the real domain we use the same generation scheme like in baseline b) of rendering randomly textured objects on random backgrounds. However, for the synthetic domain we cannot use real backgrounds to generate samples, as the different statistics would give away the synthetic object and bias the GAN towards object segmentation. Instead, we collect a separate data-set of synthetic background images from 3D-game footage on Youtube. Here, we select random crops of randomly selected frames from a total of about 5 hours of video footage.
Furthermore, the specific architecture is capable of translating images in both directions. This allows reversing the pipeline; instead of adapting sythetic images to the real domain at training time, it is also possible to adapt real images to the sythetic domain at run-time. While this eases training, it comes with the cost of having to execute the translation network for inference.
As a limitation, the original CycleGAN model is tuned to produce images at a resolution of px, which is only a fraction of the YOLO6D receptive field of px. Scaling the GAN up would require fine-tuning its hyper-parameters like the size of the hidden layers, which is out of the scope of this work. Therefore, we perform our experiments with the limited resolution, which allows to judge the feasibility of the method. However, one should keep in mind that pose precision can be improved by scaling the network output to match YOLO6D.
In this section we quantitatively and qualitatively evaluate, whether our pipeline allows the training of the demanding pose estimation task network from synthetic, randomized and non-photorealistic renderings only. For comparability with related work, we use the LineMod data set for evaluation. Here, we focus on the ”Driller” object as it exhibits a characteristic, non-uniform surface and is the most challenging object for the pose estimation model.
In the following, we first present the results of the baseline approaches as introduced in section II. We then turn to the paired image translation via an intermediate, reduced domain and finally present the results for direct image domain translation.
Iii-a Implementation details
For training the pipeline, we follow the procedure outlined by [tekin2018real] in initially dropping the confidence loss, when training YOLO6D on a domain different from real images. This proved essential to allow the pose-estimator to adapt, as samples from the GAN exhibit significantly different colors and details then the ImageNet data-set, which the model was initialized on. After samples the estimator was able to reach 85% recall which improved to 95% after
samples. Only after such initialization, we proceeded with training with the complete loss function.
If not specified otherwise, each benchmark used the following training parameters:
stochastic gradient descent with momentum ()
weight decay of and learning rate of
batch size of 12.
When not explicitly aiming for convergence, the learning rate was kept constant, otherwise it was reduced gradually with advancing training.
Iii-B Quantitative results
We measure the domain translation performance of the presented methods in terms of pose estimation error of the task network [tekin2018real]. Here, we employ two different metrics; the 3D translation and 3D rotation as well as the 2D corner re-projection error. The latter measures the error in screen-space and therefore is well suited for augmented-reality, while the former is more meaningful for robotic applications.
|real / none||9 cm||15 px|
|random / none||45 cm||111 px|
Comparing the baseline methods as introduced in Section II-A (see Table I), we see that the performance of baseline method b) is significantly reduced, although we ensured model convergence in both cases. This shows, that our task network for pose-estimation, YOLO6D, is not sufficiently conditioned to overcome the domain gap on its own, even when presented with a virtually unlimited amount of images from the synthetic domain.
|real / laplace||8 cm||11 px|
|random / laplace||36 cm||88 px|
|uniform / laplace||49 cm||81 px|
|pattern / laplace||68 cm||135 px|
Looking at the domain translation results using the paired model as introduced in Section II-C in Table II, we see an improvement in pose estimation by about 15% in both cases. This indicates that the edge based intermediate representation leads to a more robust representation with YOLO6D. Likely, because we are able to reduce the texture bias [geirhos2018imagenettrained]
of the model. Notably, rendering the object using a uniform color as in 3) does not improve results compared to using a random texture. Probably the randomized lighting causes enough variation such that the pose estimator cannot benefit from the consistent shading. Training the pose estimation network was not possible when applying rendering method 4).
While the results improve over baseline, the margin is only moderate. Likely, the reason for this is that the Laplace filtering does not sufficiently reduce the expressiveness of the image and the domain gap is still present in the intermediate representation.
|random / direct||40 cm||83 px|
|random / reversed||10 cm||25 px|
Turning to the direct domain translation using CycleGAN as introduced in Section II-D in Table II, the results of synthetic to real translation are consistent with paired translation, even though the network produces images only at quarter the resolution. This leads us to believe that unsupervised models are generally sufficient for domain adaptation.
When reversing the translation from real to synthetic, we see results in reach of our baseline method a). This is remarkable as the baseline uses realistic texturing of the object that is captured from real images. The domain translation method on the other hand only relies on the CAD geometry with both the object background and object surface being randomized.
Iii-C Qualitative results
As shown in Figure 6, Pix2PixHD can not predict a consistent checkerboard texture. While the results are plausible, they do not exhibit enough detail to help the pose estimation model. Still, this is a remarkable result as the translation network is confronted with a significantly more demanding challenge; to correctly apply the texture pattern it not only has to infer the object pose, but also the object geometry.
Figure 1 shows some predictions of the pipeline trained with CycleGAN for domain adaptation. While, the model is able to reliably detect the object, the accuracy of the object pose is lacking. This is likely due to the limited resolution of the adaptation network and can be further improved by scaling it up to the pose estimation input size.
Figure 7 shows the effect on image quality after applying CycleGAN on the objects ”ape” and ”benchvise”. The model is able to improve color reproduction as well as applying details like specular highlights.
We have shown that employing a paired translation GAN for domain adaptation during training generally improves model robustness and hence the performance of the target network. Turning to unpaired translation GANs, we have shown that training solely on CAD geometry with neither knowing the surface properties nor the environment is possible and reaches the performance of training on realistic data, when using reverse translation. These results indicate that image-conditional GANs are indeed an effective measure to close the domain gap between real and synthetic images.
Furthermore, the introduced training is much simpler compared to existing solutions relying on domain randomization. The latter require a faithful setup of randomization modules — even when employing guided domain randomization. On the other hand, the presented style transfer pipelines only require collecting unstructured images from the target domain, which are fed into the adaptation network in an unsupervised fashion. This allows focusing on training the task network. At this, we have shown that the method is precise enough to train a pose-estimation model with satisfactory precision.
To further improve the performance of our approach, the obvious measure is to take out the resolution loss by scaling up the CycleGAN output to match the YOLO6D input. Furthermore, the method currently relies on blind randomization during rendering to consider the multitude of possible surface materials. Instead, it seems beneficial to employ a generator that is aware of the multi-modal data and therefore is capable to produce different surface materials based on a noise vector. To this end, one could replace the image-conditional CycleGAN by a more recent variant like MUNIT [huang2018multimodal]. Alternatively, further regularization on the unconditional StyleGAN [karras2019style], could enforce its samples closely resemble the input image. In this case one could reconsider the architecture as it is also capable to generate multiple styles, by internally disentangling content and style.