ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing

03/05/2018 ∙ by Chen-Hsuan Lin, et al. ∙ adobe Carnegie Mellon University 0

We address the problem of finding realistic geometric corrections to a foreground object such that it appears natural when composited into a background image. To achieve this, we propose a novel Generative Adversarial Network (GAN) architecture that utilizes Spatial Transformer Networks (STNs) as the generator, which we call Spatial Transformer GANs (ST-GANs). ST-GANs seek image realism by operating in the geometric warp parameter space. In particular, we exploit an iterative STN warping scheme and propose a sequential training strategy that achieves better results compared to naive training of a single generator. One of the key advantages of ST-GAN is its applicability to high-resolution images indirectly since the predicted warp parameters are transferable between reference frames. We demonstrate our approach in two applications: (1) visualizing how indoor furniture (e.g. from product images) might be perceived in a room, (2) hallucinating how accessories like glasses would look when matched with real portraits.



There are no comments yet.


page 1

page 3

page 5

page 7

page 8

page 12

page 13

Code Repositories


ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing (CVPR 2018)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Composite images easily fall outside the natural image manifold due to appearance and geometric discrepancies. We seek to learn geometric corrections that sequentially warp composite images towards the intersection of the geometric and natural image manifolds.

Generative image modeling has progressed remarkably with the advent of convolutional neural networks (CNNs). Most approaches constrain the possible appearance variations within an image by learning a low-dimensional embedding as an encoding of the natural image subspace and making predictions from this at the pixel level. We refer to these approaches here as

direct image generation. Generative Adversarial Networks (GANs) [7], in particular, have demonstrated to be an especially powerful tool for realistic image generation. They consist of a generator network () that produces images from codes, and a discriminator network () that distinguishes real images from fake ones. These two networks play a minimax game that results in generating realistic looking images and being unable to distinguish between the two when equilibrium is reached.

Direct image generation, however, has its limitations. As the space of all images is very high-dimensional and image generation methods are limited by finite network capacity, direct image generation methods currently work well only on restricted domains (e.g. faces) or at low resolutions.

In this work, we leverage Spatial Transformer Networks (STNs) [11], a special type of CNNs capable of performing geometric transformations on images, to provide a simpler way to generate realistic looking images – by restricting the space of possible outputs to a well-defined low-dimensional geometric transformation of real images. We propose Spatial Transformer Generative Adversarial Networks (ST-GANs), which learn Spatial Transformer generators within a GAN framework. The adversarial loss enables us to learn geometric corrections resulting in a warped image that lies at the intersection of the natural image manifold and the geometric manifold – the space of geometric manipulations specific to the target image (Fig. 1). To achieve this, we advocate a sequential adversarial training strategy to learn iterative spatial transformations that serve to break large transformations down into smaller ones.

We evaluate ST-GANs in the context image compositing, where a source foreground image and its mask are warped by the Spatial Transformer generator , and the resulting composite is assessed by the discriminator . In this setup, tries to distinguish warped composites from real images, while tries to fool by generating as realistic looking as possible composites. To the best of our knowledge, we are the first to address the problem of realistic image generation through geometric transformations in a GAN framework. We demonstrate this method on the application of compositing furniture into indoor scenes, which gives a preview of, for example, how purchased items would look in a house. To evaluate in this domain, we created a synthetic dataset of indoor scene images as the background with masked objects as the foreground. We also demonstrate ST-GANs in a fully unpaired setting for the task of compositing glasses on portrait images. A large-scale user study shows that our approach improves the realism of image composites.

Our main contributions are as follows:

  • px

  • We integrate the STN and GAN frameworks and introduce ST-GAN, a novel GAN framework for finding realistic-looking geometric warps.

  • We design a multi-stage architecture and training strategy that improves warping convergence of ST-GANs.

  • We demonstrate compelling results in image compositing tasks in both paired and unpaired settings as well as its applicability to high-resolution images.

2 Related Work

Image compositing refers to the process of overlaying a masked foreground image on top of a background image. One of the main challenges of image compositing is that the foreground object usually comes from a different scene than the background, and therefore it is not likely to match the background scene in a number of ways that negatively effects the realism of the composite. These can be both appearance differences (due to lighting, white balance, and shading differences) and geometric differences (due to changes in camera viewpoint and object positioning).

Existing photo-editing software features various image appearance adjustment operations for that allows users to create realistic composites. Prior work has attempted to automate appearance corrections (e.g. contrast, saturation) through Poisson blending [26]

or more recent deep learning approaches 

[42, 30]. In this work, we focus on the second challenge: correcting for geometric inconsistencies between source and target images.

Spatial Transformer Networks (STNs) [11] are one way to incorporate learnable image warping within a deep learning framework. A Spatial Transformer module consists of a subnetwork predicting a set of warp parameters followed by a (differentiable) warp function.

STNs have been shown effective in resolving geometric variations for discriminative tasks as well as a wide range of extended applications such as robust filter learning [4, 13], image/view synthesis [41, 6, 24, 37], and 3D representation learning [14, 35, 40]. More recently, Inverse Compositional STNs (IC-STNs) [17] advocated an iterative alignment framework. In this work, we borrow the concept of iterative warping but do not enforce recurrence in the geometric prediction network; instead, we add different generators at each warping step with a sequential training scheme.

Generative Adversarial Networks (GANs) [7] are a class of generative models that are learned by playing a minimax optimization game between a generator network and a discriminator network

. Through this adversarial process, GANs are shown to be capable of learning a generative distribution that matches the empirical distribution of a given data collection. One advantage of GANs is that the loss function is essentially learned by the discriminator network, which allows for training in cases where ground truth data with strong supervision is not available.

GANs are utilized for data generation in various domains, including images [27], videos [31], and 3D voxelized data [33]

. For images in particular, it has been shown to generate compelling results in a vast variety of conditional image generation problems such as super-resolution 

[16], inpainting [25]

, image-to-image translation 

[10, 44, 19], and image editing/manipulation [43].

Recently, STNs were also sought to be adversarially trained for object detection [32], where adversarial examples with feature deformations are generated to robustify object detectors. LR-GAN [36] approached direct image generation problems with additional STNs onto the (directly) generated images to factorize shape variations. We explore the context of STNs with GANs in the space of conditional image generation from given inputs, which is a more direct integration of the two frameworks.

3 Approach

Figure 2: Background. (a) Given an initial composite transformation , the foreground image and mask is composited onto the background image using (3). (b) Using Spatial Transformer Networks (STNs), a geometric prediction network predicts an update conditioned on the foreground and background images, resulting in the new parameters . The update is performed with warp composition (3.1). (c) Our final form is an iterative STN to predict a series of accumulative warp updates on the foreground such that the resulting composite image falls closer to the natural image manifold.

Our goal is realistic geometric correction for image compositing given a background image and foreground object with a corresponding mask . We aim to correct the camera perspective, position and orientation of the foreground object such that the resulting composite looks natural. The compositing process can be expressed as:


For simplicity, we further introduce the notation to represent compositing (with implied within ). Given the composite parameters (defining an initial warp state) of , we can rewrite (3) as


where images are written as functions of the warp parameters. This operator is shown in Fig. 2(a).

In this work, we restrict our geometric warp function to homography transformations, which can represent approximate 3D geometric rectifications for objects that are mostly planar or with small perturbations. As a result, we are making an assumption that the perspective of the foreground object is close to the correct perspective; this is often the case when people are choosing similar, but not identical, images from which to composite the foreground object.

The core module of our network design is an STN (Fig. 2(b)), where the geometric prediction network predicts a correcting update . We condition on both the background and foreground images, since knowing how an object should be transformed to fit a background scene requires knowledge of the complex interaction between the two. This includes geometry of the object and the background scene, the relative camera position, and semantic understanding of realistic object layouts (e.g. having a window in the middle of the room would not make sense).

3.1 Iterative Geometric Corrections

Predicting large displacement warp parameters from image pixels is extremely challenging, so most prior work on image alignment predict local geometric transformations in an iterative fashion [9, 21, 2, 34, 18]. Similarly, we propose to use iterative STNs to predict a series of warp updates, shown in Fig. 2(c). At the th iteration, given the input image and the previous warp state , the correcting warp update and the new warp state can be written as


where is the geometric prediction network and denotes composition of warp parameters. This family of iterative STNs preserves the original images from loss of information due to multiple warping operations [17].

3.2 Sequential Adversarial Training

Figure 3: Sequential adversarial training of ST-GAN. When learning a new warp state , only the new generator is updated while the previous ones are kept fixed. A single discriminator (learned from all stages) is continuously improved during the sequential learning process.

In order for STNs to learn geometric warps that map images closer to the natural image manifold, we integrate them into a GAN framework, which we refer to as ST-GANs. The motivation for this is two-fold. First, learning a realistic geometric correction is a multi-modal problem (e.g. a bed can reasonably exist in multiple places in a room); second, supervision for these warp parameters are typically not available. The main difference of ST-GANs from conventional GANs is that (1) generates a set of low-dimensional warp parameter updates instead of images (the whole set of pixel values); and (2) gets as input the warped foreground image composited with the background.

To learn gradual geometric improvements toward the natural image manifold, we adopt a sequential adversarial training strategy for iterative STNs (Fig. 3), where the geometric predictor corresponds to the stack of generators . We start by training a single , and each subsequent new generator is added and trained by fixing the weights of all previous generators . As a result, we train only and by feeding the resulting composite image at warp state into the discriminator and matching it against the real data distribution. This learning philosophy shares commonalities with the Supervised Descent Method [34], where a series of linear regressors are solved greedily, and we found it makes the overall training faster and more robust. Finally, we fine-tune the entire network end-to-end to achieve our final result. Note that we use the same discriminator for all stages of the generator , as the fundamental measure of “geometric fakeness” does not change over iterations.

3.3 Adversarial Objective

We optimize the Wasserstein GAN (WGAN) [1] objective for our adversarial game. We note that ST-GAN is amenable to any other GAN variants [22, 39, 3], and that the choice of GAN architecture is orthogonal to this work.

The WGAN minimax objective at the th stage is


where and are drawn from the real data and fake composite distributions, and is the set of 1-Lipschitz functions enforced by adding a gradient penalty term  [8]. Here, (where is implied, defined in (3.1)) is drawn from the posterior distribution conditioned on (recursively implied). When , the initial warp is drawn from , a predefined distribution for geometric data augmentation.

We also constrain the warp update to lie within a trust region by introducing an additional penalty . This is essential since ST-GAN may learn trivial solutions to remove the foreground (e.g. by translating it outside the image or shrinking it into nothing), leaving behind only the background image and in turn making the composite image realistic already.

When training ST-GAN sequentially, we update and alternating the respective loss functions:


where and are the penalty weights for the gradient and the warp update respectively, and and are again implied through (3.1). When fine-tuning ST-GAN with learned updates end-to-end, the generator objective is the sum of that from each , i.e. .

4 Experiments

We begin by describing the basic experimental settings.

Warp parameterizations.

We parameterize a homography with the Lie algebra [23], i.e. the warp parameters and homography matrices are related through the exponential map. Under this parameterization, warp composition can be expressed as the addition of parameters, i.e. .

Model architecture.

We denote the following: C() is a 2D convolutional layer with filters of size

and stride 2 (halving the feature map resolution) and

L() is a fully-connected layer with output nodes. The input of the generators has 7 channels: RGBA for foreground and RGB for background, and the input to the discriminator is the composite image with 3 channels (RGB). All images are rescaled to , but we note that the parameterized warp can be applied to full-resolution images at test time.

The architecture of is C(32)-C(64)-C(128)-C(256)-C(512)-L(256)-L(8), where the output is the 8-dimensional (in the case of a homography) warp parameter update . For each convolutional layer in , we concatenate a downsampled version of the original image (using average pooling) with the input feature map. For , we use a PatchGAN architecture [10], with layout C(32)-C(64)-C(128)-C(256)-C(512)-C

(1). Nonlinearity activations are inserted between all layers, where they are ReLU for

and LeakyReLU with slope 0.2 for . We omit all normalization layers as we found them to deteriorate training performance.

4.1 3D Cubes

Figure 4: (a) We create a synthetic dataset of 3D cube renderings and validate the efficacy of ST-GAN by attempting to correct randomly generated geometric perturbations. (b) ST-GAN is able to correct the cubes to a right perspective, albeit a possible translational offset from the ground truth.

To begin with, we validate whether ST-GANs can make geometric corrections in a simple, artificial setting. We create a synthetic dataset consisting of a 3D rectangular room, an axis-aligned cube inside the room, and a perspective camera (Fig. 4(a)). We apply random 3-DoF translations to the cube and 6-DoF perturbations to the camera, and render the cube/room pair separately as the foreground/background (of resolution ). We color all sides of the cube and the room randomly.

We perturb the rendered foreground cubes with random homography transformations as the initial warp and train ST-GAN by pairing the original cube as the ground-truth counterpart for . As shown in Fig. 4(b), ST-GAN is able to correct the perturbed cubes scale and perspective distortion w.r.t. the underlying scene geometry. In addition, ST-GAN is sometimes able to discover other realistic solutions (e.g. not necessarily aligning back to the ground-truth location), indicating ST-GAN’s ability to learn the multi-modal distribution of correct cube placements in this dataset.

4.2 Indoor Objects

Next, we show how ST-GANs can be applied to practical image compositing domains. We choose the application of compositing furniture in indoor scenes and demonstrate its efficacy on both simulated and real-world images. To collect training data, we create a synthetic dataset consisting of rendered background scenes and foreground objects with masks. We evaluate on the synthetic test set as well as high-resolution real world photographs to validate whether ST-GAN also generalizes to real images.

px Category Training set Test set # 3D inst. # pert. # 3D inst. # pert. Bed 3924 11829 414 1281 Bookshelf 508 1280 58 137 Cabinet 9335 31174 1067 3518 Chair 196 609 22 60 Desk 64 1674 73 214 Dresser 285 808 31 84 Refrigerator 3802 15407 415 1692 Sofa 3604 11165 397 1144 Total 22303 73946 2477 8130

Table 1: Dataset statistics for the indoor object experiment, reporting the number of object instances chosen for perturbation, and the final number of rendered perturbed samples.

Data preparation.

We render synthetic indoor scene images from the SUNCG dataset [29], consisting of 45,622 indoor scenes with over 5M 3D object instances from 37 categories [28]. We use the selected 41,499 scene models and the 568,749 camera viewpoints from Zhang et al[38] and utilize Mitsuba [12] to render photo-realistic images with global illumination. We keep a list of candidate 3D objects consisting of all instances visible from the camera viewpoints and belonging to the categories listed in Table 1.

Figure 5: Rendering pipeline. Given an indoor scene and a candidate object, we remove occluding objects to create an occlusion-free scenario, which we do the same at another perturbed camera pose. We further remove the object to create a training sample pair with mismatched perspectives.
Category Initial SDM [34] Homogra- ST-GAN ST-GAN ST-GAN ST-GAN ST-GAN Ground
warp phyNet [5] (non-seq.) (warp 1) (warp 2) (warp 4) (end-to-end) truth
Bed 35.5 % 30.5 % 30.2 % 32.8 % 32.8 % 46.8 % 32.8 % 32.2 % 75.0 %
Bookshelf 21.1 % 33.9 % 35.1 % 16.7 % 26.4 % 26.2 % 39.5 % 42.6 % 68.9 %
Cabinet 20.9 % 19.8 % 35.0 % 36.6 % 14.3 % 31.2 % 44.4 % 50.0 % 74.3 %
Chair 32.8 % 36.8 % 47.6 % 50.9 % 62.3% 42.7 % 50.0 % 58.6 % 68.7 %
Desk 18.9 % 13.1 % 36.1 % 35.4 % 29.2 % 29.0 % 39.4 % 40.7 % 65.1 %
Dresser 14.9 % 18.6 % 20.7 % 16.7 % 24.6 % 27.4 % 29.7 % 48.4 % 66.1 %
Refrigerator 37.1 % 21.4 % 50.0 % 37.7 % 28.6 % 47.1 % 39.7 % 51.7 % 81.6 %
Sofa 15.9 % 31.0 % 42.4 % 28.9 % 37.0 % 54.9 % 56.1 % 51.8 % 78.2 %
Average 24.6 % 25.6 % 37.1 % 31.9 % 31.9 % 38.2 % 41.5 % 47.0 % 72.6 %
Table 2: AMT User studies

for the indoor objects experiment. Percentages represent the how often the images in each category were classified as “real” by Turkers. We can see that our final model, ST-GAN (end-to-end), substantially improves over geometric realism when averaged across all classes. Our realism performance improves with the number of warps trained as well as after the end-to-end fine-tuning. The ground truth numbers serve as a theoretical upper bound for all methods.

The rendering pipeline is shown in Fig. 5. During the process, we randomly sample an object from the candidate list, with an associated camera viewpoint. To emulate an occlusion-free compositing scenario, occlusions are automatically removed by detecting overlapping object masks. We render one image with the candidate object present (as the “real” sample) and one with it removed (as the background image). In addition, we perturb the 6-DoF camera pose and render the object with its mask (as the foreground image) for compositing. We thus obtain a rendered object as viewed from a different camera perspective; this simulates the image compositing task where the foreground and background perspectives mismatch. We note that a homography correction can only approximate these 3D perturbations, so there is no planar ground-truth warp to use for supervision. We report the statistics of our rendered dataset in Table 1. All images are rendered at resolution.


Similar to the prior work by Lin & Lucey [17], we train ST-GAN for sequential warps During adversarial training, we rescale the foreground object randomly from and augment the initial warp with a translation sampled from scaled by the image dimensions. We set for all methods.


One major advantage of ST-GAN is that it can learn from “realism” comparisons without ground-truth warp parameters for supervision. However, prior approaches require supervision directly on the warp parameters. Therefore, we compare against self-supervised approaches trained with random homography perturbations on foreground objects as input, yielding warp parameters as self-supervision. We reemphasize that such direct supervision is insufficient in this application as we aim to find the closest point on a manifold of realistic looking composites rather than fitting a specific paired model. Our baselines are (1) HomographyNet [5], a CNN-based approach that learns direct regression on the warp parameters, and (2) Supervised Descent Method (SDM) [34]

, which greedily learns the parameters through cascaded linear regression. We train the SDM baseline for 4 sequential warps as well.

Quantitative evaluation.

As with most image generation tasks where the goal is realism, there is no natural quantitative evaluation possible. Therefore, we carry out a perceptual study on Amazon Mechanical Turk (AMT) to assess geometric realism of the warped composites. We randomly chose 50 test images from each category and gather data from 225 participants. Each participant was shown a composite image from a randomly selected algorithm (Table 2), and was asked whether they saw any objects whose shape does not look natural in the presented image.

We report the AMT assessment results in Table 2. On average, ST-GAN shows a large improvement of geometric realism, and quality improves over the sequential warps. When considering that the warp is restricted to homography transformations, these results are promising, as we are not correcting for more complicated view synthesis effects for out-of-plane rotations such as occlusions. Additionally, ST-GAN, which does not require ground truth warp parameters during training, greatly outperforms other baselines, while SDM yields no improvement and HomographyNet increases realism, but to a lesser degree.

Ablation studies.

We found that learning iterative warps is advantageous: compared with a non-iterative version with the same training iterations (non-seq. in Table 2), ST-GAN (with multiple generators) approaches geometric realism more effectively with iterative warp updates. In addition, we trained an iterative HomographyNet [5] using the same sequential training strategy as ST-GAN but found little visual improvement over the non-iterative version; we thus focus our comparison against the original [5].

Figure 6: Qualitative evaluation on the indoor rendering test set. Compared to the baselines trained with direct homography supervision, ST-GAN creates more realistic composites. We find that ST-GAN is able to learn common object-room relationships in the dataset, such as beds being against walls. Note that ST-GANs corrects the perspectives but not necessarily scale, as objects often exist at multiple scales in the real data. We observe that ST-GAN occasionally performs worse for unusual objects (e.g. with peculiar colors, last column).
Figure 7: Visualization of iterative updates in ST-GAN, where objects make gradual improvements that reaches closer to realism in an incremental fashion.
Figure 8: Dragging and snapping. (a) When an object is dragged across the scene, the perspective changes with the composite location to match that of the camera’s. (b) ST-GAN “snaps” objects to where it would be frequently composited (e.g. a bookshelf is usually laid against the wall).

Qualitative evaluation.

We present qualitative results in Fig. 6. ST-GAN visually outperforms both baselines trained with direct homography parameter supervision, which is also reflected in the AMT assessment results. Fig. 7 shows how ST-GAN updates the homography warp with each of its generators; we see that it learns gradual updates that makes a realism improvement at each step. In addition, we illustrates in Fig. 8 the effects ST-GAN learns, including gradual changes of the object perspective at different composite locations inside the room, as well as a “snapping” effect that predicts a most likely composite location given a neighborhood of initial locations. These features are automatically learned from the data, and they can be useful when implemented in interactive settings.

Figure 9: Real world high-resolution test results. Here we show our method applied to real images. The inputs are scaled down and fed to the network and then the warp parameters are applied at full resolution.

Finally, to test whether ST-GAN extends to real images, we provide a qualitative evaluation on photographic, high-resolution test images gathered from the Internet and manually masked (Fig 9). This is feasible since the warp parameters predicted from the low-resolution network input are transferable to high-resolution images. As a consequence, ST-GAN is indirectly applicable to various image resolutions and not strictly limited as with conventional GAN frameworks. Our results demonstrates the utilization of ST-GAN for high-quality image generation and editing.

4.3 Glasses

Finally, we demonstrate results in an entirely unpaired setting where we learn warping corrections for compositing glasses on human faces. The lack of paired data means that we do not necessarily have pictures of the same people both with and without glasses (ground truth).

Figure 10: The split of CelebA for the background and the real images, as well as the crafted glasses as the foreground.

Data preparation.

We use the CelebA dataset [20] and follow the provided training/test split. We then use the “eyeglasses” annotation to separate the training set into two groups. The first group of people with glasses serve as the real data to be matched against in our adversarial settings, and the group of people without glasses serves as the background. This results in 152249 training and 18673 test images without glasses, and 10521 training images with glasses. We hand-crafted 10 pairs of frontal-facing glasses as the foreground source (Fig. 10). We note that there are no annotations about where or how the faces are placed, and we do not have any information where the different parts of the glasses are in the foreground images.

In this experiment, we train ST-GAN with sequential warps. We crop the aligned faces into images and resize the glasses to widths of pixels initialized at the center. During training, we add geometric data augmentation by randomly perturbing the faces with random similarity transformations and the glasses with random homographies.

Figure 11: Glasses compositing results. (a) The glasses progressively moves into a more realistic position. (b) ST-GAN learns to warp various kinds of glasses such that the resulting positions are usually realistic. The top rows indicates the initial composite, and the bottom rows indicates the ST-GAN output. The last 4 examples shows failure cases, where glasses fail to converge onto the faces.


The results are shown in Fig. 11. As with the previous experiments, ST-GAN learns to warp the foreground glasses in a gradual fashion that improves upon realism at each step. We find that our method can correctly align glasses onto the people’s faces, even with a certain amount of in-plane rotations. However, ST-GAN does a poorer job on faces with too much out-of-plane rotation.

While such an effect is possible to achieve by taking advantage of facial landmarks, our results are encouraging as no information was given about the structure of either domain, and we only had access to unpaired images of people with and without glasses. Nonetheless, ST-GAN was able to learn a realism manifold that drove the Spatial Transformer generators. We believe this demonstrates great potential to extend ST-GANs to other image alignment tasks where acquiring paired data is very challenging.

5 Conclusion

We have introduced ST-GANs as a class of methods to model geometric realism. We have demonstrated the potential of ST-GANs on the task of image compositing, showing improved realism in a large-scale rendered dataset, and results on fully unpaired real-world image data. It is our hope that this work will open up new revenues to the research community to continue to explore in this direction.

Despite the encouraging results ST-GAN achieves, there are still some limitations. We find that ST-GAN suffers more when presented imbalanced data, particularly rare examples (e.g. white, thick-framed glasses in the glasses experiment). In addition, we also find convergence of ST-GAN to fail with more extreme translation or in-plane rotation of objects. We believe a future analysis of the convergence properties of classical image alignment methods with GAN frameworks is worthy of investigation in improving the robustness of ST-GANs.



A.1. Indoor Object Experiment: Rendering Details

We describe additional details regarding the rendering of the SUNCG dataset [29] for our experiment. In addition to Mitsuba [12] for rendering photo-realistic textures, we also utilize the OpenGL toolbox provided by Song et al[29], which supports rendering of instance segmentation.

Candidate object selection.

For each of the provided camera viewpoints from Zhang et al[38], we render an instance segmentation of all objects visible in the camera viewpoint. For each of these objects, we also separately render a binary object mask by removing all other existing objects (including the floor/ceiling/walls).

We use these information to exclude objects that are not ideal for our compositing experiment, including those that are too tiny or only partially visible in the camera view. Therefore, we include objects into the candidate selection list that match the criteria:

  • The entire object mask is visible within the camera.

  • The object mask occupies at least 10% of all pixels.

  • At least 50% of the object mask is visible within the instance segmentation mask.

  • The object belongs to one of the NYUv2 [28] categories of refrigerators, desks, bookshelves, cabinets, beds, dressers, sofas, or chairs.

Occlusion removal.

For all the objects in the candidate list, we remove the occluding objects (from the associated camera viewpoint) by overlapping the object mask onto the instance segmentation mask. All overlapped pixels with different instance labels are detected to be associated with an occluding object. Since there may be “hidden” occlusions that are occluded in the first place, we repeat the same process after the initial detected occlusions are removed to reveal the remaining occlusions. This is repeated until no more occluding objects w.r.t. the candidate object is present.

In order to create a cleaner space for compositing objects, we also use a “thicker” object mask for the above removal procedure. To achieve this, we dilate the object mask with a all-ones kernel for 10 times (i.e. “thicken” the object mask by 10 pixels).

Camera perturbation.

For each of the provided camera viewpoints, we generate a camera perturbation by adding a random 3D-translation sampled from in the forward-backward direction, one sampled from in the left-right direction (both scaled in meters as defined in the dataset), and a random azimuth rotation sampled from (degrees).

After generating a camera perturbation, the same occlusion removal process described above is performed to ensure the wholeness of the object from the perturbed perspective. The candidate object rendered from the perturbed view serves as the foreground source for our experiment. However, if it becomes only partially or not visible, then the rendering is discarded.


We use Mitsuba to render realistic textures and the OpenGL toolbox to render object masks at followed by downscaling for anti-aliasing.

A.2. Warp Parameterization Details

We follow Mei et al[23] to parameterize homography with the

Lie algebra. Given a warp parameter vector

, the transformation matrix can be written as


where is the exponential map (i.e. matrix exponential). is the identity transformation when is an all-zeros vector. Warp composition can thus be expressed as the addition of parameters, i.e. ; furthermore, .

The exponential map is also Taylor-expandable as


We implement the parameterization using the Taylor approximation expression with .

A.3. Training Details

For all experiments, we set the batch size for all experiments to be 20. Unless otherwise specified, we initialize all learnable weights in the networks from and all biases to be 0. All deep learning approaches are trained with Adam optimization [15]. We set following Gulrajani et al[8].

We describe settings for specific experiments as follows.

3D cubes.

We create 4000 samples of 3D cube/room pairs with random colors, as described in the paper. For the initial warp , we generate random homography perturbations by sampling each element of from , i.e. . This is applied to a canonical frame with and coordinates normalized to and subsequently transformed back to the image frame. We train ST-GAN with 4 sequential warps, each for 50K iterations (with perturbations generated on the fly) with the learning rates for both and to be . We set in this experiment.

Indoor objects.

For the self-supervised baselines (HomographyNet [5] and SDM [34]), we generate random homography perturbations using the same noise model as that from the 3D cubes experiment.

We train HomographyNet for 200K iterations (with perturbations generated on the fly) with a learning rate of . For SDM, we vectorize the grayscale images to be the feature as was practiced for image alignment [18]; in our case, we concatenate those of the background and masked foreground as the final extracted feature. We generate 750K perturbed examples (more than 10 perturbed examples per training sample) to train each linear regressor. Also as was practiced [34, 18], we add an regularization term to the SDM least-squares objective function and search for the penalty factor by evaluating on a separate validation set.

We initialize each of the ST-GAN generators with the pretrained HomographyNet as we find it to be better-conditioned. During adversarial training, we train each for 40K iterations with the learning rate for to be and that of to be . In the final end-to-end fine-tuning stage, we train all for 40K iterations using the same learning rates ( for all and for ). The non-sequential ST-GAN baseline is trained for 160K iterations with the same learning rates. We set in this experiment.


For data augmentation, we perturb the faces with random similarity transformations from for rotation (radian) and for translation (scaled by the image dimensions, in both and directions). The glasses are perturbed using the same random homography noise model as used in the 3D cubes experiment.

We train ST-GAN with 5 sequential warps, each for 50K iterations with the learning rates for both and to be . As a preconditioning step, we also pretrain the discriminator using only the initial fake samples and real samples for 50K iterations with the same learning rate. We set in this experiment.

A.4. Additional Indoor Object Results

Figure 12: Additional qualitative results from the indoor object experiment (test set). The yellow arrows in the second row point to the composited foreground objects.

We include additional qualitative results from the indoor object experiment in Fig. 12. Compared to the baselines, ST-GAN consistently predicts more realistic geometric corrections in most cases.

A.5. Additional Glasses Results

Figure 13: Additional qualitative results from the glasses experiment (test set). The top row indicates the initial composite, and the bottom row indicates the ST-GAN output.

We also include additional qualitative results from the glasses experiment in Fig. 13. We re-emphasize that the training data here is unpaired and there is no information in the dataset about where the glasses are placed. Despite these, ST-GAN is able to consistently match the initial glasses foreground to the background faces.