## 1 Introduction

Humans imagine far more than they see. In Figure 1 (b), we imagine the hidden arms and legs of the sitting woman. In Figure 1 (c), we imagine forward motion of the camera, as opposed to the road drifting backwards underneath the camera. We arrive at these interpretations – , predictions of latent factors – by referring to priors on how the world works.

This work proposes Adversarial Inverse Graphics Networks (AIGNs), a model that learns to map images to latent factors using feedback from the rendering of its predictions, as well as distribution matching between the predictions and a stored collection of ground truth latent factors. The ground-truth collection does not need to be directly related to the current input – it can be a disordered set of labels. We call this unpaired supervision. The renderers employed are differentiable, parameter-free, and task specific, , camera projection, camera motion, downsampling, and masking. Figure 2 depicts architectures for AIGNs for the tasks of structure from motion, 3D human pose estimation, super-resolution, and in-painting.

When we (deliberately) bias the ground truth collection of an AIGN to reflect a distorted reality, surprising renders arise. For super-resolution, AIGNs can make people look older, younger, more feminine, more masculine or more like Tom Cruise, by simply curating the discriminators’ ground truth to include images of old, young, female, male, or Tom Cruise pictures, respectively. For inpainting, AIGNs can make people appear with bigger lips or bigger noses, again by curating the discriminators’ ground truth to include faces with big lips or big noses. These observations inspire a compelling analogy: the AIGN’s ground truth collection is its memory, and the renders are its imagination. When the memories are biased, the imaginations reflect a distorted reality.

Our model is related to, and builds upon, many recent works in the literature. Inverse-graphics networks [DBLP:journals/corr/KulkarniWKT15] use parametric deconvolutional networks and strong supervision (, annotation of images with their ground truth imaginations) for decomposing an image into interpretable factors, , albedo and shading. Our model instead employs parameter-free renderers, and makes use of unpaired weak supervision. Similar to our approach, 3D interpreter networks [Wu2016] use a reconstruction loss in the form of a 2D re-projection error of predicted 3D keypoints, along with paired supervised training for 3D human pose estimation. Our model complements the reconstruction loss with adversarial losses on the predicted 3D human poses, and shows performance superior to the 3D interpreter. Conditional Generative Adversarial Networks (GANs) have used a combination of adversarial and L2 losses, , for inpainting [pathakCVPR16context], 3D voxel generation [3dgan], super-resolution [DBLP:journals/corr/DongLHT15], and image-to-image translation [DBLP:journals/corr/IsolaZZE16]. In these models, the adversarial loss is used to avert the regression-to-mean problem in standard L2 regression (blurring). Such models are feed-forward; they do not have a renderer and reconstruction feedback. As a result, they can only exploit supervision in the form of annotated pairs , of an RGB image and its corresponding 3D voxel grid [3dgan]. Our model extends such supervised conditional GAN formulations through self-consistency via inverse rendering, and matching between rendered predictions and the visual input, rather than between predictions and the ground truth. This feedback loop allows weakly supervised distribution matching to work (using unpaired annotations for the discriminators), removing the need for direct label matching.

As unsupervised models, AIGNs do not discriminate between training and test phases. Extensive experiments in a variety of tasks show that their performance consistently improves over their supervised alternatives, by adapting in a self-supervised manner to the statistics of the test set.

## 2 Adversarial Inverse Graphics Networks

AIGN architectures for various tasks are shown in Figure 2. Given an image or image pair , generator networks map to a set of predictions . Then, a task-specific differentiable renderer renders predictions back to the original input space. Discriminator networks are trained to discriminate between predictions and true memories of appropriate form

. Discriminators have to assign large probability to true memories and low probability to generators’ predictions. Given a set of images (or image pairs depending on the task)

generators are trained to minimize L2 distance between rendered imaginations and input and simultaneously maximize discriminators’ confusion. Our loss then reads:(1) |

where denotes the set of discriminator networks and the set of generator networks and the relative weight of reconstruction and adversarial losses. Since paired ground-truth is not used anywhere, both reconstruction and adversarial losses can be used both at train and test time. Our model can also benefit from strong supervision through paired annotations -pairs of visual input with desired predictions- at training time for training the generator networks. We will use the term adversarial priors to denote adversarial losses over latent predicted factors of our model.

In the next sections, we present applications of AIGNs to the tasks of (i) 3D human pose estimation, (ii) extraction of 3D depth and egomotion from a pair of frames, which we call Structure from Motion (SfM), (iii) super-resolution, and (iv) inpainting. For application of AIGNs to intrinsic image decomposition, please see our earlier unpublished manuscript [inversionearlier].

### 2.1 3D Human Pose Estimation

Recent works use deep neural networks and large supervised training sets [h36m_pami] and learn to regress to 3D poses directly, given an RGB image [DBLP:journals/corr/PavlakosZDD16]. Many have explored 2D body pose as an intermediate representation [DBLP:journals/corr/ChenR16a, Wu2016, DBLP:journals/corr/TomeRA17], or as an auxiliary task in a multi-task setting [DBLP:journals/corr/TomeRA17] where the abundance of labelled 2D pose training examples helps feature learning and complements limited 3D supervision. Yasin [Yasin_Iqbal_CVPR2016] explored unpaired supervision between 2D and 3D keypoint annotations by pretraining a low-rank Gaussian model from 3D annotations as a prior for plausible 3D poses. Instead of fitting 3D poses into a predefined probabilistic model and having a separate stage for pretraining, our model learns such priors directly from data, and co-trains the networks responsible for the priors and the predictions.

The AIGN architecture for 3D human pose estimation is depicted in Figure 2 (a). Given an image crop centered around a person detection, the task is to predict a 3D skeleton for the depicted person. We decompose the 3D human shape into a view-aligned 3D linear basis model and a rotation matrix: Our shape basis

is obtained using PCA on orientation-aligned 3D poses in our training set, where orientation is measured by the direction of the normal to the hip vector. We keep 60 components out of a total of 96 (three coordinates of 32 keypoints). The dimensionality reduction is small, and indeed, we just use bases weights for ease of prediction, relying on our adversarial priors (rather than PCA) to regularize the 3D shape prediction.

Our generator network first predicts a set of 2D bodyjoint heatmaps and then maps those to basis weights , focal length , and Euler rotation angles so that the 3D rotation of the shape reads , where denotes rotation around the x-axis by angle . Our renderer then projects the 3D keypoints to 2D keypoints , all in homogeneous coordinates:

(2) |

where is the camera projection matrix. The reconstruction loss is the norm between the reprojected coordinates and 2D coordinates obtained by the argmax of predicted 2D heatmaps. A discriminator network discriminates between our generated keypoints and a database of 3D human skeletons, which does not contain the ground-truth (paired) 3D skeleton for the input image, but rather previous poses (, “memories”).

### 2.2 Structure from Motion

Simultaneous Localization And Mapping (SLAM) methods have shown impressive results on estimating camera pose and 3D point clouds from monocular, stereo, or RGB-D video sequences [schoeps14ismar, kerl13iros] using alternate optimization over camera poses and the 3D scene pointcloud. There has been a recent interest in integrating learning to aid geometric approaches handle moving objects and low-texture image regions. The work of Handa et al. [handa2016gvnn]

contributed a deep learning library with many geometric operations including a differentiable camera projection layer, similar to those used in SfM networks

[sfmnet, tinghuisfm], 3D image interpreters [interpreter], and depth from stereo CNN [garg2016unsupervised]. Our SfM AIGN architecture, depicted in Figure 2 (b), build upon those works. Given a pair of consecutive frames, the task is to predict the relative camera motion and a depth map for the first frame.We have two generator networks: an egomotion network and a depth network. The egomotion network takes the following inputs: a concatenation of two consecutive frames , their 2D optical flow estimated using our implementation of FlowNet [flownet], the angles of the optical flow, and a static angle field. The angle field denotes each pixel’s angle from the camera’s principal point (in radians), while the flow angles denote the angular component of the optical flow. The egomotion network produces the camera’s 3D relative rotation and translation as output. We represent 3D camera relative rotation using an Euler angle representation: where denotes rotation around the -axis by the angle . The egomotion network is a standard convolutional architecture (similar to VGG16 [simonyan2014very]), but following the last convolution layer, there is a single fully-connected layer producing a 6D vector, representing the Euler angles of the rotation matrix, and the translation vector.

The depth network takes as input the first frame and estimates 3D scene log depth at every pixel, . The architecture of this network is the same as that of the egomotion network, except deconvolution layers (with skip connections) replace the fully-connected layer, producing a depth estimate at every pixel. Given generated depth for the first frame and known camera intrinsics – focal length and principal point – we obtain the corresponding 3D point cloud for : , where are the column and row coordinates and the predicted depth of the pixel in frame . Our renderer in this task transforms the point cloud according to the estimated camera motion: (similar to SE3-Nets [SE3-Nets] and the recent SfM-Nets [sfmnet, tinghuisfm]), and projects the 3D points back to image pixels using camera projection.

The optical flow vector of pixel is then . Our reconstruction loss corresponds to the well known brightness constancy principle, that pixel appearance does not change under small pixel motion:

(3) |

We have two discriminator networks: manages the realism of predicted camera motions by enforcing physics constraints in a statistical manner: when the camera is fixed to the top of a car, the roll and pitch angles are close to zero due to the nature of the car motion. This network has three fully-connected layers (of size 128, 128, and 64), and discriminates between real and fake relative transformation matrices. The second discriminator, , manages the realism of 3D depth predictions, using a collection of ground-truth depth maps from unrelated sequences. This discriminator is fully convolutional. That is, it has one probabilistic output at each image sub-region, as we are interested in the realism of local depth texture. We found the depth discriminator to effectively handle the scale ambiguity of monocular 3D reconstruction: in a monocular setup without priors, the re-projection error cannot tell if an object is far and big, or near and small. The depth discriminator effectively imposes a prior regarding world scale, as well as depth realism on fine-grained details.

### 2.3 Super-resolution

Deep neural network architectures have recently shown excellent performance on the task of super-resolution [DBLP:journals/corr/DongLHT15, DBLP:journals/corr/BrunaSL15] and non-blind inpainting [pathakCVPR16context, DBLP:journals/corr/YehCLHD16] (inpainting where the mask is always at the same part of the image). Adversarial losses have been used to resolve the regression to the mean problem of standard loss [DBLP:journals/corr/LedigTHCATTWS16, pathakCVPR16context]. Our model unifies recent works of [sonderby2014apparent] and [DBLP:journals/corr/YehCLHD16] that combine adversarial and reconstruction losses for super-resolution and inpainting respectively, without paired supervision, same as our model. However, weak (unpaired) supervision might be unnecessary for super-resolution or inpainting as an unlimited amount of ground-truth pairs can be easily collected by downsampling or masking images. Our AIGN focuses instead on biased super-resolution and inpainting, as a tool for creative and controllable image generation.

The AIGN architecture for super-resolution is depicted in Figure 2 (c). Our generator network takes as input a low-resolution image and produces a high resolution image after a number of residual neural blocks [he2016deep]. Our renderer

is a downsampler that reduces the size of the output image by four times. We implement it using average pooling with the appropriate stride. Our reconstruction loss is the

distance between the input image and the rendered imagination . Our discriminator takes as input high resolution predicted images, as well as memories of high resolution images, unrelated to the current input. Thus far, our model is similar to the concurrent work Sønderby [sonderby2014apparent].Unsupervised super-resolution networks [sonderby2014apparent] may not be necessary, since large supervised training datasets of low- and high-resolution image pairs can be collected by Gaussian blurring and downsampling. We instead focus on what we call biased super-resolution for face images. We bias our discriminator ground-truth memories to contain high resolution images of a particular image category, , females, males, young, old faces, or faces of a particular individual. AIGNs then mix the low-frequencies of the input image (preserved via our reconstruction loss) with high frequencies from the memories (imposed by the adversarial losses), the relative weight between reconstruction and adversarial loss in Eq. 2 controls such mixing. The result is realistically looking faces whose age, gender or identity has been transformed, as shown in Figure 5. Such facial transformations are completely unsupervised; the model has never seen a pair of the same person old and young (or male and female).

Direct | Discuss | Eat | Greet | Phone | Photo | Pose | Purchase | Sit | SitDown | Smoke | Wait | Walk | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Forward2Dto3D | 75.2 | 118.4 | 165.7 | 95.9 | 149.1 | 154.1 | 77.7 | 176.9 | 186.5 | 193.7 | 142.7 | 99.8 | 74.7 | 128.9 |

3Dinterpr [Wu2016] | 56.3 | 77.5 | 96.2 | 71.6 | 96.3 | 106.7 | 59.1 | 109.2 | 111.9 | 111.9 | 124.2 | 93.3 | 58.0 | 88.6 |

Monocap [DBLP:journals/corr/ZhouZPLDD17] | 78.0 | 78.9 | 88.1 | 93.9 | 102.1 | 115.7 | 71.0 | 90.6 | 121.0 | 118.2 | 102.5 | 82.6 | 75.62 | 92.3 |

AIGN (ours) | 53.7 | 71.5 | 82.3 | 58.6 | 86.9 | 98.4 | 57.6 | 104.2 | 100.0 | 112.5 | 83.3 | 68.9 | 57.0 | 79.0 |

Direct | Discuss | Eat | Greet | Phone | Photo | Pose | Purchase | Sit | SitDown | Smoke | Wait | Walk | Average | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Forward2Dto3D | 80.2 | 92.4 | 102.8 | 92.5 | 115.5 | 79.9 | 119.5 | 136.7 | 136.7 | 144.4 | 109.3 | 94.2 | 80.2 | 104.6 |

3Dinterpr [Wu2016] | 78.6 | 90.8 | 92.5 | 89.4 | 108.9 | 112.4 | 77.1 | 106.7 | 127.4 | 139.0 | 103.4 | 91.4 | 79.1 | 98.4 |

AIGN (ours) | 77.6 | 91.4 | 89.9 | 88 | 107.3 | 110.1 | 75.9 | 107.5 | 124.2 | 137.8 | 102.2 | 90.3 | 78.6 | 97.2 |

### 2.4 Inpainting

The AIGN architecture for inpainting is depicted in Figure 2 (d). The input is a “masked” image , that is, an image whose content is covered by a black contiguent mask . Our generator produces a complete (inpainted) image . The rendering function in this case is a masking operation: where denotes pointwise multiplication. Our discriminator receives inpainted imaginations and memories of complete face images , unrelated to our current input images. Our model is trained then to predict complete, inpainted images that when masked will match the input image . Thus far, our model is similar to [DBLP:journals/corr/YehCLHD16].

Unsupervised inpainting networks [DBLP:journals/corr/YehCLHD16] may not be necessary, since large supervised training datasets of paired masked and complete images can be collected via image masking. We instead focus on biased inpainting of face images. We bias the discriminator’s ground-truth memories to contain complete images with a particular desired characteristic, in the location of the mask . For example, if the mask covers the mouth of a person, we can bias discriminators’ ground-truth memories to only contain people with big lips. In this case, our generator will produce inpainted images , that have this localized characteristic in a smooth photorealistic blend, in order to confuse the discriminator .

For further details on the proposed architectures please see the supplementary material.

## 3 Experiments

### 3.1 3D human pose estimation from a static image

We use the Human3.6M (H3.6M) dataset of Ionescu [h36m_pami], the largest available dataset with annotated 3D human poses. This dataset contains videos of actors performing activities and provides annotations of body joint locations in 2D and 3D at every frame, recorded from a Vicon system. We split the videos by the human subjects, with five subjects (S1, S5, S6, S7, S8) for training and two subjects (S9, S11) for testing, following the split of previous works [DBLP:journals/corr/ZhouZPLDD17]. For both sets, we use one third of the original frame rate.

We consider a variety of supervised setups and baselines for our 3D human pose predictor, which we detail below. We first train our network using synthetic data augmentation (Sup 1), following the protocol of Wu [Wu2016]: A 3D example skeleton is first perturbed, a 3D rotation and focal length are sampled, and the resulting rotated shape is projected to 2D. We further train our network using real paired 2D-to-3D training data from H3.6M (Sup 2). Our generator network trained with Sup1+Sup2 we will call it Forward2DTo3D net, as it resembles a standard supervised model for 3D human pose estimation. We further finetune using a reconstruction loss (2D reprojection error) in the test data (Sup 3). Our generator network trained with Sup1+Sup2+Sup3 we will call it 3D interpreter due to its clear similarity with Wu [Wu2016]. Since the original source code is not available, we re-implement it and use it as one of our baselines. Our AIGN model, along with Sup1+Sup2+Sup3, uses unsupervised adversarial losses in the test data using randomly selected 3D training poses (Sup4). We compare AIGN to 3D interpreter and Forward2Dto3D baselines in two setups for 3D lifting: (a) using ground-truth 2D body joints provided by H3.6M as input, and (b) using 2D body joints provided by the state-of-the-art Convolutional Pose Machine detector [wei2016cpm]. We used an SVM regressor to map keypoint definitions of Wei [wei2016cpm] to the one of H3.6M dataset. When using ground-truth 2D body joints as input we also compare against the publicly available 3D pose code for MonoCap [DBLP:journals/corr/ZhouZPLDD17]

, an optimization method that uses a sparsity prior across an over-complete dictionary of 3D poses, and minimizes the reprojection error via Expectation-Maximization. We consider one image as input for all the models for a fair comparison (MonoCap was originally proposed assuming a video sequence as input).

Evaluation metrics. Given a set of estimated 3D joint locations and corresponding ground-truth 3D joint locations , the reconstruction error is defined as the 3D per-joint error after the torsos are aligned to face the front with transformation : We show the 3D reconstruction error (in millimeters) of our model and baselines in Tables 1 and 2, organized according to activity, following the presentation format of Zhou [DBLP:journals/corr/ZhouZPLDD17], though only one model was used across all activities for our method and baselines (for MonoCap this means using the same dictionary for optimization in all images).

The AIGN outperforms the baselines, especially for ground-truth keypoints. This suggests it would be valuable to finetune 2D keypoint detector features as well, instead of keeping them fixed. Adversarial priors allow the model not to diverge when finetuned on new (unlabelled) data, as they ensure anthropomorphism and plausibility of the detected poses. For additional 3D pose results, please see the supplementary material.

### 3.2 Structure from Motion

We use Virtual KITTI (VKITTI) [Gaidon:Virtual:CVPR2016], a synthetic dataset that depicts videos taken from a camera mounted on a car driving in an urban center. This dataset contains scenes and camera viewpoints similar to the real KITTI dataset [Geiger2012CVPR]. We use the VKITTI dataset rather than KITTI, because the real-world dataset has large errors in the ground truth camera motion for short sub-sequences, whereas the synthetic ground truth is precise. We use the first VKITTI sequence (in all weather conditions) as the validation set, and the remaining four sequences as the training set. We consider the tasks of (i) single-frame depth estimation, and (ii) egomotion estimation from pairs of frames.

Evaluation metrics. We evaluate the error between our camera motion predictions using four metrics, as defined in prior works [sturm12iros, DBLP:conf/icra/JaeglePD16]. These are: (a) distance error: the camera end point error distance in meters; (b) rotation angle error: the camera rotational error in radians, (c) angular translation error: the error in the angular direction of the camera translation; and (d) magnitude translation error: the error in the magnitude of the camera translation.

We evaluate the error between our depth prediction and the true depth with an L1 distance in log depth space. The use of log depth is inspired by Eigen [eigen2014depth], but we do not use the scale-invariant error from that work, because it assumes the presence of an oracle indicating the true mean depth value at test time.

We consider two supervision setups: (a) unsupervised learning of SfM, where the AIGN is trained from scratch using (self-supervised) reconstruction and adversarial losses, and (b) supervised pretraining, where depth and camera motion in the training set are used to pretrain our generators, before applying unsupervised learning in the test set.

Self-supervised learning of SfM. We compare the following models: (a) our full model, AIGN, with adversarial losses on depth and camera motion, (b) our model with adversarial losses only on depth, but not on camera motion (depth-AIGN), (c) our model without adversarial losses but instead with a smoothness loss on depth (smooth-IGN), and (d) our model without any adversarial priors (IGN).

We show depth, camera error and photometric errors against number of training iterations for all models in Figure 3 in the test set. Models without depth adversarial priors diverge after a number of iterations (the depth map takes very large values). This is due to the well known scale ambiguity of monocular reconstruction: objects can be further away or simply be smaller in size, and the 2D re-projection will be the same. Adversarial priors enforce constraints regarding the general scale of the world scene, as well as depth-like appearance and camera motion-like plausibility on the predictions, and prevent depth divergence. While the adversarial model has higher photometric error that the one without priors (naturally, since it is more constrained), the intermediate variables, namely depth and camera motion, are much more accurate when adversarial priors are present. The model that uses depth smoothness (depth-IGN) falls in-between the AIGN model and the model without any priors, and diverges as well. In Figure 4, we show the estimated depth and geometric flow predicted by models with and without adversarial priors.

Supervised pretraining. In this setup, we pretrain our camera pose estimation sub-network and depth sub-network using the VKITTI training set using supervised training against ground-truth depthmaps and camera motions supplied with the dataset (pretrain baseline). We then finetune our model in the test set using self-supervision, , reprojection error and adversarial constraints (Pretrain+SelfSup). We evaluate camera pose estimation performance of our models in the test set, and compare against the geometric model of Jaegle [DBLP:conf/icra/JaeglePD16]

, a monocular camera motion estimation method that takes optical flow as input, and solves for the camera motion with an optimization-based method that attenuates the influence of outlier flows. We show our translation and rotation camera motion errors in Table

3. The pretrained model performs worse than the geometric baseline. When pretraining is combined with self-supervision, we obtain a much lower error than the geometric model. Monocular geometric methods, such as [DBLP:conf/icra/JaeglePD16], do not have a good way to resolve the scale ambiguity in reconstruction, and thus have large translation errors. The AIGN for SfM, while being monocular, learns depth and does not suffer from such ambiguities. Further, we outperform [DBLP:conf/icra/JaeglePD16] with respect to the angle of translation that a geometric method can in principle estimate (no ambiguity). These results suggest that our adversarial SfM model improves by simply watching unlabeled videos, without diverging.trl err | rot err | trl mag. | trl ang. | |
---|---|---|---|---|

Geometric [DBLP:conf/icra/JaeglePD16] | 0.4588 | 0.0014 | 0.4579 | 0.3423 |

Pretrain | 0.4876 | 0.0017 | 0.4865 | 0.3306 |

Pret.+SelfSup | 0.1294 | 0.0014 | 0.1247 | 0.3333 |

### 3.3 Image-to-image translation

We use the CelebA dataset [liu2015faceattributes] which contains 202,599 face images, with 10,177 unique identities, and is annotated with 40 binary attributes. We preprocess the data by cropping each face image to the largest bounding box that includes the whole face using the OpenFace library [amos2016openface].

Biased super-resolution. We train female-to-male and male-to-female gender transformation by applying adversarial super-resolution to new face input images, while discriminator memories contain only male or only female faces, respectively. We train old-to-young and young-to-old age transformations by applying adversarial super-resolution to new face images while discriminator memories contain only young or only old faces, respectively – as indicated by the attributes in the CelebA dataset. We train identity mixing transformations by applying adversarial super-resolution to new face images while discriminator memories contain only a particular person identity, for demonstration we choose Tom Cruise. We show results in Figure 5 (a-c). We further compare our model against the recent work of Attribute2Image[DBLP:journals/corr/YanYSL15] in Figure 5(d) using code available by the authors. The AIGN better preserves the fidelity of the transformation and is more visually detailed. Though we demonstrate age, gender transformation and identity mixing, AIGN could be used for any creative image generation task, with appropriate curation of the discriminator’s ground-truth memories.

(a) Female-to-male transformation. |

(b) Male-to-female transformation. |

(c) Anybody-to-Tom Cruise transformation. |

(d) Comparison with Attribute2Image [DBLP:journals/corr/YanYSL15] for male-to-female |

(left) and old-to-young transformation (right). |

Biased Inpainting. We train “bigger lips” and “bigger nose” transformations by applying adversarial inpainting to new input face images where the mouth or nose region has been masked, respectively, and discriminator’s memories contain only face images with big lips or with big noses, respectively. Note that “big lips” and “big nose” are attributes annotated in the CelebA dataset. We show results in Figure 6. From top to bottom, we show the original image, the masked image input to adversarial inpainting, the output of our generator, and in the last row, the output of our generator superimposed over the original image.

Renderers versus parametric decoders for image-to-image translation. We compare the results of domain-specific, non-parametric renderers, and parametric decoders in image-to-image translation tasks. For our model with parametric decoder we use the full res image as input and instead of the downsampling module in the proposed super-resolution renderer we instead use convolutional/deconvolutional layers so that the decoder can freely adjust its weights through training. This is similar to the one way transformation proposed in the concurrent work of [pmlr-v70-kim17a] [CycleGAN2017]. We trained both models on gender transformation, and results are shown in Figure 7. While both models produce photorealistic results, the model with the renderer produces females “more paired” to their male counterparts, while parametric renderers may alter other properties of the face considerably, e.g., in the last row of Figure 7, the age of the produced females does not match their male origins. Parameter-free rendering is an important element of unsupervised learning with AIGNs; We have observed that parametric decoders (instead of parameter-free renderers) can cause the reconstruction loss to drop without learning meaningful predictions but rather exploiting the capacity of the decoder. We provide a comprehensive experiment in the supplementary material.

## 4 Conclusion

We have presented Adversarial Inverse Graphics Networks, weakly supervised neural networks for 2D-to-3D lifting and image-to-image translation that combines feedback from renderings of predictions with data-driven priors on latent semantic factors, imposed using adversarial networks. We showed AIGNs outperform previous supervised models that do not use adversarial priors, in the tasks of 3D human pose estimation and 3D structure and egomotion extraction. We further showed how biasing discriminators’ priors for inpainting and super-resolution results in creative image editing, and outperforms supervised variational autoencoder models of previous works, in terms of the fidelity of the transformation, and the amount of visual detail. Deep neural networks have shown that we do not need to engineer our features; AIGNs shows we do not need to engineer our priors either.

## References

## Appendix A Parametric vs. non-parametric decoders

Here we discuss the benefits of using non-parametric and domain-specific renderers, over learned decoders. Both the proposed model and CycleGAN [CycleGAN2017] can be viewed as autoencoders: the input is first transformed into a target domain, and then transformed back to its original space. A parametric decoder could be more desirable, for the reason that we do not need to hand-engineer a mapping function from the target domain back to the inputs. However, simply using reconstruction loss and adversarial loss does not guarantee that the predictions look spatially similar to the inputs. In tasks such as image-to-image translation, spatial precision can be of critical importance. With a parametric decoder, the transformed input can be viewed as a information bottleneck, and as long as the decoder can correctly “guess” the final output from the transformed input (, the code), the code is valid and the solution is optimal.

To support this point, we conduct an experiment on image inpainting using the MNIST dataset. Similar to the parametric encoder-decoder described in the main text, the network has two main parts: (1) an encoder that transforms the input (a partially obscured image of a digit) into prediction (a hallucinated digit), and (2) a decoder that transforms the prediction back into the input. Instead of using convolutional layers, which have an architectural bias on preserving spatial relationships, we use fully-connected layers in both the encoder and the decoder. This is important, because such architectural conveniences are unavailable in less-structured tasks, such as 3D pose prediction and SfM. We train the model with a reconstruction loss on the decoder, and adversarial loss on the encoder.

The results are shown in Figure 8. While inpainting, the encoder (incorrectly) transforms many of the digits into other digits. For instance, several obscured “1” images are inpainted as “4”. In the parametric decoding process, however, these errors are undone, and the original input is recovered successfully. In other words, the decoder takes the burden of the reconstruction loss, allowing the encoder to learn an inaccurate latent space. Parameter-free rendering avoids this problem.

## Appendix B Additional experiments and details

In the sections to follow, we provide implementation details, including architecture descriptions for the generator and discriminator in each task, and training details. Additionally, we provide more experimental results.

### b.1 3D human pose estimation from static images

Figure 9 shows the architecture of our generator network for 3D human pose estimation from a single RGB image. Our generator predicts weights over the shape bases , rotation translation and focal length , as described in our paper. The generator takes as input a set of 2D body joint heatmaps. We use convolutional pose machines [wei2016cpm]

to estimate 2D keypoints, and convert them into heatmaps by creating a Gaussian distribution centered around each 2D keypoint. The network consists of 8 convolutional layers with leaky ReLU activations and batch normalization and two fully connected layers at the end that map to the desired outputs. The width, height and number of channels for each layer is specified in Figure

13. The discriminator for this task consists of five fully connected layers with featuremap depth 512, 512, 256, 256 and 1, with a leaky ReLU and batch normalization after each layer. The discriminator takes all values output from the generator (, , , ) as input.In all experiments, we set the variance for the Gaussian heatmap

to 0.25, and the dimensionality of our PCA shape basis to 60 (out of 96 total bases). The dimensionality reduction is small, and indeed, we only use basis weights for ease of prediction, relying on our adversarial priors (rather than PCA) to regularize the 3D shape prediction. We use gradient descent for both generator and discriminator training. Learning rate for reconstruction loss is set to 0.00001 and learning rate for the adversarial loss is set to 0.0001. All parameters are initialized with random sampling from zero mean normal distributions with variance of 0.02.

In Figure 10, we show predicted 3D human poses on images from the MPII dataset [andriluka14cvpr] using the ground-truth 2D keypoints available. Our model generalizes well on unseen images without any further self-supervised finetuning, though we would expect additional self-supervised finetuning to further improve performance.

### b.2 Structure from Motion

Our generator networks for the task of structure from motion is illustrated in Figure 11. It includes three encoder-decoder convolutional networks with skip connections, which solve for optical flow, depth, and camera motion. The egomotion network uses RGB, optical flow and an angle field as input, and estimates the camera motion in SE(3). The depth network takes an RGB image as input and predicts logdepth. The depth discriminator consists of four convolution layers with batch normalization on the second and the third layers, and leaky ReLU activation after each layer. The depth discriminator is fully convolutional as we are interested in the realism of every depth patch, as opposed to the depthmap as a whole.

The egomotion discriminator is a 3-layer fully-connected network that takes

matrices as input. The hidden layers have 128, 128, and 64 neurons, respectively, with batch normalization and a leaky ReLU after each layer.

Stabilizing training.

In order to make sure that generators and discriminators progress together during training, we update the generator only when the discriminator has low enough loss. We add an updating heuristic such that if the likelihood loss of the discriminator is above a threshold

we do not update the generator. While discriminator is strong enough (below this threshold) and the generator is relatively weak (below a different threshold ), we update the generator twice in the iteration. In the experiments, we set to 0.695 and to 0.75.(a) Generator’s and discriminator’s architectures for |

image super-resolution. |

(b) Generator and discriminator architectures for image |

inpainting. |

### b.3 Image Super-Resolution

In Figure 13, we show the architecture of the generator and discriminator for image super-resolution. The input image is first passed through a convolutional layer with 64 channels, then “residual blocks”. Each residual block consists of two convolutional layers, with a batch normalization after each convolution layer and ReLU activation after the first batch normalization. The output from the last block is further passed to two deconvolution layers and generates the final image. The discriminator for this task consists of five convolution layers that use leaky ReLU activations and batch normalization, and one fully-connected layer that outputs one final value. In all experiments, we use Adam optimizer with learning rate 0.0001.All parameters are initialized with truncated normal distribution with variance 0.02.

In Figures 15, 16 and 17, we compare our model with Attribute2Image [DBLP:journals/corr/YanYSL15] and with Unsupervised Image Translation [DBLP:journals/corr/DongNWG17] for gender and age transformations. We use the code provided by the authors for our comparisons. In Figures 12 and 14, we show additional results of our model on gender and age transformation.

### b.4 Inpainting

Figure 13 illustrates the architecture of our generator and discriminator for image inpainting. The occluded input image and the corresponding mask are separately passed through two convolution layers, and then concatenated. The concatenated outputs are then passed to three deconvolutional layers to generate the inpainted image. The discriminator consists of four convolutional layers with leaky ReLU and batch normalization layers, and one fully connected layer that outputs one final value. In all experiments, we use the Adam optimizer, with a learning rate All parameters are initialized from the truncated Normal distribution, with variance 0.02.

In Figure 18, we show additional results on biased inpainting for making bigger lips.

Comments

There are no comments yet.