Self-Supervised Viewpoint Learning From Image Collections

04/03/2020 ∙ by Siva Karthik Mustikovela, et al. ∙ University of Heidelberg Nvidia 36

Training deep neural networks to estimate the viewpoint of objects requires large labeled training datasets. However, manually labeling viewpoints is notoriously hard, error-prone, and time-consuming. On the other hand, it is relatively easy to mine many unlabelled images of an object category from the internet, e.g., of cars or faces. We seek to answer the research question of whether such unlabeled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision. Self-supervision here refers to the fact that the only true supervisory signal that the network has is the input image itself. We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner with a generative network, along with symmetry and adversarial constraints to successfully supervise our viewpoint estimation network. We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains. Our work opens up further research in self-supervised viewpoint learning and serves as a robust baseline for it. We open-source our code at https://github.com/NVlabs/SSV.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 8

page 15

page 16

page 17

page 18

Code Repositories

SPACE-Pytorch-Implementation

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition https://arxiv.org/abs/2001.02407


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D understanding of objects from 2D images is a fundamental computer vision problem. Object viewpoint (azimuth, elevation and tilt angles) estimation provides a pivotal link between 2D imagery and the corresponding 3D geometric understanding. In this work, we tackle the problem of object viewpoint estimation from a single image. Given its central role in 3D geometric understanding, viewpoint estimation is useful in several vision tasks such as object manipulation 

[xiang2018rss:posecnn], 3D reconstruction [kundu20183d], image synthesis [chen2019mono] to name a few. Estimating viewpoint from a single image is highly challenging due to the inherent ambiguity of 3D understanding from a 2D image. Learning-based approaches, e.g.[LiaoCVPR19, grabner20183d, zhou2018starmap, mahendran20173d, su2015render, tulsiani2015viewpoints, gu2017dynamic, yang2019fsa], using neural networks that leverage a large amount of annotated training data, have demonstrated impressive viewpoint estimation accuracy. A key requirement for such approaches is the availability of large-scale human annotated datasets, which is very difficult to obtain. A standard way to annotate viewpoints is by manually finding and aligning a rough morphable 3D or CAD model to images [fanelli2013random, zhu2017face, xiang2014beyond], which is a tedious and slow process. This makes it challenging to create large-scale datasets with viewpoint annotations. Most existing works [grabner20183d, massaBMVC2016, su2015render, LiaoCVPR19, zhu2017face, gu2017dynamic] either rely on human-annotated viewpoints or augment real-world data with synthetic data. Some works [grabner20183d] also leverage CAD models during viewpoint inference.

Figure 1: Self-supervised viewpoint learning. We learn a single-image object viewpoint estimation network for each category (face or car) using only a collection of images without ground truth.

In this work, we propose a self-supervised learning technique for viewpoint estimation of general objects that learns from an object image collection without the need for any viewpoint annotations (Figure 

1). By image collection, we mean a set of images containing objects of a category of interest (say, faces or cars). Since viewpoint estimation assumes known object bounding boxes, we also assume that the image collection consists of tightly bounded object images. Being self-supervised in nature, our approach provides an important advancement in viewpoint estimation as it alleviates the need for costly viewpoint annotations. It also enables viewpoint learning on object categories that do not have any existing ground-truth annotations.

Following the analysis-by-synthesis paradigm, we leverage a viewpoint aware image synthesis network as a form of self-supervision to train our viewpoint estimation network. We couple the viewpoint network with the synthesis network to form a complete cycle and train both together. To self-supervise viewpoint estimation, we leverage cycle-consistency losses between the viewpoint estimation (analysis) network and a viewpoint aware generative (synthesis) network, along with losses for viewpoint and appearance disentanglement, and object-specific symmetry priors. During inference, we only need the viewpoint estimation network, without the synthesis network, making viewpoint inference simple and fast for practical purposes. As per our knowledge, ours is the first self-supervised viewpoint learning framework that learns 3D viewpoint of general objects from image collections in-the-wild. We empirically validate our approach on the human head pose estimation task, which on its own has attracted considerable attention 

[zhu2017face, bulat2017far, sun2013deep, yang2018ssr, kumar2017kepler, chang2017faceposenet, gu2017dynamic, yang2019fsa] in computer vision research. We demonstrate that the results obtained by our self-supervised technique are comparable to those of fully-supervised approaches. In addition, we also demonstrate significant performance improvements when compared to viewpoints estimated with self-supervisedly learned keypoint predictors. To showcase the generalization of our technique, we analyzed our approach on object classes such as cars, buses, and trains from the challenging Pascal3D+ [xiang2014beyond] dataset. We believe this work opens up further research in self-supervised viewpoint learning and would also serve as a robust baseline for future work.

To summarize, our main contributions are:

  • We propose a novel analysis-by-synthesis framework for learning viewpoint estimation in a purely self-supervised manner by leveraging cycle-consistency losses between a viewpoint estimation and a viewpoint aware synthesis network. To our understanding, this is one of first works to explore the problem of self-supervised viewpoint learning for general objects.

  • We introduce generative, symmetric and adversarial constraints which self-supervise viewpoint estimation learning just from object image collections.

  • We perform experiments for head pose estimation on the BIWI dataset [fanelli2013random] and for viewpoint estimation of cars, buses and trains on the challenging Pascal3D+ [xiang2014beyond] dataset and demonstrate competitive accuracy in comparison to fully-supervised approaches.

2 Related Work

Viewpoint estimation

Several successful learning-based viewpoint estimation techniques have been developed for general object categories that either regress orientation directly [mousavian20173d, mahendran20173d, su2015render, tulsiani2015viewpoints, LiaoCVPR19, prokudin2018deep]; locate 2D keypoints and fit them to 3D keypoints [grabner20183d, pavlakos20176, zhou2018starmap]; or predict 3D shape and viewpoint parameters [kundu20183d]. These techniques require object viewpoint annotations during training, either in the form of angular values; or 2D and 3D keypoints and use large annotated datasets, e.g., Pascal3D+ [xiang2014beyond] and ObjectNet3D [xiang2016objectnet3d] with 12 and 100 categories, respectively. These datasets were annotated via a tedious manual process of aligning best-matched 3D models to images – a procedure that is not scalable easily to larger numbers of images or categories. To circumvent this problem, existing viewpoint algorithms augment real-world data with synthetic images [grabner20183d, massaBMVC2016, su2015render, LiaoCVPR19]; assume auxiliary supervision and learn the related aspects (e.g., 3D keypoints) along with viewpoint [zhou2018starmap, suwajanakorn2018key-pointnet]; or try to learn from very few labeled examples of novel categories [tseng2019].

Head pose estimation

Separate from the above-mentioned works, learning-based head pose estimation techniques have also been studied extensively [zhu2017face, bulat2017far, sun2013deep, yang2018ssr, kumar2017kepler, chang2017faceposenet, gu2017dynamic, yang2019fsa]. These works learn to either predict facial landmarks from data with varying levels of supervision ranging from full [zhu2017face, bulat2017far, sun2013deep, yang2018ssr, kumar2017kepler], partial [honari2018improving], or no supervision [hung2019scops, zhang2018unsupervised]; or learn to regress head orientation directly in a fully-supervised manner [chang2017faceposenet, ruiz2018fine, gu2017dynamic, yang2019fsa]. The latter methods perform better than those that predict facial points [yang2019fsa]. To avoid manual annotation of head pose, prior works also use synthetic datasets [zhu2017face, gu2017dynamic]. On the other hand, several works [tewari2018self, feng2018joint, tran2018nonlinear, sahasrabudhe2019lifting] propose learning-based approaches for dense 3D reconstruction of faces via in-the-wild image collections and some use analysis-by-synthesis [tewari2018self, tran2018nonlinear]. However, they are not purely self-supervised and use either facial landmarks [tewari2018self], dense 3D surfaces [feng2018joint] or both [tran2018nonlinear] as supervision.

Self-supervised object attribute discovery

Several recent works try to discover 2D object attributes like landmarks [zhang2018unsupervised, thewlisICCVT2017, tomas2018neurips] and part segmentation [hung2019scops, collins2018deep] in a self-supervised manner. These works are orthogonal to ours as we estimate 3D viewpoint. Some other works such as [l2019differ, insafutdinov2018unsupervised, Henderson2018LearningTG] make use of differentiable rendering frameworks to learn 3D shape and/or camera viewpoint from a single or multi-view image collections. Because of heavy reliance on differentiable rendering, these works mainly operate on synthetic images. In contrast, our approach can learn viewpoints from image collections in the wild. Some works learn 3D reconstruction from in-the-wild image collections, but use annotated object silhouettes along with other annotations such as 2D semantic keypoints [cmrKanazawa18], category-level 3D templates [kulkarni2019csm]; or multiple views of each object instance [kato2018renderer, Wiles17a, novotny2017learning]. In contrast, we use no additional supervision other than the image collections that comprise of independent object images. To the best we know, no prior works propose to learn viewpoint of general objects in a purely self-supervised manner from in-the-wild image collections.

3 Self-Supervised Viewpoint Learning

Problem setup We learn a viewpoint estimation network  using an in-the-wild image collection of a specific object category without annotations. Since viewpoint estimation assumes tightly cropped object images, we also assume that our image collection is composed of cropped object images. Figure 1 shows some samples in the face and car image collections. During inference, the viewpoint network takes a single object image I as input and predicts the object 3D viewpoint .

Viewpoint representation To represent an object viewpoint , we use three Euler angles, namely azimuth (), elevation () and in-plane rotation () describing the rotations around fixed 3D axes. For the ease of viewpoint regression, we represent each Euler angle, e.g., , as a point on a unit circle with 2D coordinates . Following [LiaoCVPR19], instead of predicting co-ordinates on a 360

circle, we predict a positive unit vector in the first quadrant with

= and also the category of the combination of signs of and indicated by = . Given the predicted and from the viewpoint network, we can construct = and = . The predicted Euler angle can finally be computed as . In short, the viewpoint network performs both regression to predict a positive unit vector

and also classification to predict the probability of

.

Figure 2: Approach overview. We use generative consistency, symmetry and discriminator losses to supervise the viewpoint network with a collection of images without annotations.

Approach overview and motivation We learn the viewpoint network using a set of self-supervised losses as illustrated in Figure 2. To formulate these losses we use three different constraints, namely generative consistency, a symmetry constraint and a discriminator loss. Generative consistency forms the core of the self-supervised constraints to train our viewpoint network and is inspired from the popular analysis-by-synthesis learning paradigm [kundu20183d]. This framework tries to tackle inverse problems (such as viewpoint estimation) by modelling the forward process of image or feature synthesis. A synthesis function models the process of generating an image of an object from a basic representation and a set of parameters. The goal of the analysis function is to infer the underlying parameters which can best explain the formation of an observed input image. Bayesian frameworks such as [yuille2006vision] and inverse graphics [kundu20183d, kato2018renderer, yao20183d, liu2019softras, jampani2015informed] form some of the popular techniques that are based on the analysis-by-synthesis paradigm. In our setup, we consider the viewpoint network as the analysis function.

We model the synthesis function , with a viewpoint aware image generation model. Recent advances in Generative Adversarial Networks (GAN) [chen2016infogan, karras2019style, nguyenhologan] have shown that it is possible to generate high-quality images with fine-grained control over parameters like appearance, style, viewpoint, etc. Inspired by these works, our synthesis network generates an image, given an input , which controls the viewpoint of the object and an input vector , which controls the style of the object in the synthesized image. By coupling both the analysis () and synthesis () networks in a cycle, we learn both the networks in a self-supervised manner using cyclic consistency constraints described in 3.1 and shown in Figure 3. Since the synthesis network can generate high quality images based on controllable inputs and , these synthesized images can in turn be used as input to the analysis network () along with , as the pseudo ground-truth. On the other hand, for a real world image, if predicts the correct viewpoint and style, these can be utilized by to produce a similar looking image. This effectively functions as image reconstruction-based supervision. In addition to this, similar to [chen2016infogan, nguyenhologan]

the analysis network also functions as a discriminator, evaluating whether the synthesized images are real or fake. Using a widely prevalent observation that several real-world objects are symmetric, we also enforce a prior constraint via a symmetry loss function to train the viewpoint network. Object symmetry has been used in previous supervised techniques such as 

[mahendran20173d] for data augmentation, but not as a loss function. In the following, we first describe the various loss constraints used to train the viewpoint network while assuming that we already have a trained synthesis network . In Section 4, we describe the loss constraints used to train the synthesis network .

Figure 3: Generative consistency. The two cyclic (a) image consistency () and (b) style and viewpoint consistency () losses make up generative consistency. The input to each cycle is highlighted in yellow. Image consistency enforces that an input real image, after viewpoint estimation and synthesis, matches its reconstructed synthetic version. Style and viewpoint consistency enforces that the input style and viewpoint provided for synthesis are correctly reproduced by the viewpoint network.

3.1 Generative Consistency

As Figure 3 illustrates, we couple the viewpoint network with the synthesis network to create a circular flow of information resulting in two consistency losses: (a) image consistency and (b) style and viewpoint consistency.

Image consistency Given a real image sampled from a given image collection , we first predict its viewpoint and style code via the viewpoint network . Then, we pass the predicted and into the synthesis network to create the synthetic image . To train the viewpoint network, we use the image consistency between the input image and corresponding synthetic image with a perceptual loss:

(1)

where

denotes the conv5 features of an ImageNet-trained 

[deng2009imagenet]

VGG16 classifier 

[Simonyan15_vgg] and denotes the cosine distance. Figure 3(a) illustrates the image consistency cycle.

Style and viewpoint consistency As illustrated in Figure 3(b), we create another circular flow of information with the viewpoint and synthesis networks, but this time starting with a random viewpoint and a style code

, both sampled from uniform distributions, and input them to the synthesis network to create an image

. We then pass the synthetic image to the viewpoint network that predicts its viewpoint and the style code . We use the sampled viewpoint and style codes for the synthetic image as a pseudo GT to train the viewpoint network. Following [LiaoCVPR19], the viewpoint consistency loss between two viewpoints and has two components for each Euler angle: (i) cosine proximity between the positive unit vectors and (ii) the cross-entropy loss between the classification probabilities of and . The viewpoint consistency loss is a sum of the cross-entropy and cosine proximity losses for all the three Euler angles:

(2)

The overall style and viewpoint loss between the sampled and the predicted is hence:

(3)

While viewpoint consistency enforces that learns correct viewpoints for synthetic images, image consistency helps to ensure that generalizes well to real images as well, and hence avoids over-fitting to images synthesized by .

3.2 Discriminator Loss

also predicts a score indicating whether an input image is real or synthetic. It thus acts as a discriminator in a typical GAN [goodfellow2014generative] setting, helping the synthesis network create more realistic images. We use the discriminator loss from Wasserstein-GAN [arjovsky2017wasserstein] to update the viewpoint network using:

(4)

where = and = are the predicted class scores for the real and the synthesized images, respectively.

3.3 Symmetry Constraint

Symmetry is a strong prior observed in many commonplace object categories, e.g., faces, boats, cars, airplanes, etc. For categories with symmetry, we propose to leverage an additional symmetry constraint. Given an image I of an object with viewpoint , the GT viewpoint of the object in a horizontally flipped image is given by --. We enforce a symmetry constraint on the viewpoint network’s outputs and for a given image I and its horizontally flipped version , respectively. Let = and = and we denote the flipped viewpoint of the flipped image as =--. The symmetry loss is given as

(5)

Effectively, for a given horizontally flipped image pair, we regularize that the network predicts similar magnitudes for all the angles and opposite directions for azimuth and tilt. Additionally, the above loss enforces that the style of the flipped image pair is consistent.

Our overall loss to train the viewpoint network is a linear combination of the aforementioned loss functions:

(6)

where the parameters determine the relative importance of the different losses, which we empirically determine using a grid search.

Figure 4: Synthesis network overview. The network takes viewpoint and style code to produce a viewpoint aware image.

4 Viewpoint-Aware Synthesis Network

Recent advances in GANs such as InfoGAN [chen2016infogan], StyleGAN [karras2019style] and HoloGAN [nguyenhologan] demonstrate the possibility of conditional image synthesis where we can control the synthesized object’s attributes such as object class, viewpoint, style, geometry, etc. A key insight that we make use of in our synthesis network and which is also used in recent GANs such as HoloGAN [nguyenhologan] and other works[zhu2018visual, kulkarni2015deep, sitzmann2019deepvoxels]

, is that one can instill 3D geometric meaning into the network’s latent representations by performing explicit geometric transformations such as rotation on them. A similar idea has also been used successfully with other generative models such as auto-encoders 

[Hinton2011ICANN, Rhodin2018ECCV, Park2019ICCV]. Our viewpoint-aware synthesis network has a similar architecture to HoloGAN [nguyenhologan], but is tailored for the needs of viewpoint estimation. HoloGAN is a pure generative model with GAN losses to ensure realism and an identity loss to reproduce the input style code, but lacks a corresponding viewpoint prediction network. In this work, since we focus on viewpoint estimation, we introduce tight coupling of HoloGAN with a viewpoint prediction network and several novel loss functions to train it in a manner that is conducive to accurate viewpoint prediction.

Synthesis network overview Figure 4 illustrates the design of the synthesis network. The network takes a style code and a viewpoint to produce a corresponding object image . The goal of is to learn a disentangled 3D representation of an object, which can be used to synthesize objects in various viewpoints and styles, hence aiding in the supervision of the viewpoint network . We first pass a learnable canonical 3D latent code through a 3D network, which applies 3D convolutions to it. Then, we rotate the resulting 3D representation with and pass through an additional 3D network. We project this viewpoint-aware learned 3D code on to 2D using a simple orthographic projection unit. Finally, we pass the resulting 2D representation through a StyleGAN [karras2019style]-like 2D network to produce a synthesized image. The style and appearance of the image is controlled by the sampled style code . Following StyleGAN [karras2019style], the style code affects the style of the resulting image via adaptive instance normalization [huang2017arbitrary] in both the 3D and 2D representations. For stable training, we freeze while training and vice versa.

Loss functions Like the viewpoint network, we use several constraints to train the synthesis network, which are designed to improve viewpoint estimation. The first is the standard adversarial loss used in training Wasserstein-GAN[arjovsky2017wasserstein]:

(7)

where = is the class membership score predicted by for a synthesized image. The second is a paired version of the style and viewpoint consistency loss (Eqn. 3) described in Section 3.1, where we propose to use multiple paired samples to enforce style and viewpoint consistency and to better disentangle the latent representations of . The third is a flip image consistency loss. Note that, in contrast to our work, InfoGAN [chen2016infogan] and HoloGAN [nguyenhologan] only use adversarial and style consistency losses.

Figure 5: Synthesis results. Example synthetic images of (a) faces and (b) cars generated by the viewpoint-aware generator . For each row the style vector is constant, whereas the viewpoint is varied monotonically along the azimuth (first row), elevation (second row) and tilt (third row) dimensions.

Style and viewpoint consistency with paired samples Since we train the viewpoint network with images synthesized by , it is very important for to be sensitive and responsive to its input style and viewpoint parameters. An ideal would perfectly disentangle and . That means, if we fix and vary , the resulting object images should have the same style, but varying viewpoints. On the other hand, if we fix and vary , the resulting object images should have different styles, but a fixed viewpoint. We enforce this constraint with a paired version of the style and viewpoint consistency (Eqn. 3) loss where we sample 3 different pairs of values by varying one parameter at a time as: . We refer to this paired style and viewpoint loss as . The ablation study in Section 5 suggests that this paired style and viewpoint loss helps to train a better synthesis network for our intended task of viewpoint estimation. We also observe qualitatively that the synthesis network successfully disentangles the viewpoints and styles of the generated images. Some example images synthesized by for faces and cars are shown in Figure 5. Each row uses a fixed style code and we monotonically vary the input viewpoint by changing one of its , or values across the columns.

Flip image consistency This is similar to the symmetry constraint used to train the viewpoint network, but applied to synthesized images. Flip image consistency forces to synthesize horizontally flipped images when we input appropriately flipped viewpoints. For the pairs and , where has opposite signs for the and values of , the flip consistency loss is defined as:

(8)

where is the horizontally flipped version of .

The overall loss for the synthesis network is given by:

(9)

where the parameters are the relative weights of the losses which we determine empirically using grid search.

5 Experiments

We empirically validate our approach with extensive experiments on head pose estimation and viewpoint estimation on other object categories of buses, cars and trains. We refer to our approach as ‘SSV’.

Implementation and training details

We implement our framework in Pytorch

[paszke2017automatic]. We provide all network architecture details, and run-time and memory analyses in the supplementary material.

Viewpoint calibration The output of SSV for a given image is (). However, since SSV is self-supervised, the co-ordinate system for predictions need not correspond to the actual canonical co-ordinate system of GT annotations. For quantitative evaluation, following the standard practice in self-supervised learning of features [donahue2016adversarial, zhang2017split, caron2018deep] and landmarks [hung2019scops, zhang2018unsupervised, thewlisICCVT2017], we fit a linear regressor that maps the predictions of SSV to GT viewpoints using 100 randomly chosen images from the target test dataset. Note that this calibration with a linear regressor only rotates the predicted viewpoints to the GT canonical frame of reference. We do not update or learn our SSV network during this step.

5.1 Head Pose Estimation

Human faces have a special place among objects for viewpoint estimation and head pose estimation has attracted considerable research attention [zhu2017face, bulat2017far, sun2013deep, yang2018ssr, kumar2017kepler, chang2017faceposenet, gu2017dynamic, yang2019fsa]. The availability of large-scale datasets [300wlp, fanelli2013random] and the existence of ample research provides a unique opportunity to perform extensive experimental analysis of our technique on head pose estimation.

Datasets and evaluation metric

For training, we use the 300W-LP [300wlp] dataset, which combines several in-the-wild face datasets. It contains 122,450 face images with diverse viewpoints, created by fitting a 3D face morphable model [blanz1999morphable] to face images and rendering them from various viewpoints. Note that we only use the images from this dataset to train SSV and not their GT viewpoint annotations. We evaluate our framework on the BIWI [fanelli2013random] dataset which contains 15,677 images across 24 sets of video sequences of 20 subjects in a wide variety of viewpoints. We use the MTCNN face detector to detect all faces [zhang2016joint]. We compute average absolute errors (AE) for azimuth, elevation and tilt between the predictions and GT. We also report the mean absolute error (MAE) of these three errors.

Method Azimuth Elevation Tilt MAE

Self-Supervised

LMDIS [zhang2018unsupervised] + PnP 16.8 26.1 5.6 16.1
IMM [tomas2018neurips] + PnP 14.8 22.4 5.5 14.2
SCOPS [hung2019scops] + PnP 15.7 13.8 7.3 12.3
HoloGAN [nguyenhologan] 8.9 15.5 5.0 9.8
HoloGAN [nguyenhologan] with 7.0 15.1 5.1 9.0
2-6 SSV w/o + 6.8 13.0 5.2 8.3
SSV w/o 6.9 10.3 4.4 7.2
SSV-Full 6.0 9.8 4.4 6.7

Supervised

3DDFA [zhu2017face] 36.2 12.3 8.7 19.1
KEPLER [kumar2017kepler] 8.8 17.3 16.2 13.9
DLib [kazemi2014one] 16.8 13.8 6.1 12.2
FAN [bulat2017far] 8.5 7.4 7.6 7.8
Hopenet [ruiz2018fine] 5.1 6.9 3.3 5.1
FSA [yang2019fsa] 4.2 4.9 2.7 4.0
Table 1: Head pose estimation ablation studies and SOTA comparisons. Average absolute angular error for azimuth, elevation and tilt Euler angles in degrees together with the mean absolute error (MAE) for the BIWI [fanelli2013random] dataset.

Ablation study We empirically evaluate the different self-supervised constraints used to train the viewpoint network. Table 1 shows that for head pose estimation, using all the proposed constraints (SSV-Full) results in our best MAE of . Removing the image consistency constraint leads to an MAE to and further removing the symmetry constraint results in an MAE of . These results demonstrate the usefulness of the generative image consistency and symmetry constraints in our framework.

Method Azimuth Elevation Tilt MAE

Self-Sup

SSV non-refined 6.9 9.4 4.2 6.8
SSV refined on BIWI 4.9 8.5 4.2 5.8

Supervised

FSA [yang2019fsa] 2.8 4.2 3.6 3.6
DeepHP [mukherjee2015deep] 5.6 5.1 - -
RNNFace [gu2017dynamic] 3.9 4.0 3.0 3.6
Table 2: Improved head pose estimation with fine-tuning. Average angular error for each of the Euler angles together with mean average error (MAE) on data of 30% held-out sequences of the BIWI [fanelli2013random] dataset and fine-tuning on the remaining 70% without using their annotations. All values are in degrees.

Additionally, we evaluate the effect of using the paired style and viewpoint loss to train the viewpoint-aware synthesis network . We observe that when we train without , our viewpoint network (SSV-full model) results in AE values of (azimuth), (elevation), (tilt) and an MAE of . This represents a increase from the corresponding MAE value of for the SSV-full, where is trained with (Table 1, SSV-full). This shows that our paired style and viewpoint loss helps to better train the image synthesis network for the task of viewpoint estimation.

Comparison with self-supervised methods Since SSV is a self-supervised viewpoint estimation work, there is no existing work that we can directly compare against. One could also obtain head pose from predicted face landmarks and we compare against recent state-of-the-art self-supervised landmark estimation (LMDIS [zhang2018unsupervised], IMM [tomas2018neurips]) and part discovery techniques (SCOPS [hung2019scops]). We fit a linear regressor that maps the self-supervisedly learned semantic face part centers from SCOPS and landmarks from LMDIS, IMM to five canonical facial landmarks (left-eye center, right-eye center, nose tip and mouth corners). Then we fit an average 3D face model to these facial landmarks with the Perspective-n-Point (PnP) algorithm [lepetit2009epnp] to estimate head pose. We also quantify HoloGAN’s [nguyenhologan] performance at viewpoint estimation, by training a viewpoint network with images synthesized by it under different input viewpoints (as pseudo GT). Alternatively, we train HoloGAN with an additional viewpoint output and a corresponding additional loss for it. For both these latter approaches, we additionally use viewpoint calibration, similar to SSV. We consider these works as our closest baselines because of their self-supervised training. The MAE results in Table 1 indicate that SSV performs considerably better than all the competing self-supervised methods.

Comparison with supervised methods As a reference, we also report the metrics for the recent state-of-the-art fully-supervised methods. Table 1 shows the results for both the keypoint-based [zhu2017face, kumar2017kepler, kazemi2014one, bulat2017far] and keypoint-free [ruiz2018fine, yang2019fsa] methods. The latter methods learn to directly regress head orientation values from networks. The results indicate that ‘SSV-Full’, despite being purely self-supervised, can obtain comparable results to fully supervised techniques. In addition, we notice that SSV-Full (with MAE ) outperforms all the keypoint-based supervised methods [zhu2017face, kumar2017kepler, kazemi2014one, bulat2017far], where FAN [bulat2017far] has the best MAE of .

Refinement on BIWI dataset The results reported thus far are with training on the 300W-LP [300wlp] dataset. Following some recent works [yang2019fsa, mukherjee2015deep, gu2017dynamic], we use 70% (16) of the image sequences in the BIWI dataset to fine-tune our model. Since our method is self-supervised, we just use images from BIWI without the annotations. We use the remaining 30% (8) image sequences for evaluation. The results of our model along with those of the state-of-the-art supervised models are reported in Table 2. After refinement with the BIWI dataset’s images, the MAE of SSV significantly reduces to . This demonstrates that SSV can improve its performance with the availability of images that match the target domain, even without GT annotations. We also show qualitative results of head pose estimation for this refined SSV-Full model in Figure 6(a). It performs robustly to large variations in head pose, identity and expression.

5.2 Generalization to Other Object Categories

SSV is not specific to faces and can be used to learn viewpoints of other object categories. To demonstrate its generalization ability, we additionally train and evaluate SSV on the categories of cars, buses and trains.

Datasets and evaluation metric Since SSV is completely self-supervised, the training image collection has to be reasonably large to cover all possible object viewpoints while covering diversity in other image aspects such as appearance, lighting etc. For this reason, we leverage large-scale image collections from both the existing datasets and the internet to train our network. For the car category, we use the CompCars [yang2015large] dataset, which is a fine-grained car model classification dataset containing 137,000 car images in various viewpoints. For the ‘train’ and ‘bus’ categories, we use the OpenImages [papadopoulos2016we, papadopoulos2017extreme, benenson2019large] dataset which contains about 12,000 images of each of these categories. Additionally, we mine about 30,000 images from Google image search for each category. None of the aforementioned datasets have viewpoint annotations. This also demonstrates the ability of SSV to consume large-scale internet image collections that come without any viewpoint annotations.

We evaluate the performance of the trained SSV model on the test sets of the challenging Pascal3D+ [xiang2014beyond] dataset. The images in this dataset have extreme shape, appearance and viewpoint variations. Following [mahendran20173d, prokudin2018deep, tulsiani2015viewpoints, LiaoCVPR19], we estimate the azimuth, elevation and tilt values, given the GT object location. To compute the error between the predicted and GT viewpoints, we follow the standard geodesic distance between the predicted rotation matrix constructed using viewpoint predictions and constructed using GT viewpoints [LiaoCVPR19]. Using this distance metric, we report the median geodesic error (Med. Error) for the test set. Additionally, we also compute the percentage of inlier predictions whose error is less than ().

Figure 6: Viewpoint estimation results. We visually show the results of (a) head pose estimation on the BIWI [fanelli2013random] dataset and of viewpoint estimation on the test sets of the (b) car, (c) bus and (d) train categories from the PASCAL3D+ [xiang2014beyond] dataset. Solid arrows indicate predicted viewpoints, while the dashed arrows indicate their GT values. Our self-supervised method performs well for a wide range of head poses, identities and facial expressions. It also successfully handles different object appearances and lighting conditions from the car, bus and train categories. We show additional results in the supplementary material.

Baselines For head pose estimation, we compared with self-supervised landmark [zhang2018unsupervised, hung2019scops, tomas2018neurips] discovery techniques coupled with the PnP algorithm for head pose estimation by fitting them to an average 3D face. For objects like cars with full azimuth rotation, we notice that the landmarks produced by SCOPS [hung2019scops] and LMDIS [zhang2018unsupervised] cannot be used for reasonable viewpoint estimates. This is because SCOPS is primarily a self supervised part segmentation framework which does not distinguish between front and rear parts of the car. Since the keypoints we compute are the centers of part segments, the resulting keypoints cannot distinguish such parts. LMDIS on the other hand produces keypoints only for the side profiles of cars. Hence, we use another baseline technique for comparisons on cars, trains and buses. Following the insights from [hung2019scops, thewlisICCVT2017] that features learned by image classification networks are equivariant to object rotation, we learn a linear regressor that maps the Conv5 features of a pre-trained VGG network [Simonyan15_vgg] to the viewpoint of an object. To train this baseline, we use the VGG image features and the GT viewpoint annotations in the Pascal3D+ training dataset [xiang2014beyond]. We use the same Pascal3D+ annotations used to calibrate SSV’s predicted viewpoints to GT canonical viewpoint axes. We consider this as a self-supervised baseline since we are not using GT annotations for feature learning but only to map the features to viewpoint predictions. We refer to this baseline as VGG-View. As an additional baseline, we train HoloGAN [nguyenhologan] with an additional viewpoint output and a corresponding loss for it. The viewpoint predictions are calibrated, similar to SSV.

Comparisons We compare SSV to our baselines and also to several state-of-the-art supervised viewpoint estimation methods on the Pascal3D+ test dataset. Table 3 indicates that SSV significantly outperforms the baselines. With respect to supervised methods, SSV performs comparably to Tulsiani et al[tulsiani2015viewpoints] and Mahendran et al[mahendran20173d] in terms of Median error. Interestingly for the ‘train’ category, SSV performs even better than supervised methods. These results demonstrate the general applicability of SSV for viewpoint learning on different object categories. We show some qualitative results for these categories in Figure 6(b)-(d).

Method Car Bus Train

Self-Sup

VGG-View 34.2 19.0 9.4
HoloGAN [nguyenhologan] with 16.3 14.2 9.7
2-5 SSV-Full 10.1 9.0 5.3

Supervised

Tulsiani et al[tulsiani2015viewpoints] 9.1 5.8 8.7
Mahendran et al[mahendran20173d] 8.1 4.3 7.3
Liao et al[LiaoCVPR19] 5.2 3.4 6.1
Grabner et al[grabner20183d] 5.1 3.3 6.7
Table 3: Generalization to other object categories, median error. We show the median geodesic errors (in degrees) for the car, bus and train categories.
Method Car Bus Train

Self-sup

VGG-View 0.43 0.69 0.82
HoloGAN [nguyenhologan] with 0.52 0.73 0.81
2-5 SSV-Full 0.67 0.82 0.96

Supervised

Tulsiani et al[tulsiani2015viewpoints] 0.89 0.98 0.80
Mahendran et al[mahendran20173d] - - -
Liao et al[LiaoCVPR19] 0.93 0.97 0.84
Grabner et al[grabner20183d] 0.93 0.97 0.80
Table 4: Generalization to other object categories, inlier count. We show the percentage of images with geodesic error less than for the car, bus and train categories.

6 Conclusions

In this work we investigate the largely unexplored problem of learning viewpoint estimation in a self-supervised manner from collections of un-annotated object images. We design a viewpoint learning framework that receives supervision from a viewpoint-aware synthesis network; and from additional symmetry and adversarial constraints. We further supervise our synthesis network with additional losses to better control its image synthesis process. We show that our technique outperforms existing self-supervised techniques and performs competitively to fully-supervised ones on several object categories like faces, cars, buses and trains.

References

Appendix

In this supplement, we provide the architectural and training details of our SSV framework. In Section A we describe the architectures of the both viewpoint () and synthesis () networks. In Section B

we present the various training hyperparameters and the training schedule. In Section

C we examine the memory requirements and runtime of SSV. In Section D we provide additional visual viewpoint estimation results for all object categories (i.e., face, car, bus and train).

A Network Architecture

The network architectures of the viewpoint and synthesis networks are detailed in tables 5 and 6, respectively. Both and operate at an image resolution of 128x128 pixels. has an input size of 128x128. synthesizes images at the same resolution. We use Instance Normalization [ulyanovcvpr17instancenorm] in the viewpoint network. For the synthesis network, the size of the style code is 128 for faces and 200 for the other objects (car, bus and train). is mapped to affine transformation parameters (), which are in turn used by adaptive instance normalization(AdaIN) [huang2017arbitrary] to control the style of the synthesized images.

B Training Details

SSV is implemented in Pytorch [paszke2017automatic]. We open-source our code required to reproduce the results at https://github.com/NVlabs/SSV

. We train both our viewpoint and synthesis networks from scratch by initializing all weights with a normal distribution

and zero bias. The learning rate is 0.0001 for both () and (). We use the ADAM [kingma2015adam]

optimizer with betas (0.9, 0.99) and no weight decay. We train the networks for 20 epochs.

Training Cycle  In each training iteration, we optimize and alternatively. In the optimization step, we compute the generative consistency, discriminator loss and the symmetry constraint (Sections 3.1, 3.2, 3.3 in the main paper). We freeze the parameters of , compute the gradients of the losses with respect to parameters of and do an update step for it. In an alternative step, while optimizing , we compute the paired style and viewpoint consistency, flip image consistency and the adversarial loss (Section 4 in the paper). We freeze the parameters of , compute the gradients of the losses with respect to parameters of and do an update step for it. We train separate networks for each object category.

C Runtime and Memory

Our viewpoint network runs real-time with 76 FPS. That is, the inference takes 13 milliseconds on an NVIDIA Titan X Pascal GPU for a single image. The memory consumed is 900MB. We use a small network for viewpoint estimation for real-time performance and low-memory consumption.

D Visual Results

In figures 7, 8, 10

, we present some additional visual results for the various object categories (faces, cars, buses and trains). It can be seen that the viewpoint estimation network reliably predicts viewpoint. For cars, it generalizes to car models like race cars and formula-1 cars, which are not seen by SSV during training. In each figure, we also show some failure cases in the last row. For faces, We observe that failures are caused in cases where the viewpoints contain extreme elevation or noisy face detection. For cars, viewpoint estimation is noisy when there is extreme blur in the image or the if the car is heavily occluded to the extent where it is difficult to identify it as a car. For buses, viewpoint estimation is erroneous when there is ambiguity between the rear and front parts of the object.

Layer Kernel Size stride Activation Normalization Output Dimension
Conv 1x1 1 LReLU - 128x128x128

Backbone Layers

Conv2D 3x3 1 LReLU Instance Norm 128X128x256
Conv2D 3x3 1 LReLU Instance Norm 128X128x256
2-7 Interpolate (scale = 0.5)
2-7 Conv2D 3x3 1 LReLU Instance Norm 64X64x512
Conv2D 3x3 1 LReLU Instance Norm 64X64x512
2-7 Interpolate (scale = 0.5)
2-7 Conv2D 3x3 1 LReLU Instance Norm 32X32x512
Conv2D 3x3 1 LReLU Instance Norm 32X32x512
2-7 Interpolate (scale = 0.5)
2-7 Conv2D 3x3 1 LReLU Instance Norm 16X16x512
Conv2D 3x3 1 LReLU Instance Norm 16X16x512
2-7 Interpolate (scale = 0.5)
2-7 Conv2D 3x3 1 LReLU Instance Norm 8X8x512
Conv2D 3x3 1 LReLU Instance Norm 8X8x512
2-7 Interpolate (scale = 0.5)
2-7 Conv2D 3x3 1 LReLU Instance Norm 4X4x512
Conv2D 4x4 1 LReLU - 1X1x512
2-7 Backbone ouput
FC-real/fake - - - - 1
FC-style - - - - code_dim

Azimuth

FC - - LReLU - 256
FC - - - - - 2
FC - sign() - - - - 4

Elevation

FC - - LReLU - 256
FC - - - - - 2
FC - sign() - - - - 4

Tilt

FC - - LReLU - 256
FC - - - - - 2
FC - sign() - - - - 4
Table 5: Viewpoint Network Architecture. The network contains a backbone whose resultant fully-connected features are shared by the heads that predict (a) real/fake scores, (b) style codes, and (c) heads that predict azimuth, elevation and tilt values. All LReLU units have a slope of 0.2. FC indicates a fully connected layer.
Layer Kernel Size stride Activation Normalization Output Dimension
Input - 3D Code - - - - 4x4x4x512

Styled 3D Convs

Conv 3D 3x3 1 LReLU AdaIN 4x4x4x512
Conv 3D 3x3 1 LReLU AdaIN 4x4x4x512
2-7 Interpolate (scale = 2)
2-7 Conv 3D 3x3 1 LReLU AdaIN 8x8x8x512
Conv 3D 3x3 1 LReLU AdaIN 8x8x8x512
2-7 Interpolate (scale = 2)
2-7 Conv 3D 3x3 1 LReLU AdaIN 16x16x16x256
Conv 3D 3x3 1 LReLU AdaIN 16x16x16x256
3D Rotation
Conv 3D 3x3 1 LReLU - 16x16x16x128
Conv 3D 3x3 1 LReLU - 16x16x16x128
2-7 Conv 3D 3x3 1 LReLU - 16x16x16x64
Conv 3D 3x3 1 LReLU - 16x16x16x64

Project

Collapse - - - - 16x16x(16.64)
Conv 3x3 1 LReLU - 16x16x1024

Styled 2D Convs

Conv 2D 3x3 1 LReLU AdaIN 16x16x512
Conv 2D 3x3 1 LReLU AdaIN 16x16x512
2-7 Interpolate (scale = 2)
2-7 Conv 2D 3x3 1 LReLU AdaIN 32x32x256
Conv 2D 3x3 1 LReLU AdaIN 32x32x256
2-7 Interpolate (scale = 2)
2-7 Conv 2D 3x3 1 LReLU AdaIN 64x64x128
Conv 2D 3x3 1 LReLU AdaIN 64x64x128
2-7 Interpolate (scale = 2)
2-7 Conv 2D 3x3 1 LReLU AdaIN 128x128x64
Conv 2D 3x3 1 LReLU AdaIN 128x128x64
Out Conv 2D 3x3 1 - - 128x128x3
Table 6: Synthesis Network Architecture. This network contains a set of 3D and 2D convolutional blocks. A learnable 3D latent code is passed through stylized 3D convolution blocks, which also use style codes as inputs to their adaptive instance normalization(AdaIN [huang2017arbitrary]) layers. The resulting 3D features are then rotated using a rigid rotation via the input viewpoint. Following this, the 3D features are orthographically projected to become 2D features. These are then passed through a stylized 2D convolution network which has adaptive instance normalization layers to control the style of the synthesized image.
Figure 7: Viewpoint estimation results for the face category. SSV predicts reliable viewpoints for a variety of face poses with large variations in azimuth, elevation and tilt. The last row (below the black line) shows some erroneous cases where the faces are partially detected by the face detector or there are extreme elevation angles.
Figure 8: Viewpoint estimation results for the car category. SSV predicts reliable viewpoints for a variety of objects with large variations in azimuth, elevation and tilt. It generalizes to car models like race cars and formula-1 cars, which are not seen by SSV during training. The last row (below the black line) shows some erroneous cases where the objects have extreme motion blur or are heavily occluded to the extent where it is difficult to identify it as a car.
Figure 9: Viewpoint estimation results for the bus category . SSV predicts reliable viewpoints for a variety of buses with large variations in azimuth, elevation and tilt. The last row (below the black line) shows erroneous viewpoints when there is ambiguity between the rear and front parts of the object.
Figure 10: Viewpoint estimation results for the train category. SSV predicts reliable viewpoints for a variety of objects with large variations in azimuth, elevation and tilt. The last row (below the black line) shows the erroneous viewpoints predicted by SSV.