Few-Shot Viewpoint Estimation

05/13/2019 ∙ by Hung-Yu Tseng, et al. ∙ 4

Viewpoint estimation for known categories of objects has been improved significantly thanks to deep networks and large datasets, but generalization to unknown categories is still very challenging. With an aim towards improving performance on unknown categories, we introduce the problem of category-level few-shot viewpoint estimation. We design a novel framework to successfully train viewpoint networks for new categories with few examples (10 or less). We formulate the problem as one of learning to estimate category-specific 3D canonical shapes, their associated depth estimates, and semantic 2D keypoints. We apply meta-learning to learn weights for our network that are amenable to category-specific few-shot fine-tuning. Furthermore, we design a flexible meta-Siamese network that maximizes information sharing during meta-learning. Through extensive experimentation on the ObjectNet3D and Pascal3D+ benchmark datasets, we demonstrate that our framework, which we call MetaView, significantly outperforms fine-tuning the state-of-the-art models with few examples, and that the specific architectural innovations of our method are crucial to achieving good performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the viewpoint (azimuth, elevation, and cyclorotation) of rigid objects, relative to the camera, is a fundamental problem in three-dimensional (3D) computer vision. It is vital to applications such as robotics 

[Tremblay et al.(2018a)Tremblay, To, Molchanov, Tyree, Kautz, and Birchfield], 3D model retrieval [Grabner et al.(2018)Grabner, Roth, and Lepetit], and reconstruction [Kundu et al.(2018)Kundu, Li, and Rehg]

. With convolutional neural networks (CNNs) and the availability of many labeled examples 

[Chang et al.(2015)Chang, Funkhouser, Guibas, Hanrahan, Huang, Li, Savarese, Savva, Song, Su, et al., Xiang et al.(2016)Xiang, Kim, Chen, Ji, Choy, Su, Mottaghi, Guibas, and Savarese, Xiang et al.(2014)Xiang, Mottaghi, and Savarese], much progress has been made in estimating the viewpoint of known categories of objects [Grabner et al.(2018)Grabner, Roth, and Lepetit, Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Košecká, Pavlakos et al.(2017)Pavlakos, Zhou, Chan, Derpanis, and Daniilidis]. However, it remains challenging for even the best methods [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] to generalize well to unknown categories that the system did not encounter during training [Kuznetsova et al.(2016)Kuznetsova, Hwang, Rosenhahn, and Sigal, Tulsiani et al.(2015)Tulsiani, Carreira, and Malik, Zhou et al.(2018)Zhou, Karpur, Luo, and Huang]. In such a case, re-training the viewpoint estimation network on an unknown category would require annotating thousands of new examples, which is labor-intensive.

To improve the performance of viewpoint estimation on unknown categories with little annotation effort, we introduce the problem of few-shot viewpoint estimation, in which a few ( or less) labeled training examples are used to train a viewpoint estimation network for each novel category. We are inspired by the facts that (a) humans are able to perform mental rotations of objects [Shepard and Metzler(1971)] and can successfully learn novel views from a few examples [Palmer(1999)]; and (b) recently, successful few-shot learning methods for several other vision tasks have been proposed [Finn et al.(2017)Finn, Abbeel, and Levine, Gui et al.(2018)Gui, Wang, Ramanan, and Moura, Park and Berg(2018)].

However, merely fine-tuning a viewpoint estimation network with a few examples of a new category can easily lead to over-fitting. To overcome this problem, we formulate the viewpoint estimation problem as one of learning to estimate category-specific 3D canonical keypoints, their 2D projections, and associated depth values from which viewpoint can be estimated. We use meta-learning [Andrychowicz et al.(2016)Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, Shillingford, and De Freitas, Finn et al.(2017)Finn, Abbeel, and Levine] to learn weights for our viewpoint network that are optimal for category-specific few-shot fine-tuning. Furthermore, we propose meta-Siamese, which is a flexible network design that maximizes information sharing during meta-learning and adapts to an arbitrary numbers of keypoints.

Through extensive evaluation on the ObjectNet3D [Xiang et al.(2016)Xiang, Kim, Chen, Ji, Choy, Su, Mottaghi, Guibas, and Savarese] and Pascal3D+ [Xiang et al.(2014)Xiang, Mottaghi, and Savarese] benchmark datasets, we show that our proposed method helps to significantly improve performance on unknown categories and outperforms fine-tuning the state-of-the-art models with a few examples.

To summarize, the main scientific contributions of our work are:

  • We introduce the problem of category-level few-shot viewpoint estimation, thus bridging viewpoint estimation and few-shot learning.

  • We design a novel meta-Siamese architecture and adapt meta-learning to learn weights for it that are optimal for category-level few-shot fine-tuning.

2 Related work

Viewpoint estimation. Many viewpoint estimation networks have been proposed for single [Kundu et al.(2018)Kundu, Li, and Rehg, Su et al.(2015)Su, Qi, Li, and Guibas, Tulsiani and Malik(2015)] or multiple [Grabner et al.(2018)Grabner, Roth, and Lepetit, Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] categories; or individual instances [Rad et al.(2018)Rad, Oberweger, and Lepetit, Sundermeyer et al.(2018)Sundermeyer, Marton, Durner, Brucker, and Triebel] of objects. They use different network architectures, including those that estimate angular values directly [Kehl et al.(2017)Kehl, Manhardt, Tombari, Ilic, and Navab, Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Košecká, Su et al.(2015)Su, Qi, Li, and Guibas, Tulsiani and Malik(2015), Xiang et al.(2018)Xiang, Schmidt, Narayanan, and Fox]; encode images in latent spaces to match them against a dictionary of ground truth viewpoints [Massa et al.(2016)Massa, Russell, and Aubry, Sundermeyer et al.(2018)Sundermeyer, Marton, Durner, Brucker, and Triebel]; or detect projections of 3D bounding boxes [Grabner et al.(2018)Grabner, Roth, and Lepetit, Rad and Lepetit(2017), Tekin et al.(2017)Tekin, Sinha, and Fua, Tremblay et al.(2018a)Tremblay, To, Molchanov, Tyree, Kautz, and Birchfield] or of semantic keypoints [Pavlakos et al.(2017)Pavlakos, Zhou, Chan, Derpanis, and Daniilidis, Zhou et al.(2018)Zhou, Karpur, Luo, and Huang], which along with known [Pavlakos et al.(2017)Pavlakos, Zhou, Chan, Derpanis, and Daniilidis] or estimated [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang, Grabner et al.(2018)Grabner, Roth, and Lepetit] 3D object structures are used to compute viewpoint. Zhou et al. propose the state-of-the-art StarMap method that detects multiple visible general keypoints [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] similar to SIFT [Lowe(2004)] or SURF [Bay et al.(2008)Bay, Ess, Tuytelaars, and Van Gool] via a learned CNN, and estimates category-level canonical 3D shapes. The existing viewpoint estimation methods are designed for known object categories and hence very few works report performance on unknown ones [Kuznetsova et al.(2016)Kuznetsova, Hwang, Rosenhahn, and Sigal, Tulsiani et al.(2015)Tulsiani, Carreira, and Malik, Zhou et al.(2018)Zhou, Karpur, Luo, and Huang]. Even highly successful techniques such as [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] perform significantly worse on unknown categories versus known ones. To our knowledge, no prior work has explored few-shot learning as a means of improving performance on novel categories and our work is the first to do so.

The existing viewpoint estimation networks also require large training datasets and two of them: Pascal3D+ [Xiang et al.(2014)Xiang, Mottaghi, and Savarese] and ObjectNet3D [Xiang et al.(2016)Xiang, Kim, Chen, Ji, Choy, Su, Mottaghi, Guibas, and Savarese] with 12 and 100 categories, respectively, have helped to move the field forward. At the instance level, the LineMOD [Hinterstoisser et al.(2012)Hinterstoisser, Lepetit, Ilic, Holzer, Bradski, Konolige, and Navab], T-LESS [Hodaň et al.(2017)Hodaň, Haluza, Obdržálek, Matas, Lourakis, and Zabulis], OPT [OPT2017], and YCB-Video [Xiang et al.(2018)Xiang, Schmidt, Narayanan, and Fox]

datasets that contain images of no more than 30 known 3D objects are widely used. Manual annotation of object viewpoint by aligning 3D CAD models to images (e.g., Figure (1)); or of 2D keypoints is a significant undertaking. To overcome this limitation, viewpoint estimation methods based on unsupervised learning 

[Suwajanakorn et al.(2018)Suwajanakorn, Snavely, Tompson, and Norouzi]; general keypoints [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang]; and synthetic images [Rad et al.(2018)Rad, Oberweger, and Lepetit, Su et al.(2015)Su, Qi, Li, and Guibas, Sundermeyer et al.(2018)Sundermeyer, Marton, Durner, Brucker, and Triebel, Tremblay et al.(2018b)Tremblay, To, Sundaralingam, Xiang, Fox, and Birchfield, Xiang et al.(2018)Xiang, Schmidt, Narayanan, and Fox] have been proposed.

Few-shot learning. Successful few-shot learning algorithms for several vision tasks, besides viewpoint estimation, have been proposed recently. These include object recognition [Finn et al.(2017)Finn, Abbeel, and Levine, Ravi and Larochelle(2017), Rezende et al.(2016)Rezende, Mohamed, Danihelka, Gregor, and Wierstra, Santoro et al.(2016)Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap, Snell et al.(2017)Snell, Swersky, and Zemel, Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Wierstra, et al.], segmentation [Rakelly et al.(2018)Rakelly, Shelhamer, Darrell, Efros, and Levine, Shaban et al.(2017)Shaban, Bansal, Liu, Essa, and Boots], online adaptation of trackers [Park and Berg(2018)], and human motion prediction [Gui et al.(2018)Gui, Wang, Ramanan, and Moura]. Several of these methods use meta-learning [Andrychowicz et al.(2016)Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, Shillingford, and De Freitas] to learn a “learner” that is amenable to few-shot learning. The learner may take the form of (a) a training algorithm [Finn et al.(2017)Finn, Abbeel, and Levine, Nichol et al.(2018)Nichol, Achiam, and Schulman, Ravi and Larochelle(2017)]; (b) a metric-space for representing tasks [Snell et al.(2017)Snell, Swersky, and Zemel, Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Wierstra, et al.]; or (c) a meta-recurrent network [Rezende et al.(2016)Rezende, Mohamed, Danihelka, Gregor, and Wierstra, Santoro et al.(2016)Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap]. The MAML [Finn et al.(2017)Finn, Abbeel, and Levine] meta-learning algorithm that learns a set of network initialization weights that are optimal for few-shot fine-tuning, is shown to be useful for many vision tasks.

In our work, we train networks for category-level viewpoint estimation. We further assume that we do not have access to 3D CAD models of any object or category to compute the viewpoint. Lastly, we endeavor to train viewpoint networks for new categories with very few examples—a task that has not been attempted previously.

Figure 2: Method overview.

Our MetaView framework is composed of a category-agnostic feature extraction (top-left) and category-specific viewpoint estimation (top-right) modules. The bottom components show the different steps for training our viewpoint estimation module via meta-learning or for adapting it to a new category (bottom left only), which are described in detail in Section 

3.

3 Few-shot Viewpoint Estimation

Our proposed MetaView framework for category-level few-shot viewpoint estimation is shown in the top row of Figure 2. It consists of two main components: a category-agnostic feature extraction module designed to extract general features from images that help to improve the accuracy of the downstream viewpoint estimation task; and a category-specific viewpoint estimation module designed to compute the viewpoint of all objects of a specific category. The latter module, in turn, computes viewpoint by detecting a unique set of semantic keypoints (containing 3D, 2D and depth values) via category-specific feature extraction () and keypoint detection () modules.

Our system operates in the following manner. We first train each of our feature extraction and viewpoint estimation modules using a training set

of a finite set of object categories. We use standard supervised learning to train the feature extraction module and fix its weights for the following stages. We then use meta-learning to train our viewpoint estimation module. It uses an alternative training procedure designed to make the viewpoint estimation module an effective few-shot “learner”. This means that when our trained viewpoint estimation module is further fine-tuned with a few examples of an unknown category, it can generalize well to other examples of it.

At inference time, we assume that our system encounters a new category (not present during training) along with a few of its labeled examples from another set (e.g., the category “monitor” shown in the lower part of Figure 2). We construct a unique viewpoint estimation network for it, initialize it with the optimal weights and learned via meta-learning, and fine-tune it with the new category’s few labeled examples (lower left of Figure 2). This results in a viewpoint estimator that generalizes well to other examples of this new category (lower right of Figure 2). In the following sections, we describe the architecture and the training procedure for each component in more detail.

3.1 Feature Extraction

The first stage of our pipeline is a feature extraction module (top left of Figure 2), which is trained and used to extract features without regard to an object’s category. It consists of two ResNet-18-style [He et al.(2016)He, Zhang, Ren, and Sun] networks: one trained as described in [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] to extract a multi-peak heatmap for the locations of many visible general keypoints; and another whose first four convolutional blocks compute an identically-sized set of high-level convolutional features and is trained to detect 8 semantic keypoints for all categories by optimizing the loss in Eq. (6) described later in Section 3.2.2. We concatenate the multi-peak heatmap and high-level features and input them to the viewpoint estimation module. We train this module via standard supervised SGD learning and once trained, we fix its weights for all subsequent steps.

3.2 Viewpoint Estimation

Our viewpoint estimation module (top right in Figure 2) is specific to each category. It computes a 3D canonical shape for each category, along with its 2D image projections and depth estimates; and relates these quantities to compute an object’s viewpoint. Furthermore, it is trained via meta-learning to be an optimal few-shot “learner”. We describe its architecture and training procedure in the following sections.

3.2.1 Architecture

Viewpoint estimation via semantic keypoints. We assume that we have no knowledge of the 3D shape of any object in a category. So, to compute viewpoint, inspired by [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang], we train our viewpoint estimation module to estimate a set of 3D points , which together represent a canonical shape for the entire category in an object-centric coordinate system (e.g., for the category “chairs” it may comprise of the corners of a stick-figure representation of a prototypical chair with a back, a seat, and 4 legs). Additionally, for each 3D point , our network detects its 2D image projection and estimates its associated depth . We refer collectively to the values , , of a point as a “semantic keypoint”. Finally, we obtain the viewpoint (rotation) of an object by solving the set of equations that relate each of the rotated 3D canonical points to its 2D image location and depth estimate , via orthogonal Procrustes. Note that our viewpoint estimation module is different from that of Zhou et al.’s [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] as they detect the 2D projections of only the visible 3D canonical points, whereas we detect projections of all of them, thus providing more data for estimating viewpoint.

Semantic keypoint estimation. To locate the 2D image projection of each semantic keypoint , the output of our network is a 2D heatmap

, which predicts the probability of the point being located at

. It is produced by a spatial softmax layer. We obtain the final image coordinates

via a weighted sum of the row and column values as:

(1)

Our network similarly computes a 2D map of depth values that is the same size as , along with three more maps for each dimension of its 3D canonical point. We compute the final depth estimate and 3D keypoint as:

(2)

Category-specific keypoints estimation. Given a category , our viewpoint estimation module must detect its unique semantic keypoints via a category-specific feature extractor followed by a set of semantic keypoint detectors (lower left of Figure 2). Each keypoint detector detects one unique category-specific semantic keypoint , while the feature extractor computes the common features required by all of them. Since our viewpoint estimation module must adapt to multiple categories with different numbers of semantic keypoints, it cannot have a fixed number of pre-defined keypoint detectors. To flexibly change the number of keypoint detectors for each novel category, we propose a meta-Siamese architecture (lower left of Figure 2), which we operate as follows. For each new category , we replicate a generic pre-trained keypoint detector () times and fine-tune each of its copies to detect one specific keypoint of the new category, thus creating a unique number of specialized keypoint-detectors for each new category.

3.2.2 Training

Our goal is to train the viewpoint estimation module to be an effective few-shot learner. In other words, its learned feature extractor and semantic keypoint detector , after being fine-tuned with a few examples of a new category (lower left in Figure 2), should learn to extract features for it and to detect each of its unique keypoints, respectively. To learn the optimal weights that make our viewpoint estimation module amenable to few-shot fine-tuning without catastrophically over-fitting, we adopt the MAML algorithm [Finn et al.(2017)Finn, Abbeel, and Levine].

MAML optimizes a special meta-objective using a standard optimization algorithm, e.g., SGD. In standard supervised learning the objective is to minimize only the training loss for a task during each iteration of optimization. However, the meta-objective in MAML is to explicitly minimize, during each iteration, the generalization loss for a task after a network has been fine-tuned with a few of its labeled examples. We describe our meta-training procedure to learn the optimal weights for our viewpoint estimation module as follows.

During each iteration of meta-training, we sample a random task from . A task comprises of a support set and a query set , each containing 10 and 3 labeled examples, respectively, of a category . The term “shot” refers to the number of examples in the support set . For this category, containing semantic keypoints, we replicate our generic keypoint detector () times to construct its unique meta-Siamese keypoints detector with the parameters (lower left in Figure 2) and initialize each with . We use it to estimate the support set’s semantic keypoints and given their ground truth values, we compute the following loss:

(3)

where , , and are the average regression losses for correctly estimating the semantic keypoints’ 2D and 3D positions, and depth estimates, respectively. The parameters control the relative importance of each loss term. We compute the gradient of this loss w.r.t. to the network’s parameters and use a single step of SGD to update to with a learning rate of :

(4)

Next, with the updated model parameters , we compute the loss for the query set (lower right in Figure 2). Note that this generalization loss of our few-shot fine-tuned network serves as the final meta-objective minimized in each iteration of meta-training. To compute the query loss, in addition to the loss terms described in (3), we also use a weighted concentration loss term:

(5)

which forces the distribution of a 2D keypoint’s heatmap to be peaky around the predicted position . We find that this concentration loss term helps to improve the accuracy of 2D keypoint detection. Our final query loss is:

(6)

Finally, we optimize the network’s initial parameters w.r.t. its query loss using:

(7)
(8)

We repeat the meta-training iterations until our viewpoint estimation module converges to . Notice that in Eq. (8) we compute the optimal weights for the generic keypoint detector by averaging the gradients of all the duplicated keypoint detectors . We find that this novel design feature of our network along with its shared category-level feature extractor with parameters help to improve accuracy. They enable efficient use of all the available keypoints to learn the optimal values for and during meta-training, which is especially important when training data is scarce.

3.2.3 Inference

We evaluate the performance of how well our viewpoint estimation module , which is learned via meta-learning performs at the task of adapting to unseen categories. Similar to meta-training, we sample tasks from with the same shot size. For each task, we construct its unique viewpoint estimation network and fine-tune it with its support set by minimizing the loss in Eq. (3). This results in a few-shot learned network for that category. We then evaluate the generalization performance of with all testing images of that category.

Method bed bookshelf calculator cellphone computer fcabinet guitar iron knife microwave
StarMap (zero) 0.37 / 45.1 0.69 / 18.5 0.19 / 61.8 0.51 / 29.8 0.74 / 15.6 0.78 / 14.1 0.64 / 20.4 0.02 / 142 0.08 / 136 0.89 / 12.2
StarMap* (zero) 0.31 / 45.0 0.63 / 22.2 0.27 / 52.2 0.51 / 29.8 0.64 / 24.2 0.78 / 15.8 0.52 / 28.0 0.00 / 134 0.06 / 124 0.82 / 16.9
Baseline (zero) 0.26 / 49.1 0.57 / 25.0 0.78 / 53.3 0.38 / 45.5 0.66 / 20.3 0.73 / 18.7 0.39 / 44.6 0.06 / 135 0.08 / 127 0.82 / 16.8
StarMap* + fine-tune 0.32 / 47.2 0.61 / 21.0 0.26 / 50.6 0.56 / 26.8 0.59 / 24.4 0.76 / 17.1 0.54 / 27.9 0.00 / 128 0.05 / 120 0.82 / 19.0
Baseline + fine-tune 0.28 / 43.7 0.67 / 22.0 0.77 / 18.4 0.45 / 34.6 0.67 / 22.7 0.67 / 21.5 0.27 / 52.1 0.02 / 127 0.06 / 108 0.85 / 16.6
StarMap* + MAML 0.32 / 42.2 0.76 / 15.7 0.58 / 26.8 0.59 / 22.2 0.69 / 19.2 0.76 / 15.5 0.59 / 21.5 0.00 / 136 0.08 / 117 0.82 / 17.3
Ours 0.36 / 37.5 0.76 / 17.2 0.92 / 12.3 0.58 / 25.1 0.70 / 22.2 0.66 / 22.9 0.63 / 24.0 0.20 / 76.9 0.05 / 97.9 0.77 / 17.9
Method pot rifle slipper stove toilet tub wheelchair TOTAL
StarMap (zero) 0.50 / 30.0 0.00 / 104 0.11 / 146 0.82 / 12.0 0.43 / 35.8 0.49 / 31.8 0.14 / 93.8 0.44 / 39.3
StarMap* (zero) 0.51 / 29.2 0.02 / 97.4 0.10 / 130 0.81 / 13.9 0.44 / 34.4 0.37 / 37.0 0.17 / 74.4 0.43 / 39.4
Baseline (zero) 0.46 / 38.8 0.00 / 98.6 0.09 / 123 0.82 / 14.8 0.32 / 39.5 0.29 / 50.4 0.14 / 71.6 0.38 / 44.6
StarMap* + fine-tune 0.51 / 29.9 0.02 / 100 0.08 / 128 0.80 / 16.1 0.38 / 36.8 0.35 / 39.8 0.18 / 80.4 0.41 0.00 / 41.0 0.22
Baseline + fine-tune 0.38 / 39.1 0.01 / 107 0.03 / 123 0.72 / 21.6 0.31 / 39.9 0.28 / 48.5 0.15 / 70.8 0.40 0.02 / 39.1 1.79
StarMap* + MAML 0.51 / 28.2 0.01 / 100 0.15 / 128 0.83 / 15.6 0.39 / 35.5 0.41 / 38.5 0.24 / 71.5 0.46 0.01 / 33.9 0.16
Ours 0.49 / 31.6 0.21 / 80.9 0.07 / 115 0.74 / 21.7 0.50 / 32.0 0.29 / 46.5 0.27 / 55.8 0.48 0.01 / 31.5 0.72
Table 1: Intra-dataset experiment. We report Acc30()/MedErr(). All models are trained and evaluated on 76 and 17 categories from ObjectNet3D, respectively. The “zero” methods don’t use images of unknown categories for training and all others involve few-shot learning.
Method aero bike boat bottle bus car chair
StarMap (zero) 0.04 / 97.7 0.10 / 90.42 0.14 / 78.42 0.81 / 16.7 0.54 / 29.4 0.25 / 67.8 0.19 / 97.3
StarMap* (zero) 0.02 / 112 0.02 / 102 0.06 / 110 0.44 / 34.3 0.48 / 32.7 0.18 / 87.0 0.29 / 70.0
Baseline (zero) 0.03 / 114 0.06 / 101 0.10 / 95 0.41 / 36.6 0.36 / 42.0 0.14 / 93.7 0.26 / 71.5
StarMap* + fine-tune 0.03 / 102 0.05 / 98.8 0.07 / 98.9 0.48 / 31.9 0.46 / 33.0 0.18 / 80.8 0.22 / 74.6
Baseline + fine-tune 0.02 / 113 0.04 / 112 0.11 / 93.4 0.39 / 37.1 0.35 / 39.9 0.11 / 99.0 0.21 / 75.0
StarMap* + MAML 0.03 / 99.2 0.08 / 88.4 0.11 / 92.2 0.55 / 28.0 0.49 / 31.0 0.21 / 81.4 0.21 / 80.2
Ours 0.12 / 104 0.08 / 91.3 0.09 / 108 0.71 / 24.0 0.64 / 22.8 0.22 / 73.3 0.20 / 89.1
Method table mbike sofa train tv TOTAL
StarMap (zero) 0.62 / 23.3 0.15 / 70.0 0.23 / 49.0 0.63 / 25.7 0.46 / 31.3 0.32 / 50.1
StarMap* (zero) 0.43 / 31.7 0.09 / 86.7 0.26 / 42.5 0.30 / 46.8 0.59 / 24.7 0.25 / 71.2
Baseline (zero) 0.38 / 39.0 0.11 / 82.3 0.39 / 57.5 0.29 / 50.0 0.63 / 24.3 0.24 / 70.0
StarMap* + fine-tune 0.46 / 31.4 0.09 / 91.6 0.32 / 44.7 0.36 / 41.7 0.52 / 29.1 0.25 0.01 / 64.7 1.07
Baseline + fine-tune 0.41 / 35.1 0.09 / 79.1 0.32 / 58.1 0.29 / 51.3 0.59 / 29.9 0.22 0.02 / 69.2 1.48
StarMap* + MAML 0.29 / 36.8 0.11 / 83.5 0.44 / 42.9 0.42 / 33.9 0.64 / 25.3 0.28 0.00 / 60.5 0.10
Ours 0.39 / 36.0 0.14 / 74.7 0.29 / 46.2 0.61 / 23.8 0.58 / 26.3 0.33 0.02 / 51.3 4.28
Table 2: Inter-dataset experiment. We report Acc30()/MedErr(). All models are trained on ObjectNet3D and evaluated on Pascal3D+. The “zero” methods don’t use images of unknown categories for training and all others involve few-shot learning.

4 Results

Experimental settings. We evaluate our method for two different experimental settings. First, we follow the intra-dataset experiment of [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] and split the categories in ObjectNet3D [Xiang et al.(2016)Xiang, Kim, Chen, Ji, Choy, Su, Mottaghi, Guibas, and Savarese] into and for training and testing, respectively. Secondly, we conduct an inter-dataset experiment. From ObjectNet3D, we exclude the 12 categories that are also present in Pascal3D+ [Xiang et al.(2014)Xiang, Mottaghi, and Savarese]. We use the remaining categories in ObjectNet3D for training, and evaluate performance on Pascal3D+. Complying with [Tulsiani and Malik(2015)], we discard the images with occluded or truncated objects from the test set in both experiments. We use two metrics for evaluation: 1) Acc30, which is the percentage of views with rotational error less than and 2) MedErr, which is the median rotational error across a dataset, measured in degrees. We compute the rotational error as , where is the Frobenius norm, and and are the ground truth and predicted rotation matrices, respectively.

Method bed bookshelf calculator cellphone computer fcabinet guitar iron knife microwave
Ours 0.28 / 42.3 0.68 / 23.1 0.87 / 15.3 0.47 / 32.1 0.63 / 24.9 0.71 / 22.1 0.03 / 100 0.15 / 76.0 0.01 / 121 0.69 / 23.2
Ours (MS) 0.27 / 42.4 0.77 / 22.2 0.74 / 24.0 0.54 / 28.3 0.64 / 24.9 0.63 / 25.3 0.61 / 25.3 0.13 / 76.9 0.05 / 103 0.65 / 26.2
Ours (MS, ) 0.31 / 41.3 0.79 / 19.0 0.84 / 17.4 0.53 / 28.0 0.62 / 25.9 0.66 / 23.6 0.35 / 35.8 0.16 / 86.5 0.05 / 101 0.81 / 17.7
Ours (MS, , KP) 0.36 / 37.5 0.76 / 17.2 0.92 / 12.3 0.58 / 25.1 0.70 / 22.2 0.66 / 22.9 0.63 / 24.0 0.20 / 76.9 0.05 / 97.9 0.77 / 17.9
Method pot rifle slipper stove toilet tub wheelchair TOTAL
Ours 0.46 / 32.1 0.04 / 119 0.02 / 125 0.81 / 19.5 0.15 / 51.2 0.26 / 45.9 0.02 / 109 0.35 0.01 / 42.5 1.15
Ours (MS) 0.34 / 37.4 0.18 / 78.8 0.05 / 111 0.71 / 21.5 0.37 / 35.8 0.24 / 44.8 0.10 / 76.1 0.41 0.01 / 36.0 0.78
Ours (MS, ) 0.49 / 31.2 0.16 / 90.5 0.05 / 111 0.75 / 21.7 0.41 / 34.4 0.31 / 42.4 0.22 / 60.8 0.45 0.01 / 33.6 0.94
Ours (MS, , KP) 0.49 / 31.6 0.21 / 80.9 0.07 / 115 0.74 / 21.7 0.50 / 32.0 0.29 / 46.5 0.27 / 55.8 0.48 0.01 / 31.5 0.72
Table 3: Ablation study. The table shows the individual contributions of out meta-Siamese design (MS), the concentration loss, and general keypoints heatmap (KP) on the performance of MetaView in the inter-dataset experiment. We report Acc30()/MedErr().

Comparisons. We compare several viewpoint estimation networks to ours. These include:

For methods that involve few-shot fine-tuning on unknown categories (i.e., StarMap* or Baseline with fine-tuning, StarMap + MAML, and Ours), we use a shot size of . We repeat each experiment ten times with random initial seeds and report their average performance. Lastly, note that we also attempted to train viewpoint estimation networks that estimate angular values directly (e.g.,  [Xiang et al.(2018)Xiang, Schmidt, Narayanan, and Fox]); or those that detect projections of 3D bounding boxes (e.g.,  [Grabner et al.(2018)Grabner, Roth, and Lepetit]) with MAML, but they either failed to converge to performed very poorly. So, we do not report results for them. The results of the intra-dataset and inter-dataset experiments are presented in  Table 1 and  Table 2, respectively.

Zero-shot performance. For both experiments, methods trained using standard supervised learning solely on the training categories (i.e., StarMap, StarMap* and Baseline denoted by “zero”) are limited in their ability to generalize to unknown categories. For the original StarMap method [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] in the intra-dataset experiment, the overall Acc30 and MedErr worsen from and , respectively, when the test categories are known to the system to and , respectively, when they are unknown. This indicates that the existing state-of-the-art viewpoint estimation networks require information that is unique to each category to infer its viewpoint. Note that since the original StarMap [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] method uses a larger backbone network than ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun] it performs better than our implementation (StarMap*) of it.

Few-shot performance. Among the methods that involve few-shot fine-tuning for unknown categories, methods that are trained via meta-learning (StarMap + MAML and our MetaView) perform significantly better than the methods that are not (StarMap* or Baseline with fine-tuning) in both the intra- and inter-dataset experiments. These results are the first demonstration of the effectiveness of meta-learning at the task of category-level few-shot viewpoint learning. Furthermore, in both experiments, our MetaView framework results in the best overall performance of all the zero- and few-shot learning methods. It outperforms StarMap* + MAML, which shows the effectiveness of our novel design components that differentiate it from merely training StarMap* with MAML. They include our network’s ability to (a) detect the 2D locations and depth values of all 3D canonical points and not just the visible ones; (b) share information during meta-learning via the meta-Siamese design; and (c) flexibly construct networks with a different number of keypoints for each category. Lastly, observe that even with a smaller backbone network, our method performs better than the current best performance for the task of viewpoint estimation of unknown categories, i.e. of StarMap [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] “zero” and thus helps to improve performance on unknown categories with very little additional labeling effort.

The effectiveness of MetaView is also evident from Figure 3, which shows examples of the 2D keypoint heatmaps (described in Section 3.1) produced by it before and after few-shot fine-tuning with examples of new categories. The keypoint detector, prior to few-shot fine-tuning, is not specific to any keypoint and generates heatmaps that tend to have high responses on corners, edges or regions of the foreground object. After fine-tuning, however, it successfully learns to detect keypoints of various new categories and produces heatmaps with more concentrated peaks.

Figure 3: Qualitative results of the intra-dataset experiment. We show the keypoint detection results of MetaView on unknown categories, before and after few-shot fine-tuning. The images from left to right are: the input image, the 2D keypoint heatmap before fine-tuning with Eq. (3), three example heatmaps for specific keypoints after fine-tuning, all the predicted keypoints, and their ground truth values.

Ablation study. To validate the effectiveness of our various novel design components including our meta-Siamese design, concentration loss term, and of using general keypoints multi-peak as input, we show the results of an ablation study for the inter-dataset experiment in Table 3. While each component individually contributes to the overall performance, the concentration loss and the meta-Siamese design contribute the most.

5 Conclusion

To improve performance on unknown categories, we introduce the problem of category-level few-shot viewpoint estimation. We propose the novel MetaView framework that successfully adapts to unknown categories with few labeled examples and helps to improve performance on them with little additional annotation effort. Our meta-Siamese keypoint detector is general and can be explored in the future for other few-shot tasks requiring keypoints detection.

References

  • [Andrychowicz et al.(2016)Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, Shillingford, and De Freitas] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, 2016.
  • [Bay et al.(2008)Bay, Ess, Tuytelaars, and Van Gool] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). CVIU, 110(3):346–359, 2008.
  • [Chang et al.(2015)Chang, Funkhouser, Guibas, Hanrahan, Huang, Li, Savarese, Savva, Song, Su, et al.] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [Finn et al.(2017)Finn, Abbeel, and Levine] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [Grabner et al.(2018)Grabner, Roth, and Lepetit] Alexander Grabner, Peter M Roth, and Vincent Lepetit. 3D pose estimation and 3D model retrieval for objects in the wild. In CVPR, 2018.
  • [Gui et al.(2018)Gui, Wang, Ramanan, and Moura] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and José MF Moura. Few-shot human motion prediction via meta-learning. In ECCV, 2018.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [Hinterstoisser et al.(2012)Hinterstoisser, Lepetit, Ilic, Holzer, Bradski, Konolige, and Navab] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In ACCV, 2012.
  • [Hodaň et al.(2017)Hodaň, Haluza, Obdržálek, Matas, Lourakis, and Zabulis] Tomáš Hodaň, Pavel Haluza, Štěpán Obdržálek, Jiří Matas, Manolis Lourakis, and Xenophon Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In WACV, 2017.
  • [Kehl et al.(2017)Kehl, Manhardt, Tombari, Ilic, and Navab] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In ICCV, 2017.
  • [Kundu et al.(2018)Kundu, Li, and Rehg] Abhijit Kundu, Yin Li, and James M Rehg. 3D-RCNN: Instance-level 3D object reconstruction via render-and-compare. In CVPR, 2018.
  • [Kuznetsova et al.(2016)Kuznetsova, Hwang, Rosenhahn, and Sigal] Alina Kuznetsova, Sung Ju Hwang, Bodo Rosenhahn, and Leonid Sigal. Exploiting view-specific appearance similarities across classes for zero-shot pose prediction: A metric learning approach. In AAAI, 2016.
  • [Lowe(2004)] David G Lowe. Distinctive image features from scale-invariant key-points. IJCV, 60(2):91–110, 2004.
  • [Massa et al.(2016)Massa, Russell, and Aubry] Francisco Massa, Bryan C Russell, and Mathieu Aubry. Deep exemplar 2D-3D detection by adapting from real to rendered views. In CVPR, 2016.
  • [Mousavian et al.(2017)Mousavian, Anguelov, Flynn, and Košecká] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Košecká.

    3D bounding box estimation using deep learning and geometry.

    In CVPR, 2017.
  • [Newell et al.(2016)Newell, Yang, and Deng] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
  • [Nichol et al.(2018)Nichol, Achiam, and Schulman] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. In arXiv:1803.02999, 2018.
  • [Palmer(1999)] Stephen E Palmer. Vision science: Photons to phenomenology. MIT press, 1999.
  • [Park and Berg(2018)] Eunbyung Park and Alexander C Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In ECCV, 2018.
  • [Pavlakos et al.(2017)Pavlakos, Zhou, Chan, Derpanis, and Daniilidis] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis. 6-dof object pose from semantic key-points. In ICRA, 2017.
  • [Rad and Lepetit(2017)] Mahdi Rad and Vincent Lepetit. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In ICCV, 2017.
  • [Rad et al.(2018)Rad, Oberweger, and Lepetit] Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In CVPR, 2018.
  • [Rakelly et al.(2018)Rakelly, Shelhamer, Darrell, Efros, and Levine] Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alexei A Efros, and Sergey Levine. Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373, 2018.
  • [Ravi and Larochelle(2017)] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017.
  • [Rezende et al.(2016)Rezende, Mohamed, Danihelka, Gregor, and Wierstra] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. One-shot generalization in deep generative models. JMLR, 48, 2016.
  • [Santoro et al.(2016)Santoro, Bartunov, Botvinick, Wierstra, and Lillicrap] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016.
  • [Shaban et al.(2017)Shaban, Bansal, Liu, Essa, and Boots] Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017.
  • [Shepard and Metzler(1971)] Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703, 1971.
  • [Snell et al.(2017)Snell, Swersky, and Zemel] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
  • [Su et al.(2015)Su, Qi, Li, and Guibas] Hao Su, Charles R. Qi, Yangyan Li, and Leonidas J. Guibas. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV, 2015.
  • [Sundermeyer et al.(2018)Sundermeyer, Marton, Durner, Brucker, and Triebel] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner, Manuel Brucker, and Rudolph Triebel. Implicit 3D orientation learning for 6D object detection from RGB images. In ECCV, 2018.
  • [Suwajanakorn et al.(2018)Suwajanakorn, Snavely, Tompson, and Norouzi] Supasorn Suwajanakorn, Noah Snavely, Jonathan Tompson, and Mohammad Norouzi. Discovery of latent 3D key-points via end-to-end geometric reasoning. In NIPS, 2018.
  • [Tekin et al.(2017)Tekin, Sinha, and Fua] Bugra Tekin, Sudipta N Sinha, and Pascal Fua. Real-time seamless single shot 6D object pose prediction. In CVPR, 2017.
  • [Tremblay et al.(2018a)Tremblay, To, Molchanov, Tyree, Kautz, and Birchfield] Jonathan Tremblay, Thang To, Artem Molchanov, Stephen Tyree, Jan Kautz, and Stan Birchfield. Synthetically trained neural networks for learning human-readable plans from real-world demonstrations. In ICRA, 2018a.
  • [Tremblay et al.(2018b)Tremblay, To, Sundaralingam, Xiang, Fox, and Birchfield] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. In CoRL, 2018b.
  • [Tulsiani and Malik(2015)] Shubham Tulsiani and Jitendra Malik. Viewpoints and key-points. In CVPR, 2015.
  • [Tulsiani et al.(2015)Tulsiani, Carreira, and Malik] Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Pose induction for novel object categories. In ICCV, 2015.
  • [Vinyals et al.(2016)Vinyals, Blundell, Lillicrap, Wierstra, et al.] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016.
  • [Wu et al.(2017)Wu, Lee, Tseng, Ho, Yang, and Chien] Po-Chen Wu, Yueh-Ying Lee, Hung-Yu Tseng, Hsuan-I Ho, Ming-Hsuan Yang, and Shao-Yi Chien. A benchmark dataset for 6dof object pose tracking. In ISMAR Adjunct, 2017.
  • [Xiang et al.(2018)Xiang, Schmidt, Narayanan, and Fox] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In RSS, 2018.
  • [Xiang et al.(2014)Xiang, Mottaghi, and Savarese] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV, 2014.
  • [Xiang et al.(2016)Xiang, Kim, Chen, Ji, Choy, Su, Mottaghi, Guibas, and Savarese] Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. ObjectNet3D: A large scale database for 3D object recognition. In ECCV, 2016.
  • [Zhou et al.(2018)Zhou, Karpur, Luo, and Huang] Xingyi Zhou, Arjun Karpur, Linjie Luo, and Qixing Huang. Starmap for category-agnostic key-point and viewpoint estimation. In ECCV, 2018.