Keypoint-based shape and pose representation is attractive because of its simplicity and ease of handling. Example applications include 3D reconstruction [novotny2019c3dpo, Dai2012, Snavely2007], registration [yew20183dfeat, Kneip2014, Luong1995, Loper2015]
, and human body pose analysis[shotton2011real, moreno20173d, cao2017, bogo2016smpl], recognition [he2017mask, sattler2011fast], and generation [tang2019cycle, zafeiriou20173d]. The keypoints are often detected as 2D image coordinates due to the ease of the corresponding annotation. Yet, this often does not suffice for the subsequent geometric reasoning tasks. In fact, in many applications (e.g. as augmented reality), both 3D shape and pose are required [tulsiani2015viewpoints].
No Test time opt
|Park et al. missing [DBLP:conf/eccv/ParkLK20]|
|Kong et al. missing [DBLP:conf/iccv/KongL19]||✓||✓|
|Wang et al. missing [wang2021paul]||✓||✓|
|Zeng et al. missing [zeng2021pr]||✓||✓|
|Novotny et al. missing [novotny2019c3dpo]||✓||✓||✓|
It therefore stands to reason that keypoints should be estimated in 3D, similar to [zhao2020learning, suwajanakorn2018discovery, tulsiani2015viewpoints, sundermeyer2020augmented]
. Such solutions however, come with one or both of the two pitfalls: (i) need of 3D keypoints, pose, or multiple views for supervision; (ii) the lack of direct pose reasoning with respect to a canonical frame. In this regard, learning based methods can provide both 3D keypoints and pose merely from a single image, making them suitable for a very wide range of applications from scene understanding[fernandez2020indoor] to augmented reality [marchand2015pose]. Alternatively, template-based single view methods [yang2020perfect, shi2021optimal] may also be used to obtain 3D keypoints and pose from 2D keypoints. However, template based methods, beside requiring templates, are known to be sensitive to self-occlusions [dang20203d]. Therefore, we adopt a learning based method for single view inference of both 3D keypoints and pose of objects.
In this paper, we consider that only one image per object is available both during training and inference. Such assumption allows us to learn from diverse datasets, such as internet image collections, potentially offering us a high generalization ability. For better scalability, we also assume that only minimalistic supervision in the form of 2D keypoints and objects’ categories are available. In essence, we wish to learn the 3D keypoints and pose of objects from an image collection of multiple categories, where not even multi-view images of the same object are available.
Existing methods that learn 3D shape and pose from an image collection by object categories are also known as deep non-rigid structure-from-motion (NrSfM) due to their underlying assumption. The method proposed in this paper also belongs to the same class, which can be divided into single [DBLP:conf/eccv/ParkLK20, DBLP:conf/iccv/KongL19, wang2021paul, zeng2021pr] and multi-category [novotny2019c3dpo] methods. Multi-category methods are strikingly interesting due to two main reasons: (i) computational: one single neural network can infer shapes and poses for objects from different categories; (ii) relational: possibility of establishing/exploiting relationships across categories. The latter does not only let us measure the cross-category similarity but may also have better generalizability.
In the context of multi-category methods, we introduce three major improvements in the form of:
– an end-to-end learning framework directly from images;
– sparse non-negative combination of shape basis;
– the explicit 3D pose and shape disentanglement via an augmentation-based cyclic consistency for supervision.
End-to-end learning: Most existing methods [DBLP:conf/eccv/ParkLK20, DBLP:conf/iccv/KongL19, wang2021paul, zeng2021pr, novotny2019c3dpo] operate in two stages; 2D keypoints extraction followed by 3D shape and pose estimation. These two stages are often performed independently. We argue that these two stages are dependent therefore can mutually benefit from each other. Thus, 2D keypoints can be extracted while being suitable for the down-streaming task of 3D reasoning. In particular, we extract the visual context information along with the 2D keypoints from the keypoint extraction network. Later, both visual context and 2D keypoints are provided to the network that lifts 2D keypoints to 3D. Our experiments clearly demonstrate the benefit of visual context information during 3D pose and shape recovery.
Non-negative shape coefficients: We model 3D shape using a dictionary learning approach, similar to [novotny2019c3dpo]
, where the shape basis for the union of categories are learned. The instance-wise shape is then recovered with the help of the shape basis coefficients. In this process, only non-negative coefficients are considered. Intuitively, the non-negative shape coefficients allow us to perform the interpolation and the scaling of the learned shape basis. The subtraction between shape basis, although algebraically non problematic, are geometrically less meaningful. Non-negative shape coefficients encourage the shape basis to be geometrically more meaningful, leading to better performance.
Augmentation-based pose and shape disentanglement: During the recovery of shape basis, by factoring 2D measurements into shape and the pose, the pose aspect may largely influence the shape basis, which is not desired [lee2013procrustean]. In this context, [novotny2019c3dpo] refers to shape disentanglement as “shape transversality”, which is done implicitly by undoing random rotations of the estimated shape by an auxiliary neural network. Similarly, a rigidity-based pairwise contrastive loss is proposed in [zeng2021pr] to handle the disentanglement problem implicitly. On the other hand, the methods in [park2017procrustean, DBLP:conf/eccv/ParkLK20, wang2021paul] use a Procrustean-based approach. The latter approaches do not predict the motion, and rather compute the pose between input and 3D prediction in the canonical frame using orthographic n-point (OnP) solvers [steger2018algorithms]. Instead of using the OnP solvers, we simulate multiple instances of the pose by random rotations and perform self-supervision from the viewpoint of data augmentation. In particular, we rotate the generated shape, followed by their 2D projection. The projected 2D keypoints are fed back to the network to recover pose agnostic shape parameters. In other words, our disentanglement explicitly exploits the following fact: the shape coefficients must be the same for keypoints that differ only by poses. This allows us to explore the space of poses that may not be available in the data, making our method more robust to pose variations.
The major contributions of our work can be summarized as:
End-to-end reconstructing of 3D shape and pose in a multiple category setup, using a single neural network.
We propose to use non-negative shape coefficients and an augmentation-based self-supervised pose and shape disentanglement method.
Our method achieves state-of-the-art results in the multi-category setting, with significant improvement.
2 Related Work
The task of lifting 2D keypoints of deformable objects to 3D from a single image has been mostly studied in the context of NrSfM. In NrSfM, the task is recovering the poses and viewpoints from multiple observations in time of an object [DBLP:journals/pami/AkhterSKK11]. Significant research has been carried out in NrSfM to improve the performance through sparse dictionary learning [DBLP:conf/cvpr/KongL16, DBLP:conf/cvpr/ZhouZLDD16], low-rank constraints [daubechies2004iterative], union of local subspaces [DBLP:conf/cvpr/ZhuHTL14], diffeomorphism [DBLP:conf/cvpr/ParasharSF20] and coarse-to-fine low rank reconstruction [DBLP:conf/cvpr/BartoliGCPOS08]. It is possible to use NrSfM frameworks to build category specific models that can learn to estimate pose and viewpoint from a single image by treating the images of the same category as observations of a single object deformed at different time steps [DBLP:conf/3dim/KongZKL16, DBLP:conf/iccv/KongL19, DBLP:conf/iccv/ChaLO19].
Obtaining the 3D structure of an object given only a single image has been studied sparsely. In [DBLP:conf/eccv/KanazawaTEM18] instance, segmentation datasets were utilized to train a model that outputs 3D mesh reconstructions given an image. Correspondences between 2D-3D keypoints were also used to improve results [DBLP:journals/corr/abs-2106-05662]. While some recent methods can estimate the viewpoint and non-rigid meshes, these methods work on objects with limited diversity, such as faces [DBLP:conf/ijcai/Wu0V21, DBLP:conf/accv/JenniF20, DBLP:conf/iccvw/SahasrabudheSBG19].
The closest line of work to ours involves building a single model for a diverse set of input classes. C3DPO [DBLP:conf/iccv/NovotnyRGNV19] proposed to learn the factorization of the object deformation and viewpoint change. They propose to enforce the transversal property through a separate canonicalization network that undoes rotation applied on a canonical shape. Park et al. proposed using Procrustean regression [DBLP:journals/tip/ParkLK18] to determine unique motions and shapes [DBLP:conf/eccv/ParkLK20]. However, their method cannot handle multiple object categories or occluded keypoints. Moreover, it requires temporal information in the form of sequences. Recently, [wang2021paul]
extended Procrustean formulation with autoencoders and proposed a method that can infer 3D shapes without the need for sequence. However, their method requires a substantially more complex network as well as optimization at test-time. All these methods accept 2D keypoints as input rather than images and tackle the problem of obtaining 3D keypoint locations from a single image using a separate keypoint detector, such as a stacked hourglass network[DBLP:conf/cvpr/ToshevS14].
3 Multi-category from Single View
We extract 3D structures in the form of 3D keypoints, given only an image of some object category. During training, we only have access to the 2D location of keypoints, including the category label. For simplicity, we separate our solution into two parts: category and 2D keypoints extraction from the image and lifting them to 3D. In the following, we will first explain our approach for lifting the given 2D keypoints. We formulate the lifting of 2D keypoints problem in the context of NrSfM.
3.1 Preliminaries - NrSfM
Let be a stacked matrix representation of 2D keypoints from the view. We represent the structure of the view as , using the shape basis and coefficients . For simplicity, we assume that the keypoints are centered and normalized, and that the camera follows an orthographic projection model, represented by . Given the camera rotation matrix , and the centered, and normalized keypoints, we can write , where the operation reshapes the row vector to a matrix of the from . The recovery of shape and pose by NrSfM given views can be written as111We will omit and the transposition of throughout the rest of the paper for the ease of notation.,
where is a norm-based loss of the form .
The above problem is generally ill-posed. Thus, different assumptions regarding and are made in the literature. The most common constraints in this regard are, low-rank [Dai2012], finite-basis [wolf2002projection], and sparse combination [zhu2014complex]. In this work, we are interested to solve the problem of (1) using a learning based approach. We, however, are further interested in multi-category setting of single view inference.
3.2 Multi-category Formulation
In the context of multi-class NrSfM, our method extracts 3D structures of objects from wide variety of classes. Thus, , should be able to express 3D structure of objects with different number of keypoints. Let Z represent the set of object categories and be the category of sample . Let each category be represented by keypoints. Then we set the number of total keypoints . We have a subset selection vector that indicates which dimensions relates to the category . Then, formula 1 can be rewritten as
where is the broadcasted elementwise multiplication.
In the above formulation, and are inputs, hence category dependent, while is shared among all categories. To formulate the problem as a learning-based approach, let us separate into two composite functions , with being an affine function, with . Moreover, let us rewrite as a function of input , i.e. . Representing all the parameters with , the problem definition becomes
In the above formulation, shape basis coefficients are latent codes with the latent space basis vectors and a translation term , which are shared for all categories. Projecting features of objects from different categories into a shared latent space lets the method extract cross-categorical geometric relationships, as shown in Fig. 2. Moreover, it substantially simplifies computations since we do not require a separate network for each class.
3.3 Non-negative Shape Coefficients
Formula 3 is under-determined unless there are additional constraints imposed on the system. The most common constraint is restricting the dimension of the shape basis coefficients [DBLP:conf/cvpr/BreglerHB00, DBLP:journals/pami/AkhterSKK11]
. However, selecting the optimal cardinality requires careful hyperparameter tuning[novotny2019c3dpo, DBLP:conf/eccv/ParkLK20].
Since our method extracts 3D structures of objects from a wide variety of classes, the latent space has to accommodate latent codes from wide range of inputs. Considering most objects share some common characteristics, using different manifolds for each class results in failure to utilize cross-class information as well as increasing complexity of the method. On the other hand, the dimensionality of the optimal manifold is different for each class. Therefore, optimally, we would like to automatically select a manifold for each input in a way that maximizes the performance. Note that the optimal manifold selection depends not only on the object class, but we would like the method to discover cross-class rules for manifold assignment.
Given a sample , let the selected basis vectors among be denoted by a binary vector where , with the basis coefficients . Then, the representation of is .
This representation is difficult to use with neural networks since the basis selection vector is binary and non-differentiable. Let us consider the cone that is defined by the basis vectors . Then, . Thus, it is the space covered by when the basis coefficients are non-negative. Let us assume that , is bounded , i.e. . Then, is bounded within a dimensional sphere of radius . Let the axis vector of the cone formed by be . Then . This means, we can translate the sphere that encloses in such a way that it is completely covered by the cone of the basis vectors. Rearranging terms, we arrive at the equation .
This result implies that we can restrict the basis coefficients to by adding a translation term . With these adjustments, we rewrite . Plugging it into 3 we get
With Eq. 4, the method learns to adaptively pick the basis vectors to activate, thus selecting the dimensionality of the manifold. The resulting structure is very simple and all the parameters, including the shape basis vectors , the coefficient generating function and the bias , are learned from the data. The manifold selection rule is embedded in the thresholding operation applied by ReLU. Note that is the network’s output, i.e. the 3D locations of the keypoints. Thus, in-distribution inputs are expected to produce bounded estimates.
The proposed formulation has the property of being sparse. This encourages the representation of the objects in shape space, i.e. coefficients to be disentangled. Intuitively, as the number of active (non-zero) coefficients increases, more different combinations of the shape basis vectors can arrive at the same solution. By decreasing the active coefficients adaptively, we force a small number of shape coefficients to represent changes from one object to another. Thus each coefficient represents a different major variation. Moreover, the cutoff imposed by ReLU implies that a small change in the coefficients will, likely, not result in any change in the output as long as the coefficient is inactive. This improves the robustness of the overall method.
3.4 Augmented Self-Supervision
Previous works acknowledged the issue of shape and pose entanglement, and addressed it using canonicalization functionality. While [novotny2019c3dpo] learned a function that undoes any rotation on estimated canonical shape, [park2017procrustean, wang2021paul] used a Procrustean framework. However, in both settings, the method can produce transversal estimates only on the in-distribution inputs since canonicalization is applied separately on the output of the 2D-3D lifter network. Thus, the input 2D keypoints can only be augmented by in plane rotations, which hinders the generalizability of the overall method.
We propose a 3D augmentation approach, which only uses the lifter network. The lifter network accepts as inputs, with and and outputs the shape estimate and pose . Here, is the set of shapes in the true distribution. In this setting, the input covers projections of all rotations of true distribution shapes. However, the training set does not include all the projections of the true distribution shape samples. Since we do not have access to 3D keypoint locations, we cannot directly augment the input samples. However, we can perform augmentations on the estimated 3D keypoints produced by the lifter network. Thus, let be the 3D output of the lifter network, then, we propose the following cyclic augmented self supervision
where is any rotation matrix. This loss encourages the lifter network to be transversal in its outputs since any rotation of its shape estimate is mapped back to . Different than [novotny2019c3dpo], our formulation lets the lifter network to explore all the rotations of the true distribution shapes. Moreover, all the modifications made on , such as restricting latent space dimensionality and non-negative coefficients, naturally help improve the cycle supervision. In the training we break down the cyclic loss into two losses:
These losses ensure that the method can estimate the random rotation and maintain the same canonical shape. One important benefit of our formulation is that we do not require an additional network to validate the estimated 3D shapes such as discriminators in GAN based approaches [DBLP:journals/corr/abs-1803-08244] or canonicalization networks [novotny2019c3dpo, wang2021paul]. Furthermore, by using our augmentation-based approach our network learns to deal with “imperfect” keypoints, which is fundamental when working with estimated 2D keypoints. Note that, by simply rewriting , cyclic supervision can be extended with non-negative coefficients.
4 End-to-End Learning from Images
The image of an object can be used for more than only 2D keypoint extraction. We propose to detect the 2D keypoints from the image and extract a context vector that can be used in conjunction with the 2D keypoints to obtain a better 3D estimation. The detected 2D keypoints and context vector are used by the lifter network, in our end-to-end trainable pipeline.
4.1 Keypoints from Images
We require a method that can output the locations of an object category dependent pre-defined set of keypoints. Therefore, the problem at hand naturally extends to object classification. Moreover, in order to fully utilize the image, the keypoint network should produce a context feature representation from the image that can guide the lifter network. Thus, the desired function is where is the image, is the category dependent keypoint locations, the category of the object and is the context vector.
We propose to use a DETR-based [DBLP:conf/eccv/CarionMSUKZ20] architecture at the core of function . Thus, the input image is processed by the backbone (Resnet50 [DBLP:conf/cvpr/HeZRS16]), and the resulting feature map is fed to the transformer. The transformer uses two sets of learned query vectors. The first set is related to keypoints, where each query vector represents a keypoint. To formalize this, let the maximum number of keypoints among all categories be be . Thus, we can extract 2D locations and the semantics of each keypoint by processing the corresponding query vector using two MLPs. The semantics of a keypoint is category dependent and encoded as a one-hot vector. For example, a given entry in can correspond to the front right of a car or the left rear leg of a chair. Entries of with indices larger than are zero. The true semantics are denoted by .
To help the lifter network estimate the 3D keypoints, the visual context in the image is important, this is especially true in multi-category examples with large in-category variations. Thus, we use a second set of learned query vectors, which gets processed by the transformer together with the keypoint queries. The output of the transformer for the context query is then processed by two MLPs. The first outputs a dimensional context vector and the second
the one-hot encoded category probability. The category probability is used in conjunction with the keypoint type estimates to obtain the correct 2D keypoint representation, while the context vector is used by the lifter network together with the 2D keypoint representation, see Fig.6.
The training of the network has two types of supervision. First, we perform direct supervision of the 2D keypoints and the category-specific outputs and , where we use Hungarian matching to select the supervision targets. Second, by training end-to-end, the keypoint extraction network also receives supervision information via the lifter network, which helps to learn the lifting and keypoint regression jointly. This is also the indirect supervision signal that guides the learning of the context . This end-to-end connection of the lifter and keypoint extraction network is in sharp contrast to existing papers, which focused on either of the two parts. We will show in the results that the combination of the two can give great performance benefits.
4.2 End-to-end Pipeline
Given the end-to-end joint 2D-3D model, the first step in the training loop is Hungarian matching over the keypoint queries and the GT keypoints. For this, the loss to minimze is given by where and . The Hungarian matching output provides the set of query vectors that are one-to-one matched to true keypoints. We reformat selected location estimates using the matched semantics and the category estimate into the form given in Eq. 2. Let this extracted 2D keypoint representation be and the true keypoints be . Adding the category loss of the keypoint network, we arrive at the following set of losses.
KP Type loss
where we use Huber loss for . For the total loss, the different terms are combined using hyperparameters.
During evaluation, where we cannot use Hungarian matching, we first get the object category estimate . Then, for each keypoint type defined for that category, we take the location of the most likely proposal and convert it into the form given in Eq. 2 to obtain .
We experiment on the Synthetic Up3D (S-Up3D), PASCAL3D+ and Human3.6M datasets. For all the datasets, we use the pre-processed versions of [novotny2019c3dpo].
There are only few NrSfM methods that can handle an as diverse settings as our method. We compare against C3DPO [DBLP:conf/iccv/NovotnyRGNV19] and PAUL [wang2021paul] in all datasets since they can produce accurate estimates in a wide range of datasets and settings. We also report results of EMSfM [DBLP:journals/pami/TorresaniHB08] and GbNrSfM [DBLP:conf/nips/FragkiadakiSAM14] on the S-Up3D and PASCAL3D datasets. We compare against [DBLP:conf/eccv/ParkLK20] and [DBLP:journals/corr/abs-1803-08244] only in the Human3.6M dataset since they cannot handle occlusions or multiple object categories. In Pascal3D, we also compare against CMR [DBLP:conf/eccv/KanazawaTEM18]. We report results of our lifter network without (Ours-base) and with cycle-supervision (Ours).
To validate the end-to-end pipeline, we report results with three settings: 1) Lifter and transformer are trained separately without context vector (Ours/TR); 2) Lifter and transformer are trained end-to-end without context vector (Ours w/o Context); 3) The proposed end-to-end training with context vector (Ours). Moreover, we also experiment with using the stacked hourglass network [DBLP:conf/cvpr/ToshevS14] to extract 2D keypoints on the Pascal3D and Human3.6M datasets.
5.3 Evaluation protocol
Following [DBLP:conf/iccv/NovotnyRGNV19], we report absolute mean per joint position error as well as , where we center both the estimates and ground truth at zero mean. To calculate MPJPE, we apply flip the depth dimension and calculate MPJPE for both cases, keeping the best as our result. This is done to resolve the depth ambiguity problem where the network can assign the depth negative or positive direction.
For experiments with GT keypoints, where only our lifter network is used, we only use visible keypoints as inputs. We mask out the invisible points and concatenate the visibility mask along with the input keypoints. Our keypoint extraction network estimates all keypoints, thus, the jointly trained lifter network also uses all the extracted keypoints. In order to provide a fair comparison, we use the exact same architecture and data pre-processing as [DBLP:conf/iccv/NovotnyRGNV19] for our lifter.
In our experiments with GT keypoints, we deploy two training schemes. In the first one, we calculate training loss only over the visible keypoints and in the second we calculate the loss over all the keypoints. Note that in both settings, the input of the network is only the visible points and the output is all the keypoints. We refer to the setting with loss calculation over all keypoints with suffix A.
In order to show the performance of individual components, we separate the results into two parts: using GT 2D keypoints and estimating the keypoints from the image.
6.1 Results with GT Keypoints
We present our results for the lifter network when the GT keypoints are used in Table 2. Our method outperforms all methods that do not perform test-time optimization with using only visible keypoints in training. Moreover, Ours-A outperforms all methods in the Pascal3D dataset where our method’s multi-class focus is shown best. We also outperform all methods in S-Up3D dataset. Comparing C3DPO-base and Ours-base, the clear boost the non-negative coefficients provide can be seen. The proposed cyclic self-supervision also makes a significant contribution. It can be seen that our method is only slightly worse than the Procrustean network [park2017procrustean] in the Human3.6M dataset although they use sequences for training as well as test-time optimization.
6.2 Results with Estimated Keypoints
The results with estimated keypoints are given in Table 3. In both datasets our method significantly outperforms the competitors. We see that the performance boost mainly comes from the proposed joint training and the context vector. Especially test-time optimization methods, even when competitive with GT keypoints, suffer considerably using estimated keypoints, visible in the Human3.6M results. For the Pascal dataset, neither PAUL[wang2021paul] nor Procrustean network [park2017procrustean] even report numbers.
|Proc-SH †* [DBLP:conf/eccv/ParkLK20]||-||-||124.5||-|
|PAUL-SH † [wang2021paul]||-||-||132.5||-|
|Ours w/o Cont||57.6||42.9||113.8||56.7|
6.3 Multi-Class Ablation
To show the benefits of a multi-category setting, we truncate Pascal3D to a one-class dataset and train our architecture. We repeat this for three different classes. The results, given in Table 4, show that the multi-class trained network significantly outperforms the single category counterparts. The visual examinations shown in Fig. 2 already demonstrated that the latent space of the multi-category network expresses cross-categorical geometric variations.
In order to evaluate the effect of the proposed modifications on the latent space, we measured mutual coherence of the latent space basis vectors . The linear combinations of these vectors create the latent code that is then decoded into the 3D keypoints via . Thus, the mutual coherence of the latent basis vectors provides a measure of disentanglement. Table 5 shows that sparse non-negative coefficients encourage latent basis vectors to be less correlated. Moreover, cyclic self-supervision preserves this latent structure.
We study the problem of estimating 3D pose and shape from a single image for objects of multiple categories, in an end-to-end manner. Our learning framework relies only on 2D annotations of keypoints for supervision, and exploits the relationships between keypoints within and and across categories. From our experiments two major conclusions can be drawn: (a) multi-category learning is not only beneficial for computational reasons but also offers performance boost; (b) the end-to-end learning process improves the performance of the downstream task. Our method is first of its kind, which outperforms all the compared methods in estimating 3D shape and pose directly from images, on three benchmark datasets.
Limitations. We have observed unstable training in cyclic-self supervision due to batch norm. Batch norm is known to be problematic in recursive use. Thus we first pre-train the network without cycle and fine tune with cyclic loss by freezing batch norm layers to decrease instability.