Latent Space Representation for Shape Analysis and Learning

06/11/2018 ∙ by Ruqi Huang, et al. ∙ Stanford University 0

We propose a novel shape representation useful for analyzing and processing shape collections, as well for a variety of learning and inference tasks. Unlike most approaches that capture variability in a collection by using a template model or a base shape, we show that it is possible to construct a full shape representation by using the latent space induced by a functional map net- work, allowing us to represent shapes in the context of a collection without the bias induced by selecting a template shape. Key to our construction is a novel analysis of latent functional spaces, which shows that after proper regularization they can be endowed with a natural geometric structure, giving rise to a well-defined, stable and fully informative shape representation. We demonstrate the utility of our representation in shape analysis tasks, such as highlighting the most distorted shape parts in a collection or separating variability modes between shape classes. We further exploit our representation in learning applications by showing how it can naturally be used within deep learning and convolutional neural networks for shape classi cation or reconstruction, signi cantly outperforming existing point-based techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Detecting, quantifying and analyzing variability in shape collections is a fundamental task in computer graphics and geometry processing, with applications across multiple domains, including in statistical shape analysis [Anguelov et al., 2005; Bogo et al., 2014; Hasler et al., 2009], shape exploration [Kim et al., 2012; Rustamov et al., 2013; Kleiman et al., 2015], shape correspondence [Huang et al., 2014] and co-segmentation [Wang et al., 2012]. A key question that arises in all techniques for extracting variability is the choice of the right shape representation, which can reveal the structure of each shape in the context of the collection while also being compact and easy to manipulate, enabling efficient shape analysis and processing.

The majority of existing techniques dedicated to extracting variability in a collection are based on first selecting a template (or base) shape and considering the changes on all other shapes with respect to this template — this is the standard practice in medical domains where the reference shape is often referred to as an “atlas” (e.g., in brain anatomy) [Grenander and Miller, 1998]. In computer graphics this approach is common both in shape reconstruction and in statistical shape analysis [Anguelov et al., 2005; Bogo et al., 2014; Hasler et al., 2009], but also in shape exploration (e.g., [Kim et al., 2012; Ovsjanikov et al., 2011; Kim et al., 2013; Rustamov et al., 2013] among many others) where the template is often constructed by either simplifying some fixed base shape or by using shape abstractions derived from collections of parts and their relations.

Although easy and intuitive, template-based shape exploration and analysis has obvious limitations when shape variability is large and no single prototype adequately models all given shapes. But even in settings of more modest variation, there are significant limitations: first, the choice of the template can significantly affect the results in terms of the types of variability that is detected and highlighted. Second, considering the variability with respect to a fixed base shape can make it difficult to reveal cross-class variability that becomes apparent only when comparing all pairs of shapes in the collection. Third, even when the base shape is given, the exact choice of encoding for the variability remains crucial. For example, while simple techniques based on the displacement of each template vertex might be relevant for reconstruction and statistical shape analysis, their use is very limited in the context of learning since, as they are not invariant to even the basic rigid motions.

This question of shape representation has also become particularly important with the advent of powerful techniques based on deep learning and convolutional neural networks. Although very successful in image analysis, their adoption for shape processing has so far been relatively limited due to representational differences. Common 3D representations such as meshes or point clouds are irregular, unlike the regular grids defining 2D images, making it challenging to define notions such as convolution or to encode basic 3D invariances. While some significant progress has been made in this direction in the past few years (see e.g., [Maron et al., 2017] and [Bronstein et al., 2017] for an overview), the question of defining a representation that is at once invariant, compact and well-suited for learning remains open.

In this paper we present a novel approach to encoding shapes in the context of a collection that helps overcome many of the above limitations. Specifically, starting from a collection of shapes with some soft (functional) maps between them, we show how consistent latent spaces that have previously been used for improving map quality can also be exploited to reveal the geometric variability in the collection, without relying on a base or template shape (e.g., in Figure 13, our approach highlights the regions that are distinctive between the cats and lions), or assuming a particular (e.g., star-shaped) topology of the functional map network. Our approach is based on a novel analysis of latent spaces, which demonstrates that after proper regularization they can be endowed with natural, unbiased geometric structure. We then show that, although our latent shape is a dual object that need not correspond to a real shape in 3D, it can be used together with the notion of shape differences introduced in [Rustamov et al., 2013] to construct a representation for each shape in the collection, but without relying on a fixed base shape as done in that work. Moreover, we show how the algebraic nature of this representation can be exploited to detect detailed information about differences between shape classes, perform (possibly partial) shape analogies, and analyze shapes across different modalities.

Contributions. To summarize, our main contributions are:

  • We describe how latent functional spaces can be endowed with natural geometric (metric and measure) structure, giving rise, for the first time, to a well-defined notion of a “latent shape” that characterizes a shape collection.

  • We define shape differences between real and latent shapes and show how such differences lead to a shape representation, that can be used for detailed shape analysis without assuming a particular topology of the map network.

  • We provide tools for a nuanced understanding of shape viability, including the separation of the different types of variability present within and across shape sub-collections.

  • We demonstrate that our new representation supports deep learning techniques, including CNNs, for both analysis and synthesis, leading to improved results over baseline methods.

Figure 1. We start with a collection of shapes and a set of functional maps among them, from which we extract a latent shape . And then we construct for each shape a canonical latent basis, and represent each shape with a pair of operators (matrices), which are based on the latent basis. Finally, we demonstrate various applications of the latent representations in shape analysis and deep learning on 3D shapes.

2. Related Work

Template-based shape analysis and exploration. Analyzing shape collections by variability around a template shape has a rich and vast history going back to D’Arcy Thompson’s classic “On growth and form” [Thompson et al., 1942], which has inspired Kendall’s shape space theory [Kendall, 1989] and pattern theory formalized by Grenader and commonly used in computational anatomy [Grenander and Miller, 1998], where templates are often referred to as atlases.

In Computer Graphics, shape spaces based on template variation are ubiquitous in statistical shape analysis, e.g. for defining 3D morphable models [Blanz and Vetter, 1999; Allen et al., 2003], especially for capturing variability in human body and pose, e.g. [Anguelov et al., 2005; Hasler et al., 2009; Bogo et al., 2014] among many others.

Shape templates are also commonly used for exploring shape collections [Ovsjanikov et al., 2011; Kim et al., 2012]. Although in most cases the presence of a shape template is assumed to be given a priori, simultaneous template construction and fitting techniques have been used for both reconstruction [Wand et al., 2007, 2009; Tong et al., 2012] and exploration [Kim et al., 2013], among many others.

While pervasive, template-based methods also have a well-known limitation in that the choice of the template model can introduce bias in the kinds of variability that are revealed. Common selection techniques include using a particular (median) shape in a collection that is as close as possible to a centroid, or constructing a new template shape by pointwise averaging (e.g., [Joshi et al., 2004]).

Our approach avoids the construction of a explicit template shape, and replaces it with an implicit template obtained via the analysis of latent functional space, which both removes the bias in the template shape selection and also avoids the expensive geometric (embedded 3D shape) template construction.

Shape Analysis with functional maps. Our approach takes as input a collection of shapes with soft (functional) maps between them. In this, we follow the recent line of work on shape analysis with soft maps, similar to [Solomon et al., 2012; Kim et al., 2012; Rustamov et al., 2013]. Namely, we use the formalism of functional maps introduced originally in [Ovsjanikov et al., 2012] and extended significantly in follow-up works, including [Kovnatsky et al., 2013; Huang et al., 2014] among others (see [Ovsjanikov et al., 2017] for a recent overview).

Although originally proposed as a computational tool for shape matching, follow-up works have also shown its utility in shape analysis and exploration, starting with map visualization [Ovsjanikov et al., 2013], detection and encoding of shape differences [Rustamov et al., 2013], and co-segmentation and co-analysis [Huang et al., 2014] among others. The advantage of these techniques is that they only require approximate functional maps, which are much easier to compute than precise (point-to-point) correspondences. Nevertheless, existing methods such as [Ovsjanikov et al., 2013; Rustamov et al., 2013] also follow the spirit of template-based techniques and assume the presence of a single base shape with respect to which variability is captured. A recent method introduced in [Huang and Ovsjanikov, 2017] has tried to lift this assumption but is still restricted to revealing global variability within a single collection. We extend these techniques first by proposing a template-free analysis and exploration framework using functional maps and second by proposing techniques for detecting and highlighting cross-collection variability, and finally by defining a compact shape representation that is suitable for learning.

Latent functional spaces. A key building block in our approach is the use of so-called latent functional spaces, which are closely related to map synchronization [Wang and Singer, 2013] and which have been used for computing consistent functional maps in shape and image collections [Huang et al., 2014; Wang et al., 2013, 2014]. One of our key contributions is to show that in addition to providing a powerful computational method for map inference, latent functional spaces also allow to reveal variability in shape collections and also to define a compact and informative shape representation.

Shape representations for learning. One of our key applications is to show how the shape representation obtained via the latent functional spaces can be naturally used in the context of supervised learning applications, and especially enable the use of convolutional neural networks for shape regression and classification.

In this, our work is related to the recent techniques aimed at applying deep learning methods to shape analysis. One of the main challenges is defining a meaningful notion of convolution, while ensuring invariance to basic transformations, such as rigid motions. Several techniques have recently been proposed based on e.g., Geometry Images [Sinha et al., 2016], Volumetric [Maturana and Scherer, 2015; Wang et al., 2017], point-based [Qi et al., 2016] and multi-view approaches [Su et al., 2015], as well as, more recently intrinsic techniques that adapt convolution to curved surfaces [Masci et al., 2015; Boscaini et al., 2016] (see also [Bronstein et al., 2017] for an overview), and even via toric covers [Maron et al., 2017] among many others.

Despite this tremendous progress in the last few years, defining a shape representation that can naturally support convolution operations, is compact, invariant to the desired class of transformations (e.g., rigid motions) and not limited to a particular topology, remains a challenge. As we show below, our representation is well-suited for learning applications, and especially for revealing subtle geometric information regarding the shape structure.

Shape processing in latent representations. Finally, our work is also related to recent techniques that construct latent spaces for representing 3D shapes, especially those based on learning. For instance, [Wu et al., 2016] combine a 3D-CNN with a Generative Adversarial Network (GAN) to first learn the latent space of 3D shapes. Given the latent space, they regress an image feature learned via a 2D-CNN to the latent space to recover the underlying geometry. [Girdhar et al., 2016]

follow a similar strategy but use a voxel-based AutoEncoder (AE) instead of a GAN for learning the latent representation.

[Achlioptas et al., 2018] introduced an AE operating on 3D point-clouds to produce a latent space which is further exploited by a GAN for point-cloud synthesis. In a similar manner, [Li et al., 2017] developed a recursive neural net to map 3D part-layouts to a latent space at which a GAN operates to create novel shapes with various part-hierarchies.

Differently from our representation via latent space analysis, these learned embeddings represent shapes as points in some high-dimensional space and rarely give access to regions or parts of the 3D shapes associated with or responsible for the shape variability. On the other hand, we represent shapes in a collection as linear operators, stored as matrices, which not only enables a meaningful notion of convolution but also allows us to recover explanations for differences and variability in terms of highlighted shape pats.

3. Overview

The rest of the paper is organized as follows: in Section 4 we describe the problem setting, the main goals and notations used below. Section 5 provides the theoretical foundation for our method.

In particular, we characterize the geometric structure of latent shapes in Section 5.1 and define our shape representation based on shape differences with respect to latent shapes in Section 5.2. We then describe the two key applications: extracting variability in shape collections (Section 6) and using our representation for 3D deep learning (Section 7). Finally, we show qualitative and quantitative results obtained using our methods in Section 8.

4. Preliminaries, Notation and Problem Setup

Throughout our work, we assume that we are given a collection of related 3D shapes and a set of functional maps [Ovsjanikov et al., 2012] among some shape pairs. Our main goal is to develop a theoretical foundation for a novel representation for the shapes in the collection, and to show how this representation can be effectively used in practical applications.

Specifically, we assume as input a set of shapes and functional maps , which map real-valued functions between some pairs of shapes . The functional maps can either be induced by point-wise correspondences, or, can be obtained via an optimization procedure, as described, e.g., in [Ovsjanikov et al., 2017]. Let be the stiffness matrix and the area matrix of these shapes, which encode respectively the metric and the measure information. The Laplace-Beltrami operator (LBO) is classically discretized as  [Meyer et al., 2003]. We let be the diagonal matrix storing the

smallest eigenvalues of the LBO of shape

, and

the matrix storing the corresponding eigenvectors. Following previous works, we assume that functional maps are given in the reduced eigenbasis and can be thought of as matrices of size

.

The functional map network (FMN) on is a graph , where the th vertex in corresponds to the functional space on , and the edge if we are given a functional map . We assume that this network is symmetric ( if and only if ) and is connected so that there exists at least one path consisting of the edges in between any pair of vertices in .

Shape Differences

Our shape representation is based on the shape differences introduced in [Rustamov et al., 2013], which characterize shape deformations by encoding the changes in inner products of functions. Namely, given shapes and a functional map in the reduced basis, the authors introduce the area-based and the conformal shape differences :

(1)
(2)

where is the Moore-Penrose pseudo-inverse. Intuitively is a linear operator, which once again, can be represented as a matrix of size , and which encodes the difference or distortion induced by a map (see Figure 2 and Eq.(4) in [Rustamov et al., 2013]).

The key limitation of shape difference operators for shape collection analysis, is that they require a choice of a base shape and consider only directional changes, from shape to other shapes, making it impossible to use them given an arbitrary (non star-shaped) FMN. Thus, one of our goals is to extend this construction to the case of shape collections without assuming a fixed base shape. We achieve this by exploiting the formalism of latent functional bases [Wang et al., 2013], which has been proposed for improving the consistency of functional maps.

Latent Spaces

Given a FMN, the authors of  [Wang et al., 2013] propose to extract a set of consistent latent bases on such that , and use them to refine the quality (consistency) of functional maps. The latent bases can be thought of as functions on , or as functional maps from some latent shape to each shape . Then, a map from to can be factored into a map from to the latent shape and then to via: . While useful as a tool for improving functional maps, the exact structure of latent shapes is still not fully understood, and they have so far not been used for representing shapes in a collection.

In our work we first show how latent shapes can be endowed with geometric structure, and be made more stable, through an extra regularization, and then define a latent space shape representation.

5. Latent Representation

5.1. Canonical Latent Basis and Latent Shape

Our first key observation is that the latent shape plays the role of an “average shape” in analyzing shape collections – a shape-like object that represents the entire collection, and which can be endowed with a natural geometric structure. Crucially, unlike existing approaches, for example in computational anatomy [Younes, 2010] that consider building templates or average shapes, we characterize the latent shape directly in the functional domain, without attempting to embed it in the ambient space.

input : A set of consistent latent basis learned from a shape collection and associated FMN . The eigenbasis , eigenvalues on each .
output : A set of canonical consistent latent basis , and the eigenbasis and the spectrum , for the latent shape.
  1. [label=(0),ref=Step 0,leftmargin=*]

  2. Compute the eigen-decomposition of so that and let .

  • Let for an arbitrary , and from the previous step.

  • ALGORITHM 1 Computing a Canonical Consistent Latent Basis

    The following theorem establishes the connection between the consistent latent basis and the geometry of the latent shape, while at the same time highlighting the limitations of the previously used approaches for constructing latent bases:

    Theorem 5.1 ().

    Given a collection of discrete 3D shapes in - vertex correspondence and sharing the same mesh connectivity, and a consistent FMN , in which the functional maps are represented in the eigenbasis on each . Let be the consistent latent basis satisfying the conditions: , and , where is a diagonal matrix. Then, the eigenbasis of the latent shape whose metric and measure are given by , i.e. can be recovered as for any .

    This theorem suggests that the consistent latent basis carries information about the “average” geometry in the collection, given, in the full basis, by the average metric and measure matrices.

    Role of Proper Regularization

    Note that previous approaches for constructing the latent basis, such as [Wang et al., 2013] proposed to compute the latent basis by solving the optimization problem Geometrically, and in light of Theorem 5.1, this corresponds to only averaging the measure of the shapes, which leads to metric ambiguity. This can result in significant instabilities in the extraction of the latent basis. We demonstrate this effect in Figure 2. Namely, given a shape collection and an additional shape , we compared the CLB on , with respect to the original shape collection, and recomputed with all shapes. Figure 2(b) depicts the change of basis matrix between these two settings, which has noisy off-diagonal entries, suggesting that the latent shape is significantly perturbed.

    To overcome this instability, we propose to construct a canonical latent basis by introducing an extra normalization which forces to be a diagonal matrix, and which corresponds in Theorem 5.1 to averaging the metric on the latent shape. With this additional normalization, the change of basis matrix between latent bases with and without shape shown in Figure 2(c) is much closer to a diagonal one than in Figure 2(b). The details of this construction are given in Algorithm 1.

    Figure 2. (a) input shapes, where is an additional shape; (b) the transformation matrix between the standard latent basis computed with and without ; (c) the same transformation matrix but between canonical latent bases. (d) The transformation matrix between the computed latent basis and the theoretical ground-truth stated in Theorem 5.1 when expressing functional maps in a reduced basis (e) the computed spectrum and the theoretical ground-truth. (see the text for details).

    Now, the extra normalization incorporates the metric information, therefore the latent shape can be thought of as a well-defined shape. In general, a shape with the average metric and measure does not admit an embedding in , but as we will soon show, this construction carries rich geometric information useful for shape processing.

    Let us stress that Theorem 5.1 is of purely theoretical interest. However, Algorithm 1 can be implemented in practice, without assuming access to the full basis or exact consistent maps. In Figure 2 (d), we also show the proximity between the eigenbasis/spectrum of the latent shape recovered from functional maps in the reduced basis and the theoretical ground truth. Namely, Figure 2(d) shows the transformation matrix between the first computed eigenbasis, when functional maps are represented in a reduced basis of size and the theoretical ground-truth, given by the exact averaging of the metric and measure. At the same time, Figure 2(e) shows the eigenvalues in the two cases. Hereafter, we always use the canonical latent basis in all the formulations and applications, and denote it by to simplify notation.

    Computing canonical latent basis in practice Computing the consistent latent basis with the framework of [Wang et al., 2013] involves an eigen-decomposition of a possibly large, block-wise sparse matrix, whose size depends on the number of shapes and the dimensionality of functional maps. In order to gain scalability, in practice we first sample a subset of shapes with which we compute the canonical latent basis, and then for each shape outside the subset, we search for its nearest neighbor, , in using the Shape-DNA descriptor. Finally, we push the latent basis from to via the functional map , namely, . In particular, this scheme not only improves the scalability of the computation of our latent shape representation, but also allows to avoid recomputing the latent basis for each new shape.

    5.2. Shapes as Latent Shape Differences

    Although the canonical CLB reduces the instability present in the previous basis construction, the latent bases unfortunately still cannot be used to represent each shape in the collection. The main reason is that is expressed in the eigenbasis of shape , and therefore, one cannot compare, for example with , which is fundamental in both shape analysis and learning applications.

    Instead, we build our shape representation by defining the latent shape differences, which are linear operators acting on the function space of the latent shape, and which, as such, are independent of the basis on each shape.

    Namely, we assume the spectrum of the latent shape, arising from step 3. of the procedure described in Algorithm 1, is denoted by . Then, following the formulation of [Rustamov et al., 2013], we define the area-based and conformal latent shape differences as:

    (3)
    (4)

    The final procedure for extracting these operators from a given collection is summarized in Algorithm 2.

    input : Shape collection and associated FMN . The eigenbasis basis and spectrum on each .
    output : A pair of latent shape differences for each shape : area-based and conformal .
    1. [label=(0),ref=Step 0,leftmargin=*]

    2. Compute the CLB with respect to and via the framework of [Wang et al., 2013].

  • Compute the canonical CLB and diagonal matrix with the spectrum of the latent shape, using Algorithm 1.

  • Construct , and respectively .

  • ALGORITHM 2 Construction of Latent Shape Difference Operators

    The main insight of our work is that the latent shape differences provide a compact and extremely versatile representation for each shape in a collection as a pair of small-sized matrices, which enjoy several nice theoretical properties, and enable a number of novel applications in analysis and learning.

    5.3. Properties of the Latent Shape Differences

    Given a shape collection with the associated functional map network, the latent space shape differences (LSSDs) provide a representation of each shape as a pair of matrices whose size is controlled by the size of the latent basis. In this work, we argue that this representantion enables a number of novel applications and lifts fundamental restrictions of previous approaches. In particular, LSSDs inherit some of the most attractive properties of shape differences, such as their compactness and informativeness, while avoiding their shortcomings. Below we summarize the main properties of this representation.

    Invariance: LSSDs provide a representation that is invariant to rigid (and more generally isometric) shape transformations. In the context of learning, this is especially important as it will allow us to do inference in a pose-invariant way.

    Flexibility: computing LSSDs only requires the knowledge of functional maps and places no restriction on the shape discretization. For example, they can accomondate collections of shapes with different number of vertices, or even with different modalities such as point-clouds and meshes.

    Informativeness: LSSDs fully encode the intrinsic geometry of each shape in the collection in a compact way. Indeed, it follows from Theorem 5.1 that in the presence of full information, given the FMN of a collection of shapes , the spectrum of the latent shape, and for each shape in , one can recover the intrinsic geometry for each , i.e., the area and stiffness matrices , which, in turn, fully determines the edge lengths [Zeng et al., 2012].

    Functoriality: if we interpret each as the functional map associating the latent shape to , it follows from the functoriality property in [Rustamov et al., 2013] that

    where is the shape difference between and . Thus, LSSDs not only encode the difference of each shape to the latent shape but also allow to factor the difference between each pair of shapes, via the canonical latent basis.

    Algebraic nature: LSSDs are linear functional operators on the latent shape. As such, they can be represented as small matrices and manipulated using standard numerical linear algebraic tools, in practice. Moreover, they provide detailed (localized) information about the shape geometry. As we show below, this allows us to extract partial information to compare and reconstruct shape parts, in contrast to purely global shape descriptors.

    Base-shape independence: Crucially, unlike the original shape differences, which rely on the choice of a specific base shape, which can lead to biased results, and requires a star-shaped map network, LSSDs are extracted from the entire input functional map network, regardless of its topology. This is especially true due to our novel regularization, which leads to a latent shape, endowed with canonical geometric structure. Let us note that theoretically, in the presence of full information and an a priori consistent map network, the choice of the base shape should not affect the results. In practice, however, functional maps are represented in a reduced basis and are not perfectly consistent, which can introduce strong bias in the subsequent analysis.

    To illustrate this effect, we aligned a collection of cats and dogs shown in Figure 3 without any maps across them using the original and the latent space shape differences. For the former, we assume that a pair of shapes, e.g., the boxed animals in Figure 3, to be used as bases in each cluster, and computed the eigenvalues of the respective shape differences as descriptors. On the other hand, we used the eigenvalues of the latent shape differences as the descriptor for each shape in the collection, without any a priori information. The alignment result based on the above descriptors in shown in the bottom two rows of Figure 3. Note that, when using the approach of [Rustamov et al., 2013] even after fixing the corresponding base shapes, none of the base shape choices led to the correct result. We demonstrate one such result obtained by fixing the base shapes to be the ones shown in the blue boxes. Meanwhile, as shown in the middle row, using the latent shape differences results in the ground-truth alignment. Note that the same experiment has been conducted in [Rustamov et al., 2013] (see Figure 13 therein), however, to obtain the exact alignment, the authors used all pairwise shape differences.

    Figure 3. Simultaneous analogies between a collection of cats and dogs without maps across them. The ground-truth correspondences are indicated by the color-coding. In the middle row, the latent shape differences recover the ground-truth alignment. On the other hand, the shape differences fail to recover — one failure examples using the shape differences with the boxed base shape are shown on the bottom.

    Compatibility with sparse map networks: Another key advantage of our latent representation is its ability to extract information from sparse map networks. As observed in previous works [Huang et al., 2014], functional maps between similar shapes are typically much easier to compute. On the other hand, establishing functional maps from a fixed base shape to all other shapes in the collection can lead to significant errors. To illustrate this, we consider a sequence of frames of galloping horses shown in Figure 4(a), and assume that only functional maps between consecutive frames are given, resulting in a sparse FMN with chain topology. Figure 4(b) demonstrates that, even when extracted from the sparse FMN, the LSSDs recover the cyclical structure of the collection, while using the shape differences from the base shape, , computed by composing the given functional maps, leads to an erroneous embedding, as shown in Figure 4(c).

    Figure 4. (a) frames of a galloping horse. With a given FMN of chain topology, we computed the latent and original original shape differences as signatures of the shapes: (b) PCA layout of the latent shape differences; (c) PCA layout of the original shape differences.

    5.4. Projected Latent Shape Differences

    Besides encoding each shape in the collection, the latent shape differences also give access to detailed information about the deformation, including the local changes in different shape regions in a purely algebraic way.

    A link between actual deformations across shapes and functional distortions induced by the respective shape differences has been established in [Rustamov et al., 2013] – namely, a function, such as an indicator function of a region, will be modified by the shape difference, if it is supported on region undergoing a deformation. It has been further shown in [Rustamov et al., 2013] (see Section 6) that the area-based (resp. conformal) shape difference is an identity operator if and only if the underlying map is area-preserving (resp. conformal).

    In this section, we propose a novel projection operation on the latent shape differences, the key observation is that we can suppress a functional deformation by modifying the shape difference so that it acts like an identity operator on certain functional subspace expressing the deformation of interest, which, in the following allows us to perform partial shape analogies.

    Suppose that we are given a set of shapes and a FMN , and let be the LSSDs computed using Algorithm 2. Now we consider a set of functions , where are orthonormal basis functions on the latent shape, i.e., . We construct a projected latent shape difference using and as follows:

    (5)

    It is easy to verify that if is orthogonal to the subspace spanned by functions in , and if is spanned by the functions in . Intuitively, if contains the full basis on the latent shape, then , which forces latent shape difference to correspond to an area-preserving or conformal map, depending on the type of .

    6. Shape Collection Comparison

    Several approaches have been proposed for detecting geometric variability that exists within a given collection of shapes connected by functional maps, e.g.,  [Rustamov et al., 2013; Huang and Ovsjanikov, 2017]. In this section, we show how our latent-based shape representation can be used for detecting and analyzing and differences across different shape collections, or two subsets of a larger collection. Namely, given a set of shapes , a FMN and a partition , we aim to capture the difference between and , while not being sensitive to the global variability that exists within . This problem arises especially when trying to detect the detailed geometric properties that are responsible for the differences between shape classes (e.g., healthy vs unhealthy organs), while factoring out the “normal” or “common” variability within the collection.

    Global variability

    Before approaching this problem, we first propose an algebraic approach for detecting global variability within a collection. Our observation is that, in light of Section 5.4, suppressing global variability should lead to projected LSSDs that are indistinguishable from each other. Namely, we would like to find a basis such that the latent shape differences projected onto are as close as possible. For this, we first introduce a term that measures the difference of the norms between the original and projected latent shape differences:

    (6)

    According to the following lemma, the change is always non-negative and can be written in a quadratic form.

    Lemma 6.1 ().

    If , then

    It is natural to optimize for a function , which maximizes the global change of distances within the collection, i.e.,

    (7)

    In other words, after suppressing the functional deformation related to , the shapes are maximally brought together. According to Lemma 6.1,

    is given by the eigenfunction associated with the largest eigenvalues of

    .

    Cross-collection variability

    Following the same idea above, we formulate the cross-collection variability to be such that after suppressing it, the clusters and should become closer to each other, while maintaining their inner structure. In other words, we aim to simultaneously maximize the changes of distances across shapes in different clusters, and minimize those within the same cluster.

    Putting these two goals together, we construct:

    (8)
    Figure 5. The global variability of four deformed spheres (in the blue box) and the cross-collection variability regarding the partition and (in the red boxes) detected by our algorithms. Note that the horizontal bump (global variability) is of twice the magnitude of the vertical one (cross-collection variability).

    As an illustration, in Figure 5, we demonstrate the optimizers respectively. Since the horizontal bump is of twice the size of the vertical one, to maximally reduce the intra-variability, one should suppress the horizontal deformation. Meanwhile, it is intuitive that cluster and are distinguished by magnitudes of the vertical bumps, which should be detected as cross-collection variability.

    Finally, we point out that, though not being equivalent, there is a connection between the formulations above and the one for detecting global variability proposed in [Huang and Ovsjanikov, 2017]. In fact, we can use results from both approaches for cross-validation. We refer interested readers to Appendix for the statement and proof of this connection.

    7. Applications in Learning

    On the (deep) learning side of our exposition we study how our representation of the latent shape difference operators, can be used as the input that neural networks will rely upon to reason about 3D data. A key property of this input representation is that it is encoded as a small size matrix – i.e. it provides a regular structure amenable to convolutions. CNNs rely and take advantage of the spatial proximities found in regular-grid data such as images. Analogously, according to our formulation for LSSDs, we have, e.g. in Eq. 3, , where is the -th latent basis function. Thus, instead of spatial proximity, the neighboring entries in our matrix representation encode and provide interactions of function pairs that are close in the “spectral” domain.

    The first task at which we test the effectiveness of our approach is shape regression

    – here we compare neural-networks that learn to estimate hidden parameters that control the body variation of human-form meshes. Assuming a set of training shapes, with known parameters that are represented as real-valued vectors, our networks learn to regress the underlying parameters for new unseen shapes.

    The second task we explore is that of 3D point-cloud reconstruction. Quite differently from the regression task, here we test how a novel network that inputs a latent difference matrix can learn to reconstruct a 3D point-cloud version of the underlying mesh. This problem is closely related to shape reconstruction from intrinsic operators, which was recently considered in [Boscaini et al., 2015; Corman et al., 2017] where several advanced, purely geometric, optimization techniques have been proposed that give satisfactory results in the presence of full information [Boscaini et al., 2015] or under strong (extrinsic) regularization [Corman et al., 2017] — but they also demonstrate the many challenges posed by this type of reconstruction. In contrast, we show that by using the context of a collection and learning machinery, real shapes can be recovered rather well from their latent difference operators, and moreover that entirely new shapes can be synthesized using the algebraic structure of difference operators.

    One possible concern with our approach is that it requires an initial functional map network, which can potentially restrict the amount of training data available. However, as we show in Section 8 even for collections of moderate size, consisting of a hundred to two hundred shapes, our networks are sufficiently regularized and allow for very powerful and effective learning.

    7.1. Localized Latent Shape Difference

    Localized shape deformation is a useful tool for shape analysis and synthesis in geometry processing. The algebraic form of our latent representation makes it easy to manipulate, meanwhile, the geometric information encoded in it allow us to access the local geometric features.

    Given shapes and with their respective LSSDs and , and a set of basis functions , expressed in the basis of the latent shape and which is supported on localized region on the shapes, we can construct an operator that acts as on and as on the complement of as follows: Note that this expression resembles Eq. (5) above, but where we consider

    , so that one of the interpolated shapes is the latent shape itself. Using

    allows us to construct shapes by mixing different parts or regions of existing shapes, leading to localized interpolation/shape analogy, as we will show in Section 8.4.

    8. Main Experimental Results

    8.1. Applications in Learning

    In this set of experiments we explored how latent shape differences can be used within the context of a 3D deep learning pipeline. As mentioned in Section 7 our latent differences provide a new representation of geometry with unique characteristics, suggesting its use in 3D-ML applications.

    8.2. Data-generation

    For the experiments of Sections 8.3 and 8.4 we generated human shape bodies in eight different poses using the open-source implementation [Chen et al., 2015] of the SCAPE method [Anguelov et al., 2005]. In [Chen et al., 2015], body variations are controlled with latent parameters , which informally encode shape attributes such as height, leg-girth, belly protrusion, etc. To generate our shapes we sampled uniformly i.i.d. each of the aforementioned parameters and considered eight modifications of the standard T-pose. See Figure 6 for a sample of the resulting meshes. It is worth noting that the produced meshes share the same combinatorial tessellation on 6,449 vertices, which facilitated the construction of pairwise functional maps in this collection. For the following experiments of regression and reconstruction, we used a train-test-val split with of this dataset respectively.

    Figure 6. Example synthetically generated meshes used within the leaning-based pipelines of Section 8.4, displaying a randomly selected mesh of each pose-class.

    8.3. Regression

    In this experiment, we assess the efficacy of a neural-network in regressing the body-generating parameters under different types of input representations. Concretely, we compare the responses between two types of input: point-clouds with points sampled uniformly area-wise from each mesh and area-based latent differences. We explore the effect of several design choices in the construction of our differences. First, we consider different topologies of the underlying Functional Map Network (FMN). These include the complete graph but also much sparser versions based on the -nearest-neighbors () of each shape. Second, we vary the dimensions of the latent bases which crucially effects the size of the difference matrices. We use the LBO eigenvectors with the smallest eigenvalues to express all functional maps and the Euclidean norm of these spectra to define a distance for the construction of the -nearest-neighbors. Last, we train our neural-networks to minimize the Mean-Square-Error (MSE) between their predicted and the ground-truth shape generating parameters. Note, that since these parameters are independent of a shape’s pose, pose-variations of this dataset act as “nuisance” variables that the networks have to explain-away.

    8.3.1. Comparing architectures: protocol

    To select a good point-cloud (PC) based architecture we evaluate three PointNet-like networks [Qi et al., 2016] that use encoding/decoding schemes like those of [Achlioptas et al., 2018]

    . These architectures have shown excellent results in tasks involving 3D point clouds, including classification, part-segmentation and generation and provide a strong baseline. Concretely, our point-based architectures have three layers of convolutional encoders, followed by a feature-wise max-pool and either two or three layers of FC-ReLUs that act as decoders. To strengthen our comparisons, we calibrate each PC architecture to have a distinct number of training parameters and train it with several learning rates to obtain from a pool of

    models, the one with the best performance (see Appendix Sec. B.1.1 for more details).

    At the same time, we consider two types of architectures when the input is a latent-difference: Multi-Linear Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). Across all experiments, these MLPs are four layer deep and the CNNs have two layers of convolutions leading to a third FC layer (see Appendix Sec.  

    B.2 for more details).

    Figure 7 shows the MSE between the predicted vectors and the ground-truth for the test shapes in a variety of conditions. The reported MSE is the average over five random data-splits and weight initializations of the neural nets. The networks are trained maximally for epochs and the displayed MSE correspond to the model (epoch) that optimized the validation split. The dashed-line shows the performance of the best over-all point-based architectures.

    Discussion.

    Figure 7 reveals several trends. First, shape difference CNNs perform better than MLPs and both perform significantly better than point-based nets for a wide variety of different configurations. Second, there seems to exist a sweet-spot in the range of 35 and 45 latent bases — which consistently produces better results across different network topologies. Third, denser topologies give rise to better results with the clique FMN achieving the best performance. In Table 1 we include some complementary information to that of Figure 7. Its first row contains the MSE measurements for the point-base network (PC column) and some MLP/CNNs configurations (clique or 20-nearest-neighbors topology with basis functions). The second row reports the generalization error (difference between test and training MSE) of each architecture. The architectures seem to over-fit in a similar fashion percentage-wise, but crucially the difference-based ones, do so at significantly lower values. The last two rows report the MSE and the average distance between the predictions and ground-truth when we train these networks for 1,000 instead of 500 epochs.

    Figure 7. Regression-based comparison of different input modalities and neural-nets. The -axis depicts the Mean-Square-Error (MSE) on the test split. The -axis corresponds to the number of consistent functions each shape was associated with. The dashed line is the point-based baseline (best of such models). Models starting with M (solid lines) are MLPs and starting with C (scattered points) CNNs. K10/K20: sparse 10/20-nearest-neighbor FMN topologies. Clique stands for the clique FMN topology. The results are averages of 5 random seeds.
    Metric PC MLP-Clique CNN-20 CNN-Clique
    MSE@500 0.057 0.033 0.027 0.009
    GE@500 0.020 0.009 0.010 0.003
    MSE@1K 0.061 0.032 0.013 0.005
    @1K 0.192 0.134 0.086 0.050
    Table 1. Complementary statistics of Fig. 7, see 8.3.1 for details. GE stands for Generalization Error. For reference, if our output prediction was the average of the training examples, then the average would be 0.245 and the MSE 0.08.

    8.4. Reconstruction

    In the second set of deep learning experiments we demonstrate how we can reconstruct a point-cloud derived from a 3D mesh based on the corresponding latent area-based difference operators. To achieve this we use a wider and deeper version of our previous CNN with the regression-optimal input: difference matrices of dimensions , based on a clique FMN. The new network is comprised of 5 layers, with the first two layers being convolutional and the remaining three FCs (see Appendix Sec. B.2 for more details). The output of this network is real-numbers which are trained to have minimal Chamfer-(pseudo)-distance, from the corresponding ground-truth point-clouds that are comprised also of points (similar to [Fan et al., 2016; Achlioptas et al., 2018]).

    Figures 8 and 10 demonstrate the quality of the learned reconstructions along with the capacity of our representation for doing semantically-rich shape synthesis operations, such that of constructing new shapes (not present in the original shape collection) based on shape-analogies. First, to visually inspect the reconstruction quality compare the ground-truth point-clouds: with their corresponding reconstructions . These ground-truth point-clouds belong in the test split and their reconstructions have successfully captured both the underlying pose and the body structure. While this is true, it is also evident that some high-frequency geometric information (mostly around the hands) has not been recovered. Despite these artifacts, these results are remarkable, given previous attempts at shape reconstruction from difference operators [Boscaini et al., 2015; Corman et al., 2017], which only work by combining both area and conformal differences and work in very restricted settings under strong regularization.

    We also test the generalization power of the network by synthesizing shapes and that try to have a similar pose-wise and body-wise relation to the point-cloud , as the relation that point-cloud has to (to form an analogy). To construct we decode (i.e. reconstruct) the neural-network’s latent code corresponding to the additive formula: , where is the output activations of the first FC layer when the input is shape . This is the traditional practice in performing analogies with the latent-codes of a deep-net [Mikolov et al., 2013; Wu et al., 2016; Achlioptas et al., 2018], based on latent vector arithmetic. In a different way that better reflects the nature of difference operators, we also reconstruct the result of the multiplicative formula with being the difference operator of shape . Here we directly exploit the matrix nature of our representation which enables this type of algebra. The result of this approach is . It is interesting to observe that this reconstruction, (), results not only in less noisy point-clouds compared to ; but also in semantically more appropriate structures, e.g. in Fig 8, reflects less prominently the expected sitting pose and in Fig 10, has more muscular arms than expected.

    Figure 8. Reconstructed point-clouds from area-based latent differences. Point-clouds belong to the ground-truth test split and are their corresponding reconstructions. and complete the analogy of a shape that is to C what B is to A. Reconstruction is based on traditional vector code arithmetic, while is the one based on shape difference operator algebra.

    Partial shape analogies Moreover, we propose to construct partial shape analogies. We follow the formulation described in Section 7.1 – in parallel to , we construct for localized deformation transfer between and . We first show in Figure 9 a partial body transfer. Given the LSSDs regarding and , we restricted the region of interest to their upper body, and synthesized the LSSD. The reconstructed point clouds are shown in Figure 9. Note that is similar to in the lower body, while being similar to in the upper body.

    In Figure 10, we show both the global shape analogies and the partial ones. are the reconstruction result of and , respectively.

    Figure 9. Synthesis of a shape that is similar to in the lower body, while being similar to in the upper body.
    Figure 10. Global and partial shape analogies. Note that has a mixed body type of and , and so has but in a different pose following and .

    Generalization with computed functional maps

    Though the input LSSDs of our network are precomputed with respect to all the shapes in consideration, as mentioned at the end of Section 5.1, we can assign the latent basis to new unseen shapes without recomputing the latent basis. In particular, we generated a set of new human shape bodies, and for each shape, we searched for its nearest neighbor in the existing collection. We then the kernel matching algorithm [Lähner et al., 2017] to compute an initial map between the new shape and its neighbor in the collection, allowing us to compute its latent representations, as described at the end of Section 5.1.

    We show some of the reconstruction results in Figure 11. In particular, we show in 11(b) a failure case. It is worth noting that though the result has a wrong pose compared to the ground truth, the body type is recovered. This is due to the fact that in this dataset, the variability across different body types is more prominent, while the number of poses is limited. Thus the network is expected to put more weight on the features regarding the former, resulting some mismatch poses. We also emphasize that lifting the need of a base shape is crucial in this case, since estimating functional maps across distant shapes is error-prone. Contrastingly, our formulation simplifies such matching procedure.

    Figure 11. Reconstruction result using computed functional maps on unseen shapes. Top row: ground truth; Bottom row: reconstructed point clouds; (a) successful cases; (b) a failure case.

    Latent shape interpolation We also considered another dataset for the reconstruction task – the Dynamic FAUST dataset [Bogo et al., 2017], from which we sampled shapes for training, validation and test. For the computational efficiency, we computed the canonical latent basis among a subset of shapes, and push the basis to the rest shapes in the same way above, but using the ground-truth functional maps. It is worth noting the shapes in this dataset manifest high extrinsic variability while being near isometry within the poses corresponding to the same character. On the other hand, our representation is purely intrinsic, making it challenging to learn features that differentiate the extrinsic change.

    Here, we demonstrate the advantage of the algebraic form of our representation. We selected a pair of shapes and from the test set, and construct a sequence of linear interpolation between their latent representations , i.e., . The output of the network, given , presents a continuous change regarding the knees and upper body. In Figure 12, we selected output point clouds with respect to increasing (from left to right), which show a process of raising the right knee and leaning to left from to .

    Figure 12. Latent space interpolation: given two shapes , we synthesized new LSSDs by constructing . The reconstructed point clouds of the present a continuous deformation from to .

    8.5. Geometric Exploration of Shape Collections

    In the following experiments we demonstrate the utility of our method for capturing cross-collection variability in shape collections, as suggested in Section 6. In particular, we demonstrate that our method can be applied to real-world data (Figure 16) beyond the synthesized shapes, and can be used to compare point clouds as well as triangle meshes. We also demonstrate that our method can extract informative signals in a semi-supervised classification task (Figure 14). Finally, we demonstrate that our method is stable with respect to the input functional maps, as it produces comparable results when using computed and ground truth maps functional maps (Figure 13,16,18,17).

    Throughout the results below, unless stated otherwise, we used area-based LSSDs, which are represented as matrices of size in the reduced basis. To construct the FMN, we first compute distances among shapes (using the shape-DNA descriptors [Reuter et al., 2006]), and form a minimum spanning tree network using these distances. When considering two clusters, we first form a spanning tree on each, and connect shapes across clusters using nearest neighbor search. The methods described in Section 6 optimize for functions on the latent shape, which we map to functions on the actual shapes in the collection, resulting in a consistent and informative visualization.

    We applied our method with computed functional maps as input as well. Unless stated otherwise, in the following we used the kernel matching algorithm [Lähner et al., 2017] for an initial point-wise map, and then converted and refined the maps using functional map techniques (see, e.g., [Ovsjanikov et al., 2012]).

    Heterogeneous Shape Collection Comparision

    We first demonstrate that our method can capture variability across heteregeneous shape collections, without relying on point-wise correspondences, through its use of functional maps framework. For this, in Figure 13, we show the computed distinctive functions highlighting the difference between a set of cats (each consists of vertices) and a set of lions (each consists of vertices), where the cross-collection maps were estimated using the original functional maps approach [Ovsjanikov et al., 2012] given a sparse set of landmarks. Note that our method correctly highlights the snouts, the four paws and the tips of the tails, distinctive to each class, despite the presence of the global poses variability in the collection. Moreover, we consider a quantitative validation of the cross-collection variability detected by our method, by comparing the PCA embeddings of the LSSDs before and after projection with respect to the highlighted functions shown in Figure 13. As shown Figure 13(b) after projection the relative distances within the same cluster remain similar while the two clusters become closer to each other.

    Figure 13. Cross-collection variability between a set of cats and lions, as detected by our algorithm. Note the four paws and the tips of the tails are highlighted, distinctive to each class, despite the presence of the global variability in each collection due to the various poses.
    Clustering with Visual Evidence

    In Figure 14, we analyze two clusters of shapes displayed in the two top rows that represent different characters in two distinct poses. As shown in the first five columns, the highlighted functions capture the bending knees, which intuitively distinguish the two poses. In contrast, both the global variability across the whole collection (the second column from right) and the ones within each cluster (the right-most column) concentrate on the torso. We also used computed functional maps for detecting the cross-collection variability, and obtained comparable highlighted functions, which are shown on a subset of the shapes in Figure 15.

    We also plot the PCA of the latent shape differences in the bottom of Figure 14, where the blue and red points are mixed, suggesting the dominance of variability in body type, captured in area-based shape differences.

    However, with the highlighted function detected by our approach, expressed through coefficients in the latent basis, we can separate the shapes by computing and plotting the PCA of the resulting vectors. As can be seen in Figure 14 (bottom, middle) these vectors separate the two clusters much better.

    Furthermore, the same procedure can be performed with partial clustering. Thus, given the cluster ids of only the shapes in the red box, we first computed the optimal distinctive function with respect to this subset, and plotted the PCA of on the latent shape of the whole collection. The PCA plot of suggests that this approach reveals the correct clusters in a semi-supervised way.

    Figure 14. Comparison between two sets of humans, each corresponding to a distinct pose (top vs. bottom). The distinctive functions obtained with our approach capture the change in pose, even with partial information, whereas the global variability primarily highlights changes in body type. Moreover, by using the action of each latent shape difference on the distinctive function, we can separate the two clusters (shown via PCA in bottom row). See text for details.
    Figure 15. The distinctive functions obtained with computed functional maps on the same data set as the one in Figure 14. Note that the highlighted regions are comparable.
    Practical Application in Anatomy

    The problem of analyzing variability across different classes of 3D objects is well-studied in computational anatomy, where the classical approach is to first manually establish dense landmark correspondences and to compute the difference from each object to some pre-computed deformable template shape. As mentioned above, our approach does not require an actual embedding of a template, which allows it to handle complex heterogeneous data.

    To illustrate this, we compared two sets of bones of two sub-species of wild boars acquired using 3D scanning techniques. In particular, as input we considered the bone scans with consistent handcrafted landmarks and sliding landmarks [Gunz and Mitteroecker, 2013] on each of shape. We then estimated the FMN starting with a different number, , of the handcrafted landmarks, using the functional map estimation approach proposed in [Huang and Ovsjanikov, 2017]. These functional maps were then used to compute the distinctive regions, as described Section 6. The corresponding shapes and highlighted functions are shown in Figure 16. Remarkably, our results are stable with even for a small number of landmarks, and furthermore correspond to anatomically meaningful shape parts, which in general coincide with the ones detected by the extrinsic template-based approach, which uses all the landmarks, and agree with the functionality explanation for this cross-species variability identified by the domain experts.

    Figure 16. Comparing two sets of bones corresponding to different sub-species of wild boars using computed functional maps. The highlighted regions are stable and reveal distinctive yet subtle sub-parts.
    Cross-collection Variability across Point Clouds

    Our framework can also be applied across different modalities. In Figure 17, we compared the shapes on the top row with the ones on the bottom, which clusters correspond to two characters in different poses. Therefore the cross-collection variability should capture the difference in body shapes. We used the discretization from [Huang et al., 2017] for computing eigenbases on point clouds, and, given a sparse set of ten landmarks across the 8 point clouds, we used the adjoint-regularization from [Huang and Ovsjanikov, 2017] for estimating the functional maps between the shapes. For comparison, we also computed the cross-collection variability detected with the ground-truth FMN among the same shapes represented as meshes. Clearly, both of these highlight regions are on the torso, reflecting the significant area change, associated with the change in body type.

    Figure 17. (a) Comparison between two characters in various poses, represented by point clouds. We plot on them the highlighted function obtained with a computed FMN among the point clouds. (b) As a baseline, we plot the highlighted function obtained with the ground-truth FMN on the corresponding meshes.
    Facial Comparison

    We end our demonstration with a comparison between two sets of human faces, which correspond to happy and sad expressions (due to the lack of space, we plot only faces out of ). As shown in Figure 18, both highlighted functions computed with ground-truth functional maps and the computed ones detect the intuitive cross-collection difference – the chin and the cheeks.

    Figure 18. Distinctive regions across two sets of expressions (happy on the top, sad on the bottom). Note that the highlighted regions with respect to the computed functional maps are comparable with the ground truth ones.

    9. Conclusions

    We have presented a novel approach for representing and analyzing 3D shapes in a context of one or multiple collections. Our construction is based on functional maps network connecting the shapes and a novel analysis that demonstrates that previously used latent functional spaces can both be endowed with a natural geometric structure and provide a basis for representing and comparing shapes in an unbiased way. This leads to Latent Space Shape Differences which represent each shape in the collection as a pair of functional operators, stored as small-sized matrices in practice. This representation has many appealing properties, including invariance to rigid motions as well as full intrinsic informativeness that permits reconstruction. We have demonstrated their use in extracting and highlighting variability of interest in a set of shapes, while also suppressing other variability that we regard as nuisance (and which may in fact manifest in larger geometric deformations). We believe that this highly nuanced understanding of shape distortions and variability is important for many applications in engineering, biology, and medicine. Moreover, we showed that the matrix form of our representation makes it suitable for learning algorithms and the use of CNNs in particular, for both regression and reconstruction.

    We also note that the matrix nature of our representation makes it a different mathematical object from the usual latent codes used in machine learning that are invariably points in high-dimensional Euclidean spaces. While point-based representations usually lead to quite limited set of operations (typically, interpolations and vector-based analogies), our difference matrices reflect the internal structure of the shapes and enable, for example, localized shape analogies.

    References

    • [1]
    • Achlioptas et al. [2018] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas J Guibas. 2018. Learning Representations and Generative Models For 3D Point Clouds. Proceedings of the 35th International Conference on Machine Learning (2018).
    • Allen et al. [2003] Brett Allen, Brian Curless, and Zoran Popović. 2003. The space of human body shapes: reconstruction and parameterization from range scans. In ACM transactions on graphics (TOG), Vol. 22. ACM, 587–594.
    • Anguelov et al. [2005] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: Shape Completion and Animation of People. In ACM Transactions on Graphics (TOG), Vol. 24. ACM, 408–416.
    • Blanz and Vetter [1999] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 187–194.
    • Bogo et al. [2014] Federica Bogo, Javier Romero, Matthew Loper, and Michael J Black. 2014. FAUST: Dataset and Evaluation for 3D Mesh Registration. In Proc. CVPR. 3794–3801.
    • Bogo et al. [2017] Federica Bogo, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2017. Dynamic FAUST: Registering Human Bodies in Motion. In

      IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

      .
    • Boscaini et al. [2015] Davide Boscaini, Davide Eynard, Drosos Kourounis, and Michael M Bronstein. 2015. Shape-from-Operator: Recovering Shapes from Intrinsic Operators. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 265–274.
    • Boscaini et al. [2016] Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. 2016. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in Neural Information Processing Systems. 3189–3197.
    • Bronstein et al. [2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18–42.
    • Chen et al. [2015] Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. 2015. Synthesizing Training Images for Boosting Human 3D Pose Estimation. In 3D Vision (3DV). https://doi.org/chen1474147/Deep3DPose
    • Corman et al. [2017] Etienne Corman, Justin Solomon, Mirela Ben-Chen, Leonidas Guibas, and Maks Ovsjanikov. 2017. Functional Characterization of Intrinsic and Extrinsic Geometry. ACM Trans. Graph. 36, 2, Article 14 (March 2017), 17 pages.
    • Fan et al. [2016] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. 2016. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. CoRR abs/1612.00603 (2016).
    • Girdhar et al. [2016] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Abhinav Gupta. 2016. Learning a Predictable and Generative Vector Representation for Objects. Springer International Publishing, Cham, 484–499.
    • Grenander and Miller [1998] Ulf Grenander and Michael I Miller. 1998. Computational anatomy: An emerging discipline. Quarterly of applied mathematics 56, 4 (1998), 617–694.
    • Gunz and Mitteroecker [2013] Philipp Gunz and Philipp Mitteroecker. 2013. Semilandmarks: a method for quantifying curves and surfaces. Hystrix, the Italian Journal of Mammalogy 24, 1 (2013), 103–109. https://doi.org/10.4404/hystrix-24.1-6292
    • Hasler et al. [2009] Nils Hasler, Carsten Stoll, Martin Sunkel, Bodo Rosenhahn, and H-P Seidel. 2009. A Statistical Model of Human Pose and Body Shape. In Computer Graphics Forum, Vol. 28. 337–346.
    • Huang et al. [2014] Qixing Huang, Fan Wang, and Leonidas Guibas. 2014. Functional map networks for analyzing and exploring large shape collections. ACM Transactions on Graphics (TOG) 33, 4 (2014), 36.
    • Huang et al. [2017] Ruqi Huang, Frederic Chazal, and Maks Ovsjanikov. 2017. On the Stability of Functional Maps and Shape Difference Operators. 37, 1 (2017).
    • Huang and Ovsjanikov [2017] Ruqi Huang and Maks Ovsjanikov. 2017. Adjoint Map Representation for Shape Analysis and Matching. In Proc. SGP, Vol. 36.
    • Joshi et al. [2004] Sarang Joshi, Brad Davis, Matthieu Jomier, and Guido Gerig. 2004. Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage 23 (2004), S151–S160.
    • Kendall [1989] David G Kendall. 1989. A survey of the statistical theory of shape. Statist. Sci. (1989), 87–99.
    • Kim et al. [2013] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Siddhartha Chaudhuri, Stephen DiVerdi, and Thomas Funkhouser. 2013. Learning part-based templates from large collections of 3D shapes. ACM Transactions on Graphics (TOG) 32, 4 (2013), 70.
    • Kim et al. [2012] Vladimir G Kim, Wilmot Li, Niloy J Mitra, Stephen DiVerdi, and Thomas Funkhouser. 2012. Exploring collections of 3D models using fuzzy correspondences. ACM Transactions on Graphics (TOG) 31, 4 (2012), 54.
    • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980
    • Kleiman et al. [2015] Yanir Kleiman, Oliver van Kaick, Olga Sorkine-Hornung, and Daniel Cohen-Or. 2015. SHED: shape edit distance for fine-grained shape similarity. ACM Transactions on Graphics (TOG) 34, 6 (2015), 235.
    • Kovnatsky et al. [2013] Artiom Kovnatsky, Michael M Bronstein, Alexander M Bronstein, Klaus Glashoff, and Ron Kimmel. 2013. Coupled quasi-harmonic bases. In Computer Graphics Forum, Vol. 32. Wiley Online Library, 439–448.
    • Lähner et al. [2017] Z. Lähner, M. Vestner, A. Boyarski, O. Litany, R. Slossberg, T. Remez, E. Rodol‘a, A. M. Bronstein, M. M. Bronstein, R. Kimmel, and D. Cremers. 2017. Efficient Deformable Shape Correspondence via Kernel Matching. arXiv preprint 1707.08991 (2017).
    • Li et al. [2017] Jun Li, Kai Xu, Siddhartha Chaudhuri, Ersin Yumer, Hao Zhang, and Leonidas J. Guibas. 2017. GRASS: Generative Recursive Autoencoders for Shape Structures. CoRR abs/1705.02090 (2017).
    • Maron et al. [2017] Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, VLADIMIR G KIM, and Yaron Lipman. 2017. Convolutional Neural Networks on Surfaces via Seamless Toric Covers.
    • Masci et al. [2015] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. 2015. Geodesic convolutional neural networks on riemannian manifolds. In Proc. ICCV workshops. 37–45.
    • Maturana and Scherer [2015] Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 922–928.
    • Meyer et al. [2003] Mark Meyer, Mathieu Desbrun, Peter Schröder, and Alan H Barr. 2003. Discrete differential-geometry operators for triangulated 2-manifolds. In Visualization and mathematics III. Springer, 35–57.
    • Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. CoRR abs/1310.4546 (2013). arXiv:1310.4546
    • Ovsjanikov et al. [2013] M. Ovsjanikov, M. Ben-Chen, F. Chazal, and L. Guibas. 2013. Analysis and visualization of maps between shapes. Computer Graphics Forum 32, 6 (2013), 135–145.
    • Ovsjanikov et al. [2012] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. 2012. Functional Maps: A Flexible Representation of Maps Between Shapes. ACM Transactions on Graphics (TOG) 31, 4 (2012), 30.
    • Ovsjanikov et al. [2017] Maks Ovsjanikov, Etienne Corman, Michael Bronstein, Emanuele Rodolà, Mirela Ben-Chen, Leonidas Guibas, Frederic Chazal, and Alex Bronstein. 2017. Computing and processing correspondences with functional maps. In ACM SIGGRAPH 2017 Courses. ACM, 5.
    • Ovsjanikov et al. [2011] Maks Ovsjanikov, Wilmot Li, Leonidas Guibas, and Niloy J Mitra. 2011. Exploration of continuous variability in collections of 3D shapes. ACM Transactions on Graphics (TOG) 30, 4 (2011), 33.
    • Qi et al. [2016] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2016. PointNet: deep learning on point sets for 3D classification and segmentation. CoRR abs/1612.00593 (2016).
    • Reuter et al. [2006] Martin Reuter, Franz-Erich Wolter, and Niklas Peinecke. 2006. Laplace-Beltrami Spectra As ’Shape-DNA’ of Surfaces and Solids. Comput. Aided Des. 38, 4 (April 2006), 342–366.
    • Rustamov et al. [2013] Raif M. Rustamov, Maks Ovsjanikov, Omri Azencot, Mirela Ben-Chen, Frédéric Chazal, and Leonidas Guibas. 2013. Map-based exploration of intrinsic shape differences and variability. ACM Transactions on Graphics 32, 4 (2013), 1.
    • Sinha et al. [2016] Ayan Sinha, Jing Bai, and Karthik Ramani. 2016. Deep learning 3d shape surfaces using geometry images. In European Conference on Computer Vision. Springer, 223–240.
    • Solomon et al. [2012] Justin Solomon, Andy Nguyen, Adrian Butscher, Mirela Ben-Chen, and Leonidas Guibas. 2012. Soft maps between surfaces. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 1617–1626.
    • Su et al. [2015] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15).
    • Thompson et al. [1942] Darcy Wentworth Thompson et al. 1942. On growth and form. On growth and form. (1942).
    • Tong et al. [2012] Jing Tong, Jin Zhou, Ligang Liu, Zhigeng Pan, and Hao Yan. 2012. Scanning 3d full human bodies using kinects. IEEE transactions on visualization and computer graphics 18, 4 (2012), 643–650.
    • Wand et al. [2009] Michael Wand, Bart Adams, Maksim Ovsjanikov, Alexander Berner, Martin Bokeloh, Philipp Jenke, Leonidas Guibas, Hans-Peter Seidel, and Andreas Schilling. 2009. Efficient reconstruction of nonrigid shape and motion from real-time 3D scanner data. ACM Transactions on Graphics (TOG) 28, 2 (2009), 15.
    • Wand et al. [2007] Michael Wand, Philipp Jenke, Qi-Xing Huang, Martin Bokeloh, Leonidas Guibas, and Andreas Schilling. 2007. Reconstruction of Deforming Geometry from Time-Varying Point Clouds. In Proc. SGP. 49–58.
    • Wang et al. [2013] Fan Wang, Qixing Huang, and Leonidas J. Guibas. 2013. Image co-segmentation via consistent functional maps. In Proceedings of the IEEE International Conference on Computer Vision. 849–856.
    • Wang et al. [2014] Fan Wang, Qixing Huang, Maks Ovsjanikov, and Leonidas J Guibas. 2014. Unsupervised multi-class joint image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3142–3149.
    • Wang and Singer [2013] Lanhui Wang and Amit Singer. 2013. Exact and stable recovery of rotations for robust synchronization. Information and Inference: A Journal of the IMA 2, 2 (2013), 145–193.
    • Wang et al. [2017] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36, 4 (2017), 72.
    • Wang et al. [2012] Yunhai Wang, Shmulik Asafi, Oliver van Kaick, Hao Zhang, Daniel Cohen-Or, and Baoquan Chen. 2012. Active co-analysis of a set of shapes. ACM Transactions on Graphics (TOG) 31, 6 (2012), 165.
    • Wu et al. [2016] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. In Proc. NIPS. 82–90.
    • Younes [2010] Laurent Younes. 2010. Shapes and diffeomorphisms. Vol. 171. Springer Science & Business Media.
    • Zeng et al. [2012] Wei Zeng, Ren Guo, Feng Luo, and Xianfeng Gu. 2012. Discrete heat kernel determines discrete Riemannian metric. Graphical Models 74, 4 (2012), 121–129.

    Appendix A Technical Details

    Proof of Theorem 5.1
    Proof.

    First note that is well-defined since by consistency, . The regularization constraint therefore implies .

    Now let to be a diagonal matrix (implicitly corresponds to the eigenvalues of the latent shape). Note that is a non-negative diagonal matrix, thus admits an eigen-decomposition and we let . Direct computation yields that , and . Thus it follows from that . On the other hand, it is easy to verify that the eigenfunctions of satisfies the the consistency constraint and the normalization, therefore they are equivalent. ∎

    Proof of Lemma 6.1
    Proof.

    We first prove that:

    It is easy to verify that , since . In other words, is a projection operator, then so is . For the sake of simplicity, we denote in the following and by respectively. Obviously are both symmetric matrices, and . Then the above equivalence can be re-rewritten as

    which amounts to .

    Finally, the equivalence follows from

    Finally, the difference is equal to

    Connection between our Method and the framework of [Huang and Ovsjanikov, 2017]

    Our formulation constructs a linear combination of terms ,, where , and then computing the eigenvectors associated with the largest eigenvalues of it In essence, the distortion energy constructed in [Huang and Ovsjanikov, 2017] is similarly composed of a set of terms in the form , where is the adjoint functional map from to , and is the latent basis on .

    Our main observation is that, in the case of area-based operators, under the same condition as of Theorem 5.1, . A consequence of the above argument is that when both the spectra of and have no repeating eigenvalues, then their eigenvectors are identical.

    We provide a sketch proof of this claim. Following the proof of Theorem 5.1, we have , where is the full eigenbasis on , and is the eigenbasis of the average/latent shape. Then we have , which implies , where is the measure of the average shape. Regarding the adjoint case, we similarly have . Therefore it is easy to verify the commutativity between and .

    Appendix B Neural Network Details

    b.1. Regression

    b.1.1. Point-Cloud Architectures

    We used three configurations for making point-base architectures. In a spirit similar to [Qi et al., 2016] we implemented all (3-layer deep) encoders as -D convolutions with filter size

    , i.e., treating each point independently. The output of the last encoding layer was further processed by a feature-wise max-pool which was further processed by an FC-ReLU decoder. Table 

    2 shows the exact number of parameters (columns) in each consecutive layer for the three configurations (rows).

    Version Encoder (# filters)

    Decoder (# Neurons)

    A {32, 64, 64} {64, 12}
    B {64, 128, 128} {64, 12}
    C {64, 128, 128} {64, 128, 12}
    Table 2. Size of layers in point-based architectures for the versions that formed the baseline of the regression experiments. The further right a parameter is displayed the deeper the underlying layer of architecture is.

    We trained each of these architectures with learning rates of {0.001, 0.002, 0.005, 0.007, 0.01}. The learning rate of gave the best performance in the regression experiments..

    b.1.2. MLPs

    We used FC-ReLU MLPS for which the last 3 layers had {50, 100, 12} neurons respectively. The number of neurons of the first layer was calibrated according to the size of the input difference matrix. Table 3 shows their correspondence.

    # Latent-Bases 5 10 20 30 40 50
    # Neurons 369 185 62 29 17 11
    Table 3. Number of neurons in first layer of MLP-architectures based on the size corresponding to the Latent-Bases.

    b.1.3. CNNs

    The encoding part of our CNNs was comprised by two convolutional layers leading to a single FC-ReLU layer with neurons. See Table 4 and Table 5 for the parameters of the convolutional layers when the input was difference matrices and , respectively.

    Layer # Filters Kernel-size Stride
    First 10 (2, 2) 1
    Second 10 (4, 4) 2
    Table 4. CNN parameters with input.
    Layer # Filters Kernel-size Stride
    First 10 (3, 3) 2
    Second 10 (4, 4) 2
    Table 5. CNN parameters with input.

    b.2. Reconstruction Architecture

    The architecture we used here is inspired by the CNN used for regression. Again, the convolutional part comes with two encoding layers (see Table 6 for parameters). The decoder is an MLP implemented with FC-ReLU layers of size .

    Layer # Filters Kernel-size Stride
    First 20 (3, 3) 2
    Second 20 (6, 6) 2
    Table 6. CNN encoding parameters with input for the purposed of reconstruction a point-cloud from a difference-matrix.

    b.3. Training details

    For training we used stochastic gradient descent with Adam

    [Kingma and Ba, 2014] () and batch-size of throughout all experiments. Moreover we normalized the differences matrices by subtracting their average wrt. the training split. For the regression task, the networks operating with difference-matrices were trained with a learning rate of . In the reconstruction experiments we trained the CNN-architecture for 850 epochs with a learning rate of .

    (a)
    (b)
    Figure 19. Training trends: PC-based Net vs. MLP (top) or CNN(bottom) architectures on difference maps. The PC-based architecture suffers from a lot of over-fitting while the CNN one enjoys very good generalization error. (Plots based on training with single seed.)