1 Introduction
Clustering algorithms aim at discovering patterns in the data that enable their characterization, identification, and separation. The development of such a framework without any prior information regarding the data remains one of the milestones of machine learning that would assist clinicians, physicists, and data scientists, among others, with a better pattern discovery tool (bertsimas2020interpretable; greene2005producing).
While supervised learning has been converging toward the almost exclusive use of Deep Neural Networks (DNN), avoiding the development of handcrafted features to provide the desired linearly separable embedding map, unsupervised clustering algorithms take various forms depending on the application at hand
(ma2019learning; wagstaff2001constrained; estivill2002so). For instance, the usage of SIFT features combined with clustering algorithm for medical imaging (nam2009high), the extraction of DNNs embedding used as the input of themeans algorithm for computer vision tasks
(xie2016unsupervised), and the combination of signalprocessing features extractors combined with Gaussian mixture model to understand the nature of the various seismic activities
(seydoux2020clustering). The important role of clustering algorithms in assisting medical diagnoses as well as scientific discoveries highlight the importance of the development of an interpretable and theoretically guaranteed tool (dolnicar2003using; xu2010clustering).In this work, we focus our attention on the means clustering algorithm (macqueen1967some) and its application to dimensional signals, such as images or timefrequency representations. Wellknown for its simplicity, efficiency, and interpretability, the means algorithm partitions the data space into disjoint regions. Each region is represented by a centroid, and each datum is assigned to the closest centroid’s region. The integral part in the design of a clustering algorithm is the choice of an appropriate distance, and the number of clusters (he2013k; frey2002fast; 990920). While the Euclidean distance makes the design of the algorithm straightforward, this measure of similarity might omit the geometrical relationships between data points (steinbach2004challenges). In fact, a small rigid perturbation of an image, such as rotation or translation, is enough to change the cluster assignment.
There are two major difficulties in constructing a distance for a clustering algorithm; on the one hand, the metric should take into account the geometry of the data, e.g., be invariant to rigid transformations for images, and on the other hand, the metric should be interpretable as it is tied to the interpretability of the algorithm (steinbach2004challenges).
In this work, we tackle these two difficulties by introducing in our similarity measure the spatial transformations inherent to the geometry of the data at hand. In particular, we: formulate an interpretable and theoretically guaranteed means framework capable of exploiting the symmetry within the data, extend prior work on metrics invariant to rigid transformations to nonrigid transformations, thus taking into account a more realistic set of nuisances and allow the learnability of the symmetry underlying the data at hand, therefore enabling the exploration of data where the equivalence classes are yet to be determined.
To learn the symmetry in the data and perform their transformations, we will use the spatial transformer framework, which was successfully introduced in jaderberg2015spatial. This allows us to provide a learnable metric invariant to nonrigid transformations that is used as the means distortion error.
While many approaches to learn and estimate nonrigid transformations have been proposed, we will follow one of nowadays mainstream approaches developed in
jaderberg2015spatial where the Thin Plate Spline is used as a differentiable deformation model. Our attempt is, in fact, not to compare among deformation models but to consider a way to approach the learnability of invariances in an unsupervised setting such that it is effective, tractable, and interpretable.Our contributions can be summarized as follows:

[leftmargin=*, itemsep=0pt]

We propose a novel approach to tackle clustering using a novel adaptive similarity measure within the means framework that considers nonrigid transformations, Sec. 3.1.

We derive an appropriate update rule for the centroids that drastically improves both the interpretability of the centroids and their quality, Sec. 3.2.

Finally, we show numerically that our unsupervised algorithm competes with stateoftheart methods on various datasets while benefiting from interpretable results, Sec. 5.
2 Background
2.1 Invariant Metrics
The development of measures invariant to specific deformations has been under investigation in the computer vision community for decades (fitzgibbon2002affine; simard2012transformation; lim2004image). By considering affine transformations such as shearing, translation, and rotation of the data as being nuisances, these approaches propose a distance that reduces the variability intrinsic to highdimensional images. These works are considered as appearance manifoldbased framework; that is, the distance are quantified by taking into account geometric proximity (262951; basri1998clustering; su2001modified; 1211332).
While the development of affine invariant metrics is pretty standard, their extension to more general nonrigid transformations requires more attention. Recently, various deep learning methods proposed ways to learn diffeomorphic transformations
(detlefsen2018deep; balakrishnan2018unsupervised; lohit2019temporal; dalca2019learning; NEURIPS2019_db98dc0d). Others adopt a more theoretically grounded approach based on group theory as in zhang2015finite; freifeld2015highly; durrleman2013sparse; allassonniere2015bayesian as well as the statistical “pattern theory” approach developed in grenander1993general; dupuis1998variational.2.2 Spatial Transformer
The transformer operator, denoted by , allows for nonrigid image transformations. It is based on the composition of two mappings; a deformation map and a sampling function. The deformation maps a uniform grid of dimensionak coordinates to provide its transformed version. The sampling function samples the signal with respect to a given grid of dimensional coordinates.
The mapping we select to enable the learnability of the transformation in the coordinate space of the
dimensional signal is the ThinPlateSpline (TPS) interpolation technique
(duchon1976interpolation; bookstein1989principal; nejati2010fast) which produces smooth surfaces from to (morse2005interpolating). We refer the reader to Appendix E for details regarding this method. We consider as learnable parameters of the TPS a set of dimensional coordinates, called landmarks, and denoted by . Given a set of landmarks, the TPS provides the transformation map of a dimensional grid. That is, the euclidean plane is bent according to the learned landmarks.In Fig. 1, we show on the bottom right the grid associated with the landmarks. Each grid corresponds to the spatial transformation applied to the handwritten digit . The transformation of the signal based on these new coordinates is produced by performing bilinear interpolation using the original signal (top left) and the new coordinates; the details are provided in Appendix E.
The spatial transformer is the composition of these two maps and is defined as
(1) 
where is the original dimensional signal, is the set of dimensional transformed coordinate to be learned. Note that can be smaller than the dimension of the image as the TPS interpolates to rescale the transformation to any size.
Such a framework composing the TPS and bilinear interpolation has been defined as spatial transformer in jaderberg2015spatial. However, in their work, the inference of the nonrigid transformations is performed using each datum as the input of a “localisation network”; instead, we directly learn the transformation parameters.
3 Spatial Transformer means
We now introduce the spatial transformer means, ST means, our proposed solution that composes the spatial transformer and the means algorithm.
3.1 Formalism
We recall that in this work we will consider dimensional signals defined by their width and height, such as images and timefrequency representation of timeseries. Given a set of dimensional signals ,, with , the means algorithm aims at grouping the data into distinct clusters defining the partition , with and . Each cluster of the partition is represented by a centroid .
As for the means algorithm, the goal of the ST means is to find centroids minimizing the following distortion error
(2) 
The assignment of a signal to a cluster is achieved through the evaluation of the similarity measure, , between the signal and each centroid. A signal belongs to the cluster if and only if . While the standard means algorithm makes use of the Euclidean distance, i.e., , we instead propose to use the following deformation invariant similarity measure
(3) 
which is a Quasipseudosemimetric, see Appendix A for details and proof.
This similarity measure represents the leastsquare distance between the centroids and the datum that has been fit to the centroid via the spatial transformer operator. Once this fitting is done for each centroid, the cluster assignment is done based on the argmin of those distances, i.e., the data is assigned to . Therefore, the underlying assumption of our approach is that the distance between the optimal transformation of a signal into a centroid belonging to the same ”class” should be smaller than the distance between its optimal transformation into a centroid that does not. That is, let be geometrically near , then .
This measure requires solving a nonconvex optimization problem. It can be achieved in practice by exploiting the spatial transformer’s differentiability with respect to the landmarks . As a result, we can learn the transformation by performing gradientdescent based optimization (kingma2014adam); further details regarding this optimization are given in Appendix B as well as solutions to facilitate the optimization of the nonconvex objective by exploiting the manifold geometry.
The crucial property of the measure we propose is its invariance to deformations that are spanned by the spatial transformer; formal proofs and definitions are proposed in Appendix A.3. This means that evaluating Eq. 3 with any datum that is transformed from the spatial transformer will produce the same value, as long as no information is lost.
3.2 Learning the Spatial Transformer means
Solving the optimization problem in Eq. 2, similarly to means, is an NPhard problem. A popular tractable solution nonetheless exists and is known as the twostep Lloyd algorithm (lloyd1982least).
In the ST means, the first step of the Lloyd algorithm consists of assigning the data to the clusters using the newly defined measure of similarity in Eq. 3 . The second step is the update of the centroids using the previously determined cluster assignment. It corresponds to the result of the optimization problem: , provided in following Proposition 1.
Proposition 1.
The centroids update of the ST means algorithm are given by
(4) 
where denotes the cardinal of the set , is the set of parameters of the TPS that best transforms the signal into the centroid , that is, (proof in Appendix A.2).
The averaging in Eq. 4 is performed on the transformed version of the signals. The ST means thus considers the topology of the signal’s space. A pseudocode of the centroid update Eq. 4 is presented in Algo. 1.
The ST means, which aims to minimize the distortion error Eq. 2 is done by alternating between the two steps detailed above until convergence, as summarized in Algo. 2.
The update in Eq. 4, induced by our similarity measure, alleviates a fundamental limitation of the standard means. In fact, in the standard means, the average of the data belonging to a cluster , , consists of an averaging of the signals without deforming them, which, as a result, does not account for the noneuclidean geometry of the signals (klassen2004analysis; srivastava2005statistical).
3.3 Convergence of the Spatial Transformer means
As we mentioned, our development is motivated by the interest in proposing a novel way to think about invariance in an unsupervised fashion while conserving the interpretability and theoretical guarantees of the means algorithm. We propose here to prove the convergence of the ST means algorithm following the generalization of clustering algorithms via the Bregman divergence as developed in banerjee2005clustering. In their work, they provide the class of distortion function that admits an iterative relocation scheme where a global objective function, such as the one in Eq. 2, is progressively decreased. We, therefore, prove that Algo. 2 monotonically decreases the distortion error of the ST means in Eq. 2 which in turn implies that Algo. 2 converges to a local optimal.
3.4 Geometrical Interpretation of the Similarity Measure
One of the great benefit of the means algorithm is the interpretability of the regions composing its partitioning. In particular, they are related to Voronoi diagrams which are well studied partitioning techniques (aurenhammer1991voronoi; aurenhammer2013voronoi). Following this framework, we propose now to highlight the regions defined by the ST means algorithm. This is achieved by analysing the following sets
(5) 
where we recall . Such a partitioning falls in the framework of a special type of Voronoi diagram.
Proposition 3.
The partitioning induced by the ST means corresponds to a weighted Voronoi diagram where each region’s size depends on the per data spatial transformations (proof and details in Appendix A.5).
While the Euclidean means induces a Voronoi diagram where each region is a polytope, the ST means does not impose such a constraint of its geometry. The similarity measure we propose adapts the geometry of each data to each centroid and thus induces a specific metric space for each datacentroid pair. In particular, for each datacentroid pair, the ST means has a particular metric that induces the boundary of the regions. In a more general setting, each region is defined as the orbit of the centroid with respect to the transformations induced by the spatial transformer, thus defining regions that depend on the orbit’s shape instead of polytopal ones.
This geometric observation can lead to efficient initializations for the ST means (arthur2006k), as well as the evaluation of its optimality (bhattacharya2016tight). Besides, one can perform in depth study to understand the shape of the regions spanned by our approach to understand the fail cases of the algorithm for a particular application har2014complexity; xia2018unified. One can also compare the partitioning achieved in our approach with the one of DNN as in NEURIPS2019_0801b20e to gain more insights into both models.
3.5 Computational Complexity & Parameters
The time complexity of ST means is . In fact, the ST means computes a TPS of computational complexity for each sample of the samples and each of the centroids, as in Eq. 3. In practice, is of the order . The number of parameters of the model is ; it depends on the number of samples, clusters, and landmarks.
To speed up the computation, we precompute the matrix inverse responsible for the dominating cubic term, see Appendix E for implementation details regarding the TPS, and implement ST means on GPU with SymJAX (balestriero2020symjax) where high parallelization renders the practical computation time near constant with respect to the number of landmarks as we depict in Fig. 2.
4 Experimental Setup
In this section, we detail the experimental settings followed to evaluate the performances of our model. For all the experiments, the number of clusters is set to be the number of classes the dataset contains for all clustering algorithms. The various datasets and their traintest split to optimize the model’s parameters and update the centroids of the different models are described in Appendix F.
4.1 Evaluation Metrics
For all the experiments, the accuracy is calculated using the metric proposed in yang2010image and defined as
(6) 
where is the groundtruth label, the cluster assignment and all the possible onetoone mappings between clusters and labels. The results in Table 1 are taken as the best score on the test set based on the ground truth labels among runs as in xie2016unsupervised. We also provide on the same run the normalized mutual information (NMI) (romano2014standardized), and adjusted rand index (ARI) (hubert1985comparing).
4.2 Cross Validation Settings
We provide in Appendix. C the details regarding the benchmark models and their crossvalidation settings.
Our model requires the crossvalidation of hyperparameters: the number of landmarks and the learning rate to learn the similarity measure in Eq. 3. However, the clustering framework does not allow the use of label information to perform the crossvalidation of the parameters. We thus need to find a proxy for it to determine the optimal model parameters. Interestingly, the distortion error related used in the ST means, Eq. 2, appears to be negatively correlated to the accuracy, as displayed in Fig. 5. Note that the use of the distortion error is commonly used as a fitness measure in means, for example, when crossvalidating the number of clusters.
We crossvalidate the number of landmarks, , which defines the resolution of the transformation, which we optimize over the following grid, . Then, the learning of the landmarks, , is done via Adam optimizer. The learning rate is picked according to . We train our method for epochs for all the datasets, with batches of size . As for means and AI means, the centroids’ initialization of the ST means is performed by the means algorithm. Importantly, the same procedure is applied to all datasets.
5 Results and Interpretations
In this section, we report and interpret the results obtained by our ST means and competing models.
5.1 Clustering Accuracy
We report in Table 1 the accuracy of the different models considered on the different datasets. Our approach shows to outperform existing models on most datasets. Our model equals the performance of AI means on Affine MNIST and is only outperformed by VaDE (MLP) on MNIST.
Deep Learning 
Aff. MNIST #10 
Diffeo. MNIST #10 
MNIST #10 
Audio MNIST #10 
EMNIST #26 
RockPaperSci. #3 
Face10 #13 
Arabic Char. #28 
Aff. MNIST #10 
Diffeo. MNIST #10 
MNIST #10 
Audio MNIST #10 
EMNIST #26 
RockPaperSci. #3 
Face10 #13 
Arabic Char. #28 
Aff. MNIST #10 
Diffeo. MNIST #10 
MNIST #10 
Audio MNIST #10 
EMNIST #26 
RockPaperSci. #3 
Face10 #13 
Arabic Char. #28 

ACC  NMI  ARI  
means  ✗  68  61  53  10  39  40  20  19      50  1  39  5  18  27      39  0  21  4  0  1 
AI means  ✗  100  91  75  29  48  72  31  30      62  18  45  30  30  37      54  10  26  24  3  17 
ST means  ✗  100  99  92  41  65  86  45  51      82  26  63  63  53  61      83  15  46  63  20  38 
AE + means  ✓  72  60  66  13  41  48  37  23      64  1  40  9  27  33      59  0  28  6  26  15 
DEC (MLP) ()  ✓  84  77  84  10  55  46  33  24      83  1  51  12  20  32      80  0  31  8  3  13 
DEC (Conv)  ✓  70  68  78  15  60  54  38  29      74  3  56  18  31  39      69  1  37  13  17  16 
VaDE (MLP) ()  ✓  68  65  94  11  20  50  36  26      89  1  12  16  27  30      85  0  8  11  14  10 
VaDE (Conv)  ✓  65  59  81  14  58  55  40  46      78  2  55  20  35  53      80  0  38  15  18  29 
Whereas the various deep learning approaches perform well on datasets for which their architectures were developed, e.g., MNIST and its derivatives: EMNIST, Arabic Characters, they show limited performance on higher resolution datasets with a small number of samples, such as RockPaperScissors, Face10 as well as the two toy examples. In fact, they are composed of only training data and testing data. In the following sections, we interpret various visualizations of the means variants used in this work.
5.2 Interpretability: Centroids Visualization
We propose in Fig. 3 to visualize the centroids obtained via means, AI means, and our ST means. Supplementary visualizations are provided in Appendix G.2. For each dataset, the first row shows the clusters after initialization from means. The three following rows show the centroids obtained via the means, AI means, and ST means algorithms, respectively.
We observe that, for all datasets, the means centroids are not lying on the data manifold as they are unrealistic images that could not occur naturally in the dataset. Besides, they appear to be blurry and hardly interpretable. These drawbacks are due to the update rule that consists in the average of the data belonging to each cluster in the pixel space. The AI means algorithm drastically reduces the centroids’ blurriness induced by such an averaging as it considers the average of affinely transformed data. However, our ST means produces the crispest centroids and does not introduce any ambiguity in between the different clusters. In fact, the update of our method, Eq. 4, takes into account the nonlinear structure of the manifold by taking the average over data transformed using a nonrigid transformation.
Interestingly, Fig. 3 shows that even if at initialization multiple centroids assigned to the same class are attributed to different clusters, the ST means is able to recover this poor initialization thanks to its explicit manifold modeling and centroid averaging technique. For instance, in the RockPaperScissors dataset, although at initialization, two centroids correspond to the class paper, the ST means learns centroids of each of the three classes within this dataset. In the Face10 dataset, some centroids learned correspond to the rotation of the initialization; even in such extreme change of pose, the centroids remain crisp in most cases.
5.3 Interpretability: Embedding Visualization
To get further insights into the disentangling capability of the ST means, we compare the dimensional projections of the data using tSNE (maaten2008visualizing), of the means, AI means and ST means. Supplementary visualizations are provided in Appendix G.1.
The tSNE visualizations, for both the AI and ST means, are obtained by extracting the optimal transformation that led to the assignment. Precisely, for each image , we compute and extract the optimal parameter which is then used to obtain the transformed image fed as the input of the tSNE.
We can observe in Fig. 4 that the affine transformations ease the data separation in this dimensional space. The ST means also drastically enhances the separability of the different clusters. When using ST means, the data are clustered based on macroscopically meaningful and interpretable parameters, making the model’s performance possible to understand. For instance, for the Face10 dataset, the tSNE representation of the ST means clusters’ shows that faces are grouped according to three significant orientations, left, right, and front. These three clusters are more easily observed in our ST means than in the affine invariant model. However, the different orientations present in the dataset remain too subtle to be captured by the ST means.
For the MNIST dataset, the last row and column of Fig. 4, we observe that most of the incorrectly clustered images are almost indistinguishable from samples of the cluster they have been attributed. In particular, we highlight this by proposing to zoomin into the cluster of handwritten in Fig. 4. We can see that the yellow instances are samples from the class that have been transformed such that they resemble the ’s centroid in Fig. 3. We also provide a zoomin on one of the clusters obtained on the rockpaperscissors dataset, first row and last column of Fig. 4. The incorrectly clustered data are the ones that, when transformed, easily fit the scissors shape.
6 Conclusion
Designing an unsupervised algorithm that is robust to nonrigid transformations remains challenging, despite the tremendous breakthrough in machine learning. The problem lies in appropriately limiting the size of the transformations. We showed that the spatial transformer could achieve this as the number of landmarks allows the learnability of a coarse to fine grid of transformation. However, such a parameter controlling the size of the transformation should be designed as well as be learned percluster or persample. Besides this difficulty, we showed that we could conserve the interpretability of the means algorithm applied in the input data space while drastically improving its performances. Such a framework should be favored in clustering applications where the explainability of the decision is critical.
References
Appendix A Properties of ST Kmeans and Proofs
a.1 ST Kmeans Similarity Measure: a Quasipseudosemimetric
Proposition 4.
The similarity measure defined by is a Quasipseudosemimetric.
Proof.
Let’s first define the orbit of an image with respect to the TPS transformations. Note that, the TPS does not form a group as it is a piecewise mapping. However, we know that it approximate any diffeomorphism on . Therefore, for sake of simplicity, we will make a slight notation abuse by considering the orbit, equivariance, and others group specific properties as being induced by the spatial transformer .
Definition 1.
We define the orbit an image under the action the by
(7) 
Let’s now consider each metric statement: 1) It is nonnegative as per the use of a norm.
2) Pseudo: , that is, and are equivariant with respect to the transformations induced by . Thus, for possibly distinct values and , however, these are not distinct when we consider the data as any possible point on their orbit with respect to the group of diffeomorphism. In fact, the distance is equal to if and only if, and are equivariant.
3) Quasi: The asymmetry of the distance is due to the nonvolume preserving deformations considered. In fact, we do not consider the Haar measure of the associated diffeomorphism group and consider the distance with respect to the Lebesgue measure. Although the asymmetry of does not affect our algorithm or results, a symmetric metric can be built by normalizing the distance by the determinant of the Jacobian of the transformation. Such a normalization would make the metric volumepreserving and as a result make the distance symmetric.
4) Semi: If , then as it exist a such that the TPS maps each data onto the other as per definition of the orbit, thus the triangular inequality holds. If and , we have . If and , we have , and since , the inequality is respected. However, if belong to three different orbits, then we do not have the guarantee that then triangular inequality holds. In fact, it will depend on the distance between the orbits which is specific to each dataset. ∎
a.2 ST Kmeans Updates: Proof of Proposition 1
We consider the Féchet mean of the centroid to be the solution of the following optimization problem, . Using our similarity measure, we obtain the following.
Proof.
The Fréchet mean for the cluster is defined as since the optimization problem is convex in (as the result of the composition of the identity map and a norm which are both convex) we have . with,
(8) 
∎
a.3 ST Kmeans Similarity Measure: Invariance Property
Motivated by the fact that small nonrigid transformations, usually, do not change nature of an image, we propose to exploit the invariance property of the similarity measure we proposed.
In this section, for sake of simplicity we will assume that the transformations belong to the group of diffeomorphism. In practice, the TPS can only approximate element of such group, and the constraint we impose on the transformation, e.g., number of landmark, also limit the type of diffeomorphism that can be approximated, therefore, we could instead consider that we approximate a subgroup of the diffeomorphism group.
Let’s define an invariant similarity measure under the action of such group. That is, the similarity between two dimensional signals remain the same under any diffeomorphic transformations. We propose to define the invariance in the framework of centroidbased clustering algorithm as follows.
Definition 2.
An invariant similarity measure with respect to is defined as such that for all images , all centroids , and all group elements , we have
(9) 
where denotes the action of the group element onto the image .
Proposition 5.
The similarity is invariant.
Proof.
Let consider , we have , where is the inverse group element of . In fact, Since for all , it exists an inverse element , we have that , .
That is, by definition of the group, there is always another element that minimizes the loss function by using the composition between the inverse element of the group that has just been added,
, and the optimal element . ∎a.4 St means Convergence: Proof of Proposition 2
Proof.
Following the notation of Sec. E, we can define the spatial transformer operator as the composition of the TPS and bilinear interpolation map. That is, . Now the aim is to prove that defines a Bregman divergence measure as in (banerjee2005clustering). In such a case, Algo. 2 defines a special case of the Bregman divergence hardclustering algorithm again defined in (banerjee2005clustering) which is proven to converge.
Let’s first start by making an assumption on the data , we can without loss of generality assume that they are nonnegative as we are dealing either with images or timefrequency representation where a modulus is applied to obtain the dimensional real representation. Then, we also assume that the minimum over the transformation parameters reaches a global unique minimum, denoted by . Now,
Now it is clear that which consists in the identity transform of the centroid . Then we denote by where , and obtain that,
Now, we know that is nondecreasing w.r.t each dimension since the image or timefrequency representation are positive real valued, and the inner product defines a strictly convex map. Then, we also know that
is defined as the composition of the TPS for the coordinate and the bilinear map for the image, which can be formulated as a linear transformation with respect to the data
, where is a structured sparse matrix where each block denotes the dependency to nearby pixels. Therefore this mapping is convex. As a composition between a nondecreasing w.r.t each dimension and strictly convex function with a convex function, is strictly convex, which complete the proof.∎
a.5 St means: Weighted Voronoi Diagram
Proof.
Let’s start by rewritting the similarity measure as to analytically express a metric tensor that would be the weight in the weighted Voronoi diagram the ST
means defines. Using App. E, we can rewrite , where is bilinear in the coordinates that are induced by the TPS. In this formulation we can observe thatdefines the displacement vector w.r.t the original uniform grid of landmark. That is, if
is the null vector, then . Now, we assume that such linear operator is inversible, i.e., the TPS transformation is invertible (note that this is not always the case (johnson2001landmark)). Then, we can rewrite aswhere , and defines the metric tensor, and the notation indicates that the displacement vector depends on the centroid and the datum . Also, note that while defines the transformation operator to map onto , is the inverse operator mapping the centroid to the datum .
Now the tuple of cells defines such as
defines a weighted Voronoi diagram (letscher2007vector; inaba1994applications), where we observe that the metric tensor is dependant on all the spatial transformations. ∎
Appendix B Implementation Details
Note that the deformation invariant similarity measure we introduced in Eq. 3 differs from the affine invariant distances developed in (fitzgibbon2002affine; simard2012transformation; lim2004image). All previously defined measures of error rely on the assumption that the manifold can be locally linearized and as a result the tangent space is used as a proxy to learn the optimal affine transformation. However, the work of (wakin2005multiscale) suggests that tangent planes fitted to image manifold continually twist off into new dimensions as the parameters of the affine transformations vary due to a possible the intrinsic multiscale structure of the manifold. As such, the alignment of two images can be done by linearizing the manifold. To do so, (wakin2005multiscale) propose to consider the multiscale structure of the manifold, we simplify their approach by applying a lowpass filter on the images and the centroid prior to learn the affine transformation best aligning them. Then, we optimize the remaining part of the TPS to account for diffeomorphic transformations. These two steps are similar to the one used in (jaderberg2015spatial).
Appendix C Alternative Methods
We compare our model with wellknown clustering techniques using deep neural networks. We performed experiments for the VaDE (jiang2016variational) and DEC (xie2016unsupervised) using the code made publicly available by the authors. We use the annotation (MLP) as a reference to the MLP architecture used in their experiments (see Appendix D for details). To fairly compare our model to the DEC and VaDE models, we proposed a convolutional architecture to the DEC and VaDE networks, denoted by DEC (Conv) and VaDE (Conv) (see Appendix D for details). Finally, we evaluate the performance of an augmented
means algorithm trained using the features extracted by an Autoencoder, denoted by AE
means in the following.The parameters of the different models mentioned above are learned by stochastic gradient descent (Adam optimizer
(kingma2014adam)). In all the experiments, the learning rate are crossvalidated following the approach in (xie2016unsupervised) according to . The internal parameters that are model dependent, e.g., the number of pretraining epoch and the update intervals, are also crossvalidated.We also compare our ST means to the closely related means and affine invariant means, denoted by AI means.For each run, all three means algorithms start from the same initial centroids using the means algorithm developed by arthur2006k to speed up the convergence of the means algorithm.
Appendix D Neural Network Architectures
For both architectures , the decoder architecture is symmetric to the encoder and the batch size is set to .
Mlp:
The MLP architecture from input data to bottleneck hidden layer is composed of
fully connected ReLU layers with dimensions
.Conv:
The CONV architecture is composed of convolutional ReLU layers with filters of size , and fully connected ReLU layers with dimension
. For each layer, a batch normalization is applied.
Appendix E ThinPlateSpline Interpolation
Let’s consider two set of landmarks, the source ones and the transformed where denotes the number of landmarks. The TPS aim at finding a mapping , such that , that is, the mapping between two set of landmarks. The particularity of the TPS is that it learns such a mapping by minimizing the interpolation term, and a regularization that consists in penalizing the bending energy.
The TPS optimization problem is defined by
(10) 
In our model, the source landmarks are considered to be the coordinates of a uniform grid. Also note that both the source landmarks and transformed ones are usually a subset of the set of coordinate of the images. For instance, for the MNIST dataset of size , the landmarks would be a grid of size , where . While the mapping is based on the landmark, it is then applied to the entire image coordinate. In fact, is mapping , where (resp. ) corresponds to the mapping from to the first dimension (resp. the second dimension ).
The solution of the TPS optimization problem, Eq. 10, provides the following analytical formula for
(11) 
(12) 
where is the norm, are the parameters governing the affine transformation, and are parameters responsible for nonrigid transformations as they stand as a weight of the nonlinear kernel . The nonlinear kernel is expressed by .
Based on the landmarks and , we can obtain these parameters by solving a simple system of equation define by the following operations
(13) 
where the matrix , is defined as
where , , and
Note that, since the matrix depends only on the source landmarks, and that in our case these are unchanged, its inverse can be computed only once. The only operation required to be computed for each data and each centroid is the matrix multiplication providing the parameters of the TPS transformation, as per Eq. 11, 12. Given these parameters, the mapping can be applied to to each coordinate of the image.
Now in order to render the image, one can perform bilinear interpolation as it is achieved in. Besides, the bilinear interpolation will allow the propagation of the gradient through any differentiable loss function.
Given an image where , denotes the width and the height of the image, and two sets of landmarks ,uniform grid coordinate of , and , the transformation of the uniform grid, which are a subset of the image coordinate, We are able to learn a mapping such that for each original pixel coordinate, we have their transformed coordinates. In fact, given any position on the original image, the mapping provides the new positions as per Eq. 11, Eq. 12.
Now, from this transformed the coordinates space, we can render an image using, as in (jaderberg2015spatial), the bilinear interpolation function which takes as input the original image and the transformed pixel coordinates , and outputs the pixel value of the transformed image at a given pixel coordinate
where is the Kronecker delta function and is the floor function rounding the real coordinate to the closest pixel coordinate.
Appendix F Datasets
MNIST (deng2012mnist): is a handwritten digit dataset containing training and test images of dimension representing classes.
Rigid MNIST: we randomly sample one instance of each MNIST class and generate random affine transformations for each sample. A third of the data are used for testing.
Nonrigid MNIST: we randomly sample one instance of each MNIST class and generate random affine transformations for each sample as well as random transformations using the TPS method. A third of the data are used for testing.
Audio MNIST (becker2018interpreting): is composed of recordings of spoken digits by different speaker and sampled at of sec long. It consists of classes. We use data for testing and for testing. This dataset will be transformed into a timefrequency representation. This representations, can be considered as images, are usually used as the common representation of audio recordings (cosentino2020).
EMNIST (cohen2017emnist): is a handwritten letters dataset merging a balanced set of the uppercase and lowercase letters into a single classes dataset of dimension .
RockPaperScissors (rps): images of hands playing rock, paper, scissor game, that is, a classes dataset. The dimension of each image is . The training set is composed of data and the testing set .
Face10 (gourier2004estimating): images of the face of 15 people, wearing glasses or not and having various skin color. For each individual, different samples are obtained with different pose orientation varying from degrees to degrees vertical degrees. The dimension of each image is . The training set is composed of data and testing set of data.
Arabic Char (altwaijry2020arabic): Handwritten Arabic characters written by participants. The dataset is composed of images in the training set and in the test set. The dimension of each image is .