Log In Sign Up

Spatial Transformer K-Means

by   Romain Cosentino, et al.

K-means defines one of the most employed centroid-based clustering algorithms with performances tied to the data's embedding. Intricate data embeddings have been designed to push K-means performances at the cost of reduced theoretical guarantees and interpretability of the results. Instead, we propose preserving the intrinsic data space and augment K-means with a similarity measure invariant to non-rigid transformations. This enables (i) the reduction of intrinsic nuisances associated with the data, reducing the complexity of the clustering task and increasing performances and producing state-of-the-art results, (ii) clustering in the input space of the data, leading to a fully interpretable clustering algorithm, and (iii) the benefit of convergence guarantees.


page 2

page 6

page 7

page 17

page 18

page 19


Interpretable Image Clustering via Diffeomorphism-Aware K-Means

We design an interpretable clustering algorithm aware of the nonlinear s...

Fast and Accurate k-means++ via Rejection Sampling

k-means++ <cit.> is a widely used clustering algorithm that is easy to i...

Improved Guarantees for k-means++ and k-means++ Parallel

In this paper, we study k-means++ and k-means++ parallel, the two most p...

Hybridized Threshold Clustering for Massive Data

As the size n of datasets become massive, many commonly-used clustering ...

Fast Noise Removal for k-Means Clustering

This paper considers k-means clustering in the presence of noise. It is ...

Riemannian Multi-Manifold Modeling

This paper advocates a novel framework for segmenting a dataset in a Rie...

Deep Transformation-Invariant Clustering

Recent advances in image clustering typically focus on learning better d...

1 Introduction

Clustering algorithms aim at discovering patterns in the data that enable their characterization, identification, and separation. The development of such a framework without any prior information regarding the data remains one of the milestones of machine learning that would assist clinicians, physicists, and data scientists, among others, with a better pattern discovery tool (bertsimas2020interpretable; greene2005producing).

While supervised learning has been converging toward the almost exclusive use of Deep Neural Networks (DNN), avoiding the development of handcrafted features to provide the desired linearly separable embedding map, unsupervised clustering algorithms take various forms depending on the application at hand

(ma2019learning; wagstaff2001constrained; estivill2002so). For instance, the usage of SIFT features combined with clustering algorithm for medical imaging (nam2009high), the extraction of DNNs embedding used as the input of the

-means algorithm for computer vision tasks


, and the combination of signal-processing features extractors combined with Gaussian mixture model to understand the nature of the various seismic activities

(seydoux2020clustering). The important role of clustering algorithms in assisting medical diagnoses as well as scientific discoveries highlight the importance of the development of an interpretable and theoretically guaranteed tool (dolnicar2003using; xu2010clustering).

In this work, we focus our attention on the -means clustering algorithm (macqueen1967some) and its application to -dimensional signals, such as images or time-frequency representations. Well-known for its simplicity, efficiency, and interpretability, the -means algorithm partitions the data space into disjoint regions. Each region is represented by a centroid, and each datum is assigned to the closest centroid’s region. The integral part in the design of a clustering algorithm is the choice of an appropriate distance, and the number of clusters (he2013k; frey2002fast; 990920). While the Euclidean distance makes the design of the algorithm straightforward, this measure of similarity might omit the geometrical relationships between data points (steinbach2004challenges). In fact, a small rigid perturbation of an image, such as rotation or translation, is enough to change the cluster assignment.

There are two major difficulties in constructing a distance for a clustering algorithm; on the one hand, the metric should take into account the geometry of the data, e.g., be invariant to rigid transformations for images, and on the other hand, the metric should be interpretable as it is tied to the interpretability of the algorithm (steinbach2004challenges).

In this work, we tackle these two difficulties by introducing in our similarity measure the spatial transformations inherent to the geometry of the data at hand. In particular, we: formulate an interpretable and theoretically guaranteed -means framework capable of exploiting the symmetry within the data, extend prior work on metrics invariant to rigid transformations to non-rigid transformations, thus taking into account a more realistic set of nuisances and allow the learnability of the symmetry underlying the data at hand, therefore enabling the exploration of data where the equivalence classes are yet to be determined.

To learn the symmetry in the data and perform their transformations, we will use the spatial transformer framework, which was successfully introduced in jaderberg2015spatial. This allows us to provide a learnable metric invariant to non-rigid transformations that is used as the -means distortion error.

While many approaches to learn and estimate non-rigid transformations have been proposed, we will follow one of nowadays mainstream approaches developed in

jaderberg2015spatial where the Thin Plate Spline is used as a differentiable deformation model. Our attempt is, in fact, not to compare among deformation models but to consider a way to approach the learnability of invariances in an unsupervised setting such that it is effective, tractable, and interpretable.

Our contributions can be summarized as follows:

  • [leftmargin=*, itemsep=0pt]

  • We propose a novel approach to tackle clustering using a novel adaptive similarity measure within the -means framework that considers non-rigid transformations, Sec. 3.1.

  • We derive an appropriate update rule for the centroids that drastically improves both the interpretability of the centroids and their quality, Sec. 3.2.

  • We provide convergence guarantees and geometrical interpretations of our approach, Sec. 3.33.4.

  • Finally, we show numerically that our unsupervised algorithm competes with state-of-the-art methods on various datasets while benefiting from interpretable results, Sec. 5.

2 Background

2.1 Invariant Metrics

The development of measures invariant to specific deformations has been under investigation in the computer vision community for decades (fitzgibbon2002affine; simard2012transformation; lim2004image). By considering affine transformations such as shearing, translation, and rotation of the data as being nuisances, these approaches propose a distance that reduces the variability intrinsic to high-dimensional images. These works are considered as appearance manifold-based framework; that is, the distance are quantified by taking into account geometric proximity (262951; basri1998clustering; su2001modified; 1211332).

While the development of affine invariant metrics is pretty standard, their extension to more general non-rigid transformations requires more attention. Recently, various deep learning methods proposed ways to learn diffeomorphic transformations

(detlefsen2018deep; balakrishnan2018unsupervised; lohit2019temporal; dalca2019learning; NEURIPS2019_db98dc0d). Others adopt a more theoretically grounded approach based on group theory as in zhang2015finite; freifeld2015highly; durrleman2013sparse; allassonniere2015bayesian as well as the statistical “pattern theory” approach developed in grenander1993general; dupuis1998variational.

2.2 Spatial Transformer

Figure 1: Spatial Transformations -

Visualizations of a sample taken from the MNIST dataset and its transformed versions. Each image results from the application of the spatial transformer that take as input the original signal (top left), and the grid displayed below its transformed version. (

Left) we observe the original image and its associated original transformation grid, which corresponds to the identity transform. (Middle) the image has been transformed by the affine transformation induced by the associated grid. (Right) the image transformed by the non-rigid transformation using the TPS induced by the grid below it.

The transformer operator, denoted by , allows for non-rigid image transformations. It is based on the composition of two mappings; a deformation map and a sampling function. The deformation maps a uniform grid of -dimensionak coordinates to provide its transformed version. The sampling function samples the signal with respect to a given grid of -dimensional coordinates.

The mapping we select to enable the learnability of the transformation in the coordinate space of the

-dimensional signal is the Thin-Plate-Spline (TPS) interpolation technique

(duchon1976interpolation; bookstein1989principal; nejati2010fast) which produces smooth surfaces from to (morse2005interpolating). We refer the reader to Appendix E for details regarding this method. We consider as learnable parameters of the TPS a set of -dimensional coordinates, called landmarks, and denoted by . Given a set of landmarks, the TPS provides the transformation map of a -dimensional grid. That is, the euclidean plane is bent according to the learned landmarks.

In Fig. 1, we show on the bottom right the grid associated with the landmarks. Each grid corresponds to the spatial transformation applied to the hand-written digit . The transformation of the signal based on these new coordinates is produced by performing bilinear interpolation using the original signal (top left) and the new coordinates; the details are provided in Appendix E.

The spatial transformer is the composition of these two maps and is defined as


where is the original -dimensional signal, is the set of -dimensional transformed coordinate to be learned. Note that can be smaller than the dimension of the image as the TPS interpolates to re-scale the transformation to any size.

Such a framework composing the TPS and bilinear interpolation has been defined as spatial transformer in jaderberg2015spatial. However, in their work, the inference of the non-rigid transformations is performed using each datum as the input of a “localisation network”; instead, we directly learn the transformation parameters.

3 Spatial Transformer -means

We now introduce the spatial transformer -means, ST -means, our proposed solution that composes the spatial transformer and the -means algorithm.

3.1 Formalism

We recall that in this work we will consider -dimensional signals defined by their width and height, such as images and time-frequency representation of time-series. Given a set of -dimensional signals ,, with , the -means algorithm aims at grouping the data into distinct clusters defining the partition , with and . Each cluster of the partition is represented by a centroid .

As for the -means algorithm, the goal of the ST -means is to find centroids minimizing the following distortion error


The assignment of a signal to a cluster is achieved through the evaluation of the similarity measure, , between the signal and each centroid. A signal belongs to the cluster if and only if . While the standard -means algorithm makes use of the Euclidean distance, i.e., , we instead propose to use the following deformation invariant similarity measure


which is a Quasipseudosemimetric, see Appendix A for details and proof.

This similarity measure represents the least-square distance between the centroids and the datum that has been fit to the centroid via the spatial transformer operator. Once this fitting is done for each centroid, the cluster assignment is done based on the argmin of those distances, i.e., the data is assigned to . Therefore, the underlying assumption of our approach is that the distance between the optimal transformation of a signal into a centroid belonging to the same ”class” should be smaller than the distance between its optimal transformation into a centroid that does not. That is, let be geometrically near , then .

This measure requires solving a non-convex optimization problem. It can be achieved in practice by exploiting the spatial transformer’s differentiability with respect to the landmarks . As a result, we can learn the transformation by performing gradient-descent based optimization (kingma2014adam); further details regarding this optimization are given in Appendix B as well as solutions to facilitate the optimization of the non-convex objective by exploiting the manifold geometry.

The crucial property of the measure we propose is its invariance to deformations that are spanned by the spatial transformer; formal proofs and definitions are proposed in Appendix A.3. This means that evaluating Eq. 3 with any datum that is transformed from the spatial transformer will produce the same value, as long as no information is lost.

3.2 Learning the Spatial Transformer -means

Solving the optimization problem in Eq. 2, similarly to -means, is an NP-hard problem. A popular tractable solution nonetheless exists and is known as the two-step Lloyd algorithm (lloyd1982least).

In the ST -means, the first step of the Lloyd algorithm consists of assigning the data to the clusters using the newly defined measure of similarity in Eq. 3 . The second step is the update of the centroids using the previously determined cluster assignment. It corresponds to the result of the optimization problem: , provided in following Proposition 1.

Proposition 1.

The centroids update of the ST -means algorithm are given by


where denotes the cardinal of the set , is the set of parameters of the TPS that best transforms the signal into the centroid , that is, (proof in Appendix A.2).

The averaging in Eq. 4 is performed on the transformed version of the signals. The ST -means thus considers the topology of the signal’s space. A pseudo-code of the centroid update Eq. 4 is presented in Algo. 1.

0:  Cluster , TPS parameters
0:  Centroids update
1:  Initialize
2:  for  do
3:     Compute ,   Eq. 4
Algorithm 1 Centroids Updates of ST -means

The ST -means, which aims to minimize the distortion error Eq. 2 is done by alternating between the two steps detailed above until convergence, as summarized in Algo. 2.

0:  Initial centroids , dataset
0:  Cluster partition
1:  repeat
2:     for  to  do
3:        for  to  do
4:           Compute and store by solving Eq. 3
5:        Assign to where
6:     Update the centroid using Algo. 1
7:  until Convergence
Algorithm 2 Spatial Transformer -means

The update in Eq. 4, induced by our similarity measure, alleviates a fundamental limitation of the standard -means. In fact, in the standard -means, the average of the data belonging to a cluster , , consists of an averaging of the signals without deforming them, which, as a result, does not account for the non-euclidean geometry of the signals (klassen2004analysis; srivastava2005statistical).

3.3 Convergence of the Spatial Transformer -means

As we mentioned, our development is motivated by the interest in proposing a novel way to think about invariance in an unsupervised fashion while conserving the interpretability and theoretical guarantees of the -means algorithm. We propose here to prove the convergence of the ST -means algorithm following the generalization of clustering algorithms via the Bregman divergence as developed in banerjee2005clustering. In their work, they provide the class of distortion function that admits an iterative relocation scheme where a global objective function, such as the one in Eq. 2, is progressively decreased. We, therefore, prove that Algo. 2 monotonically decreases the distortion error of the ST -means in Eq. 2 which in turn implies that Algo. 2 converges to a local optimal.

Proposition 2.

Under the assumption that the spatial transformation optimization problem in Eq. 3, reaches a unique global minimum, the ST -means algorithm described in Algo. 2 terminates in a finite number of step at a partition that is locally optimal (Proof in Appendix A).

3.4 Geometrical Interpretation of the Similarity Measure

One of the great benefit of the -means algorithm is the interpretability of the regions composing its partitioning. In particular, they are related to Voronoi diagrams which are well studied partitioning techniques (aurenhammer1991voronoi; aurenhammer2013voronoi). Following this framework, we propose now to highlight the regions defined by the ST -means algorithm. This is achieved by analysing the following sets


where we recall . Such a partitioning falls in the framework of a special type of Voronoi diagram.

Proposition 3.

The partitioning induced by the ST -means corresponds to a weighted Voronoi diagram where each region’s size depends on the per data spatial transformations (proof and details in Appendix A.5).

While the Euclidean -means induces a Voronoi diagram where each region is a polytope, the ST -means does not impose such a constraint of its geometry. The similarity measure we propose adapts the geometry of each data to each centroid and thus induces a specific metric space for each data-centroid pair. In particular, for each data-centroid pair, the ST -means has a particular metric that induces the boundary of the regions. In a more general setting, each region is defined as the orbit of the centroid with respect to the transformations induced by the spatial transformer, thus defining regions that depend on the orbit’s shape instead of polytopal ones.

This geometric observation can lead to efficient initializations for the ST -means (arthur2006k), as well as the evaluation of its optimality (bhattacharya2016tight). Besides, one can perform in depth study to understand the shape of the regions spanned by our approach to understand the fail cases of the algorithm for a particular application har2014complexity; xia2018unified. One can also compare the partitioning achieved in our approach with the one of DNN as in NEURIPS2019_0801b20e to gain more insights into both models.

3.5 Computational Complexity & Parameters

The time complexity of ST -means is . In fact, the ST -means computes a TPS of computational complexity for each sample of the samples and each of the centroids, as in Eq. 3. In practice, is of the order . The number of parameters of the model is ; it depends on the number of samples, clusters, and landmarks.

To speed up the computation, we pre-compute the matrix inverse responsible for the dominating cubic term, see Appendix E for implementation details regarding the TPS, and implement ST -means on GPU with SymJAX (balestriero2020symjax) where high parallelization renders the practical computation time near constant with respect to the number of landmarks as we depict in Fig. 2.

Time (min.) Training Size ()          Number Landmarks ()
Figure 2: Computational Training Time - Comparison between our ST -means and the Affine Invariant (AI) -means computational times on the Arabic Characters dataset. The input pixel size is . (Left) shows the computational time for varying training set sizes and . (Right) shows the computational time as a function of the number of landmarks, , for . Since the AI -means does not use the TPS algorithm, its computational time is constant as a function of the number of landmarks. We can observe that our process to speed up the computation enables the tractability of the ST -means.
Face-10 Rock-Paper-Scissors E-MNIST ST         AI        -means      Init. ST         AI        -means      Init. ST         AI        -means      Init.
Figure 3: Centroids - We depict some centroids for the different -means algorithms. The centroid at initialization are displayed in the nth1 row. The centroids learned by -means are shown in the 2 row, by the Affine invariant -means in the 3 row, and by our ST -means in the 4 row. By comparing the results of the AI -means (3 row) with the standard -means (2 row), we can see that using only affine transformations slightly improves the -means centroids and reduces the superposition issue that -means suffers from. By comparing the results of our ST -means (4 row) with the other methods, it is clear that using non-rigid transformations significantly improves the quality of the centroids, making them sharper and removing the issue related to the non-additiveness of images. Note that -means iteratively updates the centroids and cluster assignments, as such, the class associated to a specific centroid usually changes during training (additional centroid vizualisations are proposed in Appendix G.2).
Raw Data Affine Invariant Spatial Transformer MNIST #10                   Audio-MNIST #10            Face-10 #13               Rock-Paper-Scissors #3
Figure 4: -dimensional t-SNE - (# denotes the number of clusters) - We suggest the reader to zoom in the plots to best appreciate the visualizations. - The raw data (left column), the affinely transformed data using the AI distance, i.e., we extract the best affine transformation of the data that corresponds to the centroid it was assigned and perform the t-SNE on these affinely transformed data, (middle column), the data transformed with respect to the TPS as per Eq. 3, i.e., the same process as previously mentioned but we consider the spatial transformer instead, and then perform the dimension reduction on these transformed data, (right column). Each row corresponds to a different datasets: Rock-Paper-Scissors, Face-10, AudioMNIST, and MNIST are depicted from the top to bottom row. For all the figures, the colors of the data represent their ground truth labels. We observe that across datasets, both the affine transformations learned on the data and the non-rigid transformations help to define more localized clusters. One can observe that for the Face-10 dataset, while the dataset contains clusters, we can see that the ST -means induced transformations lead to a -dimensional space where the faces are clustered majors orientations. The top left cluster corresponds to faces pointing left, the bottom one face pointing right, and the bottom right one face pointing front. We also propose to zoom-in two locations where the ambiguity in the transformation induced by the spatial transformer is noticeable. In particular, we show two cases where the non-rigid transformations are too large for certain samples leading to an erroneous clustering assignment, e.g., in the MNIST dataset, the yellow samples in the lense are initial instance of the class that have been transformed into digit that geometrically ressemble the centroid of the cluster , thus being assigned to the ’s cluster. The same concept is shown in the Rock-Paper-Scissors lense where some instance of the classes rock and paper are assigned to the class scissors (additional t-SNE vizualisations are proposed in Appendix G.1).

4 Experimental Setup

In this section, we detail the experimental settings followed to evaluate the performances of our model. For all the experiments, the number of clusters is set to be the number of classes the dataset contains for all clustering algorithms. The various datasets and their train-test split to optimize the model’s parameters and update the centroids of the different models are described in Appendix F.

4.1 Evaluation Metrics

For all the experiments, the accuracy is calculated using the metric proposed in yang2010image and defined as


where is the ground-truth label, the cluster assignment and all the possible one-to-one mappings between clusters and labels. The results in Table 1 are taken as the best score on the test set based on the ground truth labels among runs as in xie2016unsupervised. We also provide on the same run the normalized mutual information (NMI) (romano2014standardized), and adjusted rand index (ARI) (hubert1985comparing).

4.2 Cross Validation Settings

We provide in Appendix. C the details regarding the benchmark models and their cross-validation settings.

Our model requires the cross-validation of hyper-parameters: the number of landmarks and the learning rate to learn the similarity measure in Eq. 3. However, the clustering framework does not allow the use of label information to perform the cross-validation of the parameters. We thus need to find a proxy for it to determine the optimal model parameters. Interestingly, the distortion error related used in the ST -means, Eq. 2, appears to be negatively correlated to the accuracy, as displayed in Fig. 5. Note that the use of the distortion error is commonly used as a fitness measure in -means, for example, when cross-validating the number of clusters.

Accuracy Distortion Error
Figure 5: Accuracy vs Distortion Error - Clustering accuracy, Eq. 6, of ST -means algorithm on the MNIST dataset as a function of the distortion error, Eq. 2, using the similarity measure, Eq. 3

. Each gray dot is associated with a specific set of hyper-parameters, e.g., the learning rate and the number of landmarks for the spatial transformer. The accuracy is negatively correlated to the distortion error (see the blue line corresponding to the ordinary least square fit), indicating that the distortion error is an appropriate metric to cross-validate the hyper-parameters of the ST

-means algorithm, which is crucial in an unsupervised setting as the labels are not available.

We cross-validate the number of landmarks, , which defines the resolution of the transformation, which we optimize over the following grid, . Then, the learning of the landmarks, , is done via Adam optimizer. The learning rate is picked according to . We train our method for epochs for all the datasets, with batches of size . As for -means and AI -means, the centroids’ initialization of the ST -means is performed by the -means algorithm. Importantly, the same procedure is applied to all datasets.

Note that during the training, both the similarity measure in Eq. 3 and the clustering update are performed, Eq. 6. During the algorithm’s testing phase, the centroids remain fixed, and only the similarity measure is performed to assign each testing datum to a cluster.

5 Results and Interpretations

In this section, we report and interpret the results obtained by our ST -means and competing models.

5.1 Clustering Accuracy

We report in Table 1 the accuracy of the different models considered on the different datasets. Our approach shows to outperform existing models on most datasets. Our model equals the performance of AI -means on Affine MNIST and is only outperformed by VaDE (MLP) on MNIST.

Deep Learning

Aff. MNIST #10

Diffeo. MNIST #10


Audio MNIST #10


Rock-Paper-Sci. #3

Face-10 #13

Arabic Char. #28

Aff. MNIST #10

Diffeo. MNIST #10


Audio MNIST #10


Rock-Paper-Sci. #3

Face-10 #13

Arabic Char. #28

Aff. MNIST #10

Diffeo. MNIST #10


Audio MNIST #10


Rock-Paper-Sci. #3

Face-10 #13

Arabic Char. #28

-means 68 61 53 10 39 40 20 19 - - 50 1 39 5 18 27 - - 39 0 21 4 0 1
AI -means 100 91 75 29 48 72 31 30 - - 62 18 45 30 30 37 - - 54 10 26 24 3 17
ST -means 100 99 92 41 65 86 45 51 - - 82 26 63 63 53 61 - - 83 15 46 63 20 38
AE + -means 72 60 66 13 41 48 37 23 - - 64 1 40 9 27 33 - - 59 0 28 6 26 15
DEC (MLP) () 84 77 84 10 55 46 33 24 - - 83 1 51 12 20 32 - - 80 0 31 8 3 13
DEC (Conv) 70 68 78 15 60 54 38 29 - - 74 3 56 18 31 39 - - 69 1 37 13 17 16
VaDE (MLP) () 68 65 94 11 20 50 36 26 - - 89 1 12 16 27 30 - - 85 0 8 11 14 10
VaDE (Conv) 65 59 81 14 58 55 40 46 - - 78 2 55 20 35 53 - - 80 0 38 15 18 29
Table 1: Clustering results in of the test set accuracy Eq. 6 - Following the benchmarks evaluation method, the best accuracy (ACC) over runs are displayed - We also provide the associated normalized mutual information (NMI) and adjusted rand index (ARI) - the number of clusters is denoted by # next to the dataset name and where (): xie2016unsupervised and (): jiang2016variational.

Whereas the various deep learning approaches perform well on datasets for which their architectures were developed, e.g., MNIST and its derivatives: E-MNIST, Arabic Characters, they show limited performance on higher resolution datasets with a small number of samples, such as Rock-Paper-Scissors, Face-10 as well as the two toy examples. In fact, they are composed of only training data and testing data. In the following sections, we interpret various visualizations of the -means variants used in this work.

5.2 Interpretability: Centroids Visualization

We propose in Fig. 3 to visualize the centroids obtained via -means, AI -means, and our ST -means. Supplementary visualizations are provided in Appendix G.2. For each dataset, the first row shows the clusters after initialization from -means. The three following rows show the centroids obtained via the -means, AI -means, and ST -means algorithms, respectively.

We observe that, for all datasets, the -means centroids are not lying on the data manifold as they are unrealistic images that could not occur naturally in the dataset. Besides, they appear to be blurry and hardly interpretable. These drawbacks are due to the update rule that consists in the average of the data belonging to each cluster in the pixel space. The AI -means algorithm drastically reduces the centroids’ blurriness induced by such an averaging as it considers the average of affinely transformed data. However, our ST -means produces the crispest centroids and does not introduce any ambiguity in between the different clusters. In fact, the update of our method, Eq. 4, takes into account the non-linear structure of the manifold by taking the average over data transformed using a non-rigid transformation.

Interestingly, Fig. 3 shows that even if at initialization multiple centroids assigned to the same class are attributed to different clusters, the ST -means is able to recover this poor initialization thanks to its explicit manifold modeling and centroid averaging technique. For instance, in the Rock-Paper-Scissors dataset, although at initialization, two centroids correspond to the class paper, the ST -means learns centroids of each of the three classes within this dataset. In the Face-10 dataset, some centroids learned correspond to the rotation of the initialization; even in such extreme change of pose, the centroids remain crisp in most cases.

5.3 Interpretability: Embedding Visualization

To get further insights into the disentangling capability of the ST -means, we compare the -dimensional projections of the data using t-SNE (maaten2008visualizing), of the -means, AI -means and ST -means. Supplementary visualizations are provided in Appendix G.1.

The t-SNE visualizations, for both the AI and ST -means, are obtained by extracting the optimal transformation that led to the assignment. Precisely, for each image , we compute and extract the optimal parameter which is then used to obtain the transformed image fed as the input of the t-SNE.

We can observe in Fig. 4 that the affine transformations ease the data separation in this -dimensional space. The ST -means also drastically enhances the separability of the different clusters. When using ST -means, the data are clustered based on macroscopically meaningful and interpretable parameters, making the model’s performance possible to understand. For instance, for the Face-10 dataset, the t-SNE representation of the ST -means clusters’ shows that faces are grouped according to three significant orientations, left, right, and front. These three clusters are more easily observed in our ST -means than in the affine invariant model. However, the different orientations present in the dataset remain too subtle to be captured by the ST -means.

For the MNIST dataset, the last row and column of Fig. 4, we observe that most of the incorrectly clustered images are almost indistinguishable from samples of the cluster they have been attributed. In particular, we highlight this by proposing to zoom-in into the cluster of hand-written in Fig. 4. We can see that the yellow instances are samples from the class that have been transformed such that they resemble the ’s centroid in Fig. 3. We also provide a zoom-in on one of the clusters obtained on the rock-paper-scissors dataset, first row and last column of Fig. 4. The incorrectly clustered data are the ones that, when transformed, easily fit the scissors shape.

6 Conclusion

Designing an unsupervised algorithm that is robust to non-rigid transformations remains challenging, despite the tremendous breakthrough in machine learning. The problem lies in appropriately limiting the size of the transformations. We showed that the spatial transformer could achieve this as the number of landmarks allows the learnability of a coarse to fine grid of transformation. However, such a parameter controlling the size of the transformation should be designed as well as be learned per-cluster or per-sample. Besides this difficulty, we showed that we could conserve the interpretability of the -means algorithm applied in the input data space while drastically improving its performances. Such a framework should be favored in clustering applications where the explainability of the decision is critical.


Appendix A Properties of ST K-means and Proofs

a.1 ST K-means Similarity Measure: a Quasipseudosemimetric

Proposition 4.

The similarity measure defined by is a Quasipseudosemimetric.


Let’s first define the orbit of an image with respect to the TPS transformations. Note that, the TPS does not form a group as it is a piecewise mapping. However, we know that it approximate any diffeomorphism on . Therefore, for sake of simplicity, we will make a slight notation abuse by considering the orbit, equivariance, and others group specific properties as being induced by the spatial transformer .

Definition 1.

We define the orbit an image under the action the by


Let’s now consider each metric statement: 1) It is non-negative as per the use of a norm.

2) Pseudo: , that is, and are equivariant with respect to the transformations induced by . Thus, for possibly distinct values and , however, these are not distinct when we consider the data as any possible point on their orbit with respect to the group of diffeomorphism. In fact, the distance is equal to if and only if, and are equivariant.

3) Quasi: The asymmetry of the distance is due to the non-volume preserving deformations considered. In fact, we do not consider the Haar measure of the associated diffeomorphism group and consider the distance with respect to the Lebesgue measure. Although the asymmetry of does not affect our algorithm or results, a symmetric metric can be built by normalizing the distance by the determinant of the Jacobian of the transformation. Such a normalization would make the metric volume-preserving and as a result make the distance symmetric.

4) Semi: If , then as it exist a such that the TPS maps each data onto the other as per definition of the orbit, thus the triangular inequality holds. If and , we have . If and , we have , and since , the inequality is respected. However, if belong to three different orbits, then we do not have the guarantee that then triangular inequality holds. In fact, it will depend on the distance between the orbits which is specific to each dataset. ∎

a.2 ST K-means Updates: Proof of Proposition 1

We consider the Féchet mean of the centroid to be the solution of the following optimization problem, . Using our similarity measure, we obtain the following.


The Fréchet mean for the cluster is defined as since the optimization problem is convex in (as the result of the composition of the identity map and a norm which are both convex) we have . with,


a.3 ST K-means Similarity Measure: Invariance Property

Motivated by the fact that small non-rigid transformations, usually, do not change nature of an image, we propose to exploit the invariance property of the similarity measure we proposed.

In this section, for sake of simplicity we will assume that the transformations belong to the group of diffeomorphism. In practice, the TPS can only approximate element of such group, and the constraint we impose on the transformation, e.g., number of landmark, also limit the type of diffeomorphism that can be approximated, therefore, we could instead consider that we approximate a subgroup of the diffeomorphism group.

Let’s define an invariant similarity measure under the action of such group. That is, the similarity between two -dimensional signals remain the same under any diffeomorphic transformations. We propose to define the invariance in the framework of centroid-based clustering algorithm as follows.

Definition 2.

An invariant similarity measure with respect to is defined as such that for all images , all centroids , and all group elements , we have


where denotes the action of the group element onto the image .

The similarity used in Eq. 3 of the optimization problem is -invariant as per Definition 2.

Proposition 5.

The similarity is -invariant.


Let consider , we have , where is the inverse group element of . In fact, Since for all , it exists an inverse element , we have that , .

That is, by definition of the group, there is always another element that minimizes the loss function by using the composition between the inverse element of the group that has just been added,

, and the optimal element . ∎

a.4 St -means Convergence: Proof of Proposition 2


Following the notation of Sec. E, we can define the spatial transformer operator as the composition of the TPS and bilinear interpolation map. That is, . Now the aim is to prove that defines a Bregman divergence measure as in (banerjee2005clustering). In such a case, Algo. 2 defines a special case of the Bregman divergence hard-clustering algorithm again defined in (banerjee2005clustering) which is proven to converge.

Let’s first start by making an assumption on the data , we can without loss of generality assume that they are non-negative as we are dealing either with images or time-frequency representation where a modulus is applied to obtain the -dimensional real representation. Then, we also assume that the minimum over the transformation parameters reaches a global unique minimum, denoted by . Now,

Now it is clear that which consists in the identity transform of the centroid . Then we denote by where , and obtain that,

Now, we know that is non-decreasing w.r.t each dimension since the image or time-frequency representation are positive real valued, and the inner product defines a strictly convex map. Then, we also know that

is defined as the composition of the TPS for the coordinate and the bilinear map for the image, which can be formulated as a linear transformation with respect to the data

, where is a structured sparse matrix where each block denotes the dependency to nearby pixels. Therefore this mapping is convex. As a composition between a non-decreasing w.r.t each dimension and strictly convex function with a convex function, is strictly convex, which complete the proof.

a.5 St -means: Weighted Voronoi Diagram


Let’s start by re-writting the similarity measure as to analytically express a metric tensor that would be the weight in the weighted Voronoi diagram the ST

-means defines. Using App. E, we can re-write , where is bilinear in the coordinates that are induced by the TPS. In this formulation we can observe that

defines the displacement vector w.r.t the original uniform grid of landmark. That is, if

is the null vector, then . Now, we assume that such linear operator is inversible, i.e., the TPS transformation is invertible (note that this is not always the case (johnson2001landmark)). Then, we can re-write as

where , and defines the metric tensor, and the notation indicates that the displacement vector depends on the centroid and the datum . Also, note that while defines the transformation operator to map onto , is the inverse operator mapping the centroid to the datum .

Now the tuple of cells defines such as

defines a weighted Voronoi diagram (letscher2007vector; inaba1994applications), where we observe that the metric tensor is dependant on all the spatial transformations. ∎

Appendix B Implementation Details

Note that the deformation invariant similarity measure we introduced in Eq. 3 differs from the affine invariant distances developed in (fitzgibbon2002affine; simard2012transformation; lim2004image). All previously defined measures of error rely on the assumption that the manifold can be locally linearized and as a result the tangent space is used as a proxy to learn the optimal affine transformation. However, the work of (wakin2005multiscale) suggests that tangent planes fitted to image manifold continually twist off into new dimensions as the parameters of the affine transformations vary due to a possible the intrinsic multiscale structure of the manifold. As such, the alignment of two images can be done by linearizing the manifold. To do so, (wakin2005multiscale) propose to consider the multiscale structure of the manifold, we simplify their approach by applying a low-pass filter on the images and the centroid prior to learn the affine transformation best aligning them. Then, we optimize the remaining part of the TPS to account for diffeomorphic transformations. These two steps are similar to the one used in (jaderberg2015spatial).

Appendix C Alternative Methods

We compare our model with well-known clustering techniques using deep neural networks. We performed experiments for the VaDE (jiang2016variational) and DEC (xie2016unsupervised) using the code made publicly available by the authors. We use the annotation (MLP) as a reference to the MLP architecture used in their experiments (see Appendix D for details). To fairly compare our model to the DEC and VaDE models, we proposed a convolutional architecture to the DEC and VaDE networks, denoted by DEC (Conv) and VaDE (Conv) (see Appendix D for details). Finally, we evaluate the performance of an augmented

-means algorithm trained using the features extracted by an Autoencoder, denoted by AE

-means in the following.

The parameters of the different models mentioned above are learned by stochastic gradient descent (Adam optimizer

(kingma2014adam)). In all the experiments, the learning rate are cross-validated following the approach in (xie2016unsupervised) according to . The internal parameters that are model dependent, e.g., the number of pre-training epoch and the update intervals, are also cross-validated.

We also compare our ST -means to the closely related -means and affine invariant -means, denoted by AI -means.For each run, all three -means algorithms start from the same initial centroids using the -means algorithm developed by arthur2006k to speed up the convergence of the -means algorithm.

Appendix D Neural Network Architectures

For both architectures , the decoder architecture is symmetric to the encoder and the batch size is set to .


The MLP architecture from input data to bottleneck hidden layer is composed of

fully connected ReLU layers with dimensions



The CONV architecture is composed of -convolutional ReLU layers with filters of size , and fully connected ReLU layers with dimension

. For each layer, a batch normalization is applied.

Appendix E Thin-Plate-Spline Interpolation

Let’s consider two set of landmarks, the source ones and the transformed where denotes the number of landmarks. The TPS aim at finding a mapping , such that , that is, the mapping between two set of landmarks. The particularity of the TPS is that it learns such a mapping by minimizing the interpolation term, and a regularization that consists in penalizing the bending energy.

The TPS optimization problem is defined by


In our model, the source landmarks are considered to be the coordinates of a uniform grid. Also note that both the source landmarks and transformed ones are usually a subset of the set of coordinate of the images. For instance, for the MNIST dataset of size , the landmarks would be a grid of size , where . While the mapping is based on the landmark, it is then applied to the entire image coordinate. In fact, is mapping , where (resp. ) corresponds to the mapping from to the first dimension (resp. the second dimension ).

The solution of the TPS optimization problem, Eq. 10, provides the following analytical formula for


where is the -norm, are the parameters governing the affine transformation, and are parameters responsible for non-rigid transformations as they stand as a weight of the non-linear kernel . The non-linear kernel is expressed by .

Based on the landmarks and , we can obtain these parameters by solving a simple system of equation define by the following operations


where the matrix , is defined as

where , , and

Note that, since the matrix depends only on the source landmarks, and that in our case these are unchanged, its inverse can be computed only once. The only operation required to be computed for each data and each centroid is the matrix multiplication providing the parameters of the TPS transformation, as per Eq. 11, 12. Given these parameters, the mapping can be applied to to each coordinate of the image.

Now in order to render the image, one can perform bilinear interpolation as it is achieved in. Besides, the bilinear interpolation will allow the propagation of the gradient through any differentiable loss function.

Given an image where , denotes the width and the height of the image, and two sets of landmarks ,uniform grid coordinate of , and , the transformation of the uniform grid, which are a subset of the image coordinate, We are able to learn a mapping such that for each original pixel coordinate, we have their transformed coordinates. In fact, given any position on the original image, the mapping provides the new positions as per Eq. 11, Eq. 12.

Now, from this transformed the coordinates space, we can render an image using, as in (jaderberg2015spatial), the bilinear interpolation function which takes as input the original image and the transformed pixel coordinates , and outputs the pixel value of the transformed image at a given pixel coordinate

where is the Kronecker delta function and is the floor function rounding the real coordinate to the closest pixel coordinate.

Appendix F Datasets

MNIST (deng2012mnist): is a handwritten digit dataset containing training and test images of dimension representing classes.
Rigid MNIST: we randomly sample one instance of each MNIST class and generate random affine transformations for each sample. A third of the data are used for testing.
Non-rigid MNIST: we randomly sample one instance of each MNIST class and generate random affine transformations for each sample as well as random transformations using the TPS method. A third of the data are used for testing.
Audio MNIST (becker2018interpreting): is composed of recordings of spoken digits by different speaker and sampled at of sec long. It consists of classes. We use data for testing and for testing. This dataset will be transformed into a time-frequency representation. This representations, can be considered as images, are usually used as the common representation of audio recordings (cosentino2020).
E-MNIST (cohen2017emnist): is a handwritten letters dataset merging a balanced set of the uppercase and lowercase letters into a single classes dataset of dimension .
Rock-Paper-Scissors (rps): images of hands playing rock, paper, scissor game, that is, a classes dataset. The dimension of each image is . The training set is composed of data and the testing set .
Face-10 (gourier2004estimating): images of the face of 15 people, wearing glasses or not and having various skin color. For each individual, different samples are obtained with different pose orientation varying from degrees to degrees vertical degrees. The dimension of each image is . The training set is composed of data and testing set of data.
Arabic Char (altwaijry2020arabic): Handwritten Arabic characters written by participants. The dataset is composed of images in the training set and in the test set. The dimension of each image is .

Appendix G Supplementary Visualization

g.1 Additional t-SNE Visualisations

Raw Data Affine Invariant Spatial Transformer E-MNIST Arabic Characters
Figure 6: -dimensional t-SNE Vizualisation - The raw data (left column), the affinely transformed data, i.e., we extract the transformation of the data that corresponds to the centroid it was assigned and perform the t-SNE on these affinely transformed data, (middle column), the data transformed with respect to non-rigid transformations as per Eq. 3, i.e., the same process as previously mention, but we consider the transformation induced by the TPS, and then perform the dimension reduction on these transformed data, (right column). Each row corresponds to a different dataset, E-MNIST, Arabic Characters, are depicted from the top to bottom row. For all the figures, the colors of the data represent their ground truth labels.

g.2 Additional Centroid Visualisations

ST              AI         K-means        Init.
Figure 7: Additional - MNIST Centroids Visualization (dim 28x28) - Depiction of the initialization of the per-cluster centroids top row and the final per-cluster centroids of the K-means, Affine invariant K-means, and ST K-means (proposed) methods.
ST              AI         K-means        Init.
Figure 8: Additional - out of E-MNIST Centroids Visualization (dim 28x28) - Depiction of the initialization of the per-cluster centroids top row and the final per-cluster centroids of the K-means, AI K-means, and ST K-means (proposed) methods.
ST              AI         K-means        Init.
Figure 9: Additional - Audio MNIST Centroids Visualization (dim 64x24) - Depiction of the initialization of the per-cluster centroids top row and the final per-cluster centroids of the K-means, AI K-means, and ST K-means (proposed) methods.
ST              AI         K-means        Init.
Figure 10: Additional - out of Arab Characters Centroids Visualization (dim 32x32) - Depiction of the initialization of the per-cluster centroids top row and the final per-cluster centroids of the K-means, AI K-means, and ST K-means (proposed) methods.
ST              AI         K-means        Init.
Figure 11: Additional - out of Face Position Centroids Visualization (dim 288x384) - Depiction of the initialization of the per-cluster centroids top row and the final per-cluster centroids of the K-means, AI K-means, and ST K-means (proposed) methods.