Hierarchical Optimal Transport for Multimodal Distribution Alignment

06/27/2019 ∙ by John Lee, et al. ∙ Georgia Institute of Technology 0

In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets. Optimal transport (OT)-based approaches pose alignment as a divergence minimization problem: the aim is to transform a source dataset to match a target dataset using the Wasserstein distance as a divergence measure. We introduce a hierarchical formulation of OT which leverages clustered structure in data to improve alignment in noisy, ambiguous, or multimodal settings. To solve this numerically, we propose a distributed ADMM algorithm that also exploits the Sinkhorn distance, thus it has an efficient computational complexity that scales quadratically with the size of the largest cluster. When the transformation between two datasets is unitary, we provide performance guarantees that describe when and how well aligned cluster correspondences can be recovered with our formulation, as well as provide worst-case dataset geometry for such a strategy. We apply this method to synthetic datasets that model data as mixtures of low-rank Gaussians and study the impact that different geometric properties of the data have on alignment. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions and instantaneous velocities from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in datasets, and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in cross-domain alignment.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets (e.g., data across time points or under different conditions or contexts). Alignment is an important problem at the heart of transfer learning

pan2009survey ; weiss2016survey , point set registration chui2003new ; myronenko2010point ; tam2013registration , and shape analysis bronstein2006generalized ; bronstein2011shape ; ovsjanikov2012functional , but is generally NP hard. In recent years, distribution alignment methods that use optimal transport (OT) have been shown to provide state-of-the-art transfer in domain adaptation tasks courty2017optimal ; courty2017joint . Distribution alignment-based approaches cast alignment as an optimization problem that aims to match two distributions. However, when the source and target do not align exactly (e.g., noisy, undersampled) or have complicated multi-modal structure, algorithms suffer from poor local minima. Thus, leveraging additional structure in the problem is necessary to regularize OT and constrain the solution space.

Here, we leverage the fact that heterogeneous datasets often admit clustered or multi-subspace

structure to improve distribution alignment. Our solution to this problem is to simultaneously estimate the cluster alignment across two datasets using their local geometry, while also solving a global alignment problem to meld these local estimates. We introduce a hierarchical formulation of OT for clustered and multi-subspace datasets called

Hierarchical Wasserstein Alignment (HiWA)

. We empirically show that when data can be well approximated with Gaussian mixture models (GMMs) or lie on a union of subspaces, we may leverage existing clustering pipelines (e.g., sparse subspace clustering

elhamifar2013sparse ) to improve alignment. When the transformation between datasets is unitary, we provide analyses that reveal key geometric and sampling insights. To solve the problem numerically, we propose a distributed ADMM algorithm that exploits the Sinkhorn distance, and thus has computational complexity that scales quadratically with the size of the largest cluster.

To test and benchmark our approach, we applied it to synthetic data generated from mixtures of low-rank Gaussians and study the impact of different geometric properties of the data on alignment to confirm the predictions of our theoretical analysis. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in neural datasets and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in unsupervised decoding from ambiguous (symmetric) movement patterns. This suggests the application of OT to a wider range of neural datasets and shows that a hierarchical strategy can be used to avoid local minima encountered in a global alignment strategy that does not use cluster structure in data.

2 Background and related work

Transfer learning and distribution alignment. A fundamental goal in transfer learning is to aggregate related datasets by learning an alignment between them. We wish to learn a transformation , where

refers to some class of transformations that aligns distributions under a notion of probability divergence

between a target distribution and a reference (source) distribution :


Various probability divergences have been proposed in the literature, such as Euclidean least-squares (when data ordering is known) shi2010transfer ; shekhar2013generalized ; han2012sparse , Kullbeck-Liebler (KL) sugiyama2008direct , maximum mean discrepancy (MMD) pan2011domain ; baktashmotlagh2013unsupervised ; long2014transfer ; gong2016domain , and the Wasserstein distance courty2017optimal , where trade-offs are often statistical (e.g., consistency, sample complexity) versus computational. Alignment problems are ill-posed since the space of is large, so a priori structure is often necessary to constrain based on geometric assumptions. Compact manifolds like the Grassmann or Stiefel gopalan2011domain ; gong2012geodesic are primary choices when little information is present, as they preserve isometry. Non-isometric transformations, though richer, demand much more structure (e.g., manifold or graph structure) wang2008manifold ; wang2009general ; ferradans2014regularized ; cui2014generalized ; courty2017optimal .

Low-rank and union of subspaces models.Principal components analysis (PCA), one of the most popular methods in data science, assumes a low-rank model where the top- principal components of a dataset provide the optimal rank- approximation under an Euclidean loss. This has been extended to robust (sparse errors) settings elhamifar2013sparse , and multi- (union of) subspaces settings where data can be partitioned into disjoint subsets where each subset of data is locally low-rank eldar2009robust . Transfer learning methods based on subspace alignment fernando2013unsupervised ; sun2015subspace ; sun2016return work well with zero-mean unimodal datasets, but struggle on more complicated modalities (e.g., Gaussian mixtures or union of subspaces) due to a mixing of covariances. Related to our work, thopalli2018multiple performs multi-subspace alignment by greedily assigning correspondences between subspaces using chordal distances; this however neglects sign ambiguities in principal directions since subspaces inadequately describe a distribution’s shape.

Optimal transport. Optimal transport (OT) kantorovich2006problem is a natural type of divergence for registration problems because it accounts for the underlying geometry of the space. In Euclidean settings, OT is a metric known as the Wasserstein distance which measures the minimum effort required to “displace” points across measures and (understood here as empirical point clouds). Therefore, OT by design relieves the need for kernel estimation to create an overlapping support of the measures . Despite this attractive property, it has both a poor numerical complexity of (where is the sample size) and a dimension-dependent sample complexity of , where the data dimension is dudley1969speed ; weed2017sharp . Recently, an entropically regularized version of OT known as the Sinkhorn distance cuturi2013sinkhorn has emerged as a compelling divergence measure; it not only inherits OT’s geometric properties but also has superior computational and sample complexities of and 111Dependent on a regularization parameter genevay2018sample ., respectively. It has also become a versatile building block in domain adaptation courty2017optimal ; courty2017joint . Prior art courty2017optimal has largely exploited the OT’s push-forward as the alignment map since this map minimizes the OT cost between the source and target distributions while allowing a priori structure to be easily incorporated (e.g., to preserve label/graphical integrity). Such an approach, however, is fundamentally expensive when since the primary optimization variable is a large transport coupling (i.e., ), while in reality the alignment mapping is merely . Moreover, it assumes that the source and target distributions are close in terms of their squared Euclidean distance, but this does not generally hold in the alignment of arbitrary latent spaces.

3 Hierarchical Wasserstein alignment

Preliminaries and notation. Consider clustered datasets and whose clusters are denoted with the indices and whose columns are treated as embedding coordinates. Let () denote the number of samples in the -th (-th) cluster of dataset (dataset ). We respectively express the empirical measures of clusters and as and , where refers to a point mass located at coordinate . The squared 2-Wasserstein distance between and is defined as


is a doubly stochastic matrix that encodes point-wise correspondences (i.e., the

-th entry describes the flow of mass between and ), is the -th column of matrix , and the constraint refers to the uniform transport polytope (with a length vector containing ones).

Overview. Although unsupervised alignment is challenging due to the presence of local minima, the imposition of additional structure will help to prune them away. Our key insight is that hierarchical structure decomposes a complicated optimization surface into simpler ones that are less prone to local minima. We formulate a hierarchical Wasserstein approach to align datasets with known (or estimated) clusters but whose correspondences are unknown. The task therefore is to jointly learn the alignment and the cluster-correspondences:


where the matrix encodes the strength of correspondences between clusters, with a large value indicating a correspondence between clusters , and a small value indicating a lack thereof. We note that is a special type of transport polytope known as the -th Birkhoff polytope. Interestingly, this becomes a nested (or block) OT formulation, where correspondences are resolved at two levels: the outer level resolves cluster-correspondences (via ) while the inner level resolves point-wise correspondences between cluster points (via the Wasserstein distance).

Alignment over the Stiefel manifold. Assuming clusters lie on subspaces and principal angles between subspaces are “well preserved” across and (we make this precise in Theorem 4.2), an isometric transformation suffices. Hence, we solve (2) with , which refers to the Stiefel manifold defined as and refers to the identity matrix. Explicitly, we have


Here, measures pairwise cluster divergences using the squared 2-Wasserstein distance under a Stiefel transformation acting on cluster , i.e.,


Finally, we include entropic regularization over transportation couplings and all ’s to modify the Wasserstein distances to Sinkhorn distances, so as to take advantage of its superior computational and sample complexities. Omitting constraints for brevity, our final problem is given as


where are the entropic regularization parameters and the negative entropy function is defined as . Parameters control the correspondence entropy, therefore (5) approximates (3) when , but reverts to the original problem (3) as .

Distributed ADMM approach. Problem (5) is non-convex due to multilinearity in the objective and its Stiefel manifold domain. It has recently been shown that the augmented directions method of multipliers (ADMM) eckstein1992douglas ; boyd2011distributed can be globally convergent even in non-convex settings wang2019global . Furthermore, since (5) readily admits a splitting structure that separates the individual blocks, we develop a distributed ADMM solver. We proceed to split (5) as follows:

noting that the set constraints are omitted for brevity. Its augmented Lagrangian is

where is the ADMM parameter and are Lagrange multipliers. Full details of the update steps are included in the supplementary material. The algorithm may be summarized in two steps: (i) a distributed step that asks all cluster pairs to individually find their optimal transformations in parallel, and (ii) a consensus step that aggregates all the found transformations according to a weighting that is proportional to correspondence strengths . Algorithm 1 summarizes our method.

1:procedure HierarchicalWassersteinAlignment()
2:      random Initialization
3:     while not converged do
4:         for all in parallel do
6:              while not converged do
9:              end while
10:         end for
14:     end while
15:end procedure
1:procedure Sinkhorn()
3:     while not converged do
6:     end while
8:end procedure
1:procedure StiefelAlignment()
4:end procedure

: elementwise division
: elementwise exponential
: diagonal matrix of argument

Algorithm 1 Hierarchical Wasserstein Alignment (HiWA) Algorithm

Parameters. Entropic parameters relax the one-to-one cluster correspondence assumption, balancing a trade off between alignment precision (small ) and sample complexity (large ). Numerically, negative entropy adds strong convexity to the program, reducing sensitivity towards perturbations at the cost of a slower convergence rate. The ADMM parameter controls the ‘strength’ of the consensus, or from an algorithmic viewpoint, the gradient step size.

Distributed consensus. Update steps for can be performed in parallel over all cluster pairs (i.e., in total), making it amenable for a distributed implementation. When fully parallelized, the algorithm has a per-iteration computational complexity of , where refers to the number of points in the largest clusters of respectively (compared to vanilla Sinkhorn’s complexity where refers to the total number of points in respective datasets, assuming ).

Stopping criteria. In lines 3 and 6 of Algorithm 1, possible stopping criteria are (i) where the difference is between the current and previous iteration’s transformation and is the tolerance, and (ii) where is the maximum number of iterations.

Robustness against initial conditions. We build in robustness against initial conditions by ordering updates for and before such that when is sufficiently small, the ADMM sequence is influenced more by the data than by initial conditions.

4 Theoretical guarantees for cluster-based alignment

While the previous section explains how to align clustered datasets, in this section, we aim to answer the question of when and how well they can be aligned. We provide necessary conditions for cluster-based alignability as well as alignment perturbation bounds according to problem (3)’s formulation. To simplify our analysis, we make the following assumptions: (i) each of the clusters contain the same number of datapoints , (ii) the ground truth cluster correspondences are (i.e., diagonal containing ). Detailed proofs are given in the supplementary material.

The following result is a criterion that, if met, ensures the existence of the cluster-correspondence global minima . This criterion requires that matched clusters must be closer in Wasserstein distance than mismatched clusters, according to a threshold according to Wasserstein’s sample complexity (i.e., an asymptotic rate dependent on the clusters’ sample sizes and intrinsic dimensions). Since these sample complexity results are based on the Wasserstein distance, we expect a less stringent criterion when using the Sinkhorn distance in (5) (due to superior sample complexity genevay2018sample ).

Theorem 4.1 (Correspondence disambiguity criterion).

Let all clusters be strictly low-rank where the dimension of the -th cluster in the -th dataset is . Let . Define . Problem (3) yields the solution with probability at least if, , the following criterion is satisfied:

Proof sketch.

The proof contains two parts. In the first part, we consider perturbation conditions of the cost matrix in a (non-variational) optimal transport program over the Birkhoff polytope. To be unperturbed from , we require that . In the second part, we extend this condition to the the finite-sample regime by utilizing recently developed concentration bounds weed2017sharp for the -Wasserstein distance, which essentially raises the disambiguity lower bound due to finite-sample uncertainty. ∎

Now, even if we have the global correspondence solution , we still do not have the full picture about the alignment’s quality. For example, all matching clusters may have very similar covariances, but principal angles between the clusters are “distorted” across the datasets. Our next theorem gives us an upper bound on the alignment error (for unitary transformations), and makes precise the notion of global structure distortion.

Theorem 4.2 (Cluster-based alignment perturbation bounds).

Consider data matrices with known point-wise correspondence matrices . Define matrices

Set . If the criterion stated in theorem 4.1 is satisfied, is full row rank, and where is the operator norm and is the pseudo-inverse of , then

where is a data-dependent constant.

Proof sketch.

We utilize a recent perturbation result on the Procrustes problem (on a Frobenius norm objective) by Arias-Castro et al. arias2018perturbation and adapt it to our squared 2-Wasserstein objective. ∎

We point out that plays a major role in the alignment error bound and quantifies the notion of global structure distortion. It therefore allows us to understand on how phenomena like covariate shift or misclustering impacts alignment. To shed some light in this regard, we consider a simple analysis on a cluster-pair’s error contribution to . Consider the decomposition of the -th block of the Grammians related to clusters and

, where their respective singular value decompositions are

and . Defining the blockwise error between clusters as

two components stand out: (i) angular shift, which is characterized by differences in principal angles between and , and (ii) spectral shift, which is characterized by differences in spectra.

Finally, we show that the subspace configuration of a dataset’s clusters can also affect alignment. Pretend for a moment that external alignment information were present to aid in the disambiguation between two clusters. The following lemma tells us when such information is useless.

Lemma 4.3 (Uninformative alignment).

Consider clusters and known point-wise correspondences . Denote the left and right singular vectors of associated with the non-zero singular values as with . Define the set of orthogonal transformations that are constrained to agree with known angular directions as

where with . Given with , we have


with equality holding when .

Direct consequences of this lemma are the following: When a dataset has equally-spaced subspaces, it has a maximally uninformative geometric configuration since angular information from other clusters (i.e., ) can never increase the inter-cluster distance (i.e., equality in (6) always holds); it is hence a worst-case scenario for alignment. This also explains why alignment in very high-dimensional space is harder: All subspaces may be orthogonal to each other, and hence offer no “geometric” advantage in the joint alignment effort.

5 Numerical experiments

5.1 Synthetic low-rank Gaussian mixture dataset

In this section, we validate our method as well as demonstrate its limiting characteristics under symmetric-subspace and finite-sample regimes. To generate our synthetic data, we repeat the following procedure for

clusters. We first randomly generate Gaussian distribution parameters

(positive semi-definite), then randomly sample data-points from these parameters, and finally project them into a random subspace in a dimensional embedding. We assume that the respective clusters are known, but the cluster-correspondences between datasets is not. We measure performance with two metrics: (i) alignment error, defined as the relative difference between the recovered versus true rotation acting on the data , and (ii) correpondence error, defined as the sum of absolute differences between the recovered and the true correspondences .

Figure 1: Synthetic experiments. HiWA was tested in two subspace configurations (a,b): randomly-spaced (average-case, solid) versus equally-spaced (worst-case, dashed) for , where is the number of clusters, the dimension of each cluster, is the embedding dimension, and is the sample size. As expected, performance in terms of the (a) alignment and (b) correspondence (from 20 random trials) error is better in the average (vs. worst) case. In (c,d), we report (c) alignment and (d) correspondence errors as and varies, and report the error’s 25th/50th/75th percentiles from 20 trials. In (e,f), we compare HiWA when clusters are known (HiWA), HiWA when clusters are unknown (HiWA-SSC), non-hierarchical Wasserstein alignment (WA), subspace alignment methods (SA fernando2013unsupervised , CORAL sun2016return ), and iterative closest point (ICP) besl1992method for , , and (e) ,, and (f) ,.

In Figure 1a-b, we empirically validate the fact that equally spaced subspaces are indeed the worst-case scenario in alignment, as exposed by Lemma 4.3. We run our proposed algorithm on two identical datasets generated with parameters ; the key difference being that one dataset has equally-spaced subspaces with a subspace similarity of , while the other contains subspaces that are randomly selected on the Grassman manifold. We observe that equally-spaced subspaces have significantly inferior performance compared to randomly-spaced subspaces, across various . Interestingly, correspondence error is more tolerant than alignment towards subspace spacing configuration. In Figure 1c-d, we empirically study the effect of dimensions and sample size on the accuracy of alignment. We run our proposed algorithm on various dataset conditions by varying parameters while approximately maintaining the average subspace correlations (i.e., ) by tuning to control for subspace spacing biases, and fixing the cluster size . Both errors demonstrate sample complexities that are better than the theoretical , with correspondence error exhibiting greater robustness. We hypothesize this is due to the Sinkhorn distance’s superior sample complexity. In Figure 1e-f, we evaluate our algorithm against benchmark methods in transfer learning and point set registration under two settings (50 trials, no random restarts permitted): a simple one in low- (e) and a harder one in higher- (f). Specifically, we compare HiWA when clusters are known (but correspondences not), HiWA with clustering via sparse subspace clustering elhamifar2013sparse (HiWA-SSC), a Wasserstein alignment variant with no cluster-structure (WA), subspace alignment fernando2013unsupervised , correlation alignment sun2016return , and iterative closest point (ICP) besl1992method . In both settings, HiWA exhibits strongest performance, with HiWA-SSC trailing closely behind (since clusters are independently resolved with SSC), followed by WA, then other algorithms. Subspace alignment methods have remarkably poor performance in higher dimensions due to their inability to resolve subspace sign ambiguities, while ICP demonstrates its notorious dependence on good initial conditions. These results indicates HiWA’s strong robustness against initial conditions.

Figure 2: Results on brain decoding dataset:  How distribution alignment is used to translate neural activity into movement – low-dimensional embeddings of neural data are aligned with target movement patterns (a). In (b), we compare the performance (cluster correspondence) of HiWA, WA, and DAD as the number of points in the source dataset decreases. Next, we compared the performance of HiWA with known and estimated clusters (via GMM). Movement patterns in which cluster separability is high and the geometry is preserved across datasets, can be aligned in both cases (green stars). Patterns where separability is low but geometry is useful can be aligned when the cluster arrangements are known (yellow stars), and when the geometry is not unique, it is not possible to find the correct alignment (red X).

5.2 Neural population decoding example

Decoding intent (e.g., where you want to move your arm) or evoked responses (e.g., what you are looking at or listening to) directly from neural activity is a widely studied problem in neuroscience, and the first step in the design of a brain machine interface (BMI). A critical challenge in BMI is that neural decoders need to be recalibrated (or re-trained) due to drift in neural responses or electrophysiology measurements/readouts pandarinath2018latent . Recently, a method for semi-supervised brain decoding was proposed which finds a transformation between projected neural responses and movements by solving a KL-divergence minimization problem dyer2017cryptography . Using this approach, one could build robust decoders that work across days and shifts in neural responses through alignment.

To test the utility of hierarchical alignment for neural decoding, we utilized datasets collected from the primary motor cortex while a non-human primate (macaque monkey) was making arm movements during a center out reaching task dyer2017cryptography . After spike sorting and binning the data, we applied factor analysis to reduce the data dimensionality to 3D (source distribution) and then applied HiWA to align the neural data to a 3D movement distribution (target distribution) (Figure 2). We compared the performance of HiWA to a standard Wasserstein alignment (WA) that doesn’t use a nested structure, and a baseline brute force search method called distribution alignment decoding (DAD) dyer2017cryptography . In all cases, we examined the accuracy in predicting the target reach direction for the motor decoding task at hand. This is akin to asking whether the algorithm predicted the correct cluster correspondences.

To examine the sensitivity of our method to quantities studied in our theory, we first examined the impact of the sampling density (Figure 2b) on performance. Surprisingly, HiWA continues to produce consistent cluster correspondences (> 70% accuracy), even as the number of samples per cluster drops to around 8 samples. In comparison, DAD is competitive for larger sample sizes but its performance rapidly drops off as sampling density decreases because it requires estimating a distribution. WA suffers from the presence of many local minima and fails to find the correct cluster correspondences. Our results suggest that HiWA consistently provides stable solutions and outperforms other competitor methods (see Supp. Materials) in this neural decoding application.

To study the impact of local and global geometry on whether an unlabeled source and target can be aligned, we applied HiWA to eight different subsets of reach directions (movement patterns). When just two reach directions are considered (Figure 2c, Columns 1-4), global geometry becomes useless in determining the correct rotation. In this case, we observe that HiWA is only capable of consistently doing so when cluster asymmetries are sufficiently extreme in both the source and target to allow discernment. When three reach directions are considered (Figure 2c, Columns 5-8), the global geometry can be used, yet there still exist symmetrical cases where recovering the correct rotation is unlikely without adequate local asymmetries or some supervised (labeled) data to match clusters. These results suggest that hierarchical structure can be critical in resolving ambiguities in alignment of globally symmetric movement distributions.

6 Conclusion

This paper introduces a new method for hierarchical alignment with Wasserstein distances and provided an efficient numerical solution with analytic guarantees. We tested the method and compared its performance with other alignment methods on both synthetic mixture model datasets and in a neural decoding example. Our results on real neural datasets suggest that when either global or local cluster structure is preserved across datasets, a hierarchical approach can dramatically improve performance over traditional OT approaches. While our approach demonstrates strong performance with unitary transformations, this could be restrictive when the data lives on more interesting topologies (e.g., structured manifolds culpepper2009learning ). Our hierarchical formulation could in principle provide the necessary structure to perform alignment over richer classes of transformations. Also, our results on neural data are compelling and suggest that HiWA can be applied to higher dimensional alignment problems, such as aligning neural datasets across days without needing to match kinematics or another measured behavioral covariate.


  • [1] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  • [2] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.
  • [3] Haili Chui and Anand Rangarajan. A new point matching algorithm for non-rigid registration. Computer Vision and Image Understanding, 89(2-3):114–141, 2003.
  • [4] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift. IEEE transactions on pattern analysis and machine intelligence, 32(12):2262–2275, 2010.
  • [5] Gary KL Tam, Zhi-Quan Cheng, Yu-Kun Lai, Frank C Langbein, Yonghuai Liu, David Marshall, Ralph R Martin, Xian-Fang Sun, and Paul L Rosin. Registration of 3d point clouds and meshes: a survey from rigid to nonrigid. IEEE transactions on visualization and computer graphics, 19(7):1199–1217, 2013.
  • [6] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. Proceedings of the National Academy of Sciences, 103(5):1168–1172, 2006.
  • [7] Alexander M Bronstein, Michael M Bronstein, Leonidas J Guibas, and Maks Ovsjanikov. Shape google: Geometric words and expressions for invariant shape retrieval. ACM Transactions on Graphics (TOG), 30(1):1, 2011.
  • [8] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG), 31(4):30, 2012.
  • [9] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2017.
  • [10] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739, 2017.
  • [11] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
  • [12] Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 1049–1054. IEEE, 2010.
  • [13] Sumit Shekhar, Vishal M Patel, Hien V Nguyen, and Rama Chellappa. Generalized domain-adaptive dictionaries. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 361–368, 2013.
  • [14] Yahong Han, Fei Wu, Dacheng Tao, Jian Shao, Yueting Zhuang, and Jianmin Jiang. Sparse unsupervised dimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video Technology, 22(10):1485, 2012.
  • [15] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.
  • [16] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer component analysis.

    IEEE Transactions on Neural Networks

    , 22(2):199–210, 2011.
  • [17] Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
  • [18] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer joint matching for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1417, 2014.
  • [19] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839–2848, 2016.
  • [20] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 999–1006. IEEE, 2011.
  • [21] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
  • [22] Chang Wang and Sridhar Mahadevan. Manifold alignment using procrustes analysis. In Proceedings of the 25th international conference on Machine learning, pages 1120–1127. ACM, 2008.
  • [23] Chang Wang and Sridhar Mahadevan. A general framework for manifold alignment. In 2009 AAAI Fall Symposium Series, 2009.
  • [24] Sira Ferradans, Nicolas Papadakis, Gabriel Peyré, and Jean-François Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014.
  • [25] Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Generalized unsupervised manifold alignment. In Advances in Neural Information Processing Systems, pages 2429–2437, 2014.
  • [26] Yonina C Eldar and Moshe Mishali. Robust recovery of signals from a structured union of subspaces. IEEE Transactions on Information Theory, 55(11):5302–5316, 2009.
  • [27] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960–2967, 2013.
  • [28] Baochen Sun and Kate Saenko. Subspace distribution alignment for unsupervised domain adaptation. In BMVC, volume 4, pages 24–1, 2015.
  • [29] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [30] Kowshik Thopalli, Rushil Anirudh, Jayaraman J Thiagarajan, and Pavan Turaga. Multiple subspace alignment improves domain adaptation. arXiv preprint arXiv:1811.04491, 2018.
  • [31] Leonid Vitalevich Kantorovich. On a problem of monge. Journal of Mathematical Sciences, 133(4):1383–1383, 2006.
  • [32] RM Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40–50, 1969.
  • [33] Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.
  • [34] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • [35] Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. arXiv preprint arXiv:1810.02733, 2018.
  • [36] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293–318, 1992.
  • [37] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [38] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019.
  • [39] Ery Arias-Castro, Adel Javanmard, and Bruno Pelletier. Perturbation bounds for procrustes, classical scaling, and trilateration, with applications to manifold learning. arXiv preprint arXiv:1810.09569, 2018.
  • [40] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
  • [41] Chethan Pandarinath, K Cora Ames, Abigail A Russo, Ali Farshchian, Lee E Miller, Eva L Dyer, and Jonathan C Kao. Latent factors and dynamics in motor cortex and their application to brain–machine interfaces. Journal of Neuroscience, 38(44):9390–9401, 2018.
  • [42] Eva L Dyer, Mohammad Gheshlaghi Azar, Matthew G Perich, Hugo L Fernandes, Stephanie Naufel, Lee E Miller, and Konrad P Körding. A cryptography-based approach for movement decoding. Nature Biomedical Engineering, 1(12):967, 2017.
  • [43] Benjamin Culpepper and Bruno A Olshausen. Learning transport operators for image manifolds. In Advances in neural information processing systems, pages 423–431, 2009.