1 Introduction
In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets (e.g., data across time points or under different conditions or contexts). Alignment is an important problem at the heart of transfer learning
pan2009survey ; weiss2016survey , point set registration chui2003new ; myronenko2010point ; tam2013registration , and shape analysis bronstein2006generalized ; bronstein2011shape ; ovsjanikov2012functional , but is generally NP hard. In recent years, distribution alignment methods that use optimal transport (OT) have been shown to provide stateoftheart transfer in domain adaptation tasks courty2017optimal ; courty2017joint . Distribution alignmentbased approaches cast alignment as an optimization problem that aims to match two distributions. However, when the source and target do not align exactly (e.g., noisy, undersampled) or have complicated multimodal structure, algorithms suffer from poor local minima. Thus, leveraging additional structure in the problem is necessary to regularize OT and constrain the solution space.Here, we leverage the fact that heterogeneous datasets often admit clustered or multisubspace
structure to improve distribution alignment. Our solution to this problem is to simultaneously estimate the cluster alignment across two datasets using their local geometry, while also solving a global alignment problem to meld these local estimates. We introduce a hierarchical formulation of OT for clustered and multisubspace datasets called
Hierarchical Wasserstein Alignment (HiWA). We empirically show that when data can be well approximated with Gaussian mixture models (GMMs) or lie on a union of subspaces, we may leverage existing clustering pipelines (e.g., sparse subspace clustering
elhamifar2013sparse ) to improve alignment. When the transformation between datasets is unitary, we provide analyses that reveal key geometric and sampling insights. To solve the problem numerically, we propose a distributed ADMM algorithm that exploits the Sinkhorn distance, and thus has computational complexity that scales quadratically with the size of the largest cluster.To test and benchmark our approach, we applied it to synthetic data generated from mixtures of lowrank Gaussians and study the impact of different geometric properties of the data on alignment to confirm the predictions of our theoretical analysis. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in neural datasets and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in unsupervised decoding from ambiguous (symmetric) movement patterns. This suggests the application of OT to a wider range of neural datasets and shows that a hierarchical strategy can be used to avoid local minima encountered in a global alignment strategy that does not use cluster structure in data.
2 Background and related work
Transfer learning and distribution alignment. A fundamental goal in transfer learning is to aggregate related datasets by learning an alignment between them. We wish to learn a transformation , where
refers to some class of transformations that aligns distributions under a notion of probability divergence
between a target distribution and a reference (source) distribution :(1) 
Various probability divergences have been proposed in the literature, such as Euclidean leastsquares (when data ordering is known) shi2010transfer ; shekhar2013generalized ; han2012sparse , KullbeckLiebler (KL) sugiyama2008direct , maximum mean discrepancy (MMD) pan2011domain ; baktashmotlagh2013unsupervised ; long2014transfer ; gong2016domain , and the Wasserstein distance courty2017optimal , where tradeoffs are often statistical (e.g., consistency, sample complexity) versus computational. Alignment problems are illposed since the space of is large, so a priori structure is often necessary to constrain based on geometric assumptions. Compact manifolds like the Grassmann or Stiefel gopalan2011domain ; gong2012geodesic are primary choices when little information is present, as they preserve isometry. Nonisometric transformations, though richer, demand much more structure (e.g., manifold or graph structure) wang2008manifold ; wang2009general ; ferradans2014regularized ; cui2014generalized ; courty2017optimal .
Lowrank and union of subspaces models.Principal components analysis (PCA), one of the most popular methods in data science, assumes a lowrank model where the top principal components of a dataset provide the optimal rank approximation under an Euclidean loss. This has been extended to robust (sparse errors) settings elhamifar2013sparse , and multi (union of) subspaces settings where data can be partitioned into disjoint subsets where each subset of data is locally lowrank eldar2009robust . Transfer learning methods based on subspace alignment fernando2013unsupervised ; sun2015subspace ; sun2016return work well with zeromean unimodal datasets, but struggle on more complicated modalities (e.g., Gaussian mixtures or union of subspaces) due to a mixing of covariances. Related to our work, thopalli2018multiple performs multisubspace alignment by greedily assigning correspondences between subspaces using chordal distances; this however neglects sign ambiguities in principal directions since subspaces inadequately describe a distribution’s shape.
Optimal transport. Optimal transport (OT) kantorovich2006problem is a natural type of divergence for registration problems because it accounts for the underlying geometry of the space. In Euclidean settings, OT is a metric known as the Wasserstein distance which measures the minimum effort required to “displace” points across measures and (understood here as empirical point clouds). Therefore, OT by design relieves the need for kernel estimation to create an overlapping support of the measures . Despite this attractive property, it has both a poor numerical complexity of (where is the sample size) and a dimensiondependent sample complexity of , where the data dimension is dudley1969speed ; weed2017sharp . Recently, an entropically regularized version of OT known as the Sinkhorn distance cuturi2013sinkhorn has emerged as a compelling divergence measure; it not only inherits OT’s geometric properties but also has superior computational and sample complexities of and ^{1}^{1}1Dependent on a regularization parameter genevay2018sample ., respectively. It has also become a versatile building block in domain adaptation courty2017optimal ; courty2017joint . Prior art courty2017optimal has largely exploited the OT’s pushforward as the alignment map since this map minimizes the OT cost between the source and target distributions while allowing a priori structure to be easily incorporated (e.g., to preserve label/graphical integrity). Such an approach, however, is fundamentally expensive when since the primary optimization variable is a large transport coupling (i.e., ), while in reality the alignment mapping is merely . Moreover, it assumes that the source and target distributions are close in terms of their squared Euclidean distance, but this does not generally hold in the alignment of arbitrary latent spaces.
3 Hierarchical Wasserstein alignment
Preliminaries and notation. Consider clustered datasets and whose clusters are denoted with the indices and whose columns are treated as embedding coordinates. Let () denote the number of samples in the th (th) cluster of dataset (dataset ). We respectively express the empirical measures of clusters and as and , where refers to a point mass located at coordinate . The squared 2Wasserstein distance between and is defined as
where
is a doubly stochastic matrix that encodes pointwise correspondences (i.e., the
th entry describes the flow of mass between and ), is the th column of matrix , and the constraint refers to the uniform transport polytope (with a length vector containing ones).Overview. Although unsupervised alignment is challenging due to the presence of local minima, the imposition of additional structure will help to prune them away. Our key insight is that hierarchical structure decomposes a complicated optimization surface into simpler ones that are less prone to local minima. We formulate a hierarchical Wasserstein approach to align datasets with known (or estimated) clusters but whose correspondences are unknown. The task therefore is to jointly learn the alignment and the clustercorrespondences:
(2) 
where the matrix encodes the strength of correspondences between clusters, with a large value indicating a correspondence between clusters , and a small value indicating a lack thereof. We note that is a special type of transport polytope known as the th Birkhoff polytope. Interestingly, this becomes a nested (or block) OT formulation, where correspondences are resolved at two levels: the outer level resolves clustercorrespondences (via ) while the inner level resolves pointwise correspondences between cluster points (via the Wasserstein distance).
Alignment over the Stiefel manifold. Assuming clusters lie on subspaces and principal angles between subspaces are “well preserved” across and (we make this precise in Theorem 4.2), an isometric transformation suffices. Hence, we solve (2) with , which refers to the Stiefel manifold defined as and refers to the identity matrix. Explicitly, we have
(3) 
Here, measures pairwise cluster divergences using the squared 2Wasserstein distance under a Stiefel transformation acting on cluster , i.e.,
(4) 
Finally, we include entropic regularization over transportation couplings and all ’s to modify the Wasserstein distances to Sinkhorn distances, so as to take advantage of its superior computational and sample complexities. Omitting constraints for brevity, our final problem is given as
(5) 
where are the entropic regularization parameters and the negative entropy function is defined as . Parameters control the correspondence entropy, therefore (5) approximates (3) when , but reverts to the original problem (3) as .
Distributed ADMM approach. Problem (5) is nonconvex due to multilinearity in the objective and its Stiefel manifold domain. It has recently been shown that the augmented directions method of multipliers (ADMM) eckstein1992douglas ; boyd2011distributed can be globally convergent even in nonconvex settings wang2019global . Furthermore, since (5) readily admits a splitting structure that separates the individual blocks, we develop a distributed ADMM solver. We proceed to split (5) as follows:
noting that the set constraints are omitted for brevity. Its augmented Lagrangian is
where is the ADMM parameter and are Lagrange multipliers. Full details of the update steps are included in the supplementary material. The algorithm may be summarized in two steps: (i) a distributed step that asks all cluster pairs to individually find their optimal transformations in parallel, and (ii) a consensus step that aggregates all the found transformations according to a weighting that is proportional to correspondence strengths . Algorithm 1 summarizes our method.
Parameters. Entropic parameters relax the onetoone cluster correspondence assumption, balancing a trade off between alignment precision (small ) and sample complexity (large ). Numerically, negative entropy adds strong convexity to the program, reducing sensitivity towards perturbations at the cost of a slower convergence rate. The ADMM parameter controls the ‘strength’ of the consensus, or from an algorithmic viewpoint, the gradient step size.
Distributed consensus. Update steps for can be performed in parallel over all cluster pairs (i.e., in total), making it amenable for a distributed implementation. When fully parallelized, the algorithm has a periteration computational complexity of , where refers to the number of points in the largest clusters of respectively (compared to vanilla Sinkhorn’s complexity where refers to the total number of points in respective datasets, assuming ).
Stopping criteria. In lines 3 and 6 of Algorithm 1, possible stopping criteria are (i) where the difference is between the current and previous iteration’s transformation and is the tolerance, and (ii) where is the maximum number of iterations.
Robustness against initial conditions. We build in robustness against initial conditions by ordering updates for and before such that when is sufficiently small, the ADMM sequence is influenced more by the data than by initial conditions.
4 Theoretical guarantees for clusterbased alignment
While the previous section explains how to align clustered datasets, in this section, we aim to answer the question of when and how well they can be aligned. We provide necessary conditions for clusterbased alignability as well as alignment perturbation bounds according to problem (3)’s formulation. To simplify our analysis, we make the following assumptions: (i) each of the clusters contain the same number of datapoints , (ii) the ground truth cluster correspondences are (i.e., diagonal containing ). Detailed proofs are given in the supplementary material.
The following result is a criterion that, if met, ensures the existence of the clustercorrespondence global minima . This criterion requires that matched clusters must be closer in Wasserstein distance than mismatched clusters, according to a threshold according to Wasserstein’s sample complexity (i.e., an asymptotic rate dependent on the clusters’ sample sizes and intrinsic dimensions). Since these sample complexity results are based on the Wasserstein distance, we expect a less stringent criterion when using the Sinkhorn distance in (5) (due to superior sample complexity genevay2018sample ).
Theorem 4.1 (Correspondence disambiguity criterion).
Let all clusters be strictly lowrank where the dimension of the th cluster in the th dataset is . Let . Define . Problem (3) yields the solution with probability at least if, , the following criterion is satisfied:
Proof sketch.
The proof contains two parts. In the first part, we consider perturbation conditions of the cost matrix in a (nonvariational) optimal transport program over the Birkhoff polytope. To be unperturbed from , we require that . In the second part, we extend this condition to the the finitesample regime by utilizing recently developed concentration bounds weed2017sharp for the Wasserstein distance, which essentially raises the disambiguity lower bound due to finitesample uncertainty. ∎
Now, even if we have the global correspondence solution , we still do not have the full picture about the alignment’s quality. For example, all matching clusters may have very similar covariances, but principal angles between the clusters are “distorted” across the datasets. Our next theorem gives us an upper bound on the alignment error (for unitary transformations), and makes precise the notion of global structure distortion.
Theorem 4.2 (Clusterbased alignment perturbation bounds).
Consider data matrices with known pointwise correspondence matrices . Define matrices
Set . If the criterion stated in theorem 4.1 is satisfied, is full row rank, and where is the operator norm and is the pseudoinverse of , then
where is a datadependent constant.
Proof sketch.
We utilize a recent perturbation result on the Procrustes problem (on a Frobenius norm objective) by AriasCastro et al. arias2018perturbation and adapt it to our squared 2Wasserstein objective. ∎
We point out that plays a major role in the alignment error bound and quantifies the notion of global structure distortion. It therefore allows us to understand on how phenomena like covariate shift or misclustering impacts alignment. To shed some light in this regard, we consider a simple analysis on a clusterpair’s error contribution to . Consider the decomposition of the th block of the Grammians related to clusters and
, where their respective singular value decompositions are
and . Defining the blockwise error between clusters astwo components stand out: (i) angular shift, which is characterized by differences in principal angles between and , and (ii) spectral shift, which is characterized by differences in spectra.
Finally, we show that the subspace configuration of a dataset’s clusters can also affect alignment. Pretend for a moment that external alignment information were present to aid in the disambiguation between two clusters. The following lemma tells us when such information is useless.
Lemma 4.3 (Uninformative alignment).
Consider clusters and known pointwise correspondences . Denote the left and right singular vectors of associated with the nonzero singular values as with . Define the set of orthogonal transformations that are constrained to agree with known angular directions as
where with . Given with , we have
(6) 
with equality holding when .
Direct consequences of this lemma are the following: When a dataset has equallyspaced subspaces, it has a maximally uninformative geometric configuration since angular information from other clusters (i.e., ) can never increase the intercluster distance (i.e., equality in (6) always holds); it is hence a worstcase scenario for alignment. This also explains why alignment in very highdimensional space is harder: All subspaces may be orthogonal to each other, and hence offer no “geometric” advantage in the joint alignment effort.
5 Numerical experiments
5.1 Synthetic lowrank Gaussian mixture dataset
In this section, we validate our method as well as demonstrate its limiting characteristics under symmetricsubspace and finitesample regimes. To generate our synthetic data, we repeat the following procedure for
clusters. We first randomly generate Gaussian distribution parameters
(positive semidefinite), then randomly sample datapoints from these parameters, and finally project them into a random subspace in a dimensional embedding. We assume that the respective clusters are known, but the clustercorrespondences between datasets is not. We measure performance with two metrics: (i) alignment error, defined as the relative difference between the recovered versus true rotation acting on the data , and (ii) correpondence error, defined as the sum of absolute differences between the recovered and the true correspondences .In Figure 1ab, we empirically validate the fact that equally spaced subspaces are indeed the worstcase scenario in alignment, as exposed by Lemma 4.3. We run our proposed algorithm on two identical datasets generated with parameters ; the key difference being that one dataset has equallyspaced subspaces with a subspace similarity of , while the other contains subspaces that are randomly selected on the Grassman manifold. We observe that equallyspaced subspaces have significantly inferior performance compared to randomlyspaced subspaces, across various . Interestingly, correspondence error is more tolerant than alignment towards subspace spacing configuration. In Figure 1cd, we empirically study the effect of dimensions and sample size on the accuracy of alignment. We run our proposed algorithm on various dataset conditions by varying parameters while approximately maintaining the average subspace correlations (i.e., ) by tuning to control for subspace spacing biases, and fixing the cluster size . Both errors demonstrate sample complexities that are better than the theoretical , with correspondence error exhibiting greater robustness. We hypothesize this is due to the Sinkhorn distance’s superior sample complexity. In Figure 1ef, we evaluate our algorithm against benchmark methods in transfer learning and point set registration under two settings (50 trials, no random restarts permitted): a simple one in low (e) and a harder one in higher (f). Specifically, we compare HiWA when clusters are known (but correspondences not), HiWA with clustering via sparse subspace clustering elhamifar2013sparse (HiWASSC), a Wasserstein alignment variant with no clusterstructure (WA), subspace alignment fernando2013unsupervised , correlation alignment sun2016return , and iterative closest point (ICP) besl1992method . In both settings, HiWA exhibits strongest performance, with HiWASSC trailing closely behind (since clusters are independently resolved with SSC), followed by WA, then other algorithms. Subspace alignment methods have remarkably poor performance in higher dimensions due to their inability to resolve subspace sign ambiguities, while ICP demonstrates its notorious dependence on good initial conditions. These results indicates HiWA’s strong robustness against initial conditions.
5.2 Neural population decoding example
Decoding intent (e.g., where you want to move your arm) or evoked responses (e.g., what you are looking at or listening to) directly from neural activity is a widely studied problem in neuroscience, and the first step in the design of a brain machine interface (BMI). A critical challenge in BMI is that neural decoders need to be recalibrated (or retrained) due to drift in neural responses or electrophysiology measurements/readouts pandarinath2018latent . Recently, a method for semisupervised brain decoding was proposed which finds a transformation between projected neural responses and movements by solving a KLdivergence minimization problem dyer2017cryptography . Using this approach, one could build robust decoders that work across days and shifts in neural responses through alignment.
To test the utility of hierarchical alignment for neural decoding, we utilized datasets collected from the primary motor cortex while a nonhuman primate (macaque monkey) was making arm movements during a center out reaching task dyer2017cryptography . After spike sorting and binning the data, we applied factor analysis to reduce the data dimensionality to 3D (source distribution) and then applied HiWA to align the neural data to a 3D movement distribution (target distribution) (Figure 2). We compared the performance of HiWA to a standard Wasserstein alignment (WA) that doesn’t use a nested structure, and a baseline brute force search method called distribution alignment decoding (DAD) dyer2017cryptography . In all cases, we examined the accuracy in predicting the target reach direction for the motor decoding task at hand. This is akin to asking whether the algorithm predicted the correct cluster correspondences.
To examine the sensitivity of our method to quantities studied in our theory, we first examined the impact of the sampling density (Figure 2b) on performance. Surprisingly, HiWA continues to produce consistent cluster correspondences (> 70% accuracy), even as the number of samples per cluster drops to around 8 samples. In comparison, DAD is competitive for larger sample sizes but its performance rapidly drops off as sampling density decreases because it requires estimating a distribution. WA suffers from the presence of many local minima and fails to find the correct cluster correspondences. Our results suggest that HiWA consistently provides stable solutions and outperforms other competitor methods (see Supp. Materials) in this neural decoding application.
To study the impact of local and global geometry on whether an unlabeled source and target can be aligned, we applied HiWA to eight different subsets of reach directions (movement patterns). When just two reach directions are considered (Figure 2c, Columns 14), global geometry becomes useless in determining the correct rotation. In this case, we observe that HiWA is only capable of consistently doing so when cluster asymmetries are sufficiently extreme in both the source and target to allow discernment. When three reach directions are considered (Figure 2c, Columns 58), the global geometry can be used, yet there still exist symmetrical cases where recovering the correct rotation is unlikely without adequate local asymmetries or some supervised (labeled) data to match clusters. These results suggest that hierarchical structure can be critical in resolving ambiguities in alignment of globally symmetric movement distributions.
6 Conclusion
This paper introduces a new method for hierarchical alignment with Wasserstein distances and provided an efficient numerical solution with analytic guarantees. We tested the method and compared its performance with other alignment methods on both synthetic mixture model datasets and in a neural decoding example. Our results on real neural datasets suggest that when either global or local cluster structure is preserved across datasets, a hierarchical approach can dramatically improve performance over traditional OT approaches. While our approach demonstrates strong performance with unitary transformations, this could be restrictive when the data lives on more interesting topologies (e.g., structured manifolds culpepper2009learning ). Our hierarchical formulation could in principle provide the necessary structure to perform alignment over richer classes of transformations. Also, our results on neural data are compelling and suggest that HiWA can be applied to higher dimensional alignment problems, such as aligning neural datasets across days without needing to match kinematics or another measured behavioral covariate.
References
 [1] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
 [2] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.
 [3] Haili Chui and Anand Rangarajan. A new point matching algorithm for nonrigid registration. Computer Vision and Image Understanding, 89(23):114–141, 2003.
 [4] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift. IEEE transactions on pattern analysis and machine intelligence, 32(12):2262–2275, 2010.
 [5] Gary KL Tam, ZhiQuan Cheng, YuKun Lai, Frank C Langbein, Yonghuai Liu, David Marshall, Ralph R Martin, XianFang Sun, and Paul L Rosin. Registration of 3d point clouds and meshes: a survey from rigid to nonrigid. IEEE transactions on visualization and computer graphics, 19(7):1199–1217, 2013.
 [6] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling: a framework for isometryinvariant partial surface matching. Proceedings of the National Academy of Sciences, 103(5):1168–1172, 2006.
 [7] Alexander M Bronstein, Michael M Bronstein, Leonidas J Guibas, and Maks Ovsjanikov. Shape google: Geometric words and expressions for invariant shape retrieval. ACM Transactions on Graphics (TOG), 30(1):1, 2011.
 [8] Maks Ovsjanikov, Mirela BenChen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. Functional maps: a flexible representation of maps between shapes. ACM Transactions on Graphics (TOG), 31(4):30, 2012.
 [9] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2017.
 [10] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Advances in Neural Information Processing Systems, pages 3730–3739, 2017.
 [11] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
 [12] Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 1049–1054. IEEE, 2010.

[13]
Sumit Shekhar, Vishal M Patel, Hien V Nguyen, and Rama Chellappa.
Generalized domainadaptive dictionaries.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 361–368, 2013.  [14] Yahong Han, Fei Wu, Dacheng Tao, Jian Shao, Yueting Zhuang, and Jianmin Jiang. Sparse unsupervised dimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video Technology, 22(10):1485, 2012.
 [15] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in neural information processing systems, pages 1433–1440, 2008.

[16]
Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang.
Domain adaptation via transfer component analysis.
IEEE Transactions on Neural Networks
, 22(2):199–210, 2011.  [17] Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pages 769–776, 2013.
 [18] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer joint matching for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1417, 2014.
 [19] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Schölkopf. Domain adaptation with conditional transferable components. In International conference on machine learning, pages 2839–2848, 2016.
 [20] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 999–1006. IEEE, 2011.
 [21] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2066–2073. IEEE, 2012.
 [22] Chang Wang and Sridhar Mahadevan. Manifold alignment using procrustes analysis. In Proceedings of the 25th international conference on Machine learning, pages 1120–1127. ACM, 2008.
 [23] Chang Wang and Sridhar Mahadevan. A general framework for manifold alignment. In 2009 AAAI Fall Symposium Series, 2009.
 [24] Sira Ferradans, Nicolas Papadakis, Gabriel Peyré, and JeanFrançois Aujol. Regularized discrete optimal transport. SIAM Journal on Imaging Sciences, 7(3):1853–1882, 2014.
 [25] Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Generalized unsupervised manifold alignment. In Advances in Neural Information Processing Systems, pages 2429–2437, 2014.
 [26] Yonina C Eldar and Moshe Mishali. Robust recovery of signals from a structured union of subspaces. IEEE Transactions on Information Theory, 55(11):5302–5316, 2009.
 [27] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision, pages 2960–2967, 2013.
 [28] Baochen Sun and Kate Saenko. Subspace distribution alignment for unsupervised domain adaptation. In BMVC, volume 4, pages 24–1, 2015.

[29]
Baochen Sun, Jiashi Feng, and Kate Saenko.
Return of frustratingly easy domain adaptation.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [30] Kowshik Thopalli, Rushil Anirudh, Jayaraman J Thiagarajan, and Pavan Turaga. Multiple subspace alignment improves domain adaptation. arXiv preprint arXiv:1811.04491, 2018.
 [31] Leonid Vitalevich Kantorovich. On a problem of monge. Journal of Mathematical Sciences, 133(4):1383–1383, 2006.
 [32] RM Dudley. The speed of mean glivenkocantelli convergence. The Annals of Mathematical Statistics, 40(1):40–50, 1969.
 [33] Jonathan Weed and Francis Bach. Sharp asymptotic and finitesample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.
 [34] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 [35] Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of sinkhorn divergences. arXiv preprint arXiv:1810.02733, 2018.
 [36] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(13):293–318, 1992.
 [37] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [38] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing, 78(1):29–63, 2019.
 [39] Ery AriasCastro, Adel Javanmard, and Bruno Pelletier. Perturbation bounds for procrustes, classical scaling, and trilateration, with applications to manifold learning. arXiv preprint arXiv:1810.09569, 2018.
 [40] Paul J Besl and Neil D McKay. Method for registration of 3d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
 [41] Chethan Pandarinath, K Cora Ames, Abigail A Russo, Ali Farshchian, Lee E Miller, Eva L Dyer, and Jonathan C Kao. Latent factors and dynamics in motor cortex and their application to brain–machine interfaces. Journal of Neuroscience, 38(44):9390–9401, 2018.
 [42] Eva L Dyer, Mohammad Gheshlaghi Azar, Matthew G Perich, Hugo L Fernandes, Stephanie Naufel, Lee E Miller, and Konrad P Körding. A cryptographybased approach for movement decoding. Nature Biomedical Engineering, 1(12):967, 2017.
 [43] Benjamin Culpepper and Bruno A Olshausen. Learning transport operators for image manifolds. In Advances in neural information processing systems, pages 423–431, 2009.
Comments
There are no comments yet.