# Matching Component Analysis for Transfer Learning

We introduce a new Procrustes-type method called matching component analysis to isolate components in data for transfer learning. Our theoretical results describe the sample complexity of this method, and we demonstrate through numerical experiments that our approach is indeed well suited for transfer learning.

## Authors

• 3 publications
• 29 publications
• 4 publications
10/25/2021

### Covariance-Generalized Matching Component Analysis for Data Fusion and Transfer Learning

In order to allow for the encoding of additional statistical information...
09/21/2018

### Target Transfer Q-Learning and Its Convergence Analysis

Q-learning is one of the most popular methods in Reinforcement Learning ...
02/29/2016

### Beyond CCA: Moment Matching for Multi-View Models

We introduce three novel semi-parametric extensions of probabilistic can...
01/17/2022

### Growing Neural Network with Shared Parameter

We propose a general method for growing neural network with shared param...
03/13/2020

### A Wide Dataset of Ear Shapes and Pinna-Related Transfer Functions Generated by Random Ear Drawings

Head-related transfer functions (HRTFs) individualization is a key matte...
01/29/2021

### Morphological components analysis for circumstellar disks imaging

Recent developments in astronomical observations enable direct imaging o...
09/03/2020

### Transfer learning for nonlinear dynamics and its application to fluid turbulence

We introduce transfer learning for nonlinear dynamics, which enables eff...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many state-of-the-art classification algorithms require a large training set that is statistically similar to the test set. For example, deep learning–based approaches require a large number of representative samples in order to find near-optimal network weights and biases

[4, 9]. Similarly, template-based approaches require large dictionaries of training images so that each test image can be represented by an element of the dictionary [21, 26, 17, 6]. For each technique, if test images cannot be represented in a feature space that has been determined from the training set, then classification accuracy is poor.

In applications such as synthetic aperture radar (SAR) automatic target recognition (ATR), it is infeasible to collect the volume of data necessary to naively train high-accuracy classification networks. Additionally, due to varying operating conditions, the features measured in SAR imagery are different from those extracted from electro-optical (EO) imagery [13]

. As such, off-the-shelf networks that have been pre-trained on the popular EO-based ImageNet

[3] or CIFAR-10 [8] datasets are insufficient for performing accurate ATR tasks in different imaging domains. In fact, recent work has demonstrated that pre-trained networks fail to effectively generalize to random perturbations on test sets [19, 18]. To build more representative training sets, additional data are often generated using modeling and simulation software. However, due to various model errors, simulated data often misrepresent the real-world scattering observed in measured imagery. Thus, even though it is possible to augment training sets with a large amount of simulated data, the inherent differences in sensor modalities and data representations make modifying classification networks a non-trivial task [20].

In this paper, we introduce matching component analysis (MCA) to help remedy this situation. Given a small number of images from the training domain and matching images from the testing domain, MCA identifies a low-dimensional feature space that both domains have in common. With the help of MCA, one can map augmented training sets into a common domain, thereby making the classification task more robust to mismatch between the training and testing domains. We note that other transfer learning methods, image-to-image domain regression techniques, and generative adversarial networks have all been developed with a similar task in mind [12, 7, 23, 16, 11]

, but little theory has been developed to explain the performance of these machine learning–based adaptation techniques. By contrast, in this paper, we estimate the number of matched samples needed for MCA to identify a common domain.

The rest of the paper is organized as follows. Section 2 introduces the MCA algorithm and our main theoretical results. In Section 3, we use a sequence of numerical experiments involving MNIST [10] and SAR [13]

data to demonstrate that classifying data in the common domain allows for more accurate classification. We discuss limitations of MCA in Section 4. Sections 5 and 6 contain the proofs of our main theoretical results.

## 2 Matching component analysis

Let and denote the training and testing domains, respectively. Traditionally, our training set would consist of labeled points in , whereas our test test would consist of labeled points in . In order to bridge the disparity between the training and testing domains, we will augment our training set with a matching set of labeled pairs in . Then our full training set, whose size we denote by , consists of a conventional training set of labeled points in and a matching set of labeled points in . The matching set will enable us to identify maps and from the training and testing domains to a common domain , where we can train a classifier on the full training set:

We model our setting in terms of unknown random variables

, ,

over a common probability space

. In particular, suppose points are drawn independently at random from , and we are given

 {X1(ωj)}j∈[N],{X2(ωj)}j∈[n],{Y(ωj)}j∈[N]

for some with the task of finding such that . Our approach is summarized by the following:

• Select and a class of functions that map to for each .

• Use and to (approximately) solve

 minimizeE∥g1(X1)−g2(X2)∥2 (1) subject togi∈Fi,Egi(Xi)=0,Egi(Xi)gi(Xi)⊤=Ik,i∈{1,2}.
• Train on and , and return .

For (i), we are principally interested in the case where

is the set of affine linear transformations from

to

. This choice of function class is nice because it locally approximates arbitrary differentiable functions while being amenable to theoretical analysis. Considering the ubiquity of principal component analysis in modern data science, this choice promises to be useful in practice. The constraints in program (

1) ensure that the training set in (iii) is normalized, while simultaneously preventing useless choices for , such as those for which almost surely. Intuitively, (ii) selects and so as to transform and into a common domain, and then (iii) leverages the large number of realizations of to predict in this domain, thereby enabling us to predict from . We expect this approach to work well in settings for which

• each captures sufficient information about to predict ,

• is robust to slight perturbations so that ,

• is too complicated to be learned from a training set of size , and

• can be learned from a training set of size .

To solve program (1) in the case of affine linear transformations, must have the form for some and . Let and denote the mean and covariance of . The constraint in program (1) forces , and so , i.e., . The constraint also forces . Overall, program (1) is equivalent to

 minimizeE∥A1(X1−μ1)−A2(X2−μ2)∥2subject toAiΣiA⊤i=Ik,i∈{1,2}. (2)

Notice that this program is not infeasible when . Of course, we do not have access to and , but rather realizations of each, and so we are forced to approximate. To this end, we estimate the means and covariances as

 ^μi:=1n∑j∈[n]Xi(ωj),^Σi:=1n∑j∈[n](Xi(ωj)−^μi)(Xi(ωj)−^μi)⊤, (3)

and then consider the following approximation to program (2):

 minimize1n∑j∈[n]∥A1(X1(ωj)−^μ1)−A2(X2(ωj)−^μ2)∥2 (4) subject toAi^ΣiA⊤i=Ik,i∈{1,2}.

Observe that program (4) is equivalent to

 minimize1n∑j∈[n]∥A1(X1(ωj)−^μ1)−A2(X2(ωj)−^μ2)∥2 (5) subject toAi^ΣiA⊤i=Ik,imA⊤i⊆im^Σi,i∈{1,2}.

Indeed, if is feasible in (4), then we can project the rows of onto without changing the objective value. Next, define , take to be any matrix whose columns form an orthonormal basis for , and define to be the matrix whose th column is . Then every solution of

 minimize1n∥B1Z1−B2Z2∥2F% subject toBiB⊤i=Ik,i∈{1,2} (6)

can be transformed to a solution to program (5) by the change of variables , where . In fact, by this change of variables, programs (5) and (6) are equivalent. In the special case where , one may take without loss of generality, and then program (6) amounts to the well-known orthogonal Procrustes problem [5]. In general, we refer to (6) as the projection Procrustes problem; see Figure 1 for an illustration. Considering orthogonal Procrustes enjoys a spectral solution, there is little surprise that projection Procrustes also enjoys a spectral solution:

###### Lemma 1.

Suppose for both . If , then the projection Procrustes problem (6) is infeasible. Otherwise, select any

-truncated singular value decomposition

of . Then is a solution to (6).

###### Proof.

Since is a matrix, the constraint requires . Suppose , and consider any feasible point in program (6). Then

 ∥BiZi∥2F=tr(Z⊤iB⊤iBiZi)=tr(B⊤iBiZiZ⊤i)=ntr(B⊤iBi)=ntr(BiB⊤i)=nk,

and so the objective is proportional to

 ∥B1Z1−B2Z2∥2F =∥B1Z1∥2F−2tr(Z⊤1B⊤1B2Z2)+∥B2Z2∥2F =2nk−2tr((Z2Z⊤1)(B⊤1B2))≥2nk−2∑l∈[k]σl(Z2Z⊤1),

where the last step applies the von Neumann trace inequality (see Section 7.4.1 in [5]). This inequality is saturated when the columns of and

are leading left and right singular vectors of

. ∎

As a consequence of Lemma 1, we now have a fast method to solve program (4), which we summarize in Algorithm 1; we refer to this algorithm as matching component analysis (MCA). (To be clear, given a matrix of rank , the thin singular value decomposition consists of and , both with orthonormal columns, and a diagonal matrix of the positive singular values of .) Recalling our application, we note that matching data is an expensive enterprise, and so we wish to solve program (4) using as few samples as possible. For this reason, we are interested in determining how many samples it takes for (4) to well approximate the original program (2). We summarize our study of MCA sample complexity in the remainder of this section.

### 2.1 Sample complexity of MCA approximation

Given a random , consider the covariances

 ΣXi:=E(Xi−EXi)(Xi−EXi)⊤

for . We are interested in minimizing

 fX(A)=fX(A1,A2):=E∥A1(X1−EX1)−A2(X2−EX2)∥22

over the subset of defined by

 SX:={(A1,A2)∈V:AiΣXiA⊤i=I, i∈{1,2}}.

Given independent instances of , we may approximate the distribution of with the uniform distribution over these independent instances, producing the random vector . Notice that has mean and covariance , as defined in (3). We therefore have the following convenient expressions for (2) and (4):

 (???)=minA∈SXfX(A),(???)=minA∈S^Xf^X(A).

The following is our first result on MCA sample complexity:

###### Theorem 2.

Fix . There exists such that the following holds: Suppose almost surely and . Then for every , it holds that

 ∣∣minA∈S^Xf^X(A)−minA∈SXfX(A)∣∣≤ϵ⋅β2σ2

in an event of probability , provided

 n≥C((d1+d2)⋅kϵ2log(kϵ2)+(βϵσ)4⋅log(d1+d2)).

Note that the boundedness assumption is reasonable in practice since, for example, black-and-white images have pixel values that range from 0 to 255. Also, we may assume without loss of generality by restricting to the image of if necessary. We prove this theorem in Section 5 using ideas from matrix analysis and high dimensional probability.

### 2.2 Conditions for exact matching

Next, we consider a family of random vectors that are particularly well suited for matching component analysis. Suppose our probability space takes the form for some unknown . We say is an affine linear random vector if there exists and such that for every . While every random vector can be viewed as an affine linear random vector over the appropriate probability space, we will be interested in relating two affine linear random vectors over a common probability space. Since and are both unknown, we may assume without loss of generality that has zero mean and identity covariance in , and so has mean and covariance .

Let and be affine linear random vectors, and suppose we encounter affine linear functions and such that . Then determines up to a coset of some subspace , and the smaller this subspace is, the better we can predict . As one might expect, there is a limit to how small can be:

###### Lemma 3.

Suppose for each . Then implies , which in turn implies .

###### Proof.

Suppose . Since

 (AiXi(ω)+bi)−(AiXi(0)+bi)=AiSiω,

it follows that . For each , we have , and so . Since is closed under addition, the result follows. ∎

###### Definition 4.

Given , the corresponding affine linear model receives a distribution over some real vector space and returns the random function

 EP:(S1,μ1,S2,μ2)↦{Siωj+μi}i∈{1,2},j∈[n]

defined over all and , and with drawn independently with distribution . We say is exactly matchable if there exists a measurable function

 D:{xij}i∈{1,2},j∈[n]↦(A1,b1,A2,b2)

such that for every

, every continuous probability distribution

over , and every input , the random tuple

 (A1,b1,A2,b2):=(D∘EP)(S1,μ1,S2,μ2)

almost surely satisfies both

• for all , and

• .

Our second result on MCA sample complexity provides a sharp phase transition for the affine linear model to be exactly matchable:

###### Theorem 5.
• If , then is exactly matchable.

• If , then is not exactly matchable.

In particular, we use MCA to define a witness for Theorem 5(a). We prove this theorem in Section 6 using ideas from matrix analysis and algebraic geometry.

## 3 Experiments

In this section, we perform several experiments to evaluate the efficacy of matching component analysis for transfer learning (see Table 1 for a summary). For each experiment, in order to produce a matching set, we take an example set of labeled points from the testing domain and match them with members of the conventional training set. (While the example set resides in the testing domain, it is disjoint from the test set in all of our experiments.) Each experiment is described by the following features; see Figure 2 for an illustration.

• training domain. Space where the conventional training set resides.

• testing domain. Space where the example and test sets reside.

• match. Method used to identify a matching set, which is comprised of pairs of points from the conventional training and example sets.

• n. Size of example set.

• r. Number of points from the conventional training set that are matched to each member of the example set, producing a matching set of size . (While our theory assumes , we find that taking is sometimes helpful in practice.)

• k. Parameter selected for matching component analysis.

For each experiment, we run MCA to find affine linear mappings to the common domain , and then we train a k-nearest neighbor (k-NN) classifier in this domain on the conventional training set, and we test by first mapping the test set into the common domain. For comparison, we consider two different baselines, which we denote by BL1 and BL2. For BL1, we train a k-NN classifier on the example set (whose size is only ) and test on the test set. For BL2, we train a k-NN classifier on the conventional training set (which resides in the training domain ) and test on the test set (which resides in the testing domain ). This latter baseline is possible whenever , which occurs in all of our experiments. In order to isolate the performance of MCA in our experiments, we set the number of neighbors to be 10 for all of our k-NN classifiers.

In half of the experiments we consider, we are given a matching set with , and in the other experiments, we are only given an example set. In this latter case, we have the luxury of selecting , and in both cases, we have the additional luxury of selecting . We currently do not have a rule of thumb for selecting these parameters, although we observe that overall performance is sensitive to the choice of parameters. See Section 4 for more discussion along these lines.

### 3.1 Transfer learning from MNIST to MNIST

For our first experiment, we tested the performance of the MCA algorithm in a seemingly trivial case: when the training and testing domains are identical. Of course, the MCA algorithm should not outperform the baseline BL2 in this simple case. However, this setup allows us to isolate the impact of using different matching procedures.

We partitioned the training set of 60,000 MNIST digits into two subsets of equal size. We arbitrarily chose the first 30,000 to represent the training domain, and interpreted the remaining 30,000 points as members of the testing domain. We then matched of the points from the testing domain with of their nearest neighbors (in the Euclidean sense) in the training domain with the same label. For a cheaper alternative, we also tried matching with randomly selected members of the training domain that have the same label.

As expected, MCA does not outperform the classifier trained on the entire training set (BL2). However, with sufficiently many matches, MCA is able to find a low-dimensional embedding of

that still allows for accurate classification of digits. Judging by the poor performance of the label-based matching, these experiments further illustrate the importance of a thoughtful matching procedure. In general, when label classes exhibit large variance and yet the matching is determined by label information alone, we observe that MCA often fails to identify a common domain that allows for transfer learning.

### 3.2 Transfer learning from cropped MNIST to pixelated MNIST

Our second experiment replicates the affine linear setup from Subsection 2.2. Here, we view the MNIST dataset as a subset of a probability space with distributed uniformly over . Next, we linearly transform the MNIST dataset by applying two different maps . In particular, crops a given image to the middle portion, while forms a pixelated version of the original image by averaging over each block; see Figure 3 for an illustration. We interpret the cropped images as belonging to the training domain and the pixelated images to the testing domain. Notice that this setup delivers a natural matching between members of both domains, i.e., is matched with for every ; as such, . We evaluate the performance of MCA against the baselines with both and . These experiments are noteworthy because MCA beats both baselines for both small and large values of . We credit this behavior to the affine linear setup, since in general, we find that MCA beats BL1 only when is small. See Figure 3 for a visualization of the information captured in the common domain.

### 3.3 Transfer learning from computer fonts to MNIST

For this experiment, we attempted transfer learning from the computer font (CF) digits provided in [1] to MNIST digits. While the MNIST digits are , the CF digits are . In order to put both into a common domain, we resized both datasets to be ; see Figure 4 for an illustration. Interestingly, resizing MNIST in this way makes BL1 succeed with even modest values of . In order to make MCA competitive, we decided to focus on binary classification tasks, specifically, classifying 2 vs. 5, 0 vs. 1, and 4 vs. 9. To identify a matching between CF and MNIST digits, we looked for CF digits that were closest to each of the MNIST digits in the Euclidean distance. (For runtime considerations, we first selected 5,000 out of the 56,443 computer fonts that tended to be close to MNIST digits, and then limited our search to digits in these fonts.) Since we used the Euclidean distance for matching, it comes as no surprise that BL2 outperforms MCA. While Table 1 details the case, Figure 4 illustrates performance for each . Surprisingly, the performance of MCA drops for larger values of . We discuss this further in Section 4.

### 3.4 Transfer learning with the SAMPLE dataset

Finally, we consider transfer learning with the Synthetic and Measured Paired and Labeled (SAMPLE) database of computer-simulated and real-world SAR images [13]. The publicly-available SAMPLE database consists of 1366 paired images of 10 different vehicles, each pair consisting of a real-world SAR image and a corresponding computer-simulated SAR image; see Figure 5 for an illustration.

In this experiment, the training domain corresponds to simulated data, and the testing domain corresponds to real-world data. The training set consists of 80% of the simulated set of SAMPLE images, of which are matched with corresponding real-world data. The test set consists of the real-world data corresponding to the withheld 20% of simulated training set. In this case, MCA substantially out-performs both BL1 and BL2; see Figure 5 for a depiction of the normalized confusion matrices in these cases. We note that BL2 is similar to the SAR classification challenge problem outlined in [13] and [20]

, where a small convolutional neural network (CNN) achieved 24% accuracy, and a densely connected CNN achieved 55% accuracy. Impressively, by mapping to the common domain identified by MCA, we can simply use a k-NN classifier and increase performance to 87%.

## 4 Discussion

This paper introduced matching component analysis (MCA, Algorithm 1) as a method for identifying features in data that are appropriate for transfer learning. In this section, we reflect on our observations and identify various opportunities for future work.

The theory developed in this paper concerned the sample complexity of MCA. The fundamental question to answer is

How large of a matching set is required to perform high-accuracy transfer learning?

In order to isolate the performance of MCA, our theory does not rely on the choice of the classifier, and because of this, our sample complexity results rely on different proxies for success. Overall, a different approach is needed to answer the above question.

Like many algorithms in machine learning, MCA requires the user to select a parameter, namely, . We currently do not have a rule of thumb for selecting this parameter. Also, one should expect that a larger matching set will only help with transfer learning, but some of our experiments seem to suggest that MCA behaves worse given more matches (see Figure 4, for example). While we do not understand this behavior, one can get around this by partitioning the matching set into batches, training a weak classifier on each batch, and then boosting. The drop in performance might reflect the fact that MCA is oblivious to the data labels, suggesting a label-aware alternative (cf. PCA vs. SqueezeFit [15]). The performance drop might also reflect our choice of affine linear maps and Euclidean distances, suggesting alternatives involving non-linear maps and other distances.

As one would expect, transfer learning is more difficult when the matching set is poorly matched. Indeed, we observed this when transfer learning from MNIST to MNIST using two different matching techniques. In practice, it is expensive to find a good matching set. For example, for the SAMPLE dataset [13], it took two years of technical expertise to generate accurate computer-simulated matches. In general, one might attempt to automate the matching process with an algorithm such as GHMatch [25], but we find that runtimes are slow for even moderately large datasets; e.g., it takes several minutes to match datasets with more than 50 points. Overall, finding a matching set appears to be a bottleneck, akin to finding labels for a training set. As an alternative, it would be interesting to instead develop theory that allows for transfer learning given non-matched data in both domains without having to first match the data.

## 5 Proof of Theorem 2

It is convenient to define the diagonal operator

 D:=[Id100−Id2]

so that our objective function takes the form

In what follows, we let denote the norm on defined by

 ∥(A1,A2)∥V:=max{∥A1∥2→2,∥A2∥2→2}.

This determines a Hausdorff distance between nonempty subsets of . Throughout, we denote . Our approach is summarized in the following lemma:

###### Lemma 6.

Let be random vectors such that

• ,

• are both -Lipschitz, and

• for every .

Then .

###### Proof.

Without loss of generality, it holds that . Let denote an optimizer for . By (i), there exists such that , and then by (ii), it holds that . As such,

 ∣∣minA∈SXfX(A)−minA∈SYfY(A)∣∣≤fX(B)−fY(A⋆)≤Lϵ1+fX(A⋆)−fY(A⋆)≤Lϵ1+ϵ2,

where the last step applies (iii). ∎

As such, it suffices to show that and satisfy Lemma 6(i)–(iii). In order to verify Lemma 6(i), it is helpful to have a bound on the members of :

###### Lemma 7.

Suppose . If , then .

###### Proof.

First, we observe that

 1=∥I∥2→2=∥AΣA⊤∥2→2=∥Σ1/2A⊤∥22→2.

Next, select a unit vector such that . Then

 ∥Σ1/2A⊤∥2→2≥∥Σ1/2A⊤x∥2≥λmin(Σ1/2)⋅∥A⊤x∥2=λmin(Σ1/2)⋅∥A∥2→2.

The result then follows by combining and rearranging the above estimates. ∎

###### Lemma 8.

Suppose for both . Then

 dist(SX,SY)2≤maxi∈{1,2}∥ΣXi−ΣYi∥2→2λmin(ΣXi)⋅λmin(ΣYi).
###### Proof.

Define the function by

 gXY(A1,A2):=(A1Σ1/2X1Σ−1/2Y1,A2Σ1/2X2Σ−1/2Y2).

Observe that maps every point to a point in :

 (AiΣ1/2XiΣ−1/2Yi)ΣYi(AiΣ1/2XiΣ−1/2Yi)⊤=AiΣXiA⊤i=I.

Furthermore, for every , we may apply sub-multiplicativity, Lemma 7, and then Theorem X.1.1 in [2] to obtain

 ∥AiΣ1/2XiΣ−1/2Yi−Ai∥22→2 =∥Ai(Σ1/2Xi−Σ1/2Yi)Σ−1/2Yi∥22→2 ≤∥Ai∥22→2⋅∥Σ1/2Xi−Σ1/2Yi∥22→2⋅∥Σ−1/2Yi∥22→2 ≤∥Σ1/2Xi−Σ1/2Yi∥22→2λmin(ΣXi)⋅λmin(ΣYi)≤∥ΣXi−ΣYi∥2→2λmin(ΣXi)⋅λmin(ΣYi).

Maximizing over produces an upper bound on . By symmetry, the same bound holds for , implying the result. ∎

Overall, for Lemma 6(i), it suffices to have spectral control over the covariance. In the special case where , we will accomplish this with the help of Matrix Hoeffding [14]. Before doing so, we consider Lemma 6(ii):

###### Lemma 9.

For every , it holds that .

###### Proof.

Select a unit vector such that . Then the triangle and Cauchy–Schwarz inequalities together give

 ∥A∥2→2=∥A1x1+A2x2∥2 ≤∥A1∥2→2∥x1∥2+∥A2∥2→2∥x2∥2 ≤(∥A1∥22→2+∥A2∥22→2)1/2(∥x1∥22+∥x2∥22)1/2 ≤√2⋅maxi∈{1,2}∥Ai∥2→2=√2⋅∥A∥V.\qed
###### Lemma 10.

Suppose almost surely. Then is -Lipschitz.

###### Proof.

Put so that , and select any . Then