## 1. Introduction

How to adequately quantify the system of interest by assembling available information from multiple datasets collected simultaneously from different sensors is a long lasting and commonly encountered problem in data science. This problem is commonly referred to as the

*sensor fusion*problem [20, 30, 19, 49]. While the simplest approach to “fuse” the information is via a simple concatenation of available information from each sensor, it is not the best and most efficient approach. To achieve a better and more efficient fusion algorithm, researchers usually face several challenges. For example, the sensors might be heterogeneous, datasets from different sensors might not be properly aligned, the datasets might be high dimensional and noisy, to name but a few. Roughly speaking, researchers are interested in extracting common components (information) shared by different sensors, if there is any, where roughly speaking.

A lot of effort was invested to find a satisfactory algorithm based on various models. Historically, when we can safely assume a linear structure in the common information shared by different sensors, the most typical algorithm in handling this problem is the canonical correlation analysis (CCA) [24] and its descendants [23, 20, 25], which is a far from complete list. In the modern data analysis era, due to the advance of sensor development and growth of the complexities of problems, researchers may need to take the nonlinear structure of the datasets into account to better understand the datasets. To handle this nonlinear structure, several nonlinear sensor fusion algorithms have been developed, for example, nonparametric canonical correlation analysis (NCCA) [39], alternative diffusion (AD) [31, 45] and its generalization [42], time coupled diffusion maps [38], multiview diffusion maps [34], etc. See [42] for a recent and more thoughtful list of available nonlinear tools and [50] for a recent review. The main idea beyond these developments is that the nonlinear structure is modeled by various nonlinear geometric structures, and the algorithms are designed to preserve and capture this nonlinear structure. Such ideas and algorithms have been successfully applied to many real world problems, like audio-visual voice activity detection [10], the study of the sequential audio-visual correspondence [9], automatic sleep stage annotation from two electroencephalogram signals [35], seismic event modeling [33], fetal electrocardiogram analysis [42] and IQ prediction from two fMRI paradigms [48], which is a far from complete list.

While these kernel-based sensor fusion algorithms have been developed and applied for a while, there are still several gaps toward a solid practical application and sound theoretical understanding of these tools. One important gap is understanding how the inevitable noise, particularly when the data dimension is high, impacts the kernel-based sensor fusion algorithms. For example, can we be sure if the obtained fused information is really informative, particularly when the datasets are noisy or when one sensor is broken? If the signal-to-noise ratios of two sensors are different, how does these noises impact the information captured by these kernel based sensors? To our knowledge, the developed kernel-based sensor fusion algorithms do not take care of how the noise interacts with the algorithm, and most theoretical understandings are mainly based on the nonlinear data structure without considering the impact of high dimensional noise, except a recent effort in the null case [7]. In this paper, we focus on one specific challenge among many; that is, we study how high dimensional noise impacts the spectrum of two kernel-based sensor fusion algorithms, NCCA and AD, in the non-null setup when there are two sensors.

We briefly recall the NCCA and AD algorithms. Consider two noisy point clouds, and .
For some bandwidths and some fixed constant chosen by the user, we consider two *affinity matrices*, and , defined as

(1) |

where . Here, and are related to the point clouds and respectively.
Denote the associated *degree matrices* and which are diagonal matrices such that

(2) |

Moreover, denote the transition matrices as

The NCCA and AD matrices are defined as

(3) |

respectively. Note that in the current paper, for simplicity, we focus our study on the Gaussian kernels. More general kernel functions will be our future topics. Usually, the top few eigenpairs of and are used as features of the extracted common information shared by two sensors. We shall emphasize that in general, while and are diagonalizable, and are not. But theoretically we can obtain the top few eigenpairs without a problem under the common manifold model [45, 42] since asymptotically and

both converge to self-adjoint operators. To avoid this trouble, researchers also consider singular value decomposition (SVD) of

and . Another important fact is that usually we are interested in the case when and are aligned; that is, and are sampled from the same system at the same time. However, the algorithm can be applied to any two datasets of the same size, while it is not our concern in this paper.### 1.1. Some related works

In this subsection, we summarize some related results. Since the NCCA and AD matrices (3) are essentially products of transition matrices, we start from summarizing the results of the affinity and transition matrices when there is only one sensor. On one hand, in the noiseless setting, the spectral properties have been widely studied, for example, [2, 21, 22, 44, 18, 11], to name but a few. In summary, under the manifold model, researchers show that the Graph Laplacian(GL) converges to the Laplace–Beltrami operator in various settings with properly chosen bandwidth. On other hand, the spectral properties have been investigated in [4, 3, 8, 14, 13, 17, 28] under the null setup. These works essentially show that when contains pure high-dimensional noise, the affinity and transition matrices are governed by a low-rank perturbed Gram matrix when the bandwidth . Despite rich literature above about two extreme setups, limited results are available in the intermittent, or nonnull, setup [12, 6, 15]. For example, when the signal-to-noise ratio (SNR), which will be defined precisely later, is sufficiently large, the spectral properties of GL constructed from the noisy observation are close to that constructed from the clean signal. Moreover, the bandwidth plays an important role in the nonnull setup. For a more comprehensive review and sophisticated study on the spectral properties of the affinity and transition matrices for an individual point cloud, we refer the readers to [6, Sections 1.2 and 1.3].

For the NCCA and AD matrices, on one hand, in the noiseless setting, there have been several results under the common manifold model [32, 46]

. On the other hand, under the null setup that both sensors only capture high dimensional white noise, its spectral property has been studied recently

[7]. Specifically, except for a few larger outliers, when

and the edge eigenvalues of or converge to some deterministic limit depending on the free convolution (c.f. Definition 2.3) of two Marchenko-Pastur (MP) laws [37]. However, in the nonnull setting when both sensors are contaminated by noise, to our knowledge, there does not exist any theoretical study, particularly under the high dimensional setup.### 1.2. An overview of our results

We now provide an overview of our results. The main contribution of this paper is a comprehensive study of NCCA and AD under the non-null case in the high dimensional setup. This result can be viewed as a continuation of the study under the null case [7]

. We focus on the setup that the signal is modeled by a low dimensional manifold. It turns out that this problem can be recast as studying the algorithm under the commonly applied spiked model, which will be made clear later. In addition to providing a theoretical justification based on the kernel random matrix theory, we propose a method to choose the bandwidth adaptively. Moreover, peculiar and counterintuitive results will be presented when two sensors have different behavior, which emphasizes the importance of carefully applying these algorithms in practice. In Section

3, we investigate the eigenvalues of the NCCA and AD matrices when and , which is a common choice in the literature. The behavior of the eigenvalues varies according to both SNRs of the point clouds. When both SNRs are small, the spectral behavior of and is like that in the null case, while both the number of outliers and the convergence rates rely on SNRs; see Theorem 3.1 for details. Furthermore, if one of the sensors has large SNR and the other one has small SNR, the eigenvalues of and provide limited information about the signal; see Theorem 3.2 for details. We emphasize that this result warns us that if we directly apply NCCA and AD without any sanity check, it may result in a misleading conclusion. When both SNRs are larger, the eigenvalues are close to the clean NCCA and AD matrices; see Theorem 3.3 for more details. It is clear that the classic bandwidth choices for are inappropriate when the SNR is large, since the bandwidth is too small compared with the signal strength. In this case , and we obtain limited information about the signal; see (42) for details. To handle this issue, in Section 4, we consider bandwidths that are adaptively chosen according to the dataset. With this choice, when the SNRs are large, NCCA and AD become non-trivial and informative; that is, NCCA and AD are robust against the high dimensional noise. See Theorem 4.1 for details.Conventions. The fundamental large parameter is and we always assume that and are comparable to and depend on . We use to denote a generic positive constant, and the value may change from one line to the next. Similarly, we use , , , etc., to denote generic small positive constants. If a constant depends on a quantity , we use or to indicate this dependence. For two quantities and depending on , the notation means that for some constant , and means that for some positive sequence as . We also use the notations if , and if and . For a matrix , indicates the operator norm of , and means for some constant

. Finally, for a random vector

we say it is sub-Gaussian if for any deterministic vector , we have .The paper is organized as follows. In Section 2, we introduce the mathematical setup and some background in random matrix theory. In Section 3, we state our main results for the classic choice of bandwidth. In Section 4, we state the main results for the adaptively chosen bandwidth. In Section 5, we offer the technical proofs of the main results. In Appendix 5.1, we provide and prove some preliminary results which will be used in the technical proofs.

## 2. Mathematical framework and background

### 2.1. Mathematical framework

We focus on the following model for the datasets and . Assume that the first sensor i.i.d. sample clean signals from a sub-Gaussian vector , denoted as , where

is a probability space. Similarly, assume that the second sensor also i.i.d. sample clean signals from a sub-Gaussian vector

, denoted as . Since we focus on the distance of pairwise samples, without loss of generality, we assume that(4) |

Denote and , and to simplify the discussion, we assume that and admit the following spectral decomposition

(5) |

where and are fixed integers. We model the common information by assuming that there exists a bijection so that

(6) |

that is, we have for any . In practice, the clean signals and are contaminated by two sequences of i.i.d. sub-Gaussian noise and , respectively, so that the data generating process follows

(7) |

where

(8) |

We further assume that and are independent with each other and also independent of and . We are mainly interested in the high dimensional setting; that is, and are comparably as large as More specifically, we assume that there exists some small constant such that

(9) |

The SNRs in our setting are defined as and respectively, so that for all and ,

(10) |

for some constants . To avoid repetitions, we summarize the assumptions as follows.

In view of (5), the model (7) for each sensor is related to the spiked covariance matrix models [27]. We comment that this seemingly simple model, particularly (5), includes the commonly considered nonlinear common manifold model. In the literature, the common manifold model means that two sensors sample simultaneously from one low dimensional manifold; that is, and is an identity map, where is a low dimensional smooth and compact manifold embedded in the high dimensional Euclidean space. Since we are interested in the kernel matrices depending on pairwise distances, which is invariant to rotation, when combined with Nash’s embedding theory, the common manifold can be assumed to be supported in the first few axes of the high dimensional space, like that in (5). As a result, the common manifold model becomes a special case of the model (7). We refer readers to [6] for a detailed discussion of this relationship. A special example of the common manifold model is the widely considered linear subspace as the common component; that is, when embedded in for

. In this case, we could simply apply CCA to estimate the common component, and its behavior in the high dimensional setup has been studied in

[1, 36].We should emphasize that through the analysis of NCCA and AD under the common component model satisfying Assumption (2.1), we do not claim that we could understand the underlying manifold structure. The problem we are asking here is the nontrivial relationship between the noisy and clean affinity and transition matrices, while the problem of exploring the manifold structure from the clean datasets [18, 11] is a different one, which is usually understood as the manifold learning problem. To study the nontrivial relationship between the noisy and clean affinity and transition matrices, it is the spiked covariance structure that we focus on, but not the possibly non-trivial . By establishing the nontrivial relationship between the noisy and clean affinity and transition matrices in this paper, when combined with the knowledge of manifold learning via the kernel-based manifold learning algorithm with clean datasets [31, 46, 43], we know how to explore the common manifold structure, which depends on , from the noisy datasets.

###### Remark 2.2.

While it is not our focus in this paper, we should mention that our model includes the case that the datasets captured by two sensors are not exactly on one manifold , but from two manifolds that are diffeomorphic to [43]. Specifically, the first sensor samples points from , while the second sensor simultaneously samples points from , where and are both diffeomorphisms and ; that is, in (6). Note that in this case, might be different from . Moreover, the samples from two sensors can be more general. For example, in [46], the principle bundle structure is considered to model the “nuisance”, which can be understood as the “deterministic noise”, and in [31] the metric space as the common component is considered. While it is possible to consider a more complicated one, since we are interested in studying how noise impacts NCCA and AD, in this paper we simply focus on the above model but not further elaborate this possible extension.

### 2.2. Some random matrix theory background

In this subsection, we introduce some random matrix theory background and necessary notations. Let be the data matrix associated with ; that is, the -th column is , and consider the scaled noise , where

stands for the standard deviation of the scaled noise. Denote the empirical spectral distribution (ESD) of

asIt is well-known that in the high dimensional regime (9), has the same asymptotic [29] as the so-called MP law [37], denoted as , satisfying

(11) |

where is a measurable set, is the indicator function and when and when ,

(12) |

and . Denote

(13) |

and for ,

(14) |

For any constant denote be the shifting operator that shifts a probability measure defined on by that is

(15) |

where means the shifted set. Using the notation (11), for denote

(16) |

Next, we introduce a -dependent quantity of some probability measure. For a given probability measure and define as

(17) |

Finally, we recall the following notion of *stochastic domination* [16, Chapter 6.3] that we will frequently use.
Let
and

be two families of nonnegative random variables, where

is a possibly -dependent parameter set. We say that is stochastically dominated by , uniformly in the parameter , if for any small and large , there exists so that we have , for a sufficiently large . We interchangeably use the notation or if is stochastically dominated by , uniformly in , when there is no danger of confusion. In addition, we say that an -dependent event holds with high probability if for a , there exists so that , when### 2.3. A brief summary of free multiplication of random matrices

In this subsection, we summarize some preliminary results about free multiplication of random matrices from [5, 26]. Given some probability measure its Stieltjes transform and -transform are defined as

where , respectively. We next introduce the subordination functions utilizing the -transform [26, 47]. For any two probability measures and , there exist analytic functions and satisfying

(18) |

Armed with the subordination functions, we now introduce the free multiplicative convolution of and denoted as , when and are compactly supported on but not both delta measures supported on ; see Definition 2.7 of [5].

###### Definition 2.3.

Denote the analytic function by

(19) |

Then the free multiplicative convolution is defined as the unique probability measure that (19) holds for all i.e., is the -transform of Moreover, and are referred to as the subordination functions.

For and defined in (16), we have two sequences and , where . Note that we have

(20) |

where are the right edges of and respectively. Denote two positive definite matrices and as follows

(21) |

Let be an Haar distributed random matrix in and denote

The following lemma summarizes the rigidity of eigenvalues of

###### Lemma 2.4.

Suppose (21) holds. Then we have that

## 3. Main results (I)–classic bandwidth:

In this section, we state our main results regarding the eigenvalues of and when , where For definiteness, we assume that and In what follows, for the ease of statements, we focus on reporting the results for and hence omit the subscripts of the indices in (10). For the general setting with or , we refer the readers to Remark 3.4 below for more details. Finally, we focus on reporting the results for the NCCA matrix . The results for the AD matrix are similar. For the details of the AD matrix , we refer the readers to Remark 3.5 below. Moreover, by symmetry, without loss of generality, we always assume that that is, the first sensor always has a larger SNR.

### 3.1. Noninformative region:

In this subsection, we state the results when at least one sensor contains strong noise, or equivalently has a small SNR; that is, . In this case, the NCCA and AD will not be able to provide useful information or can only provide limited information for the underlying common manifold.

#### 3.1.1. When both sensors have small SNRs,

In this case, both sensors have small SNRs such that the noise dominates the signal. For some fixed integers satisfying

(22) |

where is some constant, denote as

(23) |

Moreover, define as

(24) |

###### Theorem 3.1.

Intuitively, in this region we cannot obtain any information about the signal, since asymptotically the noise dominates the signal. In practice, the datasets might fall in this region when both sensors are corrupted or the environment noise is too strong. This intuition is confirmed by Theorem 3.1. As discussed in [6, 13], when the noise dominates the signal, the outlier eigenvalues are mainly from the kernel function expansion or the Gram matrix and hence are not useful to study the underlying manifold structure. The number of these outlier eigenvalues depend on the SNR as can be seen in (23), which can be figured out from the kernel function expansion.

We should point out that (25) and (26) are mainly technical assumptions and commonly used in the random matrix theory literature. They guarantee that the individual bulk eigenvalues of

can be characterized by the quantiles of free multiplicative convolution. Specifically, (

26) ensures that the Gram matrices are bounded from below and (25) has been used in [7]to ensure that the eigenvectors of the Gram matrix are Haar distributed. As is discussed in

[7], while it is widely accepted that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed, we cannot find a proper proof. Since the proof of this Theorem depends on the results in [7], we impose the same condition. The assumption (25) can be removed when we can show that the eigenvectors of the Gram matrix from i.i.d. sub-Gaussian random vectors are Haar distributed. Since this is not the focus of the current paper, we will pursue this direction in future works.#### 3.1.2. When one sensor has a small SNR,

In Theorem 3.2 below, we consider that i.e., one of the sensors has a large SNR whereas the other is dominated by the noise. We prepare some notations here. Let and be the affinity matrices associated with and respectively, where the subscript stands for the short-hand notation for the signal. In other words, and are constructed from the clean signal. In general, since may be different from , and might be different. Denote

(28) |

Analogously, we denote the associated degree matrix and transition matrix as and respectively, that is,

(29) |

Define and similarly. Note that from the random walk perspective, (and as well) describe a lazy random walk on the clean dataset. We further introduce some other matrices,

We then define the associate degree matrix and transition matrix as and respectively; that is,

(30) |

and will be used when is too large () so that the bandwidth is insufficient to capture the relationship between two different samples.

###### Theorem 3.2.

Suppose Assumption 2.1 holds with , , and . Then we have that for ,

(33) |

where is defined in (24) and is defined in (31). Furthermore, when is larger in the sense that for any given small constant

(34) |

then with probability at least for some sufficiently small constant and some constant and all we have

(35) |

This is a potentially confusing region. In practice, it captures the situation when one sensor is corrupted so that the signal part becomes weak. Since we still have one sensor available with a strong SNR, it is expected that we could still obtain something useful. However, it is shown in Theorem 3.2 that the corrupted sensor unfortunately contaminates the overall performance of the sensor fusion algorithm. Note that since the first sensor has a large SNR, the noisy transition matrix is close to the transition matrix , which only depends on the signal part when and the transition which is a mixture of the signal and noise when . This fact has been shown in [6]. However, for the second sensor, due to the strong noise, will be close to a perturbed Gram matrix that mainly comes from the high dimensional noise. Consequently, as illustrated in (33), the NCCA matrix will be close to which is a product matrix of the clean transition matrix and the shifted Gram matrix. Clearly, the clean transition matrix is contaminated by the shifted Gram matrix, which does not contain any information about the signal. This limits the information we can obtain.

In the extreme case when is larger in the sense of (34), the chosen bandwidth is too small compared with the signal so that the transition matrix

will be close to the identity matrix. Consequently, as in (

35), the NCCA matrix will be mainly characterized by the perturbed Gram matrix whose limiting ESD follows the MP law with proper scaling.We should however emphasize that as has been elaborated in [6], when the SNR is large, particularly when , we should consider a different bandwidth, particularly the bandwidth determined by the percentile of pairwise distance that is commonly considered in practice. It is thus natural to ask if the bandwidth is chosen “properly”, would we obtain useful information eventually. We will answer this question in the later section.

### 3.2. Informative region:

In this subsection, we state the results when both of the sensors have large SNR (). Recall (29), (30) and denote analogously for the point cloud For some constant denote

(36) |

###### Theorem 3.3.

Theorem 3.3 shows that when , where and both SNRs are large, the NCCA matrix from the noisy dataset could be well approximated by that from the clean dataset of the common manifold. The main reason has been elaborated in [6] when we have only one sensor. In the two sensors case, combining (37) and (38), we see that except the first eigenvalues, the remaining eigenvalues are negligible and not informative. Moreover, (2) and (3) reveal important information about the bandwidth; that is, if the bandwidth choice is improper, like and , the result could be misleading in general. For instance, when and are large, ideally we should have a “very clean” dataset and we shall expect to obtain useful information about the signal. However, this result says that we cannot obtain any useful information from NCCA; particularly, see (42). This however is intuitively true, since when the bandwidth is too small, the relationship of two distinct points cannot be captured by the kernel; that is, when , with high probability (see proof below for a precise statement of this argument or [6]). This problem can be fixed if we choose a proper bandwidth. In Section 4, we will state the corresponding results when the bandwidths are selected properly, in which case this counterintuitive result is eliminated.

###### Remark 3.4.

In the above theorems, we focus on reporting the results for the case in (5). In this remark, we discuss how to generalize the results to the setting when or First, when Theorem 3.1 still holds after minor modification, for example, in (22) should be replaced by and the error bound in (27) should be replaced by

where ’s are defined similarly as in (24). Similar arguments apply for Theorem 3.2. Second, when , where Theorem 3.3 holds by setting Finally, suppose that there exist some integers such that

Then we have that Theorem 3.3 still holds by setting and the affinity and transition matrices in (29) should be defined using the signal part with large SNRs. For example, should be defined via

The detailed statements and proofs are similar to the setting except for extra notational complicatedness. Since this is not the main focus of the current paper, we omit details here.

###### Remark 3.5.

Throughout the paper, we focus on reporting the results of the NCCA matrix. However, our results can also be applied to the AD matrix with a minor modification based on their definitions in (3). Specifically, Theorem 3.1 holds for , Theorem 3.2 holds for and Theorem 3.3 holds for by replacing with with and with . Since the proof is similar, we omit details.

## 4. Main results (II)–adaptive choice of bandwidth

As discussed after Theorem 3.3, when the SNRs are large, the bandwidth choice for is improper. One solution to this trouble has been discussed in [6] when we have one sensor; that is, the bandwidth is decided by the percentile of all pairwise distances. It is thus natural to hypothesize that the same solution would hold for the kernel sensor fusion approach. As in Section 3, we focus on the case and the discussion for the general setting is similar to that of Remark 3.4. As before, we also assume that Also, we focus on reporting the results of the NCCA matrix. The discussion for the AD matrix is similar to that of Remark 3.5.

We first recall the adaptive bandwidth selection approach [6] motivated by the empirical approach commonly used in daily practice. Let and be the empirical distributions of pairwise distances and respectively. Then we choose the bandwidths and by

(43) |

where and are fixed constants chosen by the user. Define in the same way as that in (45), as that in (48), and as that in (46) using (43). Similarly, we can define the counterparts for the point cloud

Recall that and are the affinity matrices associated with and With a little bit abuse of notation, for we denote

(44) |

where are constructed using the adaptively selected bandwidth . Clearly, and differ by an isotropic spectral shift, and when , asymptotically and are the same. Note that compared to (28), the difference is that we use the modified bandwidth in (44). This difference is significant, particularly when is large. Indeed, when is large, defined in (28) is close to an identity matrix, while defined in (44) encodes information of the signal. Specifically, asymptotically we can show that defined in (44) converges to an integral operator defined on the manifold, whose spectral structure is commonly used in manifold learning society to study the signal. See [6] for more discussion. We then define

(45) |

and

(46) |

where . Compared to (45), (46) does not contain the scaling and shift of the signal parts. Moreover, denote

(47) |

###### Theorem 4.1.

Suppose Assumption

Comments

There are no comments yet.