DeepAI

# Doubly-Stochastic Normalization of the Gaussian Kernel is Robust to Heteroskedastic Noise

A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e. no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate m^-1/2, where m is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavorably under heteroskedastic noise. Furthermore, we provide a prototypical example of simulated single-cell RNA sequence data with strong intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident.

• 12 publications
• 10 publications
• 31 publications
09/16/2022

### Robust Inference of Manifold Density and Geometry by Doubly Stochastic Scaling

The Gaussian kernel and its traditional normalizations (e.g., row-stocha...
06/22/2022

### Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise

Bi-stochastic normalization of kernelized graph affinity matrix provides...
11/30/2020

### Doubly Stochastic Subspace Clustering

Many state-of-the-art subspace clustering methods follow a two-step proc...
10/18/2022

### Representation Power of Graph Convolutions : Neural Tangent Kernel Analysis

The fundamental principle of Graph Neural Networks (GNNs) is to exploit ...
11/29/2019

### Sinkhorn limits in finitely many steps

Applied to a nonnegative m× n matrix with a nonzero σ-diagonal, the sequ...
10/22/2021

### Sinkformers: Transformers with Doubly Stochastic Attention

Attention based models such as Transformers involve pairwise interaction...
11/24/2021

### Auto robust relative radiometric normalization via latent change noise modelling

Relative radiometric normalization(RRN) of different satellite images of...

## 1 Introduction

### 1.1 Affinity matrix constructions

Given a dataset of points in Euclidean space, a useful approach for encoding the intrinsic geometry of the data is by a weighted graph, where the vertices represent data points, and the edge-weights describe similarities between them. Such a graph can be described by an affinity (or adjacency/similarity) matrix, namely a nonnegative matrix whose ’th entry holds the edge-weight between vertices and . To measure the similarity between pairs of data points, one can employ the Gaussian kernel with pairwise (Euclidean) distance. In particular, given data points , we consider the matrix given by

 (1)

for , where is the kernel “width” parameter. For many applications it is a common practice to normalize , so to equip the resulting affinity matrix with a useful interpretation and favourable properties. Two such normalizations, which are closely related to each other, are the row-stochastic and the symmetric normalizations:

 (Row-stochastic normalization) W(r) def=diag(r)K,ri=1∑nj=1Ki,j, (2) (Symmetric normalization) W(s) def=√diag(r)K√diag(r), (3)

where , and is a diagonal matrix with on its main diagonal.

Notably, the matrix is row-stochastic, i.e. the sum of every row of is , which allows for a useful interpretation of

as a transition-probability matrix (in the sense of a Markov chain). An important characteristic of the the row-stochastic affinity matrix

is its relation to the heat kernel and the Laplace-Beltrami operator on a manifold [belkin2003laplacian, coifman2006diffusionMaps, hein2005graphs, singer2006graph, trillos2019error]. Specifically, under the “manifold assumption” – where the points are uniformly sampled from a smooth low-dimensional Riemannian manifold embedded in the Euclidean space – approximates the heat kernel on the manifold, and the matrix (known as the random-walk graph Laplacian) approximates the Laplace-Beltrami operator. This property of the row-stochastic normalization establishes the relation between and the intrinsic local geometry of the data, thereby justifying the use of as an affinity matrix.

The affinity matrix (obtained by the symmetric normalization) is closely-related to . Since , shares the spectrum of

, and their eigenvectors are related through the vector

. Even though is not a proper transition-probability matrix, it enjoys symmetry, which is advantageous in various applications.

We also mention that the row stochastic and symmetric normalizations can be used in conjunction with a kernel with variable width, i.e. when a different value of is taken for each row or column of (see for instance [berry2016variable] and references therein). We further discuss one such variant in the example in Section 3.2.

The matrices and (or equivalently, their corresponding graph Laplacians and

) are used extensively in data processing and machine learning, notably in non-linear dimensionality reduction (or manifold learning)

, community detection and spectral-clustering

[shi2000normalized, ng2002spectral, sarkar2015role, von2007tutorial, fortunato2010community, shaham2018spectralnet, kluger2003spectral], image denoising [buades2005non, pang2017graph, meyer2014perturbation, landa2018steerable, singer2009diffusion]

, and in signal processing and supervised-learning over graph domains

[shuman2013emerging, coifman2006diffusionWavelets, hammond2011wavelets, defferrard2016convolutional, bronstein2017geometric].

### 1.2 The doubly-stochastic normalization

In this work, we focus on the doubly-stochastic normalization of :

 (Doubly-stochastic normalization) W(d) def=diag(d)Kdiag(d), (4)

where is a vector chosen such that is doubly-stochastic, i.e., such that the sum of every row and every column of is . The problem of finding such that has prescribed row and column sums is known as a matrix scaling problem, and the entries of are often referred to as scaling factors. Matrix scaling problems have a rich history, with a long list of applications and generalizations [bapat1997nonnegative, idel2016review]. Since the scaling factors are defined implicitly, their existence and uniqueness are not obvious, and depend on the zero-pattern of the matrix to be scaled. For the particular zero-pattern of , existence and uniqueness are established by the following proposition.

###### Proposition 1 (Existence and uniqueness).

Suppose that , , is symmetric with zero main diagonal and strictly positive off-diagonal entries. Then, there exist scaling factors such that for all . Moreover, are unique.

The proof can be found in Appendix A, and is based on the simple zero-pattern of and on a Lemma by Knight [knight2008sinkhorn]. On the computational side, the scaling factors can be obtained by the classical Sinkhorn-Knopp algorithm [sinkhorn1967concerning] (known also as the RAS algorithm), or by more recent techniques based on optimization (see [allen2017much] and references therein). We detail a lean variant of the Sinkhorn-Knopp algorithm adapted to symmetric matrices in Algorithm 1.

By definition, is a symmetric transition-probability matrix. Hence, it naturally combines the two favorable properties that and hold separately. It is worthwhile to point-out that

is in fact the closest symmetric and row-stochastic matrix to

in KL-divergence [brown1993order, zass2007doubly], and interestingly, it can also be obtained by iteratively re-applying the symmetric normalization (3) indefinitely (see [zass2005unifying]). Another appealing interpretation of the doubly-stochastic normalization is through the lens of optimal transport with entropy regularization [cuturi2013sinkhorn], summarized by the following proposition.

###### Proposition 2 (Optimal transport interpretation).

from (4) is the unique solution to

 MinimizeW∈Rn×n+n∑i,j=1∥xi−xj∥2Wi,j+εH(W),Subject to:W1=1,WT1=1,Wi,i=0,i=1…,n, (5)

where is a column vector of ones, and is the negative entropy.

The proof of Proposition 2 follows very closely with the proof of Lemma 2 in [cuturi2013sinkhorn], with the additional use of Proposition 1 (to account for the constraint ), and is omitted for the sake of brevity. In the optimal transport interpretation of the problem (5), each point holds a unit mass that should be distributed between all the other points , while minimizing the transportation cost between the points (measured by the pair-wise distances ). The outcome of this process is constrained so that each point ends up holding a unit mass. In this context, the matrix describes the distribution of the masses from all points to all other points, and is therefore required to be doubly-stochastic. The negative entropy regularization term controls the “fairness” of the mass allocation, such that each mass is distributed more evenly between the points for large values of .

The optimization problem (5) can also be interpreted as an optimal graph construction. In this context, the term can be considered as accounting for the regularity of the data (as a multivariate signal) with respect to the weighted graph represented by , while the negative entropy term controls the approximate sparseness of . Since the solution to (5) is a symmetric matrix, can be thought of as describing the undirected weighted graph that optimizes the “smoothness” of the dataset, under the constraints of prescribed entropy (or approximate sparseness), no self-loops, and stochasticity (i.e. a transition-probability matrix).

In the context of manifold learning, the relation between the doubly-stochastic normalization and the heat kernel (or the Laplace-Beltrami operator) on a Riemannian manifold has been recently established in [marshall2019manifold]. That is, under the manifold assumption (and under certain conditions) is expected to approximate the heat kernel on the manifold, and therefore to encode the local geometry of the data much like . The doubly-stochastic normalization was also demonstrated to be useful for spectral clustering in [beauchemin2015affinity], where it was shown to achieve the best clustering performance on several datasets. Last, we note that several other constructions of doubly-stochastic affinity matrices have appeared in the literature [wang2012improving, zass2007doubly], typically involving a notion of closeness to other than KL-divergence (e.g. Frobenius norm).

### 1.3 Robustness to noise

When considering real-world datasets, it is desirable to construct affinity matrices that are robust to noise. Specifically, suppose that we do not have access to the points , but rather to their noisy observations , given by

 ˜xi=xi+ηi, (6)

where are pairwise independent noise vectors satisfying

 E[ηi]=0,E[ηTiηi]=Σ2i, (7)

for all , where is the zero (row) vector in , and is the covariance matrix of . We then define , , , , and analogously to , , , , and , respectively, when replacing in (1) with . For the noise model described above, we say that the noise is homoskedastic if , and heteroskedastic otherwise.

The influence of homoskedastic noise on kernel matrices (such as ) was investigated in [el2010information], and the results therein imply that and are robust to high-dimensional homoskedastic noise. Specifically, in the high-dimensional setting considered in [el2010information], converges to a biased version where all the off-diagonal entries of admit the same multiplicative bias. Such bias can therefore be corrected by applying either the row-stochastic or the symmetric normalizations (see [el2016graph]). However, this is not the case in the more general setting of heteroskedastic noise.

Heteroskedastic noise is a natural assumption for many real-world applications. For example, heteroskedastic noise arises in certain biological, photon-imaging, and Magnetic Resonance Imaging (MRI) applications [cao2017multi, salmon2014poisson, hafemeister2019normalization, foi2011noise]

, where observations are modeled as samples from random variables whose variances depend on their means, such as in binomial, negative-binomial, multinomial, Poisson, or Rice distributions. In natural image processing, heteroskedastic noise occurs occurs due to the spatial clipping of values in an image

[foi2009clipped]. Additionally, heteroskedastic noise is encountered when the experimental setup varies during the data collection process, such as in spectrophotometry and atmospheric data acquisition [cochran1977statistically, tamuz2005correcting]. Generally, many modern datasets are inherently heteroskedastic as they are formed by aggregating observations collected at different times and from different sources. Last, we mention that heteroskedastic noise can be considered as a natural relaxation to the popular manifold assumption. In particular, heteroskedastic noise arises whenever data points are sampled from the high-dimensional surroundings of a low-dimensional manifold embedded in the ambient space, where the size of the sampling neighborhood (in the ambient space) around the manifold is determined locally by the manifold itself. See Figure 4 and the corresponding example in Section 3.1.2.

### 1.4 Contributions

Our main contribution is to establish the robustness of the doubly-stochastic normalization of the Gaussian kernel (with zero main diagonal) to high-dimensional heteroskedastic noise. In particular, we prove that in the high-dimensional setting where the number of points is fixed, the dimension is increasing, and the noise does not concentrate too much in specific direction in space, converges to with rate . See Theorem 3 in Section 2. An intuitive justification of the robustness of the doubly-stochastic normalization to heteroskedastic noise, and also why zeroing-out the main diagonal of is important, can be found in Section 2, equations (9)–(10). The proof of Theorem 3, see Appendix B, relies on a perturbation analysis of the doubly stochastic normalization.

We demonstrate the robustness of to heteroskedastic noise in several simulations (see Section 3). In Section 3.1.1 we corroborate Theorem 3 numerically, and exemplify that and suffer from inherent point-wise bias due to heteroskedastic noise (see Figures 13). In Section 3.1.2 we demonstrate the robustness of the leading eigenvectors of to heteroskedastic noise whose characteristics depend locally on the manifold of the clean data (see Figures 46). In Section 3.2 we apply the doubly stochastic normalization for analyzing simulated single-cell RNA sequence data with significant heteroskedasticity, showcasing its ability to accurately recover the underlying structure of the data despite the noise (see Figures 7,8).

## 2 Main result

###### Theorem 3 (Convergence of ˜W(d) to W(d)).

Consider the setting where the number of points is fixed, and the dimension is increasing. Suppose that and , for all and all (sufficiently large) dimensions , where is a universal constant (independent of ). Then,

 ∥˜W(d)−W(d)∥F=O(m−1/2), (8)

with high probability.111The precise statement of the theorem is that for any probability , there exists a constant and an integer , such that for every we have with probability at least .

The proof is detailed in Appendix B. We mention that the constant in the boundedness condition is arbitrary and can be replaced with any other constant. Additionally, note that even though the quantities are required to decrease with , the expected noise magnitudes (which are equal to ) can remain constant, and can possibly be large compared to the magnitudes of the clean data points . For example, if we have for all , then it follows that , asserting that the magnitude of the noise is greater or equal to that of the clean data points (assuming ). In this regime of non-vanishing high-dimensional noise, the condition guarantees that the noise spreads-out in Euclidean space, and does not concentrate too much in any particular direction (note that

is the largest singular value of

, and is therefore the standard deviation of the noise in the direction with largest variance). Clearly, the setup of Theorem

3 can also accommodate for heteroskedastic noise, as the ratios between the noise magnitudes for different data points can be arbitrary. See Remark 1 for further discussion of the setting considered in Theorem 3 and a closely-related high-dimensional setting.

The main reason behind the robustness of the doubly-stochastic normalization to high-dimensional heteroskedastic noise, is that it is invariant to the type of bias introduced by heteroskedastic noise. Specifically, under the conditions of Theorem 3, for all

 ∥˜xi−˜xj∥2∼m→∞E∥˜xi−˜xj∥2=E∥ηi∥2+∥xi−xj∥2+E∥ηj∥2, (9)

and therefore

 ˜Ki,j∼m→∞exp(−E∥ηi∥2/ε)⋅Ki,j⋅exp(−E∥ηj∥2/ε), (10)

for all (since . Crucially, in (10) is biased by symmetric diagonal scaling, which is precisely the type of bias corrected automatically by the doubly-stochastic normalization (4). See remark 2 for an alternative justification and further discussion of the robustness of the doubly-stochastic normalization to heteroskedastic noise.

Equations (9) and (10) also highlight why zeroing-out the main diagonal of the Gaussian kernel (see Eq. (1)) is important. Without it, the entries on the main diagonal of would be , while the off-diagonal entries of would be small due to the bias in the noisy pairwise distances (9). Thus,

would be close to the identity matrix, which would render any normalization (row-stochastic, symmetric, or doubly-stochastic) ineffective.

###### Remark 1.

Consider the following setting for high-dimensionality, where data coordinates (dimensions) are sampled from some underlying distribution, and the the noise is only required to have bounded variance, i.e. for some universal constant . Specifically, suppose that each clean observation is given by

 xi=[Fi(y1),…,Fi(ym)], (11)

where is a bounded function, and are i.i.d samples from some latent “coordinate” variable (possibly multivariate). In this case, one has

 ∥xi−xj∥2=m∑k=1(Fi(yk)−Fj(yk))2=m(Ey∼Y[(Fi(y)−Fj(y))2]+O(m−1/2)), (12)

with high probability, where the last equality is due to Hoeffding’s inequality [hoeffding1994probability] (for sums of independent and bounded random variables). Therefore, the clean pairwise distances in this setting grow linearly with , which suggests that the parameter of the Gaussian kernel (1) should also grow linearly with . Ultimately, taking when computing is equivalent to normalizing the noisy observations by (while keeping fixed), which places us in the setting of Theorem 3.

###### Remark 2.

According to (4), for we can write

 W(d)i,j=diexp(−∥xi∥2/ε)⋅exp(2⟨xi,xj⟩/ε)⋅exp(−∥xj∥2/ε)dj=uiexp(2⟨xi,xj⟩/ε)uj, (13)

where we defined . Thus, can be viewed alternatively as obtained by the doubly-stochastic normalization of the nonnegative matrix (with zero main diagonal) instead of , where the scaling factors are . This implies that depends only on the scalar products , which are not biased by hetersokedastic noise in the sense that . Indeed, this fact is a key ingredient in the proof of Theorem 3.

However, it is important to note that without the doubly-stochastic normalizaion, the matrix does not correspond to a local kernel (see [berry2016local]). This is because the quantity is sensitive to the magnitudes , such that it may be large even if and are not close (but rather or are large). Therefore, the row-stochastic and symmetric normalizations of the matrix (with zero main diagonal) are not expected to encode the local geometry of the data (even though they are robust to heteroskedastic noise by the virtue of using only the scalar products ). To conclude, the doubly stochastic normalization of is special in the the following way. On the one and, it is obtained from a local kernel (the Gaussian kernel), and therefore describes the local neighborhoods of the data points (as can be seen from (4) or Proposition 2, and is established formally in [marshall2019manifold]). On the other hand, it only depends on the scalar products , which is advantageous for coping with heteroskedastic noise.

## 3 Numerical examples

### 3.1 Example 1: The unit circle embedded in high-dimensional space

In our first example, we sampled points uniformly from the unit circle in , and embedded them in , for , using randomly-generated orthogonal transformations. In more details, we first sampled angles independently and uniformly from . Then, for each embedding dimension

, we generated a random orthogonal matrix

(i.e. such that ), and computed the data points as

 xi=[cos(θi),sin(θi)]⋅Rm,1≤i≤n. (14)

Note that as a result, the magnitude of all points is constant, with for all and embedding dimension .

#### 3.1.1 Gaussian noise with arbitrary variances

We begin by demonstrating Theorem 3 numerically. Towards that end, we created the noise as follows. For every embedding dimension , we set (so that the noise is uncorrelated between coordinates), and generated the noise standard-deviations according to

 σi,j=√αiβjm, (15)

where ,

were sampled (independently) from the uniform distribution over

. Therefore, the noise magnitudes satisfy

 1400≤E∥ηi∥2≤14, (16)

for all , and can take any values in that range. Importantly, the noise magnitudes can vary substantially between data points, which is key in our setting. Then, were sampled (independently) according to

 ηi,j∼N(0,σ2i,j). (17)

Once we generated the noisy data points according to (6), we formed the clean and noisy kernel matrices and with , and computed , using Algorithm 1 with . Last, we also evaluated , and , using and , respectively, according to (2) and (3).

The behavior of the errors , , as a function of can be seen in Figure 1. It is evident that for the error for the doubly stochastic normalization is substantially smaller than that for the row-stochastic normalization or for the symmetric normalization. Additionally, the error for the doubly-stochastic normalization decreases linearly in logarithmic scale, while the errors for the row-stochastic and the symmetric normalizations reach saturation and never fall below a certain value. In this experiment, the slope of versus (between and ) was , matching the slope suggested by the upper bound in Theorem 3 (which implies a slope of for the squared Frobenius norm).

In Figure 2 we depict the noisy affinities , , versus their corresponding clean affinities , , , for . It can be observed that the noisy affinities from the doubly-stochastic normalization concentrate near their corresponding clean affinities, while the noisy affinities from the row-stochastic and symmetric normalizations deviate substantially from their clean counterparts, particularly for larger affinity values.

Last, in Figure 3 we visually demonstrate the first row of the clean and noisy affinity matrices , , and , , , using . Note that we only display about a quarter of all the entries, since all the other entries are vanishingly small. It can be seen that the clean row-stochastic, clean symmetric, and clean doubly-stochastic affinities are all very similar, and resemble a Gaussian. This is explained by the fact that both and are expected to approximate the heat kernel on the unit circle (see [coifman2006diffusionMaps, marshall2019manifold] and other related references given in the introduction), which is close to the Gaussian kernel with geodesic distance (for sufficiently small ). Additionally, since the sampling density on the circle is uniform, (from (2)) is close to a multiple of the identity, and hence is expected to be close to (recall that ). Indeed, we found that .

Importantly, the doubly stochastic normalization recovers the true affinities with high accuracy, with an almost perfect match between the corresponding clean and noisy affinities. On the other hand, there is an evident discrepancy between the corresponding clean and noisy affinities from the row-stochastic normalization and from the symmetric normalization.

#### 3.1.2 Noise sampled uniformly from a ball with smoothly varying radius

Next, we proceed by demonstrating the robustness of the leading eigenvectors from the doubly-stochastic normalization under heteroskedastic noise, and in particular, in the presence of noise whose magnitude depends on the local geometry of the clean data. Specifically, we simulated heteroskedastic noise whose magnitude varies smoothly according to the angle of each point on the circle (see (14)), according to

 ηi∼U(Bρ(θi)),ρ(θ)=0.01+0.991+cos(2θ)2, (18)

where stands for the uniform distribution over , which is a ball with radius in (centered at the origin). That is, every noisy observation is sampled uniformly from a sphere whose center is and its radius is from (18). Consequently, the maximal noise magnitude varies smoothly between (for ) and (for ). A typical array of clean and noisy points arising from the noise model (18) for dimension can be seen in Figure 4.

We generated the noisy data points according to (6) for dimension , and formed the noisy kernel matrix with . We next computed using Algorithm 1 with , and evaluated , using according to (2) and (3).

Figure 5 displays the five leading (right) eigenvectors of , , , denoted by , ,

, respectively. It can be seen that the leading eigenvectors from the doubly-stochastic normalization are almost unaffected by the noise, which is evident by the fact that they approximate sines and cosines – the eigenfunctions of the Laplace-Beltrami operator on the circle. As sines and cosines are advantageous for expanding periodic functions, it is natural to employ the eigenvectors of

for the purposes of regression, interpolation, and classification over the dataset. It is important to mention that other useful bases and frames can potentially be constructed from

(see [coifman2006diffusionWavelets, hammond2011wavelets]). On the other hand, the eigenvectors obtained from and are strongly biased due to the heteroskedastic noise, and exhibit undesired effects such as discontinuities and localization. Specifically, as evident from Figure 5, the leading eigenvectors of are discontinuous at and , and the leading eigenvectors of are localized around and (their values are close to around and ). Clearly, this behaviour of the leading eigenvectors of and does not reflect the geometry of the data, but rather the characteristics of the noise (since the noise variance is smallest at and largest at ).

In Figure 6 we illustrate the two-dimensional embedding of the noisy data points using the second and third eigenvectors of , , and (corresponding to their second- and third-largest eigenvalues). That is, the -axis and -axis values for each embedding are given by the entries of and for the doubly-stochastic normalization, and for the row-stochastic normalization, and and for the symmetric normalization (see also [belkin2003laplacian, coifman2006diffusionMaps]). It is clear that the embedding due to the doubly-stochastic normalization reliably represents the intrinsic structure of the clean dataset – a unit circle with uniform density, whereas the embeddings due to the row-stochastic and the symmetric normalizations are incoherent with the geometry and density of the clean points.

### 3.2 Example 2: Simulated single-cell RNA sequence data

Single-cell RNA sequencing (scRNA-seq) is a revolutionary technique for measuring target gene expressions of individual cells in large and heterogeneous samples [tang2009mrna, macosko2015highly]. Due to the method’s high resolution (single-cell level) it allows for the discovery of rare cell populations, which is of paramount importance in immunology and developmental biology. A typical scRNA-seq dataset is an nonnegative matrix corresponding to cells and genes, where its ’th entry is an integer called the read count, describing the expression level of ’th gene in the ’th cell. Importantly, the total number of read counts (or in short total reads) per cell (i.e. row sums) may vary substantially within a sample [kim2020demystifying].

We now exemplify the advantage of using the doubly stochastic normalization for exploratory analysis of scRNA-seq data. Specifically, we provide a simple prototypical example where the gene expression levels of cells are measured in two different batches, such that the number of total reads (per cell) within each batch is constant, but is substantially different between the batches. Therefore, the noise variance (modeled by the variance of the multinomial distribution, to be described shortly) differs between the observations in the two batches, giving rise to heteroskedastic noise. Such a scenario can arise naturally in scRNA-seq, either from the intrinsic read count variability common to such datasets, or when two datasets from two independent experiments are merged for unified analysis.

We consider a simulated dataset which includes only two cell types, denoted by , with genes. The prototypes and were created by first sampling their entries uniformly (and independently) from , and then normalizing them so that they sum to . That is,

 p1,j=z1,j∑mk=1z1,k,p2,j=z2,j∑mk=1z2,k,z1,j,z2,j∼U[0,1]. (19)

Next, each noisy observation was drawn from a multinomial distribution using either or as the probability vector, and normalized to sum to , as described next. First, we generated a batch containing observations of and observations of , each with multinomial trials. Second, we added a batch containing observations of only, each with multinomial trials. To summarize, the total number of observations is , given explicitly by

 (20)

Therefore, the dataset consists of (normalized) multinomial observations of , followed by (normalized) multinomial observations of . While all observations of are with multinomial trials, the observations of are split between observations with multinomial trials, and observations with multinomial trials. Evidently, we can write

 (21)

where is a zero-mean noise vector (arising from the multinomial sampling) satisfying that is significantly smaller (by a factor of roughly) for compared to .

Using the noisy observations , we formed the noisy kernel matrix of (1) with , computed the corresponding matrix using Algorithm 1 with , and evaluated the matrices , according to (2) and (3). Our methodology for choosing was to take it to be the smallest possible such that Algorithm 1 converges within the desired tolerance (specifically, in this experiment we set a maximum of iterations for the algorithm). We note that if is too small, then becomes too sparse, and the doubly-stochastic normalization may become numerically ill-posed.

Figure 7 illustrates the values (in logarithmic scale) of the obtained affinity matrices , , . It is evident that the affinity matrix from the doubly-stochastic normalization accurately describes the relationships between the data points. That is, indicates the similarities within the two groups of cell types (i.e. and ), but also the dissimilarities between them, regardless of batch association. On the other hand, the affinity matrices from the row-stochastic and the symmetric normalizations are not loyal to the grouping according to cell types, but rather to batch association. In particular, and highlight the observations from the second batch (observations ) as being most similar to all other observations. Clearly, the fundamental issue here is the heteroskedasticity of the noise, and specifically, the fact that the noise in the last observations is considerably smaller than the noise in all the other observations.

One of the main goals of exploratory analysis of scRNA-seq data is to identify different cell types. Towards that end, non-linear dimensionality reduction techniques are often employed, among which t-distributed stochastic neighbor embedding (t-SNE) [maaten2008visualizing] is perhaps the most prominent [linderman2019fast, tirosh2016dissecting, villani2017single, habib2016div]. For its operation, t-SNE employs an affinity matrix which is a close variant of the row-stochastic normalization (2), where the kernel width parameter in (1) is allowed to vary between different rows of , and the resulting row-stochastic matrix is symmetrized by averaging it with its transpose. The different kernel widths are determined by a parameter called the perplexity, which is related to the entropy of each row of the resulting affinity matrix.

Even though the affinity matrix employed by t-SNE is a modification of the standard row-stochastic normalization, and uses a different value of for each row, it is still expected to suffer from the inherent bias observed in Figure (b)b. Specifically, note that the order of the entries in each row of (when sorted by their values) does not depend on , and only on the noisy pair-wise distances , which are strongly biased by the magnitudes of the noise, as evident from Figure (b)b.

In Figures (a)a,(b)b,(c)c we demonstrate the two-dimensional visualization obtained from t-SNE for the dataset , using typical perplexity values of . We used MATLAB’s standard implementation of t-SNE, activating the option of forcing the algorithm to be exact (i.e. without approximating the affinity matrix). All other parameters of t-SNE were set to their default values suggested by the code (we also mention that the default suggested perplexity is ).

In Figure (d)d we display the two-dimensional visualization obtained from t-SNE when replacing its default affinity matrix construction with the doubly-stochastic matrix (obtained using ), while leaving all other aspects of t-SNE unchanged. Since the optimization procedure in t-SNE is affected by randomness, we ran the experiment several times to verify that the results we exhibit are consistent.

While there are only two types of cell in the data ( and ), no clear evidence of this fact can be found in the visualizations by t-SNE (Figures (a)a,(b)b,(c)c). Furthermore, the visualizations by t-SNE do not provide any noticeable separation between the cell types. On the other hand, the visualization obtained by modifying the t-SNE to employ the doubly-stochastic affinity matrix (Figure (d)d) allows one to easily identify and distinguish between the two cell types.

## 4 Summary and discussion

In this work, we investigated the robustness of the doubly-stochastic normalization to heteroskedastic noise, both from a theoretical perspective and from a numerical one. Our results imply that the doubly-stochastic normalization is advantageous over the popular row-stochastic and symmetric normalizations, particularly when the data at hand is high-dimensional and suffers from inherent heteroskedasticity. Moreover, our experiments suggest that incorporating the doubly-stochastic normalization into various data analysis, visualization, and processing techniques for real-world datasets can be worthwhile. The doubly stochastic normalization is particularly appealing due to is simplicity, solid theoretical foundation, and resemblance to the row-stochastic/symmetric normalizations – which proved useful in countless applications.

The results reported in this work naturally give rise to several possible future research directions. On the theoretical side, it is of interest to characterize the convergence rate of to also in terms of the number of points and the covariance matrices explicitly. As a particular simpler case, one may consider the high-dimensional setting where both and tend to infinity, while the quantity is fixed (or tends to a fixed constant). On the practical side, it is of interest to investigate how to best incorporate the affinity matrix from the doubly-stochastic normalization into data analysis and visualization techniques. To that end, it is desirable to derive a method for picking the kernel parameter automatically, or in more generality, to determine how to make use of a variable kernel width (similarly to [zelnik2005self]) while retaining the robustness to heteroskedastic noise.

## 5 Acknowledgements

We would like to thank Boaz Nadler for his useful comments and suggestions. B.L, R.R.C, and Y.K. acknowledge support by NIH grant R01GM131642. R.R.C and Y.K acknowledge support by NIH grant R01HG008383. Y.K. acknowledges support by NIH grant 2P50CA121974. R.R.C acknowledges support by NIH grant 5R01NS10004903.

## Appendix A Proof of Proposition 1

We first recall the definition of a fully indecomposable [bapat1997nonnegative] matrix. A matrix is called fully indecomposable if there are no permutation matrices and such that

 PBQ=[B10B2B3], (22)

with square. We now proceed to show that from Proposition 1 is fully-indecomposable. Since the only zeros in are on its main diagonal, there is only one zero in every row and every column of . Consequently, any permutation of the rows and columns of would retain this property, namely have a single zero in every row and every column. Therefore, if , it is impossible to find and such that (22) would hold for , since there cannot be a block of zeros in whose number of rows or columns is greater than . Hence, is fully-indecomposable, and the existence and uniqueness of follows from Lemma 4.1 in [knight2008sinkhorn].

## Appendix B Proof of Theorem 3

Throughout the proof, all quantities should be considered as dependent on the dimension (unless stated otherwise), while the number of points is fixed.

Let us define

 (23)

for . By the definition of in (4), for we can write

 W(d)i,j=diexp(−∥xi−xj∥2/ε)dj=die−∥xi∥2/εe2⟨xi,xj⟩/εe−∥xj∥2/εdj=uiHi,juj. (24)

Analogously, we define and by replacing and in (23) with and , respectively, and we have that .

Let denote the Hadamard (element-wise) product, , and . We can write

 ∥˜W(d)−W(d)∥F =∥diag(˜u)˜Hdiag(˜u)−diag(u)Hdiag(u)∥F =∥(uuT)⊙(˜H−H)+˜H⊙(˜u˜uT−uuT)∥F ≤maxi,j{uiuj}⋅∥˜H−H∥F+maxi,j{˜Hi,j}⋅∥˜u˜uT−uuT∥F. (25)

We begin by bounding the quantity , which is the subject of the following Lemma.

###### Lemma 4.

For all ,

 |˜Hi,j−Hi,j|=O(m−1/2), (26)

with high probability.

###### Proof.

Let us write

 ⟨˜xi,˜xj⟩=⟨xi,xj⟩+⟨xi,ηj⟩+⟨ηi,xj⟩+⟨ηi,ηj⟩. (27)

According to (7) and the conditions in Theorem 3, for we have

 E{⟨xi,ηj⟩+⟨ηi,xj⟩+⟨ηi,ηj⟩}=0, (28) Var{⟨xi,ηj⟩}=E[xiηTjηjxTi]=xiΣ2jxTi≤∥xi∥2∥Σ2j∥2≤C2ηm−1, (29) Var{⟨xj,ηi⟩}=E[xjηTiηixTj]=xjΣ2ixTj≤∥xj∥2∥Σ2i∥2≤C2ηm−1, (30) Var{⟨ηi,ηj⟩}=E[ηiηTjηjηTi]=m∑k=1n∑ℓ=1E[ηi,kηi,ℓ]E[ηj,kηj,ℓ] =Tr{Σ2iΣ2j}≤m∥Σ2iΣ2j∥2≤m∥Σi∥22∥Σj∥22≤C4ηm−1. (31)

Therefore,

 Var{⟨xi,ηj⟩+⟨ηi,xj⟩+⟨ηi,ηj⟩}≤6C2η+3C4ηm, (32)

where we used the inequality . Consequently, for any , Chebyshev’s inequality yields that with probability at least

 ∣∣⟨xi,ηj⟩+⟨ηi,xj⟩+⟨ηi,ηj⟩∣∣≤ ⎷6C2η+3C4ηm(1−p)=O(m−1/2). (33)

Using the above for , a first-order Taylor expansion of around gives

 exp{2(⟨xi,ηj⟩+⟨ηi,xj⟩+⟨ηi,ηj⟩)/ε}=1+O(m−1/2), (34)

and by (27) we have