# Optimal denoising of rotationally invariant rectangular matrices

In this manuscript we consider denoising of large rectangular matrices: given a noisy observation of a signal matrix, what is the best way of recovering the signal matrix itself? For Gaussian noise and rotationally-invariant signal priors, we completely characterize the optimal denoiser and its performance in the high-dimensional limit, in which the size of the signal matrix goes to infinity with fixed aspects ratio, and under the Bayes optimal setting, that is when the statistician knows how the signal and the observations were generated. Our results generalise previous works that considered only symmetric matrices to the more general case of non-symmetric and rectangular ones. We explore analytically and numerically a particular choice of factorized signal prior that models cross-covariance matrices and the matrix factorization problem. As a byproduct of our analysis, we provide an explicit asymptotic evaluation of the rectangular Harish-Chandra-Itzykson-Zuber integral in a special case.

• 1 publication
• 3 publications
• 74 publications
• 11 publications
• 67 publications
10/17/2021

### Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising

Factorization of matrices where the rank of the two factors diverges lin...
04/15/2020

### High-dimensional rank-one nonsymmetric matrix decomposition: the spherical case

We consider the problem of estimating a rank-one nonsymmetric matrix und...
06/09/2021

### Shrinkage Estimation of Functions of Large Noisy Symmetric Matrices

We study the problem of estimating functions of a large symmetric matrix...
07/19/2021

### Mismatched Estimation of rank-one symmetric matrices under Gaussian noise

We consider the estimation of an n-dimensional vector s from the noisy e...
12/08/2020

### Construction of optimal spectral methods in phase retrieval

We consider the phase retrieval problem, in which the observer wishes to...
08/02/2021

### A Random Matrix Perspective on Random Tensors

Tensor models play an increasingly prominent role in many fields, notabl...
12/08/2021

### Custom Orthogonal Weight functions (COWs) for Event Classification

A common problem in data analysis is the separation of signal and backgr...

## 1 Introduction

In this paper we consider the problem of denoising large rectangular matrices, i.e. the problem of reconstructing a matrix from a noisy observation . Our aim is to characterize theoretically this problem in the Bayes optimal setting, in which the statistician knows the details of the prior distribution of the signal and the noisy channel . In particular, we will consider the case of additive Gaussian noise where controls the strength of the noise, and

is a matrix of i.i.d. Gaussian random variables. On the side of the signal we will consider rotationally invariant priors, i.e. those where

for any pair of rotation matrices .

We are interested in particular in the case of factorized priors, i.e.  with and , and both and having i.i.d. Gaussian entries. From the point of view of applications, this is equivalent to the problem of denoising cross-correlation matrices, which is, for example, extremely relevant in modern financial portfolio theory where it allows to compute better performing investment portfolios by reducing overfitting cleaning2017.

From a theoretical point of view, the denoising problem is a simpler variant of the matrix factorization problem, where one would like to reconstruct the two factors and from the noisy observation . Matrix factorization is ubiquitous in computer science, modeling many applications, from sparse PCA sparsePCA to dictionary learning online_dictionary. While well studied in the low-rank regime and

, where optimal estimators and guarantees on their performances are available

2017Thibault, miolane2018fundamental, much less is known in the extensive-rank regime, where is of the same order as and .

This setting was studied, for generic priors on elements of and , in kabashima, where an analysis of the information-theoretic thresholds and the performance of approximate message passing algorithms was given. A more recent series of works pointed out however that a key assumption of kabashima does not hold in practice, and proposed alternative analysis techniques, ranging from spectral characterizations schmidt:tel-03227132, barbier2021statistical to high-temperature expansions maillard2021perturbative. In all these cases, the analytical results obtained for rectangular matrices are not explicit and of limited practical applicability even in the very simple case of Gaussian priors on the factors and additive Gaussian noise. Studying the denoising problem in the Gaussian regime is thus a first step towards a better understanding of matrix factorization in this challenging regime.

Our main results concern rotationally invariant

priors, i.e. signal matrices whose information lies exclusively in the distribution of their singular values, corrupted by additive Gaussian noise. For this large class of priors, we compute the optimal denoiser (in the sense of the mean square error on the matrix

) and its performance in the Bayes optimal setting, and we discuss in detail the case of factorized priors where both the factors and have i.i.d. Gaussian components. As a byproduct of our analysis, we also provide an explicit formula for the high-dimensional asymptotics of the rectangular HCIZ integral HC, IZ, guionnet2021large in a special case.

The code for the numerical simulations and for reproducing the figures is available here: https://github.com/PenombraET/rectangular_RIE

### 1.1 Definition of the model

In this paper we focus our attention to rotationally-invariant priors, i.e. for any pair of rotation matrices , where is the orthogonal group in dimensions, and additive white Gaussian noise

being a matrix of i.i.d. Gaussian random variables with zero mean and variance

111 We decided to use a symmetric normalization, that is to scale all quantities by instead of using or , highlighting the symmetry under transposition of the problem. In matrix factorization applications, would typically denote the number of samples, the dimensionality of the samples, and a more common normalization convention would require to normalize all quantities using . Our results can be adapted accordingly, as this normalization change amounts to an overall rescaling by . . The rotational invariance of the prior, together with the rotational invariance of Gaussian noise, implies that the observation will have a rotationally-invariant distribution, and that all relevant observables of the problem will depend only on the singular value distributions of and .

In order for the signal-to-noise ratio (SNR)

to be comparable over different choices of priors, we fix the ratio between the averaged norm of the signal matrix and that of the noise matrix to 1. Explicitly, one requires that

 Eprior[||\bS||22]=\EEnoise[||\bZ||22]=√mp, (1)

where the norm is defined as . We will consider the high-dimensional regime, i.e. the limit with fixed ratio Without loss of generality, we will consider , i.e. . We require that in this limit the singular values of the signal are of order

, and that the corresponding empirical singular value density of the prior converges to a deterministic probability density function.

As a particular choice of rotationally-invariant prior, we focus on on factorized signals, i.e.

 \bS∗=\bF\bX√r4√mp, (2)

where , are matrices with i.i.d. standard Gaussian entries. In the high dimensional limit, we will consider the extensive-rank regime, where is kept constant. We will also study the low-rank limit of this prior, i.e. the limit , or equivalently . This form of the prior models both Gaussian-factors matrix factorization and cross-covariance matrices of two datasets of samples in dimensions respectively and .

### 1.2 Main Results

Our main result is the analytical characterization of the optimal denoiser and its predicted performance in the high-dimensional Bayes-optimal setting, i.e. when the statistician knows the details of the prior distribution of the signal and the noisy channel, for rotationally-invariant priors and additive Gaussian noise. The optimal denoiser here is defined as the function of the observation that minimizes the average mean-square error (MSE) with the ground truth,

 ^S(⋅)=\argmindenoisersf1√mp\EE\bS∗,\bY||\bS∗−f(\bY)||22, (3)

where is the joint average over the ground truth and the noisy observation. Similarly we define the minimal mean-square error (MMSE) as the averaged MSE of the optimal estimator:

 MMSE=\EE\bS∗,\bY[MSE(\bS∗,^S(\bY))]. (4)

In order to state our results, let us denote by the symmetrized asymptotic singular value density of a -distributed matrix, i.e.

 ^σ\bY(x)=limm→∞1mm∑i=1[12δ(x−yi)+12δ(x+yi)], (5)

where are the singular values of a -distributed matrix of size . Let us also denote by

the singular value decomposition (SVD) of the actual instance of the observation

, where is orthogonal by rows, is orthogonal by columns and is the diagonal matrix of singular values of 222By concentration of spectral densities of large matrices, the empirical spectral density of an actual instance of converges to the deterministic asymptotic spectral density in the high-dimensional limit..

[Optimal estimator] The optimal estimator is rotationally-invariant, i.e. diagonal in the basis of singular vectors of the observation

, and it is given by

 ^S(\bY)=\bU\diag(ξ(λ1),…,ξ(λm))\bV, (6)

where the spectral denoising function is given, in the limit with fixed, by

 ξ(λ)=λ−2Δ√R1[R1−12λ+\dashintdζ^σ\bY(ζ)λ−ζ], (7)

and the integral is intended as a Cauchy principal value integral. We obtain Section 1.2 in Section 2 by computing the average of the posterior distribution

, i.e. the probability that a candidate signal

was used to generate the observation ,

 P(\bS∣\bY)=1\caZ\bYPsignal(\bS)Δ−mp2exp[−√mp2Δ\Tr((\bY−\bS)(\bY−\bS)T)], (8)

where is the partition function, i.e. the correct normalization factor

 \caZ\bY=∫d\bSPsignal(\bS)Δ−mp2exp[−√mp2Δ\Tr((\bY−\bS)(\bY−\bS)T)]. (9)

Indeed, one can prove that the posterior average is always the optimal estimator with respect to the MSE metric cover1999elements.

[Analytical MMSE] In the limit with fixed, the MMSE is given by

 MMSE=Δ−2Δ2∂∂Δ[1R1\dashintdλdζ^σ\bY(λ)^σ\bY(ζ)log|λ−ζ|+R1−1R1\dashintdλ^σ\bY(λ)log|λ|], (10)

where the dashed integral signs denote the symmetric regularization of the integrals (á la Cauchy principal value) around the singularities of the logarithms — see Appendix A. We obtain Section 1.2 in Section 2 by using the I-MMSE theorem guo2004mutual, which links the performance of optimal estimators in problems with Gaussian noise with the derivative of the partition function with respect to the SNR.

Notice that to implement numerically the denoising function Eq. 6 and the MMSE Eq. 10 one needs to compute the symmetrized asymptotic singular value density of the observation , . We will provide details on how to compute it for generic rotationally-invariant priors in Section 3.1 and in the special case of the Gaussian factorized prior in Section 3.3.

On a more technical note, the computation of the partition function Eq. 9, from which Section 1.2 and Section 1.2 are derived, involves the computation of the asymptotics of a rectangular Harish-Chandra-Itzykson-Zuber (HCIZ) integral, defined as

 \caIm(\bA,\bB;τ)=∫μm(d\bU)μp(d\bV)exp[τm\Tr(\bA\bV\bBT\bU)] (11)

for any pair of rectangular matrices and , where is the uniform measure over the -dimensional orthogonal group . Notice that the HCIZ integral depends only on the singular values of its arguments, as the singular vectors can always be reabsorbed in the integration over orthogonal groups.

We will justify in detail why the HCIZ integral appears in the computation of Eq. 9 in Section 2. The idea is that, after a change of variables to the SVD decomposition of in Eq. 9, the coupling term will be the only term depending on the singular vectors of . This term, together with integration over the singular vectors of , will give rise to a rectangular HCIZ integral . The integration over the spectrum of will be performed explicitly thanks to a combination of a concentration argument and Nishimori identities Nishimori_1980, so that the actual HCIZ integral we will be interested in is .

In general, the asymptotics of the HCIZ integral is difficult to characterize in closed form, and it is linked to the solution of a 1d hydrodynamical problem Matytsin, guionnet2021large. In the special case in which the two matrices over which the HCIZ integral is evaluated differ only by a matrix of i.i.d. Gaussian variables — which happens to be the case in our computation, as — this non-trivial problem simplifies, allowing for a closed from computation of the asymptotics of the HCIZ integral.

[Asymptotics of the rectangular HCIZ integral] Under the hypotheses of [guionnet2021large, Theorem 1.1], and in the case in which with a matrix of i.i.d. Gaussian variables with zero mean and variance , one has

 IR1[^σ\bA,^σ\bB;τ]=limm→∞p=R1m2m2log\caIm(\bA,\bB;τ)=C(R1,κ√τ)+R1logΔ+√R1Δ∫dλ^σ\bA(λ)λ2+√R1Δ∫dλ^σ\bB(λ)λ2−2(R1−1)\dashintdλ^σ\bB(λ)log|λ|−2\dashintdλdζ^σ\bB(λ)^σ\bB(ζ)log|λ−ζ|, (12)

where is the symmetrized asymptotic singular value density of, respectively, or and is an undetermined constant depending only on and on the product . Again, dashed integrals need to be regularized as detailed in Appendix A.

We justify Section 1.2 in Section 4 by specifying the general asymptotic form given in guionnet2021large to the case. This result generalizes Result 3.2 of maillard2021perturbative for the case of symmetric matrices to the case of rectangular ones.

Finally, let us note that we stated the main findings of the paper as Results instead of Theorems to highlight that we shall not provide a complete rigorous justification for each steps of computation, and leave some technical details to the reader. Nonetheless, we believe that each of our derivation could be made entirely rigorous through a slightly more careful control.

### 1.3 Numerical results and comparisons

Before presenting the technical details that justify Section 1.2, Section 1.2 and Section 1.2, we provide numerical simulations in the case of the Gaussian factorized prior Eq. 2. We have two aims: (i) corroborating our analytical result by showing that on actual instances of the denoising problem the performance of our estimator Eq. 6 (empirical MMSE) equals that predicted by the MMSE formula Eq. 10 (analytical MMSE); (ii) studying the phenomenology of the MMSE as a function of the noise level , the aspect-ratio and the rank-related parameter .

To implement numerically the denoising function Eq. 6 and the MMSE Eq. 10 one needs to compute the symmetrized singular value density of the observation , . We will provide details on how to compute it in the special case of the Gaussian factorized prior in Section 3.3. Given , one can compute (i) the empirical MMSE by considering a random instance of the observation , by denoising it with the optimal denoiser , and by computing the MSE between and the cleaned matrix ; (ii) the analytical MMSE Eq. 10 by numerical integration of . Details on numerical integration are given in Appendix B. At fixed , we distinguish two regimes for : over-complete and under-complete .

In the under-complete regime , the rank of the signal matrix is non-maximal. Fig. 1 shows a strong dependence of the MMSE on for , with better MMSE for larger . In particular, we observe that the MMSE at a given value of decreases as grows larger, and the cross-over between low and high denoising error shifts to larger values of (lower SNR). This is in accordance with intuition: larger correspond to matrices with higher aspect-ratio, and thus with a larger number of components (recall that is an matrix), while the (low) rank — here measuring the inverse degree of correlations between components of — remains fixed.

In the over-complete regime , and in particular for , the prior trivializes: each of the components of

is a sum of a large enough number of independent variables for the central limit theorem to hold. Thus,

becomes a matrix with i.i.d. coordinates, and our problem factorizes into simpler scalar denoising problems on each of the matrix components. Fig. 1 portrays the over-complete regime for . We see that the dependence on is extremely weak — see inset in Fig. 1 — signaling a very fast convergence towards scalar denoising. In the limit — see Fig. 2 — the MMSE converges to that of scalar denoising, , independently on .

In the extreme low-rank limit , we expect to recover the results of low-rank matrix denoising baik2004phase, 2017Thibault as already shown for the symmetric-matrix denoising in maillard2021perturbative

. In particular, we expect to recover the MMSE phase transition at

. Fig. 3 shows convergence towards the phase transition for a fixed value of , while Fig. 3 confirms that the dependence of on is the expected one. We give details on how to compute the theoretical MMSE in the low-rank limit in Appendix C.

### 1.4 Related works

Spectral denoisers for covariance matrices have been used extensively in finance, as in 2012Ledoit, RIEoriginal, cleaning2017. Very recently benaychgeorges2021optimal proposed an algorithm to denoise cross correlation matrices that is in spirit similar to the one we propose, with a crucial difference. In our setting, the signal matrix is corrupted by a noisy channel, and the objective is to get rid of the noise. In their setting, they are given two correlated datasets and , and the objective is to estimate their theoretical cross-covariance, which is a non-trivial task when the number of observations is comparable with the dimensions . Qualitatively, in their setting the noise is a finite-size effect due to a finite sample-to-dimension ratio. The algorithm of benaychgeorges2021optimal

is also a rotationally invariant estimator, i.e. one where every eigenvalue is rescaled by a function, though their function is different from our Eq.

7.

Our proof strategy is also different from the one used in the works cited above: we derive our estimator from a free entropy approach rooted in statistical physics, which allows us to predict a priori the MMSE between the signal and the denoised sample. This approach was presented for extensive-rank matrix denoising in schmidt:tel-03227132, maillard2021perturbative, barbier2021statistical. In maillard2021perturbative the denoising problem was solved for square symmetric priors, obtaining the already known denoiser of cleaning2017 and predicting its theoretical MMSE. The authors there study denoising in the context of a high-temperature expansion approach to matrix factorization. We extend their analysis to non-symmetric rectangular priors, and we find the symmetric-case denoiser as a special case of our optimal denoiser. In barbier2021statistical, the computation of the denoising free entropy is presented as a stepping stone towards the computation of minimum mean-squared error of extensive rank matrix factorization aiming to go beyond the rotationally invariant priors on . The mathematical challenge in barbier2021statistical

is linked to finding closed forms for the corresponding HCIZ integral in the limit of high dimension. The main point of their work is a closed form expression that presents a conjecture for the free entropy for matrix denoising and factorization in terms of some spectral quantities and the HCIZ integral. Evaluating this formula would require to compute HCIZ in a generic setting, which as far as we know is for the moment out of reach. If the conjecture of

barbier2021statistical is correct then our results can be seen as an explicit evaluation of their formula for the special case of rotationally invariant priors on . Note, however, that even checking that indeed for rotationally invariant priors the conjecture of barbier2021statistical recovers our results is so far open.

The asymptotic behaviour of HCIZ integrals — symmetric and rectangular — has been studied extensively in the literature, both in the context of free probability (GUIONNET2002461, guionnet2021large) and by the statistical physics community (Matytsin, 2003HCIZ, collins2020tensor), yet explicit forms in special cases were considered only sporadically instanton. To the best of our knowledge, the explicit form given in Eq. 12 has not been presented elsewhere.

## 2 Analytical derivation of the optimal estimator and its performance

As we briefly discussed in the introduction, both the optimal estimator and its MSE are related to properties of the posterior distribution Eq. 8 and its partition function Eq. 9. Moreover, both can be derived from the knowledge of the free entropy . Indeed, one can check that the posterior average, i.e. the optimal estimator, is given by

 ^S(\bY)iμ=∫d\bS\bSiμP(\bS∣\bY)=\bYiμ+Δ√mp1\caZ\bY∂\caZ\bY∂\bYiμ=\bYiμ+Δ√mp∂Φ\bY∂\bYiμ, (13)

by computing explicitly the derivative starting from Eq. 9. For the MMSE Eq. 4, by the I-MMSE theorem guo2004mutual, we have

 MMSE=Δ+2Δ2∂Δ\EE\bY[Φ\bY]. (14)

A sketch of the derivation of Eq. 14 with our specific normalizations and notations is given in Appendix D. The main focus of the section will be to compute the free entropy in the high-dimensional limit.

### 2.1 Asymptotics of the free entropy Φ\bY

To compute the free entropy, we start by performing a change of variable from to its singular value decomposition in Eq. 9, with , and diagonal (recall that without loss of generality we took ). We will denote the singular values as for . As usual — see anderson2010introduction for example — the Jacobian of the change of variable involves the Vandermonde determinant of the squared singular values , , as well as an additional term involving the product of singular values333The change of variable involves also a constant term depending only on the ratio that will not contribute to the computation of any relevant observable. For this reason we can safely avoid computing it explicitly, and the resulting free entropy will be computed up to -dependent terms.

 \caZ\bY=Δ−mp2∫μm(d\bU)μp(d\bV)d\bTPsignal(\bT)|Δ(\bT2)|m∏l=1Tp−ml×exp[−√mp2Δ(m∑l=1T2l+\Tr(\bY\bYT))]exp[√mpΔ\Tr(\bU\bT\bV\bYT)]=Δ−mp2exp[−√mp2Δ\Tr(\bY\bYT)]∫d\bTPsignal(\bT)|Δ(\bT2)|m∏l=1T(R1−1)ml×exp[−√mp2Δm∑l=1T2l]\caIm(\bT,\bY;√R1Δ). (15)

where is the uniform measure on the orthogonal group in dimension and where we recognized that the integral over rotation matrices reduces to the rectangular HCIZ integral defined in Eq. 11. Notice that we used the rotational invariance of the prior to argue that . In the non-rotational invariant case would also depend on the singular vectors of .

To move forward, we notice that all quantities appearing in the integral depend on the singular values matrix through its symmetrized empirical singular value density , and that they have finite asymptotic limit in the exponential scale 444 For the HCIZ, see Section 4. For the Vandermonde and the product of singular values, we notice that each product over (which converts into a sum when exponentiating) contributes to a power in the exponential scale.. Then, we change integration variables from the ”-dimensional” empirical density to a generic probability density : we argue in Appendix E that in the large limit this change introduces only a term in the exponential scale , which is thus subleading with respect to the original integrand and can be discarded. Finally, we use Laplace’s method to observe that the integral concentrates in the scale onto the value of the integrand at a deterministic density ; Nishimori identities Nishimori_1980 guarantee then that (see also barbier2021statistical, maillard2021perturbative for more discussions of this phenomenon). Thus, by taking the leading asymptotic order of all terms and disregarding all finite-size corrections and all constant terms depending only on

 Φ\bY=−12logΔ+1mplogPsignal[^σ\bS∗]−12Δ√R1\Var[^σ\bS∗]+1R1Σ[^σ\bS∗]+R1−1R1Λ[^σ\bS∗]−12Δ√R1\Var[^σ(\bY)]+12R1IR1[^σ\bS∗,^σ(\bY);√R1Δ] (16)

where , and , and is the asymptotic symmetrized singular value densities of . instead is the symmetrized empirical density of singular values of the fixed instance of the observation . All dashed integrals are regularized as detailed in Appendix A.

In Eq. 16 we took the high-dimensional limit of the prior term in a rather uncontrolled way. Treating the limit rigorously is delicate, see for example the discussions in [barbier2021statistical, Section II.C], but brings no surprises for non-pathological priors — for example, the factorized priors in the symmetric version of our problem maillard2021perturbative. As mentioned at the start of the section, all quantities we are interested in depend on derivatives of the free entropy with respect to or to , and the prior term brings no contribution at all in these cases. For this reason, we do not treat it in detail.

We will show in Section 4 that at leading order

 IR1[^σ\bS∗,^σ(\bY);√R1Δ]=CR1+R1logΔ+√R1Δ\Var[^σ\bS∗]+√R1Δ\Var[^σ(\bY)]−2(R1−1)Λ[^σ(\bY)]−2Σ[^σ(\bY)] (17)

for some constant depending only on . The asymptotic free entropy thus equals

 Φ\bY=const(R1)+1mplogPsignal[^σ\bS∗]+1R1Σ[^σ\bS∗]+R1−1R1Λ[^σ\bS∗]−1R1Σ[^σ(\bY)]−R1−1R1Λ[^σ(\bY)], (18)

where gathers all constants depending on that we did not trat explicitly, namely the constant in the HCIZ asymptotics and the constant in the intial SVD change of variable. Notice that only the last two terms depend either on or on (through the spectral properties of ).

### 2.2 Explicit form of the MMSE and the optimal estimator

The MMSE can be computed directly by combining Eq. 14 with Eq. 18, obtaining

 MMSE=Δ−2Δ2∂∂Δ\EE\bY[1R1Σ[^σ(\bY)]+R1−1R1Λ[^σ(\bY)]]. (19)

Thanks to the concentration of spectral densities in the high-dimensional limit, the average over can be performed directly by substituting the empirical symmetrized singular value density of the specific instance of the observation with the asymptotic singular value density of the observation, which is a deterministic quantity: it depends only on the statistical properties of , and not on its specific value. Thus, we obtain Section 1.2.

To compute the explicit form of the optimal estimator, we start from Eq. 13 and notice that the free entropy depends only on spectral properties of the actual instance of the observation . Thus, the derivative of the free entropy w.r.t. one component of can be decomposed on the eigenvalues of as follows

 ^S(\bY)iμ=\bYiμ+Δ√mp∂Φ\bY∂\bYiμ=\bYiμ+Δ√mpm∑l=1∂Φ\bY∂yl∂yl∂\bYiμ, (20)

where in the last passage we called the singular values of .

To compute the derivative, we use a variant of the Hellman-Feynman theorem Cohen-Tannoudji:101367. The Hellman-Feynman theorem considers a symmetric matrix depending on a parameter, and relates the derivative of an eigenvalue of the matrix with respect to that parameter with the derivative of the matrix itself. In Appendix F we show that the Hellman-Feynman theorem can be adapted to the case of rectangular matrices and singular values, and in particular we show that , if is a matrix, one of its singular values, and the corresponding left and right singular vectors, and all these quantities depend on a parameter .

To compute we need to perturb the original matrix as so that (here and are the left and right singular values of corresponding to the -th singular vector)

 ∂yl∂\bYi0μ0=∂yl∂λ=(ul)T⋅∂λ\bY(λ)⋅vl=m,p∑i,μ=1uliδii0δμμ0vlμ=uli0vlμ0. (21)

For clarity, is the i-th component of the left singular vector corresponding to the -th singular value, and similarly for . Thus

 ⟨\bSiμ⟩=\bYiμ+Δ√mpm∑l=1∂Φ\bY∂ylulivlμ=m∑l=1[yl+Δ√mp∂Φ\bY∂yl]ulivlμ=m∑l=1ξ(yl)ulivlμ, (22)

where we used that by definition of SVD and we introduced the spectral denoising function . This proves the first claim of Section 1.2, i.e. that the optimal denoiser is diagonal in the bases of left and right singular vectors of the observation .

To obtain an explicit form for the spectral denoising function , we need to compute . For this, we consider the part of the free entropy depending on , call it , in the discrete setting (all following equalities are intended at leading order for large ), i.e.

 Ψ\bY=−R1−1R11m∑l:yl≠0log|yl|−1R11m2∑l≠l′log|yl−yl′|, (23)

so that

 √mp∂Ψ\bY∂yl=−R1−1yl√R1−2√R11m∑k1yl−yk=−1√R1R1−1yl−2√R1\dashintdx^σ(\bY)(x)yl−x, (24)

and the spectral denoising function is finally given by

 ξ(y)=y−2Δ√R1[R1−12y+\dashintdζ^σ(\bY)(ζ)y−ζ]. (25)

We thus recover the second part of Section 1.2, i.e. the expression for the denoising function , by invoking once again concentration of spectral densities, so that can be safely substituted by its asymptotic deterministic counterpart in the high-dimensional limit.

## 3 Specific priors

In this section we provide all the ingredients needed to specialize Section 1.2 and Section 1.2 to specific choices of rotationally-invariant priors. We then analyze in detail the case of the Gaussian factorized matrix prior defined in Eq. 2.

### 3.1 General remarks

The two main ingredients needed to make our results explicit for a specific form of the prior are the symmetrized singular value density of the observation and its Hilbert transform . Now we show how to compute them analytically.

The first ingredient needed is the so-called Stieltjes transform of the symmetrized singular value density , i.e.

 g\bA(z)=∫dx^σ\bA(x)z−xforz∈\bbC∖\supp^σ\bA, (26)

where denotes the support . Indeed, one can show that (see for example potters_bouchaud_2020)

 ^σ\bA(x)=1πlimϵ→0+Ig\bA(x−iϵ)and\dashintdx^σ\bA(x)y−x=limϵ→0+Rg\bA(x−iϵ) (27)

where the second equality follows from Kramers–Kronig relations (Jackson:100964). Thus, from the knowledge of one can recover both the symmetrized singular value density and the corresponding Hilbert transform, obtaining all the ingredients needed to make the optimal estimator Eq. (6) and the MMSE Eq. (10) explicit for a given prior. Notice that, while in the following we will focus on cases in which can be computed analytically, our algorithm can be in principle applied to any rotationally-invariant ensemble of random matrices by sampling a large matrix and computing its singular values to estimate as in ledoit2009eigenvectors.

Thus, the questions shifts to how to compute the Stieltjes transform of , which in turn entails two sub-problems: determining the Stieltjes transform of the prior , and then adding the effect of the noise.

In order to compute the Stieltjes transform of the prior555 in the following as the reasoning holds for generic matrices . , we consider the matrix . Indeed, if has SVD with diagonal , then , from which we see that there is a one-to-one correspondence between singular values of and eigenvalues of . Namely the former are the positive square roots of the latter (notice that is symmetric and positive semi-definite)666One needs to be careful here: we used that and , so that has as much eigenvalues as the number of singular values of . In general, one needs to choose between and the one with lower dimensionality in order not to add spurious null singular values. This eigenvalues-singular values relations has a consequence on the corresponding Stieltjes transforms. Indeed

 g\bA(z)=∫dx^σ\bA(x)z−x=∫∞0dx2zσ\bA(x)z2−x2=∫∞0dλzσ\bA\bAT(λ)z2−λ=zg\bA\bAT(z2) (28)

where is the usual spectral density of . Thus, Eq. 28 links the Stieltjes transform of the singular value density of a rectangular matrix, which is in principle a very generic and non-trivial quantity to compute, to the Stieltjes transform of the eigenvalue density of a symmetric square matrix, for which many explicit analytical and numerical results already exist in the literature potters_bouchaud_2020.

Notice that the procedure just described is of no help in order to add the noise: in fact, is a sum of products of matrices that are not independent (free) between each other. Thus, usual free probability techniques based on the or transforms and free convolution of symmetric matrices would fail to treat this problem.

In order to add the effect of the noise, one needs rectangular free probability techniques. We do not enter into the details here: we refer the interested reader to Benaych-Georges2009. The underlying idea is not different from what happens in the more conventional symmetric case: there exists a functional transform of the Stieltjes transform, the rectangular transform, that linearizes the sum of random matrices, i.e. whenever the random matrices and are mutually free. Thus, one can access the Stieltjes transform of the sum by computing two direct transforms, and one inverse transform. In practice, the computation of the rectangular transform entails the numerical estimation of functional inverses as soon as one departs from the simplest case of matrices with i.i.d. Gaussian entries. For this reason, in the special case of factorized priors Eq. 2, we will not pursue this very general strategy. Instead, in Section 3.3 we will adapt the results of NIPS2017_0f3d014e, Louart, which are easier to implement numerically.

As a final remark, notice that the Stieltjes transform scales under a rescaling of as . This will be useful to deal with the many constant terms appearing in our computations.

### 3.2 Symmetric priors

The case of symmetric priors has already been treated originally in RIEoriginal, and more recently in maillard2021perturbative with techniques akin to ours. In particular, with a symmetric prior one can repeat our analysis using eigenvalue densities instead of singular value densities. For positive semi-definite symmetric priors, it is immediate to see that our results reduce to those of RIEoriginal, maillard2021perturbative, as singular values coincide with eigenvalues in this case. For generic symmetric priors, singular values are the absolute values of eigenvalues, and thus the singular value density is simply given by the symmetrized spectral density.

### 3.3 The case of Gaussian factorized prior

The prior we would like to focus our attention on is given by the model of Gaussian extensive-rank factorized signals defined in Eq. 2. As argued in Section 3.1, we need a way to compute the Stieltjes transform of the observation for this specific prior in order to have access to — to numerically compute the integrals in the MMSE formula Eq. 10 — and its Hilbert transform — to be able to perform actual denoising on given instances of using Eq. 6.

To this end, we generalize the results given in NIPS2017_0f3d014e. There, the authors study the spectrum of , where is a component-wise non-linearity and matrices with i.i.d. Gaussian entries. They show that the Stieltjes transform of the spectral density of satisfies a degree-four algebraic equation depending only on the aspect-ratio , the rank parameter and two Gaussian moments of , and , where denotes integration over a standard Gaussian random variable.

Recall that, thanks to Eq. 28, knowing the Stieltjes transform of the spectral density of is equivalent to knowing that of the singular value density of . Thus, we could compute by choosing a noisy function , where is a Gaussian random variable (actually, one for each component of the matrix, all i.i.d.). In order to do that, we just need to show that the results of NIPS2017_0f3d014e extend to non-deterministic non-linearities — only the deterministic case is considered therein. This happens to be the case: we provide the details of the extension in Appendix G. The only effect of the noise of the non-linearity is a redefinition of the Gaussian moments of , and in particular and , where denotes averaging over the noise induced by .

Thus, we can safely use the results from NIPS2017_0f3d014e to compute the Stieltjes transform of the singular value density of for Gaussian factorized priors. There is a non-trivial mismatch between our normalizations and those of NIPS2017_0f3d014e: to fall-back directly onto their results, we need to choose , and set their parameters to , and .

With these choices, the Gaussian moments of are given by and . The algebraic equation for is given by [NIPS2017_0f3d014e, SM, Eq. S42-S43] and is reported in Appendix G.

## 4 Asymptotics of the HCIZ integral

In this section we consider the asymptotics of the rectangular HCIZ presented in guionnet2021large and adapt them to the problem of denoising.

We follow the notations of the paper. In particular, their parameters and in our case equal and . The HCIZ integral was defined in Eq. 11, and its asymptotic value is defined as

 Iα[^σ\bA,^σ\bB;λ]=limm→∞,p=(1+α)m2m2log\caIm(\bA,\bB;λ), (29)

where are the asymptotic symmetrized singular value densities of and . Following [guionnet2021large, Theorem 1.1], the asymptotic of the HCIZ integral with parameter equals

 Iα[^σ\bA,^σ\bB;τ=1]=Cα+\Var[^σ\bA]−αΛ[^σ\bA]−Σ[^σ\bA]+\Var[^σ\bB]−αΛ[^σ\bB]−Σ[^σ\bB]−inf{^ρt}0≤t≤1{∫10ds∫dx^ρs(x)[v2s(x)+π23^ρ2s(x)+α24x2]} (30)

where , , , is an unspecified constant depending only on , the infimum is taken over continuous symmetric density valued processes such that and , and where is a solution to the continuity equation . We refer the reader to the original work for precise definitions and assumptions over all quantities.

The non-trivial part of Eq. (30

), i.e. the optimization problem, is a mass transport problem whose solution interpolates between the singular value densities of the two matrices

and . This transport problem can be understood as the evolution of a hydrodynamical system, and it can be shown [menon2017complex, Eq. 4.13] that the resulting density profile (i.e. the process at which the infimum is reached) is the symmetrized singular value density of the Dyson Brownian bridge between and , i.e. the symmetrized singular value density of the matrix

 \bX(t)=(1−t)\bA+t\bB+√t(1−t)\bW (31)

where is a matrix of i.i.d. Gaussian variables with mean zero and variance . It can be shown [guionnet2021large, section 4] that the velocity field satisfies

 vt(x)=πH[^ρt](x)+α2x+D(x,t),H[^ρ](x)=1π\dashintdy^ρ(y)x−y, (32)

where is the Hilbert transform, the Cauchy principal value integral, and is a drift term that ensures that the Brownian motion ends precisely at when .

Our aim is to make Eq. 30 explicit in the