# AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion

Conventionally, the matrix completion (MC) model aims to recover a matrix from partially observed elements. Accurate recovery necessarily requires a regularization encoding priors of the unknown matrix/signal properly. However, encoding the priors accurately for the complex natural signal is difficult, and even then, the model might not generalize well outside the particular matrix type. This work combines adaptive and implicit low-rank regularization that captures the prior dynamically according to the current recovered matrix. Furthermore, we aim to answer the question: how does adaptive regularization affect implicit regularization? We utilize neural networks to represent Adaptive and Implicit Regularization and named the proposed model AIR-Net. Theoretical analyses show that the adaptive part of the AIR-Net enhances implicit regularization. In addition, the adaptive regularizer vanishes at the end, thus can avoid saturation issues. Numerical experiments for various data demonstrate the effectiveness of AIR-Net, especially when the locations of missing elements are not randomly chosen. With complete flexibility to select neural networks for matrix representation, AIR-Net can be extended to solve more general inverse problems.

• 3 publications
• 16 publications
08/11/2022

### Adaptive and Implicit Regularization for Matrix Completion

The explicit low-rank regularization, e.g., nuclear norm regularization,...
07/29/2020

### A regularized deep matrix factorized model of matrix completion for image restoration

It has been an important approach of using matrix completion to perform ...
05/13/2018

### Extendable Neural Matrix Completion

Matrix completion is one of the key problems in signal processing and ma...
06/25/2015

### Completing Low-Rank Matrices with Corrupted Samples from Few Coefficients in General Basis

Subspace recovery from corrupted and missing data is crucial for various...
04/01/2022

### Survey of Matrix Completion Algorithms

Matrix completion problem has been investigated under many different con...
11/22/2019

### Matrix Completion from Quantized Samples via Generalized Sparse Bayesian Learning

The recovery of a low rank matrix from a subset of noisy low-precision q...
07/14/2019

### Learning Neural Networks with Adaptive Regularization

Feed-forward neural networks can be understood as a combination of an in...

## Code Repositories

### AIR-Net

Codes for AIR-Net: Adaptive and implicit regularization neural network for matrix completion

## 1 Introduction

The matrix completion (MC) problem, which aims to recover a matrix

from its partially observed elements, has arisen in numerous domains, ranging from computer vision

(Wen et al., 2012), recommender system (Netflix, 2009), and drug-target interaction (DTI) (Mongia and Majumdar, 2020). This fundamental problem is ill-posed without assumptions on since we have many completions. So it is essential to impose additional information or priors on the unknown matrix/signal.

To describe the prior for natural signal, or restrict the solution in the corresponding space is difficult. Classical methods for MC are mainly based on low-rank, sparsity or piece-wise smoothness assumption (Rudin et al., 1992; Buades et al., 2005; Romano et al., 2014; Dabov et al., 2007). These priors describe simple structural signal well, but may lead to a poor approximation of with complex structures (Radhakrishnan et al., 2021) especially when the observed entries are not sampled uniformly at random. Recently, deep neural networks (DNN) have shown a strong ability in extracting complex structures from large datasets(Li et al., 2018; Mukherjee et al., 2021). However, such a large number of data sets cannot be obtained in many scenarios. Fortunately, DNN also works in solving some inverse problems without any extra training set (Ulyanov et al., 2018). Over-parametric DNN performs well on a single matrix is a mysterious phenomenon. One of the explanations is there exists implicit regularization during training (Arora et al., 2019; Xu et al., 2019; Rahaman et al., 2019; Chakrabarty and Maji, 2019). Although DNN with implicit regularization outperforms some classical methods, it is insufficient to describe the space of complex . Extra-explicit regularization can improve its performance in signal recovery (Metzler et al., 2018; Boyarski et al., 2019a; Liu et al., 2019; Li et al., 2020). However, such explicit priors are often valid only for specific data or sampling patterns. A more flexible regularization is required to meet practical MC problems.

We introduce flexibility in this paper by firstly representing the explicit regularization using DNN without any extra training set. The explicit regularization we begin with is Dirichlet Energy (DE), which is formulated as , with a Laplacian matrix describing the similarity between columns. Note that in DE is fixed during iteration. Building an exact based on incomplete observation is very challenging. Therefore, we parameterize with DNN, and revise iteratively during training. Furthermore, We combine the learned DE, which is an adaptive regularizer, with implicit regularization to form a new regularization method for MC named AIR-Net. The interaction between explicit regularization and implicit regularization in solving MC problems is further studied. The results show that combining the two can obtain a new, more flexible regularization model and enhance the low-rank preference of implicit regularization. In many examples, AIR-NET has a more vital feature representation ability and more comprehensive application range and shows state-of-art performance.

## 2 Adaptive and implicit regularization neural network

Our model is proposed as follow:

 minX,WiLall=LY(A(X∗),A(X))+N∑i=1λi⋅RWi(Ti(X)) (1)

where and are the observed coordinates set, and the other entries are missing. Different from other regularization models for MC, here is represented by a neural network which tends to be low-rank implicitly (Section 2.1), and is an adaptive regularization with a forward neural network represented Laplacian matrix(Section 2.2). The detailed notations will be introduced in the corresponding sections. A specific case of Equation 1 for matrix completion is given in Section 2.3.

### 2.1 DMF as an implicit regularization

In order to make the model suitable for more matrix types, we need a more general data prior. The low-rank is a very general prior in various matrix types. There are two main ways to encode the low-rank prior into model: (a) Adding an explicit regularization term such as rank and nuclear norm (Candès and Recht, 2009; Lin et al., 2010). (b) Using a low-dimensional latent variable model to represent , including matrix factorization (MF) and its varieties (Koren et al., 2009; Fan and Cheng, 2018)

. The first case suffers from the saturation issue, which is induced by explicit regularization. The second one faces the problem of estimating a proper latent variable dimension.

Unlike the existing MF model, which constricts the size of the shared dimension of the factorized matrix, DMF can take a large shared dimension and still preserve the low-rank property without explicit regularization. This is the so-called implicit low-rank regularization of DMF:

 X(t)=W[L−1](t)W[L−2](t)…W[1](t)W[0](t)∈Rm×n,

where is the depth of MF. represents the -th matrix at the step during training. The results are given under a mild assumption 1 in Section A.2. This property helps avoiding dimension estimation and saturation issues. As for the details of the implicit low-rank we will discuss in Section A.2.

### 2.2 Adaptive regularizer

Apart from the low-rank prior, self-similarity is also a typical prior. The patch in the image and the rating behavior of users are all examples of self-similarity. For example, there is always a certain degree of self-similarity between the blocks in the image. A classical way to encode the similarity prior to is Dirichlet Energy (DE) which is formulated as . But DE will face two problems in applications: (a) is unknown in MC problem, construct based on incomplete may induce worse prior. (b) The formulation of DE only encodes the similarity of the columns of . Other similarities such as block similarity cannot be captured. To address both of these issues, we parameterize with DNN and replace by a transformed to capture the self-similarity flexibly.

The adaptive regularization is defined as

where is parameterized by . To keep the Laplacian properties of , special design for the parameterized structure is important. We design a forward neural network which encodes the properties of Laplacian matrix in structure. The details are discussed in A.4. transforms into special domain, which makes the AIR-Net possible to capture various relationships embedded in data. The common choice can be which captures the relationship between columns. Regularization captures the relationship between rows when . Especially if , where

is the vectorization of

-th block in row by row, then the similarity among blocks can be obtained. A natural problem that arises is what the looks like during training.

Obviously, reaches minimum when , and this is called a trivial solution. The most exciting thing is that when we minimize Equation 1 with the gradient descent algorithm, converges to a non-trivial solution. Another expected phenomenon is that vanishes at the end and will not cause the so-called saturation issue. The saturation issue is a bias term that dominates the overall estimation error due to explicit regularization. We illustrate these phenomena both by theoretical analysis (Theorem 2 in Section 3.2) and numerical experiments (Section 4).

### 2.3 AIR-Net for MC

In this subsection, we will focus our model on the MC problem. We select , to capture the relationship both in rows and columns of . Overall, the theoretical analyses for general inverse problem is based on Equation 2.

 (2)

where , , . Specially, our experiments focus on the MC problem which can reform Equation 2 as follows:

 minW[l],Wr,WcLall=∑(i,j)∈Ω∣∣Xij−X∗ij∣∣+λr⋅RWr(X)+λc⋅RWc(X⊤), (3)

with The parameters in Equation 3 is updated by gradient descent algorithm or its variations. We stop the iteration until and . The recovered matrix is .

Some works which combine implicit and explicit regularization also can be regarded as a special case of Equation 1. Both the Total Variation (TV) and DE can be regarded as a fixed . Therefore, the framework of Equation 1 also contains DMF+TV (Li et al., 2020), DMF+DE (Boyarski et al., 2019a). So far, we cannot see any essential difference between Equation 3 and these models. We will illustrate the amazing properties of the model in the next section.

## 3 Theoretical analysis

In this section, we will analyze the properties based on the dynamics of Equation 3. Theorem 1 shows that our proposed regularization enhances the implicit low-rank regularization of DMF. Theorem 2 shows that the adaptive regularization will converge to a minimum while capturing the inner structure of data flexibly. Although this paper focus on MC problem, the following theoretical analyzes is satisfied for the general inverse problems. As and are fixed during optimization, we simplify as below. is the th entry of , and are the -th column and the -th row of respectively.

### 3.1 AIR-Net enhances the implicit low-rank regularization

To simplify the analysis, we keep and fixed. Then the only varies with . We will demonstrate what the adaptive regularizer brings to the implicit low-rank regularization.

###### Theorem 1.

Consider the following dynamics with initial data satisfying the balance initialization Assumption 1(see A.2):

 ˙W[l](t)=−∂∂W[l]Lall(X(t)),t≥0,l=0,…,L−1,

where . Then for , we have

 ˙σk(t)= −L(σ2k(t))1−1L⟨∇WLY(X(t)),U:,k(t)V⊤:,k(t)⟩ (4) −2L(σ2k(t))32−1Lγk(t),

where is the SVD for , , .

###### Proof.

Directly calculate the gradient of at and utilize Equation 4 will obtain the result. The details of proof can be found in A.3. ∎

Compared with the results of vanilla DMF whose order of is . This Theorem demonstrates that AIR-Net’s has a higher dynamics order which is . Notice that the adaptive regularizer keeps

. In this way, a bigger convergence speed gap appears between different singular values

than vanilla DMF. Therefore, the AIR-Net enhances the implicit tendency of DMF toward low-rank.

### 3.2 The dynamics of adaptive regularizer

Now suppose is given and fixed. We focus on the converge property of based on the evolutionary of the dynamics of . vanishes at the end and avoids AIR-Net suffering from saturation issues.

###### Theorem 2.

Consider the gradient flow model, where and . If we initialize , then will keep symmetric during optimization. We can get the following element-wise convergence relationship

where , ,

 L∗i(k,l)=⎧⎪ ⎪⎨⎪ ⎪⎩0,(k,l)∈C1γ,(k,l)∈C2−∑mil′=1,l′≠lL∗i(k,l′),k=l

is -th the element of , is a matrix of all-one. , is a constant defined in A.4 which equals to zero if and only if .

###### Proof.

We prove this theorem in A.4. ∎

This Theorem gives the limit point and convergence rate of . . unless or , that is to say, in the end, will only think that the exact same columns in are related. converges faster when than . In another word, adaptive regularizer captures the similarity first. This convergence rate gap products a multi-scale similarity which will be discussed in Section 4.1. Additionally, it’s not difficult to find , the convergence rate is given as follow:

###### Corollary 1.

In the setting of Theorem 2, we further have .

###### Proof.

We prove this theorem in appendix A.5. ∎

According to Corollary 1, . Therefore, the regularization will vanish at the end and not induce the saturation issue.

###### Remark 1.

Notice that we have no restriction on specific , or representation of in the above proof. Therefore, the conclusion in this subsection is a general result for inverse problem.

In this subsection, we demonstrate AIR-Net’s fantastic theoretical properties. It can both enhance the implicit low-rank and avoid saturation issues. We will verify these properties and the effectiveness of AIR-Net in applications experimentally.

## 4 Experimental Analysis

Now we demonstrate the adaptive properties of AIR-Net by numerical experiments: (a) and capture the structural similarity in data from large scale to small scale (Section 4.1); (b) The comprehensive similarity in all scales contribute to successful MC, therefore the adaptive regularizer is necessary (Section 4.2); (c) Because AIR-Net is adaptive to data, it avoids over-fitting and achieves good performance. (Section 4.3).

Data type and sampling pattern Three types of matrices are considered: gray-scale image, user-movie rating matrix, and drug-target interaction (DTI) data. Three standard test gray images of size (Monti et al., 2017) are included in the image type (Baboon, Barbara, and Cameraman). The user-movie rating matrix is Syn-Netflix which is of , and the DTI data has Ion channels (IC) and G protein-coupled receptor (GPCR) are shaped and respectively (Boyarski et al., 2019b; Mongia and Majumdar, 2020). The sampling patterns include random missing, patch missing and textural missing, which are listed in Figure 4. The random missing rate varies in different experiments, and the default is .

Parameter settings We set to ensure the fidelity and the regularization are in the same order of magnitude, where and are maximum and minimum of . The is a threshold which we set as

by default. All the parameters in AIR-Net are initialized with Gaussian distribution, which owns zero mean and

as its variance. The Adam is chosen as the optimization algorithm by default

(Kingma and Ba, 2015).

### 4.1 AIR-Net capture relationship adaptive to both spatial and time domain

In this section, we will verify the previously proposed theorems. This section provides a few slices of and during training to demonstrate what AIR-Net can learn. The heatmap of and for Baboon at respectively are shown in Figure 1. The according results for Syn-Netflix are shown in Figure 5 in A.1. The first row shows the heatmap of and the second one shows the heatmap of .

As Figure 1 shows, both and first appear many blocks (). Specially, we sigh two of out. These blocks indicate that these corresponding blocks columns are highly related. These blocks correspond to columns in which the eyes of Baboon are located, which are indeed highly similar. However, the slight difference between these columns induces the relationship captured by adaptive regularizer focusing on the related columns (), which is similar to TV(Rudin et al., 1992). The columns of Baboon are not fully the same. The regularization gradually vanishes (), which matches the results of Theorem 2 (Figure 1). Except the gray-scale images, the results on Syn-Netflix give similar conclusion.

These results illustrate that AIR-Net captures the similarity from large scale to small scale. Meanwhile, a natural question is raised: does there exist a moment that both

and are captured accurately? If yes, we can train AIR-Net with these fixed and to obtain better recovery performance. The experiments below show that the and captured by AIR-Net are necessary for MC.

### 4.2 The necessity of utilizing an adaptive regularizer

In this section, the necessity of adaptive updating and is explored. Let AIR-Net have a fixed regularizer, which is an adaptive regularizer learned at a specific step. The Normalized Mean Absolute Error (NMAE) is adopted to measure the distance between the recovered matrix and the actual matrix :

 NMAE=1(X∗max−X∗min)|¯Ω|∑(i,j)∈¯Ω∣∣^Xij−X∗ij∣∣,

where is the complement set of . We utilize the regularization captured by AIR-Net at respectively. All of the training hyper-parameters keep the same as AIR-Net. The Baboon under all the three missing patterns are tested.

Figure 2

shows how the NMAE changes with the epoch of training. AIR-Net, which updates the regularization during training, achieves the best performance in all missing patterns. The fixed regularization can accelerate the convergence speed of the algorithm. In random missing case,

and is the best fixed regularizer among three time steps while other missing cases are and . Fixed regularizer based methods will face two problems: (a) How to determine the best step? (b) How to estimate the regularization based on the partially observed matrix before training? These problems are not easy to solve. AIR-Net solves these problems from another perspective by updating the regularization during training. The adaptive property of AIR-Net is essential to the effectiveness of AIR-Net.

### 4.3 AIR-Net adaptive to both varies data and missing pattern

Now we apply AIR-Net for matrix completion on three data types under different missing patterns.

Peered methods

The peered methods include KNN

(Goldberger et al., 2004), SVD(Troyanskaya et al., 2001), PNMC(Yang and Xu, 2020), DMF(Arora et al., 2019) and RDMF(Li et al., 2020) in image type. Here RDMF is replaced by DMF+DE(Boyarski et al., 2019a) because it is more suitable in the Syn-Netflix experiment.

Avoid Over-fitting. Figure 3 shows how the NMAE of DMF and AIR-Net changes with the training step. Compared with vanilla DMF, AIR-Net avoids over-fitting and achieves better performance on all the three data types and missing patterns. The Syn-Netflix and DIT data can be found in Figure 6 at A.1.

Adaptive to data. Our proposed method achieves the best-recovered performance in most tasks. Table 1 shows the efficacy of AIR-Net on the various data types. More surprising is that our methods perform better than other methods, which are well designed for the particular data type. The recovered results are shown in Figure 4. In this figure, the existing methods perform well on specific missing pattern data. Such as the RDMF achieved good performance on the random missing case but performed not OK on reminding missing patterns. PNMC completed the patch missing well while obtaining worse results on texture missing. Thanks to the proposed model’s adaptive properties, our method achieves promising results both visually and by numerical measures.

## 5 Conclusion

We have proposed AIR-Net which aims to solve the MC problem without knowning the prior in advance. We show that our AIR-Net can adaptively learn the regularization according to different data at different training steps. In addition, we demonstrate that AIR-Net can avoid the saturation issue and over-fitting issue simultaneously. In fact, the AIR-Net is a general framework for solving the inverse problem. In the future work, we will combine other implicit regularization such as F-Principle(Xu et al., 2019) with more flexible for other inverse problems.

## References

• S. Arora, N. Cohen, W. Hu, and Y. Luo (2019) Implicit regularization in deep matrix factorization. In NeurIPS, Cited by: §1, Figure 4, §4.3, Table 1, Proposition 1.
• A. Boyarski, S. Vedula, and A. Bronstein (2019a) Deep matrix factorization with spectral geometric regularization. arXiv: Learning. Cited by: §1, §2.3, §4.3, Table 1.
• A. Boyarski, S. Vedula, and A. Bronstein (2019b) Spectral geometric matrix completion. Cited by: §4.
• A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising.

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

2, pp. 60–65 vol. 2.
Cited by: §1.
• E. J. Candès and B. Recht (2009) Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, pp. 717–772. Cited by: §2.1.
• P. Chakrabarty and S. Maji (2019) The spectral bias of the deep image prior. ArXiv abs/2107.01125. Cited by: §1.
• K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing 16, pp. 2080–2095. Cited by: §1.
• J. Fan and J. Cheng (2018) Matrix completion by deep matrix factorization. Neural networks : the official journal of the International Neural Network Society 98, pp. 34–41. Cited by: §2.1.
• J. Goldberger, S. T. Roweis, G. E. Hinton, and R. Salakhutdinov (2004) Neighbourhood components analysis. In NIPS, Cited by: Figure 4, §4.3, Table 1.
• D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.
• Y. Koren, R. M. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42. Cited by: §2.1.
• H. Li, J. Schwab, S. Antholzer, and M. Haltmeier (2018) NETT: solving inverse problems with deep neural networks. ArXiv abs/1803.00092. Cited by: §1.
• Z. Li, Z. J. Xu, T. Luo, and H. Wang (2020) A regularized deep matrix factorized model of matrix completion for image restoration. ArXiv abs/2007.14581. Cited by: §1, §2.3, Figure 4, §4.3, Table 1.
• Z. Lin, M. Chen, and Y. Ma (2010) The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. ArXiv abs/1009.5055. Cited by: §2.1.
• J. Liu, Y. Sun, X. Xu, and U. Kamilov (2019) Image restoration using total variation regularized deep image prior. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7715–7719. Cited by: §1.
• C. A. Metzler, A. Mousavi, R. Heckel, and R. Baraniuk (2018) Unsupervised learning with stein’s unbiased risk estimator. ArXiv abs/1805.10531. Cited by: §1.
• A. Mongia and A. Majumdar (2020) Drug-target interaction prediction using multi graph regularized nuclear norm minimization. PLoS ONE 15. Cited by: §1, §4.
• F. Monti, M. M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. In NIPS, Cited by: §4.
• S. Mukherjee, M. Carioni, O. Öktem, and C. Schönlieb (2021) End-to-end reconstruction meets data-driven regularization for inverse problems. ArXiv abs/2106.03538. Cited by: §1.
• Netflix (2009) Netflix prize rules. Cited by: §1.
• A. Radhakrishnan, G. Stefanakis, M. Belkin, and C. Uhler (2021) Simple, fast, and flexible framework for matrix completion with infinite width neural networks. ArXiv abs/2108.00131. Cited by: §1.
• N. Rahaman, A. Baratin, D. Arpit, F. Dräxler, M. Lin, F. Hamprecht, Y. Bengio, and A. C. Courville (2019) On the spectral bias of neural networks. In ICML, Cited by: §1.
• Y. Romano, M. Protter, and M. Elad (2014)

Single image interpolation via adaptive nonlocal sparsity-based modeling

.
IEEE Transactions on Image Processing 23, pp. 3085–3098. Cited by: §1.
• L. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, pp. 259–268. Cited by: §1, §4.1.
• O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 6, pp. 520–5. Cited by: Figure 4, §4.3, Table 1.
• D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §1.
• Z. Wen, W. Yin, and Y. Zhang (2012) Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation 4, pp. 333–361. Cited by: §1.
• Z. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma (2019) Frequency principle: fourier analysis sheds light on deep neural networks. ArXiv abs/1901.06523. Cited by: §1, §5.
• M. Yang and S. Xu (2020)

A novel patch-based nonlinear matrix completion algorithm for image analysis through convolutional neural network

.
Neurocomputing 389, pp. 56–82. Cited by: Figure 4, §4.3, Table 1.

## Appendix A Appendix

### a.1 Experiments Results

In this section, we place the experiments mentioned before. Figure 5 shows the heatmap of and learned by adaptive regularizer. Eventually, adaptive regularizer obtain the and which are highly similar to real and in first column.

Figure 6 shows the NMAE of Syn-Netflix, IC and GPCR during training, respectively. This experiment result also shows the ability to avoid over-fitting.

### a.2 Introduction of DMF

###### Assumption 1.

Factor matrices are balanced at initialization, i.e.,

 W[l+1]⊤(0)W[l+1](0)=W[l](0)W[l]⊤(0),l=0,…,L−2.

Under this assumption, Arora et al. studied the gradient flow of the non-regularized risk function , i.e.,

 ˙W[l](t)=−∂∂W[l]LY(X(t)),t≥0,l=0,…,L−1, (5)

where the empirical risk can be any analytic function of . According to the analyticity of ,

has the following singular value decomposition where each matrix is an analytic function of

:

 X(t)=U(t)S(t)V⊤(t),

where , and are analytic functions of ; and for every , the matrices and have orthonormal columns, while is diagonal (its diagonal entries may be negative and may appear in any order). The diagonal entries of which we denote by are signed singular values of . The columns of and , denoted by and are the corresponding left and right singular vectors respectively. Based on these notation, Arora derive the following singular values evolutionary dynamics equation.

###### Proposition 1 ((Arora et al., 2019, Theorem 3)).

Consider the dynamics Equation 5 with initial data satisfying Assumption 1. Then the signed singular values of the product matrix evolve by:

 ˙σk(t) =−L(σ2k(t))1−1L⟨∇XLY(X(t)),U:,k(t)V⊤:,k(t)⟩, (6) k =1,…,min{m,n}.

If the matrix factorization is non-degenerate, i.e., has depth the singular values need not be signed (we may assume for all ).

Arora et al. claimed the terms enhance the movement of large singular values, and on the other hand, attenuate that of small ones. The enhancement/attenuation becomes more significant as grows.

### a.3 Proof of Theorem 1

We first give the details of the proposed adaptive regularizer with a iterative definition:

###### Theorem 1.

Consider the following dynamics with initial parameters satisfying Assumption 1:

 ˙W[l](t)=−∂∂W[l]Lall(X(t)),t≥0,l=0,…,L−1,

where . Then we have for any

 ˙σk(t)= −L(σ2k(t))1−1L⟨∇WLY(X(t)),U:,k(t)V⊤:,k(t)⟩ −2L(σ2k(t))32−1Lγk(t),

where , .

###### Proof.

This is proved by direct calculation:

 ∇W(λr⋅Rr+λc⋅Rc) =∂tr(λr⋅X⊤LrX+λc⋅XLcX⊤)∂X =2λr⋅LrX+2λc⋅XLc =2λr⋅Lr∑sσsU:,sV⊤:,s+2λc⋅∑sσsU:,sV⊤:,sLc.

Note that

 ⟨V:,s,Vs′⟩=⟨U:,s,Us′⟩=δss′={1,s=s′,0.s≠s′.

Therefore

 U⊤:,k(∇W(λr⋅Rr+λc⋅Rc))V:,k =2σk(λr⋅U⊤:,kLrU:,k+λc⋅V⊤:,kLcV:,k) =2σkγk(t),

where the term . Furthermore, according to Equation 6, we have .

Finally, according to , we have

### a.4 Proof of Theorem 2

, where , and .

###### Proof.

We denote , then we consider

 d[tr(X⊤LiX)] = = tr[(dAi⊙(1mi×mi−Imi)⋅1mi×mi)⊙In⋅XX⊤ −dAi⊙(1mi×mi−Imi)XX⊤] = −(XX⊤)⊤((1mi×mi−Imi)⊙dAi)] = −(XX⊤⊙(1mi×mi−Imi))⊤dAi] = tr[((XX⊤⊙Imi)1mi×mi)⊤(dAi⊙(1mi×mi−Imi)) −(XX⊤⊙(1mi×mi−Imi))⊤dAi] = tr[(((XX⊤⊙Imi)⋅1mi×mi)⊙(1mi×mi−Imi)−(XX⊤⊙(1mi×mi−Imi))⊤)dAi] = tr[((XX⊤⊙Imi)1mi×mi−XX⊤)dAi]

We denote , then

 d[tr(X⊤LiX)] = tr(CdAi) = 1S2Witr[C(SWi⋅exp(Wi+W⊤i)⊙d(Wi+W⊤i)) −C(1⊤mi(exp(Wi)⊙dWi)1mi)exp(Wi+W⊤i)] = tr[C⋅(Ai⊙d(Wi+W⊤i))] = tr[C⋅(Ai⊙d(Wi+W⊤i))] = tr[C⋅(Ai⊙d(Wi+W⊤i))]−tr[tr(C⋅Ai)⋅A′idWi] = tr[((C⊤⊙Ai)⊤+C⊤⊙Ai−tr(C⋅Ai)A′i)dWi]

Therefore,

 ∇Witr(X⊤LiX) =(C⊤⊙Ai)⊤+C⊤⊙Ai−tr(C⋅Ai)A′i =2C⊙Ai−2tr(CA′i)A′i

Notice that , the proposition is proved. ∎

###### Theorem 2.

Consider the gradient flow model, assume and , if we initialize , then will keep symmetric during optimization. We can get the element-wise convergence relationship

where , ,

 L∗i(k,l)=⎧⎪ ⎪⎨⎪ ⎪⎩0,(k,l)∈C1γ,(k,l)∈C2−∑mil′=1,l′≠lL∗i(k,l′),k=l

is the element of at the -th row and -th column, is all one elements matrix. , is a constant defined in A.4 which equals to zero if and only if .

###### Proof.

We rewritten the gradient in proposition 2 with element wise formulation:

 ˙WTi(k,l)(t)=(2Ca(t)−4Ck,l)⋅A′i(k,l)(t),

where and the sub index denote the element in matrix.

With the assumption that and , we have