. They are popularly used in areas such as computer vision(Vasilescu & Terzopoulos, 2002), recommender systems (Symeonidis & Zioupos, 2016), and signal processing (Cichocki et al., 2015). Many of these are -order tensors, which are the focus in this paper. Examples include color images (Liu et al., 2013) and hyperspectral images (Signoretto et al., 2011). In Youtube, users can follow each other and can belong to the same subscribed channels. By treating channels as the third dimension, the users’ co-subscription network can again be represented as a 3-order tensor (Lei et al., 2009).
In many applications, only a few tensor entries are observed. For example, each Youtube user often only interacts with a few other users (Lei et al., 2009; Davis et al., 2011). Tensor completion, which aims at filling in this partially observed tensor, has attracted a lot of interest (Rendle & Schmidt-Thieme, 2010; Signoretto et al., 2011; Bahadori et al., 2014; Cichocki et al., 2015; Symeonidis & Zioupos, 2016; Trouillon et al., 2017; Lacroix et al., 2018; Wimalawarne & Mamitsuka, 2018). In the related task of matrix completion, different rows/columns of the underlying full matrix often share similar characteristics, and the matrix is thus low-rank (Candès & Recht, 2009). The nuclear norm, which is the tightest convex envelope of the rank (Boyd & Vandenberghe, 2009), is popularly used as a surrogate for the matrix rank in low-rank matrix completion (Cai et al., 2010; Mazumder et al., 2010).
In tensor completion, the low-rank assumption can also capture relatedness in the different tensor dimensions (Tomioka et al., 2010; Acar et al., 2011; Song et al., 2017). However, tensors are more complicated than matrices. Indeed, even computation of the tensor rank is NP-hard (Hillar & Lim, 2013). In recent years, many convex relaxations based on the matrix nuclear norm have been proposed for tensors. Examples include the tensor trace norm (Chandrasekaran et al., 2012), overlapped nuclear norm (Tomioka et al., 2010; Gandy et al., 2011), and latent nuclear norm (Tomioka et al., 2010). In particular, the overlapped nuclear norm is the most popular, as it (i) can be computed exactly (Cheng et al., 2016), (ii) has better low-rank approximation (Tomioka et al., 2010), and (iii) can lead to exact recovery (Tomioka et al., 2011; Tomioka & Suzuki, 2013; Mu et al., 2014).
However, the (overlapped) nuclear norm equally penalizes all singular values. Intuitively, larger singular values are more informative and should be less penalized (Mazumder et al., 2010; Lu et al., 2016; Yao et al., 2018). In matrix completion, various adaptive nonconvex regularizers have been recently introduced to alleviate this problem. Examples include the capped- norm (Zhang, 2010b), log-sum-penalty (LSP) (Candès et al., 2008), truncated nuclear norm (TNN) (Hu et al., 2013), smoothed-capped-absolute-deviation (SCAD) (Fan & Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010a). All these assign smaller penalties to the larger singular values. This leads to better empirical performance (Lu et al., 2016; Gu et al., 2017; Yao et al., 2018) and statistical guarantee (Gui et al., 2016).
Motivating by the success of adaptive nonconvex regularizers in matrix completion, we propose to develop a nonconvex variant of the overlapped nuclear norm regularizer for tensor completion. Unlike the standard convex tensor completion problem, the resulting optimization problem is nonconvex and more difficult to solve.
Based on the proximal average algorithm (Bauschke et al., 2008), we develop in this paper an efficient solver with much better time and space complexities. The keys to its success are on (i) avoiding expensive tensor folding and unfolding, (ii) maintaining a “sparse plus low-rank” structure on the iterates, and (iii) incorporating the adaptive momentum (Li et al., 2017). Convergence guarantees to critical points are provided under smoothness and Kurdyka-Lojasiewicz (Attouch et al., 2013) conditions. Experiments on a number of synthetic and real-world data sets show that the proposed algorithm is efficient and has much better empirical performance than other low-rank tensor regularization and decomposition methods.
Notation.Vectors are denoted by lowercase boldface, matrices by uppercase boldface, and tensors by boldface Euler. For a matrix (assume ) with singular values ’s, its nuclear norm is . For tensors, we follow the notation in (Kolda & Bader, 2009). For a -order tensor (without loss of generality, we assume ), its th entry is . One can unfold along its th-mode to obtain the matrix , whose entry (where ) is equal to . One can also fold a matrix back to a tensor , such that , and is as defined above. The inner product of two tensors and is . The Frobenius norm of is . For a proper and lower-semi-continuous function , denotes its Frechet subdifferential (Attouch et al., 2013).
2 Related Works
2.1 Low-Rank Matrix Learning
Low-rank matrix learning can be formulated as:
where is a low-rank regularizer, is a hyper-parameter, is a -Lipschitz smooth111In other words, for any and . loss. A common choice of is the nuclear norm. Using the proximal algorithm (Parikh & Boyd, 2013), the iterate is given by , where
is the stepsize, and
is the proximal step. The following Lemma shows that it can be obtained from the SVD of . Note that shrinking of the singular values encourages to be low-rank.
(Cai et al., 2010) , where is the SVD of , and .
2.1.1 Matrix Completion
A special class of low-rank matrix learning problems is matrix completion, which attempts to find a low-rank matrix that agrees with the observations in data :
Note that has a “sparse plus low-rank” structure, with a low-rank and sparse . This allows the SVD computation in Lemma 2.1 to be much more efficient (Mazumder et al., 2010). Specifically, on using the power method to compute the SVD of , most effort is spent on multiplications of the form (where ) and (where ). Let in (5) be low-rank factorized as with rank . Computing
takes time. Usually, and . This is much faster than direct multiplying and , which takes time. The same holds for . Thus, the proximal step in (3) take time, while a direct computation without utilizing the “sparse plus low-rank” structure takes time. Besides, as only and the factorized form of need to be kept, the space complexity is reduced from to .
2.1.2 Nonconvex Low-Rank Regularizer
where is nonconvex and possibly nonsmooth. We assume the following on .
is a concave, nondecreasing and -Lipschitz continuous function on with .
Examples of include the capped- penalty: (Zhang, 2010b), and log-sum-penalty (LSP): (where is a constant) (Candès et al., 2008). More can be found in Appendix LABEL:app:egreg. They have similar statistical guarantees (Gui et al., 2016), and empirically perform better than the convex nuclear norm (Lu et al., 2016; Yao et al., 2018).
can be obtained as follows.
(Lu et al., 2016) , where is the SVD of , and .
2.2 Low-Rank Tensor Learning
Recently, the nuclear norm regularizer has been extended to tensors. Here, we focus on the overlapped nuclear norm (Tomioka et al., 2010; Gandy et al., 2011), and its nonconvex extension that will be introduced in Section 3.
-order tensor ,
the overlapped nuclear norm is , where are hyperparameters.
Factorization methods, such as the Tucker/CP (Kolda & Bader, 2009) and tensor-train decompositions (Oseledets, 2011), have also been used for low-rank tensor learning. Compared to nuclear norm regularization, they usually offer worse approximations and inferior performance (Tomioka et al., 2011; Liu et al., 2013; Guo et al., 2017).
3 Nonconvex Low-Rank Tensor Completion
By integrating a nonconvex in (1) with the overlapped nuclear norm, the tensor completion problem becomes
Note that we only sum over modes. This is useful as the third mode is sometimes already small (e.g., the number of bands in images), and so does not need to be low-rank regularized. When , (9) reduces to the matrix completion problem , which can be solved by the proximal algorithm as in (Lu et al., 2016; Yao et al., 2018). In the sequel, we only consider .
When , (9) reduces to (convex) overlapped nuclear norm regularization. While may not be equal to , it can be easily shown that optimization solvers such as alternating direction of multiple multipliers (ADMM) and fast low-rank tensor completion (FaLRTC) can still be used as in (Tomioka et al., 2010; Liu et al., 2013). However, when is nonconvex, ADMM no longer guarantees convergence, and FaLRTC’s dual cannot be derived.
3.1 Structure-aware Proximal Iterations
As in Section 2.1, we solve (9) with the proximal algorithm. However, the proximal step for is not simple. To address this problem, we use the proximal average (PA) algorithm (Bauschke et al., 2008; Yu, 2013).
Let be a Hilbert space. Consider the following problem with possibly nonsmooth regularizers, whose individual proximal steps are assumed to be easily computable.
where is convex and Lipschitz-smooth, while each is convex but possibly nonsmooth. The PA algorithm generates the iterates ’s as
Recently, the PA algorithm is also extended to nonconvex and ’s, where each admits a difference-of-convex decomposition222In other words, each can be decomposed as where and are two convex functions. (Yu et al., 2015; Zhong & Kwok, 2014).
However, as the regularizer is imposed on the unfolded matrix , not on directly, the proximal steps need to be performed as follows.
The individual proximal steps in (16) can be computed using Lemma 2.2 based on SVD. However, tensor folding and unfolding are required in (16). A direct implementation takes space and time per iteration, where and , and is expensive. In the following, we show how the PA iterations can be computed efficiently by utilizing the “sparse plus low-rank” structures.
3.1.1 Keeping the Low-Rank Factorizations
3.1.2 Maintaining “Sparse plus Low-Rank”
The sparse tensor can be constructed efficiently by using the coordinate format333For a sparse 3-order tensor, its th nonzero element is represented in the coordinate format as , where are indices on each mode and is the value. Using (17), of can be computed by finding the corresponding rows in and , which takes time. (Bader & Kolda, 2007). As is a sum of tensor (folded from low-rank matrices) and is sparse, is also “sparse plus low-rank”.
The first term in both (19) and (20) can be easily computed in space and time. and are sparse. Using sparse tensor packages such as the Tensor Toolbox (Bader & Kolda, 2007), and can be computed in space and time.
Computing and in (19), (20) involves folding/unfolding and is expensive. By examining how elements are ordered by folding and unfolding, the following shows that and can be reformulated without explicit folding / unfolding, and thus be computed more efficiently.
Let , , and (resp.) be the th column of (resp.). For any and , we have
where is the Kronecker product, and reshapes vector into a matrix.
|per-iteration time complexity||space||convergence|
Computation of takes a total of time and space. The same holds for computation of . This is much less expensive than direct evaluation, which takes time and space.
Combining the above, and noting that we have to keep the factorized form of , computing proximal steps in (16) takes space and time.
In each PA iteration, proximal steps are performed in (16). The whole PA algorithm thus takes a total of space and time for each iteration. As , , these are much lower than those of a direct implementation (Table 1). Moreover, the PA algorithm has a convergence rate of , where is the number of iterations (Zhong & Kwok, 2014; Yu et al., 2015).
3.2 Use of Adaptive Momentum
The PA algorithm uses only first-order information, and empirically can be slow to converge (Parikh & Boyd, 2013)
. To address this problem, we adopt adaptive momentum, which has been popularly used for stochastic gradient descent(Duchi et al., 2011; Kingma & Ba, 2014) and proximal algorithms (Li & Lin, 2015; Yao et al., 2017; Li et al., 2017). The idea is to use historical iterates to speed up convergence. Here, we adopt the adaptive scheme in (Li et al., 2017). The resultant procedure, shown in Algorithm 1, will be called NOncvx Regularized Tensor (NORT). Note that even when step 6 is performed, in step 10 still has the “sparse plus low-rank” structure on , since . The resultant time and space complexities are the same as in Section 3.1.3.
3.3 Convergence Analysis
Adaptive momentum has not been used with the PA algorithm. Besides, previous proofs of the PA algorithm does not involve folding/unfolding operations. Thus, previous proofs cannot be directly used. In the following, first note that the proximal step in (16) implicitly corresponds to a new regularizer.
There exists a function such that for any .
Let the objective with the new regularizer be . The following bounds the difference between the optimal values ( and , respectively) of the objectives in (9) and .
3.3.1 With Smooth Assumption
As in Section 2.1, we assume that is -Lipschitz smooth. the following shows that Algorithm 1 converges to a critical point (Theorem 3.5) at the rate of (Corollary 3.6). Note that this is the best possible rate for first-order methods on general nonconvex problems (Nesterov, 2013; Ghadimi & Lan, 2016).
The sequence generated from Algorithm 1 has at least one limit point, and all limits points are critical points of .
(i) If , then is a critical point of ; (ii) let . then ;
3.3.2 With Kurdyka-Lojasiewicz Condition
The Kurdyka-Lojasiewicz (KL) condition (Attouch et al., 2013; Bolte et al., 2014) has been popularly used in nonconvex optimization, particularly in gradient (Attouch et al., 2013) and proximal gradient descent algorithms (Bolte et al., 2014; Li & Lin, 2015; Li et al., 2017).
A function : has the uniformized KL property if for every compact set on which is a constant, there exist , such that for all and all , one has , where for some , and .
The following extends this to Algorithm 1.
Let . If has the uniformized KL property, for a sufficiently large , we have
If , reduces to zero in finite steps;
If , where ;
If , where .
Though the convergence rates in Corollary 3.6 and Theorem 3.7 are the same as when momentum is not used, the proposed algorithm does have faster convergence empirically, as will be shown in Section 4. This also agrees with previous studies showing that adaptive momentum can significantly accelerate empirical convergence (Duchi et al., 2011; Kingma & Ba, 2014; Li & Lin, 2015; Li et al., 2017; Yao et al., 2017).
|small : , sparsity:||large : , sparsity:|
|RMSE||space (MB)||time (sec)||RMSE||space (MB)||time (sec)|
4.1 Synthetic Data
The setup is as in (Song et al., 2017). We first generate , where , and , and denotes the outer product (i.e., ). All elements in ’s, ’s, ’s and
’s are sampled independently from the standard normal distribution. This is then corrupted by Gaussian noise fromto form . A total of random elements are observed from . We use of them for training, and the remaining for validation. Testing is evaluated on the unobserved elements in .
Three nonconvex penalties as used: capped- (Zhang, 2010a), LSP (Candès et al., 2008) and TNN (Hu et al., 2013). The proposed NORT algorithm is compared with (i) its slower variant without adaptive momentum (denoted sNORT); (ii) GDPAN (Zhong & Kwok, 2014), which directly applies PA algorithm to (10) as described in (14)-(16); and (iii) PA-APG (Yu, 2013), which solves the convex overlapped nuclear norm minimization problem. For NORT, has to be larger than (Corollary 3.6). However, a large leads to slow convergence (Remark 3.3). Hence, we set . Moreover, we set and as in (Li et al., 2017). Besides, in step 5 of Algorithm 1 is hard to evaluate, and we use instead as in (Zhong & Kwok, 2014). All algorithms are implemented in Matlab, with sparse tensor and matrix operations in C. Experiments are performed on a PC with Intel-i8 CPU and 32GB memory.
Following (Lu et al., 2016; Yao et al., 2017, 2018), performance is evaluated by (i) root-mean-square-error on the unobserved elements: , where is the low-rank tensor recovered, and contains the unobserved elements in ; and (ii) CPU time. To reduce statistical variation, results are averaged over five repetitions.
Recall that for the 3-order tensor considered here, we assume that . In this experiment, we first study the case where is small. We set , where , , and .
Table 2 shows
the results444 In all tables, the best and comparable performances (according to the pairwise t-test with 95% confidence)
In all tables, the best and comparable performances (according to the pairwise t-test with 95% confidence) are highlighted.. As can be seen, PA-APG, which is based on the convex overlapped nuclear norm, has much higher testing RMSEs than those with nonconvex regularization. Besides, the various nonconvex penalties (capped-, LSP and TNN) have similar empirical testing RMSEs, as is also observed in (Lu et al., 2016; Yao et al., 2017, 2018). As for space, NORT and its variant sNORT need much less memory than PA-APG and GDPAN, as they do not explicitly construct dense tensors during the iterations. As for time, NORT is the fastest. Figure 1 shows convergence of the objective value.555Plots for LSP and TNN are similar, and so are not shown because of the lack of space. As can be seen, sNORT and GDPAN have similar speeds w.r.t. the number of iterations. However, sNORT is much faster when measured against time, as it utilizes the “sparse plus low-rank” structure. NORT is even faster due to usage of adaptive momentum.
In this experiment, we set , where ; and is used here. Results are shown in Table 2. Again, the capped-, LSP and TNN regularizers yield the same RMSE. GDPAN, sNORT and NORT all have much lower RMSEs than PA-APG. Convergence of the objective value is shown in Figure 2. Again, NORT is the fastest, and GDPAN is the slowest. NORT and sNORT need much less memory than GDPAN.
3). vs . Recall that in (9) can be either and . We expect to be better when is small, and vice versa. This will be verified in this section. The setup is as in Sections 4.1.1 and 4.1.2. We only experiment with NORT, as the other baselines are less efficient.
Results are shown in Table 3. As can be seen, when is small, and yield similar RMSEs. However, = is much slower than . When is small, the third mode is not low-rank. Thus, the proximal step for the third mode is much more expensive than those for the first two modes as we cannot have . Moreover, = requires much larger space than .
|small ()||large ()|
|RMSE||space (MB)||time (sec)||RMSE||space (MB)||time (sec)|
When is large, it is slightly more expensive on CPU time and space. However, has much worse testing RMSE than , as it cannot capture the low-rank property on the third mode.
4.2 Real-World Data Sets
In this section, we perform evaluation on color images, remote sensing data, and social network data. As different nonconvex regularizers have similar performance, we will only use LSP. Moreover, based on the observations in Section 4.1.3, we use if is , and otherwise. Besides comparing with GDPAN, the proposed NORT algorithm is also compared with:666sNORT is not compared as it has been shown to be slower than NORT. Neither do we compare with (Bahadori et al., 2014), which is inferior to FFW above (Guo et al., 2017); and (Rauhut et al., 2017), whose its code is not publicly available and the more recent TMac-TT solves the same problem. (i) Algorithms for various convex regularizers including ADMM (Tomioka et al., 2010), FaLRTC (Liu et al., 2013), PA-APG (Yu, 2013), FFW (Guo et al., 2017), TenNN (Zhang & Aeron, 2017), and TR-MM (Nimishakavi et al., 2018); (ii) Factorization-based algorithms including RP (Kasai & Mishra, 2016), TMac (Xu et al., 2013), CP-WOPT (Acar et al., 2011), and TMac-TT (Bengua et al., 2017). More details about these methods can be found in Table LABEL:tab:comparedalgs of the Appendix.
4.2.1 Color Images
We use the images windows, tree and rice from (Hu et al., 2013), which are resized to . Sample images are shown in Appendix LABEL:app:cimg. Each pixel is normalized to . We randomly sample 10% of the pixels for training, which are then corrupted by Gaussian noise . Half of the training pixels are used for validation. The remaining unseen clean pixels are used for testing. Performance is measured by the testing RMSE and CPU time. To reduce statistical variation, results are averaged over five repetitions.
Table 4 shows the results. As can be seen, the best convex methods (PA-APG and FaLRTC) are based on the overlapping nuclear norm. This agrees with our motivation to build a nonconvex regularizer based on the overlapping nuclear norm. GDPAN and NORT have similar RMSEs, that are lower than those by convex regularization and factorization approach. Convergence of the testing RMSE is shown in Figure 3. As can be seen, while ADMM solves the same convex model as PA-APG and FaLRTC, it has slower convergence. FFW, RP and TR-MM are very fast but their testing RMSEs are higher than that of NORT. By utilizing the “sparse plus low-rank” structure and adaptive momentum, NORT is more efficient than GDPAN.
4.2.2 Remote Sensing Data
Experiments are performed on three hyper-spectral data: Cabbage (131243249), Scene (131295149) and Female (592409148). Details are in Appendix LABEL:app:himg. The third dimension is for the bands of images. We use the same setup as in Section 4.2.1. ADMM, TenNN, GDPAN, and TMac-TT are slow and so not compared. Results are shown in Table 5. Again, NORT achieves much lower testing RMSE than convex regularization and factorization approach. Figure 4 shows convergence of the testing RMSE. As can be seen, NORT is fast.