1 Introduction
Tensors, which can be seen as highorder matrices, can be used to describe the multilinear relationships inside the data (Kolda & Bader, 2009; Song et al., 2017; Papalexakis et al., 2017)
. They are popularly used in areas such as computer vision
(Vasilescu & Terzopoulos, 2002), recommender systems (Symeonidis & Zioupos, 2016), and signal processing (Cichocki et al., 2015). Many of these are order tensors, which are the focus in this paper. Examples include color images (Liu et al., 2013) and hyperspectral images (Signoretto et al., 2011). In Youtube, users can follow each other and can belong to the same subscribed channels. By treating channels as the third dimension, the users’ cosubscription network can again be represented as a 3order tensor (Lei et al., 2009).In many applications, only a few tensor entries are observed. For example, each Youtube user often only interacts with a few other users (Lei et al., 2009; Davis et al., 2011). Tensor completion, which aims at filling in this partially observed tensor, has attracted a lot of interest (Rendle & SchmidtThieme, 2010; Signoretto et al., 2011; Bahadori et al., 2014; Cichocki et al., 2015; Symeonidis & Zioupos, 2016; Trouillon et al., 2017; Lacroix et al., 2018; Wimalawarne & Mamitsuka, 2018). In the related task of matrix completion, different rows/columns of the underlying full matrix often share similar characteristics, and the matrix is thus lowrank (Candès & Recht, 2009). The nuclear norm, which is the tightest convex envelope of the rank (Boyd & Vandenberghe, 2009), is popularly used as a surrogate for the matrix rank in lowrank matrix completion (Cai et al., 2010; Mazumder et al., 2010).
In tensor completion, the lowrank assumption can also capture relatedness in the different tensor dimensions (Tomioka et al., 2010; Acar et al., 2011; Song et al., 2017). However, tensors are more complicated than matrices. Indeed, even computation of the tensor rank is NPhard (Hillar & Lim, 2013). In recent years, many convex relaxations based on the matrix nuclear norm have been proposed for tensors. Examples include the tensor trace norm (Chandrasekaran et al., 2012), overlapped nuclear norm (Tomioka et al., 2010; Gandy et al., 2011), and latent nuclear norm (Tomioka et al., 2010). In particular, the overlapped nuclear norm is the most popular, as it (i) can be computed exactly (Cheng et al., 2016), (ii) has better lowrank approximation (Tomioka et al., 2010), and (iii) can lead to exact recovery (Tomioka et al., 2011; Tomioka & Suzuki, 2013; Mu et al., 2014).
However, the (overlapped) nuclear norm equally penalizes all singular values. Intuitively, larger singular values are more informative and should be less penalized (Mazumder et al., 2010; Lu et al., 2016; Yao et al., 2018). In matrix completion, various adaptive nonconvex regularizers have been recently introduced to alleviate this problem. Examples include the capped norm (Zhang, 2010b), logsumpenalty (LSP) (Candès et al., 2008), truncated nuclear norm (TNN) (Hu et al., 2013), smoothedcappedabsolutedeviation (SCAD) (Fan & Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010a). All these assign smaller penalties to the larger singular values. This leads to better empirical performance (Lu et al., 2016; Gu et al., 2017; Yao et al., 2018) and statistical guarantee (Gui et al., 2016).
Motivating by the success of adaptive nonconvex regularizers in matrix completion, we propose to develop a nonconvex variant of the overlapped nuclear norm regularizer for tensor completion. Unlike the standard convex tensor completion problem, the resulting optimization problem is nonconvex and more difficult to solve.
Based on the proximal average algorithm (Bauschke et al., 2008), we develop in this paper an efficient solver with much better time and space complexities. The keys to its success are on (i) avoiding expensive tensor folding and unfolding, (ii) maintaining a “sparse plus lowrank” structure on the iterates, and (iii) incorporating the adaptive momentum (Li et al., 2017). Convergence guarantees to critical points are provided under smoothness and KurdykaLojasiewicz (Attouch et al., 2013) conditions. Experiments on a number of synthetic and realworld data sets show that the proposed algorithm is efficient and has much better empirical performance than other lowrank tensor regularization and decomposition methods.
Notation.Vectors are denoted by lowercase boldface, matrices by uppercase boldface, and tensors by boldface Euler. For a matrix (assume ) with singular values ’s, its nuclear norm is . For tensors, we follow the notation in (Kolda & Bader, 2009). For a order tensor (without loss of generality, we assume ), its th entry is . One can unfold along its thmode to obtain the matrix , whose entry (where ) is equal to . One can also fold a matrix back to a tensor , such that , and is as defined above. The inner product of two tensors and is . The Frobenius norm of is . For a proper and lowersemicontinuous function , denotes its Frechet subdifferential (Attouch et al., 2013).
2 Related Works
2.1 LowRank Matrix Learning
Lowrank matrix learning can be formulated as:
(1) 
where is a lowrank regularizer, is a hyperparameter, is a Lipschitz smooth^{1}^{1}1In other words, for any and . loss. A common choice of is the nuclear norm. Using the proximal algorithm (Parikh & Boyd, 2013), the iterate is given by , where
(2) 
is the stepsize, and
(3) 
is the proximal step. The following Lemma shows that it can be obtained from the SVD of . Note that shrinking of the singular values encourages to be lowrank.
Lemma 2.1.
(Cai et al., 2010) , where is the SVD of , and .
2.1.1 Matrix Completion
A special class of lowrank matrix learning problems is matrix completion, which attempts to find a lowrank matrix that agrees with the observations in data :
(4) 
The positions of observed elements in are indicated by ’s in the binary matrix , if and otherwise. Setting in (1), in (2) becomes:
(5) 
Note that has a “sparse plus lowrank” structure, with a lowrank and sparse . This allows the SVD computation in Lemma 2.1 to be much more efficient (Mazumder et al., 2010). Specifically, on using the power method to compute the SVD of , most effort is spent on multiplications of the form (where ) and (where ). Let in (5) be lowrank factorized as with rank . Computing
(6) 
takes time. Usually, and . This is much faster than direct multiplying and , which takes time. The same holds for . Thus, the proximal step in (3) take time, while a direct computation without utilizing the “sparse plus lowrank” structure takes time. Besides, as only and the factorized form of need to be kept, the space complexity is reduced from to .
2.1.2 Nonconvex LowRank Regularizer
Instead of using a convex in (1), the following nonconvex regularizer is commonly used (Gui et al., 2016; Lu et al., 2016; Gu et al., 2017; Yao et al., 2018):
(7) 
where is nonconvex and possibly nonsmooth. We assume the following on .
Assumption 1.
is a concave, nondecreasing and Lipschitz continuous function on with .
Examples of include the capped penalty: (Zhang, 2010b), and logsumpenalty (LSP): (where is a constant) (Candès et al., 2008). More can be found in Appendix LABEL:app:egreg. They have similar statistical guarantees (Gui et al., 2016), and empirically perform better than the convex nuclear norm (Lu et al., 2016; Yao et al., 2018).
The proximal algorithm can again be used, and converges to a critical point (Attouch et al., 2013). Analogous to Lemma 2.1, the underlying proximal step
(8) 
can be obtained as follows.
Lemma 2.2.
(Lu et al., 2016) , where is the SVD of , and .
2.2 LowRank Tensor Learning
Recently, the nuclear norm regularizer has been extended to tensors. Here, we focus on the overlapped nuclear norm (Tomioka et al., 2010; Gandy et al., 2011), and its nonconvex extension that will be introduced in Section 3.
Definition 1.
For a order tensor , the overlapped nuclear norm is , where
are hyperparameters.
Factorization methods, such as the Tucker/CP (Kolda & Bader, 2009) and tensortrain decompositions (Oseledets, 2011), have also been used for lowrank tensor learning. Compared to nuclear norm regularization, they usually offer worse approximations and inferior performance (Tomioka et al., 2011; Liu et al., 2013; Guo et al., 2017).
3 Nonconvex LowRank Tensor Completion
By integrating a nonconvex in (1) with the overlapped nuclear norm, the tensor completion problem becomes
(9) 
Note that we only sum over modes. This is useful as the third mode is sometimes already small (e.g., the number of bands in images), and so does not need to be lowrank regularized. When , (9) reduces to the matrix completion problem , which can be solved by the proximal algorithm as in (Lu et al., 2016; Yao et al., 2018). In the sequel, we only consider .
When , (9) reduces to (convex) overlapped nuclear norm regularization. While may not be equal to , it can be easily shown that optimization solvers such as alternating direction of multiple multipliers (ADMM) and fast lowrank tensor completion (FaLRTC) can still be used as in (Tomioka et al., 2010; Liu et al., 2013). However, when is nonconvex, ADMM no longer guarantees convergence, and FaLRTC’s dual cannot be derived.
3.1 Structureaware Proximal Iterations
As in Section 2.1, we solve (9) with the proximal algorithm. However, the proximal step for is not simple. To address this problem, we use the proximal average (PA) algorithm (Bauschke et al., 2008; Yu, 2013).
Let be a Hilbert space. Consider the following problem with possibly nonsmooth regularizers, whose individual proximal steps are assumed to be easily computable.
(10) 
where is convex and Lipschitzsmooth, while each is convex but possibly nonsmooth. The PA algorithm generates the iterates ’s as
(11)  
(12)  
(13) 
Recently, the PA algorithm is also extended to nonconvex and ’s, where each admits a differenceofconvex decomposition^{2}^{2}2In other words, each can be decomposed as where and are two convex functions. (Yu et al., 2015; Zhong & Kwok, 2014).
Note that (9) is of the form in (10), and in (7) admits a differenceofconvex decomposition (Yao et al., 2018), the PA algorithm can be used to generate the iterates as:
(14)  
(15) 
However, as the regularizer is imposed on the unfolded matrix , not on directly, the proximal steps need to be performed as follows.
The individual proximal steps in (16) can be computed using Lemma 2.2 based on SVD. However, tensor folding and unfolding are required in (16). A direct implementation takes space and time per iteration, where and , and is expensive. In the following, we show how the PA iterations can be computed efficiently by utilizing the “sparse plus lowrank” structures.
3.1.1 Keeping the LowRank Factorizations
3.1.2 Maintaining “Sparse plus LowRank”
Using (17), in (15) can be rewritten as
(18) 
The sparse tensor can be constructed efficiently by using the coordinate format^{3}^{3}3For a sparse 3order tensor, its th nonzero element is represented in the coordinate format as , where are indices on each mode and is the value. Using (17), of can be computed by finding the corresponding rows in and , which takes time. (Bader & Kolda, 2007). As is a sum of tensor (folded from lowrank matrices) and is sparse, is also “sparse plus lowrank”.
Recall that the proximal step in (16) requires SVD, which involves matrix multiplications of the form (where ) and (where ). Using the “sparse plus lowrank” structure in (18),
(19)  
(20)  
The first term in both (19) and (20) can be easily computed in space and time. and are sparse. Using sparse tensor packages such as the Tensor Toolbox (Bader & Kolda, 2007), and can be computed in space and time.
Computing and in (19), (20) involves folding/unfolding and is expensive. By examining how elements are ordered by folding and unfolding, the following shows that and can be reformulated without explicit folding / unfolding, and thus be computed more efficiently.
Proposition 3.2.
Let , , and (resp.) be the th column of (resp.). For any and , we have
where is the Kronecker product, and reshapes vector into a matrix.
periteration time complexity  space  convergence  

direct  slow  
NORT  fast 
Remark 3.1.
Computation of takes a total of time and space. The same holds for computation of . This is much less expensive than direct evaluation, which takes time and space.
Combining the above, and noting that we have to keep the factorized form of , computing proximal steps in (16) takes space and time.
3.1.3 Complexities
In each PA iteration, proximal steps are performed in (16). The whole PA algorithm thus takes a total of space and time for each iteration. As , , these are much lower than those of a direct implementation (Table 1). Moreover, the PA algorithm has a convergence rate of , where is the number of iterations (Zhong & Kwok, 2014; Yu et al., 2015).
3.2 Use of Adaptive Momentum
The PA algorithm uses only firstorder information, and empirically can be slow to converge (Parikh & Boyd, 2013)
. To address this problem, we adopt adaptive momentum, which has been popularly used for stochastic gradient descent
(Duchi et al., 2011; Kingma & Ba, 2014) and proximal algorithms (Li & Lin, 2015; Yao et al., 2017; Li et al., 2017). The idea is to use historical iterates to speed up convergence. Here, we adopt the adaptive scheme in (Li et al., 2017). The resultant procedure, shown in Algorithm 1, will be called NOncvx Regularized Tensor (NORT). Note that even when step 6 is performed, in step 10 still has the “sparse plus lowrank” structure on , since . The resultant time and space complexities are the same as in Section 3.1.3.3.3 Convergence Analysis
Adaptive momentum has not been used with the PA algorithm. Besides, previous proofs of the PA algorithm does not involve folding/unfolding operations. Thus, previous proofs cannot be directly used. In the following, first note that the proximal step in (16) implicitly corresponds to a new regularizer.
Proposition 3.3.
There exists a function such that for any .
Let the objective with the new regularizer be . The following bounds the difference between the optimal values ( and , respectively) of the objectives in (9) and .
Proposition 3.4.
.
3.3.1 With Smooth Assumption
As in Section 2.1, we assume that is Lipschitz smooth. the following shows that Algorithm 1 converges to a critical point (Theorem 3.5) at the rate of (Corollary 3.6). Note that this is the best possible rate for firstorder methods on general nonconvex problems (Nesterov, 2013; Ghadimi & Lan, 2016).
Theorem 3.5.
The sequence generated from Algorithm 1 has at least one limit point, and all limits points are critical points of .
Corollary 3.6.
(i) If , then is a critical point of ; (ii) let . then ;
3.3.2 With KurdykaLojasiewicz Condition
The KurdykaLojasiewicz (KL) condition (Attouch et al., 2013; Bolte et al., 2014) has been popularly used in nonconvex optimization, particularly in gradient (Attouch et al., 2013) and proximal gradient descent algorithms (Bolte et al., 2014; Li & Lin, 2015; Li et al., 2017).
Definition 2.
A function : has the uniformized KL property if for every compact set on which is a constant, there exist , such that for all and all , one has , where for some , and .
The following extends this to Algorithm 1.
Theorem 3.7.
Let . If has the uniformized KL property, for a sufficiently large , we have

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=13pt]

If , reduces to zero in finite steps;

If , where ;

If , where .
Though the convergence rates in Corollary 3.6 and Theorem 3.7 are the same as when momentum is not used, the proposed algorithm does have faster convergence empirically, as will be shown in Section 4. This also agrees with previous studies showing that adaptive momentum can significantly accelerate empirical convergence (Duchi et al., 2011; Kingma & Ba, 2014; Li & Lin, 2015; Li et al., 2017; Yao et al., 2017).
4 Experiments
small : , sparsity:  large : , sparsity:  
RMSE  space (MB)  time (sec)  RMSE  space (MB)  time (sec)  
convex  PAAPG  0.01490.0011  302.40.1  2131.7419.9  0.00980.0001  4804.5598.2  6196.42033.4 
(nonconvex)  GDPAN  0.01030.0001  171.52.2  665.499.8  0.00060.0001  3243.3489.6  3670.4225.8 
capped  sNORT  0.01030.0001  14.00.8  27.95.1  0.00060.0001  44.60.3  575.970.9 
NORT  0.01030.0001  14.90.9  5.91.6  0.00060.0001  66.30.6  89.413.4  
(nonconvex)  GDPAN  0.01040.0001  172.21.5  654.1214.7  0.00060.0001  3009.3376.2  3794.0419.5 
LSP  sNORT  0.01040.0001  14.40.1  27.95.7  0.00060.0001  44.60.2  544.275.5 
NORT  0.01040.0001  15.10.1  5.82.8  0.00060.0001  62.10.5  81.324.9  
(nonconvex)  GDPAN  0.01040.0001  172.11.6  615.0140.9  0.00060.0001  3009.2412.2  3922.9280.1 
TNN  sNORT  0.01040.0001  14.40.1  26.24.0  0.00060.0001  44.70.2  554.744.1 
NORT  0.01030.0001  15.10.1  5.31.5  0.00060.0001  63.10.6  78.09.4 
In this section, experiments are performed on both synthetic (Section 4.1) and realworld data sets (Section 4.2).
4.1 Synthetic Data
The setup is as in (Song et al., 2017). We first generate , where , and , and denotes the outer product (i.e., ). All elements in ’s, ’s, ’s and
’s are sampled independently from the standard normal distribution. This is then corrupted by Gaussian noise from
to form . A total of random elements are observed from . We use of them for training, and the remaining for validation. Testing is evaluated on the unobserved elements in .Three nonconvex penalties as used: capped (Zhang, 2010a), LSP (Candès et al., 2008) and TNN (Hu et al., 2013). The proposed NORT algorithm is compared with (i) its slower variant without adaptive momentum (denoted sNORT); (ii) GDPAN (Zhong & Kwok, 2014), which directly applies PA algorithm to (10) as described in (14)(16); and (iii) PAAPG (Yu, 2013), which solves the convex overlapped nuclear norm minimization problem. For NORT, has to be larger than (Corollary 3.6). However, a large leads to slow convergence (Remark 3.3). Hence, we set . Moreover, we set and as in (Li et al., 2017). Besides, in step 5 of Algorithm 1 is hard to evaluate, and we use instead as in (Zhong & Kwok, 2014). All algorithms are implemented in Matlab, with sparse tensor and matrix operations in C. Experiments are performed on a PC with Inteli8 CPU and 32GB memory.
Following (Lu et al., 2016; Yao et al., 2017, 2018), performance is evaluated by (i) rootmeansquareerror on the unobserved elements: , where is the lowrank tensor recovered, and contains the unobserved elements in ; and (ii) CPU time. To reduce statistical variation, results are averaged over five repetitions.
4.1.1 Small
Recall that for the 3order tensor considered here, we assume that . In this experiment, we first study the case where is small. We set , where , , and .
Table 2 shows the results^{4}^{4}4
In all tables, the best and comparable performances (according to the pairwise ttest with 95% confidence) are highlighted.
. As can be seen, PAAPG, which is based on the convex overlapped nuclear norm, has much higher testing RMSEs than those with nonconvex regularization. Besides, the various nonconvex penalties (capped, LSP and TNN) have similar empirical testing RMSEs, as is also observed in (Lu et al., 2016; Yao et al., 2017, 2018). As for space, NORT and its variant sNORT need much less memory than PAAPG and GDPAN, as they do not explicitly construct dense tensors during the iterations. As for time, NORT is the fastest. Figure 1 shows convergence of the objective value.^{5}^{5}5Plots for LSP and TNN are similar, and so are not shown because of the lack of space. As can be seen, sNORT and GDPAN have similar speeds w.r.t. the number of iterations. However, sNORT is much faster when measured against time, as it utilizes the “sparse plus lowrank” structure. NORT is even faster due to usage of adaptive momentum.4.1.2 Large
In this experiment, we set , where ; and is used here. Results are shown in Table 2. Again, the capped, LSP and TNN regularizers yield the same RMSE. GDPAN, sNORT and NORT all have much lower RMSEs than PAAPG. Convergence of the objective value is shown in Figure 2. Again, NORT is the fastest, and GDPAN is the slowest. NORT and sNORT need much less memory than GDPAN.
3). vs . Recall that in (9) can be either and . We expect to be better when is small, and vice versa. This will be verified in this section. The setup is as in Sections 4.1.1 and 4.1.2. We only experiment with NORT, as the other baselines are less efficient.
Results are shown in Table 3. As can be seen, when is small, and yield similar RMSEs. However, = is much slower than . When is small, the third mode is not lowrank. Thus, the proximal step for the third mode is much more expensive than those for the first two modes as we cannot have . Moreover, = requires much larger space than .
small ()  large ()  

RMSE  space (MB)  time (sec)  RMSE  space (MB)  time (sec)  
capped  2  0.0103  14.0  5.9  0.0009  46.7  40.0 
  3  0.0103  78.7  918.7  0.0006  66.3  89.4 
LSP  2  0.0104  14.1  5.8  0.0010  45.2  50.8 
3  0.0103  78.7  899.7  0.0006  62.1  81.3  
TNN  2  0.0103  14.4  5.3  0.0009  46.8  39.3 
3  0.0104  77.8  615.5  0.0006  63.1  78.0 
When is large, it is slightly more expensive on CPU time and space. However, has much worse testing RMSE than , as it cannot capture the lowrank property on the third mode.
4.2 RealWorld Data Sets
In this section, we perform evaluation on color images, remote sensing data, and social network data. As different nonconvex regularizers have similar performance, we will only use LSP. Moreover, based on the observations in Section 4.1.3, we use if is , and otherwise. Besides comparing with GDPAN, the proposed NORT algorithm is also compared with:^{6}^{6}6sNORT is not compared as it has been shown to be slower than NORT. Neither do we compare with (Bahadori et al., 2014), which is inferior to FFW above (Guo et al., 2017); and (Rauhut et al., 2017), whose its code is not publicly available and the more recent TMacTT solves the same problem. (i) Algorithms for various convex regularizers including ADMM (Tomioka et al., 2010), FaLRTC (Liu et al., 2013), PAAPG (Yu, 2013), FFW (Guo et al., 2017), TenNN (Zhang & Aeron, 2017), and TRMM (Nimishakavi et al., 2018); (ii) Factorizationbased algorithms including RP (Kasai & Mishra, 2016), TMac (Xu et al., 2013), CPWOPT (Acar et al., 2011), and TMacTT (Bengua et al., 2017). More details about these methods can be found in Table LABEL:tab:comparedalgs of the Appendix.
4.2.1 Color Images
We use the images windows, tree and rice from (Hu et al., 2013), which are resized to . Sample images are shown in Appendix LABEL:app:cimg. Each pixel is normalized to . We randomly sample 10% of the pixels for training, which are then corrupted by Gaussian noise . Half of the training pixels are used for validation. The remaining unseen clean pixels are used for testing. Performance is measured by the testing RMSE and CPU time. To reduce statistical variation, results are averaged over five repetitions.
Table 4 shows the results. As can be seen, the best convex methods (PAAPG and FaLRTC) are based on the overlapping nuclear norm. This agrees with our motivation to build a nonconvex regularizer based on the overlapping nuclear norm. GDPAN and NORT have similar RMSEs, that are lower than those by convex regularization and factorization approach. Convergence of the testing RMSE is shown in Figure 3. As can be seen, while ADMM solves the same convex model as PAAPG and FaLRTC, it has slower convergence. FFW, RP and TRMM are very fast but their testing RMSEs are higher than that of NORT. By utilizing the “sparse plus lowrank” structure and adaptive momentum, NORT is more efficient than GDPAN.
rice  tree  windows  

convex  ADMM  0.6800.003  0.9150.005  0.7090.004 
PAAPG  0.5830.016  0.4880.007  0.5850.002  
FaLRTC  0.5760.004  0.4940.011  0.5670.005  
FFW  0.6340.003  0.5990.005  0.7720.004  
TRMM  0.5960.005  0.5150.011  0.6340.002  
TenNN  0.6470.004  0.5620.004  0.5860.003  
factor  RP  0.5410.011  0.5240.010  0.3880.026 
ization  TMac  1.9230.005  1.7500.006  1.3130.005 
CPOPT  0.9120.086  0.7330.060  0.9640.102  
TMacTT  0.7290.022  0.6970.147  1.0450.107  
noncvx  GDPAN  0.4670.002  0.3880.012  0.2960.007 
NORT  0.4680.001  0.3860.009  0.2970.007 
4.2.2 Remote Sensing Data
Experiments are performed on three hyperspectral data: Cabbage (131243249), Scene (131295149) and Female (592409148). Details are in Appendix LABEL:app:himg. The third dimension is for the bands of images. We use the same setup as in Section 4.2.1. ADMM, TenNN, GDPAN, and TMacTT are slow and so not compared. Results are shown in Table 5. Again, NORT achieves much lower testing RMSE than convex regularization and factorization approach. Figure 4 shows convergence of the testing RMSE. As can be seen, NORT is fast.
Cabbage  Scene  Female  
convex  PAAPG  9.130.06  19.650.02  11.570.03 
FaLRTC  9.090.02  19.200.01  11.330.04  
FFW  9.620.04  20.370.02  20.960.06  
TRMM  9.590.01  19.650.02  13.970.06  
factor  RP  4.910.11  18.040.05  6.470.03 
ization  TMac  49.190.59  59.700.29  198.970.06 
CPOPT  18.465.14  48.110.82  18.680.13  
noncvx  NORT 
Comments
There are no comments yet.