1 Introduction
Learning from a matrix or a tensor has long been an important problem in machine learning. In particular, matrix and tensor factorization by low rank inducing norms has been studied extensively with many applications such as missing value imputation
(Signoretto et al., 2013; Liu et al., 2009), multitask learning (Argyriou et al., 2006; RomeraParedes et al., 2013; Wimalawarne et al., 2014), subspace clustering (Liu et al., 2010), and inductive learning (Signoretto et al., 2013; Wimalawarne et al., 2016). Though useful in many applications, factorization based on an individual matrix or tensor tends to perform poorly under the cold start setup (Singh and Gordon, 2008). To address this issue, matrix or tensor factorization with side information are useful (Narita et al., 2011), and are applied for recommendation systems (Singh and Gordon, 2008; Gunasekar et al., 2015) and personalized medicine (Khan and Kaski, 2014).The matrix or tensor factorization with side information can be regarded as a joint factorization of a coupled matrices and tensors (hereafter coupled tensors) (See Figure 1). Acar et al. (2011) introduced a coupled factorization method based on the CP decomposition, which simultaneously factorizes matrices and tensors with sharing low rank structures in matrices and tensors. The coupled factorization approach has been applied to joint analyzing of fluorescence and NMR measurements (Acar et al., 2014a) and joint NMR and liquid chromatographymass spectrometry (LC–MS) (Acar et al., 2015). More recently, a Bayesian approach has been proposed by Ermis et al. (2015), and it was applied to link prediction problems. However, existing coupled factorization methods are nonconvex and can obtain only a poor local optimum. Moreover, the ranks of the coupled tensors need to be determined beforehand. In practice it is hard to specify the true rank of the tensor and the matrix without prior knowledge. Furthermore, the existing algorithms are not theoretically guaranteed.
In this paper, to handle the nonconvexity problem, we propose new convex norms for coupled tensors, where the norms consist of the mixtures of tensor norms: the overlapped trace norm (Tomioka et al., 2011), the latent trace norm (Tomioka and Suzuki, 2013) and the scaled latent norm (Wimalawarne et al., 2014), and matrix trace norm (Argyriou et al., 2006). A key advantage of the proposed norm is that it is convex and can find a globally optimal solution, while existing coupled factorization approaches are nonconvex. Furthermore, we analyze the excess risk bounds of the completion model regularized by our proposed norms. Through synthetic and realworld data experiments, we show that the proposed completion algorithm compares favorably with existing completion algorithms.
The contributions are summarized below:

Propose a set of convex coupled norms for matrices and tensors by extending low rank tensor and matrix norms.

Propose mixed norms which combine features from both the overlapped norm and latent norms.

A new convex completion model with regularization by the proposed coupled norms.

Excess risk bounds for the proposed completion model with respect to the proposed norms which show that our model leads to lower excess risk.

Through synthetic and realworld data experiments, we show that our norms are more robust than those for individual matrix or tensor completion.
The remainder of the paper is organized as follows. In Section 2, we discuss related methods of coupled tensor completion. In Section 3, we put forward our proposed method where we first introduce a coupled completion model then propose a new set of norms called the coupled norms. In Section 4, we derive the dual norms of the proposed coupled norms. In Section 5, we give optimization methods to solve the coupled completion model. In Section 6, we provide a theoretical analysis on the coupled completion models using the excess risk bounds for proposed coupled norms. In Section 7, we give experimental results based on simulations and realworld experiments. Finally in Section 8, we give our conclusions and future works.
2 Related Work
There exist several learning models that have been proposed for learning with multiple matrices or tensors but most of those methods are using joint factorization of matrices and tensors. Acar et al. (2011) proposed a regularization based model to perform completion of coupled tensors which was further studied in (Acar et al., 2014a; Acar et al., 2014b; Acar et al., 2015). Their method uses the CP decomposition (Kolda and Bader, 2009) to factorize the tensor and assumes that factorized components of its coupled mode are common with the factorized components of the matrix on the same mode. Bayesian models have also been proposed for missing value imputation with applications in link prediction (Ermis et al., 2015) and nonnegative factorization (Takeuchi et al., 2013) which also use similar factorization models. Other applications that have employed collective factorization of tensors are multiview factorization (Khan and Kaski, 2014) and multiway clustering (Banerjee et al., 2007). Due to factorization based learning all the above mentioned methods are nonconvex models.
Recently, use of common adjacency graphs to incorporate the similarity among heterogeneous tensor data have been proposed (Li et al., 2015). Though this method does not require assumptions on ranks for explicit factorization of tensors, it depends on the modeling of the common adjacency graph and does not incorporate the low rankness created by coupling of tensors.
3 Proposed Method
We investigate coupling of a matrix and a tensor which is formed when they share a common mode (Acar et al., 2015, 2014a; Acar et al., 2014b). The most basic example of such coupling is shown in Figure 1 where a way (thirdorder) tensor is attached to a matrix on a specific mode. As depicted we may have a problem of predicting recommendations for customers based on their preferences of restaurants in different locations and we may also have side information about features of each customer. In order to utilize side information, we can couple the customerfeature matrix with the sparse customerrestaurantlocation tensor on the customer mode and then perform imputation of missing values in the tensor.
Let us consider a partially observed matrix and a partially observed way tensor with mappings to observed elements indexed by and respectively, and assume that they are coupled on the first mode. The final goal of this paper is to invent convex coupled norms to solve the following problem:
(1) 
where is the regularization parameter. Moreover, we investigate the theoretical properties of (1).
Notations: The mode unfolding of tensor is represented as which is obtained by concatenating all the vectors with dimensions of obtained by fixing all except the th index on the mode along its columns. We use the notation to indicate the conversion of a matrix or a tensor into a vector and to represent its reverse operation. The spectral norm (operator norm) of a matrix is
that is the largest singular value of
and the Frobenius norm of a tensor is defined as . We use the notation as the concatenation of matrices and on their mode.3.1 Existing Matrix and Tensor Norms
Before we introduce our new norms, let us first briefly review the existing low rank inducing matrix and tensor norms. In the case of matrices, the matrix trace norm (Argyriou et al., 2006) is a commonly used convex relaxation to the minimization of the rank of a matrix. For a given matrix with rank , we can define its trace norm as follows:
where is the th nonzero singular value of the matrix .
Low rank inducing norms for tensors have also gained a revived attention in the recent years. One of the earliest low rank inducing tensor norm is the tensor nuclear norm (Liu et al., 2009) or the overlapped trace norm (Tomioka and Suzuki, 2013) which can be expressed for a tensor as follows:
(2) 
Tomioka and Suzuki (2013) proposed the latent trace norm:
(3) 
Recently, Wimalawarne et al. (2014) have proposed the scaled latent trace norm as an extension of the latent trace norm:
(4) 
The behaviors of the above tensor norms have been studied based on multitask learning setting (Wimalawarne et al., 2014) and inductive learning setting (Wimalawarne et al., 2016). These studies have shown that for a tensor with multilinear rank , the excess risk with respect to regularization by overlapped trace norm is bounded above by , with the latent trace norm it is bounded above by and with the scaled latent trace norm it is bounded above by .
3.2 Coupled Tensor Norms
As with individual matrices and tensors, having convex and low rank inducing norms for coupled tensors would be useful in achieving global solutions for coupled tensor completion with theoretical guarantees. In order to achieve this we propose a set of new norms for coupled tensors that are coupled on specific modes using the existing matrix and tensor trace norms. Let us first define a new coupled norm with the format of where the superscript specifies the mode in which the tensor and the matrix are coupled and the subscripts specified by indicate how the modes are regularized. Let us now introduce the notations for :

Notation : The mode is regularized with the trace norm and the same tensor is regularized on some other modes similarly to the overlapped trace norm.

Notation : The mode is considered as a latent tensor which is regularized by the trace norm only with respect to that mode.

Notation : The mode is regularized as a latent tensor but it is also scaled similarly to the scaled latent trace norm.

Notation : The mode is not regularized.
Given a matrix and a tensor , we introduce three norms which are a coupled extension of the overlapped trace norm, the latent trace norm, and the scaled latent trace norm, respectively.
Coupled overlapped trace norm:
(5) 
Coupled latent trace norm:
(6) 
Coupled scaled latent trace norm:
(7) 
In addition to these norms, we can also build norms as mixtures of overlapped and latent/scaled latent norms. For example, if we want to build a norm which is regularized with the scaled latent trace norm on the second mode but other modes are regularized using the overlapped trace norm, then we may define the following norm,
(8) 
This norm has two latent tensor and with the been regularized with the overlapped method while the been only regularized as a scaled latent tensor, hence a mixture of regularization methods and we call such heterogeneous regularization as mixed norms.
Similarly to the previous mixed norm (8) we can create mixed norms as indicated by subscripts , , , , and . The main advantage that we can gain by using these mixed norms is the additional freedom to regularize low rank constraints among coupled tensors. Other combinations of norms where two modes consist with latent tensors such as will make the third mode also a latent tensor since overlapped regularization requires more than one mode regularized of the same tensor. Though we have taken the latent trace norm into consideration, in practice it has shown to be weaker in performance compared to the scaled latent trace norm (Wimalawarne et al., 2014; Wimalawarne et al., 2016) and in our experiments we only consider the scaled latent version of mixed norms.
3.2.1 Extensions for Multiple Matrices and Tensors
Our newly defined norms can be easily extended to multiple matrices and tensors that are coupled on different modes. For instance, we can formulate a coupling between two matrices and to a way tensor on its first and third modes where mode and mode are regularized by a overlapped trace norm and the mode is regularized by a scaled latent trace norm and the mixed norm for this can be written as follows,
3.3 Dual Norms
Let us now briefly look at dual norms for the above defined coupled norms. The dual norms are useful in deriving excess risk bounds in Section 4. Due to space limitations we do not derive dual norms of all the coupled norms but give two examples to understand their nature. In order to derive the dual norms we first need to know the Schatten norms (Tomioka and Suzuki, 2013) for the coupled tensor norms. Let us first define the Schatten norm denoted with an additional subscript notation , for the coupled norm as follows,
(9) 
where and are constants, , and are the ranks on each mode and , and are the singular values on each unfolding.
The next theorem shows the dual norm of (see the appendix A for the proof).
Theorem 1.
Let a matrix and a tensor be coupled on their first mode. The dual norm of with and is
where , and are the ranks on each mode and , and are the singular values on each unfolding of the coupled tensor.
In the special case of and we see that and its dual is the spectral norm for this coupled norm as given next.
Corollary 1.
Let a matrix and a tensor be coupled on their first mode. The dual norm of is
The next theorem gives a dual norm for the mixed norm and the dual norms for the other mixed norms follow similarly. Let’s first look at its Schatten norm which is as follow,
Theorem 2.
Let a matrix and a tensor be coupled on their first mode. The dual norm of the mixed coupled norm with and is
where , and are the ranks of , and receptively and , and are their singular values.
The proof for the above theorem also can be found in the appendix A.
4 Optimization
In this section we discuss optimization methods for the new completion methods proposed in (1). The completion models in (1) for each coupled norm can be easily solved using a state of the art optimization method such as the alternating direction method of multipliers (ADMM) method (Boyd et al., 2011). Next we give details on the derivation of optimization steps for the coupled norm based on the ADMM method (Boyd et al., 2011) and the optimization procedures for other norms also follow similarly.
Let us express (1) with the norm as follows:
(10) 
By introducing auxiliary variables and the objective function of ADMM for (10) can be formulated as follows:
(11) 
We introduce Lagrangian multipliers and and formulate the Lagrangian as follows:
(12) 
where is a proximity parameter. Using the above Lagrangian formulation we can obtain solutions for unknown variables , , , , , and iteratively. We use the superscript and to represent variables at iterations and respectively.
The solutions for at each iteration can be acquired by solving the following subproblem
The solution for and at iteration given can be obtained from the following subproblem
(13) 
where and are unit diagonal matrices with dimensions of and respectively.
The updates for and at iteration are given as
(14) 
and
(15) 
where for .
The updating rules for the dual variables are as follows,
We can modify the above optimization procedures by replacing variables in (10) accordingly to the norm that is used to regularize and adjusting operations in (11), (13), (14) and (15). For example if we consider the norm then we have only a single and updating rules derived from (13) will be changed as,
then the (14) becomes,
and
also
Similarly optimization procedures for all the other norms can be derived easily.
5 Theoretical Analysis
In this section we analyze the excess risk bounds of the completion problem we introduced in (1) for different coupled norms we defined in the Section 3 using the transductive Rademacher complexity (ElYaniv and Pechyony, 2007; Shamir and ShalevShwartz, 2014). Let us again consider the matrix and the tensor and use it as a single structure with training samples indexed by and testing samples indexed by with total number of observed samples . Let us rewrite (1) with our new notations as an equivalent model as follows,
(16) 
where , is the learned coupled structure consisting of components and of the tensor and the matrix respectively, is a constant, and could be any of the norms defined in Section 3.2.
Given that is a
Lipchitz continuous loss function bounded as
and with the assumption that we can have the following bound based on the transductive Rademacher complexity theory (ElYaniv and Pechyony, 2007; Shamir and ShalevShwartz, 2014)with probability
,(17) 
where the is the transductive Rademacher complexity defined as follows,
(18) 
where with probability if or (derivation details can be found in appendix B).
Next we give bounds for (18) with respect to different coupled norms. We have made the assumption of as in (Shamir and ShalevShwartz, 2014) but our theorem can be extended to more general cases. The detailed proofs of all the theorems in this section can be found in the appendix B.
The following two theorems give the excess risk bounds for the completion with the norm (5) and the norm (7).
Theorem 3.
Let then with probability ,
where is the multilinear rank of , is the rank of the coupled unfolding on mode and , , and are constants.
Theorem 4.
Let then with probability ,
where is the multilinear rank of , is the rank of the coupled unfolding on mode and , , and are constants.
We can see that in Theorems 34, the total risk of the coupled tensor is divided by the total number of observed samples of both the matrix and the tensor. If the tensor or the matrix are completed separately then their losses are only bounded by their individual samples (see the Theorem 79 in the appendix B and (Shamir and ShalevShwartz, 2014)). This indicates that coupled tensor learning can lead to better performances than learning a matrix or a tensor independently. Additionally we can see that in the minimum term on the right side of bound of , the first term is due to the coupled unfolding of the tensors but it may not be selected since it could become a larger term due to coupling. This may lead to a better bound for due to coupling since the the larger structure due to coupling may not effect excess risk though it is bounded with the rank of the coupled mode. In the case of the norm we can see that the right hand side maximum term may select a larger term due to coupling but the excess risk is bounded by the minimum of scaled ranks of the tensor and coupled unfolding of tensors.
Finally, we look at the excess risk of the mixed norm norm in the next theorem.
Theorem 5.
Let then with probability ,
where is the multilinear rank of , is the rank of the coupled unfolding on mode and , , and are constants.
We see that for the mixed norm , the excess risk is bounded by the scaled ranks of the coupled unfolding along the first mode since it will be always selected in the minimization among rank combinations. Also this norm can be better than the bound in Theorem 4 since the maximum among the right hand side components can be smaller with this norm. This indicates that mixed norms can be lead to better bounds compared to other coupled norms. The bounds of the other two mixed norms can also be derived and explained as in Theorem 5.
6 Experiments
In this section we describe experiments with synthetic data and real world data.
6.1 Synthetic Data
Our main objectives in these experiments were to find out how different norms perform under various ranks and dimensions of coupled tensors.
In order to create coupled tensors we first generate a tensor using the Tucker decomposition (Kolda and Bader, 2009) as where
is the core tensor generated from a Normal distribution specifying multilinear rank
and component matrices , and are orthogonal matrices. Next we generate a matrix that is to be coupled with the mode of the tensorusing singular value decomposition
where we specify its rank using the diagonal matrix and generate the matrices and as orthogonal matrices. In case of sharing between the matrix and the tensor we compute and we replace the first singular values of and the basis vectors of with the first singular values of and basis vectors of to construct the matrix such that the coupled structure sharescommon components. We also add noise sampled from a Gaussian distribution of mean
and variance
to the elements of the matrix and the tensor.Throughout our synthetic experiments, we considered a coupled structure with tensors with dimension of and matrices with dimension of coupled on the mode. We considered different mulilinear ranks of tensors, ranks of matrices and the degree of sharing among them. To simulate completion experiments we randomly selected observed samples with percentages of , and of the total number of elements from the matrix and the tensor, selected a validation set with percentage of and the rest were taken as test samples. As benchmark methods we used the overlapped trace norm (abbreviated as OTN) and the scaled latent trace norm (abbreviated as SLTN) for individual tensors, matrix trace norm (abbreviated as MTN) for individual matrices and the method posed by Acar et al. (2014b). We cross validated using regularization parameters from to with intervals of . We ran our experiments with random selection and plotted the calculated mean squares error (MSE) of the test samples.
First we considered a coupled structure as mentioned above with a matrix having rank of and a tensor having multilinear rank of but there is nothing shared among them. Figure 2 shows the performances and the best performance for the tensor was with the norm, outperforming both the overlapped trace norm and the scaled latent trace norm, indicating that the is better than learning the tensor independently. However the performance of matrix completion with the coupled norms did not show clear dominance compared to its independent completion though the norm gave improved performance when the number of training samples were small.
Now we considered coupling of tensors and matrices with some degree of sharing among them. In this case we made a matrix having rank of and a tensor having multilinear rank of and let them shared singular vectors on the mode. Figure 3 shows that the matrix completion individually was outperformed by all the coupled norms except the norms and . This indicates that the coupled structure has much lower rank on the first mode compared to other modes. Overall the norm gave the best performance for both the matrix and the tensor.
Next we considered a matrix having rank of and a tensor having multilinear rank of and they shared singular vectors on the mode. We see in Figure 4 that the best performances for both the matrix and the tensor were given by the mixed norm . This indicates that the coupling has made the rank of mode smaller and that mixed regularization is important since it outperformed the scaled latent trace norm and the norm.
Finally, we created a coupled matrix with rank and a tensor with multilinear rank of sharing singular vectors on mode. Figure 5 shows that the best performances for the tensor are given by mixed norms of and where a scaled latent tensor was chosen either for the low ranked mode or mode and the rest of the modes were regularized as overlapping norms. We see that the performance of matrix completion with the same mixed norms were slightly poor indicating that coupling in this case was not useful for the matrix completion.