1 Introduction
Given multiple related learning tasks, multitask learning (Caruana, 1997; Zhang and Yang, 2017)
aims to exploit useful information contained in them to help improve the performance of all the tasks. Multitask learning has been applied to many application areas, including computer vision, natural language processing, speech recognition and so on. Over past decades, many multitask learning models have been devised to learn such useful information shared by all the tasks. As reviewed in
(Zhang and Yang, 2017), multitask learning models can be categorized into six classes, including the feature transformation approach (Argyriou et al., 2006; Misra et al., 2016), feature selection approach
(Obozinski et al., 2006; Liu et al., 2009; Lozano and Swirszcz, 2012), lowrank approach (Pong et al., 2010; Han and Zhang, 2016; Yang and Hospedales, 2017b), task clustering approach (Xue et al., 2007; Jacob et al., 2008; Kumar and III, 2012; Han and Zhang, 2015a), task relation learning approach (Bonilla et al., 2007; Zhang and Yeung, 2010; Long et al., 2017; Zhang et al., 2018), and decomposition approach (Jalali et al., 2010; Chen et al., 2010; Zweig and Weinshall, 2013; Han and Zhang, 2015b).Among those approaches, the lowrank approach is effective to identify lowrank model parameters. When model parameters of a task can be organized in a vector corresponding to for example binary classification tasks or regression tasks on vectorized data, the matrix trace norm or its variants is used as a regularizer on the parameter matrix, each of whose columns stores parameters for a task, to identify the lowrank structure among tasks. Nowadays with the collection of complex data and the popularity of deep learning techniques, each data point can be represented as a tensor (e.g., images) and each learning task becomes complex, e.g., multiclass classification tasks. In this case, the parameters of all the tasks are stored in a tensor, making the matrix trace norm not applicable, and instead tensor trace norms
(RomeraParedes et al., 2013; Wimalawarne et al., 2014; Yang and Hospedales, 2017b) are used to learn lowrank parameters in the parameter tensor for multitask learning.Different from the matrix trace norm which has a unique definition, the tensor trace norm has many variants as the tensor rank has multiple definitions. Here we focus on overlapped tensor trace norms which equals the sum of the matrix trace norm of several tensor flattenings of the tensor. An overlapped tensor trace norm relies on the way to do the tensor flattening. For example, the Tucker trace norm (Tucker, 1966) conducts the tensor flattening along each axis in the tensor and the TensorTrain (TT) trace norm (Oseledets, 2011) does it along successive axes starting from the first one. There are two limitation in the existing tensor trace norms. Firstly, for a way tensor, we can see that there are possible tensor flattenings but existing overlapped tensor trace norms only utilize a subset of them, making them fail to capture all the lowrank structures in the parameter tensor. Another limitation of existing tensor trace norms is that all the tensor flattenings used in a tensor trace norm are assumed to be equally important, which is suboptimal to the performance.
In this paper, to overcome the two aforementioned limitations of existing overlapped tensor trace norms, we propose a Generalized Tensor Trace Norm (GTTN). The GTTN exploits all possible tensor flattenings and it is defined as the convex sum of matrix trace norms of all possible tensor flattenings. In this way, the GTTN can capture all the lowrank structures in the parameter tensor and hance overcome the first limitation. Moreover, to alleviate the second limitation, we treat combination coefficients in the GTTN as variables and propose an objective function to learn them from data. Another advantage of learning combination coefficients is that it can show the importance of some axes, which can improve the interpretability of the learning model and give us insights for the problem under investigation. To obtain a full understanding of the GTTN, we study properties of the proposed GTTN. For example, the number of tensor flattenings with distinct matrix trace norms is proved to be and so when we encountered in most problems, such number is not so large that the computational complexity is comparable to existing tensor trace norms. We also analyze the dual norm of the GTTN and give a generalization bound. Extensive experiments on realworld datasets demonstrate the effectiveness of the proposed GTTN.
2 Existing Tensor Trace Norms
In multitask learning, trace norms are widely used as the regularization to learn a lowrank structure among model parameters of all the tasks as minimizing the trace norm will enforce some singular values to approach zero. When both a data point and model parameters of a task are represented in vectorized forms in regression tasks or binary classification tasks, the matrix trace norm can be used and it is defined as
with each column of the parameter matrix storing the parameter vector of the corresponding task and denoting the th largest singular value of . Regularizing with will make tend to be lowrank, which leads to the linear dependency among parameter vectors of different tasks and reflects the relatedness among tasks in terms of model parameters.Nowadays, the data such as images can be represented in a matrix or tensor form in the raw representation (e.g., pixelbased representation) and transformed representation after for example convolutional operations. Moreover, each task becomes more complex, for example, a multiclass classification task. In those cases, parameters of all the tasks can be organized as a way tensor (), e.g., . That is, for multiclass classification tasks, when equals 3, denotes the number of hidden units in the last hidden layer, can represent the number of classes, and can be the number of tasks. In such cases, the matrix trace norm is no longer applicable and instead tensor trace norms are investigated.
According to (Tomioka and Suzuki, 2013)
, tensor trace norms can be classified into two categories, including overlapped tensor trace norms and latent tensor trace norms. An overlapped tensor trace norm transforms a tensor into matrices in different ways and compute the sum of the matrix trace norm of different transformed matrices. A latent tensor trace norm decomposes the tensor into multiple latent tensors and then compute the sum of the matrix trace norm of matrices which are transformed from the latent tensors. Deep multitask learning mainly uses the overlapped tensor trace norm, which is the focus of our study.
As reviewed in (Yang and Hospedales, 2017b), three tensor trace norms belonging to the overlapped tensor trace norm are used in deep multitask learning, including the Tucker trace norm, TT trace norm, and Last Axis Flattening (LAF) trace norm. In the following, we will review those three tensor trace norms.
2.1 Tucker Trace Norm
Based on the Tucker decomposition (Tucker, 1966), the Tucker trace norm for a tensor can be defined as
(1) 
where denotes a set of positive integers no larger than , permutes the tensor along axis indices in that is a permutation of , reshapes the tensor with the new size stored in a vector , is the mode tensor flattening to transform to a matrix along the th axis, and denotes the weight for the model flattening. To control the scale of , here are required to satisfy that and . Based on Eq. (1), we can see that the Tucker trace norm is a convex combination of matrix trace norms of tensor flattening along each axis, where controls the importance of the mode tensor flattening. Without a priori information, different tensor flattenings are usually assumed to have equal importance by setting to be .
2.2 TT Trace Norm
Based on the tensortrain decomposition (Oseledets, 2011), the TT trace norm for a tensor can be defined as
(2) 
where and denotes a nonnegative weight. Different from the mode tensor flattening , unfolds the tensor to a matrix along the first axes. Similar to the Tucker trace norm, are assumed to satisfy that and . Usually is set by users to be if there is no additional information for the importance of each term in Eq. (2).
2.3 LAF Trace Norm
The LAF trace norm for a tensor can be defined as
(3) 
The last axis in is the task axis and hence the LAF trace norm is equivalent to place the matrix trace norm on each of whose rows stores model parameters of each task. Compared with the Tucker trace norm in Eq. (1), the LAF trace norm can be viewed as a special case of the Tucker trace norm where equals 1 and other ’s () are equal to 0.
Given a tensor trace norm, the objective function of a deep multitask model can be formulated as^{1}^{1}1Here for simplicity, we assume the tensor trace norm regularization is placed on only one . This formulation can easily be extended to multiple ’s with the tensor trace norm regularization.
(4) 
where denotes the number of tasks, denotes the number of training data points in the th task, denotes the th data point in the th task, denotes the label of , denotes a learning function for the
th task given a deep multitask neural network parameterized by
,denotes a loss function such as the crossentropy loss for classification tasks and the square loss for regression tasks,
denotes a part of that is regularized by a tensor trace norm, and is a regularization parameter. In problem (4), the tensor trace norm can be the Tucker trace norm, or the TT trace norm, or the LAF trace norm.3 Generalized Tensor Trace Norm
In this section, we first analyze existing tensor trace norms and then present the proposed generalized tensor trace norm as well as the optimization and generalization bound.
3.1 Analysis on Existing Tensor Trace Norms
As introduced in the previous section, we can see that overlapped tensor trace norms rely on different ways of tensor flattening. For example, the Tucker trace norm reshapes the tensor along each axis and the LAF trace norm focuses on the last axis, while the TT trace norm reshapes the tensor by combining the first several axes. Given the physical meaning of each axis, the LAF trace norm only considers the intertask lowrank structure among tasks, but differently both the Tucker and TT trace norms consider not only the intertask lowrank structure among tasks but also the intratask lowrank structure among, for example, features. In this sense, the Tucker and TT trace norms seems to be superior to the LAF trace norm.
For overlapped tensor trace norms like the Tucker and TT trace norms, there are two important issues.

How to choose the ways of tensor flattening?

How to determine the importance of of different ways of tensor flattening?
For the first issue, the Tucker trace norm chooses to reshape along each axis while the TT trace norm combines the first several axes together to do the tensor flattening. Different ways of tensor flattening encode the belief on the existence of the lowrank structure in . So the Tucker trace norm assumes that the lowrank structure exists in each axis while the TT trace norm considers the combinations of the first several axes have lowrank structure. However, those models may fail when such assumptions do not hold.
For the second issue, current models usually assume the equal importance of different ways of tensor flattening, which is reflected in the equal value of . Intuitively, different ways of tensor flattening should have different degrees in terms of the lowrank structure and hence should be different from each other. In this sense, with an equal value incur the suboptimal performance.
3.2 Gttn
To solve the above two issues together, we propose the generalized tensor trace norm.
For the first issue, since for most problems we do not know which ways of tensor flattening are helpful to learn the lowrank structure, we can try all possible ways of tensor flattening. To mathematically define this, we define as
where is a nonempty subset of (i.e., ) and denotes the complement of with respect to (i.e., ). So is a tensor flattening to a matrix with a dimension corresponding to axis indices in and the other to axis indices in . When contains only one element, becomes , the mode tensor flattening used in the Tucker trace norm. When , becomes that is used in the TT trace norm. Moreover, this new tensor flattening can be viewed as a generalization of and as can contain more than one element, which is more general than , and it does not require that elements in should be successive integers from 1, which is more general than .
As we aim to consider all possible ways of tensor flattening, similar to the Tucker and TT trace norms, we define the GTTN as
(5) 
where is also used as a subscript to index the corresponding weight for , denotes the set of ’s, and defines a constraint set for . Then based on the GTTN, we can solve the first issue to some extent as it can discover all the lowrank structures by considering all possible ways of tensor flattening with appropriate settings of .
In Figure 1, we show the difference among the Tucker trace norm, TT trace norm, LAF trace norm and GTTN for a fourway tensor at the top. In the bottom of Figure 1, we can see that there are seven possible tensor flattenings. The Tucker trace norm uses , , , and . The TT trace norm relies on , , and . The LAF trace norm only contains . The calculation of the GTTN is based on all the seven tensor flattenings. From this example, we can see that the union of tensor flattenings used in the Tucker, TT, and LAF trace norms cannot cover all the possible ones and the GTTN utilizes some additional tensor flattening (e.g., and ). In this sense, the GTTN can discover more lowrank structures than existing tensor trace norms.
For the number of distinct summands in the righthand side of Eq. (5), we have the following theorem.^{2}^{2}2All the proofs are put in the appendix.
Theorem 1
The righthand side of Eq. (5) has distinct summands.
As shown in the proof of Theorem 1, and are transpose matrices to each other with equal matrix trace norm and we can eliminate one of them to reduce the computational cost. For notational simplicity, we do not explicitly do the elimination in the formulation but in computation, we did do that. In problems we encounter, is at most and so the GTTN has at most 15 distinct summands. So the number of distinct summands are not so large, making the optimization efficient.
Similar to the Tucker and TT trace norms, GTTN defined in Eq. (5) still faces the second issue. Here to solve the second issue, we view as variables to be optimized and based on Eq. (5), the objective function of a deep multitask model based on GTTN is formulated as
(6) 
Compared with problem (4), we can see two differences. Firstly, the regularization terms in two problems are different. Secondly, problem (6) treat as variables to be optimized but the corresponding entities are constants which are set by users.
In the following theorem, we can simplify problem (6) by eliminating .
Theorem 2
Problem (6) is equivalent to
(7) 
According to problem (7), learning will tend to choosing a tensor flattening with the minimal matrix trace norm.
3.3 Optimization
Even though problem (7) is equivalent to problem (6), in numerical optimization, we choose problem (6) as the objective function to be optimized. One reason is that problem (7), which involves the minimum of matrix trace norms, is more complicated than problem (6) to be optimized. Another reason is that the learned in problem (6) can visualize the importance of each tensor flattening, which can improve the interpretability of the learning model.
Since problem (6
) is designed for deep neural networks, the Stochastic Gradient Descent (SGD) technique is the first choice for optimization. However, problem (
6) is a constrained optimization problem, making SGD techniques not directly applicable. The constraints in problem (6) constrain to form a dimensional simplex. To convert problem (6) to an unconstrained problem that can be optimized by SGD, we reparameterize each asWith such reparameterization, problem (6) can be reformulated as
(8) 
For each parameter , its gradient can be computed based on the first term in the objective function of problem (8). For each , its gradient can be computed as
For , the computation of its gradient comes from both terms in the objective function of problem (8). The first term is the conventional training loss and the second term involves the matrix trace norm which is nondifferentiable. According to (Watson, 1992), we can compute the subgradient instead, that is, where
denotes the singular value decomposition of a matrix
.3.4 Generalization Bound
For the GTTN defined in Eq. (5), we can derive its dual norm in the following theorem.
Theorem 3
The dual norm of the GTTN defined in Eq. (5) is defined as
where is a variable indexed by and denotes the spectral norm of a matrix that is equal to the maximum singular value.
Without loss of generality, here we assume which can simplify the analysis. We rewrite problem (6) into an equivalent formulation as
(9) 
where is assumed to be fixed to show its impact to the bound. Here each data point is a tensor and binary classification tasks are considered,^{3}^{3}3The analysis is easy to extend to regression tasks and multiclass classification tasks. implying that and . The learning function for each task is a linear function defined as , where denotes the inner product between two tensors with equal size and denotes the th slice along the last axis which is the task axis. For simplicity, different tasks are assumed to have the same number of data points, i.e., equals for . It is very easy to extend our analysis to general settings. The generalization loss for all the tasks is defined as , where denotes the underlying data distribution for the th task and denotes the expectation. The empirical loss for all the tasks is defined as . We assume the loss function has values in and it is Lipschitz with respect to the first input argument with a Lipschitz constant . Each training data is assumed to satisfy . To characterize correlations between features, we assume that for any and , where means that is a positive semidefinite matrix, , and
denotes an identity matrix with an appropriate size.
For problem (9), we can derive a generalization bound in the following theorem.
Theorem 4
According to Theorem 4, we can see that each can be used to weigh the second term which is related to the model complexity.
4 Experiments
In this section, we conduct empirical studies for the proposed GTTN.
4.1 Experimental Settings
4.1.1 Datasets
ImageCLEF dataset
. This dataset contains 12 common categories shared by 4 tasks: Caltech256, ImageNet ILSVRC 2012, Pascal VOC 2012, and Bing. Totally, there are about 2,400 images in all the tasks.
OfficeCaltech dataset. This dataset consists of 4 tasks and 2,533 images in total. One task consists of data from 10 common categories shared in the Caltech256 dataset, and the other three tasks consist of data from the Office dataset whose images are collected from 3 distinct domains/tasks, e.g., Amazon, Webcam and DSLR.
Office31 dataset. This dataset contains 31 categories from Amazon, webcam, and DSLR. Totally, there are 4,110 images in all the tasks.
OfficeHome dataset. This dataset contains images from 4 domains/tasks, which are artistic images, clip art, product images, and realworld images. Each task contains images from 65 object categories collected in the office and home settings. There are about 15,500 images in all the tasks.
4.1.2 Baselines
We compare the GTTN method with various competitors, including the deep multitask learning (DMTL) method where different tasks share the first several layers as the common feature representation, the Tucker trace norm method (denoted by Tucker), the TT trace norm method (denoted by TT), LAF trace norm method (denoted by LAF), LAF Tensor Factorisation method (denoted by LAFTF) (Yang and Hospedales, 2017a).
4.1.3 Implementation details
We employ the Vgg19 network (Simonyan and Zisserman, 2015) to extract features for image data by using the output of the pool5 layer and fc7 layer, respectively, for all the models in comparison. After that, if the pool5 layer is used, the feature representation extracted is a 3way
tensor and all the multitask learning models adopt a fivelayer architecture where the three hidden layers are used to transform along each mode of the input with the ReLU activation function and they have 6, 6, 256 hidden units, respectively. Otherwise, if the fc7 layer is used, all the multitask learning models adopt a twolayer fullyconnected architecture with the ReLU activation function and 1024 hidden units, where the first layer is shared by all the tasks. The architecture used is illustrated in Figure
2.To see the effect of training size on the performance, we vary the training proportion from 50% to 70% at an interval of 10%. The performance measure is the classification accuracy. Each experimental setting will repeat 5 times and we report the average performance as well as the standard deviation. For all the baseline methods, we follow their original model selection procedures. The regularization parameter
that controls the tradeoff between the training crossentropy loss and the regularization term is set by 0.25 and 0.65, respectively, for all the 6 methods to test the sensitivity of the performance with respect to to . In addition, we use Adam with the learning rate varying as , where is the number of the iteration and we adopt minibatch SGD with .4.2 Experimental Results
The experimental results are reported in Figures 310 based on different feature extractors (i.e., pool5 or fc7) and different regularization parameters (i.e., 0.25 or 0.65).
Since the output of the fc7 layer is in a vectorized representation, the model parameter is a 3way tensor. In this case, we can see that the Tucker trace norm possesses three tensor flattenings, the TT trace norm utilizes two tensor flattenings, and the GTTN also has three tensor flattenings. So in this case, both the GTTN and Tucker trace norm utilize all the possible tensor flattenings with the only difference that the GTTN learns the combination coefficients but the Tucker trace norm manually sets them to be identical. According to the results, we can see the GTTN outperforms the Tucker trace norm in most cases, which verifies that learning is better than fixing it.
When using the pool5 layer as the feature extractor, the feature representation is in a 3way tensor, making the parameter a 5way tensor. In this case, we can see that the GTTN method performs significantly better than other baseline methods. This is mainly because the GTTN utilizes more tensor flattenings than other baseline models and hence it may discover more lowrank structures.
4.3 Analysis on Learned
Tables 1 and 2 show the learned of GTTN based on the pool5 layer when takes the value of 0.25 and 0.65, respectively. In this case, the parameter is a 5way tensor and hence the GTTN contains 15 different flattenings, including , , , , , , , , , , , , , , and , which correspond to each component of in Tables 1 and 2. According to the results, we can see that different tensor flattenings have varying weights.
Similarly, Tables 3 and 4 show the learned of GTTN based on the fc7 layer when = 0.25 and = 0.65, respectively. In this case, the parameter is a 3way tensor, which contains 3 different flattenings by GTTN method, i.e., , . We can notice that the weight of is among the maximum in most settings, which may imply that the combination of the first two axes is very important.
Dataset  ( )  ( )  ( ) 

ImageCLEF  0.0736, 0.0799 , 0.0789, 0.0548,  0.0674, 0.0724, 0.0688, 0.0620,  0.0757, 0.0668, 0.0699, 0.0610, 
0.0724, 0.0780, 0.0592, 0.0741,  0.0691, 0.0799, 0.0630, 0.0823,  0.0683, 0.0819, 0.0608, 0.0718,  
0.0529, 0.0526, 0.0470, 0.0613,  0.0603, 0.0541, 0.0531, 0.0661,  0.0629, 0.0502, 0.0542, 0.0792,  
0.0699, 0.0745, 0.0709  0.0727, 0.0657, 0.0632  0.0610, 0.0678, 0.0686  
OfficeCaltech10  0.0627, 0.0739, 0.0709, 0.0604,  0.0722, 0.0676, 0.0783, 0.0482,  0.0697, 0.0762, 0.0883, 0.0497, 
0.0707, 0.0667, 0.0564, 0.0705,  0.0690, 0.0725, 0.0597, 0.0705,  0.0901, 0.0837, 0.0536, 0.0685,  
0.0610, 0.0564, 0.0476, 0.0876 ,  0.0583, 0.0503, 0.0584, 0.0761,  0.0491, 0.0446, 0.0482, 0.0552,  
0.0723, 0.0767, 0.0663  0.0662, 0.0842, 0.0686  0.0616, 0.0768, 0.0850  
Office31  0.0796, 0.0841, 0.0782, 0.0587,  0.0786, 0.0676, 0.0678, 0.0480,  0.0778, 0.0771, 0.0805, 0.0551, 
0.0771, 0.0617, 0.0577, 0.0725,  0.0702, 0.0843, 0.0544, 0.0815,  0.0746, 0.0761, 0.0554, 0.0794,  
0.0640, 0.0557, 0.0602, 0.0571,  0.0578, 0.0529, 0.0651, 0.0566,  0.0571, 0.0510, 0.0597, 0.0489,  
0.0505, 0.0657, 0.0771  0.0510, 0.0814, 0.0827  0.0628, 0.0705, 0.0737  
OfficeHome  0.0867, 0.0752, 0.0815, 0.0542,  0.0818, 0.0781, 0.0901, 0.0479,  0.0907, 0.0708, 0.0784, 0.0525, 
0.0727, 0.0831, 0.0470, 0.0798,  0.0872, 0.0781, 0.0522, 0.0867,  0.0710, 0.0795, 0.0545, 0.0848,  
0.0550, 0.0538, 0.0810, 0.0467,  0.0446, 0.0451, 0.0818, 0.0439,  0.0517, 0.0508, 0.0744, 0.0564,  
0.0604, 0.0480, 0.0749  0.0438, 0.0548, 0.0838  0.0617, 0.0427, 0.0802 
Dataset  ( =0.5 )  ( =0.6 )  ( =0.7 ) 

ImageCLEF  0.0672, 0.0666, 0.0695, 0.0523,  0.0688, 0.0739, 0.0808, 0.0602,  0.0821, 0.0795, 0.0705, 0.0549, 
0.0712, 0.0690, 0.0670, 0.0791,  0.0687, 0.0680, 0.0563, 0.0726,  0.0741, 0.0787, 0.0528, 0.0682,  
0.0563, 0.0675, 0.0521, 0.0664,  0.0515, 0.0507, 0.0590, 0.0754,  0.0595, 0.0494, 0.0463, 0.0579,  
0.0809, 0.0713, 0.0637  0.0678, 0.0763, 0.0698  0.0704, 0.0743, 0.0814  
OfficeCaltech10  0.0662, 0.0746, 0.0760, 0.0545,  0.0681, 0.0648, 0.0863, 0.0500,  0.0665, 0.0730, 0.0682, 0.0613, 
0.0596, 0.0737, 0.0566, 0.0792,  0.0711, 0.0731, 0.0495, 0.0667,  0.0749, 0.0866, 0.0453, 0.0857,  
0.0600, 0.0618, 0.0564, 0.0646,  0.0518, 0.0528, 0.0604, 0.0722,  0.0566, 0.0492, 0.0505, 0.0750,  
0.0740, 0.0715, 0.0713  0.0721, 0.0768, 0.0841  0.0612, 0.0686, 0.0773  
Office31  0.0874, 0.0772, 0.0910, 0.0562,  0.0833, 0.0806, 0.0811, 0.0571,  0.0680, 0.0736, 0.0788, 0.0574, 
0.0684, 0.0806, 0.0509, 0.0726,  0.0767, 0.0694, 0.0602, 0.0617,  0.0720, 0.0700, 0.0547, 0.0732,  
0.0518, 0.0514, 0.0621, 0.0539,  0.0651, 0.0575, 0.0686, 0.0553,  0.0535, 0.0548, 0.0622, 0.0663,  
0.0557, 0.0642, 0.0767  0.0541, 0.0700, 0.0593  0.0588, 0.0763, 0.0804  
OfficeHome  0.0687, 0.0672, 0.0780, 0.0619,  0.0673, 0.0810, 0.0668, 0.0497,  0.0907, 0.0834, 0.0835, 0.0492, 
0.0731, 0.0786, 0.0480, 0.0798,  0.0820, 0.0791, 0.0492, 0.0892 ,  0.0773, 0.0819, 0.0466, 0.0852,  
0.0523, 0.0572, 0.0749, 0.0633,  0.0589, 0.0517, 0.0819, 0.056,  0.0515, 0.0432, 0.0751, 0.0522,  
0.0591, 0.0651, 0.0730  0.0524, 0.0484, 0.0865  0.0523, 0.0523, 0.0755 
Dataset  ( =0.5 )  ( =0.6 )  ( =0.7 ) 

ImageCLEF  0.3861, 0.2246, 0.3893  0.3825, 0.2336, 0.3839  0.3718, 0.2154, 0.4128 
OfficeCaltech10  0.3911, 0.2246, 0.3843  0.3953, 0.2152, 0.3895  0.3984, 0.2302,0.3714 
Office31  0.3186, 0.2507, 0.4307  0.3041, 0.2787, 0.4170  0.2662, 0.2864, 0.4474 
OfficeHome  0.3162, 0.2750, 0.4088  0.2901, 0.2724, 0.4374  0.3057, 0.2630, 0.4313 
Dataset  ( =0.5 )  ( =0.6 )  ( =0.7 ) 

ImageCLEF  0.2992, 0.2834, 0.4173  0.3216, 0.2753, 0.4029  0.3229, 0.2908, 0.3861 
OfficeCaltech10  0.4052, 0.2244, 0.3704  0.3759, 0.2462, 0.3779  0.3871, 0.2106, 0.4023 
Office31  0.3609, 0.2184, 0.4207  0.3926, 0.2415, 0.3658  0.3279, 0.2399, 0.4322 
OfficeHome  0.2789, 0.2944, 0.4267  0.3113, 0.2618, 0.4269  0.2746, 0.2672, 0.4582 
5 Conclusion
In this paper, we devise a generalized tensor trace norm to capture all the lowrank structures in a parameter tensor used in deep multitask learning and identify the importance of each structure. We analyze properties of the proposed GTTN, including its dual norm and generalization bound. Empirical studies show that it outperforms stateoftheart counterparts and the learned combination coefficients can give us more understanding of the problem studied. As a future work, we are interested in extending the idea of GTTN to study tensor Schatten norms.
References
 Multitask feature learning. In Advances in Neural Information Processing Systems 19, pp. 41–48. Cited by: §1.
 Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: Proof for Theorem 4.
 Multitask Gaussian process prediction. In Advances in Neural Information Processing Systems 20, Vancouver, British Columbia, Canada, pp. 153–160. Cited by: §1.
 Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §1.
 Learning incoherent sparse and lowrank patterns from multiple tasks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 1179–1188. Cited by: §1.

Learning multilevel task groups in multitask learning.
In
Proceedings of the 29th AAAI Conference on Artificial Intelligence
, Cited by: §1.  Learning tree structure in multitask learning. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §1.
 Multistage multitask learning with reduced rank. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Cited by: §1.
 Clustered multitask learning: a convex formulation. In Advances in Neural Information Processing Systems 21, pp. 745–752. Cited by: §1.
 A dirty model for multitask learning. In Advances in Neural Information Processing Systems 23, Vancouver, British Columbia, Canada, pp. 964–972. Cited by: §1.
 Learning task grouping and overlap in multitask learning. In Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK. Cited by: §1.
 Blockwise coordinate descent procedures for the multitask lasso, with applications to neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, Cited by: §1.
 Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems 30, pp. 1593–1602. Cited by: §1.
 Multilevel lasso for sparse multitask regression. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK. Cited by: §1.

Crossstitch networks for multitask learning.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, pp. 3994–4003. Cited by: §1.  Multitask feature selection. Technical report Department of Statistics, University of California, Berkeley. Cited by: §1.
 Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §1, §2.2.
 Trace norm regularization: reformulations, algorithms, and multitask learning. SIAM Journal on Optimization 20 (6), pp. 3465–3489. Cited by: §1.
 Multilinear multitask learning. In Proceedings of the 30th International Conference on Machine Learning, pp. 1444–1452. Cited by: §1, §2.1.
 Very deep convolutional networks for largescale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Cited by: §4.1.3.
 Convex tensor decomposition via structured schatten norm regularization. In Advances in Neural Information Processing Systems 26, pp. 1331–1339. Cited by: §2.
 Userfriendly tail bounds for sums of random matrices. Foundations of Computational Mathematics 12 (4), pp. 389–434. Cited by: Proof for Theorem 4.
 Some mathematical notes on threemode factor analysis. Psychometrika 31 (3), pp. 279–311. Cited by: §1, §2.1.
 Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications 170, pp. 33–45. Cited by: §3.3.

Multitask learning meets tensor factorization: task imputation via convex optimization
. In Advances in Neural Information Processing Systems 27, pp. 2825–2833. Cited by: §1, §2.1.  Multitask learning for classification with Dirichlet process priors. Journal of Machine Learning Research 8, pp. 35–63. Cited by: §1.
 Deep multitask representation learning: A tensor factorisation approach. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §4.1.2.
 Trace norm regularised deep multitask learning. In Workshop Track Proceedings of the 5th International Conference on Learning Representations, Cited by: §1, §1, §2.
 A convex formulation for learning task relationships in multitask learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 733–742. Cited by: §1.
 Learning to multitask. In Advances in Neural Information Processing Systems 31, pp. 5776–5787. Cited by: §1.
 A survey on multitask learning. CoRR abs/1707.08114. Cited by: §1.
 Hierarchical regularization cascade for joint learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, pp. 37–45. Cited by: §1.
Appendix
Proof for Theorem 1
Proof. For a valid , it is required that and should not be empty, implying that and . So the total number of valid summands in the righthand side of Eq. (5) is . Based on the definition of , we can see that is equal to the transpose , making . So for , there will always be an equivalent , leading to distinct summands in the righthand side of Eq. (5).
Proof for Theorem 2
Proof. Based on Eq. (5), we rewrite problem (6) as
which is equivalent to
So we just need to prove that
The optimization problem in the lefthand side of the above equation is a linear programming problem with respect to
. It is easy to show that for , where the equality holds when the corresponding coefficient for equals 1 and other coefficients equals 0. Then we reach the conclusion.Proof for Theorem 3
Proof. We define a linear operator , where denotes the columnwise concatenation of a matrix and denotes a set of successively integers for to . We define the norm as
where denotes the inverse vectorization of a subvector of into a matrix where and transforms an index into a subset of . Based on the definition of the dual norm, we have
where denotes the inner product between two tensors with equal size. Since this maximization problem satisfies the Slater condition, the strong duality holds. Thus, due to Fenchel duality theorem, we have
where is an indicator function of condition and it outputs 0 when is true and otherwise . Since the dual norm of the trace norm is the spectral norm, we reach the conclusion.
Proof for Theorem 4
Before presenting the proof for Theorem 4, we first prove the following theorem.
Theorem 5
, a Rademacher variable, is an uniform
valued random variable, and
is a tensor with , where equals . Then we havewhere , is an absolute constant,
Proof. We define . According to Theorem 3, we have
Since for each we can make equal to , we have
which implies that
So we can get
Based on Theorem 6.1 in (Tropp, 2012), we can upperbound each expectation as
where is a zero tensor with only the th slice along the last axis equal to , needs to satisfy , and
As the Frobenius norm of a matrix is larger than its spectral norm, and we simply set . For , we have
implying that
Similarly, we have
where denotes the trace of a matrix and converts a vector or scalar to a diagonal matrix. This inequality implies
By combining the above inequalities, we reach the conclusion.
Then we can prove Theorem 4 as follows.
Proof. By following (Bartlett and Mendelson, 2002), we have
When each pair of the training data changes, the random variable can change by no more than due to the boundedness of the loss function . Then by McDiarmid’s inequality, we can get
where denotes the probability and . This inequality implies that with probability at least ,
Based on the the property of the Rademacher complexity, we have
Then based on the definition of and the Hölder’s inequality, we have
By combining the above inequalities, with probability at least , we have
Then by incorporating Theorem 5 into this inequality, we reach the conclusion.
Comments
There are no comments yet.