# Deep Multi-Task Learning via Generalized Tensor Trace Norm

The trace norm is widely used in multi-task learning as it can discover low-rank structures among tasks in terms of model parameters. Nowadays, with the emerging of big datasets and the popularity of deep learning techniques, tensor trace norms have been used for deep multi-task models. However, existing tensor trace norms cannot discover all the low-rank structures and they require users to manually determine the importance of their components. To solve those two issues together, in this paper, we propose a Generalized Tensor Trace Norm (GTTN). The GTTN is defined as a convex combination of matrix trace norms of all possible tensor flattenings and hence it can discover all the possible low-rank structures. In the induced objective function, we will learn combination coefficients in the GTTN to automatically determine the importance. Experiments on real-world datasets demonstrate the effectiveness of the proposed GTTN.

## Authors

• 90 publications
• 145 publications
• 177 publications
• ### Tensor Q-Rank: A New Data Dependent Tensor Rank

Recently, the Tensor Nuclear Norm (TNN) regularization based on t-SVD ha...
10/26/2019 ∙ by Hao Kong, et al. ∙ 0

• ### SUM: Suboptimal Unitary Multi-task Learning Framework for Spatiotemporal Data Prediction

The typical multi-task learning methods for spatio-temporal data predict...
10/11/2019 ∙ by Qichen Li, et al. ∙ 0

• ### Regularized Orthogonal Tensor Decompositions for Multi-Relational Learning

Multi-relational learning has received lots of attention from researcher...
12/26/2015 ∙ by Fanhua Shang, et al. ∙ 0

• ### Convex Coupled Matrix and Tensor Completion

We propose a set of convex low rank inducing norms for a coupled matrice...
05/15/2017 ∙ by Kishan Wimalawarne, et al. ∙ 0

• ### Theoretical and Experimental Analyses of Tensor-Based Regression and Classification

We theoretically and experimentally investigate tensor-based regression ...
09/06/2015 ∙ by Kishan Wimalawarne, et al. ∙ 0

• ### Guarantees of Augmented Trace Norm Models in Tensor Recovery

This paper studies the recovery guarantees of the models of minimizing X...
07/23/2012 ∙ by Ziqiang Shi, et al. ∙ 0

• ### A Distributed Frank-Wolfe Framework for Learning Low-Rank Matrices with the Trace Norm

We consider the problem of learning a high-dimensional but low-rank matr...
12/20/2017 ∙ by Wenjie Zheng, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given multiple related learning tasks, multi-task learning (Caruana, 1997; Zhang and Yang, 2017)

aims to exploit useful information contained in them to help improve the performance of all the tasks. Multi-task learning has been applied to many application areas, including computer vision, natural language processing, speech recognition and so on. Over past decades, many multi-task learning models have been devised to learn such useful information shared by all the tasks. As reviewed in

(Zhang and Yang, 2017), multi-task learning models can be categorized into six classes, including the feature transformation approach (Argyriou et al., 2006; Misra et al., 2016)

, feature selection approach

(Obozinski et al., 2006; Liu et al., 2009; Lozano and Swirszcz, 2012), low-rank approach (Pong et al., 2010; Han and Zhang, 2016; Yang and Hospedales, 2017b), task clustering approach (Xue et al., 2007; Jacob et al., 2008; Kumar and III, 2012; Han and Zhang, 2015a), task relation learning approach (Bonilla et al., 2007; Zhang and Yeung, 2010; Long et al., 2017; Zhang et al., 2018), and decomposition approach (Jalali et al., 2010; Chen et al., 2010; Zweig and Weinshall, 2013; Han and Zhang, 2015b).

Among those approaches, the low-rank approach is effective to identify low-rank model parameters. When model parameters of a task can be organized in a vector corresponding to for example binary classification tasks or regression tasks on vectorized data, the matrix trace norm or its variants is used as a regularizer on the parameter matrix, each of whose columns stores parameters for a task, to identify the low-rank structure among tasks. Nowadays with the collection of complex data and the popularity of deep learning techniques, each data point can be represented as a tensor (e.g., images) and each learning task becomes complex, e.g., multi-class classification tasks. In this case, the parameters of all the tasks are stored in a tensor, making the matrix trace norm not applicable, and instead tensor trace norms

(Romera-Paredes et al., 2013; Wimalawarne et al., 2014; Yang and Hospedales, 2017b) are used to learn low-rank parameters in the parameter tensor for multi-task learning.

Different from the matrix trace norm which has a unique definition, the tensor trace norm has many variants as the tensor rank has multiple definitions. Here we focus on overlapped tensor trace norms which equals the sum of the matrix trace norm of several tensor flattenings of the tensor. An overlapped tensor trace norm relies on the way to do the tensor flattening. For example, the Tucker trace norm (Tucker, 1966) conducts the tensor flattening along each axis in the tensor and the Tensor-Train (TT) trace norm (Oseledets, 2011) does it along successive axes starting from the first one. There are two limitation in the existing tensor trace norms. Firstly, for a -way tensor, we can see that there are possible tensor flattenings but existing overlapped tensor trace norms only utilize a subset of them, making them fail to capture all the low-rank structures in the parameter tensor. Another limitation of existing tensor trace norms is that all the tensor flattenings used in a tensor trace norm are assumed to be equally important, which is suboptimal to the performance.

In this paper, to overcome the two aforementioned limitations of existing overlapped tensor trace norms, we propose a Generalized Tensor Trace Norm (GTTN). The GTTN exploits all possible tensor flattenings and it is defined as the convex sum of matrix trace norms of all possible tensor flattenings. In this way, the GTTN can capture all the low-rank structures in the parameter tensor and hance overcome the first limitation. Moreover, to alleviate the second limitation, we treat combination coefficients in the GTTN as variables and propose an objective function to learn them from data. Another advantage of learning combination coefficients is that it can show the importance of some axes, which can improve the interpretability of the learning model and give us insights for the problem under investigation. To obtain a full understanding of the GTTN, we study properties of the proposed GTTN. For example, the number of tensor flattenings with distinct matrix trace norms is proved to be and so when we encountered in most problems, such number is not so large that the computational complexity is comparable to existing tensor trace norms. We also analyze the dual norm of the GTTN and give a generalization bound. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed GTTN.

## 2 Existing Tensor Trace Norms

In multi-task learning, trace norms are widely used as the regularization to learn a low-rank structure among model parameters of all the tasks as minimizing the trace norm will enforce some singular values to approach zero. When both a data point and model parameters of a task are represented in vectorized forms in regression tasks or binary classification tasks, the matrix trace norm can be used and it is defined as

with each column of the parameter matrix storing the parameter vector of the corresponding task and denoting the th largest singular value of . Regularizing with will make tend to be low-rank, which leads to the linear dependency among parameter vectors of different tasks and reflects the relatedness among tasks in terms of model parameters.

Nowadays, the data such as images can be represented in a matrix or tensor form in the raw representation (e.g., pixel-based representation) and transformed representation after for example convolutional operations. Moreover, each task becomes more complex, for example, a multi-class classification task. In those cases, parameters of all the tasks can be organized as a -way tensor (), e.g., . That is, for multi-class classification tasks, when equals 3, denotes the number of hidden units in the last hidden layer, can represent the number of classes, and can be the number of tasks. In such cases, the matrix trace norm is no longer applicable and instead tensor trace norms are investigated.

According to (Tomioka and Suzuki, 2013)

, tensor trace norms can be classified into two categories, including overlapped tensor trace norms and latent tensor trace norms. An overlapped tensor trace norm transforms a tensor into matrices in different ways and compute the sum of the matrix trace norm of different transformed matrices. A latent tensor trace norm decomposes the tensor into multiple latent tensors and then compute the sum of the matrix trace norm of matrices which are transformed from the latent tensors. Deep multi-task learning mainly uses the overlapped tensor trace norm, which is the focus of our study.

As reviewed in (Yang and Hospedales, 2017b), three tensor trace norms belonging to the overlapped tensor trace norm are used in deep multi-task learning, including the Tucker trace norm, TT trace norm, and Last Axis Flattening (LAF) trace norm. In the following, we will review those three tensor trace norms.

### 2.1 Tucker Trace Norm

Based on the Tucker decomposition (Tucker, 1966), the Tucker trace norm for a tensor can be defined as

 |||W|||∗=p∑i=1αi∥W(i)∥∗, (1)

where denotes a set of positive integers no larger than , permutes the tensor along axis indices in that is a permutation of , reshapes the tensor with the new size stored in a vector , is the mode- tensor flattening to transform to a matrix along the th axis, and denotes the weight for the model- flattening. To control the scale of , here are required to satisfy that and . Based on Eq. (1), we can see that the Tucker trace norm is a convex combination of matrix trace norms of tensor flattening along each axis, where controls the importance of the mode- tensor flattening. Without a priori information, different tensor flattenings are usually assumed to have equal importance by setting to be .

Besides being used in deep multi-task learning, the Tucker trace norm has been adopted in multilinear multi-task learning (Romera-Paredes et al., 2013; Wimalawarne et al., 2014) which assumes the existence of multi-modal structures contained in multi-task learning problems.

### 2.2 TT Trace Norm

Based on the tensor-train decomposition (Oseledets, 2011), the TT trace norm for a tensor can be defined as

 |||W|||∗=p−1∑i=1αi∥W[i]∥∗, (2)

where and denotes a nonnegative weight. Different from the mode- tensor flattening , unfolds the tensor to a matrix along the first axes. Similar to the Tucker trace norm, are assumed to satisfy that and . Usually is set by users to be if there is no additional information for the importance of each term in Eq. (2).

### 2.3 LAF Trace Norm

The LAF trace norm for a tensor can be defined as

 |||W|||∗=∥W(p)∥∗. (3)

The last axis in is the task axis and hence the LAF trace norm is equivalent to place the matrix trace norm on each of whose rows stores model parameters of each task. Compared with the Tucker trace norm in Eq. (1), the LAF trace norm can be viewed as a special case of the Tucker trace norm where equals 1 and other ’s () are equal to 0.

Given a tensor trace norm, the objective function of a deep multi-task model can be formulated as111Here for simplicity, we assume the tensor trace norm regularization is placed on only one . This formulation can easily be extended to multiple ’s with the tensor trace norm regularization.

 minΘm∑i=11nini∑j=1l(fi(xij;Θ),yij)+λ|||W|||∗, (4)

where denotes the number of tasks, denotes the number of training data points in the th task, denotes the th data point in the th task, denotes the label of , denotes a learning function for the

,

denotes a loss function such as the cross-entropy loss for classification tasks and the square loss for regression tasks,

denotes a part of that is regularized by a tensor trace norm, and is a regularization parameter. In problem (4), the tensor trace norm can be the Tucker trace norm, or the TT trace norm, or the LAF trace norm.

## 3 Generalized Tensor Trace Norm

In this section, we first analyze existing tensor trace norms and then present the proposed generalized tensor trace norm as well as the optimization and generalization bound.

### 3.1 Analysis on Existing Tensor Trace Norms

As introduced in the previous section, we can see that overlapped tensor trace norms rely on different ways of tensor flattening. For example, the Tucker trace norm reshapes the tensor along each axis and the LAF trace norm focuses on the last axis, while the TT trace norm reshapes the tensor by combining the first several axes. Given the physical meaning of each axis, the LAF trace norm only considers the inter-task low-rank structure among tasks, but differently both the Tucker and TT trace norms consider not only the inter-task low-rank structure among tasks but also the intra-task low-rank structure among, for example, features. In this sense, the Tucker and TT trace norms seems to be superior to the LAF trace norm.

For overlapped tensor trace norms like the Tucker and TT trace norms, there are two important issues.

1. How to choose the ways of tensor flattening?

2. How to determine the importance of of different ways of tensor flattening?

For the first issue, the Tucker trace norm chooses to reshape along each axis while the TT trace norm combines the first several axes together to do the tensor flattening. Different ways of tensor flattening encode the belief on the existence of the low-rank structure in . So the Tucker trace norm assumes that the low-rank structure exists in each axis while the TT trace norm considers the combinations of the first several axes have low-rank structure. However, those models may fail when such assumptions do not hold.

For the second issue, current models usually assume the equal importance of different ways of tensor flattening, which is reflected in the equal value of . Intuitively, different ways of tensor flattening should have different degrees in terms of the low-rank structure and hence should be different from each other. In this sense, with an equal value incur the suboptimal performance.

### 3.2 Gttn

To solve the above two issues together, we propose the generalized tensor trace norm.

For the first issue, since for most problems we do not know which ways of tensor flattening are helpful to learn the low-rank structure, we can try all possible ways of tensor flattening. To mathematically define this, we define as

 W{s}=reshape(permute(W,[s,¬s]),[∏i∈sdi,∏j∈¬sdj]),

where is a nonempty subset of (i.e., ) and denotes the complement of with respect to (i.e., ). So is a tensor flattening to a matrix with a dimension corresponding to axis indices in and the other to axis indices in . When contains only one element, becomes , the mode- tensor flattening used in the Tucker trace norm. When , becomes that is used in the TT trace norm. Moreover, this new tensor flattening can be viewed as a generalization of and as can contain more than one element, which is more general than , and it does not require that elements in should be successive integers from 1, which is more general than .

As we aim to consider all possible ways of tensor flattening, similar to the Tucker and TT trace norms, we define the GTTN as

 |||W|||∗=∑s⊂[p],s≠∅αs∥W{s}∥∗, (5)

where is also used as a subscript to index the corresponding weight for , denotes the set of ’s, and defines a constraint set for . Then based on the GTTN, we can solve the first issue to some extent as it can discover all the low-rank structures by considering all possible ways of tensor flattening with appropriate settings of .

In Figure 1, we show the difference among the Tucker trace norm, TT trace norm, LAF trace norm and GTTN for a four-way tensor at the top. In the bottom of Figure 1, we can see that there are seven possible tensor flattenings. The Tucker trace norm uses , , , and . The TT trace norm relies on , , and . The LAF trace norm only contains . The calculation of the GTTN is based on all the seven tensor flattenings. From this example, we can see that the union of tensor flattenings used in the Tucker, TT, and LAF trace norms cannot cover all the possible ones and the GTTN utilizes some additional tensor flattening (e.g., and ). In this sense, the GTTN can discover more low-rank structures than existing tensor trace norms.

For the number of distinct summands in the right-hand side of Eq. (5), we have the following theorem.222All the proofs are put in the appendix.

###### Theorem 1

The right-hand side of Eq. (5) has distinct summands.

As shown in the proof of Theorem 1, and are transpose matrices to each other with equal matrix trace norm and we can eliminate one of them to reduce the computational cost. For notational simplicity, we do not explicitly do the elimination in the formulation but in computation, we did do that. In problems we encounter, is at most and so the GTTN has at most 15 distinct summands. So the number of distinct summands are not so large, making the optimization efficient.

Similar to the Tucker and TT trace norms, GTTN defined in Eq. (5) still faces the second issue. Here to solve the second issue, we view as variables to be optimized and based on Eq. (5), the objective function of a deep multi-task model based on GTTN is formulated as

 (6)

Compared with problem (4), we can see two differences. Firstly, the regularization terms in two problems are different. Secondly, problem (6) treat as variables to be optimized but the corresponding entities are constants which are set by users.

In the following theorem, we can simplify problem (6) by eliminating .

###### Theorem 2

Problem (6) is equivalent to

 minΘm∑i=11nini∑j=1l(fi(xij;Θ),yij)+λmins⊂[p]s≠∅∥W{s}∥∗, (7)

According to problem (7), learning will tend to choosing a tensor flattening with the minimal matrix trace norm.

### 3.3 Optimization

Even though problem (7) is equivalent to problem (6), in numerical optimization, we choose problem (6) as the objective function to be optimized. One reason is that problem (7), which involves the minimum of matrix trace norms, is more complicated than problem (6) to be optimized. Another reason is that the learned in problem (6) can visualize the importance of each tensor flattening, which can improve the interpretability of the learning model.

Since problem (6

) is designed for deep neural networks, the Stochastic Gradient Descent (SGD) technique is the first choice for optimization. However, problem (

6) is a constrained optimization problem, making SGD techniques not directly applicable. The constraints in problem (6) constrain to form a -dimensional simplex. To convert problem (6) to an unconstrained problem that can be optimized by SGD, we reparameterize each as

 αs=exp{βs}∑t⊂[p],t≠∅exp{βt}.

With such reparameterization, problem (6) can be reformulated as

 minΘ,βm∑i=11nini∑j=1l(fi(xij;Θ),yij)+λ∑s⊂[p]s≠∅exp{βs}∥W{s}∥∗∑t⊂[p],t≠∅exp{βt}. (8)

For each parameter , its gradient can be computed based on the first term in the objective function of problem (8). For each , its gradient can be computed as

 ∂h∂βs = −λexp{βs}∑t⊂[p]t≠∅exp{βt}∥W{t}∥∗(∑t⊂[p],t≠∅exp{βt})2 +λexp{βs}∥W{s}∥∗∑t⊂[p],t≠∅exp{βt}.

For , the computation of its gradient comes from both terms in the objective function of problem (8). The first term is the conventional training loss and the second term involves the matrix trace norm which is non-differentiable. According to (Watson, 1992), we can compute the subgradient instead, that is, where

denotes the singular value decomposition of a matrix

.

### 3.4 Generalization Bound

For the GTTN defined in Eq. (5), we can derive its dual norm in the following theorem.

###### Theorem 3

The dual norm of the GTTN defined in Eq. (5) is defined as

 |||W|||∗⋆=min∑s≠∅s⊂[p]αsY(s)=Wmaxs≠∅s⊂[p]∥Y(s){s}∥∞,

where is a variable indexed by and denotes the spectral norm of a matrix that is equal to the maximum singular value.

Without loss of generality, here we assume which can simplify the analysis. We rewrite problem (6) into an equivalent formulation as

 (9)

where is assumed to be fixed to show its impact to the bound. Here each data point is a tensor and binary classification tasks are considered,333The analysis is easy to extend to regression tasks and multi-class classification tasks. implying that and . The learning function for each task is a linear function defined as , where denotes the inner product between two tensors with equal size and denotes the th slice along the last axis which is the task axis. For simplicity, different tasks are assumed to have the same number of data points, i.e., equals for . It is very easy to extend our analysis to general settings. The generalization loss for all the tasks is defined as , where denotes the underlying data distribution for the th task and denotes the expectation. The empirical loss for all the tasks is defined as . We assume the loss function has values in and it is Lipschitz with respect to the first input argument with a Lipschitz constant . Each training data is assumed to satisfy . To characterize correlations between features, we assume that for any and , where means that is a positive semidefinite matrix, , and

denotes an identity matrix with an appropriate size.

For problem (9), we can derive a generalization bound in the following theorem.

###### Theorem 4

For the solution of problem (9) and

, with probability at least

, we have

 L(^W)≤ ^L(^W)+2ργCmn0mins≠∅s⊂[p](κm√lndsαsn0d+lndsαsn0) +√2mln1δ.

According to Theorem 4, we can see that each can be used to weigh the second term which is related to the model complexity.

## 4 Experiments

In this section, we conduct empirical studies for the proposed GTTN.

### 4.1 Experimental Settings

#### 4.1.1 Datasets

ImageCLEF dataset

. This dataset contains 12 common categories shared by 4 tasks: Caltech-256, ImageNet ILSVRC 2012, Pascal VOC 2012, and Bing. Totally, there are about 2,400 images in all the tasks.

Office-Caltech dataset. This dataset consists of 4 tasks and 2,533 images in total. One task consists of data from 10 common categories shared in the Caltech-256 dataset, and the other three tasks consist of data from the Office dataset whose images are collected from 3 distinct domains/tasks, e.g., Amazon, Webcam and DSLR.

Office-31 dataset. This dataset contains 31 categories from Amazon, webcam, and DSLR. Totally, there are 4,110 images in all the tasks.

Office-Home dataset. This dataset contains images from 4 domains/tasks, which are artistic images, clip art, product images, and real-world images. Each task contains images from 65 object categories collected in the office and home settings. There are about 15,500 images in all the tasks.

#### 4.1.2 Baselines

We compare the GTTN method with various competitors, including the deep multi-task learning (DMTL) method where different tasks share the first several layers as the common feature representation, the Tucker trace norm method (denoted by Tucker), the TT trace norm method (denoted by TT), LAF trace norm method (denoted by LAF), LAF Tensor Factorisation method (denoted by LAF-TF) (Yang and Hospedales, 2017a).

#### 4.1.3 Implementation details

We employ the Vgg19 network (Simonyan and Zisserman, 2015) to extract features for image data by using the output of the pool5 layer and fc7 layer, respectively, for all the models in comparison. After that, if the pool5 layer is used, the feature representation extracted is a 3-way

tensor and all the multi-task learning models adopt a five-layer architecture where the three hidden layers are used to transform along each mode of the input with the ReLU activation function and they have 6, 6, 256 hidden units, respectively. Otherwise, if the fc7 layer is used, all the multi-task learning models adopt a two-layer fully-connected architecture with the ReLU activation function and 1024 hidden units, where the first layer is shared by all the tasks. The architecture used is illustrated in Figure

2.

To see the effect of training size on the performance, we vary the training proportion from 50% to 70% at an interval of 10%. The performance measure is the classification accuracy. Each experimental setting will repeat 5 times and we report the average performance as well as the standard deviation. For all the baseline methods, we follow their original model selection procedures. The regularization parameter

that controls the trade-off between the training cross-entropy loss and the regularization term is set by 0.25 and 0.65, respectively, for all the 6 methods to test the sensitivity of the performance with respect to to . In addition, we use Adam with the learning rate varying as , where is the number of the iteration and we adopt mini-batch SGD with .

### 4.2 Experimental Results

The experimental results are reported in Figures 3-10 based on different feature extractors (i.e., pool5 or fc7) and different regularization parameters (i.e., 0.25 or 0.65).

Since the output of the fc7 layer is in a vectorized representation, the model parameter is a 3-way tensor. In this case, we can see that the Tucker trace norm possesses three tensor flattenings, the TT trace norm utilizes two tensor flattenings, and the GTTN also has three tensor flattenings. So in this case, both the GTTN and Tucker trace norm utilize all the possible tensor flattenings with the only difference that the GTTN learns the combination coefficients but the Tucker trace norm manually sets them to be identical. According to the results, we can see the GTTN outperforms the Tucker trace norm in most cases, which verifies that learning is better than fixing it.

When using the pool5 layer as the feature extractor, the feature representation is in a 3-way tensor, making the parameter a 5-way tensor. In this case, we can see that the GTTN method performs significantly better than other baseline methods. This is mainly because the GTTN utilizes more tensor flattenings than other baseline models and hence it may discover more low-rank structures.

### 4.3 Analysis on Learned α

Tables 1 and 2 show the learned of GTTN based on the pool5 layer when takes the value of 0.25 and 0.65, respectively. In this case, the parameter is a 5-way tensor and hence the GTTN contains 15 different flattenings, including , , , , , , , , , , , , , , and , which correspond to each component of in Tables 1 and 2. According to the results, we can see that different tensor flattenings have varying weights.

Similarly, Tables 3 and 4 show the learned of GTTN based on the fc7 layer when = 0.25 and = 0.65, respectively. In this case, the parameter is a 3-way tensor, which contains 3 different flattenings by GTTN method, i.e., , . We can notice that the weight of is among the maximum in most settings, which may imply that the combination of the first two axes is very important.

## 5 Conclusion

In this paper, we devise a generalized tensor trace norm to capture all the low-rank structures in a parameter tensor used in deep multi-task learning and identify the importance of each structure. We analyze properties of the proposed GTTN, including its dual norm and generalization bound. Empirical studies show that it outperforms state-of-the-art counterparts and the learned combination coefficients can give us more understanding of the problem studied. As a future work, we are interested in extending the idea of GTTN to study tensor Schatten norms.

## References

• A. Argyriou, T. Evgeniou, and M. Pontil (2006) Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pp. 41–48. Cited by: §1.
• P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: Proof for Theorem 4.
• E. Bonilla, K. M. A. Chai, and C. Williams (2007) Multi-task Gaussian process prediction. In Advances in Neural Information Processing Systems 20, Vancouver, British Columbia, Canada, pp. 153–160. Cited by: §1.
• R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §1.
• J. Chen, J. Liu, and J. Ye (2010) Learning incoherent sparse and low-rank patterns from multiple tasks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 1179–1188. Cited by: §1.
• L. Han and Y. Zhang (2015a) Learning multi-level task groups in multi-task learning. In

Proceedings of the 29th AAAI Conference on Artificial Intelligence

,
Cited by: §1.
• L. Han and Y. Zhang (2015b) Learning tree structure in multi-task learning. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Cited by: §1.
• L. Han and Y. Zhang (2016) Multi-stage multi-task learning with reduced rank. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Cited by: §1.
• L. Jacob, F. Bach, and J.-P. Vert (2008) Clustered multi-task learning: a convex formulation. In Advances in Neural Information Processing Systems 21, pp. 745–752. Cited by: §1.
• A. Jalali, P. D. Ravikumar, S. Sanghavi, and C. Ruan (2010) A dirty model for multi-task learning. In Advances in Neural Information Processing Systems 23, Vancouver, British Columbia, Canada, pp. 964–972. Cited by: §1.
• A. Kumar and H. D. III (2012) Learning task grouping and overlap in multi-task learning. In Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK. Cited by: §1.
• H. Liu, M. Palatucci, and J. Zhang (2009) Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In Proceedings of the 26th Annual International Conference on Machine Learning, Cited by: §1.
• M. Long, Z. Cao, J. Wang, and P. S. Yu (2017) Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems 30, pp. 1593–1602. Cited by: §1.
• A. C. Lozano and G. Swirszcz (2012) Multi-level lasso for sparse multi-task regression. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK. Cited by: §1.
• I. Misra, A. Shrivastava, A. Gupta, and M. Hebert (2016) Cross-stitch networks for multi-task learning. In

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 3994–4003. Cited by: §1.
• G. Obozinski, B. Taskar, and M. Jordan (2006) Multi-task feature selection. Technical report Department of Statistics, University of California, Berkeley. Cited by: §1.
• I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §1, §2.2.
• T. K. Pong, P. Tseng, S. Ji, and J. Ye (2010) Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization 20 (6), pp. 3465–3489. Cited by: §1.
• B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pontil (2013) Multilinear multitask learning. In Proceedings of the 30th International Conference on Machine Learning, pp. 1444–1452. Cited by: §1, §2.1.
• K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, Cited by: §4.1.3.
• R. Tomioka and T. Suzuki (2013) Convex tensor decomposition via structured schatten norm regularization. In Advances in Neural Information Processing Systems 26, pp. 1331–1339. Cited by: §2.
• J. A. Tropp (2012) User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics 12 (4), pp. 389–434. Cited by: Proof for Theorem 4.
• L. R. Tucker (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31 (3), pp. 279–311. Cited by: §1, §2.1.
• G. A. Watson (1992) Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications 170, pp. 33–45. Cited by: §3.3.
• K. Wimalawarne, M. Sugiyama, and R. Tomioka (2014)

.
In Advances in Neural Information Processing Systems 27, pp. 2825–2833. Cited by: §1, §2.1.
• Y. Xue, X. Liao, L. Carin, and B. Krishnapuram (2007) Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research 8, pp. 35–63. Cited by: §1.
• Y. Yang and T. M. Hospedales (2017a) Deep multi-task representation learning: A tensor factorisation approach. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §4.1.2.
• Y. Yang and T. M. Hospedales (2017b) Trace norm regularised deep multi-task learning. In Workshop Track Proceedings of the 5th International Conference on Learning Representations, Cited by: §1, §1, §2.
• Y. Zhang and D.-Y. Yeung (2010) A convex formulation for learning task relationships in multi-task learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, pp. 733–742. Cited by: §1.
• Y. Zhang, Y. Wei, and Q. Yang (2018) Learning to multitask. In Advances in Neural Information Processing Systems 31, pp. 5776–5787. Cited by: §1.
• Y. Zhang and Q. Yang (2017) A survey on multi-task learning. CoRR abs/1707.08114. Cited by: §1.
• A. Zweig and D. Weinshall (2013) Hierarchical regularization cascade for joint learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, pp. 37–45. Cited by: §1.

## Appendix

### Proof for Theorem 1

Proof. For a valid , it is required that and should not be empty, implying that and . So the total number of valid summands in the right-hand side of Eq. (5) is . Based on the definition of , we can see that is equal to the transpose , making . So for , there will always be an equivalent , leading to distinct summands in the right-hand side of Eq. (5).

### Proof for Theorem 2

Proof. Based on Eq. (5), we rewrite problem (6) as

 minΘ,α∈Cαm∑i=11nini∑j=1l(fi(xij;Θ),yij)+λ∑s⊂[p]s≠∅αs∥W{s}∥∗,

which is equivalent to

So we just need to prove that

 minα∈Cα∑s⊂[p]s≠∅αs∥W{s}∥∗=mins⊂[p]s≠∅∥W{s}∥∗.

The optimization problem in the left-hand side of the above equation is a linear programming problem with respect to

. It is easy to show that for , where the equality holds when the corresponding coefficient for equals 1 and other coefficients equals 0. Then we reach the conclusion.

### Proof for Theorem 3

Proof. We define a linear operator , where denotes the columnwise concatenation of a matrix and denotes a set of successively integers for to . We define the norm as

 ∥y∥q=∑i∥Y(π(i)){π(i)}∥∗,

where denotes the inverse vectorization of a subvector of into a matrix where and transforms an index into a subset of . Based on the definition of the dual norm, we have

 |||W|||∗⋆=sup⟨W,X⟩ s.t. |||X|||∗≤1,

where denotes the inner product between two tensors with equal size. Since this maximization problem satisfies the Slater condition, the strong duality holds. Thus, due to Fenchel duality theorem, we have

where is an indicator function of condition and it outputs 0 when is true and otherwise . Since the dual norm of the trace norm is the spectral norm, we reach the conclusion.

### Proof for Theorem 4

Before presenting the proof for Theorem 4, we first prove the following theorem.

###### Theorem 5

, a Rademacher variable, is an uniform

-valued random variable, and

is a tensor with , where equals . Then we have

 E[|||M|||∗⋆]≤mins≠∅s⊂[p]Cαs(κmn0d√lnds+lndsn0).

where , is an absolute constant,

Proof. We define . According to Theorem 3, we have

 |||M|||∗⋆=min∑s≠∅s⊂[p]αsY(s)=Mmaxs≠∅s⊂[p]∥Y(s){s}∥∞

Since for each we can make equal to , we have

 |||M|||∗⋆≤1αs∥M{s}∥∞ ∀s≠∅, s⊂[p],

which implies that

 |||M|||∗⋆≤mins1αs∥M{s}∥∞.

So we can get

 E[|||M|||∗⋆]≤ E[mins1αs∥M{s}∥∞] ≤ minsE[1αs∥M{s}∥∞].

Based on Theorem 6.1 in (Tropp, 2012), we can upper-bound each expectation as

 E[∥M{s}∥∞]≤C(σs√lnds+ψslnds),

where is a zero tensor with only the th slice along the last axis equal to , needs to satisfy , and

 σ2s =

As the Frobenius norm of a matrix is larger than its spectral norm, and we simply set . For , we have

implying that

 ∥∥ ∥∥m∑i=1n0∑j=1E[Zi,j{s}(Zi,j{s})T]∥∥ ∥∥∞≤κmn0d.

Similarly, we have

 E[n0∑j=1(Zi,j{s})TZi,j{s}]=diag(tr(Cs−{p})n0)⪯κn0dI,

where denotes the trace of a matrix and converts a vector or scalar to a diagonal matrix. This inequality implies

 ∥∥ ∥∥m∑i=1n0∑j=1E[Zi,j{s}(Zi,j{s})T]∥∥ ∥∥∞≤κmn0d.

By combining the above inequalities, we reach the conclusion.

Then we can prove Theorem 4 as follows.

Proof. By following (Bartlett and Mendelson, 2002), we have

 L(^W) ≤ =

When each pair of the training data changes, the random variable can change by no more than due to the boundedness of the loss function . Then by McDiarmid’s inequality, we can get

 P(supW∈C{E[^L(W)]−^L(W)}−E[supW∈C{E[^L(W)]−^L(W)}]≥t) ≤exp{−t2mn02},

where denotes the probability and . This inequality implies that with probability at least ,

 supW∈C{E[^L(W)]−^L(W)}≤ E[supW∈C{E[^L(W)]−^L(W)}] +√2mn0ln1δ.

Based on the the property of the Rademacher complexity, we have

 E[supW∈C{E[^L(W)]−^L(W)}] ≤ 2ρE[supW∈C{1mn0m∑i=1n0∑j=1σijfi(xij)}].

Then based on the definition of and the Hölder’s inequality, we have

 supW∈C{1mn0m∑i=1n0∑j=1σijfi(xij)}≤γm|||M|||∗⋆.

By combining the above inequalities, with probability at least , we have

 L(^W) ≤

Then by incorporating Theorem 5 into this inequality, we reach the conclusion.