# Regularization Techniques for Learning with Matrices

There is growing body of learning problems for which it is natural to organize the parameters into matrix, so as to appropriately regularize the parameters under some matrix norm (in order to impose some more sophisticated prior knowledge). This work describes and analyzes a systematic method for constructing such matrix-based, regularization methods. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on the known duality fact: that a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and kernel learning.

## Authors

• 45 publications
• 28 publications
• 48 publications
• ### Regularization Strategies and Empirical Bayesian Learning for MKL

Multiple kernel learning (MKL), structured sparsity, and multi-task lear...
11/13/2010 ∙ by Ryota Tomioka, et al. ∙ 0

• ### Framework for Multi-task Multiple Kernel Learning and Applications in Genome Analysis

We present a general regularization-based framework for Multi-task learn...
06/30/2015 ∙ by Christian Widmer, et al. ∙ 0

• ### The empirical duality gap of constrained statistical learning

This paper is concerned with the study of constrained statistical learni...
02/12/2020 ∙ by Luiz F. O. Chamon, et al. ∙ 0

• ### Manifold regularization based on Nyström type subsampling

In this paper, we study the Nyström type subsampling for large scale ker...
10/13/2017 ∙ by Abhishake Rastogi, et al. ∙ 0

• ### Learning High Order Feature Interactions with Fine Control Kernels

We provide a methodology for learning sparse statistical models that use...
02/09/2020 ∙ by Hristo Paskov, et al. ∙ 0

• ### Efficient Per-Example Gradient Computations

This technical report describes an efficient technique for computing the...
10/07/2015 ∙ by Ian Goodfellow, et al. ∙ 0

• ### Learning rates of l^q coefficient regularization learning with Gaussian kernel

Regularization is a well recognized powerful strategy to improve the per...
12/19/2013 ∙ by Shaobo Lin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As we tackle more challenging learning problems, there is an increasing need for algorithms that efficiently impose more sophisticated forms of prior knowledge. Examples include: the group Lasso problem (for “shared” feature selection across problems), kernel learning, multi-class prediction, and multi-task learning. A central question here is to understand the performance of such algorithms in terms of the attendant complexity restrictions imposed by the algorithm. Such analyses often illuminate the nature in which our prior knowledge is being imposed.

The predominant modern method for imposing complexity restrictions is through regularizing a vector of parameters, and much work has gone into understanding the relationship between the nature of the regularization and the implicit prior knowledge imposed, particular for the case of regularization with

and norms (where one is more tailored to rotational invariance and margins, while the other is more suited to sparsity). When dealing with more complex problems, we need systematic tools for designing more complicated regularization schemes. This work examines regularization based on group norms and spectral norms of matrices. We analyze the performance of such regularization methods and provide a methodology for choosing a regularization function based on the underlying statistical properties of a given problem.

In particular, we utilize a recently developed methodology, based on the notion of strong convexity, for designing and analyzing the regret or generalization ability of a wide range of learning algorithms (see e.g. Shalev-Shwartz [2007], Kakade et al. [2008]). In fact, most of our efficient algorithms (both in the batch and online settings) impose some complexity control via the use of some strictly convex penalty function either explicitly via a regularizer or implicitly in the design of an online update rule. Central to understanding these algorithms is the manner in which these penalty functions are strictly convex, i.e. the behavior of the “gap” by which these convex functions lie above their tangent planes, which is strictly positive for strictly convex functions. Here, the notion of strong convexity provides one means to characterize this gap in terms of some general norm rather than just Euclidean.

The importance of strong convexity can be understood using the duality between strong convexity and strong smoothness. Strong smoothness measures how well a function is approximated at some point by its linearization. Linear functions are easy to manipulate (e.g. because of the linearity of expectation). Hence, if a function is sufficiently smooth we can more easily control its behavior. We further distill the analysis given in Shalev-Shwartz [2007], Kakade et al. [2008] — based on the strong-convexity/smoothness duality, we derive a key inequality which seamlessly enables us to design and analyze a family of learning algorithms.

Our focus in this work is on learning with matrices. We characterize a number of matrix based regularization functions, of recent interest, as being strongly convex functions — allowing us to immediately derive learning algorithms by relying on the family of learning algorithms mentioned previously. Specifying the general performance bounds for the specific matrix based regularization method, we are able to systematically decide which regularization function is more appropriate based on underlying statistical properties of a given problem.

### 1.1 Our Contributions

We can summarize the contributions of this work as follows:

• We show how the framework based on strong convexity/strong smoothness duality (see Shalev-Shwartz [2007], Kakade et al. [2008]) provides a methodology for analyzing matrix based learning methods, which are of much recent interest. These results reinforce the usefulness of this framework in providing both learning algorithms, and their associated complexity analysis. For this reason, we further distill the analysis given in Shalev-Shwartz [2007], Kakade et al. [2008] by emphasizing a key inequality which immediately enables us to design and analyze a family of learning algorithms.

• We provide template algorithms (both in the online and batch settings) for a number of machine learning problems of recent interest, which use matrix parameters. In particular, we provide a simple derivation of generalization/mistake bounds for: (i) online and batch multi-task learning using group or spectral norms, (ii) online multi-class categorization using group or spectral norms, and (iii) kernel learning.

• Based on the derived bounds, we interpret how statistical properties of a given problem can help us decide which regularization function is appropriate. For example, for the case of multi-class learning, we describe and analyze a new “group Perceptron” algorithm and show that with a shared structure between classes, this algorithm significantly outperforms previously proposed algorithms. Similarly, for the case of multi-task learning, the pressing question is what shared structure between the tasks allows for sample complexity improvements and by how much? We discuss these issues based on our regret and generalization bounds.

• Our unified analysis significantly simplifies previous analyses of recently proposed algorithms. For example, the generality of this framework allows us to simplify the proofs of previously proposed regret bounds for online multi-task learning (e.g. Cavallanti et al. [2008], Agarwal et al. [2008]). Furthermore, bounds that follow immediately from our analysis are sometimes much sharper than previous results (e.g. we improve the bounds for multiple kernel learning given in Lanckriet et al. [2004], Srebro and Ben-David [2006]).

### 1.2 Related work

We first discuss related work on learning with matrix parameters then discuss the use of strong convexity in learning.

Matrix Learning: This is growing body of work studying learning problems in which the parameters can be organized as matrices. Several examples are multi-class categorization (e.g. Crammer and Singer [2000]), multi-task and multi-view learning (e.g. Cavallanti et al. [2008], Agarwal et al. [2008]), and online PCA [Warmuth and Kuzmin, 2006]. It was also studied under the framework of group Lasso (e.g. Yuan and Lin [2006], Obozinski et al. [2007], Bach [2008]).

In the context of learning vectors (rather than matrices), the study of the relative performance of different regularization techniques based on properties of a given task dates back to Littlestone [1988], Kivinen and Warmuth [1997]. In the context of batch learning, it was studied by several authors (e.g. Ng [2004]).

We also note that much of the work on multi-task learning for regression is on union support recovery — a setting where the generative model specifies a certain set of relevant features (over all the tasks), and the analysis here focuses on the conditions and sample sizes under which the union of the relevant features can be correctly identified (e.g. Obozinski et al. [2007], Lounici et al. [2009]). Essentially, this is a generalization of the issue of identifying the relevant feature set in the standard single task regression setting, under regression. In contrast, our work focuses on the agnostic setting of just understanding the sample size needed to obtain a given error rate (rather than identifying the relevant features themselves).

We also discuss related work on kernel learning in Section 6. Our analysis here utilizes the equivalence between kernel learning and group Lasso (as noted in Bach [2008]).

Strong Convexity/Strong Smoothness: The notion of strong convexity takes its roots in optimization. Zalinescu [2002] attributes it to a paper of Polyak in the 1960s. Relatively recently, its use in machine learning has been two fold: in deriving regret bounds for online algorithms and generalization bounds in batch settings.

The duality of strong convexity and strong smoothness was first used by Shalev-Shwartz and Singer [2006], Shalev-Shwartz [2007] in the context of deriving low regret online algorithms. Here, once we choose a particular strongly convex penalty function, we immediately have a family of algorithms along with a regret bound for these algorithms that is in terms of a certain strong convexity parameter. A variety of algorithms (and regret bounds) can be seen as special cases.

A similar technique, in which the Hessian is directly bounded, is described by Grove et al. [2001], Shalev-Shwartz and Singer [2007]. Another related approach involved bounding a Bregman divergence [Kivinen and Warmuth, 1997, 2001, Gentile, 2003] (see Cesa-Bianchi and Lugosi [2006] for a detailed survey). Another interesting application of the very same duality is for deriving and analyzing boosting algorithms [Shalev-Shwartz and Singer, 2008].

More recently, Kakade et al. [2008] showed how to use the very same duality for bounding the Rademacher complexity of classes of linear predictors. That the Rademacher complexity is closely related to Fenchel duality was shown in Meir and Zhang [2003], and the work in Kakade et al. [2008] made the further connection to strong convexity. Again, under this characterization, a number of generalization and margin bounds (for methods which use linear prediction) are immediate corollaries, as one only needs to specify the strong convexity parameter from which these bounds easily follow (see Kakade et al. [2008] for details).

The concept of strong smoothness (essentially a second order upper bound on a function) has also been in play in a different literature, for the analysis of the concentration of martingales in smooth Banach spaces [Pinelis, 1994, Pisier, 1975]

. This body of work seeks to understand the concentration properties of a random variable

, where is a (vector valued) martingale and is a smooth norm, say an -norm.

Recently, Juditsky and Nemirovski [2008] used the fact that a norm is strongly convex if and only if its conjugate is strongly smooth. This duality was useful in deriving concentration properties of a random variable , where now

is a random matrix. The norms considered here were the (Schatten)

-matrix norms and certain “block” composed norms (such as the norm).

### 1.3 Organization

The rest of the paper is organized as follows. In Section 2, we describe the general family of learning algorithms. In particular, after presenting the duality of strong-convexity/strong-smoothness, we isolate an important inequality (Corollary 2.2) and show that this inequality alone seamlessly yields regret bounds in the online learning model and Rademacher bounds (that leads to generalization bounds in the batch learning model). We further highlight the importance of strong convexity to matrix learning applications by drawing attention to families of strongly convex functions over matrices. To do so, we rely on the recent results of Juditsky and Nemirovski [2008]. In particular, we obtain a strongly convex function over matrices based on strongly convex vector functions, which leads to a number of corollaries relevant to problems of recent interest. Next, in Section 3 we show how the obtained bounds can be used for systematically choosing an adequate prior knowledge (i.e. regularization) based on properties of the given task. We then turn to describe the applicability of our approach to more complex prediction problems. In particular, we study multi-task learning (Section 4), multi-class categorization (Section 5), and kernel learning (Section 6). Naturally, many of the algorithms we derive have been proposed before. Nevertheless, our unified analysis enables us to simplify previous analyzes, understand the merits and pitfalls of different schemes, and even derive new algorithms/analyses.

## 2 Preliminaries and Techniques

In this section we describe the necessary background. Most of the results below are not new and are based on results from Shalev-Shwartz [2007], Kakade et al. [2008], Juditsky and Nemirovski [2008]. Nevertheless, we believe that the presentation given here is simpler and slightly more general.

Our results are based on basic notions from convex analysis and matrix computation. The reader not familiar with some of the objects described below may find short explanations in Appendix A.

### 2.1 Notation

We consider convex functions , where is a Euclidean vector space equipped with an inner product . We denote . The subdifferential of at is denoted by . The Fenchel conjugate of is denoted by . Given a norm , its dual norm is denoted by . We say that a convex function is -Lipschitz w.r.t. a norm if for all exists with . Of particular interest are -norms, .

When dealing with matrices, We consider the vector space of real matrices of size and the vector space of symmetric matrices of size , both equipped with the inner product, . Given a matrix , the vector

is the vector that contains the singular values of

in a non-increasing order. For , the vector

is the vector that contains the eigenvalues of

arranged in non-increasing order.

### 2.2 Strong Convexity–Strong Smoothness Duality

Recall that the domain of is (allowing to take infinite values is the effective way to restrict its domain to a proper subset of ). We first define strong convexity.

A function is -strongly convex w.r.t. a norm if for all in the relative interior of the domain of and we have

 f(αx+(1−α)y)≤αf(x)+(1−α)f(y)−12βα(1−α)∥x−y∥2

We now define strong smoothness. Note that a strongly smooth function is always finite.

A function is -strongly smooth w.r.t. a norm if is everywhere differentiable and if for all we have

 f(x+y)≤f(x)+⟨∇f(x),y⟩+12β∥y∥2

The following theorem states that strong convexity and strong smoothness are dual properties. Recall that the biconjugate equals if and only if is closed and convex.

(Strong/Smooth Duality) Assume that is a closed and convex function. Then is -strongly convex w.r.t. a norm if and only if is -strongly smooth w.r.t. the dual norm .

Subtly, note that while the domain of a strongly convex function may be a proper subset of (important for a number of settings), its conjugate always has a domain which is (since if is strongly smooth then it is finite and everywhere differentiable). The above theorem can be found, for instance, in Zalinescu [2002] (see Corollary 3.5.11 on p. 217 and Remark 3.5.3 on p. 218). In the machine learning literature, a proof of one direction (strong convexity strong smoothness) can be found in Shalev-Shwartz [2007]. We could not find a proof of the reverse implication in a place easily accessible to machine learning people. So, a self-contained proof is provided in the appendix.

The following direct corollary of Theorem. 2.2 is central in proving both regret and generalization bounds.

If is strongly convex w.r.t. and , then, denoting the partial sum by , we have, for any sequence and for any ,

The 1st inequality is Fenchel-Young and the 2nd is from the definition of smoothness by induction.

### 2.3 Machine learning implications of the strong-convexity / strong-smoothness duality

We consider two learning models.

• Online convex optimization: Let be a convex set. Online convex optimization is a two player repeated game. On round of the game, the learner (first player) should choose and the environment (second player) responds with a convex function over , i.e. . The goal of the learner is to minimize its regret defined as:

 1nn∑t=1lt(wt)−minw∈W1nn∑t=1lt(w) .
• Batch learning of linear predictors: Let be a distribution over . Our goal is to learn a prediction rule from to . The prediction rule we use is based on a linear mapping

, and the quality of the prediction is assessed by a loss function

. Our primary goal is to find that has low risk (a.k.a. generalization error), defined as , where expectation is with respect to . To do so, we can sample i.i.d. examples from and observe the empirical risk, . The goal of the learner is to find with a low excess risk defined as:

 L(^w)−minw∈WL(w) ,

where is a set of vectors that forms the comparison class.

We now seamlessly provide learning guarantees for both models based on Corollary 2.2. We start with the online convex optimization model.

#### Regret Bound for Online Convex Optimization

Algorithm 1 provides one common algorithm which achieves the following regret bound. It is one of a family of algorithms that enjoy the same regret bound (see Shalev-Shwartz [2007]).

(Regret) Suppose Algorithm 1 is used with a function that is -strongly convex w.r.t. a norm on and has . Suppose the loss functions are convex and -Lipschitz w.r.t. the dual norm . Then, the algorithm run with any positive enjoys the regret bound,

 T∑t=1lt(wt)−minu∈WT∑t=1lt(u)≤maxu∈Wf(u)η+ηV2T2β

Apply Corollary 2.2 to the sequence to get, for all ,

 −ηT∑t=1⟨vt,u⟩−f(u)≤−ηT∑t=1⟨vt,wt⟩+12βT∑t=1∥ηvt∥2⋆ .

Using the fact that is -Lipschitz, we get . Plugging this into the inequality above and rearranging gives, . By convexity of , . Therefore, . Since the above holds for all the result follows.

#### Generalization bound for the batch model via Rademacher analysis

Let be a training set obtained by sampling i.i.d. examples from . For a class of real valued functions , define its Rademacher complexity on to be

 RT(F):=E[supf∈F1nn∑i=1ϵif(xi)] .

Here, the expectation is over ’s, which are i.i.d. Rademacher random variables, i.e.

. It is well known that bounds on Rademacher complexity of a class immediately yield generalization bounds for classifiers picked from that class (assuming the loss function is Lipschitz). Recently,

Kakade et al. [2008] proved Rademacher complexity bounds for classes consisting of linear predictors using strong convexity arguments. We now give a quick proof of their main result using Corollary 2.2. This proof is essentially the same as their original proof but highlights the importance of Corollary 2.2.

(Generalization) Let be a -strongly convex function w.r.t. a norm and assume that . Let and . Consider the class of linear functions, . Then, for any dataset , we have

 RT(F)≤X√2fmaxβn .

Let . Apply Corollary 2.2 with and to get,

 supw∈Wn∑i=1⟨w,λϵixi⟩ ≤λ22βn∑i=1∥ϵixi∥2⋆+supw∈Wf(w)+n∑i=1⟨∇f⋆(v1:i−1),ϵixi⟩ ≤λ2X2n2β+fmax+n∑i=1⟨∇f⋆(v1:i−1),ϵixi⟩ .

Now take expectation on both sides. The left hand side is and the last term on the right hand side becomes zero. Dividing throughout by , we get, . Optimizing over gives us the result.

Combining the above with the contraction lemma and standard Rademacher based generalization bounds (see e.g. Bartlett and Mendelson [2002], Kakade et al. [2008]) we obtain: Let be a -strongly convex function w.r.t. a norm and assume that . Let and . Let be an -Lipschitz scalar loss function and let be an arbitrary distribution over . Then, the algorithm that receives i.i.d. examples and returns that minimizes the empirical risk, , satisfies

 E[L(^w)−minw∈WL(w)] ≤ O(ρX√fmaxβn) ,

where expectation is with respect to the choice of the

i.i.d. examples. We note that it is also easy to obtain a generalization bound that holds with high probability, but for simplicity of the presentation we stick to expectations.

### 2.4 Strongly Convex Matrix Functions

Before we consider strongly convex matrix functions, let us recall the following result about strong convexity of vector norm. Its proof can be found e.g. in Shalev-Shwartz [2007].

Let . The function defined as is -strongly convex with respect to over .

We mainly use the above lemma to obtain results with respect to the norms and . The case is straightforward. Obtaining results with respect to is slightly more tricky since for the strong convexity parameter is (meaning that the function is not strongly convex). To overcome this problem, we shall set to be slightly more than , e.g. . For this choice of , the strong convexity parameter becomes and the value of corresponds to the dual norm is . Note that for any we have

 ∥x∥∞≤∥x∥p≤(d∥x∥p∞)1/p=d1/p∥x∥∞=e∥x∥∞≤3∥x∥∞ .

Hence the dual norms are also equivalent up to a factor of : . The above lemma therefore implies the following corollary. The function defined as for is -strongly convex with respect to over .

We now consider two families of strongly convex matrix functions.

#### Schatten q-norms

The first result we need is the counterpart of Lemma. 2.4 for the -Schatten norm defined as This result can be found in Ball et al. [1994].

(Schatten matrix functions) Let . The function defined as is -strongly convex w.r.t. the -Schatten norm over .

As above, choosing to be for gives the following corollary.

The function defined as for is -strongly convex with respect to over .

#### Group Norms.

Let be a real matrix with columns . We denote by as

 ∥X∥r,p:=∥(∥X1∥r,…,∥Xn∥r)∥p .

That is, we apply to each column of to get a vector in to which we apply the norm to get the value of . It is easy to check that this is indeed a norm. The dual of is where and . The following theorem, which appears in a slightly weaker form in Juditsky and Nemirovski [2008], provides us with an easy way to construct strongly convex group norms. We provide a proof in the appendix which is much simpler than that of Juditsky and Nemirovski [2008] and is completely “calculus free”.

(Group Norms) Let be absolutely symmetric norms on . Let denote the following function,

 (Φ2∘√)(x):=Φ2(√x1,…,√xn) . (1)

Suppose, is a norm on . Further, let the functions and be - and -smooth w.r.t. and respectively. Then, is -smooth w.r.t. .

The condition that Eq. (1) be a norm appears strange but in fact it already occurs in the literature. Norms satisfying it are called quadratic symmetric gauge functions (or Q-norms) [Bhatia, 1997, p. 89]. It is easy to see that for is a Q-norm. Now using strong convexity/strong smoothness duality and the discussion preceding Corollary 2.4, we get the following corollary.

The function defined as for is -strongly convex with respect to over .

### 2.5 Putting it all together

Combining Lemma. 2.4 and Corollary 2.4 with the bounds given in Theorem. 2.3 and Corollary 2.3 we therefore obtain the following two corollaries. Let and let be a sequence of functions which are -Lipschitz w.r.t. . Then, there exists an online algorithm with a regret bound of the form

 1nn∑t=1lt(wt)−minw∈W1nn∑t=1lt(w) ≤ O⎛⎝XW√ln(d)n⎞⎠ .

Let and let . Let be an -Lipschitz scalar loss function and let be an arbitrary distribution over . Then, there exists a batch learning algorithm that returns a vector such that

 E[L(^w)−minw∈WL(w)] ≤ O⎛⎝XW√ln(d)n⎞⎠ .

Results of the same flavor can be obtained for learning matrices. For simplicity, we present the following two corollaries only for the online model, but it is easy to derive their batch counterparts.

Let and let be a sequence of functions which are -Lipschitz w.r.t. . Then, there exists an online algorithm with a regret bound of the form

 1nn∑t=1lt(Wt)−minW∈W1nn∑t=1lt(W) ≤ O⎛⎝XW√ln(d)n⎞⎠ .

Let and let be a sequence of functions which are -Lipschitz w.r.t. . Then, there exists an online algorithm with a regret bound of the form

 1nn∑t=1lt(Wt)−minW∈W1nn∑t=1lt(W) ≤ O⎛⎝XW√ln(min{k,d})n⎞⎠ .

## 3 Matrix Regularization

We are now ready to demonstrate the power of the general techniques we derived in the previous section. Consider a learning problem (either online or batch) in which is a subset of a matrix space (of dimension ) and we would like to learn a linear predictor of the form where is also a matrix of the same dimension. The loss function takes the form and we assume for simplicity that is -Lipschitz with respect to its first argument. For example, can be the absolute loss, , or the hinge-loss, .

For the sake of concreteness, let us focus on the batch learning setting, but we note that the discussion below is relevant to the online learning model as well. Our prior knowledge on the learning problem is encoded by the definition of the comparison class that we use. In particular, all the comparison classes we use take the form , where the only difference is what norm do we use. We shall compare the following four classes:

 W1,1 ={W:∥W∥1,1≤W1,1} W2,2 ={W:∥W∥2,2≤W2,2} W2,1 ={W:∥W∥2,1≤W2,1} WS(1) ={W:∥W∥S(1)≤WS(1)}

Let us denote . We define analogously. Applying the results of the previous section to these classes we obtain the bounds given in Table 1 where for simplicity we ignore constants.

Let us now discuss which class should be used based on prior knowledge on properties of the learning problem. We start with the well known difference between and . Note that both of these classes ignore the fact that is organized as a matrix and simply refer to as a single vector of dimension . The difference between and is therefore the usual difference between and regularization. To understand this difference, suppose that is some matrix that performs well on the distribution we have. Then, we should take the radius of each class to be the minimal possible while still containing , namely, either or . Clearly, and therefore in terms of this term there is a clear advantage to use the class . On the other hand, . We therefore need to understand which of these inequalities is more important. Of course, in general, the answer to this question is data dependent. However, we can isolate properties of the distribution that can help us choose the better class.

One useful property is sparsity of either or . If is assumed to be sparse (i.e., it has at most non-zero elements), then we have . That is, for a small , the difference between and is small. In contrast, if is very dense and each of its entries is bounded away from zero, e.g. , then . The same arguments are true for . Hence, with prior knowledge about the sparsity of and we can guess which of the bounds will be smaller.

Next, we tackle the more interesting cases of and . For the former, recall that we first apply norm on each column of and then apply norm on the obtained vector of norm values. Similarly, to calculate we first apply norm on columns of and then apply norm on the obtained vector of norm values. Let us now compare to . Suppose that the columns of are very sparse. Therefore, the norm of each column of is very close to its norm. On the other hand, if some of the columns of are dense, then can be order of smaller than . In that case, the class is preferable over the class . As we show later, this is the case in multi-class problems, and we shall indeed present an improved multi-class algorithm that uses the class . Of course, in some problems, columns of might be very dense while columns of can be sparse. In such cases, using is better than using .

Now lets compare to . Similarly to the previous discussion, choosing over makes sense if we assume that the vector of norms of columns, , is sparse. This implies that we assume a “group”-sparsity pattern of , i.e., each column of is either the all zeros column or is dense. This type of grouped-sparsity has been studied in the context of group Lasso and multi-task learning. Indeed, we present bounds for multi-task learning that relies on this assumption. Without the group-sparsity assumption, it might be better to use over .

Finally, we discuss when it makes sense to use . Recall that , where is the vector of singular values of , and . Therefore, the class should be used when we assume that the spectrum of is sparse while the spectrum of is dense. This means that the prior knowledge we employ is that is of low rank while is of high rank. Note that can be defined equivalently as . Therefore, the difference between and is similar to the difference between and just that instead of considering sparsity properties of the elements of and we consider sparsity properties of the spectrum of and .

In the next sections we demonstrate how to apply the general methodology described above in order to derive a few generalization and regret bounds for problems of recent interest.

Suppose we are simultaneously solving -multivariate prediction problems, where each learning example is of the form where is a matrix of example vectors with examples from different tasks sitting in rows of , and are the responses for the problems. To predict the responses, we learn a matrix such that is a good predictor of . In this section, we denote row of by . The predictor for the th task is therefore . The quality of a prediction for the ’th task is assessed by a loss function ; And, the total loss of on an example is defined to be the sum of the individual losses,

 l(W,X,y)=k∑j=1lj(⟨wj,xj⟩,yj) .

This formulation allows us to mix regression and classification problems and even use different loss functions for different tasks. Such “heterogeneous” multi-task learning has attracted recent attention [Yang et al., 2009].

If the tasks are related, then it is natural to use regularizers that “couple” the tasks together so that similarities across tasks can be exploited. Considerations of common sparsity patterns (same features relevant across different tasks) lead to the use of group norm regularizers (i.e. using the comparison class defined in the previous section) while rank considerations (the ’s lie in a low dimensional linear space) lead to the use of unitarily invariant norms as regularizers (i.e. the comparison class is ).

We now describe online and batch multi-task learning using different matrix norm.

In the online model, on round the learner first uses to predict the vector of responses and then it pays the cost . Let be a sub-gradient of at . It is easy to verify that the ’th row of , denoted , is a sub-gradient of at . Assuming that is -Lipschitz with respect to its first argument, we obtain that for some . In other words, . It is easy to verify that for any . In addition, since any Schatten norm is sub-multiplicative we also have that . We therefore obtain the following: Let be the classes defined in Section 3 and lets be the radius of w.r.t. the corresponding norms. Then, there exist online multi-task learning algorithms with regret bounds according to Table 1.

Let us now discuss few implications of these bounds, and for simplicity assume that . Recall that each column of represents the value of a single feature for all the tasks. As discussed in the previous section, if the matrix is dense and if we assume that is sparse, then using the class is better than using . Such a scenario often happens when we have many irrelevant features and only are few features that can predict the target reasonably well. Concretely, suppose that and that it typically has non-zero values. Suppose also that there exists a matrix that predicts the targets of the different tasks reasonably well and has non-zero values. Then, the bound for is order of while the bound for is order of . Thus, will be better if .

Now, consider the class . Let us further assume the following. The non-zero elements of are grouped into columns and are roughly distributed evenly over those columns; The non-zeros of are roughly distributed evenly over the columns. Then, the bound for is . This bound will be better than the bound of if and will be better than the bound of if . We see that there are scenarios in which the group norm is better than the non-grouped norms and that the most adequate class depends on properties of the problem and our prior beliefs on a good predictor .

As to the bound for , it is easy to verify that if the rows of sits in a low dimensional subspace then the spectrum of will be sparse. Similarly, the value of depends on the maximal singular value of , which is likely to be small if we assume that all the “energy” of is spread over its entire spectrum. In such cases, can be the best choice. This is an example of a different type of prior knowledge on the problem.

In the batch setting we see a dataset consisting of i.i.d. samples drawn from a distribution over . In the -task setting, . Analogous to the single task case, we define the risk and empirical risk of a multitask predictor as:

 ˆL(W)

Let be some class of matrices, and define the empirical risk minimizer, . To obtain excess risk bounds for , we need to consider the -task Rademacher complexity

because, assuming each is -Lipschitz, we have the bound . This bound follows easily from Talagrand’s contraction inequality and Thm. 8 in Maurer [2006]. We can use matrix strong convexity to give the following -task Rademacher bound. (Multitask Generalization) Suppose for all for a function that is -strongly convex w.r.t. some (matrix) norm . If the norm is invariant under sign changes of the rows of its argument matrix then, for any dataset , we have, , where is an upper bound on . We can rewrite as

 E[supW∈W1nn∑i=1k∑j=1ϵji⟨wj,Xji⟩] =E[supW∈W1nk∑j=1⟨wj,n∑i=1ϵjiXji⟩]=E[supW∈W1n⟨W,n∑i=1~Xi⟩] ,

where is defined by and we have switched to a matrix inner product in the last line. By the assumption on the dual norm , . Now using Corollary 2.2 and proceeding as in the proof of Theorem. 2.3, we get, for any , Optimizing over proves the theorem.

Note that both group -norms and Schatten- norms satisfy the invariance under row flips mentioned in the theorem above. Thus, we get the following corollary.

Let be the classes defined in Section 3 and lets be the radius of w.r.t. the corresponding norms. Then, the (expected) excess multitask risk of the empirical multitask risk minimizer satisfies the same bounds given in Table 1.

## 5 Multi-class learning

In this section we consider multi-class categorization problems. We focus on the online learning model. On round , the online algorithm receives an instance and is required to predict its label as a number in . Following the construction of Crammer and Singer [2000], the prediction is based on a matrix and is defined as the index of the maximal element of the vector . We use the hinge-loss function adapted to the multi-class setting. That is,

where is a matrix with on the ’th row, on the ’th row, and zeros in all other elements. It is easy to verify that upper bounds the zero-one loss, i.e. if the prediction of is then .

A sub-gradient of is either a matrix of the form

or the all zeros matrix. Note that each column of

is very sparse (contains only two elements). Therefore,

Based on this fact, we can easily obtain the following. Let be the classes defined in Section 3 and let and . Then, there exist online multi-class learning algorithms with regret bounds given by the following table

 class bound W1,1 W2,2 W2,1 WS(1) W1,1X∞√ln(kd)n W2,2X2√1n W2,1X∞√ln(d)n WS(1)X2√ln(min{d,k})n

Let us now discuss the implications of this bound. First, if , which will happen if instance vectors are sparse, then and will be inferior to . In such a case, using can be even better if sits in a low dimensional space but each row of still has a unit norm. Using in such a case was previously suggested by Amit et al. [2007], who observed that empirically, the class performs better than when there is a shared structure between classes. The analysis given in Corollary 5 provides a first rigorous explanation to such a behavior.

Second, if is much larger than , and if columns of share common sparsity pattern, then can be factor of better than and factor of better than . To demonstrate this, let us assume that each vector is in and it represents experts advice of experts. Therefore, . Next, assume that a combination of the advice of experts predicts very well the correct label (e.g., the label is represented by the binary number obtained from the advice of experts). In that case, will be a matrix such that all of its columns will be except columns which will take values in . The bounds for and in that case becomes and