# Optimistic bounds for multi-output prediction

We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multi-output loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which are minimax optimal up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalization bound for multi-class gradient boosting.

## Authors

• 4 publications
• 6 publications
04/29/2021

### Fine-grained Generalization Analysis of Vector-valued Learning

Many fundamental machine learning tasks can be formulated as a problem o...
09/11/2019

### Learning Vector-valued Functions with Local Rademacher Complexity

We consider a general family of problems of which the output space admit...
05/03/2014

### On Lipschitz Continuity and Smoothness of Loss Functions in Learning to Rank

In binary classification and regression problems, it is well understood ...
05/01/2016

### A vector-contraction inequality for Rademacher complexities

The contraction inequality for Rademacher averages is extended to Lipsch...
05/24/2019

### A Generalization Error Bound for Multi-class Domain Generalization

Domain generalization is the problem of assigning labels to an unlabeled...
01/26/2021

### Iterative Weak Learnability and Multi-Class AdaBoost

We construct an efficient recursive ensemble algorithm for the multi-cla...
10/12/2016

### Post Selection Inference with Kernels

We propose a novel kernel based post selection inference (PSI) algorithm...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-output prediction represents an important class of problems that includes multi-class classification Crammer and Singer (2001), multi-label classification Tsoumakas and Katakis (2007); Zhang and Zhou (2013), multi-target regression Borchani et al. (2015), label distribution learning Geng (2016), structured regression Cortes et al. (2016) and others, with a wide range of practical applications Xu et al. (2019).

Our objective is to provide a general framework for establishing guarantees for multiple-output prediction problems. A fundamental challenge in the statistical learning theory of multi-output prediction problems is to obtain bounds which allow for (i) favourable convergence rate with the sample size, and (ii) favourable dependence of the risk on the dimensionality of the output space. Whilst modern applications of multi-output prediction deal with increasingly large data sets, they also incorporate problems where the target dimensionality is increasingly large. For example, the number of categories in multi-label is often of the order of tens of thousands, an emergent problem referred to as

extreme classification Agrawal et al. (2013); Babbar and Schölkopf (2017); Bhatia et al. (2015); Jain et al. (2019).

Formally, the task of multi-output prediction is to learn a vector-valued function from a labelled training set. A common tool in the theoretical analysis of this problem has been a vector-valued extension of Talagrand’s contraction inequality for Lipschitz losses Ledoux and Talagrand (2013). Both Maurer (2016) and Cortes et al. (2016) established vector-contraction inequalities for Rademacher complexity which gave rise to learning guarantees for multi-output prediction problems with a linear dependence upon the dimensionality of the output space. More recently, Lei et al. (2019) has provided more refined vector-contraction inequalities for both Gaussian and Rademacher complexity. This approach leads to a highly favourable sub-linear dependence upon the output dimensionality, which can even be logarithmic, depending upon the degree of regularisation. These structural results lead to a slow convergence rate . Guermeur (2017) and Musayeva et al. (2019) explore an alternative approach based on covering numbers. Chzhen et al. (2017) derived a bound for multi-label classification based upon Rademacher complexities. Each of these bounds give rise to favourable dependence upon the dimensionality of the output space, with a rate of order .

Local Rademacher complexities provide a crucial tool in establishing faster rates of convergence Bousquet (2002); Bartlett et al. (2005); Koltchinskii et al. (2006); Lei et al. (2016). By leveraging local Rademacher complexities, Liu et al. (2019) have derived guarantees for for multi-class learning with function classes which are linear in an RKHS, building upon their previous margin based guarantees Lei et al. (2015); Li et al. (2019). This gives rise to fast rates under suitable spectral conditions. Fast rates of convergence have also been derived by Xu et al. (2016) for multi-label classification with linear function spaces. On the other hand, Chzhen (2019) have derived fast rates of convergence by exploiting an analogue the margin assumption.

Our objective is provide a general framework for establishing generalization bounds for multi-output prediction, which yield fast rates whenever empirical error is small, and apply to a wide variety of function classes, including ensembles of decision trees. We address this problem by generalising to vector-valued functions a smoothness based approach due to

Srebro et al. (2010). A key advantage of our approach is that it allow us to accommodate a wide variety of multi-output loss functions, in conjunction with a variety of hypothesis classes, making our analytic strategy applicable to a variety of learning tasks. Below we summarise our contributions:

• [noitemsep,topsep=0pt]

• We give a contraction inequality for the local Rademacher complexity of vector-valued functions (Proposition 1). The main ingredient is a self-bounding Lipschitz condition for multi-output loss functions which holds for several widely used examples.

• We leverage our localised contraction inequality to give a general upper bound for multi-output learning (Theorem 1), which exhibits fast rates whenever the empirical error is small.

• We demonstrate the minimax-optimality of our result, both in terms of the number of samples, and the output dimensionality, up to logarithmic factors, in the realizable setting (Theorem 5).

• Finally, to demonstrate a concrete use our general result, we derive from it a state-of-the-art bound for ensembles of multi-output decision trees (Theorem 7).

### 1.1 Problem setting

We shall consider multi-output prediction problems in supervised learning. Suppose we have a measurable space

, a label space and an output space

. We shall assume that there is an unknown probability distribution

over random variables

, taking values in . The performance is quantified through a loss function .

Let denote the set of measurable functions . The goal of the learner is to obtain such that the corresponding risk is as low as possible. The learner selects based upon a sample , where are independent copies of . We let denote the empirical risk. When the distribution and the sample are clear from context we shall write in place of and in place of . We consider multi-output prediction problems in which . We let denote the max norm on and for positive integer we let .

## 2 The self-bounding Lipschitz condition

We introduce the following self-bounding Lipschitz condition for multi-output loss functions.

###### Definition 1 (Self-bounding Lipschitz condition).

A loss function is said to be -self-bounding Lipschitz for if for all and ,

 |L(u,y)−L(v,y)|≤λ⋅max{L(u,y),L(v,y)}θ⋅∥u−v∥∞.

This condition interpolates continuously between a classical Lipschitz condition (when ) and a multi-dimensional analogue of a smoothness condition (when ), and will be the main assumption that we use to obtain our results.

Our motivation for introducing Definition 1 is as follows. Firstly, in recent work of Lei et al. (2019) the classical Lipschitz condition with respect to the norm has been utilised to derive multi-class bounds with a favourable dependence upon the number of classes . The role of the norm is crucial since it prevents the deviations in the loss function from accumulating as the output dimension grows. Our goal is to give a general framework which simultanously achieves a favourable dependence upon . Secondly, Srebro et al. (2010) introduced a second-order smoothness condition on the loss function. This condition corresponds to the special case whereby and . Srebro et al. (2010) showed that this smoothness condition gives rise to a optimistic bound which gives a fast rate in the realizable case. The self-bounding Lipschitz provides a multi-dimensional analogue of this condition when which is intended to yield a favourable dependence upon both the number of samples and the number of classes . The results established in Sections 3 and 5 show that this is indeed the case. Finally, by considering the range of exponents we will yield convergence rates ranging from slow to fast in the realizable case. This is reminiscent of the celebrated Tsybakov margin condition Mammen and Tsybakov (1999) which interpolate between slow and fast rates in the parametric classification setting. Crucially, however, whilst the Tsybakov margin condition Mammen and Tsybakov (1999) is a condition on the underlying distribution which cannot be verified in practice, the self-bounding Lipschitz condition is a property of a loss function which may be verified analytically by the learner.

### 2.1 Verifying the self-bounding Lipschitz condition

We start by giving a collection of results which can be used to verify that a given loss function satisfies the self-bounding Lipschitz condition. The following lemmas are proved in Appendix B.

###### Lemma 1.

Take any , . Suppose that is a loss function such that for any , , there exists a non-negative differentiable function satisfying

1. ;

2. , .

3. The derivative is non-negative on ;

4. , ;

Then is -self-bounding Lipschitz.

Lemma 2 shows that clipping preserves this condition.

###### Lemma 2.

Suppose that is a -self-bounding Lipschitz loss function with , . Then the loss defined by is -self-bounding Lipschitz.

Finally, we note the following monotonicity property which follows straightforwardly from the definition.

###### Lemma 3.

Suppose that is a bounded -self-bounding Lipschitz loss function with , . Then given any , the loss is also -self-bounding Lipschitz with .

These properties can be used to establish the self-bounding Lipschitz condition in practical examples.

### 2.2 Examples

We now demonstrate several examples of multi-output loss functions that satisfy our self-bounding Lipschitz condition. In each of the examples below we shall show that the self-bounding Lipschitz condition is satisfied by applying our sufficient condition (Lemma 1). Detailed proofs are given in Appendix B.

#### 2.2.1 Multi-class losses

We begin with the canonical multi-output prediction problem of multi-class classification in which and . A popular loss function for the theoretical analysis of multi-class learning is the margin loss Crammer and Singer (2001). The smoothed analogue of the margin loss was introduced by Srebro et al. (2010) in the one-dimensional setting, and Li et al. (2018) in the multi-class setting.

###### Example 1 (Smooth margin losses).

Given we define the margin function by . The zero-one loss is defined by . Whilst natural, the zero-one loss has the drawback of being discontinuous, which presents an obstacle for deriving guarantees. For each , the corresponding margin loss is defined by . The margin loss is also discontinuous. However, we may define a smooth margin loss by

 :=⎧⎪ ⎪⎨⎪ ⎪⎩1 if m(u,y)≤02(m(u,y)ρ)3−3(m(u,y)ρ)2+1 if m(u,y)∈[0,ρ]0 if m(u,y)≥ρ.

By applying Lemma 1 we can show that is -self-bounding Lipschitz with and . Moreover, the smooth margin loss satisfies for .

The margin loss plays a central role in learning theory and continues to receive significant attention in the analysis of multi-class prediction Guermeur (2017); Li et al. (2018); Musayeva et al. (2019), so it is fortuitous that our self-bounding Lipschitz condition incorporates the smooth margin loss. More importantly, however, the self-bounding Lipschitz condition applies to a variety of other loss functions which have received less attention in statisical learning theory.

One of the most widely used loss functions in practical applications is the multinomial logistic loss, also known as the softmax loss.

###### Example 2 (Multinomial logistic loss).

Given , the multinomial logistic loss is defined by

 L(u,y)=log⎛⎝∑j∈[q]exp(uj−uy)⎞⎠,

where and . For each let and define . By applying Lemma 1 with we can show that the multinomial logistic loss is -self-bounding Lipschitz with and .

Recently, Lei et al. (2019) emphasized that the multinomial-logistic loss is -Lipschitz with respect to the -norm (equivalently, -self-bounding Lipschitz). This gives rise to a slow rate of order . The fact that the multinomial-logistic loss is also -self bounding can be used to derive more favourable guarantees, as we shall see in Section 3.

#### 2.2.2 Multi-label losses

Multi-label prediction is the challenge of classification in settings where instances may be simultaneously assigned to several categories. In multi-label classification we have , where is the total number possible classes. Whilst is often very large, the total number of simultaneous labels is typically much smaller. Hence, we consider the set of -sparse binary vectors denote the set of -sparse vectors, where . We consider the pick-all-labels loss Menon et al. (2019); Reddi et al. (2019).

###### Example 3 (Pick-all-labels).

Given , the pick-all-labels loss is defined by

 L(u,y)=∑l∈[q]yllog⎛⎝∑j∈[q]exp(uj−ul)⎞⎠,

where and . For each we define by and let . By applying Lemma 1 with we can show that is -self-bounding Lipschitz with and .

Crucially, the constant for the pick-all-labels family of losses is a function of the sparsity , rather than the total number of labels. This means that our approach is applicable to multi-label problems with with tens of thousands of labels, as long as the label-vectors are -sparse.

#### 2.2.3 Losses for multi-target regression

We now return to the problem of multi-target regression in which Borchani et al. (2015).

###### Example 4 (Sup-norm losses).

Given , we can define a loss-function for multi-target regression by setting . By applying Lemma 1 with we can see that is a -self-bounding Lipschitz with and . This yields examples of -self-bounding Lipschitz loss functions for all and .

With these examples in mind we are ready to present our results.

## 3 Main results

In this section we give a general upper bound for multi-output prediction problems under the self-bounding Lipschitz condition. A key tool for proving this result will be a contraction inequality for local Rademacher complexity of vector valued functions given in Section 3.2, and which may also be of independent interest. First, we recall the concept of Rademacher complexity.

Let be a measurable space and consider a function class . Given a sequence we define the empirical Rademacher complexity of with respect to by111Taking the supremum over finite subsets is required to ensure that the function within the expectation is measurable Talagrand (2014). This technicality can typically be overlooked.

 ^Rz(G):=sup~G⊆G:|~G|<∞Eσ⎛⎝supg∈~G1n∑i∈[n]σi⋅g(zi)⎞⎠,

where the expectation is taken over sequences of independent Rademacher random variables with . For each , the worst-case Rademacher complexity of is defined by .

The Rademacher complexity is defined in the context of real-valued functions. However, in this work we deal with multi-output prediction so we shall focus on function classes . In order to utilise the theory of Rademacher complexity in this context we shall transform function classes into the projected function classes as follows. Firstly, for each we define to be the projection onto the -th coordinate. We then define, for each , the function by . Finally, given we let .

Our central result is the following relative bound.

###### Theorem 1.

Suppose we have a class of multi-output functions , and a -self-bounding Lipschitz loss function for some , , . Take , and let

 Γλ,θn,q,δ(F) +bn⋅(log(1/δ)+log(logn)).

There exists numerical constants such that given an i.i.d. sample the following holds with probability at least for all ,

 EL(f)≤^EL(f)+C0⋅(√^EL(f)⋅Γλ,θn,q,δ(F)+Γλ,θn,q,δ(F)).

Moreover, if minimises the risk and minimises the empirical risk, then with probability at least ,

 EL(^f)≤EL(f∗)+C1⋅(√EL(f∗)⋅Γλ,θn,q,δ(F)+Γλ,θn,q,δ(F)).

The proof of Theorem 1 is built upon a local contraction inequality result (Proposition 1, Section 3.2). The result follows by combining with techniques from Bousquet (2002). For details see Appendix A.

Theorem 1 gives an upper bound for the generalization gap , framed in terms of a complexity term , which depends upon both the Rademacher complexity of the projected function class and the self-bounding Lipschitz parameters , . When the empirical error is small in relation to the complexity term (), the generalization gap is of order . In less favourable circumstances we recover a bound of order .

In Section 4 we will demonstrate that in the realizable setting, Theorem 1 is minimax optimal up to logarithmic factors, both in terms of the sample size , and the output dimension . In Section 5 we will demonstrate that Theorem 1 yields state of the art guarantees for ensembles of decision trees for multi-output prediction problems.

### 3.1 Comparison with state of the art

In this section we compare our main result (Theorem 1) with a closely related guarantee due to Lei et al. (2019). We say that a loss function is -Lipschitz if it is -self-bounding Lipschitz with .

###### Theorem 2.

Lei et al. (2019) Suppose we have a class of multi-output functions , and a -Lipschitz loss function for some and . Take , and let

 Jλn,q,δ(F):=λ(√q⋅log3/2(eβnq)⋅Rnq(Π∘F)+1√n).

There exists numerical constants such that given an i.i.d. sample the following holds with probability at least for all ,

 EL(f)≤^EL(f)+C2⋅Jλn,q,δ(F)+b√log(1/δ)n.

Moreover, if minimises the risk and minimises the empirical risk, then with probability at least ,

 EL(^f)≤EL(f∗)+C3⋅Jλn,q,δ(F)+2b√log(1/δ)n.

Theorem 2 is a mild generalization of Theorem 6 from Lei et al. (2019), which establishes the special case of Theorem 2 in which is an RKHS and the learning problem is multi-class classification. For completeness we show that Theorem 2 follows from Proposition 1 in Appendix A. Note that by the monotonicity property (Lemma 3) any loss function which is -self-bounding Lipschitz is also -Lipschitz, so the additve bound in Theorem 2 also applies.

To gain a deeper intuition for the bound in Theorem 1 we compare with the bound in Theorem 2. Let’s suppose that (for a concrete example where this is the case see Section 5). We then have . For large values of Theorem 1 gives a bound on generalization gap of order , which is slower than the rate achieved by Theorem 2 whenever . However, when is small (), Theorem 1 gives rise to a bound of order , yielding faster rates than can be obtained through the standard Lipschitz condition alone whenever . Finally note that if the loss is -self-bounding Lipschitz with then the rates given by Theorem 1 always either match or outperform the rates given by Theorem 2. Moreover, occurs for several practical examples discussed in Section 2.2 including the multinomial-logistic loss.

### 3.2 A contraction inequality for the local Rademacher compliexity of vector-valued function classes

We now turn to stating and proving the key ingredient of our main result, Proposition 1. First we introduce some additional notation.

Suppose . Given a loss function we define by . We extend this definition to function classes by . Moreover, for each and , a subset . Intuitively, the local Rademacher complexity allows us to zoom in upon the neighbourhood of the empirical risk minimizer. This is the subset that matters in practice and is typically much smaller than the full .

###### Proposition 1.

Suppose we have a class of multi-output functions , where . Given a -self-bounding Lipschitz loss function , where , and , , we have,

 ^Rz(L∘F|rz)≤λrθ(29√q⋅log3/2(eβnq)⋅Rnq(Π∘F)+n−1/2).

The proof of Proposition 1, given later in this section, relies upon covering numbers.

###### Definition 3 (Covering numbers).

Let be a semi-metric space. Given a set and an , a subset is said to be a (proper) -cover of if, for all , there exists some with . We let denote the minimal cardinality of an -cover for .

We shall consider covering numbers for two classes of data-dependent semi-metric spaces. Let be a measurable space and take . For each and each sequence we define a pair of metrics and by

 ρz,2(g0,g1) := ⎷1n∑i∈[n](g0(zi)−g1(zi))2 ρz,∞(g0,g1) :=maxi∈[n]{|g0(zi)−g1(zi)|},

where . The first stage of the proof of Proposition 1 will be using the following lemma which bounds the covering number of in terms of an associated covering number for .

###### Lemma 4.

Suppose that and is -self-bounding Lipschitz with . Take , , and define . Given any ,

 ρz,2(L∘f0,L∘f1)≤2θλrθ⋅ρw,∞(Π∘f0,Π∘f1).

Moreover, for any , .

###### Proof of Lemma 4.

To prove the first part of the lemma we take and let . It follows from the construction of that for each , so for each .

Furthermore, by the self-bounding Lipschitz condition we deduce that for each ,

 |L(f0(xi),yi)−L(f1(xi),yi)| ≤λ⋅max{L(f0(xi),yi),L(f1(xi),yi)}θ⋅∥f0(xi)−f1(xi)∥∞ ≤λ⋅max{L(f0(xi),yi),L(f1(xi),yi)}θ⋅ζ.

Hence, by Jensen’s inequality we have

 ρz,2(L∘f0,L∘f1)2 =1n∑i∈[n](L(f0(xi),yi)−L(f1(xi),yi))2 ≤(λζ)2⋅1n∑i∈[n]max{L(f0(xi),yi),L(f1(xi),yi)}2θ ≤(λζ)2⋅(^EL(f0,z)+^EL(f1,z))2θ≤(λζ)2⋅(2r)2θ,

where we use the fact that and . Thus,

 ρz,2(L∘f0,L∘f1) ≤2θλrθ⋅ζ=2θλrθ⋅ρw,∞(Π∘f0,Π∘f1).

This completes the proof of the first part of the lemma.

To prove the second part of the lemma we note that since we have222The factor of is required as we are using proper covers, which are subsets of the set being covered (see Definition 3).

 N(2ϵ,Π∘F|rz,ρw,∞)≤N(ϵ,Π∘F,ρw,∞),

so we may choose with such that forms a -cover of with respect to the metric.

To complete the proof it suffices to show that is a -cover of with respect to the metric.

Take any , so for some . Since forms a -cover of we may choose so that . By the first part of the lemma we deduce that

 ρz,2(L∘fl,~g)=ρz,2(L∘fl,L∘~f)≤21+θλrθ⋅ϵ

Since this holds for all , we see that is a -cover of , which completes the proof of the lemma. ∎

To prove Proposition 1, we shall also utilise two technical results to move from covering numbers to Rademacher complexity and back. First, we shall use the following powerful result from Srebro et al. (2010) which gives an upper bound for worst-case covering numbers in terms of the worst-case Rademacher complexity.

###### Theorem 3 (Srebro et al. (2010)).

Given a measurable space and a function class , any and any ,

 logN(ϵ,G,ρz,∞)≤(Rn(G))2⋅4nϵ2⋅log2eβnϵ.

We can view this result as an analogue of Sudakov’s minoration inequality for covers, rather than covers.

Secondly, we shall use Dudley’s inequality Dudley (1967) which allows us to bound Rademacher complexities in terms of covering numbers. We shall use the following variant due to Guermeur (2017) as it yields more favourable constants.

###### Theorem 4 (Guermeur (2017)).

Suppose we have a measurable space , a function class and a sequence . For any decreasing sequence with with , the following inequality holds for all ,

 ^Rz(G)≤2⋅K∑k=1(ϵk+ϵk−1)⋅√logN(ϵk,G,ρz,2)n+ϵK.

We are now ready to complete the proof of our local Rademacher complexity inequality.

###### Proof of Proposition 1.

Take and and define . By Lemma 4 combined with Theorem 3 applied to we see that for each we have

 logN(21+θλrθ⋅ξ,L∘F|rz,ρz,2) ≤logN(ξ,Π∘F,ρw,∞) ≤(Rnq(Π∘F))2⋅4nqξ2⋅log2eβnqξ. (1)

Moreover, given any , , so , so by the first part of Lemma 4 we have .

Now construct by and choose

 K=⌈log2(β⋅min{(2⋅Rnq(Π∘F))−1,(8√n)})⌉−1

Hence, and .

Furthermore, for by letting , we have and , so by eq. (3.2)

 logN(ϵk,L∘F|rz,ρz,2) ≤(Rnq(Π∘F))2⋅4nqξ2k⋅log2eβnqξk

Note also that by construction .

By Theorem 4 and we deduce that

 ^Rz(L∘F|rz) ≤2⋅K∑k=1(ϵk+ϵk−1)⋅√logN(ϵk,L∘F|rz,ρz,2)n+ϵK ≤6K∑k=1ϵk⋅√logN(ϵk,L∘F|rz,ρz,2)n+ϵK ≤6K⋅(21+θλrθ⋅Rnq(