1 Introduction
Multioutput prediction represents an important class of problems that includes multiclass classification Crammer and Singer (2001), multilabel classification Tsoumakas and Katakis (2007); Zhang and Zhou (2013), multitarget regression Borchani et al. (2015), label distribution learning Geng (2016), structured regression Cortes et al. (2016) and others, with a wide range of practical applications Xu et al. (2019).
Our objective is to provide a general framework for establishing guarantees for multipleoutput prediction problems. A fundamental challenge in the statistical learning theory of multioutput prediction problems is to obtain bounds which allow for (i) favourable convergence rate with the sample size, and (ii) favourable dependence of the risk on the dimensionality of the output space. Whilst modern applications of multioutput prediction deal with increasingly large data sets, they also incorporate problems where the target dimensionality is increasingly large. For example, the number of categories in multilabel is often of the order of tens of thousands, an emergent problem referred to as
extreme classification Agrawal et al. (2013); Babbar and Schölkopf (2017); Bhatia et al. (2015); Jain et al. (2019).Formally, the task of multioutput prediction is to learn a vectorvalued function from a labelled training set. A common tool in the theoretical analysis of this problem has been a vectorvalued extension of Talagrand’s contraction inequality for Lipschitz losses Ledoux and Talagrand (2013). Both Maurer (2016) and Cortes et al. (2016) established vectorcontraction inequalities for Rademacher complexity which gave rise to learning guarantees for multioutput prediction problems with a linear dependence upon the dimensionality of the output space. More recently, Lei et al. (2019) has provided more refined vectorcontraction inequalities for both Gaussian and Rademacher complexity. This approach leads to a highly favourable sublinear dependence upon the output dimensionality, which can even be logarithmic, depending upon the degree of regularisation. These structural results lead to a slow convergence rate . Guermeur (2017) and Musayeva et al. (2019) explore an alternative approach based on covering numbers. Chzhen et al. (2017) derived a bound for multilabel classification based upon Rademacher complexities. Each of these bounds give rise to favourable dependence upon the dimensionality of the output space, with a rate of order .
Local Rademacher complexities provide a crucial tool in establishing faster rates of convergence Bousquet (2002); Bartlett et al. (2005); Koltchinskii et al. (2006); Lei et al. (2016). By leveraging local Rademacher complexities, Liu et al. (2019) have derived guarantees for for multiclass learning with function classes which are linear in an RKHS, building upon their previous margin based guarantees Lei et al. (2015); Li et al. (2019). This gives rise to fast rates under suitable spectral conditions. Fast rates of convergence have also been derived by Xu et al. (2016) for multilabel classification with linear function spaces. On the other hand, Chzhen (2019) have derived fast rates of convergence by exploiting an analogue the margin assumption.
Our objective is provide a general framework for establishing generalization bounds for multioutput prediction, which yield fast rates whenever empirical error is small, and apply to a wide variety of function classes, including ensembles of decision trees. We address this problem by generalising to vectorvalued functions a smoothness based approach due to
Srebro et al. (2010). A key advantage of our approach is that it allow us to accommodate a wide variety of multioutput loss functions, in conjunction with a variety of hypothesis classes, making our analytic strategy applicable to a variety of learning tasks. Below we summarise our contributions:
[noitemsep,topsep=0pt]

We give a contraction inequality for the local Rademacher complexity of vectorvalued functions (Proposition 1). The main ingredient is a selfbounding Lipschitz condition for multioutput loss functions which holds for several widely used examples.

We leverage our localised contraction inequality to give a general upper bound for multioutput learning (Theorem 1), which exhibits fast rates whenever the empirical error is small.

We demonstrate the minimaxoptimality of our result, both in terms of the number of samples, and the output dimensionality, up to logarithmic factors, in the realizable setting (Theorem 5).

Finally, to demonstrate a concrete use our general result, we derive from it a stateoftheart bound for ensembles of multioutput decision trees (Theorem 7).
1.1 Problem setting
We shall consider multioutput prediction problems in supervised learning. Suppose we have a measurable space
, a label space and an output space. We shall assume that there is an unknown probability distribution
over random variables
, taking values in . The performance is quantified through a loss function .Let denote the set of measurable functions . The goal of the learner is to obtain such that the corresponding risk is as low as possible. The learner selects based upon a sample , where are independent copies of . We let denote the empirical risk. When the distribution and the sample are clear from context we shall write in place of and in place of . We consider multioutput prediction problems in which . We let denote the max norm on and for positive integer we let .
2 The selfbounding Lipschitz condition
We introduce the following selfbounding Lipschitz condition for multioutput loss functions.
Definition 1 (Selfbounding Lipschitz condition).
A loss function is said to be selfbounding Lipschitz for if for all and ,
This condition interpolates continuously between a classical Lipschitz condition (when ) and a multidimensional analogue of a smoothness condition (when ), and will be the main assumption that we use to obtain our results.
Our motivation for introducing Definition 1 is as follows. Firstly, in recent work of Lei et al. (2019) the classical Lipschitz condition with respect to the norm has been utilised to derive multiclass bounds with a favourable dependence upon the number of classes . The role of the norm is crucial since it prevents the deviations in the loss function from accumulating as the output dimension grows. Our goal is to give a general framework which simultanously achieves a favourable dependence upon . Secondly, Srebro et al. (2010) introduced a secondorder smoothness condition on the loss function. This condition corresponds to the special case whereby and . Srebro et al. (2010) showed that this smoothness condition gives rise to a optimistic bound which gives a fast rate in the realizable case. The selfbounding Lipschitz provides a multidimensional analogue of this condition when which is intended to yield a favourable dependence upon both the number of samples and the number of classes . The results established in Sections 3 and 5 show that this is indeed the case. Finally, by considering the range of exponents we will yield convergence rates ranging from slow to fast in the realizable case. This is reminiscent of the celebrated Tsybakov margin condition Mammen and Tsybakov (1999) which interpolate between slow and fast rates in the parametric classification setting. Crucially, however, whilst the Tsybakov margin condition Mammen and Tsybakov (1999) is a condition on the underlying distribution which cannot be verified in practice, the selfbounding Lipschitz condition is a property of a loss function which may be verified analytically by the learner.
2.1 Verifying the selfbounding Lipschitz condition
We start by giving a collection of results which can be used to verify that a given loss function satisfies the selfbounding Lipschitz condition. The following lemmas are proved in Appendix B.
Lemma 1.
Take any , . Suppose that is a loss function such that for any , , there exists a nonnegative differentiable function satisfying

;

, .

The derivative is nonnegative on ;

, ;
Then is selfbounding Lipschitz.
Lemma 2 shows that clipping preserves this condition.
Lemma 2.
Suppose that is a selfbounding Lipschitz loss function with , . Then the loss defined by is selfbounding Lipschitz.
Finally, we note the following monotonicity property which follows straightforwardly from the definition.
Lemma 3.
Suppose that is a bounded selfbounding Lipschitz loss function with , . Then given any , the loss is also selfbounding Lipschitz with .
These properties can be used to establish the selfbounding Lipschitz condition in practical examples.
2.2 Examples
We now demonstrate several examples of multioutput loss functions that satisfy our selfbounding Lipschitz condition. In each of the examples below we shall show that the selfbounding Lipschitz condition is satisfied by applying our sufficient condition (Lemma 1). Detailed proofs are given in Appendix B.
2.2.1 Multiclass losses
We begin with the canonical multioutput prediction problem of multiclass classification in which and . A popular loss function for the theoretical analysis of multiclass learning is the margin loss Crammer and Singer (2001). The smoothed analogue of the margin loss was introduced by Srebro et al. (2010) in the onedimensional setting, and Li et al. (2018) in the multiclass setting.
Example 1 (Smooth margin losses).
Given we define the margin function by . The zeroone loss is defined by . Whilst natural, the zeroone loss has the drawback of being discontinuous, which presents an obstacle for deriving guarantees. For each , the corresponding margin loss is defined by . The margin loss is also discontinuous. However, we may define a smooth margin loss by
By applying Lemma 1 we can show that is selfbounding Lipschitz with and . Moreover, the smooth margin loss satisfies for .
The margin loss plays a central role in learning theory and continues to receive significant attention in the analysis of multiclass prediction Guermeur (2017); Li et al. (2018); Musayeva et al. (2019), so it is fortuitous that our selfbounding Lipschitz condition incorporates the smooth margin loss. More importantly, however, the selfbounding Lipschitz condition applies to a variety of other loss functions which have received less attention in statisical learning theory.
One of the most widely used loss functions in practical applications is the multinomial logistic loss, also known as the softmax loss.
Example 2 (Multinomial logistic loss).
Given , the multinomial logistic loss is defined by
where and . For each let and define . By applying Lemma 1 with we can show that the multinomial logistic loss is selfbounding Lipschitz with and .
Recently, Lei et al. (2019) emphasized that the multinomiallogistic loss is Lipschitz with respect to the norm (equivalently, selfbounding Lipschitz). This gives rise to a slow rate of order . The fact that the multinomiallogistic loss is also self bounding can be used to derive more favourable guarantees, as we shall see in Section 3.
2.2.2 Multilabel losses
Multilabel prediction is the challenge of classification in settings where instances may be simultaneously assigned to several categories. In multilabel classification we have , where is the total number possible classes. Whilst is often very large, the total number of simultaneous labels is typically much smaller. Hence, we consider the set of sparse binary vectors denote the set of sparse vectors, where . We consider the pickalllabels loss Menon et al. (2019); Reddi et al. (2019).
Example 3 (Pickalllabels).
Given , the pickalllabels loss is defined by
where and . For each we define by and let . By applying Lemma 1 with we can show that is selfbounding Lipschitz with and .
Crucially, the constant for the pickalllabels family of losses is a function of the sparsity , rather than the total number of labels. This means that our approach is applicable to multilabel problems with with tens of thousands of labels, as long as the labelvectors are sparse.
2.2.3 Losses for multitarget regression
We now return to the problem of multitarget regression in which Borchani et al. (2015).
Example 4 (Supnorm losses).
Given , we can define a lossfunction for multitarget regression by setting . By applying Lemma 1 with we can see that is a selfbounding Lipschitz with and . This yields examples of selfbounding Lipschitz loss functions for all and .
With these examples in mind we are ready to present our results.
3 Main results
In this section we give a general upper bound for multioutput prediction problems under the selfbounding Lipschitz condition. A key tool for proving this result will be a contraction inequality for local Rademacher complexity of vector valued functions given in Section 3.2, and which may also be of independent interest. First, we recall the concept of Rademacher complexity.
Definition 2 (Rademacher complexity).
Let be a measurable space and consider a function class . Given a sequence we define the empirical Rademacher complexity of with respect to by^{1}^{1}1Taking the supremum over finite subsets is required to ensure that the function within the expectation is measurable Talagrand (2014). This technicality can typically be overlooked.
where the expectation is taken over sequences of independent Rademacher random variables with . For each , the worstcase Rademacher complexity of is defined by .
The Rademacher complexity is defined in the context of realvalued functions. However, in this work we deal with multioutput prediction so we shall focus on function classes . In order to utilise the theory of Rademacher complexity in this context we shall transform function classes into the projected function classes as follows. Firstly, for each we define to be the projection onto the th coordinate. We then define, for each , the function by . Finally, given we let .
Our central result is the following relative bound.
Theorem 1.
Suppose we have a class of multioutput functions , and a selfbounding Lipschitz loss function for some , , . Take , and let
There exists numerical constants such that given an i.i.d. sample the following holds with probability at least for all ,
Moreover, if minimises the risk and minimises the empirical risk, then with probability at least ,
The proof of Theorem 1 is built upon a local contraction inequality result (Proposition 1, Section 3.2). The result follows by combining with techniques from Bousquet (2002). For details see Appendix A.
Theorem 1 gives an upper bound for the generalization gap , framed in terms of a complexity term , which depends upon both the Rademacher complexity of the projected function class and the selfbounding Lipschitz parameters , . When the empirical error is small in relation to the complexity term (), the generalization gap is of order . In less favourable circumstances we recover a bound of order .
In Section 4 we will demonstrate that in the realizable setting, Theorem 1 is minimax optimal up to logarithmic factors, both in terms of the sample size , and the output dimension . In Section 5 we will demonstrate that Theorem 1 yields state of the art guarantees for ensembles of decision trees for multioutput prediction problems.
3.1 Comparison with state of the art
In this section we compare our main result (Theorem 1) with a closely related guarantee due to Lei et al. (2019). We say that a loss function is Lipschitz if it is selfbounding Lipschitz with .
Theorem 2.
Lei et al. (2019) Suppose we have a class of multioutput functions , and a Lipschitz loss function for some and . Take , and let
There exists numerical constants such that given an i.i.d. sample the following holds with probability at least for all ,
Moreover, if minimises the risk and minimises the empirical risk, then with probability at least ,
Theorem 2 is a mild generalization of Theorem 6 from Lei et al. (2019), which establishes the special case of Theorem 2 in which is an RKHS and the learning problem is multiclass classification. For completeness we show that Theorem 2 follows from Proposition 1 in Appendix A. Note that by the monotonicity property (Lemma 3) any loss function which is selfbounding Lipschitz is also Lipschitz, so the additve bound in Theorem 2 also applies.
To gain a deeper intuition for the bound in Theorem 1 we compare with the bound in Theorem 2. Let’s suppose that (for a concrete example where this is the case see Section 5). We then have . For large values of Theorem 1 gives a bound on generalization gap of order , which is slower than the rate achieved by Theorem 2 whenever . However, when is small (), Theorem 1 gives rise to a bound of order , yielding faster rates than can be obtained through the standard Lipschitz condition alone whenever . Finally note that if the loss is selfbounding Lipschitz with then the rates given by Theorem 1 always either match or outperform the rates given by Theorem 2. Moreover, occurs for several practical examples discussed in Section 2.2 including the multinomiallogistic loss.
3.2 A contraction inequality for the local Rademacher compliexity of vectorvalued function classes
We now turn to stating and proving the key ingredient of our main result, Proposition 1. First we introduce some additional notation.
Suppose . Given a loss function we define by . We extend this definition to function classes by . Moreover, for each and , a subset . Intuitively, the local Rademacher complexity allows us to zoom in upon the neighbourhood of the empirical risk minimizer. This is the subset that matters in practice and is typically much smaller than the full .
Proposition 1.
Suppose we have a class of multioutput functions , where . Given a selfbounding Lipschitz loss function , where , and , , we have,
The proof of Proposition 1, given later in this section, relies upon covering numbers.
Definition 3 (Covering numbers).
Let be a semimetric space. Given a set and an , a subset is said to be a (proper) cover of if, for all , there exists some with . We let denote the minimal cardinality of an cover for .
We shall consider covering numbers for two classes of datadependent semimetric spaces. Let be a measurable space and take . For each and each sequence we define a pair of metrics and by
where . The first stage of the proof of Proposition 1 will be using the following lemma which bounds the covering number of in terms of an associated covering number for .
Lemma 4.
Suppose that and is selfbounding Lipschitz with . Take , , and define . Given any ,
Moreover, for any , .
Proof of Lemma 4.
To prove the first part of the lemma we take and let . It follows from the construction of that for each , so for each .
Furthermore, by the selfbounding Lipschitz condition we deduce that for each ,
Hence, by Jensen’s inequality we have
where we use the fact that and . Thus,
This completes the proof of the first part of the lemma.
To prove the second part of the lemma we note that since we have^{2}^{2}2The factor of is required as we are using proper covers, which are subsets of the set being covered (see Definition 3).
so we may choose with such that forms a cover of with respect to the metric.
To complete the proof it suffices to show that is a cover of with respect to the metric.
Take any , so for some . Since forms a cover of we may choose so that . By the first part of the lemma we deduce that
Since this holds for all , we see that is a cover of , which completes the proof of the lemma. ∎
To prove Proposition 1, we shall also utilise two technical results to move from covering numbers to Rademacher complexity and back. First, we shall use the following powerful result from Srebro et al. (2010) which gives an upper bound for worstcase covering numbers in terms of the worstcase Rademacher complexity.
Theorem 3 (Srebro et al. (2010)).
Given a measurable space and a function class , any and any ,
We can view this result as an analogue of Sudakov’s minoration inequality for covers, rather than covers.
Secondly, we shall use Dudley’s inequality Dudley (1967) which allows us to bound Rademacher complexities in terms of covering numbers. We shall use the following variant due to Guermeur (2017) as it yields more favourable constants.
Theorem 4 (Guermeur (2017)).
Suppose we have a measurable space , a function class and a sequence . For any decreasing sequence with with , the following inequality holds for all ,
We are now ready to complete the proof of our local Rademacher complexity inequality.
Comments
There are no comments yet.