Permutational Rademacher Complexity: a New Complexity Measure for Transductive Learning

05/12/2015 ∙ by Ilya Tolstikhin, et al. ∙ Moscow Institute of Physics and Technology Max Planck Society Universität Potsdam 0

Transductive learning considers situations when a learner observes m labelled training points and u unlabelled test points with the final goal of giving correct answers for the test points. This paper introduces a new complexity measure for transductive learning called Permutational Rademacher Complexity (PRC) and studies its properties. A novel symmetrization inequality is proved, which shows that PRC provides a tighter control over expected suprema of empirical processes compared to what happens in the standard i.i.d. setting. A number of comparison results are also provided, which show the relation between PRC and other popular complexity measures used in statistical learning theory, including Rademacher complexity and Transductive Rademacher Complexity (TRC). We argue that PRC is a more suitable complexity measure for transductive learning. Finally, these results are combined with a standard concentration argument to provide novel data-dependent risk bounds for transductive learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rademacher complexities ([14], [2]) play an important role in the widely used concentration-based approach to statistical learning theory [4], which is closely related to the analysis of empirical processes [21]. They measure a complexity of function classes and provide data-dependent risk bounds in the standard i.i.d. framework of inductive learning, thanks to symmetrization and concentration inequalities. Recently, a number of attempts were made to apply this machinery also to the transductive learning setting [22]. In particular, the authors of [10] introduced a notion of transductive Rademacher complexity and provided an extensive study of its properties, as well as general transductive risk bounds based on this new complexity measure.

In the transductive learning, a learner observes labelled training points and

unlabelled test points. The goal is to give correct answers on the test points. Transductive learning naturally appears in many modern large-scale applications, including text mining, recommender systems, and computer vision, where often the objects to be classified are available beforehand. There are two different settings of transductive learning, defined by V. Vapnik in his book

[22, Chap. 8]. The first one assumes that all the objects from the training and test sets are generated i.i.d. from an unknown distribution . The second one is distribution free, and it assumes that the training and test sets are realized by a uniform and random partition of a fixed and finite general population of cardinality into two disjoint subsets of cardinalities and ; moreover, no assumptions are made regarding the underlying source of this general population. The second setting has gained much attention111 For the extensive overview of transductive risk bounds we refer the reader to [18]. ([22], [9], [7], [10], [8], and [20]

), probably due to the fact that any upper risk bound for this setting directly implies a risk bound also for the first setting

[22, Theorem 8.1]. In essence, the second setting studies uniform deviations of risks computed on two disjoint finite samples. Following Vapnik’s discussion in [6, p. 458], we would also like to emphasize that the second setting of transductive learning naturally appears as a middle step in proofs of the standard inductive risk bounds, as a result of symmetrization or the so-called double-sample trick. This way better transductive risk bounds also translate into better inductive ones.

An important difference between the two settings discussed above lies in the fact that the elements of the training set in the second setting are interdependent, because they are sampled uniformly without replacement from the general population. As a result, the standard techniques developed for inductive learning, including concentration and Rademacher complexities mentioned in the beginning, can not be applied in this setting, since they are heavily based on the i.i.d. assumption. Therefore, it is important to study empirical processes in the setting of sampling without replacement.

Previous work. A large step in this direction was made in [10], where the authors presented a version of McDiarmid’s bounded difference inequality [5] for sampling without replacement together with the Transductive Rademacher Complexity (TRC). As a main application the authors derived an upper bound on the binary test error of a transductive learning algorithm in terms of TRC. However, the analysis of [10] has a number of shortcomings. Most importantly, TRC depends on the unknown labels of the test set. In order to obtain computable risk bounds, the authors resorted to the contraction inequality [15], which is known to be a loose step [17], since it destroys any dependence on the labels.

Another line of work was presented in [20], where variants of Talagrand’s concentration inequality were derived for the setting of sampling without replacement. These inequalities were then applied to achieve transductive risk bounds with fast rates of convergence , following a localized approach [1]. In contrast, in this work we consider only the worst-case analysis based on the global complexity measures. An analysis under additional assumptions on the problem at hand, including Mammen-Tsybakov type low noise conditions [4], is an interesting open question and left for future work.

Summary of our results. This paper continues the analysis of empirical processes indexed by arbitrary classes of uniformly bounded functions in the setting of sampling without replacement, initiated by [10]. We introduce a new complexity measure called permutational Rademacher complexity (PRC) and argue that it captures the nature of this setting very well. Due to space limitations we present the analysis of PRC only for the special case when the training and test sets have the same size , which is nonetheless sufficiently illustrative222 All the results presented in this paper are also available for the general case, but we defer them to a future extended version of this paper. .

We prove a novel symmetrization inequality (Theorem 3.1), which shows that the expected PRC and the expected suprema of empirical processes when sampling without replacement are equivalent up to multiplicative constants. Quite remarkably, the new upper and lower bounds (the latter is often called desymmetrization inequality) both hold without any additive terms when , in contrast to the standard i.i.d. setting, where an additive term of order is unavoidable in the lower bound. For TRC even the upper symmetrization inequality [10, Lemma 4] includes an additive term of the order and no desymmetrization inequality is known. This suggests that PRC may be a more suitable complexity measure for transductive learning. We would also like to note that the proof of our new symmetrization inequality is surprisingly simple, compared to the one presented in [10].

Next we compare PRC with other popular complexity measures used in statistical learning theory. In particular, we provide achievable upper and lower bounds, relating PRC to the conditional Rademacher complexity (Theorem 3.2). These bounds show that the PRC is upper and lower bounded by the conditional Rademacher complexity up to additive terms of orders and respectively, which are achievable (Lemma 1). In addition to this, Theorem 3.2 also significantly improves bounds on the complexity measure called maximum discrepancy presented in [2, Lemma 3]. We also provide a comparison between expected PRC and TRC (Corollary 1), which shows that their values are close up to small multiplicative constants and additive terms of order .

Finally, we apply these results to obtain a new computable data-dependent risk bound for transductive learning based on the PRC (Theorem 4.2

), which holds for any bounded loss functions. We conclude by discussing the advantages of the new risk bound over the previously best known one of

[10].

2 Notations

We will use calligraphic symbols to denote sets, with subscripts indicating their cardinalities: . For any function we will denote its average value computed on a finite set by . In what follows we will consider an arbitrary space (for instance, a space of input-output pairs) and class of functions (for instance, loss functions) mapping to . Most of the proofs are deferred to the last section for improved readability.

Arguably, one of the most popular complexity measures used in statistical learning theory is the Rademacher complexity ([15], [14], [2]):

Definition 1 (Conditional Rademacher complexity)

Fix any subset . The following random quantity is commonly known as a conditional Rademacher complexity:

where are i.i.d. Rademacher signs, taking values with probabilities . When the set is clear from the context we will simply write .

As discussed in the introduction, Rademacher complexities play an important role in the analysis of empirical processes and statistical learning theory. However, this measure of complexity was devised mainly for the i.i.d. setting, which is different from our setting of sampling without replacement. The following complexity measure was introduced in [10] to overcome this issue:

Definition 2 (Transductive Rademacher complexity)

Fix any set , positive integers such that , and . The following quantity is called Transductive Rademacher complexity (TRC):

where

are i.i.d. random variables taking values

with probabilities and with probability .

We summarize the importance of these two complexity measures in the analysis of empirical processes when sampling without replacement in the following result:

Theorem 2.1

Fix an -element subset and let elements of be sampled uniformly without replacement from . Also let elements of be sampled uniformly with replacement from . Denote with . The following upper bound in terms of the i.i.d. Rademacher complexity was provided in [20]:

(1)

The following bound in terms of TRC was provided in [10]. Assume that functions in are uniformly bounded by . Then for and :

(2)

While (1) did not explicitly appear in [20], it can be immediately derived using [20, Corollary 8] and i.i.d. symmetrization of [13, Theorem 2.1].

Finally, we introduce our new complexity measure:

Definition 3 (Permutational Rademacher complexity)

Let be any fixed set of cardinality . For any the following quantity will be called a permutational Rademacher complexity (PRC):

where is a random subset of containing elements sampled uniformly without replacement and . When the set is clear from the context we will simply write .

The name PRC is explained by the fact that if is even then the definitions of and are very similar. Indeed, the only difference is that the expectation in the PRC is over the randomly permuted sequence containing equal number of and , whereas in Rademacher complexity the average is w.r.t. all the possible sequences of signs. The term “permutation complexity” has already appeared in [16], where it was used to denote a novel complexity measure for a model selection. However, this measure was specific to the i.i.d. setting and binary loss. Moreover, the bounds presented in [16] were of the same order as the risk bounds based on the Rademacher complexity with worse constants in the slack term.

3 Symmetrization and Comparison Results

We start with showing a version of the i.i.d. symmetrization inequality (references can be found in [15], [13]) for the setting of sampling without replacement. It shows that the expected supremum of empirical processes in this setting is up to multiplicative constants equivalent to the expected PRC.

Theorem 3.1

Fix an -element subset and let elements of be sampled uniformly without replacement from . Denote with . If and is even then for any :

The inequalities also hold if we include absolute values inside the suprema.

Proof

The proof can be found in Sect. 5.1.

This inequality should be compared to the previously known complexity bounds of Theorem 2.1. First of all, in contrast to (1) and (2) the new bound provides a two sided control, which shows that PRC is a “correct” complexity measure for our setting. It is also remarkable that the lower bound (commonly known as the desymmetrization inequality) does not include any additive terms, since in the standard i.i.d. setting the lower bound holds only up to an additive term of order [13, Sect. 2.1]. Also note that this result does not assume the boundedness of functions in , which is a necessary assumptions both in (2) and in the i.i.d. desymmetrization inequality.

Next we compare PRC with the conditional Rademacher complexity:

Theorem 3.2

Let be any fixed set of even cardinality . Then:

(3)

Moreover, if the functions in are absolutely bounded by then

(4)

The results also hold if we include absolute values inside suprema in .

Proof

Conceptually the proof is based on the coupling between a sequence of i.i.d. Rademacher signs and a uniform random permutation of a set containing plus and minus signs. This idea was inspired by the techniques used in [11]. The detailed proof can be found in Sect. 5.2.

Note that a typical order of is , thus the multiplicative upper bound (3) can be much tighter than the upper bound of (4). We would also like to note that Theorem 3.2 significantly improves bounds of Lemma 3 in [2], which relate the so-called maximal discrepancy measure of the class to its Rademacher complexity (for the further discussion we refer to Appendix).

Our next result shows that bounds of Theorem 3.2 are essentially tight.

Lemma 1

Let with even . There are two finite classes and of functions mapping to and absolutely bounded by , such that:

(5)
(6)
Proof

The proof can be found in Sect. 5.3.

Inequalities (5) simultaneously show that (a) the order of the additive bound (4) can not be improved, and (b) the multiplicative upper bound (3) can not be reversed. Moreover, it can be shown using (6) that the factor appearing in (3) can not be improved to .

Finally, we compare PRC to the transductive Rademacher complexity:

Lemma 2

Fix any set . If and :

Proof

The upper bound was presented in [10, Lemma 1]. For the lower bound, notice that if the i.i.d. signs presented in Definition 2 have the same distribution as , where are i.i.d. Rademacher signs and are i.i.d. Bernoulli random variables with parameters . Thus, Jensen’s inequality gives:

Together with Theorems 3.1 and 3.2 this result shows that when the PRC can not be much larger than transductive Rademacher complexity:

Corollary 1

Using notations of Theorem 3.1, we have:

If functions in are uniformly bounded by then we also have a lower bound:

Proof

Simply notice that .

4 Transductive Risk Bounds

Next we will use the results of Sect. 3 to obtain a new transductive risk bound. First we will shortly describe the setting.

We will consider the second, distribution-free setting of transductive learning described in the introduction. Fix any finite general population of input-output pairs , where and are arbitrary input and output spaces. We make no assumptions regarding underlying source of . The learner receives the labeled training set consisting of elements sampled uniformly without replacement from . The remaining test set is presented to the learner without labels (we will use to denote the inputs of ). The goal of the learner is to find a predictor in the fixed hypothesis class based on the training sample and unlabelled test points , which has a small test risk measured using bounded loss function . For and denote and also denote the loss class . Then the test and training risks of are defined as and respectively.

Following risk bound in terms of TRC was presented in [10, Corollary 2]:

Theorem 4.1 ([10])

If then with probability at least over the random training set any satisfies:

(7)

Using results of Sect. 3 we obtain the following risk bound:

Theorem 4.2

If and then with probability at least over the random training set any satisfies:

(8)

Moreover, with probability at least any satisfies:

(9)
Proof

The proof can be found in Sect. 5.4.

We conclude by comparing risk bounds of Theorems 4.2 and 4.1:

1. First of all, the upper bound of (9) is computable. This bound is based on the concentration argument, which shows that the expected PRC (appearing in (8

)) can be nicely estimated using the training set. Meanwhile, the upper bound of (

7) depends on the unknown labels of the test set through TRC. In order to make it computable the authors of [10] resorted to the contraction inequality, which allows to drop any dependence on the labels for Lipschitz losses, which is known to be a loose step [17].

2. Moreover, we would like to note that for binary loss function TRC (as well as the Rademacher complexity) does not depend on the labels at all. Indeed, this can be shown by writing for and noting that and are identically distributed for used in Definition 2. This is not true for PRC, which is sensitive to the labels even in this setting. As a future work we hope to use this fact for analysis in the low noise setting [4].

3. The slack term appearing in (8) is significantly smaller than the one of (7). For instance, if then the latter is times larger. This is caused by the additive term in symmetrization inequality (2). At the same time, Corollary 1 shows that the complexity term appearing in (8) is at most two times larger than TRC, appearing in (7).

4. Comparison result of Theorem 3.2 shows that the upper bound of (9) is also tighter than the one which can be obtained using (1) and conditional Rademacher complexity.

5. Similar upper bounds (up to extra factor of 2) also hold for the excess risk , where minimizes the training risk over . This can be proved using a similar argument to Theorem 4.2.

6. Finally, one more application of the concentration argument can simplify the computation of PRC, by estimating the expected value appearing in Definition 3 with only one random partition of .

5 Full Proofs

5.1 Proof of Theorem 3.1

Lemma 3

For let be sampled uniformly without replacement from a finite set of real numbers . Then:

Proof (of Theorem 3.1)

Fix any positive integers and such that , which implies and . Note that Lemma 3 implies:

where and are sampled uniformly without replacement from and respectively. Using Jensen’s inequality we get:

(10)

The marginal distribution of , appearing in (10), can be equivalently described by first sampling from , then from (both times uniformly without replacement), and setting (recall that ). Thus

which completes the proof of the upper bound.

We have shown that for and :

(11)

where and are sampled uniformly without replacement from and respectively. Let be sampled uniformly without replacement from and let be the remaining elements of . Using Lemma 3 once again we get:

We can rewrite the r.h.s.of (11) as:

where we have used Jensen’s inequality. If we take we get

It is left to notice that the random subsets and have the same distributions as and .

5.2 Proof of Theorem 3.2

Let , be i.i.d.Rademacher signs, and be a uniform random permutation of a set containing plus and minus signs. The proof of Theorem 3.2 is based on the coupling of random variables and , which is described in Lemma 4. We will need a number of definitions. Consider binary cube . Denote

, which is a set of all the vectors in

having equal number of plus and minus signs. For any denote and consider the following set:

which consists of the points in closest to in Hamming metric. For any let be a random element of , distributed uniformly. We will use to denote -th coordinate of the vector .

Remark 1

If then . Otherwise, will clearly contain more than one element of . Namely, it can be shown, that if for some positive integer it holds that , then is necessarily even and consists of all the vectors in which can be obtained by replacing of signs in with signs, and thus in this case .

Lemma 4 (Coupling)

Assume that . Then the random sequence  has the same distribution as .

Proof

Note that the support of is equal to . From symmetry it is easy to conclude that the distribution of is exchangable. This means that it is invariant under permutations and as a consequence uniform on .

Next result is in the core of the multiplicative upper bound (3).

Lemma 5

Assume that . For any the following holds:

Proof

We will first upper bound , where is (w.l.o.g.) a sequence of plus signs followed by a sequence of minus signs.

(12)

where we have used Lemma 4 and the sum is over all different sequences of signs . For any denote and consider terms in (12) corresponding to with , , and :

Case 1: . These terms will be zero, since .

Case 2: . This means that “has more plus signs than it should” and according to Remark 1 the mapping will replace several of “+1” with “-1”. In particular, if then and thus the corresponding terms will be zero. If and in the same time the event also can not hold. Moreover, note that identity can hold only if , which necessarily leads to

(13)

From this we conclude that if then all the terms corresponding to with are zero. We will use to denote the subset of consisting of sequences , such that (a) , (b) , and (c) condition (13) holds. It can be seen that if then:

This holds since, according to Remark 1, can take exactly different values, while only one of them is equal to .

Let us compute the cardinality of for . It is easy to check that condition for some positive integer implies that has exactly minus signs. Considering the fact that for we have:

Combining everything together we have:

Finally, it is easy to show using induction that:

Case 3: . We can repeat all the steps of the previous case and get:

Accounting for these three cases in (12) we conclude that

where we have used the upper bound on the binomial coefficient from [19, Corollary 2.4]. We can conclude the proof of lemma by writing:

Proof (of Theorem 3.2)

First we prove (3). Let . We can write:

(14)
(15)
(16)

where we have used coupling Lemma 4 in (14), Lemma 5 in (15), and Jensen’s inequality in (16). This completes the proof of (3).

Next we prove (4). We have:

Using Lemma 4 and Jensen’s inequality we further get:

(17)

where we have, perhaps misleadingly, denoted the conditional expectation with respect to the uniform choice from given using . Next we have:

(18)

where is a subset of indices, s.t. iff . We can continue by writing

(19)

Note that since functions in are absolutely bounded by :

Returning to (17) and using Remark 1 we obtain:

Khinchin’s inequality [15, Lemma 4.1] together with the best known constant due to [12] gives which completes the proof of (4).

5.3 Proof of Lemma 5

Proof

Let . Take to be a set of two constant functions, and for all