# Multiclass versus Binary Differentially Private PAC Learning

We show a generic reduction from multiclass differentially private PAC learning to binary private PAC learning. We apply this transformation to a recently proposed binary private PAC learner to obtain a private multiclass learner with sample complexity that has a polynomial dependence on the multiclass Littlestone dimension and a poly-logarithmic dependence on the number of classes. This yields an exponential improvement in the dependence on both parameters over learners from previous work. Our proof extends the notion of Ψ-dimension defined in work of Ben-David et al. [JCSS '95] to the online setting and explores its general properties.

## Authors

• 16 publications
• 28 publications
• 2 publications
03/01/2020

### An Equivalence Between Private Classification and Online Prediction

We prove that every concept class with finite Littlestone dimension can ...
02/14/2021

### Private learning implies quantum stability

Learning an unknown n-qubit quantum state ρ is a fundamental challenge i...
03/10/2020

### Closure Properties for Private Classification and Online Prediction

Let H be a class of boolean functions and consider acomposed class H' th...
02/04/2020

### Efficient, Noise-Tolerant, and Private Learning via Boosting

We introduce a simple framework for designing private boosting algorithm...
06/04/2018

### Private PAC learning implies finite Littlestone dimension

We show that every approximately differentially private learning algorit...
02/19/2020

### Quantum statistical query learning

We propose a learning model called the quantum statistical learning QSQ ...
11/24/2019

### PAC learning with stable and private predictions

We study binary classification algorithms for which the prediction on an...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning and data analytics are increasingly deployed on sensitive information about individuals. Differential privacy [DMNS06] gives a mathematically rigorous way to enable such analyses while guaranteeing the privacy of individual information. The model of differentially private PAC learning [KLN08] captures binary classification for sensitive data, providing a simple and broadly applicable abstraction for many machine learning procedures. Private PAC learning is now reasonably well-understood, with a host of general algorithmic techniques, lower bounds, and results for specific fundamental concept classes  [BNSV15, FX14, BNS13, BNS19, ALMM19, KLM20, BMNS19, KMST20].

Beyond binary classification, many problems in machine learning are better modeled as multiclass learning problems. Here, given a training set of examples from domain with labels from , the goal is to learn a function that approximately labels the data and generalizes to the underlying population from which it was drawn. Much less is presently known about differentially private multiclass learnability than is known about private binary classification, though it appears that many specific tools and techniques can be adapted one at a time. In this work, we ask: Can we generically relate multiclass to binary learning so as to automatically transfer results from the binary setting to the multiclass setting?

To illustrate, there is a simple reduction from a given multiclass learning problem to a sequence of binary classification problems. (This reduction was described by Ben-David et al. [BCHL95] for non-private learning, but works just as well in the private setting.) Intuitively, one can learn a multi-valued label one bit at a time. That is, to learn an unknown function , it suffices to learn the binary functions , where each is the bit of .

###### Theorem 1.1 (Informal).

Let be a concept class consisting of -valued functions. If all of the binary classes are privately learnable, then is privately learnable.

Beyond its obvious use for enabling the use of tools for binary private PAC learning on the classes , we show that Theorem 1.1 has strong implications for relating the private learnability of to the combinatorial properties of itself. Our main application of this reductive perspective is an improved sample complexity upper bound for private multiclass learning in terms of online learnability.

### 1.1 Online vs. Private Learnability

A recent line of work has revealed an intimate connection between differentially private learnability and learnability in Littlestone’s mistake-bound model of online learning [Lit87]. For binary classes, the latter is tightly captured by a combinatorial parameter called the Littlestone dimension; a class is online learnable with mistake bound at most if and only if its Littlestone dimension is at most . The Littlestone dimension also qualitatively characterizes private learnability. If a class has Littlestone dimension , then every private PAC learner for requires at least samples [ALMM19]. Meanwhile, Bun et al. [BLM20] showed that is privately learnable using samples, and Ghazi et al. [GGKM20] gave an improved algorithm using samples. (Moreover, while quantitatively far apart, both the upper and lower bound are tight up to polynomial factors as functions of the Littlestone dimension alone [KLM20].)

Jung et al. [JKT20] recently extended this connection from binary to multiclass learnability. They gave upper and lower bounds on the sample complexity of private multiclass learnability in terms of the multiclass Littlestone dimension [DSBDSS11]. Specifically, they showed that if a multi-valued class has multiclass Littlestone dimension , then it is privately learnable using samples and that every private learner requires samples.

Jung et al.’s upper bound [JKT20] directly extended the definitions and arguments from Bun et al.’s [BLM20] earlier -sample algorithm for the binary case. While plausible, it is currently unknown and far from obvious whether similar adaptations can be made to the improved binary algorithm of Ghazi et al. [GGKM20]. Instead of attacking this problem directly, we show that Theorem 1.1, together with additional insights relating multiclass and binary Littlestone dimensions, allows us to generically translate sample complexity upper bounds for private learning in terms of binary Littlestone dimension into upper bounds in terms of multiclass Littlestone dimension. Instantiating this general translation using the algorithm of Ghazi et al. gives the following improved sample complexity upper bound.

###### Theorem 1.2 (Informal).

Let be a concept class consisting of -valued functions and let be the multiclass Littlestone dimension of . Then is privately learnable using samples.

In addition to being conceptually simple and modular, our reduction from multiclass to binary learning means that potential future improvements for binary learning will also automatically give improvements for multiclass learning. For example, if one were able to prove that all binary classes of Littlestone dimension are privately learnable with samples, this would imply that every -valued class of multiclass Litttlestone dimension is privately learnable with samples.111The nearly cubic dependence on follows from the fact that the accuracy of private learners can be boosted with a sample complexity blowup that is nearly inverse linear in the target accuracy [DRV10, BCS20]. See Theorem A.1.

Finally, in Section 8, we study pure, i.e., -differentially private PAC learning in the multiclass setting. Beimel et al. [BNS19] characterized the sample complexity of pure private learners in the binary setting using the notion of probabilistic representation dimension. We study a generalization of the representation dimension to the multiclass setting and show that it characterizes the sample complexity of pure private multiclass PAC learning up to a logarithmic term in the number of labels . Our primary technical contribution in this section is a new and simplified proof of the relationship between representation dimension and Littlestone dimension that readily extends to the multiclass setting. This connection was previously explored by Feldman and Xiao [FX14] in the binary setting, through a connection to randomized one-way communication complexity. We instead use techniques from online learning — specifically, the experts framework and the weighted majority algorithm developed by Littlestone and Warmuth [LW94] for the binary setting and extended to the multiclass setting by Daniely et al. [DSBDSS11].

### 1.2 Techniques

Theorem 1.1 shows that a multi-valued class is privately learnable if all of the binary classes are privately learnable, which in turn holds as long as we can control their (binary) Littlestone dimensions. So the last remaining step in order to conclude Theorem 1.2 is to show that if has bounded multiclass Littlestone dimension, then all of the classes have bounded binary Littlestone dimension. At first glance, this may seem to follow immediately from the fact that (multiclass) Littlestone dimension characterizes (multiclass) online learnability — a mistake bounded learner for a multiclass problem is, in particular, able to learn each individual output bit of the function being learned. The problem with this intuition is that the multiclass learner is given more feedback from each example, namely the entire multi-valued class label, than a binary learner for each that is only given a single bit. Nevertheless, we are still able to use combinatorial methods to show that multiclass online learnability of a class implies online learnability of all of the binary classes .

###### Theorem 1.3.

Let be a -valued concept class with multiclass Littlestone dimension . Then every binary class has Littlestone dimension at most .

Moreover, this result is nearly tight. In Section 6, we show that for every there is a -valued class with multiclass Littlestone dimension such that at least one of the classes has Littlestone dimension at least .

Theorem 1.3 is the main technical contributions of this work. The proof adapts techniques introduced by Ben-David et al. [BCHL95] for characterizing the sample complexity of (non-private) multiclass PAC learnability. Specifically, Ben-David et al. introduced a family of combinatorial dimensions, parameterized by collections of maps and called -dimensions, associated to classes of multi-valued functions. One choice of corresponds to the “Natarajan dimension” [Nat89], which was previously known to give a lower bound on the sample complexity of multiclass learnability. Another choice corresponds to the “graph dimension” [Nat89] which was known to give an upper bound. Ben-David et al. gave conditions under which -dimensions for different choices of could be related to each other, concluding that the Natarajan and graph dimensions are always within an factor, and thus characterizing the sample complexity of multiclass learnability up to such a factor.

Our proof of Theorem 1.3 proceeds by extending the definition of -dimension to online learning. We show that one choice of corresponds to the multiclass Littlestone dimension, while a different choice corresponds to an upper bound on the maximum Littlestone dimension of any binary class . We relate the two quantities up to a logarithmic factor using a new variant of the Sauer-Shelah-Perles Lemma for the “0-cover numbers” of a class of multi-valued functions. While we were originally motivated by privacy, we believe that Theorem 1.3 and the toolkit we develop for understanding online -dimensions may be of broader interest in the study of (multiclass) online learnability.

Next, in Section 7 we prove that the multiclass Littlestone dimension of any -valued class can be no more than a multiplicative factor larger than the maximum Littlestone dimension over the classes . We also show that this is tight. Hence, our results give a complete characterization of the relationship between the multiclass Littlestone dimension of a class and the maximum Littlestone dimension over the corresponding binary classes .

Finally, we remark that Theorem 1.3 implies a qualitative converse to Lemma 1.1. If a multi-valued class is privately learnable, then the lower bound of Jung et al. [JKT20] implies that has finite multiclass Littlestone dimension. Theorem 1.3 then shows that all of the classes have finite binary Littlestone dimension, which implies via sample complexity upper bounds for binary private PAC learnability [BLM20, GGKM20] that they are also privately learnable.

## 2 Background

##### Differential Privacy

Differential privacy is a property of a randomized algorithm guaranteeing that the distributions obtained by running the algorithm on two datasets differing for one individual’s data are indistinguishable up to a multiplicative factor and an additive factor . Formally, it is defined as follows:

###### Definition 2.1 (Differential privacy, [Dmns06]).

Let . A randomized algorithm is -differentially private if for all subsets of the output space, and for all datasets and containing elements of the universe and differing in at most one element (we call these neighbouring datasets), we have that

 Pr[M(X)∈S]≤eϵPr[M(X′)∈S]+δ

We will also need the closely related notion of

-indistinguishability of random variables.

###### Definition 2.2 ((ϵ,δ)-indistinguishability).

Two random variables and defined over the same outcome space are said to be -indistinguishable if for all subsets , we have that

 Pr[a1∈S]≤eϵPr[a2∈S]+δ

and

 Pr[a2∈S]≤eϵPr[a1∈S]+δ

One useful property of differential privacy that we will use is that any output of a differentially private algorithm is closed under ‘post-processing’, that is, its cannot be made less private by applying any data-independent transformations.

###### Lemma 2.3 (Post-processing of differential privacy, [Dmns06]).

If is -differentially private, and is any randomized function, then the algorithm is -differentially private.

Similarly, -indistinguishability is also preserved under post-processing.

###### Lemma 2.4 (Post-processing of (ϵ,δ)-indistinguishability).

If and are random variables over the same outcome space that are -indistinguishable, then for any possibly randomized function , we have that and are -indistinguishable.

##### PAC learning.

PAC learning [Val84] aims at capturing natural conditions under which an algorithm can approximately learn a hypothesis class.

###### Definition 2.5 (Hypothesis class).

A hypothesis class with input space and output space (also called the label space) is a set of functions mapping to .

Where it is clear, we will not explicitly name the input and output spaces. We can now formally define PAC learning.

###### Definition 2.6 (PAC learning, [Val84]).

A learning problem is defined by a hypothesis class . For any distribution over the input space , consider independent draws from distribution . A labeled sample of size is the set where . We say an algorithm taking a labeled sample of size is an -accurate PAC learner for the hypothesis class if for all functions and for all distributions over the input space, on being given a labeled sample of size drawn from and labeled by , outputs a hypothesis

such that with probability greater than or equal to

over the randomness of the sample and the algorithm,

 Pr[h(x)≠f(x)]≤α.

The definition above defines PAC learning in the realizable setting, where all the functions labeling the data are in . Two well studied settings for PAC learning are the binary learning case, where and the multiclass learning case, where for natural numbers . The natural notion of complexity for PAC learning is sample complexity.

###### Definition 2.7 (Sample complexity).

The sample complexity of algorithm with respect to hypothesis class is the minimum size of the sample that the algorithm requires in order to be an -accurate PAC learner for . The PAC complexity of the hypothesis class is

 infASH,α,β(A).

In this work, we will be interested in generic learners, that work for every hypothesis class.

###### Definition 2.8 (Generic learners).

We say that an algorithm that additionally takes the hypothesis class as an input, is a generic -accurate private PAC learner with sample complexity function , if for every hypothesis class , it is an -accurate private PAC learner for with sample complexity .

### 2.1 Differentially Private PAC Learning

We can now define differentially private PAC learning, by putting together the constraints imposed by differential privacy and PAC learning respectively.

###### Definition 2.9 (Differentially private PAC learning [Kln+08]).

An algorithm A is an -differentially private and -accurate private PAC learner for the hypothesis class with sample complexity if and only if:

1. A is an -accurate PAC learner for the hypothesis class with sample complexity .

2. A is -differentially private.

In this work, we study the complexity of private PAC learning. Our work focuses on the multiclass realizable setting.

### 2.2 Multiclass Littlestone Dimension

We recall here the definition of multiclass Littlestone dimension [DSBDSS11], which we will use extensively in this work. Unless stated otherwise, we will use the convention that the root of a tree is at depth . As a first step, we define a class of labeled binary trees, representing possible input-output label sequences over an input space and the label space .

###### Definition 2.10 (Complete io-labeled binary tree).

A complete io-labeled binary tree of depth with input set and output set consists of a complete binary tree of depth with the following properties:

1. Every node of the tree other than the leaves is labeled by an example .

2. The edges going from any parent node to its two children are labeled by two different labels in .

3. The leaf nodes of the tree are unlabeled.

We are interested in whether the input-ouput labelings defined by the complete io-labeled tree can be achieved by some function in the hypothesis class; to this end, we define realizability for root-to-leaf paths.

###### Definition 2.11.

Given a complete io-labeled binary tree of depth , consider a root-to-leaf path described as an ordered sequence , where is a node label and is the label of the edge between and , and where is the root. We say that the root-to-leaf path is realized by a function if for every in , we have and .

Using this definition we can now define what it means for a hypothesis class of functions to shatter a complete io-labeled binary tree, which helps to capture how expressive the hypothesis class is.

###### Definition 2.12 (Shattering).

We say that a complete io-labeled binary tree of depth with label set is shattered by a hypothesis class if for all root-to-leaf sequences of the tree, there exists a function that realizes .

Using this definition of shattering we can finally define the multiclass Littlestone dimension.

###### Definition 2.13 (Multiclass Littlestone dimension, [Dsbdss11]).

The multiclass Littlestone dimension of a hypothesis class , denoted , is defined to be the maximum such that there exists a complete io-labeled binary tree of depth that is shattered by . If no maximum exists, then we say that the multiclass Littlestone dimension of is .

## 3 Main Results

### 3.1 Reduction from multiclass private PAC learning to binary private PAC learning

Our first main result is a reduction from multiclass private PAC learning to binary private PAC learning. Informally, the idea is that that every function mapping examples to labels in

can be thought of as a vector of binary functions

. Here, each binary function predicts a bit of the binary representation of the label predicted by . Then, we can learn these binary functions by splitting the dataset into parts, and using each part to learn a different . We can learn the binary functions using an -DP binary PAC learner. Then, we can combine the binary hypotheses obtained to get a hypothesis for the multiclass setting, by applying a binary to decimal transformation. This process, described in Figure 1, preserves privacy since changing a single element of the input dataset changes only one of the partitions, and we apply an -DP learning algorithm to each partition. The binary to decimal transformation can be seen as post-processing.

Next, we formalize this idea. Given a hypothesis class with label set , construct the following hypothesis classes . For every function , let be the function defined such that is the bit of the binary expansion of . Let the hypothesis class be defined as . We will call these the binary restrictions of .

###### Theorem 3.1.

Let be a hypothesis class with label set and let be its binary restrictions. Assume we have -differentially private, -accurate PAC learners for with sample complexities upper bounded by . Then, there exists an -differentially private, -accurate PAC learner for the hypothesis class that has sample complexity upper bounded by .

###### Proof.

For simplicity, let be a predecessor of a power of . Note that if it is not, the argument below will work by replacing with the predecessor of the closest power of that is larger than .

Fix any distribution over and an unknown function ( represents a binary to decimal conversion; which in this case will be an output in ) where predicts the bit of the binary expansion of the label predicted by .

Assuming the algorithm is given a labeled sample of size drawn independently from and labeled by , split the sample into smaller samples . The first sample will be of size , the second sample will be of size and so on. For each sample , replace the labels of all examples in that sample by the bit of the binary expansion of what the label previously was. Note that this is equivalent to getting a sample of size from distribution that is labeled by function .

For all classes , runs the -DP, -accurate PAC learning algorithm to learn using the sample . Let the hypothesis output when running the generic binary PAC learner on be . Then, outputs the function .

First, we argue that is an -accurate PAC learner for .

 Pr[g(x)≠f(x)]=Pr[∃i,gi(x)≠fi(x)]≤log(k+1)∑i=1Pr[gi(x)≠fi(x)] (1)

where the last inequality is by a union bound.

But since is the output of the -accurate PAC learner on , and we feed it a sufficient number of samples, we get that for any , with probability ,

 Pr[gi(x)≠fi(x)]≤α/log(k+1).

This means that again by a union bound, we can say that with probability ,

 ∀i,Pr[gi(x)≠fi(x)]≤α/log(k+1) (2)

Substituting equation 2 into equation 1, we get that with probability over the randomness of the sample and the algorithm,

 Pr[g(x)≠f(x)]≤log(k+1)∑i=1Pr[gi(x)≠fi(x)]≤α (3)

which means that is an -accurate PAC learner with sample complexity upper bounded by

 log(k+1)∑i=1SCiα/log(k+1),β/log(k+1).

We now argue that is -DP. This will follow from the ‘parallel composition’ property of -DP.

###### Claim 3.2.

Let algorithm have the following structure: it splits its input data into disjoint partitions in a data-independent way. It runs (potentially different) -DP algorithms , one on each partition. It then outputs . Then, is -DP.

###### Proof.

Fix any two neighbouring datasets and . Then, we want to argue that the random variable is -indistinguishable from the random variable . Observe that since and differ in only one element, when we partition them, all but one partition is the same. Assume without loss of generality that only the first partition is different, that is , but . and are neighbouring datasets since they differ in only a single element. Hence since is -DP, we have that is -indistinguishable from .

Next, consider a randomized function (that depends on the neighbouring dataset pair) to represent the output of as follows: For any , let . By Claim 2.4, since -indistinguishability is preserved under post-processing, we have that is -indistinguishable from .

But

and

where the second equality follows because . Hence, we get that and are -indistinguishable. This argument works for any pair of databases; hence, we get that is -DP. ∎

Note that algorithm follows a similar structure to that described in Claim 3.2; it divides the dataset into partitions, runs an -PAC learning algorithm for binary hypothesis classes on each partition and post-processes the outputs. Hence, by Claim 3.2 and by the fact that -DP is closed under postprocessing (Claim 2.3), we get that is -DP. ∎

Next, we recall that the sample complexity of privately learning binary hypothesis classes can be characterized by the Littlestone dimension of the hypothesis class [ALMM19, BLM20]. That is, there exists an -accurate, -DP PAC learning algorithm for any binary hypothesis class with sample complexity upper and lower bounded by a function only depending on and where is the Littlestone dimension of . Using this characterization, we directly obtain the following corollary to Theorem 3.1.

###### Corollary 3.3.

Let be a hypothesis class with label set and let be its binary restrictions. Let the Littlestone dimensions of be . Assume we have a generic -differentially private, -accurate PAC learner for binary hypothesis classes that has sample complexity upper bounded by a function where is the Littlestone dimension of . Then, there exists an -differentially private, -accurate PAC learner for that has sample complexity upper bounded by .

Corollary 3.3 shows that the sample complexity of privately PAC learning a hypothesis class in the multiclass setting can be upper bounded by a function depending on the Littlestone dimensions of its binary restrictions. However, as described earlier, Jung et al. [JKT20] showed that the sample complexity of private multiclass PAC learning could be characterized by the multiclass Littlestone dimension. Hence, an immediate question is what the relationship between the multiclass Littlestone dimension of a class and the Littlestone dimensions of its binary restrictions is.

### 3.2 Connection between Multiclass and Binary Littlestone Dimension

We show that the multiclass Littlestone dimension of a hypothesis class is intimately connected to the maximum Littlestone dimension over its binary restrictions.

###### Theorem 3.4.

Let by a hypothesis class with input set and output set . Let the multiclass Littlestone dimension of be . Let be the binary restrictions of . Let the Littlestone dimensions of be . Then,

 maxi=1,…,log(k+1)di≤6dln(k+1).

A similar-looking theorem relating the Natarajan dimension of a hypothesis class with the maximum VC dimension over its binary restrictions was proved in Ben-David et al. [BCHL95] using the notion of -dimension. Our proof of Theorem 3.4 is inspired by this strategy. It will proceed by defining and analyzing a notion of dimension that we call -Littlestone dimension. It will also use the -cover function of a hypothesis class defined in Rakhlin et al. [RST15]. The details of the proof are described in Section 5.

This theorem is tight; for all and , there exists a hypothesis class with label set and multiclass Littlestone dimension such the maximum Littlestone dimensions over the binary restrictions of is . We prove this in Section 6. Additionally, the reverse direction is also true, the multiclass Littlestone dimension of any hypothesis class with label set is at most a factor larger than the maximum Littlestone dimension over its binary restrictions (this is also tight). We prove this in Section 7.

These arguments together completely describe the relationship between the multiclass Littlestone dimension of a hypothesis class with label set and the maximum Littlestone dimension over its binary restrictions.

Finally, combining Theorem 3.4 and Corollary 3.3, we can directly obtain the following corollary to Theorem 3.1.

###### Corollary 3.5.

Assume we have a generic -differentially private, -accurate PAC learner for binary hypothesis classes that has sample complexity upper bounded by a function where is the Littlestone dimension of . Then, there exists a generic -differentially private, -accurate PAC learner for multi-valued hypothesis classes (label set ) that has sample complexity upper bounded by where is the multiclass Littlestone dimension of .

We now consider an application of this result. The best known sample complexity bound for -DP binary PAC learning is achieved by a learner described in Ghazi et al. [GGKM20]. We state a slightly looser version of their result here.

###### Theorem 3.6 (Theorem 6.4 [Ggkm20]).

Let be any binary hypothesis class with Littlestone dimension . Then, for any , for some

 n=O⎛⎜⎝d6Llog2(dLαβϵδ)ϵα2⎞⎟⎠,

there is an -differentially private, -accurate PAC learning algorithm for with sample complexity upper bounded by .

Now, applying the reduction described in Theorem 3.1, with this learner as a subroutine, we get the following theorem. (Instead of directly applying Theorem 3.6, we will instead first use a boosting procedure described in Appendix A.)

###### Theorem 3.7.

Let be a concept class over with label set and multiclass Littlestone dimension . Then, for any , for some

 n=O(d6(log(k+1))8log4(dlog3(k+1)ϵδαβ)ϵα)

there is an -differentially private, -accurate PAC learning algorithm for with sample complexity upper bounded by .

###### Proof.

We will use the fact that the binary PAC learner from Ghazi et al. can be boosted to give a learner for binary hypothesis classes with Littlestone dimension with sample complexity upper bounded by . The main difference is that the sample complexity is nearly inverse linear in the term versus inverse quadratic. This boosting procedure is discussed in detail in Section A and the sample complexity bound we use here is derived in Corollary A.3.

Substituting into Corollary 3.5 with gives the result. ∎

## 4 Ψ-Littlestone Dimension

### 4.1 Definition

In this section, we define an online analog of the -dimension [BCHL95] that will help us prove Theorem 3.4. The main intuition is that similar to in the definition of -dimension, we can use what we term collapsing maps to reason about the multiclass setting while working with binary outputs. Let represent a function that maps labels to , which we call a collapsing map. We refer to a set of collapsing maps as a family. The definitions of labeled trees will be the only distinction from the regular definition of multiclass Littlestone dimension, and every node will have not only an example, but also a collapsing map assigned to it.

###### Definition 4.1 (Ψ-labeled binary tree).

A complete -labeled binary tree of depth with label set and mapping set on input space consists of a complete binary tree of depth with the following labels:

1. Every node of the tree other than the leaves is labeled by an example , and a collapsing map .

2. The left and right edges going from any parent node to its two children are labeled by and respectively.

3. The leaf nodes of the tree are unlabeled.

A complete -uniformly labeled binary tree of depth with label set and mapping set on input space is defined in the same way, with the additional property that all nodes at the same depth are labeled by the same collapsing map.

Where the input space, label space and mapping set are obvious, we will omit them and simply refer to a complete -labeled binary tree or -uniformly labeled binary tree.

###### Definition 4.2.

Consider a root-to-leaf path in a complete -labeled binary tree described as an ordered sequence , where each is an input, is a collapsing map, and is an edge label. We say that this path is realized by a function if for every triple in the ordered sequence .

We can now define what it means for a class of functions to -shatter a complete -labeled binary tree.

###### Definition 4.3 (Ψ-shattering).

We say that a complete -labeled binary tree of depth with label set is -shattered by a hypothesis class if for all root-to-leaf sequences of the tree, there exists a function that realizes . Similarly, we say that a complete binary -uniformly labeled tree of depth with label set is -shattered by a hypothesis class if for all root-to-leaf sequences of the tree, there exists a function that realizes .

Finally, we are in a position to define the -Littlestone dimension.

###### Definition 4.4 (Ψ-Littlestone dimension).

The -Littlestone dimension of a hypothesis class is defined to be the maximum depth such that there is a complete -labeled binary tree of depth that is -shattered by . If no maximum exists, then we say that the -Littlestone dimension of is . The uniform -Littlestone dimension is defined similarly (using the definition of -shattering for complete -uniformly labeled binary trees instead).

### 4.2 Properties of Ψ-Littlestone Dimension

In this section, we begin our investigation of the -Littlestone dimensions by discussing a few simple and useful properties. We first define three important families of collapsing maps , and that will play an important role in our results.

Consider a collapsing map defined by if , if , and otherwise. Then, is defined to be . Similarly, let be a collapsing map that maps a label in to the bit of its -bit binary expansion. Then, is defined to be . Finally, is defined as the family of all collapsing maps from to .

We first show that the multiclass Littlestone dimension of a hypothesis class (denoted ) is equivalent to .

###### Lemma 4.5.

For all hypothesis classes , .

###### Proof.

Consider any complete io-labeled binary tree of depth that is shattered by . Construct a complete -labeled binary tree as follows. The tree will be of the same depth as . If in , for a particular parent node, the two edges from a parent to a child are labeled by , then let the collapsing map labeling the parent node in be . The edge labeled in will be labeled by in and the other edge will be labeled by . Also, label the nodes of with examples in exactly the same way as . The leaves remain unlabeled. By the definition of shattering, for every root-to-leaf path in , there is a function that realizes that path. This function will continue to realize the corresponding path in . Hence, is -shattered by . This implies that

 MLD(H)≤ΨNLD(H).

The other direction performs this construction in reverse: it takes a complete -labeled binary tree that is -shattered by and creates a complete io-labeled binary tree of the same depth that is shattered by . For any node in , if the collapsing map assigned to that node is , the edges of that node to its children in will be labeled and respectively (the edge labeled in will be labeled by in and the other edge will be labeled by ). The nodes of are labeled with the same examples as . The leaves remain unlabeled. By a similar argument to that in the previous paragraph, we have that is shattered by , which means that

 ΨNLD(H)≤MLD(H).

This proves the claim. ∎

Next, we connect the Littlestone dimension of the binary restrictions of a hypothesis class with label set to the -Littlestone dimension of the class.

###### Claim 4.6.

Consider any hypothesis class with label set , and let be the binary restrictions of . Let the Littlestone dimension of be . Then,

 maxjdj≤ΨbinLDU(H)≤ΨbinLD(H).
###### Proof.

The second inequality follows immediately from the fact that for any , if there exists a complete -uniformly labeled binary tree that is -shattered by , then there exists a complete -labeled binary tree that is -shattered by .

To prove the first inequality, fix a class such that . Consider a complete, io-labeled binary tree of depth that is shattered by . Then, construct the following complete -labeled binary tree of the same depth . For every node, label it with the same example as in tree . Every node in is labeled with the collapsing map which maps a label to the bit of its binary expansion. The leaves remain unlabeled. Then, we have that -shatters . Additionally, is of the same depth as and all nodes at the same depth are labeled by the same collapsing map. Hence,

 maxjdj≤ΨbinLDU(H).

Finally, we relate the notions of -Littlestone dimension we have obtained with the families , and .

###### Claim 4.7.

For all hypothesis class ,

 ΨNLD(H)≤ΨbinLD(H)≤ΨBLD(H).
###### Proof.

Consider any complete -labeled binary tree of depth that is -shattered by . Construct a complete -labeled binary tree of the same depth as follows. Label the nodes of with examples exactly as in . Consider a node in and the collapsing map that labels the node. There is at least one bit in which the binary expansions of and vary. Let this bit be the bit. Then, label the corresponding node in with the collapsing map , which maps every label to the bit of its binary expansion. Consider the two edges emanating from this node. If the bit of the binary expansion of is , then in , label the edge that was labeled in by and the other by . Else, label the edge that was labeled in by and the other by . Perform this transformation for every labeled node in to obtain a corresponding labeled node in . The leaves of will remain unlabeled.

Then, is -shattered by . This gives that . The second inequality follows because , and so a -labeled tree that is -shattered by is automatically also a -labeled tree that is -shattered by . ∎

## 5 Proof of Theorem 3.4

In this section, we use the concept of -Littlestone dimension to prove Theorem 3.4.

### 5.1 Sauer’s Lemma for Multiclass Littlestone Dimension

In this section, we will describe a version of Sauer’s Lemma that will suffice for our application. This argument is essentially due to Rakhlin et al. [RST15]. Theorem 7 in that paper states a Sauer’s lemma style upper bound for a quantity they introduce called the “0-cover function”, for hypothesis classes with bounded “sequential fat-shattering dimension.” We show that this argument applies almost verbatim for hypothesis classes with bounded multiclass Littlestone dimension.

#### 5.1.1 0-Cover Function

We start by recalling the definition of 0-cover from Rakhlin et al.

###### Definition 5.1 (output-labeled trees, input-labeled trees).

A complete output-labeled binary tree of depth with label set is a complete binary tree of depth such that every node of the tree is labeled with an output in . A complete input-labeled binary tree of depth with input set is a complete binary tree of depth such that every node of the tree is labeled with an input in .

The convention we will use is that output and input-labeled binary trees have root at depth (as opposed to io-labeled trees and -labeled trees, where we use the convention that root has depth ). Consider a set of complete output-labeled binary trees of depth with label set . Consider a hypothesis class consisting of functions from input space to label set . Fix a complete input-labeled binary tree of depth with input space and a complete output-labeled tree .

###### Definition 5.2.

We say that a root-to-leaf path in corresponds to a root-to-leaf path in if for all , if node in is the left child of node in , then node in is the left child of node in and likewise for the case where node is the right child of node .

###### Definition 5.3.

Let be a root-to-leaf path in and let the the labels of the nodes in be where . The function applied to , denoted by , is the sequence .

###### Definition 5.4 (0-cover, [Rst15]).

We say that forms a 0-cover of hypothesis class on tree if, for every function and every root-to-leaf path in , there exists a complete output-labeled tree , such that for the corresponding root-to-leaf path with the labels of nodes in denoted by a tuple (call this the label sequence of ), we have that .

###### Definition 5.5 (0-cover function, [Rst15]).

Let denote the size of the smallest -cover of hypothesis class on tree . Let be the set of all complete input-labeled binary trees of depth with input space . Then, the 0-cover function of hypothesis class is defined as .

We use the convention that .

#### 5.1.2 Statement of theorem

The following theorem is essentially Theorem 7 of Rakhlin et al. [RST15] (with multiclass Littlestone dimension in place of sequential fat shattering dimension).

###### Theorem 5.6.

Let hypothesis class be a set of functions . Let the multiclass Littlestone dimension of be . Then, for all natural numbers , with ,

 N(0,H,n)≤d∑i=0(ni)ki (4)

For all natural numbers , with , we additionally have the following:

 N(0,H,n)≤d∑i=0(ni)ki≤(eknd)d. (5)

Finally, for all , for all natural numbers , we have .

###### Proof.

Firstly, observe that for all , ,

 d∑i=0(ni)ki ≤(knd)dd∑i=0(ni)(dn)i ≤(knd)d(1+dn)n ≤(eknd)d.

This proves the second inequality in expression 5.

The proof of the rest of the theorem will be by double induction on and .

##### First base case (d=0,n≥1):

Observe that when , the class consists of only a single distinct function. Call this function . Then, for any complete, input-labeled binary tree of depth on input set , create a complete, output-labeled binary tree of depth on output set as follows: for every node in labeled by input , label the corresponding node in by . Then the set consisting of just one tree is a -cover for . Thus we have that , verifying this base case.

##### Second base case (0<n≤d):

We will prove a stronger statement; we will show that for any complete input-labeled binary tree of depth (for any natural number ), there is a -cover of hypothesis class on of size . This also proves the final part of the theorem corresponding to . We start by observing that there are sequences of elements from . For every such sequence, create a complete output-labeled binary tree of depth as follows: label all nodes at depth by the element of the sequence. In this way, we create different trees. This set of trees will form a -cover for on . To see this, fix a root-to-leaf path in and a function and consider the sequence . Then by construction, there is a tree such that every root-to-leaf path in has label sequence . This implies that is a -cover of hypothesis class on . Thus, we have that for , verifying the second base case.

##### Inductive case:

Fix a such that (note that the base cases handle other values of and ). Assume that the theorem is true for all pairs of values where and . We will prove it is true for values . Consider a complete, input-labeled binary tree of depth with input set . Let the root node of be labeled by example . Divide hypothesis class into subclasses as follows,

 Hi={f∈H:f(xr)=i}.

That is, is the subclass of functions in that output label on example .

###### Claim 5.7.

There exists at most one such that has multiclass Littlestone dimension . Every other subclass has multiclass Littlestone dimension at most .

###### Proof.

Assume by way of contradiction that there are two hypothesis classes and that both have multiclass Littlestone dimension . Then there are complete io-labeled, binary trees and of depth with input set and output set that are shattered by and respectively. Construct a complete io-labeled binary tree of depth with input set and output set as follows: set the root node to be , and label the two edges emanating from the root by and respectively. Set the left sub-tree of the root to be and the right sub-tree to be . Then shatters since and shatter and respectively. However, this is a contradiction since has multiclass Littlestone dimension , and the shattered tree has depth . ∎

Next, consider any hypothesis class with multiclass Littlestone dimension equal to . If no such class exists, simply choose the class with maximum multiclass Littlestone dimension (note that will be upper bounded by ). Let and be the left and right sub-trees of depth of the root of . By the inductive hypothesis, there are -covers and of on and each of size at most . We will now stitch together trees from and to create a set of trees that will form a -cover of on . Informally, we do this as follows. Every tree in will have root labeled by . The left sub-tree of the root will be assigned to be some tree from and the right sub-tree of the root will be assigned to be some tree from .

Formally, without loss of generality, let . Then, there exists a surjective function from to . For every tree , construct a tree in as follows, the root will be labeled by , the left subtree will be and the right subtree will be labeled by . Clearly, the size of is equal to the size of , which is at most . Next, we argue that the set is a -cover for on .

###### Claim 5.8.

is a -cover of on .

###### Proof.

Fix a root-to-leaf path in and fix a function . Let where is the root-to-leaf path omitting the root. Consider