# Margin Maximization as Lossless Maximal Compression

The ultimate goal of a supervised learning algorithm is to produce models constructed on the training data that can generalize well to new examples. In classification, functional margin maximization – correctly classifying as many training examples as possible with maximal confidence –has been known to construct models with good generalization guarantees. This work gives an information-theoretic interpretation of a margin maximizing model on a noiseless training dataset as one that achieves lossless maximal compression of said dataset – i.e. extracts from the features all the useful information for predicting the label and no more. The connection offers new insights on generalization in supervised machine learning, showing margin maximization as a special case (that of classification) of a more general principle and explains the success and potential limitations of popular learning algorithms like gradient boosting. We support our observations with theoretical arguments and empirical evidence and identify interesting directions for future work.

## Authors

• 3 publications
• 2 publications
• 8 publications
• ### Maximal Margin Distribution Support Vector Regression with coupled Constraints-based Convex Optimization

Support vector regression (SVR) is one of the most popular machine learn...
05/05/2019 ∙ by Gaoyang Li, et al. ∙ 0

• ### Boosting through Optimization of Margin Distributions

Boosting has attracted much research attention in the past decade. The s...
04/14/2009 ∙ by Chunhua Shen, et al. ∙ 0

• ### IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms

By balancing margin-quantity maximization and margin-quality maximizatio...
10/05/2018 ∙ by Ruzhang Zhao, et al. ∙ 0

• ### Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training

Self-training is a classical approach in semi-supervised learning which ...
06/19/2020 ∙ by Samet Oymak, et al. ∙ 0

• ### Optimal Minimal Margin Maximization with Boosting

Boosting algorithms produce a classifier by iteratively combining base h...
01/30/2019 ∙ by Allan Grønlund, et al. ∙ 0

• ### Algebraic Machine Learning

Machine learning algorithms use error function minimization to fit a lar...
03/14/2018 ∙ by Fernando Martin-Maroto, et al. ∙ 0

• ### The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity

Algorithm performance in supervised learning is a combination of memoriz...
12/23/2019 ∙ by Pedro Sandoval Segura, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of a supervised learning algorithm is to construct a model on the training set that can generalize well on new data. Yet, generalization is an elusive property, involving intractable quantities to be approximated or bound –like the generalization error– or notions with multiple definitions –like that of model complexity. As a result there are many different theoretical routes to generalization, leading to often apparently ‘contradictory’ conclusions with one another or with empirical evidence zhang2016understanding

. For instance, why are certain learning algorithms that explore overparameterized or non-parametric model families so good at producing models that can generalize well, even without explicit regularization

buhlmann2007boosting ; zhang2016understanding ; kawaguchi2017generalization ? A unified language for comparing the complexity of models trained on a given dataset can help us identify good model selection and algorithmic practices that guide the learning algorithm towards models that are complex enough to not underfit yet also maximally resistant to overfitting.

In this paper we make a step towards this direction by bridging two –until now disconnected– theoretical paths to generalization in the case of classification, namely information theory shannon1948mathematical inspired by recent advances on the information bottleneck principle tishby2000information ; tishby2015deep ; shwartz2017opening and margin theory vapnik1982estimation ; schapire1998boosting . From an information-theoretic perspective, we would like our learning algorithm to learn a model that contains all the information from the features necessary for describing the target (we call this property losslessness) and no more information beyond that (we call this property maximal compression). Margin theory suggests constructing a model that can correctly classify as many training examples as possible with as high confidence as possible (i.e. one that maximizes the quantity known as the functional margin over the training set). We prove that in the case of classification of on noiseless (i.e. unambiguously labelled) datasets, functional margin maximization is equivalent to lossless maximal compression in the information-theoretic sense. The existence of margin-based bounds on the generalization error implies that margin maximization is beneficial for achieving good generalization and therefore so is lossless maximal compression.

Our experiments on gradient boosting, a method that maximizes the training margins, show empirically that on noiseless data, margin maximization amounts to lossless maximal compression and that maximally compressed models on average exhibit the highest generalization capability (as estimated by the test error). We identify interesting similarities between the way training progresses in Deep Neural Networks (DNNs) and in gradient boosting and gain useful insights on the training of gradient boosting algorithms. All findings persist across a wide range of datasets & hyperparameter configurations.

To our knowledge, there is no prior work establishing the connection between functional margin maximization and lossless model compression in the information-theoretic sense. Both margin theory and information theory have been individually connected to generalization and have been used to explain resistance to overfitting. The idea that functional margin maximization promotes good generalization can be traced back to vapnik1982estimation . It has been used in the theoretical analysis of Boosting algorithms schapire1998boosting , with recent work using it to explain good generalization in DNNs sokolic2017generalization ; dziugaite2017computing ; neyshabur2017exploring ; wei2018margin . The related notion of geometric margin maximization111The geometric margin of a classifier is the distance of the closest example in the training set to the decision boundary. For linear models the geometric margin is a rescaling of the functional margin, therefore a model maximizing the one also maximizes the other.

has been used to justify good generalization in Support Vector Machines (SVMs)

Cortes1995 . The idea that a learned representation of a dataset that generalizes well is one that extracts from the features all the useful information for predicting the target and no more, is captured in information theoretic terms under the information bottleneck principle tishby2000information . Recent work has offered insights into the good generalization capabilities of DNNs, utilizing these ideas tishby2015deep ; shwartz2017opening . More recently, bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and outputhave been established xu2017information ; asadi2018chaining .

## 2 Background

### 2.1 Binary Classification

A classification algorithm, receives as input a training dataset consisted of pairs of feature vectors and corresponding class labels

. The training set is drawn i.i.d. from some unknown probability measure

on . We shall focus on binary classification, where . In this setting, we consider w.l.o.g. the output of the learning algorithm (model) as a function that allows us to predict the label on unseen examples drawn from with feature vector as . Given any probability measure on and any function we let denote the probability of making an error with respect to distribution ,

 RP(f)=P[sign(f(X))≠Y]. (1)

Ideally the learning algorithm will output the model with the lowest possible risk w.r.t. the unknown distribution , i.e. a function that minimizes the true classification risk . However, since we do not have direct access to the unknown distribution we must estimate with the empirical measure defined for each set in terms of the training data by

 ^Pn(A)=1n⋅∑i∈[n]1{(xi,yi)∈A}. (2)

The empirical risk of a function is given by

 RS(f)=^Pn[sign% (f(X))≠Y]=1n⋅∑i∈[n]1{% sign(f(xi))≠yi}.

In what follows we shall refer to a general finitely supported measure on . The motivating example here is the empirical measure which is supported on the finite set , the training dataset.

### 2.2 Information theory

We now present some basic definitions and properties from information theory shannon1948mathematical . Let &

be random variables (RVs), with alphabets

&

with probability distribution measure

. We shall assume that is finitely supported and there exist finite subsets and such that .

The entropy of a RV , measures the amount of uncertainty associated with its value when only its distribution is known. It is defined by

 HP(A)=−∑a∈APP(A=a)log(P(A=a)). (3)

The amount of information shared by RVs & is their mutual information, defined as

 IP(A;B)=∑a∈AP,b∈BPP(A=a,B=b)logP(A=a,B=b)P(A=a)P(B=b)=IP(B;A). (4)

In terms of information theory, the chain rule of probability takes the form

 IP(A;B)=HP(A)−HP(A|B)IP(A;B)=IP(B;A)=HP(B)−HP(B|A), (5)

where is the conditional entropy of given , given by

 HP(A|B)=−∑a∈AP,b∈BPP(A=a,B=b)logP(A=a|B=b), (6)

which measures the uncertainty of the value of RV given the value of RV . From Eq. (5), it is clear that measures the decrease in uncertainty about either the value of RV or the value of RV , when the value of the other RV is known.

If is a deterministic transformation of then there is no uncertainty remaining about the value of given the value of , so we have . Finally, if is an invertible transformation of RV , we have as the value of grants us perfect knowledge of the value of and vice-versa.

In what follows we shall be particularly interested in the empirical entropy and the empirical mutual information where is the empirical measure of Eq. (2).

#### 2.2.1 The information bottleneck principle

Suppose we wish to learn a compressed representation from the original features that is useful for predicting a target variable . Treating , & as RVs, the Information Bottleneck principle tishby2000information , offers a way to select a representation , by trading-off the information the learned representation captures from regarding the target variable , i.e. –the higher, the better for predicting – and the total information it captures from , i.e. –the lower, the higher the degree compression. We thus look for a representation such that

 F∗=argminF{IS(F;X)−βIS(F;Y)}, (7)

where is a Lagrange multiplier that controls the aforementioned tradeoff.

Recently, this principle has been used to explain good generalization in DNNs tishby2015deep ; shwartz2017opening and the use of training set estimates of Eq. (7) or variants of it as objective functions for training DNNs –and other models– have been growing in popularity alemi2016deep ; strouse2017deterministic ; simeone2018brief . The reasoning is that compression controls for the complexity of the learned representation, thus promoting good generalization shamir2008learning .

In this work we draw inspiration from the above line of research and our findings further reinforce the role of information compression in promoting generalization. We regard the trained model’s outputs as the ‘representation’ and explore the properties of a model that minimizes subject to maximizing on a given training set . We shall call such an intuitively ‘ideal’ model a lossless maximal compressor of the training set .

### 2.3 Margin maximization

The (normalized) functional margin222Also known as the hypothesis margin, or –in the case of ensembles– the voting margin. of a training example under a model is defined as

 mi=yif(xi)∈[−1,1].

It is a combined measure of confidence and correctness of the classification of the example under . Its sign encodes whether the example is correctly classified () or misclassified (), while the magnitude of the margin (i.e. the magnitude of the score ) measures the confidence of the model in its prediction (the higher, the more confident).

Maximizing the margins over the training set has been connected to good generalization vapnik1982estimation ; schapire1998boosting . An upper bound to the generalization error of an AdaBoost classifier , based on its minimum margin over the training set, is proven in schapire1998boosting . Tighter generalization bounds, dependent not only on the minimum margin but on the entire distribution of the training margins have been derived (e.g. Emargin bound wang2011refined , k-th margin bound gao2013doubt ). Beyond boosting, such bounds hold for voting classifiers in general and recently similar bounds have been derived for DNNs sokolic2017generalization ; dziugaite2017computing ; neyshabur2017exploring ; wei2018margin .

In this work, we will establish an equivalence between models that maximize the margins on a noiseless training set and lossless maximal compressors of

. We will verify our observations empirically, using boosting, a method that explicitly minimizes a monotonically decreasing loss function of the margin (i.e. maximizes training examples’ margins)

333

Adaboost approximately maximizes the average margin & actually minimizes the margins’ variance

shen2010dual .
. As we will see, boosting drives learning towards a lossless maximal compressor of the noiseless training dataset. It achieves the lowest generalization error (estimated by the average test set error) once lossless maximal compression has been achieved.

## 3 Lossless maximal compression & margin maximisation

### 3.1 An information-theoretic view of datasets & models

We will now define properties that capture the relationship between the information content of the model’s output () and the information present in the features & the target ( & , respectively) as measured on the training dataset . In Figure 1 we provide a visual summary of these properties and their possible combinations. In Table 1 we summarize the information-theoretic equalities and inequalities that hold under each scenario. Proofs not directly following the statement of a lemma or theorem can be found in the Supplementary Material.

Any function , can be considered as a model of the training dataset . Typically, the model constructed by a learning algorithm is a member of some given model family. In this work we impose no restriction on the model space, i.e. , where is the set of all models. Being a deterministic transformation of , cannot contain more information than . So, .

###### Definition 1 (Noiselessness).

A probability distribution is noiseless if and only if .
Otherwise, is noisy and . We shall say that a dataset is noiseless (respectively, noisy) if the corresponding empirical measure is noiseless (respectively, noisy).

Under this information-theoretic perspective, a noiseless distribution , hence a noiseless dataset , is one in which the features , contain all information to perfectly describe the target .

Given a distribution on we let denote a minimizer of the risk i.e.

 f∗P∈argminf∈Φ{RP(f)}.

In particular, when is the underlying distribution then is the Bayes classifier. When is the empirical measure then is the empirical risk minimiser where .

###### Lemma 1.

A dataset is noiseless if and only if .

In other words, a noiseless training dataset is one in which no datapoints with the same feature vector have different labels . For such a dataset444Also known as a unambiguously labelled or consistent training dataset in the literature., there exists a model that can achieve zero empirical risk (training error), i.e. that can perfectly classify the training data. In other words, there exists some deterministic mapping from the features to the target .

We shall now introduce properties that make a model useful for the purposes of capturing relevant and ignoring redundant information from a training set .

###### Definition 2 (Losslessness).

A model is lossless on the dataset if and only if . Otherwise, the model is lossy on and .

A lossless555Often the term "lossless encoding" of some r.v. in the literature characterizes an encoding that allows us to recover the original value of from it. In our case, because of the supervised nature of the learning task, it shall mean that allows us to recover the original value of the target from it. Not necessarily the value of the feature vector. model on a dataset is one that captures all the information in features that is relevant for describing the target . We can equivalently state that the r.v. is a sufficient statistic of the empirical distribution of the training data.

###### Lemma 2.

Suppose dataset is noiseless. A model is lossless on if and only if there exists an invertible transformation such .

Lemma 2 means that if a model is lossless on a training set , its output can be used to describe the target with the only source of training error being the irreducible class overlap in the training set.

###### Definition 3 (Maximal Compression).

A model is a maximal compressor of the dataset if and only if . Otherwise, the model is undercompressed on and .

A model that is a maximal compressor of a training dataset is one that only captures from the features information relevant for describing the target . It does not necessarily capture all that information; this special case, merits a definition of its own given below.

###### Definition 4 (Lossless Maximal Compression - (LMC)).

A model is a lossless maximal compressor (LMC) of a training dataset if and only if it is lossless on and a maximal compressor on .

###### Proposition 1.

A model is an LMC of a training dataset , if and only if it satisfies

 IS(F;X)=IS(F;Y)=IS(Y;X).
###### Proof.

Follows straightforwardly from Definition 2 & Definition 3.∎

A model that is an LMC of a training dataset is one that only captures from the features all the information relevant for describing the target . From an information-theoretic perspective, an LMC of is the optimal classification model that can be constructed from .

We have defined the notion of a noiseless / noisy training dataset and those of a lossy / lossless on and of an undercompressed / maximally compressed model on . In Figure 1 we provide a visual summary of these properties and their possible combinations in the form of entropy Venn diagrams. In Table 1 we summarize the relationships among the various information-theoretic quantities involved that hold under each scenario.

In the next subsections, we shall use the properties we defined here to obtain a better understanding of what types of models the information-theoretically optimal model, the LMC corresponds to for a noiseless dataset and for a noisy dataset .

### 3.2 Noiseless data: Equivalence of lossless maximal compression & margin maximisation

Let us first focus on the special case of a noiseless dataset , i.e. a dataset that does not contain any datapoints with the same feature vector but different labels. We will then discuss the noisy case where ambiguously labelled datapoints can be present in the dataset.

The noiseless case merits a special discussion for several reasons: (i) It is the typical case studied in the literature and as such it allows us to connect our observations to existing work. (ii) It allows us to establish an equivalence between information theoretic lossless maximal compression and margin maximization. (iii) It is a very common case in practice since in large dimensional datasets, encountering datapoints that have the same feature vector but different class labels are typically expected to be rare666This is because encountering datapoints that have the same feature vector in high dimensional feature spaces is typically expected to be rare in the first place..

We will now show the equivalence between lossless maximal compression and margin maximisation on a noiseless dataset .

###### Theorem 3.1.

Suppose is noiseless and finitely supported. A model is an LMC with respect to if and only if there exists some invertible transformation such that is a margin maximizer with respect to .

Under Theorem 3.1 a classification model that maximizes the training margins on a given noiseless dataset777A margin maximizer on a noiseless dataset is a model that correctly classifies all training examples, with maximal confidence. Obviously, if , then both the average and the minimal margin of on are equal to (maximal) and the variance of the margin distribution of on is (minimal). is one that captures all the information present in the features of that dataset relevant for predicting the target label and no more. Conversely, since an LMC is a margin maximizer, it offers the same guarantees on the generalization error as the latter. Note that Theorem 3.1 captures a relationship between a noiseless dataset and a model , regardless of the underlying learning algorithm that produced it (i.e. the model family it explores or the optimization method used to explore it).

From Lemma & Lemma , we have that a lossless model on a noiseless training dataset is one whose output can be used to classify every training example to the correct class (i.e. is separable by ). The success of algorithms that generate models that can interpolate888An interpolating classifier is one that can perfectly separate the data, i.e. achieve zero training error. In our terminology it is a lossless model on a noiseless dataset, as such it falls within the case examined here.

the data, yet, despite exploring overparameterized model spaces, are resistant to overfitting (e.g. gradient boosting, random forests, SVMs and DNNs) has recently attracted considerable research interest

wyner2015explaining ; belkin2018reconciling ; hastie2019surprises .

Our work connects these findings to information theory and margin theory: we posit that models generated by such methods are typically not simply lossless, but actually LMCs, hence margin maximizers and their good generalization follows via the margin-based generalization bounds. Algorithms such as the aforementioned, have mechanisms for promoting both losslessness (interpolation, in a noiseless dataset), guaranteeing the model produced will not underfit (afforded by overparameterizing the model space) and maximal compression (afforded via explicit or implicit margin maximization) which produces a model from that space that is maximally resistant to overfitting.

### 3.3 Noisy data: The equivalence collapses

Let us now discuss the case of a noisy dataset and how it differs from the case of noiseless data.

In the noiseless case, any model that correctly classifies every training datapoint in (i.e. achieves zero training error) is a lossless model on . In the noisy case this observation is no longer relevant, as there exist at least 2 datapoints which are noisy, i.e. have the same feature vector , but different labels . It is no longer the case that there exists a model that can perfectly separate the data.

We can also rephrase the observation stated above as follows: Any model that yields the minimal achievable training error on a noiseless training dataset is a lossless model on . As we will see from Lemma 3, this condition is no longer sufficient for to be lossless on a noisy training dataset .

A model that minimizes the training error on a noisy dataset will be one that classifies all points in with the same feature vector to the majority class among them. Furthermore, a margin maximizer 999We remind the reader that we refer to minimizers of the average (equivalently: total) margin over the training examples with this term. on is a model that minimizes the training error while also assigning maximal absolute score to its predictions (i.e. ). It is easy to see that – unlike in the case of a noiseless training dataset where a margin maximizer was an LMC– in the noisy case, a margin maximizer cannot even be a lossless model. This is a direct consequence of Lemma 3, the proof of which can be found in Section A of the Supplementary Material.

###### Lemma 3.

Suppose that and

are discrete random variables taking values in

and , respectively. Suppose that and let . Then if and only if the map is constant on all sets of the form for some .

In simpler terms, Lemma 3 tells us that if for two feature vectors and a model satisfies , then it also has to be the case that for to be lossless (and inversely). Therefore, a margin maximizer, i.e. a model assigning maximal (i.e. the same) score both to noiseless positive examples (unambiguously labelled positive examples, for which ) and to noisy popsitive examples (ambiguously labelled examples, i.e. ones with ) violates the condition of losslessness.

We therefore see that a margin maximizing model of a noisy training dataset cannot be an LMC of as it is not even lossless on . Furthermore, as margin maximizers are themselves training error minimizers, this implies that not all training error minimizers of are LMCs (or even lossless) on either. These observations are summarized in Table 2.

A lossless model (one satisfying ) is one that captures all the information present in the features relevant for predicting the target . In the case of a noisy training dataset, this information includes the uncertainty introduced by the ambiguous labelling of a feature vector , i.e. . So a lossless model should assign different scores to feature vectors which have different values of .

Moreover, is an LMC (has the minimum that allows ) iff it is lossless while using the fewest values possible to encode the empirical , i.e. have as many distinct values for as there are distinct values of .

The above discussion provides an intuition into the limitations of margin maximization approaches in the presence of label noise. The sub-optimality of boosting (a margin maximization approach) in the presence of label noise has been observed in earlier studies kalai2003boosting ; servedio2003smooth ; bootkrajang2013boosting and here we provide an information-theoretic justification for this phenomenon. Simply put, the strategy of maximizing the margins on a noisy dataset is not producing a lossless maximally compressing model on that dataset. In fact, the resulting margin maximizing model, is not even going to be lossless on the training dataset as it will fail to capture the uncertainty over the labels. When the training data are noisy, we should instead aim to produce models whose scores capture the underlying empirical distribution (lossless). Ideally, we should aim for strategies producing models whose scores are in correspondence to (LMCs).

## 4 Empirical Evidence

### 4.1 Experimental Setup

Boosting, a method that explicitly maximizes the margins of the training examples101010Gradient boosting is a family of ensemble learning methods that construct an additive model by adding on each round the component minimizing some monotonically decreasing loss function of the margin. It can be viewed as minimizing said loss by performing gradient descent on the space of components (base learners)., can be shown empirically to also converge to LMC models on noiseless datasets. After lossless maximal compression is achieved, so is the minimal generalization error, as estimated by the error on the test set. To demonstrate this, we plot the trajectory of the boosting ensemble on the entropy-normalized information plane, vs. . For each boosting round , denotes the RV of which the ensemble’s outputs are realizations.

The experiments were carried out on binary classification tasks on both real-world UCI datasets and artificial data (dataset descriptions in the Supplementary Material). Qualitatively, the results are similar for all datasets (see Figure 2 as well as Section C of the Supplementary Material). The boosting ensemble consisted of a maximum of decision trees (i.e. rounds of boosting) of maximal depth . No shrinkage of the updates or subsampling of the examples was performed (both are techniques to counter overfitting), and the exponential loss function was used (i.e. the loss minimized by AdaBoost). We performed no hyperparameter optimization. Plotting trajectories on the information plane follows tishby2015deep ; shwartz2017opening . All information-theoretic quantities were estimated on the training data by first discretizing the features & model outputs in equal-sized bins111111Note that by discretizing the features we might convert an originally noiseless dataset into a noisy one. In the experiments included in this paper this did not happen for any dataset for the numbers of discretization bins chosen. So all results shown are on noiseless datasets., then using maximum likelihood estimators. The joint RV was then constructed by the discretized features as . We plot average results across runs with different train-test splits () on the same original data. We also visualize the trajectories obtained by some random individual runs to showcase that although they can vary significantly from one another, they all follow the same general pattern. All datasets & code used in the experiments can be downloaded at https://github.com/nnikolaou/margin_maximization_LMC.

### 4.2 Results & Analysis

Let us first introduce some characteristic points on the information plane:
Lossless maximal compression (LMC) point: A red star on the information plane denotes the point of lossless maximal compression – the optimal feasible point a model can occupy on the plane on a given dataset – on which is the maximal achievable while is minimal. On this point, and for a noiseless dataset, .
Average margin maximization point: With a hollow green circle on the information plane, we denote the model (round of boosting) under which the average (equiv. total) margin is first minimized.
Training error minimization point: With a full black dot on the information plane, we denote the model (round of boosting) under which the training error is first minimized (losslessness is achieved). At this point has reached its maximum, so for a noiseless dataset, .
Test error minimization point: With a magenta square on the information plane, we denote the model (round of boosting) under which the test error (proxy for generalization error) is first minimized.

Let us now summarize our observations from Figure 2 & the figures of Section C of the Supplementary Material:
Boosting leads to lossless maximal compression: In all datasets, the boosting ensemble traces a trajectory on the information plane that leads to the LMC point and once it reaches it in never escapes.
Lossless maximal compression coincides with margin maximization: In all datasets the image on the information plane of the models that minimize the margin coincides with the LMC point.
Lossless maximal compression coincides with maximal generalization: The point of the ensemble’s trajectory corresponding to the minimal test error coincides –on average– with the LMC point on the information plane (and so does the margin maximization point). In other words, LMCs correspond to the models exhibiting –on average– the best generalization behaviour.
Average trajectory shape: After the training error is minimized, the test error can be further decreased by training for more rounds. This is a known result in boosting, explained via margin theory. Here we give an information-theoretic interpretation. Training until training error minimization, amounts to achieving losslessness. Subsequent rounds result in travelling along the line of maximal on the information plane, towards the LMC point. This compresses the model (relieves it of remaining information from irrelevant for predicting ), decreasing its effective complexity121212Holds for average trajectories. Single runs include steps that both increase & decrease ..
Training in boosting consists of 2 (typically distinct) phases: The results suggest the presence of 2 distinct phases during training under gradient boosting. A similar behaviour was observed in shwartz2017opening for the trajectories of the representations learned by DNNs. Following the terminology of shwartz2017opening , these are the empirical risk minimization (ERM) phase, when increases (the model better fits the training data) but typically so does (the model uses more information from ) and the compression phase, when decreases (the model uses increasingly less information from , reducing its effective complexity), without decreasing . We can view the ERM phase as decreasing the bias of the model while not decreasing its variance and the compression phase as reducing variance while not increasing bias. The ERM phase is usually much shorter than the compression phase, as is the case with DNNs shwartz2017opening . Although typically we do observe these 2 phases as distinct in the average trajectories, they need not be, as was also observed in subsequent studies in DNNs xu2018training . Trajectories of individual runs, are not as smooth as the average trend; we can even observe steps that increase both bias & variance. However, the 2 phases still appear to be distinct: once losslessness is achieved (ERM phase terminates), it is maintained and pure compression begins.
Early stopping does not improve generalization in gradient boosting: As long as losslessness can be achieved, additional boosting rounds do not hurt generalization. Once the model reaches the LMC point on the information plane, it never escapes it. Subsequent iterations neither increase the training nor the test error. This suggests that early stopping with boosting is unnecessary for improving generalization and agrees with recent observations wyner2015explaining

. General margin losses minimized via stochastic gradient descent (SGD) also exhibit similar behaviour

soudry2017implicit .
Consistency across datasets, hyperparameter & discretization settings: The aforementioned observations hold across different datasets and hyperparameter settings. Section C of the Supplementary Material contains more results supporting this claim. They also hold if we change the number of bins used to discretize the features (provided the dataset remains noiseless) or the scores (provided they are).
Margin maximization as a built-in regularization mechanism: Additional regularization techniques like subsampling or shrinkage are not the main reason why boosting regularizes. Their contribution is small compared to the algorithms’ built-in regularization mechanism: margin maximization, which as we saw amounts to lossless maximal compression of the training dataset. This is another similarity shared with DNNs trained with SGD that achieve good generalization by tracing a similar trajectory on the information plane shwartz2017opening

, and additional regularization control (e.g. dropout or batch normalization) is beneficial, but not the main contributor to their good generalization

zhang2016understanding ; shwartz2017opening ; kawaguchi2017generalization .

## 5 Discussion

We characterized from an information theoretic perspective, models trained on a given training set w.r.t. the information they capture from it. Under this light, we identified an ideal model trained on a given dataset as its lossless maximal compressor (LMC): one capturing all the information from the features relevant for predicting the target and no more. We then established that an LMC is –in the case of classification– equivalent to a margin maximizer of the dataset (provided it is noiseless, i.e. consistently labelled). The existence of margin-based bounds on the generalization error implies that margin maximization, hence lossless maximal compression, is beneficial to generalization.

Our experiments on gradient boosting, demonstrate that indeed, margin maximization amounts to lossless maximal compression on noiseless data. The evolution of the model constructed by boosting, traces a trajectory on the information plane that leads to the point of lossless maximal compression which also coincides with the point of margin maximization and the point on average exhibiting the best generalization. In agreement with recent studies on boosting, we observe that early stopping is unnecessary for improving generalization wyner2015explaining and identify interesting similarities between how training progresses in DNNs and in gradient boosting in terms of the trajectory they trace on the information plane shwartz2017opening . All observations persist across a wide range of datasets and hyperparameter configurations.

This work gives an information-theoretic interpretation of margin maximization and provides us with a principled way to define model complexity for the purposes of generalization, thus shedding more light on the success of methods like gradient boosting. It also opens various directions for future work. For instance, exploring how these concepts can be applied in model selection or to inform learning algorithm design to more efficiently traverse the information plane to reach the LMC point. It would also be of interest to identify the analogue of the LMC in learning tasks other than classification, like ranking or regression.

##### Acknowledgements

This project was partially supported by the EPSRC LAMBDA [EP/N035127/1] & Anyscale Apps [EP/L000725/1] project grants and the EU Horizon 2020 research & innovation programme [grant No 758892, ExoAI]. NN acknowledges the support of the EPSRC Doctoral Prize Fellowship at the University of Manchester and the NVIDIA Corporation’s GPU grant. The authors thank Konstantinos Sechidis, Konstantinos Papangelou & Ingo Waldmann for their useful comments and suggestions.

## 6 Supplementary Material

### A. Proofs

In this section we shall prove Lemma 1, Lemma 2 and Theorem 3.1. Rather than proving these results directly we shall instead prove generalisations to an arbitrary finitely supported probability measure (Lemma 6, Lemma 5 and Theorem 6.1). The sample based (i.e. dataset based) results used in the main paper correspond to the special case in which the probability measure is the empirical measure .

###### Definition 5.

A finitely supported probability distribution is noiseless if and only if .

Definition 5 generalises Definition 1 which corresponds to the special case in which is the empirical measure . The proofs of the following results require the following elementary lemma.

###### Lemma 4.

Let the function defined by and for . Then we have for all with equality if and only if .

We shall now prove Lemma 5 which generalises Lemma 1.

###### Lemma 5.

A finitely supported probability distribution is noiseless if and only if .

###### Proof.

We can write out the conditional entropy in terms of as follows,

 HP(Y|X) =∑x∈XPP(X=x)(−∑y∈YP(Y=y|X=x)log(P(Y=y|X=x))) =∑x∈XPP(X=x)(∑y∈Yϕ(P(Y=y|X=x))).

Since is non-negative it follows that if and only if for each we have which is the case if and only if .

Now suppose that so for each , , . Then we can define so that , so . Thus for each if

 RP(f∗P)

One the other hand, if then we must have if and otherwise. Thus, for each , and so . ∎

Definition 6 generalises Definition 2.

###### Definition 6.

A model is lossless with respect to if and only if .

Lemma 6 generalises Lemma 2.

###### Lemma 6.

Suppose is noiseless. A model is lossless with respect to if and only if there exists an invertible transformation such .

###### Proof.

The model is lossless if and only if , which is the case if and only if , where we have used Eq. (5) and the assumption that is noiseless. Moreover, as in the proof of Lemma 1 we have

 HP(Y|X) =∑s∈f(XP)P(f(X)=s)(∑y∈Yϕ(P(Y=y|f(X)=s))).

Using the fact that on with equality only at we infer that if and only if for each , we have .

Now if for some invertible transformation then

 ∑s∈f(XP)P(f(X)=s)⋅P(Y≠% sign(g(s))|f(X)=s) =P(sign(g∘f(X))≠Y) =RP(g∘f)=RP(f∗P)=0.

This implies that for each for , and for , , so in general , and is lossless.

Conversely, if is lossless then for each we can take

 g(s)=(2⋅P(Y=1|f(X)=s)−1)(s+23),

and extend on arbitrarily to form a bijection. Since is lossless, for each and we have . If for some we have then and so . Similarly, for with we have and so again . Hence, in general .

Definitions 7 and 6 generalise Definitions 3 and 2, respectively.

###### Definition 7.

A model is a maximal compressor of a distribution if and only if .

###### Definition 8.

A model is a lossless maximal compressor (LMC) of a training dataset if and only if it is lossless on and a maximal compressor on .

Proposition 2 generalises Proposition 1.

###### Proposition 2.

A model is an LMC of a finitely supported probability distribution , if and only if it satisfies

 IP(F;X)=IP(F;Y)=IP(Y;X).
###### Proof.

Follows straightforwardly from Definition 6 & Definition 7.∎

Finally we shall prove Theorem 6.1 which generalises Theorem 3.1.

###### Theorem 6.1.

Suppose is noiseless and finitely supported. A model is an LMC with respect to if and only if there exists some invertible transformation such that is a margin maximizer with respect to .

###### Proof.

As we saw in the proof of Lemma 1, the fact that is noiseless implies that for each and we have . We form partition partition so that for , and for , .

Now suppose that for some invertible transformation , is a margin maximizer with respect to . Hence, if for and for . This implies that . Thus, by Lemma 2, is lossless. Moreover, is invertible this is equivalent to for and for . Hence, when and otherwise, so in general . Thus,

 HP(F|Y) =∑y∈YP(Y=y)∑s∈f(XP)ϕ(P(f(X)=s|Y=y))=0=HP(F|X),

where the final inequality follows from the fact that . Hence, we have . It follows that is a maximal compressor and we have already shown that is lossless.

Conversely, let’s suppose that is a lossless maximal compressor. Since is lossless we infer from Lemma 2 that there is some transformation such , which in turn implies that if and then . Moreover, since is a maximal compressor we must have which implies

 ∑y∈YP(Y=y)∑s∈f(XP)ϕ(P(f(X)=s|Y=y))=HP(F|Y)=HP(F|X)=0.

Thus, for each and we have , where once again we use the fact that is non-negative with zero attained at . Hence, there exists some such that for all , and some with such that for all , . Thus, if we choose to be any invertible map with and we see that is a margin maximiser. This completes the proof of the theorem. ∎

Finally, we prove Lemma 7 which generalises Lemma 3.

###### Lemma 7.

Suppose that and are discrete random variables taking values in and , respectively. Suppose that and let . Then if and only if the map is constant on all sets of the form for some .

###### Proof.

The proof uses the entropy functional by . Note that is strictly concave. Now observe that

 H(Y|X) =∑x∈XP(X=x)⋅ϕ(P(Y=1|X=x)) =∑z∈ZP(F=z)⋅∑x∈XP(X=x|F=z)⋅ϕ(P(Y=1|X=x)),

where we have used the fact that if and otherwise. We also have

 H(Y|F) =∑z∈ZP(F=z)⋅ϕ(P(Y=1|F=z)).

By the strict concavity of for each we have

 ∑x∈XP(X=x|F=z)⋅ϕ(P(Y=1|X=x)) ≤ϕ(∑x∈XP(X=x|F=z)⋅P(Y=1|X=x))=ϕ(P(Y=1|F=z)),

with equality if and only if is constant for all with , so constant for all . Hence, we have with equality if and only if for each , is constant on . To conclude note that and , so with equality if and only which holds if and only if for each , is constant on . ∎

### B. Details of datasets used

#### B1. Artificial Data

The artificial dataset was generated by scikit-learn’s make_classification() function. We generated examples, each consisting of features, only of which were relevant for predicting the class. The examples belonged to different clusters for each of the

classes, each cluster’s points normally distributed (with unit standard deviation) about vertices of a

-sided hypercube. Some label noise was added by randomly flipping the label of each point with probability . For more information see the function’s documentation at http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html.

#### B2. UCI Datasets

Table 3 shows the characteristics of the real-world datasets used in our experiments. The original datasets are all from the UCI repository. Examples with missing values were discarded. The multiclass datasets were converted to balanced binary ones by setting the minority class as the ‘positive’ one and uniformly sampling examples from the remaining classes to form the ‘negative’ class. A link to the final datasets will be provided along with the code used to generate the results.