Large deviations for the perceptron model and consequences for active learning

12/09/2019 ∙ by Hugo Cui, et al. ∙ Cole Normale Suprieure CEA 0

Active learning is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any active learning algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing active learning algorithms. We also provide a comparison with the performance of some other popular active learning strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Supervised learning consists in presenting a parametric function (often a neural network) with a series of samples (samples) and labels, and adjusting (training) the parameters (network weights) so as to match the network output with the labels as closely as possible. Active learning (AL) is concerned with choosing the most informative samples so that the training requires the least number of labeled samples to reach the same test accuracy. Active learning is relevant in situations where the potential set of samples is large, but obtaining the labels is expensive (computationally or otherwise). There exist many strategies for active learning, see e.g. Settles (2009) for a review. In membership-based active learning Angluin (1988), Cohn et al. (1994), Seung et al. (1992) the algorithm is allowed to query the label of any sample, most often one it generates itself. In stream-based active learning Atlas et al. (1990) an infinite sequence of samples is presented to the learner which can decide whether or not to query its label. In pool-based active learning, which is the object of the present work, the learner can only query samples that belong to a pre-existing, fixed pool of samples. It therefore needs to choose according to some strategy which samples to query so as to have the best possible test accuracy.

Pool-based active learning is relevant for many machine learning applications, e.g. because not every possible input vector is of relevance. A beautiful recent application of active learning is in computational chemistry

Zhang et al. (2019) where a neural network is trained to predict inter-atomic potentials. In this case the pool of data is large and consists in all possible alloys, but not of arbitrary input vectors, and labelling is extremely expensive, as it demands resource-intensive ab-initio simulations. Consequently, only a limited number of samples can be labeled, i.e. one only possesses a certain budget for the cardinal of the training set. Another setting where a cheap large pool of input data is readily available but labelling is expensive is drug discovery Warmuth et al. (2001), where given a target molecule one aims to find new compounds among the pool able to bind it. Another example would be on text classification McCallum and Nigam (1998), Tong and Koller (2000), Hoi et al. (2006), where labelling a text requires non-negligible human input, while a large pool of texts is readily available on the internet. Establishing efficient pool-based active learning procedures in this case implies to select a priori the most informative data samples for labelling.

Main-stream works on active learning focus on designing heuristic algorithmic strategies in a variety of settings, and analyzing the performance thereof. It is very rarely known what are the information-theoretic limitations an active learning algorithm can face and hence evaluating the distance from optimality is mostly an open question. The main contribution of the present work is to provide a toy model that is at the one hand challenging for active learning, and at the same time where the optimal performance of pool-based active learning can be computed and heuristic algorithms hence evaluated and bench-marked against the optimal solution. To our knowledge, this is the first work to derive optimal performance results for pool-based active learning procedures are computed. More specifically we study the random perceptron model

Gardner and Derrida (1988). The available pool of samples is assumed to be i.i.d. normal vectors, the teacher generating the labels is taken to be also a perceptron with the vector of teacher-weights having i.i.d. normal components. We compute the large deviation function for how likely one is to find a subset of the samples that leads to a given learning accuracy. Our results are based on the replica method computation of this large deviation function, that is an exact method (modulo the possibility of the so-called replica symmetry breaking that we are not evaluating in the present work) originating in theoretical statistical physics Parisi (1979); Parisi (1983); Mézard et al. (1986)

. Providing a rigorous proof of the obtained results or turning them into rigorous bounds would be a natural, and rather challenging, next step. In the algorithmic part of this work we benchmark several existing algorithms and also propose two new algorithms relying on the approximate-message-passing algorithm for estimation of the label uncertainty for yet unlabeled sample, showing that they closely achieve, in the studied cases, the relevant information-theoretic limitations.

The paper is organized as follows: the problem is defined and related work discussed in section II. In section III, we propose a measure to quantify the informativeness of given subsets of samples. In section IV, we derive the large deviation function over all possible subset choices and deduce performance boundaries that apply to any pool-based active learning strategy. In section V, we then compare these theoretical results with the performance of existing active learning algorithms and propose two new ones, based on approximate-message-passing.

Ii Definition of the problem and related work

A natural modeling framework for analyzing learning processes and generalization properties is the so-called teacher-student (or planted) perceptron model Zdeborová and Krzakala (2016), where the input samples are assumed to be random i.i.d. vectors, and the ground truth labels are assumed to be generated by a neural network (denoted as the teacher) belonging to the same hypothesis class as the student-neural-network. In this work we will restrict to single-layer neural networks (without hidden units) for which this setting was defined and studied in Gardner and Derrida (1989). Specifically we collect the input vectors into a matrix where is the dimension of the input space and is the number of samples. The teacher generating the labels, called teacher perceptron, is characterized by a teacher-vector of weights and produces the label vector according to . Learning is then done using a student perceptron and consists in finding a vector so that for the training set we have as closely as possible . The relevant notion of error is the test accuracy (generalization error) measuring the agreement between the teacher and the student on a new sample not presented in the training set. Since both teacher and student possess the same architecture, the training process can be rephrased in terms of an inference problem (as discussed for instance in Zdeborová and Krzakala (2016)): the student aims to infer the teacher weights, used to generate the labels, from the knowledge of a set of input-output associations. This scenario allows for nice geometrical insights (see for example Engels and van den Broeck (2001)), as the generalization properties are linked to the distance in weight space between teacher and student functions. Note that, in the case of a noiseless labelling process, the teacher-student scenario guarantees that perfect training is always possible.

Active learning was previously studied in the context of the teacher-student perceptron problem. Best known is the line of work on Query by Committee Seung et al. (1992); Freund et al. (1992); Zhou (2019), dealing with the membership based active learning setting, i.e. where the samples are picked one by one into the training set and can be absolutely arbitrary -dimensional vectors. The active learning is in that case more a strategy for designing the samples rather than one for selecting them smartly from an predefined set. In the original work Seung et al. (1992) the new samples are chosen so that a committee of several student-neural-networks has the maximum possible disagreement on the new sample. The paper shows that in this way one can reach a generalization error that decreases exponentially with the size of the training set, while for a random training set the generalization error can decrease only inversely proportionally to the size of the set Engels and van den Broeck (2001). However, in many practical applications the possible set of samples to be picked into the training set is not arbitrarily big, e.g. not every input vector represents an encoding of a molecular structure. We hence argue that the pool-based active learning, studied in the present paper, where the samples are selected from a pre-defined set is of larger relevance to many applications.

The theoretical part of this paper is presented for a generalization of the perceptron model, specifically the for the random teacher-student Generalized Linear Models (GLM), see e.g. Barbier et al. (2019). An instance of a GLM is specified by a prior measure on the weights , from which the true generative model is assumed to be sampled, and an output channel measure , defining the generative process for the labels given the pre-activations. In the part where results of this work are presented we focus on the prototypical case of the noiseless continuous perceptron, where and where for example we have . Moreover, we will consider the setting where the learning model is matched to the generative model and thus the student has perfect knowledge on the correct form of the two above defined measures.

The pool-based active learning task can now be more formally stated as follows: given a set of dimensional samples of cardinality , the goal is to select and query the labels of a subset of cardinality , , according to some active learning criterion. We will refer to as the budget of the student. The true labels are then obtained through ,  

. Henceforth measures with vector arguments are understood to be products over the coordinates of the corresponding scalar measures. For technical reasons, we rely on the strong (but customary) assumption that the samples are i.i.d. Gaussian distributed,

, . Note that, while this assumption implies that the full set of input data is generally unstructured and uncorrelated, it does not prevent non-trivial correlations to appear in any smaller labeled subset , selected through an active learning procedure.

In pool-based active learning settings, it is assumed that the student has a fixed budget for building its training set, i.e. that only up to labels can be queried for training. The active learning goal is to select, among the pool of available samples, the most informative labels, to present to the student so that the latter achieves the best possible generalization performance. While many criteria of informativeness have been considered in the literature, see e. g. Settles (2009), in the teacher-student setting there exist a natural measure of informativeness, which we shall define in the next section.

Iii The Gardner volume as a measure of optimality

A natural strategy for ranking the possible subset selections is to evaluate the mutual information between the teacher vector and each subset of labels , conditioned on the corresponding inputs . Good selections contain larger amounts of information about the ground truth, encoded in the labels, and make the associated inference problem for the student easier. Conversely, bad selections are characterized by less informative labels. In the case of the teacher-student perceptron, where the output channel is completely deterministic and binary, the mutual information can be rewritten (following Barbier et al. (2019)) as

(1)

Equation (1) allows a connection with a quantity well-known in statistical physics, the so-called Gardner volume Gardner and Derrida (1988), Nishimori (2001), Engels and van den Broeck (2001), denoted in the following by

(2)

The Gardner volume represents the extent of the version space Mitchell (1982), i.e. the entropy of hypotheses in the model class consistent with the labeled training set. This provides a natural measure of the quality of the student training. A narrower volume implies less uncertainty about the ground truth and is thus a desirable objective in an active learning framework. We shall focus the rest of our discussion on the large deviation properties of the Gardner volume, but we invite the reader to keep in mind that this is equivalent to studying the above defined mutual information.

There exist other natural measures of informativeness, e.g. the student generalization error and the magnetization (or teacher/student overlap) . In the thermodynamic limit , is a decreasing function of (see the appendix D for more details). Moreover we will show analytical and numerical evidence that all these measures co-vary, at least in the simple teacher-student setting studied in this work. A numerical check at finite of the correlation between and can also be found in appendix E.

Iv Large deviations of the Gardner volume

We consider the problem of sampling labeled subsets of cardinality , , from a fixed pool of data of cardinality ,

, and study the variations in the associated Gardner volumes. We will hereby consider that, for any fixed pool and subset size, the Gardner volume probability distribution follows a large deviation principle, i.e. that there exist an exponential number

of subsets choices that produce Gardner volumes equal to . Employing a statistical physics terminology, we will refer to the rate function, , as the complexity of labeled subsets associated to a budget and a volume .

In the large limit, the overwhelming majority of subsets will thus realize a Gardner volume , such that . This means that, since the fluctuations around this typical value are exponentially rare, random sampling will almost certainly yield Gardner volumes extremely close to . However, the aim of active learning is to find strategies for accessing the atypically informative subsets (i.e., the atypically small volumes ), whence the necessity of analyzing the large deviations properties of the subset selection process.

We will here give a brief outline of the analytic computation, based on standard methods from physics of disordered systems Parisi (1979), Parisi (1983), Mézard et al. (1986), and refer the reader to appendices A, B and C for a more detailed derivation. It is convenient to introduce a vector of selection variables , such that when the sample is selected (and added to the labeled training set), while otherwise. In this notation the selected subset is easily defined as .

Since a direct computation of the complexity is not straightforward, as customary in this type of analyses Dotsenko et al. (1994) we derive it by first evaluating its Legendre transform. We introduce the (unnormalized) measure over the selection variables

(3)

and the associated free entropy

(4)

From a statistical physics perspective, can be regarded as a grand-canonical partition function, with playing the role of an inverse temperature, the Gardner volume being the associated energy function, and where is an effective chemical potential controlling the cardinality of the selection subset, . In the thermodynamic limit , by applying the saddle-point method one can easily see that will be dominated by a subset of selection vectors whose budget and energy, and , are given by

(5)

Thus, inverting the Legendre transform yields the sought complexity

(6)

At fixed budget , the range of values of the volume associated to positive complexities, i.e. with , effectively spans all the achievable Gardner volumes for subsets of that given cardinality, agnostic of the actual strategy for selecting them. In particular, and define the minimal and maximal Gardner volumes and provide theoretical algorithmic boundaries for all realizable active learning strategies. Note that this means that our prototypical model, albeit being idealized, constitutes a nice benchmark for comparing known pool-based active learning heuristics.

iv.1 Replica symmetric formula for the large deviations

In practice, the analytic evaluation of involves the computation of a quenched average of a log-partition function and is not feasible via rigorous methods. In order to perform the computation, we resort to the replica method from statistical physics Parisi (1979), Parisi (1983), Mézard et al. (1986), based on the identity

(7)

and the fact that for integer values the can be computed. We refer the interested reader to appendix B, where the computation is explicited in the more general case of a generalized linear model Krzakala et al. (2012), Barbier et al. (2019), and then specialized to the case of interest of a teacher-student perceptron. The final analytic expression for the replica symmetric free entropy in this special case reads

(8)

where we introduced the definitions and . The extremum operation entailed in the free entropy computation is performed over a set of overlap order parameters, amenable of the following geometric interpretation

  • ,          typical overlap between students with different labeled subsets.

  • ,        typical overlap between students with the same labeled subset.

  • ,        typical norm of a student.

  • , typical magnetization of a student.

Once the free entropy is evaluated, the complexity can be obtained via a numerical implementation of the extremization prescribed by the inverse Legendre transform (6).

We remark at this point that the presented replica calculation was obtained in the so-called replica symmetric ansatz. In general, it is possible for the replica symmetric result not to be exact, requiring replica symmetry breaking (RSB) in order to evaluate the correct free entropy Mézard et al. (1986). In this model, while RSB is surely not needed close to the maximum of the complexity curves (as implied by results in Barbier et al. (2019)), it is plausibly arising for highly frustrated cases, either very large or very negative , corresponding to the values of complexity close to zero. At the same time, the presence of RSB usually entails corrections that are very small in magnitude. We leave the (technically more challenging) investigation of the RSB solution for the large deviations for future work.

iv.2 Large deviation results

Figure 1: (Left) Complexity-volume curves for various budgets , at pool size extracted from the large deviation computations. These curves reach their maxima at a point with coordinates corresponding to the Gardner volume of randomly chosen samples, and log-number of choices of elements among ones. (Right) The magnetization order parameter (in other words the teacher-student overlap) as a function of the Gardner volume for a pool of cardinality , as extracted from the large deviation computations. As is physically intuitive, smaller Gardner volumes imply larger values of the magnetization.

In Fig. 1, we show the results of the large deviation analysis at . Note that the qualitative picture is unaltered when is varied (e.g., equivalent results for are shown in appendix C). The different curves, obtained at fixed values of the budget , show the complexity (i.e., the exponential rate of the number) of possible subset choices, , that realize the corresponding Gardner volumes . As expected, the maximum of each curve is observed at

, and yields the typical Garner volume of a teacher-student perceptron that has learned to correctly classify

i.i.d. Gaussian input patterns. The associated complexity is simply given by the binomial distribution.

The cases where the extremum in equation (6) is realized for positive values of describe choices of the labeled subsets that induce atypically large Gardner volumes: these correspond to active learning scenarios where the student query is worse than random sampling. The number of possible realizations of these scenarios decreases exponentially as one approaches the right-hand extremum of where the complexity curve is positive, describing the largest possible volume at that given budget . An important remark is that as soon as , the statistics of the input patterns in the labeled set is no longer i.i.d., but has increasing correlations for larger .

On the other side, negative induces atypically small Gardner volumes and labeled subsets with high information content. Again, as one spans smaller and smaller volumes the associated complexity drops, making the problem of finding these desirable subsets harder and harder. The left positive-complexity extremum of the curves in the left plot of Fig. 1 corresponds to the smallest reachable Gardner volumes. We observe in the figure that for larger values of budgets the complexity curves saturate fast very close to the smallest possible Garner volume corresponding to the Gardner volume for entire pool of samples .

Figure 2: Typical Gardner volume (purple, decreasing linearly at large ) and information theoretically smallest achievable one (orange, yellow and blue) extracted from the large deviation computation for . The horizontal lines depict the value of the Gardner volume corresponding to the whole pool, we see the fast saturation of the lowest volumes at these lines. The information-theoretic volume-halving limit for label-agnostic active learning procedures is plotted in a dotted line. We notice that the qualitative picture is essentially unchanged when is varied.

In the right plot of Fig. 1, we also show the prediction for the typical value of the magnetization, i.e. the overlap between teacher and students, as the Gardner volume is varied. As mentioned in section III, small Gardner volumes induce high magnetizations and thus low generalization errors.

In Fig. 2 the typical (purple) and corresponding minimum (orange, yellow, cyan) Gardner volumes are depicted as a function of the budget for various pool sizes . Note that the qualitative picture is unaltered when is varied. We further observe that the minimum volume becomes very close to the Gardner volume of the entire pool of samples already for very small budgets .

V Algorithmic implications

v.1 Generic considerations

The setting investigated in this paper provides a unique chance to benchmark the algorithmic performance of any given pool-based active learning algorithm against the optimal achievable performance, and to measure how closely are the large deviations results approached. Before going to such algorithmic performance we should make a distinction between two active learning strategies

  • Label-agnostic algorithms, where the student is not able to extract any knowledge on the labels. In this case, for binary labels there is a simple lower bound on the Gardner volume reachable with samples which is obtained by the argument that every new sample can at best divide the current volume by a factor two Seung et al. (1992). This strategy is explored in the famous query by committee active learning strategy, and the classical work Seung et al. (1992) argues that the volume halving can be actually achieved when unlimited set of samples is available. Plotting this volume-halving bound in Fig. 2 we see that even though there exit subsets leading to smaller Gardner volumes, they cannot be found in a label-agnostic way

  • Label-informed algorithms, where external knowledge of the labels is available and can be exploited for extracting information about which sample to choose. While in our toy setting, the additional information can only consist of disclosing the true labels, which would defy the very point of active learning, in applications with structured data the structure in the input space could be exploited (e.g., through clustering, transfer learning, etc) for making unsupervised guesses of the labels. A concrete example where external insight is available is drug discovery

    Warmuth et al. (2001), where additional information can be inferred from the presence of chemical functional groups (or absence thereof) on the molecules in the data pool. In the present work, we study whether it is possible, with full access to the labels, to devise an efficient method for finding a subset of samples that achieves close to the minimal Gardner volume (note that this is still an algorithmically non-trivial problem).

In this section we will investigate both the label-agnostic and label-informed strategies. We will benchmark several well known active learning algorithms on the model studied in the present paper as well as design and test a new message passing based active learning algorithm. Before doing that let us describe the general strategy.

Many of the commonly used active learning criteria rely on some form of label-uncertainty measure. Uncertainty sampling Settles (2009), Lewis and Gale (1994) is an active learning scheme based on the idea of iteratively selecting and labelling data-points where the prediction of the available trained model is the least confident. In general, the computational complexity associated to this type of scheme is of order , requiring an extensive number of runs of a training algorithm (which can scale as at best). Since even training a single model per pattern addition can become expensive in the large setting, in all our numerical tests we opted for adding to the labeled set batches of samples instead of a single sample per iteration. We remark that, despite the -fold speed-up, the observed performance deterioration is negligible. The structure of this type of algorithm is sketched in Algorithm 1.

Select heuristic strategy from Table 1
Define batch size
Initialize ()
while  do
     Obtain required estimates given
     Obtain model predictions at data-points in
     Sort predictions according to sorting criterion
     Add first elements in the sorting permutation to
end while
Algorithm 1 Uncertainty sampling

v.2 Approximate message passing for active learning (AL-AMP)

In general, estimating the Gardner volume on a given training set or the label-uncertainty of a new sample is a computationally hard problem. However, in perceptrons (or more general GLMs) with i.i.d. Gaussian input data , at large system size one can rely on the estimate provided by a well known algorithm for approximate inference, Approximate Message Passing (AMP) (historically also referred to as the Thouless-Anderson-Palmer (TAP) equations, see Thouless et al. (1977)). The AMP algorithm Donoho et al. (2009); Rangan (2011), Zdeborová and Krzakala (2016) yields (at convergence) an estimator of the posterior means,

, and variances,

, thus accounting for uncertainty in the inference process including the label of a new sample. The Gardner volume  (corresponding to the so-called Bethe free entropy) can then be expressed as a simple function of the AMP fixed-point messages (see Krzakala et al. (2014) for an example). We provide a pseudo-code of AMP in the case of the perceptron in Algorithm 2. An important remark is that when the training set is not sampled randomly from the pool, as in the active learning context, correlations can arise and AMP is no longer rigorously guaranteed to converge nor to provide a consistent estimate of the Gardner volume. In the present work, we can only argue that its employment seems to be justified a posteriori by observing the agreement between theoretical predictions and numerical experiments for instance for the generalization error.

Initialize
Initialize
Initialize
while Convergence criterion not satisfied do
     
     
     
     
     
     
     
end while
Algorithm 2 single-instance AMP for the perceptron

We use the AMP algorithm to introduce a new uncertainty sampling procedure relying on the information contained in the AMP messages, denoted as AL-AMP in the following. At each iteration, the single-instance AMP equations are run on the current training subset to yield posterior mean estimate and variance . These quantities can then be used to evaluate, for all the unlabeled samples, the output magnetization (i.e. the Bayesian prediction) defined as

(9)

where we introduced the output overlaps and variances , where is the component-wise product. The output magnetizations correspond to the weighted output average over all the estimators contained in the current version space, and their magnitude represents the overall confidence in the classification of the still unlabeled samples. This means that AMP provides an extremely efficient way of obtaining the information on uncertainty. The specifics of the algorithm can be found in Tab. 1.

We also explore numerically the label-informed active learning regime introduced in the previous section. We consider its limiting case by introducing the informed AL-AMP strategy, which can fully access the true labels in order to query the samples whose output magnetisation (9) is maximally distant from the correct classification . This selection process can iteratively reduce the Gardner volume of factors larger than . Again, the relevant specifics of informed AL-AMP algorithm are detailed in Tab. 1.

Uncertainty sampling strategies
Heuristic Required estimates Sorting criterion
Agnostic AL-AMP
Informed AL-AMP
Query by committee
Logistic regression
Perceptron learning
Table 1: Table summarizing the specifics of the uncertainty sampling strategies considered in this paper.

v.3 Other tested measures of uncertainty

One of the widely used uncertainty sampling procedure is the so-called Query by Committee (QBC) strategy Seung et al. (1992), Freund et al. (1992). In QBC, at each time step, a committee of students is to be sampled from the version space (e.g., via the Gibbs algorithm). The committee is then employed to choose the labels to be queried, by identifying the samples where maximum disagreement in the committee members outputs is observed. The QBC algorithm was introduced as a proxy for doing bisection, i.e. cutting version space into two equal-volume halves. As already mentioned, this constitutes the optimal information gain in an label-agnostic setting Dasgupta (2005). Note that, however, the QBC procedure can achieve volume-halving only in the infinite-size committee limit, , with uniform version space sampling and with availability of infinitely many samples. Obviously, running a large number

of ergodic Gibbs sampling procedures quickly becomes computationally unfeasible. Moreover in the pool-based active learning the pool of samples is limited. In order to allow comparison with other strategies at finite sizes, we approximated the uniform sampling with a set of greedy optimization procedures (e.g., stochastic gradient descent) from random initialization conditions, checking numerically that this yields a committee of students reasonably spread out in version space. It is possible to ensure a greater coverage of the version space by performing a short Monte-Carlo random walk for each committee member. The effect has been found to be small for computationally reasonable lengths of walk.

We also implemented an alternative uncertainty sampling strategy, relying on a single training procedure (e.g., training with the perceptron algorithm or logistic regression) per iteration: in this case, the uncertainty information is extracted from the magnitude of the pre-activations measured at the unlabeled samples after each training cycle. This strategy implements the intuitive geometric idea of looking for the samples that are most orthogonal to the available updated estimator, which are more likely to halve the version space independently of the value of the true label.

v.4 Algorithmic results

Figure 3: (Left) Performance of the label-agnostic (yellow circles) and label-informed (blue circles) AL-AMP, plotted together with the minimum and maximum values of the Gardner volume extracted from the large deviation computation (purple and green) and the volume-halving curve (dotted black). For comparison we also plot the typical Gardner volume (cyan) and the one obtained by random sampling (orange squares). Numerical experiments were run for system size and pool size . For each algorithmic performance curve the average over samples is presented. Fluctuations were found to be negligible around the average and are not shown. (Right) The same plot with the Gardner volume replaced by the Bayesian test accuracy, derived in Appendix D. For the AL-AMP algorithm the accuracy is evaluated using a test set of size . The qualitative picture is very similar to the one for the Gardner volume curves (left), once more confirming that Gardner volumes and generalization errors both constitute good measures for informativeness.
Figure 4: (Left) Performance of the label-agnostic algorithms presented in Tab. 1 is plotted against the budget and compared to the volume-halving lower bound. Experiments were performed at system size , and pool size . For each algorithm the average over samples is presented. Note that error bars are smaller than marker size. (Left) (Bayesian) test accuracy of the same heuristics for various budgets . The test set size was chosen to be . In blue the Bayesian test accuracy for a typical subset, see appendix D. Again, the qualitative picture is unchanged going from the Gardner volume to the test accuracy.

In Fig. 3, we compare the minimum Gardner volume obtained from the large deviation calculation with the algorithmic performance obtained on synthetic data at finite size, , by the AL-AMP algorithms detailed in Algorithm 1 and Tab. 1. The data-pool size is fixed to . The large deviation analysis yields values for the minimum and maximum achievable Gardner volumes at any budget . We compare the algorithmic results also with the prediction for the typical case and with the volume-halving curve . Since in the considered pool-based setting the volume-halving performance cannot be achieved for volumes smaller than the Gardner volume corresponding to the entire pool , the relevant volume-halving bound should be more precisely . Random sampling displays good agreement with the expected typical volumes. Most notably, the label-agnostic AL-AMP algorithm tightly follows the volume-halving bound , thus reaching close to optimal possible performance. Since for large the behaviour of Engels and van den Broeck (2001) we conclude that the AL-AMP algorithm will reach close to minimum possible Gardner volumes for a budget . We thus obtain an exponential reduction in the number of samples even in the pool-based active learning similarly to the original Query by Committee work Seung et al. (1992).

The label-informed AL-AMP also approaches the theoretically minimal volume but not as closely. We remark that an important limit of the AL-AMP algorithm comes from the fact that AMP is not guaranteed to provide good estimators (or converge at all) with correlated data. For example, in the numerical experiments for obtaining the informed AL-AMP curve, we had to resort to mild damping schemes in the message-passing to allow fixed-points being reached. This effect was stronger for the label-informed algorithm than for the label-agnostic one.

In Fig. 4, we provide a numerical comparison of the performance of the agnostic AL-AMP and the other above mentioned label-agnostic active learning algorithms. The finite size experiments were run at , while here we set . Note that, while the mentioned different active learning strategies where employed for selecting the labeled subset, in all cases supervised learning and the related performance estimates were obtained by running AMP. In the plot, we can see that, while AL-AMP is able to extract very close to the maximum amount of information from each query (one bit per pattern, until the volume is saturated), other heuristics with the same computational complexity are sub-optimal. In particular, in the simplified query by committee procedure we observe that increasing the size of the committee does not yield very noticeable change in its performance, most probably because the committee cannot cover a sufficient portion of the version space if the computational cost is to be kept reasonable. On the other hand, using the information of the magnitude pre-activations allows better performance while being also more time-efficient, since only a single perceptron, rather than a committee thereof, has to be trained at each step. The logistic loss allows a rather good performance, close to that of AL-AMP, while the uncertainty sampling with the perceptron algorithm yields a mitigated performance.

We leave a more systematic bench-marking of the many existing strategies for future work, stressing the fact that, while there certainly exist more involved procedures that can yield better performance than the presented heuristics, the absolute performance bounds still apply, agnostic of the implemented active learning strategy.

Vi Conclusions

Using the replica method for large deviation calculation of the Gardner volume, we computed for the teacher-student perceptron model the minimum Gardner volume (equivalently, maximum mutual information) achievable by selecting a subset with fixed cardinality from a pre-existing pool of i.i.d. normal samples. We evaluated the large deviation function based on the replica symmetric assumption; checking for replica symmetry breaking and evaluating the eventual corrections to the presented results is left for future work, as well as rigorous establishment of the presented results. Our result for the information-theoretic limit of pool-based active learning in this setting complements the already known volume-halving bound for label-agnostic strategies. We hope our result may serve as a guideline to benchmark future heuristic algorithms on the present model, while our modus operandi regarding the derivation of the large deviations may help for future endeavour in theoretical analysis of active learning in more realistic settings. We presented the performance of some known heuristics, plus we suggested the AL-AMP algorithms to perform the uncertainty based active learning. We show numerically that on the present model the label-agnostic AL-AMP algorithm performs very close to the optimal bound, thus being able to achieve accuracy corresponding to the entire pool of samples with exponentially fewer samples.

Acknowledgements.
We want to thank Guilhem Semerjian for clarifying discussions in early stages on this work. This work is supported by the ERC under the European Union’s Horizon 2020 Research and Innovation Program 714608-SMiLe.

References

  • Settles (2009) Burr Settles, Active Learning Literature Survey, Computer Sciences Technical Report (2009).
  • Angluin (1988) Dana Angluin, “Queries and concept learning,” Machine Learning 2, 319–342 (1988).
  • Cohn et al. (1994) David A. Cohn, Les E. Atlas,  and Richard E. Ladner, “Improving generalization with active learning,” Machine Learning 15, 201–221 (1994).
  • Seung et al. (1992) H. Sebastian Seung, Manfred Opper,  and Haim Sompolinsky, “Query by committee,” COLT 5, 287–294 (1992).
  • Atlas et al. (1990) Les E. Atlas, David A. Cohn,  and Richard E. Ladner, “Training connectionist networks with queries and selective sampling,” NIPS 2, 566–573 (1990).
  • Zhang et al. (2019) Linfeng Zhang, De-Ye Lin, Han Wang, Roberto Car,  and E Weinan, “Active learning of uniformly accurate interatomic potentials for materials simulation,” Phys. Rev. Materials 3, 023804 (2019).
  • Warmuth et al. (2001) Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao,  and Christian Lemmen, “Active learning in drug discovery,” NIPS 14, 1449–1456 (2001).
  • McCallum and Nigam (1998) Andrew K. McCallum and Kamal Nigam, “Employing em and pool-based active learning for text classification,” ICML , 350–358 (1998).
  • Tong and Koller (2000)

    Simon Tong and Daphne Koller, “Support vector machine active learning with applications to text classification,” Journal of Machine Learning Reasearch 

    2, 45–66 (2000).
  • Hoi et al. (2006) Steven C.H. Hoi, Rong Jin, Jianke Zhu,  and Michael R. Lyu, “Batch mode active learning and its application to medical image classification,” ICML 6, 417–424 (2006).
  • Gardner and Derrida (1988) Elizabeth Gardner and Bernard Derrida, “Optimal storage properties of neural network models,” J. Phys. A: Math. Gen. 21, 271–284 (1988).
  • Parisi (1979) Giorgio Parisi, “Towards a mean field theory for spin glasses,” Phys. Lett 73, 203–205 (1979).
  • Parisi (1983) Giorgio Parisi, “Order parameter for spin glasses,” Phys. Rev. Lett 50, 1946–1948 (1983).
  • Mézard et al. (1986) Marc Mézard, Miguel A. Virasoro,  and Giorgio Parisi, Spin Glass Theory and Beyond (World scientific Lecture Notes in Physics, 1986).
  • Zdeborová and Krzakala (2016) Lenka Zdeborová and Florent Krzakala, “Statistical physics of inference : thresholds and algorithms,” Adv. Phys. 5, 453–552 (2016).
  • Gardner and Derrida (1989) Elizabeth Gardner and Bernard Derrida, “Three unfinished works on the optimal storage capacity of networks,” Journal of Physics A: Mathematical and General 22, 1983 (1989).
  • Engels and van den Broeck (2001) Andreas Engels and Christian P.L. van den Broeck, Statistical mechanics of learning (Cambridge University Press, 2001).
  • Freund et al. (1992) Yoav Freund, Eli Shamir, H. Sebastian Seung,  and Naftali Tishby, “Information, prediction, and query by committee,” NIPS 5, 483–490 (1992).
  • Zhou (2019) Hai-Jun Zhou, “Active online learning in the binary perceptron problem,” Communications in Theoretical Physics 71, 243 (2019).
  • Barbier et al. (2019)

    Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane,  and Lenka Zdeborová, “Optimal errors and phase transitions in high-dimensional generalized linear models,” Proc. Nat. Ac. Sc. 

    116, 5451–5460 (2019).
  • Nishimori (2001) Hidetoshi Nishimori, Statistical physics of spin glasses and information processing (Oxford Science Publications, 2001).
  • Mitchell (1982)

    Tom M. Mitchell, “Generalization as search,” Artificial intelligence 

    18, 203–226 (1982).
  • Dotsenko et al. (1994) Victor Dotsenko, Silvio Franz,  and Marc Mézard, “Partial annealing and overfrustration in disordered systems,” J. Phys. A: Math. Gen. 27, 2351–2365 (1994).
  • Krzakala et al. (2012) Florent Krzakala, Marc Mézard, François Sausset, Sun Yifan,  and Lenka Zdeborová, “Probabilistic reconstruction in compressed sensing : Algorithms, phase diagrams and treshold achieving matrices,” J. Stat. Mech. 2012, P08009 (2012).
  • Lewis and Gale (1994) David D. Lewis and William A. Gale, “A sequential algorithm for training text classifiers,” SIGIR 17, 3–12 (1994).
  • Thouless et al. (1977) David J. Thouless, Philip W. Anderson,  and Richard G. Palmer, “Solution of solvable model of spin glass,” Phil. Mag. 35, 593–601 (1977).
  • Donoho et al. (2009) David L. Donoho, Arian Maleki,  and Andrea Montanari, “Message-passing algorithms for compressed sensing,” Proceedings of the National Academy of Sciences 106, 18914–18919 (2009).
  • Rangan (2011) Sundeep Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in 2011 IEEE International Symposium on Information Theory Proceedings (IEEE, 2011) pp. 2168–2172.
  • Krzakala et al. (2014) Florent Krzakala, André Manoel, Éric W. Tramel,  and Lenka Zdeborová, “Variational free energy for compressed sensing,” ISIT , 1499–1503 (2014).
  • Dasgupta (2005) Sanjoy Dasgupta, “Analysis of a greedy active learning strategy,” NIPS 17, 337–344 (2005).
  • Antenucci et al. (2019) Fabrizio Antenucci, Florent Krzakala, Pierfrancesco Urbani,  and Lenka Zdeborová, “Approximate survey propagation for statistical inference,” J. Stat. Mech. 2019, 023401 (2019).
  • Mézard et al. (1984) Marc Mézard, Giorgio Parisi, Nicolas Sourlas, Gérard Toulouse,  and Miguel A. Virasoro, “Replica symmetry breaking and the nature of the spin glass phase,” J. Phys. France 45, 843–854 (1984).
  • Baldassi et al. (2015)

    Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti,  and Riccardo Zecchina, “Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses.” Physical review letters 

    115 12, 128101 (2015).

Appendix A Notations and Large-Deviation (LD) measure for GLMs

In this appendix we set the replica calculation in the more general setting of a generalized linear model (GLM) with arbitrary teacher/student prior/posterior. We allow the inference to be mismatched, i.e. we allow the teacher and student measures to be different. The specialization to the particular case of the teacher-student perceptron with no mismatch (Bayes-optimal) will be carried out in appendix C. Our computation borrows from Krzakala et al. (2012) and Barbier et al. (2019) which study the simple measure. Because we study large deviations, our formalism has some semblance with one-step Replica Breaking (1RSB) equations, a discussion of which can be found for example in Antenucci et al. (2019), Mézard et al. (1984), Mézard et al. (1986).

a.1 Definition of the problem

We consider a student GLM (Krzakala et al. (2012)) with -dimensional weights learning from samples stacked in a matrix and the corresponding labels stacked into . We assume the student-teacher (or planted) setting (Zdeborová and Krzakala (2016)), where the labels are generated by the ground truth (teacher weights) with the channel measure . The teacher weight itself is drawn with prior . Given and , the student perceptron is trained so that its own weight vector tries to match the ground truth . The inference is carried out with the student prior and posterior . Note that the cases where or

mean that the student ignores the precise Markov chain wherefrom the labels are generated, as discussed in section

II, see also Zdeborová and Krzakala (2016). The likelihood that a vector is the ground truth vector is then . A reasonable measure of the average lack of accuracy of the student’s guess is then given by the Gardner volume, viz. the partition function associated with the likelihood

(10)

The smaller , the easier the student inference, see section II in the main text. The validity of the Gardner volume as a measure of informativeness is justified for the Bayes-optimal perceptron in section III of the main text.

We consider here pool-based active learning, where only a subset of the pool is used for training. The choice of subset can be conveniently parametrized by the Boolean , where means sample is used, while means is not selected. For a given budget , we intend to find the selection that minimizes the Gardner volume , viz. that allows the best student guess. To do this we shall compute the complexity , with the number of ways to select samples so that the Gardner volume associated with the training of the student is , as in section IV of the main text.

a.2 Assumption

To simplify, the samples are taken to be identically and independently distributed according to a normal distribution

. Moreover all measures over vectors are assumed to be separable, that is factorizable as a product of identical measures over the components. Notation-wise for example is therefore understood to mean .

a.3 LD measure

The goal is to compute the averaged log partition function (free entropy in statistical physics terms)

(11)

can be seen as an inverse temperature and as a chemical potential, see section IV in the main text. The reason why we compute is that this quantity is the Legendre transform of the complexity

(12)

Inverting the Legendre transform is then straightforward and yields

(13)

The spectrum of values of such that for any given fixed corresponds to all achievable Gardner volumes for a budget . In particular, is the minimal Gardner volume when the samples are chosen in an optimal way. Contrariwise, is the maximal Gardner volume when the samples are chosen in the least informative way for the student, so that the student inference problem is hardest. Finally note that the selection variables play in the grand-canonical partition function (11) the role of an annealed disorder in disordered systems terminology Mézard et al. (1986), and shall be sometimes referred to as such in the following.

Appendix B Replica computation

b.1 Replica trick

The standard way of taking care of the logarithm in equation (11) is the replica trick (Parisi (1979), Parisi (1983), Mézard et al. (1986)),

(14)

To compute , one needs to further replicate times to care for the power involved in the summand in equation (11)

(15)

In the present problem we thus introduced two replication levels. Each replica is hence characterized by a set of two indices: the first index runs from to and specifies the disorder replica, the second index, running from to is related to the replication in . In total there are therefore replicas. The teacher is set as replica . Implicitly henceforth when summed over will be running over . But

(16)

where we defined

Gaussian because of the central limit theorem and enforced the definition of its covariance matrix

with integral representations of Dirac deltas. The conjugate matrix is . Matrix elements are noted with small . Then

(17)

where we factorized both in indices (first parenthesis) and in indices (second parenthesis). The free entropy defined in (11) then reads

(18)

with

(19)
(20)

b.2 Replica Symmetric (RS) ansatz

The extremization in equation (18) is hard to carry out. As is now standard in the disordered systems literature we can reduce the number of parameters to be extremized over by enforcing the so-called Replica Symmetric (RS) ansatz (Mézard et al. (1986)) on both replication levels

(21)
(22)
(23)
(24)
(25)

where . Physically, the ansatz (21)-(25) means that two replicas seeing the same realisation of disorder (i.e., possessing the same first index) have an overlap greater than the overlap between students seeing different realisations (an thus possessing different -index). The in the definition of (23) is just introduced for latter convenience.

Note finally that while the ansatz (21) to (25) is replica-symmetric for both replications, it gives a set of equations that are formally those of a 1RSB problem (Mézard et al. (1984)). This is also a reason why taking 1RSB ansatz (Mézard et al. (1986)) in the present large deviation calculation would be rather involved as it would lead to equations in the usual 2RSB form that are numerically involved to be solved.

We plug the RS ansatz (21)-(25) into the three contributions that make up equation (18). The trace term is

(26)

We can decompose the exponent in (19) according to the ansatz (21)-(25)

(27)

In the last but one term index does not intervene. Introducing Hubbard-Stratonovitch fields for the last but one term and Hubbard-Stratonovith field for the last, reads

(28)

To carry out the computation for (equation (20)) we need to explicitly compute the inverse of the Parisi matrix involved in equation (20). This is done in the following subsection.

Some linear algebra for hierarchical matrices

Name the inverse of the overlap matrix . Since is clearly of the same form as , we can parametrize its coefficient in an identical fashion as those of . means

(29)
(30)
(31)
(32)
(33)
(34)

yielding

(35)
(36)
(37)