1 Introduction
Supervised machine learning models require enough labeled data to obtain good generalization performance. For many practical applications such as medical diagnosis or video classification it can be expensive or time consuming to label data
[18]. Often in these settings unlabeled data is abundant, but due to high costs only a small fraction can be labeled. In active learning an algorithm chooses unlabeled samples for labeling [4]. The idea is that models can perform better with less labeled data if the labeled data is chosen carefully instead of randomly. This way active learning methods make the most of a small labeling budget or can be used to reduce labeling costs.A lot of methods have been proposed for active learning [18], among which several works have used generalization bounds to perform active learning [9, 10, 7, 11]. We perform a theoretical and empirical study of active learners, that choose queries that explicitly minimize generalization bounds, to investigate how the relation of the bounds impacts active learning performance. We use the kernel regularized least squares model [17] and the squared loss, a popular active learning setting [13, 20].
We study the stateoftheart Maximum Mean Discrepancy (MMD) active learner of Chattopadhyay et al. [3] that minimizes a generalization bound [20]. The MMD is a divergence measure [8] which is closely related to the Discrepancy measure of Mansour et al. [15], both have been used in domain adaptation [12, 5]
. Using the Discrepancy, we show that we can get a tighter bound than the MMD in the realizable setting. Tighter bounds are generally considered better as they estimate the generalization error more accurately. One might therefore also expect them to lead to better labeling choices in active learning and so we introduce an active learner that minimizes the Discrepancy. We show using a probabilistic analysis, however, that the Discrepancy and MMD active learners optimize their strategies for unlikely scenarios. This leads us to introduce the Nuclear Discrepancy whose bound is looser. The Nuclear Discrepancy considers an average case scenario that may occur more often in practice.
We show empirically that active learning using the Nuclear Discrepancy improves upon the MMD and Discrepancy. In fact, active learning using the tightest Discrepancy bound performs the worst. We show experimentally that the scenarios considered by our bound occurs more often, explaining these counterintuitive results. Our study shows that tighter bounds do not guarantee improved active learning performance and that a probabilistic analysis is essential.
The rest of this paper is organized as follows. In Section 3 we introduce two existing generalization bounds, the MMD and Discrepancy, and we present several novel theoretical results. We give an improved MMD bound applicable to active learning and we show how to choose the MMD kernel (contribution 1). Under these conditions the MMD and Discrepancy bound become comparable, and we show that the Discrepancy bound is tighter (contribution 2). We use these theoretical results in Section 4 to analyze these existing bounds probabilistically. In this section we explain why tighter bounds may not lead to improved active learning performance (contribution 3). This probabilistic analysis leads to our novel looser Nuclear Discrepancy bound (contribution 4). In Section 5 we benchmark the active learners on several datasets. We show that indeed our bound improves upon the Discrepancy and MMD for active learning, and we verify our probabilistic argument empirically by computing a novel error decomposition (contribution 5). In Section 6 we give a brief discussion and in Section 7 we give the conclusions of this work. All proofs are given in the supplementary material. First, however, we introduce the setting and necessary notation.
2 Setting and Notation
Let denote the input space and the output space. We assume there exists a deterministic labeling function . We study the binary classification setting but all our results are applicable to regression as well. We assume there is an unknown distribution over from which we get an independent and identically distributed (i.i.d.) unlabeled sample . Initially the labeled set is empty. The active learner selects (queries) samples sequentially from the unlabeled pool , these samples are labeled and added to . The samples are not removed from the set , but the set is updated after each query. On a kernel regularized least squares model is trained. We indicate such a model using and its output on a sample using .
We take the kernel of the model to be positive definite symmetric (PDS). For these kernels a reproducing kernel Hilbert space (RKHS) exists and we indicate the RKHS by . We use the overloaded notation where
also indicates the corresponding vector in the RKHS of the model or function
. The norm of a vector in the RKHS is written as . We use to indicate the kernel function between object and . In this work we use the Gaussian kernel where, the bandwidth, is a hyperparameter of the kernel. For the computations of the MMD we need a second kernel which we indicate with
and we indicate its RKHS and bandwidth by and , respectively. We use the convention that all vectors are column vectors. and are the by and by matrices of the sets and . We use the same convention as Cortes and Mohri [5], where kernel regularized least squares minimizes the following objective for an hypothesis when trained on the sample : . We use as shorthand for the mean squared error of on the sample with labels given by : . The parameter is a regularization parameter which controls the complexity of the model. Similar to [5] we choose a subset of as our hypothesis set: , where . In [16, Lemma 11.1] it is shown that training of the kernel regularized least squares model always leads to a solution .We focus on the squared loss since the bounds give direct guarantees on this performance measure, and because the quantities in the bounds can be computed in closed form for this loss. The goal for the active learner is to choose queries in such a way as to minimize the expected loss of the model: . We would actually want to train our model on , since if the model complexity is chosen appropriately, small loss on will lead to small expected loss on ^{1}^{1}1This holds even in the active learning setting, since the samples in are i.i.d. samples. If desired, a standard Rademacher complexity bound can be used to bound the expected loss on in terms of the loss on [6].. However, since we do not have labels for the samples in , we upperbound the loss on instead. This upperbound is minimized by the active learners. The studied bounds are of the form .
Due to training will be relatively small. The term is a constant that cannot be minimized during the active learning process since it depends on the labels of the set . However, if the model mismatch is small, will be small. Therefore we ignore this term during active learning, this is also (sometimes implicitly) done in other works [12, 3, 5]. The active learners will therefore minimize some objective sequentially. This objective can be the MMD, disc or which will be introduced in the next sections. These objectives estimate the similarity between the samples and and do not depend on labels. Note that the resulting active learners will be nonadaptive: the active learning strategy is independent from the observed labels during active learning.
Sequential minimization by the active learner is done as follows. The active learner forms a candidate set for each possible query , and computes the objective for each candidate set. The query of the candidate set with minimal objective is chosen for labeling: . Note that constants or do not influence the query selection for any of the objectives. We consider two settings. In the realizable setting the labeling function , thus a model of our hypothesis set generates the labels and there is no model misspecification. In this case . In the agnostic setting we use binary labels, thus , and we generally have that .
3 Theoretical Analysis of Existing Bounds
Improved MMD Bound for Active Learning.
The MMD measures the similarity between the two samples and . Using this criterion we give a generalization bound similar to the one given in [20] that is suitable for active learning. We use the empirical MMD quantity, defined as
(1) 
Here is the worstcase function from a set of functions . We take the set as , where is a yet to be specified PDS kernel with RKHS . We note that the MMD can be computed in practice using Equation 3 in [8], where we have to multiply by since we consider the general case where . The equation is given in the supplementary material. By using the function
to approximate the worstcase loss function we can prove the following novel bound:
Theorem 1 (MMD Generalization bound)
Let be any loss function, and let . Then for any hypothesis ,
(2) 
where is given by .
The term appears because we may have that . Our MMD bound differs in two aspects from the bound of Wang and Ye [20]. In Wang and Ye [20] the MMD is estimated between the distributions and . However, to estimate the MMD between distributions i.i.d. samples are required [8, Appendix A.2]. The samples of are not i.i.d. since they are chosen by an active learner. We avoid this by using the MMD for empirical samples. The second novelty is that we measure the error of approximating the loss function using the quantity . This formulation allows us to adapt the MMD to take the hypothesis set and loss into account, similar to the Discrepancy measure of Cortes and Mohri [5].
Theorem 2 (Adjusted MMD)
Let be the squared loss and assume (realizable setting). If and , then it is guaranteed that and thus .
Corollary 1
Let and let be a Gaussian kernel with bandwidth . If is a Gaussian kernel with bandwidth and then .
Compared to other works Theorem 2 gives a more principled way to choose the MMD kernel in the context of learning^{2}^{2}2Note that the MMD can also be used to determine whether or not two sets of samples are from the same distribution [8].. Previously, often a Gaussian kernel was used for the MMD with . In particular, Corollary 1 shows that if our model uses as bandwidth and , we may have that even in the realizable setting , since is too large. This is undesirable since cannot be minimized during active learning. Therefore our choice for which guarantees that in the realizable setting is preferable.
Discrepancy Bound for Active Learning.
We give a bound of Cortes et al. [6] in terms of the Discrepancy. The Discrepancy is defined as
(3) 
Observe that the Discrepancy depends directly on the loss and the hypothesis set .
Theorem 3 (Discrepancy generalization bound)
Assume that for any and that and that is the squared loss. Then given any hypothesis ,
(4) 
where is given by .
Here measures the model misspecification. If the term becomes zero.
Eigenvalue Analysis.
The matrix is given by
(5) 
We study the bounds of the Discrepancy and the MMD using the eigenvalues of matrix
. This analysis is novel for the MMD and allows us to show that the Discrepancy bound is tighter. Furthermore, we need this analysis in the next section to motivate the Nuclear Discrepancy. Mansour et al. [15] show that the Discrepancy, in terms of the eigenvalues of , is given by(6) 
in case is the linear kernel^{3}^{3}3The Discrepancy can also be computed for any arbitrary kernel by replacing by [5], see the supplementary material for more details. All our theoretical results that follow are applicable to both and and are thus applicable to any kernel. For simplicity we use in the main text.. From this point on we assume that the eigenvalues of are sorted by absolute value, where is the largest absolute eigenvalue. The next original theorem shows how the MMD can be computed using .
Theorem 4
Under the conditions of Theorem 2 we have
(7) 
In this case, , and by comparing Equation 6 and Equation 7 we can show that . Thus the Discrepancy bound (Theorem 3) is tighter than the MMD bound (Theorem 1) under the conditions of Theorem 2. Since the Discrepancy bound is tighter, one can argue that it estimates the expected loss more accurately than the MMD bound. Therefore, one may expect that active learning by minimization of the Discrepancy may result in better active learning queries than minimization of the MMD.
4 Nuclear Discrepancy
Though it may seem obvious to expect better performance in active learning when tighter bounds are used, the probabilistic analysis given in this section indicates that the Discrepancy will perform worse than the MMD. This, in turn, will lead us to introduce the Nuclear Discrepancy. Our analysis suggests that this bound will improve upon the MMD and the Discrepancy when used for active learning. To start with, we require the following decomposition of the error:
Theorem 5
If , is the squared loss, and ( is trained on ), then
(8) 
where , and where is the projection of on the normalized
th eigenvector of
. Note that .Observe that in the above theorem, the components of weigh the contribution of each eigenvalue to the error. Essentially, this means that if points more in the direction of the eigenvector belonging to eigenvalue , this eigenvalue will contribute more to the error.
Active learning by minimization of the Discrepancy minimizes the largest absolute eigenvalue (Equation 6). In view of the error decomposition above, we can see that the Discrepancy always considers the scenario where points in the direction of the eigenvector with largest absolute eigenvalue. One should realize that this is a very specific scenario, where all components of are zero, that is very unlikely to occur. This is a worst case scenario because this maximizes the right hand side of Equation 8 with respect to .
The MMD considers a less pessimistic and a more realistic scenario. The MMD active learner minimizes the squared eigenvalues (Equation 7) and thus assumes all eigenvalues contribute to the error. However, the MMD is biased towards minimizing large absolute eigenvalues because the eigenvalues are squared. This suggests that the MMD assumes that is more likely to point in the direction of eigenvectors with large absolute eigenvalues, since the objective indicates these eigenvalues are more important to minimize. This is also in a sense pessimistic, since large absolute eigenvalues can contribute more to the error. However, this scenario is less unlikely than the scenario considered by the Discrepancy. Because the MMD active learner optimizes its strategy for a scenario that we expect to occur more often in practice, we expect it to improve upon the Discrepancy.
In light of the foregoing, we now propose the more optimistic assumption that any is equally likely. This assumption is not true generally since is the result of a minimization problem. However, we expect this average case scenario to better reflect reality than always assuming a pessimistic scenario like the MMD where points in the direction of large eigenvalues. An active learner that optimizes its strategy for this more realistic scenario will therefore likely improve upon the MMD. Our optimistic assumption leads to the following theorem.
Theorem 6 (Probabilistic generalization bound)
Let and . Then . Assuming each is equally likely we can show that:
(9) 
where are the eigenvalues of the matrix (Equation 5).
This bound indicates that we should minimize all absolute eigenvalues if all are equally likely to occur. Inspired by this analysis, we define the Nuclear Discrepancy quantity as . This is proportional to the socalled nuclear matrix norm of . Observe that the Nuclear Discrepancy upperbounds the MMD (Equation 7) and the Discrepancy (Equation 6). By upperbounding the Discrepancy in Equation 4 we obtain the following deterministic bound.
Theorem 7 (Deterministic Nuclear Discrepancy bound)
Assume that for any and that and that is the squared loss. Then given any hypothesis ,
(10) 
where is given in Theorem 3.
The Nuclear Discrepancy bound is looser in the realizable setting than the MMD and Discrepancy bounds. Yet the bound is more optimistic since it takes an average case scenario into account instead of an unlikely pessimistic scenario. These average case scenarios might occur more often in practice, and therefore we expect the Nuclear Discrepancy to improve upon the MMD and Discrepancy when minimized for active learning, since the Nuclear Discrepancy takes these scenarios explicitly into account in its strategy.
5 Experiments
Experimental Setup and Baselines.
A training set () and test set () are used in each experiment. The training set corresponds to . is initially empty. After each query, the labeled set is updated and the model is trained and evaluated on the test set in terms of the mean squared error (MSE). We use the active learners to sequentially select 50 queries. As baseline we use random sampling and a sequential version of the stateoftheart MMD active learner [3, 20]. We compare the baselines with our novel active learners: the Discrepancy active learner and the Nuclear Discrepancy active learner.
The active learning methods are evaluated on 12 datasets. Some datasets are provided by Cawley and Talbot [2] and the other datasets originate from the UCI Machine Learning repository [14]. See the supplementary material for the dataset characteristics. Similar to Huang et al. [13]
we convert multiclass datasets into two class datasets. To ease computation times we subsampled datasets to contain a maximum of 1000 objects. All features were normalized to have zero mean and a standard deviation of one. To make datasets conform to the realizable setting we use the approach of
Cortes and Mohri [5]: we fit a model of our hypothesis set to the whole dataset and use its outputs as labels.To set reasonable hyperparameters, we repeat the following procedure multiple times. This procedure makes sure the ’s in the bounds are small and that the model complexity is appropriate. We randomly select examples from the dataset and label these. We train a model on these samples and evaluate the MSE on all unselected objects. The hyperparameters that result in the best performance after averaging are used in the active learning experiments. For reproducibility we give all hyperparameter settings in the supplementary material. The procedure above leads to the hyperparameter of the Gaussian kernel and the regularization parameter of the model. We set the hyperparameter of the Gaussian kernel of the MMD according to our analysis in Corollary 1 as .
Realizable Setting.
First we benchmark our proposed active learners in the realizable setting. In this setting we are assured that in all bounds and therefore we eliminate effects that arise due to model misspecification. Also the relations between the bounds are guaranteed: the Discrepancy bound is the tightest, and the Nuclear Discrepancy bound is the loosest.
We plot several learning curves in Figure 1. The MSE of the active learner minus the mean performance (per query) of random sampling is displayed on the yaxis (lower is better). The curve is averaged over 100 runs. Error bars represent the confidence interval computed using the standard error. We summarize our results on all datasets in Table 1 (see page 1) as is usual in active learning [20, 11]
. To this end, each time after 5 queries, we compute a two tailed ttest (significance level
) comparing the 200 MSE results of two active learning methods. If an active learner improves significantly upon another in terms of MSE we count this as a win, if there is no significant difference we count it as a tie, and if it performs significantly worse we count it as a loss.Observe that the Discrepancy active learner, with the tightest bound, in the majority of the cases performs worst. Especially the results on the ringnorm dataset are remarkable, here the Discrepancy performs worse than random sampling. In the majority of the cases the proposed Nuclear Discrepancy, with the loosest bound, indeed improves upon the MMD and the Discrepancy active learners.
Error Decomposition.
In the realizable setting we have the advantage that we know the labeling function . This allows us to compute the error decomposition of Equation 8 explicitly. See the supplementary material for details how to compute the decomposition in case kernels are used. Using this decomposition we can explain the differences in performance between the active learners.
In Figure 2 we show this decomposition of the error using a stacked bar chart during several active learning experiments of the baseline active learner ‘Random’^{4}^{4}4The graphs for other active learners are qualitatively similar and for brevity we do not show them here.. Recall that the eigenvalues of the matrix are sorted by absolute size, here we use EV1 to indicate the absolute largest eigenvalue. The contribution of EV1 is given by: (see also Equation 8). We use the notation EV 2  9 to indicate the summed contribution of the eigenvalues: . The mean contribution over 100 runs is shown.
Observe that the contribution of the absolute largest eigenvalue to the error in practice often is extremely small: the bar of EV1 is hardly visible in Figure 2. Note: EV1 is represented by the white bar that starts at the bottom which is only visible for the datasets thyroid and german. Therefore the Discrepancy active learner chooses suboptimal samples: its strategy is optimized for a worstcase scenario that is very rare in practice. The MMD improves upon the Discrepancy since we observe that the scenario of the MMD is more likely. However, observe that small absolute eigenvalues can contribute substantially to the error, this is especially clear for the ringnorm dataset where EV 50  650 contribute the most to the error after 10 queries. In practice we did not find evidence that larger absolute eigenvalues are likely to contribute more to the error. This confirms our argument why the Nuclear Discrepancy can improve upon the MMD: the scenario considered by the Nuclear Discrepancy is more likely to occur in practice.
Agnostic Setting.
The experiments in the realizeable setting dealt with a somewhat artificial setting without model mismatch. Now we discuss the results of the agnostic setting where the original binary labels of the datasets are used. In this setting , but will be small due to our choice of hyperparameters, and therefore we ignore it during active learning. All results are summarized in Table 1 (see page 1). The learning curves are quite similar to the realizable setting, therefore we defer them to the supplementary material. One notable difference is that the learning curves are less smooth and have larger standard errors. Therefore the active learning methods are harder to distinguish which is reflected in Table 1 by more ties. However, the trends observed in the realizable setting are still observed in this setting: the Nuclear Discrepancy active learner improves more often on the MMD than the reverse, and the MMD improves more often upon the Discrepancy than the reverse.
Realizeable setting  Agnostic setting  

Dataset  D vs MMD  ND vs D  ND vs MMD  D vs MMD  ND vs D  ND vs MMD 
vehicles  0 / 10 / 0  0 / 10 / 0  0 / 10 / 0  3 / 7 / 0  0 / 9 / 1  1 / 9 / 0 
heart  0 / 3 / 7  10 / 0 / 0  7 / 3 / 0  0 / 10 / 0  0 / 10 / 0  0 / 10 / 0 
sonar  0 / 1 / 9  10 / 0 / 0  4 / 6 / 0  0 / 3 / 7  8 / 2 / 0  3 / 7 / 0 
thyroid  0 / 10 / 0  1 / 9 / 0  1 / 9 / 0  0 / 10 / 0  2 / 8 / 0  2 / 8 / 0 
ringnorm  0 / 0 / 10  10 / 0 / 0  10 / 0 / 0  0 / 1 / 9  7 / 0 / 3  1 / 1 / 8 
ionosphere  0 / 0 / 10  10 / 0 / 0  10 / 0 / 0  1 / 3 / 6  0 / 8 / 2  0 / 7 / 3 
diabetes  0 / 9 / 1  4 / 6 / 0  5 / 3 / 2  1 / 9 / 0  0 / 7 / 3  0 / 8 / 2 
twonorm  0 / 1 / 9  10 / 0 / 0  10 / 0 / 0  0 / 7 / 3  5 / 5 / 0  7 / 3 / 0 
banana  0 / 3 / 7  7 / 3 / 0  0 / 10 / 0  2 / 8 / 0  2 / 8 / 0  6 / 4 / 0 
german  0 / 1 / 9  10 / 0 / 0  10 / 0 / 0  1 / 9 / 0  1 / 9 / 0  2 / 8 / 0 
splice  1 / 9 / 0  9 / 1 / 0  8 / 2 / 0  0 / 8 / 2  6 / 4 / 0  3 / 7 / 0 
breast  1 / 0 / 9  10 / 0 / 0  10 / 0 / 0  0 / 6 / 4  2 / 8 / 0  1 / 9 / 0 
Summary  2 / 47 / 71  91 / 29 / 0  75 / 43 / 2  8 / 81 / 31  33 / 78 / 9  26 / 81 / 13 
6 Discussion
A first issue raised by this work is the following. Where the experiments in the realizable setting provide clear insights, the results concerning the agnostic setting are certainly not fully understood. Studying the approximation errors in the bounds may offer further insight. But such study is not trivial since the behavior of in this setting depends on the precise model mismatch and the setting of the hyperparameters.
A second issue of interest is whether our Nuclear Discrepancy bound can be helpful in other settings as well. BenDavid et al. [1] give the Discrepancy bound for the zeroone loss. Given our results we suspect this bound can be too pessimistic as well. A Nuclear Discrepancy type bound for the zeroone loss is therefore desirable. However such a bound is nontrivial to compute since it will involve an integral over the zeroone loss function, and therefore we defer this to future work.
Aside for the implications for active learning, our results have implications for domain adaptation as well. Cortes and Mohri [5] observe that the Discrepancy outperforms the MMD in adaptation. We suspect that the MMD can improve for our suggested choice of . Furthermore, our results suggest that the Nuclear Discrepancy is a promising bound for adaptation, however it poses a nontrivial optimization problem which we plan to address in the future.
Finally, in this work we have only considered nonadaptive active learners. Adaptive active learning strategies use label information to choose queries, this additional information may lead to improved performance. For domain adaptation, [6] considers such an adaptive approach to improve upon the Discrepancy. Their bound is, however, not trivial to apply to active learning because it is intrinsically designed for domain adaptation. Nevertheless, any successful adaptation of our approach in such direction would certainly be valuable.
7 Conclusion
We proposed two novel active learning methods based on generalization bounds and compared them to the stateoftheart MMD active learner. To investigate the relation between the bounds and active learning performance, we have shown that the Discrepancy bound is the tightest bound and our proposed Nuclear Discrepancy is the loosest.
Even though the Discrepancy bound is tighter, the active learner performs worse compared to the MMD active learner, even in the realizable setting. Our proposed Nuclear Discrepancy, which has the loosest bound, improves significantly upon both the MMD and Discrepancy in terms of the mean squared error in the realizable setting.
We explain this counterintuitive result by showing that the Discrepancy and MMD focus too much on pessimistic scenarios that are unlikely to occur in practice. On the other hand, the Nuclear Discrepancy considers an average case scenario, which occurs more often, and therefore performs better. We show that a probabilistic approach is essential: active learners should optimize their strategy for scenarios that are likely to occur in order to perform well.
References
 BenDavid et al. [2010] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1):151–175, 2010.

Cawley and Talbot [2004]
Gavin C. Cawley and Nicola L.C. Talbot.
Fast exact leaveoneout crossvalidation of sparse leastsquares support vector machines.
Neural Networks, 17(10):1467 – 1475, 2004. 
Chattopadhyay et al. [2012]
Rita Chattopadhyay, Zheng Wang, Wei Fan, Ian Davidson, Sethuraman Panchanathan,
and Jieping Ye.
Batch Mode Active Sampling Based on Marginal Probability Distribution Matching.
In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 741–749, 2012.  Cohn et al. [1994] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.
 Cortes and Mohri [2014] Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
 Cortes et al. [forthcoming] Corinna Cortes, Mehryar Mohri, and Andres Muñoz Medina. Adaptation Based on Generalized Discrepancy. Machine Learning Research, forthcoming. URL http://www.cs.nyu.edu/~mohri/pub/daj.pdf.

Ganti and Gray [2012]
Ravi Ganti and Alexander Gray.
UPAL: Unbiased Pool Based Active Learning.
In
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 422–431, 2012.  Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Twosample Test. Machine Learning Research, 13(1):723–773, 2012.
 Gu and Han [2012] Quanquan Gu and Jiawei Han. Towards Active Learning on Graphs: An Error Bound Minimization Approach. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), pages 882–887, 2012.
 Gu et al. [2012] Quanquan Gu, Tong Zhang, Jiawei Han, and Chris H Ding. Selective Labeling via Error Bound Minimization. In Proceedings of the 25th Conference on Advances in Neural Information Processing Systems (NIPS), pages 323–331, 2012.
 Gu et al. [2014] Quanquan Gu, Tong Zhang, and Jiawei Han. BatchMode Active Learning via Error Bound Minimization. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI), 2014.
 Huang et al. [2007] Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard Schölkopf. Correcting sample selection bias by unlabeled data. In Proceedings of the 19th Conference on Advances in Neural Information Processing Systems (NIPS), pages 601–608, 2007.
 Huang et al. [2010] Shengjun Huang, Rong Jin, and Zhihua Zhou. Active Learning by Querying Informative and Representative Examples. In Proceedings of the 23th Conference on Advances in Neural Information Processing Systems (NIPS), pages 892–900, 2010.
 Lichman [2013] M Lichman. UCI Machine Learning Repository, 2013. URL http://archive.ics.uci.edu/ml.
 Mansour et al. [2009] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain Adaptation: Learning Bounds and Algorithms. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 2009.
 Mohri et al. [2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, Cambridge, Massachusetts, 2012.
 Rifkin et al. [2003] Ryan Rifkin, Gene Yeo, and Tomaso Poggio. Regularized leastsquares classification. Advances in Learning Theory: Methods, Model, and Applications, 190:131–154, 2003.
 Settles [2012] Burr Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
 ShaweTaylor and Cristianini [2004] J ShaweTaylor and N Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004.
 Wang and Ye [2013] Zheng Wang and Jieping Ye. Querying Discriminative and Representative Samples for Batch Mode Active Learning. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 158–166, 2013.
8 Background Theory
8.1 Computation of the MMD
The MMD quantity can be computed in practice by rewriting it as follows:
(11)  
(12)  
(13)  
(14) 
In the first step we used that due to the reproducing property [16, p. 96]. Here is the mapping from to the RKHS of . The other steps follow from the linearity of the inner product. In Equation 13 we defined and similarly for , note that these are vectors in the RKHS of . The last step follows from the fact that the vector in maximizing the term in Equation 13 is:
(15) 
This follows from the fact that the inner product between two vectors is maximum if the vectors point in the same direction. Because of the symmetry of with respect to and , it’s straightforward to show that this derivation also holds if we switch and .
We can compute the MMD quantity in practice by working out the norm with kernel products:
(16)  
(17) 
8.2 Computation of the Discrepancy
In this section we calculate the discrepancy analytically for the squared loss in the linear kernel following the derivation of [15, p. 8]. At the end of this section we extend the computation to any arbitrary kernel following the derivation of [5, Section 5.2]. In our setting and all samples have equal weights, therefore this derivation is slightly adapted from [15] and [5]. We first rewrite the discrepancy for the linear kernel. As in Theorem 5 we take . We directly use the results of Theorem 5.
(18)  
(19)  
(20)  
(21) 
First we solve the first term:
(22) 
Observe that provides a weighted combination of eigenvalues. To maximize this equation, we therefore need to put all the weight of on the largest eigenvalue. Thus the vector that maximizes this is given by a multiple of the eigenvector corresponding to the maximum eigenvalue :
(23) 
Note (since the eigenvector is orthonormal) to maximize the quantity in Equation 22. Substituting the solution of and using that the eigendecomposition is orthogonal we obtain:
(24) 
Now for the maximization of the second term of Equation 21, observe that the solution changes sign (since now we want to place all the weight on the minimum eigenvalue. Thus we find that:
(25)  
(26)  
(27) 
Where is also known as the spectral norm of the matrix , which is given by the largest absolute eigenvalue .
Now we can compute the discrepancy in a linear kernel. In an arbitrary kernel we cannot easily compute the covariance matrices of the sets and , since the RKHS of may be very large or infinite, such as for the Gaussian kernel. In the following we rewrite the spectral norm of in terms of kernel innerproducts, so the discrepancy can be computed in any arbitrary kernel.
First we introduce the set . We assume in the following that the matrix is structured as:
(28) 
It can be shown that can be rewritten as [5]:
(29) 
Where is an diagonal matrix. The matrix reweights all objects and is given by:
(30) 
Where is a diagonal matrix of size , and is a diagonal matrix of size .
Since the matrix product and have the same eigenvalues [5], and since only depends on the eigenvalues, we can permute the matrices in to obtain a new matrix while :
(31)  
(32) 
Here is the kernel matrix of the set , meaning it contains the kernel products for all objects in . Now only depends on the kernel matrix of . Note that the kernel matrix should be ordered the same as , thus the kernel matrix is given by:
(33) 
Now the discrepancy can be computed in any arbitrary kernel using:
(34) 
Note that then we have to compute the largest absolute eigenvalue of the matrix to compute the discrepancy.
9 Proofs
9.1 Proof of Theorem 1
Let be any loss function, let and let the function be any arbitrary function in , where is a PDS kernel with RKHS . We indicate the empirical average of the functions and on a set of samples by and , respectively. We aim to bound the quantity:
(35) 
Observe that:
(36) 
By reordering the terms and applying the triangle inequality twice we can show that:
(37) 
The first term on the right hand side can be bounded by the MMD quantity:
(38) 
Then we obtain:
(39) 
Now we simplify the remaining two terms on the right. These terms appear because we may have that . Observe that due to the triangle inequality we have that:
(40) 
By maximizing over we can show that:
(41) 
Combining Equations 40 and 41 we have that:
(42) 
Thus we have shown that:
(43) 
This same result can be derived for the second term:
(44) 
Now in our setting , thus we can bound this using the term of :
(45) 
Therefore, we have that:
(46) 
Combining this with Equation 39 we find:
(47) 
Now we make the bound independent of by maximizing over all :
(48) 
Up to now our results hold for any . Now to make the bound tight, we minimize with respect to :
(49) 
Because of the absolute value, the equation below also holds:
(50) 
Rewriting we obtain:
(51) 
For clarity, we now plug in :
(52) 
9.2 Proof of Theorem 2
Let be the squared loss and we assume the realizable setting .
9.2.1 Proof in the Linear Case
First we show the theorem for the case where is the linear kernel: .
Let . Then , since and . The loss function is .
We define the squared kernel of as . The featuremap of that maps from to is given by [19, chap. 9.1]^{5}^{5}5Note that actually in [19] this kernel is defined as a polynomial kernel. In our case for this polynomial kernel we have that and , resulting in the featuremap given in Equation 53. This is often referred to as the squared kernel.:
(53) 
Note the kernel is a PSD kernel since its featuremap exists. The function can be described as . Thus the function , thus with . Furthermore we have that , since . Thus we have shown that .
In conclusion, we have that:
(54) 
This will ensure .
9.2.2 Proof for any Kernel
Now we prove the more general case where is any kernel.
Transformation 
