One of the most popular task in machine learning is supervised learning, in which we estimate a function that maps an input to its label based on finite labeled examples called training data. The goodness of the learned function is measured by the generalization ability, that is roughly the accuracy of the learned function for previously unseen data. Statistical learning theory is a powerful tool which gives a framework for analysing the generalization errors of learning algorithms(Vapnik and Vapnik, 1998). Enormous learning algorithms have been proposed and their generalization abilities are analysed in various settings.
In spite of the great successes of supervised learning, it has a fundamental limitation due to the expensive cost for making training examples. Particularly, it is often the case that collecting input data is cheap but to give labels of them is limited or expensive and that is one of bottlenecks in supervised learning (Roh et al., 2019). The dilemma is that the more labeled data, better generalization ability is guaranteed but the higher labeling cost is incurred.
In this limited situation, importance labeling
problem naturally arises, which is a special case of active learning(Settles, 2009). In the importance labeling settings, we first collect many unlabeled examples. Then we choose a limited number of examples to be labeled from unlabeled ones. The most naive selection of labeled examples is based on uniform subsampling from unlabeled data. What we expect here is that if we choose labeled samples effectively, then better generalization ability may be acquired.
Despite of the significance of the problem, theoretical aspects of importance labeling is little known. The essential question is what importance labeling scheme surpasses the standard uniform labeling in what settings.
In this paper, we consider this quite general question in the context of least squares regression in Reproducing Kernel Hilbert Spaces (RKHS). Kernel method is classical and promising approach for learning nonlinear functions (Schölkopf et al., 2002). In kernel method, input data is mapped to (potentially) infinite dimensional feature space and then a linear predictor on the feature space is learned. The feature space is determined by the user-defined kernel function and numerous kernel functions are known, e.g. classical gaussian kernel and more modern neural tangent kernel (NTK) (Jacot et al., 2018). Least squares regression in RKHS has long history and its generalization ability has been thoroughly studied in supervised learning settings (Caponnetto and De Vito, 2007; Steinwart et al., 2009; Rosasco and Villa, 2015; Dieuleveut et al., 2016; Rudi and Rosasco, 2017).
We propose CRED, that is a new importance labeling scheme based on the contribution ratios to effective dimension of unlabeled data.
The generalization error of gradient descent with CRED for least squares regression in RKHS is theoretically analysed and the superiority of the algorithm to existing methods is shown under low label noise (i.e., near interpolation) settings.
The algorithm and the theoretical results are extended to random features settings and the potential computational intractability of CRED is resolved.
The comparison of theoretical generalization errors between our proposed algorithms with most relevant existing methods is summarised in Table 1.
|Method||Generalization Error||Additional Assumptions|
|(S)GD (Pillaud-Vivien et al., 2018)||a.e.|
|KTR (Jun et al., 2019)||None|
|SSSL (Ji et al., 2012)||, sufficiently large|
|RF-KRLS (Rudi and Rosasco, 2017)||, sufficiently large|
is the variance of label noise,is the uniform upper bound of labels. In column "Additional Assumptions," means the number of random features and does the number of unlabeled data. Please refer to Section 2 for the definitions of these parameters. Extra log factors are hided for simplicity, where
is confidence parameter for high probability bounds.
Here, we briefly overview the most relevant research areas and methods to our work.
Supervised Learning Supervised least squares regression in RKHS has been thoroughly studied (Yao et al., 2007; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Rosasco and Villa, 2015; Dieuleveut et al., 2016; Rudi and Rosasco, 2017; Lin and Rosasco, 2017; Carratino et al., 2018; Pillaud-Vivien et al., 2018; Jun et al., 2019). In Caponnetto and De Vito (2007); Steinwart et al. (2009)
, generalization error of kernel ridge regression has been studied and it has been shown that the minimax optimal rate is attained under suitable assumtions. In(Yao et al., 2007; Rosasco and Villa, 2015)
, gradient descent for kernel ridgeless regression has been considered and the effect of early stopping as implicit regularization has been theoretically justified. The analysis has been further improved with additional assumption about eigenvalues decay(Lin and Rosasco, 2017)
. Online stochastic gradient descent (SGD) has been studied in(Dieuleveut et al., 2016) and minimax optimal rate has been established when the true function is (nearly) attainable. Recently the authors of (Pillaud-Vivien et al., 2018) have considered Multi-Pass SGD and shown its optimality without attainability of the true function under additional assumption about the capacity of the feature space in terms of infinity norm. Random features technique (Rahimi and Recht, 2008) can be applicable to kernel regression and reduces the computational time. The generalization ability of kernel regression with random features has been studied in Rudi and Rosasco (2017); Carratino et al. (2018) and it has been shown that random features technique doesn’t hurt the generalization ability when the number of random features is sufficiently large and the true function is attainable. More recently, in (Jun et al., 2019), low label noise cases have been particularly discussed and their proposed Kernel Truncated Randomized Ridge Regression (KTR) achieves an improved rate when the label noise is low.
Semi-supervised learning has a close relation to importance labeling. In semi-supervised learning, we are given many unlabeled data and small number of labeled data. Typically the labeled data is uniformly selected from unlabeled data. Semi-supervised learning aims to get better generalization ability by the effective use of unlabeled examples typically under so-called cluster assumption(Balcan and Blum, 2005; Rigollet, 2007; Ben-David et al., 2008; Wasserman and Lafferty, 2008). In contrast, the importance labeling scheme in this paper aims to get better generalization ability by the effective choice of labeled examples without the assumption. In (Ji et al., 2012)
, a simple semi-supervised kernel regression algorithm called SSSR has been proposed and they have shown that the generalization ability surpasses the one of supervised learning when the true function is attainable and deterministic. Roughly speaking, the algorithm first computes eigen-system of covariance operator in the feature space using unlabeled data. Then, linear regression is executed on the principle eigen-functions as features. The theory of SSSR does not require the cluster assumption and is on the standard theoretical settings of kernel regression.
Active Learning Active learning is also a close concept to importance labeling. In active learning, we are given learned model on small labeled data and then select new labeled data from unlabeled one by utilizing the information of the learned model. In some sense, active learning is a generalized concept of important labeling. However, in active learning, how to select the initially labeled data is out-of-scope and typically assumed to be uniform selection. Enormous active learning strategies have been proposed (Brinker, 2003; Dasgupta, 2005; Yu et al., 2006; Kapoor et al., 2007; Guo and Schuurmans, 2008; Wei et al., 2015; Gal et al., 2017; Sener and Savarese, 2017) ((Settles, 2009) for extensive survey) and empirically studied their performances but their theoretical aspects are little known at least in our kernel regression setting.
Importance Sampling Importance sampling is a general technique to reduce the variance of estimations and typically used in Monte Carlo methods and stochastic optimization (Needell et al., 2014; Zhao and Zhang, 2015; Alain et al., 2015; Csiba and Richtárik, 2018; Chen et al., 2019). The behind idea is that if the realizations that potentially cause large variance are more frequently sampled, the variance of a bias-corrected estimator can be reduced. But the definition of importance is strongly problem-dependent and to the best of our knowledge, any algorithms for importance labeling problem have not been proposed so far.
2 Problem Settings and Assumptions
In this section, we provide problem settings in this paper and theoretical assumptions for our analysis.
2.1 Kernel Least Squares Regression with Importance Labeling
Let be i.i.d. samples from some distribution , where , and , . We denote as the marginal distribution of on and as the conditional distribution of with respect to . We subsample () from according to user-defined distribution on and we denote , .
The objective of this paper is to minimize the excess risk only using the information of labeled observations , where and is some Reproducing Kernel Hilbert Space (RKHS) with inner product and kernel .
We denote by the induced norm by and as Euclidean norm. Let and , where the operator is the natural embedding from to and is the adjoint operator of . We define as for operator . For natural number , We denote by .
2.2 Theoretical Assumptions
Assumption 1 (Boundedness of kernel).
for some .
Assumption 2 (Smoothness of true function).
There exists such that for some with (). Here .
Assumption 2 quantifies the complexity of in terms of the eigen-system of . When , becomes a subset of and particularly , it exactly matches to . As , roughly .
Assumption 3 (Polynomial decay of eigenvalues).
There exists such that .
Parameter characterizes the complexity of feature space .
Assumption 4 (Bounded variance and uniform bounededness of labels).
There exists and such that and almost surely.
Generally label noise , but we are particularly interested in the case .
3 Proposed Algorithm
In this section, first we describe our proposed algorithm. Then computational aspects of our algorithm are briefly discussed.
Our proposed algorithm is illustrated in Algorithm 1. The algorithm consists of two blocks importance labeling and optimization by gradient descent.
Importance Labeling Our proposed importance labeling is based on the contribution ratios to effective dimension (Zhang, 2005). First recall the definition of effective dimension, that is . The essential intuition of our scheme is that input that has a large contribution to effective dimension is much important than other inputs. To realize this intuition, we construct an importance sampling distribution proportional to on unlabeled data. For stability of sampling, we add the mean of contribution ratios on unlabeled data to it. Finally, since covariance operator is unknown, we replace it by empirical covariance operator using unlabeled data.
Optimization by Gradient Descent The optimization process is similar to the standard gradient descent on labeled data, but each loss is weighed by the inverse labeling probability to guarantee the unbiasedness of the risk and then the gradient of the bias corrected risk is used for updating the solution.
Computational Tractability Gradient descent on RKHS can be efficiently executed even in infinite dimensional feature spaces thanks to kernel trick. However the computation of the contribution ratios to effective dimension is generally intractable due to the inapplicability of kernel trick (Schölkopf et al., 2002). This computational problem can be avoided by introducing random features technique. For the details, see Section 6.
4 Generalization Error Analysis
Here, we give the main theoretical results of CRED-GD (Algorithm 1). The proofs are found in Section B of the supplementary material. We use and notation to hide extra factors for simple statements, where is a confidence parameter for high probability bounds.
Our analysis starts from bias-variance decomposition , where is GD path on excess risk, i.e., with . The first term is called as bias and can be bounded by the following Lemma:
Lemma 4.1 (Bias bound, simplified version of Lemma a.1).
Next, the second term, that is called as variance, can be bounded as follows:
Proposition 4.2 (Variance bound, simplified version of Proposition b.1).
Suppose that be sufficiently small. Let , , and and . Then there exits event with such that
Proposition 4.2 is the main novelty of our analysis. In (Pillaud-Vivien et al., 2018), the variance bound of the standard GD is roughly in our settings. In contrast, our bound is roughly for and sufficiently large . Since holds, CRED-GD improves the variance bound of the standard GD when is small. Later, we discuss the case (see Lemma 4.3 and Section 5).
In (Pillaud-Vivien et al., 2018), under Assumption 1 and additional assumption for some and , the authors have shown that (Lemma 13 in (Pillaud-Vivien et al., 2018)), which is a better bound than ours in Lemma 4.3 when . However, in worst case their bound matches to ours in Lemma 4.3. For an example of this case, see Section 5.
For balancing the bias and variance term, we introduce a notion of the optimal number of iterations:
Definition 4.1 (Optimal number of iterations).
Optimal number of iterations for CRED-GD is defined by , where is defined as
Theorem 4.4 (Generalization Error of CRED-GD).
Wider Optimality on General Noise Settings When , the generalization error of CRED-GD with sufficiently many unlabeled data becomes the optimal rate . The same rate is also achieved by supervised GD or SGD but under restrictive condition in our theoretical settings (Dieuleveut et al., 2016; Pillaud-Vivien et al., 2018), which is not necessary for CRED-GD.
Low Noise Acceleration When , the rate of CRED-GD with sufficiently many unlabeled data becomes . In contrast, supervised GD or SGD only achieves in our theoretical settings when , and thus CRED-GD significantly improves the generalization ability of supervised methods. Semi-supervised method SSSL (Ji et al., 2012) only achieves when and , which is worse than ours.
Equivalence to Kernel Ridge Regression with Importance Labeling Using very similar arguments of our analysis, it can be shown that analytical kernel ridge regression solution also achieves the generalization error bound in Theorem 4.4 (see Section C of supplementary material). When is extremely small, the analytical solution is computationally cheap than gradient descent and sometimes useful.
5 Sufficient Condition for
In this section, we give a sufficient condition for and its simple example. The proofs are found in Section D of the supplementary material.
Let () be the eigen-system of in , where . Assume that and for any for some and . Moreover if , we additionally assume for any for some . Then Assumption 1 is satisfied and for any ,
Let and , that is the product measure of truncated normal distributions with mean
, that is the product measure of truncated normal distributions with meanand scale parameter , i.e., independent normal distributions with mean and variance conditioned on . Let . We denote as the variance of for . Note that for sufficiently small , we have for any . Then we particularly consider linear kernel and thus . Since the covariance matrix is , the eigen-system of in is , where for . Suppose that the polynomial decay of holds: . Then from Lemma 4.3, . On the other hand, from Proposition 5.1 with , we have .
6 Extension to Random Features Settings
In this section, we discuss the application of random features technique to Algorithm 1 for computational tractability. Then we theoretically analyse the generalization error of the algorithm. The proofs are given in Section E of the supplementary material.
Suppose that kernel has an integral representation for for some . Random features , where independently, is used for an approximation of by . Here, the number of random features is a user-defined parameter and characterizes the goodness of the approximation.
The random features version of CRED-GD is illustrated in Algorithm 2. The difference from Algorithm 1 is only the replacement of to random features . Note that we can properly compute important labeling distribution using standard SVD solvers thanks to random features technique.
We need the following additional assumption for theoretical analysis:
We define by and by the adjoint of . Then we denote and .
Generalization Error Analysis
We consider generalization error . We decompose the generalization error to bias and variance:
where where is the path of GD with RF on excess risk, i.e., with .The bias term can be bounded similar to Lemma 4.1:
Lemma 6.1 (Bias bound for RF setting, simplified version of Lemma e.1).
Compared to Lemma 4.1, additional condition is assumed. This implies that to make bias small, appropriately large number of random features is required.
The variance conditioned on random features can be bounded in a perfectly similar manner to the proof of Proposition 4.2 with replacing by and by . The latter is trivially bounded by . The key lemma for bounding is the following:
Lemma 6.2 (Proposition 10 in (Rudi and Rosasco, 2017)).
Suppose that Assumption 5 holds. We denote for . For any and sufficiently small , if , with probability at least it holds that
Combining the bias and variance bounds with Lemma 6.2 yields the following theorem:
7 Numerical Experiments
In this section, numerical results are provided to empirically verify our theoretical findings.
Experimental Settings In our experiments, the input data of public datasets MNIST and Fashion MNIST (Xiao et al., 2017) were used. First we randomly split each dataset into train () and test () and normalized input data by dividing
. We conducted both linear regression (LR) and nonlinear regression (NLR) tasks. For linear tasks, we used the original inputs with bias as features. For nonlinear tasks, we used a randomly initialized three hidden layered fully connected ReLU network with widthwithout output layer as features. Here, the random weights were from i.i.d. standard normal distributions. Then we randomly generated true linear functions111Each true function was generated by , where and was the eigen-system of the covariance matrix in correspondence feature space.on the feature spaces and then did noised labels based on them, where the noises were from i.i.d. normal distributions with mean and variance . We compared our proposed method222As we mentioned before, we used very small synthetic label noise in some experiments and then the convergence speed of gradient descent was sometimes quite slow. Hence we decided that optimization methods were replaced with analytical methods. As we pointed out in the end of Section 4, the same generalization error bound is guaranteed for the analytical solution.with KRR (Kernel Ridge Regression), KTR (Jun et al., 2019) and SSSR (Ji et al., 2012). The hyper-parameters were fairly and reasonably determined.333CRED has hyper-parameter and selecting best one requires additional labeling. In our experiments, we recorded the best test error by trying in . This potentially violates the fair comparison with the other methods because CRED implicitly uses ten patterns of labeled data. Hence we decided that the other methods were ran ten times with independent uniform labeling and then the best test error was recorded as one experimental trial. The train data was used as unlabeled data and the labeled data was selected from it. The number of labeled data was ranged in . We independently ran each experiment five times and recorded the median of test RMSE on each setting.
Results Figure 1 shows the comparisons of test RMSE of our proposed method with previous methods. For the all cases, our method consistently outperformed the other methods. Particularly when the label noise is small, our method achieves much smaller test RMSE than the other methods.
Conclusion and Future Work
In this paper, we proposed a new importance labeling scheme called CRED. The generalization error of GD with CRED was theoretically analysed and much better bound than previous methods was derived when label noise is small. Further, the algorithm and analysis were extended to random features settings and computational intractability of CRED was resolved. Finally, we provided numerical comparisons with existing methods. The numerical results showed empirical superiority to the other methods and verified our theoretical findings.
One direction of future work would be an application of our importance labeling idea to deep learning. Since the feature space of a deep neural network is updated in training time, our importance labeling scheme can be naturally extended to active learning settings. The theoretical and empirical study of the application of our importance labeling idea to active learning of deep neural networks is a promising future work.
TS was partially supported by JSPS KAKENHI (18K19793, 18H03201, and 20H00576), Japan DigitalDesign, and JST CREST.
- Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481. Cited by: §1.
A pac-style model for learning from labeled and unlabeled data.
International Conference on Computational Learning Theory, pp. 111–126. Cited by: §1.
- Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning.. In COLT, pp. 33–44. Cited by: §1.
Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th international conference on machine learning (ICML-03), pp. 59–66. Cited by: §1.
- Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7 (3), pp. 331–368. Cited by: §1, §1.
- Learning with sgd and random features. In Advances in Neural Information Processing Systems, pp. 10192–10203. Cited by: §1.
- Fast and accurate stochastic gradient estimation. In Advances in Neural Information Processing Systems, pp. 12339–12349. Cited by: §1.
- Importance sampling for minibatches. The Journal of Machine Learning Research 19 (1), pp. 962–982. Cited by: §1.
- Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337–344. Cited by: §1.
- Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics 44 (4), pp. 1363–1399. Cited by: §1, §1, §4.
- Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1183–1192. Cited by: §1.
- Discriminative batch mode active learning. In Advances in neural information processing systems, pp. 593–600. Cited by: §1.
- Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: §1.
- A simple algorithm for semi-supervised learning with improved generalization error bound. arXiv preprint arXiv:1206.6412. Cited by: §1, Table 1, §4, §7.
- Kernel truncated randomized ridge regression: optimal rates and low noise acceleration. In Advances in Neural Information Processing Systems, pp. 15332–15341. Cited by: §1, Table 1, §7.
Active learning with gaussian processes for object categorization.
2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §1.
- Optimal rates for multi-pass stochastic gradient methods. The Journal of Machine Learning Research 18 (1), pp. 3375–3421. Cited by: Lemma A.1, Lemma E.1, §1.
- Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in neural information processing systems, pp. 1017–1025. Cited by: §1.
- Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, pp. 8114–8124. Cited by: §1, Table 1, §4, Remark, Remark.
- Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §1.
- Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research 8 (Jul), pp. 1369–1392. Cited by: §1.
- A survey on data collection for machine learning: a big data-ai integration perspective. IEEE Transactions on Knowledge and Data Engineering. Cited by: §1.
- Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pp. 1630–1638. Cited by: §1, §1.
- Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pp. 3215–3225. Cited by: Appendix A, Appendix A, Lemma E.2, §1, Table 1, §1, Lemma 6.2, footnote 4, footnote 5.
- Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Cited by: §1, §3.
Active learning for convolutional neural networks: a core-set approach. arXiv preprint arXiv:1708.00489. Cited by: §1.
- Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §1, §1.
- Optimal rates for regularized least squares regression.. In COLT, pp. 79–93. Cited by: §1, §1.
- Statistical learning theory wiley. New York 1. Cited by: §1.
- Statistical analysis of semi-supervised regression. In Advances in Neural Information Processing Systems, pp. 801–808. Cited by: §1.
- Submodularity in data subset selection and active learning. In International Conference on Machine Learning, pp. 1954–1963. Cited by: §1.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §7.
- On early stopping in gradient descent learning. Constructive Approximation 26 (2), pp. 289–315. Cited by: §1.
- Active learning via transductive experimental design. In Proceedings of the 23rd international conference on Machine learning, pp. 1081–1088. Cited by: §1.
- Learning bounds for kernel regression using effective data dimensionality. Neural Computation 17 (9), pp. 2077–2098. Cited by: §3.
- Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1–9. Cited by: §1.
Appendix A Auxiliary Results
First we introduce GD path on the excess risk:
with for .
Lemma A.1 (Proposition 2 and Extension of Lemma 16 in [Lin and Rosasco, 2017]).
. Observe that . This finishes the proof. ∎
Suppose that Assumption 1 holds. For any ,
From Assumptions 1, we immediately obtain the claim. ∎
Lemma A.4 (Spectral filters).
Let for and . Also we define and . Then the following inequalities hold:
for any and
for any and .
When , the inequalities always hold and so we assume . Note that . The first inequality is trivial because . We show the second inequality. Note that . Observe that from elemental calculus, function for is maximized at and has maximum value . This finishes the proof. ∎
for . will be defined later (see Definition 4.1). Then we define , where is uniformly at random on .
Suppose that Assumption 1. Let and . Suppose that . When , with probability at least