Over the last several years, feature evolvable learning has drawn extensive attentions DBLP:journals/tkde/ZhangZL0ZW16 ; DBLP:conf/nips/Hou0Z17 ; DBLP:journals/pami/HouZ18 , where old features will vanish and new features will emerge when data streams come continuously. There are various problem settings in previous studies. For instance, in FESL DBLP:conf/nips/Hou0Z17 , there is an overlapping period where old and new features exist simultaneously when feature space switches. Hou and Zhou DBLP:journals/pami/HouZ18 investigate the scenario when old features disappear, part of them will survive and continue to exist with new arriving features. Or we can simply say there are overlapping features between old and new feature spaces. Zhang et al DBLP:journals/tkde/ZhangZL0ZW16 study that features of new samples are always equal to or larger than the old samples so as to render trapezoidal data streams. Subsequent works consider the situation that features could vary arbitrarily at different time steps under certain assumptions DBLP:conf/aaai/BeyazitA019 ; DBLP:conf/ijcai/HeWWBC019 .
Note that the setting of feature evolvable learning is different from transfer learningDBLP:journals/tkde/PanY10 or domain adaptation jiang2008literature ; DBLP:journals/inffus/SunSW15 . Transfer learning usually assumes that data come in batches instead of the streaming form. One exception is online transfer learning DBLP:journals/ai/ZhaoHWL14 in which data from both sets of features arrive sequentially. However, they assume that all the feature spaces must appear simultaneously during the whole learning process while such an assumption does not hold in feature evolvable learning. Domain adaptation usually assumes the data distribution changes across the domains, yet the feature spaces are the same, which is evidently different from the setting of feature evolvable learning.
These conventional feature evolvable learning methods all assume that a label can be revealed in each round. However, in real applications, labels may be rarely given during the whole learning process. For example, in an object detecting system, a robot takes high-frame rate pictures to learn the name of different objects. Like a child learning in real world, the robot receives names rarely from human. Thus we will face an online semi-supervised learning
online semi-supervised learningproblem. We focus on manifold regularization which assumes that similar samples should have the same label and has been successfully applied in many practical tasks zhu2005semi . However, this method needs to store previous samples and render a challenge on storage. Besides, different devices have different storage budget limitations, or even the available storage in the same device could be different at different times. Thus it is important to fit our method to different storage situations and maximize its performance DBLP:conf/ijcai/Hou0Z17 .
In this paper, we propose a new setting: Storage-Fit Feature-Evolvable streaming Learning (SFEL). We focus on FESL DBLP:conf/nips/Hou0Z17 , and other feature evolvable learning methods based on online learning technique can also adapt to our framework. Our contributions are threefold as follows.
Both theoretical and experimental results show that our method is always able to follow the best baseline at any time step and thus always performs well during the whole learning process in new feature space regardless of the limitation that only few data emerge in the beginning. This is a very fundamental requirement in feature evolvable learning scenario and FESL cannot achieve this goal when labels are barely given.
Besides, the experimental results exhibit that manifold regularization plays an important role when there are only few labels.
Finally, we theoretically and experimentally validate that larger buffer will bring better performance. Therefore, our method can fit different storages by taking full advantage of the budget.
We focus on binary classification task. In each round of the learning process, the learner observes an instance and gives its prediction. After the prediction has been made, the true label is revealed with a small probability. Otherwise, the instance remains unlabeled. The learner updates its predictor based on the observed instance and the label, if any.
In this paper, we focus on FESL DBLP:conf/nips/Hou0Z17 , and other feature evolvable learning methods based on online learning technique can also adapt to our framework. Figure 1 presents that how streaming data come in FESL:
, the first period, in each round, the learner observes a vectorfrom where is the number of features of , is the number of total rounds in .
For , the second period, in each round, the learner observes two vectors and from and where is the number of features of .
For , the third period, in each round, the learner observes a vector from where is the number of rounds in . Besides, is usually small, so we can omit the streaming data from on rounds since they have minor effect on training the model in .
These three periods will continue again and again and form cycles, and each cycle merely includes two feature spaces. Thus, we only need to focus on one cycle and it is easy to extend to the case with multiple cycles. Besides, the old features in one cycle will vanish simultaneously by considering the example of ecosystem protection where all the sensors share the same expected lifespan and thus they will wear out at the same time. The case where old features vanish asynchronously has been studied in PUFE DBLP:journals/corr/abs-1904-12171 , which can adapt to our framework as well.
In FESL linear predictor is adopted, whereas to be general, nonlinear predictor is chosen in our paper. Let denote a kernel over and the corresponding Reproducing Kernel Hilbert Space (RKHS) DBLP:books/lib/ScholkopfS02 where which indexes the feature space. We define the projection as . The predictor learned from the sequence is denoted as . Denote by the predictor learned from the -th feature space in the
-th round. The loss functionis convex in its first argument. For example, in classification task, we have logistic loss hinge loss etc.
If the label is fully provided, the loss suffered by the predictor in each round can be merely the prediction loss mentioned above. Then the most straightforward or baseline algorithm is to apply online gradient descent DBLP:conf/icml/Zinkevich03 on rounds with streaming data , and invoke it again on rounds with streaming data . The models are updated according to:
where is the gradient of the loss function on and is a time-varying step size.
Unfortunately, however, on one hand, we cannot always obtain label in each round. On the other hand, this baseline method wastes the efforts on both the data collecting and model training from the old feature space . Thus, our goal is to leverage the model learned during period from to boost the learning during period from under the circumstance where labels are rarely provided.
3 Our Approach
When feature space changes, the efforts of data collecting and model training in the old feature space are wasted by the baseline method since we cannot observe the data from when , and thus the model learned from cannot be used directly. To tackle this challenge, FESL learns a mapping between and by least squares during the overlapping period. Then when disappears, we can leverage this mapping to map the new data from into to recover the data from , i.e., . In this way, the well-learned model from can make good prediction on the recovered data and update itself with them. Concurrently, a new model is learned in and another prediction on is also made. At the beginning, the ’s prediction is good with the good predictor and ’s prediction is bad due to limited data. But after some time, may become worse because of the cumulated error brought by the inaccurate mapping and will be better with more and more accurate data.
FESL dynamically combines these two changing predictions with weights by calculating the loss of each base model. In this way, it achieves the fundamental goal in feature evolvable learning, i.e., can always follow the best base model at any time step and thus always perform well during the whole learning process in the new feature space. However, in our scenario, labels are rarely given, and thus few losses of base models can be calculated. Therefore, FESL will fail to achieve the fundamental goal. Our basic idea is to leverage online semi-supervised learning method to calculate a risk for each base model even if no label is provided and thus there is a way to realize the fundamental goal again. However, this will raise a challenge on the storage and calculation since the manifold regularization needs to store all the observed instances. Thus we use a buffering strategy to alleviate this problem and enable our method to fit different storage budgets to maximize its performance.
3.1 Manifold Regularization
With limited labels, we will face an online semi-supervised learning problem. There are several convex semi-supervised learning methods, e.g., manifold regularization and multiview learning. Their batch risk is the sum of convex funtion in . For these convex semi-supervised learning methods, one can derive a corresponding online semi-supervised learning algorithm using online convex programming DBLP:conf/pkdd/GoldbergLZ08 . We focus on manifold regularization while the online versions of multiview learning and other convex semi-supervised learning methods can be derived similarly.
In online learning, the learner only has access to the input sequence up to the current time. We thus use manifold regularization to define the instantaneous regularized risk at time to be
where is the number of labeled samples, indicates whether the label is given or not, is the predictor learned in -th feature space and is the edge weight which defines a graph over the samples such as a fully connected graph with Gaussian weights . The last term in is the manifold regularization term which involves the graph edges from to all previous samples up to time . in the first term of (2
) is the empirical estimate of the inverse label probability, which we assume is given and easily determined based on the rate at which humans can label the data at hand.
The online gradient descent algorithm applied on the instantaneous regularized risk will derive
where is a time-varing step size. Thus even if no label is revealed, we can still update our model according to (3). Then in round , the learner can calculate two base predictions based on models and : and at each time step. In this case, by ensemble over the two base predictions in each round, our SFEL is able to achieve the fundamental goal, i.e., following the best base prediction all the time. The initialization process to obtain the relationship mapping and during rounds is summarized in Algorithm 1.
3.2 Combining Base Learners
We propose to do ensemble by combining base learners with weights based on exponential of the cumulative risk DBLP:books/daglib/0016248 . The prediction of our method at time is the weighted average of all the base predictions:
where is the weight of the -th base prediction. With the previous risk of each base model, we can update the weights of the two base models as follows:
where is a tuned parameter and is the cumulative risk of the -th base model until time : The risk of our predictor is calculated by
The updating rule of the weights shows that if the risk of one of the models in previous round is large, then its weight will decrease in next round, which is reasonable and can derive a good theoretical result shown in Theorem 1. Thus the learning paradigm is that we first learn a model using (3) on rounds , during which we also learn a relationship for . Then for , we learn a model in each round with new data from :
and keep updating on the recovered data (where ):
Then we combine the predictions of the two models by weights calculated in (5).
To obtain , we need to calculate the coefficients . We follow the kernel online semi-supervised learning approach DBLP:conf/pkdd/GoldbergLZ08 to update our coefficients by writing the gradient as
in which we compute the derivative according to the reproducing property of RKHS, i.e., , where is the (sub)gradient of the loss function with respect to . Putting (10) back to (7) or (8), and replace with its kernel expansion (9), we can obtain the coefficients for as follows:
where and , and
As can be seen from (11) and (12), when updating the model, we need to store each observed sample and calculate the weights between the new incoming sample and all the other observed ones. These operations will bring huge burdens on computation and storage. To alleviate this problem, we do not store all the observed samples. Instead, we only use a buffer to store a small part of them, which we call buffering.
We denote by the buffer and let its size be . In order to make the samples in buffer more representative, it is better to make each sample in the buffer sampled by equal quality. Therefore, we exploit the reservoir sampling technique DBLP:journals/toms/Vitter85 to achieve this goal which enables us to use a size-fixed buffer to represent all the received samples. Specifically, when receiving a sample , we will directly add it to the buffer if the buffer size . Otherwise, with probability , we update the buffer by randomly replacing one sample in with . The key property of reservoir sampling is that samples in the buffer are provably sampled from the original dataset uniformly. Then the instantaneous risk will be approximated by
where the scaling factor keeps the magnitude of the manifold regularizer comparable to that of the unbuffered one. Accordingly, the predictor will become
If the buffer size , we will update the coefficients by (11) and (12) directly. Otherwise, if the new incoming sample replaces some sample in the buffer, there will be two steps to update our predictor. The first step is to update to an intermediate function represented by elements including the old buffer and the new observed sample as follows:
in which and , and
The second step is to use the newest sample to replace the sample selected by reservoir sampling, say and obtain which uses base representers by approximating which uses base representers:
This can be intuitively regarded as spreading the replaced weighted contribution to the remaining samples including the newly added in the buffer. The optimal in (18) can be efficiently found by kernel matching pursuit DBLP:journals/ml/VincentB02 .
If the new incoming sample does not replace the sample in the buffer, will still consist of the representers from the unchanged buffer. Then only the coefficients of the representers from the buffer will be updated as follows.
where and . Algorithm 2 summarizes our SFEL.
In this paragraph, we borrow regret from online learning to measure the performance of SFEL. Specifically, we give a risk bound which shows that the performance will be improved with the assistance of the old feature space. We define that and are two cumulative risks suffered by base models on rounds , , and is the cumulative risk suffered by our method according to the definition of our predictor’s risk in (6): Then we have (proof can be found in supplementary file):
Assume that the risk function takes value in [0,1]. For all and for all with , with parameter satisfies
This theorem implies that the cumulative risk of Algorithm 2 over rounds is comparable to the minimum of and . Furthermore, we define . If , it is easy to verify that is smaller than . In summary, on rounds , when is better than to certain degree, the model with assistance from is better than that without assistance.
Besides, we show that larger buffer can bring better performance by leveraging our buffering strategy. Concretely, let be the last term of the objective (2), which is formed by all the observed samples till the current iteration. Denote by the approximated version formed by the observed samples in the buffer. Then we have (proof can be found in the supplementary file):
With the reservoir sampling mechanism, the approximated objective is an unbiased estimation of objective formed by the original data, namely,
With the reservoir sampling mechanism, the approximated objective is an unbiased estimation of objective formed by the original data, namely,.
demonstrates the rationality of the reservoir sampling mechanism in buffering. The objective formed by the observed samples in the buffer is provably unbiased to that formed by all the observed samples. The variance of the approximated objective will decrease with more observed samples in a larger buffer, leading to a more accurate approximation, which suggests us to make the best of the buffer storage to store previous observed samples. Since various devices have different storage budgets and even the same device will provide different available storages, we can fit our method to different storage to maximize the performance by taking full advantage of the budget.
In this section, we conduct experiments in different scenarios to validate the three claims presented in Introduction.
4.1 Compared Methods
We compare our SFEL to three approaches. One is mentioned in Section 2, where once the feature space changes, the online gradient descent algorithm will be invoked from scratch, named as NOGD (Naive Online Gradient Descent). The other two approaches utilize the model learned from feature space by online gradient descent to do predictions on the recovered data. The difference between them is that one keeps updating with the recovered data while the other does not. The one which keeps updating is called “updating Recovered Online Gradient Descent” (uROGD) and the other which keeps fixed is called “fixed Recovered Online Gradient Descent” (fROGD). Note that these baselines will not update on those rounds when no label is revealed. Besides, we want to emphasize that it is sufficient to validate the effectiveness of our framework by merely comparing our method to these baselines mentioned above in the scenario of FESL since our goal is to be comparable to these base models and show that the manifold regularization is useful and our method can fit the storage budget to maximize its performance. With the manifold regularization and buffering strategy, other feature evolvable learning methods based on the online learning technique can also adapt to our framework similarly. This is beyond the scope of this paper and it can be deferred as future work.
4.2 Evaluation and Parameter Setting
We evaluate the empirical performances of the proposed approaches on classification task on rounds . We assume all the labels can be obtained in hindsight. Thus the accuracy is calculated on all rounds. Besides, to verify that Theorem 1 is reasonable, we present the trend of average cumulative risk. Concretely, at each time , the risk of every method is the average of the cumulative risk over , namely . The probability of labeled data is set as , which is not necessary. The performances of all approaches are obtained by average results over independent runs.
We conduct our experiments on datasets from different domains including economy and biology, etc111Datasets can be found in http://archive.ics.uci.edu/ml/ and http://www.lamda.nju.edu.cn/data_RFID.ashx.. Note that in FESL datasets are used. However, over of them are the texting datasets which do not satisfy the manifold characteristic. Only these datasets satisfy the manifold characteristic and the Swiss dataset is the perfect one. We would like to conduct more experiments on other datasets which satisfy the manifold characteristic. To generate synthetic data, we artificially map the original datasets into another feature space by random matrices, then we have data both from feature space and . Since the original data are in batch mode, we manually make them come sequentially. In this way, synthetic data are completely generated. As for the real dataset, we use “RFID” dataset provided by FESL which satisfies all the assumptions in Preliminary.
We have three claims mentioned in Introduction. The first is that manifold regularization (MR) brings better performance when there are only few labels. The second is that our method can always follow the best baseline at any time step. The last one is that larger buffer will bring better performance and thus our method can fit different storage by taking full advantage of the budget. In the following, we show part of the experimental results that validate these three claims. Others can be found in the supplementary file due to the page limitations.
MR Brings Better Performance
In Table 1, “+MR” means baseline methods are boosted by manifold regularization (MR). We can see that MR makes all the baselines better, and our method also benefits from it.
Following the Best Baseline
Figure 2 (a-c) shows the trend of risk of our method and all the baseline methods boosted by MR. Our method is the ensemble of NOGD+MR and uROGD+MR. We can see that our method’s risk is always comparable with the best baseline method which validates Theorem 1. Note that our method’s goal is to be comparable with the baseline methods instead of being better than them. However, as can be seen from Table 1, our method’s accuracy on classification tasks all surprisingly outperform that of the baseline methods.
Figure 2 (d) provides the performance comparisons of our method between different buffer sizes. We can see that larger buffer brings better performance which validates Theorem 2. In this way, our method SFEL can fit different storage to maximize the performance by taking full advantage of the budget. We can also see that the Swiss dataset (like a swiss roll) which possesses the best manifold property enjoys most the increasing of the buffer size. The exact numerical values can be found in the supplementary file for detailed check.
Learning with feature evolvable streams usually assumes a label can be revealed immediately in each round. However, in reality this assumption may not hold. We introduce manifold regularization into FESL and make FESL work well in this scenario. Other feature evolvable learning resembling FESL can also adapt to our framework. Both theoretical and experimental results validate that our method can achieve the fundamental goal again, i.e., can follow the best baselines and thus work well at any time steps despite the fact that few labels are given. Besides, we also theoretically and empirically demonstrate that larger buffer can bring better performance and thus our method can fit different storages by taking full advantage of the budget.
A more ambitious goal is to dynamically fit the storage budget even during the learning process, which would have wide applications since the same running device could provide dynamic available storage rather than a fixed one in usage. This can be deferred as future work and dynamic reservoir sampling technique DBLP:conf/ssdbm/Al-KatebLW07 may help.
- [AKLW07] M. Al-Kateb, B. S. Lee, and X. S. Wang. Adaptive-size reservoir sampling over data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, page 22, 2007.
E. Beyazit, J. Alagurajah, and X. Wu.
Online learning from data streams with varying feature spaces.
Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pages 3232–3239, 2019.
- [CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
A. B. Goldberg, M. Li, and X. Zhu.
Online manifold regularization: A new learning setting and
Proceedings of the 19th European Conference on Machine Learning and Principles of Knowledge Discovery in Databases, pages 393–407, 2008.
- [HWW19] Y. He, B. Wu, D. Wu, E. Beyazit, S. Chen, and X. Wu. Online learning from capricious data streams: A generative approach. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2491–2497, 2019.
- [HZ18] C. Hou and Z.-H. Zhou. One-pass learning with incremental and decremental features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2776–2792, 2018.
- [HZZ17a] B.-J. Hou, L. Zhang, and Z.-H. Zhou. Learning with feature evolvable streams. In Advances in Neural Information Processing Systems 30, pages 1417–1427, 2017.
- [HZZ17b] B.-J. Hou, L. Zhang, and Z.-H. Zhou. Storage fit learning with unlabeled data. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1844–1850, 2017.
- [HZZ19] B.-J. Hou, L. Zhang, and Z.-H. Zhou. Prediction with unpredictable feature evolution. CoRR, abs/1904.12171, 2019.
A literature survey on domain adaptation of statistical classifiers.URL: http://sifaka. cs. uiuc. edu/jiang4/domainadaptation/survey, 3:1–12, 2008.
- [PY10] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22:1345–1359, 2010.
B. Schölkopf and A. J. Smola.
Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press, 2002.
- [SSW15] S. Sun, H. Shi, and Y. Wu. A survey of multi-source domain adaptation. Information Fusion, 24:84–92, 2015.
- [VB02] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48(1-3):165–187, 2002.
- [Vit85] J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37–57, 1985.
- [ZHWL14] P. Zhao, S. Hoi, J. Wang, and B. Li. Online transfer learning. Artificial Intelligence, 216:76–102, 2014.
- [Zin03] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.
- [ZLR05] X. Zhu, J. Lafferty, and R. Rosenfeld. Semi-supervised learning with graphs. PhD thesis, Carnegie Mellon University, language technologies institute, school of computer science, 2005.
- [ZZL16] Q. Zhang, P. Zhang, G. Long, W. Ding, C. Zhang, and X. Wu. Online learning from trapezoidal data streams. IEEE Transactions on Knowledge and Data Engineering, 28:2709–2723, 2016.
In the supplementary material, we will prove the two theorems presented in the section “Our Approach”, and give additional experiment results.
Appendix A Analysis
In this section, we will give the detailed proofs of the two theorems in the section “Our Approach”.
a.1 Proof of Theorem 1
Proof: In order to prove Theorem 1, we first introduce potential function
Then we have
Summing over , letting , telescoping and rearranging give
Considering the boundness of , thus we have
When is optimally set to , (20) can be immediately derived.
a.2 Proof of Theorem 2
The proof of this theorem can be derived simply by using induction.
, where is the in -th step.
Assume our first elements have been chosen with probability .
The algorithm chooses the -th element with probability .
If we choose this element, each element in has probability of being replaced.
The probability that an element in is replaced with the -th element is therefore .
Thus the probability of an element not being replaced is .
So contains any given element either because it was chosen into and not replaced: .
Or because it was chosen in the latest round with probability .
We already prove that each element in at time step is sampled with equal quality . Then we have
Appendix B Additional Experiments
In this section, we show some additional experimental results to validate the effectiveness of our method.
Figure 3 (a-b) shows the trend of risk of our method and all the baseline methods boosted by MR. Our method is the ensemble of NOGD+MR and uROGD+MR. We can see that our method’s risk is always comparable with the best baseline method which validates Theorem 1. Note that our method’s goal is to be comparable with the baseline methods instead of being better than them.
Table 2 provides the exact numerical values of the performance comparisons of our method between different buffer sizes. We can see that larger buffer brings better performance which validates Theorem 2. In this way, our method SFEL can fit different storage to maximize the performance by taking full advantage of the budget. We can also see that the Swiss dataset (like a swiss roll) which possesses the best manifold property enjoys most the increasing of the buffer size.