Learning with Feature Evolvable Streams

06/16/2017 ∙ by Bo-Jian Hou, et al. ∙ Nanjing University 0

Learning with streaming data has attracted much attention during the past few years. Though most studies consider data stream with fixed features, in real practice the features may be evolvable. For example, features of data gathered by limited-lifespan sensors will change when these sensors are substituted by new ones. In this paper, we propose a novel learning paradigm: Feature Evolvable Streaming Learning where old features would vanish and new features will occur. Rather than relying on only the current features, we attempt to recover the vanished features and exploit it to improve performance. Specifically, we learn two models from the recovered features and the current features, respectively. To benefit from the recovered features, we develop two ensemble methods. In the first method, we combine the predictions from two models and theoretically show that with assistance of old features, the performance on new features can be improved. In the second approach, we dynamically select the best single prediction and establish a better performance guarantee when the best model switches. Experiments on both synthetic and real data validate the effectiveness of our proposal.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real tasks, data are accumulated over time, and thus, learning with streaming data has attracted much attention during the past few years. Many effective approaches have been developed, such as hoeffding tree Domingos and Hulten (2000), Bayes tree Seidl et al. (2009)

, evolving granular neural network (eGNN) 

Leite et al. (2009)

, Core Vector Machine (CVM) 

Tsang et al. (2007), etc. Though these approaches are effective for certain scenarios, they have a common assumption, i.e., the data stream comes with a fixed stable feature space. In other words, the data samples are always described by the same set of features. Unfortunately, this assumption does not hold in many streaming tasks. For example, for ecosystem protection one can deploy many sensors in a reserve to collect data, where each sensor corresponds to an attribute/feature. Due to its limited-lifespan, after some periods many sensors will wear out, whereas some new sensors can be spread. Thus, features corresponding to the old sensors vanish while features corresponding to the new sensors appear, and the learning algorithm needs to work well under such evolving environment. Note that the ability of adapting to environmental change is one of the fundamental requirements for learnware Zhou (2016), where an important aspect is the ability of handling evolvable features.

A straightforward approach is to rely on the new features and learn a new model to use. However, this solution suffers from some deficiencies. First, when new features just emerge, there are few data samples described by these features, and thus, the training samples might be insufficient to train a strong model. Second, the old model of vanished features is ignored, which is a big waste of our data collection effort. To address these limitations, in this paper we propose a novel learning paradigm: Feature Evolvable Streaming Learning (FESL). We formulate the problem based on a key observation: in general features do not change in an arbitrary way; instead, there are some overlapping periods in which both old and new features are available. Back to the ecosystem protection example, since the lifespan of sensors is known to us, e.g., how long their battery will run out is a prior knowledge, we usually spread a set of new sensors before the old ones wear out. Thus, the data stream arrives in a way as shown in Figure 1, where in period , the original set of features are valid and at the end of , period appears, where the original set of features are still accessible, but some new features are included; then in , the original set of features vanish, only the new features are valid but at the end of , period appears where newer features come. This process will repeat again and again. Note that the and periods are usually long, whereas the and periods are short because, as in the ecosystem protection example, the and periods are just used to switch the sensors and we do not want to waste a lot of lifetime of sensors for such overlapping periods.

Figure 1: Illustration that how data stream comes.

In this paper, we propose to solve the FESL problem by utilizing the overlapping period to discover the relationship between the old and new features, and exploiting the old model even when only the new features are available. Specifically, we try to learn a mapping from new features to old features through the samples in the overlapping period. In this way, we are able to reconstruct old features from new ones and thus the old model can still be applied. To benefit from additional features, we develop two ensemble methods, one is in a combination manner and the other in a dynamic selection manner. In the first method, we combine the predictions from two models and theoretically show that with the assistance of old features, the performance on new features can be improved. In the second approach, we dynamically select the best single prediction and establish a better performance guarantee when the best model switches at an arbitrary time. Experiments on synthetic and real datasets validate the effectiveness of our proposal.

The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the formulation of FESL. Our proposed approaches with corresponding analyses are presented in section 4. Section 5 reports experimental results. Finally, Section 6 concludes.

2 Related Work

Data stream mining contains several tasks, including classification, clustering, frequency counting, and time series analysis. Our work is most related to the classification task and we can also solve the regression problem. Existing techniques for data stream classification can be divided into two categories, one only considers a single classifier and the other considers ensemble classifiers. For the former, several methods origin from approaches such as decision tree 

Domingos and Hulten (2000), Bayesian classification Seidl et al. (2009), neural networks Leite et al. (2009)

, support vector machines 

Tsang et al. (2007), and -nearest neighbour Aggarwal et al. (2006). For the latter, various ensemble methods have been proposed including Online Bagging & Boosting Oza (2005), Weighted Ensemble Classifiers Wang et al. (2003); Nguyen et al. (2012), Adapted One-vs-All Decision Trees (OVA) Hashemi et al. (2009) and Meta-knowledge Ensemble Zhang et al. (2011). For more details, please refer to  Gaber et al. (2005); Gama and Rodrigues (2009); Aggarwal (2010); de Andrade Silva et al. (2013); Nguyen et al. (2015). These traditional streaming data algorithms often assume that the data samples are described by the same set of features, while in many real streaming tasks feature often changes. We want to emphasize that though concept-drift happens in streaming data where the underlying data distribution changes over time Aggarwal (2010); Gama and Rodrigues (2009); Bifet et al. (2010)

, the number of features in concept-drift never changes which is different from our problem. Most studies correlated to features changing are focusing on feature selection and extraction 

Samina et al. (2014); Zhou et al. (2012) and to the best of our knowledge, none of them consider the evolving of feature set during the learning process.

Data stream mining is a hot research direction in the area of data mining while online learning Zinkevich (2003); Hoi et al. (2014)

is a related topic from the area of machine learning. Yet online learning can also tackle the streaming data problem since it assumes that the data come in a streaming way. Online learning has been extensively studied under different settings, such as learning with experts 

Cesa-Bianchi and Lugosi (2006) and online convex optimization Hazan et al. (2007); Shalev-Shwartz (2012). There are strong theoretical guarantees for online learning, and it usually uses regret or the number of mistakes to measure the performance of the learning procedure. However, most of existing online learning algorithms are limited to the case that the feature set is fixed. Other related topics involving multiple feature sets include multi-view learning Li et al. (2014); Muslea et al. (2002); Xu et al. (2013)

, transfer learning 

Pan and Yang (2010); Raina et al. (2007) and incremental attribute learning Guan and Li (2001). Although both our approaches and multi-view learning exploit the relation between different sets of features, there exists a fundamental difference: multi-view learning assumes that every sample is described by multiple feature sets simultaneously, whereas in FESL only few samples in the feature switching period have two sets of features, and no matter how many periods there are, the switching part involves only two sets of features. Transfer learning usually assumes that data are in batch mode, few of them consider the streaming cases where data arrives sequentially and cannot be stored completely. One exception is online transfer learning Zhao et al. (2014) in which data from both sets of features arrive sequentially. However, they assume that all the feature spaces must appear simultaneously during the whole learning process while such an assumption is not available in FESL. When it comes to incremental attribute learning, old sets of features do not vanish or do not vanish entirely while in FESL, old ones will vanish thoroughly when new sets of features come.

The most related work is Hou and Zhou (2016), which also handles evolving features in streaming data. Different to our setting where there are overlapping periods, Hou and Zhou (2016) handles situations where there is no overlapping period but there are overlapping features. Thus, the technical challenges and solutions are different.

3 Preliminaries

Figure 2: Specific illustration with one cycle.

We focus on both classification and regression tasks. On each round of the learning process, the algorithm observes an instance and gives its prediction. After the prediction has been made, the true label is revealed and the algorithm suffers a which reflects the discrepancy between the prediction and the groundtruth. We define “feature space" in our paper by a set of features. That the feature space changes means both the underlying distribution of the feature set and the number of features change. Consider the process with three periods where in the first period large amount of data stream come from the old feature space; then in the second period named as overlapping period, few of data come from both the old and the new feature space; soon afterwards in the third period, data stream only come from the new feature space. We call this whole process a cycle. As can be seen from Figure 1, each cycle merely includes two feature spaces. Thus, we only need to focus on one cycle and it is easy to extend to the case with multiple cycles. Besides, we assume that the old features in one cycle will vanish simultaneously by considering the example that in ecosystem protection, all the sensors share the same expected lifespan and thus they will wear out at the same time. We will study the case where old features do not vanish simultaneously in the future work.

Based on the above discussion, we only consider two feature spaces denoted by and , respectively. Suppose that in the overlapping period, there are rounds of instances both from and . As can be seen from Figure 2, the process can be concluded as follows.

  • For , in each round, the learner observes a vector sampled from where is the number of features of , is the number of total rounds in .

  • For , in each round, the learner observes two vectors and from and , respectively where is the number of features of .

  • For , in each round, the learner observes a vector sampled from where is the number of rounds in . Note that is small, so we can omit the streaming data from on rounds since they have minor effect on training the model in .

We use to denote the -norm of a vector . The inner product is denoted by . Let and be two sets of linear models that we are interested in. We define the projection . We restrict our prediction function in -th feature space and -th round to be linear which takes the form where

. The loss function

is convex in its first argument and in implementing algorithms, we use logistic loss for classification task, namely and square loss for regression task, namely .

The most straightforward or baseline algorithm is to apply online gradient descent Zinkevich (2003) on rounds with streaming data , and invoke it again on rounds with streaming data . The models are updated according to (1), where is a varied step size:


4 Our Proposed Approach

In this section, we first introduce the basic idea of the solution to FESL, then two different kinds of approaches with the corresponding analyses are proposed.

The major limitation of the baseline algorithm mentioned above is that the model learned on rounds is ignored on rounds . The reason is that from rounds , we cannot observe data from feature space , and thus the model , which operates in , cannot be used directly. To address this challenge, we assume there is a certain relationship between the two feature spaces, and we try to discover it in the overlapping period. There are several methods to learn a relationship between two sets of features including multivariate regression Kibria (2007), streaming multi-label learning Read et al. (2011), etc. In our setting, since the overlapping period is very short, it is unrealistic to learn a complex relationship between the two spaces. Instead, we use a linear mapping to approximate . Assume the coefficient matrix of the linear mapping is , then during rounds

, the estimation of

can be based on least squares

The optimal solution to the above problem is given by

Then if we only observe an instance from , we can recover an instance in by , to which can be applied. Based on this idea, we will make two changes to the baseline algorithm:

  • During rounds , we will learn a relationship from , , .

  • From rounds , we will keep on updating using the recovered data and predict the target by utilizing the predictions of and .

In round , the learner can calculate two base predictions based on models and : and By utilizing the two base predictions in each round, we propose two methods, both of which are able to follow the better base prediction empirically and theoretically. The process to obtain the relationship mapping and during rounds are concluded in Algorithm 1.

1:Initialize randomly, , and ;
2:for  do
3:     Receive and predict ; Receive the target , and suffer loss ;
4:     Update using (1) where ;
5:     if then ;
Algorithm 1 Initialize

4.1 Weighted Combination

We first propose an ensemble method by combining predictions with weights based on exponential of the cumulative loss Cesa-Bianchi and Lugosi (2006). The prediction at time is the weighted average of all the base predictions:


where is the weight of the -th base prediction. With the previous loss of each base model, we can update the weights of the two base models as follows:


where is a tuned parameter. The updating rule of the weights shows that if the loss of one of the models on previous round is large, then its weight will decrease in an exponential rate in next round, which is reasonable and can derive a good theoretical result shown in Theorem 1. Algorithm 2 summarizes our first approach for FESL named as FESL-c(ombination). We first learn a model using online gradient descent on rounds , during which, we also learn a relationship for . For , we learn a model on each round and keep updating on the recovered data showed in (4) where is a varied step size:


Then we combine the predictions of the two models by weights calculated in (3).

1:Initialize and during using Algorithm 1;
3:Initialize randomly and by ;
4:for  do
5:     Receive and predict ;
6:     Predict using (2), then receive the target , and suffer loss ;
7:     Update weights using (3) where ;
8:     Update and using (4) and (1) respectively where ;
Algorithm 2 FESL-c(ombination)


In this paragraph, we borrow the regret from online learning to measure the performance of FESL-c. Specifically, we give a loss bound as follows which shows that the performance will be improved with assistance of the old feature space. For the sake of soundness, we put the proof of our theorems in the supplementary file. We define that and are two cumulative losses suffered by base models on rounds ,


and is the cumulative loss suffered by our methods: Then we have:

Theorem 1.

Assume that the loss function is convex in its first argument and that it takes value in [0,1]. For all and for all with , with parameter satisfies


This theorem implies that the cumulative loss of Algorithm 2 over rounds is comparable to the minimum of and . Furthermore, we define . If , it is easy to verify that is smaller than . In summary, on rounds , when is better than to certain degree, the model with assistance from is better than that without assistance.

4.2 Dynamic Selection

The combination approach mentioned in the above subsection combines several base models to improve the overall performance. Generally, combination of several classifiers performs better than selecting only one single classifier Zhou (2012). However, it requires that the performance of base models should not be too bad, for example, in Adaboost the accuracy of the base classifiers should be no less than 0.5 Freund and Schapire (1997). Nevertheless, in our FESL problem, on rounds , cannot satisfy the requirement in the beginning due to insufficient training data and may become worse when more and more data come causing a cumulation of recovered error. Thus, it may not be appropriate to combine the two models all the time, whereas dynamically selecting the best single may be a better choice. Hence we propose a method based on a new strategy, i.e., dynamic selection, similar to the Dynamic Classifier Selection Zhou (2012)

which only uses the best single model rather than combining both of them in each round. Note that, though we only select one of the models, we retain and utilize both of them to update their weights. So it is still an ensemble method. The basic idea of dynamic selection is to select the model of larger weight with higher probability. Algorithm 

3 summarizes our second approach for FESL named as FESL-s(election). Specifically, the steps in Algorithm 3 on rounds is the same as that in Algorithm 2. For , we still update weights of each model. However, when doing prediction, we do not combine all the models’ prediction, we adopt the result of the “best" model’s according to the distribution of their weights


To track the best model, we have a different way of updating weights which is given as follows Cesa-Bianchi and Lugosi (2006).


where we define , , and is the binary entropy function defined for .

1:Initialize and during using Algorithm 1;
3:Initialize randomly and by ;
4:for  do
5:     Receive and predict ;
6:     Draw a model according to the distribution (7) and predict according to the model;
7:     Receive the target , and suffer loss ; Update the weights using (8);
8:     Update and using (4) and (1) respectively, where .
Algorithm 3 FESL-s(election)


From rounds , the first model would become worse due to the cumulative recovered error while the second model will become better by the large amount of coming data. Since is initialized by which is learnt from the old feature space and is initialized randomly, it is reasonable to assume that is better than in the beginning, but inferior to after sufficient large number of rounds. Let be the round after which is worse than . We define we can verify that


Then a more ambitious goal is to compare the proposed algorithm against from rounds to , and against the from rounds to , which motivates us to study the following performance measure Because the exact value of is generally unknown, we need to bound the worst-case An upper bound of is given as follows.

Theorem 2.

For all , if the model is run with parameter and , then


where is the binary entropy function.

According to Theorem 2 we know that is comparable to . Due to (9), we can conclude that the upper bound of in Algorithm 3 is tighter than that of Algorithm 2.

5 Experiments

In this section, we first introduce the datasets we use. We want to emphasize that we collected one real dataset by ourselves since our setting of feature evolving is relatively novel so that the required datasets are not widely available yet. Then we introduce the compared methods and settings. Finally experiment results are given.

Dataset n Dataset Dataset
Australian 690 42 29 r.EN-FR 18,758 21,531 24,892 r.GR-IT 29,953 34,279 15,505
Credit-a 653 15 10 r.EN-GR 18,758 21,531 34,215 r.GR-SP 29,953 34,279 11,547
Credit-g 1,000 20 14 r.EN-IT 18,758 21,531 15,506 r.IT-EN 24,039 15,506 21,517
Diabetes 768 8 5 r.EN-SP 18,758 21,531 11,547 r.IT-FR 24,039 15,506 24,892
DNA 940 180 125 r.FR-EN 26,648 24,893 21,531 r.IT-GR 24,039 15,506 34,278
German 1,000 59 41 r.FR-GR 26,648 24,893 34,287 r.IT-SP 24,039 15,506 11,547
Kr-vs-kp 3,196 36 25 r.FR-IT 26,648 24,893 15,503 r.SP-EN 12,342 11,547 21,530
Splice 3,175 60 42 r.FR-SP 26,648 24,893 11,547 r.SP-FR 12,342 11,547 24,892
Svmguide3 1,284 22 15 r.GR-EN 29,953 34,279 21,531 r.SP-GR 12,342 11,547 34,262
RFID 940 78 72 r.GR-FR 29,953 34,279 24,892 r.SP-IT 12,342 11,547 15,500
Table 1: Detail description of datasets: let be the number of examples, and and denote the dimensionality of the first and second feature space, respectively. The first 9 datasets in the left column are synthetic datasets, “r.EN-GR" means the dataset EN-GR comes from Reuter and “RFID" is the real dataset.

5.1 Datasets

We conduct our experiments on 30 datasets consisting of 9 synthetic datasets, 20 Reuter datasets and 1 real dataset. To generate synthetic data, we randomly choose some datasets from different domains including economy and biology, etc111Datasets can be found in http://archive.ics.uci.edu/ml/. whose scales vary from 690 to 3,196. They only have one feature space at first. We artificially map the original datasets into another feature space by random Gaussian matrices, then we have data both from feature space and . Since the original data are in batch mode, we manually make them come sequentially. In this way, synthetic data are completely generated. We also conduct our experiments on 20 datasets from Reuter Amini et al. (2009). They are multi-view datasets which have large scale varying from 12,342 to 29,963. Each dataset has two views which represent two different kinds of languages, respectively. We regard the two views as the two feature spaces. Now they do have two feature spaces but the original data are in batch mode, so we will artificially make them come in streaming way.

We use the RFID technique to collect the real data which contain 450 instances from and respectively. RFID technique is widely used to do moving goods detection Wang et al. (2016). In our case, we want to utilize the RFID technique to predict the location’s coordinate of the moving goods attached by RFID tags. Concretely, we arranged several RFID aerials around the indoor area. In each round, each RFID aerial received the tag signals, then the goods with tag moved, at the same time, we recorded the goods’ coordinate. Before the aerials expired, we arranged new aerials beside the old ones to avoid the situation without aerials. So in this overlapping period, we have data from both old and new feature spaces. After the old aerials expired, we continue to use the new ones to receive signals. Then we only have data from feature space . So the RFID data we collect totally satisfy our assumptions. The details of all the datasets we use are presented in Table 1.

5.2 Compared Approaches and Settings

We compare our FESL-c and FESL-s with three approaches. One is mentioned in Section 3, where once the feature space changed, the online gradient descent algorithm will be invoked from scratch, named as NOGD (Naive Online Gradient Descent). The other two approaches utilize the model learned from feature space by online gradient descent to do predictions on the recovered data. The difference between them is that one keeps updating with the recovered data while the other does not. The one which keeps updating is called Updating Recovered Online Gradient Descent (ROGD-u) and the other which keeps fixed is called Fixed Recovered Online Gradient Descent (ROGD-f). We evaluate the empirical performances of the proposed approaches on classification and regression tasks on rounds . To verify that our analysis is reasonable, we present the trend of average cumulative loss. Concretely, at each time , the loss of every method is the average of the cumulative loss over , namely We also present the classification performance over all instances on rounds on synthetic and Reuter data. The performances of all approaches are obtained by average results over 10 independent runs on synthetic data. Due to the large scale of Reuter data, we only conduct 3 independent runs on Reuter data and report the average results.

  (a) australian   (b) credit-a   (c) credit-g   (d) diabetes   (e) r.EN-SP   (f) r.FR-SP   (g) r.GR-EN   (h) r.IT-FR   (i) RFID   legend
Figure 3: The trend of loss with three baseline methods and the proposed methods on synthetic data. The smaller the cumulative loss, the better. All the average cumulative loss at any time of our methods is comparable to the best of baseline methods and 8 of 9 are smaller.

The parameters we need to set are the number of instances in overlapping period, i.e., , the number of instances in and , i.e., and and the step size, i.e., where is time. For all baseline methods and our methods, the parameters are the same. In our experiments, we set 5 or 10 for synthetic data, 50 for Reuter data and 40 for RFID data. We set almost and to be half of the number of instances, and to be where is searched in the range . The detailed setting of in for each dataset is presented in supplementary file.

5.3 Results

Here we only present part of the loss trend results, and other results are presented in the supplementary file. Figure 3 gives the trend of average cumulative loss. (a-d) are the results on synthetic data, (e-h) are the results on Reuter data, (i) is the result of the real data. The smaller the average cumulative loss, the better. From the experimental results, we have the following observations. First, all the curves with circle marks representing NOGD decrease rapidly which conforms to the fact that NOGD on rounds becomes better and better with more and more data coming. Besides, the curves with star marks representing ROGD-u also decline but not very apparent since on rounds , ROGD-u already learned well and tend to converge, so updating with more recovered data could not bring too much benefits. Moreover, the curves with plus marks representing ROGD-f does not drop down but even go up instead, which is also reasonable because it is fixed and if there are some recovering errors, it will perform worse. Lastly, our methods are based on NOGD and ROGD-u, so their average cumulative losses also decrease. As can be seen from Figure 3, the average cumulative losses of our methods are comparable to the best of baseline methods on all datasets and are smaller than them on 8 datasets. And FESL-s exhibits slightly smaller average cumulative loss than FESL-c. You may notice that NOGD is always worse than ROGD-u on synthetic data and real data while on Reuter data NOGD becomes better than ROGD-u after a few rounds. This is because on synthetic data and real data, we do not have enough rounds to let all methods converge while on Reuter data, large amounts of instances ensure the convergence of every method. So when all the methods converge, we can see that NOGD is better than other baseline methods since it always receives the real instances while ROGD-u and ROGD-f receive the recovered instances which may contain recovered error. As can be seen from (e-h), in the first few rounds, our methods are comparable to ROGD-u. When NOGD is better than ROGD-u, our methods are comparable to NOGD which shows that our methods are comparable to the best one all the time. Moreover, FESL-s performs worse than FESL-c in the beginning while afterwards, it becomes slightly better than FESL-c.

Table 2 shows the accuracy results on synthetic datasets and Reuter datasets. We can see that for synthetic datasets, FESL-s outperforms other methods on 8 datasets, FESL-c gets the best on 5 datasets and ROGD-u also gets 5. NOGD performs worst since it starts from scratch. ROGD-u is better than NOGD and ROGD-f because ROGD-u exploits the old better trained model from old feature space and keep updating with recovered instances. Our two methods are based on NOGD and ROGD-u. We can see that our methods can follow the best baseline method or even outperform it. For Reuter datasets, we can see that FESL-c outperforms other methods on 17 datasets, FESL-s gets the best on 9 datasets and NOGD gets 8 while ROGD-u gets 1. In Reuter datasets, the period on new feature space is longer than that in synthetic datasets so that NOGD can update itself to a good model. Whereas ROGD-u updates itself with recovered data, so the model will become worse when recovered error accumulates. ROGD-f does not update itself, thus it performs worst. Our two methods can take the advantage of NOGD and ROGD-f and perform better than them.

australian .767.009 .849.009 .809.025 .849.009 .849.009
credit-a .811.006 .826.018 .785.051 .827.014 .831.009
credit-g .659.010 .733.006 .716.011 .733.006 .733.006
diabetes .650.002 .652.009 .651.006 .652.007 .652.009
dna .610.013 .691.023 .608.064 .691.023 .692.021
german .684.006 .700.002 .700.002 .700.001 .703.004
kr-vs-kp .612.005 .621.036 .538.024 .626.028 .630.016
splice .568.005 .612.022 .567.057 .612.022 .612.022
svmguide3 .680.010 .779.010 .748.012 .779.010 .778.010
r.EN-FR .902.004 .849.003 .769.069 .903.003 .902.005
r.EN-GR .867.005 .836.007 .802.036 .870.002 .870.003
r.EN-IT .858.014 .847.014 .831.018 .861.010 .863.013
r.EN-SP .900.002 .848.002 .825.001 .901.001 .899.002
r.FR-EN .858.007 .776.009 .754.012 .858.007 .858.007
r.FR-GR .869.004 .774.019 .753.021 .870.004 .868.003
r.FR-IT .874.005 .780.022 .744.040 .874.005 .873.005
r.FR-SP .872.001 .778.022 .735.013 .872.001 .871.002
r.GR-EN .907.000 .850.007 .801.035 .907.001 .906.000
r.GR-FR .898.001 .827.009 .802.023 .898.001 .898.000
r.GR-IT .847.011 .851.017 .816.006 .850.018 .851.017
r.GR-SP .902.001 .845.003 .797.012 .902.001 .902.001
r.IT-EN .854.003 .760.006 .730.024 .856.002 .854.003
r.IT-FR .863.002 .753.012 .730.020 .864.002 .862.003
r.IT-GR .849.004 .736.022 .702.012 .849.004 .846.004
r.IT-SP .839.006 .753.014 .726.005 .839.007 .839.006
r.SP-EN .926.002 .860.005 .814.021 .926.002 .924.001
r.SP-FR .876.005 .873.017 .833.042 .876.014 .878.012
r.SP-GR .871.013 .827.025 .810.026 .873.013 .873.013
r.SP-IT .928.002 .861.005 .826.005 .928.003 .927.002
Table 2:

Accuracy with its variance on synthetic datasets and Reuter datasets. The larger the better. The best ones among all the methods are bold.

6 Conclusion

In this paper, we focus on a new setting: feature evolvable streaming learning. Our key observation is that in learning with streaming data, old features could vanish and new ones could occur. To make the problem tractable, we assume there is an overlapping period that contains samples from both feature spaces. Then, we learn a mapping from new features to old features, and in this way both the new and old models can be used for prediction. In our first approach FESL-c, we ensemble two predictions by learning weights adaptively. Theoretical results show that the assistance of the old feature space can improve the performance of learning with streaming data. Furthermore, we propose FESL-s to dynamically select the best model with better performance guarantee.


This research was supported by NSFC (61333014, 61603177), JiangsuSF (BK20160658), Huawei Fund (YBN2017030027) and Collaborative Innovation Center of Novel Software Technology and Industrialization.


  • Aggarwal et al. (2006) C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18:577–589, 2006.
  • Aggarwal (2010) C. C. Aggarwal. Data streams: An overview and scientific applications. In Scientific Data Mining and Knowledge Discovery - Principles and Foundations, pages 377–397. Springer, 2010.
  • Amini et al. (2009) M.-R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views - an application to multilingual text categorization. In Advances in Neural Information Processing Systems 22, pages 28–36, 2009.
  • Bifet et al. (2010) A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive online analysis. Journal of Machine Learning Research, 11:1601–1604, 2010.
  • Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • de Andrade Silva et al. (2013) J. de Andrade Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama. Data stream clustering: A survey. ACM Computing Surveys, 46:13:1–13:31, 2013.
  • Domingos and Hulten (2000) P. M. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 71–80, 2000.
  • Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
  • Gaber et al. (2005) M. M. Gaber, A. B. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Record, 34:18–26, 2005.
  • Gama and Rodrigues (2009) J. Gama and P. P. Rodrigues. An overview on mining data streams. In Foundations of Computational Intelligence, pages 29–45. Springer, 2009.
  • Guan and Li (2001) S. U. Guan and S. Li. Incremental learning with respect to new incoming input attributes. Neural Processing Letters, 14:241–260, 2001.
  • Hashemi et al. (2009) S. Hashemi, Y. Yang, Z. Mirzamomen, and M. R. Kangavari. Adapted one-versus-all decision trees for data stream classification. IEEE Transactions on Knowledge and Data Engineering, 21:624–637, 2009.
  • Hazan et al. (2007) E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Maching Learning, 69:169–192, 2007.
  • Hoeffding (1963) W. Hoeffding.

    Probability inequalities for sums of bounded random variables.

    Journal of the American Statistical Association, 58:13–30, 1963.
  • Hoi et al. (2014) S. Hoi, J. Wang, and P. Zhao. LIBOL: A library for online learning algorithms. Journal of Machine Learning Research, 15:495–499, 2014.
  • Hou and Zhou (2016) C. Hou and Z.-H. Zhou. One-pass learning with incremental and decremental features. ArXiv e-prints, arXiv:1605.09082, 2016.
  • Kibria (2007) B. M. Golam Kibria. Bayesian statistics and marketing. Technometrics, 49:230, 2007.
  • Leite et al. (2009) D. Leite, P. Costa Jr., and F. Gomide. Evolving granular classification neural networks. In Proceedings of International Joint Conference on Neural Networks 2009, pages 1736–1743, 2009.
  • Li et al. (2014) S.-Y. Li, Y. Jiang, and Z.-H. Zhou. Partial multi-view clustering. In

    Proceedings of the 28th AAAI Conference on Artificial Intelligence

    , pages 1968–1974, 2014.
  • Muslea et al. (2002) I. Muslea, S. Minton, and C. Knoblock.

    Active + semi-supervised learning = robust multi-view learning.

    In Proceedings of the 19th International Conference on Machine Learning, pages 435–442, 2002.
  • Nguyen et al. (2012) H.-L. Nguyen, Y.-K. Woon, W. K. Ng, and L. Wan. Heterogeneous ensemble for feature drifts in data streams. In Proceedings of the 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 1–12, 2012.
  • Nguyen et al. (2015) H.-L. Nguyen, Y.-K. Woon, and W. K. Ng. A survey on data stream clustering and classification. Knowledge and Information Systems, 45:535–569, 2015.
  • Oza (2005) N. C. Oza. Online bagging and boosting. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics 2005, pages 2340–2345, 2005.
  • Pan and Yang (2010) S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22:1345–1359, 2010.
  • Raina et al. (2007) R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning, pages 759–766, 2007.
  • Read et al. (2011) J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Streaming multi-label classification. In Proceedings of the 2nd Workshop on Applications of Pattern Analysis, pages 19–25, 2011.
  • Samina et al. (2014) K. Samina, K. Tehmina, and N. Shamila.

    A survey of feature selection and feature extraction techniques in machine learning.

    In Proceedings of Science and Information Conference 2014, pages 372–378, 2014.
  • Seidl et al. (2009) T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In Proceedings of the 12th International Conference on Extending Database Technology, pages 311–322, 2009.
  • Shalev-Shwartz (2012) S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:107–194, 2012.
  • Tsang et al. (2007) I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In Proceedings of the 24th International Conference on Machine Learning, pages 911–918, 2007.
  • Wang et al. (2003) H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 226–235, 2003.
  • Wang et al. (2016) C. Wang, L. Xie, W. Wang, T. Xue, and S. Lu. Moving tag detection via physical layer analysis for large-scale RFID systems. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications, pages 1–9, 2016.
  • Xu et al. (2013) C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ArXiv e-prints, arXiv:1304.5634, 2013.
  • Zhang et al. (2011) P. Zhang, J. Li, P. Wang, B. J. Gao, X. Zhu, and L. Guo. Enabling fast prediction for ensemble models on data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 177–185, 2011.
  • Zhao et al. (2014) P. Zhao, S. Hoi, J. Wang, and B. Li. Online transfer learning. Artificial Intelligence, 216:76–102, 2014.
  • Zhou et al. (2012) G. Zhou, K. Sohn, and H. Lee.

    Online incremental feature learning with denoising autoencoders.

    In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, pages 1453–1461, 2012.
  • Zhou (2012) Z.-H. Zhou. Ensemble methods: Foundations and algorithms. CRC press, 2012.
  • Zhou (2016) Z.-H. Zhou. Learnware: On the future of machine learning. Frontiers of Computer Science, 10:589–590, 2016.
  • Zinkevich (2003) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.

Supplementary Material of “Learning with Feature Evolvable Streams”

In the Appendix, we will prove the two theorems in the section “Our Proposed Approaches", and give some additional experiment results.

Appendix A: Analysis

In this section, we will give the detailed proofs of the two theorems in “Our Proposed Approaches". The two theorems are the special cases of Theorem 2.2 and Corollary 5.1 respectively in Cesa-Bianchi and Lugosi (2006).

A.1 Proof of Theorem  1

To prove Theorem 1, we propose to bound the related quantities where

for , and . is the cumulative loss at time of the -th base learner, namely . In the proof we use the following classical inequality due to Hoeffding Hoeffding (1963).

Lemma 1.

Let be a random variable with . Then for any ,

The detailed proof of Lemma 1 can be found in Section A.1 of the Appendix in Cesa-Bianchi and Lugosi (2006).

Proof of Theorem 1.

First observe that


On the other hand, for each ,

Now using Lemma 1, we observe that the quantity above may be upper bounded by

where we used the convexity of the loss function in its first argument and the way how the weight updates. Summing over , we get


Combining this with the lower bound (11) and solving for , we find that

as desired. In particular, with , the upper bound becomes . ∎

A.2 Proof of Theorem 2

To prove Theorem 2, we first give some definitions. Since we only choose one base learner’s prediction in FESL-s as our final prediction in each round, we use to denote the index of the base learners in -th round for . We call an action. So the loss in round can be denoted as . Thus, randomly choosing one base learner in each round is a randomized version of FESL-c, so we call it randomized FESL-c. Denote the distribution according to which the random action is drawn at time by , and is the expected loss of randomized FESL-c at time . Then we have the following lemma:

Lemma 2.

Let and . The randomized FESL-c with satisfies, with probability at least


The random variables , for , form a sequence of bounded martingale differences. With a simple application of the Hoeffding-Azuma inequality and combining the results of Theorem 1, we yield the result of this lemma. ∎

In addition, is defined as the sequence of the base learner’s index such that we can study a more ambitious goal where . It is not difficult to modify the randomized FESL-c in order to achieve this goal. Specifically, we associate a compound action with each sequence which only switches once. Then we can run our randomized FESL-c over the set of compound actions: at any time the randomized FESL-c draws a compound action and plays action . Denote by the number of all compound actions. Then, in FESL-c, we only have 2 base learners while in randomized FESL-c, we have base learners. Then Lemma 2 implies that is bounded by . Hence, it suffices to count the number of compound actions: for each there are ways to pick time steps where a switch occurs, and there are ways to assign a distinct action to each of the resulting blocks. This gives

where is the binary entropy function defined for . Substituting this bound in , we find that satisfies