Prediction with Unpredictable Feature Evolution

04/27/2019 ∙ by Bo-Jian Hou, et al. ∙ Nanjing University 30

Feature space can change or evolve when learning with streaming data. Several recent works have studied feature evolvable learning. They usually assume that features would not vanish or appear in an arbitrary way. For example, when knowing the battery lifespan, old features and new features represented by data gathered by sensors will disappear and emerge at the same time along with the sensors exchanging simultaneously. However, different sensors would have different lifespans, and thus the feature evolution can be unpredictable. In this paper, we propose a novel paradigm: Prediction with Unpredictable Feature Evolution (PUFE). We first complete the unpredictable overlapping period into an organized matrix and give a theoretical bound on the least number of observed entries. Then we learn the mapping from the completed matrix to recover the data from old feature space when observing the data from new feature space. With predictions on the recovered data, our model can make use of the advantage of old feature space and is always comparable with any combinations of the predictions on the current instance. Experiments on the synthetic and real datasets validate the effectiveness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In big data era, data usually come in a streaming way due to the data’s characteristic of big volume and high velocity. Learning with streaming data has been studied extensively DBLP:conf/kdd/DomingosH00 ; DBLP:conf/edbt/SeidlAKKH09 ; DBLP:conf/ijcnn/LeiteCG09 ; DBLP:conf/icml/TsangKK07 . Usually, the corresponding methods assume that the feature space of the streaming data is fixed. However, in recent years, people have realized that the feature space of streaming data could evolve. For example, in environment monitoring task, we deploy several sensors in the environment to gather the data of temperature, humidity, illumination, etc., to predict the future status of the environment. Due to the limited lifespans of sensors, we need to replace the old expired sensors with new ones. Thus features corresponding to the expired sensors vanish and features corresponding to the new sensors emerge.

There are two essential problems in this learning paradigm. The first one is how to learn well when there are only few data in new feature space. The second problem is how can we not waste the data collection effort when old features vanish. These two problems are also described in learnware DBLP:journals/fcsc/Zhou16a where the ability of handling evolvable features is needed. In order to solve these two problems, the authors in DBLP:conf/nips/Hou0Z17 assume that the sensors’ lifespans are known to us, and the old sensors expire at the same time, which means that the corresponding features will vanish at the same time. Then before the old sensors expire, new sensors are spread in order to avoid the situation where no sensors work. Thus an overlapping period appears where data are with both old and new feature spaces. Through this overlapping period, a bridge is built to connect the old and new feature spaces.

Figure 1: Illustration that how data stream comes.

However, due to the different situations of sensors, such as the difference on positions, temperatures, magnitudes of signal, etc., the sensors’ expiring time would be different and unpredictable. Thus, the features corresponding to the sensors with short lifespans will vanish earlier than the features of sensors with long lifespans. Besides, although we do not know the exact lifespans of sensors, we still probably know the lifespan of sensors and will spread new sensors simultaneously as replenishment before the old sensors expire. This kind of operation is natural and reasonable since it is much more efficient and can save more workload than employing sensors one by one when we know the rough lifespans of all the sensors. Note that the position or the number of the new sensors could be different from the old one. Therefore, the way that how data stream comes can be illustrated as Figure 

1. Each column with blue or red color represents a feature gathered from a sensor. At the beginning, during time all sensors are working together to gather data so as to generate matrix . From time to , some sensors start to wear out, and thus the corresponding features (blue columns) will stop extending, which makes matrix (with missing items) come into being. At the same time, due to the substitute of new sensors, new features (red columns) start to grow. Thus the matrix appears. After period , all old sensors expire and all new sensors continue to gather data concurrently. Therefore, the intact matrix emerges.

In this paper, we propose a novel paradigm: Prediction with Unpredictable Feature Evolution (PUFE) where old features will vanish unpredictably so as to render the feature evolution unpredictable. In the first step, we propose to leverage Frequent Directions DBLP:conf/kdd/Liberty13 and provide a matrix completion technique to complete . We give a theoretical analysis on the matrix completion in the situation of sampling without replacement. Then we are able to learn a mapping from to . With this mapping, we can recover the data from the previous feature space when obtaining the data from the current feature space. In this way, the well-learned model on matrix can be applied to the recovered data. With the prediction on the recovered data, our model can make use of the advantage of each base model and avoid their drawbacks. We give a theoretical guarantee that our model is always comparable with any combinations of all the base models. Furthermore, our model can be extended to adaptively tackle the situation when newer feature space appears. In other words, we do not need to decide manually which base model should be incorporated in and which base model should be discarded. All in all, we can conclude our contributions in the following four main parts:

  1. We propose a more practical setting where the old features will vanish unpredictably since the different sensors have different lifespans.

  2. To tackle the unpredictable feature evolution, we formulate the setting as a matrix completion problem and propose an effective method with theoretical guarantee where we only need observed entries to recover the target matrix exactly, where is the row of the target matrix and is the rank.

  3. We propose a new method to use the help from the previous feature space and give two theoretical guarantees that our model is always comparable to the best baseline and can be extended to adaptively tackle the situation when newer feature space appears.

  4. The experiments show that our model is comparable to the best baseline and surprisingly better in most cases, which validate the effectiveness of PUFE and theorems.

The rest of this paper is organized as follows. Section 2 introduces related work. Our proposed approach with corresponding theoretical guarantees is presented in Section 3. Section 4 reports experimental results. Finally, Section 5 concludes our paper.

2 Related Work

Our work is most related to DBLP:conf/nips/Hou0Z17 . They propose a setting called “feature evolvable streaming learning”. They observe that in learning with streaming data, old features could vanish and new ones could occur. To make the problem tractable, they assume there is an overlapping period that contains samples from both feature spaces. Then, they learn a mapping from new features to old features, and in this way both the new and old models can be used for prediction. The overlapping period comes from the assumption that the old features vanish simultaneously. But usually this assumption does not hold. A more practical assumption is that different features could vanish unpredictably and thus there will be no intact overlapping period. In this paper, we focus on this new setting and propose an effective method to tackle it. Another very related work is DBLP:journals/pami/HouZ18 , which also handles evolving features in streaming data. But they assume there are overlapping features instead of overlapping period. Thus, the technical challenges and solutions are different. Besides, they know how features vanish and emerge, which is different from our unpredictable feature evolution setting. Similar to DBLP:journals/pami/HouZ18 , in DBLP:conf/icml/YeZ0Z18 , the authors also assume that there are overlapping features when feature evolves. They use optimal transport technique to learn mapping from the two different feature spaces. But they do not consider streaming mode but batch one. Learning with trapezoidal data streams DBLP:conf/icdm/ZhangZLDZW15 ; DBLP:journals/tkde/ZhangZL0ZW16 is also a closely related work to us. They deal with trapezoidal data stream where instance and feature can doubly increase. Though their feature space evolves, the setting that new data always have overlapping features with all old data is different from our work.

Our work is also related to data stream mining. Such as evolving neural networks 

DBLP:conf/ijcnn/LeiteCG09

, core vector machines 

DBLP:conf/icml/TsangKK07 , -nearest neighbour DBLP:journals/tkde/AggarwalHWY06 , online bagging & boosting DBLP:conf/smc/Oza05

and weighted ensemble classifiers 

DBLP:conf/kdd/WangFYH03 ; DBLP:conf/pakdd/NguyenWNW12 . For more details, please refer to  DBLP:books/daglib/p/Aggarwal10 . These conventional data stream mining methods usually assume that the data samples are described by the same set of features, while in many real streaming tasks feature often changes. Other related topics involving multiple feature sets include multi-view learning DBLP:conf/aaai/LiJZ14 ; DBLP:conf/icml/MusleaMK02 ; DBLP:journals/corr/abs-1304-5634

, transfer learning 

DBLP:journals/tkde/PanY10 ; DBLP:conf/icml/RainaBLPN07 , etc. Although multi-view learning exploits the relation between different sets of features as ours, there exists a fundamental difference: multi-view learning assumes that every sample is described by multiple feature sets simultaneously, whereas in PUFE only few samples in the feature switching period have two sets of features. Transfer learning usually assumes that data come by batches, few of them consider the streaming cases where data arrives sequentially and cannot be stored completely. One exception is online transfer learning DBLP:journals/ai/ZhaoHWL14 in which data from both sets of features arrive sequentially. However, they assume that all the feature spaces must appear simultaneously during the whole learning process while such an assumption is not available in PUFE.

Online learning DBLP:conf/icml/Zinkevich03 ; DBLP:journals/jmlr/HoiWZ14

is another related topic from the area of machine learning. It can naturally tackle the streaming data problem since it assumes that the data come in a streaming way. Specifically, at each round, after the learner makes prediction on the given instance, the adversary will reveal its loss, with which, the learner will make better prediction to minimize the total loss through all rounds. Online learning has been extensively studied under different settings, such as learning with experts 

DBLP:books/daglib/0016248 and online convex optimization DBLP:journals/ml/HazanAK07 ; DBLP:journals/ftml/Shalev-Shwartz12 . There are strong theoretical guarantees for online learning, and it usually uses regret or the number of mistakes to measure the performance of the learning procedure. However, most of existing online learning algorithms are limited to the case that the feature set is fixed.

3 The Proposed Approach

Our goal is to leverage the assistance from the previous feature space to always obtain good performance in the current feature space, namely, during period , no matter at the beginning or at any other time step. The key idea is to establish relationship between the previous and current feature space by an overlapping period where both previous and current features exist. Then the well-learned model on the previous feature space can be applied to the current feature space to assist the performance in this current feature space. But in our setting, we do not have an intact overlapping period. Thus we need to study whether we can rebuild it. Since time is seasonal, it is reasonable to assume that two instances on the same periodic point are linearly related. Thus, we have chance to rebuild the overlapping period with the help of observed instances. Therefore, the framework of our method is clear. Concretely, we have mainly four steps. The first step is to learn a good model in the previous feature space as a prepared backup. Then in order to build the relationship between the previous and the current feature space, we need to complete the overlapping period, that is matrix showed in Figure 1, where the features start to vanish and emerge. In the third step, we learn the mapping between and . Finally, we learn a model on the current feature space, which will be boosted by the well-learned model from the previous feature space. The framework of our method is summarized in Algorithm 1.

Note that completing matrix in step two is not trivial since the missing of items is not random but with certain rule. This will be discussed in Section 3.2. And we give a theoretical guarantee on the number of observed entries, which is much smaller than the conventional one. Learning a good model through utilizing the assistance from the previous feature space in step four is also not easy since at the beginning, we should follow the good model, say learned from the previous feature space. But note that there are errors when doing recovering. Then after a period of time, this model would be worse and worse since more and more recovered error accumulates. Thus we have to avoid this damage and obtain help adaptively. In the following, we will give the details of these four steps. These two steps contain the main contributions of this paper.

1:  Learn a model sequentially in the previous feature space with Algorithm 2.
2:  Complete matrix sequentially showed in Figure 1 with Algorithm 3.
3:  Learn a mapping sequentially from to using (3).
4:  Make predictions sequentially in current feature space with the assistance from the previous feature space with Algorithm 4.
Algorithm 1 Framework of PUFE

3.1 Learn a Model from in Previous Feature Space

1:  Initialize randomly.
2:  for  do
3:     Receive and predict
4:     Receive the target , and suffer loss
5:     Update using (1) where
6:  end for
Algorithm 2 Learn a Model from

We use to denote the -norm of a vector . The inner product is denoted by . Let be the set of linear models in the previous feature space that we are interested in. We define the projection . We restrict our prediction function at -th round to be linear which takes the form where and is the instance from the revious feature space at time

. The loss function

is convex in its first argument. In implementing algorithms, we use logistic loss for classification task, namely, while in regression task, we use square loss, namely,

We follow DBLP:conf/nips/Hou0Z17 to learn an online linear model by online gradient descent DBLP:conf/icml/Zinkevich03 . The models are updated according to:

(1)

where is a varied step size. The process of learning a model from during rounds are concluded in Algorithm 2.

3.2 Complete Matrix

We assume each feature is represented by the data gathered by a sensor. In our scenario, when some old sensor disappears, it means that the corresponding feature will vanish forever. In other words, for each row in , the remaining or observed entries are always fewer than or equal to the entries in the preceding row. Besides, each element in the current row is observed only once and the vanishings of features are uniformly at random since the corresponding sensors expire uniformly at random. Thus this setting can be formulated as the sampling each row uniformly at random without replacement in matrix completion problem. Traditional matrix completion methods with nuclear norm minimization DBLP:journals/jmlr/Recht11 ; DBLP:journals/cacm/CandesR12 are not appropriate in our setting because they usually assume that the observed entries are sampled uniformly at random from the whole matrix, whereas in our setting, entries are observed in the certain rule mentioned above. On the other hand, what we handle is data stream which means it is more natural and appropriate to deal with it sequentially. Thus, it is desirable to complete each row immediately when receiving it, which cannot be resolved by traditional matrix completion approaches neither.

Specifically, for a matrix , let and denote the -th row and -th column of , respectively. For a set , the vector contains elements of vector indexed by . Similarly the matrix has rows of matrix indexed by . Let be the matrix to be completed. We observe that matrix and share the same feature space and the same column of and are data gathered by the same sensor. Thus it is reasonable to assume that matrix and are spanned by the same row space. Therefore, we can leverage to obtain the row space of and recover each row of . Concretely, to approximate , let be the rank of . We calculate the top- right singular vectors of denoted by that is the row space of . Since in our online or one-pass setting we can only obtain one instance (row) at a time, we use Frequent Directions technique DBLP:conf/kdd/Liberty13 ; DBLP:journals/siamcomp/GhashamiLPW16 ; DBLP:conf/icml/Huang18 to calculate . Frequent Directions can compute row space of a matrix in a streaming way. For each row of denoted by , we only observe a set of entries denoted by . It is equivalent to state that we sample a set of entries uniformly at random without replacement from . We then solve the following optimization problem

(2)

to recover this row by , where is the optimal solution and is the selected columns of indexed by . Since the problem (2) has a closed-form solution , we have

The detailed procedures are summarized in Algorithm 3.

1:  Input: number of observed entries per row, .
2:  Calculate the top- right singular vectors of denoted by by Frequent Directions.
3:  for  do
4:     Sample a set of entries uniformly at random without replacement denoted by .
5:     Calculate .
6:  end for
7:  Output: .
Algorithm 3 Complete Matrix

3.2.1 Theoretical Guarantee

Let , let and where and are the top- left and right singular vectors of , respectively. The incoherence measure for and is defined as

In the following theorem (proof deferred to supplementary file), we show that in the low-rank case where when observing entries, we can recover exactly with high probability.

Theorem 1

Assume the rank of is , and the number of observed entries in is . With a probability at least , Algorithm 3 recovers exactly.

Remark: We know that there will be fewer and fewer entries in each row as time goes on. Thus we can recover exactly if only we guarantee that the number of entries in the last row is larger than . For those rows whose entries are fewer than this amount, we simply discard them. Then an intact overlapping period can be used to learn a mapping. Suppose the number of rows that contain entries more than is , and the column number is , then with the free row space of , the sample complexity is only which is much smaller than of the conventional matrix completion DBLP:journals/jmlr/Recht11 .

3.3 Learn Mapping from to

There are several methods to learn a relationship between two sets of features including multivariate regression DBLP:journals/technometrics/Kibria07 , streaming multi-label learning DBLP:journals/jmlr/ReadBHP11 , etc. We follow DBLP:conf/nips/Hou0Z17 and choose to use the popular and effective method — least squares stigler1981gauss which can be formulated as follows.

where is the instance from the urrent feature space at time . If the overlapping period is very short, it is unrealistic to learn a complex relationship between the two spaces. Instead, we can use a linear mapping to approximate . Assume the coefficient matrix of the linear mapping is , then during rounds

, the estimation of

can be based on linear least square method

The optimal solution to the above problem is given by

(3)

Note that we do not need a budget to store instances from the overlapping period because during the period from to , can be calculated in an online way, i.e. we first iteratively calculate and , then, Then if we only observe an instance from the current feature space, we can recover an instance in the previous feature space by , to which can be applied.

3.4 Prediction in Current Feature Space

1:  Let .
2:  for  do
3:     Predict the weight of each base model using (5).
4:     Receive and make prediction or
5:     Calculate our prediction by
6:     Receive target , each base model suffers loss and our model suffers loss
7:     Set
8:     Update and using (4) and (6) respectively where .
9:  end for
Algorithm 4 Prediction in Current Feature Space

From round , if we keep on updating using the recovered data , i.e.,

(4)

where is a varied step size. Then the learner can mainly calculate two base predictions based on models and : and If we do not update and only use to predict on the recovered data, we can obtain another base prediction . Through ensembling the base predictions in each round, or more concretely our prediction is the weighted combination of these base predictions, our model is able to follow the best base model empirically and theoretically. We borrow the idea of learning with expert DBLP:conf/colt/LuoS15 to realize it. We first give some notations that we need to use here. To be general, we assume the number of base models is . Let be the weight of the -th model at time . is the loss of the -th base model at time . Then our prediction at time is the weighted combination of the base predictions, namely, where and are the vector of weights and base predictions. We let , and use to denote the simplex of all distributions over . We define weight function:

where is the potential function with defined to be 1. Then at each round we set to be proportional to :

(5)

where is the cumulative magnitude of up to time . When receiving instance from the current feature space , we can make prediction or Then with , we calculate our prediction by . After receiving target , our model and the base models suffer loss and , respectively. Then we update by

(6)

and by (4) where is a varied step size and is the set of linear models in the current feature space. The procedure of learning model in the current feature space is summarized in Algorithm 4. In the following, we give a theoretical guarantee that we are able to follow the best models by this strategy of weights adjusting.

3.4.1 Theoretical Guarantee

We denote the cumulative loss of each model from by

The cumulative loss of our model from is denoted by

Then we have the following theorem.

Theorem 2

For any distribution , the cumulative loss of our model is bounded as follows:

(7)

where , . We use to hide the “” terms since they are very small and thus we consider these terms to be nearly constant.

Remark: This theorem is a special case of Theorem 1 from DBLP:conf/colt/LuoS15 . One can find the proof in its appendix. This theorem shows that our model is comparable to any linear combination of base models. Furthermore, since is the cumulative magnitude of . Thus If concentrates on the best model with minimum cumulative loss, then the upper bound will become which is exactly the bound in DBLP:conf/nips/Hou0Z17 , which means that our model is comparable to the best model. Yet our bound has several merits over that in DBLP:conf/nips/Hou0Z17 . First ours is parameter free which means we do not have to tune that appears in the exponential formula in DBLP:conf/nips/Hou0Z17 . Second, is the worst case. As long as is not always the worst, will be much smaller than . Another advantage is that we can utilize any number of base models while in DBLP:conf/nips/Hou0Z17 they only focus on two.

Furthermore, we can set in the following way:

(8)

where is the confidence of the -th base model at time . Clearly, we can see that the problem we studied above is the special case of the general setting with for all and . means that the -th base model does not contribute to our prediction. Besides, the upper bound in Theorem 2 still holds, which is summarized below.

Theorem 3

For any distribution , the cumulative loss of our model is bounded as follows:

(9)

where is the total number of the base models created from to .

Remark: The proof of this theorem can also be found in the appendix of DBLP:conf/colt/LuoS15 . Adding this confidence term can bring an obvious benefit in our continual setting. Specifically, we focus on the case when , which means either the base model participates in our prediction or not. If , it means the -th base model is “asleep” at round . A base model that has never appeared before should be thought of being asleep for all previous rounds. Thus if the current feature space vanishes and new feature space appears, it means new base models appear and these base models in new feature space can be regarded as being asleep in the current and previous feature space. In this way, we do not need to decide manually which base model should be incorporated in and which base model should be discarded.

4 Experiments

Dataset NOGD ROGD-f r ROGD-u r FESL-c r FESL-s r PUFE (Ours) r average r
australian-C .7727.0061 .8473.0188 1 .8631.0044 1 .8630.0044 1 .8627.0042 1 .8630.0044 1 1.0
australian-I .7727.0061 .7891.0677 3 .8542.0094 3 .8541.0095 3 .8536.0092 3 .8542.0093 3 3.0
australian-IC .7727.0061 .8066.0602 2 .8550.0086 2 .8549.0087 2 .8543.0085 2 .8549.0086 2 2.0
credit-a-C .6876.0128 .6457.0710 1 .7886.0287 1 .7760.0283 1 .7840.0276 1 .7876.0286 1 1.0
credit-a-I .6876.0128 .6251.1058 2 .6998.0650 3 .7186.0294 3 .7178.0324 3 .7211.0317 3 2.8
credit-a-IC .6876.0128 .6251.1058 2 .7005.0665 2 .7190.0301 2 .7185.0327 2 .7215.0324 2 2.0
diabetes-C .6136.0064 .6575.0160 1 .6792.0044 1 .6769.0045 1 .6773.0043 1 .6794.0041 1 1.0
diabetes-I .6136.0064 .4859.1104 2 .6599.0179 3 .6597.0171 3 .6560.0223 3 .6564.0206 3 2.8
diabetes-IC .6136.0064 .4858.1104 3 .6600.0180 2 .6598.0171 2 .6562.0223 2 .6565.0207 2 2.2
dna-C .6084.0041 .7142.0337 1 .7526.0299 1 .7526.0299 1 .7525.0295 1 .7526.0297 1 1.0
dna-I .6084.0041 .6318.0571 2 .7164.0313 3 .7164.0313 3 .7162.0310 3 .7164.0311 3 2.8
dna-IC .6084.0041 .6317.0571 3 .7165.0313 2 .7165.0313 2 .7163.0310 2 .7165.0311 2 2.2
german-C .6843.0046 .7000.0016 1 .7002.0014 1 .6997.0016 1 .6999.0034 1 .7002.0014 1 1.0
german-I .6843.0046 .6960.0054 3 .6996.0022 3 .6991.0020 3 .6993.0018 3 .6999.0022 3 3.0
german-IC .6843.0046 .6964.0057 2 .6998.0020 2 .6992.0019 2 .6998.0020 2 .7000.0019 2 2.0
kr-vs-kp-C .6110.0034 .6222.0563 1 .7353.0285 1 .7345.0283 1 .7314.0308 1 .7352.0286 1 1.0
kr-vs-kp-I .6110.0034 .5733.0319 2 .7114.0351 2 .7111.0350 2 .7104.0352 3 .7114.0351 2 2.2
kr-vs-kp-IC .6110.0034 .5733.0319 2 .7114.0351 2 .7111.0351 2 .7105.0352 2 .7114.0352 2 2.0
splice-C .5664.0024 .5890.0368 1 .6564.0129 1 .6564.0129 1 .6563.0129 1 .6564.0129 1 1.0
splice-I .5664.0024 .5613.0350 2 .6478.0125 3 .6478.0125 3 .6478.0125 2 .6478.0124 3 2.6
splice-IC .5664.0024 .5613.0350 2 .6479.0125 2 .6479.0125 2 .6478.0125 2 .6479.0125 2 2.0
svmguide3-C .6802.0048 .7483.0124 1 .7839.0067 1 .7839.0067 1 .7835.0066 1 .7839.0067 1 1.0
svmguide3-I .6802.0048 .6055.0609 2 .7494.0373 3 .7439.0429 2 .7455.0401 2 .7422.0438 2 2.2
svmguide3-IC .6802.0048 .6055.0609 2 .7497.0365 2 .7439.0429 2 .7455.0402 2 .7422.0438 2 2.0
RFID-C 2.1750.058 1.6410.084 1 1.2970.082 1 1.3090.081 1 1.3090.082 1 1.3040.082 1 1.0
RFID-I 2.1750.058 2.1770.092 3 1.7190.069 3 1.7320.069 3 1.7300.068 3 1.7240.068 3 3.0
RFID-IC 2.1750.058 1.9920.078 2 1.5750.065 2 1.5880.064 2 1.5890.066 2 1.5810.064 2 2.0
Table 1:

The first eight big rows (each contains three unit row) are the accuracy with its standard deviation on synthetic datasets. The last big row is the mean square error with its standard deviation on real dataset. The best ones among all the methods are bold. The best ones among FESL-c, FESL-s, PUFE are added with

. “r” means “rank”. “dataset-C”, “dataset-I” and “dataset-IC” means the overlapping is complete, incomplete and incomplete but we complete it, respectively. We expect for all methods except NOGD, “dataset-C” ranks 1, “dataset-I” ranks 3 and “dataset-IC” ranks 2.

In this section, we first introduce the datasets that we use. Then we describe the compared approaches and experimental settings. Finally, we show the results of our experiments.

4.1 Datasets

We conduct our experiments on 9 datasets consisting of 8 synthetic datasets and 1 real dataset. To generate synthetic data, we randomly choose some datasets from different domains including economy, biology, literature, etc (Datasets can be found in http://archive.ics.uci.edu/ml/.). We artificially map the original datasets into another feature space by random Gaussian matrices, then we have data both from the previous and current feature space. Since the original data are in batch mode, we manually make them come sequentially. In the overlapping period, we discard entries of each row uniformly at random from the remaining features obeying the vanishing rule mentioned in Section 3.2. In this way, synthetic data are completely generated.

We use a real dataset named as “RFID” collected by ourselves, which contains 450 instances from the previous and current feature space respectively. RFID technique is widely used to do moving goods detection DBLP:conf/infocom/WangX0XL16 . This dataset uses the RFID technique to gather the location’s coordinate of the moving goods attached by RFID tags. Concretely, several RFID aerials are arranged around the indoor area. In each round, each RFID aerial received the tag signals, then the goods with tag moved, at the same time, the goods’ coordinate is recorded. Before the aerials expired, new aerials are arranged beside the old ones to avoid the situation without aerials. So in this overlapping period, data are from both the previous and current feature spaces. After the old aerials expired, the new ones continue to receive signals. Then data only from the current feature space remain. The overlapping period in this dataset is complete, so we simulate unpredictable feature evolution like we did on the synthetic data. Therefore the modified RFID data satisfy our assumptions.

4.2 Compared Approaches and Settings

We compare our PUFE with five approaches. In the first approach, once the feature space changed, the online gradient descent algorithm will be invoked from scratch, named as NOGD (Naive Online Gradient Descent). The second and third one are called ROGD-u (Updating Recovered Online Gradient Descent) and ROGD-f (Fixed Recovered Online Gradient Descent). They both utilize the model learned from the previous feature space by online gradient descent to predict on the recovered data. The fourth and the fifth one are named as FESL-c and FESL-s respectively, from DBLP:conf/nips/Hou0Z17 , which use the exponential of loss to update the weight of base models and with complete overlapping period. The difference between them is that FESL-c combines the base models and FESL-s selects the best base model.

  (a) australian   (b) credit-a   (c) diabetes   (d) dna   (e) german   (f) kr-vs-kp   (g) splice   (h) svmguide3   (i) RFID   legend
Figure 2: The trend of average cumulative loss on synthetic and real data. The smaller the average cumulative loss is, the better. All the average cumulative loss at any time of our method is comparable to the best of baseline methods.

We evaluate the empirical performances of the proposed approaches on classification and regression tasks during rounds . We first give the accuracy and mean square error over all instances during rounds on synthetic dataset and real dataset, respectively. In order to verify that our completion module is effective, we conduct experiments on each dataset with three different settings, namely, “complete overlapping period”, “incomplete overlapping period” and “incomplete overlapping period but we complete it”. We expect that the performance of the first setting is the best since the overlapping period is complete; the third setting is the runner up since the incomplete overlapping period is repaired; the second setting is the worst due to the lack of information. Furthermore, to verify that our model is comparable to the best base model, we present the trend of average cumulative loss. Concretely, at each time , the loss of every method is the average of the cumulative loss over , namely The performances of all approaches are obtained by average results over 10 independent runs. The parameters we need to set are the number of instances in overlapping period, i.e., , the number of instances in previous and current feature space, i.e., and and the step size, i.e., where is time. For all baseline methods and our methods, the parameters are the same. In our experiments, we set to be 10, 20, 25 for synthetic data and 40 for RFID data. We set almost and to be half of the number of instances, and to be where is searched in the range .

4.3 Results

The accuracy and mean square error results are shown in Table 1. The first eight big rows (each contains three unit row) are the accuracy with its standard deviation on synthetic datasets. The last big row is the mean square error with its standard deviation on real dataset. The best ones among all the methods are bold. The best ones among FESL-c, FESL-s, PUFE are added with . “r” means “rank”. “dataset-C”, “dataset-I” and “dataset-IC” means the overlapping is “complete”, “incomplete” and “incomplete but we complete it”, respectively. We expect for all methods except NOGD, “dataset-C” ranks 1, “dataset-I” ranks 3 and “dataset-IC” ranks 2. NOGD starts to learn from time so that it is not influenced by the overlapping period. As can be seen, on total 27 cases, our PUFE outperforms other methods on 15 cases, and outperforms FESL-c and FESL-s on 23 cases. Note that we do not have to be better than all the base models but comparable to. NOGD performs worst since it starts from scratch. ROGD-u is better than NOGD and ROGD-f because ROGD-u exploits the old better trained model from old feature space and keep updating with recovered instances. We can see that our method can follow the best baseline method or surprisingly even outperform it. The rank of the three settings we mentioned above also nearly follows our expectation, where performances on all “dataset-C”s rank 1, on almost all “dataset-IC”s rank 2, on almost all “dataset-I”s rank 3. Those cases that violate the expectation show that our completion operation cannot improve the accuracy and mean square error performance. This is because the data matrix is not low-rank. For example, on dataset “kr-vs-kp”, our completion does not improve the performance when applying ROGD-f, ROGD-u, FESL-c and PUFE. We find that the rank of matrix in dataset “kr-vs-kp” is full, which means we cannot recover the original complete matrix well.

Figure 2 gives the trend of average cumulative loss. (a-h) are the results on synthetic data, (i) is the result of the real data. The smaller the average cumulative loss is, the better. From the experimental results, we have the following observations. First, NOGD decreases rapidly which conforms to the fact that NOGD on rounds becomes better and better with more and more correct data coming. Besides, ROGD-u also declines but not very apparent since on rounds , ROGD-u already learned well and tend to converge, so updating with more recovered data could not bring too much benefits. Moreover, ROGD-f does not drop down but even go up instead, which is also reasonable because it is fixed and if there are some recovering errors, it will perform worse. FESL-c and FESL-s are based on NOGD and ROGD-u, so their average cumulative losses also decrease. Our PUFE is based on the five base methods, its average cumulative loss follows the best curve all the time and obtains good performance at the beginning of period . This is very important since at the beginning of the current feature space, data are few and a good model is hard to learn but very necessary since we need good performance everyday or even every single time.

5 Conclusion and Discussion

In this paper, we focus on a new and more practical setting: prediction with unpredictable feature evolution. In this setting, we find that the vanishing of old features is usually unpredictable. We attempt to complete this fragmentary period and formulate it as a matrix completion problem. By the free row space obtaining from the preceding matrix, we only need observed entries to recover the target matrix exactly, where is the row of the target matrix and is the rank. We also provide a new way to adaptively combine the base models. Theoretical results show that our model is always comparable to the best base model. In this way, at the beginning of the new feature space, our model is still desirable, which conforms to the robustness, an important topic in nowadays machine learning community.

Finally, we want to emphasize that though this is a more realistic setting, its data are not widely available yet. Thus we collect a real dataset by ourselves. We also use synthetic datasets that totally satisfy our setting to validate the effectiveness of PUFE. Considering that feature evolving is an important and tough problem, we would like to collect more real datasets in the future.

References

  • [Agg10] C. C. Aggarwal. Data streams: An overview and scientific applications. In Scientific Data Mining and Knowledge Discovery - Principles and Foundations, pages 377–397. Springer, 2010.
  • [AHWY06] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18:577–589, 2006.
  • [CL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • [CR12] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Commun. ACM, 55(6):111–119, 2012.
  • [DH00] P. M. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 71–80, 2000.
  • [GLPW16] Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762–1792, 2016.
  • [HAK07] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Maching Learning, 69:169–192, 2007.
  • [Hua18] Zengfeng Huang. Near optimal frequent directions for sketching dense and sparse matrices. In Proceedings of the 35th International Conference on Machine Learning, pages 2053–2062, 2018.
  • [HWZ14] S. Hoi, J. Wang, and P. Zhao. LIBOL: A library for online learning algorithms. Journal of Machine Learning Research, 15:495–499, 2014.
  • [HZ18] Chenping Hou and Zhi-Hua Zhou. One-pass learning with incremental and decremental features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2776–2792, 2018.
  • [HZZ17] Bo-Jian Hou, Lijun Zhang, and Zhi-Hua Zhou. Learning with feature evolvable streams. In Advances in Neural Information Processing Systems 30, pages 1416–1426, 2017.
  • [Kib07] B. M. Golam Kibria. Bayesian statistics and marketing. Technometrics, 49:230, 2007.
  • [Lib13] Edo Liberty. Simple and deterministic matrix sketching. In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 581–588, 2013.
  • [LJG09] D. Leite, P. Costa Jr., and F. Gomide. Evolving granular classification neural networks. In Proceedings of International Joint Conference on Neural Networks 2009, pages 1736–1743, 2009.
  • [LJZ14] S.-Y. Li, Y. Jiang, and Z.-H. Zhou. Partial multi-view clustering. In

    Proceedings of the 28th AAAI Conference on Artificial Intelligence

    , pages 1968–1974, 2014.
  • [LS15] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adanormalhedge. In Proceedings of The 28th Conference on Learning Theory, pages 1286–1304, 2015.
  • [MMK02] I. Muslea, S. Minton, and C. Knoblock.

    Active + semi-supervised learning = robust multi-view learning.

    In Proceedings of the 19th International Conference on Machine Learning, pages 435–442, 2002.
  • [NWNW12] H.-L. Nguyen, Y.-K. Woon, W. K. Ng, and L. Wan. Heterogeneous ensemble for feature drifts in data streams. In Proceedings of the 16th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 1–12, 2012.
  • [Oza05] N. C. Oza. Online bagging and boosting. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics 2005, pages 2340–2345, 2005.
  • [PY10] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22:1345–1359, 2010.
  • [RBHP11] J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Streaming multi-label classification. In Proceedings of the 2nd Workshop on Applications of Pattern Analysis, pages 19–25, 2011.
  • [RBL07] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine Learning, pages 759–766, 2007.
  • [Rec11] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–3430, 2011.
  • [SAK09] T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental learning and anytime classification on data streams. In Proceedings of the 12th International Conference on Extending Database Technology, pages 311–322, 2009.
  • [Sha12] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:107–194, 2012.
  • [Sti81] S. M. Stigler. Gauss and the invention of least squares. The Annals of Statistics, pages 465–474, 1981.
  • [TKK07] I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In Proceedings of the 24th International Conference on Machine Learning, pages 911–918, 2007.
  • [WFYH03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 226–235, 2003.
  • [WXW16] C. Wang, L. Xie, W. Wang, T. Xue, and S. Lu. Moving tag detection via physical layer analysis for large-scale RFID systems. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications, pages 1–9, 2016.
  • [XTX13] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ArXiv e-prints, arXiv:1304.5634, 2013.
  • [YZJZ18] Han-Jia Ye, De-Chuan Zhan, Yuan Jiang, and Zhi-Hua Zhou. Rectify heterogeneous models with semantic mapping. In Proceedings of the 35th International Conference on Machine Learning, pages 1904–1913, 2018.
  • [Zho16] Z.-H. Zhou. Learnware: On the future of machine learning. Frontiers of Computer Science, 10:589–590, 2016.
  • [ZHWL14] P. Zhao, S. Hoi, J. Wang, and B. Li. Online transfer learning. Artificial Intelligence, 216:76–102, 2014.
  • [Zin03] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.
  • [ZZL15] Qin Zhang, Peng Zhang, Guodong Long, Wei Ding, Chengqi Zhang, and Xindong Wu. Towards mining trapezoidal data streams. In IEEE International Conference on Data Mining, pages 1111–1116, 2015.
  • [ZZL16] Qin Zhang, Peng Zhang, Guodong Long, Wei Ding, Chengqi Zhang, and Xindong Wu. Online learning from trapezoidal data streams. IEEE Transactions on Knowledge and Data Engineering, 28:2709–2723, 2016.