1 Introduction and Motivation
Predictive models are the core of many intelligent systems. A common approach to induce such models is to apply machine learning algorithms for classification. This type of algorithm trains classifiers based on historical data. The training data is represented as a set of instances, each described by a set of features and a discrete, dependent class label.
The features selected to describe a data instance may have a major effect on the performance of a learning algorithm. In many real world applications, the acquisition of these features can be expensive, and thus there is a trade-off between model performance and resource consumption. By intelligently selecting features, the training expense can be significantly reduced, while achieving an acceptable level of performance. Moreover, an intelligent selection of features results in a reduction in the need for data storage and processing, and contributes to a more efficient deployment of the learning algorithm. When data acquisition is limited by budget, intelligent feature acquisition becomes a necessity, and this domain is referred to asbudgeted learning.
Budgeted learning applies to supervised learning tasks where the class labels are fully specified but some or all of the feature-values are initially omitted. With budgeted learning, a learner sequentially acquires feature-values at a cost which may vary depending on the feature. The objective of the learner in this case (under an acquisition budget constraint), is to maximize the prediction performance by actively acquiring feature-values[deng2013new].
Existing budgeted learning algorithms assume that it is possible to acquire all of an instance’s feature-values throughout the training phase. However, this assumption does not always apply to real world situations where feature-values cannot be acquired at any time. In many real world scenarios, a learner must be trained in an online manner from data streams. A data stream is a potentially unbounded sequence of data instances. The challenge in this setting is that the feature-values of each current instance can only be acquired at the current time. For example, in the case of smartphone based activity recognition, for each time interval the learner must decide whether the location feature of the user should be acquired (via the smartphone’s expensive GPS sensor) or to instead rely on other features for the task (e.g., device acceleration). Similar issues arise in a wide variety of real world situations involving tasks such as context prediction and health monitoring [tayde2015context, perera2014context].
In this paper we address budgeted learning in an online setting: given a data stream, a classifier-induction algorithm, the cost of acquiring each feature-value, and a fixed acquisition budget, the online budgeted learning issue is to identify the optimal set of features that should be acquired for each new instance, so as to maximize the induced classifier’s performance, while adhering to the acquisition budget.
The contributions of this paper are (1) the creation of a general framework that addresses budgeted learning in an online setting, and (2) the establishment of four different feature value acquisition policies which can be used with the framework. We also describe an Oracle policy, which evaluates an upper bound of performance (note that the policy is not viable, since it utilizes instances from the future which are unavailable in a given current state). We evaluate the framework and acquisition policies using five real world datasets. The evaluation presents a trade-off analysis of the budget and the prediction accuracy of the resulting induced classifier. The proposed framework and policies are simple to implement and provide efficient online feature-value acquisition under different budget constraints.
2 Related Work
Previous work has addressed the issue of active feature-value acquisition [saar2009active, bilgic2007voila, attenberg2011selective]. Active feature-value acquisition (AFA), which is similar to budgeted learning, acquires feature-values for missing data. The goal of AFA is to acquire feature-values in order to obtain a desired performance level at a minimal cost. The key difference between AFA and budgeted learning is that budgeted learning is limited by budget, while AFA does not have a strict budget. Like budgeted learning, AFA works under the assumption that feature-values can be acquired at any time during the training phase.
The feature-value acquisition problem is influenced by the classic multi-armed bandit problem, first introduced by Robbins (1952). The multi-arm bandit problem can be described as follows [gittins1979bandit]
: there are N arms, each having an unknown success probability of emitting a unit reward which are assumed to be independent of each other. The objective is to pull arms sequentially in order to maximize the total reward. Previous work by Deng et al.[deng2013new] approached budgeted learning as a multi-armed bandit problem in which each pull is related to the acquisition of a feature-value for an instance, and the reward is related to the model’s improved performance. We adopt the described approach but adjust it to the online setting.
An alternative line of research explores the domain of efficient feature-value acquisition from the perspective of adaptive sampling. Many context-aware systems use the sensors of portable devices to build predictive models for classification. Since portable devices are very constrained in terms of power, the amount of sensing must be minimized. Previous research describes an architecture in which the sampling rates of sensors are adapted according to the information they provide about the possible next contexts [wood2012adaptive]
. This approach assumes that instances are time-dependent, so the context transition can be represented as a Markov chain. In our work, we assume that the order in which the data stream produces instances may not be relevant, and thus cannot be temporally modeled.
3 Problem Formulation and Framework
Given the challenges of real world settings described above, our goal is to place an online budgeted feature-value acquisition framework between a data stream and a given classifier . Instances are produced by in a sequential manner. An instance is represented as a set of features and a discrete, dependent class label . When a new instance arrives from , both the feature-values and the class label value are initially unknown. The feature-value for of instance can be acquired at a given fixed cost . Let be a subset of which represents the set of acquired features for an instance . We denote the total acquisition cost for instance as where
Similar to the classic budgeted learning problem, we assume that there is no cost associated with acquiring the value of the class label . Let be the total budget constraint. The budget for feature-value acquisition of an instance is denoted as . Post-acquisition instances are stored in the training set . Finally, the total budget fulfills: . An acquisition policy allocates budget for each instance and seeks the optimal which fulfills . The objective of is to maximize the performance of classifier which is induced based on the acquired training set .
The framework follows a generic iterative procedure (pseudo code available in Algorithm 1):
New instance arrives from .
Policy determines .
Policy iteratively acquires feature-values for .
The class label for is sampled.
The instance is passed to the training set .
In order to ensure that features that exceed the budget are not selected, the framework maintains a so-called ”potential features set”, denoted as . The is initialized with the for each new upcoming instance . After each feature-value acquisition iteration, features that exceed the remaining budget are removed from the set.
Our suggested framework is based on the following assumptions: (1) the decision whether to acquire a feature-value can be made only when the instance is active, e.g., if we collect data from a mobile phone, and we decided not to collect data from a specific sensor at a particular time, we can’t go back in time and acquire it. (2) The sequential feature-value acquisition for an instance is supported by assuming negligible acquisition and decision times. (3) The class label value for an instance is provided at the end of the feature-value acquisition phase, so it has no effect on the decision regarding which feature-values to acquire. (4) Misclassification costs are considered to be equal. (5) For simplification, the data stream length is assumed to be predefined.
4 Acquisition Policies
In this section, we describe the four acquisition policies which we implement for our framework. The four polices reflect the multi-armed bandit problem and can be categorized as random policies and adaptive policies. The two policy types differ from each other by (1) the information on which they base their decisions, (2) the method for the division of the budget between the instances, and (3) the way that new information from the stream is utilized.
4.1 Random Policies
We define random policies as policies which are not sensitive to the information that arrives from the data stream. Random policies operate in a stochastic manner, where each feature has a probability to be selected for acquisition. The probabilities are generated at the start of each feature-value acquisition iteration (see line 9 in Algorithm 1). Only prior knowledge, e.g., acquisition costs, may be taken into account when probabilities are generated. The budget for each instance is the same and is calculated as .
Pure Random Acquisition Policy
The pure random acquisition policy selects features following a uniform random distribution. Specifically, the probability to select from is
We use this policy as the baseline for our evaluation.
Cost-Sensitive Random Acquisition Policy
In the cost-sensitive random acquisition policy, features are selected randomly according to a probability that is negatively correlated with the feature’s acquisition cost. Specifically, the probability to select from is
The motivation for the use of this policy is that we want to acquire features with low acquisition cost more often than other features.
4.2 Adaptive Policies
A policy is defined as an adaptive policy if it utilizes new information coming from the data stream. Adaptive policies approach the problem of online budgeted learning as an explore or exploit problem. When deciding which feature to select next, the policy decides whether to exploit the features that have been considered as valuable for previous instances, or to explore the other features. Our suggested adaptive policies tend to favor exploitation, therefore we involve stochastic elements in order to facilitate exploration.
The adaptive policies require initial data upon which to base their exploitation. Therefore, we allocate part of the total acquisition budget for complete feature-value acquisition of the first instances. We denote the set of complete instances as . Then, we induce a base acquisition policy that relies on . The budget for each of the remaining instances is calculated as where the rest of the budget and the stream of the remaining instances are denoted as and , respectively. The acquisition policy is updated when a new instance is added to the training set. An illustration of the process is presented in Figure 1.
Variance Cost-Sensitive Acquisition Policy
In the variance cost-sensitive acquisition policy, we first acquire complete instances as previously described. Then, we calculate the variance of each of the features. For the rest of the instances, features are selected following a probability which is positively correlated with the feature’s variance divided by the feature’s cost. Before calculating a feature’s variance we rescale the feature’s observed values to the range [0,1] using min-max normalization to avoid unit bias. Letbe the rescaled version of . The variance of is denoted as and defined as
The probability to select from is defined as
With each feature-value acquisition, the feature’s respective variance is recalculated.
This policy is based on the notion that features with high variance may describe the dependent variable better than other features. This idea is inspired by the principal component analysis (PCA) algorithm which seeks the principal component that has the greatest possible variance. We divide the variance by the acquisition cost in order to prioritize the less expensive features. Thus, we would like to acquire features with greater variance and lower acquisition cost more often than other features.
Tree Classifier Based Acquisition Policy
In the tree classifier based acquisition policy, we first acquire complete instances as previously described. With these instances we induce a tree based classifier using a cost-sensitive version of the C4.5 algorithm [lauwereins2014context]
. The decision tree, described by Lauwereins et al., selects features for tree splits according to their trade-off between information gain and power consumption (in our case, the power consumption is redefined as feature acquisition cost). When a new instance arrives from the data stream, we perform a walk through the tree from the root node down. The inner nodes visited during the walk indicate useful features. Therefore, we select and acquire those features until (1) we select a feature that is not contained in the, or (2) we reach a leaf node. In either case the rest of the features in the are acquired according to the variance cost-sensitive acquisition policy. Once the budget has been met and the instance has been added to the training set , the decision tree is rebuilt with the latest training data.
This policy is motivated by the idea that decision trees represent a collection of hypotheses. Therefore, when a new instance arrives, its matching hypothesis, represented by a path in the tree, is tested through the instance. Thus, exploitation is applied. Moreover, the policy encourages exploration, since it may select features that are not covered by the matching hypothesis.
We tested the framework and acquisition policies on five real world datasets. The datasets reflect real world domains where the training data is collected in an online manner and a cost can be associated with the acquisition of each feature-value. Four of the datasets were obtained from the UCI repository [Lichman:2013, reyes2016transition]. The fifth dataset, a smartphone sensor dataset involving SMS malware, was collected in our labs.
The reason we collected the fifth dataset is to illustrate a real world scenario in which online budgeted learning is a necessity. With the high adoption of smart mobile devices among world’s population, the role of machine learning on these devices becomes important (e.g., see [mirsky2016anomaly]). However, most of the smart mobile devices, e.g., smartphones, are limited in storage and power resources. We demonstrated the contribution of online budgeted learning by using this dataset to detect malicious SMSs. The dataset contains features which capture the motion of a user’s smartphone during the transmission of SMSs over the course of several months. In order to compile the dataset, we placed a malicious agent on the user’s smartphone; the agent sends SMSs simulating the behavior of actual sophisticated SMS malware aimed at financial theft (a malware that sends SMSs to a premium phone number in order to charge the user’s account). Each instance in the dataset captures the transmission of an SMS with 149 features describing the smartphone’s acceleration, rotation, orientation, and location. A label is provided to each instance to indicate whether the transmission was malicious or benign.
Table 1 provides a detailed comparison of the five datasets. The datasets can be divided into two dataset collections: (1) a collection in which mobile device sensors are sampled, denoted as the sensors collection. This collection consists the HAPT dataset - a smartphone-based recognition of human activities and postural transitions dataset [reyes2016transition], and our own collected SMS leakage dataset. (2) A collection in which medical tests are performed, denoted as the medical collection. This collection consists the Cardiotocography and the Thyroid datasets [Lichman:2013]. The sensors collection datasets are characterized by a large number of features and instances, while the medical collection datasets are characterized by significantly lower dimensions. All of the datasets have no missing values, so the acquisition policies were not restricted. For most of the datasets, feature acquisition costs were not provided. Therefore, for the sensors collection datasets we gave each feature an acquisition cost which is associated with its respective sensor power consumption. Some of the sensor power consumption data was obtained from previous research by ”Google”,111https://dl.google.com/io/2009/pres/W_0300_CodingforLife-BatteryLifeThatIs.pdf while the rest were randomly generated and fine-tuned based on our prior knowledge. The sensor power consumption data is presented in Table 2. For the medical collection datasets (except the Thyroid dataset), acquisition costs were randomly generated.
|SMS Leakage (SMSL)||Sensors||90,528||149||2|
|Sensor||Evaluated Power Consumption (mA)|
We evaluated the four acquisition policies previously described on each of the datasets. Each acquisition policy was tested with nine different total acquisition budget values. Each acquisition budget was calculated with the percent of the maximum cost , where
. The performance of each acquisition policy was evaluated over 10 independent runs. In each run, 70% of the instances were randomly selected and treated as the data stream. The test set was formed by the remaining instances. In each run, when the acquisition phase was completed, we trained an XgBoost classifier[chen2016xgboost]
on the training set. The XgBoost classifier was initialized with fixed parameters for each of the datasets. Parameters were configured based on best practice heuristics.222https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python The prediction accuracy of the test set (measured by multi-class area under the ROC curve (AUC)) served as the performance measure of the induced classifier. For the adaptive policies, we used 20% of the budget for acquiring the complete instances, and the rest of the budget was used to acquire features dynamically as described in section 4.2. We used this portion of the budget based on preliminary experiments that we conducted to determine the optimal ratio to use. Since there is no difference between the policies when there are no budget constraints, we performed complete feature-value acquisition for each of datasets within each run described above.
As it is not feasible to determine an optimal acquisition policy to compare our proposed methods to, we needed a simple method for estimating the upper bound of performance. Therefore, we defined an Oracle acquisition policy which assumes that a given Oracle provides a list of features, ordered based on their trade-off between information gain and cost. The policy allocates equal budget for each instance and selects features according to their order in the list. Although the Oracle acquisition policy is not a viable acquisition approach, it does provide a simple estimation of the upper bound of performance. For our experiments, an ordered list of features with their information gain was generated by training the XgBoost classifier with the complete data.
To evaluate the overall behavior of each of the acquisition policies, we constructed curves that reflect the performance of the induced classifier for each of the budget constraints. Each point in the graphs represents the mean of the area under the ROC curve (AUC) over the 10 runs for the specified budget and acquisition policy. To keep the plots uncluttered, in figure 2 we plot results for the complete data acquisition (Complete), the random acquisition policy (Random), the cost-sensitive acquisition policy (Cost), the variance cost-sensitive acquisition policy (Variance cost), the tree classifier based acquisition policy (Classifier based) and the Oracle acquisition policy (Oracle).
As can be seen in figure 2, as expected, the Oracle policy achieved the best performance for all of the datasets and values, however as mentioned previously, it is not a viable approach. The adaptive policies performed better than the random policies, except for the HAPT dataset. For the HAPT dataset, the random policies performed equal to or better than the adaptive policies for . Our explanation for this is that since the HAPT dataset has the largest number of features, the adaptive policies tend to exploit a small set of its features, while other more promising features are not explored. For the Thyroid dataset, the adaptive policies obtained similar results to those achieved by the Oracle policy. For both the SMS malware and Thyroid datasets, the adaptive policies outperformed the random policies. The difference was more significant with small values. Our explanation for the latter is that when there is a small budget and a large set of features, identifying the optimal set of features becomes a harder task. Therefore, we expect that good policies will obtain significant better performance in such cases. With regards to the adaptive policies, we found that the variance cost-sensitive policy outperformed the tree classifier based policy with the sensor dataset collection. However, with the medical dataset collection, the two policies obtained similar results. We believe these results are because the variance cost-sensitive policy is more exploratory than the tree classifier based policy.
6 Conclusions and Future Work
We presented a general framework to address online budgeted feature-value acquisition, in which instances arrive from a data stream and an acquisition policy selects which features should be acquired within given budget constraints. We also introduced four acquisition policies categorized as random or adaptive policies and defined an Oracle policy, which is not viable for acquisition but provides an upper bound of performance. We evaluated the acquisition policies on five real world datasets with various budget constraints and demonstrated that the adaptive acquisition policies outperform the random policies for most of the budget values and datasets. In some cases, the adaptive acquisition policies achieved similar results to those achieved by the Oracle policy.
There are several directions for future work. First, based on our observations of the large datasets, it would be beneficial to be able to control the amount of exploration and exploitation an adaptive policy should perform. A possible direction for the latter may be influenced by the simulated annealing technique [aarts1988simulated]. Furthermore, we plan to test the computational resources (e.g., machine time) required to train the classifier based on the training data acquired under different budget constraints. Moreover, we may develop acquisition methods for problems in which the length of the data stream is unknown. To sum up, we believe that future research in this area is necessary for many real world applications in which feature-value acquisition is limited by budget constraints and must be performed in an online manner.