Dynamic Feature Acquisition Using Denoising Autoencoders

by   Mohammad Kachuee, et al.

In real-world scenarios, different features have different acquisition costs at test-time which necessitates cost-aware methods to optimize the cost and performance trade-off. This paper introduces a novel and scalable approach for cost-aware feature acquisition at test-time. The method incrementally asks for features based on the available context that are known feature values. The proposed method is based on sensitivity analysis in neural networks and density estimation using denoising autoencoders with binary representation layers. In the proposed architecture, a denoising autoencoder is used to handle unknown features (i.e., features that are yet to be acquired), and the sensitivity of predictions with respect to each unknown feature is used as a context-dependent measure of informativeness. We evaluated the proposed method on eight different real-world datasets as well as one synthesized dataset and compared its performance with several other approaches in the literature. According to the results, the suggested method is capable of efficiently acquiring features at test-time in a cost- and context-aware fashion.


page 8

page 9

page 10


Opportunistic Learning: Budgeted Cost-Sensitive Learning from Data Streams

In many real-world learning scenarios, features are only acquirable at a...

Target-Focused Feature Selection Using a Bayesian Approach

In many real-world scenarios where data is high dimensional, test time a...

Peri-Diagnostic Decision Support Through Cost-Efficient Feature Acquisition at Test-Time

Computer-aided diagnosis (CADx) algorithms in medicine provide patient-s...

Cost-Sensitive Feature-Value Acquisition Using Feature Relevance

In many real-world machine learning problems, feature values are not rea...

Cost-Sensitive Tree of Classifiers

Recently, machine learning algorithms have successfully entered large-sc...

A Simple Test-Time Method for Out-of-Distribution Detection

Neural networks are known to produce over-confident predictions on input...

Recovering Loss to Followup Information Using Denoising Autoencoders

Loss to followup is a significant issue in healthcare and has serious co...

I Introduction

Feature selection methods have been largely studied in the literature. Usually, the main goal of feature selection is defined as selecting a subset of available features to increase the prediction performance and to reduce over-fitting. In real world scenarios, however, the cost of extracting or acquiring each feature is different from other features. The cost difference can be due to various factors such as differences in computational load in the extraction of features [1, 2], user disruptions in computer and user interactions [3], patient pain in medical procedures and tests [4], and so forth [5]. In these scenarios, selecting a feature that may only marginally contribute to an increase in the prediction accuracy which entails high costs would be unacceptable. In other words, there exists a trade-off between the feature cost and prediction performance that should be considered in the algorithm design.

To overcome this issue, there are methods suggested in the literature trying to adapt feature selection algorithms to consider the cost of each feature [6, 7, 8, 9]. However, another point of concern that requires attention is that selecting a fixed set of features to be used during the training phase and using them at test-time would not be an optimal solution; as it neglects the potential interdependence between features. In many scenarios, there are features that are either freely available or easy to acquire at test-time. An optimal decision about other features to include in the analysis can be highly dependent on them. For instance, a doctor decides whether to prescribe an MRI scan based on the patient’s current available information such as age, gender, symptoms and so on. In this example, having a fixed list of required tests and asking patients to provide the results of these tests for a clinical visit, would result in the high cost of MRI for all patients. In other words, the decision to include each feature should be based on the learned system dynamics as well as the available information at test-time.

In this paper, we suggest a novel approach for feature acquisition considering costs at test-time (FACT). The proposed solution is capable of incrementally asking for features to be included in the prediction based on the current available context and user-defined feature costs. The rest of the paper is organized as follows. Section II briefly reviews the current relevant literature. Section III introduces the suggested approach including theoretical and implementation details. Section IV presents the results of using the suggested method and compares them with the state-of-the-art approaches in the literature. Finally, Section VI concludes the paper.

Ii Related Work

One of the approaches to incorporate feature acquisition costs or feature costs in general is considering the feature costs during the training phase and trading off the prediction accuracy with the prediction cost. An example of these approaches is limiting the number of features that are actually used in the predictor model by using regularization [10]. In this method, the regularization enforces weights corresponding to certain features to be zero, and hence they can be omitted during the test phase. There are other methods in the literature that try to define and solve optimization problems over both the prediction performance and prediction costs [11, 7, 12]. Nevertheless, in all these methods, the final set of selected features is fixed and these methods fail to capture and take the advantage of the contextual information available at test-time.

One intuitive approach to incorporate feature costs during the training phase, while considering the available context during the test phase, is using the idea of decision trees. One of the most famous examples of this approach is the face detection cascade classifier by Viola and Jones

[13]. While their goal was to increase the prediction speed by rejecting negative samples as soon as possible within a cascade of classifiers, many papers followed their architecture and incorporated feature cost in creating cascade predictors [14, 15]. One main drawback of cascade approaches is that cascades are only applicable to problems with a considerable class imbalance such as face detection or spam email detection. In these cases, the number of negative samples is significantly higher than the number of positive samples. However, there are many real-world applications in which the classes are relatively balanced such as document classification or image classification. To overcome this issue, in [16, 17] authors suggested the idea of classifier trees instead of classifier cascades to handle the problems where cascades are not applicable.

While cascade and tree based test-time feature acquisition methods are shown to perform reasonably well in many scenarios; there are many problems and applications such as large-scale image classification, voice recognition, natural language processing, etc. where tree and cascade classifiers are not intrinsically strong enough to make accurate predictions. Another important limitation of cascade and tree based approaches is, while they include the context information to some extent, their feature query decisions are not truly instance specific. Specifically, they are limited by the fixed predetermined structure of the tree that enforces the features to be acquired at each tree node.

In order to address these issues, recently, there has been great attention toward using learning methods to solve the generic problem of cost-sensitive and context-aware feature acquisition. He et al. [18]

suggested a method based on imitation learning that trains a model that is able to predict an optimal feature query decision to be made given the available features. Contardo

et al. [19, 20]

introduced the idea of defining the problem as a reinforcement learning problem and solving it as a separate problem. While these methods are successful in terms of truly incorporating the test-time context information to the decisions, they require extra effort of training a feature query model in addition to the target predictor.

An alternative idea for measuring the informativeness of features given the context is using sensitivity analysis at test-time to measure the influence of each feature on the predictions. Early et al. [21]

introduced a method based on sensitivity analysis that exhaustively measures the impact of acquiring each feature on the prediction outcome. Their solution does not require training any other model, and it works in conjunction with almost any supervised learning algorithm. However, exhaustive sensitivity measurement is computationally expensive. It is impractical in problems with a large number of features to exhaustively examine the sensitivity with respect to each unknown feature.

In this paper we suggest a novel approach that is based on the idea of sensitivity analysis. The proposed approach incrementally asks for features based on the feature acquisition costs and the expected effect each feature can induce on the prediction. Furthermore, the devised method uses back-propagation of gradients and binary representation layers in neural networks to address the computational load as well as scalability concerns. In an earlier work [22], we introduced the idea of sensitivity analysis as a method for dynamic feature selection. However, in this paper, we extend the idea by considering feature acquisition costs, introducing improvements such as feature encoding, and conducting more detailed experiments and analysis.

Iii Proposed Method

Iii-a Problem Definition

In this paper, we consider the problem of predicting target classes (

) corresponding to a given feature vector (

). Each feature vector consists of known features as well as unknown features (i.e., missing values) that are set to zero. The complete feature vector without any missing values is denoted by . To indicate unknown features, a vector is defined that acts as a mask and indicates known and unknown features with one and zero values, respectively. In addition, we define a feature acquisition cost vector () that defines the cost of acquiring each feature.

For simplicity of analysis, we consider the incremental problem of having a feature vector () and the corresponding mask vector () at time step . Additionally, we consider the cost values to be time dependent and defined for each time step (). Using this notation, at each time step , the current feature vector can be represented as


which is acquired at the total cost of


Apart from this, at each time step, we have an expected prediction value () using a predictor function () that takes as input:


In this setup, we define the feature query operator () as a function that acquires the value of feature in the incomplete feature vector and outputs the feature vector of the next time step, :


Furthermore, we define the desired feature to be queried at time as the feature that decreases the prediction error significantly, while at the same time, incurs a low acquisition cost. Mathematically speaking, we can use prediction accuracy improvement per acquisition cost as a measure of efficiency for the feature query. Accordingly, the desired feature to be queried at time step can be found by


where is the ground-truth target value, and is bias value in order to prevent the first term from becoming zero. It is worth mentioning that the solution introduced here is basically an incremental solution that greedily selects features to be acquired at each step.

Table I presents a summary of the notations used throughout the paper.

Ground-truth target values Predicted target values Incomplete feature vector at test-time Complete feature vector without any missing values Binary representation of the feature vector Reconstructed feature vector Binary representation of the reconstructed feature vector Mask vector indicating known and unknown features Feature acquisition cost vector Encoded feature vector
TABLE I: The summary of the notations used throughout the paper.

Iii-B Sensitivity-based Feature Acquisition

While (5) suggests what features to be acquired at each step, directly using this equation is not practical. The reason behind this is the first term in this equation is usually not known and is difficult to estimate. To resolve this issue, it is possible to use the sensitivity of model predictions with respect to each missing feature as a measure for the potential impact of that feature on the final predictions. As a result, (5) can be rewritten using the suggested sensitivity measure as:


Note that because a higher sensitivity is synonymous with a more informative feature to select, the argmin function in (5) is replaced by argmax. Furthermore, the prediction sensitivity with respect to input j can be defined as


In this equation, the first term corresponds to the derivative of the predictor function with respect to each missing feature. The second term is the confidence of inferring the ’th feature given the context and model parameters. By substituting this into (6), the feature query criterion can be written as


Furthermore, the continuous integral in (8) can be approximated by a discrete summation:


where is a set of samples from the range of possible values that can be taken by each feature. By adjusting the granularity of the values in the set, one can trade-off between the approximation accuracy and the computational load of the expected value approximation.

Fig. 1: Network architecture of the proposed method including: encoder, decoder, and predictor parts. The encoder part is responsible for handling missing features. The decoder part is used for feature density estimation. The predictor is responsible for making predictions; additionally, its derivatives with respect to inputs are used for measuring sensitivities.

Iii-C Proposed Solution

The required terms in (9

) for finding the feature to query includes: the cost of acquiring each feature, the derivative of the prediction function with respect to each input at different input values, and probability of having each value for each feature given the available context. Feature query costs are assumed to be given by the user and known for each time step. For the latter two, while it is possible to model and estimate each term using conventional modeling methods, the solution to evaluate the summation exhaustively may be computationally expensive and impractical in many applications. Here, we introduce a novel method based on autoencoders with binary representation layers that can estimate the whole summation with a single forward and backward propagation in neural networks.

The left and the upper right part of the Fig.  1 show the architecture of the proposed network for the context-aware and one-shot estimation of the distribution of each feature. As depicted, an autoencoder architecture designed to convert each feature in the feature vector () to a binary representation (). Then, it encodes the features to a more compact representation (), and finally reconstructs the original feature vectors () by creating a binary decoded vector (

). Here, in order to have an estimate for the probability of each bit being set, sigmoid non-linearity activation function is used for the binary reconstruction layer (

). For other activation functions; however, we used the rectified linear unit (ReLu)

[23] non-linearity. Additionally, the network optimization cost function is defined as the weighted sum of cross-entropies for binary feature words. Here, the term word refers to the set of encoded bits that are representing a feature. The weights are adjusted to offset the importance of the reconstruction error caused by errors in different bits in the word with different significance. It is worth mentioning that the trained autoencoder as explained here, takes an input feature vector where missing features are set to zero, and it is capable of estimating the probability of each bit being set in the binary decode layer ().

In addition to the autoencoder part, in the network of Fig.  1, we create a predictor model by stacking a few layers on the top of the encoded representation (

) and training the encoder as well as the predictor parts of the network in a supervised fashion. Here, in order to measure the sensitivity of the output predictions with respect to different changes in each feature, we suggest using the summation of absolute derivatives of the output layer neurons with respect to each bit of the missing features. The final estimation of the summation in (

9) is achieved by an element-wise multiplication of the bit probabilities estimated from the autoencoder’s binary reconstruction layer and the sensitivities calculated from the derivative of output layer with respect to each input feature bit. Specifically, this paper suggests defining the as


where is the total number of bits used in the binary representation of each feature. Using (9) and (10), the feature to be acquired is given by


where the sensitivity term is defined as


It is worth mentioning that, in addition to the common neural network hyper-parameters, the only hyper-parameter that is added by the suggested method is the parameter which is used for controlling the accuracy of the binary representation. Additionally, as we do not make any assumptions on the values used as feature costs and the proposed method is incremental, it can be applied to the scenarios where feature costs are subject to change during the course of operation at test-time. However, in our experiments, in order to make the comparison of results easier, we evaluate the proposed method on scenarios where feature acquisition costs are constant in time.

Instances Features Classes Network Architecture Latent Missing Distribution

70000 784 111697 features after omitting features with STD of less than corresponding to margin pixels. 10 Encoder: [6978, 64, 32] Beta Distribution
[24] Predictor: [16,10]
Yahoo LTRC
34815 519 5 Encoder: [5198, 128, 32] Beta Distribution
[25] Predictor: [16, 8]

10929 561 12 Encoder: [5618, 64, 32] Beta Distribution
[26] Predictor: [16]
Reuters R8
7674 1000 8 Encoder: [10008, 64, 32] Beta Distribution
[27] Predictor: [16, 8]
UCI Mushroom
8124 22222

116 features after one-hot encoding of categorical features.

2 Encoder: [1168, 16] Beta Distribution
[28] Predictor: [4]
UCI Landsat
6435 36 7 Encoder: [368, 16, 8] Beta Distribution
[29] Predictor: [4]
2126 23 3 Encoder: [238, 8] Beta Distribution
[30] Predictor: [4]
16000 64 2 Encoder: [648, 16, 10] Beta Distribution
(Section IV-D) Predictor: [8, 4]

279 16 3 Encoder: [168, 8] Beta Distribution
(Section IV-E) Predictor: [4]
TABLE II: The summary of datasets and experimental settings.

Iii-D Implementation Details

Prior to the analysis we have normalized all feature values in the dataset to the range of zero and one. Also, throughout the experiments we used Tensorflow numerical computation library


and explored feed-forward neural network architectures. Also, ReLU non-linearity


is used for all hidden layers except the binary representation layers. For converting feature values to the suggested binary representation, we implemented the bit by bit recursive conversion in an efficient and parallel manner. Also, for converting back from the binary representation, we implemented the weighted summation of bit values utilizing a fully parallel matrix multiplication, reshape, and addition. In this work, the Adaptive Moment (Adam) optimization algorithm

[32] is used to train each network. The Adam hyper-parameters: learning rate (), decay rate for the first moment (), and decay rate for the second raw moment () are set to , , and , respectively.

The process of training the network starts with training the autoencoder part using a weighted cross-entropy loss between the binary representation of the complete feature vectors and the estimated probabilities from the binary reconstruction layer:


In order to train the denoising autoencoder, for each training instance, we sample random values from a latent Beta distribution and use the sampled values as the probability of missing each feature in the training data. After training the autoencoder part, the trained autoencoder network weights are stored, and a few prediction layers are added on top of the encoder part. The reason we store autoencoder weights is that fine-tuning the weights for the prediction task would affect the distribution estimation functionality of the originally trained autoencoder. In other words, we do fine-tuning for the supervised prediction task, while a copy of the original not fine-tuned autoencoder is used for probabilistic modeling. Here, to train the predictor network, we use a smaller learning rate () for the pre-trained encoder and a larger learning rate () for the new predictor weights.

For the efficient calculation of derivatives we use back-propagation from the values in the output prediction layer to each binary input bit. In this section, we only described the general architecture and training procedures, as we have conducted various experiments on different datasets, the exact network architecture of each case is explained in Section IV.

Iv Experimental Results

Iv-a Datasets and Experiments

The proposed method is evaluated on seven different real-world datasets including human activity recognition (HAPT) [26], hand-written character recognition (MNIST) [24], document classification (Reuters R8) [27], and web ranking (Yahoo LTRC) [25] as well as three other classification datasets. Apart from these, we have evaluated the method on a synthesized dataset which is explained in Section IV-D and a dataset in health domain explained in Section IV-E. Table II

presents a summery of the conducted experiments. The table also includes the network architecture and the missing value distribution used during the training phase. In each case, the architecture column contains encoder layer sizes for the binary layers, encoder layers, and predictor layers. In this table, the decoder layers are not shown and are equal to the encoder layer sizes in reverse order. We used an 8-bit binary representation throughout the experiments. Regarding the feature size of each dataset, we report both the nominal feature count and the number of features we have used in our experiments. Specifically, for the MNIST dataset, each pixel is considered as a feature and we removed features corresponding to pixels near the margin that are almost always zero with the standard deviation of less than

across all samples. Also, regarding the Mushroom dataset, one-hot encoding of categorical features resulted in features to be used as input. An important point to consider for these features is that, during sensitivity measurement and acquisition, we should consider acquisition of all one-hot features corresponding to a categorical feature as a single feature to acquire.

Regarding the feature acquisition costs, for the LTRC dataset, as suggested by [16], we used real cost values in the range of to based on the time required to extract each feature. In order to introduce feature costs to the MNIST dataset, we have followed the method suggested by [33]. Using this method, we create feature vectors by concatenating MNIST images at different resolutions including , , , and with feature acquisition costs of , , , and for acquiring feature at each resolution, respectively. For the Landsat dataset, each sample consists of features from four different frequency bands. Here, we considered features of the same frequency band to have the same acquisition cost from to , equal to the frequency band number. At last, for the CTG dataset, we assumed that features that are measuring event counts to have cost value of , features measuring statistical information to have the cost value of , and histogram features to have the cost of . For the synthesized and diabetes datasets, a complete explanation experiments is presented in Section IV-D and Section IV-E, respectively. Lastly, for the other three datasets, we assumed feature costs to be equal for all features.

Based on the aforementioned experimental setup, in the following parts of this section, we evaluate the proposed method for feature acquisition considering costs at test-time (FACT) on each dataset.

Iv-B Performance Evaluation

FACT Accuracy (%) RFC Accuracy Denoising Rand DPFQ33footnotemark: 3 FACT
% of total cost used11footnotemark: 1 (%) (%) (AUACC22footnotemark: 2) (AUACC) (AUACC) total cost11footnotemark: 1


Yahoo LTRC


Reuters R8

UCI Mushroom

UCI Landsat




11footnotemark: 1 Total cost defined as the total cost of acquiring all features.22footnotemark: 2 AUACC is defined as the area under the accuracy and the feature acquisition cost curve.33footnotemark: 3 Cost-aware version of the method suggested in [22].
TABLE III: Results of evaluating the proposed method on different datasets.

We split each dataset into three parts: for test, for validation, and the rest for training. During the training phase, we use the validation and train sets to train the networks as explained in Section III-D and following the setups introduced in Table II. It is worth noting that the proposed method does not necessary require all training features to be known. In fact, as explained in Section III-D, we use a latent Beta distribution to simulate the existence of unknown features. After the training phase, we use the test set to simulate the case of feature acquisition using the proposed method by assuming all the features to be initially unknown and using the feature query criterion of (11) to ask for features incrementally. In our experiments (except experiments in Section IV-D), we continue the feature query until querying all the features and report the accuracy as well as the cost at each point during feature acquisition. In real applications; however, the incremental acquisition should be stopped after reaching a certain criterion such as a minimum confidence of predictions.

Table III presents the results of the proposed method on each dataset. The table contains the test accuracy of each dataset while asking for different percentage of the total cost from the original feature set. Here, total cost is defined as the cost of acquiring all features

. As a baseline, we have reported the test performance of using a random forest classifier (RFC) on the complete feature set

(i.e., acquiring all features and spending the maximum cost). The table also reports the denoising percentage of the trained denoising autoencoder calculated as


Additionally, in this table, the area under the accuracy-cost curves (AUACC) of the proposed feature query method as well as the AUACC of randomly asking for the unknown features are presented. We have also included AUACC results of a cost-aware version of the method suggested in [22] in which sensitivities are normalized by feature costs (see the DPFQ column). It is worth mentioning that AUACC values are calculated as the normalized area under the accuracy versus the acquisition cost curve from a cost of zero to the cost at which the accuracy converges to the maximum accuracy.

According to the results presented in Table III, the proposed method can be used to effectively reduce the cost of features to be acquired at test-time for accurate predictions. Also, comparing the baseline accuracy of using the complete feature set, the results show that our method trains predictor models that can make viable predictions using only a subset of the features without sacrificing prediction performance. Regarding the denoising percentage, in most of the cases, the achieved denoising percentage is significant and confirming that a denoising autoencoder is capable of encoding features to reduce the feature representation length which results in a new representation that is more robust to the presence of missing values. Regarding the area under the curve values, in all cases, the AUC values of using FACT is considerably higher than its random selection counterpart. It is also noteworthy to mention that for a few datasets (i.e., Reuters R8, UCI Landsat, HAPT, and UCI CTG) there is a considerable class imbalance that is affecting baseline accuracies.

To further illustrate the performance of the proposed approach, we used MNIST dataset to visualize the effect of cost and context on the selection of features to be queried. Here, features are pixel values at different locations across each image, and the context is the available pixel values at each time step. Fig. 2 shows the effect of context and cost on the order of features that are queried by the proposed algorithm as well as a static order which is measured based on mutual information between pixels and target classes. Here, we present results for the original MNIST dataset with single resolution images and equal feature costs (see Fig. 1(a)) as well as the introduced multi-resolution setup with different feature costs for each resolution (see Fig. 1(c)). In this figure, pixels with higher importance to be queried are indicated by warmer colors and less important pixels are indicated by colder colors. As it is evident from this figure, the proposed context-aware method, based on the available pixels, acquires features with different orders and is scanning for digit edges or discriminative areas. On the other hand, the static feature acquisition method, only asks for central pixels of each image in a fixed order (see Fig. 1(b)). In addition, regarding the multi-resolution case, as it can be seen from the figure, the informative pixels from lower resolutions that incur lower costs are preferred to more costly higher resolution pixels. For instance, in the lower left corner image of Fig. 1(c), the parts that are first acquired from the high-resolution pixels are the parts that create difference between the digits of 4 and 9 which are not clear enough in lower resolutions.

Fig. 2: Visualization of: (a) using the proposed approach on the MNIST dataset with equal feature costs, (b) using static feature acquisition using mutual information between pixels and targets on MNIST dataset, and (c) using the proposed approach on the multi-resolution MNIST dataset with different feature costs at each resolution. Pixels with more importance/priority to be queried are indicated by warmer colors.

Iv-C Comparison with Other Work

Fig. 3

presents comparison of the proposed feature acquisition method with a feature acquisition method based on recurrent neural networks (RADIN)

[19] and a tree-based feature acquisition method (GreedyMiser) [15]. In this comparison, we have used MNIST dataset to evaluate the performance of each method based on their accuracy using a different number of features. As it can be inferred from the figure, in the case of acquiring of features, where the number of features to be queried is significantly less than the total number of features, the achieved accuracy using FACT is lower than other methods. Nevertheless, the rate of increase in the accuracy with respect to the number of queried features for the presented method is significantly higher than other papers which makes it superior in other cases.

In this plot and similar cost versus accuracy plots in this section, we provide 95% confidence intervals presented as error-bars measured by running each experiment multiple times using different random initializations.

Fig. 3: Comparison of the proposed method (FACT) with RADIN [19] and GreedyMiser [15] methods on the MNIST dataset.

Fig. 4 presents a comparison between the feature acquisition and cost curve of the proposed method with three other approaches that use classifier cascades or trees in the literature including CSTC [16], Cronus [14], and Early Exit [34]. Here, we used the LTRC dataset with the feature costs as suggested by [16] to plot the feature acquisition cost versus the corresponding normalized discounted cumulative gain (NDCG) [35] performance measure. NDCG is a well-known measure of ranking quality that is used to measure the effectiveness and the relevance of ranking results in search engines. As it can be seen from this figure, FACT is significantly more powerful and more efficient compared to others.

Fig. 4: Comparison of the proposed method (FACT) with CSTC [16], Cronus [14], and Early Exit [34] methods on the LTRC dataset.
Fig. 5: Comparison of the proposed method (FACT) with exhaustive sensitivity-based (Exhaustive) [21] and random selection (Rand) methods on the Landsat dataset. This figure shows that the proposed method is able to approximate the ground-truth sensitivity values accurately and efficiently.

In order to evaluate the performance of the proposed method for the estimation of sensitivity values, we have implemented an exhaustive feature query method as suggested by [21] using sensitivity as the utility function. To make a fair comparison, we have used the trained predictor network, and exhaustively measured the effect of changing each input on the prediction probabilities. Here, in order to estimate the probability of each change, we have used a 5-bin histogram for each feature. Fig. 5 presents a comparison between the accuracy achieved using FACT and the exhaustive sensitivity-based method on the Landsat dataset. As a baseline, we have also included the curve corresponding to randomly selecting and acquiring features. As it is evident from the figure, the proposed method is almost equivalent to the exhaustive method in terms of the accuracy achieved at each total acquisition cost. This is promising considering the fact that the proposed approach tries to approximate the exhaustive sensitivity measurement in an efficient and scalable manner. In other words, compared to the proposed method, the exhaustive method is significantly slower and less efficient at test-time. Specifically, the average processing time of the exhaustive method was about for each sample, while the corresponding processing time for FACT was about (i.e., about times faster). In other test cases with more features, comparing with the exhaustive method was not possible due to the exponentially increasing computational load of evaluating the exhaustive method.

It worth mentioning that this computational advantage comes from the fact that, for each incremental feature query, we approximate the summation of (11) for all unknown features using one forward and one backward network computation. However, the exhaustive sensitivity measurement computes the summation for each feature and over the range of all possible values, separately. In other words, the proposed method scales linearly with the growth in the number of unknown features, while the exhaustive method scales in polynomial time.

Iv-D Evaluation using Synthesized Data

Fig. 6: Evaluation of the proposed method on synthesized data. (a) Cost and static importance of each feature. (b) The feature acquisition order for 50 different test samples (warmer colors mean more priority). (c) Accuracy versus acquisition cost curves for the proposed method (FACT), acquisition using static order, and random selection.

In order to get more insight about the performance of the proposed method, we have used a synthesized dataset to evaluate the suggested approach. The synthesized dataset is generated as follows: first, we have randomly sampled cluster centers from a -dimensional space. Then,

points sampled around each cluster center from a normal distribution with the mean of zero and variance of

. Afterwards, we have randomly assigned each cluster to a class from a set of two different classes. To each feature vector created so far containing features, we have appended another features with random values from a normal distribution. These are features without any predictive value. Accordingly, the resulting feature vectors are of size . Finally, we made the dataset cost-sensitive by defining feature costs for the first and second features to be a monotonically increasing function from to . See Fig. 5(a) for a visualization of feature costs and the static importance computed using the mutual information between features and labels.

Fig. 5(b) demonstrates the order in which each feature is acquired using the proposed method. In this figure, each row corresponds to a test sample (here only 50 samples are visualized) and each column represents a feature. The features that are acquired earlier are indicated with warmer colors. Here, we have continued the feature acquisition until we reach of the maximum achievable accuracy. As it can be seen, the second half of features which are not informative are mainly skipped by the proposed method. Specifically, only about of features from the second half are selected by FACT, which means that most of the uninformative features are not acquired by the algorithm. On the other hand, based on the cost and value of each feature from the first half, the proposed method acquired features that are more informative and have lower cost values. Apart from this, Fig. 5(c) presents the accuracy versus total feature acquisition cost on the test set. As it can be seen from the curve, FACT converges to the maximum accuracy much faster than the static and random acquisition methods. It is mainly due to the fact that the proposed method highly prefers informative features with low cost while other methods disregard this information.

Iv-E Evaluation using Real-World Health Data

In order to evaluate the performance of the proposed method on a dataset in health domain where feature acquisition costs are inherently important, we have used thyroid classification dataset [36]333Available at: http://archive.ics.uci.edu/ml/datasets/thyroid+disease. Here, we have features from different categories including demographics, questionnaire, examination, and lab results. Furthermore, this dataset provides the acquisition costs corresponding to each feature which ranges from for features such as age to for certain blood tests.

Figure 6(a) presents a visualization of orders which each feature is acquired for 40 randomly selected test samples. As it can be observed from this visualization, FACT gives more priority to low cost features and costly but informative features are acquired a with lower priority. Apart from this, Figure 6(b) presents the accuracy versus acquisition cost curve for FACT, static, and random methods. As it can be seen from this figure, FACT outperforms other baseline approaches.

Fig. 7: Evaluation of the proposed method on thyroid disease classification task. (a) The feature acquisition order for 50 different test samples (warmer colors mean more priority). (b) Accuracy versus acquisition cost curves for the proposed method (FACT), acquisition using static order, and random selection.

V Discussion

Fig. 8: Influence of different beta distribution parameters on the accuracy versus acquisition cost curve for the synthesized dataset.

There are many methods such as mutual information, information gain etc. that are traditionally used in literature to measure the value of each feature [37]

. However, these methods are usually limited to considering only linear relationships or considering a single feature rather than joint distribution of features. For instance, given evidence about a subset of features, the correlations between the rest of features maybe affected that traditional approaches are usually incapable of capturing. In this paper, we suggest inferring the dynamics between features and classes by sensitivity analysis of trained predictors. This approach employs the hidden information captured inside a black-box network to measure the value of acquiring each feature given the available context.

In this paper, we use (6) as a measure of feature informativeness per unit of the cost to make feature acquisition decisions. However, an alternative approach, which may result in better accuracies at a certain context, would be defining an objective function that balances the cost versus performance trade-off using a hyper-parameter.

Furthermore, this paper suggests an encoding and decoding approach to create a range of changes so that the final summation of sensitivities would be a better approximation of the total sensitivity with respect to each feature. Aside from binary quantization, we have explored different methods such as variable length and constant length quantizations; however, while they usually work reasonably well, we decided to use binary encoding as it is more efficient to implement and our readers are more familiar with.

In this paper, a beta corruption function is used to introduce missing features and to train the denoising autoencoder. Based on our experiments, as long as it is chosen reasonably, it does not have any direct influence on the performance of the predictor or the feature acquisition functionality. Specifically, we measured the influence of changing beta parameters from to and the area under the accuracy cost curve changes were less than (see Fig. 8 for an example). In this paper, we suggest beta distribution parameters of for most datasets, and parameters of sparse datasets such as mushroom. It is also worth mentioning that the corruption function is applied to all features independently. Therefore, it does not introduce any bias toward certain features.

Vi Conclusion

In this paper, we introduced a novel method for cost- and context-aware feature acquisition at test-time. The proposed method based on denoising autoencoders with binary representation layers efficiently estimates context-dependent feature distributions and measures the sensitivity of the output with respect to each unknown feature. Furthermore, we evaluated the proposed approach on eight different real-world datasets covering various problem scenarios and applications. Finally, we compared the results of the introduced method with the results of using other state-of-the-art approaches in the literature. According to the results, the suggested method is capable of dynamically deciding on which feature to be acquired based on feature costs and available context in an efficient manner.


  • [1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
  • [2] M. Kachuee, M. M. Kiani, H. Mohammadzade, and M. Shabany, “Cuffless blood pressure estimation algorithms for continuous health-care monitoring,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 4, pp. 859–869, 2017.
  • [3] K. Early, J. Mankoff, and S. E. Fienberg, “Dynamic question ordering in online surveys,” arXiv preprint arXiv:1607.04209, 2016.
  • [4] P. K. Sharpe and R. Solly, “Dealing with missing values in neural network-based diagnostic systems,” Neural Computing & Applications, vol. 3, no. 2, pp. 73–77, 1995.
  • [5] B. Krishnapuram, S. Yu, and R. B. Rao,

    Cost-sensitive Machine Learning

    .   CRC Press, 2011.
  • [6] M. Liu, C. Xu, Y. Luo, C. Xu, Y. Wen, and D. Tao, “Cost-sensitive feature selection via f-measure optimization reduction.” in AAAI, 2017, pp. 2252–2258.
  • [7] H. Ghasemzadeh, N. Amini, R. Saeedi, and M. Sarrafzadeh, “Power-aware computing in wearable sensor networks: An optimal feature selection,” IEEE Transactions on Mobile Computing, vol. 14, no. 4, pp. 800–812, 2015.
  • [8] F. Min, Q. Hu, and W. Zhu, “Feature selection with test cost constraint,” International Journal of Approximate Reasoning, vol. 55, no. 1, pp. 167–179, 2014.
  • [9] P. Cao, D. Zhao, and O. Zaiane, “An optimized cost-sensitive svm for imbalanced data learning,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining.   Springer, 2013, pp. 280–292.
  • [10] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al., “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
  • [11] R. Greiner, A. J. Grove, and D. Roth, “Learning cost-sensitive active classifiers,” Artificial Intelligence, vol. 139, no. 2, pp. 137–174, 2002.
  • [12] S. Ji and L. Carin, “Cost-sensitive feature acquisition and classification,” Pattern Recognition, vol. 40, no. 5, pp. 1474–1485, 2007.
  • [13] P. Viola and M. J. Jones, “Robust real-time face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
  • [14] M. Chen, Z. Xu, K. Weinberger, O. Chapelle, and D. Kedem, “Classifier cascade for minimizing feature evaluation cost,” in Artificial Intelligence and Statistics, 2012, pp. 218–226.
  • [15] Z. Xu, K. Weinberger, and O. Chapelle, “The greedy miser: Learning under test-time budgets,” arXiv preprint arXiv:1206.6451, 2012.
  • [16] Z. E. Xu, M. J. Kusner, K. Q. Weinberger, M. Chen, and O. Chapelle, “Classifier cascades and trees for minimizing feature evaluation cost.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 2113–2144, 2014.
  • [17] S. Karayev, T. Baumgartner, M. Fritz, and T. Darrell, “Timely object recognition,” in Advances in Neural Information Processing Systems, 2012, pp. 890–898.
  • [18] H. He, H. Daumé III, and J. Eisner, “Cost-sensitive dynamic feature selection,” in ICML Inferning Workshop, 2012.
  • [19] G. Contardo, L. Denoyer, and T. Artières, “Recurrent neural networks for adaptive feature acquisition,” in International Conference on Neural Information Processing.   Springer, 2016, pp. 591–599.
  • [20] G. Contardo, L. Denoyer, and T. Artieres, “Sequential cost-sensitive feature acquisition,” in International Symposium on Intelligent Data Analysis.   Springer, 2016, pp. 284–294.
  • [21] K. Early, S. E. Fienberg, and J. Mankoff, “Test time feature ordering with focus: interactive predictions with minimal user burden,” in Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing.   ACM, 2016, pp. 992–1003.
  • [22] M. Kachuee, A. Hosseini, B. Moatamed, S. Darabi, and M. Sarrafzadeh, “Context-aware feature query to improve the prediction performance,” in Signal and Information Processing (GlobalSIP), 2017 IEEE Global Conference on.   IEEE, 2017, pp. 838–842.
  • [23]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  • [24]

    Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998.

  • [25] O. Chapelle and Y. Chang, “Yahoo! learning to rank challenge overview,” in Proceedings of the Learning to Rank Challenge, 2011, pp. 1–24.
  • [26] J.-L. Reyes-Ortiz, L. Oneto, A. Sama, X. Parra, and D. Anguita, “Transition-aware human activity recognition using smartphones,” Neurocomputing, vol. 171, pp. 754–767, 2016.
  • [27] D. D. Lewis, “Reuters-21578 text categorization test collection, distribution 1.0,” 1997.
  • [28] J. Schlimmer, “Mushroom records drawn from the audubon society field guide to north american mushrooms,” GH Lincoff (Pres), New York, 1981.
  • [29] C. L. Blake and C. J. Merz, “Uci repository of machine learning databases [http://www. ics. uci. edu/~ mlearn/mlrepository. html]. irvine, ca: University of california,” Department of Information and Computer Science, vol. 55, 1998.
  • [30] D. Ayres-de Campos, J. Bernardes, A. Garrido, J. Marques-de Sa, and L. Pereira-Leite, “Sisporto 2.0: a program for automated analysis of cardiotocograms,” Journal of Maternal-Fetal Medicine, vol. 9, no. 5, pp. 311–318, 2000.
  • [31] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [32] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] K. Trapeznikov and V. Saligrama, “Supervised sequential classification under budget constraints,” in Artificial Intelligence and Statistics, 2013, pp. 581–589.
  • [34] B. B. Cambazoglu, H. Zaragoza, O. Chapelle, J. Chen, C. Liao, Z. Zheng, and J. Degenhardt, “Early exit optimizations for additive machine learned ranking systems,” in Proceedings of the third ACM international conference on Web search and data mining.   ACM, 2010, pp. 411–420.
  • [35] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422–446, 2002.
  • [36] D. Dheeru and E. Karra Taniskidou, “UCI Machine Learning Repository,” 2017.
  • [37] H. Liu and R. Setiono, “Chi2: Feature selection and discretization of numeric attributes,” in Tools with artificial intelligence, 1995. proceedings., seventh international conference on.   IEEE, 1995, pp. 388–391.