1 Introduction
Machine learning algorithms are widely used in many realworld applications, ranging from emailspam [26] and adult content filtering [12], to websearch engines [28]. As machine learning transitions into these industry fields, managing the CPU cost at testtime becomes increasingly important. In applications of such large scale, computation must be budgeted and accounted for. Moreover, reducing energy wasted on unnecessary computation can lead to monetary savings and reductions of greenhouse gas emissions.
The testtime cost consists of the time required to evaluate a classifier and the time to extract features for that classifier, where the extraction time across features is highly variable. Imagine introducing a new feature to an email spam filtering algorithm that requires seconds to extract per incoming email. If a webservice receives one billion emails (which many do daily), it would require 115 extra CPU days to extract just this feature. Although this additional feature may increase the accuracy of the filter, the cost of computing it for every email is prohibitive. This introduces the problem of balancing the testtime cost and the classifier accuracy. Addressing this tradeoff in a principled manner is crucial for the applicability of machine learning.
In this paper, we propose a novel algorithm, CostSensitive Tree of Classifiers (CSTC). A CSTC tree (illustrated schematically in Fig. 1) is a tree of classifiers that is carefully constructed to reduce the average testtime complexity of machine learning algorithms, while maximizing their accuracy. Different from prior work, which reduces the total cost for every input [11] or which stages the feature extraction into linear cascades [24; 20; 22; 21; 8], a CSTC tree incorporates
inputdependent feature selection
into training and dynamically allocates higher feature budgets for infrequently traveled treepaths. By introducing a probabilistic treetraversal framework, we can compute the exact expected testtime cost of a CSTC tree. CSTC is trained with a single global loss function, whose testtime cost penalty is a direct relaxation of this expected cost. This principled approach leads to unmatched testcost/accuracy tradeoffs as it naturally divides the input space into subregions and extracts expensive features only when necessary.
We make several novel contributions: 1. We introduce the metalearning framework of CSTC trees and derive the expected cost of an input traversing the tree during testtime. 2. We relax this expected cost with a mixednorm relaxation and derive a single global optimization problem to train all classifiers jointly. 3. We demonstrate on synthetic data that CSTC effectively allocates features to classifiers where they are most beneficial and show on largescale realworld websearch ranking data that CSTC significantly outperforms the current stateoftheart in testtime costsensitive learning—maintaining the performance of the best algorithms for websearch ranking at a fraction of their computational cost.
2 Related Work
A basic approach to control testtime cost is the use of norm regularization [11], which results in a sparse feature set, and can significantly reduce the feature cost during testtime (as unused features are never computed). However, this approach fails to address the fact that some inputs may be successfully classified by only a few cheap features, whereas others strictly require expensive features for correct classification.
There is much previous work that extends single classifiers to classifier cascades (mostly for binary classification) [24; 20; 22; 21; 8]
. In these cascades, several classifiers are ordered into a sequence of stages. Each classifier can either reject inputs (predicting them), or pass them on to the next stage, based on the prediction of each input. To reduce the testtime cost, these cascade algorithms enforce that classifiers in early stages use very few and/or cheap features and reject many easilyclassified inputs. Classifiers in later stages, however, are more expensive and cope with more difficult inputs. This linear structure is particularly effective for applications with highly skewed class imbalance and generic features. One celebrated example is face detection in images, where the majority of all image regions do not contain faces and can often be easily rejected based on the response of a few simple Haar features
[24]. The linear cascade model is however less suited for learning tasks with balanced classes and specialized features. It cannot fully capture the scenario where different partitions of the input space require different expert features, as all inputs follow the same linear chain.Grubb & Bagnell [15] and Xu et al. [27]
focus on training a classifier that explicitly tradesoff testtime cost and accuracy. Instead of optimizing the tradeoff by building a cascade, they push the cost tradeoff into the construction of the weak learners. It should be noted that, in spite of the high accuracy achieved by these techniques, the algorithms are based heavily on stagewise regression (gradient boosting)
[13], and are less likely to work with more general weak learners.Gao & Koller [14] use locally weighted regression during test time to predict the information gain of unknown features. Different from our algorithm, their model is learned during testtime, which introduces an additional cost especially for large data sets. In contrast, our algorithm learns and fixes a tree structure in training and has a testtime complexity that is constant with respect to the training set size.
Karayev et al. [18]
use reinforcement learning to dynamically select features to maximize the average precision over time in an object detection setting. In this case, the dataset has multilabeled inputs and thus warrants a different approach than ours.
Hierarchical Mixture of Experts (HME) [17] also builds treestructured classifiers. However, in contrast to CSTC, this work is not motivated by reductions in testtime cost and results in fundamentally different models. In CSTC, each classifier is trained with the testtime cost in mind and each testinput only traverses a single path from the root down to a terminal element, accumulating pathspecific costs. In HME, all testinputs traverse all paths and all leafclassifiers contribute to the final prediction, incurring the same cost for all testinputs.
Recent treestructured classifiers include the work of Deng et al. [9], who speed up the training and evaluation of label trees [1], by avoiding many binary onevsall classifier evaluations. Differently, we focus on problems in which feature extraction time dominates the testtime cost which motivates different algorithmic setups. Dredze et al. [10]
combine the cost to select a feature with the mutual information of that feature to build a decision tree that reduces the feature extraction cost. Different from this work, they do not directly minimize the total testtime cost of the decision tree or the risk. Possibly most similar to our work are
[4], who learn a directed acyclic graph via a Markov decision process to select features for different instances, and
[25], who adaptively partition the feature space and learn local regionspecific classifiers. Although each work is similar in motivation, the algorithmic frameworks are very different and can be regarded complementary to ours.3 Costsensitive classification
We first introduce our notation and then formalize our testtime costsensitive learning setting. Let the training data consist of inputs with corresponding class labels , where in the case of regression ( could also be a finite set of categorical labels—because of space limitations we do not focus on this case in this paper).
Nonlinear feature space. Throughout this paper, we focus on linear classifiers but in order to allow nonlinear decision boundaries we map the input into a nonlinear feature space with the “boosting trick” [13; 7], prior to our optimization. In particular, we first train gradient boosted regression trees with a squared loss penalty [13], , where each function is a limiteddepth CART tree [3]. We then apply the mapping to all inputs, where . To avoid confusion between CART trees and the CSTC tree, we refer to CART trees as weak learners.
Risk minimization. At each node in the CSTC tree we propose to learn a linear classifier in this feature space, with
, which is trained to explicitly reduce the CPU cost during testtime. We learn the weightvector
by minimizing a convex empirical risk function with regularization, . In addition, we incorporate a cost term , which we derive in the following subsection, to restrict testtime cost. The combined testtime costsensitive loss function becomes(1) 
where is the accuracy/cost tradeoff parameter, and controls the strength of the regularization.
Testtime cost. There are two factors that contribute to the testtime cost of each classifier. The weak learner evaluation cost of all active (with ) and the feature extraction cost for all features used in these weak learners. We assume that features are computed on demand with the cost the first time they are used, and are free for future use (as feature values can be cached). We define an auxiliary matrix with if and only if the weak learner uses feature . Let be the cost to evaluate a , and be the cost to extract feature . With this notation, we can formulate the total testtime cost for an instance precisely as
(2) 
where the norm for scalars is defined as with if and only if . The first term assigns cost to every weak learner used in , the second term assigns cost to every feature that is extracted by at least one of such weak learners.
Testcost relaxation. The cost formulation in (2) is exact but difficult to optimize as the norms are noncontinuous and nondifferentiable. As a solution, throughout this paper we use the mixednorm relaxation of the norm over sums,
(3) 
described by [19]. Note that for a single element this relaxation relaxes the norm to the norm, , and recovers the commonly used approximation to encourage sparsity [11; 23]. We plug the costterm (2) into the loss in (1) and apply the relaxation (3) to all norms to obtain
(4) 
where we abbreviate for simplicity. While (4) is costsensitive, it is restricted to a single linear classifier. In the next section we describe how to expand this formulation into a costeffective treestructured model.
4 Costsensitive tree
(5) 
We begin by introducing foundational concepts regarding the CSTC tree and derive a global loss function (5). Similar to the previous section, we first derive the exact cost term and then relax it with the mixednorm. Finally, we describe how to optimize this function efficiently and to undo some of the inaccuracy induced by the mixednorm relaxations.
CSTC nodes. We make the assumption that instances with similar labels can utilize similar features.^{1}^{1}1For example, in websearch ranking, features generated by browser statistics are typically predictive only for highly relevant pages as they require the user to spend significant time on the page and interact with it. We therefore design our tree algorithm to partition the input space based on classifier predictions. Classifiers that reside deep in the tree become experts for a small subset of the input space and intermediate classifiers determine the path of instances through the tree. We distinguish between two different elements in a CSTC tree (depicted in Figure 1): classifier nodes (white circles) and terminal elements (black squares). Each classifier node is associated with a weight vector and a threshold . Different from cascade approaches, these classifiers not only classify inputs using , but also branch them by their threshold , sending inputs to their upper child if , and to their lower child otherwise. Terminal elements are “dummy” structures and are not classifiers. They return the predictions of their direct parent classifier nodes—essentially functioning as a placeholder for an exit out of the tree. The tree structure may be a full balanced binary tree of some depth (eg. figure 1), or can be pruned based on a validation set (eg. figure 4, left).
During testtime, inputs are first applied to the root node . The root node produces predictions and sends the input along one of two different paths, depending on whether . By repeatedly branching the testinputs, classifier nodes sitting deeper in the tree only handle a small subset of all inputs and become specialized towards that subset of the input space.
4.1 Tree loss
We derive a single global loss function over all nodes in the CSTC tree.
Soft tree traversal.
Training the CSTC tree with hard thresholds leads to a combinatorial optimization problem, which is NPhard. Therefore, during training, we
softly partition the inputs and assign traversal probabilities to denote the likelihood of input traversing through node . Every input traverses through the root, so we define for all. We use the sigmoid function to define a soft belief that an input
will transition from classifier node to its upper child as .^{2}^{2}2The sigmoid function is defined as and takes advantage of the fact that and that is strictly monotonic.The probability of reaching child
from the root is, recursively, , because each node has exactly one parent. For a lower child of parent we naturally obtain . In the following paragraphs we incorporate this probabilistic framework into the singlenode risk and cost terms of eq. (4) to obtain the corresponding expected tree risk and tree cost.Expected tree risk. The expected tree risk can be obtained byWg over all nodes and inputs and weighing the risk of input at node by the probability ,
(6) 
This has two effects: 1. the local risk for each node focusses more on likely inputs; 2. the global risk attributes more weight to classifiers that serve many inputs.
Expected tree costs. The cost of a testinput is the cumulative cost across all classifiers along its path through the CSTC tree. Figure 1 illustrates an example of a CSTC tree with all paths highlighted in color. Every testinput must follow along exactly one of the paths from the root to a terminal element. Let denote the set of all terminal elements (e.g., in figure 1 we have ), and for any let denote the set of all classifier nodes along the unique path from the root before terminal element (e.g., ). The evaluation and feature cost of this unique path is exactly
This term is analogous to eq. (2), except the cost of the weak learner is paid if any of the classifiers in path use this tree (i.e. assign nonzero weight). Similarly, the cost of a feature is paid exactly once if any of the weak learners of any of the classifiers along require it. Once computed, a feature or weak learner can be reused by all classifiers along the path for free (as the computation can be cached very efficiently).
Given an input , the probability of reaching terminal element (traversing along path ) is . Therefore, the marginal probability that a training input (picked uniformly at random from the training set) reaches is . With this notation, the expected cost for an input traversing the CSTC tree becomes . Using our norm relaxation in eq. (3) on both norms in gives the final expected tree cost penalty
which naturally encourages weak learner and feature reuse along paths through the CSTC tree.
4.2 Optimization Details
There are many techniques to minimize the loss in (5). We use a cyclic optimization procedure, solving for each classifier node one at a time, keeping all other nodes fixed. For a given classifier node , the traversal probabilities of a descendant node and the probability of an instance reaching a terminal element also depend on and (through its recursive definition) and must be incorporated into the gradient computation.
To minimize (5) with respect to parameters , we use the lemma below to overcome the nondifferentiability of the squareroot terms (and norms) resulting from the relaxations (3).
Lemma 1. Given any , the following holds:
(7) 
The lemma can be proved as minimizes the function on the right hand side. Further, it is shown in [2] that the right hand side is jointly convex in and , so long as is convex.
For each squareroot or term we introduce an auxiliary variable (i.e., above) and alternate between minimizing the loss in (5) with respect to and the auxiliary variables. The former is performed with conjugate gradient descent and the latter can be computed efficiently in closed form. This pattern of blockcoordinate descent followed by a closed form minimization is repeated until convergence. Note that the loss is guaranteed to converge to a fixed point because each iteration decreases the loss function, which is bounded below by .
Initialization. The minimization of eq. (5) is nonconvex and therefore initialization dependent. However, minimizing eq. (5) with respect to the parameters of leaf classifier nodes is convex, as the loss function, after substitutions based on lemma 1, becomes jointly convex (because of the lack of descendant nodes). We therefore initialize the tree toptobottom, starting at , and optimize over by minimizing (5) while considering all descendant nodes of as “cutoff” (thus pretending node is a leaf).
Tree pruning. To obtain a more compact model and to avoid overfitting, the CSTC tree can be pruned with the help of a validation set. As each node is a classifier, we can apply the CSTC tree on a validation set and compute the validation error at each node. We prune away nodes that, upon removal, do not decrease the performance of CSTC on the validation set (in the case of ranking data, we even can use validation NDCG as our pruning criterion).
Finetuning. The relaxation in (3) makes the exact cost terms differentiable and is well suited to approximate which dimensions in a vector should be assigned nonzero weights. The mixednorm does however impact the performance of the classifiers because (different from the norm) larger weights in incur larger penalties in the loss. We therefore introduce a postprocessing step to correct the classifiers from this unwanted regularization effect. We reoptimize all predictive classifiers (classifiers with terminal element children, i.e. classifiers that make final predictions), while clamping all features with zeroweight to strictly remain zero.
subject to:  (8) 
The final CSTC tree uses these reoptimized weight vectors for all predictive classifier nodes .
5 Results
In this section, we first evaluate CSTC on a carefully constructed synthetic data set to test our hypothesis that CSTC learns specialized classifiers that rely on different feature subsets. We then evaluate the performance of CSTC on the large scale Yahoo! Learning to Rank Challenge data set and compare it with stateoftheart algorithms.
5.1 Synthetic data
We construct a synthetic regression dataset, sampled from the four quadrants of the plane, where . The features belong to two categories: cheap features, with cost , which can be used to identify the quadrant of an input; and four expensive features with cost , which represent the exact label of an input if it is from the corresponding region (a random number otherwise). Since in this synthetic data set we do not transform the feature space, we have , and (the weak learner featureusage variable) is the identity matrix. By design, a perfect classifier can use the two cheap features to identify the subregion of an instance and then extract the correct expensive feature to make a perfect prediction. The minimum feature cost of such a perfect classifier is exactly
per instance. The labels are sampled from Gaussian distributions with quadrantspecific means
and variance
. Figure 2 shows the CSTC tree and the predictions of test inputs made by each node. In every path along the tree, the first two classifiers split on the two cheap features and identify the correct subregion of the input. The final classifier extracts a single expensive feature to predict the labels. As such, the mean squared error of the training and testing data both approach 0.5.2 Yahoo! Learning to Rank
To evaluate the performance of CSTC on realworld tasks, we test our algorithm on the public Yahoo! Learning to Rank Challenge data set^{3}^{3}3http://learningtorankchallenge.yahoo.com [6]. The set contains 19,944 queries and 473,134 documents. Each querydocument pair consists of 519 features. An extraction cost, which takes on a value in the set , is associated with each feature^{4}^{4}4The extraction costs were provided by a Yahoo! employee. . The unit of these values is the time required to evaluate a weak learner . The label denotes the relevancy of a document to its corresponding query, with indicating a perfect match. In contrast to Chen et al. [8], we do not inflate the number of irrelevant documents (by counting them times). We measure the performance using NDCG@5 [16], a preferred ranking metric when multiple levels of relevance are available. Unless otherwise stated, we restrict CSTC to a maximum of nodes. All results are obtained on a desktop with two 6core Intel i7 CPUs. Minimizing the global objective requires less than 3 hours to complete, and finetuning the classifiers takes about 10 minutes.
Comparison with prior work. Figure 3 shows a comparison of CSTC with several recent algorithms for testtime costsensitive learning. We show NDCG versus cost (in units of weak learner evaluations). The plot shows different stages in our derivation of CSTC: the initial costinsensitive ensemble classifier [13] from section 3 (stagewise regression), a single costsensitive classifier as described in eq. (4), the CSTC tree (5) and CSTC tree with finetuning (8). We obtain the curves by varying the accuracy/cost tradeoff parameter (and perform early stopping based on the validation data, for finetuning). For CSTC tree we evaluate six settings, . In the case of stagewise regression, which is not costsensitive, the curve is simply a function of boosting iterations.
For competing algorithms, we include Early exit [5] which improves upon stagewise regression by shortcircuiting the evaluation of unpromising documents at testtime, reducing the overall testtime cost. The authors propose several criteria for rejecting inputs early and we use the bestperforming method “early exits using proximity threshold”. For Cronus [8], we use a cascade with a maximum of 10 nodes. All hyperparameters (cascade length, keep ratio, discount, earlystopping) were set based on a validation set. The cost/accuracy curve was generated by varying the corresponding tradeoff parameter, .
As shown in the graph, CSTC significantly improves the cost/accuracy tradeoff curve over all other algorithms. The power of Early exit is limited in this case as the testtime cost is dominated by feature extraction, rather than the evaluation cost. Compared with Cronus, CSTC has the ability to identify features that are most beneficial to different groups of inputs. It is this ability, which allows CSTC to maintain the high NDCG significantly longer as the costbudget is reduced.
Note that CSTC with finetuning only achieves very tiny improvement over CSTC without it. Although the finetuning step decreases the mean squared error on the testset, it has little effect on NDCG, which is only based on the relative ranking of the documents (as opposed to their exact predictions). Moreover, because we finetune prediction nodes until validation NDCG decreases, for the majority of values, only a small amount of finetuning occurs.
Input space partition. Figure 4 (left) shows a pruned CSTC tree () for the Yahoo! data set. The number above each node indicates the average label of theWg inputs passing through that node. We can observe that different branches aim at different parts of the input domain. In general, the upper branches focus on correctly classifying higher ranked documents, while the lower branches target lowrank documents. Figure 4 (right) shows the Jaccard matrix of the predictive classifiers from the same CSTC tree. The matrix shows a clear trend that the Jaccard coefficients decrease monotonically away from the diagonal. This indicates that classifiers share fewer features in common if their average labels are further apart—the most different classifiers and have only of their features in common—and validates that classifiers in the CSTC tree extract different features in different regions of the tree.
Feature extraction. We also investigate the features extracted in individual classifier nodes. Figure 5 shows the fraction of features, with a particular cost, extracted at different depths of the CSTC tree for the Yahoo! data. We observe a general trend that as depth increases, more features are being used. However, cheap features () are fully extracted earlyon, whereas expensive features () are extracted by classifiers sitting deeper in the tree, where each individual classifier only copes with a small subset of inputs. The expensive features are used to classify these subsets of inputs more precisely. The only feature that has cost is extracted at all depths—which seems essential to obtain high NDCG [8].
6 Conclusions
We introduce CostSensitive Tree of Classifiers (CSTC), a novel learning algorithm that explicitly addresses the tradeoff between accuracy and expected testtime CPU cost in a principled fashion. The CSTC tree partitions the input space into subregions and identifies the most costeffective features for each one of these regions—allowing it to match the high accuracy of the stateoftheart at a small fraction of the cost. We obtain the CSTC algorithm by formulating the expected testtime cost of an instance passing through a tree of classifiers and relax it into a continuous cost function. This cost function can be minimized while learning the parameters of all classifiers in the tree jointly. By making the testtime cost vs. accuracy tradeoff explicit we enable high performance classifiers that fit into computational budgets and can reduce unnecessary energy consumption in largescale industrial applications. Further, engineers can design highly specialized features for particular edgescases of their input domain and CSTC will automatically incorporate them ondemand into its tree structure.
Acknowledgements
KQW, ZX, MK, and MC are supported by NIH grant U01 1U01NS07345701 and NSF grants 1149882 and 1137211. The authors thank John P. Cunningham for clarifying discussions and suggestions.
References
 Bengio et al. [2010] Bengio, S., Weston, J., and Grangier, D. Label embedding trees for large multiclass tasks. NIPS, 23:163–171, 2010.
 Boyd & Vandenberghe [2004] Boyd, S.P. and Vandenberghe, L. Convex optimization. Cambridge Univ Pr, 2004.
 Breiman [1984] Breiman, L. Classification and regression trees. Chapman & Hall/CRC, 1984.
 BusaFekete et al. [2012] BusaFekete, R., Benbouzid, D., Kégl, B., et al. Fast classification using sparse decision dags. In ICML, 2012.
 Cambazoglu et al. [2010] Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., and Degenhardt, J. Early exit optimizations for additive machine learned ranking systems. In WSDM’3, pp. 411–420, 2010.
 Chapelle & Chang [2011] Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In JMLR: Workshop and Conference Proceedings, volume 14, pp. 1–24, 2011.
 Chapelle et al. [2011] Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., and Tseng, B. Boosted multitask learning. Machine learning, 85(1):149–173, 2011.
 Chen et al. [2012] Chen, M., Xu, Z., Weinberger, K. Q., and Chapelle, O. Classifier cascade for minimizing feature evaluation cost. In AISTATS, 2012.
 Deng et al. [2011] Deng, J., Satheesh, S., Berg, A.C., and FeiFei, L. Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS, 2011.
 Dredze et al. [2007] Dredze, M., Gevaryahu, R., and EliasBachrach, A. Learning fast classifiers for image spam. In proceedings of the Conference on Email and AntiSpam (CEAS), 2007.
 Efron et al. [2004] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.
 Fleck et al. [1996] Fleck, M., Forsyth, D., and Bregler, C. Finding naked people. ECCV, pp. 593–602, 1996.
 Friedman [2001] Friedman, J.H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, pp. 1189–1232, 2001.
 Gao & Koller [2011] Gao, T. and Koller, D. Active classification based on value of classifier. In NIPS, pp. 1062–1070. 2011.
 Grubb & Bagnell [2012] Grubb, A. and Bagnell, J. A. Speedboost: Anytime prediction with uniform nearoptimality. In AISTATS, 2012.
 Järvelin & Kekäläinen [2002] Järvelin, K. and Kekäläinen, J. Cumulated gainbased evaluation of IR techniques. ACM TOIS, 20(4):422–446, 2002.
 Jordan & Jacobs [1994] Jordan, M.I. and Jacobs, R.A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
 Karayev et al. [2012] Karayev, S., Baumgartner, T., Fritz, M., and Darrell, T. Timely object recognition. In Advances in Neural Information Processing Systems 25, pp. 899–907, 2012.
 Kowalski [2009] Kowalski, M. Sparse regression using mixed norms. Applied and Computational Harmonic Analysis, 27(3):303–324, 2009.
 Lefakis & Fleuret [2010] Lefakis, L. and Fleuret, F. Joint cascade optimization using a product of boosted classifiers. In NIPS, pp. 1315–1323. 2010.
 Pujara et al. [2011] Pujara, J., Daumé III, H., and Getoor, L. Using classifier cascades for scalable email classification. In CEAS, 2011.
 Saberian & Vasconcelos [2010] Saberian, M. and Vasconcelos, N. Boosting classifier cascades. In Lafferty, J., Williams, C. K. I., ShaweTaylor, J., Zemel, R.S., and Culotta, A. (eds.), NIPS, pp. 2047–2055. 2010.

Schölkopf & Smola [2001]
Schölkopf, B. and Smola, A.J.
Learning with kernels: Support vector machines, regularization, optimization, and beyond
. MIT press, 2001.  Viola & Jones [2004] Viola, P. and Jones, M.J. Robust realtime face detection. IJCV, 57(2):137–154, 2004.

Wang & Saligrama [2012]
Wang, J. and Saligrama, V.
Local supervised learning through space partitioning.
In Advances in Neural Information Processing Systems 25, pp. 91–99, 2012.  Weinberger et al. [2009] Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In ICML, pp. 1113–1120, 2009.
 Xu et al. [2012] Xu, Z., Weinberger, K., and Chapelle, O. The greedy miser: Learning under testtime budgets. In ICML, pp. 1175–1182, 2012.
 Zheng et al. [2008] Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., and Sun, G. A general boosting method and its application to learning ranking functions for web search. In NIPS, pp. 1697–1704. Cambridge, MA, 2008.