1 Introduction
Structured prediction is a powerful and flexible framework for making a joint prediction over mutually dependent output variables. It has been successfully applied to a wide range of computer vision and natural language processing tasks ranging from text classification to human detection. However, the superior performance and flexibility of structured predictors come at the cost of computational complexity. In order to construct computationally efficient algorithms, a tradeoff must be made between the expressiveness and speed of structured models.
The cost of inference in structured prediction can be broken down into three parts: acquiring the features, evaluating the part responses and solving a combinatorial optimization problem to make a prediction based on part responses. Past research has focused on evaluating part responses and solving the combinatorial optimization problem, and proposed efficient inference algorithms for specific structures (e.g., Viterbi and CKY parsing algorithms) and general structures (e.g., variational inference
jordan1999introduction ). However, these methods overlook feature acquisition and part response, which are bottlenecks when the underlying structure is relative simple or is efficiently solved.Consider the dependency parsing task, where the goal is to create a directed tree that describes semantic relations between words in a sentence. The task can be formulated as a structured prediction problem, where the inference problem concerns finding the maximum spanning trees (MSTs) in a directed graphs mcdonald2005non . Each node in the graph represents a word, and the directed edge represents how likely depends on . Fig. 1 shows an example of a dependency parse and the trade off between a rich set of features and the prediction time. Introducing complex features has the potential to increase system performance, however they only distinguish among a small subset of “difficult” parts. Therefore, computing complex features for all parts on every example is both computationally costly and unnecessary to achieve high levels of performance.
We address the problem of structured prediction under testtime budget constraints, where the goal is to learn a system that is computationally efficient during testtime with little loss of predictive performance. We consider testtime costs associated with the computational cost of evaluating feature transforms or acquiring new sensor measurements. Intuitively, our goal is to learn a system that identifies the parts in each example incorrectly classified/localized using “cheap” features and additionally yield large reductions in error over the entire structure given “expensive” features due to improved distinguishability and relationships to other parts in the example.
We consider two forms of the budgeted structured learning problem, prediction under expected budget constraints and anytime prediction. For both cases, we consider the streaming testtime scenario where the system operates on each test example without observation or interaction of other test examples. In the expected budget constraint setting, the system chooses features to acquire for each example, to minimize prediction error subject to an average feature cost constraint. A fixed budget is given by the user during training time, and during testtime, the system is tasked with both allocating resources as well as determining the features to be acquired with the allocated resources. In the anytime structured prediction setting, the system chooses features to be acquired sequentially for each example to minimize prediction error at each time step, allowing for accurate prediction at anytime. No budget is specified by the user during training time. Instead,the system sequentially chooses features to minimize prediction error at any time as features are acquired. This setting requires a single system able to predict for any budget constraint on each example.
We learn systems capable of adaptive feature acquisition for both settings. We propose learning policy functions to exploit relationships between parts and adapt to varying length examples. This problem naturally reduces policy learning to a structured learning problem, allowing the original model to be used with minor modification. The resulting systems reduce prediction cost during testtime with minimal loss of predictive performance.
We summarize our contributions as follows:
• Formulation of structured prediction under expected budget constraints and anytime prediction.
• Reduction of both these settings to conventional structured prediction problems.
• Demonstration that structured models benefit from having access to features of multiple complexities and can perform well when a only a subset of parts use the expensive features.
2 Budgeted Structured Learning
We begin with reviewing structured prediction problem and formulating it under an expected budget constraint. We then extend the formulation to anytime structured prediction.
Structured Prediction: The goal in structured prediction is to learn a function, , that maps from an input space, , to a structure space, . In contrast to multiclass classification, the space of outputs is not simply categorical but instead is assumed to be some exponential space of outputs (often of varying size dependent on the feature space) containing some underlying structure, generally represented by multiple parts and relationships between parts. For example, in dependency parsing, are features representing a sentence (e.g., words, pos tags), and is a parse tree.
In a structured prediction model, the mapping function is often modeled as , where is a scoring function. We assume the score can be broken up into subscores across components , , where is the output assignment associated with the component . The number of subcomponents, , varies across examples. For the dependency parsing example, each is an edge in the directed graphs, and is an indicator variable for whether the edge is in the parse tree. The score of a parse tree consists of the scores of all its edges.
2.1 Structured Prediction Under an Expected Budget
Our goal is to reduce the cost of prediction during testtime (representing computational time, energy consumption, etc.). We consider the case where a variety of scoring functions are available to be used for each component. Additionally, associated with each scoring function is an evaluation cost (such as the time or energy consumption required to extract the features for the scoring function).
For each example, we define a state , where the space of states is defined , representing which of the features is used for the components during prediction. In the state, the element indicates that the feature will be used during prediction for component . For any state , we define the evaluation cost: where is the (known) cost of evaluating the feature for a single part.
We assume that we are given a structured prediction model that maps from a set of features and a state to a structured label prediction . For a predicted label, we have a loss that maps from a predicted and true structured label, and , respectively, to an error cost, generally either an indicator error, , or a Hamming error, . For an example and state , we now define the modified loss that represents the error induced by predicting a label from using the sensors in combined with the cost of acquiring the sensors in , where is a tradeoff pattern adjusted according to the budget ^{1}^{1}1Our framework does not restrict the type of modified loss, , or the state cost, and extends to general losses. A small value of encourages correct classification at the expense of feature cost, whereas a large value of penalizes use of costly features, enforcing a tighter budget constraint. We define a policy that maps from the feature space and the initial state
to a new state. For ease of reference, we refer to this policy as the feature selection policy. Our goal is to learn a policy
chosen from a family of functions that, for a given example , maps to a state with minimal expected modified loss, In practice, denotes a set of I.I.D training examples:(1) 
Note that the objective of the minimization can be expanded with respect to the space of states, allowing the optimization problem in (1) to be expressed From this, we can reformulate the problem of learning a policy as a structured learning problem.
Theorem 2.1.
The minimization in (1) is equivalent to the structured learning problem:
Proofs can be found in Suppl. Material. Theorem 2.1 maps the policy learning problem in (1) to a weighted structured learning problem. For each example , an example/label pair is created for each state with an importance weight representing the savings lost by not choosing the state .
Unfortunately, the expansion of the cost function over the space of states introduces the summation over the combinatorial space of states. To avoid this, we instead introduce an approximation to the objective in (1). Using a single indicator function, we formulate the approximate policy
(2) 
where the pseudolabel is defined:  (3) 
and the example weight is defined as .
This formulation reduces the objective from a summation over the combinatorial set of states to a single indicator function for each example and represents an upperbound on the original risk.
Note that the second term (2) is not dependent on . Thus, Theorem 2.2 leads to an efficient algorithm for learning a policy function by solving an importanceweighted structured learning problem:
(4) 
where each example having a pseudolabel and importance weight .
Combinatorial Search Space: Finding the psuedolabel in Eqn. (3) involves searching over the combinatorially large search space of states, , which is computationally intractable. Instead, we present trajectorybased and parsimonious pseudolabels for approximating .
Trajectory Search: The trajectorybased pseudolabel is a greedy approximation to the optimization in Eqn. (3). To this end, define as the 1stage feasible transitions: where is the Hamming distance. We define a trajectory of states where . The initial state is assumed to be where none of the features are evaluated for the components. For each example , we obtain a trajectory , where the terminal state is the allone state. We choose the pseudolabel from the trajectory: Note that by restricting the search space of states differing by a single component, the approximation needs to only perform a polynomial search over states as opposed to the exhaustive combinatorial search in Eqn. (3). Observe that the modified loss is not strictly decreasing, as the cost of adding features may outweigh the reduction in loss at any time. Empirically, this approach is computationally tractable and is shown to produce strong results.
Parsimonious Search: Rather than a trajectory search, which requires an inference update as we acquire more features, we consider an alternative one stage update here. The idea is to look for 1step transitions that can potentially improve the cost. We then simultaneously update all the features that produce improvement. This obviates the need for a trajectory search. In addition we can incorporate a guaranteed loss improvement for our parsimonious search. Note that the potential candidate transitions can be nonunique and thus we generate a collection of potential state transitions, . To obtain the final state we take the union over these transitions, namely, Suppose we set the margin
, replace the costfunction with the loss function then this optimization is relatively simple (assuming that acquiring more features does not increase the loss). This is because the new state is simply the collection of transitions where the subcomponents are incorrect. Finding the parsinomious pseudolabel is computationally efficient and empirically shows similar performance to the trajectorybased pseudolabel.
Choosing the pseudolabel requires knowledge of the budget to set the cost tradeoff parameter . If the budget is unspecified or varies over time, a system capable of adapting to changing budget demands is necessary. To handle this scenario, we propose an anytime system in the next section.
2.2 Anytime Structured Prediction
In many applications, the budget constraint is unknown a priori or varies from example to example due to changing resource availability and an expected budget system as in Section 2.1 does not yield a feasible system. We instead consider the problem of learning an anytime system grubb2012speedboost . In this setting, a single system is designed such that when a new example arrives during testtime, features are acquired until an arbitrary budget constraint (that may vary over different examples) is met for the particular example. Note that an anytime system is a special case of the expected budget constrained system. Instead of an expected budget, instead a hard perexample budget is given. A single system is applied to all feasible budgets, as opposed to learning unique systems for each budget constraint.
We model the anytime learning problem as sequential state selection. The goal is to select a trajectory of states, starting from an initial state where all components use features with negligible cost. To select this trajectory of states, we define policy functions , where is a function that maps from a set of structured features and current state to a new state .
The sequential selection system is then defined by the policy functions . For an example , the policy functions produce a trajectory of states defined as follows:
Our goal is to learn a system with small expected loss at any time . Formally, we define this as the average modified loss of the system over the trajectory of states:
(5) 
where is a userspecified family of functions. Unfortunately, the problem of learning the policy functions is highly coupled due to the dependence of the state trajectory on all policy functions. Note that as there is no fixed budget, the choice of dictates the behavior of the anytime system. Decreasing leads to larger increases in classification performance at the expense of budget granularity.
We propose a greedy approximation to the policy learning problem by sequentially learning policy functions that minimize the modified loss:
(6) 
for . Note that the selected in (6) does not take into account the future effect on the loss in (5). We consider in (6) to be a greedy approximation as it is instead chosen to minimize the immediate loss at time .
We restrict the output space of states for the policy to have the same nonzero components as the previous state with a single feature added. This space of states can be defined where is the Hamming distance. Note that this mirrors the trajectory used for the trajectorybased pseudolabel.
As in Section 2.1, we take an empirical risk minimization approach to learning policies. To this end we sequentially learn a set of function minimizing the risk:
(7) 
enumerating over the space of states that the policy can map each example. Note that the space of states may be empty if all features are acquired for example by step .
As in Thm. 2.1, the problem of learning the sequence of policy functions can be viewed as a weighted structured learning problem.
Theorem 2.3.
The optimization problem in (7) is equivalent to solving an importance weighted structured learning problem using an indicator risk of the form:
(8) 
where the weight is defined: 
This is equivalent to an importance weighted structured learning problem, where each state in defines a pseudolabel for the example with an associated importance .
Theorem 2.3 reduces the problem of learning a policy to an importance weighted structured learning problem. Replacement of the indicators with upperbounding convex surrogate functions results in a convex minimization problem to learn the policies . In particular, use of a hingeloss surrogate converts this problem to the commonly used structural SVM. Experimental results show significant cost savings by applying this sequential policy.
The training algorithm is presented in Algorithm 1. At time , the policy is trained to minimize the immediate loss. Given this policy, the states of examples at time are fixed, and is trained to minimize the immediate loss given these states. The algorithm continues learning policies until every feature for every example as been acquired. During testtime, the system sequentially applies the trained policy functions until the specified budget is reaches, as shown in Algorithm 2.
3 Related Work
Multiclass prediction with testtime budget has received significant attention (see e.g., viola2001robust ; chen2012classifier ; busa2012fast ; karayev2013dynamic ; xu2013cost ; trapeznikov2013supervised ; kusner2014feature ; wang2014model ; wang2014lp ). Fundamentally, multiclass classification based approaches cannot be directly applied to structured settings for two reasons: (1) Structured Feature Selection Policy: Unlike multiclass prediction, in a structured setting, we have many parts with associated features and costs for each part. This often requires a coupled adaptive part by part feature selection policy applied to varying structures; (2) Structured Inference Costs: In contrast to multiclass prediction, structured prediction requires solving a constrained optimization problem in testtime, which is often computationally expensive and must be taken into account.
Strubell et al. strubell2015learning improve the speed of a parser that operates on searchbased structured prediction models, where joint prediction is decomposed to a sequence of decisions. In such a case, resourceconstrained multiclass approaches can be applied, however this reduction only applies to searchbased models that are fundamentally different from the graphbased models we discussed (with different types of theoretical guarantees and use cases). Applying their policy to the case of graphical models requires repeated inferences, dramatically increasing the computational cost when inference is slow.^{2}^{2}2The equivalent policy of strubell2015learning applied to our inference algorithm is marked as the myopic policy in our experiments. Due to the high cost of repeated inference, the resulting policy is computationally intensive.
Similar observations apply to Weiss et al. weiss2013dynamic ; weiss2013learning
, who present a scheme for adaptive feature selection assuming the computational costs of policy execution and inference are negligible. Their approach uses a reinforcement learning scheme, requiring inference at each step of their policy to estimate rewards. For complex inference tasks, repeatedly executing the policy (performing inference) can negate any computational gains induced by adaptive feature selection (see Fig. 3 in
weiss2013learning ).He et al. he2013dynamic
use imitation learning to adaptively select features for dependency parsing. Their approach can be viewed as an approximation of Eqn. (
4) with a parsimonious search. Although their policy avoids performing inference to estimate reward, multiple inferences are required for each instance due to the design of action space. Overhead is avoided by exploiting the specific inference structure (maximal spanning tree over fully connected graph), and it is unclear if it can be generalized.Methods to increase the speed of inference (predicting the given part responses) have been proposed weiss2010structured ; shi2015learning . These approaches can be incorporated into our approach to further reduce computational cost and therefore are complementary. More focused research has improved the speed of individual algorithms such as object detection using deformable parts models felzenszwalb2010object ; zhu2014active and dependency parsing he2013dynamic ; strubell2015learning . These methods are specialized, failing to generalize to varying graph size and/or structures and relying on problemspecific heuristics or algorithmspecific properties.
Adaptive features approaches have been designed to improve accuracy, including easyfirst decoding strategies goldberg2010efficient ; stoyanov2012easy , however these methods focus on performance as opposed to computational cost.
4 Experiments
In this section, we demonstrate the effectiveness of the proposed algorithm on two structured prediction tasks in different domains: dependency parsing and OCR. We report the results on both anytime and expected case policies and refer to the latter one as oneshot policy. Our focus is mainly on the policy and not on achieving the state of the art performance on either of these domains.
At a highlevel, policies for resource constrained structured prediction must manage & tradeoff benefits of three resources, namely, feature acquisition costs, intermediate inferencing costs, and policy overhead costs that decides between feature acquisition and inferencing. Some methods as described earlier account for feature costs but not inference and overhead costs. Other methods incorporate inference into their policy (metafeatures) for selecting new features but do not account for the resulting policy overhead. Our approach poses policy optimization as a structured learning problem and in turn jointly optimizes these resources as demonstrated empirically in our experiments.
We compare our system to the Qlearning approach in weiss2013learning and two baselines: a uniform policy and a myopic policy. The uniform policy takes random part level actions. The uniform policy will help us show that the performance of our policy does not come from removing redundant features, but clever allocation of them among samples. As a second baseline, we adapt the myopic policy used by trapeznikov2013supervised to the structured prediction case. The myopic policy runs the structured predictor initially on all cheap features, then looks at the total confidence of the classifier normalized by the sample size (e.g. sentence length). If the confidence is below a threshold, it chooses to acquire expensive features for all positions. Finally, we compare against the Qlearning method proposed by weiss2013learning . This method requires global features for structures with varying size. From now on we will refer to features that require access to more than one part as complex features and part level features as simple features. In their case, they use confidence feedback from the structured predictor which induces additional inference overhead for the policy. In addition to this, it is not straightforward to apply this approach to do part by part feature selection on structures with varying sizes.
We adopt StructuredSVM tsochantaridis2005large to solve the policy learning problems for expected and anytime cases defined in (4) and (8), respectively. For the structure of the policy we use a graph with no edges due to its simplicity. In this form, the policy learning problem can be written as a sample weighted SVM. We discuss the details in the appendix due to space constraints.
We show in the following that complex features indeed benefit the policy, but simple features perform better for cases where the inference time and feature costs are comparable and the additional overhead is unwanted. Finally, we show that part by part selection outperforms global selection.
Optical Character Recognition We tested our algorithm on a sequencelabel problem, the OCR dataset taskar2003max
composed of 6,877 handwritten words, where each word is represented as a sequence of 16x8 binary letter images. We use a linearchain Markov model, and similar to the setup in
weiss2013dynamic ; wang2014model , use raw pixel values and HOG features with 3x3 cell size as our feature templates. We split the data such that 90% percent is used for training and 10% is used for test.Fig. 3 shows the average letter accuracy vs. total running time. The proposed system reduces the budget for all performance levels, with a savings of up to 50 percent at the top performance. Note that Weiss13 can not operate on part by part level when the graph structure is varying. We see that using complex part by part selection has significant advantage over using uniform feature templates. Furthermore, Fig 3 shows the behavior of the policy on an individual example for the anytime model, significant gains in accuracy are made in first several steps by correctly identifying the noisy letters.
Dependency Parsing We follow the setting in he2013dynamic and conduct experiments on English Penn Treebank (PTB) corpus marcus1993building . All algorithms are implemented based on the graphbased dependency parser mcdonald2005non in IllinoisSL library chang2015illinoissl , where the code is optimized for speed. Two sets of feature templates are considered for the parser. ^{3}^{3}3 Complex features often contribute to small performance improvement. Adding complex or redundant features can easily yield arbitrarily large speedups, and comparing speedups of different systems with different accuracy levels is not meaningful (see Fig. 3 in he2013dynamic ). In addition, greedystyle parser such as strubell2015learning might be faster by nature. Discussing different architecture and features is outside the scope of this paper. The first (
) considers the partofspeech (POS) tags and lexicons of
, , and their surrounding words (see mcdonald2005non ). The other () only considers the POS features. The policy assigns one of these two feature templates to each word in the sentence, such that all the directed edges corresponding to the word share the same feature templates. The first feature template, , takes 165 s per word and the second feature template, , takes 275 s per word to extract the features and compute edge scores. The decoding by Chu–LiuEdmonds algorithm is 75s per word, supporting our hypothesis that feature extraction makes a significant portion of the total running time yet the inference time is not negligible. Due to the space limit, we present further details of the experiment setting in the appendix.
Fig. 4 shows the test performance (unlabeled attachment accuracy) along with inference time. We see that all oneshot policies perform similarly, losing negligible accuracy when using half of the available expensive features. When we apply the length dictionary filtering heuristic in he2013dynamic ; rush12Vine , our parser achieves 89.7% UAS on PTB section 23 with overall running time merely 7.5 seconds (I/O excluded, 10s with I/O) and obtains 2.9X total speedup with losing only 1% UAS comparing to the baseline. ^{4}^{4}4This heuristic only works for parsing. Therefore, we exclude it when presenting Figure 4 as it does not reflect the performance of policies in general. This significant speedup over an efficient implementation is remarkable. ^{5}^{5}5In contrast, the baseline system in he2013dynamic is slow than us by about three times. When operating in the accuracy level of 90%, Figure 3 in he2013dynamic shows that their final system takes about 20s. We acknowledge that he2013dynamic use different features, policy settings, and hardware from ours; therefore these numbers might not be comparable.
Although marginal, oneshot policy with greedy trajectory has the strongest performance in low budget regions. This is because the greedy trajectory search has better granularity than parsimonious search in choosing positions that decrease the loss early on. The anytime policy is below oneshot policy for all budget levels. As discussed in 2.2, the anytime policy is more constrained in that it has to achieve a fixed budget for all examples. The naive myopic policy performs worse than uniform since it has to run inference on samples with low confidence two times, adding approximately 4.5 seconds of extra time for the full test dataset. We then explore the effect of importance weights for the greedy policy. We notice a small improvement. We hypothesize that this is due to the policy functional complexity being a limiting factor.
We also conduct ablative studies to better understand the policy behavior. Fig. 4 shows the distribution of depth for the words that use expensive and cheap features in the ground truth dependency tree. We expect investing more time on the lowdepth words (root in the extreme) to yield higher accuracy gains. We observe this phenomenon empirically, as the policy concentrates on extracting features close to the root.
Appendix A Proofs
Proof of Theorem 2.1
The objective in (1) can be expressed:
Note that , allowing for further simplification:
Removing constant terms (that do not affect the output of the ) yields the expression in Thm. 2.1.
Proof of Theorem 2.2
For a single example/label pair , consider the two possible values of the term in the summation of (4). In the event that :
which is equivalent to the value of (2) if . Otherwise, , and therefore:
This is an upperbound on (1), and therefore (2) is a valid upperbound on (1).
Proof of Theorem 2.3
Note that the objective in (7) can be expressed:
Note that , allowing for further simplification:
Removing constant terms (that do not affect the output of the ) yields the expression in (8).
Appendix B Implementation details
Dependency parsing
We split PTB corpus into two parts, Sections 0222 and Section 23 as training and test sets. We then conduct a modified crossvalidation mechanism to train the feature selector and the dependency parser. Note that the cost of policy is dependent on the structured predictor. Therefore, learning policy on the same training set of the predictor may cause the structured loss to be overly optimistic. We follow the cross validation scheme in to deal with this issue by splitting the training data into folds. For each fold, we generate label predictions based on the structured predictor trained on the remaining folds. Finally, we gather these label predictions and train a policy on the complete data.
The dependency parser is trained by the averaged Structured Perceptron modelwith learning rate and number of epochs set to be 0.1 and 50, respectively. This setting achieves the best test performance as reported in Notice that if we trained two dependency models with different feature sets separately the scale of the edge scores may be different, resulting suboptimal test performance. To fix this issue, we generate data with random edge features and train the model to minimize the joint loss over all states.
Finally, We found that for dependency parsing expensive features are only necessary in several critical locations in the sentence. Therefore, budget levels above 10% turned out to be unachievable for any featuretradeoff parameter lambda in the pseudolabels. To obtain those regions, we varied the class weights of feature templates in the training of oneshot feature selector.
References

[1]
R. BusaFekete, D. Benbouzid, and B. Kégl.
Fast classification using sparse decision DAGs.
In
29th International Conference on Machine Learning (ICML 2012)
, pages 951–958. Omnipress, 2012.  [2] K.W. Chang, S. Upadhyay, M.W. Chang, V. Srikumar, and D. Roth. IllinoisSL: A JAVA Library for Structured Prediction. arXiv preprint arXiv:1509.07179, 2015.

[3]
M. Chen, K. Q. Weinberger, O. Chapelle, D. Kedem, and Z. Xu.
Classifier cascade for minimizing feature evaluation cost.
In
International Conference on Artificial Intelligence and Statistics
, pages 218–226, 2012.  [4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
 [5] Y. Goldberg and M. Elhadad. An efficient algorithm for easyfirst nondirectional dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 742–750. Association for Computational Linguistics, 2010.
 [6] A. Grubb and D. Bagnell. Speedboost: Anytime prediction with uniform nearoptimality. In International Conference on Artificial Intelligence and Statistics, pages 458–466, 2012.
 [7] H. He, H. Daumé III, and J. Eisner. Dynamic Feature Selection for Dependency Parsing. In EMNLP, pages 1455–1464, 2013.
 [8] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 [9] S. Karayev, M. Fritz, and T. Darrell. Dynamic feature selection for classification on a budget. In International Conference on Machine Learning (ICML): Workshop on Prediction with Sequential Models, 2013.
 [10] M. J. Kusner, W. Chen, Q. Zhou, Z. E. Xu, K. Q. Weinberger, and Y. Chen. FeatureCost Sensitive Learning with Submodular Trees of Classifiers. In AAAI, pages 1939–1945, 2014.
 [11] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993.
 [12] R. McDonald, F. Pereira, K. Ribarov, and J. Hajič. Nonprojective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530. Association for Computational Linguistics, 2005.
 [13] A. Rush and S. Petrov. Vine pruning for efficient multipass dependency parsing learned prioritization for trading off accuracy and speed. In NAACL, 2012.
 [14] T. Shi, J. Steinhardt, and P. Liang. Learning Where to Sample in Structured Prediction. In AISTATS, 2015.
 [15] V. Stoyanov and J. Eisner. Easyfirst Coreference Resolution. In COLING, pages 2519–2534. Citeseer, 2012.
 [16] E. Strubell, L. Vilnis, K. Silverstein, and A. McCallum. Learning Dynamic Feature Selection for Fast Sequential Prediction. arXiv preprint arXiv:1505.06169, 2015.
 [17] B. Taskar, C. Guestrin, and D. Koller. MaxMargin Markov Networks. In Advances in Neural Information Processing Systems, page None, 2003.
 [18] K. Trapeznikov and V. Saligrama. Supervised sequential classification under budget constraints. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, pages 581–589, 2013.
 [19] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. In Journal of Machine Learning Research, pages 1453–1484, 2005.
 [20] P. Viola and M. Jones. Robust realtime object detection. International Journal of Computer Vision, 4, 2001.

[21]
J. Wang, T. Bolukbasi, K. Trapeznikov, and V. Saligrama.
Model selection by linear programming.
In Computer Vision–ECCV 2014, pages 647–662. Springer, 2014.  [22] J. Wang, K. Trapeznikov, and V. Saligrama. An LP for Sequential Learning Under Budgets. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 987–995, 2014.
 [23] D. Weiss, B. Sapp, and B. Taskar. Dynamic structured model selection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2656–2663, 2013.
 [24] D. Weiss and B. Taskar. Structured Prediction Cascades. In International Conference on Artificial Intelligence and Statistics, pages 916–923, 2010.
 [25] D. J. Weiss and B. Taskar. Learning adaptive value of information for structured prediction. In Advances in Neural Information Processing Systems, pages 953–961, 2013.
 [26] Z. Xu, M. Kusner, M. Chen, and K. Q. Weinberger. CostSensitive Tree of Classifiers. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 133–141, 2013.
 [27] M. Zhu, N. Atanasov, G. J. Pappas, and K. Daniilidis. Active deformable part models inference. In Computer Vision–ECCV 2014, pages 281–296. Springer, 2014.
Comments
There are no comments yet.