1 Introduction
One of the core combinatorial online learning problems is that of learning a minimum loss path in a directed graph. Examples can be found in structured prediction problems such as machine translation, automatic speech recognition, optical character recognition and computer vision. In these problems, predictions (or predictors) can be decomposed into possibly overlapping substructures that may correspond to words, phonemes, characters, or image patches. They can be represented in a directed graph where each edge represents a different substructure.
The number of paths, which serve as experts, is typically exponential in the size of the graph. Extensive work has been done to design efficient algorithms when the loss is additive, that is when the loss of the path is the sum of the losses of the edges along that path. Several efficient algorithms with favorable guarantees have been designed both for the full information setting (Takimoto and Warmuth, 2003; Kalai and Vempala, 2005; Koolen et al., 2010) and different bandit settings (György et al., 2007; CesaBianchi and Lugosi, 2012) by exploiting the additivity of the loss.
However, in modern machine learning applications such as machine translation, speech recognition and computational biology, the loss of each path is often not additive in the edges along the path. For instance, in machine translation, the BLEU score similarity determines the loss. The BLEU score can be closely approximated by the inner product of the count vectors of the
gram occurrences in two sequences, where typically (see Figure 1). In computational biology tasks, the losses are determined based on the inner product of the (discounted) count vectors of occurrences of grams with gaps (gappygrams). In other applications, such as speech recognition and optical character recognition, the loss is based on the editdistance. Since the performance of the algorithms in these applications is measured via nonadditive loss functions, it is natural to seek learning algorithms optimizing these losses directly. This motivates our study of online path learning for nonadditive losses.
One of the applications of our algorithm is ensemble structured prediction. Online learning of ensembles of structured prediction experts can significantly improve the performance of algorithms in a number of areas including machine translation, speech recognition, other language processing areas, optical character recognition, and computer vision (Cortes et al., 2014). In general, ensemble structured prediction is motivated by the fact that one particular expert may be better at predicting one substructure while some other expert may be more accurate at predicting another substructure. Therefore, it is desirable to interleave the substructure predictions of all experts to obtain the more accurate prediction. This application becomes important, particularly in the bandit setting. Suppose one wishes to combine the outputs of different translators as in Figure 1. Instead of comparing oneself to the outputs of the best translator, the comparator is the best “interleaved translation” where each word in the translation can come from a different translator. However, computing the loss or the gain (such as BLEU score) of each path can be costly and may require the learner to resort to learning from partial feedback only.
Online path learning with nonadditive losses has been previously studied by Cortes et al. (2015). That work focuses on the full information case providing an efficient implementations of Expanded Hedge (Takimoto and Warmuth, 2003) and FollowthePerturbedLeader (Kalai and Vempala, 2005) algorithms under some technical assumptions on the outputs of the experts.
In this paper, we design algorithms for online path learning with nonadditive gains or losses in the full information, as well as in several bandit settings specified in detail in Section 2. In the full information setting, we design an efficient algorithm that enjoys regret guarantees that are more favorable than those of Cortes et al. (2015), while not requiring any additional assumption. In the bandit settings, our algorithms, to the best of our knowledge, are the first efficient methods for learning with nonadditive losses.
The key technical tools used in this work are weighted automata and transducers (Mohri, 2009). We transform the original path graph (e.g. Figure 1) into an intermediate graph . The paths in are mapped to the paths in , but now the losses in are additive along the paths. Remarkably, the size of does not depend on the size of the alphabet (word vocabulary in translation tasks) from which the output labels of edges are drawn. The construction of is highly nontrivial and is our primary contribution. This alternative graph , in which the losses are additive, enables us to extend many wellknown algorithms in the literature to the path learning problem.
The paper is organized as follows. We introduce the path learning setup in Section 2. In Section 3, we explore the wide family of nonadditive countbased gains and introduce the alternative graph using automata and transducers tools. We present our algorithms in Section 4 for the full information, semi and full bandit settings for the countbased gains. Next, we extend our results to gappy countbased gains in Section 5. The application of our method to the ensemble structured prediction is detailed in Appendix A. In Appendix B, we go beyond countbased gains and consider arbitrary (nonadditive) gains. Even with no assumption about the structure of the gains, we can efficiently implement the EXP3 algorithm in the full bandit setting. Naturally, the regret bounds for this algorithm are weaker, however, since no special structure of the gains can be exploited in the absence of any assumption.
2 Basic Notation and Setup
We describe our path learning setup in terms of finite automata. Let denote a fixed acyclic finite automaton. We call the expert automaton. admits a single initial state and one or several final states which are indicated by bold and double circles, respectively, see Figure 2(a). Each transition of is labeled with a unique name. Denote the set of all transition names by . An automaton with a single initial state is deterministic if no two outgoing transitions from a given state admit the same name. Thus, our automaton is deterministic by construction since the transition names are unique. An accepting path is a sequence of transitions from the initial state to a final state. The expert automaton can be viewed as an indicator function over strings in such that iff is an accepting path. Each accepting path serves as an expert and we equivalently refer to it as a path expert. The set of all path experts is denoted by . At each round , each transition outputs a symbol from a finite nonempty alphabet , denoted by . The prediction of each path expert at round is the sequence of output symbols along its transitions at that round and is denoted by . We also denote by the automaton with the same topology as where each transition is labeled with , see Figure 2(b). At each round , a target sequence is presented to the learner. The gain/loss of each path expert is where . Our focus is the functions that are not necessarily additive along the transitions in . For example, can be either a distance function (e.g. editdistance) or a similarity function (e.g. gram gain with ).
We consider standard online learning scenarios of prediction with path experts. At each round , the learner picks a path expert and predicts with its prediction . The learner receives the gain of . Depending on the setting, the adversary may reveal some information about and the output symbols of the transitions (see Figure 3). In the full information setting, and are revealed to the learner for every transition in . In the semibandit setting, the adversary reveals and for every transition along . In full bandit setting, is the only information that is revealed to the learner. The goal of the learner is to minimize the regret which is defined as the cumulative gain of the best path expert chosen in hindsight minus the cumulative expected gain of the learner.
3 CountBased Gains
Many of the most commonly used nonadditive gains in applications belong to the broad family of countbased gains, which are defined in terms of the number of occurrences of a fixed set of patterns, , in the sequence output by a path expert. These patterns may be grams, that is sequences of consecutive symbols, as in a common approximation of the BLEU score in machine translation, a set of relevant subsequences of variablelength in computational biology, or patterns described by complex regular expressions in pronunciation modeling.
For any sequence , let denote the vector whose th component is the number of occurrences of in , .^{1}^{1}1This can be extended to the case of weighted occurrences where more emphasis is assigned to some patterns whose occurrences are then multiplied by a factor , and less emphasis to others. The countbased gain function at round for a path expert in given the target sequence is then defined as a dot product:
(1) 
Such gains are not additive along the transitions and the standard online path learning algorithms for additive gains cannot be applied. Consider, for example, the special case of grambased gains in Figure 1. These gains cannot be expressed additively if the target sequence is, for instance, “He would like to eat cake” (see Appendix F). The challenge of learning with nonadditive gains is even more apparent in the case of gappy countbased gains which allow for gaps of varying length in the patterns of interest. We defer the study of gappycount based gains to Section 5.
How can we design algorithms for online path learning with such nonadditive gains? Can we design algorithms with favorable regret guarantees for all three settings of full information, semi and full bandit? The key idea behind our solution is to design a new automaton whose paths can be identified with those of and, crucially, whose gains are additive. We will construct by defining a set of contextdependent rewrite rules, which can be compiled into a finitestate transducer defined below. The contextdependent automaton can then be obtained by composition of the transducer with . In addition to playing a key role in the design of our algorithms (Section 4), provides a compact representation of the gains since its size is substantially less than the dimension (number of patterns).
3.1 ContextDependent Rewrite Rules
We will use contextdependent rewrite rules to map to the new representation . These are rules that admit the following general form:
where , , , and are regular expressions over the alphabet of the rules. These rules must be interpreted as follows: is to be replaced by whenever it is preceded by and followed by . Thus, and represent the left and right contexts of application of the rules. Several types of rules can be considered depending on their being obligatory or optional, and on their direction of application, from left to right, right to left or simultaneous application (Kaplan and Kay, 1994). We will be only considering rules with simultaneous applications.
Such contextdependent rules can be efficiently compiled into a finitestate transducer (FST), under the technical condition that they do not rewrite their noncontextual part (Mohri and Sproat, 1996; Kaplan and Kay, 1994).^{2}^{2}2Additionally, the rules can be augmented with weights, which can help us cover the case of weighted countbased gains, in which case the result of the compilation is a weighted transducer (Mohri and Sproat, 1996). Our algorithms and theory can be extended to that case. An FST over an input alphabet and output alphabet defines an indicator function over the pairs of strings in . Given and , we have if there exists a path from an initial state to a final state with input label and output label , and otherwise.
To define our rules, we first introduce the alphabet as the set of transition names for the target automaton . These capture all possible contexts of length , where is the length of pattern :
where the ‘#’ symbol “glues” together and forms one single symbol in . We will have one contextdependent rule of the following form for each element :
(2) 
Thus, in our case, the left and rightcontexts are the empty strings^{3}^{3}3 Contextdependent rewrite rules are powerful tools for identifying different patterns using their left and rightcontexts. For our application of countbased gains, however, identifying these patterns are independent of their context and we do not need to fully exploit the strength of these rewrite rules. , meaning that the rules can apply (simultaneously) at every position. In the special case where the patterns are the set of grams, then is fixed and equal to . Figure 4 shows the result of the rule compilation in that case for . This transducer inserts whenever and are found consecutively and otherwise outputs the empty string. We will denote the resulting FST by .
3.2 ContextDependent Automaton
To construct the contextdependent automaton , we will use the composition operation. The composition of and is an FST denoted by and defined as the following product of two outcomes for all inputs:
There is an efficient algorithm for the composition of FSTs and automata (Pereira and Riley, 1997; Mohri et al., 1996; Mohri, 2009), whose worstcase complexity is in . The automaton is obtained from the FST by projection, that is by simply omitting the input label of each transition and keeping only the output label. Thus if we denote by the projection operator, then is defined as
Observe that admits a fixed topology (states and transitions) at any round . It can be constructed in a preprocessing stage using the FST operations of composition and projection. Additional FST operations such as removal and minimization can help further optimize the automaton obtained after projection (Mohri, 2009). Proposition 3.2, proven in Appendix D, ensures that for every accepting path in , there is a unique corresponding accepting path in . Figure 5 shows the automata and in a simple case and how a path in is mapped to another path in .
Let be an expert automaton and let be a deterministic transducer representing the rewrite rules (2). Then, for each accepting path in , there exists a unique corresponding accepting path in .
The size of the contextdependent automaton depends on the expert automaton and the lengths of the patterns. Notice that, crucially, its size is independent of the size of the alphabet . Appendix A analyzes more specifically the size of in the important application of ensemble structure prediction with gram gains.
At any round and for any , let denote the sequence , that is the sequence obtained by concatenating the outputs of . Let be the automaton with the same topology as where each label is replaced by . Once is known, the representation can be found, and consequently, the additive contribution of each transition of can be computed. The following theorem, which is proved in Appendix D, shows the additivity of the gains in . See Figure 6 for an example.
At any round , define the gain of the transition in by if for some and if no such exists. Then, the gain of each path in at trial can be expressed as an additive gain of the corresponding unique path in :
4 Algorithms
In this section, we present algorithms and associated regret guarantees for online path learning with nonadditive countbased gains in the full information, semibandit and full bandit settings. The key component of our algorithms is the contextdependent automaton . In what follows, we denote the length of the longest path in by , an upperbound on the gain of each transition in by , the number of path experts by , and the number of transitions and states in by and , respectively. We note that is at most the length of the longest path in since each transition in admits a unique label.
Remark.
The number of accepting paths in is often equal to but sometimes less than the number of accepting paths in . In some degenerate cases, several paths in may correspond to one single path^{4}^{4}4 For example, in the case of gram gains, all the paths in with a length less than correspond to path with empty output in and will always have a gain of zero. in . This implies that in will always consistently have the same gains in every round and that is the additive gain of in . Thus, if is predicted by the algorithm in , any of the paths can be equivalently used for prediction in the original expert automaton .
4.1 Full Information: Contextdependent Component Hedge Algorithm
Koolen et al. (2010) gave an algorithm for online path learning with nonnegative additive losses in the full information setting, the Component Hedge (CH) algorithm. For countbased losses, Cortes et al. (2015) provided an efficient Rational Randomized Weighted Majority (RRWM) algorithm. This algorithm requires the use of determinization (Mohri, 2009) which is only shown to have polynomial computational complexity under some additional technical assumptions on the outputs of the path experts. In this section, we present an extension of CH, the Contextdependent Component Hedge (CDCH), for the online path learning problem with nonadditive countbased gains. CDCH admits more favorable regret guarantees than RRWM and can be efficiently implemented without any additional assumptions.
Our CDCH algorithm requires a modification of such that all paths admit an equal number of transitions (same as the longest path). This modification can be done by adding at most states and zerogain transitions (György et al., 2007). Abusing the notation, we will denote this new automaton by in this subsection. At each iteration , CDCH maintains a weight vector in the unitflow polytope over , which is a set of vectors satisfying the following conditions: (1) the weights of the outgoing transitions from the initial state sum up to one, and (2) for every nonfinal state, the sum of the weights of incoming and outgoing transitions are equal. For each , we observe the gain of each transition , and define the loss of that transition as . After observing the loss of each transition in , CDCH updates each component of as (where is a specified learning rate), and sets to the relative entropy projection of the updated back to the unitflow polytope, i.e. .
CDCH predicts by decomposing into a convex combination of at most paths in and then sampling a single path according to this mixture as described below. Recall that each path in identifies a path in which can be recovered in time . Therefore, the inference step of the CDCH algorithm takes at most time polynomial in steps. To determine a decomposition, we find a path from the initial state to a final state with nonzero weights on all transitions, remove the largest weight on that path from each transition on that path and use it as a mixture weight for that path. The algorithm proceeds in this way until the outflow from initial state is zero. The following theorem from (Koolen et al., 2010) gives a regret guarantee for the CDCH algorithm.
With proper tuning of the learning rate , the regret of CDCH is bounded as below:
The regret bounds of Theorem 4.1 are in terms of the countbased gain . Cortes et al. (2015) gave regret guarantees for the RRWM algorithm with countbased losses defined by . In Appendix E, we show that the regret associated with is upperbounded by the regret bound associated with . Observe that, even with this approximation, the regret guarantees that we provide for CDCH are tighter by a factor of . In addition, our algorithm does not require additional assumptions for an efficient implementation compared to the RRWM algorithm of Cortes et al. (2015).
4.2 SemiBandit: Contextdependent SemiBandit Algorithm
György et al. (2007) gave an efficient algorithm for online path learning with additive losses in the semibandit setting. In this section, we present a Contextdependent SemiBandit (CDSB) algorithm extending that work to solving the problem of online path learning with countbased gains in a semibandit setting. To the best of our knowledge, this is the first efficient algorithm with favorable regret bounds for this problem.
As with the algorithm of György et al. (2007), CDSB makes use of a set of covering paths with the property that, for each , there is an accepting path in such that belongs to . At each round , CDSB keeps track of a distribution over all path experts by maintaining a weight on each transition in such that the weights of outgoing transitions for each state sum up to and , for all accepting paths in . Therefore, we can sample a path from in at most steps by selecting a random transition at each state according to the distribution defined by . To make a prediction, we sample a path in according to a mixture distribution , where
is a uniform distribution over paths in
. We selectwith probability
or with probability and sample a random path from the randomly chosen distribution.Once a path in is sampled, we observe the gain of each transition of , denoted by . CDSB sets , where if and otherwise. Here, are parameters of the algorithm and is the flow through in , which can be computed using a standard shortestdistance algorithm over the probability semiring (Mohri, 2009). The updated distribution is . Next, the weight pushing algorithm (Mohri, 1997) is applied (see Appendix C), which results in new transition weights such that the total outflow out of each state is again one and the updated probabilities are , thereby facilitating sampling. The computational complexity of each of the steps above is polynomial in the size of . The following theorem from György et al. (2007) provides a regret guarantee for CDSB algorithm.
Let denote the set of “covering paths” in . For any , with proper tuning of the parameters , , and , the regret of the CDSB algorithm can be bounded as follows with probability :
4.3 Full Bandit: Contextdependent ComBand Algorithm
Here, we present an algorithm for online path learning with countbased gains in the full bandit setting. CesaBianchi and Lugosi (2012) gave an algorithm for online path learning with additive gains, ComBand. Our generalization, called Contextdependent ComBand (CDCB), is the first efficient algorithm with favorable regret guarantees for learning with countbased gains in this setting. For the full bandit setting with arbitrary gains, we develop an efficient execution of EXP3, called EXP3AG, in Appendix B.
As with CDSB, CDCB maintains a distribution over all path experts using weights on the transitions such that the outflow of each state is one and the probability of each path experts is the product of the weights of the transitions along that path. To make a prediction, we sample a path in according to a mixture distribution , where is a uniform distribution over the paths in . Note that this sampling can be efficiently implemented as follows. As a preprocessing step, define using a separate set of weights over the transitions of in the same form. Set all the weights to one and apply the weightpushing algorithm to obtain a uniform distribution over the path experts. Next, we select with probability or with probability and sample a random path from the randomly chosen distribution.
After observing the scalar gain of the chosen path, CDCB computes a surrogate gain vector for all transitions in via , where is the pseudoinverse of and is a bit representation of the path . As for CDSB, we set and update via weightedpushing to compute . We obtain the following regret guarantees from CesaBianchi and Lugosi (2012) for CDCB:
Let
denote the smallest nonzero eigenvalue of
where is the bit representation of the path which is distributed according to the uniform distribution . With proper tuning of the parameters and , the regret of CDCB can be bounded as follows:5 Extension to Gappy CountBased Gains
Here, we generalize the results of Section 3 to a broader family of nonadditive gains called gappy countbased gains: the gain of each path depends on the discounted counts of gappy occurrences of a fixed set of patterns in the sequence output by that path. In a gappy occurrence, there can be “gaps” between symbols of the pattern. The count of a gappy occurrence is discounted multiplicatively by where is a fixed discount rate and is the total length of gaps. For example, the gappy occurrences of the pattern in a sequence with discount rate are

[leftmargin=.1cm,itemindent=.35cm]

, length of gap = , discount factor = ;

, length of gap = , discount factor = ;

, length of gap = , discount factor = ,
which makes the total discounted count of gappy occurrences of in to be . Each sequence of symbols can be represented as a discounted count vector of gappy occurrences of the patterns whose th component is “the discounted number of gappy occurrences of in ”. The gain function is defined in the same way as in Equation (1).^{5}^{5}5The regular countbased gain can be recovered by setting . A typical instance of such gains is gappy gram gains where the patterns are all many grams.
The key to extending our results in Section 3 to gappy grams is an appropriate definition of the alphabet , the rewrite rules, and a new contextdependent automaton . Once is constructed, the algorithms and regret guarantees presented in Section 4 can be extended to gappy countbased gains. To the best of our knowledge, this provides the first efficient online algorithms with favorable regret guarantees for gappy countbased gains in full information, semibandit and full bandit settings.
ContextDependent Rewrite Rules.
We extend the definition of so that it also encodes the total length of the gaps: . Note that the discount factor in gappy occurrences does not depend on the position of the gaps. Exploiting this fact, for each pattern of length and total gap length , we reduce the number of output symbols by a factor of by encoding the number of gaps as opposed to the position of the gaps.
We extend the rewrite rules in order to incorporate the gappy occurrences. Given , for all path segments of length in where is a subsequence of with and , we introduce the rule:
As with the nongappy case in Section 3, the simultaneous application of all these rewrite rules can be efficiently compiled into a FST . The contextdependent transducer maps any sequence of transition names in into a sequence of corresponding gappy occurrences. The example below shows how outputs the gappy trigrams given a path segment of length as input:
ContextDependent Automaton .
As in Section 3.2, we construct the contextdependent automaton as , which admits a fixed topology through trials. The rewrite rules are constructed in a way such that different paths in are rewritten differently. Therefore, assigns a unique output to a given path expert in . Proposition 3.2 ensures that for every accepting path in , there is a unique corresponding accepting path in .
For any round and any , define . Let be the automaton with the same topology as where each label is replaced by . Given , the representation can be found, and consequently, the additive contribution of each transition of . Again, we show the additivity of the gain in (see Appendix D for the proof).
Given the trial and discount rate , for each transition in , define the gain if for some and and if no such and exist. Then, the gain of each path in at trial can be expressed as an additive gain of in :
We can extend the algorithms and regret guarantees presented in Section 4 to gappy countbased gains. To the best of our knowledge, this provides the first efficient online algorithms with favorable regret guarantees for gappy countbased gains in full information, semibandit and full bandit settings.
6 Conclusion and Open Problems
We presented several new algorithms for online nonadditive path learning with very favorable regret guarantees for the full information, semibandit, and full bandit scenarios. We conclude with two open problems: (1) Nonacyclic expert automata: we assumed here that the expert automaton is acyclic and the language of patterns is finite. Solving the nonadditive path learning problem with cyclic expert automaton together with (infinite) regular language of patterns remains an open problem; (2) Incremental construction of : in this work, regardless of the data and the setting, the contextdependent automaton is constructed in advance as a preprocessing step. Is it possible to construct gradually as the learner goes through trials? Can we build incrementally in different settings and keep it as small as possible as the algorithm is exploring the set of paths and learning about the revealed data?
The work of MM was partly funded by NSF CCF1535987 and NSF IIS1618662. Part of this work was done while MKW was at UC Santa Cruz, supported by NSF grant IIS1619271.
References
 Allauzen et al. (2007) Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. OpenFst: a general and efficient weighted finitestate transducer library. In Proceedings of CIAA, pages 11–23. Springer, 2007.
 Auer et al. (2002) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 CesaBianchi and Lugosi (2012) Nicolo CesaBianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
 Cortes et al. (2014) Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In Proceedings of ICML, 2014.
 Cortes et al. (2015) Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Manfred K. Warmuth. Online learning algorithms for path experts with nonadditive losses. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 36, 2015, pages 424–447, 2015.
 György et al. (2007) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369–2403, 2007.
 Kalai and Vempala (2005) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
 Kaplan and Kay (1994) Ronald M. Kaplan and Martin Kay. Regular models of phonological rule systems. Computational Linguistics, 20(3):331–378, 1994.
 Koolen et al. (2010) Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In Proceedings of COLT, pages 93–105, 2010.
 Mohri (1997) Mehryar Mohri. Finitestate transducers in language and speech processing. Computational Linguistics, 23(2):269–311, 1997.
 Mohri (2002) Mehryar Mohri. Semiring Frameworks and Algorithms for ShortestDistance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002.
 Mohri (2009) Mehryar Mohri. Weighted automata algorithms. In Handbook of Weighted Automata, pages 213–254. Springer, 2009.
 Mohri and Sproat (1996) Mehryar Mohri and Richard Sproat. An efficient compiler for weighted rewrite rules. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 231–238. Association for Computational Linguistics, 1996.
 Mohri et al. (1996) Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted automata in text and speech processing. In Proceedings of ECAI96 Workshop on Extended finite state models of language, 1996.
 Pereira and Riley (1997) Fernando Pereira and Michael Riley. Speech recognition by composition of weighted finite automata. In FiniteState Language Processing, pages 431–453. MIT Press, 1997.
 Stoltz (2005) Gilles Stoltz. Information incomplete et regret interne en prédiction de suites individuelles. PhD thesis, Ph. D. thesis, Univ. Paris Sud, 2005.
 Takimoto and Warmuth (2003) Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. JMLR, 4:773–818, 2003.
Appendix A Applications to Ensemble Structured Prediction
The algorithms discussed in Section 4 can be used for the online learning of ensembles of structured prediction experts, and consequently, significantly improve the performance of algorithms in a number of areas including machine translation, speech recognition, other language processing areas, optical character recognition, and computer vision. In structured prediction problems, the output associated with a model is a structure that can be decomposed and represented by substructures . For instance, may be a machine translation system and a particular word.
The problem of ensemble structured prediction can be described as follows. The learner has access to a set of experts to make an ensemble prediction. Therefore, at each round , the learner can use the outputs of the experts . As illustrated in Figure 7(a), each expert consists of substructures .