DeepAI

# Online Non-Additive Path Learning under Full and Partial Information

We consider the online path learning problem in a graph with non-additive gains/losses. Various settings of full information, semi-bandit, and full bandit are explored. We give an efficient implementation of EXP3 algorithm for the full bandit setting with any (non-additive) gain. Then, focusing on the large family of non-additive count-based gains, we construct an intermediate graph which has equivalent gains that are additive. By operating on this intermediate graph, we are able to use algorithms like Component Hedge and ComBand for the first time for non-additive gains. Finally, we apply our methods to the important application of ensemble structured prediction.

• 13 publications
• 5 publications
• 49 publications
• 3 publications
• 27 publications
07/23/2021

04/18/2019

### Semi-bandit Optimization in the Dispersed Setting

In this work, we study the problem of online optimization of piecewise L...
06/14/2021

### Fast Construction of 4-Additive Spanners

A k-additive spanner of a graph is a subgraph that preserves the distanc...
06/13/2018

### Additive perfect codes in Doob graphs

The Doob graph D(m,n) is the Cartesian product of m>0 copies of the Shri...
11/25/2019

### Minimax Optimal Algorithms for Adversarial Bandit Problem with Multiple Plays

We investigate the adversarial bandit problem with multiple plays under ...
02/15/2021

### Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent

Decision trees provide a rich family of highly non-linear but efficient ...
01/24/2022

### Valid belief updates for prequentially additive loss functions arising in Semi-Modular Inference

Model-based Bayesian evidence combination leads to models with multiple ...

## 1 Introduction

One of the core combinatorial online learning problems is that of learning a minimum loss path in a directed graph. Examples can be found in structured prediction problems such as machine translation, automatic speech recognition, optical character recognition and computer vision. In these problems, predictions (or predictors) can be decomposed into possibly overlapping substructures that may correspond to words, phonemes, characters, or image patches. They can be represented in a directed graph where each edge represents a different substructure.

The number of paths, which serve as experts, is typically exponential in the size of the graph. Extensive work has been done to design efficient algorithms when the loss is additive, that is when the loss of the path is the sum of the losses of the edges along that path. Several efficient algorithms with favorable guarantees have been designed both for the full information setting (Takimoto and Warmuth, 2003; Kalai and Vempala, 2005; Koolen et al., 2010) and different bandit settings (György et al., 2007; Cesa-Bianchi and Lugosi, 2012) by exploiting the additivity of the loss.

However, in modern machine learning applications such as machine translation, speech recognition and computational biology, the loss of each path is often not additive in the edges along the path. For instance, in machine translation, the BLEU score similarity determines the loss. The BLEU score can be closely approximated by the inner product of the count vectors of the

-gram occurrences in two sequences, where typically (see Figure 1). In computational biology tasks, the losses are determined based on the inner product of the (discounted) count vectors of occurrences of -grams with gaps (gappy

-grams). In other applications, such as speech recognition and optical character recognition, the loss is based on the edit-distance. Since the performance of the algorithms in these applications is measured via non-additive loss functions, it is natural to seek learning algorithms optimizing these losses directly. This motivates our study of online path learning for non-additive losses.

One of the applications of our algorithm is ensemble structured prediction. Online learning of ensembles of structured prediction experts can significantly improve the performance of algorithms in a number of areas including machine translation, speech recognition, other language processing areas, optical character recognition, and computer vision (Cortes et al., 2014). In general, ensemble structured prediction is motivated by the fact that one particular expert may be better at predicting one substructure while some other expert may be more accurate at predicting another substructure. Therefore, it is desirable to interleave the substructure predictions of all experts to obtain the more accurate prediction. This application becomes important, particularly in the bandit setting. Suppose one wishes to combine the outputs of different translators as in Figure 1. Instead of comparing oneself to the outputs of the best translator, the comparator is the best “interleaved translation” where each word in the translation can come from a different translator. However, computing the loss or the gain (such as BLEU score) of each path can be costly and may require the learner to resort to learning from partial feedback only.

Online path learning with non-additive losses has been previously studied by Cortes et al. (2015). That work focuses on the full information case providing an efficient implementations of Expanded Hedge (Takimoto and Warmuth, 2003) and Follow-the-Perturbed-Leader (Kalai and Vempala, 2005) algorithms under some technical assumptions on the outputs of the experts.

In this paper, we design algorithms for online path learning with non-additive gains or losses in the full information, as well as in several bandit settings specified in detail in Section 2. In the full information setting, we design an efficient algorithm that enjoys regret guarantees that are more favorable than those of Cortes et al. (2015), while not requiring any additional assumption. In the bandit settings, our algorithms, to the best of our knowledge, are the first efficient methods for learning with non-additive losses.

The key technical tools used in this work are weighted automata and transducers (Mohri, 2009). We transform the original path graph (e.g. Figure 1) into an intermediate graph . The paths in are mapped to the paths in , but now the losses in are additive along the paths. Remarkably, the size of does not depend on the size of the alphabet (word vocabulary in translation tasks) from which the output labels of edges are drawn. The construction of is highly non-trivial and is our primary contribution. This alternative graph , in which the losses are additive, enables us to extend many well-known algorithms in the literature to the path learning problem.

The paper is organized as follows. We introduce the path learning setup in Section 2. In Section 3, we explore the wide family of non-additive count-based gains and introduce the alternative graph using automata and transducers tools. We present our algorithms in Section 4 for the full information, semi- and full bandit settings for the count-based gains. Next, we extend our results to gappy count-based gains in Section 5. The application of our method to the ensemble structured prediction is detailed in Appendix A. In Appendix B, we go beyond count-based gains and consider arbitrary (non-additive) gains. Even with no assumption about the structure of the gains, we can efficiently implement the EXP3 algorithm in the full bandit setting. Naturally, the regret bounds for this algorithm are weaker, however, since no special structure of the gains can be exploited in the absence of any assumption.

## 2 Basic Notation and Setup

We describe our path learning setup in terms of finite automata. Let denote a fixed acyclic finite automaton. We call the expert automaton. admits a single initial state and one or several final states which are indicated by bold and double circles, respectively, see Figure 2(a). Each transition of is labeled with a unique name. Denote the set of all transition names by . An automaton with a single initial state is deterministic if no two outgoing transitions from a given state admit the same name. Thus, our automaton is deterministic by construction since the transition names are unique. An accepting path is a sequence of transitions from the initial state to a final state. The expert automaton can be viewed as an indicator function over strings in such that iff is an accepting path. Each accepting path serves as an expert and we equivalently refer to it as a path expert. The set of all path experts is denoted by . At each round , each transition outputs a symbol from a finite non-empty alphabet , denoted by . The prediction of each path expert at round is the sequence of output symbols along its transitions at that round and is denoted by . We also denote by the automaton with the same topology as where each transition is labeled with , see Figure 2(b). At each round , a target sequence is presented to the learner. The gain/loss of each path expert is where . Our focus is the functions that are not necessarily additive along the transitions in . For example, can be either a distance function (e.g. edit-distance) or a similarity function (e.g. -gram gain with ).

We consider standard online learning scenarios of prediction with path experts. At each round , the learner picks a path expert and predicts with its prediction . The learner receives the gain of . Depending on the setting, the adversary may reveal some information about and the output symbols of the transitions (see Figure 3). In the full information setting, and are revealed to the learner for every transition in . In the semi-bandit setting, the adversary reveals and for every transition along . In full bandit setting, is the only information that is revealed to the learner. The goal of the learner is to minimize the regret which is defined as the cumulative gain of the best path expert chosen in hindsight minus the cumulative expected gain of the learner.

## 3 Count-Based Gains

Many of the most commonly used non-additive gains in applications belong to the broad family of count-based gains, which are defined in terms of the number of occurrences of a fixed set of patterns, , in the sequence output by a path expert. These patterns may be -grams, that is sequences of consecutive symbols, as in a common approximation of the BLEU score in machine translation, a set of relevant subsequences of variable-length in computational biology, or patterns described by complex regular expressions in pronunciation modeling.

For any sequence , let denote the vector whose th component is the number of occurrences of in , .111This can be extended to the case of weighted occurrences where more emphasis is assigned to some patterns whose occurrences are then multiplied by a factor , and less emphasis to others. The count-based gain function at round for a path expert in given the target sequence is then defined as a dot product:

 U(outt(π),yt):=Θ(outt(π))⋅Θ(yt)≥0. (1)

Such gains are not additive along the transitions and the standard online path learning algorithms for additive gains cannot be applied. Consider, for example, the special case of -gram-based gains in Figure 1. These gains cannot be expressed additively if the target sequence is, for instance, “He would like to eat cake” (see Appendix F). The challenge of learning with non-additive gains is even more apparent in the case of gappy count-based gains which allow for gaps of varying length in the patterns of interest. We defer the study of gappy-count based gains to Section 5.

How can we design algorithms for online path learning with such non-additive gains? Can we design algorithms with favorable regret guarantees for all three settings of full information, semi- and full bandit? The key idea behind our solution is to design a new automaton whose paths can be identified with those of and, crucially, whose gains are additive. We will construct by defining a set of context-dependent rewrite rules, which can be compiled into a finite-state transducer defined below. The context-dependent automaton can then be obtained by composition of the transducer with . In addition to playing a key role in the design of our algorithms (Section 4), provides a compact representation of the gains since its size is substantially less than the dimension (number of patterns).

### 3.1 Context-Dependent Rewrite Rules

We will use context-dependent rewrite rules to map to the new representation . These are rules that admit the following general form:

 ϕ→ψ/λ     –––––––ρ,

where , , , and are regular expressions over the alphabet of the rules. These rules must be interpreted as follows: is to be replaced by whenever it is preceded by and followed by . Thus, and represent the left and right contexts of application of the rules. Several types of rules can be considered depending on their being obligatory or optional, and on their direction of application, from left to right, right to left or simultaneous application (Kaplan and Kay, 1994). We will be only considering rules with simultaneous applications.

Such context-dependent rules can be efficiently compiled into a finite-state transducer (FST), under the technical condition that they do not rewrite their non-contextual part (Mohri and Sproat, 1996; Kaplan and Kay, 1994).222Additionally, the rules can be augmented with weights, which can help us cover the case of weighted count-based gains, in which case the result of the compilation is a weighted transducer (Mohri and Sproat, 1996). Our algorithms and theory can be extended to that case. An FST over an input alphabet and output alphabet defines an indicator function over the pairs of strings in . Given and , we have if there exists a path from an initial state to a final state with input label and output label , and otherwise.

To define our rules, we first introduce the alphabet as the set of transition names for the target automaton . These capture all possible contexts of length , where is the length of pattern :

 E′={#e1⋯er∣e1⋯er is a path % segment of length r in A,r∈{|θ1|,…,|θp|}},

where the ‘#’ symbol “glues” together and forms one single symbol in . We will have one context-dependent rule of the following form for each element :

 e1⋯er→#e1⋯er/ϵ     –––––––ϵ. (2)

Thus, in our case, the left- and right-contexts are the empty strings333 Context-dependent rewrite rules are powerful tools for identifying different patterns using their left- and right-contexts. For our application of count-based gains, however, identifying these patterns are independent of their context and we do not need to fully exploit the strength of these rewrite rules. , meaning that the rules can apply (simultaneously) at every position. In the special case where the patterns are the set of -grams, then is fixed and equal to . Figure 4 shows the result of the rule compilation in that case for . This transducer inserts whenever and are found consecutively and otherwise outputs the empty string. We will denote the resulting FST by .

### 3.2 Context-Dependent Automaton A′

To construct the context-dependent automaton , we will use the composition operation. The composition of and is an FST denoted by and defined as the following product of two outcomes for all inputs:

 ∀x∈E∗,∀y∈E′∗:(A∘TA)(x,y):=A(x)⋅TA(x,y).

There is an efficient algorithm for the composition of FSTs and automata (Pereira and Riley, 1997; Mohri et al., 1996; Mohri, 2009), whose worst-case complexity is in . The automaton is obtained from the FST by projection, that is by simply omitting the input label of each transition and keeping only the output label. Thus if we denote by the projection operator, then is defined as

Observe that admits a fixed topology (states and transitions) at any round . It can be constructed in a pre-processing stage using the FST operations of composition and projection. Additional FST operations such as -removal and minimization can help further optimize the automaton obtained after projection (Mohri, 2009). Proposition 3.2, proven in Appendix D, ensures that for every accepting path in , there is a unique corresponding accepting path in . Figure 5 shows the automata and in a simple case and how a path in is mapped to another path in .

Let be an expert automaton and let be a deterministic transducer representing the rewrite rules (2). Then, for each accepting path in , there exists a unique corresponding accepting path in .

The size of the context-dependent automaton depends on the expert automaton and the lengths of the patterns. Notice that, crucially, its size is independent of the size of the alphabet . Appendix A analyzes more specifically the size of in the important application of ensemble structure prediction with -gram gains.

At any round and for any , let denote the sequence , that is the sequence obtained by concatenating the outputs of . Let be the automaton with the same topology as where each label is replaced by . Once is known, the representation can be found, and consequently, the additive contribution of each transition of can be computed. The following theorem, which is proved in Appendix D, shows the additivity of the gains in . See Figure 6 for an example.

At any round , define the gain of the transition in by if for some and if no such exists. Then, the gain of each path in at trial can be expressed as an additive gain of the corresponding unique path in :

 ∀t∈[T],∀π∈P:U(% outt(π),yt)=∑e′∈π′ge′,t.

## 4 Algorithms

In this section, we present algorithms and associated regret guarantees for online path learning with non-additive count-based gains in the full information, semi-bandit and full bandit settings. The key component of our algorithms is the context-dependent automaton . In what follows, we denote the length of the longest path in by , an upper-bound on the gain of each transition in by , the number of path experts by , and the number of transitions and states in by and , respectively. We note that is at most the length of the longest path in since each transition in admits a unique label.

#### Remark.

The number of accepting paths in is often equal to but sometimes less than the number of accepting paths in . In some degenerate cases, several paths in may correspond to one single path444 For example, in the case of -gram gains, all the paths in with a length less than correspond to path with empty output in and will always have a gain of zero. in . This implies that in will always consistently have the same gains in every round and that is the additive gain of in . Thus, if is predicted by the algorithm in , any of the paths can be equivalently used for prediction in the original expert automaton .

### 4.1 Full Information: Context-dependent Component Hedge Algorithm

Koolen et al. (2010) gave an algorithm for online path learning with non-negative additive losses in the full information setting, the Component Hedge (CH) algorithm. For count-based losses, Cortes et al. (2015) provided an efficient Rational Randomized Weighted Majority (RRWM) algorithm. This algorithm requires the use of determinization (Mohri, 2009) which is only shown to have polynomial computational complexity under some additional technical assumptions on the outputs of the path experts. In this section, we present an extension of CH, the Context-dependent Component Hedge (CDCH), for the online path learning problem with non-additive count-based gains. CDCH admits more favorable regret guarantees than RRWM and can be efficiently implemented without any additional assumptions.

Our CDCH algorithm requires a modification of such that all paths admit an equal number of transitions (same as the longest path). This modification can be done by adding at most states and zero-gain transitions (György et al., 2007). Abusing the notation, we will denote this new automaton by in this subsection. At each iteration , CDCH maintains a weight vector in the unit-flow polytope over , which is a set of vectors satisfying the following conditions: (1) the weights of the outgoing transitions from the initial state sum up to one, and (2) for every non-final state, the sum of the weights of incoming and outgoing transitions are equal. For each , we observe the gain of each transition , and define the loss of that transition as . After observing the loss of each transition in , CDCH updates each component of as (where is a specified learning rate), and sets to the relative entropy projection of the updated back to the unit-flow polytope, i.e. .

CDCH predicts by decomposing into a convex combination of at most paths in and then sampling a single path according to this mixture as described below. Recall that each path in identifies a path in which can be recovered in time . Therefore, the inference step of the CDCH algorithm takes at most time polynomial in steps. To determine a decomposition, we find a path from the initial state to a final state with non-zero weights on all transitions, remove the largest weight on that path from each transition on that path and use it as a mixture weight for that path. The algorithm proceeds in this way until the outflow from initial state is zero. The following theorem from (Koolen et al., 2010) gives a regret guarantee for the CDCH algorithm.

With proper tuning of the learning rate , the regret of CDCH is bounded as below:

 ∀π∗∈P:n∑t=1U(%outt(π∗),yt)−U(outt(πt),yt)≤√2TB2K2log(KM)+BKlog(KM).

The regret bounds of Theorem 4.1 are in terms of the count-based gain . Cortes et al. (2015) gave regret guarantees for the RRWM algorithm with count-based losses defined by . In Appendix E, we show that the regret associated with is upper-bounded by the regret bound associated with . Observe that, even with this approximation, the regret guarantees that we provide for CDCH are tighter by a factor of . In addition, our algorithm does not require additional assumptions for an efficient implementation compared to the RRWM algorithm of Cortes et al. (2015).

### 4.2 Semi-Bandit: Context-dependent Semi-Bandit Algorithm

György et al. (2007) gave an efficient algorithm for online path learning with additive losses in the semi-bandit setting. In this section, we present a Context-dependent Semi-Bandit (CDSB) algorithm extending that work to solving the problem of online path learning with count-based gains in a semi-bandit setting. To the best of our knowledge, this is the first efficient algorithm with favorable regret bounds for this problem.

As with the algorithm of György et al. (2007), CDSB makes use of a set of covering paths with the property that, for each , there is an accepting path in such that belongs to . At each round , CDSB keeps track of a distribution over all path experts by maintaining a weight on each transition in such that the weights of outgoing transitions for each state sum up to and , for all accepting paths in . Therefore, we can sample a path from in at most steps by selecting a random transition at each state according to the distribution defined by . To make a prediction, we sample a path in according to a mixture distribution , where

is a uniform distribution over paths in

. We select

with probability

or with probability and sample a random path from the randomly chosen distribution.

Once a path in is sampled, we observe the gain of each transition of , denoted by . CDSB sets , where if and otherwise. Here, are parameters of the algorithm and is the flow through in , which can be computed using a standard shortest-distance algorithm over the probability semiring (Mohri, 2009). The updated distribution is . Next, the weight pushing algorithm (Mohri, 1997) is applied (see Appendix C), which results in new transition weights such that the total outflow out of each state is again one and the updated probabilities are , thereby facilitating sampling. The computational complexity of each of the steps above is polynomial in the size of . The following theorem from György et al. (2007) provides a regret guarantee for CDSB algorithm.

Let denote the set of “covering paths” in . For any , with proper tuning of the parameters , , and , the regret of the CDSB algorithm can be bounded as follows with probability :

 ∀π∗∈P:n∑t=1U(%outt(π∗),yt)−U(outt(πt),yt)≤2B√TK(√4K|C|lnN+√MlnMδ).

### 4.3 Full Bandit: Context-dependent ComBand Algorithm

Here, we present an algorithm for online path learning with count-based gains in the full bandit setting. Cesa-Bianchi and Lugosi (2012) gave an algorithm for online path learning with additive gains, ComBand. Our generalization, called Context-dependent ComBand (CDCB), is the first efficient algorithm with favorable regret guarantees for learning with count-based gains in this setting. For the full bandit setting with arbitrary gains, we develop an efficient execution of EXP3, called EXP3-AG, in Appendix B.

As with CDSB, CDCB maintains a distribution over all path experts using weights on the transitions such that the outflow of each state is one and the probability of each path experts is the product of the weights of the transitions along that path. To make a prediction, we sample a path in according to a mixture distribution , where is a uniform distribution over the paths in . Note that this sampling can be efficiently implemented as follows. As a pre-processing step, define using a separate set of weights over the transitions of in the same form. Set all the weights to one and apply the weight-pushing algorithm to obtain a uniform distribution over the path experts. Next, we select with probability or with probability and sample a random path from the randomly chosen distribution.

After observing the scalar gain of the chosen path, CDCB computes a surrogate gain vector for all transitions in via , where is the pseudo-inverse of and is a bit representation of the path . As for CDSB, we set and update via weighted-pushing to compute . We obtain the following regret guarantees from Cesa-Bianchi and Lugosi (2012) for CDCB:

Let

denote the smallest non-zero eigenvalue of

where is the bit representation of the path which is distributed according to the uniform distribution . With proper tuning of the parameters and , the regret of CDCB can be bounded as follows:

 ∀π∗∈P:n∑t=1U(%outt(π∗),yt)−U(outt(πt),yt)≤2B√(2KMλmin+1)TMlnN.

## 5 Extension to Gappy Count-Based Gains

Here, we generalize the results of Section 3 to a broader family of non-additive gains called gappy count-based gains: the gain of each path depends on the discounted counts of gappy occurrences of a fixed set of patterns in the sequence output by that path. In a gappy occurrence, there can be “gaps” between symbols of the pattern. The count of a gappy occurrence is discounted multiplicatively by where is a fixed discount rate and is the total length of gaps. For example, the gappy occurrences of the pattern in a sequence with discount rate are

• [leftmargin=.1cm,itemindent=.35cm]

• , length of gap = , discount factor = ;

• , length of gap = , discount factor = ;

• , length of gap = , discount factor = ,

which makes the total discounted count of gappy occurrences of in to be . Each sequence of symbols can be represented as a discounted count vector of gappy occurrences of the patterns whose th component is “the discounted number of gappy occurrences of in ”. The gain function is defined in the same way as in Equation (1).555The regular count-based gain can be recovered by setting . A typical instance of such gains is gappy -gram gains where the patterns are all -many -grams.

The key to extending our results in Section 3 to gappy -grams is an appropriate definition of the alphabet , the rewrite rules, and a new context-dependent automaton . Once is constructed, the algorithms and regret guarantees presented in Section 4 can be extended to gappy count-based gains. To the best of our knowledge, this provides the first efficient online algorithms with favorable regret guarantees for gappy count-based gains in full information, semi-bandit and full bandit settings.

#### Context-Dependent Rewrite Rules.

We extend the definition of so that it also encodes the total length of the gaps: . Note that the discount factor in gappy occurrences does not depend on the position of the gaps. Exploiting this fact, for each pattern of length and total gap length , we reduce the number of output symbols by a factor of by encoding the number of gaps as opposed to the position of the gaps.

We extend the rewrite rules in order to incorporate the gappy occurrences. Given , for all path segments of length in where is a subsequence of with and , we introduce the rule:

 ej1ej2…ejn+k⟶(#ei1ei2…ein)k/ϵ––––ϵ.

As with the non-gappy case in Section 3, the simultaneous application of all these rewrite rules can be efficiently compiled into a FST . The context-dependent transducer maps any sequence of transition names in into a sequence of corresponding gappy occurrences. The example below shows how outputs the gappy trigrams given a path segment of length as input:

 e1,e2,e3,e4,e5TA−−→ (#e1e2e3)0,(#e2e3e4)0,(#e3e4e5)0,(#e1e2e4)1,(#e1e3e4)1,(#e2e3e5)1, (#e2e4e5)1,(#e1e2e5)2,(#e1e4e5)2,(#e1e3e5)2.

#### Context-Dependent Automaton A′.

As in Section 3.2, we construct the context-dependent automaton as , which admits a fixed topology through trials. The rewrite rules are constructed in a way such that different paths in are rewritten differently. Therefore, assigns a unique output to a given path expert in . Proposition 3.2 ensures that for every accepting path in , there is a unique corresponding accepting path in .

For any round and any , define . Let be the automaton with the same topology as where each label is replaced by . Given , the representation can be found, and consequently, the additive contribution of each transition of . Again, we show the additivity of the gain in (see Appendix D for the proof).

Given the trial and discount rate , for each transition in , define the gain if for some and and if no such and exist. Then, the gain of each path in at trial can be expressed as an additive gain of in :

 ∀t∈[1,T],∀π∈P,U(outt(π),yt)=∑e′∈π′ge′,t.

We can extend the algorithms and regret guarantees presented in Section 4 to gappy count-based gains. To the best of our knowledge, this provides the first efficient online algorithms with favorable regret guarantees for gappy count-based gains in full information, semi-bandit and full bandit settings.

## 6 Conclusion and Open Problems

We presented several new algorithms for online non-additive path learning with very favorable regret guarantees for the full information, semi-bandit, and full bandit scenarios. We conclude with two open problems: (1) Non-acyclic expert automata: we assumed here that the expert automaton is acyclic and the language of patterns is finite. Solving the non-additive path learning problem with cyclic expert automaton together with (infinite) regular language of patterns remains an open problem; (2) Incremental construction of : in this work, regardless of the data and the setting, the context-dependent automaton is constructed in advance as a pre-processing step. Is it possible to construct gradually as the learner goes through trials? Can we build incrementally in different settings and keep it as small as possible as the algorithm is exploring the set of paths and learning about the revealed data?

The work of MM was partly funded by NSF CCF-1535987 and NSF IIS-1618662. Part of this work was done while MKW was at UC Santa Cruz, supported by NSF grant IIS-1619271.

## References

• Allauzen et al. (2007) Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of CIAA, pages 11–23. Springer, 2007.
• Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
• Cesa-Bianchi and Lugosi (2012) Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
• Cortes et al. (2014) Corinna Cortes, Vitaly Kuznetsov, and Mehryar Mohri. Ensemble methods for structured prediction. In Proceedings of ICML, 2014.
• Cortes et al. (2015) Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Manfred K. Warmuth. On-line learning algorithms for path experts with non-additive losses. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 424–447, 2015.
• György et al. (2007) András György, Tamás Linder, Gábor Lugosi, and György Ottucsák. The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8(Oct):2369–2403, 2007.
• Kalai and Vempala (2005) Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
• Kaplan and Kay (1994) Ronald M. Kaplan and Martin Kay. Regular models of phonological rule systems. Computational Linguistics, 20(3):331–378, 1994.
• Koolen et al. (2010) Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In Proceedings of COLT, pages 93–105, 2010.
• Mohri (1997) Mehryar Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23(2):269–311, 1997.
• Mohri (2002) Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002.
• Mohri (2009) Mehryar Mohri. Weighted automata algorithms. In Handbook of Weighted Automata, pages 213–254. Springer, 2009.
• Mohri and Sproat (1996) Mehryar Mohri and Richard Sproat. An efficient compiler for weighted rewrite rules. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, pages 231–238. Association for Computational Linguistics, 1996.
• Mohri et al. (1996) Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted automata in text and speech processing. In Proceedings of ECAI-96 Workshop on Extended finite state models of language, 1996.
• Pereira and Riley (1997) Fernando Pereira and Michael Riley. Speech recognition by composition of weighted finite automata. In Finite-State Language Processing, pages 431–453. MIT Press, 1997.
• Stoltz (2005) Gilles Stoltz. Information incomplete et regret interne en prédiction de suites individuelles. PhD thesis, Ph. D. thesis, Univ. Paris Sud, 2005.
• Takimoto and Warmuth (2003) Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. JMLR, 4:773–818, 2003.

## Appendix A Applications to Ensemble Structured Prediction

The algorithms discussed in Section 4 can be used for the online learning of ensembles of structured prediction experts, and consequently, significantly improve the performance of algorithms in a number of areas including machine translation, speech recognition, other language processing areas, optical character recognition, and computer vision. In structured prediction problems, the output associated with a model is a structure that can be decomposed and represented by substructures . For instance, may be a machine translation system and a particular word.

The problem of ensemble structured prediction can be described as follows. The learner has access to a set of experts to make an ensemble prediction. Therefore, at each round , the learner can use the outputs of the experts . As illustrated in Figure 7(a), each expert consists of substructures .