The combination or aggregation of predictions is central to machine learning. Traditional Bayesian updating can be viewed as a particular way of aggregating information that takes account of prior information. Notions of “mixability” which play a central role in the setting of prediction with expert advice offer a more general way to aggregate, and which take account of the loss function used to evaluate predictions (how well they fit the data). As shown byVovk (2001), his more general “aggregating algorithm” reduces to Bayesian updating when log loss is used. However, as we will show there is another design variable that to date has not been fully exploited. The aggregating algorithm makes use of a distance between the current distribution and a prior which serves as a regulariser. In particular the aggregating algorithm uses the KL-divergence. We consider the general setting of an arbitrary loss and an arbitrary regulariser (in the form of a Bregman divergence) and show that we recover the core technical result of traditional mixability: if a loss is mixable in the generalised sense then there is a generalised aggregating algorithm which can be guaranteed to have constant regret.
In symbols (more formally defined later), if we use to denote the loss of the prediction by expert on observation and is used to penalise the “distance” between the choice of updated distribution from what it was previously then we can recover both Bayesian updating and the updates of the aggregating algorithm as minimisers of via the choices summarised in the table below.
|Bayesian updating||log loss||KL divergence|
|Aggregating algorithm||general mixable loss||KL divergence|
|This paper||general -mixable loss||general Bregman divergence|
We show that there is a single notion of mixability that applies to all three of these cases and guarantees the corresponding updates can be used to achieve constant regret.
We stress that the idea of the more general regularisation and updates is hardly new. See for example the discussion of potential based methods in (Cesa-Bianchi, 2006)
and other references later in the paper. The key novelty is the generalised notion of mixability, the name of which is justified by the key new technical result — a constant regret bound assuming the general mixability condition achieved via a generalised algorithm which can be seen as intimately related to mirror descent. Crucially, our result depends on some properties of the conjugates of potentials defined over probabilities that do not hold for potential functions defined over more general spaces.
1.1 Prediction With Expert Advice and Mixability
A prediction with expert advice game is defined by its loss, a collection of experts that the player must compete against, and a fixed number of rounds. Each round the expert reveals their predictions to the player and then the player makes a prediction. An observation is then revealed to the experts and the player and all receive a penalty determined by the loss. The aim of the player is to keep its total loss close to that of the best expert once all the rounds have completed. The difference between the total loss of the player and the total loss of the best expert is called the regret and is the typically the focus of the analysis of this style of game. In particluar, we are interested in when the regret is constant, that is, independent of the number of rounds played.
More formally, let denote a set of possible observations. We consider a version of the game where predictions made by the player and the experts are all distributions over . The set of such distributions will be denoted and the probability (or density) assigns to will be denoted . A loss assigns the penalty to predicting when is observed. The set of experts is denoted and in each round , each expert makes a prediction . These are revealed to the player who makes a prediction . Once observation is revealed the experts receive loss and the player receives loss . The aim of the player is to minimise its regret where and .
The algorithm that witnesses the original mixability result is known
as the aggregating algorithm (AA) (Vovk, 2001).
It works similarly to exponentiated gradient algorithms
(Cesa-Bianchi, 2006) in that it updates a
mixture distribution111To keep track of the two spaces and we adopt the convention
of using Roman letters for distributions in and vectors in
and vectors inand Greek letters for distributions in and vectors in . over experts based on their performance at the end of each round. The mixture is then used to “blend” the predictions of the experts in the next round in such a way as to achieve low regret. In the aggregating algorithm, the mixture is intially set to some “prior” . After rounds where the observations were and the expert predictions where for the mixture is set to
On round , after seeing all the expert predictions , the AA plays a such that for all
Mixability is precisely the condition on the loss that guarantees that such a prediction can always be found.
A loss is said to be -mixable for if for all mixtures and all predictions there exists a such that (2) holds for all .
The key result concerning mixability is that it characterises when constant regret is achievable.
Theorem 1 (Vovk (2001)).
If is -mixable for some then for any game of rounds with finitely many experts the aggregating algorithm will guarantee
Furthermore, Vovk (2001) also supplies a converse: that a constant regret bound is only achievable for -mixable losses. Later work by Erven et al. (2012) has show that mixability of proper losses (see §2.1) can be characterised in terms of the curvature of the corresponding entropy for , that is, in terms of .
Our main contribution is a generalisation of the notion of mixability and a corresponding generalisation of Theorem 1. Specifically, for any entropy (i.e., convex function on the simplex) we define -mixability for losses (Definition 2) and provide two equivalent characterisations that lend themselves to some novel interpretations. (Lemma 4). We use these characterisations to prove the follow key result. Denote by the unit mass on : , for all . Let denote the Bregman divergence induced by , defined formally below in (4).
If is -mixable then there is a family of strategies parameterised by which, for any sequence of observations and sequence of expert predictions , plays a sequence such that for all
The standard notion of mixability is recovered when for and the Shannon entropy on . In this case, Theorem 1 is obtained as a corollary for
the uniform distribution over. A compelling feature of our result is that it gives a natural interpretation of the constant in the regret bound: if is the initial guess as to which expert is best before the game starts, the “price” that is paid by the player is exactly how far (as measured by ) the initial guess was from the distribution that places all its mass on the best expert.
In addition, an algorithm analogous to the Aggregating Algorithm is naturally recovered to witness the above bound during the construction of the proof; see (12). Like the usual Aggregating Algorithm, our “generalised Aggregating Algorithm” updates its mixtures according to the past performances of the experts. However, our algorithm is most easily understood as doing so via updates to the duals of the distributions induced by .
1.3 Related Work
The starting point for mixability and the aggregating algorithm is the work of Vovk (1995, 1990). The general setting of prediction with expert advice is summarised in (Cesa-Bianchi, 2006, Chapters 2 and 3). There one can find a range of results that study diffferent aggregation schemes and different assumptions on the losses (exp-concave, mixable). Variants of the aggregating algorithm have been studied for classically mixable losses, with a tradeoff between tightness of the bound (in a constant factor) and the computational complexity (Kivinen and Warmuth, 1999). Weakly mixable losses are a generalisation of mixable losses. They have been studied in Kalnishkan and Vyugin (2008) where it is shown there exists a variant of the aggregating algorithm that achives regret for some constant . Vovk (2001, in §2.2) makes the observation that his Aggregating Algorithm reduces to Bayesian mixtures in the case of the log loss game. See also the discussion in (Cesa-Bianchi, 2006, page 330) relating certain aggregation schemes to Bayesian updating.
The general form of updating we propose is similar to that considered by Kivinen and Warmuth (1997) who consider finding a vector minimising where is some starting vector, is the instance/label observation at round and is a loss. The key difference between their formulation and ours is that our loss term is (in their notation) – i.e., the linear combination of the losses of the at and not the loss of their inner product.
Online methods of density estimation for exponential families are discussed in(Azoury and Warmuth, 2001, §3) where they compare the online and offline updates of the same sequence and make heavy use of the relationship between the KL divergence between members of an exponential family and an associated Bregman divergence between the parameters of those members.
The analysis of mirror descent by Beck and Teboulle (2003) shows that it achieves constant regret when the entropic regulariser is used. However, they do not consider whether similar results extend to other entropies defined on the simplex.
2 Generalised Mixability
This work was motivated by the observation that the original mixability definition (2) looks very closely related to the log-sum-exp function, which is known to be the simplex-restricted conjugate of Shannon entropy. We wondered whether the proof that mixability implies constant regret was due to unique properties of Shannon entropy or whether alternative notions of entropy could lead to similar results. We found that the key step of the original mixability proof (that allows the sum of bounds to telescope) holds for any convex function defined on the simplex. This is because the conjugates of such functions have a translation invariant property that allows the original telescoping series argument to go through in the general case. By re-expressing the original proof using only the tools of convex analysis we were able to naturally derive the corresponding update algorithm and express the constant term in the bound as a Bregman divergence.
We begin by introducing some basic concepts and notation from convex analysis. Terms not defined here can be found in a reference such as (Hiriart-Urruty and Lemaréchal, 2001). A convex function is called an entropy if it is proper, convex, and lower semi-continuous. The Bregman divergence associated with a suitably differentiable entropy is given by
for all and , the relative interior of . The convex conjugate of is defined to be where , i.e., the dual space to . One could also write the supremum over by the convention of setting for . For differentiable , it is known that the supremum defining is attained at (Hiriart-Urruty and Lemaréchal, 2001). That is,
A similar result holds for by applying this result to and using . We will make use of this result to establish the following inequality connecting a Bregman divergence with its conjugate.
For all and we have
By definition and using (5) expands to . Subtracting the former from the latter gives
which, when rearranged gives which then gives the result. ∎
We will also make use of a property of conjugates of entropies called translation invariance (Othman and Sandholm, 2011). This notion is central to what are called convex and coherent risk functions in mathematical finance (Föllmer and Schied, 2004). In the following result and throughout, we use for the point such that for all .
If is an entropy then its convex conjugate is translation invariant, that is, for all and we have and its gradient satisfies .
By definition of the convex conjugate we have
since . Taking derivatives of both sides gives the second part of the lemma. ∎
We will also make use of the readily established fact that for any convex and all we have .
Probably the most well studied example of what we call an entropy is the negative of the Shannon entropy222We write Shannon entropy here as a sum but can also consider the continuous version relative to some reference measure , that is, . For simplicitly, we stick to the countable case. which is known to be concave, proper, and upper semicontinuous and thus is an entropy. When we look at the form of the original definition of mixability, we observe that it is closely related to the conjugate of :
which is sometimes called the log-sum-exp or partition function. This observation is what motivated this work and drives our generalisation to other entropies.
Entropies are known to be closely related to the Bayes risk of what are called proper losses or proper scoring rules (Dawid, 2007; Gneiting and Raftery, 2007). Specifically, if a loss is used to assign a penalty to a prediction upon outcome it is said to be proper if its expected value under is minimsed by predicting . That is, for all we have
where is the Bayes risk of and is necessarily concave (Erven et al., 2012), thus making convex and thus an entropy. The correspondence also goes the other way: given any convex function we can construct a unique proper loss. The following representation can be traced back to Savage (1971) but is expressed here using conjugacy.
If is a differentiable entropy then the loss defined by
By eq. (5) we have , giving us
from which propriety follows. ∎
It is straight-forward to show that the proper loss associated with the negative Shannon entropy is the log loss, that is, .
For a loss define the assessment to be the loss of each model/expert on observation , i.e.,
Suppose is a differentiable entropy on . A loss is -mixable if for all and all there is a such that for all ,
We can readily show that this definition reduces to the standard mixability definition when since, in this case,
by (6) and the fact that for any convex . As mentioned above, the proper loss corresponding to this choice of is easily seen to be by substitution into (7). Thus, the mixability inequality becomes which is equivalent to (2).
We now show that the above definition is equivalent to one involving the Bregman divergence for and also the difference in the “potential” evaluated at before and after it is updated by .
Suppose is a differentiable entropy on . Then the -mixability condition (8) is equivalent to the following:
3 The Generalised Aggregating Algorithm
In this section we prove our main result (Theorem 2) and examine the “generalised Aggregating Algorithm” that witnesses the bound. The updating strategy we use is the one that repeatedly returns the minimiser of the right-hand side of (10).
On round , after observing , the generalised aggregating algorithm (GAA) updates the mixture by setting
The next lemma shows that this updating process simply aggregates the assessments in the dual space with as the starting point.
The GAA updates satisfy for all and so
By considering the Lagrangian and setting its derivative to zero we see that the minimising must satisfy where the is the dual variable at step . For convex , the functions and are inverses (Hiriart-Urruty and Lemaréchal, 2001) so by the translation invariance of (Lemma 2). This means the constants are arbitrary and can be ignored. Thus, the mixture updates satisfy the relation in the lemma and summing over gives (13). ∎
To see how the updates just described are indeed a generalisation of those used by the original aggregating algorithm, we can substitute and in (12). Because is maximal for uniform distributions we must have and so . However, by (10) we see that
and then substituting gives the update equation in (1).
3.1 The proof of Theorem 2
of Theorem 2.
Note that the proof above gives us something even stronger — eq. (16) states that the GAA satisfies the stronger condition that eq. (3) hold for all , in addition to all , where the loss is an expected loss under . In particular, choosing in eq. (15), we have
Finally, we briefly note some similarities between the Generalised Aggregating Algorithm and the literature on automated market makers for prediction markets. The now-standard framework of Abernethy et al. (2013) defines the cost of a purchase of some bundle of securities as the difference in a convex potential function. Formally, for some convex , a purchase of bundle given current market state is given by . The instantaneous prices in the market at state are therefore . As the prices correspond to probabilities in their framework, it must be the case that satisfies . From this we can conclude as we have done above that is translation invariant, and thus one can restate the cost of the bundle as .
We now are in a position to draw an anology with our GAA. The formulation of -mixability in eq. (11) says that the loss upon observing must be bounded above by , which is exactly the negative of the expression above, where , , and . Thus, -mixability is saying the loss must be at least as good for the algorithm than in the market making setting, and hence it is not surprising that the loss bounds are the same in both settings; see Abernethy et al. (2013) for more details.
4 Future Work
Our exploration into a generalized notion of mixability opens more doors than it closes. In the following, we briefly outline several open directions.
Relation to original mixability result
The proof of our main result, Theorem 2, shows that in essence, an algorithm can guarantee constant regret, expressed in terms of a -divergence between a starting point and the best expert, if and only if the underlying loss is -mixable. The original mixability result of Vovk (2001) states that one achieves a constant regret of if and only if is, in our terminology, -mixable. But of course for any which is bounded on , the penalty is also bounded, and hence it would seem that for all bounded , a loss is mixable in the sense of Vovk if and only if it is -mixable for some .
Relation to curvatures of and
A recent result of Erven et al. (2012) shows that the mixability constant from the original Definition 1 can be calculated as the ratio of curvatures between the Bayes risk of the loss and Shannon entropy. It would stand to reason therefore that for any , the -mixability constant for a loss , defined as the largest such that is -mixable, would be similarly defined as the ratio to instead of .
Optimal regret bound
The curvature discussion above addresses the question of finding, given a , the largest such that is -mixable. Note that the larger is, the smaller the corresponding regret term is. Hence, for fixed , this -mixability constant yields the tightest bound. The question remains, however, what is the tightest bound one can acheive across all choices of ? Again in reference to Vovk, it seems that the choice of may not matter, at least as long as is a constant independent of . It would be clarifying to directly assert this claim or find a counter-example.
- Abernethy et al.  Jacob Abernethy, Yiling Chen, and Jennifer Wortman Vaughan. Efficient market making via convex optimization, and a connection to online learning. ACM Transactions on Economics and Computation, 1(2):12, 2013. URL http://dl.acm.org/citation.cfm?id=2465777.
- Azoury and Warmuth  Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211–246, 2001.
- Beck and Teboulle  Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- Cesa-Bianchi  Nicolo Cesa-Bianchi. Prediction, learning, and games. Cambridge University Press, 2006.
- Dawid  A Philip Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1):77–93, 2007.
- Erven et al.  Tim van Erven, Mark D Reid, and Robert C Williamson. Mixability is bayes risk curvature relative to log loss. The Journal of Machine Learning Research, 13:1639–1663, 2012.
- Föllmer and Schied  Hans Föllmer and Alexander Schied. Stochastic finance, volume 27 of de gruyter studies in mathematics, 2004.
- Gneiting and Raftery  Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Hiriart-Urruty and Lemaréchal  J.B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of convex analysis. Springer Verlag, 2001.
- Kalnishkan and Vyugin  Yuri Kalnishkan and Michael V. Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74:1228–1244, 2008.
- Kivinen and Warmuth  Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
- Kivinen and Warmuth  Jyrki Kivinen and Manfred K Warmuth. Averaging expert predictions. In Computational Learning Theory, pages 153–167. Springer, 1999.
- Othman and Sandholm  Abraham Othman and Tuomas Sandholm. Liquidity-sensitive automated market makers via homogeneous risk measures. In Internet and Network Economics, pages 314–325. Springer, 2011.
- Savage  Leonard J Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971.
- Vovk  Volodya Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computational Learning Theory (COLT), pages 371–383, 1990.
- Vovk  Volodya Vovk. A game of prediction with expert advice. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 51–60. ACM, 1995.
- Vovk  Volodya Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213–248, 2001.