In this paper we use PDE tools to analyze one of the classical problems in machine learning, namely prediction with expert advice. In this framework, a game is played between a player and the nature (also called the adversary in learning literature). At each time step, given past information, the player has to choose an expert among experts. Simultaneously the nature chooses the winning experts. Then, both choices are announced. If the player chooses a winning expert, the player also wins. The objective of the player is to minimize his regret with respect to the best performing expert, i.e., minimize
where is the total gain of the expert and is the gain of the player at the final time. The objective of the nature is the choose the winning experts to maximize the regret of the player. This problem that has been extensively studied in learning theory [6, 5, 20, 12, 13, 17, 22] can also be seen as a discrete time and discrete space robust utility maximization problem similar to  for a particular choice of the utility function.
For the case of 2 experts, the optimal strategy for the adversary was first described by Cover  in the 1960’s. Recently, using an ansatz of power type for the best regret, Gravin et al. showed in  that the so called comb strategy, the strategy that consists of choosing the leading and the third leading expert by the nature, is optimal. However, the power type ansatz for the value of the game does not generalize to larger number of experts.
In this paper, we follow the setting of 
, where the maturity of the game is a geometric random variable with parameterand study the game where both the player and nature can use randomized strategy. In this framework, we prove 2 conjectures stated in  for the game with experts. We use tools from stochastic analysis and PDE theory to give an explicit expansion of the value function of the game for small , i.e., long time asymptotics. In Theorem 3.1, this expansion allows us to prove that the value of the game, also called best regret, indeed grows as as conjectured in .
The proof of this result can be achieved in two steps. This first step can be found in , where using the tools from viscosity theory the author shows that the rescaled value function (2.4), solves the elliptic PDE (2.5). The second step, which is the main contribution of this paper, is to explicitly solve this PDE for case of 4 experts. In order to find this expression, we use the conjectured optimal strategy in , and relate the value function of the control problem (2.6) to an expectation of a functional of an obliquely reflected Brownian motion. This expression is a discounted expected value of the local time that measures the number of times the best two experts’ gains cross each other. Then, using appropriate differentiation of the dynamic programming equation (5.8), we characterize the value of the expectation on two “opposite” faces of the domain of reflection by a system of hyperbolic PDE (5.17) and (5.18). Then, we solve this system of hyperbolic PDE to explicitly compute the value for the conjectured control at the boundary, which then leads to the value in the whole domain. Finally, we check, that the value given for the conjectured control solve the nonlinear PDE (2.5). Thus, proving by a simple verification argument the optimality of comb strategy which is the second conjecture in  that we prove; see Theorem 3.2. The direct proof of the verification argument is quite tedious. We came up with a method that relies on Proposition 5.4, which is a type of maximum principle for the system of hyperbolic equations (5.28).
From the perspective of control theory, we note that the setting of  is in fact similar to the weak formulation (or feedback/closed loop formulation) of zero-sum games in the sense of  (see also ) where the player and the nature observe the same source of information, i.e. the path of the gains of the experts and the player. One can also state the game in a Elliott-Kalton sense  where similarly to , before taking its decision, the nature learns the choice of the player. These two formulations generally lead to different values; see Remark 4.2 in .
Our expansion is in accordance with well known results in prediction problems. Indeed, it is known that in the long run, there is an upper bound for the value of regret minimization problems that grows at most as which is achieved by the so-called multiplicative weight algorithms . In this paper, we compute the exact scaling for the geometric stopping problem which also allows us to directly provide explicit algorithms for both the player and the nature.
The rest of the paper is organized as follows. In Section 2 we introduce our notation and define the value function of the regret minimization problem. In Section 3, we give the main results of the paper. This result is proven in Section 6. The Sections 4 and 5 are there to provide the methodology used in finding the explicit solution (3.1).
2 Statement of the problem
We fix and denote by
the set of probability measures onand by the set of probability measures on , the power set of . These sets of probability measures are in fact in bijection with respectively and dimensional unit simplexes. We denote by the canonical basis of and for , stands for . Similarly to , for all , we denote by the ranked coordinates of with
and define the function
We assume that a player and the nature interact through the evolution of the state of experts. At time , the state of the game in hand is described by , the history of the gains of each expert and the history of the gains of the player. At time step , observing , simultaneously, the player chooses and the nature chooses . The gain of each expert chosen by the nature increases by i.e.,
If the player also chooses an expert chosen by the nature, then the gain of player also increases i.e.,
The regret of the player at time is defined as
Let be denote the random maturity of the problem. We assume that is a geometric random variable with parameter .
We now convexify the problem by assuming that instead of choosing deterministic and , the nature and the player choose randomized strategies. At time
, the player chooses a probability distributionand the nature chooses that may depend on the observation . We denote by the set of such sequences and by the set of such sequences . With some notational abuse, we denote by the random variable with distribution and the random variable with distribution .
The objective of the player is to minimize his expected regret at time and the objective of the nature is to maximize the regret of the player. Hence we have a zero sum game with the lower and the upper value for the game
where is the probability distribution under which we evaluate the regret given the controls and . We denote by
The game has a value, i.e.,
There exists independent of such that for all and we have that
Additionally, satisfies the following dynamic programming principle
2.1 Limiting behavior of
The main objective of the paper is to provide an explicit formula at the leading order for the function for small . For this purpose define the rescaled value function:
The next result shows that the limiting behavior of the value of the game can be characterized by the value of a stochastic control problem.
As , the function converges locally uniformly to which is the unique viscosity solution of the equation
in the class of functions with linear growth. Additionally admits the Feynman-Kac representation
where with a 1-dimensional Brownian motion and the progressively measurable process satisfies for all , for some .
The fact that converges to is a consequence of [8, Theorem 7]. Note also that an analysis of the proof of [8, Theorem 7] and the general methodology of proof in  allows us to claim that the convergence is in fact locally uniform. The fact that admits the representation (2.6) is a consequence of uniqueness of viscosity solution of (2.5) with linear growth that is proven in [7, Theorem 5.1] and the stochastic Perron’s method of . ∎
3 Main Results
3.1 Explicit solution for experts
The main contribution of the paper is to provide a method to explicitly solve the PDE (2.5).
With for experts, for , the function is given by the expression
Additionally, is twice continuously differentiable, monotone 00footnotetext: Monotone here means
and if is a maximizer of the Hamiltonian then its complement is also a maximizer of the same Hamiltonian.
In fact, has the following expansion at the origin
The proof of this result is provided at Section 6 after developing the methodology required to obtain this expression. Note that one can check by hand (or preferably with a computer) that the expression provided at (3.1) solves the equation (2.5) when all are different from each other. However, due to potential discontinuities of the derivatives when two of the are equal we need to check that the almost everywhere solution of the equation (2.5) defined via this expression is twice continuously differentiable and is therefore a smooth solution. ∎
(3.3) is the main result for the long time behavior of the regret minimization problem with geometric stopping and is conjectured in . The optimal regret scales as the square root of the time scale in hand. In this case of geometric stopping gives the term of proportionality between the optimal regret and the stopping time parameter.
3.2 Asymptotically optimal strategies
Given the value of , we now describe a family of asymptotically optimal strategies for nature. Inspired by  we give the following definition.
(i) We denote
the set of maximizers of the Hamiltonian.
(ii) For all with , we denote the comb strategy which is the control for the problem (2.6) that consists in choosing the experts and . We take the convention that if two components and of the points are equal for then the ordering of the point is taken with .
(iii) We denote the balanced comb strategy which is the control for the nature in game (2.3) that consists in choosing at , with probability and with probability .
One may conjecture that it is asymptotically optimal for the nature to choose for all an element in . However, this conjecture is not true since the strategy is not balanced in the sense of . Indeed, assume for example that for is reduced to a unique subset of cardinality , meaning . In this case, choosing the expert would be suboptimal for the nature since the player can also guess this control and choose the expert . It is proven in  that in order to be optimal any strategy of the nature has to be balanced. Thanks to the Theorem 3.1, the simplest strategy for the nature would be to randomize his strategy between the maximizer of the Hamiltonian and its complement.
The main result for asymptotically optimal strategies is the following theorem.
The control is asymptotically optimal for the nature, in the sense that
where is locally uniform in , and we denote
The proof is deferred to Section 6.2. We will finish this section with a few remarks.
As a sanity check, the expansion of implies that the Hessian of is
where the second equality follows from (2.5) and the optimality of the comb strategies.
We note that at the leading order it is optimal for the nature to choose the controls in the sense that for all family and for , we have that
This inequality means that up to an error negligible at the leading order, the comb strategy is optimal for the nature.
which is obtained by taking a continuum analogue of . Compared to this 3 dimensional counterpart the expression (3.1) is not a simple sum of exponentials. Instead of guess and verify we needed to directly compute the value of comb strategies.
Note that for all we have with . Hence . The claim is direct consequence of (3.2). Thanks to this observation, we can define via the feedback control : at point , the player chooses the expert with probability and define the value
We conjecture that
which would imply that is an asymptotically optimal strategy for the player. The main difficulty one faces to obtain such a result is to obtain locally uniform bounds for when .
4 Value for comb strategies
Inspired by the conjecture in , our objective here is to introduce the value of the control problem (2.6) corresponding to comb strategies. Then, in section 5, we develop a methodology to compute this value. Finally, in Section 6, we check that the value computed in these sections is a solution to (2.5).
We note that the Sections 4 and 5 are only included in the paper to explain how to find the expression (3.1). Indeed, the only rigorous proofs for our results are in Section (6). Therefore, in Sections 4 and 5, we will slightly deviate from mathematical rigor.
The optimal strategy for (2.6) conjectured in  consists in choosing the best and the third best experts. This is a rank based interaction for the evolution of the components of , the optimally controlled state. Therefore, for any , it is expected that solves the following SDE
where and is the control corresponding to comb strategy.
It is not clear that (4.1) admits a strong solution. In fact, based in [9, Theorem 4.1], we conjecture that there is no strong solution to (4.1). However, it is expected that the ranked components are well-defined. Given also the fact the the payoff of the problem is symmetric, we will directly define our value of interest via an obliquely reflected Brownian motion. This procedure also allows a reduction of the dimension of the problem. We first recall the definition of an obliquely reflected Brownian motion given in [23, Definition 2.1].
We say that the family of continuous processes and probability measures is a weak solution to the semimartingale reflected Brownian motion on with covariance matrix and reflection matrix if
i) For all and
ii) The process is a Brownian motion with covariance matrix under .
iii) is adapted to the filtration generated by , , is continuous, non decreasing, and
We will denote by the family with
and . These processes have the following semimartingale decomposition for ,
and denote for the local time of at the origin. Since the matrix
is a tridiagonal Toeplitz matrix whose eigenvalues are less thanin absolute value. Therefore, thanks to [23, Theorem 2.1], there exists a unique solution to the oblique reflection problem. However, the existence of solution to (4.1) is not straightforward. If a solution to this system exists then we clearly would have
with . We will assume that this is the case. (This is the only non-rigorous part of the derivation. But we should again remark that a rigorous verification of our claims is in Section 6 and the arguments here are performed for giving an intuitive construction of the solution.) In the sequel we will denote
4.2 Value associated to an obliquely reflected Brownian motion
We now give a lemma that allows us to define our candidate solution to (2.5).
Assume that there exists a weak solution to (4.1). Then for all we have
One interpretation of the previous lemma is that the optimal strategy aims to maximize the third component of the local time of a reflected Brownian motion. This is consistent with discrete time problem in the case or where the optimal strategies of the nature is proven to be maximizer of the number of crossings between the leading and the second leading experts [6, 12]. We note that this strategy also maximize the expected value of where is a stopping time exponentially distributed.
is a stopping time exponentially distributed.
The function defined by
is a viscosity solution of
with the reflection conditions
5 Characterization of the value on the reflection boundary
We now characterize the function via a system of hyperbolic first order PDE.
5.1 The value of for
We start by characterizing on the set .
The function admits the following factorisation
and solves the two dimensional obliquely reflected Brownion Motion problem
where are the local times at zero of and respectively.
Additionally, for all , we have
If it is clear due to the uniqueness of the solution of the oblique reflection problem (4.1) that for all
First, we compute the functions
Let and and define
where and . Then, by the dynamic programming principle
Assuming is smooth we differentiate this equality in , then in the expression we send for fixed to obtain