# An Online-Learning Approach to Inverse Optimization

In this paper, we demonstrate how to learn the objective function of a decision-maker while only observing the problem input data and the decision-maker's corresponding decisions over multiple rounds. Our approach is based on online learning and works for linear objectives over arbitrary feasible sets for which we have a linear optimization oracle. As such, it generalizes previous approaches based on KKT-system decomposition and dualization. The two exact algorithms we present -- based on multiplicative weights updates and online gradient descent respectively -- converge at a rate of O(1/sqrt(T)) and thus allow taking decisions which are essentially as good as those of the observed decision-maker already after relatively few observations. We also discuss several useful generalizations, such as the approximate learning of non-linear objective functions and the case of suboptimal observations. Finally, we show the effectiveness and possible applications of our methods in a broad computational study.

## Authors

• 1 publication
• 1 publication
• 31 publications
• 1 publication
• ### Inverse Multiobjective Optimization Through Online Learning

We study the problem of learning the objective functions or constraints ...
10/12/2020 ∙ by Chaosheng Dong, et al. ∙ 0

• ### Generalized Inverse Optimization through Online Learning

Inverse optimization is a powerful paradigm for learning preferences and...
10/03/2018 ∙ by Chaosheng Dong, et al. ∙ 0

• ### Contextual Inverse Optimization: Offline and Online Learning

We study the problems of offline and online contextual optimization with...
06/26/2021 ∙ by Omar Besbes, et al. ∙ 0

• ### The Many Faces of Exponential Weights in Online Learning

A standard introduction to online learning might place Online Gradient D...
02/21/2018 ∙ by Dirk van der Hoeven, et al. ∙ 0

• ### Learning Linear Programs from Optimal Decisions

We propose a flexible gradient-based framework for learning linear progr...
06/16/2020 ∙ by Yingcong Tan, et al. ∙ 8

• ### Parameter-free online learning via model selection

We introduce an efficient algorithmic framework for model selection in o...
12/30/2017 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Online Learning with Gated Linear Networks

This paper describes a family of probabilistic architectures designed fo...
12/05/2017 ∙ by Joel Veness, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Human decision-makers are very good at taking decisions under rather imprecise specification of the decision-making problem, both in terms of constraints as well as objective. One might argue that the human decision-maker can pretty reliably learn from observed previous decisions – a traditional learning-by-example setup. At the same time, when we try to turn these decision-making problems into actual optimization problems, we often run into all types of issues in terms of specifying the model. In an optimal world, we would be able to infer or learn the optimization problem from previously observed decisions taken by an expert.

This problem naturally occurs in many settings where we do not have direct access to the decision-maker’s preference or objective function but can observe his behaviour, and where the learner as well as the decision-maker have access to the same information. Natural examples are as diverse as making recommendations based on user history and strategic planning problems, where the agent’s preferences are unknown but the system is observable. Other examples include knowledge transfer from a human planner into a decision support system: often human operators have arrived at finely-tuned “objective functions” through many years of experience, and in many cases it is desirable to replicate the decision-making process both for scaling up and also for potentially including it in large-scale scenario analysis and simulation to explore responses under varying conditions.

Here we consider the learning of preferences or objectives from an expert by means of observing his actions. More precisely, we observe a set of input parameters and corresponding decisions of the form . They are such that with is a certain realization of problem parameters from a given set  and is an optimal solution to the optimization problem

 max c⊤truex (1) s.t. x∈X(pt), (2)

where is the expert’s true but unknown objective and for some (fixed) . We assume that we have full information on the feasible set and that we can compute for any candidate objective and . We present two online-learning algorithms, based on the multiplicative weights update (MWU) algorithm and online gradient descent (OGD) respectively, that allow us to learn a strategy of subsequent objective function choices with the following guarantee: if we optimize according to the surrogate objective function  instead of the actual unknown objective function in response to parameter realization , we obtain a sequence of optimal decisions (w.r.t. to each ) given by

 ¯xt=argmax{c⊤tx∣x∈X(pt)}

that are essentially as good as the decisions  taken by the expert on average. To this end, we interpret the observations of parameters and expert solutions as revealed over multiple rounds such that in each round  we are shown the parameters  first, then take our optimal decision  according to our objective function , then we are shown the solution  chosen by the expert, and finally we are allowed to update for the next round. For this setup, we will be able to show that our MWU-based algorithm attains an error bound of

 0≤1TT∑t=1(ct−ctrue)⊤(¯xt−xt)≤2K√lnnT, (4)

where is an upper bound on the -diameter of the feasible regions with . This implies that both the deviations in true cost as well as the deviations in surrogate cost can be made arbitrarily small on average. In other words, the average regret for having decided optimally according to the surrogate objectives  vs. having decided optimally for the true objective  vanishes at a rate of . While this algorithm is only applicable if holds, our algorithm based on OGD works without this restriction. If is an upper bound on the -diameter of the feasible regions , and is an upper bound on the -diameter of the set from which both and the ’s originate, then the OGD-based algorithm achieves an error bound of

 0≤1TT∑t=1(ct−ctrue)⊤(¯xt−xt)≤3LK2√T. (5)

These results show that linear objective functions over general feasible sets can be learned from relatively few observations of historical optimal parameter-solutions pairs. We will derive various extensions of our scheme, such as approximately learning non-linear objective functions and learning from suboptimal decisions. We will also, briefly, discuss the case where the objective  is known, but some linear constraints are unknown in this paper.

### Literature Overview

The idea of learning or inferring parts of an optimization model from data is a reasonably well-studied problem under many different assumptions and applications and has gained significant attention in the optimization community over the last few years, as discussed for example in den Hertog and Postek (2016), Lodi (2016) or Simchi-Levi (2014). These papers argue that there would be significant benefits in combining traditional optimization models with data-derived components. Most approaches in the literature focus on deriving the objective function of an expert decision-maker in a static fashion, based on past observations of input data and the decisions he took in each instance. In almost all cases, the objective functions are learned by considering the KKT-conditions or the dual of the (parameterized) optimization problem, and as such convexity for both the feasible region and the objective function is inherently assumed. Examples of this approach include Keshavarz et al. (2011), Li (2016) as well as Thai and Bayen (2018), where the latter one also considers the derivation of variational inequalities from data. Sometimes also distributional assumptions regarding the observations are made. Applications of such approaches have been heavily studied in the context of energy systems (Ratliff et al. (2014); Konstantakopoulos et al. (2016)), robot motion (Papadopoulos et al. (2016); Yang et al. (2014)), medicine (Sayre and Ruan (2014)) and revenue management (Kallus and Udell (2015); Qiang and Bayati (2016); Chen et al. (2015); Kallus and Udell (2016); Bertsimas and Kallus (2016)); also in the situation where the observed decisions were not necessarily optimal (Nielsen and Jensen (2004)).

Very closely related to our learning approach in terms of the problem formulation is Esfahani et al. (2018)

. This work studies different loss functions for evaluating a learned objective function on a data sample

, which leads the authors to the minimization of the same regret function that we consider in the present paper. However, as their solution approach is based on duality, it does not extend to the integer case like the ideas presented here. Also closely related is the research reported in Troutt et al. (2005), which was later extended in Troutt et al. (2006), where an optimization model is defined that searches for a linear optimization problem that minimizes the total difference between the observed solutions and solutions found by optimizing according to that optimization problem. In the latter case, the models are solved using LP duality and cutting planes. In the follow-up work Troutt et al. (2008)

, a genetic algorithm is used to solve the problem heuristically under rather general assumptions, but inherently without any quality guarantees, and in

Troutt et al. (2011)

the authors study experimental setups for learning objectives under various stochastic assumptions, focussing on maximum likelihood estimation, which is generally the case for their line of work; we make no such assumptions.

Closely related to learning optimization models from observed data is the subject of inverse optimization. Here the goal is to find an objective function that renders the observed solutions optimal with respect to the concurrently observed parameter realizations. Approaches in this field mostly stem from convex optimization, and they are used for inverse optimal control (Iyengar and Kang (2005); Panchea and Ramdani (2015); Molloy et al. (2016)

), inverse combinatorial optimization (

D. Burton (1997); Burton and Toint (1994, 1992); Sokkalingam et al. (1999); Ahuja and Orlin (2000)), integer inverse optimization (Schaefer (2009)) and inverse optimization in the presence of noisy data, such as observed decisions that were suboptimal (Aswani et al. (2018); Chan et al. (2018)).

All these approaches heavily rely on duality and thus require convexity assumptions both for the feasible region as well as the objectives. As such, they cannot deal with more complex, possibly non-convex decision domains. This in particular includes the important case of integer-valued decisions (such as yes/no-decisions or, more generally, mixed-integer programming) and also many other non-convex setups (several of which admit efficient linear optimization algorithms). Previously, this was only possible when the structure of the feasible set could be beneficially exploited. In contrast, our approach does not make any such assumptions and only requires access to a linear optimization oracle (in short: LP oracle) for the feasible region

. Such an oracle is defined as a method which, given a vector

, returns .

Also related to our work is inverse reinforcement learning and apprenticeship learning, where the reward function is the target to be learned. However, in this case the underlying problem is modelled as a Markov decision process (MDP); see, for example, the results in

Syed and Schapire (2007) and Ratia et al. (2012). Typically, the obtained guarantees are of a different form though. Similarly, our work is not to be confused with the methods developed in Taskar et al. (2005) and Daumé et al. (2005), where online algorithms are used for learning aggregation vectors for edge features in graphs, with inverse optimization as a subroutine to define the update rule. In contrast, we do inverse optimization by means of online-learning algorithms, which is basically the reverse setup.

Our approach is based on online learning, and we mainly use the simple EXP algorithm here to attain the stated asymptotic regret bound. The EXP algorithm is commonly also called Multiplicative Weights Update (MWU) algorithm and was developed in Littlestone and Warmuth (1994), Vovk (1990) as well as Freund and Schapire (1997) (see Arora et al. (2012); Hazan (2016) for a comprehensive introduction; see also Audibert et al. (2013)). A similar algorithm was used in Plotkin et al. (1995) for solving fractional packing and covering problems. To generalize the applicability of our approach, we also derive a second algorithm based on Online Gradient Descent (OGD) due to Zinkevich (see Zinkevich (2003)). We finally point out that our feedback is stronger than bandit feedback. This requirement is not unexpected as the costs chosen by the “adversary” depend on our decision; as such the bandit model (see, for example, Dani et al. (2008), Abbasi-Yadkori et al. (2011)) does not readily apply.

### Contribution

To the best of the authors’ knowledge, this paper makes the first attempt to learn the objective function of an optimization model from data using an online-learning approach.

#### Online Learning of Optimization Problems

Based on samples for the input-output relationship of an optimization problem solved by a decision-maker, our aim is to learn an objective function which is consistent with the observed input-output relationship. This is indeed the best one can hope for: an adversary could play the same environment for rounds and then switch. This is less of an issue if the environments form samples that are independent and identically distributed (i.i.d.) from some distribution.

In our setup, the expert solves the decision-making problem repeatedly for different input parameter realizations. From these observations, we are able to learn a strategy of objective functions that emulate the expert’s unknown objective function such that the difference in solution quality between the solutions converges to zero on average.

While previous methods based on dualization or KKT-system-based approaches can lead to similar or even stronger results in the continuous/convex case, online learning allows us to relax this convexity requirement and to work with arbitrary decision domains as long as we are able to optimize a linear function over them, in particular mixed-integer programs (MIPs). Thus, we do not explicitly analyze the KKT-system or the dual program (in the case of linear programs (LPs); see Remark

3.1). In particular, one might consider our approach as an algorithmic analogue of the KKT-system (or dual program) in the convex case.

To summarize, we stress that (a) we do not make any assumptions regarding distribution of the observations, (b) the observations can be chosen by a fully-adaptive adversary, and (c) we do not require any convexity assumptions regarding the feasible regions and only rely on access to an LP oracle. We would also like to mention that our approach can be extended to work with slowly changing objectives using appropriate online-learning algorithms such as, for example, those found in Jadbabaie et al. (2015) or Zinkevich (2003); the regret bounds will depend on the rate of change.

We conduct extensive experiments to demonstrate the effectiveness and wide applicability of our algorithmic approach. To this end, we investigate its use for learning the objective functions of several combinatorial optimization problems that frequently occur in practice (possibly as subproblems of larger problems) and explore, among other things, how well the learned objective generalizes to unseen data samples.

The present paper is the full version of an extended abstract submitted to the International Conference on Machine Learning (ICML) 2017, see

Bärmann et al. (2017).

## 2 Problem Setting

We consider the following family of optimization problems , which depend on parameters for some :

 max c⊤truex (6) s.t. x∈X(p), (7)

where is the objective function and is the feasible region, which depends on the parameters . Of particular interest to us will be feasible regions that arise as polyhedra defined by linear constraints and their intersections with integer lattices, i.e. the cases of LPs and MIPs:

 X(p)={x∈Zn−l×Rl∣A(p)x≤b(p)}

with and . However, our approach can also readily be applied in the case of more complex feasible regions, such as mixed-integer sets bounded by convex functions:

 X(p)={x∈Zn−l×Rl∣G(p,x)≤0}

with convex – or even more general settings. In fact, for any possible choice of model for the sets of feasible decisions, we only require the availability of a linear optimization oracle, i.e. an algorithm which is able to determine for any and . We call a decision optimal for if it is an optimal solution to .

We assume that Problem  models a parameterized optimization problem which has to be solved repeatedly for various input parameter realizations . Our task is to learn the fixed objective function  from given observations of the parameters  and a corresponding optimal solution  to . To this end, we further assume that we are given a series of observations  of parameter realizations together with an optimal solution  to computed by the expert for ; these observations are revealed over time in an online fashion: in round , we obtain a parameter setting  and compute an optimal solution  with respect to an objective function based on what we have learned about so far. Then we are shown the solution  the expert with knowledge of

would have taken and can use this information to update our inferred objective function for the next round. In the end, we would like to be able to use our inferred objective function to take decisions that are essentially as good as those chosen by the expert in an appropriate aggregation measure such as, for example, “on average” or “with high probability”. The quality of the inferred objective is measured in terms of cost deviation between our solutions

and the solutions obtained by the expert – details of which will be given in the next section.

To fix some useful notations, let denote the -th component of a vector  throughout, and let for any natural number . Furthermore, let denote the all-ones vector in . Finally, we need a suitable measure for the diameter of a given set.

###### Definition 2.1.

The -diameter of a set , denoted by , is the largest distance between any two points , measured in the -norm, , i.e.

 diamp(S)\coloneqqmaxx1,x2∈S∥x1−x2∥p. (8)

As a technical assumption, we further demand that for some convex, compact and non-empty subset , which is known beforehand. This is no actual restriction, as could be chosen to be any ball according to some -norm, , for example. In particular, this ensures that we do not have to deal with issues that arise when rescaling our objective.

## 3 Learning Objectives

Ideally, we would like to find the true objective function  as a solution to the following optimization problem:

 minc∈F∑t∈[T]((maxx∈X(pt)c⊤x)−c⊤xt), (9)

where is an arbitrary norm on and is the optimal decision taken by the expert in round . The true objective function  is an optimal solution to Problem (9) with objective value . This is because any solution  is feasible and produces non-negative summands

 (maxx∈X(pt)^c⊤x)−^c⊤xt (10)

for , as we assume to be optimal for with respect to .

Problem (9) contains  instances of the following maximization subproblem:

 max c⊤x (11a) s.t. x∈X(pt). (11b) For each t=1,…,T, the corresponding Subproblem (11) asks for an optimal solution ¯xt when optimizing over the feasible set X(pt) with a given c∈F as the objective function. When solving Problem (9), we are interested in an objective function vector c∈F that delivers a consistent explanation for why the expert chose xt as his response to the parameters pt in round t=1,…,T. We call an objective function c∈F from some prescribed set of objective functions F⊆Rn consistent with the observations (pt,xt), t∈[T], if it is optimal for the resulting Problem (9). The aim is to find an objective c∈F for which the optimal solution of Subproblem (11) attains a value as close as possible to that of the expert’s decision, averaged over all observations. The approaches we present here will provide even stronger guarantees in some cases, such as the one described in Section 3.2, showing that we can replicate the decision-making behaviour of the expert.
###### Remark 3.1.

Note that in the case of polyhedral feasible regions, i.e.  and for , as well as a polyhedral region , Problem (9) can be reformulated as a linear program by dualizing the  instances of Subproblem (11). This yields

 min T∑t=1(b⊤tyt−c⊤xt) (13a) s.t. A⊤tyt = c (∀t=1,…,T) (13b) yt ≥ 0 (∀t=1,…,T) (13c) Bc ≤ d, (13d) where the yt are the corresponding dual variables and the xt are the observed decisions from the expert (i.e. the latter are part of the input data). This problem asks for a primal objective function vector c that minimizes the total duality gap summed over all primal-dual pairs (xt,yt) while all yt’s shall be dual feasible, which makes the xt’s the respective primal optimal solutions. Thus, Problem (9) can be seen as a direct generalization of the linear primal-dual optimization problem. In fact, our approach also covers non-convex cases, e.g. mixed-integer linear programs.

Problem (9) can be interpreted as a game over  rounds between a player who chooses an objective function  in round  and a player who knows the true objective function  and chooses the observations in a potentially adversarial way. The payoff of the latter player in each round  is equal to , i.e. the difference in cost between our solution and the expert’s solution as given by our guessed objective function .

As Problem (9) is hard to solve in general, we will design online-learning algorithms that, rather than finding an optimal objective , find a strategy of objective functions  to play in each round whose error in solution quality as compared to the true objective function is as small as possible. Our aim will then be to give a quality guarantee for this strategy in terms of the number of observations.

To allow for approximation guarantees, it will not only be necessary that the set of possible objective functions to choose from is bounded, but also that the observed feasible sets have a common upper bound on their -diameter.

From a meta perspective, our approach works as outlined in Algorithm 1.

It chooses an arbitrary objective in the first round, as there is no better indication of what to do at this point. Then, in each round , it computes an optimal solution over with respect to the current guess of objective function . Upon the following observation of the expert’s solution, it updates its guess of objective function to use it in the next round.

Clearly, the accumulated objective value of a strategy over  rounds is given by , while that of would be . Via the proposed scheme, it would be overly ambitious to demand , or even as the following example shows.

###### Example 3.2.

Consider the case and for . If the first player chooses for some as his objective function guess in each round , he will obtain optimal solutions with respect to . However, both the objective functions and the objective values will be far off. Indeed, when taking the -norm, we have for . And if for all , we additionally have , but for .

Altogether, we cannot expect to approximate the true objective function or the true optimal values in general. Neither can we expect to approximate the solutions , because even if we have the correct objective function in each round, the optima do not not necessarily have to be unique.

As a more appropriate measure of quality, we will show that our algorithms based on online learning produce strategies with

 limT→∞1TT∑t=1(ct−ctrue)⊤(¯xt−xt)=0, (15)

of which we will see that it directly implies both

 limT→∞1TT∑t=1c⊤t(¯xt−xt)=0 (16)

and

 limT→∞1TT∑t=1c⊤true(xt−¯xt)=0, (17)

with non-negative summands for all rounds  in all three expressions. The objective error is the objective function of Problem (9) when relaxing the requirement to play the same objective function in each round and instead passing to a strategy of objective functions. Equation (16) states that the average objective error over all observations converges to zero with the number of observations going to infinity. The solution error is the cumulative suboptimality of the solutions  compared to the optimal solutions  with respect to the true objective function. According to Equation (17), it equally tends to zero on average with an increasing number of observations. This means it is possible to take decisions  which are essentially as good as the decisions of the expert with respect to over the long run.

Our measure of quality of a strategy of objective functions (15) is derived from the notion of regret, which is commonly used in online learning to characterize the quality of a learning algorithm: given an algorithm  which plays solutions from some decision set  in response to loss functions observed from an adversary over rounds , it is given by . Minimizing the regret of a sequence of decisions thus aims to find a strategy that perfoms at least as good as the best fixed decision in hindsight, i.e. the best static solution that can be played with full advance-knowledge of the loss functions the adversary will play. See Hazan (2016), for example, for a broad introduction to regret minimization in online learning.

In our approach, we interpret the set of possible objective functions  in Problem (9) as the set of feasible decisions from which our learning algorithms choose an objective  in each round . Furthermore, we use as the corresponding loss function in round . We are then interested in the regret against , which is given by . Equation (15) states that the average of this total error tends to zero as the number of observations increases. Note that is not necessarily the best fixed objective in hindsight – the latter would be given by a standard unit vector , where , which is rather meaningless in our case.

In the following, we derive two online-learning algorithms for which Equation (15) holds provably as wells as an intuitive heuristic for LPs for which Equation (15) holds empirically in our experiments in Section 4.

### 3.1 An Algorithm based on Multiplicative Weights Updates

A classical algorithm in online learning is the multiplicative weights update (MWU) algorithm, which solves the following problem: given a set of  decisions, a player is required to choose one of these decisions in each round . Each time, after the player has chosen his decision, an adversary reveals to him the costs , of the decisions in the current round. The objective of the player is to minimize his overall cost over the time horizon . The MWU algorithm solves this problem by maintaining weights  which are updated from round to round, starting with the initial weights

. These weights are used to derive a probability distribution

. In round , the player samples a decision  from according to . Upon observation of the costs , the player updates his weights according to

 wt+1=wt−η(wt⊙mt), (18)

where is a suitable step size, in online learning also called learning rate, and denotes the componentwise multiplication of two vectors . The expected cost of the player in round  is then given by , and the total expected cost is given by . MWU attains the following regret bound against any fixed distribution:

###### Lemma 3.3 (Arora et al. (2012, Corollary 2.2)).

The MWU algorithm guarantees that after  rounds, for any distribution  on the decisions, we have

 T∑t=1m⊤tpt≤T∑t=1(mt+η|mt|)⊤p+lnnη, (19)

where the is to be understood componentwise.

The above regret bound is valid for any distribution , in particular for the best distribution in hindsight, i.e. the distribution that would have performed best given the observed cost vectors . The latter is again given by some standard unit vector.

We will now reinterpret the distributions , a suitable distribution to compare their regret to as well as the cost vectors  in MWU in a way that will allow us to learn an objective function from observed solutions. Namely, we will identify the distributions with the objective functions  in the strategy of the player and the distribution with the actual objective function . The difference between the optimal solution  computed by the player and the optimal solution  of the expert will then act as the cost vector  (after appropriate normalization).

Naturally, this limits us to , i.e. the objective functions have to lie in the positive orthant (while normalization is without loss of generality). However, whenever this restriction applies, we obtain a very lightweight method for learning the objective function of an optimization problem. In Section 3.3, we will present an algorithm which works without this assumption on .

Our application of MWU to learning the objective function of an optimization problem proceeds as outlined in Algorithm 2.

For the series of objectives functions that our algorithm returns, we can establish the following guarantee:

###### Theorem 3.4.

Let with for all . Then we have

 0≤1TT∑t=1(ct−ctrue)⊤(¯xt−xt)≤2K√lnnT,

and in particular it also holds:

1. ,

2. .

###### Proof.

According to the standard performance guarantee of MWU from Lemma 3.3, Algorithm 2 attains the following bound on the total error of the secuence compared to with respect to the cost vectors :

 T∑t=1c⊤tyt≤T∑t=1c⊤true(yt+η|yt|)+lnnη,

where the is to be understood component-wise. Using that each each entry of is at most  and dividing by , we can conclude

 1TT∑t=1c⊤tyt−1TT∑t=1c⊤% trueyt≤ηn∑i=1ctrue(i)+lnnηT

and further, as ,

 1TT∑t=1c⊤tyt−1TT∑t=1c⊤% trueyt=η+lnnηT.

The right-hand side attains its minimum for , which yields the bound

 1TT∑t=1c⊤tyt−1TT∑t=1c⊤% trueyt≤2√lnnT.

Substituting back for the ’s and using

 maxt=1,…,T∥¯xt−xt∥∞≤maxt=1,…,Tdiam∞(X(pt))≤K,

we obtain

 1TT∑t=1c⊤t(¯xt−xt)+1TT∑t=1c⊤true(xt−¯xt)≤2K√lnnT. (20)

Observe that for each summand we have as and is the maximum over this set with respect to . With a similar argument, we see that for all . Thus, we have

 0≤1TT∑t=1(ct−ctrue)⊤(¯xt−xt)≤2K√lnnT, (21)

and similarly for the separate terms with analogue argumentation. This establishes the claim. ∎

Note that by using exponential updates of the form

 wt+1(i)←wt(i)e−ηyt(i)

in Line 13 of the algorithm, we could attain essentially the same bound, cf. (Arora et al., 2012, Theorem 2.3). Secondly, we remark that our choice of the learning rate  requires the number of rounds  to be known beforehand; if this is not the case, we can use the standard doubling trick (see Cesa-Bianchi and Lugosi (2006)) or use an anytime variant of MWU.

From the above theorem, we can conclude that the average error over all observations for when choosing objective function  in iteration  of Algorithm 2 instead of converges to with an increasing number of observations  at a rate of roughly :

###### Corollary 3.5.

Let with for all . Then we have

1.  and

2. .

In other words, both the average error incurred from replacing the actual objective function  by the estimation as well as the average error in solution quality with respect to tend to as grows.

Moreover, using Markov’s inequality we also obtain the following quantitative bound on the deviation by more than a given from the average cost:

###### Corollary 3.6.

Let . Then the fraction of observations  with

 c⊤true(xt−¯xt)≥2K√lnnT+ε (22)

is at most

 1−ε2K√lnnT+ε. (23)

In particular, for any we have that after

 T≥lnn((1−p)2Kpε)2 (24)

observations the fraction of observations with cost

 c⊤true(xt−¯xt)≥ε1−p≥2K√lnnT+ε (25)

is at most .

###### Proof.

Markov’s inequality states

 |{x∈X∣f(x)≥a}|≤1a∑x∈X|f(x)| (26)

for a finite set , a function  and . With , for as well as , we obtain the desired upper bound on the fraction of high deviations. The second part follows from solving

 1−ε2K√lnnT+ε≤p (27)

for and plugging in values. ∎

###### Remark 3.7.

It is straightforward to extend the result from Theorem 3.4 to a more general setup, namely the learning of an objective function which is linearly composed from a set of basis functions. To this end, we consider the problem

 max c⊤truef(x) (28) s.t. x∈X(p), (29)

where with , on compact and parameterized in as above. In order to apply Theorem 3.4 to this case, the -diameter of the image of additionally needs to be finite, which is naturally the case, for example, if is Lipschitz continuous with respect to the maximum norm with Lipschitz constant . Then we can change the cost function in Line 11 of a Algorithm 2 to

 yt=f(¯xt)−f(xt)∥f(¯xt)−f(xt)∥∞, (30)

which yields a guarantee of

 0≤1TT∑t=1(ct−ctrue)⊤(f(¯xt)−f(xt))≤2K√lnnT, (31)

with .

We would like to point out that the requirement to observe optimal solutions  to learn the objective function which produced them can be relaxed in all the above considerations. Assume that we observe -optimal solutions instead, i.e. they satisfy for all and some . In this case, the upper bound

 1TT∑t=1(ct−ctrue)⊤(¯xt−^xt)≤2K√lnnT,

which is analoguous to what we derived in Theorem 3.4, still holds, as it does not depend on the optimality of the observed solutions. On the other hand, we have

 1TT∑t=1(ct−ctrue)⊤(¯xt−^xt)≥1TT∑t=1c⊤true(^xt−¯xt)≥1TT∑t=1c⊤true((1−ε)xt−¯xt)

due to the optimality of the ’s with respect to the ’s and the -optimality of the ’s. Altogether, this yields

 1TT∑t=1c⊤true((1−ε)xt−¯xt)≤2K√lnnT

and consequently

 1TT∑t=1c⊤true¯xt≥1TT∑t=1(1−ε)c⊤truext−2K√lnnT,

such that in the limit, our solutions become -optimal on average. Note that a similar result can be obtained if we assume an additive error in the observed solutions instead of a multiplicative one.

### 3.2 The Stable Case

While in most applications it is sufficient to be able to produce solutions via the surrogate objectives that are essentially equivalent to those for the true objective, we will show now that under slightly strengthened assumptions we can obtain significantly stronger guarantees for the convergence of the solutions: we will show that in the long run we learn to emulate the true optimal solutions provided that the problems have unique solutions as we will make precise now.

We say that the sequence of feasible regions is -stable for for some if for any , with , and so that for we have

 c⊤true(xt−¯xt)≥Δ,

i.e. either the two optimal solutions coincide or they differ by at least with respect to . In particular, optimizing over leads to a unique optimal solution for all with . While this condition – which is well known as the sharpness of a minimizer in convex optimization – sounds unnatural at first, it is, for example, trivially satisfied for the important case where with is a polytope with vertices in and is a rational vector. In this case, write with and observe that the minimum change in objective value between any two vertices of the 0/1-polytope with is bounded by , so that -stability with holds in this case. The same argument works for more general polytopes via bounding the minimum non-zero change in objective function value via the encoding length.

We obtain the following simple corollary of Theorem 3.4.

###### Corollary 3.8.

Let with for all , let be -stable for some , and let . Then

 |NT|≤2K√TlnnΔ.
###### Proof.

 0≤1TT∑t=1c⊤true(xt−¯xt)≤2K√lnnT. (32)

Now let be as above so that

 0≤1T∑t∈NTc⊤true(xt−¯xt)≤2K√lnnT. (33)

Observe that as was optimal for together with -stability. We thus obtain

 1T|NT|Δ≤2K√lnnT, (34)

which is equivalent to

 |NT|≤2K√TlnnΔ. (35)

From the above corollary, we obtain in particular that in the -stable case we have , i.e. the average number of times that deviates from tends to in the long run. We hasten to stress, however, that the convergence implied by this bound can potentially be slow as it is exponential in the actual encoding length of ; this is to be expected given the convergence rates of our algorithm and online-learning algorithms in general.

### 3.3 An Algorithm based on Online Gradient Descent

The algorithm based on MWU introduced in Section 3.1 has the limitation that it is only applicable for learning non-negative objectives. In addition, it cannot make use of any prior knowledge about the structure of other than coming from the positive orthant. To lift these limitations, we will extend our approach using online gradient descent (OGD) which is an online-learning algorithm applicable to the following game over  rounds: in each round , the player chooses a solution from a convex, compact and non-empty feasible set . Then the adversary reveals to him a convex objective function , and the player incurs a cost of . OGD proceeds by choosing an arbitrary in the first round and updates this choice after observing  via

 xt+1=P(xt−ηt∇ct(xt)), (36)

where is the projection onto the set and is the learning rate. With the abbreviations and , the regret of the player can then be bounded as follows.

###### Lemma 3.9 (Zinkevich (2003, Theorem 1)).

For , , we have

 T∑t=1ct(xt)−minx∈FT∑t=1ct(x)≤D2√T2+(√T−12)G2. (37)

Concerning the choice of learning rate, there are a couple of things to note. Firstly, the learning rate in round  does not depend on the total number of rounds  of the game. This means that the resulting version of OGD works without prior knowledge of . It is even possible to improve slightly on the above result: by choosing the learning rate in round