Over the past years, online learning has become a very active research field. This is due to the widespread of applications with evolving or adversarial environments, e.g. routing schemes in networks , online marketplaces , spam filtering , etc. An online learning algorithm has to choose an action over a (possible infinite) set of feasible decisions. A loss/reward is associated to each decision which may be adversarially chosen. The losses/rewards are unknown to the algorithm beforehand. The goal is to minimize the regret, i.e. the difference between the total loss/reward of the online algorithm and that of the best single action in hindsight. A “good” online learning algorithm is an algorithm whose regret is sublinear as a function of the length of the time-horizon since then, on the average, the algorithm performs as well as the best single action in hindsight. Such an online algorithm is called an online learning algorithm with vanishing regret. For problems for which the offline version is -hard, the notions of regret and vanishing regret have been extended to the notions of -regret and -vanishing regret in order to take into account the existence of an -approximation algorithm instead of an exact algorithm for solving the offline optimization problem.
While a lot of online learning problems can be modeled as the so called “experts problem” by associating a feasible solution to an expert, there is clearly an efficiency challenge since there are potentially an exponential number of solutions making problematic the use of such an approach in practice. Other methods have been used as the online gradient descent , the follow the leader algorithm and its extensions follow the perturbed leader  for linear objective functions and its generalization to submodular objective functions , or the generalized follow the perturbed leader  algorithm. Hazan and Koren  proved that a no-regret algorithm with running-time polynomial in the size of the problem does not exist in general settings without any assumption on the structure of the problem.
Our work takes into account the computational efficiency of the online learning algorithm in the same vein as the works in [1, 15, 12, 22, 6, 7, 14, 9]. We study various discrete nonlinear combinatorial optimization problems in an online learning framework, focusing in particular on the family of min-max discrete optimization problems.
Our goal is to address the two following central questions:
are there negative results showing that getting vanishing regret (or even vanishing approximate regret) is computationally hard?
are there some notable differences in the efficiencies of follow the leader and gradient descent strategies for discrete problems?
Formally. An online learning problem consists of a decision-space , a state-space and an objective function that can be either a cost or a reward function. Any problem of this class can be viewed as an iterative adversarial game with rounds where the following procedure is repeated for : (a) Decide an action , (b) Observe a state , (c) Suffer loss or gain reward .
We use as another way to refer to the objective function f after observing the state , i.e. the objective function at round .
The objective of the player is to minimize/maximize the accumulative cost/reward of his decided actions, which is given by the aggregation . An online learning algorithm is any algorithm that decides the actions at every round before observing . We compare the decisions of the algorithm with those of the best static action in hindsight, defined as: , or , for minimization or maximization problems, respectively. This is the action that a (hypothetical) offline oracle would compute, if it had access to the entire sequence . The typical measurement for the efficiency of an online learning algorithm is the regret, defined as: .
A learning algorithm typically uses some kind of randomness, and the regret denotes the expectation of the above quantity. We are interested in online learning algorithms that have the ”vanishing regret” property. This means that as the ”game” progresses (), the average deviation between the algorithm’s average cost/payoff to the average cost/payoff of the optimum action in hindsight tends to zero. Typically, a vanishing regret algorithm is an algorithm with regret such that: . However, as we are interested in polynomial time algorithms, we consider only vanishing regret where (that guarantees the convergence in polynomial time). Throughout the paper, whenever we mention vanishing regret, we mean regret where .
For many online learning problems, even their offline versions are -hard. Thus, it is not feasible to produce a vanishing regret sequence with an efficient algorithm. For such cases, the notion of -regret has been defined as:
Hence, we are interested in vanishing -regret sequences for some for which we know how to approximate the offline problem. The notion of vanishing -regret is defined in the same way as that of vanishing regret. In this article we focus on computational issues. Efficiency for an online learning algorithm needs to capture both the computation of and the convergence speed. This is formalized in the following definition (where denotes the size of the instance).
A polynomial time vanishing -regret algorithm is an online learning algorithm for which (1) the computation of is polynomial in and (2) the expected -regret is bounded by for some polynomial and some constant .
Note that in case , we simply use the term polynomial time vanishing regret algorithm.
1.1 Our contribution
In Section 2, we provide a general reduction showing that many (min-max) polynomial time solvable problems not only do not have a vanishing regret, but also no vanishing approximation -regret, for some (unless ). Then, we focus on a particular min-max problem, the min-max version of the vertex cover problem which is solvable in polynomial time in the offline case. The previous reduction proves that there is no -regret online algorithm, unless Unique Game is in ; we prove a matching upper bound providing an online algorithm based on the online gradient descent method.
In Section 3, we turn our attention to online learning algorithms that are based on an offline optimization oracle that, given a set of instances of the problem, is able to compute the optimum static solution. We show that for different nonlinear discrete optimization problems, it is strongly -hard to solve the offline optimization oracle, even for problems that can be solved in polynomial time in the static case (e.g. min-max vertex cover, min-max perfect matching, etc.). We also prove that the offline optimization oracle is strongly -hard for the problem of scheduling a set of jobs on identical machines, where is a fixed constant. To the best of our knowledge, up to now algorithms based on the follow the leader method for non-linear objective functions require an exact oracle or a FPTAS oracle in order to obtain vanishing regret. Thus, strong -hardness for the multiple instance version of the offline problem indicates that follow-the-leader-type strategies can’t be used for the online problem, at least with our current knowledge. On the positive side, we present an online algorithm with vanishing regret that is based on the follow the perturbed leader algorithm for a generalization of knapsack problem .
1.2 Further related works
Online Learning, or Online Convex Optimization, is an active research domain. In this section, we only summarize works which are directly related to ours. We refer the reader to comprehensive books [21, 11] and references therein for a more complete overview. The first no-regret algorithm has been given by Hannan . Subsequently, Littlestone and Warmuth  and Freund and Schapire  gave improved algorithms with regret where is the size of the action space. However, these algorithms have running-time which is exponential in the size of the input for many applications, in particular for combinatorial optimization problems. An intriguing question is whether there exists a no-regret online algorithm with running-time polynomial in . Hazan and Koren  proved that no such algorithm exists in general settings without any assumption on the structure. Designing online polynomial-time algorithms with approximation and vanishing regret guarantees for combinatorial optimization problems is a major research agenda.
In their breakthrough paper, Kalai and Vempala  presented the first efficient online algorithm, called Follow-the-Perturbed-Leader (FTPL), for linear objective functions. The strategy consists of adding perturbation to the cumulative gain (payoff) of each action and then selecting the action with the highest perturbed gain. This strategy has been generalized and successfully applied to several settings [12, 22, 6, 7]. Specifically, FTPL and its generalized versions have been used to design efficient online no-regret algorithms with oracles beyond linear settings: to submodular settings  and non-convex settings . However, all these approaches require best-response oracles, and as we show in this paper, for several problems such best-response oracles require exponential time computation.
Another direction is to design online learning algorithms using (offline polynomial-time) approximation algorithms as oracles. Kakade et al.  provided an algorithm which is inspired by Zinkevich’s algorithm  (gradient descent): at every step, the algorithm updates the current solution in the direction of the gradient and project back to the feasible set using an approximation algorithm. They showed that given an -approximation algorithm for a linear optimization problem, after prediction rounds (time steps) the online algorithm achieves an -regret bound of using calls to the approximation algorithm per round in average. Later on, Garber  gave an algorithm with -regret bound of using only calls to the approximation algorithm per round in average. These algorithms rely crucially on the linearity of the objective functions and it remains an interesting open question to design algorithms for online non-linear optimization problems.
2 Hardness of online learning for min-max problems
2.1 General reduction
As mentioned in the introduction, in this section we give some answers to question on ruling out the existence of vanishing regret algorithm for a broad family of online min-max problems, even for ones that are polynomial-time solvable in the offline case. In fact, we provide a general reduction (see Theorem 1) showing that many min-max problems do not admit vanishing -regret for some unless .
More precisely, we focus on a class of cardinality minimization problems where, given an -elements set , a set of constraints on the subsets of (defining feasible solutions) and an integer , the goal is to determine whether there exists a feasible solution of size at most . This is a general class of problems, including for instance graph problems such as Vertex Cover, Dominating Set, Feedback Vertex Set, etc.
Given such a cardinality problem , let min-max- be the optimization problem where given non-negative weights for all the elements of , one has to compute a feasible solution (under the same set of constraints as in problem ) such that the maximum weight of all its elements is minimized. The online min-max- problem is the online learning variant of min-max-, where the weights on the elements of change over time.
Interestingly, the min-max version of all the problems mentioned above are polynomially solvable. This is actually true as soon as, for problem , every superset of a feasible solution is feasible. Then one just has to check for each possible weight if the set of all elements of weight at most agrees with the constraints. For example, one can decide if there exists a vertex cover with the maximum weight as follows: remove all vertices of weight strictly larger than , and check if the remaining vertices form a vertex cover.
We will show that, in contrast, if is -complete then its online learning min-max version has no vanishing regret algorithm (unless ), and that if has an inapproximability gap , then there is no vanishing -regret for its online learning min-max version. Let us first recall the notion of approximation gap, where denotes the minimum size of a feasible solution to the cardinality problem .
Given two numbers , let [A,B]-Gap- be the decision problem where given an instance of such that or , we need to decide whether .
Now we can state the main result of the section.
Let be a cardinality minimization problem and be real numbers with . Assume that the problem [A,B]-Gap- is -complete. Then, for every where is an arbitrarily small constant, there is no polynomial time vanishing -regret algorithm for online min-max- unless .
We prove this theorem by deriving an polytime algorithm for [A,B]-Gap- that gives, under the assumption of a vanishing -regret algorithm for online min-max-
, the correct answer with probability of error at most. This would imply that the [A,B]-Gap- problem is in and thus .
Let be a vanishing -regret algorithm for online min-max- for some where is a constant and . Let be a time horizon which will be fixed later. We construct the following (offline) algorithm for [A,B]-Gap- using as an oracle (subroutine). At every step , use the oracle to compute a solution . Then, choose one element of uniformly at random and assign weight 1 to that element; assign weight 0 to other elements. Consequently, the cost incurred to is 1 at every step. These weight assignments over times, yet simple, are crucial. Intuitively, the assignments will be used to learn about the optimal solution of the [A,B]-Gap- problem (given the performance of the learning algorithm ). The formal description is given in Algorithm 1.
We are now analyzing Algorithm 1. If the algorithm outputs , this means that at some step the oracle has figured out a feasible solution with . Since (the minimum cardinality feasible solution) is known to be either or , the output is always correct.
If the algorithm outputs , then this means that every solution had a cardinality that was greater or equal to . We bound the probability that Algorithm 1 returns a wrong answer in this case. Let be the -regret achieved by the oracle (online learning algorithm) on the set of instances produced in Algorithm 1. Let denote the event that the algorithm returns a wrong answer. By Adam’s Law, we have:
From Algorithm 1 it should be clear that at every step, the oracle always suffers loss 1. By definition of -regret, this means that:
Now, we consider a minimum cardinality feasible solution (for the initial instance of the cardinality minimization problem ). We have . As Algorithm 1 returns a wrong answer, and at every time , has at least elements. Furthermore, by the construction of the weights, there is only one element with weight 1. Thus, with probability at most (and otherwise). Thus, we get:
since . Hence, .
As has vanishing -regret, i.e., there exists a constant such that where is a polynomial of the problem parameters. Therefore,
Choose parameter , we get that . Besides, the running time of Algorithm 1 is polynomial since it consists of (polynomial in the size of the problem) iterations and the running time of each iteration is polynomial (as is a polynomial time algorithm).
In conclusion, if there exists a vanishing -regret algorithm for online min-max-, then the -complete problem [A,B]-Gap- is in , implying . ∎
The inapproximability (gap) results for the aforementioned problems give lower bounds on the approximation ratio of any vanishing -regret algorithm for their online min-max version. For instance, the online min-max dominating set problem has no vanishing constant-regret algorithm based on the approximation hardness in . We state the lower bound explicitly for the online min-max vertex cover problem in the following corollary, as we refer to it later by showing a matching upper bound. They are based on the hardness results for vertex cover in  and  (-hardness and UGC-hardness, respectively).
The online min-max vertex cover problem does not admit a polynomial time vanishing -regret unless . It does not admit a polynomial time vanishing -regret unless Unique Game is in .
Now, consider -complete cardinality problems which have no known inapproximability gap (for instance Vertex Cover in planar graphs, which admits a PTAS). Then we can show the following impossibility result.
If a cardinality problem is -Complete, then there is no vanishing regret algorithm for online min-max- unless .
We note that the proof of Theorem 1 does not require , and to be constant: they can be functions of the instance, and the result holds as soon as is polynomially bounded (so that remains polynomially bounded in ). Then, for a cardinality problem , if and , then deciding whether is the same as deciding whether or . By setting , and in proof of Theorem 1 we get the result. ∎
2.2 Min-max Vertex Cover: matching upper bound with Gradient Descent
In this section we will present an online algorithm for the min-max vertex cover problem based on the classic Online Gradient Descent (OGD) algorithm. In the latter, at every step the solution is obtained by updating the previous one in the direction of the (sub-)gradient of the objective and projecting to a feasible convex set. The particular nature of the min-max vertex cover problem is that the objective function is the norm and the set of feasible solutions is discrete (non-convex). In our algorithm, we consider the following standard relaxation of the problem:
At time step , we update the solution by a sub-gradient with in coordinate and 0 in other coordinates. Moreover, after projecting the solution to the polytope , we round the solution by a simple procedure: if then and otherwise. The formal algorithm is given in Algorithm 2.
The following theorem, coupled with Corollary 2, show the tight bound of on the approximation ratio of polynomial-time online algorithms for Min-max Vertex Cover (assuming UGC conjecture).
Assume that . Then, after time steps, Algorithm 2 achieves
3 Computational issues for Follow the Leader based methods
The most natural approach in online learning is for the player to always pick the leading action, i.e. the action that is optimal to the observed history . However it can be proven () that any deterministic algorithm that always decides on the leading action can be “tricked” by the adversary in order to make decision that are worse than the optimal action in hindsight, thus leading to large regret algorithms. On this regard, we need to add a regularization term containing randomness to the optimization oracle in order to make our algorithms less predictable and more stable. Thus, the Follow the Regularized Leader strategy in a minimization problem, consists of deciding on an action such that:
where is the regularization term.
There are many variations of the Follow the Leader (FTL) algorithm that differentiate on the applied objective functions and the type of regularization term. For linear objectives, Kalai and Vempala  suggested the Follow the Perturbed Leader algorithm where the regularization term is simply the cost/payoff of each action on a randomly generated instance of the problem. Dudik et al.  were able to generalize the FTPL algorithm of Kalai and Vempala  for non-linear objectives, by introducing the concept of shared randomness and a much more complex perturbation mechanism.
A common element between every Follow the Leader based method, is the need for an optimization oracle over the observed history of the problem. This is a minimum requirement since the regularization term can make determining the leader even harder, but most algorithms are able to map the perturbations to the value of the objective function on a set of instances of the problem and thus eliminate this extra complexity. To the best of our knowledge, up to now FTL algorithms for non-linear objective functions require an exact or a FPTAS oracle in order to obtain vanishing regret. Thus, strong -hardness for the multiple instance version of the offline problem indicates that the FTL strategy cannot be used for the online problem, at least with our current knowledge.
3.1 Computational hardness results
As we mentioned, algorithms that use the “Follow the Leader” strategy heavily rely on the existence of an optimization oracle for the multi-instance version of the offline problem. For linear objectives, it is easy to see () that optimization over a set of instances is equivalent to optimization over a single instance and thus any algorithm for the offline problem can be transformed to an online learning algorithm. However, for non-linear problems this assumption is not always justified since even when the offline problem is polytime-solvable, the corresponding variation with multiple instances can be strongly -hard.
In this section we present some problems where we can prove that the optimum solution over a set of instances is hard to approximate. More precisely, in the multi-instance version of a given problem, we are given an integer , a set of feasible solutions , and objective functions over . The goal is to minimize (over ) .
We will show computational hardness results for the multi-instance versions of:
min-max vertex cover (already defined).
min-max perfect matching, where we are given an undirected graph and a weight function on the edges and we need to determine a perfect matching such that the weight of the heaviest edge on the matching is minimized.
min-max path, where we are given an undirected graph , two vertices and , and a weight function on the edges and we need to determine an path such that the weight of the heaviest edge in the path is minimized.
, where we are given identical parallel machines, a set of -jobs and processing times and we need to determine a schedule of the jobs to the machines (without preemption) such that the makespan, i.e. the time that elapses until the last job is processed, is minimized.
Hence, in the multi-instance versions of these problems, we are given weight functions over vertices (min-max vertex cover) or edges (min-max perfect matching, min-max path), or processing time vectors ().
The multi-instance versions of min-max vertex cover, min-max perfect matching, min-max path and are strongly -hard.
Here we present the proof for the multi-instance version of the min-max perfect matching and the min-max path problems, which use a similar reduction from the Max-3-DNF problem. The proofs for multi-instance min-max vertex cover and multi-instance can be found at appendices A.1 and A.2 respectively.
In the Max-3-DNF problem, we are given a set of n boolean variables and m clauses that are conjunctions of three variables in or their negations and we need to determine a truth assignment such that the number of satisfied clauses is maximized.
We start with the multi-instance min-max perfect matching problem. For every instance of the Max-3-DNF problem we construct a graph and weight functions defined as follows:
To each variable is associated a 4-cycle on vertices . This 4-cycle has two perfect matchings: either is matched with and is matched with , corresponding to setting the variable to true, or vice-versa, corresponding to setting to false. This specifies a one-to-one correspondence between the solutions of the two problems.
Each weight function corresponds to one conjunction: if , otherwise . Edges incident to vertices always get weight 0.
The above construction can obviously be done in polynomial time to the size of the input. It remains to show the correlation between the objective values of these solutions. If a clause is satisfied by a truth assignment then (since it is a conjunction) every literal on the clause must be satisfied. From the construction of the instance of multi-instance min-max matching, the corresponding matching will have a maximum weight of 0 for the weight function . If a clause is not satisfied by a truth assignment, then the corresponding matching will have a maximum weight of 1 for the weight function . Thus, from the reduction we get
where stands for the value of a solution. This equation already proves the hardness result of Theorem 5. It actually also shows -hardness. Indeed, the optimal value OPT of Max-3-DNF verifies . Assuming the existence of a approximation algorithm for multi-instance min-max perfect matching problem, we can get a approximation algorithm for Max-3-DNF. Since Max-3-DNF is -Hard, multi-instance min-max perfect matching is also -Hard.
A similar reduction leads to the same result for the min-max path problem: starting from an instance of 3-DNF, build a graph where . Vertex corresponds to variable There are two arcs and between and . We are looking for paths. Taking edge (resp. ) corresponds to setting to true (resp. false). As previously this gives a one-to-one correspondence between solutions. Each clause corresponds to one weight function: if then , if then . All other weights are 0. Then for a path , if and only if is satisfied by the corresponding truth assignment. The remainder of the proof is exactly the same as the one of min-max perfect matching. ∎
Theorem 5 gives insight on the hardness of non-linear multi-instance problems compared to their single-instance counterparts. As we proved, the multi-instance is strongly NP-Hard while is known to admit a FPTAS [20, 23]. Also, the multi-instance version of min-max perfect matching, min-max path and min-max vertex cover are proved to be -Hard while their single-instance versions can be solved in polynomial time. We also note that these hardness results hold for the very specific case where weights/processing times are in , for which , as well as the other problems, become trivial.
We also note that the inapproximability bound we acquired for the multi-instance min-max vertex cover under UGC is tight, since we can formulate the problem as a linear program, solve it’s continuous relaxation and then use a rounding algorithm to get a vertex cover of cost at most twice the optimum for the problem.
The results on the min-max vertex cover problem also provides some answer to question (Q2) addressed in the introduction. As we proved in Section 2.2, the online gradient descent method (paired with a rounding algorithm) suffices to give a vanishing 2-regret algorithm for online min-max vertex cover. However, since the multi-instance version of the problem is APX-hard there is no indication that the follow the leader approach can be used in order to get the same result and match the lower bound of Corollary 2 for the problem.
3.2 Online generalized knapsack problem
In this section we present a vanishing regret algorithm for the online learning version of the following generalized knapsack problem. In the traditional knapsack problem, one has to select a set of items with total weight not exceeding a fixed “knapsack” capacity and maximizes the total profit of the set. Instead, we assume that the knapsack can be customized to fit more items. Specifically, there is a capacity and if the total weight of the items exceeds this capacity, then we have to pay -times the extra weight. Formally:
Definition 3 (Generalized Knapsack Problem (GKP)).
Given a set of items with non-negative weights and non-negative profits , a knapsack capacity and a constant , determine a set of items that maximizes the total profit:
This problem, as well as generalizations with other penalty costs for overweight, have been studied for instance in [4, 2] (see there for practical motivations). In an online learning setting, we assume that we have -items with static weights and a static constant . On each timestep, we need to select a subset of those items and then we learn the capacity of the knapsack and the profit of every item, gaining some profit or even suffering loss based on our decision.
As we showed in Section 3.1, many non-linear problems do not have an efficient (polynomial) offline oracle and as a direct consequence, the follow the leader strategy can not directly be applied to get vanishing regret. While GKP is clearly not linear due to the maximum in the profit function, we will show that there exists a FPTAS for solving its multiple instances variation. We will use this result to get a vanishing regret algorithm for the online version of GKP (Theorem 6).
Since the problem is not linear, we use the the generalized FTPL (GFTPL) framework of Dudik et al. , which does not rely on the assumption that the objective function is linear. While in the linear case it was sufficient to consider an “extra” random observation (FTPL), a much more complex perturbation mechanism is needed in order for the analysis to work if the objective function is not linear. The key idea of the GFTPL algorithm is to use common randomness for every feasible action but apply it in a different way. This concept was referred by the authors of  as shared randomness, using the notion of translation matrix. The method is presented in Appendix B.1.
There is a polynomial time vanishing regret algorithm for GKP.
(sketch) The proof is based on the three following steps:
First we note that GFTPL works (gives vanishing regret) even if the oracle admits a FPTAS. This is necessary since our problem is clearly -hard.
Second, we provide for GKP an ad hoc translation matrix. This shows that the GFTPL method can be applied to our problem. Moreover, this matrix is built in such a way that the oracle needed for GFTPL is precisely a multi-instance oracle.
Third, we show that there exists an FPTAS multi-instance oracle.
The first two points are given in appendices B.1 and B.2 respectively. We only show the last point. To do this, we show that we can map a set of instances of the generalized knapsack problem to a single instance of the more general convex-generalized knapsack problem. Suppose that we have a set of instances of GKP. Then, the total profit of every item set is:
where and . Let the total weight of the item set and a non-decreasing ordering of the knapsack capacities. Then:
Note that the above function is always convex. This means that at every time step , we need a FPTAS for the maximization problem where is a convex function. We know that such an FPTAS exists (). In this paper, the authors suggest a FPTAS with time complexity by assuming that the convex function can be computed at constant time. In our case the convex function is part of the input; with binary search we can compute it in logarithmic time. ∎
In the paper, we have presented a general framework showing the hardness of online learning algorithms for min-max problems. We have also showed a sharp separation between two widely-studied online learning algorithms, online gradient descent and follow-the-leader, from the approximation and computational complexity aspects. The paper gives rise to several interesting directions. A first one is to extend the reduction framework to objectives other than min-max. A second direction is to design online vanishing regret algorithms with approximation ratio matched to the lower bound guarantee. Finally, the proof of Theorem 1 needs a non-oblivious adversary. An interesting direction would be to get the same lower bounds with an oblivious adversary if possible.
Appendix A Hardness of multi-instance problems (Theorem 5)
a.1 Hardness of multi-instance min-max vertex cover
We make a straightforward reduction from the vertex cover problem. Consider any instance of the vertex cover problem, with . We construct weight functions such that in vertex has weight 1 and all other vertices have weight 0. If we consider the instance of the multi-instance min-max vertex cover with graph and weight functions , it is clear that any vertex cover has total cost that is equal to its size, since for any vertex there is exactly one weight function where and for every other weight function.
Since vertex cover is strongly -hard, -hard to approximate within ratio and UGC-hard to approximate within ratio , the same negative results hold for the multi-instance min-max vertex cover problem.
a.2 Hardness of multi-instance P3——Cmax
We prove that the multi-instance problem is strongly -hard even when the processing times are in , using a reduction from the -complete 3-coloring problem. In the 3-coloring (3C) problem, we are given a graph and we need to decide whether there exists a coloring of its vertices with 3 colors such that if two vertices are connected by an edge, they cannot have the same color.
For every instance of the 3C problem with and , we construct (in polynomial time) an instance of the multi-instance with -jobs and processing time vectors. Every edge corresponds to a processing time vector with jobs and having processing time 1 and every other job having processing time 0. It is easy to see that at each time step the makespan is either 1 or 2 and thus the total makespan is at least and at most .
If there exists a 3-coloring on then by assigning every color to a machine, at each time step there will not be two jobs with non-zero processing time in the same machine and thus the makespan will be 1 and the total solution will have cost . If the total solution has cost then this means that at every time step the makespan was 1 and by assigning to the jobs of every machine the same color we get a 3 coloring of . Hence, the multi-instance variation of the problem is strongly -hard.
Appendix B A polynomial time vanishing regret algorithm for GKP (Theorem 6)
b.1 Generalized follow the perturbed leader
For the sake of completeness, we introduce the generalized FTPL (GFTPL) method of Dudik et al. , which can be used to achieve a vanishing regret for non linear objective functions for some discrete problems. The key idea of the GFTPL algorithm is to use common randomness for every feasible action but apply it in a different way. This concept was referred by the authors of  as shared randomness. In their algorithm, the regularization term of the FTPL algorithm is substituted by the inner product where is a random vector and is a vector corresponding to the action . In FTPL it was sufficient to have but in this general setting, must be the row of a translation matrix that corresponds to action .
Definition 4 (Admissible Matrix ).
A matrix is admissible if its rows are distinct. It is -admissible if it is admissible and also (i) the number of distinct elements within each column is at most and (ii) the distinct elements within each column differ by at least .
Definition 5 (Translation Matrix ).
A translation matrix is a -admissible matrix with -rows and N-columns. Since the number of rows is equal to the number of feasible actions, we denote as the row corresponding to action . In the general case, and is used to denote the diameter of the translation matrix.
From the definition of the translation matrix it becomes clear that the action space needs to be finite. Note that the number of feasible actions can be exponential to the input size, since we do not need to directly compute the translation matrix. The generalized FTPL algorithm for a maximization problem is presented in algorithmic box 3. At time , the algorithm decides the perturbed leader as the action that maximizes the total payoff on the observed history plus some noise that is given by the inner product of and the perturbation vector . Note that in  the algorithm only needs an oracle with an additive error . We will see later that it works also for a multiplicative error (more precisely, for an FPTAS).
Let us denote as the diameter of the objective function, i.e., .
Theorem 7 ().
By using an appropriate to draw the random vector, the regret of the generalized FTPL algorithm is:
By setting , this clearly gives a vanishing regret.
Let us quote two difficulties to use this algorithm. First, the oracle has to solve a problem where the objective function is the sum of a multi-instance version of the offline problem and the perturbation. We will see in Appendix B.2 how we can implement the perturbation mechanism as the payoff of action on a set of (random) observations of the problem.
Second, if the multi-instance version is -hard, having an efficient algorithm solving the oracle with an additive error is quite improbable. We remark that the assumption of an additive error can be replaced by the assumption of the existence of a FPTAS for the oracle. Namely, let us consider a modification of Algorithm 3 where at at each time we compute a solution such that :
Then, if we use to denote the maximum payoff, i.e., , by applying the same analysis as in , we can show that by fixing we are guaranteed to get an action that has at least the same total perturbed payoff of decision if an additive optimization parameter was used. The computation is polynomial if we use an FPTAS. Then, we can still get a vanishing regret by using instead of (considering all parameters of the problem as constants).
As a corollary, we can achieve a vanishing regret for any online learning problem in our setting by assuming access to an oracle OPT that can compute (for any ) in polynomial time a decision satisfying Equation (1).
b.2 Distinguisher sets and a translation matrix for GKP
As noted above, an important issue in the method arises from the perturbation. Until now, the translation matrix could be any -admissible matrix as long as it had one distinct row for every feasible action in . However, this matrix has to be considered by the oracle in order to decide . In  the authors introduce the concept of implementability that overcomes this problem. We present a simplified version of this property.
Definition 6 (Distinguisher Set).
A distinguisher set for an offline problem P is a set of instances such that for any feasible actions :
This means that in a set of instances that “forces” any two different actions to differentiate in at least one of their payoffs over the instances in . If we can determine such a set, then we can construct a translation matrix that significantly simplifies our assumptions on the oracle.
Let be a distinguisher set for our problem. Then, for every feasible action we can construct the corresponding row of such that:
Since is a distinguisher set, the translation matrix is guaranteed to be admissible. Furthermore, according to the set we can always determine some and parameters for the translation matrix. By implementing using a distinguisher set, the expression we need to (approximately) maximize at each round can be written as:
This shows that the perturbations transform into a set of weighted instances, were the weights
are randomly drawn from uniform distribution. This is already a significant improvement, since now the oracle has to consider only weighted instances of the offline problem and not the arbitrary perturbation we were assuming until now. Furthermore, for a variety of problems (including GKP), we can construct a distinguisher set such that:
If this is true, then we can shift the random weights of the oracle inside the instances:
Thus, if we have a distinguisher set for a given problem, to apply GFTPL all we need is an FPTAS for optimizing the total payoff over a set of weighted instances.
We now provide a distinguisher set for the generalized knapsack problem. Consider a set of instances of the problem such that in instance item has profit , all other items have profit 0 and the knapsack capacity is . Since the total weight of a set of items can never exceed , it is easy to see that :
For any two different assignments and , there is at least one item that they don’t have in common. It is easy to see that in the corresponding instance one of the assignments will have total profit and the other will have total profit . Thus, the proposed set of instances is indeed a distinguisher set for the generalized knapsack problem. We use this set of instances to implement the matrix. Then, every column of will have exactly 2 distinct values 0 and , making the translation matrix -admissible. As a result, in order to achieve a vanishing regret for online learning GKP, all we need is an FPTAS for the multi-instance generalized knapsack problem.
- Agarwal et al.  Naman Agarwal, Alon Gonen, and Elad Hazan. Learning in non-convex games with an optimization oracle. In Proc. 32nd Conference on Learning Theory, volume 99, pages 18–29, 2019.
- Antoniadis et al.  Antonios Antoniadis, Chien-Chung Huang, Sebastian Ott, and José Verschae. How to pack your items when you have to buy your knapsack. In Proc. Mathematical Foundations of Computer Science, pages 62–73, 2013.
- Awerbuch and Kleinberg  Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 – 114, 2008.
- Barman et al.  Siddharth Barman, Seeun Umboh, Shuchi Chawla, and David L. Malec. Secretary problems with convex costs. In Proc. 39th Colloquium on Automata, Languages, and Programming, pages 75–87, 2012.
- Cesa-Bianchi et al.  N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory, 61(1):549–564, Jan 2015.
- Daskalakis and Syrgkanis  Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In Proc. 57th Symposium on Foundations of Computer Science, pages 219–228, 2016.
- Dudik et al.  Miroslav Dudik, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracle-efficient online learning and auction design. In Proc. 58th Symposium on Foundations of Computer Science (FOCS), pages 528–539, 2017.
- Freund and Schapire  Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Garber  Dan Garber. Efficient online linear optimization with approximation algorithms. In Advances in Neural Information Processing Systems, pages 627–635, 2017.
- Hannan  James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
- Hazan  Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
Hazan and Kale 
Elad Hazan and Satyen Kale.
Online submodular minimization.
Journal of Machine Learning Research, 13:2903–2922, 2012.
Hazan and Koren 
Elad Hazan and Tomer Koren.
The computational power of optimization in online learning.
Proc. 48th Symposium on Theory of Computing, pages 128–141, 2016.
- Kakade et al.  Sham M Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. SIAM Journal on Computing, 39(3):1088–1106, 2009.
- Kalai and Vempala  Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 – 307, 2005.
- Khot and Regev  Subhash Khot and Oded Regev. Vertex cover might be hard to approximate to within 2-epsilon. J. Comput. Syst. Sci., 74(3):335–349, 2008.
- Khot et al.  Subhash Khot, Dor Minzer, and Muli Safra. Pseudorandom sets in grassmann graph have near-perfect expansion. In Proc. 59th Symposium on Foundations of Computer Science, pages 592–601, 2018.
- Littlestone and Warmuth  Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994.
- Raz and Safra  Ran Raz and Shmuel Safra. A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In Proc. 29th Symposium on the Theory of Computing, pages 475–484, 1997.
- Sahni  Sartaj K. Sahni. Algorithms for scheduling independent tasks. J. ACM, 23(1):116–127, 1976.
- Shalev-Shwartz et al.  Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
- Syrgkanis et al.  Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pages 2159–2168, 2016.
- Woeginger  Gerhard J. Woeginger. When does a dynamic programming formulation guarantee the existence of a fully polynomial time approximation scheme (fptas)? INFORMS J. on Computing, 12(1), January 2000.
- Zinkevich  Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proc. 10th International Conference on Machine Learning, pages 928–935, 2003.