1 Introduction
Consider computing the shortest path from home to work every morning. The shortest path may vary from day to day—sometimes side roads beat the highway; sometimes the bridge is closed due to construction. However, although San Francisco and New York are contained in the same road network, it is unlikely that a San Franciscoarea commuter would ever find New York along her shortest path—the edge times in the graph do not change that dramatically from day to day.
With this motivation in mind, we study a learning problem where the goal is to speed up repeated computations when the sequence of instances share common substructure. Examples include repeatedly computing the shortest path between the same two nodes on a graph with varying edge weights, repeatedly computing string matchings, and repeatedly solving linear programs with mildly varying objectives. Our work is in the spirit of recent work in datadriven algorithm selection (e.g., Gupta and Roughgarden, 2017; Balcan et al., 2017, 2018a, 2018b) and online learning (e.g., CesaBianchi and Lugosi, 2006, although with some key differences, which we discuss below).
The basis of this work is the observation that for many realistic instances of repeated problems, vast swaths of the search space may never contain an optimal solution—perhaps the shortest path is always contained in a specific region of the road network; large portions of a DNA string may never contain the patterns of interest; a few key linear programming constraints may be the only ones that bind. Algorithms designed to satisfy worstcase guarantees may thus waste substantial computation time on futile searching. For example, even if a single, fixed path from home to work were best every day, Dijkstra’s algorithm would consider all nodes within distance from home on day , where is the length of the optimal path on day , as illustrated in Figure 1.
We develop a simple solution, inspired by online learning, that leverages this observation to the maximal extent possible. On each problem, our algorithm typically searches over a small, pruned subset of the solution space, which it learns over time. This pruning is the minimal subset containing all previously returned solutions. These rounds are analogous to “exploit” rounds in online learning. To learn a good subset, our algorithm occasionally deploys a worstcasestyle algorithm, which explores a large part of the solution space and guarantees correctness on any instance. These rounds are analogous to “explore” rounds in online learning. If, for example, a single fixed path were always optimal, our algorithm would almost always immediately output that path, as it would be the only one in its pruned search space. Occasionally, it would run a full Dijkstra’s computation to check if it should expand the pruned set. Roughly speaking, we prove that our algorithm’s solution is almost always correct, but its cumulative runtime is not much larger than that of running an optimal algorithm on the maximallypruned search space in hindsight. Our results hold for worstcase sequences of problem instances, and we do not make any distributional assumptions.
In a bit more detail, let be a function that takes as input a problem instance and returns a solution . Our algorithm receives a sequence of inputs from . Our highlevel goal is to correctly compute on almost every round while minimizing runtime. For example, each might be a set of graph edge weights for some fixed graph and might be the shortest  path for some vertices and . Given a sequence , a worstcase algorithm would simply compute and return for every instance . However, in many application domains, we have access to other functions mapping to , which are faster to compute. These simpler functions are defined by subsets of a universe that represents the entire search space. We call each subset a “pruning” of the search space. For example, in the shortest paths problem, equals the set of edges and a pruning is a subset of the edges. The function corresponding to , which we denote , also takes as input edge weights , but returns the shortest path from to using only edges from the set . By definition, the function that is correct on every input is . We assume that for every , there is a set such that if and only if – a mild assumption we discuss in more detail later on.
Given a sequence of inputs , our algorithm returns the value on round , where is chosen based on the first inputs . Our goal is two fold: first, we hope to minimize the size of each (and thereby maximally prune the search space), since is often monotonically related to the runtime of computing . For example, a shortest path computation will typically run faster if we consider only paths that use a small subset of edges. To this end, we prove that if is the smallest set such that for all (or equivalently,^{1}^{1}1We explain this equivalence in Lemma 3. ), then
where the expectation is over the algorithm’s randomness. At the same time, we seek to minimize the the number of mistakes the our algorithm makes (i.e., rounds where ). We prove that the expected fraction of rounds where is . Finally, the expected runtime^{2}^{2}2As we will formalize, when determining , our algorithm must compute the smallest set such that on some of the inputs . In all of the applications we discuss, the total runtime required for these computations is upper bounded by the total time required to compute for of the algorithm is the expected time required to compute for , plus expected time to determine the subsets .
We instantiate our algorithm and corresponding theorem in three diverse settings—shortestpath routing, linear programming, and string matching—to illustrate the flexibility of our approach. We present experiments on realworld maps and economicallymotivated linear programs. In the case of shortestpath routing, our algorithm’s performance is illustrated in Figure 1. Our algorithm explores up to five times fewer nodes on average than Dijkstra’s algorithm, while sacrificing accuracy on only a small number of rounds. In the case of linear programming, when the objective function is perturbed on each round but the constraints remain invariant, we show that it is possible to significantly prune the constraint matrix, allowing our algorithm to make fewer simplex iterations to find solutions that are nearly always optimal.
1.1 Related work
Our work advances a recent line of research studying the foundations of algorithm configuration. Many of these works study a distributional setting (Ailon et al., 2011; Clarkson et al., 2014; Gupta and Roughgarden, 2017; Kleinberg et al., 2017; Balcan et al., 2017, 2018a, 2018b; Weisz et al., 2018): there is a distribution over problem instances and the goal is to use a set of samples from this distribution to determine an algorithm from some fixed class with the best expected performance. In our setting, there is no distribution over instances: they may be adversarially selected. Additionally, we focus on quickly computing solutions for polynomialtimetractable problems rather than on developing algorithms for NPhard problems, which have been the main focus of prior work.
Several works have also studied online algorithm configuration without distributional assumptions from a theoretical perspective (Gupta and Roughgarden, 2017; CohenAddad and Kanade, 2017; Balcan et al., 2018b). Before the arrival of any problem instance, the learning algorithm fixes a class of algorithms to learn over. The classes of algorithms that Gupta and Roughgarden (2017), and CohenAddad and Kanade (2017), and Balcan et al. (2018b) study are infinite, defined by realvalued parameters. The goal is to select parameters at each timestep while minimizing regret. These works provide conditions under which it is possible to design algorithms achieving sublinear regret. These are conditions on the cost functions mapping the realvalued parameters to the algorithm’s performance on any input. In our setting, the choice of a pruning can be viewed as a parameter, but this parameter is combinatorial, not realvalued, so the prior analyses do not apply.
Several works have studied how to take advantage of structure shared over a sequence of repeated computations for specific applications, including linear programming (Banerjee and Roy, 2015) and matching (Deb et al., 2006)
. As in our work, these algorithms have full access to the problem instances they are attempting to solve. These approaches are quite different (e.g., using machine classifiers) and highly tailored to the application domain, whereas we provide a general algorithmic framework and instantiate it in several different settings.
Since our algorithm receives input instances in an online fashion and makes no distributional assumptions on these instances, our setting is reminiscent of online optimization. However, unlike the typical online setting, we observe each input before choosing an output . Thus, if runtime costs were not a concern, we could always return the best output for each input. We seek to trade off correctness for lower runtime costs. In contrast, in online optimization, one must commit to an output before seeing each input , in both the full information and bandit settings (see, e.g., Kalai and Vempala, 2005; Awerbuch and Kleinberg, 2008). In such a setting, one cannot hope to return the best for each with significant probability. Instead, the typical goal is that the performance over all inputs should compete with the performance of the best fixed output in hindsight.
2 Model
We start by defining our model of repeated computation. Let be an abstract set of problem instances and let be a set of possible solutions. We design an algorithm that operates over rounds: on round , it receives an instance and returns some element of .
[Repeated algorithm] Over rounds, a repeated algorithm encounters a sequence of inputs . On round , after receiving input , it outputs , where denotes the sequence . A repeated algorithm may maintain a state from period to period, and thus may potentially depend on all of .
We assume each problem instance has a unique correct solution (invoking tiebreaking assumptions as necessary; in Section 6, we discuss how to handle problems that admit multiple solutions). We denote the mapping from instances to correct solutions as . For example, in the case of shortest paths, we fix a graph and a pair of source and terminal nodes. Each instance represents a weighting of the graph’s edges. The set consists of all paths from to in . Then returns the shortest path from to in , given the edge weights (breaking ties according to some canonical ordering of the elements of , as discussed in Section 6). To measure correctness, we use a mistake bound model (see, e.g., Littlestone, 1987).
[Repeated algorithm mistake bound] The mistake bound of the repeated algorithm given inputs is
where the expectation is over the algorithm’s random choices. To minimize the number of mistakes, the naïve algorithm would simply compute the function at every round . However, in our applications, we will have the option of computing other functions mapping the set of inputs to the set of outputs that are faster to compute than . Broadly speaking, these simpler functions are defined by subsets of a universe , or “prunings” of . For example, in the shortest paths problem, given a fixed graph as well as source and terminal nodes , the universe is the set of edges, i.e., . Each input is a set of edge weights and computes the shortest  path in under the input weights. The simpler function corresponding to a subset of edges also takes as input weights , but it returns the shortest path from to using only edges from the set (with if no such path exists). Intuitively, the universe contains all the information necessary to compute the correct solution to any input , whereas the function corresponding to a subset can only compute a subproblem using information restricted to .
Let denote the function corresponding to the set . We make two natural assumptions on these functions. First, we assume the function corresponding to the universe is always correct. Second, we assume there is a unique smallest set that any pruning must contain in order to correctly compute . These assumptions are summarized below.
For all , . Also, there exists a unique smallest set such that if and only if
Given a sequence of inputs , our algorithm returns the value on round , where the choice of depends on the first inputs . In our applications, it is typically faster to compute over if . Thus, our goal is to minimize the number of mistakes the algorithm makes while simultaneously minimizing . Though we are agnostic to the specific runtime of computing each function , minimizing roughly amounts to minimizing the search space size and our algorithm’s runtime in the applications we consider.
We now describe how this model can be instantiated in three classic settings: shortestpath routing, string search, and linear programming.
Shortestpath routing.
In the repeated shortest paths problem, we are given a graph (with static structure) and a fixed pair of source and terminal nodes. In period , the algorithm receives a nonnegative weight assignment . Figure 2 illustrates the pruning model applied to the repeated shortest paths problem.
For this problem, the universe is the edge set (i.e., ) and is a subset of edges in the graph. The set consists of all possible weight assignments to edges in the graph and is the set of all paths in the graph, with indicating that no path exists. The function returns the shortest  path in given edge weights . For any , the function computes the shortest  path on the subgraph induced by the edges in (breaking ties by a canonical edge ordering). If does not include any  path, we define . Part 1 of Assumption 2 holds because , so computes the shortest path on the entire graph. Part 2 of Assumption 2 also holds: since computes the shortest  path on the subgraph induced by the edges in (breaking ties by some canonical edge ordering), we can see that if and only if . To “canonicalize” the algorithm so there is always a unique solution, we assume there is a given ordering on edges and that ties are broken lexicographically according to the path description. This is easily achieved by keeping the heap maintained by Dijkstra’s algorithm sorted not only by distances but also lexicographically.
Linear programming.
We consider computing , where we assume that
is fixed across all times steps but the vector
defining the objective function may differ for each . To instantiate our pruning model, the universe is the set of all constraint indices and each indicates a subset of those constraints. The set equals . For simplicity, we assume that the set of objectives contains only directions such that there is a unique solution that is the intersection of exactly constraints in . This avoids both dealing with solutions that are the intersection of more than constraints and directions that are underdetermined and have infinitelymany solutions forming a facet. See Section 6 for a discussion of this issue in general.Given , the function computes the linear program’s optimal solution, i.e., For a subset of constraints , the function computes the optimal solution restricted to those constraints, i.e., where is the submatrix of consisting of the rows indexed by elements of and is the vector with indices restricted to elements of . We further write if there is no unique solution to the linear program (which may happen for small sets even if the whole LP does have a unique solution). Part 1 of Assumption 2 holds because and , so it is indeed the case that . To see why part 2 of Assumption 2 also holds, suppose that . If , the vector must be the intersection of exactly constraints in , which by definition are indexed by elements of . This means that .
String search.
In string search, the goal is to find the location of a short pattern in a long string. At timestep , the algorithm receives a long string of some fixed length and a pattern of some fixed length . We denote the long string as and the pattern as . The goal is to find an index such that . The function returns the smallest such index , or if there is no match. In this setting, the set of inputs consists of all string pairs of length and (e.g., for DNA sequences) and the set is the set of all possible match indices. The universe also consists of all possible match indices. For any , the function returns the smallest index such that , which we denote It returns if there is no match. We can see that part 1 of Assumption 2 holds: for all , since checks every index in for a match. Moreover, part 2 of Assumption 2 holds because if and only if .
3 The algorithm
We now present an algorithm (Algorithm 1), denoted , that encounters a sequence of inputs onebyone. At timestep , it computes the value , where the choice of depends on the first inputs . We prove that, in expectation, the number of mistakes it makes (i.e., rounds where ) is small, as is .
Our algorithm keeps track of a pruning of , which we call at timestep . In the first round, the pruned set is empty (). On round , with some probability , the algorithm computes the function and then computes , the unique smallest set that any pruning must contain in order to correctly compute . (As we discuss in Section 3.1, in all of the applications we consider, computing amounts to evaluating .) The algorithm unions with to create the set . Otherwise, with probability , it outputs , and does not update the set (i.e., ). It repeats in this fashion for all rounds.
In the remainder of this section, we use the notation to denote the smallest set such that for all . To prove our guarantees, we use the following helpful lemma:
For any ,
Proof.
First, we prove that . For a contradiction, suppose that for some , there exists an element such that . This means that , but , which contradicts Assumption 2: is the unique smallest subset of such that for any set , if and only if Therefore, . Next, let . Since , Assumption 2 implies that for all Based on the definition of and the fact that , we conclude that . ∎
We now provide a mistake bound for Algorithm 1.
For any such that for all and any inputs , Algorithm 1 has a mistake bound of
Proof.
Let be the sets such that on round , Algorithm 1 computes the function . Consider any element . Let be the number of times but for some . In other words, . Every time the algorithm makes a mistake, the current set must not contain some (otherwise, , so the algorithm would not have made a mistake by Assumption 2). This means that every time the algorithm makes a mistake, is incremented by 1 for at least one . Therefore,
(1) 
where the expectation is over the random choices of Algorithm 1.
For any element , let be the iterations where for all , . By definition, will only be incremented on some subset of these rounds. Suppose is incremented by 1 on round . It must be that , which means , and thus . Since , it must be that for since . Therefore, in each round with , Algorithm 1 must not have computed , because otherwise would have been added to the set . We can bound the probability of these bad events as
As a result,
(2) 
The theorem statement follows by combining Equations (1) and (2). ∎
Algorithm 1 with has a mistake bound of .
In the following theorem, we prove that the mistake bound in Theorem 3 is nearly tight. In particular, we show that for any there exists a random sequence of inputs such that and This nearly matches the upper bound from Theorem 3 of
The full proof is in Appendix A. In Section 4, we show that in fact, achieves a near optimal tradeoff between runtime and pruned subset size over all possible pruningbased repeated algorithms.
[] For any , any time horizon , and any , there is a random sequence of inputs to Algorithm 1 such that
and its expected mistake bound with for all is
The expectation is over the sequence of inputs.
Proof sketch.
We base this construction on shortestpath routing. There is a fixed graph where consists of two vertices labeled and and consists of edges labeled , each of which connects and . The set consists of possible edge weightings. Under the edge weights , the edge has a weight of 0 and all other edges have a weight of 1. We prove the theorem by choosing an input at each round uniformly at random from . ∎
In Theorem 3, we bounded the expected number of mistakes Algorithm 1 makes. Next, we bound , where is the set such that Algorithm 1 outputs in round (so either or , depending on the algorithm’s random choice). In our applications, minimizing means minimizing the search space size, which roughly amounts to minimizing the average expected runtime of Algorithm 1.
For any inputs , let be the sets such that on round , Algorithm 1 computes the function . Then
where the randomness is over the coin tosses of Algorithm 1.
Proof.
We know that for all , with probability and with probability . Therefore,
where the final inequality holds because for all . ∎
If we set for all , we have the following corollary, since . Given a set of inputs , let be the sets such that on round , Algorithm 1 computes the function . If for all , then
where the expectation is over the random choices of Algorithm 1.
3.1 Instantiations of Algorithm 1
We now revisit and discuss instantiations of Algorithm 1 for the three applications outlined in Section 2: shortestpath routing, linear programming, and string search. For each problem, we describe how one might compute the sets for all
Shortestpath routing.
In this setting, the algorithm computes the true shortest path using, say, Dijkstra’s shortestpath algorithm, and the set is simply the union of edges in that path. Since , the mistake bound of given by Corollary 1 is particularly strong when the shortest path does not vary much from day to day. Corollary 3 guarantees that the average edge set size run through Dijkstra’s algorithm is at most . Since the worstcase running time of Dijkstra’s algorithm on a graph is , minimizing the average edge set size is a good proxy for minimizing runtime.
Linear programming.
In the context of linear programming, computing the set is equivalent to computing and returning the set of tight constraints. Since , the mistake bound of given by Corollary 1 is strongest when the same constraints are tight across most timesteps. Corollary 3 guarantees that the average constraint set size considered in each round is at most , where is the total number of constraints. Since many wellknown solvers take time polynomial in to compute , minimizing is a close proxy for minimizing runtime.
String search.
In this setting, the set consists of the smallest index such that , which we denote . This means that computing is equivalent to computing . The mistake bound of given by Corollary 1 is particularly strong when the matching indices are similar across string pairs. Corollary 3 guarantees that the average size of the searched index set in each round is at most . Since the expected average running time of our algorithm using the naïve stringmatching algorithm to compute is , minimizing amounts to minimizing runtime.
4 Lower bound on the tradeoff between accuracy and runtime
We now prove a lower bound on the tradeoff between runtime and the number of mistakes made by any repeated algorithm. We analyze a shortest path problem with two nodes and connected by parallel edges (). Thus, all paths are single edges. For any and , consider the following distribution over tuples of edge weights in :

The weight on edge is always 1/2.

An edge and integer are chosen uniformly at random. The weight on edge is 1 on periods preceding and 0 from periods or later.

The weight on every other edge in is 1 on every period.
Note that because , with probability 1/2, and edge will be the unique shortest path () for all periods. Otherwise, . We say that an algorithm inspects an edge on period if it examines the memory location associated with that edge.
[] Fixing and any even integer , any repeated algorithm must satisfy:
where the expectation is over the random edge weights and the randomness of the algorithm. The total number of inspections the algorithm makes is clearly a lower bound on its total runtime, so Theorem 4 demonstrates a tradeoff between runtime and accuracy. This theorem is tight up to constant factors, as can be seen by the trivial algorithm that inspects every edge until it encounters a 0 on some edge and then outputs that edge henceforth, which makes no mistakes and runs in expected time . Conversely, the algorithm that always outputs edge does not make any inspections and makes expected mistakes.
5 Experiments
In this section, we present experimental results for shortestpath routing and linear programming.
Shortestpath routing.
We test Algorithm 1’s performance on realworld street maps, which we access via Python’s OSMnx package (Boeing, 2017). Each street is an edge in the graph and each intersection is a node. The edge’s weight is the street’s distance. We run our algorithm for 30 rounds (i.e., ) with for all . On each round, we randomly perturb each edge’s weight via the following procedure. Let be the original graph we access via Python’s OSMnx package. Let be a vector representing all edges’ weights. On the round, we select a vector
such that each component is drawn i.i.d. from the normal distribution with a mean of 0 and a standard deviation of 1. We then define a new edgeweight vector
such that In Appendix C, we experiment with alternative perturbation methods.In Figures 2(a) and 2(b), we illustrate our algorithm’s performance in Pittsburgh. Figure 2(a) illustrates the nodes explored by our algorithm over rounds. The goal is to get from the upper to the lower star. The nodes colored grey are the nodes Dijkstra’s algorithm would have visited if we had run Dijkstra’s algorithm on all rounds. The nodes colored black are the nodes in the pruned subgraph after the rounds. Figure 2(b) illustrates the results of running our algorithm a total of 5000 times ( rounds each run). The top (orange) line shows the number of nodes Dijkstra’s algorithm explored averaged over all 5000 runs. The bottom (blue) line shows the average number of nodes our algorithm explored. Our algorithm returned the incorrect path on a fraction of the rounds. In Appendix C, we show a plot of the average pruned set size as a function of the number of rounds.
Linear programming.
We generate linear programming instances representing the linear relaxation of the combinatorial auction winner determination problem. See Appendix C for the specific form of this linear relaxation. We use the Combinatorial Auction Test Suite (CATS) (LeytonBrown et al., 2000) to generate these instances. This test suite is meant to generate instances that are realistic and economically wellmotivated. We use the CATS generator to create an initial instance with an objective function defined by a vector and constraints defined by a matrix and a vector . On each round, we perturb the objective vector as we describe in Appendix C.2.
From the CATS “Arbitrary” generator, we create an instance with 204 bids and 538 goods which has 204 variables and 946 constraints. We run Algorithm 1 for 30 rounds () with for all , and we repeat this 5000 times. In Figure 2(c), the top (orange) line shows the number of simplex iterations the full simplex algorithm makes averaged over all 5000 runs. The bottom (blue) line shows the number of simplex iterations our algorithm makes averaged over all 5000 runs. We solve the linear program on each round using the SciPy default linear programming solver (Jones et al., 2001–), which implements the simplex algorithm (Dantzig, 2016). Our algorithm returned the incorrect solution on a fraction of the rounds. In Appendix C, we show a plot of the average pruned set size as a function of the number of rounds.
6 Multiple solutions and approximations
In this work, we have assumed that each problem has a unique solution, which we can enforce by defining a canonical ordering on solutions. For string matching, this could be the first match in a string as opposed to any match. For shortestpath routing, it is not difficult to modify shortestpath algorithms to find, among the shortest paths, the one with lexicographically “smallest” description given some ordering of edges. Alternatively, one might simply assume that there is exactly one solution, e.g., no ties in a shortestpath problem with realvalued edge weights. This latter solution is what we have chosen for the linear programming model, for simplicity.
It would be natural to try to extend our work to problems that have multiple solutions, or even to approximate solutions. However, addressing multiple solutions in repeated computation rapidly raises NPhard challenges. To see this, consider a graph with two nodes, and , connected by parallel edges. Suppose the goal is to find any shortest path and suppose that in each period, the edge weights are all 0 or 1, with at least one edge having weight 0. If is the set of edges with 0 weight on period , finding the smallest pruning which includes a shortest path on each period is trivially equivalent to set cover on the sets . Hence, any repeated algorithm handling problems with multiple solutions must address this computational hardness.
7 Conclusion
We propose an algorithm for quickly solving a series of related problems. Our algorithm learns irrelevant regions of the solution space that may be pruned across instances. With high probability, our algorithm makes few mistakes, and it may prune large swaths of the search space. For problems where the solution can be checked much more quickly than found (such as linear programming), one can also check each solution and rerun the worstcase algorithm on the few errors to ensure zero mistakes. In other cases, there is a tradeoff between the mistake probability and runtime.
Acknowledgments
This work was supported in part by Israel Science Foundation (ISF) grant #1044/16, a subcontract on the DARPA Brandeis Project, and the Federmann Cyber Security Center in conjunction with the Israel national cyber directorate.
References
 Ailon et al. (2011) Nir Ailon, Bernard Chazelle, Kenneth L. Clarkson, Ding Liu, Wolfgang Mulzer, and C. Seshadhri. Selfimproving algorithms. SIAM J. Comput., 40(2):350–375, 2011.
 Awerbuch and Kleinberg (2008) Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. J. Comput. Syst. Sci., 74(1):97–114, 2008.
 Balcan et al. (2017) MariaFlorina Balcan, Vaishnavh Nagarajan, Ellen Vitercik, and Colin White. Learningtheoretic foundations of algorithm configuration for combinatorial partitioning problems. Proceedings of the Conference on Learning Theory (COLT), 2017.

Balcan et al. (2018a)
MariaFlorina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik.
Learning to branch.
Proceedings of the International Conference on Machine Learning (ICML)
, 2018a.  Balcan et al. (2018b) MariaFlorina Balcan, Travis Dick, and Ellen Vitercik. Dispersion for datadriven algorithm design, online learning, and private optimization. Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2018b.

Banerjee and Roy (2015)
Ashis Gopal Banerjee and Nicholas Roy.
Efficiently solving repeated integer linear programming problems by
learning solutions of similar linear programming problems using boosting
trees.
Technical Report MITCSAILTR201500, MIT Computer Science and Artificial Intelligence Laboratory, 2015.
 Boeing (2017) Geoff Boeing. OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems, 65:126–139, 2017.
 CesaBianchi and Lugosi (2006) Nicolo CesaBianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 Clarkson et al. (2014) Kenneth L. Clarkson, Wolfgang Mulzer, and C. Seshadhri. Selfimproving algorithms for coordinatewise maxima and convex hulls. SIAM J. Comput., 43(2):617–653, 2014.
 CohenAddad and Kanade (2017) Vincent CohenAddad and Varun Kanade. Online Optimization of Smoothed Piecewise Constant Functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
 Dantzig (2016) George Dantzig. Linear programming and extensions. Princeton University Press, 2016.
 Deb et al. (2006) Supratim Deb, Devavrat Shah, and et al. Fast matching algorithms for repetitive optimization: An application to switch scheduling. In Proceedings of the Conference on Information Sciences and Systems (CISS), 2006.
 Gupta and Roughgarden (2017) Rishi Gupta and Tim Roughgarden. A PAC approach to applicationspecific algorithm selection. SIAM J. Comput., 46(3):992–1017, 2017.
 Jones et al. (2001–) Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL http://www.scipy.org/. [Online; accessed January 2019].
 Kalai and Vempala (2005) Adam Tauman Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291–307, 2005.
 Kleinberg et al. (2017) Robert Kleinberg, Kevin LeytonBrown, and Brendan Lucier. Efficiency through procrastination: Approximately optimal algorithm configuration with runtime guarantees. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2017.
 LeytonBrown et al. (2000) Kevin LeytonBrown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the ACM Conference on Electronic Commerce (ACMEC), 2000.
 Littlestone (1987) Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4):285–318, 1987.
 Weisz et al. (2018) Gellért Weisz, Andrés György, and Csaba Szepesvári. LEAPSANDBOUNDS: A method for approximately optimal algorithm configuration. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
Appendix A Proofs from Section 3
See 1
Proof.
We base our construction on the shortest path problem. There is a fixed graph where consists of two vertices labeled and and consists of edges labeled , each of which connects and . The set consists of possible edge weightings, where
In other words, under the edge weights , the edge has a weight of 0 and all other edges have a weight of 1. The shortest path from to under the edge weights consists of the edge and has a total weight of 0. Therefore, . Given any nonempty subset of edges , if , and otherwise breaks ties according to a fixed but arbitrary tiebreaking rule.
To construct the random sequence of inputs from the theorem statement, in each round we choose the input uniformly at random from the set . Therefore, letting , , because when throwing balls uniformly at random into bins, the expected number of empty bins is .
We now prove that
where the expectation is over the sequence of inputs. To this end, let be the sets such that at round , Algorithm 1 outputs .
Analyzing a single summand,
Therefore,
as claimed. ∎
Appendix B Proof of lower bound
See 4
Proof.
First, consider a deterministic alg