Traditional optimization techniques make use of a known mathematical formulation of the objective function, for example by calculating the derivative or a lower bound. However, many objective functions in real-life situations have no known mathematical formulation. For example, smart grids or railways are complex networks where every decision influences the whole network. In such applications, we can observe the effect of decisions either in real life, or by running a simulation. Waiting for such a result can take some time, or may have some other cost associated with it. Furthermore, the outcome of two observations with the same decision variables may be different. Such problems have been approached using methods such as black-box or Bayesian optimization , simulation-based optimization , and derivative-free optimization . Here, a model fits the relation between decision variables and objective function, and then standard optimization techniques are used on the model instead of the original objective. These so-called surrogate modeling techniques have been applied successfully to continuous optimization problems in signal processing , optics 21], robotics , and more. However, it is still an on-going research question on how these techniques can be applied effectively to combinatorial optimization problems. A common approach is to simply round to the nearest integer, a method that is known to be sub-optimal in traditional optimization, and also in black-box optimization . Another option is to use discrete surrogate models from machine learning such as regression trees  or linear model trees . Although powerful, this makes both model fitting and optimization computationally expensive.
This work describes an approach where the surrogate model is still continuous, but where finding the optimum of the surrogate model gives an integer solution. The main contributions are as follows:
This surrogate modeling algorithm, called IDONE, with two variants (one with a basic and one with a more complex surrogate model).
A proof that finding the optimum of the surrogate model gives an integer solution.
Experimental results that show when IDONE outperforms random search, simulated annealing and Bayesian optimization.
Section 2 gives a general description of the problem and an overview of related work. Section 3 describes the IDONE algorithm and the proof. In Section 4, IDONE is compared to random search, simulated annealing and Bayesian optimization on two different problems: finding robust routes in a noise-perturbed traveling salesman benchmark problem, and a convex binary optimization problem. Finally, Section 5 contains conclusions and future work.
2 Problem description and related work
Consider the problem of minimizing an objective with integer and bound constraints:
These bounds are also assumed to be integer. It is assumed that does not have a known mathematical formulation, and can only be accessed via noisy measurements , , with
zero-mean noise with finite variance. Furthermore, taking a measurement
is computationally expensive or is expensive due to a long measuring time, human decision making, deteriorating biological samples, or other reasons. Examples are hyper-parameter optimization in deep learning, contamination control of a food supply chain , and structure design in material science .
Although many standard optimization methods are unfit for this problem, there exists a vast number of methods that were designed with most of the above assumptions in mind. For example, local search heuristics
such as hill-climbing, simulated annealing, or taboo search are general enough to be applied to this problem, and have the advantage of being easy to implement. These heuristics are considered as the baseline in this work, together with random search (simply assign the variables completely random values and keep the best results). Population-based heuristics such as genetic algorithms13], and ant colony optimization  operate in the same way as local search algorithms, but keep track of multiple candidate solutions that together form a population. These algorithms have been applied successfully in many applications, but are unfit for the problem described in this paper since evaluating a whole population of candidate solutions is not practical if each measurement is expensive. The same holds for algorithms that filter out the noise by averaging, such as COMPASS , since they evaluate one candidate solution multiple times. Typical integer programming algorithms such as branch and bound  work well on standard combinatorial problems, but when the objective function is unknown and observations of single values are noisy or expensive, it is very difficult to obtain reasonable bounds.
Surrogate modeling techniques operate in a different way from the above methods: past measurements are used to fit a model, which is then used to select a next candidate solution. Bayesian optimization algorithms [12, 16]
, for example, have been successfully applied in many different fields. These methods use probability theory to determine the most promising candidate point according to the surrogate model. However, when the variables are discrete, the typical approach is to relax the integer constraints, which often leads to sub-optimal solutions. The authors in  tackled this problem by modifying the covariance function used in the surrogate model. Another approach, based on semi-definite programming, is given in . And the HyperOpt algorithm 
takes yet a different approach by using a Tree-structured Parzen Estimator as the surrogate model, which is discrete in case the variables are discrete. HyperOpt is considered the main contender in this paper.
A downside of many Bayesian optimization algorithms is that the computation time per iteration scales quadratically with the number of measurements taken up until that point. This causes these methods to become slower over time, and after a certain number of iterations they may even violate the assumption that the bottleneck in the problem is the cost of evaluating the objective. This downside is recognized and overcome in two similar algorithms: COMBO  and DONE . Both algorithms use the power of random features  to get a fixed computation time every iteration, but COMBO is designed for combinatorial optimization while DONE is designed for continuous optimization. A disadvantage of COMBO is that it evaluates the surrogate model at every possible candidate point, a number that grows exponentially with the input dimension . Though evaluating the surrogate model takes very little time (compared to evaluating the original objective ), this still makes the algorithm unfit for problems where the input dimension is large. A variant of DONE named CDONE has been applied to a mixed-integer problem, where the integer constraints were relaxed , but as mentioned earlier, this can lead to sub-optimal solutions.
However, the downside of having to relax the integer constraints can be circumvented. By choosing the basis functions in a certain way, we show how a model can be constructed for which it is known beforehand that the minima of the model lie exactly in integer points. This makes it possible to apply the algorithm to combinatorial problems, as explained in the next section.
3 IDONE algorithm
The DONE algorithm  and its variants are able to solve problem (1) without the integer constraint by making use of a surrogate model. Every time a new measurement comes in, the surrogate model is updated, the minimum of the surrogate model is found, and an exploration step is performed. To make the algorithm suitable for combinatorial optimization we propose a variant of DONE called IDONE (integer-DONE), where the surrogate model is guaranteed to have integer-valued minima.
3.1 Piece-wise linear surrogate model
The proposed surrogate model
is a linear combination of rectified linear units, a basis function that is commonly used in the deep learning community :
with and for . Unlike what is common practice in the deep learning community, the parameters and remain fixed in this surrogate model. This makes the model linear in its parameters (
), allowing it to be trained via linear regression instead of iterative methods. This is explained in Section3.2.
Because of the choice of basis functions, the surrogate model is actually piece-wise linear, which causes its local minima to lie in one of its corner points:
Any strict local minimum of lies in a point with for linearly independent .
The reverse of this theorem is not necessarily true: if satisfies for linearly independent , then it depends on the parameters of the model whether is actually a local minimum or not.
The number of local minima and their locations depend on the parameters and . In this work, we provide two options for choosing these parameters in such a way that the local minima are always found in integer solutions. In the first case, the functions are simply chosen to have zeros on hyper-planes that together form an integer lattice:
Definition 1 (Basic model ).
An example of a basis function in this model is with . This model has basis functions in total. The comes from the model bias, a basis function that is equal to everywhere. This allows the model to be shifted up or down.
Since all the basis functions depend only on one variable, this basic model might not be fit for problems where the decision variables have complex interactions. Therefore, in the advanced model, we use the same basis functions, but we also add basis functions that depend on two variables:
Definition 2 (Advanced model).
This model has more basis functions than the basic model. The added functions are zero on diagonal lines through two variables, see Figure 1.
We now show one of our main contributions.
(I) If is a strict local minimum of the basic model, then and .
(II) If is a strict local minimum of the advanced model, then .
(III) If is a non-strict local minimum of the basic model, it holds that the model retains the same value when going from to the nearest point that satisfies and .
(IV) If is a non-strict local minimum of the advanced model, it holds that the model retains the same value when rounding to the nearest integer.
(I) Let be a strict local minimum of the basic model. By Theorem 1, there are linearly independent with . From Algorithm 1 it can be seen that all functions have the form , for some , . Since of these functions are linearly independent, all of them must have a different . Since all of them satisfy , it holds that , for some , , which is what is claimed.
(II) Let be a strict local minimum of the advanced model. By Theorem 1, there are linearly independent with . This means that all together must depend on all , . From Algorithm 2 it can be seen that all functions have the same form as in the basic model, that is, , , , or they have the form , , . No matter the form, thus
To arrive at a contradiction, suppose that such that . Then by (3), for all . However, this is only possible if none of the have the form , and all of the have the form , for different . But by construction, there are only of these last ones available, see Algorithm 2 (the for-loop starts at ). Therefore, it is not true that such that . Hence, , which is what the theorem claims.
(III) Let be a non-strict local minimum of the basic model and suppose for some . Let be the line segment from to the nearest point in the set , without including that point. Since the only functions that depend on have the form , , it follows that does not happen on for any that depends on . Therefore, model is linear on this line segment, and since is a non-strict local minimum and is continuous, retains the same value when replacing by . This can be repeated for all for which , which proves the claim.
(IV) Let be a non-strict local minimum of the advanced model and suppose . We first show that rounding to the nearest integer does not change the sign of any . Note that all the of the advanced model have the form or , for some and some integer . Let denote rounding to the nearest integer. Then we have (because is integer):
Since the sign of none of the change when rounding, and model is only nonlinear when going from to for some , it follows that is linear on the line segment from to the nearest integer. Together with the fact that is a non-strict local minimum, it follows that retains the same value on this line segment. Finally, the claim is valid because is continuous. ∎
3.2 Fitting the model
Because the surrogate model is linear in its parameters , fitting the model can be done with linear regression. Given input-output pairs ), this is done by solving the regularized linear least squares problem
with regularization parameter and initial weights . The regularization part is added to overcome ill-conditioning, noise, and model over-fitting. Furthermore, by choosing , it is ensured that the surrogate model is convex before the first iteration . In this work, has been chosen.
To prevent having to solve this problem at every iteration (with runtime ), (4) is solved with the recursive least squares algorithm . This algorithm has runtime per iteration, with the number of basis functions used by the model. This implies that the computation time per iteration does not depend on the number of measurements, which is a big advantage compared to Bayesian optimization algorithms (which usually have complexity per iteration). The memory is also , because a covariance matrix needs to be stored. Since scales linearly with the input dimension and with the lower and upper bounds, the computational complexity of fitting the surrogate model is , with .
3.2.1 Model visualization
To visualize the surrogate model used by the IDONE algorithm, the fitting procedure is applied to a simple traveling salesman problem with four cities. The distance matrix for the cities is shown in Table 1. The decision variables are chosen as follows: the route starts at city 1, then variable determines which of the three remaining cities is visited, then variable determines which of the two remaining cities is visited; then the one remaining city is visited, then city 1 is visited again. This problem has two optimal solutions: (route 1-2-4-3-1) and (route 1-3-4-2-1), both with a total distance of . All other solutions have a total distance of .
Figure 2 shows what the surrogate model looks like after taking measurements in all possible data points for this problem, which is possible due to the low number of possibilities. It can be observed that this model is piece-wise linear and that any local minimum retains the same value when rounding to the nearest integer. Furthermore, the diagonal lines (see also Figure 1) make the advanced model more accurate.
3.3 Finding the minimum of the model
After fitting the model at iteration , the algorithm proceeds to find a local minimum using the new weights :
The BFGS method  with a relaxation on the integer constraint was used to solve the above problem, with a provided analytical derivative of . In this work, the derivative of the basis function has been chosen to be at . The optimal solution was rounded to the nearest integer per Theorem 2.
After fitting the model and finding its minimum, a new point needs to be chosen to evaluate the function . As in DONE , a random perturbation is added to the found minimum:
, but instead of a continuous random variable,
is a discrete random variable with the following probabilities:
In this work, has been chosen ( is the number of variables).
3.5 IDONE algorithm
The IDONE algorithm iterates over three phases: updating the surrogate model with recursive least squares, finding the minimum of the model, and performing the exploration step. The pseudocode for the algorithm is shown in Algorithm 3. Depending on which subroutine is used in the first line, we refer to this algorithm as either IDONE-basic (using the basic model) or IDONE-advanced (using the advanced model).
4 Experimental results
To determine how the IDONE algorithm compares to other black-box optimization algorithms in terms of convergence speed and scalability, it has been applied to two problems: finding a robust route for a noise-perturbed asymmetric traveling salesman benchmark problem with cities, and an artificial convex binary optimization problem. The first problem gives a first indication of the algorithm’s performance on an objective function that follows from a simulation where there is a network structure. The second problem shows an easier and more tangible situation - due to the convexity and the fact that we know the global optimum - which makes it easier to interpret results.
The algorithm is compared with several different black-box optimization algorithms: random search (RS), simulated annealing (SA), and two different Bayesian optimization algorithms: Matlab’s bayesopt function  (BO), and the Python library HyperOpt  (HypOpt). The RS, IDONE-advanced and HypOpt algorithms are implemented in Python and run on a cluster ( Intel Xeon E5-2650 GHz CPUs), while the SA, BO and IDONE-basic algorithms are implemented in Matlab and run on a laptop (Intel Core i7-6600U GHz CPU with GB RAM), which is about faster111The cluster makes it faster to run multiple experiments at the same time, but no effort has been made to parallelize the algorithms. Matlab makes use of multiple cores automatically, and since the mentioned laptop has four cores, the computational results of the algorithms that were implemented in Matlab can be up to times faster than the python implementations, not just . according to https://www.cpubenchmark.net/singleThread.html. Whenever we compare the runtimes of algorithms evaluated in these two different machine environments we will be more careful with our conclusions. For BO and HypOpt, we used the default settings. It should be noted that BO and HypOpt are both aimed at minimizing black-box functions using as few function evaluations as possible. For SA, the settings are explained below.
In the context of the IDONE algorithm, the SA algorithm essentially consists of just the exploration step of the IDONE algorithm (see Section 3.4), coupled with a probability of returning to the previous candidate solution. Suppose the current best solution is , and that the exploration step as defined in Section 3.4 gives a new candidate solution . If , then is accepted as the new best solution. Else, there is a probability that is still accepted as the new best solution. This probability is equal to , with a so-called temperature. In this work, the simulated annealing algorithm starts out with a starting temperature , and the temperature is multiplied with a factor every iteration. This strategy is called a cooling schedule. For the asymmetric traveling salesman problem, and have been chosen. For the convex binary optimization problem, and have been chosen.
4.1 Robust routes for an asymmetric traveling salesman problem (17 cities)
Consider the asymmetric TSP benchmark called BR17. This benchmark was taken from the TSPLIB website , a library of sample instances for the traveling salesman problem. While there exist specific solvers developed for this problem, these solvers are not adequate if the objective to be minimized is perturbed by noise. Here, noise
, with a uniform distribution, was added to the distances between any two cities (for distances other thanor infinity, which both occurred once per city, the mean distance between cities is for this instance). Furthermore, every time a sequence of cities has been chosen, we evaluate this route times, with different noise samples. The objective is the worst-case (largest) route length of the chosen sequence of cities. Minimizing this objective should then result in a route that is robust to the noise in the problem. For the variables the same encoding as in Section 3.2.1 has been used, giving integer variables in total.
All algorithms were run times on this problem, and the results are shown in Figure 3. The differences between the run times of the algorithms are significantly larger than between the two machine environments, so we ignore this subtlety here. The BO algorithm was not included as it took over hours per run. It can be seen that IDONE-advanced achieves similar results as HyperOpt, although its computation time is over ten times larger. Both HyperOpt and IDONE-advanced outperform the simpler benchmark methods. It seems IDONE-basic is unable to deal with the complex interaction between the variables due to the basic structure of the model, as it performs worse than the baseline algorithms.
4.2 Convex binary optimization
To gain a better understanding of the different algorithms, the second experiment is done on a function with a known mathematical formulation. Consider the function
with a random positive semi-definite matrix, and a randomly chosen vector, with
the number of binary variables. The optimal solutionor the structure of the function is not given to the different algorithms, only the number of variables and the fact that they are binary. Starting from a matrix where each element is randomly generated from a uniform distribution, matrix is constructed as
the identity matrix. The functioncan only be accessed via noisy measurements , with a uniform random variable. We ran experiments with this function, with IDONE and the other black-box optimization algorithms. For each run, and were randomly generated, as well as the initial guess . All algorithms were stopped after taking function evaluations, and the best found objective value was stored at each iteration. Figure 4 shows a convergence plot for the case . It can be seen that the two variants of IDONE have the fastest convergence. The large number of variables is too much for a pure random search, but also for HyperOpt. Simulated annealing still gives decent results on this problem.
Figure 5 shows the final objective value and computation time after iterations for the same problem for different values of . The number of variables was varied between and . The Bayesian optimization implementation of Matlab was only evaluated for due to its large computation time. As can be seen, IDONE (both versions) is the only algorithm that consistently gives a solution at or close to the optimal solution (which has an objective value between and ) for the highest dimensions. Where all algorithms get at or close to the optimal solution for problems with variables or less, the difference between the algorithms becomes more distinguishable when or more variables are considered. The strengths of HyperOpt, such as dealing with different types of variables that can have complex interactions, are not relevant for this particular problem, and the Parzen estimator surrogate model does not seem to scale well to higher dimensions compared to the piece-wise linear model used by IDONE.
5 Conclusions and future work
The IDONE algorithm is a black-box optimization algorithm that is designed for combinatorial problems with binary or integer constraints, and had shown to be useful in particular when the objective can only be accessed via costly and noisy evaluations. By using a surrogate model that is designed in such a way that its optimum lies in an integer solution, the algorithm can be applied to combinatorial optimization problems without having to resort to rounding in the objective function. IDONE has a fixed computation time per iteration that scales quadratically with the number of variables but is not influenced by the number of times the function has been evaluated, which is an advantage compared to Bayesian optimization algorithms. One variant of the proposed algorithm, IDONE-advanced, has been shown to outperform random search and simulated annealing on the problem of finding robust routes in a noise-perturbed traveling salesman benchmark problem, and on a convex binary optimization problem with up to variables. The other variant of the algorithm, IDONE-basic, is a lot faster, but only performed well in the second experiment. HyperOpt, a popular Bayesian optimization algorithm, performs similar as IDONE-advanced on the traveling salesman benchmark problem, but does not scale as well on the binary optimization problem. These results show that there is room for improvement in the use of surrogate models for black-box combinatorial optimization, and that using continuous models with integer-valued local minima is a new and promising way forward.
In future work, the special structure of the surrogate model will be further exploited to provide a faster implementation, and the algorithm will be tested on real-life applications of combinatorial optimization with expensive objective functions. The question also arises whether this algorithm would perform well in situations where the objective function is not expensive to evaluate, or does not contain noise. Population-based methods perform particularly well on cheap black-box objective functions, so it would be interesting to see if they could be combined with the surrogate model used in this paper. As for the noiseless case, it is known that for continuous variables it becomes easy in this case to estimate the gradient and use more traditional gradient-based methods, but in the case of discrete variables the traditional combinatorial optimization methods might still benefit from IDONE’s piece-wise linear surrogate model. Where surrogate-based optimization techniques have had great success in continuous optimization problems from many different fields, we hope that this work opens up the route to success of these techniques for the plenty of open combinatorial problems in these fields.
-  (2003) Local search in combinatorial optimization. Princeton University Press. Cited by: §2.
-  (2018) Bayesian optimization of combinatorial structures. In International Conference on Machine Learning, pp. 471–480. Cited by: §2, §2.
-  Bayesopt (mathworks.com documentation). Note: https://www.mathworks.com/help/stats/bayesopt.htmlAccessed 14-10-2019 Cited by: §4.
Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, pp. 13–20. Cited by: §2, §4.
Online optimization with costly and noisy measurements using random fourier expansions.
IEEE Transactions on Neural Networks and Learning Systems29 (1), pp. 167–182. External Links: Cited by: §1, §2, §3.4, §3.
Online function minimization with convex random ReLU expansions. In Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pp. 1–6. Cited by: §2, §2, §3.2.
-  (2009) Introduction to derivative-free optimization. Vol. 8, Siam. Cited by: §1.
-  (1999) Ant algorithms for discrete optimization. Artificial life 5 (2), pp. 137–172. Cited by: §2.
-  (2017-06) Dealing with Integer-valued Variables in Bayesian Optimization with Gaussian Processes. ArXiv e-prints. External Links: Cited by: §1, §2.
Simulation-based optimization: parametric optimization techniques and reinforcement learning. Vol. 55, Springer. Cited by: §1.
-  (2006) Discrete optimization via simulation using compass. Operations Research 54 (1), pp. 115–129. Cited by: §2.
-  (1998) Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §1, §2.
-  (1997) A discrete binary version of the particle swarm algorithm. In Systems, Man, and Cybernetics, 1997. Computational Cybernetics and Simulation., 1997 IEEE International Conference on, Vol. 5, pp. 4104–4108. Cited by: §2.
-  (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §3.1.
-  (2009) A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots 27 (2), pp. 93–103. Cited by: §1.
-  (2012) Bayesian approach to global optimization: theory and applications. Vol. 37, Springer Science & Business Media. Cited by: §2.
-  (2016) Branch-and-bound algorithms: a survey of recent advances in searching, branching, and pruning. Discrete Optimization 19, pp. 79 – 102. External Links: Cited by: §2.
-  (2008) Uniform approximation of functions with random bases. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pp. 555–561. Cited by: §2.
-  (1992) Discrete optimization of structures using genetic algorithms. Journal of structural engineering 118 (5), pp. 1233–1250. Cited by: §2.
-  (1998) Recursive least-squares adaptive filters. The Digital Signal Processing Handbook 21 (1). Cited by: §3.2.
-  (2012) Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
-  TSPlib. Note: http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.htmlAccessed 04-02-2019 Cited by: §4.1.
-  (2016) COMBO: an efficient Bayesian optimization library for materials science. Materials discovery 4, pp. 18–21. Cited by: §2, §2.
Multi-objective optimization with surrogate trees.
Proceedings of the 15th annual conference on Genetic and evolutionary computation, pp. 679–686. Cited by: §1.
-  (2017) Auction optimization using regression trees and linear models as integer programs. Artificial Intelligence 244, pp. 368–395. Cited by: §1.
-  (1999) Numerical optimization. Springer Science 35, pp. 67–68. Cited by: §3.3.