Log In Sign Up

Online Mixed-Integer Optimization in Milliseconds

by   Dimitris Bertsimas, et al.

We propose a method to solve online mixed-integer optimization (MIO) problems at very high speed using machine learning. By exploiting the repetitive nature of online optimization, we are able to greatly speedup the solution time. Our approach encodes the optimal solution into a small amount of information denoted as strategy using the Voice of Optimization framework proposed in [BS18]. In this way the core part of the optimization algorithm becomes a multiclass classification problem which can be solved very quickly. In this work we extend that framework to real-time and high-speed applications focusing on parametric mixed-integer quadratic optimization (MIQO). We propose an extremely fast online optimization algorithm consisting of a feedforward neural network (NN) evaluation and a linear system solution where the matrix has already been factorized. Therefore, this online approach does not require any solver nor iterative algorithm. We show the speed of the proposed method both in terms of total computations required and measured execution time. We estimate the number of floating point operations (flops) required to completely recover the optimal solution as a function of the problem dimensions. Compared to state-of-the-art MIO routines, the online running time of our method is very predictable and can be lower than a single matrix factorization time. We benchmark our method against the state-of-the-art solver Gurobi obtaining from two to three orders of magnitude speedups on benchmarks with real-world data.


Learning Mixed-Integer Convex Optimization Strategies for Robot Planning and Control

Mixed-integer convex programming (MICP) has seen significant algorithmic...

CoCo: Online Mixed-Integer Control via Supervised Learning

Many robotics problems, from robot motion planning to object manipulatio...

Optimizing Wi-Fi Channel Selection in a Dense Neighborhood

In dense neighborhoods, there are often dozens of homes in close proximi...

Optimal Solution Predictions for Mixed Integer Programs

Mixed Integer Programming (MIP) is one of the most widely used modeling ...

An Approximation Algorithm for Indefinite Mixed Integer Quadratic Programming

In this paper, we give an algorithm that finds an epsilon-approximate so...

Greedy Clustering-Based Algorithm for Improving Multi-point Robotic Manipulation Sequencing

The problem of optimizing a sequence of tasks for a robot, also known as...

1 Introduction

MIO has become a powerful tool for modeling and solving real-world decision making problems; see [JLN10]. While most MIO problems are -hard and thus considered intractable, we are now able to solve instances with complexity and dimensions that were unthinkable just a decade ago. In [Bix10] the authors analyzed the impressive rate at which the computational power of MIO grew in the last 25 years providing over a trillion times speedups. This remarkable progress is due to both algorithmic and hardware improvements. Despite these advances, MIO is still considered harder to solve than convex optimization and, therefore, it is more rarely applied to online settings.

Online optimization differs from general optimization by requiring on the one hand computing times strictly within the application limits and on the other hand limited computing resources. Fortunately, while online optimization problems are not the same between each solve, only some parameters vary and the structure remains unchanged. For this reason, online optimization falls into the broader class of parametric optimization where we can greatly exploit the repetitive structure of the problem instances. In particular, there is a significant amount of data that we can reuse from the previous solutions.

In a recent work [BS18], the authors constructed a framework to predict and interpret the optimal solution of parametric optimization problems using machine learning. By encoding the optimal solution into a small amount of information denoted as strategy, the authors convert the solution algorithm into a multiclass classification problem. Using interpretable machine learning predictors such as OCT, Bertsimas and Stellato were able to understand and interpret how the problem parameters affect the optimal solutions. Therefore they were able to give optimization a voice that the practitioner can understand.

In this paper we extend the framework from [BS18] to online optimization focusing on speed and real-time applications instead of interpretability. This allows us to obtain an end-to-end approach to solve mixed-integer optimization problems online without the need of any solver nor linear system factorization. The online solution is extremely fast and can be carried out less than a millisecond reducing the online computation time by more than two orders of magnitude compared to state-of-the-art algorithms.

1.1 Contributions

In this work, by exploiting the structure of MIQO problems, we derive a very fast online solution algorithm where the whole optimization is reduced to a NN prediction and a single linear system solution. Even though our approach shares the same framework as [BS18], it is substantially different in the focus and the final algorithm. The focus is primarily speed and online optimization applications and not interpretability as in [BS18]. This is why for our predictions we use non interpretable, but very fast, methods such as NN. Furthermore, our final algorithm does not involve any convex optimization problem solution as in [BS18]. Instead, we just apply simple matrix-vector multiplications. Our specific contributions include:

  1. We focus on the class of MIQO instead of dealing with general MICO as in [BS18]. This allows us to replace the final step to recover the solution with a simple linear system solution based on the KKT optimality conditions of the reduced problem. Therefore, the whole procedure does not require any solver to run compared to [BS18].

  2. In several practical applications of MIQO, the KKT matrix of the reduced problem does not change with the parameters. In this work we factorize it offline and cache the factorization for all the possible solution strategies appearing in the data. By doing so, our online solution step becomes a sequence of simple forward-backward substitutions that we can carry out very efficiently. Hence, with the offline factorization, our overall online procedure does not even require a single matrix factorization. Compared to [BS18], this further simplifies the online solution step.

  3. After the algorithm simplifications, we derive the precise complexity of the overall algorithm in terms of flop which does not depend on the problem parameters values. This makes the execution time predictable and reliable compared to BNB algorithms which often get stuck in the tree search procedure.

  4. We benchmark our method against state-of-the-art MIQO solver Gurobi on sparse portfolio trading and fuel battery management examples showing significant speedups in terms of execution time and reductions in its variance.

1.2 Outline

The structure of the paper is as follows. In Section 1.3, we review recent work on machine learning for optimization outlining the relationships and limitations of other methods compared to approach presented in this work. In addition, we outline the recent developments in high-speed online optimization and the limited advances that appeared so far for MIO. In Section 2, we introduce the Voice of Optimization framework from [BS18] for general MIO describing the concept of solution strategy. In Section 3, we describe the strategies selection problem as a multiclass classification problem and propose the NN architecture used in the prediction phase. Section 4 describes the computation savings that we can obtain with problem with a specific structure such as MIQO and the worst-case complexity in terms of number of flop. In Section 5, we describe the overall algorithm with the implementation details. Benchmarks comparing our method to the state-of-the-art solver Gurobi examples with realistic data appear in Section 6.

1.3 Related Work

Machine learning for optimization.

Recently the operations research community started to focus on systematic ways to analyze and solve combinatorial optimization problems with the help of machine learning. For an extensive review on the topic we refer the reader to 


Machine learning has so far helped optimization in two directions. The first one investigates heuristics to improve solution algorithms. Iterative routines deal with repeated decisions where the answers are based on expert knowledge and manual tuning. A common example is branching heuristics in BNB algorithms. In general these rules are hand tuned and encoded into the solvers. However, the hand tuning can be hard and is in general suboptimal, especially with complex decisions such as BNB algorithm behavior. To overcome these limitations in 

[KBS16] the authors learn the branching rules without the need of expert knowledge showing comparable or even better performance than hand-tuned algorithms. Another example appears in [BLZ18] where the authors investigate whether it is faster to solve MIQO by linearizing or not the cost function. This problem becomes a classification problem that offers an advantage based on previous data compared to how the decision is heuristically made inside off-the-shelf solvers.

The second direction poses combinatorial problems as control tasks that we can analyze under the reinforcement learning framework 

[SB18]. This is applicable to problems with multistage decisions such as network problems or knapsack-like problems. [DKZ17] learn the heuristic criteria for stage-wise decisions in problems over graphs. In other words, they build a greedy heuristic framework, where they learn the node selection policy using a specialized neural network able to process graphs of any size [DDS16]. For every node, the authors feed a graph representation of the problem to the network and they receive an action-value pair for every node suggesting the next one to pick in their solution.

Even though these two directions introduce optimization to the benefits of machine learning and show promising results, they do not consider the parametric nature of problems solved in real-world applications. We often solve the same problem formulation with slightly varying parameters several times generating a large amount of data describing how the parameters affect the solution.

Recently, Misra et al. [MRN19] exploited this idea to devise a sampling scheme to collect all the optimal active sets appearing from the problem parameters in continuous convex optimization. While this approach is promising, it evaluates online all the combinations of the collected active sets without predicting the optimal ones using machine learning. Another related work appeared in [KKK19] where the authors warm-start online active set solvers using the predicted active sets from a machine learning framework. However, they do not provide probabilistic guarantees for that method and their sampling scheme is tailored to their specific application of QO for MPC.

In our work, we propose a new method that exploits the amount of data generated by parametric problems to solve MIO online at high speed. In particular we study how efficient we can make the computations using a combination of machine learning and optimization. To the authors knowledge this is the first time machine learning is used to both reduce and make more consistent the solution time of MIO algorithms.

Online optimization.

Applications of online optimization span a wide variety of fields including scheduling [CaPM10], supply chain management [YG08], hybrid model predictive control [BM99], signal decoding [DCB00].

Over the last decade there has been a significant attention from the community for tools for generating custom solvers for online parametric programs. CVXGEN [MB12] is a code generation software for parametric QO that produces a fast and reliable solver. However, its code size grows dramatically with the problem dimensions and it is not applicable to large problems. More recently, the OSQP solver [SBG18] showed remarkable performance with a light first-order method greatly exploiting the structure of parametric programs to save computation time. The OSQP authors also proposed an optimized version for embedded applications in [BSM17]. Other solvers that can exploit the structure of parametric programs in online settings include qpOASES [FKP14] for QO and ECOS [DCB13] for SOCO.

However, all these approaches focus on continuous convex problems with no integer variables such as QO. This is because of two main reasons. On the one hand, mixed-integer optimization algorithms are far more complicated to implement than convex optimization ones since they feature a massive amount of heuristics and preprocessing. This is why there is still a huge gap in performance between open and commercial solvers for MIO. On the other hand, for many online applications the solution time required to solve MIO problems is still not compatible with the amount of time allowed. An example is hybrid MPC where depending on the system dynamics, we have to solve MIO problems online in fractions of a second. Explicit hybrid MPC tackles this issue by precomputing offline the entire mapping between the parameter space to the optimal solution [BMDP02]. However, the memory required for storing such solutions grows exponentially with the problem dimensions and this approach easily becomes intractable. Other approaches solve these problems only suboptimally using heuristics to deal with insufficient time to compute the globally optimal solution. Examples include the Feasibility Pump heuristic [FGL05] that iteratively solves LO subproblems and rounds their solutions until it finds a feasible solution for the original MIO problem. Another heuristic works by integrating the ADMM [DTB18] with rounding steps to obtain integer feasible solutions for the original problem. The downside of these heuristics is that they do not exploit the large amount of data that we gain by solving the parametric problems over and over again in online settings.

Despite all these efforts in solving MIO online, there is still a significant gap between the solution time and the real-time constraints of many applications. In this work we propose an alternative approach that exploits data coming from previous problem solutions with high-performance MIO solvers to reduce the online computation time making it possible to apply MIO to online problems that were not approachable before.

2 The Voice of Optimization

In this section we introduce the idea of the optimal strategy following the same framework first introduced in [BS18]. Given a parameter , we define a strategy as the complete information needed to efficiently recover the optimal solution of an optimization problem.

Consider the mixed-integer optimization problem


where is the decision variable and defines the parameters affecting the problem. We denote the cost as and the constraints as . The vector denotes the optimal solution and the optimal cost function value given the parameter .

Optimal strategy.

We now define the optimal strategy as the set of tight constraints together with the value of integer variables at the optimal solution. We denote the tight constraints as constraints that are equalities at optimality,


Hence, given the all the other constraints are redundant for the original problem.

When the problem is non degenerate the number of tight constraints is exactly because the tight constraints gradients are linearly independent [NW06, Section 16.4]. For degenerate problems the number of tight constraints can be more than the number of decision variables, i.e., . However, these situations are often less relevant in practice and they can be avoided by perturbing the cost function. Moreover, in practice the number of active constraints is significantly lower than total the number of constraints even for degenerate problems.

When we constrain some of the components of to be integer, does not uniquely identify a feasible optimal solution. However, after fixing the integer components to their optimal values , the tight constraints uniquely identify the optimal solution. Hence the strategy identifying the optimal solution is a tuple containing the index of tight constraints at optimality and the optimal value of the integer variables, i.e., .

Solution method.

Given the optimal strategy, solving (1) corresponds to solving the following optimization problem


Solving (3) is much easier than (1) because it is continuous, convex and has a smaller number of constraints. Since it is a very fast and direct step, we will denote it as solution decoding. In Section 4 we describe the details of how to exploit the structure of (3) and compute the optimal solution online at very high speeds.

3 Machine Learning

In this section, we describe how we learn the mapping from the parameters to the optimal strategies . In this way, we can replace the hardest part of the optimization routine by a prediction step in a multiclass classification problem where each strategy is a class label.

3.1 Multiclass Classifier

Our classification problem features data points where are the parameters and the corresponding labels identifying the optimal strategies. Set is the set of strategies of cardinality .

We solve the classification task using NN. NN have recently become the most popular machine learning method radically changing the way we classify in fields such as computer vision 

[KSH12], speech recognition [HDY12], autonomous driving [BDTD16], and reinforcement learning [SSS17].

In this work we choose feedforward neural networks because they offer a good balance between simplicity and accuracy without the need of more advanced architectures such as convolutional or recurrent NN [LBH15].


We deploy a similar architecture as in [BS18] consisting of layers defining a composition of functions of the form

where each layer consists of


In each layer, corresponds to the number of nodes and the dimension of vector . We define the input layer as and the output layer as so that and .

Each layer performs an affine transformation with parameters and

. In addition, it includes an activation function

to model nonlinearities as a ReLU defined as

The operator is intended element-wise. The ReLU operator has become popular because it promotes sparsity to the model outputting for the negative components of

and because it does not experience vanishing gradient issues typical in sigmoid functions 


The output layer features a softmax activation function to provide a normalized ranking between the strategies and quantify how likely they are to be the right one,

where are the elements of . Therefore, and is the predicted class.


In order to define a proper cost function and train the network we rewrite the labels as a one-hot encoding,

i.e., where is the total number of classes and all the elements of are 0 except the one corresponding to the class which is . Then we define a smooth cost function, i.e., the cross-entropy loss for which gradient descent-like algorithms work well

where is intended element-wise. This loss

can also be interpreted as the distance between the predicted probability density of the labels and to the true one. The training step consists of applying the classic stochastic gradient descent with the derivatives of the cost function obtained using the back-propagation rule.

Online predictions.

After we complete the model training, we can predict the optimal strategy given . However, the model can never be perfect and there can be situations where the prediction is not correct. To overcome these possible limitations, instead of considering only the best class predicted by the NN we pick the most-likely classes. Afterwards, we can evaluate in parallel their feasibility and objective value and pick the feasible one with the best objective.

3.2 Strategies Exploration

It is difficult to estimate the amount of data required to accurately learn the classifier for problem (1). In particular, given different strategies obtained from data points , how likely is it to encounter new strategies in the subsequent samples?

Following the approach in [BS18], we use the same algorithm to iteratively sample and estimate the probability of finding unseen strategies using the Good-Turing estimator [Goo53]. Given a desired probability guarantee , the algorithm terminates when a bound on the probability of finding new strategies falls below .

4 High-Speed Online Optimization

Thanks to the learned predictor, our method offers great computational savings compared to solving each problem instance from scratch. In addition, when the problem offers a specific structure, we can gain even further speedups and replace the whole optimizer with a linear system solution.

Two major challenges in optimization.

By using our previous solutions, the learned predictor maps new parameters to the optimal strategy replacing two of the arguably hardest tasks in numerical optimization algorithms:

Tight constraints.

Identifying the tight constraints at the optimal solution is in general a hard task because of the combinatorial complexity to search over all the possible combinations of constraints. For this reason the worst-case complexity active-set methods (also called simplex methods for LO) is exponential in the number of constraints [BT97].

Integer variables.

It is well known that finding the optimal solution of mixed-integer programs is -hard [BW05]. Hence, solving problem (1) online might require a prohibitive computation time because of the combinatorial complexity of computing the optimal values of the integer variables.

We solve both these issues by evaluating our predictor that outputs the optimal strategy , i.e., the tight constraints at optimality and the value of the integer variables . After evaluating the predictor, computing the optimal solution consists of solving (3) which we can achieve much more efficiently than solving (1), especially in case of special structure.

Special structure.

When is a linear function in we can directly consider the tight constraints as equalities in (3) without losing convexity. In these cases (3) becomes a convex equality constrained problem that can be solved via Newton’s method [BV04, Section 10.2]. We can further simplify the online solution in special cases such as MIQO (and also MILO) of the form


with cost , , and constraints and . We omitted the dependency of the problem data on for ease of notation. Given the optimal strategy , computing the optimal solution to (5) corresponds to solving the following reduced problem from (3)


Since it is an equality constrained QO, we can compute the optimal solution by solving the linear system defined by its KKT conditions [BV04, Section 10.1.1]



is the identity matrix. Vectors or matrices with a subscript index set identify only the rows corresponding to the indices in that set.

are the dual variables of the reduced continuous problem. The dimensions of the KKT matrix are where


We can apply the same method to MILO by setting .

4.1 The Efficient Solution Computation

In case of MIQO solving linear system (7) corresponds to computing the solution to (5). Let us analyze the components involved in the online computations to further optimize the solution time.

Linear system solution.

The linear system (7) is sparse and symmetric and we can solve it with both direct methods and indirect methods. Regarding direct methods, we compute a sparse permuted factorization [Dav06] of the KKT matrix where is a square lower triangular matrix and a diagonal matrix both with dimensions . The factorization step requires number of flop which can be expensive for large systems. After the factorization, the solution consists only in forward-backward solves which can be evaluated very efficiently and in parallel with complexity . Alternatively, when the system is very large we can use an indirect method such as MINRES [PS82] to iteratively approximate the solution by using simple matrix-vector multiplications at each step with complexity . Note that indirect methodsm while more amenable for large problems, can suffer from bad scaling of the matrix data requiring many steps before convergence.

Matrix caching.

In several cases does not affect the matrices and in (5). In other words, enters only in the linear part of the cost and the right hand side of the constraints and does not affect the KKT matrix in (7). This means that, since we know all the strategies appeared in the training phase, we can factor each of the KKT matrices and store the factors and offline. Therefore, whenever we predict a new strategy we can just perform forward-backward solves to obtain the optimal solution without having to perform a new factorization. This step requires  flop – an order of magnitude less than factorizing the matrix.

Parallel strategy evaluation.

In Section 3.1 we explained how we transform the strategy selection into a multiclass classification problem. Since we have the ability to compare the quality of different strategies in terms of optimality and feasibility, online we choose the most likely strategies from the predictor output and we compare their performance. Since the comparison is completely independent between the candidate strategies, we can perform it in parallel saving additional computation time.

4.2 Online Complexity

We now measure the online complexity of the proposed approach in terms of flop. The first step consists in evaluating the neural network with layers. As shown in (4), each layer consists in a matrix-vector multiplication and additions which has order operations [BV04, Section C.1.2]. The ReLU step in the layer does not involve any flop since it is a simple truncation of the non positive components of the layer input. Summing these operations over all the layers, the complexity of evaluating the network becomes

where is the dimension of the parameter and we assume that the number of strategies

is larger than the dimension of any layer. Note that the final softmax layer, while being very useful in the training phase and in its interpretation as likelihood for each class, is unnecessary in the online evaluation since we just need to rank the best strategies.

After the neural network evaluation we can decode the optimal solution by solving the KKT linear system (7) of dimension defined in (8). Since we already factorized it, the online computaitons are just simple forward-backward substitutions as discussed in Section 4.1. Therefore, the flop required to solve the linear system are

This dimension is in general much smaller than the number of constraints and mostly depends only on the problem variables. In addition the KKT matrix (7) is sparse and if the factorization matrices are sparse as well, we could further reduce the flop required.

The overall complexity of the complete MIQO online solution taking into account both NN prediction and solution decoding becomes


which does not depend on the cube of any of the problem dimensions. Note that this means that from an asymptotic perspective, our method is cheaper than factorizing a single linear system since that operation would scale with the cube of its input data.

Despite these high-speed considerations on the number of operations required, it is very important to remark how predictable and reliable the computation time of our approach is. The execution time of BNB algorithms greatly depends on how the solution search tree is analyzed and pruned. This can vary significantly when problem parameters change making the whole online optimization procedure unreliable in real-time applications. On the contrary, our method offers a fixed number of operations which we evaluate every time we encounter a new parameter.

5 Machine Learning Optimizer

Our implementation extends the software tool MLOPT from [BS18] which is implemented in Python and integrated with CVXPY [DB16] to model the optimization problems. MLOPT relies on the Gurobi Optimizer [Gur16]

to solve the problems in the training phase with high accuracy and identify the tight constraints. Then MLOPT passes the data to PyTorch 

[PGC17] to train the NN to classify the strategies. We train the NN with batch size

, number of epochs

and dense linear layers. We split the training data into training and validation to tune the learning rate over parameters ranging from to . In addition to the MLOPT framework in [BS18], we include a specialized solution method for MIQO based on the techniques described in Section 4 where we factorize and cache the factorization of the KKT matrices (7) for each strategy of the problem to speedup the online computations. As outlined in Section 4, when we classify online using the NN we pick the best strategies and evaluate them in parallel, in terms of objective function value and infeasibility, to choose the best one. This step requires a minimal overhead since the matrices are all factorized and we can execute the evaluations in parallel. We also parallelize the training phase where we collect data and solve the problems over multiple CPUs. The NN training takes place on a GPU which greatly reduces the training time. An overview of the online and offline algorithms appears in Figure 1.





Factorizationand Caching




Figure 1: Algorithm implementation

6 Computational Benchmarks

In this section, we benchmark our learning scheme on multiple parametric examples from continuous and mixed-integer problems. We compare the predictive performance and the computational time required by our method to solve the problem compared to using GUROBI Optimizer [Gur16]. We run the experiments on the MIT SuperCloud facility in collaboration with the Lincoln Laboratory [RKB18] exploiting 20 parallel Intel Xeon E5-2650 cores for the data collection involving the problems solution and a NVIDIA Tesla V100 GPU with 16GB for the neural network training. For each problem we sample 100.000 points to collect the strategies. We choose this number because the NN training works better with a large number of data points. In addition, for all the examples the Good-Turing estimator condition from Section 3.2 was always satisfied for . We use samples in the test set of the algorithm.

Infeasibility and suboptimality.

We follow the same criteria for calculating the relative suboptimality and infeasibility as in [BS18]. We repeat them here for completeness. After the learning phase, we compare the predicted solution to the optimal one obtained by solving the instances from scratch. Given a parameter , we say that the predicted solution is infeasible if the constraints are violated more than according to the infeasibility metric

where normalizes the violation depending on the size of the summands of . In case of MIQO, and . If the predicted solution is feasible, we define its suboptimality as

where is the objective of our MIQO problem in (5). Note that by construction. We consider a predicted solution to be accurate if it is feasible and if the suboptimality is less than the tolerance . For each instance we report the average infeasibility and suboptimality over the samples.

6.1 Fuel Cell Energy Management

Fuel cells are a green highly efficient energy source that need to be controlled to stay within admissible operating ranges. Too large switching between ON and OFF states can reduce both the lifespan of the energy cell and increase energy losses. This is why fuel cells are often paired with energy storage devices such as capacitors which help reducing the switching frequency during fast transients.

In this example we would like to control the energy balance between a super capacitor and a fuel cell in order to match the demanded power [FDM15]. The goal is to minimize the energy losses while maintaining the device within acceptable operating limits to prevent lifespan degradation.

We can model the capacitor dynamics as


where is the sampling time and is the energy stored. is the power provided by the fuel cell and is the desired load power.

At each time

we model the on-off state of the fuel cell with the binary variable

. When the battery is off () we do not consume any energy, thus we have . When the engine is on () it consumes units of fuel, with . We define the stage power cost as

We now model the total sum of the switchings over a time window in order to constrain its value. In order to do so we introduce binary variable determining whether the cell switches at time either from ON to OFF or viceversa. Additionally we introduce the auxiliary variable accounting for the amount of change brought by in the battery state ,


We can model these logical relationships as the following linear inequality [FDM15],


Hence we can write the number of switchings appeared up to time over the past time window of length as


and impose the constraints . The complete fuel cell problem becomes


The problem parameters are where and .

Problem setup.

We chose parameters from [FDM15] with values , , . We define energy and power constraints with , and . The initial values are , and . We randomly generated the load profile .

We obtained the samples by first simulating a closed-loop trajectory with the load profile and then sampling around the trajectory points where each component and of is uniformly sampled from a hypersphere with radius 0.5 centered at the corresponding trajectory point. After sampling we enforced the feasibility of the problem parameters according to the constraints in (14). We use the best strategies when performing the predictions.


The accuracy and suboptimality results are shown in Table 1. The number of strategies considerably grows with the horizon length. The prediction is very accurate up to horizon . With horizon and there is a accuracy degradation. However, the infeasibility and suboptimality are relatively low compared to the number of correct predictions. This means that even though we do not predict the optimal strategy, we often find solutions with often similar if not equal quality. We displayed the execution times comparison in Table 2 and Figure 2. There is a significantly higher variance between samples in the Gurobi execution time and substantial growth with the problem horizon compared to MLOPT. This is due to the fact that BNB algorithm can take significantly longer to search for the optimal solution for some parameters values. Our method, instead, does not significantly increase the average solution time that is always around . In addition, the large majority of our samples is concentrated around the mean making our approach more reliable independently from the value of the parameters.

[head to column names, late after line=
]./data/benchmarks/battery/complete.csv T=, n_strategies=, accuracy=, avg_subopt=, avg_infeas=,
Table 1: Fuel cell energy management accuracy, suboptimality and infeasibility.
[head to column names, late after line=
]./data/benchmarks/battery/complete.csv T=, mean_time_pred=, mean_time_full=, std_time_pred=, std_time_full=,
Table 2: Fuel cell energy management timing comparison between Gurobi and MLOPT.
Figure 2: Execution times for the battery example comparing Gurobi and MLOPT.

6.2 Portfolio Trading

Consider the portfolio investment problem [Mar52, BBD17]. This problem has been extensively analyzed in robust optimization [BP08] and stochastic control settings [HDG07]. The decision variables are the normalized portfolio weights at time corresponding to each of the assets in the portfolio and a riskless asset at the -th denoting a cash account. We define the trade as the difference of consecutive weights where is the given vector of current asset investments acting as a problem parameter. The goal is to maximize the risk-adjusted returns as follows


Four terms compose the stage rewards. First, we describe returns as a function of the estimated stock returns at time . Second, we define the risk cost as

where is the estimate of the return covariance at time . Third, we define the holding cost as

where is the borrowing fee for shorting asset at time . The fourth term describes a penalty on the trades defined as

Parameter denotes the relative cost importance of penalizing the trades. The first constraint enforces the portfolio weights normalization while the second constraints the maximum number of nonzero asset investments, i.e., the cardinality, to be less than .

For the complete model derivation without cardinality constraints see [BBD17, Section 5.2].

Risk model.

We use a common risk model described as as a -factor model where is the factor loading matrix and is an estimate of the factor returns covariance. Each entry is the loading of asset to factor . is a nonnegative diagonal matrix accounting for the additional variance in the asset returns usually called idiosyncratic risk. We compute the factor model estimates with factors by using as similar method as in [BBD17, Section 7.3] where we take into account data from the two years time window before time .

Return forecasts.

In practice return forecasts are always proprietary and come from sophisticated prediction techniques based on a wide variety of data available to the trading companies. In this example, we simply add zero-mean noise to the realized returns to obtain the estimates and then rescale them to have a realistic mean square error of our prediction. While these estimates are not real because they use the actual returns, they provide realistic values for the purpose of our computations. We assume to know the risk-free interest rates exactly with . The return estimates for the non-cash assets are where and is the scaling to minimize the mean-squared error  [BBD17, Section 7.3]. This method gives us return forecasts in the order of with an information ratio typical of a proprietary return forecast prediction.


The problem parameters are which, respectively, correspond to the previous assets allocation, vector of returns and the risk model matrices. Note that the risk model is updated at the beginning of every month.

Problem setup.

We simulate the trading system using the S&P100 values [Qua19] from 2008 to 2013 with risk cost , borrow cost and trade cost . These values are similar as the benchmark values in [BBD17]. We use different sparsity levels from to . Afterwards we collect data by sampling around the trajectory points from a hypersphere with radius times the magnitude of the parameter. For example, for a vector of returns of magnitude we sample around with a radius . We use the best strategies when performing the predictions.


The results are shown in Table 3. The number of strategies does not significantly increase with the sparsity level and our accuracy is always very high around . Suboptimality and infeasibility are also very low less than . The execution times appear in Table 4 and Figure 3 where our method shows a clear advantage over Gurobi. There are more than three orders of magnitude speedups and the execution times are in the order of a millisecond with a very low variance.

[head to column names, late after line=
]./data/benchmarks/portfolio/complete_clean.csv K=, n_strategies=, accuracy=, avg_subopt=, avg_infeas=,
Table 3: Portfolio trading accuracy, suboptimality and infeasibility.
[head to column names, late after line=
]./data/benchmarks/portfolio/complete_clean.csv K=, mean_time_pred=, mean_time_full=, std_time_pred=, std_time_full=,
Table 4: Portfolio trading timing comparison between Gurobi and MLOPT.
Figure 3: Execution times for the portfolio example comparing Gurobi vs MLOPT.

7 Conclusions

We proposed a machine learning method for solving online MIO at very high speed. By using the Voice of Optimization framework [BS18] we exploited the repetitive nature of online optimization problems to greatly reduce the solution time. Our method casts the core part of the optimization algorithm as a multiclass classification problem that we solve using a NN. In this work we considered the class of MIQO which, while covering the vast majority of real-world optimization problems, allows us to further reduce the computation time exploiting its structure. For MIQO, we only have to solve a linear system to obtain the optimal solution after the NN prediction. In other words, our approach does not require any solver nor iterative routine to solve parametric MIQO. Our method is not only extremely fast but also reliable. Compared to branch-and-bound methods which show a significant variance in solution time depending on the problem parameters, our approach has a very predictable online solution time for which we can exactly bound the flop complexity. This is of crucial importance for real-time applications. We benchmarked our approach against the state-of-the-art solver Gurobi on different real-world examples showing to fold speedups in online execution time.


The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing High Performace Computing resources that have contributed to the research results reported within this paper.


  • [BBD17] S. Boyd, E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, and J. Speth. Multi-period trading via convex optimization. Foundations and Trends in Optimization, 3(1):1–76, 2017.
  • [BDTD16] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. arXiv:1604.07316, 2016.
  • [Bix10] R. E. Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, pages 107–121, 2010.
  • [BLP18] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv:1811.0612, 2018.
  • [BLZ18] P. Bonami, A. Lodi, and G. Zarpellon. Learning a classification of mixed-integer quadratic programming problems. In Willem-Jan van Hoeve, editor,

    Integration of Constraint Programming, Artificial Intelligence, and Operations Research

    , pages 595–604, Cham, 2018. Springer International Publishing.
  • [BM99] A. Bemporad and M. Morari. Control of systems integrating logic, dynamics, and constraints. Automatica, 35(3):407–427, 1999.
  • [BMDP02] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002.
  • [BP08] D. Bertsimas and D. Pachamanova. Robust multiperiod portfolio management in the presence of transaction costs. Computers & Operations Research, 35(1):3 – 17, 2008.
  • [BS18] D. Bertsimas and B. Stellato. The Voice of Optimization. arXiv:1812.09991, 2018.
  • [BSM17] G. Banjac, B. Stellato, N. Moehle, P. Goulart, A. Bemporad, and S. Boyd. Embedded code generation using the OSQP solver. In IEEE Conference on Decision and Control (CDC), 2017.
  • [BT97] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, 1997.
  • [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
  • [BW05] D. Bertsimas and R. Weismantel. Optimization Over Integers. Dynamic Ideas Press, 2005.
  • [CaPM10] J. P. S Catalão, H.M.I. Pousinho, and V.M.F. Mendes. Scheduling of head-dependent cascaded hydro systems: Mixed-integer quadratic programming approach. Energy Conversion and Management, 51(3):524 – 530, 2010.
  • [Dav06] T. A. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2006.
  • [DB16] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • [DCB00] O. Damen, A. Chkeif, and Belfiore. Lattice code decoder for space-time codes. IEEE Communications Letters, 4(5):161–163, May 2000.
  • [DCB13] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pages 3071–3076, 2013.
  • [DDS16] H.un Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 2702–2711., 2016.
  • [DKZ17] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. arXiv:1704.01665, 2017.
  • [DTB18] S. Diamond, R. Takapoui, and S. Boyd. A general system for heuristic minimization of convex functions over non-convex sets. Optimization Methods and Software, 33(1):165–193, 2018.
  • [FDM15] D. Frick, A. Domahidi, and M. Morari. Embedded optimization for mixed logical dynamical systems. Computers & Chemical Engineering, 72:21–33, 2015.
  • [FGL05] M. Fischetti, F. Glover, and A. Lodi. The feasibility pump. Mathematical Programming, 104(1):91–104, 2005.
  • [FKP14] H. J. Ferreau, C. Kirches, A. Potschka, H. G. Bock, and M. Diehl. qpOASES: a parametric active-set algorithm for quadratic programming. Mathematical Programming Computation, 6(4):327–363, 2014.
  • [GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  • [Goo53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4):237–264, 1953.
  • [Gur16] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2016.
  • [HDG07] F. Herzog, G. Dondi, and H. P. Geering. Stochastic model predictive control and portfolio optimization. International Journal of Theoretical and Applied Finance (IJTAF), 10(02):203–233, 2007.
  • [HDY12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.
  • [JLN10] M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey. 50 Years of Integer Programming 1958-2008. Springer Berlin Heidelberg, 2010.
  • [KBS16] E. B. Khalil, P. L. Bodic, L. Song, G. Nemhauser, and B. Dilkina. Learning to branch in mixed integer programming. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 724–731. AAAI Press, 2016.
  • [KKK19] M. Klaučo, M. Kalúz, and M. Kvasnica. Machine learning-based warm starting of active set methods in embedded model predictive control. Engineering Applications of Artificial Intelligence, 77:1 – 8, 2019.
  • [KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, page 2012, 2012.
  • [LBH15] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, may 2015.
  • [Mar52] H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
  • [MB12] J. Mattingley and S. Boyd. CVXGEN: A code generator for embedded convex optimization. Optimization and Engineering, 13(1):1–27, 2012.
  • [MRN19] S. Misra, L. Roald, and Y. Ng. Learning for constrained optimization: Identifying optimal active constraint sets. arXiv:1802.09639, 2019.
  • [NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, Berlin, 2006.
  • [PGC17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.
  • [PS82] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw., 8(1):43–71, March 1982.
  • [Qua19] Quandl. WIKI S&P100 End-Of-Day Data, 2019.
  • [RKB18] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa, C. Yee, and P. Michaleas. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–6, Sep. 2018.
  • [SB18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition, 2018.
  • [SBG18] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An Operator Splitting Solver for Quadratic Programs. arXiv:1711.08013, 2018.
  • [SSS17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, L. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, oct 2017.
  • [YG08] F. You and I. E. Grossmann. Mixed-integer nonlinear programming models and algorithms for large-scale supply chain design with stochastic inventory management. Industrial & Engineering Chemistry Research, 47(20):7802–7817, 2008.