1 Introduction
MIO has become a powerful tool for modeling and solving realworld decision making problems; see [JLN10]. While most MIO problems are hard and thus considered intractable, we are now able to solve instances with complexity and dimensions that were unthinkable just a decade ago. In [Bix10] the authors analyzed the impressive rate at which the computational power of MIO grew in the last 25 years providing over a trillion times speedups. This remarkable progress is due to both algorithmic and hardware improvements. Despite these advances, MIO is still considered harder to solve than convex optimization and, therefore, it is more rarely applied to online settings.
Online optimization differs from general optimization by requiring on the one hand computing times strictly within the application limits and on the other hand limited computing resources. Fortunately, while online optimization problems are not the same between each solve, only some parameters vary and the structure remains unchanged. For this reason, online optimization falls into the broader class of parametric optimization where we can greatly exploit the repetitive structure of the problem instances. In particular, there is a significant amount of data that we can reuse from the previous solutions.
In a recent work [BS18], the authors constructed a framework to predict and interpret the optimal solution of parametric optimization problems using machine learning. By encoding the optimal solution into a small amount of information denoted as strategy, the authors convert the solution algorithm into a multiclass classification problem. Using interpretable machine learning predictors such as OCT, Bertsimas and Stellato were able to understand and interpret how the problem parameters affect the optimal solutions. Therefore they were able to give optimization a voice that the practitioner can understand.
In this paper we extend the framework from [BS18] to online optimization focusing on speed and realtime applications instead of interpretability. This allows us to obtain an endtoend approach to solve mixedinteger optimization problems online without the need of any solver nor linear system factorization. The online solution is extremely fast and can be carried out less than a millisecond reducing the online computation time by more than two orders of magnitude compared to stateoftheart algorithms.
1.1 Contributions
In this work, by exploiting the structure of MIQO problems, we derive a very fast online solution algorithm where the whole optimization is reduced to a NN prediction and a single linear system solution. Even though our approach shares the same framework as [BS18], it is substantially different in the focus and the final algorithm. The focus is primarily speed and online optimization applications and not interpretability as in [BS18]. This is why for our predictions we use non interpretable, but very fast, methods such as NN. Furthermore, our final algorithm does not involve any convex optimization problem solution as in [BS18]. Instead, we just apply simple matrixvector multiplications. Our specific contributions include:

We focus on the class of MIQO instead of dealing with general MICO as in [BS18]. This allows us to replace the final step to recover the solution with a simple linear system solution based on the KKT optimality conditions of the reduced problem. Therefore, the whole procedure does not require any solver to run compared to [BS18].

In several practical applications of MIQO, the KKT matrix of the reduced problem does not change with the parameters. In this work we factorize it offline and cache the factorization for all the possible solution strategies appearing in the data. By doing so, our online solution step becomes a sequence of simple forwardbackward substitutions that we can carry out very efficiently. Hence, with the offline factorization, our overall online procedure does not even require a single matrix factorization. Compared to [BS18], this further simplifies the online solution step.

After the algorithm simplifications, we derive the precise complexity of the overall algorithm in terms of flop which does not depend on the problem parameters values. This makes the execution time predictable and reliable compared to BNB algorithms which often get stuck in the tree search procedure.

We benchmark our method against stateoftheart MIQO solver Gurobi on sparse portfolio trading and fuel battery management examples showing significant speedups in terms of execution time and reductions in its variance.
1.2 Outline
The structure of the paper is as follows. In Section 1.3, we review recent work on machine learning for optimization outlining the relationships and limitations of other methods compared to approach presented in this work. In addition, we outline the recent developments in highspeed online optimization and the limited advances that appeared so far for MIO. In Section 2, we introduce the Voice of Optimization framework from [BS18] for general MIO describing the concept of solution strategy. In Section 3, we describe the strategies selection problem as a multiclass classification problem and propose the NN architecture used in the prediction phase. Section 4 describes the computation savings that we can obtain with problem with a specific structure such as MIQO and the worstcase complexity in terms of number of flop. In Section 5, we describe the overall algorithm with the implementation details. Benchmarks comparing our method to the stateoftheart solver Gurobi examples with realistic data appear in Section 6.
1.3 Related Work
Machine learning for optimization.
Recently the operations research community started to focus on systematic ways to analyze and solve combinatorial optimization problems with the help of machine learning. For an extensive review on the topic we refer the reader to
[BLP18].Machine learning has so far helped optimization in two directions. The first one investigates heuristics to improve solution algorithms. Iterative routines deal with repeated decisions where the answers are based on expert knowledge and manual tuning. A common example is branching heuristics in BNB algorithms. In general these rules are hand tuned and encoded into the solvers. However, the hand tuning can be hard and is in general suboptimal, especially with complex decisions such as BNB algorithm behavior. To overcome these limitations in
[KBS16] the authors learn the branching rules without the need of expert knowledge showing comparable or even better performance than handtuned algorithms. Another example appears in [BLZ18] where the authors investigate whether it is faster to solve MIQO by linearizing or not the cost function. This problem becomes a classification problem that offers an advantage based on previous data compared to how the decision is heuristically made inside offtheshelf solvers.The second direction poses combinatorial problems as control tasks that we can analyze under the reinforcement learning framework
[SB18]. This is applicable to problems with multistage decisions such as network problems or knapsacklike problems. [DKZ17] learn the heuristic criteria for stagewise decisions in problems over graphs. In other words, they build a greedy heuristic framework, where they learn the node selection policy using a specialized neural network able to process graphs of any size [DDS16]. For every node, the authors feed a graph representation of the problem to the network and they receive an actionvalue pair for every node suggesting the next one to pick in their solution.Even though these two directions introduce optimization to the benefits of machine learning and show promising results, they do not consider the parametric nature of problems solved in realworld applications. We often solve the same problem formulation with slightly varying parameters several times generating a large amount of data describing how the parameters affect the solution.
Recently, Misra et al. [MRN19] exploited this idea to devise a sampling scheme to collect all the optimal active sets appearing from the problem parameters in continuous convex optimization. While this approach is promising, it evaluates online all the combinations of the collected active sets without predicting the optimal ones using machine learning. Another related work appeared in [KKK19] where the authors warmstart online active set solvers using the predicted active sets from a machine learning framework. However, they do not provide probabilistic guarantees for that method and their sampling scheme is tailored to their specific application of QO for MPC.
In our work, we propose a new method that exploits the amount of data generated by parametric problems to solve MIO online at high speed. In particular we study how efficient we can make the computations using a combination of machine learning and optimization. To the authors knowledge this is the first time machine learning is used to both reduce and make more consistent the solution time of MIO algorithms.
Online optimization.
Applications of online optimization span a wide variety of fields including scheduling [CaPM10], supply chain management [YG08], hybrid model predictive control [BM99], signal decoding [DCB00].
Over the last decade there has been a significant attention from the community for tools for generating custom solvers for online parametric programs. CVXGEN [MB12] is a code generation software for parametric QO that produces a fast and reliable solver. However, its code size grows dramatically with the problem dimensions and it is not applicable to large problems. More recently, the OSQP solver [SBG18] showed remarkable performance with a light firstorder method greatly exploiting the structure of parametric programs to save computation time. The OSQP authors also proposed an optimized version for embedded applications in [BSM17]. Other solvers that can exploit the structure of parametric programs in online settings include qpOASES [FKP14] for QO and ECOS [DCB13] for SOCO.
However, all these approaches focus on continuous convex problems with no integer variables such as QO. This is because of two main reasons. On the one hand, mixedinteger optimization algorithms are far more complicated to implement than convex optimization ones since they feature a massive amount of heuristics and preprocessing. This is why there is still a huge gap in performance between open and commercial solvers for MIO. On the other hand, for many online applications the solution time required to solve MIO problems is still not compatible with the amount of time allowed. An example is hybrid MPC where depending on the system dynamics, we have to solve MIO problems online in fractions of a second. Explicit hybrid MPC tackles this issue by precomputing offline the entire mapping between the parameter space to the optimal solution [BMDP02]. However, the memory required for storing such solutions grows exponentially with the problem dimensions and this approach easily becomes intractable. Other approaches solve these problems only suboptimally using heuristics to deal with insufficient time to compute the globally optimal solution. Examples include the Feasibility Pump heuristic [FGL05] that iteratively solves LO subproblems and rounds their solutions until it finds a feasible solution for the original MIO problem. Another heuristic works by integrating the ADMM [DTB18] with rounding steps to obtain integer feasible solutions for the original problem. The downside of these heuristics is that they do not exploit the large amount of data that we gain by solving the parametric problems over and over again in online settings.
Despite all these efforts in solving MIO online, there is still a significant gap between the solution time and the realtime constraints of many applications. In this work we propose an alternative approach that exploits data coming from previous problem solutions with highperformance MIO solvers to reduce the online computation time making it possible to apply MIO to online problems that were not approachable before.
2 The Voice of Optimization
In this section we introduce the idea of the optimal strategy following the same framework first introduced in [BS18]. Given a parameter , we define a strategy as the complete information needed to efficiently recover the optimal solution of an optimization problem.
Consider the mixedinteger optimization problem
(1) 
where is the decision variable and defines the parameters affecting the problem. We denote the cost as and the constraints as . The vector denotes the optimal solution and the optimal cost function value given the parameter .
Optimal strategy.
We now define the optimal strategy as the set of tight constraints together with the value of integer variables at the optimal solution. We denote the tight constraints as constraints that are equalities at optimality,
(2) 
Hence, given the all the other constraints are redundant for the original problem.
When the problem is non degenerate the number of tight constraints is exactly because the tight constraints gradients are linearly independent [NW06, Section 16.4]. For degenerate problems the number of tight constraints can be more than the number of decision variables, i.e., . However, these situations are often less relevant in practice and they can be avoided by perturbing the cost function. Moreover, in practice the number of active constraints is significantly lower than total the number of constraints even for degenerate problems.
When we constrain some of the components of to be integer, does not uniquely identify a feasible optimal solution. However, after fixing the integer components to their optimal values , the tight constraints uniquely identify the optimal solution. Hence the strategy identifying the optimal solution is a tuple containing the index of tight constraints at optimality and the optimal value of the integer variables, i.e., .
Solution method.
Given the optimal strategy, solving (1) corresponds to solving the following optimization problem
(3) 
Solving (3) is much easier than (1) because it is continuous, convex and has a smaller number of constraints. Since it is a very fast and direct step, we will denote it as solution decoding. In Section 4 we describe the details of how to exploit the structure of (3) and compute the optimal solution online at very high speeds.
3 Machine Learning
In this section, we describe how we learn the mapping from the parameters to the optimal strategies . In this way, we can replace the hardest part of the optimization routine by a prediction step in a multiclass classification problem where each strategy is a class label.
3.1 Multiclass Classifier
Our classification problem features data points where are the parameters and the corresponding labels identifying the optimal strategies. Set is the set of strategies of cardinality .
We solve the classification task using NN. NN have recently become the most popular machine learning method radically changing the way we classify in fields such as computer vision
[KSH12], speech recognition [HDY12], autonomous driving [BDTD16], and reinforcement learning [SSS17].In this work we choose feedforward neural networks because they offer a good balance between simplicity and accuracy without the need of more advanced architectures such as convolutional or recurrent NN [LBH15].
Architecture.
We deploy a similar architecture as in [BS18] consisting of layers defining a composition of functions of the form
where each layer consists of
(4) 
In each layer, corresponds to the number of nodes and the dimension of vector . We define the input layer as and the output layer as so that and .
Each layer performs an affine transformation with parameters and
. In addition, it includes an activation function
to model nonlinearities as a ReLU defined as
The operator is intended elementwise. The ReLU operator has become popular because it promotes sparsity to the model outputting for the negative components of
and because it does not experience vanishing gradient issues typical in sigmoid functions
[GBC16].The output layer features a softmax activation function to provide a normalized ranking between the strategies and quantify how likely they are to be the right one,
where are the elements of . Therefore, and is the predicted class.
Learning.
In order to define a proper cost function and train the network we rewrite the labels as a onehot encoding,
i.e., where is the total number of classes and all the elements of are 0 except the one corresponding to the class which is . Then we define a smooth cost function, i.e., the crossentropy loss for which gradient descentlike algorithms work wellwhere is intended elementwise. This loss
can also be interpreted as the distance between the predicted probability density of the labels and to the true one. The training step consists of applying the classic stochastic gradient descent with the derivatives of the cost function obtained using the backpropagation rule.
Online predictions.
After we complete the model training, we can predict the optimal strategy given . However, the model can never be perfect and there can be situations where the prediction is not correct. To overcome these possible limitations, instead of considering only the best class predicted by the NN we pick the mostlikely classes. Afterwards, we can evaluate in parallel their feasibility and objective value and pick the feasible one with the best objective.
3.2 Strategies Exploration
It is difficult to estimate the amount of data required to accurately learn the classifier for problem (1). In particular, given different strategies obtained from data points , how likely is it to encounter new strategies in the subsequent samples?
Following the approach in [BS18], we use the same algorithm to iteratively sample and estimate the probability of finding unseen strategies using the GoodTuring estimator [Goo53]. Given a desired probability guarantee , the algorithm terminates when a bound on the probability of finding new strategies falls below .
4 HighSpeed Online Optimization
Thanks to the learned predictor, our method offers great computational savings compared to solving each problem instance from scratch. In addition, when the problem offers a specific structure, we can gain even further speedups and replace the whole optimizer with a linear system solution.
Two major challenges in optimization.
By using our previous solutions, the learned predictor maps new parameters to the optimal strategy replacing two of the arguably hardest tasks in numerical optimization algorithms:
 Tight constraints.

Identifying the tight constraints at the optimal solution is in general a hard task because of the combinatorial complexity to search over all the possible combinations of constraints. For this reason the worstcase complexity activeset methods (also called simplex methods for LO) is exponential in the number of constraints [BT97].
 Integer variables.
We solve both these issues by evaluating our predictor that outputs the optimal strategy , i.e., the tight constraints at optimality and the value of the integer variables . After evaluating the predictor, computing the optimal solution consists of solving (3) which we can achieve much more efficiently than solving (1), especially in case of special structure.
Special structure.
When is a linear function in we can directly consider the tight constraints as equalities in (3) without losing convexity. In these cases (3) becomes a convex equality constrained problem that can be solved via Newton’s method [BV04, Section 10.2]. We can further simplify the online solution in special cases such as MIQO (and also MILO) of the form
(5) 
with cost , , and constraints and . We omitted the dependency of the problem data on for ease of notation. Given the optimal strategy , computing the optimal solution to (5) corresponds to solving the following reduced problem from (3)
(6) 
Since it is an equality constrained QO, we can compute the optimal solution by solving the linear system defined by its KKT conditions [BV04, Section 10.1.1]
(7) 
Matrix
is the identity matrix. Vectors or matrices with a subscript index set identify only the rows corresponding to the indices in that set.
are the dual variables of the reduced continuous problem. The dimensions of the KKT matrix are where(8) 
We can apply the same method to MILO by setting .
4.1 The Efficient Solution Computation
In case of MIQO solving linear system (7) corresponds to computing the solution to (5). Let us analyze the components involved in the online computations to further optimize the solution time.
Linear system solution.
The linear system (7) is sparse and symmetric and we can solve it with both direct methods and indirect methods. Regarding direct methods, we compute a sparse permuted factorization [Dav06] of the KKT matrix where is a square lower triangular matrix and a diagonal matrix both with dimensions . The factorization step requires number of flop which can be expensive for large systems. After the factorization, the solution consists only in forwardbackward solves which can be evaluated very efficiently and in parallel with complexity . Alternatively, when the system is very large we can use an indirect method such as MINRES [PS82] to iteratively approximate the solution by using simple matrixvector multiplications at each step with complexity . Note that indirect methodsm while more amenable for large problems, can suffer from bad scaling of the matrix data requiring many steps before convergence.
Matrix caching.
In several cases does not affect the matrices and in (5). In other words, enters only in the linear part of the cost and the right hand side of the constraints and does not affect the KKT matrix in (7). This means that, since we know all the strategies appeared in the training phase, we can factor each of the KKT matrices and store the factors and offline. Therefore, whenever we predict a new strategy we can just perform forwardbackward solves to obtain the optimal solution without having to perform a new factorization. This step requires flop – an order of magnitude less than factorizing the matrix.
Parallel strategy evaluation.
In Section 3.1 we explained how we transform the strategy selection into a multiclass classification problem. Since we have the ability to compare the quality of different strategies in terms of optimality and feasibility, online we choose the most likely strategies from the predictor output and we compare their performance. Since the comparison is completely independent between the candidate strategies, we can perform it in parallel saving additional computation time.
4.2 Online Complexity
We now measure the online complexity of the proposed approach in terms of flop. The first step consists in evaluating the neural network with layers. As shown in (4), each layer consists in a matrixvector multiplication and additions which has order operations [BV04, Section C.1.2]. The ReLU step in the layer does not involve any flop since it is a simple truncation of the non positive components of the layer input. Summing these operations over all the layers, the complexity of evaluating the network becomes
where is the dimension of the parameter and we assume that the number of strategies
is larger than the dimension of any layer. Note that the final softmax layer, while being very useful in the training phase and in its interpretation as likelihood for each class, is unnecessary in the online evaluation since we just need to rank the best strategies.
After the neural network evaluation we can decode the optimal solution by solving the KKT linear system (7) of dimension defined in (8). Since we already factorized it, the online computaitons are just simple forwardbackward substitutions as discussed in Section 4.1. Therefore, the flop required to solve the linear system are
This dimension is in general much smaller than the number of constraints and mostly depends only on the problem variables. In addition the KKT matrix (7) is sparse and if the factorization matrices are sparse as well, we could further reduce the flop required.
The overall complexity of the complete MIQO online solution taking into account both NN prediction and solution decoding becomes
(9) 
which does not depend on the cube of any of the problem dimensions. Note that this means that from an asymptotic perspective, our method is cheaper than factorizing a single linear system since that operation would scale with the cube of its input data.
Despite these highspeed considerations on the number of operations required, it is very important to remark how predictable and reliable the computation time of our approach is. The execution time of BNB algorithms greatly depends on how the solution search tree is analyzed and pruned. This can vary significantly when problem parameters change making the whole online optimization procedure unreliable in realtime applications. On the contrary, our method offers a fixed number of operations which we evaluate every time we encounter a new parameter.
5 Machine Learning Optimizer
Our implementation extends the software tool MLOPT from [BS18] which is implemented in Python and integrated with CVXPY [DB16] to model the optimization problems. MLOPT relies on the Gurobi Optimizer [Gur16]
to solve the problems in the training phase with high accuracy and identify the tight constraints. Then MLOPT passes the data to PyTorch
[PGC17] to train the NN to classify the strategies. We train the NN with batch size, number of epochs
and dense linear layers. We split the training data into training and validation to tune the learning rate over parameters ranging from to . In addition to the MLOPT framework in [BS18], we include a specialized solution method for MIQO based on the techniques described in Section 4 where we factorize and cache the factorization of the KKT matrices (7) for each strategy of the problem to speedup the online computations. As outlined in Section 4, when we classify online using the NN we pick the best strategies and evaluate them in parallel, in terms of objective function value and infeasibility, to choose the best one. This step requires a minimal overhead since the matrices are all factorized and we can execute the evaluations in parallel. We also parallelize the training phase where we collect data and solve the problems over multiple CPUs. The NN training takes place on a GPU which greatly reduces the training time. An overview of the online and offline algorithms appears in Figure 1.6 Computational Benchmarks
In this section, we benchmark our learning scheme on multiple parametric examples from continuous and mixedinteger problems. We compare the predictive performance and the computational time required by our method to solve the problem compared to using GUROBI Optimizer [Gur16]. We run the experiments on the MIT SuperCloud facility in collaboration with the Lincoln Laboratory [RKB18] exploiting 20 parallel Intel Xeon E52650 cores for the data collection involving the problems solution and a NVIDIA Tesla V100 GPU with 16GB for the neural network training. For each problem we sample 100.000 points to collect the strategies. We choose this number because the NN training works better with a large number of data points. In addition, for all the examples the GoodTuring estimator condition from Section 3.2 was always satisfied for . We use samples in the test set of the algorithm.
Infeasibility and suboptimality.
We follow the same criteria for calculating the relative suboptimality and infeasibility as in [BS18]. We repeat them here for completeness. After the learning phase, we compare the predicted solution to the optimal one obtained by solving the instances from scratch. Given a parameter , we say that the predicted solution is infeasible if the constraints are violated more than according to the infeasibility metric
where normalizes the violation depending on the size of the summands of . In case of MIQO, and . If the predicted solution is feasible, we define its suboptimality as
where is the objective of our MIQO problem in (5). Note that by construction. We consider a predicted solution to be accurate if it is feasible and if the suboptimality is less than the tolerance . For each instance we report the average infeasibility and suboptimality over the samples.
6.1 Fuel Cell Energy Management
Fuel cells are a green highly efficient energy source that need to be controlled to stay within admissible operating ranges. Too large switching between ON and OFF states can reduce both the lifespan of the energy cell and increase energy losses. This is why fuel cells are often paired with energy storage devices such as capacitors which help reducing the switching frequency during fast transients.
In this example we would like to control the energy balance between a super capacitor and a fuel cell in order to match the demanded power [FDM15]. The goal is to minimize the energy losses while maintaining the device within acceptable operating limits to prevent lifespan degradation.
We can model the capacitor dynamics as
(10) 
where is the sampling time and is the energy stored. is the power provided by the fuel cell and is the desired load power.
At each time
we model the onoff state of the fuel cell with the binary variable
. When the battery is off () we do not consume any energy, thus we have . When the engine is on () it consumes units of fuel, with . We define the stage power cost asWe now model the total sum of the switchings over a time window in order to constrain its value. In order to do so we introduce binary variable determining whether the cell switches at time either from ON to OFF or viceversa. Additionally we introduce the auxiliary variable accounting for the amount of change brought by in the battery state ,
(11) 
We can model these logical relationships as the following linear inequality [FDM15],
(12) 
Hence we can write the number of switchings appeared up to time over the past time window of length as
(13) 
and impose the constraints . The complete fuel cell problem becomes
(14) 
The problem parameters are where and .
Problem setup.
We chose parameters from [FDM15] with values , , . We define energy and power constraints with , and . The initial values are , and . We randomly generated the load profile .
We obtained the samples by first simulating a closedloop trajectory with the load profile and then sampling around the trajectory points where each component and of is uniformly sampled from a hypersphere with radius 0.5 centered at the corresponding trajectory point. After sampling we enforced the feasibility of the problem parameters according to the constraints in (14). We use the best strategies when performing the predictions.
Results.
The accuracy and suboptimality results are shown in Table 1. The number of strategies considerably grows with the horizon length. The prediction is very accurate up to horizon . With horizon and there is a accuracy degradation. However, the infeasibility and suboptimality are relatively low compared to the number of correct predictions. This means that even though we do not predict the optimal strategy, we often find solutions with often similar if not equal quality. We displayed the execution times comparison in Table 2 and Figure 2. There is a significantly higher variance between samples in the Gurobi execution time and substantial growth with the problem horizon compared to MLOPT. This is due to the fact that BNB algorithm can take significantly longer to search for the optimal solution for some parameters values. Our method, instead, does not significantly increase the average solution time that is always around . In addition, the large majority of our samples is concentrated around the mean making our approach more reliable independently from the value of the parameters.
[head to column names, late after line=  

]./data/benchmarks/battery/complete.csv T=, n_strategies=, accuracy=, avg_subopt=, avg_infeas=, 
[head to column names, late after line=  

]./data/benchmarks/battery/complete.csv T=, mean_time_pred=, mean_time_full=, std_time_pred=, std_time_full=, 
6.2 Portfolio Trading
Consider the portfolio investment problem [Mar52, BBD17]. This problem has been extensively analyzed in robust optimization [BP08] and stochastic control settings [HDG07]. The decision variables are the normalized portfolio weights at time corresponding to each of the assets in the portfolio and a riskless asset at the th denoting a cash account. We define the trade as the difference of consecutive weights where is the given vector of current asset investments acting as a problem parameter. The goal is to maximize the riskadjusted returns as follows
(15) 
Four terms compose the stage rewards. First, we describe returns as a function of the estimated stock returns at time . Second, we define the risk cost as
where is the estimate of the return covariance at time . Third, we define the holding cost as
where is the borrowing fee for shorting asset at time . The fourth term describes a penalty on the trades defined as
Parameter denotes the relative cost importance of penalizing the trades. The first constraint enforces the portfolio weights normalization while the second constraints the maximum number of nonzero asset investments, i.e., the cardinality, to be less than .
For the complete model derivation without cardinality constraints see [BBD17, Section 5.2].
Risk model.
We use a common risk model described as as a factor model where is the factor loading matrix and is an estimate of the factor returns covariance. Each entry is the loading of asset to factor . is a nonnegative diagonal matrix accounting for the additional variance in the asset returns usually called idiosyncratic risk. We compute the factor model estimates with factors by using as similar method as in [BBD17, Section 7.3] where we take into account data from the two years time window before time .
Return forecasts.
In practice return forecasts are always proprietary and come from sophisticated prediction techniques based on a wide variety of data available to the trading companies. In this example, we simply add zeromean noise to the realized returns to obtain the estimates and then rescale them to have a realistic mean square error of our prediction. While these estimates are not real because they use the actual returns, they provide realistic values for the purpose of our computations. We assume to know the riskfree interest rates exactly with . The return estimates for the noncash assets are where and is the scaling to minimize the meansquared error [BBD17, Section 7.3]. This method gives us return forecasts in the order of with an information ratio typical of a proprietary return forecast prediction.
Parameters.
The problem parameters are which, respectively, correspond to the previous assets allocation, vector of returns and the risk model matrices. Note that the risk model is updated at the beginning of every month.
Problem setup.
We simulate the trading system using the S&P100 values [Qua19] from 2008 to 2013 with risk cost , borrow cost and trade cost . These values are similar as the benchmark values in [BBD17]. We use different sparsity levels from to . Afterwards we collect data by sampling around the trajectory points from a hypersphere with radius times the magnitude of the parameter. For example, for a vector of returns of magnitude we sample around with a radius . We use the best strategies when performing the predictions.
Results.
The results are shown in Table 3. The number of strategies does not significantly increase with the sparsity level and our accuracy is always very high around . Suboptimality and infeasibility are also very low less than . The execution times appear in Table 4 and Figure 3 where our method shows a clear advantage over Gurobi. There are more than three orders of magnitude speedups and the execution times are in the order of a millisecond with a very low variance.
[head to column names, late after line=  

]./data/benchmarks/portfolio/complete_clean.csv K=, n_strategies=, accuracy=, avg_subopt=, avg_infeas=, 
[head to column names, late after line=  

]./data/benchmarks/portfolio/complete_clean.csv K=, mean_time_pred=, mean_time_full=, std_time_pred=, std_time_full=, 
7 Conclusions
We proposed a machine learning method for solving online MIO at very high speed. By using the Voice of Optimization framework [BS18] we exploited the repetitive nature of online optimization problems to greatly reduce the solution time. Our method casts the core part of the optimization algorithm as a multiclass classification problem that we solve using a NN. In this work we considered the class of MIQO which, while covering the vast majority of realworld optimization problems, allows us to further reduce the computation time exploiting its structure. For MIQO, we only have to solve a linear system to obtain the optimal solution after the NN prediction. In other words, our approach does not require any solver nor iterative routine to solve parametric MIQO. Our method is not only extremely fast but also reliable. Compared to branchandbound methods which show a significant variance in solution time depending on the problem parameters, our approach has a very predictable online solution time for which we can exactly bound the flop complexity. This is of crucial importance for realtime applications. We benchmarked our approach against the stateoftheart solver Gurobi on different realworld examples showing to fold speedups in online execution time.
Acknowledgment
The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing High Performace Computing resources that have contributed to the research results reported within this paper.
References
 [BBD17] S. Boyd, E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nystrup, and J. Speth. Multiperiod trading via convex optimization. Foundations and Trends in Optimization, 3(1):1–76, 2017.
 [BDTD16] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for selfdriving cars. arXiv:1604.07316, 2016.
 [Bix10] R. E. Bixby. A brief history of linear and mixedinteger programming computation. Documenta Mathematica, pages 107–121, 2010.
 [BLP18] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. arXiv:1811.0612, 2018.

[BLZ18]
P. Bonami, A. Lodi, and G. Zarpellon.
Learning a classification of mixedinteger quadratic programming
problems.
In WillemJan van Hoeve, editor,
Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, pages 595–604, Cham, 2018. Springer International Publishing.  [BM99] A. Bemporad and M. Morari. Control of systems integrating logic, dynamics, and constraints. Automatica, 35(3):407–427, 1999.
 [BMDP02] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002.
 [BP08] D. Bertsimas and D. Pachamanova. Robust multiperiod portfolio management in the presence of transaction costs. Computers & Operations Research, 35(1):3 – 17, 2008.
 [BS18] D. Bertsimas and B. Stellato. The Voice of Optimization. arXiv:1812.09991, 2018.
 [BSM17] G. Banjac, B. Stellato, N. Moehle, P. Goulart, A. Bemporad, and S. Boyd. Embedded code generation using the OSQP solver. In IEEE Conference on Decision and Control (CDC), 2017.
 [BT97] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, 1997.
 [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [BW05] D. Bertsimas and R. Weismantel. Optimization Over Integers. Dynamic Ideas Press, 2005.
 [CaPM10] J. P. S Catalão, H.M.I. Pousinho, and V.M.F. Mendes. Scheduling of headdependent cascaded hydro systems: Mixedinteger quadratic programming approach. Energy Conversion and Management, 51(3):524 – 530, 2010.
 [Dav06] T. A. Davis. Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2006.
 [DB16] S. Diamond and S. Boyd. CVXPY: A Pythonembedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
 [DCB00] O. Damen, A. Chkeif, and Belfiore. Lattice code decoder for spacetime codes. IEEE Communications Letters, 4(5):161–163, May 2000.
 [DCB13] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. In European Control Conference (ECC), pages 3071–3076, 2013.
 [DDS16] H.un Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 2702–2711. JMLR.org, 2016.
 [DKZ17] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. arXiv:1704.01665, 2017.
 [DTB18] S. Diamond, R. Takapoui, and S. Boyd. A general system for heuristic minimization of convex functions over nonconvex sets. Optimization Methods and Software, 33(1):165–193, 2018.
 [FDM15] D. Frick, A. Domahidi, and M. Morari. Embedded optimization for mixed logical dynamical systems. Computers & Chemical Engineering, 72:21–33, 2015.
 [FGL05] M. Fischetti, F. Glover, and A. Lodi. The feasibility pump. Mathematical Programming, 104(1):91–104, 2005.
 [FKP14] H. J. Ferreau, C. Kirches, A. Potschka, H. G. Bock, and M. Diehl. qpOASES: a parametric activeset algorithm for quadratic programming. Mathematical Programming Computation, 6(4):327–363, 2014.
 [GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 [Goo53] I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4):237–264, 1953.
 [Gur16] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2016.
 [HDG07] F. Herzog, G. Dondi, and H. P. Geering. Stochastic model predictive control and portfolio optimization. International Journal of Theoretical and Applied Finance (IJTAF), 10(02):203–233, 2007.
 [HDY12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.
 [JLN10] M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey. 50 Years of Integer Programming 19582008. Springer Berlin Heidelberg, 2010.
 [KBS16] E. B. Khalil, P. L. Bodic, L. Song, G. Nemhauser, and B. Dilkina. Learning to branch in mixed integer programming. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 724–731. AAAI Press, 2016.
 [KKK19] M. Klaučo, M. Kalúz, and M. Kvasnica. Machine learningbased warm starting of active set methods in embedded model predictive control. Engineering Applications of Artificial Intelligence, 77:1 – 8, 2019.
 [KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, page 2012, 2012.
 [LBH15] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, may 2015.
 [Mar52] H. Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
 [MB12] J. Mattingley and S. Boyd. CVXGEN: A code generator for embedded convex optimization. Optimization and Engineering, 13(1):1–27, 2012.
 [MRN19] S. Misra, L. Roald, and Y. Ng. Learning for constrained optimization: Identifying optimal active constraint sets. arXiv:1802.09639, 2019.
 [NW06] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, Berlin, 2006.
 [PGC17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPSW, 2017.
 [PS82] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Trans. Math. Softw., 8(1):43–71, March 1982.
 [Qua19] Quandl. WIKI S&P100 EndOfDay Data, 2019.
 [RKB18] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa, C. Yee, and P. Michaleas. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–6, Sep. 2018.
 [SB18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2nd edition, 2018.
 [SBG18] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An Operator Splitting Solver for Quadratic Programs. arXiv:1711.08013, 2018.
 [SSS17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, L. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, oct 2017.
 [YG08] F. You and I. E. Grossmann. Mixedinteger nonlinear programming models and algorithms for largescale supply chain design with stochastic inventory management. Industrial & Engineering Chemistry Research, 47(20):7802–7817, 2008.