1 Introduction
1.1 The Synergy between Optimization and Machine Learning
1.1.1 Optimization for Machine Learning
Many problems in Machine Learning (ML) can be written as optimization problems (bennett2006interplay)
. In ML, a loss function provides a measure of the difference between the value predicted by the ML algorithm and the actual value. It follows then that ML practitioners seek to minimize this loss function, which leads to the use of Optimization for ML. In certain applications, e.g. deep neural networks, the objective function of the learning task is nonconvex, leading to the application of nonconvex optimization methods
(Jain_2017). The problem of selecting the right parameters for an ML model to give the best performance is called hyperparameter tuning. This can be accomplished using heuristic methods such as random search or grid search
(NIPS2011). Nevertheless, this problem also lends itself to more traditional optimization methods such as gradient descent (YANG2020295). In the book (bertsimas2019machine), one of the key underlying ideas is that machine learning is inherently linked to optimization, rather than probability theory. They demonstrate this by using mixedinteger optimization (MIO) techniques to construct optimal decision trees, and show that their MIOconstructed trees outperform classification and regression trees (CART)
(breiman1984classification) (which are constructed using heuristic methods) by up to 7% in some cases. Clearly there is a huge scope for optimization in ML. The paper (sun2019survey) provides a recent survey of optimization methods from an ML perspective.1.1.2 Machine Learning for Optimization
In the optimization field, there is more and more attention for using ML for optimization. Optimization problems are prevalent in all sorts of industries, including logistics, agriculture, energy, and many others. In practice, these problems can be large and difficult to solve. It stands to reason therefore that given the recent advances in ML (e.g. (silver2016mastering)
), ML is increasingly finding a role in combinatorial optimization (CO). There are more and more works exploiting the flexibility and usefulness of ML to aid in the solution of difficult optimization problems. In
(abbasi2020predicting) for example, the authors train ML models on the optimal solutions of a largescale supply chain problem to learn optimal decisions. This allows the ML model can be used on a daily basis, instead of needing to solve the largescale optimization problem. In another example, FISCHETTI2019use ML and optimization to speed up the solution of a wind farm layout optimization problem. They use solutions from already solved problems to train ML models, with the aim of being able to predict the solutions of a new layout problem without lengthy computation times. In addition to speeding up the solution of problems, ML and deep learning have also been used to speed up tree search in
(HOTTUNG2020) for example, to accelerate the BranchandPrice Algorithm in (VACLAVIK2018), and to guide branching in Mixed Integer Programs (MIP) (KHALIL2016). A recently published survey (BENGIO2020) provides an overview of recent attempts to leverage machine learning to solve combinatorial optimization problems.1.2 Motivation for this Study
The examples in the previous paragraph use ML to learn or obtain solutions to optimization problems in a more efficient manner. In our view however, there is even more value in using ML for optimization: ML can be used to learn constraints. Many reallife optimization problems contain one or more constraints or objectives for which there are no explicit formulae. However when there is data available, one can use the data to learn the constraints. Two such examples are given below.
Example 1  The World Food Programme (WFP): This is the problem of providing palatable rations to recipients of the United Nations (UN) WFP. The paper (peters2016nutritious), contains a MIP model for optimizing the food supply chain for the WFP. A key part of this model is the diet model (stigler1945cost; dantzig1990diet), which is concerned with selecting a set of foods to satisfy a set of daily nutritional requirements at minimum cost. Besides providing food, it is of additional benefit to the recipients that the provided rations be palatable. These palatability constraints cannot be manually developed. However, abundant data has been used to learn the palatability constraints (maragno2022constraint), which can then be added to the model.
Example 2  Radiotherapy Optimization: Optimization is used nowadays to optimize treatment plans for thousands of cancer patients every day (eikelder2019adjustable)
. However, e.g., the nonlinear formula for the Tumour Control Probability (TCP) is based on a population average. Together with researchers from Harvard Medical School, we are in the process of using the abundant patient data that is available, and learning the TCP constraint function including patient characteristics. This learned constraint can then be added to the final optimization model.
In addition to using traditional ML approaches to learn constraints for optimization problems (e.g. regression trees and linear regression in
(verwer2017auction) and neural networks in (villarrubia2018artificial)), other techniques such as genetic programming (e.g. in
(pawlak2018synthesis)), mixedinteger linear programming (MILP) (e.g. in
(pawlak2017automatic)) and symbolic regression have also been used. In our literature search, we have noticed several works of this kind from several different application areas and even several scientific fields. However it seems beneficial to have these works collated and discussed in one place. Additionally, we propose to take a more holistic view of the entire constraint learning process, starting from data gathering and processing, to resolution and validation of the final optimization model.This paper is therefore intended to provide both a survey of literature on constraint learning for optimization, as well as offer a framework for effectively facilitating Optimization with Constraint Learning (OCL). The five steps of the proposed framework are: (i) setup of the conceptual optimization model, (ii) data gathering and processing, (iii) selection and training of the predictive model(s), (iv) resolution of the optimization model, and (v) validation/improvement of the optimization model. These steps will be discussed in more detail throughout this paper. We use the term “Constraint Learning” to make clear that we are interested in cases in which most of the optimization model is known, but some of the constraints are difficult to model explicitly. This is to differentiate from other terms such as “model learning” (donti2017task) and “model seeking” (beldiceanu2016modelseeker) seen in the literature. We should also note that there are several approaches in the literature which are similar to constraint learning but do not fit in this survey for several reasons. In Constraint Programming (CP) for example, the process of learning CP models from data (positive and/or negative examples) has been widely studied in papers such as (bessiere2004leveraging; o2010automated; beldiceanu2016modelseeker; kolb2016learning; galassi2018model)
. These are not included as we are interested in cases where predictive models are embedded as constraints in mathematical optimization. Additionally, papers such as
(bertsimas2020predictive; elmachtoub2021smart; demirovic2019predict+) which focus on predicting problem parameters, or use predictive models to approximate known objective functions and constraints (such as (villarrubia2018artificial)) are excluded. We also exclude papers such as (nascimento2000neural; nagata2003optimization; venzke2020learning; venzke2020verification) where the structure of the problem is known, but predictive models are used to replace the problem in order to speed up computation. Similarly, papers like shang2017data; han2020multiple which use predictive models to learn the uncertainty set for robust optimization problems are not considered. In gilan2015sustainable; bagloee2018hybrid; say2018planning; say2020theoretical, although there is an interplay between ML and optimization in some way, these papers are out of the scope of this survey as they do not focus on constraint learning.While lombardi2018boosting and kotary2021end provide short surveys on this subject, and we present a more extensive framework for constraint learning and consider a wider range of the literature. Each paper identified in our literature search will be discussed in light of the proposed framework, and gaps in knowledge and scope for future lines of research will be identified. Running examples will be used to illustrate the steps of the framework. It is envisaged that in addition to providing a review as well as the framework, our work will bring together ideas from diverse fields.
1.3 Outline
This framework and survey is structured as follows. In Section 2, we first describe the running examples that will be used to illustrate our framework. Then in Section 3, we describe this framework and review the recent literature on light of it in Section 4. Areas for further research are presented in Section 5 with general conclusions given in Section 6.
2 Running Examples
In order to illustrate the concepts presented in later sections, we first present two practical optimization problems in which constraint learning plays an important role. These examples have been mentioned in Section 1.2 but are now given in more detail.
2.1 Palatability Considerations in WFP Food Baskets
This first example deals with the issue of providing palatable rations in the food baskets delivered by the WFP to population groups in need. These food baskets are designed to meet the nutritional requirements of a population, and take into account existing levels of malnutrition and disease, climatic conditions, activity levels, and so on
(wfpwebsite). In addition to considering the nutritional content, we would also like to be able to consider the palatability of the rations. This problem is particularly relevant in that the palatability of a food basket can change drastically between different groups of beneficiaries. Since the palatability constraint is difficult to model manually, and a data set with different food baskets and their palatability scores is available, a predictive model can be used to derive the palatability constraint from the data, which can then be embedded into the original food basket optimization problem.Let be the set of commodities (such as rice, flour, chickpeas, sugar, etc.) which may be included in a food basket, and be the set of nutrients (such as vitamins, minerals, proteins, etc.) which are important for maintaining good health. If the nutritional requirement for nutrient (in grams/person/day) is denoted by , and the nutritional value for nutrient l (per grams of commodity ) is denoted by , then a simple model for food basket optimization problem can be given as
(1a)  
s.t.  (1b)  
(1c)  
(1d) 
where
is the vector of commodities,
is the amount of each commodity (measured in grams) included in the food ration, is the vector of costs associated with each commodity (in $/gram), and the objective function (1a) is a minimization of the ration cost. Constraint (1b) is used to ensure a minimum level of nutritional requirements in the rations. Constraint (1c) is the embedding of the predictive model used to ensure that the palatability of the ration is at least equal to . The diet problem in (1a) to (1d) is only a part of a much bigger model that also includes the sourcing, facility location and transportation problems. The constraint to be learned here is a function of relatively few variables ( in this case), compared with the thousands of variables in the full model.The predictive model is trained on a sparse data set in composed of samples, with each sample composed of features and one label. Each sample is a different food basket where the features are the amount of each commodity that make up the food basket, and the label corresponds to the palatability score associated with each specific food basket. This label can either be numeric or categorical, resulting in being either a regression or classification model respectively.
2.2 Radiotherapy Optimization
The second example problem we will use to illustrate the framework is radiotherapy (RT) optimization. In RT, ionizing radiation beams are used to control tumor growth by damaging cancerous tissue. Healthy, noncancerous tissue (also known as organsatrisk (OAR)) are however present in close proximity to the tumors, and consequently incur unavoidable damage. The tumour damage is quantified as the Tumor Control Probability (TCP), while the unavoidable damage to the OARs is quantified using the Normal Tissue Complications Probability (NTCP). The final optimization problem is to maximize the TCP subject to a set of constraints ensuring that the NTCP does not exceed a specified level.
The radiation beams are delivered from a radiation source such as a linear accelerator or a cyclotron, and directed towards the tumor cells from multiple angles using a rotating gantry. The angles of delivery are predetermined by clinicians and at each given angle, the radiation beam is shaped by passing it through a multileaf collimator (MLC) (Figure 1a). The MLC adjusts the shape of the radiation beam so that it generally resembles the shape of the tumor (Figure 1b).
To facilitate the treatment planning process, the target volumes (both tumors and OARs) are discretized into 3dimensional volumetric pixels (or voxels), while each radiation beam is discretized into smaller subunits called beamlets. Let be the total number of available voxels, and be the dimensional vector of beamlet intensities (also known as a fluence map). Then, the dose deposition matrix is an dimensional matrix, where each value determines the unit intensity of dose delivered to voxel from beamlet . The physical radiation dose delivered to each voxel is an dimensional vector, and is the product of and . A schematic diagram of the radiation delivery is shown in Figure 2.
If is the set of OARs (e.g. stomach, liver, spinal cord) which need to be protected, then the optimization problem can be written as
(2a)  
s.t.  (2b)  
(2c)  
(2d)  
(2e) 
where is a vector containing the biological characteristics of the tumor and the OARs, as well as other patientspecific information. is the maximum allowable damage to OAR while equation (2c) is a placeholder for other constraints on the dose distribution (e.g. minimum dose, maximum dose, etc.) and is usually convex or can be converted to a convex form. The decision variables and are the set of the radiation intensities and the doses to be delivered over the course of the treatment respectively. The problem size depends on the number of beamlets, constraints, as well as distinct OARs considered in the treatment planning. A more detailed treatment of intensitymodulated radiation therapy optimization can be found in (hoffmann2008convex).
In the current literature, explicit expressions are used for the TCP and NTCP functions (GAY2007115)
. These however are surrogate functions based on populationwide estimates, and as such the response of different patients to the same treatment can be quite varied
(SCOTT2020). Thus, while some patients experience complete remission with minimal damage to OARs, others suffer from radiation poisoning or tumor recurrence. It has been shown in (TUCKER2013251) that the inclusion of patientspecific characteristics (such as genetic predispositions) can improve the prediction of patients’ response to RT treatment. We therefore intend to use data to learn (2a) and (2b), leading to the creation of more effective treatment plans for patients.The above two examples have illustrated the potential benefits of OCL. In the next section, we will describe the general framework for OCL, as well as show how to apply the steps of the proposed framework to optimization problems.
3 General Description of Framework
The proposed framework is the result of a careful analysis of many approaches adopted in the literature covered by this survey. It is the formalization of a procedure that (partially) recurs in most of the approaches for learning constraints. We believe that a complete view of all the steps that characterize the framework will help future research see the big picture of constraint learning, and the impact that each step has on the final performance (in a loose sense). These steps are enumerated below:

Setup of the conceptual optimization model

Data gathering and preprocessing

Selection and training of predictive models

Resolution of the optimization model

Verification and improvement of the optimization model.
Although the steps of the framework are given in sequential order, it should be noted that some of these steps may be repeated in nonsequential order, i.e. internal loops are possible. This may be necessary in order to improve the performance of the optimized model.
3.1 Step 1: Setup of the Conceptual Optimization Model
A graphical overview of this first step is given in Figure 3. It consists of five substeps, of which the first three are also commonly performed in traditional operations research (OR) modeling. This initial stage is usually done in collaboration with domain experts. This step begins with (i) defining the decision variables and parameters involved, and (ii) defining those constraints that are easy to model manually.
Mathematically, let be the vector of decision variables for an optimization problem which may be written as
(3a)  
s.t.  (3b)  
(3c) 
where the objective function is being minimized (and may also be maximized without loss of generality), are known constraints of the problem, and is a bound constraint on . The constraints are welldefined and represent the case in which a process or behaviour can be modelled manually.
In certain cases however, some constraints are difficult to model manually or to state in concise mathematical terms. A traditional approach would suggest approximating these complex constraints with other simpler constraints, but this results in a loss of model accuracy. By using constraint learning methods however, predictive models can be used to design datacentric constraints.
Let represent a quantity of interest which is difficult to model. If is a function of the decision variables , then a predictive model can be used to learn :
(4) 
This learned function can then be embedded as a constraint into the original problem. The palatability constraint of Section 2.1 is an example of a quantity of interest which is difficult to model. could be the palatability, and is a function of the amount of each commodity in the food basket. It should be noted that, depending on the predictive model learned, it might not be possible for optimization solvers to directly handle the learned models. is therefore more likely to be an encoding of the predictive model, and not the actual model. If are constraints on , then (3a)  (3c) become
(5a)  
s.t.  (5b)  
(5c)  
(5d)  
(5e) 
It should be noted that some part of the known constraint might also be a function of , i.e. could be , however we will continue with the definition where the known and unknown constraints are written separately. Similarly, some part of the objective function might also depend on the quantity to be learned (i.e. ). The objective function can also be learned, but we will also continue with the definition where is known. We would also like to make a subtle distinction between constraint learning and constraint function learning. In the former case, we learn a constraint using its feasibility threshold during the model training process. In the latter case, we learn a continuous predictive function without considering the feasibility threshold, and we add this threshold as a constraint in the optimization model. In either case, the above formulation holds.
One of the factors that affects the performance of a predictive model is the quality of the features used to find patterns in the dataset. In (5c), the features used in the predictive model correspond to the vector of decision variables (or a subset of ). This is not always the best set of features. In fact, the predictive model’s performance can be often improved with the addition of other features that are not part of the decisionmaking process, but convey additional information about the problem. For example in Section 2.2, although the decision variables are radiation intensities , additional information is conveyed by the vector which contains patientspecific information. The inclusion of these variables helps to improve the predictive model’s performance, and therefore the performance of the optimization model. Replacing (5c) with gives
(6a)  
s.t.  (6b)  
(6c)  
(6d)  
(6e) 
Although the ’s are known in the radiotherapy optimization example (e.g. patientspecific information such as gene expressions), in certain cases they may not be known and as such, we may have to optimize for the average or worst case. It may also be possible to treat any uncertainties in the ’s using a robust or stochastic optimization approach.
The vector can also be exploited to build more complex features using a process known as feature engineering. The idea behind this process is to use effective combinations of decision variables as new features to improve the performance of the predictive model, reduce the number of features used in the learning process, or embed prior knowledge into the predictive model. These new features can be expressed using auxiliary constraints , where is the specific function used to extract the feature :
(7a)  
s.t.  (7b)  
(7c)  
(7d)  
(7e)  
(7f) 
The palatability constraint in Section 2.1 is an example of such a case. It is not simply a function of the decision variables, but more accurately a function of aggregates of . These variables represent the amount of food per macro category (Cereals & Grains, Pulses & Vegetables, Oils & Fats, Mixed & Blended foods, Meat & Fish & Dairy), so that (1c) becomes .
Model (7a)  (7f) is a broad formulation of all the fundamental aspects that characterize optimization with constraint learning. is the vector of decision variables, are constraints which are easy to model, and is the difficult part of the model to design manually. It is the output of a predictive model which takes as input decision variables , nondecision variables , and , which is the vector of variables designed as a combination of the ’s.
3.1.1 Trust Regions
The performance of a predictive model worsens when it is evaluated on data that is much different from the data used to train it. During the optimization process, solutions far from what has been seen in the training and test sets could be evaluated. In this case, the predictive performance could differ much from our expectations. In order to address this issue, we can restrict the optimization to a region that is densely populated by samples that have been used to train and test the predictive model. This is known as a Trust Region (TR), and we assume that within this region, the predictive performance will be consistent with what we have seen during the model evaluation. The concept of a trust region is used in (biggs2018optimizing) and is thoroughly explained in (maragno2022constraint) where the authors constrain the optimal solution to be within the convex hull of all samples (training and test samples regardless of their label). This prevents the optimization process from evaluating solutions very different from what has been seen during the training and evaluation of the model. A larger TR might lead to a better solution in terms of objective function value, but also to less confidence in the performance score obtained in the predictive model evaluation.
3.2 Step 2: Data Gathering and Preprocessing
Once the conceptual optimization model has been formulated, the next step is to focus on collecting and preprocessing the data which will be used to learn the constraints (7c). As data gathering and processing is a huge subject in its own right, we will only discuss those aspects that are important in the context of constraint learning. Unlike the first step of the proposed framework, there is no order to the different substeps here; we simply highlight the different characteristics of the data, and how best to exploit them for our purposes. A graphical overview of this second step is given in Figure 4.
These characteristics inform the tasks to be performed, and have a direct impact not only on the predictive model, but also on the final optimization model. For example, the amount of data available might influence the choice of the ML method used to learn the constraint, which may then have an influence on the approach used to solve the final optimization model.
Understanding the data required to train and test predictive models is an activity that requires the support of domain experts. The data may already be available or may need to be generated or collected. In the second case, the best way to generate or collect and store data must be determined, and the data may either be obtained from realworld processes, or obtained from simulations of the processes. Whether or not the data is real or simulated depends on the problem under consideration, and the availability of resources (in terms of time and money). If the right data are available, then they can be used directly (with or without preprocessing) to learn the constraints (7c). If the data are not yet available, decisions have to be made on how to collect the data. Data collection or generation strategies should follow established principles such as Design of Experiments (DoE) techniques (which are applicable for physical experiments) or Design of Computer Experiments (DoCE) techniques (applicable for computational experiments). In the WFP application of Section 2.1 for example, collecting data on the palatability of rations during an epidemic or in a war zone might not be possible. Generation or simulation of data following DoCE principles is useful in such a situation. A recent review of DoCE is given in (GARUD201771).
Data collection or generation can take place either in a single run or in multiple runs. Performing data collection multiple times can make the learning and optimization process more accurate. For example if the simulation of a process from which data is obtained takes several hours, then the choice of input parameters to the simulator might need to be carefully considered. Adaptive DoCE techniques such as those in (CROMBECQ2011683) and (xu2014robust) may be used in such cases. In Derivative Free Optimization (DFO) only a few or even one new data point is collected in each iteration to minimize the number of data points needed to obtain an optimal solution. We refer to Conn2009IntroductionTD for a detailed treatment on DFO.
In addition to the availability or sourcing of the data, the size of the dataset also informs the choice of the learning approach. It is known that deep neural networks are able to exploit large datasets better than traditional algorithms such as logistic regression or support vector machine
(NgDeepLearn2020). In contrast, methods such as kriging (matheron1963principles) can be used with smaller datasets.Regardless of the size of the datasets, most data (especially those not generated with the use of simulators) need to be preprocessed before they can be used in the next step (learning). Many preprocessing activities have been widely discussed in ML literature but common preprocessing steps include imputation, normalization or scaling, binarization and binning
(scikitlearn). Imputation is the process where missing values in a dataset are replaced by estimates, while normalisation or scaling is where the values of the features are scaled to an interval (usually or ) to avoid numerical illconditioning (santner2003design). Binarization is the process of converting data to binary values based on whether they are above a threshold or not, and binning involves creating new features as discretized versions of existing continuous variables. Whatever data collection, generation and preprocessing approaches are used, the end result of this second step is a dataset that is ready to be used for training, testing and validating the predictive algorithms. A review of the various data generation and preprocessing approaches used in the context of constraint learning is given in Section 4.2.3.3 Step 3: Selection and Training of Predictive Model(s)
Once the data gathering and preprocessing activities have been completed, it is necessary to select, train, validate, and test the predictive model that will subsequently be used to represent one or more constraints in the final optimization model. Selection and training are two activities that go hand in hand. It is not possible to select the best predictive model without first training and testing it on the available dataset. Correspondingly, poor performance of a model after training might require a different predictive model to be selected. A graphical overview of this third step is shown in Figure 5.
The selection of the predictive model is based on six main factors:

Classification or regression: The choice of predictive model will depend on whether the constraint to be learned results in a discrete set of values (classification), or a continuous spectrum of values (regression). This has consequences on the format of the predictive model output, thus limiting the choice to a restricted set of models. In constraint learning, the use of a classification model results in a feasible set being learned, while the use of a regression model results in a function being learned. The palatability score function in (1c
) can for example, be a regression model, where the output is a continuous value giving a measure of the diet’s palatability. It can also be a classification function where the output is a binary variable stating whether the diet is palatable or not.

Computational complexity: The choice of predictive model directly affects the computational complexity of the final optimization model. In contrast to a linear regression model, deep learning models require several mathematical or logical statements to represent them and undoubtedly increase the complexity of the final optimization problem. This increased complexity may compromise the optimality of the final solution, and a tradeoff may be required. For example, fewer layers may be used, or the number of neurons in a layer may be small. A decision might also be made to select a predictive model so that the complexity of the final optimization problem does not increase. In the WFP example, the underlying problem is linear so a linear predictive model may be chosen to learn
. 
Model performance: In addition to computational complexity, the predictive model has to perform adequately with respect to some performance measure (e.g. accuracy or mean squared error). A simple but inaccurate model can have worse consequences than a complex but accurate model. The performance should be evaluated on both a test set and in the final optimization model. Predictive models that perform well on the test set may perform poorly on the “extreme” solutions evaluated by the solver during the optimization process, and so this means that the performance evaluation has to be performed both in Step 3 and Step 5.

Confidence intervals: If the predictive models used provide confidence intervals, it might be possible to use this information in approaches such as stochastic optimization or robust optimization.

Data quality: Despite the attention focused on the dataset in Step 2, it may be necessary to choose a predictive algorithm that is robust to noise or missing data. This is especially the case if the data to be used has been provided by a thirdparty.

Interpretability: The selection of a predictive model might also depend on whether or not the practitioner requires the model to be interpretable. In the radiotherapy optimization problem of Section 2.2, clinicians may prefer highly explainable predictive models to be used to learn the and functions due to the seriousness of the application. There are however pros and cons to consider. Decision trees, for example, are known to be highly explainable but may require the linearization of several disjunctions to encode them in the optimization problem.
A review of the various predictive models used for constraint learning in optimization is given in Section 4.3.
The implementation process is a cycle of training the predictive model, evaluating the model’s performance, and improving the model in terms of the data, model parameters, and even the model used. If a feedforward neural network is used for example, parameters such as the number of layers, the number of nodes per layer, the activation functions, whether or not to use dropout regularization, etc., need to be determined. Outofsample testing should be done to ensure adequate performance of the model, with changes made to improve performance if necessary.
In the context of constraint learning, under the same computational complexity, the best predictive model is the one with higher performance in the final optimization model. The goal should be to generate prediction models that aim to minimize decision error, and not just prediction error.
3.4 Step 4: Resolution of the Optimization Model
Once the predictive model has been used to learn the difficulttomodel constraint, it will need to be embedded into the rest of the optimization problem. Figure 6 gives a graphical overview of this step. Recall that the predictive model in (7c) may not be easy to add as a constraint, and may need to be encoded in a such a way that it can be handled by commercial or conventional solvers. This however depends on the type of model used, as a linear or quadratic function can be embedded more easily than a neural network for example. The integration of the learned constraints is often not trivial, and may result in a significant reformulation of the model, while trying not to affect its predictive performance. If possible, the predictive model may be incorporated either by providing gradients directly to the solver, or by reformulating it such that it becomes linear, convex, conic, etc., and can more readily be incorporated into the optimization problem. If desired, the optimization models may be modified to take into account any uncertainty measures provided by the predictive models. A deeper look at the various methods used in the literature to encode predictive models is given in Section 4.4.
In addition to encoding the learned constraints in a manner that the solver can exploit, a successful embedding can also include providing additional information (such as bounds) to the solver of choice in order to facilitate a more efficient solution process. Not only is it necessary to encode the learned constraints in a manner that the solver can handle, a successful embedding should also make it possible to take full advantage of the optimization solver’s capabilities.
3.5 Step 5: Verification and Improvement of the Optimization Model
In this final step illustrated in Figure 7
, we evaluate the performance of the constraint learning process with respective to three things: the final optimization model, the predictive model, and the data. For the optimization model, the focus is on both the goodness of the solution (i.e. the objective function value), as well as the time taken to find this solution. For the performance of the predictive model, the focus is on ensuring that the learned constraints accurately represent the difficulttomodel constraints. If it is possible to generate data, then using different data to improve the predictive model can in turn help to improve the optimal solution. One way of doing this is via a process called active learning. Here, the dataset is continually modified by adding new solutions (output of the optimization model) to it, which is then used to retrain the predictive model. A review of the approaches to used to improve verify and improve models with learned constraints is provided in Section
4.5.While the above framework steps have been given in sequential order, it should be reiterated that some of these steps may be repeated. For example, if it is seen that in step 5 that the learned constraint is a poor representation of the reality, we may need to go back to step 1. Similarly, the incorporation of new data will mean that the predictive model needs to be retrained and reembedded, and so on. Internal loops are also possible — it may be necessary to iterate between steps 2 and 3 until an acceptable predictive model is developed, or between steps 4 and 5 until satisfactory results are achieved.
4 Review
4.1 Setup of the Conceptual Model
In the literature, only a few papers have attempted to formalize the process of optimization with constraint learning. The papers by lombardi2017empirical and maragno2022constraint provide the most detailed approaches at formalising this process. There is also a clear structure provided in (de2003integrating) for learning unknown constraints. The methodology in lombardi2017empirical, called Empirical Model Learning (EML), is described as a way of obtaining components of a prescriptive model from machine learning. Similar to the steps given in Figure 3
, they define a conceptual model including known constraints, constraints to be learned, as well as the definition of constraints used for feature extraction.
maragno2022constraint also provide a general framework for OCL that can be applied every time data is available. A key focus of this approach is to learn constraints and embed them in such a way that the computational complexity of the final model is not increased. padmanabhan2003use and deng2018coalescing also conceptually discuss the use of data mining to define objectives and constraints.While other papers which use a predictive element to learn constraints do not have as detailed a setup, they incorporate some of the elements of Step 1 in their approach. Some authors (e.g. (verwer2017auction)) explicitly define constraints to perform feature extraction, while most of the papers reviewed simply identify which constraints are difficult to model manually and replace them with predictive models. This will be seen later in Sections 4.3 and 4.4.
The approach taken to formalize the constraint learning process in pawlak2017automatic is different in that their framework is a MILP. Similarly, other authors also leverage MILP (schede2019learning), Integer Programming (IP) (cozad2014learning), or LP (sroka2018one) frameworks for constraint learning. A number of authors (pawlak2017synthesis; pawlak2018synthesis; pawlak2019synthesis) also use a genetic programming framework to obtain constraints from data. In these different setups, the authors give the relative simplicity of their approaches as their key justification — no predictive models have to be learned, no constraints for feature extraction have to be included, and no complex embedding or reformulating procedures need to be followed. The disadvantage of these approaches however is that they don’t make use of the learning power of predictive models which have been shown to be able to learn very complex decision boundaries well.
A key point to highlight is that the use of constraint learning in operations research is a relatively new development. Consequently, not many papers are seen to have a conceptually structured approach to doing optimization with constraint learning. Although several papers thoroughly describe their methodology of embedding a ML model into an optimization problem, they don’t formalise the process of constraint learning from beginning to end. We believe that the formalization of this process is one of the important contributions of this paper to the literature.
4.2 Data Gathering and Preprocessing
Of the papers considered in this survey, the authors of (thams2017data) have the most detailed approach with regards to the data generation and preprocessing phase detailed in Section 3.2. Their procedure, also used in (halilbavsic2018data; venzke2020neural)
, is designed to either make use of available data from sensors, or generate it using a simulator. Possible operating points of the system under consideration are fed into a simulator, and the output of the simulator is analysed and classified as either
stable or unstable. Theses operating points along with their classifications are then stored in a database. While the data is only generated once, they attempt to keep the generated database as small as possible for computational reasons, while collecting enough data to train their predictive models. They do this by focusing on generating points close to the boundary of their twoclass problem, while neglecting operating points far away from the boundary. In this way they balance exploration and exploitation when generating data. In addition to this, class imbalance in the dataset is addressed using several strategies, and feature selection is done so that only easily accessible variables are considered.
With regards to following DoCE principles, the authors of (lombardi2017empirical) use factorial design to generate multiple training sets from a simulator. Data generation here, although time consuming, only needs to be done once. The data is also normalized so that the input features fall within a certain range. The approach developed in (cozad2014learning) also takes DoCE principles into account by using either LHS or twolevel factorial design. They also use adaptive sampling to iteratively refine the models generated and provide discussions on whether to use explorationbased or exploitationbased sampling. de2003integrating
use small data collected using onfield survey (an expensive process) to find a probability distribution that is later used in a simulator that generates enough data to train the predictive model. In
(fahmi2012process)where several constraints are replaced using neural networks, the data is obtained from multiple simulators. Some transformation is done to the data to ensure that the distribution of the dependent variable is as close as possible to a uniform distribution. The authors here note that a large amount of data needed to train their neural networks for higher accuracy. Data may also be collated from multiple available sources as in the case of
bertsimas2016analytics.In terms of the sizes of the datasets surveyed, it can be seen that not many of the problems considered have large datasets. Data containing or fewer examples are common. In (pawlak2017automatic), the total amount of data generated using uniform sampling varied from to examples. Similarly, schweidtmann2019deterministic use data with or fewer examples. Data of a similar size is also seen in (kudla2018one), (yang2021optimization) and several others. For one of the test problems in (schweidtmann2019deterministic), only examples are used. On the larger side, chen2020input use data containing instances, venzke2020neural use data points generated using LHS, while the data used in (say2017nonlinear) contains data points. For certain constraint learning approaches, the availability of a large amount of data is a must. The local search approach used in (sroka2018one) is such an example, where the test set contains examples generated via uniform sampling.
Some approaches for constraint learning necessitate the significant preprocessing of the data. In order to learn ellipsoidal constraints in (pawlak2021ellipsoidal)
, the uniformly generated data first needs to be clustered and standardized, before Principal Component Analysis (PCA) is applied to the data. This clustering and PCA procedure shapes the data so that it can be represented using an ellipsoidal constraints. In other cases, although preprocessing is done, it is not crucial to the constraint learning approach (e.g. in
(biggs2018optimizing) and (pawlak2019synthesis)).To summarize, it is clear from the literature that with the exception of a few authors, not many papers have put much emphasis into the data generation and preprocessing phase. This is sometimes because the data is available, is neither big nor dirty, and has no need for preprocessing. For example, the situation of having highdimensional data, where the number of examples is much less than the number of features, is only explicitly mentioned in
(mivsic2020optimization). In other cases however, the data step is an afterthought. Even when data is generated, this is cursorily mentioned without giving details or following exhaustive DoCE principles. We believe that our framework will help to provoke more thought on this phase of the constraint learning process.4.3 Selection and Training of ML models
Neural  Decision  Random  Other  Support Vector  Clustering  (M)ILP  Other  
Networks  Trees  Forest  Ensemble  Machines  
bergman2019janos  x  x  
biggs2018optimizing  x  x  
chen2020input  x  
chi2007modeling  x  
cozad2014learning  x  x  
cremer2018data  x  x  
de2003integrating  x  
fahmi2012process  x  x  
garg2018kernel  x  
grimstad2019relu  x  
gutierrez2010neural  x  
halilbavsic2018data  x  
jalali2019designing  x  
kudla2018one  x  
lombardi2017empirical  x  x  
maragno2022constraint  x  x  x  x  x  x  x  
mivsic2020optimization  x  x  
paulus2021comboptnet  x  
pawlak2017automatic  x  
pawlak2017synthesis  x  
pawlak2018synthesis  x  
pawlak2019synthesis  x  
pawlak2021ellipsoidal  x  x  
pawlak2021grammatical  x  x  
prat2020learning  x  
say2017nonlinear  x  
schede2019learning  x  x  
schweidtmann2019deterministic  x  
spyros2020decision  x  x  
sroka2018one  x  x  
thams2017data  x  
venzke2020neural  x  
verwer2017auction  x  x  
xavier2021learning  x  
yang2021optimization  x 
Before we begin this section, we would like to note that papers which use regression functions as the main constraint learning methods have been excluded from Table 1, as they are too numerous to be included in this paper. Regression with first and secondorder polynomials is used a lot in engineering, especially in combination with DoE and DoCE. We refer to kleijnen2015 for a detailed treatment. Kriging models are also used a lot in engineering, especially when the objective function values can be obtained by (expensive) computer simulations. The big advantage of kriging is that it has very high predictive power even for relatively small number of data points. The disadvantages of kriging are (i) it costs much computing time to obtain the kriging model (ii) the data points have to be spacefilling (iii) embedding kriging functions in the optimization model makes it nonconvex and hard to solve. See forrester2008 for a detailed treatment on kriging.
It can be seen from Table 1 that ANNs and DTs are the most popularly used predictive models in constraint learning (with the exception of regression models). This could be because these methods have been shown to have high representative power, can be used for both regression or classification, and generally perform well. Training, testing and evaluation is generally done as prescribed in the ML literature, and the methods are generally evaluated using metrics such as the Mean Square Error (MSE), accuracy, and so on. In terms of size of the ANNs, with the exception of yang2021optimization who use a network with six hidden layers, most networks used are generally small. One reason for this could be to reduce the burden of embedding large networks as constraints. Consider the fully connected neural network shown in Figure 8.
The parameter represents the weight of input to neuron in layer , while represents the bias of neuron in layer . The activation (or output) of a neuron in layer is given generally as
(8) 
where is the activation function used in layer and is the number of neurons in layer .
The activation functions used in the hidden layers in the constraint learning literature are usually either the Rectified Linear Unit (ReLU) (e.g. in
(say2017nonlinear)), the sigmoid function (e.g. in
(chen2020input)), (some form of) the hyperbolic tangent function (e.g. in (fahmi2012process)), or the softplus activation function (e.g. in (chen2020input)). These are shown in Figure 9. The output layer activation function is usually an identity linear function.As the input of a current layer is the output of a previous one, using multiple layers results in nested activation functions in the final model. This could lead to issues regarding convexity and/or differentiability, therefore smaller networks are used. Another reason for small neural network sizes is that using larger networks are of no benefit in some cases. lombardi2017empirical for example use a network with one hidden layer (with two neurons) as increasing the network size did not improve the solution quality of the optimization problem. To view the effect of network size on computation times, schweidtmann2019deterministic varied both the number of neurons, and the number of layers. Results showed that the solution time increases approximately exponentially with the number of layers used to learn the constraints. Deep neural networks required more processing time than shallow neural networks for the same number of neurons.
An alternative to using small networks could be to use sparsification. venzke2020neural use a network with three hidden layers and fifty neurons in each layer to learn constraints, however they reduce the size of the network by enforcing of the weights to be zero. This sparsification procedure ensures that the network is not too large and also avoids overfitting during training. say2017nonlinear use a similar technique to remove connections to and from neurons with very small weights in order to strengthen the MILP formulation. Another alternative could be to ensure that the constraint learning procedure results in a convex optimization problems. chen2020input and yang2021optimization use Input Convex Neural Networks (ICNN) (amos2017input) as their predictive models. An ICNN requires that all weights in the network are nonnegative, and that all activation functions are convex and nondecreasing. Although these networks require more training iterations and have lower representation power than conventional networks, their use guarantees that the final problem is a convex optimization problem. This allows the authors to balance the tradeoff between model accuracy and computational tractability.
In addition to ANNs, DTs are also popularly used in constraint learning. DTs are able to capture nonconvex and discontinuous feasibility spaces, and provide these rules in an easily understandable format. Take for example, the decision tree in Figure 10 used to determine whether to go hiking or not.
It can easily be seen that if the outlook is sunny and the humidity is greater than 65, then we should stay indoors. For this reason, thams2017data and halilbavsic2018data specifically choose decision trees as their method of choice to learn constraints in an optimal power flow problem. Their rationale is that as electricity is a crucial utility, using easily interpretable rules in their analysis can lower the barrier to power plant operators adopting their methods. This interpretability requirement could also favour the use of decision trees in the radiotherapy example of Section 2.2.
As random forests are a natural extension to using decision trees, they have also been used in constraint learning.
biggs2018optimizing learned objective functions using random forest while mivsic2020optimization optimized tree ensembles. AdaBoost (freund1997decision) (with DTs as base learners) is also used in (cremer2018data) to learn a probabilistic description of a constraint.Support Vector Machines (SVM) can be expressed very clearly as optimization problems, and as a result, there is a significant amount of optimization literature on SVMs. Most of the literature however deal with subjects such as optimal feature selection (e.g. (JIMENEZCORDERO202124)), or mathematical formulation and solution of the SVM problem (e.g. (BALDOMERONARANJO202084)), or improving predictions (e.g. (YAO2017679)). There are however a few cases where SVM is used in the constraint learning process. chi2007modeling uses SVM for regression with a 2nd order polynomial kernel function. The resulting learned constraint is of the form shown in (9) and is a linear combination of linear, quadratic and twoway interaction terms.
(9) 
A feature selection process is further used to remove insignificant terms and improve model accuracy. xavier2021learning also use SVM with linear kernels to learn hyperplanes within which the solution of the optimization problem is very likely to exist.
Clustering is also used in constraint learning, although it is often used in addition to other methods. pawlak2021ellipsoidal combined clustering with PCA and ellipsoidal constraints to get an MIQCP, while sroka2018one combine clustering with local search to search the clusters for LP constraints that represent each cluster.
Combinations of predictive models are often used. verwer2017auction combine regression trees and regression models for learning revenue constraints for an auction optimization problem. In addition to using ANNs to learn most of the constraints required, fahmi2012process also use nonlinear regression models for specific constraints. paulus2021comboptnet use both integer programming and neural networks to learn both the cost terms and the constraints for integer programs. Although schede2019learning use an LP formulation to learn polytopes that cover feasible examples, they also include a decision tree heuristic to model the importance of examples to be considered. Additionally, maragno2022constraint are able to use different predictive models to learn different constraints of the same optimization problem in order to achieve a better performance.
There are also approaches which don’t use predictive models to learn constraints. pawlak2017synthesis; pawlak2018synthesis and pawlak2019synthesis use genetic programming as their approach of choice, pawlak2017automatic use a mixedinteger formulation and cozad2014learning combine simple basis functions using a MILP formulation to learn constraints.
Although RF is an ensemble method, the column “Other Ensemble” in the Table 1 is separated from the “Random Forest” column to highlight the fact that there is an opportunity to use other ensemble methods where different individual base learners can be used and their predictors combined. There is also the opportunity to try a wider variety of predictive models such as symbolic regression or Bayesian methods, especially if they are more easily represented using mathematical functions. Studies could be done to see if certain families of models perform better for certain types of problems. Finally, methods that provide some measure of uncertainty (such as confidence intervals or conditional probabilities) might also be chosen in order to be able to incorporate this uncertainty into the final mathematical model.
4.4 Resolution of the Optimization Model
With regression functions, the process of embedding them as constraints is relatively straightforward as the functions are simply a weighted sum of the decision variables (bertsimas2016analytics; verwer2017auction). With other predictive models however, the embedding process is more involved. With ANNs for example, their embedding is also dependent on the activation functions used. In order to incorporate a trained neural network with ReLU activation functions as a constraint, the function can be replaced by introducing binary variables for each neuron in the network (e.g. in (venzke2020neural)). If the input to a neuron , then and (10a) and (10b) force the neuron output to the input . On the other hand, if the input to the neuron then , and (10c) and (10d) force the output to . The resulting MILP can then be solved with offtheshelf solvers such as Gurobi or CPLEX.
(10a)  
(10b)  
(10c)  
(10d)  
(10e) 
The disadvantages of this approach are twofold. Firstly, large networks require a correspondingly large number of binary variables. Secondly, the bounds and have to be chosen to be as tight as possible for the MILP solver. anderson2020strong present a MIP formulation that produces strong relaxations for ReLUs. They start with the bigM formulation and then add strengthening inequalities as needed. say2017nonlinear also encode ReLU into MILP using bigM constraints, but include reformulations to strengthen the bounds. They also use sparsification to remove neurons with very small weights, as constraints with very small coefficients can be difficult to solve for commercial solvers (d2016towards).
The embedding of trained ANNs as constraints is not always as complicated as mentioned above. Equation (8) can be used directly as a constraint, as long as the solver can handle the activation function . This approach is used in (fahmi2012process) with the hyperbolic tangent activation function, as well as in (gutierrez2010neural) where the activation function is differentiable so that the constraint can be easily handled by conventional solvers. It is nevertheless noted in (lombardi2017empirical) that several commercial solvers rely on convexity for providing globally optimal results, and that this approach may result in locally optimum solutions depending on the used. In (chen2020input), the use of ICNN guarantees that the optimization problem is convex and will converge to a globally optimal solution. They do however speed up this process by using a smoothened version of the ReLU, called the Softplus activation function (see Figure 8(d)). A large value of the parameter in the Softplus function allows it to be smooth enough, while still being convex. The final problem is solved using dual decomposition (yu2006dual; chiang2007layering). yang2021optimization also use ICNNs to get convex optimization problems which are then solved using sequential quadratic programming.
Noticing the use of ANNs in various optimization applications, schweidtmann2019deterministic provide an efficient method for deterministic global optimization problems which have ANNs embedded in them. As the only nonlinearities of ANNs are their activation functions (the function in their case), the tightness of their relaxations are influenced by the relaxations of the functions. The key idea of their paper is to therefore hide the equations that describe the ANN and its variables from the BranchandBound (B&B) solver using McCormick relaxations of the activation functions. These relaxations are once continuously differentiable, and the lower bounds are calculated by automatic propagation of McCormick relaxations through the network equations. This “ReducedSpace (RS)” formulation results in significantly fewer variables and equalities than the original formulation (by two orders of magnitude in some cases), with significantly shorter computation times in most cases. They also see that shallow ANNs show much tighter relaxations than deep ANNs. grimstad2019relu also provide bound tightening procedures for ReLU networks to reduce solution times.
The embedding of kriging functions in the final model is done in a direct way: substituting the kriging function leads to an explicitly formulated, but highly nonconvex constraint. First and second order derivatives can be calculated, and therefore standard nonlinear solvers can be used. However, due to the nonconvexity, solvers might end up in local optima. See also dbb011afe7154a63a10e75e379fa5a1f and STINSTRA2008816 for examples.
With DTs, the paths from root to leaf nodes can be represented using disjunctions of conjunctions. In Figure 10, there are three paths that lead us to go hiking which can be written as follows:
(11)  
These rules can either be linearized, or embedded as constraints using a framework like Generalized Disjunctive Programming (GDP) (Grossmann2009; GAMS2021). Although lombardi2017empirical do not embed DTs into MINLPs due to the potential for poor bounds being obtained as a result of the extensive linearization of disjunctions required, other authors have. In (kudla2018one), branches of the tree from root to leaf nodes are encoded as linear constraints using a bigM formulation. Paths that end up in feasible decisions are denoted with the (+) sign (see Figure 11). These paths are then converted into linear constraints using auxiliary binary variables to show which path in the tree is followed. For example at the top of the tree, if a binary variable , then is active and (12) holds. Otherwise if , then is active and (13) holds.
(12)  
(13) 
This process is applied to the whole tree until a set of linear constraints capturing the tree is obtained. A similar approach is used in (verwer2017auction) to embed regression trees as constraints. In (thams2017data), each path from the root to a leaf node is represented by two constraints, with one binary variable per path to determine if a path is chosen. The resulting MILP is solved with Gurobi. halilbavsic2018data also follow a similar procedure except that the resulting problem is further reformulated as a MixedInteger Second Order Cone Program (MISOCP) because of the presence of additional nonlinear constraints. The resulting MISOCP and MINLP are then solved using commercial solvers.
To embed random forests as constraints, multiple DTs can be used as above, with an additional constraint that selects a consensus based on the outcome chosen by most of the individual decision trees (bonfietti2015embedding). In (biggs2018optimizing) the authors maximize a random forest as the objective. For each tree, they use logical constraints to determine which leaf a solution lies in. For the forest, they use a constraint to ensure that only one leaf for each tree is active. To ensure that the feasible set is bounded, and that the solution is not too dissimilar from the observed data, the solution is constrained to be within the convex hull. This is similar to the approach in (mivsic2020optimization) who also have an objective function expressed as the prediction of a tree ensemble, and who also use some form of Benders reformulation with lazy constraint generation to solve the problem. biggs2018optimizing also compare solving the MIP for the whole random forest, versus splitting the forest into a number of subsets, and solving each subset using the MIP, and a crossvalidation heuristic procedure to achieve performance improvements. In (cremer2018data) where the AdaBoost ensemble learning method is used (with DTs as the weak learners), there is one set of constraints for each DT, and exactly one disjunction must be selected for each learner. Probability estimates for all learners are combined in a weighted sum, and the resulting MILP is solved using a commercial solver.
The advantage of using LP or MILP approaches to learn constraints is seen in this step of the framework, as there are no complex embedding procedures to be followed. In (pawlak2017automatic) the resolution of the final optimization problem here is relatively easy as the entire constraint learning process was formulated as a MILP from the very beginning. This is also seen in papers like (sroka2018one) and (schede2019learning). In (pawlak2021ellipsoidal), the data is first clustered, and then PCA is used to give each cluster an ellipsoidal shape. These ellipsoids can then be represented using quadratic constraints. Their approach produces Mixed Integer Quadratically Constrained Programs (MIQCP), which can then be solved by any solver that supports quadratic programming.
In summary, it can be seen that this step of the constraint learning process is the most complex one. Incorporating predictive models sometimes results in significant reformulation of the original model. Embedding of the predictive models should however be done in a way that takes full advantage of the optimization solver’s capabilities. This can be done by providing gradients or other useful information directly to the solver, or reformulating the problem to become linear, convex, conic quadratic, etc. Constraint learning approaches which use MILP or LP frameworks such as in (pawlak2017automatic) avoid these complex embedding procedures, however the disadvantages of this approach could include a lack of flexibility and limited learning ability. There is therefore a tradeoff between simplicity and performance to be considered. Wherever possible, it is desirable to keep the final optimization problem in the same complexity class as the original problem.
4.5 Verification and Improvement of the Optimization Model
The validation of the final optimization model is straightforward when dealing with benchmark problems, as the groundtruth is already known. This is seen in a lot of the literature. In (pawlak2017automatic; pawlak2018synthesis) and (pawlak2019synthesis)
for example, the use of benchmark problems allows for the comparison of the number of synthesized constraints with the actual number of constraints from the benchmark problems. They are also able to assess the similarity of the syntaxes by computing the mean angle between weight vectors of the corresponding constraints in the synthesized and actual models. The feasible regions of the synthesized models can also be compared to those of the actual feasible regions using the Jaccard Index
(jaccard1912distribution), as it is used to measure the similarity between sets. This is used in (kudla2018one; pawlak2017automatic) and others. Visual comparisons (limited to 3dimensions) have also been used to illustrate how well a constraint learning approach captures the real feasible region (e.g. (pawlak2021ellipsoidal)). In (schede2019learning), the authors compute the probabilities that feasible (and infeasible) points of the benchmark problem lie in the feasible (and infeasible) region of the learned problem. They also compare the differences between the optimum values of the original and learned problems. gutierrez2010neural; cremer2018data and schweidtmann2019deterministic similarly also test their approaches on benchmark problems, and compare their results to the known solutions of the problems. In addition to using benchmark problems, xavier2021learning also evaluate the performance of the method by using outofdistribution data to measure the robustness against moderate dataset shift.Using synthetic or wellknown benchmark problems allows one to accurately determine how far learned constraints are from the real ones, however this is not always possible. Consequently, other model verification procedures are seen in the literature. Simulators can also be used to evaluate constraint learning approaches by comparing learned solutions to solutions from the simulator (e.g. in (verwer2017auction)).
Authors also compare the constraint learning approaches with the typical approaches used to solve those types of problems. This is seen in (say2017nonlinear), (thams2017data), (chen2020input) among others. cozad2014learning implement model improvement by iteratively generating new data and updating the model with these data. The process continues until the error is less than a specified tolerance value. An overview of the approaches used for verification and improvement of the learned models is given in Table 2.
Benchmark  Compare Constraints  Simulator  Compare to  Adaptive  
Problems  and/or Feasible Regions  Results  Other Approaches  Sampling  Other  
bergman2019janos  x  
chen2020input  x  x  x  
chi2007modeling  x  
cozad2014learning  x  x  
cremer2018data  x  x  x  
de2003integrating  x  x  
fahmi2012process  x  
garg2018kernel  x  x  
grimstad2019relu  x  
gutierrez2010neural  x  x  
halilbavsic2018data  x  x  
jalali2019designing  x  x  
kudla2018one  x  x  x  
lombardi2017empirical  x  x  
mivsic2020optimization  x  
paulus2021comboptnet  x  
pawlak2017automatic  x  x  
pawlak2017synthesis  x  x  
pawlak2018synthesis  x  x  
pawlak2019synthesis  x  x  x  
pawlak2021ellipsoidal  x  x  x  
pawlak2021grammatical  x  x  x  
prat2020learning  x  x  
say2017nonlinear  x  x  
schede2019learning  x  x  x  
schweidtmann2019deterministic  x  x  
spyros2020decision  x  x  
sroka2018one  x  x  x  
thams2017data  x  x  
venzke2020neural  x  x  
verwer2017auction  x  
xavier2021learning  x  x  
yang2021optimization  x  x 
5 Opportunities for Further Research
As the process of constraint learning is relatively new, it stands to reason that there will be several areas where further research can be carried out. Following, we summarize some of these opportunities for further research in light of the steps of our proposed framework.
Data Gathering and Preprocessing: As seen in Section 4.2, the data gathering and preprocessing phase of the constraint learning process is often an afterthought in optimization. There is therefore an opportunity to understand the effects of the data generation and preprocessing phase on the solution of the optimization problem. Is there a particular strategy (e.g. DoCE, Kriging, etc.) that performs best for constraint learning? It is also necessary to understand the size and type of data that is needed in order to get a good predictive model for constraint learning, and not just for prediction.
Selection and Training: Firstly, the choice of model to use to learn constraints should also be further investigated. Most of the approaches used to learn constraints are the most commonly known methods (regression, ANNs, DTs, etc.), however there is an opportunity to see how less wellknown approaches perform in terms of solution quality, as well as ease of embedding as constraints. Secondly, predictive models often provide a measure of the uncertainty of their predictions. Neural Network classifiers, for example, can have Softmax scores while Decision Trees report the number of misclassified examples at their leaf nodes. Although cremer2018data have used probabilistic information from their classifier to good results, further research on how to incorporate these measures of uncertainty into optimization problems should be carried out. Approaches such as robust optimization or stochastic programming might be useful here. Thirdly, only a few approaches have used ensemble methods to learn constraints. While not all problems will require the additional predictive power that an ensemble approach brings, it is to be expected that as the best predictive models in ML applications are often ensembles, it makes sense to incorporate their superior abilities into constraint learning. Finally, in Section 3.1, we make the distinction between constraint learning and constraint function learning. Further research is required on which approach performs best, either generally or on a casebycase basis. The relationship between the two approaches should be investigated. When do two methods result in the same optimal solution? In other words, when do they define the same feasible region?
Resolution of the Optimization Model: Several approaches for embedding ML models as constraints have been seen, however can these models be embedded in more efficient ways? The difficulty of using larger neural networks or larger decision trees to learn constraints might be overcome if more efficient embedding procedures are developed. In (schweidtmann2019deterministic), difficult parts of the neural network ( activation functions) were hidden from the B&B solver. Further research on hiding unnecessary information from solvers is needed. Further research is also needed on providing additional useful information (such as precomputed derivatives) to commercial solvers so as to improve solution time and quality. This will be useful when the incorporation of predictive models make these computations impractical or expensive.
Verification and Improvement: In the literature reviewed, methods for model verification and improvement, usually consist of either comparing to benchmark problems, or to other competing approaches. It could be promising to look into developing formal approaches for verifying learned constraint models. A framework for verifying neural network behavior was developed in (venzke2020verification), and research could be done to see if similar frameworks can be applied during the constraint learning process. In terms of model improvement, more focus should be put on improving models via processes such as active learning or adaptive sampling. In such cases, it would be interesting to see the effects (if any) of exploration versus exploitation when generating data on the overall solution of the optimization problem. Questions also arise with respect to the robustness of the solution – What is the sensitivity of the optimal solution with respect to the uncertainty in the learned constraint? How is the robustness of the solution related to the noise in the dataset? A structured and formal approach for verifying optimization problems with learned constraints can help to answer questions such as these.
6 Conclusions
Reading any of the literature surveyed in this paper will clearly show the benefits of OCL. OCL helps to capture constraints which would otherwise have been difficult to capture, providing there is data available. These benefits are clearly stated by halilbavsic2018data where their datadriven approach increased the feasible space, identified the global optimum and was solved in less time than conventional methods. There is however a need to go about the process of OCL in an organized manner, in order to avoid potential pitfalls. We have therefore provided a framework for OCL, which we believe will help to formalize and direct the process of OCL. We start with the setup of the conceptual model, and then look at the process of collecting and preprocessing the data. We also look at the factors that influence the methods and algorithms used to learn the constraints, and the techniques by which these learned constraints are embedded into the optimization models. Finally, we highlight the steps for verification and improvement of these models. We also provide two running examples — the WFP diet palatability problem, and the radiotherapy optimization problem — and used these examples to illustrate our framework. Besides providing the framework, we have reviewed the literature in light of the different steps of the framework, highlighting common trends in constraint learning, as well as possible areas for future research. It is our belief that this paper will help to guide future efforts in OCL, and be of benefit to the wider OR community.
Acknowledgements
This work was supported by the Dutch Scientific Council (NWO) grant OCENW.GROOT.2019.015, Optimization for and with Machine Learning (OPTIMAL). The authors would also like to thank Ilker Birbil for his comments during the preparation of the manuscript.
Comments
There are no comments yet.