Constrained Machine Learning: The Bagel Framework

by   Guillaume Perez, et al.

Machine learning models are widely used for real-world applications, such as document analysis and vision. Constrained machine learning problems are problems where learned models have to both be accurate and respect constraints. For continuous convex constraints, many works have been proposed, but learning under combinatorial constraints is still a hard problem. The goal of this paper is to broaden the modeling capacity of constrained machine learning problems by incorporating existing work from combinatorial optimization. We propose first a general framework called BaGeL (Branch, Generate and Learn) which applies Branch and Bound to constrained learning problems where a learning problem is generated and trained at each node until only valid models are obtained. Because machine learning has specific requirements, we also propose an extended table constraint to split the space of hypotheses. We validate the approach on two examples: a linear regression under configuration constraints and a non-negative matrix factorization with prior knowledge for latent semantics analysis.



page 1

page 2

page 3

page 4


Spherical Matrix Factorization

Matrix Factorization plays an important role in machine learning such as...

Robust Coreset Construction for Distributed Machine Learning

Motivated by the need of solving machine learning problems over distribu...

NCVX: A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning

Optimizing nonconvex (NCVX) problems, especially nonsmooth and constrain...

The empirical duality gap of constrained statistical learning

This paper is concerned with the study of constrained statistical learni...

Data Summarization via Bilevel Optimization

The increasing availability of massive data sets poses a series of chall...

Sufficiently Accurate Model Learning for Planning

Data driven models of dynamical systems help planners and controllers to...

An Improved Bayesian Framework for Quadrature of Constrained Integrands

Quadrature is the problem of estimating intractable integrals, a problem...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The field of artificial intelligence has been drastically improved these last years, with the incredible results from image recognition and deep generative models

[15, 17, 13]. In such applications an a priori knowledge, such as, only a subset of inputs features are relevant (sparsity) [4] may be present. By design, the vanilla learning models are not well suited for learning under such assumptions. For combinatorial constraints such as the bounded norm, which restricts the number of non-zero values, a huge research work has been done [31, 11]

, and for some instances, the optimal model (i.e. the model respecting the constraint with the lowest loss function) can be found. But for the general case, it is still a hard question


. While many works have been done, it is most of the time an ad-hoc algorithm that heuristically enforces a given combinatorial constraint. For the sparsity, users will mostly use

(i.e. sum of absolute values) relaxation instead of [5, 30]. Another example is on a recent work on topic modeling for materials science [1], that required that a set of combinatorial constraints on the solutions of a matrix factorization model to be satisfied. In such works, finding a machine learning model that does not respect the constraints is useless, even if the loss value is low. Finding the prettiest relaxations, that are often both convex and differentiable is not necessarily useful, except if it allows to find a solution that at least does respect the constraints.

On the other hand, Branch and Bound (B&B), is one of the most famous tools and is widely used to solve hard combinatorial optimization problems [22]. B&B has been applied to many areas, and often, a dedicated version is extracted to fit the requirement of a given family of problems (branch and Price [2], branch and cut [33] etc.). B&B is the main solving method of many combinatorial solvers, such as constraint programming solvers [28], or even integer programming solvers [33]. These solvers have proven their efficiency in solving hard combinatorial problems.

The goal of this paper is to broaden the modeling capacity of constrained machine learning problems (CMLPs) by reusing the existing work of combinatorial optimization frameworks, and especially constraint programming. In this paper, we introduce the Branch, Generate and Learn (BaGeL). This framework allows to learn while enforcing combinatorial constraints on the learned representation. The goal of such a framework is to recursively generate learning problems in which the hypothesis space is restricted, until only valid hypothesis are allowed. The difference with classic B&B framework is that at each node, a learning problem is generated and trained. The generated problems are more and more restricted, until only valid models are obtainable. The tree search part of the BaGeL ensures that all of the valid model space is explored.

Moreover, CMLPs have different needs than basic combinatorial problems. That is the reason why a generalization of table constraints [9, 26] is proposed. The table constraint is one of the most used constraints in constraint programming. The proposed version is a hard, non-convex, and non-connected constraint, applied to a sub-sequence of the model’s features. This newly introduced constraint restricts the feature spaces into possibly non-connected and non-convex sub-spaces that act as bias on the hypothesis space. Used with the BaGeLframework, combinations of this constraint will guide the search for valid models. Moreover, it allows, for example, to fit data-sets using a prior knowledge, and generalizes many constraints such as the classic Lasso regularization constraint.

Finally, our experimental section shows on two hard combinatorial learning problems that the proposed method efficiently handles complex constraints and leads to good solutions.

Modeling CMLPs

A Constrained Machine Learning Problem (CMLP) is a learning problem in which restrictions are imposed on the hypothesis set, the learned representations, the inputs, or to any other parts of the learning process. A well known example is the Lasso regularization that add a regularization constraint to the learning model. In this example, the constraint is on the model parameters.

Definition 1

A Constrained Machine Learning Problem (CMLP) is a 5-uple where is a set of variables, is a learning model and its parameter variables, a set of constraints defined over a subset of the union of and , is a data set and a function of , , and to optimize.

Each variable is associated to a domain containing a finite set of values to which can be assigned. The solution of a constrained learning problem is an assignment of variables and parameters of such that all the constraints in are satisfied and is minimized with respect to .

Example 1

Smart Design. We consider the case of a Machine Learning model that can take as input multiple sensors. Such component can be a camera (matrix of pixels), a radar or a lidar (point cloud), a mono-valued sensor such as temperature, pressure or contact. The presence and the number of occurrences of these sensors will determine a cost for the system and an energy consumption. The goal is to find an acceptable configuration of the system minimizing the function to be learned. Using a regularization may suggest to use only some pixels of the camera, and some points of the radar, and therefore imposes both sensors to be present, maybe exceeding the maximum cost of the design.

For this smart design example, we consider optimizing the parameters of a linear regression model from a training set . Parameters of are components of . The evaluation of the quality of a learning is given by the loss function . Let be a set of components. Each component can be a camera, a lidar, a multi-value sensors, etc. And each component is associated to a cost. Each component , where

is an input tensor and

a cost. Let be the maximum allowed cost and

be a Boolean vector representing the use of a given component. The following problem is the smart design problem:


subject to


Where the is the total number of non-zeros values in a vector. In such settings, finding the optimal selection of such that the loss function is minimized and the budget is respected is a hard combinatorial problem. Note that when all the are equal to 1, and all components contain only 1 input, this is equivalent to the well known constraint, a problem of utmost importance [24, 31, 11, 5, 30, 3, 25].

Consider a toy problem with , =, =, =, =. Let . The possible solutions of equation (3) are , , , , , , , , . The optimal solution is the one optimizing equation (1).

Extended Table constraint

In addition to existing combinatorial constraints, machine learning specific needs lead to the definition of dedicated constraints. Indeed, prior knowledge, sparsity, etc. are currently leading the research direction of the optimization methods for machine learning because of the uniqueness of the constraints. This section introduces the extended table constraint, a generic constraint which will be used to split the features’ search space into sub-spaces.

Table constraints, also named extensional constraints [9, 26], are defined by the list of valid value assignments for a vector of discrete variables in constraint programming. They are widely used, mostly in industrial applications, because of their simplicity and expressiveness. They are often built by extracting the solutions of sub-problems [18, 8]. For example, consider the following binary table constraint enforcing that the variables must have a difference of or . For a pair of discrete variables associated with the domain , the table is . Thus the assignment is valid since and so is .

We extend the standard table constraint definition for a given vector of variables and a set of n-tuples , with a cost function , and a threshold value .

Definition 2

A vector of variables is valid with respect to the extended table constraint ET() if and only if there exists an n-tuple such that .

Note that this definition is equivalent to classical table constraints if and is the Euclidean distance function. The cost function allows to represent the feasibility set and the distance of to it. Extended table constraints aim at splitting the search space of the learned features into specific sub-spaces.

Figure 1: Topic modeling problem with topics from a database. (top-left) A database of topics. (top-right) An example of a topic vector: 1 indicates that the word belongs to the topic, 0 otherwise. (bottom) Each column of the left matrix () is a subset of the words of a topic from the database.
Example 2

Prior NMF. To illustrate our concepts, we consider a version of a topic modeling problem from latent semantics analysis. Extracting the topics of a text can be done using non-negative matrix factorization (NMF) [32]. Given a matrix representing a document, where each column represents the words of a paragraph, we want to decompose into two non-negative matrices and . The columns of represent the words that belong to a topic and represents the activation of the topics in each paragraph (or column) of . The additional constraint is a restriction of the columns of that ensures that the topics are taken from an existing topic set.

Let be a table of prior topics, where each entry is a Boolean vector taking value 1 if the word belongs to the topic and 0 otherwise. Let the function , with being the Hadamard product. Each column of matrix is constrained by an extended table constraint . Such constraint implies that the columns of matrix should match the database . Finally, a constraint ensures that all the selected topics from the prior data base are pairwise different. An alldifferent constraint [27] is used. Let be the number of prior topics. Let for be decision variables representing the selected topic for column . implies that the second column of matrix is associated with topic . The problem can be defined as:


subject to


Figure 1 illustrates this problem. The function returns the minimum number of words to remove from a column of to match at least one topic from the database. For example, with and the fishing topic from figure 1, we have , which means there is one word that does not belong to the fishing vocabulary: plane. Let be the acceptable number of words that differentiate each vector in the solution from its nearest database entry. If , then no new word is accepted. Thus the constraint enforces that the columns of are a subset of the topics from the database.

For discrete problems with finite domains, any constraint can be translated into a classic table constraint. The only issue is that the size of this constraint may growth exponentially. In the context of the extended table constraint, the distance function and threshold value help the representation of other constraints. First, norm constraints such as [24, 31, 5] can be translated into the extended table constraint . Moreover, the smart design constraints can be translated into an extended table too. Consider the Example 1 and its solutions table . Let be the set of tuples where for each tuple , there is a tuple . For each value of associated to component , a vector of the size of is created and filled with . Tuple is defined by the concatenation of these vectors. Finally, constraint is equivalent to constraint (2) and (3).

BaGeL: A framework for CMLPs

In this section the Branch, Generate and Learn (BaGeL) framework is presented. This framework is an adaptation of the B&B framework for CMLPs with combinatorial constraints. The main goal of BaGeL is to abstract the constraining of the learned representations from the learning process itself. Analogously to B&B, BaGeL uses a tree search and needs to define the three main components, branching, search, and pruning [22]. But the fourth and new most important component is the link between the learning process and the combinatorial process. This link is the generation of learning problems in which the set of hypothesis is restricted. These restrictions are going to be stronger and stronger, until only valid hypothesis are allowed. The constrained generated problems take advantage of different methods, such as fixing some variables, modifying the configuration of the learning process, restricting the hypothesis space of the learning model using methods such as projections etc. It should be composed of any existing machine learning problem, even another CMLP solved by BaGeL.

Problem generation Given the partial assignment of variables at node , the generation function is , where is a learning problem. In addition, each solution of can be translated into a assignment of . Let be the optimal solution of problem . Let be the search space of problem . Let be the space of all models satisfying the constraints.

Definition 3

A node is a leaf if .

This proposition implies that if the generated problem’ solutions necessarily respect the constraint, there are no reason to restrict the space more. Let be the parent nodes of node .

Definition 4

A BaGeL problem is monotonic restrictive if and only if

Property 1

Let a BaGeL minimization problem be monotonic restrictive. Let be the best leaf (solution) found so far, and be the current node. if , then the node can be safely pruned.

The proof is omitted as this property is classic for B&B like algorithms. It allows BaGeL to stop exploring unpromising branches of the search tree.

The remaining case is when a node is a leaf with respect to definition 3, but finding the optimal solution is a hard problem. For these cases, BaGeL proposes to sample the solution space to evaluate nodes. In the general case, dedicated evaluation functions should be defined.

Property 2

Let a BaGeL model be monotonic restrictive. A node is a leaf if is found and .

If the optimal model of a node respects the constraint, it is unnecessary to search anymore the sub-tree emanating from this node. This characteristic is unusual for B&B like algorithms, and allows to strongly cut the search space. Consider for example the smart design problem, an assignment of is enough to check if it is valid. In the general case, an assignment of parameters of might not be enough to check the validity. This is because variables of might need a deeper search for satisfiability.

A possible implementation of the BaGeL framework is given in Algorithm 1. The algorithm starts by opening the root node. Then the main loop of the algorithm will be run until no more nodes are open, or a stopping condition is reached. The stopping condition is implemented by the method . Each time a node is picked by the search strategy, the BaGeL framework starts by enforcing the pruning rules, generates the learning problem and then trains it. Finally, if the node is not a leaf, decisions are generated by the branching strategy and added to the current set of open nodes. The goal of these decisions is to restrict more and more the search space, such that the deeper we are in the tree, the more constrained the learning problem will be. Concretely, in each node, the pruning phase reduces the search space allocated to the learning model, and the learning phase tries to find the best solution with respect to these restrictions.

Consider the Smart design problem instance defined in Example 1. Let the decisions at depth to be of the form then . Let be the sum of all the component for which .

Property 3

The pruning rule applies for the smart design problem.

Let be the upper bound of variable . The generation function returns the following learning problem:


where is the Hadamard product. Since is a constant vector, problem (8) can be optimally solved. Let be a solution of (8). An assignment of of (1) is obtained by . Figure 2 shows a possible run of the BaGeL framework on this example. At root node, the cost is 0.12, which represents a lower bound of the problem. Then, decision is taken and applied. The next node loss is 0.14. Decision is applied. All the combinations of and are solutions, the current node is a leaf. Its value is 0.21, the node becomes the current best. The algorithm backtracks to the previous node and applies decision . The best solution is 0.22, which is dominated by 0.21, the node is pruned and the algorithm backtracks to the root. Decision is applied, pruning rules can safely remove 1 from and . All the remaining combination of are solutions of the problem, and the loss is 0.19. This node is the new best. The algorithm backtracks to the root. The optimal value is 0.19.

Figure 2: Branch and Generate algorithm for a Smart Design. Green node are solutions that respect the constraint, red nodes don’t. The value inside the node is its loss.

Example: Prior NMF

Consider the topic modeling problem using non-negative matrix factorization [32] from the previous section. No algorithm, to the best of our knowledge, is able to solve directly such problem. Nevertheless, for the vanilla version that does not contains any constraints, algorithms exist. Gradient methods have proven their efficiency for solving such problems.

Consider a BaGeL branching strategy that for each value , generate a decision . Decision implies that . The resulting tree is no longer a binary tree but an n-ary tree. Consider a search strategy that selects the next decision by selecting the . Let be the current node, and be the sequence of decisions taken so far. Let be a matrix having the same dimensions as . Each column of is defined by . With the element-wise logical OR operator. All the other columns are filled with ones. The generated problem for node is:


This problem too can be solved by gradient method since matrix is not a variable but a constant.

1:  root Node()
2:   Set(root)
3:  while  and  do
10:  end while
Algorithm 1 BaGeL algorithm

Link with combinatorial solvers The BaGeL framework can, and should, easily be implemented in most existing combinatorial solvers such as constraint programming (CP). Moreover, using existing CP solvers directly provides us strong combinatorial optimization power, allowing us to focus on the learning part. The learning part can be encapsulated as a constraint directly. Indeed, the branch and bound and its three components are already the core of such solvers, only the generation and learning parts are missing. This constraint should be run once the propagation is finished, which implies generating the learning problem as a function of the state of the CSP, training the model, and extracting the loss. If the stopping conditions related to learning are reached, then a fail is triggered, otherwise, the CP solver continues its work.

Related work

In many cases, the BaGeL

principles are going to be relatively similar to existing B&B problems, such as the search strategy which will still select a given sub-problem from a set of possible ones. Thanks to that, we will be able to re-use the huge work on black-box optimization, especially for the modeling part (constraint language, etc.). The major difference leans in the pruning rules, the use of the objective function from the model, and the interactions between the B&B variables and the learning process (i.e. the learning model generation). In our problem, the objective function is often going to be the same as the machine learning model that we try to optimize. The use of B&B algorithms has already been used in machine learning and feature selection, thus searching onto the L0 constraint

[23, 6]. Our work generalize such works and allows not only to search onto the domain of the variables, but also onto the constraint set, because we can have higher level variables (i.e. searching over the table points). Many works are now using machine learning to improve the branch and bound, and optimization in general [29, 14], while these are promising works, this paper aims to do the opposite, to use B&B to improve the consistency of machine learning models. Recently, the use of combinatorial optimization solver for machine learning has been done in the context of target moving [10]. Their work iterates between a pure machine learning phase, and a pure combinatorial phase to change the target of the optimization. This is different from our work, we propose use the combinatorial part to restrict the hypothesis set of the machine learning part. Learning models are often used inside of CP solvers, either to approximate/learn constraints, or as objective function etc. In all of these works, the machine learning models are pre-trained, and the parameters are fixed [20]. Table constraints and cost is a known topic in constraint programming, where soft implementation are used to model over-constrained problem [16]. Our settings are different as we generalize such work and embed different distance function. Finally, the proposed work differ from methods iterating between applying different optimization direction to the current point [7, 21]. The proposed method generates each time a new sub-problem which is optimized, instead of shifting the current solution.


The experimental section is split in two parts. The first part is about smart design. The second part is about the NMF with prior. This section shows how the BaGeL framework leads to high quality solutions for these constrained learning problems. Both models have been implemented in the same BaGeL framework implementation.

Smart Design instances

We generated several hundreds instances with the number of feature sizes being in the list (10, 20, 40, 70, 100, 130, 150, 180, 200, 225, 250, 300, 350), and the number of samples being in the list (100, 400, 700, 1000, 1500, 3000, 7000, 10000), and the percentage of total weight , used to define the max cost value, in the list (0.90, 0.80, 0.60, 0.30). All of the instances contains an additional Gaussian noise. We set a time out of 10 minutes. All of the source code can be found on Github111Hidden link. As competitors, we defined two methods. They are based on the regression optimization problem of equation (1). The first one is to apply classic linear regression first, and then to remove the smallest coefficients until the constraint is respected. We then fit the model again, with the selected subset of features. We call this algorithm for Basic Repair. The second one use the information of the weights of the input by computing the ratio coefficient over weight. We call this algorithm for orthogonal repair by analogy with orthogonal basis pursuit. In order to evaluate the algorithms, we split our data-set into a training set of size 80% and a test set size of 20% used for evaluation and applied 5 folds. All the accuracy showed here are with respect to the test set, otherwise it will be explicitly said.

Loss. Figure 3 shows the test loss of all the methods for different values of the (maximum cost) variable. As we can see, the BaGeL framework results are strictly stronger than the and methods. Such a result is not surprising as the BaGeL will search for the best subset of component that maximize the loss. While the and results are the component that are the most used or whose ratio is maximum in the optimal solution of the unconstrained fitting problem. Figure 4 shows the test loss of all the methods for different with respect to different number of samples. The less we have samples, the harder it is to fit our model. From this figures, we can see that either with respect to the constraint bound or the number of samples, the BaGeL framework seems is able to extract better solutions than the reconstruction methods.

Constraint. Figure 5 shows the percentage of the tightness of the satisfaction of the maximum cost constraint. This is given by the ratio (i.e. actual cost divided by maximum cost). As we can see, the BaGeL framework seems to tighten the constraint as much as possible, and in most experiments, there is a gap of around 10 % between the BaGeL algorithm and the reconstruction methods. Such a result is interesting as it is intuitively important that being less impacted by the constraints should give more freedom for the learning part. Even if in many cases, the introduced constraints can be incorporated for helping the learning process.

Time. The last important factor to analyze is the time. For the reconstruction methods, the time will grow with the number of features and number of samples, but compared to the BaGeL time, the and running time is insignificant most of the time. Figure 6 shows the impact of the number of samples and number of features on the running time of the BaGeL framework. As we can see, the time consumption growth with the number of features and the number of samples drastically. In combinatorial problems, it is not usual to have exponential running times like, as they are used to solve NP-Hard problems. In most large scale instances, solving the complete tree search will be computationally infeasible. As for classic B&B problems, smart search strategies and decomposition methods should be defined to scale up. Figure 7 shows the impact of the constraint on the running time. As we can see, the longest running times are neither for the smallest value of the constraint, or for the largest, but for a value in the middle. Here 0.6 percent of the total costs. This result could be explained by the following: when the constraint is too strong, the search space is strongly cut by the constraint. When the constraint is loose, good solutions can be found easily and allows to cut the search space too. Such results in inherent in combinatorial optimization solvers, where adding constraints to solve a problem faster is often done [12].

Figure 3: Mean loss value for a given percentage of the maximum loss.
Figure 4: Mean loss value for a given number of samples.
Figure 5: Tightness of the constraint satisfaction for the maximum cost constraint.
Figure 6: Mean time for a given number of samples, or given a number of features.
Figure 7: Mean time with respect to the maximum cost constraint.

Prior database topic modeling

The filtering of the allDifferent constraint is a binary one. We generated random instances matching these settings (i.e. having a degree of novelty of with respect to a database of knowledge). The number of words is in the list (20, 30, 50, 75, 100, 150). The number of true topics per instances is in the list (4, 5, 6, 7, 8). The number of additional false topics to be added in the prior database is in the list (2, 3, 5, 10) The number of documents per instances is int he list (50, 100, 150, 200, 250, 300), The database of prior is created by removing topics from the topics involved in an instance and concatenating with the false topics. In addition, we set the sparsity to be 0.8 for both matrices and . The minimal number of topics per document is 2. All of the instances contains an additional Gaussian noise.

The purpose of such an experiment is to check if a hard implementation of a prior knowledge using combinatorial search can lead the learning process. The NMF solver uses multiplicative gradient updates [19], and all of the source code can be found on Github222Hidden link. Such an implementation has the advantage of being unable to change a value initially set to zero by the product with matrix . The search strategy chooses the next point on the prior database by selecting the closest point of the current column of with respect to its L2 norm.

Figure 9 shows the percentage of topics correctly extracted from the database of prior knowledge. As we can see, most of the time, the true topics are extracted from the prior knowledge DB. A deeper investigation showed that in several instances where some false topics where used, they allowed to model the noise.

Figure 8 (left) shows the losses for the true reconstruction using the and matrices, and the best lost obtained by the algorithm. The true reconstruction loss is high because noises have been added to all instances. Such results, up with the Figure 9, shows that the algorithm is able to use prior knowledge, to find good decomposition, and to significantly handle the noise.

Finally, figure 8 (right) shows the exponential behavior of the tree search, both in term of time or nodes. Such a result implies that for large scale instances, users will have to use classical existing tools reducing the search time such as restarts, or custom searches.

This final experiments showed that the BaGeL framework up with the extended table constraints efficiently applied the prior knowledge to our learning models.

Figure 8: (left) Losses over all the instances with respect to the number of topics. (right) Time and number of opened nodes with respect to the number of topics.
Figure 9: Percentage of correct selection of priors in the best solution found.


This paper proposes to combine classical tools from combinatorial optimization and machine learning to tackle constrained machine learning problems. The main advantage of such an approach is to reuse the large effort of combinatorial optimization in the context of efficient machine learning constraining. A simple yet generic framework branch and bound like algorithm named BaGeL is proposed to solve constrained learning problems involving combinatorial constraints. Moreover, this paper introduces the extended table constraint, a constraint helping modelers to split the search space of the learning part. The experimental parts shows that the proposed methodology can tackle problems that are known to be hard to solve otherwise.


  • [1] Junwen Bai, Yexiang Xue, Johan Bjorck, Ronan Le Bras, Brendan Rappazzo, Richard Bernstein, Santosh K. Suram, Robert Bruce van Dover, John M. Gregoire, and Carla P. Gomes. Phase mapper: Accelerating materials discovery with ai. AI Magazine, 39(1):15–26, Mar. 2018.
  • [2] Cynthia Barnhart, Ellis L Johnson, George L Nemhauser, Martin WP Savelsbergh, and Pamela H Vance. Branch-and-price: Column generation for solving huge integer programs. Operations research, 46(3):316–329, 1998.
  • [3] Małgorzata Bogdan, Ewout Van Den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J Candès. Slope—adaptive variable selection via convex optimization. The annals of applied statistics, 9(3):1103, 2015.
  • [4] Emmanuel Candes, Terence Tao, et al.

    The dantzig selector: Statistical estimation when p is much larger than n.

    The annals of Statistics, 35(6):2313–2351, 2007.
  • [5] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14(5-6):877–905, 2008.
  • [6] Olivier Chapelle, Vikas Sindhwani, and S Sathiya Keerthi.

    Branch and bound for semi-supervised support vector machines.

    In Advances in neural information processing systems, pages 217–224, 2007.
  • [7] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of optimization theory and applications, 158(2):460–479, 2013.
  • [8] Jip J Dekker, Gustav Björdal, Mats Carlsson, Pierre Flener, and Jean-Noël Monette. Auto-tabling for subproblem presolving in minizinc. Constraints, pages 1–18, 2017.
  • [9] Jordan Demeulenaere, Renaud Hartert, Christophe Lecoutre, Guillaume Perez, Laurent Perron, Jean-Charles Régin, and Pierre Schaus. Compact-table: Efficiently filtering table constraints with reversible sparse bit-sets. In International Conference on Principles and Practice of Constraint Programming, pages 207–223. Springer, 2016.
  • [10] Fabrizio Detassis, Michele Lombardi, and Michela Milano.

    Teaching the old dog new tricks: Supervised learning with constraints, 2021.

  • [11] David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
  • [12] Carla Gomes and Meinolf Sellmann. Streamlined constraint reasoning. In International Conference on Principles and Practice of Constraint Programming, pages 274–289. Springer, 2004.
  • [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [14] Shai Haim and Toby Walsh. Restart strategy selection using machine learning techniques. In International Conference on Theory and Applications of Satisfiability Testing, pages 312–325. Springer, 2009.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [16] Minh Thanh Khong, Yves Deville, Pierre Schaus, and Christophe Lecoutre. Efficient reification of table constraints. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pages 118–122. IEEE, 2017.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [18] Olivier Lhomme. Practical reformulations with table constraints. In ECAI, pages 911–912, 2012.
  • [19] C. J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization.

    IEEE Transactions on Neural Networks

    , 18(6):1589–1596, Nov 2007.
  • [20] Michele Lombardi, Michela Milano, and Andrea Bartolini. Empirical decision model learning. Artificial Intelligence, 244:343–367, 2017.
  • [21] Giuseppe Marra, Matteo Tiezzi, Stefano Melacci, Alessandro Betti, Marco Maggini, and Marco Gori. Local propagation in constraint-based neural network. arXiv preprint arXiv:2002.07720, 2020.
  • [22] David R Morrison, Sheldon H Jacobson, Jason J Sauppe, and Edward C Sewell. Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016.
  • [23] Patrenahalli M. Narendra and Keinosuke Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions on computers, (9):917–922, 1977.
  • [24] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
  • [25] Guillaume Perez, Michel Barlaud, Lionel Fillatre, and Jean-Charles Régin. A filtered bucket-clustering method for projection onto the simplex and the ball. Mathematical Programming, May 2019.
  • [26] Guillaume Perez and Jean-Charles Régin. Improving gac-4 for table and mdd constraints. In International Conference on Principles and Practice of Constraint Programming, pages 606–621. Springer, 2014.
  • [27] Jean-Charles Régin. A filtering algorithm for constraints of difference in csps. In AAAI, volume 94, pages 362–367, 1994.
  • [28] Francesca Rossi, Peter Van Beek, and Toby Walsh. Handbook of constraint programming. Elsevier, 2006.
  • [29] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision. arXiv preprint arXiv:1802.03685, 2018.
  • [30] Konstantinos Slavakis, Yannis Kopsinis, and Sergios Theodoridis. Adaptive algorithm for sparse system identification using projections onto weighted l1 balls. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 3742–3745. IEEE, 2010.
  • [31] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [32] Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353, 2013.
  • [33] Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization. John Wiley & Sons, 2014.