Structured Learning Modulo Theories

05/07/2014 ∙ by Stefano Teso, et al. ∙ Fondazione Bruno Kessler Università di Trento 0

Modelling problems containing a mixture of Boolean and numerical variables is a long-standing interest of Artificial Intelligence. However, performing inference and learning in hybrid domains is a particularly daunting task. The ability to model this kind of domains is crucial in "learning to design" tasks, that is, learning applications where the goal is to learn from examples how to perform automatic de novo design of novel objects. In this paper we present Structured Learning Modulo Theories, a max-margin approach for learning in hybrid domains based on Satisfiability Modulo Theories, which allows to combine Boolean reasoning and optimization over continuous linear arithmetical constraints. The main idea is to leverage a state-of-the-art generalized Satisfiability Modulo Theory solver for implementing the inference and separation oracles of Structured Output SVMs. We validate our method on artificial and real world scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

page 35

page 36

page 37

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research in machine learning has progressively widened its scope, from simple scalar classification and regression tasks to more complex problems involving multiple related variables. Methods developed in the related fields of statistical relational learning (SRL) Getoor and Taskar (2007) and structured-output learning Bakir et al. (2007) allow to perform learning, reason and make inference about relational entities characterized by both hard and soft constraints. Most methods rely on some form of (finite) First-Order Logic (FOL) to encode the learning problem, and define the constraints as weighted logical formulae. One issue with these approaches is that First-Order Logic is not suited for efficiently reasoning over hybrid domains, characterized by both continuous and discrete variables. The Booleanization of an -bit integer variable requires

distinct Boolean states, making naive translation impractical; for rational variables the situation is even worse. In addition, standard FOL automated reasoning techniques offer no mechanism to deal efficiently with operators among numerical variables, like comparisons (e.g. “less-than”, “equal-to”) and arithmetical operations (e.g. summation), limiting the range of realistically applicable constraints to those based solely on logical connectives. On the other hand, many real-world domains are inherently hybrid and require to reason over inter-related continuous and discrete variables. This is especially true in

constructive machine learning tasks, where the focus is on the de-novo design of objects with certain characteristics to be learned from examples (e.g. a recipe for a dish, with ingredients, proportions, etc.).

There is relatively little previous work on hybrid SRL methods. A number of approaches Lippi and Frasconi (2009); Broecheler et al. (2010); Kuelka et al. (2011); Diligenti et al. (2012) focused on the feature representation perspective, in order to extend statistical relational learning algorithms to deal with continuous features as inputs. On the other hand, performing inference over joint continuous-discrete relational domains is still a challenge. The few existing attempts Goodman et al. (2008); Wang and Domingos (2008); Närman et al. (2010); Gutmann et al. (2011); Choi and Amir (2012); Islam et al. (2012)

aim at extending statistical relational learning methods to the hybrid domain. All these approaches focus on modeling the probabilistic relationships between variables. While this allows to compute marginal probabilities in addition to most probable configurations, it imposes strong limitations on the type of constraints they can handle. Inference is typically run by approximate methods, based on variational approximations or sampling strategies. Exact inference, support for hard numeric (in addition to Boolean) constraints and combination of diverse theories, like linear algebra over rationals and integers, are out of the scope of these approaches. Hybrid Markov Logic networks 

Wang and Domingos (2008) and Church Goodman et al. (2008) are the two formalisms which are closer to the scope of this paper. Hybrid Markov Logic networks Wang and Domingos (2008) extend Markov Logic by including continuous variables, and allow to embed numerical comparison operators (namely , and ) into the constraints by defining an ad hoc translation of said operators to a continuous form amenable to numerical optimization. Inference relies on a stochastic local search procedure that interleaves calls to a MAX-SAT solver and to a numerical optimization procedure. This inference procedure is incapable of dealing with hard numeric constraints because of the lack of feedback from the continuous optimizer to the satisfiability module. Church Goodman et al. (2008) is a very expressive probabilistic programming language that can potentially represent arbitrary constraints on both continuous and discrete variables. Its focus is on modelling the generative process underlying the program, and inference is based on sampling techniques. This makes inference involving continuous optimization subtasks and hard constraints prohibitively expensive, as will be discussed in the experimental evaluation.

In order to overcome the limitations of existing approaches, we focused on the most recent advances in automated reasoning over hybrid domains. Researchers in automated reasoning and formal verification have developed logical languages and reasoning tools that allow for natively reasoning over mixtures of Boolean and numerical variables (or even more complex structures). These languages are grouped under the umbrella term of Satisfiability Modulo Theories (SMT) Barrett et al. (2009). Each such language corresponds to a decidable fragment of First-Order Logic augmented with an additional background theory . There are many such background theories, including those of linear arithmetic over the rationals  or over the integers , among others Barrett et al. (2009). In SMT, a formula can contain Boolean variables (i.e. 0-ary logical predicates) and connectives, mixed with symbols defined by the theory , e.g. rational variables and arithmetical operators. For instance, the SMT() syntax allows to write formulas such as:

where the variables are Boolean () or rational (, , ,).111Note that SMT solvers handle also formulas on combinations of theories, like e.g.
, where are integer variables, is an array variable, is an uninterpreted function symbol, read and write are functions of the theory of arrays Barrett et al. (2009). However, for the scope of this paper it suffices to consider the and theories, and their combination.
More specifically, SMT is a decision problem, which consists in finding an assignment to the variables of a quantifier-free formula, both Boolean and theory-specific ones, that makes the formula true, and it can be seen as an extension of SAT.

Recently, researchers have leveraged SMT from decision to optimization. In particular, MAX-SAT Modulo Theories (MAX-SMT) Nieuwenhuis and Oliveras (2006); Cimatti et al. (2010, 2013) generalizes MAX-SAT Li and Manyà (2009) to SMT formulae, and consists in finding a theory-consistent truth assignment to the atoms of the input SMT formula which maximizes the total weight of the satisfied clauses of . More generally, Optimization Modulo Theories (OMTNieuwenhuis and Oliveras (2006); Sebastiani and Tomasi (2012, 2014); Li et al. (2014) consists in finding a model for which minimizes the value of some (arithmetical) term, and strictly subsumes MAX-SMT Sebastiani and Tomasi (2014). Most important for the scope of this paper is that there are high-quality OMT solvers which, at least for the theory, can handle problems with thousands of hybrid variables.

In this paper we propose Learning Modulo Theories (LMT), a class of novel hybrid statistical relational learning methods. The main idea is to combine a solver able to deal with Boolean and rational variables with a structured output learning method. In particular, we rely on structured-output Support Vector Machines (SVM) Tsochantaridis et al. (2005); Joachims et al. (2009)

, a very flexible max-margin structured prediction method. Structured-output SVMs are a generalization of binary SVM classifiers to predict structured outputs, like sequences or trees. They generalize the max-margin principle by learning to separate correct from incorrect output structures with a large margin. Training structured-output SVMs requires a separation oracle for generating counter-examples and update the parameters, while using them at prediction stage requires an inference oracle generating the highest scoring candidate structure for a certain input. In order to implement the two oracles, we leverage a state-of-the-art OMT solver. This combination enables LMT to perform learning and inference in mixed Boolean-numerical domains. Thanks to the efficiency of the underlying OMT solver, and of the cutting plane algorithm we employ for weight learning, LMT is capable of addressing constructive learning problems which cannot be efficiently tackled with existing methods. Furthermore, LMT is

generic, and can in principle be applied to any of the existing background theories. This paper builds on a previous work in which MAX-SMT was used for interactive preference elicitation Campigotto et al. (2011). Here we focus on generating novel structures/configurations from few prototypical examples, and cast the problem as supervised structured-output learning. Furthermore, we increase the expressive power from MAX-SMT to full OMT. This allows to model much richer cost functions, for instance by penalizing an unsatisfied constraint by a cost proportional to the distance from satisfaction.

The rest of the paper is organized as follows. In Section 2 we review the relevant related work, with an in-depth discussion on all hybrid approaches and their relationships with our proposed framework. Section 3 provides an introduction to SMT and OMT technology. Section 4 reviews structured-output SVMs and shows how to cast LMT in this learning framework. Section 5 reports an experimental evaluation showing the potential of the approach. Finally, conclusions are drawn in Section 6.

2 Related Work

There is a body of work concerning integration of relational and numerical data from a feature representation perspective, in order to effectively incorporate numerical features into statistical relational learning models. Lippi and Frasconi Lippi and Frasconi (2009)

incorporate neural networks as feature generators within Markov Logic Networks, where neural networks act as numerical functions complementing the Boolean formulae of standard MLNs. Semantic Based Regularization 

Diligenti et al. (2012) is a framework for integrating logic constraints within kernel machines, by turning them into real-valued constraints using appropriate transformations (T-norms). The resulting optimization problem is no longer convex in general, and they suggest a stepwise approach adding constraints in an incremental fashion, in order to solve progressively more complex problems. In Probabilistic Soft Logic Broecheler et al. (2010), arbitrarily complex similarity measures between objects are combined with logic constraints, again using T-norms for the continuous relaxation of Boolean operators. In Gaussian Logic Kuelka et al. (2011)

, numeric variables are modeled with multivariate Gaussian distributions. Their parameters are tied according to logic formulae defined over these variables, and combined with weighted first order formulae modeling the discrete part of the domain (as in standard MLNs). All these approaches aim at extending statistical relational learning algorithms to deal with continuous features as

inputs. On the other hand, our framework aims at allowing learning and inference over hybrid continuous-discrete domains, where continuous and discrete variables are the output of the inference process.

While a number of efficient lifted-inference algorithms have been developed for Relational Continuous Models Choi et al. (2010, 2011); Ahmadi et al. (2011), performing inference over joint continuous-discrete relational domains is still a challenge. The few existing attempts aim at extending statistical relational learning methods to the hybrid domain.

Hybrid Probabilistic Relational Models Närman et al. (2010)

extend Probabilistic Relational Models (PRM) to deal with hybrid domains by specifying templates for hybrid distributions as standard PRM specify templates for discrete distributions. A template instantiation over a database defines a Hybrid Bayesian Network 

Murphy (1998); Lauritzen (1992). Inference in Hybrid BN is known to be hard, and restrictions are typically imposed on the allowed relational structure (e.g. in conditional Gaussian models, discrete nodes cannot have continuous parents). On the other hand, LMT can accomodate arbitrary combinations of predicates from the theories for which a solver is available. These currently include linear arithmetic over both rationals and integers as well as a number of other theories like strings, arrays and bit-vectors.

Relational Hybrid Models Choi and Amir (2012) (RHM) extend Relational Continuous Models to represent combinations of discrete and continuous distributions. The authors present a family of lifted variational algorithms for performing efficient inference, showing substantial improvements over their ground counterparts. As for most hybrid SRL approaches which will be discussed further on, the authors focus on efficiently computing probabilities rather than efficiently finding optimal configurations. Exact inference, hard constraints and theories like algebra over integers, which are naturally handled by our LMT framework, are all out of the scope of these approaches. Nonetheless, lifted inference is a powerful strategy to scale up inference and equipping OMT and SMT tools with lifting capabilities is a promising direction for future improvements.

The PRISM Sato (1995) system provides primitives for Gaussian distributions. However, inference is based on proof enumeration, which makes support for continuous variables very limited. Islam et al. Islam et al. (2012)

recently extended PRISM to perform inference over continuous random variables by a

symbolic

procedure which avoids the enumeration of individual proofs. The extension allows to encode models like Hybrid Bayesian Networks and Kalman Filters. Being built on top of the PRISM system, the approach assumes the exclusive explanation and independence property: no two different proofs for the same goal can be true simultaneously, and all random processes within a proof are independent (some research directions for lifting these restrictions have been suggested 

Islam (2012)). LMT has no assumptions on the relationships between proofs.

Hybrid Markov Logic Networks Wang and Domingos (2008) extend Markov Logic Networks to deal with numeric variables. A Hybrid Markov Logic Network consists of both First Order Logic formulae and numeric terms. Most probable explanation (MPE) inference is performed by a hybrid version of MAXWalkSAT, where optimization of numeric variables is performed by a general-purpose global optimization algorithm (L-BFGS). This approach is extremely flexible and allows to encode arbitrary numeric constraints, like soft equalities and inequalities with quadratic or exponential costs. A major drawback of this flexibility is the computational cost, as each single inference step on continuous variables requires to solve a global optimization problem, making the approach infeasible for addressing medium to large scale problems. Furthermore, this inference procedure is incapable of dealing with hard constraints involving numeric variables, as can be found for instance in layout problems (see e.g. the constraints on touching blocks or connected segments in the experimental evaluation). This is due to the lack of feedback from the continuous optimizer to the satisfiability module, which should inform about conflicting constraints and help guiding the search towards a more promising portion of the search space. Conversely, the OMT technology underlying LMT is built on top of SMT solvers and is hence specifically designed to tightly integrate theory-specific and SAT solvers Nieuwenhuis and Oliveras (2006); Cimatti et al. (2010); Sebastiani and Tomasi (2012, 2014); Li et al. (2014). Note that the tight interaction between theory-specific and modern CDCL SAT solvers, plus many techniques developed for maximizing their synergy, are widely recognised as one key reason of the success of SMT solvers Barrett et al. (2009). Note also that previous attempts to substitute standard SAT solvers with WalkSAT inside an SMT solver have failed, producing dramatic worsening of performance Griggio et al. (2011).

Hybrid ProbLog Gutmann et al. (2011) is an extension of the probabilistic logic language ProbLog De Raedt et al. (2007)

to deal with continuous variables. A ProbLog program consists of a set of probabilistic Boolean facts, and a set of deterministic first order logic formulae representing the background knowledge. Hybrid ProbLog introduces a set of probabilistic continuous facts, containing both discrete and continuous variables. Each continuous variable is associated with a probability density function. The authors show how to compute the probability of success of a query, by partitioning the continuous space into admissible intervals, within which values are interchangeable with respect to the provability of the query. The drawback of this approach is that in order to make this computation feasible, severe limitations have to be imposed on the use of continuous variables. No algebraic operations or comparisons are allowed between continuous variables, which should remain uncoupled. Some of these limitations have been overcome in a recent approach 

Gutmann et al. (2011)

which performs inference by forward (i.e. from facts to rules) rather than backward reasoning, which is the typical inference process in (probabilistic) logic programming engines (SLD-resolution and its probabilistic extensions). Forward reasoning is more amenable to be adapted to sampling strategies for performing approximate inference and dealing with continuous variables. On the other hand, inference by sampling makes it prohibitively expensive to reason with hard continuous constraints.

Church Goodman et al. (2008) is a very expressive probabilistic programming language that can easily accomodate hybrid discrete-continuous distributions and arbitrary constraints. In order to deal with the resulting complexity, inference is again performed by sampling techniques, which result in the same up-mentioned limitations. Indeed, our experimental evaluation shows that Church is incapable of solving in reasonable time the simple task of generating a pair of blocks conditioned on the fact that they touch somewhere. 222The only publicly available version of Hybrid ProbLog is the original one by Gutmann, Jaeger and De Raedt Gutmann et al. (2011) which does not support arithmetic over continuous variables. However we have no reason to expect the more recent version based on sampling should have a substantially different behaviour with respect to what we observe with Church.

An advantage of these probabilistic inference approaches is that they allow to return marginal probabilities in addition to most probable explanations. This is actually the main focus of these approaches, and the reason why they are less suitable for solving the latter problem when the search space becomes strongly disconnected. As most structured-output approaches over which it builds, LMT is currently limited to the task of finding an optimal configuration, which in a probabilistic setting corresponds to generating the most probable explanation. We are planning to extend it to also perform probability computation, as discussed in the conclusions of the paper.

3 From Satisfiability to Optimization Modulo Theories

Propositional satisfiability (SAT), is the problem of deciding whether a logical formula over Boolean variables and logical connectives can be satisfied by some truth value assignment of the Boolean variables. 333CDCL SAT-solving algorithms, and SMT-solving ones thereof, require the input formula to be in conjunctive normal form (CNF), i.e., a conjunction of clauses, each clause being a disjunction of propositions or of their negations. Since they very-effectively pre-convert input formulae into CNF Prestwich (2009), we assume wlog input formulae to have any form. In the last two decades we have witnessed an impressive advance in the efficiency of SAT solvers, which nowadays can handle industrial-derived formulae in the order of up to variables. Modern SAT solvers are based on the conflict-driven clause-learning (CDCL) schema Marques-Silva et al. (2009), and adopt a variety of very-efficient search techniques Biere et al. (2009).

In the contexts of automated reasoning (AR) and formal verification (FV), important decision problems are effectively encoded into and solved as Satisfiability Modulo Theories (SMT) problems de Moura and Bjørner (2011). SMT is the problem of deciding the satisfiability of a (typically quantifier-free) first-order formula with respect to some decidable background theory , which can also be a combination of theories . Theories of practical interest are, e.g., those of equality and uninterpreted functions (), of linear arithmetic over the rationals () or over the integers (), of non-linear arithmetic over the reals (), of arrays (), of bit-vectors (), and their combinations.

In the last decade efficient SMT solvers have been developed following the so-called lazy approach, that combines the power of modern CDCL SAT solvers with the expressivity of dedicated decision procedures () for several first-order theories of interest. Modern lazy SMT solvers —like e.g. CVC4 444http://cvc4.cs.nyu.edu/, MathSAT5 555http://mathsat.fbk.eu/, Yices 666http://yices.csl.sri.com/, Z3 777ttp://research.microsoft.com/en-us/um/redmond/projects/z3/ml/z3.html— combine a variety of solving techniques coming from very hetherogeneous domains. We refer the reader to Sebastiani (2007); Barrett et al. (2009) for an overview on lazy SMT solving, and to the URLs of the above solvers for a description of their supported theories and functionalities.

More recently, SMT has also been leveraged from decision to optimization. Optimization Modulo Theories (OMTNieuwenhuis and Oliveras (2006); Sebastiani and Tomasi (2012, 2014); Li et al. (2014), is the problem of finding a model for an SMT formula which minimizes the value of some arithmetical cost function. References Sebastiani and Tomasi (2012, 2014); Li et al. (2014) present some general OMT procedure adding to SMT the capability of finding models minimizing cost functions in . This problem is denoted if only the theory is involved in the SMT formula, if some other theories are involved. Such procedures combine standard lazy SMT-solving with LP minimization techniques. procedures have been implemented into the OptiMathSAT tool,888 http://optimathsat.disi.unitn.it/ a sub-branch of MathSAT5.

Example 3.1

Consider the following toy -formula :

and the problem of finding the model of (if any) which makes the value of minimum. In fact, depending on the truth value of , there are two possible alternative sets of constraints to minimize:

whose minimum-cost models are, respectively:

from which we can conclude that the latter is a minimum-cost model for .

Overall, for the scope of this paper, it is important to highlight the fact that OMT solvers are available which, thanks to the underlying SAT and SMT technologies, can handle problems with a large number of hybrid variables (in the order of thousands, at least for the theory).

To this extent, we notice that the underlying theories and provide the meaning and the reasoning capabilities for specific predicates and function symbols (e.g., the -specific symbols “” and “”, or the -specific symbols “read(...)”, “write(...)”) that would otherwise be very difficult to describe, or to reason over, with logic-based automated reasoning tools —e.g., traditional first-order theorem provers cannot handle arithmetical reasoning efficiently— or with arithmetical ones —e.g., DLP, ILP, MILP, LGDP tools Balas (2010); Lodi (2009); Sawaya and Grossmann (2012) or CLP tools Jaffar and Maher (1994); Codognet and Diaz (1996); Jaffar et al. (1992) do not handle symbolic theory-reasoning on theories like or . Also, the underlying CDCL SAT solver allows SMT solvers to handle a large amount of Boolean reasoning very efficiently, which is typically out of the reach of both first-order theorem provers and arithmetical tools.

These facts motivate our choice of using SMT/OMT technology, and hence the tool OptiMathSAT, as workhorse engines for reasoning in hybrid domains. Hereafter in the paper we consider only plain .

Another prospective advantage of SMT technology is that modern SMT solvers (e.g., MathSAT5, Z3, …) have an incremental interface, which allows for solving sequences of “similar” formulae without restarting the search from scratch at each new formula, and instead reusing “common” parts of the search performed for previous formulae (see, e.g., Cimatti et al. (2013)). This drastically improves overall performance on sequences of similar formulae. An incremental extension of OptiMathSAT, fully exploiting that of MathSAT5, is currently available.

Note that a current limitation of SMT solvers is that, unlike traditional theorem provers, they typically handle efficiently only quantifier-free formulae. Attempts at extending SMT to quantified formulae have been made in the literature Rümmer (2008); Baumgartner and Tinelli (2011); Kruglov (2013), and few SMT solvers (e.g., Z3) do provide some support for quantified formulae. However, the state of the art of these estensions is still far from being satisfactory. Nonetheless, the method we present in the paper can be easily adapted to deal with this type of extensions once they will reach the required level of maturity.

4 Learning Modulo Theories using Cutting Planes

4.1 An introductory example

In order to introduce the LMT framework, we start with a toy learning example. We are given a unit-length bounding box, , that contains a given, fixed block (rectangle), as in Figure 1 (a). The block is identified by the four constants , where indicate the bottom-left corner of the rectangle, and , its width and height, respectively. Now, suppose that we are assigned the task of fitting another block, identified by the variables , in the same bounding box, so as to minimize the following cost function:

(1)

with the additional requirements that (i) the two blocks “touch” either from above, below, or sideways, and (ii) the two blocks do not overlap.

It is easy to see that the weights and control the shape and location of the optimal solution. If both weights are positive, then the cost is minimized by any block of null size located along the perimeter of block 1. If both weights are negative and , then the optimal block will be placed so as to occupy as much horizontal space as possible, while if it will prefer to occupy as much vertical space as possible, as in Figure 1 (b,c). If and are close, then the optimal solution depends on the relative amount of available vertical and horizontal space in the bounding box.

Figure 1: (a) Initial configuration. (b) Optimal configuration for . (c) Optimal configuration for .

This toy example illustrates two key points. First, the problem involves a mixture of numerical variables (coordinates, sizes of block 2) and Boolean variables, along with hard rules that control the feasible space of the optimization procedure (conditions (i) and (ii)), and costs — or soft rules — which control the shape of the optimization landscape. This is the kind of problem that can be solved in terms of Optimization Modulo Linear Arithmetic, OMT(

). Second, it is possible to estimate the weights

, from data in order to learn what kind of blocks are to be considered optimal. In the following we will describe how such a learning task can be framed within the structured output SVMs framework.

4.2 Notation

Symbol Meaning
above, right, … Boolean variables
Rational variables
(, ) Complete object; is the input, is the output
Constraints
Indicator for Boolean constraint over
Cost for arithmetical constraint over
Feature representation of the complete object
Feature associated to Boolean constraint
Feature associated to arithmetical constraint
Weights
Table 1: Explanation of the notation used throughout the text.

We consider the problem of learning from a training set of complex objects , where each object is represented as a set of Boolean and rational variables:

We indicate Boolean variables using predicates such as , and write rational variables as lower-case letters, e.g. , , , . Please note that while we write Boolean variables using a First-Order syntax for readability, our method does require the grounding of all Boolean predicates prior to learning and inference. In the present formulation, we assume objects to be composed of two parts: is the input (or observed) part, while is the output (or query) part.999We depart from the conventional / notation for indicating input/output pairs to avoid name clashes with the block coordinate variables. The learning problem is defined by a set of constraints . Each constraint is either a Boolean- or rational-valued function of the object . For each Boolean-valued constraint , we denote its indicator function as , which evaluates to if the constraint is satisfied and to otherwise (the choice of to represent falsity is customary in the max-margin literature). Similarly, we refer to the cost of a real-valued constraint as . The feature space representation of an object is given by the feature vector , which is a function of the constraints. Each soft constraint has an associated finite weight (to be learned from the data), while hard constraints have no associated weight. We denote the vector of learned weights as , and its Euclidean norm as . Table 1 summarizes the notation used throughout the text.

4.3 A Structural SVM approach to LMT

Structured-output SVMs Tsochantaridis et al. (2005) are a very flexible framework that generalizes max-margin methods to the prediction of complex outputs such as strings, trees and graphs. In this setting the association between inputs and outputs is controlled by a so-called compatibility function defined as a linear combination of the joint feature space representation of the input-output pair. Inference amounts to finding the most compatible output for a given input , which equates to solving the following optimization problem:

(2)

Performing inference on structured domains is non-trivial, since the maximization ranges over an exponential (and possibly unbounded) number of candidate outputs.

Learning is formulated within the regularized empirical risk minimization framework. In order to learn the weights from a training set of examples, one needs to define a non-negative loss function that, for any given observation , quantifies the penalty incurred when predicting instead of the correct output . Learning can be expressed as the problem of finding the weights that minimize the per-instance error and the model complexity Tsochantaridis et al. (2005):

(3)

Here the constraints require that the compatibility between any input and its corresponding correct output is always higher than that with all wrong outputs by a margin, with playing the role of per-instance violations. This formulation is called -slack margin rescaling and it is the original and most accessible formulation of structured-output SVMs. See Joachims et al. (2009) for an extensive exposition of alternative formulations.

Weight learning is a quadratic program, and can be solved very efficiently with a cutting-plane (CP) algorithm Tsochantaridis et al. (2005). Since in Eq (3) there is an exponential number of constraints, it is infeasible to naively account for all of them during learning. Based on the observations that the constraints obey a subsumption relation, the CP algorithm Joachims et al. (2009) sidesteps the issue by keeping a working set of active constraints : at each iteration, it augments the working set with the most violated constraint, and then solves the corresponding reduced quadratic program using a standard SVM solver. This procedure is guaranteed to find an -approximate solution to the QP in a polynomial number of iterations, independently of the cardinality of the output space and of the number of examples  Tsochantaridis et al. (2005). The -slack margin rescaling version of the CP algorithm can be found in Algorithm 1 (adapted from Joachims et al. (2009)). Please note that in our experiments we make use of the faster, but otherwise equivalent, -slack margin rescaling variant Joachims et al. (2009). We report the -slack margin rescaling version here for ease of exposition.

Data: Training instances , parameters ,
Result: Learned weights
1 , for all ;
2 repeat
3        for   do
4               ;
5               if   then
6                     
s.t.
7               end if
8              
9        end for
10       
11until no has changed during iteration;
return
Algorithm 1 Cutting-plane algorithm for training structural SVMs, according to the -slack formulation presented in Joachims et al. (2009).

The CP algorithm is generic, meaning that it can be adapted to any structured prediction problem as long as it is provided with: (i) a joint feature space representation of input-output pairs (and consequently a compatibility function ); (ii) an oracle to perform inference, i.e. to solve Equation (2); and (iii) an oracle to retrieve the most violated constraint of the QP, i.e. to solve the separation problem:

(4)

The two oracles are used as sub-routines during the optimization procedure. For a more detailed account, and in particular for the derivation of the separation oracle formulation, please refer to Tsochantaridis et al. (2005).

One key aspect of the structured output SVMs is that efficient implementations of the two oracles are fundamental for the learning task to be tractable in practice. The idea behind Learning Modulo Theories is that, when a hybrid Boolean-numerical problem can be encoded in SMT, the two oracles can be implemented using an Optimization Modulo Theory solver. This is precisely what we propose in the present paper. In the following sections we show how to define a feature space for hybrid Boolean-numerical learning problems, and how to use OMT solvers to efficiently perform inference and separation.

4.4 Learning Modulo Theories with OMT

Let us formalize the previous toy example in the language of LMT. In the following we give a step-by-step description of all the building blocks of an LMT problem: the background knowledge, the hard and soft constraints, the cost function, and the loss function.

Input, Output, and Background Knowledge

Here the input to the problem is the observed block while the output is the generated block . In order to encode the set of constraints that underlie both the learning and the inference problems, it is convenient to first introduce a background knowledge of predicates expressing facts about the relative positioning of blocks. To this end we add a fresh predicate , that encodes the fact that “a generic block of index touches a second block from the left”, defined as follows:

Similarly, we add analogous predicates for the other directions: , , (see Table 2 for the full definitions).

Hard constraints

The hard constraints represent the fact that the output should be a valid block within the bounding box (all the constraints are implicitly conjoined):

Then we require the output block to “touch” the input block :

Note that whenever this rule is satisfied, both conditions (i) and (ii) of the toy example hold, i.e. touching blocks never overlap.

Touching blocks
Block touches block , left
Block touches block , below
Block touches block , right
Block touches block , over
Figure 2: Background knowledge used in the toy block example.

Cost function

Finally, we encode the cost function , completing the description of the optimization problem. In the following we will see that the definition of the cost function implicitly defines also the set of features, or equivalently the set of soft constraints, of the LMT problem.

Soft constraints and Features

Now, suppose we were given a training set of instances analogous to those pictured in Figure 1 (c), i.e. where the supervision includes output blocks that preferentially fill as much vertical space as possible. The learning algorithm should be able to learn this preference by inferring appropriate weights. This kind of learning task can be cast within the structured SVM framework, by defining an appropriate joint feature space and oracles for the inference and separation problems.

Let us focus on the feature space first. Our definition is grounded on the concept of reward assigned to an object with respect to the set of formulae . We construct the feature vector

by collating per-formula rewards , where:

Here is an indicator for the satisfaction of a Boolean constraint , while denotes the cost associated to real-valued constraints, please refer to Table 1 for more details. In other words, the feature representation of a complex object is the vector of all indicator/cost functions associated to the soft constraints. Returning to the toy example, where the cost function is

the feature space of an instance is simply , which reflects the size of the output block . The negative sign here is due to interpreting the features as rewards (to be maximized), while the corresponding soft constraints can be seen as costs (to be minimized); see Eq 5 where this relationship is made explicit.

According to this definition both satisfied and unsatisfied rules contribute to the total reward, and two objects , that satisfy/violate similar sets of constraints will be close in feature space. The compatibility function computes the (weighted) total reward assigned to with respect to the constraints. Using this definition, the maximization in the inference (Equation 2) can be seen as attempting to find the output that maximizes the total reward with respect to the input and the rules, or equivalently the one with minimum cost. Since can be expressed in terms of Satisfiability Modulo Linear Arithmetic, the latter minimization problem can be readily cast as an OMT problem. Translating back to the example, maximizing the compatibility function boils down to:

(5)

which is exactly the cost minimization problem in Equation 1.

Loss function

The loss function determines the dissimilarity between output structures, which in our case contain a mixture of Boolean and rational variables. We observe that by picking a loss function expressible as an OMT() problem, we can readily use the same OMT solver used for inference to also solve the CP separation oracle (Equation (4)). This can be achieved by selecting a loss function such as the following Hamming loss in feature space:

This loss function is piecewise-linear, and as such satisfies the desideratum.

Since both the inference and separation oracles required by the CP algorithm can be encoded in OMT(), we can apply an OMT solver to efficiently solve the learning task. In particular, our current implementation is based on a vanilla copy of  101010http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html., which acts as a cutting-plane solver, whereas inference and separation are implemented with the OptiMathSAT OMT solver.

To summarize, an LMT problem can be broken down into several components: a background knowledge, a set of soft and hard constraints, a cost function and a loss function. The background knowledge amounts to a set of SMT formulae and constants useful for encoding the problem constraints, which in turn determine the relation between inputs and outputs. The hard constraints define the space of candidate outputs, while the soft constraints correspond one-to-one to features. The overall cost function is a linear combination of the dissatisfaction/cost (or, equivalently, satisfaction/reward) associated to the individual soft constraints, and as such is controlled entirely by the choice of features. Finally, the loss function determines the dissimilarity between output structures. While in the present paper we focused on a Hamming loss in feature space, LMT can work with any loss function that can be encoded as an SMT formula.

5 Experimental Evaluation

In the following we evaluate LMT on two novel that stress the ability of LMT to deal with rather complex mixed Boolean-numerical problems.

5.1 Stairway to Heaven

In this section we are interested in learning how to assemble different kinds of stairways from examples. For the purpose of this paper, a stairway is simply a collection of blocks (rectangles) located within a two-dimensional, unit-sized bounding box . Clearly not all possible arrangements of blocks form a stairway; a stairway must satisfy the following conditions: (i) the first block touches either the top or the bottom corner of the left edge of the bounding box; (ii) the last block touches the opposite corner at the right edge of the bounding box; (iii) there are no gaps between consecutive blocks; (iv) consecutive blocks must actually form a step and, (v) no two blocks overlap. Note that the property of “being a stairway” is a collective property of the blocks.

More formally, each block consists of four rational variables: the origin , which indicates the bottom-left corner of the block, a width and a height ; the top-right corner of the block is . A stairway is simply an assignment to all variables that satisfies the above conditions.

Our definition does not impose any constraint on the orientation of stairways: it is perfectly legitimate to have left stairways that start at the top-left corner of the bounding box and reach the bottom-right corner, and right stairways that connect the bottom-left corner to the top-right one. For instance, a left -stairway can be defined with the following block assignment (see Figure 3 (a)):

Similarly, a right -stairway is obtained with the assignment (Figure 3 (b)):

We also note that the above conditions do not impose any explicit restriction on the width and height of individual blocks (as long as consecutive ones are in contact and there is no overlap). Consequently we allow for both ladder stairways, where the total amount of vertical and horizontal surface of the individual blocks is minimal, as in Figure 3 (a) and (b); and for pillar stairways, where either the vertical or horizontal block lengths are maximized, as in Figure 3 (c). There are of course an uncountable number of intermediate stairways that do not belong to any of the above categories.

Figure 3: (a) A left ladder -stairway. (b) A right ladder -stairway. (c) A right pillars -stairway. (d) A block assignment that violates conditions (i), (ii) and (iv), and as such does not form a stairway.

Inference amounts to generating a set of variable assignments to all blocks, so that none of conditions (i)-(v) is violated, and the cost of the soft rules is minimized. This can be easily encoded as an OMT() problem. As a first step, we define a background knowledge of useful predicates. We use four predicates to encode the fact that a block may touch one of the four corners of the bounding box, namely , , , and , which can be written as, e.g.:

We also define predicates to describe the relative positions of two blocks and , such as :

encodes the fact that block is touching block from the left. Similarly, we also define and . Finally, and most importantly, we combine the above predicates to define the concept of step, i.e. two blocks and that are both touching and positioned as to form a stair:

We define right_step(i,j) in the same manner. For a complete description of the background knowledge, see Table 2.

Corners
Block at bottom-left corner
Block at bottom-right corner
Block at top-left corner
Block at top-right corner
Relative block positions
Block touches block , left
Block touches block , below
Block touches block , over
Steps
Left step
Right step
Table 2: Background knowledge used in the stairways experiment.

The background knowledge allows to encode the property of being a left stairway as:

Analogously, any right stairway satisfies the following condition:

However, our inference procedure does not have access to this knowledge. We rather encode an appropriate set of soft rules (costs) which, along with the associated weights, should bias the optimization towards block assignments that form a stairway of the correct type.

We include a few hard rules to constrain the space of admissible block assignments. We require that all blocks fall within the bounding box:

We also require that blocks do not overlap:

Finally, we require (without loss of generality) blocks to be ordered from left to right, .

Note that only condition (v) is modelled as a hard constraint. The others are implicitly part of the problem cost. Our cost model is based on the observation that it is possible to discriminate between the different stairway types using only four factors: minimum and maximum step size, and amount of horizontal and vertical material. These four factors are useful features in discriminating between the different stairway types without having to resort to quadratic terms, e.g. the areas of the individual blocks. For instance, in the cost we account for both the maximum step height of all left steps (a good stairway should not have too high steps):

and the minimum step width of all right steps (good stairways should have sufficiently large steps):

The value of these costs depends on whether a pair of blocks actually forms a left step, a right step, or no step at all. Note that these costs are multiplied by the number of blocks . This allows to renormalize costs according to the number of steps; e.g. the step height of a stairway with uniform steps is half that of a stairway with steps. Finally, we write the average amount of vertical material as . All the other costs can be written similarly; see Table 3 for the complete list. As we will see, the normalization of individual costs allows to learn weights which generalize to stairways with a larger number of blocks with respect to those seen during training.

Putting all the pieces together, the complete cost is:

Minimizing the weighted cost implicitly requires the inference engine to decide whether it is preferable to generate a left or a right stairway, thanks to the components, and whether the stairway should be a ladder or pillar, due to and . The actual weights are learned, allowing the learnt model to reproduce whichever stairway type is present in the training data.

(a) Hard constraints
Bounding box
No overlap
Blocks left to right
(b) Soft constraints (features)
Max step height left
Min step height left
Max step height right
Min step height right
Max step width left
Min step width left
Max step width right
Min step width right
Vertical material
Horizontal material
(c) Cost