# Train and Test Tightness of LP Relaxations in Structured Prediction

Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming (LP) relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data.

## Authors

• 8 publications
• 14 publications
• 34 publications
• 42 publications
• ### LP-SparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction

Structured prediction requires manipulating a large number of combinator...
01/13/2020 ∙ by Vlad Niculae, et al. ∙ 0

• ### Belief Propagation for Linear Programming

Belief Propagation (BP) is a popular, distributed heuristic for performi...
05/17/2013 ∙ by Andrew Gelfand, et al. ∙ 0

• ### Alpha-expansion is Exact on Stable Instances

Approximate algorithms for structured prediction problems---such as the ...
11/06/2017 ∙ by Hunter Lang, et al. ∙ 0

• ### Block Stability for MAP Inference

To understand the empirical success of approximate MAP inference, recent...
10/12/2018 ∙ by Hunter Lang, et al. ∙ 0

• ### Compact Relaxations for MAP Inference in Pairwise MRFs with Piecewise Linear Priors

Label assignment problems with large state spaces are important tasks es...
08/14/2013 ∙ by Christopher Zach, et al. ∙ 0

• ### An Approach for Finding Permutations Quickly: Fusion and Dimension matching

Polyhedral compilers can perform complex loop optimizations that improve...
03/28/2018 ∙ by Aravind Acharya, et al. ∙ 0

• ### SparseMAP: Differentiable Sparse Structured Inference

Structured prediction requires searching over a combinatorial number of ...
02/12/2018 ∙ by Vlad Niculae, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many applications of machine learning can be formulated as prediction problems over structured output spaces

(Bakir et al., 2007; Nowozin et al., 2014). In such problems output variables are predicted jointly in order to take into account mutual dependencies between them, such as high-order correlations or structural constraints (e.g., matchings or spanning trees). Unfortunately, the improved expressive power of these models comes at a computational cost, and indeed, exact prediction and learning become NP-hard in general. Despite this worst-case intractability, efficient approximations often achieve very good performance in practice. In particular, one type of approximation which has proved effective in many applications is based on linear programming (LP) relaxation. In this approach the prediction problem is first cast as an integer LP (ILP), and then the integrality constraints are relaxed to obtain a tractable program. In addition to achieving high prediction accuracy, it has been observed that LP relaxations are often tight in practice. That is, the solution to the relaxed program happens to be optimal for the original hard problem (an integral solution is found). This is particularly surprising since the LPs have complex scoring functions that are not constrained to be from any tractable family. A major open question is to understand why these real-world instances behave so differently from the theoretical worst case.

This paper aims to address this question and to provide a theoretical explanation for the tightness of LP relaxations in the context of structured prediction. In particular, we show that the approximate training objective, although designed to produce accurate predictors, also induces tightness of the LP relaxation as a byproduct. Our analysis also suggests that exact training may have the opposite effect. To explain tightness of test instances, we prove a generalization bound for tightness. Our bound implies that if many training instances are integral, then test instances are also likely to be integral. Our results are consistent with previous empirical findings, and to our knowledge provide the first theoretical justification for the wide-spread success of LP relaxations for structured prediction in settings where the training data is not linearly separable.

## 2 Related Work

Many structured prediction problems can be represented as ILPs (Roth and Yih, 2005; Martins et al., 2009a; Rush et al., 2010). Despite being NP-hard in general (Roth, 1996; Shimony, 1994), various effective approximations have been proposed. Those include both search-based methods (Daumé III et al., 2009; Zhang et al., 2014), and natural LP relaxations to the hard ILP (Schlesinger, 1976; Koster et al., 1998; Chekuri et al., 2004; Wainwright et al., 2005). Tightness of LP relaxations for special classes of problems has been studied extensively in recent years and include restricting either the structure of the model or its score function. For example, the pairwise LP relaxation is known to be tight for tree-structured models and for supermodular scores (see, e.g., Wainwright and Jordan, 2008; Thapper and Živný, 2012), and the cycle relaxation (equivalently, the second-level of the Sherali-Adams hierarchy) is known to be tight both for planar Ising models with no external field (Barahona, 1993) and for almost balanced models (Weller et al., 2016). To facilitate efficient prediction, one could restrict the model class to be tractable. For example, Taskar et al. (2004) learn supermodular scores, and Meshi et al. (2013) learn tree structures.

However, the sufficient conditions mentioned above are by no means necessary, and indeed, many score functions that are useful in practice do not satisfy them but still produce integral solutions (Roth and Yih, 2004; Sontag et al., 2008; Finley and Joachims, 2008; Martins et al., 2009b; Koo et al., 2010). For example, Martins et al. (2009b) showed that predictors that are learned with LP relaxation yield integral LPs on of the test data on a dependency parsing problem (see Table 2 therein). Koo et al. (2010) observed a similar behavior for dependency parsing on a number of languages, as can be seen in Fig. 1 (kindly provided by the authors). The same phenomenon has been observed for a multi-label classification task, where test integrality reached (Finley and Joachims, 2008, Table 3).

Learning structured output predictors from labeled data was proposed in various forms by Collins (2002); Taskar et al. (2003); Tsochantaridis et al. (2004)

. These formulations generalize training methods for binary classifiers, such as the Perceptron algorithm and support vector machines (SVMs), to the case of structured outputs. The learning algorithms repeatedly perform prediction, necessitating the use of approximate inference within training as well as at test time. A common approach, introduced right at the inception of structured SVMs by

Taskar et al. (2003), is to use LP relaxations for this purpose.

The most closely related work to ours is Kulesza and Pereira (2007), which showed that not all approximations are equally good, and that it is important to match the inference algorithms used at train and test time. The authors defined the concept of algorithmic separability which refers to the setting when an approximate inference algorithm achieves zero loss on a data set. The authors studied the use of LP relaxations for structured learning, giving generalization bounds for the true risk of LP-based prediction. However, since the generalization bounds in Kulesza and Pereira (2007) are focused on prediction accuracy, the only settings in which tightness on test instances can be guaranteed are when the training data is algorithmically separable, which is seldom the case in real-world structured prediction tasks (the models are far from perfect). Our paper’s main result (Theorem 4.1), on the other hand, guarantees that the expected fraction of test instances for which a LP relaxation is integral

is close to that which was estimated on training data. This then allows us to talk about the generalization of

computation. For example, suppose one uses LP relaxation-based algorithms that iteratively tighten the relaxation, such as Sontag and Jaakkola (2008); Sontag et al. (2008), and observes that 20% of the instances in the training data are integral using the pairwise relaxation and that after tightening using cycle constraints the remaining 80% are now integral too. Our generalization bound then guarantees that approximately the same ratio will hold at test time (assuming sufficient training data).

Finley and Joachims (2008) also studied the effect of various approximate inference methods in the context of structured prediction. Their theoretical and empirical results also support the superiority of LP relaxations in this setting. Martins et al. (2009b) established conditions which guarantee algorithmic separability for LP relaxed training, and derived risk bounds for a learning algorithm which uses a combination of exact and relaxed inference.

Finally, recently Globerson et al. (2015) studied the performance of structured predictors for 2D grid graphs with binary labels from an information-theoretic point of view. They proved lower bounds on the minimum achievable expected Hamming error in this setting, and proposed a polynomial-time algorithm that achieves this error. Our work is different since we focus on LP relaxations as an approximation algorithm, we handle the most general form without making any assumptions on the model or error measure (except score decomposition), and we concentrate solely on the computational aspects while ignoring any accuracy concerns.

## 3 Background

In this section we review the formulation of the structured prediction problem, its LP relaxation, and the associated learning problem. Consider a prediction task where the goal is to map a real-valued input vector

to a discrete output vector . A popular model class for this task is based on linear classifiers. In this setting prediction is performed via a linear discriminant rule: , where is a function mapping input-output pairs to feature vectors, and is the corresponding weight vector. Since the output space is often huge (exponential in ), it will generally be intractable to maximize over all possible outputs.

In many applications the score function has a particular structure. Specifically, we will assume that the score decomposes as a sum of simpler score functions: , where is an assignment to a (non-exclusive) subset of the variables . For example, it is common to use such a decomposition that assigns scores to single and pairs of output variables corresponding to nodes and edges of a graph : . Viewing this as a function of , we can write the prediction problem as: (we will sometimes omit the dependence on and in the sequel).

Due to its combinatorial nature, the prediction problem is generally NP-hard. Fortunately, efficient approximations have been proposed. Here we will be particularly interested in approximations based on LP relaxations. We begin by formulating prediction as the following ILP:111For convenience we introduce singleton factors , which can be set to if needed.

 maxμ∈MLμ∈{0,1}q∑c∑ycμc(yc)θc(yc)+∑i∑yiμi(yi)θi(yi)=θ⊤μ

Here, is an indicator variable for a factor and local assignment , and is the total number of factor assignments (dimension of ). The set is known as the local marginal polytope (Wainwright and Jordan, 2008). First, notice that there is a one-to-one correspondence between feasible ’s and assignments ’s, which is obtained by setting to indicators over local assignments ( and ) consistent with . Second, while solving ILPs is NP-hard in general, it is easy to obtain a tractable program by relaxing the integrality constraints , which may introduce fractional solutions to the LP. This relaxation is the first level of the Sherali-Adams hierarchy (Sherali and Adams, 1990), which provides successively tighter LP relaxations of an ILP. Notice that since the relaxed program is obtained by removing constraints, its optimal value upper bounds the ILP optimum.

In order to achieve high prediction accuracy, the parameters

are learned from training data. In this supervised learning setting, the model is fit to labeled examples

, where the goodness of fit is measured by a task-specific loss . In the structured SVM (SSVM) framework (Taskar et al., 2003; Tsochantaridis et al., 2004), the empirical risk is upper bounded by a convex surrogate called the structured hinge loss, which yields the training objective:222For brevity, we omit the regularization term, however, all of our results below still hold with regularization.

 minw∑mmaxy[w⊤(ϕ(x(m),y)−ϕ(x(m),y(m)))+Δ(y,y(m))] . (1)

This is a convex function of and hence can be optimized in various ways. But, notice that the objective includes a maximization over outputs for each training example. This loss-augmented prediction task needs to be solved repeatedly during training (e.g., to evaluate subgradients), which makes training intractable in general. Fortunately, as in prediction, LP relaxation can be applied to the structured loss (Taskar et al., 2003; Kulesza and Pereira, 2007), which yields the relaxed training objective:

 minw∑mmaxμ∈ML[θ⊤m(μ−μm)+ℓ⊤mμ] , (2)

where is a score vector in which each entry represents for some and , similarly is a vector with entries333We assume that the task-loss decomposes as the model score. , and is the integral vector corresponding to .

## 4 Analysis

In this section we present our main results, proposing a theoretical justification for the observed tightness of LP relaxations used for inference in models learned by structured prediction, both on training and held-out data. To this end, we make two complementary arguments: in Section 4.1 we argue that optimizing the relaxed training objective of Eq. (2) also has the effect of encouraging tightness of training instances; in Section 4.2 we show that tightness generalizes from train to test data.

### 4.1 Tightness at Training

We first show that the relaxed training objective in Eq. (2), although designed to achieve high accuracy, also induces tightness of the LP relaxation. In order to simplify notation we focus on a single training instance and drop the index . Denote the solutions to the relaxed and integer LPs as:

 μL∈argmaxμ∈MLθ⊤μμI∈argmaxμ∈MLμ∈{0,1}qθ⊤μ

Also, let be the integral vector corresponding to the ground-truth output . Now consider the following decomposition:

 θ⊤(μL−μT)\color[rgb]{0,0,1}{relaxed-hinge}=θ⊤(μL−μI)\color[rgb]{0,0,1}{integrality gap}+θ⊤(μI−μT)\color[rgb]{0,0,1}{exact-hinge} (3)

This equality states that the difference in scores between the relaxed optimum and ground-truth (relaxed-hinge) can be written as a sum of the integrality gap and the difference in scores between the exact optimum and the ground-truth (exact-hinge) (notice that all terms are non-negative). This simple decomposition has several interesting implications.

First, we can immediately derive the following bound on the integrality gap:

 θ⊤(μL−μI)= θ⊤(μL−μT)−θ⊤(μI−μT) (4) ≤ θ⊤(μL−μT) (5) ≤ θ⊤(μL−μT)+ℓ⊤μL (6) ≤ maxμ∈ML(θ⊤(μ−μT)+ℓ⊤μ), (7)

where Eq. (7) is precisely the relaxed training objective from Eq. (2). Therefore, optimizing the approximate training objective of Eq. (2) minimizes an upper bound on the integrality gap. Hence, driving down the approximate objective also reduces the integrality gap of training instances. One case where the integrality gap becomes zero is when the data is algorithmically separable. In this case the relaxed-hinge term vanishes (the exact-hinge must also vanish), and integrality is assured.

However, the bound above might sometimes be loose. Indeed, to get the bound we have discarded the exact-hinge term (Eq. (5)), added the task-loss (Eq. (6)), and maximized the loss-augmented objective (Eq. (7)). At the same time, Eq. (4) provides a precise characterization of the integrality gap. Specifically, the gap is determined by the difference between the relaxed-hinge and the exact-hinge terms. This implies that even when the relaxed-hinge is not zero, a small integrality gap can still be obtained if the exact-hinge is also large. In fact, the only way to get a large integrality gap is by setting the exact-hinge much smaller than the relaxed-hinge. But when can this happen?

A key point is that the relaxed and exact hinge terms are upper bounded by the relaxed and exact training objectives, respectively (the latter additionally depend on the task loss ). Therefore, minimizing the training objective will also reduce the corresponding hinge term (see also Section 5). Using this insight, we observe that relaxed training reduces the relaxed-hinge term without directly reducing the exact-hinge term, and thereby induces a small integrality gap. On the other hand, this also suggests that exact training may actually increase the integrality gap, since it reduces the exact-hinge without also reducing directly the relaxed-hinge term. This finding is consistent with previous empirical evidence. Specifically, Martins et al. (2009b, Table 2) showed that on a dependency parsing problem, training with the relaxed objective achieved integral solutions, while exact training achieved only integral solutions. An even stronger effect was observed by Finley and Joachims (2008, Table 3) for multi-label classification, where relaxed training resulted in integral instances, with exact training attaining only (‘Yeast’ dataset).

In Section 5 we provide further empirical support for our explanation, however, we next also show its possible limitations by providing a counter-example. The counter-example demonstrates that despite training with a relaxed objective, the exact-hinge can in some cases actually be smaller than the relaxed-hinge, leading to a loose relaxation. Although this illustrates the limitations of the explanation above, we point out that the corresponding learning task is far from natural; we believe it is unlikely to arise in real-world applications.

Specifically, we construct a learning scenario where relaxed training obtains zero exact-hinge and non-zero relaxed-hinge, so the relaxation is not tight. Consider a model where , , and the prediction is given by:

 y(x;w)=argmaxy(x1y1+x2y2+x3y3 +w[1{y1≠y2}+1{y1≠y3}+1{y2≠y3}]).

The corresponding LP relaxation is then:

 maxμ∈ML(x1μ1(1)+x2μ2(1)+x3μ3(1)+w[μ12(01)+μ12(10) +μ13(01)+μ13(10)+μ23(01)+μ23(10)]).

Next, we construct a trainset where the first instance is: , and the second is: . It can be verified that minimizes the relaxed objective (Eq. (2)). However, with this weight vector the relaxed-hinge for the second instance is equal to 1, while the exact-hinge for both instances is 0 (the data is separable w.r.t. ). Consequently, there is an integrality gap of 1 for the second instance, and the relaxation is loose (the first instance is actually tight).

Finally, note that our derivation above (Eq. (4)) holds for any integral , and not just the ground-truth . In other words, the only property of we are using here is its integrality. Indeed, in Section 5 we verify empirically that training a model using random labels still attains the same level of tightness as training with the ground-truth labels. On the other hand, accuracy drops dramatically, as expected. This analysis suggests that tightness is not related to accuracy of the predictor. Finley and Joachims (2008) explained tightness of LP relaxations by noting that fractional solutions always incur a loss during training. Our analysis suggests an alternative explanation, emphasizing the difference in scores (Eq. (4)) rather than the loss, and decoupling tightness from accuracy.

### 4.2 Generalization of Tightness

Our argument in Section 4.1 concerns only the tightness of train instances. However, the empirical evidence discussed above pertains to test data. To bridge this gap, in this section we show that train tightness implies test tightness. We do so by proving a generalization bound for tightness based on Rademacher complexity.

We first define a loss function which measures the lack of integrality (or, fractionality) for a given instance. To this end, we consider the discrete set of

vertices of the local polytope (excluding its convex hull), denoting by and the sets of fully-integral and non-integral (i.e., fractional) vertices, respectively (so , and consists of all vertices of ). Considering vertices is without loss of generality, since linear programs always have a vertex that is optimal. Next, let be the mapping from weights and inputs to scores (as used in Eq. (2)), and let and be the best integral and fractional scores attainable, respectively. By convention, we set whenever . The fractionality of can be measured by the quantity . If this quantity is large then the LP has a fractional solution with a much better score than any integral solution. We can now define the loss:

 L(θ)={1D(θ)>00otherwise . (8)

That is, the loss equals if and only if the optimal fractional solution has a (strictly) higher score than the optimal integral solution.444Notice that the loss will be whenever the non-integral and integral optima are equal, but this is fine for our purpose, since we consider the relaxation to be tight in this case. Notice that this loss ignores the ground-truth , as expected. In addition, we define a ramp loss parameterized by which upper bounds the fractionality loss:

 φγ(θ)=⎧⎨⎩0D(θ)≤−γ1+D(θ)/γ−γ0 , (9)

For this loss to be zero, the best integral solution has to be better than the best fractional solution by at least , which is a stronger requirement than mere tightness. In Section 4.2.1 we give examples of models that are guaranteed to satisfy this stronger requirement, and in Section 5 we also show this often happens in practice. We point out that is generally hard to compute, as is (due to the discrete optimization involved in computing and ). However, here we are only interested in proving that tightness is a generalizing property, so we will not worry about computational efficiency for now. We are now ready to state the main theorem of this section.

###### Theorem 4.1.

Let inputs be independently selected according to a probability measure

, and let be the class of all scoring functions with . Let for all , , , and is the total number of factor assignments (dimension of ). Then for any number of samples and any , with probability at least , every satisfies:

 EP[L(θX)]≤^EM[φγ(θX)]+O(q1.5B^Rγ√M)+√8ln(2/δ)M (10)

where is the empirical expectation.

###### Proof.

Our proof relies on the following general result from Bartlett and Mendelson (2002).

###### Theorem 4.2 (Bartlett and Mendelson (2002), Theorem 8).

Consider a loss function and a dominating function (i.e., for all ). Let be a class of functions mapping to , and let be independently selected according to a probability measure . Then for any number of samples and any , with probability at least , every satisfies:

 E[L(y,f(x))]≤^EM[φ(y,f(x))]+RM(~φ∘f)+√8ln(2/δ)M ,

where is the empirical expectation, , and is the Rademacher complexity of the class .

To use this result, we define , , and to be the class of all such functions satisfying and . In order to obtain a meaningful bound, we would like to bound the Rademacher term . Theorem 12 in Bartlett and Mendelson (2002) states that if is Lipschitz with constant and satisfies , then . In addition, Weiss and Taskar (2010) show that . Therefore, it remains to compute the Lipschitz constant of , which is equal to the Lipschitz constant of . For this purpose, we will bound the Lipschitz constant of , and then use (from Eq. (9)).
Let and , then:

 D(θ1)−D(θ2) = (μ1F−μ1I)⋅θ1−(μ2F−μ2I)⋅θ2 = (μ1F⋅θ1−μ2F⋅θ2)+(μ2I⋅θ2−μ1I⋅θ1) = (μ1F⋅θ1−μ2F⋅θ2)+(μ1F⋅θ2−μ1F⋅θ2) +(μ2I⋅θ2−μ1I⋅θ1)+(μ2I⋅θ1−μ2I⋅θ1) = μ1F⋅(θ1−θ2)+(μ1F−μ2F)⋅θ2 +μ2I⋅(θ2−θ1)+(μ2I−μ1I)⋅θ1 ≤ (μ1F−μ2I)⋅(θ1−θ2)[optimality of μ2F and μ1I] ≤ ∥μ1F−μ2I∥2∥θ1−θ2∥2[Cauchy-Schwarz] ≤ √q∥θ1−θ2∥2

Therefore, .

Combining everything together, and dropping the spurious dependence on , we obtain the bound in Eq. (10). Finally, we point out that when using an regularizer at training, we can actually drop the assumption and instead use a bound on the norm of the optimal solution (as in the analysis of Shalev-Shwartz et al. (2011)). ∎

Theorem 4.1 shows that if we observe high integrality (equivalently, low fractionality) on a finite sample of training data, then it is likely that integrality of test data will not be much lower, provided sufficient number of samples.

Our result actually applies more generally to any two disjoint sets of vertices, and is not limited to and . For example, we can replace by the set of vertices with at most 10% fractional values, and by the rest of the vertices of the local polytope. This gives a different meaning to the loss , and the rest of our analysis holds unchanged. Consequently, our generalization result implies that it is likely to observe a similar portion of instances with at most 10% fractional values at test time as we did at training.

#### 4.2.1 γ-tight relaxations

In this section we study the stronger notion of tightness required by our surrogate fractionality loss (Eq. (9)), and show examples of models that satisfy it. We use the following definition.

• An LP relaxation is called -tight if (so ). That is, the best integral value is larger than the best non-integral value by at least .555Notice that scaling up will also increase , but our bound in Eq. (10) also grows with the norm of (via ). Therefore, we assume here that is bounded.

We focus on binary pairwise models and show two cases where the model is guaranteed to be -tight. Proofs are provided in Appendix A. Our first example involves balanced models, which are binary pairwise models that have supermodular scores, or can be made supermodular by “flipping” a subset of the variables (for more details, see Appendix A).

###### Proposition 4.3.

A balanced model with a unique optimum is -tight, where is the difference between the best and second-best (integral) solutions.

This result is of particular interest when learning structured predictors where the edge scores depend on the input. Whereas one could learn supermodular models by enforcing linear inequalities, we know of no tractable means of restricting the model to be balanced. Instead, one could learn over the full space of models using LP relaxation. If the learned models are balanced on the training data, Prop. 4.3 together with Theorem 4.1 tell us that the pairwise LP relaxation is likely to be tight on test data as well.

Our second example regards models with singleton scores that are much stronger than the pairwise scores. Consider a binary pairwise model666

This case easily generalizes to non-binary variables.

in minimal representation, where are node scores and are edge scores in this representation (see Appendix A for full details). Further, for each variable , define the set of neighbors with attractive edges , and the set of neighbors with repulsive edges .

###### Proposition 4.4.

If all variables satisfy the condition:

 ¯θi≥−∑j∈N−i¯θij+β,% or ¯θi≤−∑j∈N+i¯θij−β

for some , then the model is -tight.

Finally, we point out that in both of the examples above, the conditions can be verified efficiently and if they hold, the value of can be computed efficiently.

## 5 Experiments

In this section we present some numerical results to support our theoretical analysis. We run experiments for both a multi-label classification task and an image segmentation task. For training we have implemented the block-coordinate Frank-Wolfe algorithm for structured SVM (Lacoste-Julien et al., 2013), using GLPK as the LP solver. In all of our experiments we use a standard regularizer, chosen via cross-validation.

##### Multi-label classification

For multi-label classification we adopt the experimental setting of Finley and Joachims (2008). In this setting labels are represented by binary variables, the model consists of singleton and pairwise factors forming a fully connected graph over the labels, and the task loss is the normalized Hamming distance.

Fig. 2 shows relaxed and exact training iterations for the ‘Yeast’ dataset (14 labels). We plot the relaxed and exact hinge terms (Eq. (3)), the exact and relaxed SSVM training objectives888The displayed objective values are averaged over train instances and exclude regularization. (Eq. (1) and Eq. (2), respectively), fraction of train and test instances having integral solutions, as well as test accuracy (measured by score). Whenever a fractional solution was found with relaxed inference, a simple rounding scheme was applied to obtain a valid prediction. First, we note that the relaxed-hinge values are nicely correlated with the relaxed training objective, and likewise the exact-hinge is correlated with the exact objective (left and middle, top). Second, observe that with relaxed training, the relaxed-hinge and the exact-hinge are very close (left, top), so the integrality gap, given by their difference, remains small (almost here). On the other hand, with exact training the exact-hinge is reduced much more than the relaxed-hinge, which results in a large integrality gap (middle, top). Indeed, we can see that the percentage of integral solutions is almost for relaxed training (left, bottom), and close to with exact training (middle, bottom). To get a better understanding, we show a histogram of the difference between the optimal integral and fractional values, i.e., the integrality margin (), under the final learned model for all training instances (right). It can be seen that with relaxed training this margin is positive (although small), while exact training results in larger negative values. Third, we notice that train and test integrality levels are very close to each other, almost indistinguishable (left and middle, bottom), which provides some empirical support to our generalization result from Section 4.2.

We next train a model using random labels (with similar label counts as the true data). In this setting the learned model obtains tight training instances (not shown), which supports our claim that any integral solution can be used in place of the ground-truth, and that accuracy is not important for tightness. Finally, in order to verify that tightness is not coincidental, we tested the tightness of the relaxation induced by a random weight vector . We found that random models are never tight (in 20 trials), which shows that tightness of the relaxation does not come by chance.

We now proceed to perform experiments on the ‘Scene’ dataset (6 labels). The results, in Fig. 3, are quite similar to the ‘Yeast’ results, except for the behavior of exact training (middle) and the integrality margin (right). Specifically, we observe that in this case the relaxed-hinge and exact-hinge are close in value (middle, top), as for relaxed training (left, top). As a consequence, the integrality gap is very small and the relaxation is tight for almost all train (and test) instances. These results show that sometimes optimizing the exact objective can reduce the relaxed objective (and relaxed-hinge) as well. Further, in this setting we observe a larger integrality margin (right), which means that the integral optimum is strictly better than the fractional one.

We conjecture that the LP instances are easy in this case due to the dominance of the singleton scores.999With ILP training, the condition in Prop. 4.4 is satisfied for 65% of all variables, although only 1% of the training instances satisfy it for all their variables. Specifically, the features provide a strong signal which allows label assignment to be decided mostly based on the local score, with little influence coming from the pairwise terms. To test this conjecture we repeat the experiment while injecting Gaussian noise into the input features, forcing the model to rely more on the pairwise interactions. We find that with the noisy singleton scores the results are indeed similar to the ‘Yeast’ dataset, where a large integrality gap is observed and fewer instances are tight (see Appendix B in the supplement).

##### Image segmentation

Finally, we conduct experiments on a foreground-background segmentation problem using the Weizmann Horse dataset (Borenstein et al., 2004). The data consists of 328 images, of which we use the first 50 for training and the rest for testing. Here a binary output variable is assigned to each pixel, and there are variables per image on average. We extract singleton and pairwise features as described in Domke (2013). Fig. 4 shows the same quantities as in the multi-label setting, except for the accuracy measure – here we compute the percentage of correctly classified pixels rather than . We observe a very similar behavior to that of the ‘Scene’ multi-label dataset (Fig. 3). Specifically, both relaxed and exact training produce a small integrality gap and high percentage of tight instances. Unlike the ‘Scene’ dataset, here only 1.2% of variables satisfy the condition in Prop. 4.4 (using LP training). In all of our experiments the learned model scores were never balanced (Prop. 4.3), although for the segmentation problem we believe the models learned are close to balanced, both for relaxed and exact training.

## 6 Conclusion

In this paper we propose an explanation for the tightness of LP relaxations which has been observed in many structured prediction applications. Our analysis is based on a careful examination of the integrality gap and its relation to the training objective. It shows how training with LP relaxations, although designed with accuracy considerations in mind, also induces tightness of the relaxation. Our derivation also suggests that exact training may sometimes have the opposite effect, increasing the integrality gap.

To explain tightness of test instances, we show that tightness generalizes from train to test instances. Compared to the generalization bound of Kulesza and Pereira (2007), our bound only considers the tightness of the instance, ignoring label errors. Thus, for example, if learning happens to settle on a set of parameters in a tractable regime (e.g., supermodular potentials or stable instances (Makarychev et al., 2014)) for which the LP relaxation is tight for all training instances, our generalization bound guarantees that with high probability the LP relaxation will also be tight on test instances. In contrast, in Kulesza and Pereira (2007)’s bound, tightness on test instances can only be guaranteed when the training data is algorithmically separable (i.e., LP-relaxed inference predicts perfectly).

Our work suggests many directions for further study. Our analysis in Section 4.1 focuses on the score hinge and ignores the task loss . It would be interesting to further study the effect of various task losses on tightness of the relaxation at training. Next, our bound in Section 4.2 is intractable to compute due to the hardness of the surrogate loss . It is therefore desirable to derive a tractable alternative which could be used to obtain a useful guarantee in practice. The upper bound on integrality shown in Section 4.1 holds for other convex relaxations which have been proposed for structured prediction, such as semi-definite programming relaxations (Kumar et al., 2009). However, it is less clear how to extend the generalization result to such non-polyhedral relaxations. Finally, we hope that our methodology will be useful for shedding light on tightness of convex relaxations in other learning problems.

## Appendix A γ-Tight LP Relaxations

In this section we provide full derivations for the results in Section 4.2.1. We make extensive use of the results in Weller et al. (2016) (some of which are restated here for completeness). We start by defining a model in minimal representation, which will be convenient for the derivations that follow. Specifically, in the case of binary variables () with pairwise factors, we define a value for each variable, and a value for each pair. The mapping between the over-complete vector and the minimal vector is as follows. For singleton factors, we have:

 μi=(1−ηiηi)

Similarly, for the pairwise factors, we have:

 μij=(1+ηij−ηi−ηjηj−ηij ,ηi−ηijηij)

The corresponding mapping to minimal parameters is then:

 ¯θi =θi(1)−θi(0)+∑j∈Ni(θij(1,0)−θij(0,0)) ¯θij =θij(1,1)+θij(0,0)−θij(0,1)−θij(1,0)

In this representation, the LP relaxation is given by (up to constants):

 maxη∈Lf(η):=n∑i=1¯θiηi+∑ij∈E¯θijηij

where is the appropriate transformation of to the equivalent reduced space of :

 0 ≤ηi≤1∀i max(0,ηi+ηj−1) ≤ηij≤min(ηi,ηj)∀ij∈E

If (), then the edge is called attractive (repulsive). If all edges are attractive, then the LP relaxation is known to be tight (Wainwright and Jordan, 2008). When not all edges are attractive, in some cases it is possible to make them attractive by flipping a subset of the variables ().101010The flip-set, if exists, is easy to find by making a single pass over the graph (see Weller (2015) for more details). In such cases the model is called balanced.

In the sequel we will make use of the known fact that all vertices of the local polytope are half-integral (take values in ) (Wainwright and Jordan, 2008). We are now ready to prove the propositions (restated here for convenience).

### a.1 Proof of Proposition 4.3

Proposition 4.3 A balanced model with a unique optimum is -tight, where is the difference between the best and second-best (integral) solutions.

###### Proof.

Weller et al. (2016) define for a given variable the function , which returns for every the constrained optimum:

 FiL(z)=maxη∈Lηi=zf(η)

Given this definition, they show that for a balanced model, is a linear function (Weller et al., 2016, Theorem 6).

Let be the optimal score, let be the unique optimum integral vertex in minimal form so , and any other integral vertex has value at most . Denote the state of at coordinate by , and consider computing the constrained optimum holding to various states. By assumption, any other integral vertex has value at most , therefore,

 FiL(z∗)= m FiL(1−z∗)≤ m−α

(the second line holds with equality if there exists a second-best solution s.t. ). Since is a linear function, we have that:

 FiL(1/2)≤m−α/2 (11)

Next, towards contradiction, suppose that there exists a fractional vertex with value . Let be a fractional coordinate, so (since vertices are half-integral). Our assumption implies that , but this contradicts Eq. (11). Therefore, we conclude that any fractional solution has value at most . ∎

It is possible to check in polynomial time if a model is balanced, if it has a unique optimum, and compute . This can be done by computing the difference in value to the second-best. In order to find the second-best: one can constrain each variable in turn to differ from the state of the optimal solution, and recompute the MAP solution; finally, take the maximum over all these trials.

### a.2 Proof of Proposition 4.4

Proposition 4.4 If all variables satisfy the condition:

 ¯θi≥−∑j∈N−i¯θij+β,% or ¯θi≤−∑j∈N+i¯θij−β

for some , then the model is -tight.

###### Proof.

For any binary pairwise models, given singleton terms , the optimal edge terms are given by (for details see Weller et al., 2016):

 ηij(ηi,ηj)={min(ηi,ηj)if ¯θij>0max(0,ηi+ηj−1)if ¯θij<0

Now, consider a variable and let be the set of its neighbors in the graph. Further, define the sets and , corresponding to attractive and repulsive edges, respectively. We next focus on the parts of the objective affected by the value at (recomputing optimal edge terms); recall that all vertices are half-integral:

It is easy to verify that the condition guarantees that in the optimal solution. We next bound the difference in objective values resulting from setting .

 Δf=12⎛⎜ ⎜ ⎜ ⎜ ⎜⎝¯θi+∑cj∈N+iηj=1¯θij+∑j∈N−iηj∈{12,1}¯θij⎞⎟ ⎟ ⎟ ⎟ ⎟⎠≥12⎛⎜⎝¯θi+∑j∈N−i¯θij⎞⎟⎠≥β/2

Similarly, when , then in any optimal solution. The difference in objective values from setting in this case is:

 Δf=−12⎛⎜ ⎜ ⎜ ⎜ ⎜⎝¯θi+∑j∈N+iηj∈{12,1}¯θij+∑j∈N−iηj=1¯θij⎞⎟ ⎟ ⎟ ⎟ ⎟⎠≥−12⎛⎜⎝¯θi+∑j∈N+i¯θij⎞⎟⎠≥β/2

Notice that for more fractional coordinates the difference in values can only increase, so in any case the fractional solution is worse by at least . ∎

## Appendix B Additional Experimental Results

In this section we present additional experimental results for the ‘Scene’ dataset. Specifically, we inject random Gaussian noise to the input features in order to reduce the signal in the singleton scores and increase the role of the pairwise interactions. This makes the problem harder since the prediction needs to account for global information.

In Fig. 5 we observe that with exact training the exact loss is minimized, causing the exact-hinge to decrease, since it is upper bounded by the loss (middle, top). On the other hand, the relaxed-hinge (and relaxed loss) increase during training, which results in a large integrality gap and fewer tight instances. In contrast, with relaxed training the relaxed loss is minimized, which causes the relaxed-hinge to decrease. Since the exact-hinge is upper bounded by the relaxed-hinge it also decreases, but both hinge terms decrease similarly and remain very close to each other. This results in a small integrality gap and tightness of almost all instances.

Finally, in contrast to other settings, in Fig. 5 we observe that with exact training the test tightness is noticeably higher (about 20%) than the train tightness (Fig. 5, middle, bottom). This does not contradict our bound from Theorem 4.1, since in fact the test fractionality is even lower than the bound suggests. On the other hand, this result does entail that train and test tightness may sometimes behave differently, which means that we might need to increase the size of the trainset in order to get a tighter bound.

## References

• Bakir et al. (2007) G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. The MIT Press, 2007.
• Barahona (1993) F. Barahona. On cuts and matchings in planar graphs. Mathematical Programming, 60:53–68, 1993.
• Bartlett and Mendelson (2002) P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2002.
• Borenstein et al. (2004) E. Borenstein, E. Sharon, and S. Ullman. Combining top-down and bottom-up segmentation. In CVPR, 2004.
• Chekuri et al. (2004) C. Chekuri, S. Khanna, J. Naor, and L. Zosin. A linear programming formulation and approximation algorithms for the metric labeling problem. SIAM J. on Discrete Mathematics, 18(3):608–625, 2004.
• Collins (2002) M. Collins.

Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms.

In EMNLP, 2002.
• Daumé III et al. (2009) H. Daumé III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75(3):297–325, 2009.
• Domke (2013) J. Domke. Learning graphical model parameters with approximate marginal inference. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(10), 2013.
• Finley and Joachims (2008) T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the 25th International Conference on Machine learning, pages 304–311, 2008.
• Globerson et al. (2015) A. Globerson, T. Roughgarden, D. Sontag, and C. Yildirim. How hard is inference for structured prediction? In ICML, 2015.
• Koo et al. (2010) T. Koo, A. M. Rush, M. Collins, T. Jaakkola, and D. Sontag. Dual decomposition for parsing with non-projective head automata. In EMNLP, 2010.
• Koster et al. (1998) A. Koster, S. van Hoesel, and A. Kolen. The partial constraint satisfaction problem: Facets and lifting theorems. Operations Research Letters, 23:89–97, 1998.
• Kulesza and Pereira (2007) A. Kulesza and F. Pereira. Structured learning with approximate inference. In Advances in Neural Information Processing Systems 20, pages 785–792. 2007.
• Kumar et al. (2009) M. P. Kumar, V. Kolmogorov, and P. H. S. Torr. An analysis of convex relaxations for MAP estimation of discrete MRFs. JMLR, 10:71–106, 2009.
• Lacoste-Julien et al. (2013) S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In ICML, pages 53–61, 2013.
• Makarychev et al. (2014) K. Makarychev, Y. Makarychev, and A. Vijayaraghavan. Bilu–linial stable instances of max cut and minimum multiway cut. Proc. nd Symposium on Discrete Algorithms (SODA), 2014.
• Martins et al. (2009a) A. Martins, N. Smith, and E. P. Xing. Concise integer linear programming formulations for dependency parsing. In ACL, 2009a.
• Martins et al. (2009b) A. Martins, N. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In Proceedings of the 26th International Conference on Machine Learning, 2009b.
• Meshi et al. (2013) O. Meshi, E. Eban, G. Elidan, and A. Globerson. Learning max-margin tree predictors. In UAI, 2013.
• Nowozin et al. (2014) S. Nowozin, P. V. Gehler, J. Jancsary, and C. Lampert. Advanced Structured Prediction. MIT Press, 2014.
• Roth (1996) D. Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82, 1996.
• Roth and Yih (2004) D. Roth and W. Yih. A linear programming formulation for global inference in natural language tasks. In CoNLL, The 8th Conference on Natural Language Learning, 2004.
• Roth and Yih (2005) D. Roth and W. Yih. Integer linear programming inference for conditional random fields. In ICML, pages 736–743. ACM, 2005.
• Rush et al. (2010) A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In EMNLP, 2010.
• Schlesinger (1976) M. I. Schlesinger. Syntactic analysis of two-dimensional visual signals in noisy conditions. Kibernetika, 4:113––130, 1976.
• Shalev-Shwartz et al. (2011) S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
• Sherali and Adams (1990) H. D. Sherali and W. P. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM J. on Disc. Math., 3(3):411–430, 1990.
• Shimony (1994) Y. Shimony. Finding the MAPs for belief networks is NP-hard. Aritifical Intelligence, 68(2):399–410, 1994.
• Sontag and Jaakkola (2008) D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2008.
• Sontag et al. (2008) D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In UAI, pages 503–510, 2008.
• Taskar et al. (2003) B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems. MIT Press, 2003.
• Taskar et al. (2004) B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. In Proc. ICML. ACM Press, 2004.
• Thapper and Živný (2012) J. Thapper and S. Živný. The power of linear programming for valued CSPs. In FOCS, 2012.
• Tsochantaridis et al. (2004) I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, pages 104–112, 2004.
• Wainwright and Jordan (2008) M. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover, MA, USA, 2008.
• Wainwright et al. (2005) M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on trees: message-passing and linear programming. IEEE Transactions on Information Theory, 51(11):3697–3717, 2005.