Bethe Learning of Conditional Random Fields via MAP Decoding

03/04/2015 ∙ by Kui Tang, et al. ∙ 0

Many machine learning tasks can be formulated in terms of predicting structured outputs. In frameworks such as the structured support vector machine (SVM-Struct) and the structured perceptron, discriminative functions are learned by iteratively applying efficient maximum a posteriori (MAP) decoding. However, maximum likelihood estimation (MLE) of probabilistic models over these same structured spaces requires computing partition functions, which is generally intractable. This paper presents a method for learning discrete exponential family models using the Bethe approximation to the MLE. Remarkably, this problem also reduces to iterative (MAP) decoding. This connection emerges by combining the Bethe approximation with a Frank-Wolfe (FW) algorithm on a convex dual objective which circumvents the intractable partition function. The result is a new single loop algorithm MLE-Struct, which is substantially more efficient than previous double-loop methods for approximate maximum likelihood estimation. Our algorithm outperforms existing methods in experiments involving image segmentation, matching problems from vision, and a new dataset of university roommate assignments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning the parameters of a Markov random field (MRF) or a conditional random field (CRF) is a ubiquitous problem in machine learning and related fields. Often, the parameters are learned via regularized maximum likelihood estimation (MLE) and then prediction is performed via maximum a-posteriori (MAP) or marginal inference111In this paper, MAP inference refers to predicting an output given parameters , while MLE learning refers to estimating given observations , optionally including quadratic regularization. As the log-likelihood is concave, it can in principle be maximized by gradient ascent. However, this requires repeatedly computing gradients of the log-partition function, which in general is intractable. One can circumvent this difficulty by using surrogates for the log-partition function sutton2005, ganapathi2008, domke2013 or by approximating the partition function using sampling petterson2009exponential,papandreou2011perturb.

Alternatively, one can avoid likelihoods entirely, and use methods such as the structured perceptron or structured support vector machines (SVM-Struct) that rely only on a MAP solver collins2002discriminative,roller2004max,tsochantaridis2004support, finley2008training. Such methods can often be quite accurate and are typically faster than approximate MLE, since MAP, or relaxations thereof, can be performed quickly using sophisticated combinatorial solvers. By using such solvers as black boxes, MAP-based training methods also offer users an attractive abstraction between the learning problem and the optimization algorithm. On the other hand, MLE remains a primary goal for many practitioners, since it may yield superior predictive accuracy, offers parameter values with increased interpretability and statistical properties, and supports test-time marginal inference.

In this work, we introduce MLE-Struct, a novel approximate MLE algorithm that also only requires access to a MAP solver. We combine Bethe-style convex free energies with the Frank-Wolfe (FW) method frank1956algorithm,jaggi2013revisiting. A naive application of FW for approximate MLE would perform approximate marginal inference using repeated calls to MAP, as in the experiments of sontag2007new, and then use these marginals to perform a single gradient step on the parameters. This double-loop algorithm requires a significant number of MAP solver calls, especially if very accurate answers are required. Our approach achieves fast learning by avoiding this costly double loop structure. First, we employ a generic reweighted entropy approximation technique that yields convex Bethe-style surrogate likelihoods for any underlying undirected graphical model. Then, we construct a constrained, convex dual problem for this approximate maximum likelihood objective. We demonstrate that the approximate dual problem can be minimized efficiently using FW: each of the linear subproblems that are solved as part of the algorithm can be formulated as separate approximate or exact MAP inference tasks on each training example. Finally, we introduce a technique to accelerate the line search subroutine of FW by precomputing certain data-dependent terms.

We can also use FW to perform test-time marginal inference using the procedure of sontag2007new. Therefore, at both train and test time we can interact with our underlying problem structure using only a MAP routine. This allows us to design fast approximate learning and prediction algorithms for a wide variety of settings in which efficient approximate/exact MAP solvers exist: bipartite/general matching and b-matching problems (via the max-flow and blossom algorithms), pairwise binary graphical models (via QPBO), planar Ising models with no external field (via a reduction to matching) schraudolph08efficient, among others.

We apply our method to learn pairwise binary CRFs and distributions over matchings on both bipartite and general graphs. Our method provides good predictive performance while often solving the approximate MLE problem significantly faster with fewer numerical instabilities than other approximate MLE methods. We also apply our method to a new dataset of housing preferences and roommate assignments of university students to predict good freshmen roommate assignments.

2 Background and Related Work

We consider conditional random fields where, in addition to samples from some discrete space , we also observe feature vectors

lafferty2001conditional. In this case, the conditional probability of the

sample has the form

(1)

where is a vector of sufficient statistics, is a vector of parameters, and the partition function is given by

In typical applications, the joint probability distribution factors over a hypergraph

where is a collection of subsets of . For ease of presentation, we assume that

That is, for and ,

These models include both Potts and Ising models, as well as log-linear distributions over matchings (10).

Given observations with corresponding feature vectors where is of the form (1), we would like to learn by maximizing the log-likelihood of the observations plus a quadratic regularizer.

Here,

2.1 Convex Free Energy Approximations

The central challenge in maximum likelihood estimation is computing the partition function for each sample at each iteration. In this work, we approximate the partition function in order to make the learning problem tractable. We begin with the Bethe free energy, a standard approximation to the so-called Gibbs free energy that is motivated by ideas from statistical physics. The approximation has been generalized to include different counting numbers that result in alternative entropy approximations weissdbcnt. We focus on a restricted set of counting numbers that result in a family of convex reweighted free energies.

The reweighted free energy at temperature is specified by a polytope approximation , the hypergraph , an entropy approximation , and a vector of counting numbers (henceforth referred to as reweighting parameters).

(2)

where the energy is given by

the entropy approximation is given by

and is restricted to lie in an outer bound of the marginal polytope known as the local (marginal) polytope,

The reweighted partition function is then computed by minimizing (2) over

Setting for each recovers the typical Bethe free energy approximation. The reweighting parameters can always be chosen so that the approximate free energy is convex margwain, hazan2012convergent, ruozzi2013. For example, the tree-reweighted belief propagation algorithm (TRW) chooses the reweighting parameters so that they correspond to (hyper)edge appearance probabilities of a collection of spanning (hyper)trees.

2.2 Saddle Point Formulation

We approximate the exact partition function in the MLE objective with a reweighted free energy approximation of the form (2). This results in the saddle-point problem

(3)

heinglob2011 investigated unregularized likelihoods of this form for MRFs, and demonstrated that convexity of the Bethe free energy guarantees that the empirical marginals satisfy a moment matching condition: the empirical marginals minimize the Bethe free energy at the

that maximizes the approximate log-likelihood. Moment matching is not necessarily achieved for general MRFs when the reweighted approximation is not convex heinglob2011. margwain investigated the use of TRW for learning in pairwise binary graphical models. They observed that the parameters learned via TRW were more robust to the addition of new data than those learned by BP. This robustness of convex free energy approximations for learning can be made theoretically precise wainwright2006wrong.

If we compute the partition function via an iterative procedure, then solving (3) necessarily requires a double-loop algorithm, which can be expensive for large datasets. The existing work on Bethe learning has sought to design more efficient approximate learning algorithms. sutton2005 proposed a piecewise training scheme whereby the graph is divided into smaller subgraphs over which the partition function can be efficiently computed exactly or approximately. These results are then combined to approximate the true partition function. Because the subproblems are typically much smaller, the procedure is quite fast but can be inaccurate if the pieces are too small ganapathi2008. For bipartite matchings, one can obtain an unbiased but noisy gradient of the log-likelihood by utilizing an perfect sampler algorithm due to huber08fast. petterson2009exponential use this approach for ranking and graph matching problems, but limited themselves to 20 vertices, and each observation required its own set of samples. domke2013 proposed performing MLE using a small, fixed number of TRW iterations as part of a procedure to estimate the gradient. However, if TRW is not converging quickly (i.e., a reasonable solution is not obtained after running for a fixed number of iterations), the resulting procedure can fail to converge. vishwanathan2006accelerated proposed improving the convergence in the outer loop using accelerated gradient methods. All of the above methods rely on a double loop.

3 Approximate Mle

We now consider a convex dual reformulation of (3) that applies to convex free energies and yields a new, fast learning algorithm. We first note that (3) is concave in the variables being maximized and convex in the variables being minimized, and one set of variables (the ) are constrained to a compact domain. We can thus invoke Sion’s minimax theorem sion1958 to reverse the and operators. Next, we can analytically solve for the optimal in terms of fixed . Setting the gradient with respect to equal to zero in (3) yields

(4)
(5)

Finally, substituting these back into (3) yields the following optimization problem over the local marginal polytope.

(6)

The linearly-constrained convex objective (6) can be minimized via general convex optimization techniques, such as the ellipsoid method—though this can be slow in practice. In the sequel, we minimize this objective with the Frank-Wolfe algorithm (FW).

Convex free energies can be obtained using the reweighting techniques above, and whenever the graph is a tree the standard Bethe free energy is both convex and exact. In principle, a similar argument can be made for any convex approximation of the partition function, though we only focus on convex Bethe-style approximations in this work.

The MLE objective is dual to the maximum entropy problem, and recent work on approximate MLE has focused on different families of entropy approximations wainwright2008graphical. ganapathi2008 also followed a maximum entropy approach to the approximate MLE problem. They proposed approximating the entropy objective using the Bethe entropy approximation (i.e., ), but specifically avoided convex entropy approximations. Unfortunately, this results in a non-convex optimization problem in general, for which the authors use the concave-convex procedure. Other recent work approximated the MLE problem using convex free energies, but did not consider the maximum entropy approach domke2013.

3.1 Frank-Wolfe Algorithm for Maximum Likelihood Learning

Following jaggi2013revisiting, FW minimizes a general convex function over a convex set via a sequence of iterates defined by

(7)
(8)

where the step-size, , is either selected using line search or is fixed at .

For the objective function in (6), each step requires minimizing a linear objective over a linear set of constraints.

(9)

Since the constraints are separable across training examples, (9) decouples into

independent linear programs (LPs) that can be solved in parallel. Depending on the specific application, purely combinatorial methods or reweighted message-passing algorithms may provide faster and more space efficient alternatives to generic LP solvers. Note that (

6) could not be solved with projected gradient algorithms efficiently, since projection onto the local or marginal polytopes is not tractable.

Despite the ability to perform (9) in parallel, it can still be prohibitive for large sample sizes. For convex optimization problems over separable constraint spaces, lacoste2013block propose a block-coordinate FW algorithm (BCFW). The BCFW procedure performs the FW iteration over a randomly selected and leaves the remaining coordinates untouched. BCFW requires less work at each iteration, but the asymptotic rate of convergence remains the same as that for the standard FW algorithm lacoste2013block. This block coordinate approach is known to outperform FW for the SVM-Struct problem. Technical details concerning the convergence of FW for this problem, including methods to bound the convergence rate, can be found in Appendix B. Both the FW and BCFW versions of our algorithm, MLE-Struct, are described in Algorithm 1.Line search can be accelerated by precomputing quadratic terms as discussed in D.2.

  Input: training examples , reweighting parameters , regularizer
  Output: Approximate maximum likelihood .
  Initialization: Set each uniformly.
  repeat
     for  in parallel (batch) or chosen uniformly at random (block) do
        
        Set for batch and for block or use line search.
        
     end for
  until converged
  Set using (4) and (5).
Algorithm 1 MLE-Struct: Frank-Wolfe Approximate Maximum-Likelihood Learning

3.2 Frank-Wolfe for Marginal Inference

In many applications, computing marginals is useful at test time, as well as during learning. Fortunately, we can use FW to perform marginal inference, thus maintaining our ability to interact with the underlying model only through a MAP solver. Specifically, approximate reweighted Bethe marginals can be obtained by minimizing (2) with respect to , which is a convex problem suitable for FW. The technique was first used in sontag2007new, using a generic LP solver for MAP.

In Appendix F, we provide experiments on CRFs defined over bipartite matchings (see Section 4.1), demonstrating the favorable accuracy and speed of FW-based inference versus the BP algorithm of huang2009approximating and an instance of the Perturb-and-MAP framework designed specifically for matchings KeLi2013. We find that FW outperforms Perturb-and-MAP in terms of both accuracy and convergence speed. FW and BP minimize the same objective, since the Bethe entropy is convex for matchings, so we compare them purely in terms of speed. We find that FW is preferable to BP in most regimes, except when extremely precise optimization is required.

4 Applications and Experiments

We apply the MLE-Struct framework to a variety of exponential family models defined over different combinatorial structures, including grid CRFs for image segmentation, bipartite matchings in vision applications, and general perfect matchings for a university roommate assignment problem. For CRFs, MAP inference is intractable, but we can efficiently solve the LP relaxation, which is equivalent to MAP inference over the local polytope with QPBO rother2007qpbo. This means our estimated pseudomarginals will not be globally consistent, but the procedure can still yield accurate predictions margwain. For the matching problem, we use efficient max-flow solvers to obtain exact MAP solutions (i.e., over the marginal polytope) goldberg1995efficient,kolmogorov2009blossom. In this case, our estimated pseudomarginals will be globally consistent. Appendix A

details the data sources, feature extraction, and machine setup.

4.1 Permanents and Perfect Matchings

We first consider the problem of learning distributions over perfect matchings of a given graph. For a graph and edge weights , the probability of observing a particular matching is

(10)

where is the adjacency matrix of a perfect matching in , is a weighted adjacency matrix of , and is the partition function. Each entry of is a function of edge-wise features. Our formulation can be relaxed to distributions over all matchings by allowing to correspond to the adjacency matrix of any (not necessarily perfect) matching.

When is bipartite, the partition function is the permanent of the matrix of edge weights and is thus #P-hard to compute valiant1979complexity. Although the partition function can be computed to any given accuracy using a fully polynomial randomized approximation scheme jerrum2004polynomial, such algorithms are impractical for graphs of any significant size.

In practice, is unknown and must be learned from data. We can learn a generative model by estimating directly, or a conditional model by first assuming that is the linear combination of some feature maps and then learning the weights. For concreteness, suppose we have features, and for the feature we have a matrix . Let be our model parameters, so that the weight on edge is . Then the conditional likelihood is

(11)
Theorem 4.1.

For any , any graph (bipartite or general), and any matching (perfect or imperfect), the reweighted free energy (2) is convex over the local polytope.

Theorem 4.1 is proven in Appendix C. By inclusion, it implies (2) is also convex over the marginal polytope. This generalizes earlier known results of convexity for bipartite perfect matchings betheperm, chertkov2013. Due to the convexity of the Bethe entropy and the availability of high-quality maximum-weight matching solvers, Algorithm 1 is well-suited to the approximate MLE task. A derivation of the specific form of (6) for matchings and a technique for making the associated line search particularly efficient by precomputing certain data-dependent terms can be found in Appendix D.

4.1.1 Synthetic Bipartite Matchings

(a) High SNR problem: Average (divided by sample size) true regularized log-likelihood evaluated at the exact MLE as well as at the parameters that maximize the RW and Bethe approximate log-likelihoods. Higher is better, with the Exact curve being the upper bound. The data are nearly as probable under the Bethe estimator as they are under the exact MLE.
(b) Low SNR problem: Same plot as (a). All methods perform comparably in this setting.
(c) High SNR problem: Optimal values of regularized true, and approximate log-likelihoods under the RW and Bethe approximations. The true likelihood is always bounded by the approximations.
(d) Low SNR problem: Same as (c)
Figure 1: Exact likelihoods of the Bethe and RW estimates, and sandwich bounds on the likelihood.

We begin with a synthetic experiment using the flexibility of MLE-Struct to analyze the accuracy of various entropy approximations for matchings. We sample bipartite matchings from the distribution (11). We explore two choices of the weight matrix : one in the high SNR regime with on the off-diagonals and on the diagonals, and one in the low SNR regime off-diagonal and on-diagonal.

Our problems are small enough that we can compute exact partition functions and their gradients with Ryser’s algorithm ryser1963. Hence, we can perform exact MLE with gradient descent. We can also evaluate the true (regularized) likelihood of our estimates. We ran Algorithm 1 with and called this result the Bethe estimator. In addition, setting for this problem guarantees a concave entropy approximation and an upper bound on the partition function weissconv. We also ran Algorithm 1 at this setting and denote the result as the RW estimator.

Figure 1(a) displays the average regularized log-likelihood of each estimator, higher being better and the Exact curve being an upper bound. In both low and high SNR regimes, the Bethe estimator is superior to the RW estimator. Reweighted entropies such as the one chosen here are known to perform poorly as estimators of the true partition function as compared to belief propagation. Interestingly, although the objective values of the estimates are different in each case, in the low SNR regime, all estimation methods produce about the same likelihood.

Our framework can also be used to bound the value of the true likelihood. First, since provides an upper bound on  margwain, we have for any . Second, for matchings, we have gurvitsnew. Therefore, we have

(12)

for all . Moreover, and are global bounds on the maximum likelihood, so the inequalities also hold at their respective optima. That is,

(13)

where is the RW estimator, is the regularized MLE, and is the Bethe estimator. We plot the quantities of (13) in Figure 1(b). We can also use (12) to obtain upper and lower bounds of and by using FW for inference to compute and , since and will already be available upon convergence of Algorithm 1. Appendix A shows these results.

4.1.2 Graph Matchings

Test Loss 0.013 0.013 0.017 0.017 0.020 0.017
Figure 2: Test loss for approximate MLE using FW on the house data set with a gap of and different choices of uniform vectors.
Figure 3: Top: Approximate MAP assignment produced from the learned model. Red edges were incorrectly matched. Middle: Pseudomarginals for a correctly predicted edge. The correct edge has high probability (red) while all others have low probability (blue). Bottom: Pseudomarginals for a wrongly predicted edge. There are two edges with nontrivial probability (red and green). When the model is forced to pick one, it picked the wrong one.
Hotel
FW lin.+l.
0 0.0 0.0
10 0.0 0.0022
20 0.0049 0.0049
30 0.020 0.020
40 0.023 0.013
50 0.0614 0.050
60 0.13 0.12
70 0.17 0.15
80 0.24 0.19
90 0.33 0.30
House
FW lin.+l.
0 0.0
0.0040 0.0
0.0022 0.0022
0.0049 0.0
0.0 0.0
0.017 0.022
0.041 0.051
0.051 0.067
0.080 0.14
0.12 0.11
Figure 4: Test loss versus gap between frames for the hotel/house data measured via the Hamming loss for both MLE-Struct and the method of Caetano et al. (2009)

We now apply the bipartite matching model to a graph matching problem arising in computer vision over the CMU

house and hotel image sequences. We follow the setup of caetano2009learning. The data consist of 111 frames of a toy house and 101 separate frames of a toy hotel, each rotated a fixed angle from its predecessor. Each frame was hand-labeled with the same 30 landmark points. We consider pairs of images at a fixed number of frames apart (the gap), which we divide into training, validation, and testing sets following the same splits as caetano2009learning. We measure the average Hamming error between the predicted matching (MAP estimate using our learned parameters) and the ground truth.

We compare our algorithm against the linear+learning method of caetano2009learning, which fits the parameters of a linear model using the same features as our algorithm but with a hinge loss objective. The results of the experiments with reweighting parameters are described in Figure 4. For each subsequence, we chose the regularization parameter via cross-validation. Both methods perform comparably, with our method doing slightly better on the houses and the method of caetano2009learning doing slightly better on the hotels. We also compared the performance of our algorithm with different reweighting parameters . Figure 2 shows the results for the house data when the gap is for various choices of . We observed little difference in test error as varies: this was confirmed over synthetic as well as real data. As a result, we did not tune for different data/problem setups.

Figure 3 illustrates one advantage of learning a probabilistic model over a discriminative model: the pseudomarginals indicate the model’s confidence in a prediction. In many cases, when the algorithm made the wrong prediction, two edges incident to a specific node had relatively high pseudomarginal probabilities. In these cases, the errors were not completely unfounded. Similar image parts were matched, albeit incorrectly.

Algorithm 1 permits fast and simple approximate MLE in problems where it was previously quite difficult. Namely, we found that using standard BP approaches huang2009approximating,bayati2011 to compute marginals for the standard double-loop MLE approach was very unstable for this problem because BP often failed to converge after taking several gradient steps. In later sections, we juxtapose Algorithm 1 with alternative approximate MLE approaches.

4.2 Unipartite Perfect Matchings

Figure 5: ROC curves for roommate matching of our algorithm and a constant baseline.
Profile Item Weight
Smoking -0.0484
Personality -0.0370
I generally go to bed at… -0.0296
I generally wake up at… -0.0218
Study with audio/visual -0.0133
Overnight Guests -0.0097
Cleanliness -0.0056
Table 1: Largest distance features of roommate survey data. More negative values increasingly discourage features from differing.

Many undergraduate institutions assign first-year students to roommates based on questionnaire responses, but allow returning students to pick their own roommates in subsequent years. Therefore, we can use observed roommate matchings of returning students to train a model for students’ preferences. Such a model can then be used by the administration to assign first-year students roommates that they will get along with.

We obtained an anonymized dataset of campus housing room assignments and questionnaire responses for undergraduate students at a major US institution for the years 2010–2012. We used data from 2010 and 2011 to train and data from 2012 to test. We prune those students who did not live in campus housing or were assigned to single rooms (there were no rooms with three or more residents). The remaining students were thus assigned to one roommate, and form perfect matchings in complete graphs of 2374–2504 nodes. As our data includes neither year nor gender, we treat the entire matching assignment for one year as one observation.

Our questionnaire data consists of 2 binary features and 12 ordinal features of 5 levels each. For each pair of students and each questionnaire question, we created one feature of absolute differences and many interaction features, which consisted of one indicator feature for each possible pair of answers to the questionnaire questions. For simplicity, we assumed symmetric interactions. For each student pair, the weighted score for their matching is a linear combination of features.

We fit a model using MLE-Struct. Table 1 lists the largest coordinates of for distance features ordinal on ordinal questionnaire responses. Here, more negative values indicate closer agreement required. We see that smoking, personality (introverted vs extroverted), and sleeping habits require the strongest agreement. For more results and details of the feature, see Appendix A.

As we are effectively performing multiclass classification of 2500 classes with only 14 features, we do not expect high accuracy in terms of Hamming error. Instead, we consider the use-case where we use the model to reject very bad roommate assignments. To evaluate this, we use our learned

to form the cost matrix from features of the test year (2012), and use the entries of this cost matrix as scores for a binary classifier. We then plot ROC curves in Figure 

5, where we demonstrate gains above random guesses and a constant baseline where for distance features and 0 for interactions. In particular, our algorithm dominates in the low false-positive regime of the graph. As competitors, we also evaluated a structured perceptron and structured SVM using the same MAP decoder, but even after extensive parameter tuning, they did not generalize well to test data, and obtained test AUCs worse than the constant baseline.

4.3 Grid Crfs

(a) Plot of the difference to the midpoint of the best values attained by each algorithm (duality gap). The domke10 curve intersects, but does not converge to the correct value.
(b) Test error (mean Hamming loss) vs time. MLE-Struct-wavg curve traces the error of our method with averaged iterates, which achieves the lowest test error and gets these fastest.
Figure 6: Horse dataset performance measures. MLE-Struct is our method; legend is same across both plots.
(a) Raw test image (#207) and ground truth segmentation.
MLE-Struct domke40
9.2 min.
3.7 hours
(a) Estimated marginals of our algorithm and that of domke2013 on image #207 after various amounts of training time. Our algorithm terminated after 3.7 hours while that of domke2013 took nearly twice as long to converge.
Figure 7: Visual results of the horse dataset.

Next we study a binary image segmentation problem on the Weizmann horses dataset borenstein2002class. We formulate this as a pairwise binary model with a variable for each pixel indicating whether it is part of a horse. We used the same features and resized images obtained from domke2013 and kept their split of 200 training and 128 testing images. This results in grid CRFs of approximately 10,000–40,000 nodes.

In initial experiments, we tried a naive double-loop method of subgradient descent on using belief propagation to compute subgradients of the TRW log-partition function. This method was too slow and unstable for even small real problems. Instead we compared our algorithm against the method proposed by domke2013, which solves (3) by using a small, fixed number of iterations of TRW to approximate the partition function and thereby the gradient of of the approximate likelihood. The outer loop runs L-BFGS until convergence. Limiting the number of inner TRW iterations is key to this algorithm’s efficiency, which burdens the user with another tuning parameter.

In Figure 6, we compare two variants of our algorithm with three variants of the algorithm of domke2013 in terms of objective value and test error over time. Both methods optimize the same objective—we optimize the dual while they optimizes the primal. We set in our parameterization which matches the setting for their published result. The MLE-Struct curves result from applying the block-coordinate version of our Algorithm 1. We obtained the fastest convergence without using linesearch. MLE-Struct-wavg uses the same algorithm, but evaluates the test error using a weighted average of the iterates as described by lacoste2013block. The curve for averaged iterates is substantially smoother than the raw MLE-Struct curve, and very quickly attains a low test error. The domke curves result from running their algorithm for TRW inner loop iterations. This method is not guaranteed to converge to the global optimum for any finite . A practitioner must run the algorithm for a sequence of increasing values of to confirm convergence to the correct value. In contrast, our FW algorithm requires only a single run and computes an upper-bound on the duality gap as a byproduct jaggi2013revisiting.

In early iterations our algorithm achieves the lowest objective value and test error. Our algorithm attains low test error even when the objective value is relatively far from optimal. This phenomenon is a result of the dual formulation: we iteratively move to minimize the objective, but for each value of , we compute the optimal as a linear function of . While may initially be very inaccurate, contributing to a large objective value, is much lower dimensional, so some of the errors may cancel in computing , resulting in good predictions nevertheless. In contrast, the method of domke2013 iteratively moves , and for each value of , computes the optimal value of using TRW. Prior to convergence, this may enable better fit to the training data at the expense of accurate estimation of .

The effect is readily apparent when visualizing predicted marginals on the test set in Figure 7. The domke40 method takes 9.2 minutes to complete one iteration. Parameters after one iteration are nowhere near the MLE, as evidenced by the first row of 7(b) and the early portion of the objective value plot 6(a); the mean Hamming loss on this sample is 0.249. Their marginal estimates at this point have only used local intensity data: light regions are classified as “horse” and dark regions are classified as “not horse.” In contrast, in about the same time, our BCFW method already ran for 12000 iterations, has made 60 passes over the training data, and essentially recovers the correct segmentation (except for difficult portions on the mane and hind legs, where the background texture is confusing) with mean Hamming loss of 0.068. After 3.7h, both algorithms produce comparable visualizations, though we get 0.074 normalized Hamming error on this sample, while domke40 gets 0.109. It takes 8.2h for the domke40 to converge according to its internal criteria, attaining a final test error of 0.0813 on this image.

5 Discussion and Conclusion

In maximum likelihood estimation of discrete exponential family models, replacing the Gibbs free energy with a convex free energy approximation leads to a concave-convex saddle point problem. We have shown that adding a quadratic regularizer enables a closed-form maximization, leaving a single convex minimization problem, which can be solved efficiently using the Frank-Wolfe algorithm. We can scale to large datasets by using block-coordinate Frank-Wolfe, and rapidly achieve low test error by solving the dual objective. This method accomplishes approximate MLE using a simple wrapper around a black-box MAP solver. Previously, practitioners either employed expensive double-loop MLE procedures or they abandoned MLE by resorting to structured SVMs and perceptrons. Our method is competitive with max-margin MAP-based estimation methods in terms of prediction error and faster than competing MLE methods in minimizing test error, while being simple to implement. In future work, we will extend the method to other combinatorial problems, incorporate structure learning with regularization, and handle latent variable models as a single minimax problem.

icml2015 biblio

Supplementary Material

The following material is used to provide details about techniques employed in the paper. Part A presents details of experimental setup and additional results. Then in part B we discuss the convergence properties of FW for our problems. In part C we prove convexity of the Bethe free energy for general matchings. Part D derives an instance of our Algorithm 1 for matchings, and part E derives the same for linear CRFs.

Then, we present a full derivation of the dual objective and line search objective for doing FW learning for graph matchings. This is presented in terms of matrix-valued terms (various weighted adjacency matrices for the graph), which facilitates easy implementation, since MAP solvers operate on such matrices.

Appendix A Details on Experiments

In this appendix, we explain further details of experiments.

a.1 Synthetic Experiments

(a) Exact and approximate likelihoods evaluated at the parameters that maximize the Bethe log-likelihood.
(b) Exact and approximate likelihoods evaluated at the parameters that maximize the RW log-likelihood.
Figure 8: Sandwich bounds on the likelihood at the Bethe and TRW estimators. Top is the high SNR problem, bottom is the low SNR problem.

For matchings, the TRW likelihood lower bounds the true likelihood, while the Bethe likelihood upper bounds the true likelihood at all parameter values. Thus, we can obtain two-sided bounds on the likelihood of by evaluating , and of the likelihood of by evaluating . Figure 8 displays these results. The approximate likelihoods were computed using FW for inference, using the procedure described in Sec. 3.2 while the exact likelihood was computed using Ryser’s algorithm, which was feasible for this small problem.

a.2 Horse Experiments

Timing experiments were performed a dedicated 8 core (16 hyper-threads), 2.67GHz Intel Xeon X5550 machine with 24 GB of physical RAM, running Ubuntu 12.04.5 LTS and Matlab R2012b. Computations were restricted to a single core, and at most two experiments were run at a time. Our algorithms were implemented in Matlab, interfacing with combinatorial solvers written in C or C++. We downloaded the code for domke2013 and caetano2009learning from the authors’ websites, which were implemented in C++ and Matlab with C extensions, respectively. We obtained the original experiment scripts for domke2013 through correspondence with the author.

a.3 Roommate Experiments

This dataset was obtained from a major US university over a three year period from 2010-2012. The anonymized dataset consists of roommate assignments for pairs of students in each of the three years. In addition, each of the students was required to complete a brief housing survey that asked for their preferences in terms of cleanliness, sleeping schedule and habits, personality, study preferences, etc. Our questionnaire data consists of 2 binary features and 12 ordinal features of 5 levels each. For each pair of students and each questionnaire question, we created one feature of absolute differences and several interaction indicator features, one for each possible pair of answers to the questionnaire questions. For simplicity, we assumed symmetric interactions. For each student pair, the weighted score for their matching is a linear combination of features. The learned weights for each distance features and their relative rankings are described in Figure 9

. As we are using a log-linear model, the weights should be interpreted as log-odds ratio for a unit increase in absolute distance (assuming ordinal features).

A few qualitative observations about these results. First, single-sex floor, rising sophomore, and allow in Brownstone all received relatively low weights, indicating perhaps that the data was too noisy with respect to these survey responses. Second, personality, smoking, and bedtime were among the strongest predictors of a successful match while cleanliness, study hours, study location were among the least important.

When comparing to the BCFW algorithm for a structured SVM, we employed the publicly-available code from the authors. We tried regularization parameter lambda values in the range [10e-4, 10e2]. Our best-performing configuration achieved an AUC of .504 and a hamming error of .9992, which outperforms random guessing, but significantly underperforms a model trained with MLE.

Profile Item Description Weight Rank
I generally go to bed at… -0.0296 3
I generally wake up at… -0.0218 4
Rising Sophomore 0 10
Cleanliness -0.0056 7
Smoking -0.0484 1
Sleeping Habits -0.0025 8
Overnight Guests -0.0097 6
Personality -0.037 2
Usual Study Hours 0.0131 12
Study Location -0.0006 9
Study with audio/visual -0.0133 5
Single-sex floor (1) 0.2277 14
Single-sex floor (2) 0.0518 13
Allow in Brownstone 0 10
Figure 9: Distance features for the roommates experiments, their learned weights, and their relative importance ranking with regularization .

Appendix B Convergence of Frank-Wolfe

Recall the following approximate max-entropy objective function.

In this appendix, we discuss the convergence rate of the FW algorithm for the minimization of the convex function over the local polytope. jaggi2013revisiting has shown that the suboptimality of the iterates of FW decays as

(14)

where is the accuracy to which each of the linear subproblems (i.e., the optimization problem performed at each iteration) is solved and is the curvature of the function .

Curvature is a stronger notion of the function’s geometry than its Lipschitz parameter, since it is affine-invariant, like the entire Frank-Wolfe algorithm jaggi2013revisiting. The curvature of a differentiable function is given by

(15)

The curvature, , quantifies how much can differ from its linearization. For twice differentiable functions, the curvature can be upper bounded as follows.

For our objective function, the curvature is most heavily influenced by the entropy term as the curvature of the quadratic piece is simply a constant that depends on the ’s. Unfortunately, the curvature of is unbounded: as one approaches integer points in the local polytope, the entropy approximation becomes arbitrarily steep. Therefore, the worst-case convergence rate of FW for this problem is unbounded. In order to obtain a bounded curvature, we could strengthen the box constraints in the marginal polytope by requiring

for some . For each , the part of the entropy approximation that depends on looks like

That is, all we need to do is compute bounds on the curvature for functions of the form . For this function,

for any . To bound this quantity from above, we can pick , , and . This gives

which is a fixed constant and tends to zero as tends towards . Hence, as long as the optimal pseudomarginals lie strictly inside , then this modified FW is always guaranteed a linear rate of convergence to the optimum. The pseudomarginals produced by both BP and RBP often lie strictly inside the box constraints, so this is typically not an issue in practice. In order to find an appropriate , we could use, for example, the bounds on the pseudomarginals proposed by mooij2007. These bounds are obtained by running BP/RBP for a fixed number of iterations. An appropriate can then be selected such that the interval contains all of the bounds as in mooij2007.

Adding additional box constraints is expensive (we can no longer use the combinatorial algorithms that we used for matching and pairwise binary MRFs), and they may not be necessary in practice. For pseudomarginals , the steepness only becomes unmanageable when there are components of that are close to 0 or 1 (in a Gibbs distribution, no clique assignment ever has zero probability, so is always well-defined). If the iterates of the algorithm never get too close to the boundary of , then the effective curvature term will be reasonable. Of course, if the optimal pseudomarginals of the reweighted approximation are close to the boundary of , then will be large in the neighborhood of the solution, and we should not expect fast convergence. A rough way to estimate this distance from the boundary is via the entropy of the pseudomarginals, and our experience has shown that the algorithm converges faster when the true pseudomarginal distribution has higher entropy.

Appendix C Convexity of the Bethe Free Energy for General Matchings

In this appendix, we argue that the Bethe free energy for the (not necessarily perfect) matching problem is convex over general graphs. The convexity of the Bethe approximation for the bipartite matching problem was investigated experimentally by huang2009approximating and the proven by betheperm. The same argument holds for any choice of reweighting parameters such that for all . For simplicity we only argue the case for all , the general case is similar to Theorem 60 in betheperm. The entropy and polytope approximations are formulated as follows.

where is restricted to .

The proof of convexity follows directly from the concavity of the function

for each betheperm.

Theorem C.1.

For any , any graph (bipartite or general), and any matching (perfect or imperfect), the reweighted free energy (2) is convex over the local polytope.

Proof.

For the case of the perfect matching problem on a graph , the entropy term of the Bethe free energy can be written as

Here, is the entropy function. As both and are concave functions, the entropy function is concave which implies that the free energy approximation is convex. ∎

Appendix D Frank-Wolfe and Matchings

In this appendix, we describe a conditional random field over perfect matchings, formulate the approximate learning problem in this context, and describe the linesearch procedure used as part of the FW algorithm.

Assume we have observations, consisting of items matched to other items. We represent the ’th observation by where the and are and data matrices and is an column permutation matrix 222That is, if maps to , then and for . . Note that and contain the data for the two separate parts of the graph.

In general, conditional random field features can be arbitrary functions of . To produce a model whose MAP solution is a maximum-weight perfect matching, we require the features to be linear in . Since denotes the presence or absence of edge , its coefficient ought to depend only on the data for items and . Therefore, we use the feature map where . That is, the ’th feature is a linear function of with coefficients given by applying a single function to every pair of rows in and . We will have features in total. We now write and dispense with and . The probability of one observation is thus

So the log-likelihood for i.i.d. observations is

(16)

We focus on the case .

d.1 Minimax Formulation

We now replace with and add an regularizer. Note that is an -vector reweighting parameter with entries between 0.5 and 1. Using the variational formulation of the (Bethe) free energy, we write the maximum Bethe likelihood problem as a minimax problem, which we further analytically reduce to a convex program with linear constraints. Begin with

(17)

To simplify subsequent derivations, let , , and be an matrix whose ’th column is given by . In the sequel, we will use the reweighting parameters in the form of pairwise sums . Thus, let be matrix where and let . Additionally, define and by vertically stacking all members of and . Thus we can rewrite . Plugging (17) into (16) and adding a quadratic penalty gives the problem

The second line is justified because the minimizations in are separable, so the and sum operators commute. The cost is that we must now minimize over a larger product space , but we will see later why this is not a problem. The last line follows from Sion’s minimax theorem: the minimization domain is compact convex, and the objective is convex in the minimization variable and concave in the maximization variable sion1958. The theorem requires only one compact domain, so can remain unconstrained. Thus, for any , the concave function attains its maximum at the stationary point , e.g. . Moreover, is strictly concave for , so the maximum is unique. Plugging in to (D.1) and simplifying, we get

d.2 Line search

To compute the next iterate of FW, , we use linesearch. Write . Plugging in to (D.1), we get

(20)

Thus we can precompute the expensive matrix products in the quadratic term.

d.3 General Matchings

The above FW procedure only requires a few changes when switching from complete bipartite graphs to general graphs. The same equations and steps hold when we replace biadjacency feature matrices with adjacency features, and permutation matrices with matrices representing perfect matchings. There are only a few technical caveats. First, for general graphs we need to be able to allow some to be zero. This can occur because either there is no edge between and , or and are neighbors, but there is no possible perfect matching in which they are linked. For both of these cases, we simply clamp at zero. Similarly, some edges occur in every perfect matching, so we need to discover these a-priori and clamp at one. Second, unlike for bipartite matching, initialization of is non-trivial, since the set of neighbors is different for every . We cannot choose an integral from the local marginal polytope as an initial point, since the curvature is infinite there. Instead, for every edge in the graph, we can find one matching that contains that edge and one matching does not by solving a series of matching problems. We average all of these matchings to obtain an initial feasible point.

Appendix E Frank Wolfe and Linear CRFs

e.1 Notation

We work with a conditional random field of labels over the graph in the standard overcomplete parameterization. That is, is an indicator vector for the state of node , and is an indicator matrix for the state of edge . We will also treat as a vector when convenient. We denote an element of a matrix or vector by parentheses. For node , let be its feature vector and for edge , let be its feature vector. Implicitly, these feature vectors are derived from applying some function to an input vector . We refer to elements of a vector We will learn a linear map for the node and edge parameters:

Now suppose we have exchangeable samples, and let the superscript denote the observation belonging to the th sample. Our joint log-likelihood is thus

e.2 Minimax Formulation

We replace with a parameterized surrogate likelihood

which interpolates between the TRW and Bethe approximations. We use the variational formulation of

, over the local polytope . Note that the Bethe approximation is not convex in this setting, but TRW is. For grid MRFs, each edge has probability of appearing in a spanning tree, so we set for each edge.

Since we are estimating matrix parameters, we add a Frobenius penalty. The minimax formulation is thus

where the reweighted approximate entropy is given by

(22)

where is the mutual information between variables and and and are singleton and pairwise entropies. We have used the identity . We have implicitly used the pairwise marginalization constraints when using the mutual information identity, so these gradients are valid only on the local polytope—a fact that is important to remember when optimizing.

The stationary point of the objective in (E.2) is thus

(23)

Similarly, . Recalling the definition of the Frobenius norm and rearranging some summations, we get