1 Introduction
Structured prediction has shown great success in modeling problems with complex dependencies between output variables. Practitioners often use undirected graphical models, which encode conditional dependency relationships via a graph. However, the tractability of exact inference in these models is limited by the graph’s treewidth, often yielding a harsh tradeoff between model expressivity and tractability.
Graphical models are good at modeling local dependencies between variables, such as the importance of surrounding context in determining the meaning of words or phrases. However, their sensitivity to cyclic dependencies often renders them unsuitable for modeling preferences for certain globally consistent states. For example, in the canonical NLP task of partofspeech tagging, there is no clear way to enforce the constraint that every sentence have at least one verb without increasing the likelihood that every token is predicted to be a verb.
Concretely, exact marginal inference in a discrete graphical model can be posed as the following optimization problem
(1) 
where
is a concatenated vector of node and clique marginals,
is the entropy, is the marginal polytope, and are parameters. Here we face a tradeoff: adding longrange dependencies directly to the model increases the clique size and thus the complexity of the problem and size of , rendering inference intractable. However, the linear scoring function breaks down over cliques, preventing us from enforcing global regularities in any other way. In this work, we propose to augment the inference objective (1) and instead optimize(2) 
Here, is some arbitrary parametric function of the entire concatenated marginal vector, where may depend on input features. Since is nonlinear, it can enforce many types of nonlocal properties. Interestingly, whenever is convex, and whenever inference is easy in the underlying model, i.e., solving (1) is tractable, we can solve (2) using nonEuclidean projected gradient methods using the Bethe entropy as a distancegenerating function. Unlike many messagepassing algorithms, our procedure maintains primal feasibility across iterations, allowing its use as an anytime algorithm. Furthermore, for nonconvex , we also show convergence to a local optimum of (2). Finally, we present algorithms for discriminative learning of the parameters . In a slight abuse of terminology, we call a nonlocal energy function.
Ours is not the first work to consider modeling global preferences by augmenting a tractable base inference objective with nonlocal terms. For example, generalized meanfield variational inference algorithms augment a tractable distribution (the distribution) with a nonlinear, nonconvex global energy function that scores terms in the full model (the distribution) using products of marginals of (Wainwright & Jordan, 2008). This is one special case of our nonlocal inference framework, and we present algorithms for solving the problem for much more general , with compelling applications.
Additionally, the modeling utility provided by global preferences has motivated work in dual decomposition, where inference in loopy or globallyconstrained models is decomposed into repeated calls to inference in tractable independent subproblems (Komodakis et al., 2007; Sontag et al., 2011). It has seen wide success due to its ease of implementation, since it reuses existing inference routines as black boxes. However, the technique is restricted to modeling linear constraints, imposed a priori. Similarly, these types of constraints have also been imposed on expectations of the posterior distribution for use in semisupervised learning, as in posterior regularization and generalized expectation (Ganchev et al., 2010; Mann & McCallum, 2010). In contrast, our methods are designed to discriminatively learn expressive inference procedures, with minimal domain knowledge required, rather than regularizing inference and learning.
First, we provide efficient algorithms for solving the marginal inference problem (2) and performing MAP prediction in the associated distribution, for both convex and nonconvex global energy functions. After that, we provide a learning algorithm for and the parametrized functions using an interpretation of (2) as approximate variational inference in a a probabilistic model. All of our algorithms are easy to implement and rely on simple wrappers around blackbox inference subroutines.
Our experiments demonstrate the power and generality of our approach by achieving stateoftheart results on several tasks. We extract accurate citations from research papers by learning discriminative global regularities of valid outputs, outperforming a strong dual decompositionbased baseline (Anzaroot et al., 2014). In a benchmark OCR task (Taskar et al., 2004), we achieve stateoftheart results with a learned nonconvex, nonlocal energy function, that guides output decodings to lie near dictionary words. Finally, our general algorithm for solving (2) provides large speed improvements for the challenging task of inference in chainstructured collective graphical models (CGMs), applied to bird migration (Sheldon & Dietterich, 2011).
2 Background
Let denote a set of discrete variables and be a collection of input features. We define the conditional distribution , where is a mapping from to a set of sufficient statistics, is a differentiable vectorvalued mapping, and . Conditional random fields (CRFs) assume that are given a graph structure and maps to a 01 vector capturing joint settings of each clique (Lafferty et al., 2001). Going forward, we often suppress the explicit dependency of on . For fixed , the model is called a Markov random field (MRF).
Given a distribution , define the expected sufficient statistics operator . For the CRF statistics above, is a concatenated vector of node and clique marginals. Therefore, marginal inference, the task of finding the marginal distribution of over , is equivalent to computing the expectation .
For treestructured graphical models, is a bijection, though this is not true for general graphs. Furthermore, for trees the entropy is equal to the Bethe entropy , defined, for example, in Wainwright & Jordan (2008). The marginal polytope is the set of that correspond to some .
As mentioned in the introduction, marginal inference can be posed as the optimization problem (1). MAP inference finds the joint setting
with maximum probability. For CRFs, this is equivalent to
(3) 
For treestructured CRFs, marginal and MAP inference can be performed efficiently using dynamic programming. Our experiments focus on such graphs. However, the inference algorithms we present can be extended to general graphs wherever marginal inference is tractable using a convex entropy approximation and a local polytope relaxation.
3 Marginal Inference With NonLocal Energies
We move beyond the standard inference objective (1), augmenting it with a nonlocal energy term as in (2):
Here, is some arbitrary parametrized function of the marginals, and may depend on input features .
Intuitively, we are augmenting the inference objective (1) by allowing it to optimize a broader set of tradeoffs – not only between expected node scores, clique scores, and entropy, but also global functions of the marginals. To be concrete, in our citation extraction experiments (Section 8.1), for example, we employ the simple structure:
(4) 
Where each is a univariate convex function and each is constrained to be nonnegative, in order to maintain the overall convexity of . We further employ
(5) 
where encodes a ‘linear measurement’ of the marginals and is some univariate convex function.
4 Variational Interpretation and Map Prediction
We next provide two complementary interpretations of (2
) as variational inference in a class of tractable probability distributions over
. They yield precisely the same variational expression. However, both are useful because the first helps motivate a MAP prediction algorithm, while the second helps characterize our learning algorithm in Section 7 as (approximate) variational EM.Proposition 1.
Proof.
Therefore, we can characterize a joint distribution over
by first finding by solving (2) and then defining an MRF over with parameters . Even more conveniently, our inference technique in Section 6iteratively estimates
on the fly, namely via the dual iterate in Algorithm 1.Ultimately, in many prediction problems we seek a single output configuration rather than an inferred distribution over outputs. Proposition 1 suggests a simple prediction procedure: first, find the variational distribution over parametrized as an MRF with parameter . Then, perform MAP in this MRF. Assuming an available marginal inference routine for this MRF, we assume the tractability of MAP – for example using a dynamic program. We avoid predicting by locally maximizing nodes’ marginals, since this would not necessarily yield feasible outputs.
Instead of solving (2), we could have introduced global energy terms to the MAP objective (3) that act directly on values rather than on expectations , as in (2
). However, this yields a difficult combinatorial optimization problem for prediction and does not yield a natural way to learn the parametrization of the global energy. Section
8.1 demonstrates that using energy terms defined on marginals, and performing MAP inference in the associated MRF, performs as well or better than an LP technique designed to directly perform MAP subject to global penalty terms.Our second variational interpretation characterizes as a variational approximation to a complex joint distribution:
(8) 
We assume that isolated marginal inference in is tractable, while is an alternative structured distribution over for which we do not have an efficient inference algorithm. Specifically, we assume that can be solved for . Furthemore, we assume that , where is a convex function, conditional on input features . Going forward, we will often surpress the dependence of on . Above, is the normalizing constant of the combined distribution. Note that if was linear, inference in both and would be tractable, since the distribution would decompose over the same cliques as .
Not surprisingly, (8) is intractable to reason about, due to the nonlocal terms in (2), so we approximate it with a variational distribution . The connection between this variational approximation and Proposition 1 is derived in Appendix A. Here, we assume no clique structure on , but show that minimizing a variational approximation of , for a given , yields a that is parametrized compactly as the MRF in Proposition 1. We discuss the relationship between this and general meanfield inference in Section 5.
Although the analysis of this section assumes convexity of , our inference techniques can be applied to nonconvex , as discussed in Section 6.3, and our learning algorithm produces stateoftheart results even in the nonconvex regime for a benchmark OCR task.
5 Related Modeling Techniques
Mean field variational inference in undirected graphical models is a particular application of our inference framework, with a nonconvex (Wainwright & Jordan, 2008). The technique estimates marginal properties of a complex joint distribution using the clique marginals of some tractable base distribution , not necessarily fully factorized. This induces a partitioning of the cliques of into those represented directly by and those where we define clique marginals as a product distribution of the relevant nodes’ marginals in . To account for the energy terms of the full model involving cliques absent in the simple base model, the energy of the base model is augmented with an extra function of .
(9) 
where is the set of cliques not included in the tractable submodel, are the potentials of the original graphical model corresponding to the missing cliques, and
represents a repeated outer (tensor) product of the node marginals for the nodes in those cliques.
Note is nonlinear and nonconvex. Our work generalizes (9) by allowing arbitrary nonlinear interaction terms between components of . This is very powerful – for example, in our citation extraction experiments in Section 8.1, expressing these global terms in a standard graphical model would require many factors touching all variables. Local coordinate ascent meanfield can be frustrated by these rigid global terms. Our gradientbased method avoids these issues by updating all marginals simultaneously.
Dual decomposition is a popular method for performing MAP inference in complex structured prediction models by leveraging repeated calls to MAP in tractable submodels (Komodakis et al., 2007; Sontag et al., 2011). The family of models solvable with dual decomposition is limited, however, because the terms that link the submodels must be expressible as linear constraints. Similar MAP techniques (Ravikumar et al., 2010; Martins et al., 2011; Fu & Banerjee, 2013) based on the alternating direction method of multipliers (ADMM) can be adapted for marginal inference, in problems where marginal inference in submodels is tractable. However, the nonlocal terms are defined as linear functions on settings of graphical model nodes, while our nonlinear terms provide practitioners with an expressive means to learn and enforce regularities of the inference output.
Posterior regularization (PR) (Ganchev et al., 2010), learning from measurements (LFM) Liang et al. (2009) , and generalized expectations (GE) (Mann & McCallum, 2010), are a family of closelyrelated techniques for performing unsupervised or semisupervised learning of a conditional distribution or a generative model
using expectationmaximization (EM), where the Estep for latent variables
does not come directly from inference in the model, but instead from projection onto a set of expectations obeying global regularity properties. In PR and GE, this yields a projection objective of the form (2), where the terms come from a Lagrangian relaxation of regularity constraints, and corresponds to dual variables. Originally, PR employed linear constraints on marginals, but He et al. (2013) extend the framework to arbitrary convex differentiable functions. Similarly, in LFM such an inference problem arises because we perform posterior inference assuming that the observations have been corrupted under some noise model. Tarlow & Zemel (2012) also present a method for learning with certain forms of nonlocal losses in a maxmargin framework.Our goals are very different than the above learning methods. We do not impose nonlocal terms in order to regularize our learning process or allow it to cope with minimal annotation. Instead, we use to increase the expressivity of our model, performing inference for every test example, using a different , since it depends on input features. Since we are effectively ‘learning the regularizer,’ on fullylabeled data, our learning approach in Section 7 differs from these methods. Finally, unlike these frameworks, we employ nonconvex terms in some of our experiments. The algorithmic consequences of nonconvexity are discussed in Section 6.3.
6 Optimizing the NonLocal Marginal Inference Objective
We now present an approach to solving (2) using nonEuclidean projected gradient methods, which require access to a procedure for marginal inference in the base distribution (which we term the marginal oracle), as well as access to the gradient of the energy function . We pose these algorithms in the composite minimization framework, which gives us access to a wide variety of algorithms that are discussed in the supplementary material.
6.1 Convex Optimization Background
Before presenting our algorithms, we review several definitions from convex analysis (Rockafellar, 1997).
We call a function strongly convex with respect to a norm , if for all ,
Proposition 2 (e.g. Beck & Teboulle (2003)).
The negative entropy function is 1strongly convex with respect to the 1norm over the interior of the simplex (restricting to ).
Given a smooth and strongly convex function , we can also define an associated generalized (asymmetric) distance measure called the Bregman divergence (Bregman, 1967) generated by ,
For example, the KL divergence is the Bregman divergence associated to the negative entropy function, and the squared Euclidean distance is its own associated divergence.
Composite minimization (Passty, 1979) is a family of techniques for minimizing functions of the form , where we have an oracle that allows us to compute minimizations over in closed form (usually here takes the form of a regularizer). Problems of this form are often solved with an algorithm called proximal gradient, which minimizes over some convex set using:
for some decreasing sequence of learning rates . Note that because of the requirement , proximal gradient generalizes projected gradient descent – since unconstrained minimization might take us out of the feasible region , computing the update requires projecting onto .
But there is no reason to use the squared Euclidean distance when computing our updates and performing the projection. In fact, the squared term can be replaced by any Bregman divergence. This family of algorithms includes the mirror descent and dual averaging algorithms (Beck & Teboulle, 2003; Nesterov, 2009).
We base our projected inference algorithms on regularized dual averaging (RDA) (Xiao, 2010). The updates are:
(10) 
where is the average gradient of encountered so far. One benefit of RDA is that it does not require the use of a learning rate parameter () when using a strongly convex regularizer. RDA can be interpreted as doing a projection onto using the Bregman divergence generated by the strongly convex function .
6.2 Our Algorithm
These nonEuclidean proximal algorithms are especially helpful when we are unable to compute a projection in terms of Euclidean distance, but can do so using a different Bregman divergence. We will show that this is exactly the case for our problem of projected inference: the marginal oracle allows us to project in terms of KL divergence.
However, to maintain tractability we avoid using the entropy function on the exponentiallylarge simplex , and instead optimize over the structured, factorized marginal polytope and its corresponding structured Bethe entropy . For treestructured models, and have identical values, but different inputs. It remains to show the strong convexity of so we can use it in RDA.
Proposition 3.
For trees with nodes, the negative Bethe entropy function is strongly convex with respect to the 2norm over the interior of the marginal polytope .
Proof.
Consequence of Lemma 1 in Fu & Banerjee (2013). ∎
With these definitions in hand, we present BetheRDA projected inference Algorithm 1. This algorithm corresponds to instantiating (10) with and . Note the simplicity of the algorithm when choosing . It is intuitively appealing that the algorithm amounts to no more than calling our marginal inference oracle with iteratively modified parameters.
Proposition 4.
Proof.
This follows from Theorem 3 of Xiao (2010) along with the strong convexity of . ∎
If we have more structure in the energy functions, specifically a Lipschitzcontinuous gradient, we can modify the algorithm to use Nesterov’s acceleration technique and achieve a convergence of . Details can be found in Appendix D. Additionally, in practice these problems need not be solved to optimality and give stable results after a few iterations, as demonstrated in Figure 1.
6.3 Inference With NonConvex, NonLocal Energies
An analogy can be made here to loopy belief propagation – even in the case of nonconvex loss functions (and even nonconvex entropy functions with associated inexact marginal oracles), the updates of our inference (and learning) algorithms are welldefined. Importantly, since one of our motivations for developing nonlocal inference was to generalize mean field inference, and the additional penalty terms are nonconvex in that case, we would like our algorithms to work for the nonconvex case as well.
Unlike loopy belief propagation, however, since we derive our algorithms in the framework of composition minimization, we have access to a wealth of theoretical guarantees. Based on results from the theory of optimization with firstorder surrogate loss functions (Mairal, 2013), in Appendix C we propose a small modification to Algorithm 1 with an asymptotic convergence condition even for nonconvex energies. In practice we find that the unmodified Algorithm 1 also works well for these problems, and experimentally in Section 8.2, we see good performance in both inference and learning with nonconvex energy functions.
7 Learning Models With NonLocal Energies
We seek to learn the parameters and of the underlying CRF base model and , respectively. Let be training examples. Let be the variational distribution for resulting from applying Proposition 1. Namely, is an MRF with parameters
(11) 
We employ the notation to highlight the role of : for a given pair, the family of variational distributions over is indexed by possible values of (recall we suppress the explicit dependence of and on ). Finally, define the shorthand .
interacts with the data in a complex manner that prevents us from using standard learning techniques for the exponential family. Namely, we can not easily differentiate a likelihood with respect to , since this requires differentiating the output of a convex optimization procedure, and the extra term in (2) prevents the use of conjugate duality relationships available for the exponential family. We could have used automatic methods to differentiate the iterative inference procedure (Stoyanov et al., 2011; Domke, 2012), but found our learning algorithm works well.
We employ a variational learning algorithm, presented in Algorithm 2, alternately updating the parameters of our tractable CRFstructured variational distributions, and updating the parameters assuming the following surrogate likelihood given by these CRF approximations:
(12) 
Given and , we update using Algorithm 1. Given , we update and by taking a single step in the direction of the gradient of the surrogate likelihood (12). We avoid taking more than one gradient step, since the gradients for and depend on and an update to and will break the property that . Therefore, we recompute every time we update the parameters.
Overall, it remains to show how to compute gradients of (12). For , we have the standard CRF likelihood gradient (Sutton & McCallum, 2006):
(13) 
For , we have:
(14) 
From (11), is also and
(15) 
Clearly, this depends on the structure of . Consider the parametrization (4). With this, we have:
(16) 
Therefore, we have . For linear measurements (5), this amounts to
(17) 
This has a simple interpretation: the gradient with respect to equals the gradient of the scalar loss at the current marginals times the difference in linear measurements between the ground truth labels and the inferred marginals.
Algorithm 2 has an expensive doubleloop structure. In practice it is sufficient to employ a ‘doublystochastic’ version given in Algorithm 3, where we sample a training example and use this to only perform a single gradient step on and . To demonstrate the simplicity of implementing our learning algorithm, we avoid any abstract derivative notation in Algorithm 3 by specializing it to the case of (17). In our experiments, however, we sometimes do not use linear measurements. Overall, all our experiments use the fast doublystochastic approach of Algorithm 3 solely, since it performs well. In general, our learning algorithms are not guaranteed to converge because we approximate the complex interaction between and with alternating updates. In practice, however, terminating after a fixed number of iterations yields models that generalize well.
Finally, recall that the notation suppresses the potential dependence of on . We assume each is a differentiable function of features of . Therefore, in our experiments where depends on , we perform gradient updates for the parametrization of
via further application of the chain rule.
8 Experiments
8.1 Citation Extraction
Model  F1 

Our Baseline  94.47 
Nonlocal Energies  95.47 
Baseline (Anzaroot et al., 2014)  94.41 
SoftDD (Anzaroot et al., 2014)  95.39 
We first apply our algorithm to the NLP task of performing text field segmentation on the UMass citation dataset (Anzaroot & McCallum, 2013), which contains strings of citations from research papers, segmented into fields (author, title, etc.). Our modeling approach, closely follows Anzaroot et al. (2014), who extract segmentations using a linearchain segmentation model, to which they add a large set of ‘soft’ linear global regularity constraints.
Let be a candidate labeling. Imagine, for example, that we constrain predicted segmentations to have no more predicted last names than first names. Then, the numbers of first and last names can be computed by linear measurements and , respectively. A hard constraint on would enforce . This is relaxed in Anzaroot et al. (2014) to a penalty term
(18) 
that is added to the MAP inference objective, where is a hinge function. For multiple soft constraints, the overall prediction problem is
(19) 
where are the parameters of the underlying linearchain model. They use a dual decomposition style algorithm for solving (19), that crucially relies on the specific structure of the hinge terms . They learn the
for hundreds of ‘soft constraints’ using a perceptronstyle algorithm.
We consider the same set of measurement vectors , but impose nonlocal terms that act on marginals rather than specific values . Further, we use smoothed hinge functions, which improve the convergence rate of inference (Rennie, 2005). We find the variational distribution by solving the marginal inference version of (19), an instance of our inference framework with linear measurements (5):
(20) 
As in Anzaroot et al. (2014), we first learn chain CRF parameters on the training set. Then, we learn the parameters on the development set, using Algorithm 3
, and tune hyperparameters for development set performance. At both train and test time, we ignore any terms in (
20) for which .We present our results in Table 1, measuring segmentlevel F1. We can see that our baseline chain has slightly higher accuracy than the baseline approach of Anzaroot et al. (2014), possibly due to optimization differences. Our augmented model (NonLocal Energies) matches and very slightly beats their soft dual decomposition (SoftDD) procedure. This is especially impressive because they employ a specialized linearprogramming solver and learning algorithm adapted to the task of MAP inference under hingeloss soft constraints, whereas we simply plug in our general learning and inference algorithms for nonlocal structured prediction – applicable to any set of energy functions.
Our comparable performance provides experimental evidence for our intuition that preferences about MAP configurations can be expressed (and “relaxed”) as functions of expectations. Anzaroot et al. (2014) solve a penalized MAP problem directly, while our prediction algorithm first finds a distribution satisfying these preferences, and then performs standard MAP inference in that distribution.
8.2 Handwriting Recognition
NGrams  2  3  4  5  6 

Accuracy  85.02  96.20  97.21  98.27  98.54 
SPC (Weiss et al., 2012)  Accuracy 

2gram  85.02 
3gram  96.20 
4gram  97.21 
5gram  98.27 
6gram  98.54 
Characterwise accuracy of our baselines, and models using learned nonlocal energies on Handwriting Recognition dataset. Note that word classifier baseline is also given in characterwise accuracy for comparison.
Model  Accuracy 

2gram (base model)  84.93 
94.01  
(MM)  94.96 
98.26  
(MM)  98.83 
55Class Classifier (MM)  86.06 
We next apply our algorithms to the widelyused handwriting recognition dataset of Taskar et al. (2004). We follow the setup of Weiss et al. (2012), splitting the data into 10 equally sized folds, using 9 for training and one to test. We report the crossvalidation results across all 10 folds.
The structured prediction cascades of Weiss et al. (2012) achieve high performance on this dataset by using extremely high order cliques of characters (up to 6grams), for which they consider only a small number of candidate outputs. Their stateoftheart results are reproduced in Table 2. The excellent performance of these largeclique models is consequence of the fact that the data contains only 55 unique words, written by 150 different people. Once the model has access to enough higherorder context, the problem becomes much easier to solve.
With this in mind, we design two nonconvex, nonlocal energy functions. These energies are intended to regularize our predictions to lie close to known elements of the vocabulary. Our base model is a standard linearchain CRF with image features on the nodes, and no features on the bigram edge potentials. Let be a function that takes the concatenated vector of node and edge marginals and sums up all of the node marginals, giving the global unigram expected sufficient statistics. Let indicate the set of all such unique vectors when applying to the train set empirical sufficient statistics for each data case . Simply, this gives 55 vectors of length 26 containing the unigram counts for each unique word in the train set.
Our intuition is that we would like to be able to “nudge” the results of inference in our chain model by pulling the inferred to be close to one of these global statistics vectors. We add the following nonconvex nonlocal energy function to the model:
(21) 
We learn two variants of this model, which differently parametrize the dependence of on . The first has a single bias feature on the nonlocal energy. The second conditions on a global representation of the sequence: concretely, we approximate the RBF kernel mean map (MM) (Smola et al., 2007) using random Fourier features (RFF) (Rahimi & Recht, 2007)
. This simply involves multiplying each image feature vector in the sequence by a random matrix with
rows, applying a pointwise nonlinearity, and taking to be a linear function of the average vector.Results of these experiments can be seen in Table 4. Adding the nonlocal energy brings our performance well above the baseline bigram chain model, and our training procedure is able to give substantially better performance when depends on the above input features.
The energy , based on unigram sufficient statistics, is not able to capture the relative ordering of letters in the vocabulary words, which the structured prediction cascades models do capture. This motivates us to consider another energy function. Let be the set of unique vectors of concatenated node marginal statistics for the train set. This gives 55 vectors of length , where is the length of the th distinct train word. Next, we define a different energy function to add to our base chain model:
(22) 
Once again we implement featurized and nonfeaturized versions of this model. As noted in structured prediction cascades, giving the model access to this level of highorder structure in the data makes the inference problem extremely easy. Our model outperforms the best structured prediction cascades results, and we note again an improvement from using the featurized over the nonfeaturized .
Of course, since the dataset has only 55 actual labels, and some of those are not valid for different input sequences due to length mismatches, this is arguably a classification problem as much as a structured prediction problem. To address this, we create another baseline, which is a constrained 55class logistic regression classifier (constrained to only allow choosing output classes with appropriate lengths given the input). We use our same global meanmap features from the
variants of the structured model and report these results in Table 4. We also tune the number of random Fourier features as a hyperparameter to give the classifier as much expressive power as possible. As we can see, the performance is still significantly below the best structured models, indicating that the interplay between local and global structure is important.8.3 Collective Graphical Models
625  10k  50k  

Our Method  0.19  2.7  14 
IP  2.8  93  690 
Next, we demonstrate that that our proximal gradientbased inference framework dramatically speeds up approximate inference in collective graphical models (CGMs) (Sheldon & Dietterich, 2011)
. CGMs are a method for structured learning and inference with noisy aggregate observation data. The largescale dependency structure is represented via a graphical model, but the nodes represent not just single variables, but aggregate sufficient statistics of large sets of underlying variables, corrupted by some noise model. In previous work, CGMs have been successfully applied to modeling bird migration. Here, the base model is a linear chain representing a time series of bird locations. Each observed variable corresponds to counts from bird watchers in different locations. These observations are assumed to be Poisson distributed with rate proportional to the true count of birds present. The CGM MAP task is to infer the underlying migration patterns.
Sheldon et al. (2013) demonstrate that MAP in CGMs is NPhard, even for trees, but that approximate MAP can be performed by solving a problem of the form (2):
(23) 
where are (concave) Poisson loglikelihoods and each is an observed bird count.
For the case where the underlying CGM graph is a tree, the ‘hard EM’ learning algorithm of Sheldon et al. (2013) is the same as Algorithm 2 specialized to their model. Therefore, Sheldon et al. (2013) provide additional experimental evidence that our alternating surrogatelikelihood optimization works well in practice.
The learning procedure of Sheldon et al. (2013) is very computationally expensive because they solve instances of (23) using an interiorpoint solver in the inner loop. For the special case of trees, Algorithm 1 is directly applicable to (23). Using synthetic data and code obtained from the authors, we compare their generic solver to Algorithm 1 for solving instances of (23). In Table 5, we see that our method achieves a large speedup with no loss in solution accuracy (since it solves the same convex problem).
9 Discussion and Future Work
Our results show that our inference and learning framework allows for tractable modeling of nonlocal dependency structures, resistant to traditional probabilistic formulations. By approaching structured modeling not via independence assumptions, but as arbitrary penalty functions on the marginal vectors , we open many new modeling possibilities. Additionally, our generic gradientbased inference method can achieve substantial speedups on preexisting problems of interest. In future work, we will apply our framework to new problems and new domains.
Acknowledgements
This work was supported in part by the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA87501320020, and in part by NSF grant #CNS0958392. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
References
 Anzaroot & McCallum (2013) Anzaroot, Sam and McCallum, Andrew. A new dataset for finegrained citation field extraction. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.
 Anzaroot et al. (2014) Anzaroot, Sam, Passos, Alexandre, Belanger, David, and McCallum, Andrew. Learning soft linear constraints with application to citation field extraction. In ACL, 2014.
 Beck & Teboulle (2003) Beck, Amir and Teboulle, Marc. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Bregman (1967) Bregman, Lev M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
 Domke (2012) Domke, Justin. Generic methods for optimizationbased modeling. In AISTATS, 2012.
 Duchi et al. (2010) Duchi, John, ShalevShwartz, Shai, Singer, Yoram, and Tewari, Ambuj. Composite objective mirror descent. In COLT, 2010.
 Fu & Banerjee (2013) Fu, Qiang and Banerjee, Huahua Wang Arindam. Betheadmm for tree decomposition based parallel map inference. In UAI, 2013.
 Ganchev et al. (2010) Ganchev, Kuzman, Graça, Joao, Gillenwater, Jennifer, and Taskar, Ben. Posterior regularization for structured latent variable models. JMLR, 99:2001–2049, 2010.
 He et al. (2013) He, L., Gillenwater, J., and Taskar, B. GraphBased Posterior Regularization for SemiSupervised Structured Prediction. In CoNLL, 2013.
 Komodakis et al. (2007) Komodakis, Nikos, Paragios, Nikos, and Tziritas, Georgios. Mrf optimization via dual decomposition: Messagepassing revisited. In IEEE ICCV, 2007.
 Lafferty et al. (2001) Lafferty, John, McCallum, Andrew, and Pereira, Fernando CN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
 Liang et al. (2009) Liang, Percy, Jordan, Michael I, and Klein, Dan. Learning from measurements in exponential families. In ICML, 2009.
 Mairal (2013) Mairal, Julien. Optimization with firstorder surrogate functions. In ICML, 2013.
 Mann & McCallum (2010) Mann, Gideon S and McCallum, Andrew. Generalized expectation criteria for semisupervised learning with weakly labeled data. JMLR, 11:955–984, 2010.
 Martins et al. (2011) Martins, André, Figueiredo, Mário, Aguiar, Pedro, Smith, Noah A, and Xing, Eric P. An augmented lagrangian approach to constrained map inference. In ICML, 2011.
 Nesterov (2009) Nesterov, Yurii. Primaldual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
 Passty (1979) Passty, Gregory B. Ergodic convergence to a zero of the sum of monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 72(2):383 – 390, 1979.
 Rahimi & Recht (2007) Rahimi, Ali and Recht, Benjamin. Random features for largescale kernel machines. In NIPS, 2007.
 Ravikumar et al. (2010) Ravikumar, Pradeep, Agarwal, Alekh, and Wainwright, Martin J. Messagepassing for graphstructured linear programs: Proximal methods and rounding schemes. JMLR, 11:1043–1080, 2010.
 Rennie (2005) Rennie, Jason DM. Smooth hinge classification, 2005.
 Rockafellar (1997) Rockafellar, R Tyrell. Convex Analysis, volume 28. Princeton University Press, 1997.
 Sheldon et al. (2013) Sheldon, Daniel, Sun, Tao, Kumar, Akshat, and Dietterich, Thomas G. Approximate inference in collective graphical models. In ICML, 2013.
 Sheldon & Dietterich (2011) Sheldon, Daniel R and Dietterich, Thomas G. Collective graphical models. In NIPS, 2011.
 Smola et al. (2007) Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf, Bernhard. A hilbert space embedding for distributions. In Algorithmic Learning Theory, pp. 13–31. Springer, 2007.

Sontag et al. (2011)
Sontag, David, Globerson, Amir, and Jaakkola, Tommi.
Introduction to dual decomposition for inference.
Optimization for Machine Learning
, 1:219–254, 2011.  Stoyanov et al. (2011) Stoyanov, Veselin, Ropson, Alexander, and Eisner, Jason. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In AISTATS, 2011.
 Sutton & McCallum (2006) Sutton, Charles and McCallum, Andrew. An introduction to conditional random fields for relational learning. Introduction to statistical relational learning, pp. 93–128, 2006.
 Tarlow & Zemel (2012) Tarlow, Daniel and Zemel, Richard S. Structured output learning with high order loss functions. In AISTATS, 2012.
 Taskar et al. (2004) Taskar, Ben, Carlos, Guestrin, and Koller, Daphne. Maxmargin markov networks. In NIPS, 2004.
 Wainwright & Jordan (2008) Wainwright, Martin J and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, (12):1–305, 2008.
 Weiss et al. (2012) Weiss, D., Sapp, B., and Taskar, B. Structured Prediction Cascades. ArXiv eprints, August 2012.
 Xiao (2010) Xiao, Lin. Dual averaging methods for regularized stochastic learning and online optimization. JMLR, 11:2543–2596, 2010.
Appendix A Variational Approximation
During learning, reasoning about in (8) is difficult, due to the intractability of . In response, we approximate it with a variational distribution:
(24) 
where
(25)  
(26) 
Given , , and , we select by minimizing the approximation (26). Note that the surrogate we minimize is a lower bound to (25), as , by Jensen’s inequality and the convexity of . This differs from many meanfield variational inference approaches that minimize an upper bound.
So far, we have not assumed any structure on . Next, we show that the minimizer of (26) is a MRF with the same clique structure as . This provides an alternative derivation of the techniques in Section 4.
Let denote the probability under of a given joint configuration . There are exponentiallymany such , and is the entropy on the simplex . Since minimizes (26), we have the following stationarity condition for every :
(27) 
Here, is a dual variable for the constraint . Rearranging, we have:
(28)  
(29) 
where is a normalizing constant.
Proposition 5.
There exists a vector such that the quantity for all . Furthermore, is a simple, closedform function of .
Proof.
We have , since . Therefore, . ∎
Corollary 1.
Since , Proposition 5 implies is an MRF with the same clique decomposition as .
So far, is implicitly defined in terms of its own marginals . Since we assume and have the same sufficient statistics , we can use the Bethe entropy representation . This transforms (26) to the augmented inference problem (2). Therefore, we can directly solve for , which can then be used to provide a closedform expression for the CRF distribution .
Appendix B Additional Experiments
In Figure 2, we examine the convergence behavior of our algorithm on the citation dataset. This demonstrates that our inference procedure converges quite quickly except for a small number of difficult cases, where the global energy and the local evidence are in significant disagreement.
Appendix C NonConvex Energies and Composite Mirror Descent
We introduce a small modification of Algorithm 1, along with a rough proof sketch of its convergence even in the case of nonconvex energy functions. Because it leans heavily on significant prior work in optimization, it is hard to give a selfcontained proof of the results in this section, and our argument takes the form of a proof sketch that appeals to these other works. However, the basic argument simply combines the strong convexity of and its associated Bregman divergence, along with the results of Mairal (2013) for the case of composite minimization of nonconvex functions using the Euclidean Bregman divergence, and the fact that the local updates performed using entropy as a distancegenerating function have a logbarrier function for the constraint set , effectively bounding the norm of the gradient of when restricted to the set of iterates actually visited during optimization.
While Algorithm 1 was built on the framework of regularized dual averaging (RDA), we introduce a slightly different formulation based on composite mirror descent (COMID) (Duchi et al., 2010). Like RDA, COMID is a gradient method for minimizing functions of the form . At each time step , COMID makes the update
(30) 
where is some strongly convex function and is its associated Bregman divergence. In Algorithm 4, we present an instantiation of composite mirror descent for our inference problem.
At first glance, this seems significantly different from our original Algorithm 1, but remembering that because of conjugate duality of the exponential family, we can see that it actually only a corresponds to a slight reweighting of the iterates of Algorithm 1.
Proposition 6.
Proof.
This follows from a standard onlinetobatch conversion, along with the strong convexity of and Theorem 7 of Duchi et al. (2010). ∎
Now, having introduced composite mirror descent in (30), will lean heavily on the framework for optimization with firstorder surrogate losses of Mairal (2013) to show that these types of algorithms should converge even in the nonconvex case. We now recall a few definitions from that work.
First, we define the asymptotic stationary point condition, which gives us a notion of convergence in the nonconvex optimization case.
Definition 1 (Asymptotic Stationary Point (Mairal, 2013)).
For a sequence , and differentiable function , we say it satisfies an asymptotic stationary point condition if
We call a function strongly smooth if
is a bound on the largest eigenvalue of the Hessian – this tells us how the norm of the gradient changes. This is also known as a
Lipschitz continuous gradient. Now we recall the notion of a majorant firstorder surrogate function.Definition 2 (Majorant FirstOrder Surrogate (Mairal, 2013)).
A function g: is a majorant firstorder surrogate of near when the following conditions are satisfied

Majorant: we have .

Smoothness: the approximation error is differentiable, and its gradient is Lipschitz continuous, moreover, we have and
We denote by the set of such surrogates.
Now we recall the majorant firstorder surrogate property for the composite minimization step in the case of Euclidean Bregman divergence (Euclidean distance).
Proposition 7 (Proximal Gradient Surrogates (Mairal, 2013)).
Assume that where is differentiable with an Lipschitz gradient. Then, admits the following majorant surrogate in :
(31) 
We can use this result to establish a majorant property for the composite mirror descent surrogate (30) given a strongly convex and strongly smooth Bregman divergence.
Proposition 8 (Composite Mirror Descent Surrogates).
Assume that where is differentiable with an Lipschitz gradient, is a strongly convex and strongly smooth function, and is its Bregman divergence. Then, admits the following majorant surrogate in :
(32) 
Proof.
However, small technical conditions keep Proposition 8 from applying directly to our case. The Bethe entropy , and thus its associated Bregman divergence, is not strongly smooth – its gradient norm is unbounded as we approach the corners of the marginal polytope. However, it is locally Lipschitz – every point in the domain has a neighborhood for which the function is Lipschitz. In practice, since the mirror descent updates have a barrier function for the constraint set , our iterative algorithm will never get too close to the boundary of the polytope and it is effectively strongly smooth for purposes of our minimization algorithm. This is not a rigorous argument, but is both intuitively plausible and born out in experiments.
Proposition 9.
The sequence of iterates from Algorithm 4, when bounded away from the corners of the marginal polytope constraint set , and for appropriate choice of learning rates , convex , and strongly smooth (but possibly nonconvex) energy function , satisfies an asymptotic stationary point condition.
Proof.
This follows from application of Proposition 8, and noting that Algorithm 4 corresponds to the generalized surrogateminimization scheme in Algorithm 1 of Mairal (2013). The asymptotic stationary point condition then follows from Proposition 2.1 of Mairal (2013). The appropriate learning rates must be chosen by the Lipschitz constant of the gradient of , as well as the effective Lipschitz constant of the gradient of , given how far we are bounded from the edge of the constraint set (this effective smoothness constant is determined by the norm of our parameter vector ). ∎
In this section we have given a rough proof sketch for the asymptotic convergence of our inference algorithms even in the case of nonconvex energies. Our heuristic argument for the effective smoothness of the entropy
is the most pressing avenue for future work, but we believe it could be made rigorous by examining the norm of the parameter vector and how it contributes to the “sharpness” of the barrier function for the mirror descent iterates.Appendix D Accelerated BetheRDA
If we have strongly smooth losses ( is a bound on the largest eigenvalue of the Hessian), we can use an accelerated dual averaging procedure to obtain an even faster convergence rate of . Let be the diameter of the marginal polytope as measured by the strongly convex distancegenerating function (using its associated Bregman divergence.) Then Algorithm 5 gives us a convergence rate of by Corollary 7 of Xiao (2010).