Deep Structured Prediction with Nonlinear Output Transformations

11/01/2018 ∙ by Colin Graber, et al. ∙ Google University of Illinois at Urbana-Champaign 0

Deep structured models are widely used for tasks like semantic segmentation, where explicit correlations between variables provide important prior information which generally helps to reduce the data needs of deep nets. However, current deep structured models are restricted by oftentimes very local neighborhood structure, which cannot be increased for computational complexity reasons, and by the fact that the output configuration, or a representation thereof, cannot be transformed further. Very recent approaches which address those issues include graphical model inference inside deep nets so as to permit subsequent non-linear output space transformations. However, optimization of those formulations is challenging and not well understood. Here, we develop a novel model which generalizes existing approaches, such as structured prediction energy networks, and discuss a formulation which maintains applicability of existing inference techniques.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, machine learning models are used widely across disciplines from computer vision and natural language processing to computational biology and physical sciences. This wide usage is fueled, particularly in recent years, by easily accessible software packages and computational resources, large datasets, a problem formulation which is general enough to capture many cases of interest, and, importantly, trainable high-capacity models, , deep nets.

While deep nets are a very convenient tool these days, enabling rapid progress in both industry and academia, their training is known to require significant amounts of data. One possible reason is the fact that prior information on the structural properties of output variables is not modeled explicitly. For instance, in semantic segmentation, neighboring pixels are semantically similar, or in disparity map estimation, neighboring pixels often have similar depth. The hope is that if such structural assumptions hold true in the data, then learning becomes easier (, smaller sample complexity)

Ciliberto et al. (2018)

. To address a similar shortcoming of linear models, in the early 2000’s, structured models were proposed to augment support vector machines (SVMs) and logistic regression. Those structured models are commonly referred to as ‘Structured SVMs’ 

Taskar et al. (2003); Tsochantaridis et al. (2005) and ‘conditional random fields’ Lafferty et al. (2001) respectively.

More recently, structured models have also been combined with deep nets, first in a two-step training setup where the deep net is trained before being combined with a structured model, , Alvarez et al. (2012); Chen et al. (2015), and then by considering a joint formulation Tompson et al. (2014); Zheng et al. (2015); Chen et al. (2015); Schwing and Urtasun (2015). In these cases, structured prediction is used on top of a deep net, using simple models for the interactions between output variables, such as plain summation. This formulation may be limiting in the type of interactions it can capture. To address this shortcoming, very recently, efforts have been conducted to include structured prediction inside, , not on top of, a deep net. For instance, structured prediction energy networks (SPENs) Belanger and McCallum (2016); Belanger et al. (2017) were proposed to reduce the excessively strict inductive bias that is assumed when computing a score vector with one entry per output space configuration. Different from the aforementioned classical techniques, SPENs compute independent prediction scores for each individual component of the output as well as a global score which is obtained by passing a complete output prediction through a deep net. Unlike prior approaches, SPENs do not allow for the explicit specification of output structure, and structural constraints are not maintained during inference.

In this work, we represent output variables as an intermediate structured layer in the middle of the neural architecture. This gives the model the power to capture complex nonlinear interactions between output variables, which prior deep structured methods do not capture. Simultaneously, structural constraints are enforced during inference, which is not the case with SPENs. We provide two intuitive interpretations for including structured prediction inside a deep net rather than at its output. First, this formulation allows one to explicitly model local output structure while simultaneously assessing global output coherence in an implicit manner. This increases the expressivity of the model without incurring the cost of including higher-order potentials within the explicit structure. A second view interprets learning of the network above the structured ‘output’ as training of a loss function which is suitable for the considered task.

Including structure inside deep nets isn’t trivial. For example, it is reported that SPENs are hard to optimize Belanger et al. (2017). To address this issue, here, we discuss a rigorous formulation for structure inside deep nets. Different from SPENs which apply a continuous relaxation to the output space, here, we use a Lagrangian framework. One advantage of the resulting objective is that any classical technique for optimization over the structured space can be readily applied.

We demonstrate the effectiveness of our proposed approach on real-world applications, including OCR, image tagging, multilabel classification and semantic segmentation. In each case, the proposed approach is able to improve task performance over deep structured baselines.

2 Related Work

We briefly review related work and contrast existing approaches to our formulation.

(a) Deep nets (b) Structured deep nets (c) SPENs (d) Non-linear structured deep nets
Figure 1: Comparison between different model types.
Figure 2: A diagram of the proposed nonlinear structured deep network model. Each image is transformed via a 2-layer MLP () into a 26-dimensional feature representation. Structured inference uses this representation to provide a feature vector which is subsequently transformed by another 2-layer MLP () to produce the final model score.

Structured Prediction:

Interest in structured prediction sparked from the seminal works of Lafferty et al. (2001); Taskar et al. (2003); Tsochantaridis et al. (2005)

and has continued to grow in recent years. These techniques were originally formulated to augment linear classifiers such as SVMs or logistic regression with a model for correlations between multiple variables of interest. Although the prediction problem (, inference) in such models is NP-hard in general 

(Shimony, 1994), early work on structured prediction focused on special cases where the inference task was tractable. Later work addressed cases where inference was intractable and focused on designing efficient formulations and algorithms.

Existing structured prediction formulations define a score, which is often assumed to consist of multiple local functions, , functions which depend on small subsets of the variables. The parameters of the score function are learned from a given dataset by encouraging that the score for a ground-truth configuration is higher than that of any other configuration. Several works studied the learning problem when inference is hard (Finley and Joachims, 2008; Kulesza and Pereira, 2008; Pletscher et al., 2010; Hazan and Urtasun, 2010; Meshi et al., 2010; Komodakis, 2011; Schwing et al., 2011; Meshi et al., 2016), designing effective approximations.

Deep Potentials:

After impressive results were demonstrated by Krizhevsky et al. (2012)

on the ImageNet dataset 

(Deng et al., 2009), deep nets gained a significant amount of attention. Alvarez et al. (2012); Chen et al. (2015); Song et al. (2016) took advantage of accurate local classification for tasks such as semantic segmentation by combining deep nets with graphical models in a two step procedure. More specifically, deep nets were first trained to produce local evidence (see fig:SPENa). In a second training step local evidence was fixed and correlations were learned. While leading to impressive results, a two step training procedure seemed counterintuitive and Tompson et al. (2014); Zheng et al. (2015); Chen et al. (2015); Schwing and Urtasun (2015); Lin et al. (2015) proposed a unified formulation (see fig:SPENb) which was subsequently shown to perform well on tasks such as semantic segmentation, image tagging .

Our proposed approach is different in that we combine deep potentials with another deep net that is able to transform the inferred output space or features thereof.

Autoregressive Models:

Another approach to solve structured prediction problems using deep nets defines an order over the output variables and predicts one variable at a time, conditioned on the previous ones. This approach relies on the chain rule, where the conditional is modeled with a recurrent deep net (RNN). It has achieved impressive results in machine translation

(Sutskever et al., 2014; Leblond et al., 2018), computer vision (Oord et al., 2016), and multi-label classification Nam et al. (2017). The success of these methods ultimately depends on the ability of the neural net to model the conditional distribution, and they are often sensitive to the order in which variables are processed. In contrast, we use a more direct way of modeling structure and a more global approach to inference by predicting all variables together.


Most related to our approach is the recent work of Belanger and McCallum (2016); Belanger et al. (2017), which introduced structured prediction energy networks (SPENs) with the goal to address the inductive bias. More specifically, Belanger and McCallum (2016)

observed that automatically learning the structure of deep nets leads to improved results. To optimize the resulting objective, a relaxation of the discrete output variables to the unit interval was applied and stochastic gradient descent or entropic mirror descent were used. Similar in spirit but more practically oriented is work by 

Nguyen et al. (2017). These approaches are illustrated in fig:SPENc. Despite additional improvements Belanger et al. (2017), optimization of the proposed approach remains challenging due to the non-convexity which may cause the output space variables to get stuck in local optima.

‘Deep value networks,’ proposed by Gygli et al. (2017)

are another gradient based approach which uses the same architecture and relaxation as SPENs. However, the training objective is inspired by value based reinforcement learning.

Our proposed method differs in two ways. First, we maintain the possibility to explicitly encourage structure inside deep nets. Hence our approach extends SPENs by including additional modeling capabilities. Second, instead of using a continuous relaxation of the output space variables, we formulate inference via a Lagrangian. Due to this formulation we can apply any of the existing inference mechanisms from belief propagation Pearl (1982) and all its variants Meltzer et al. (2009); Hazan and Shashua (2010) to LP relaxations Wainwright and Jordan (2008). More importantly, this also allows us (1) to naturally handle problems that are more general than multi-label classification; and (2) to use standard structured loss functions, rather than having to extend them to continuous variables, as SPENs do.

3 Model Description

Formally, let denote input data that is available for conditioning, for example sentences, images, video or volumetric data. Let denote the multi-variate output space with , indicating a single variable defined on the domain , assumed to be discrete. Generally, inference amounts to finding the configuration which maximizes a score that depends on the condition , the configuration and some model parameters .

Classical deep nets assume variables to be independent of each other (given the context). Hence, the score decomposes into a sum of local functions , each depending only on a single . Due to this decomposition, inference is easily possible by optimizing each independently of the other ones. Such a model is illustrated in fig:SPENa. It is however immediately apparent that this approach doesn’t explicitly take correlations between any pair of variables into account.

To model such context more globally, the score is composed of overlapping local functions that are no longer restricted to depend on only a single variable . Rather does depend on arbitrary subsets of variables with . The set subsumes all subsets that are required to describe the score for the considered task. Finding the highest scoring configuration for this type of function generally requires global inference, which is NP-hard Shimony (1994). It is common to resort to well-studied approximations Wainwright and Jordan (2008), unless exact techniques such as dynamic programming or submodular optimization Schrijver (2004); McCormick (2008); Stobbe and Krause (2010); Jegelka et al. (2011) are applicable. The complexity of those approximations increases with the size of the largest variable index subset . Therefore, many of the models considered to date do not exceed pairwise interactions. This is shown in fig:SPENb.

Beyond this restriction to low-order locality, the score function being expressed as a sum is itself a limitation. It is this latter restriction which we address directly in this work. However, we emphasize that the employed non-linear output space transformations are able to extract non-local high-order correlations implicitly, hence we address locality indirectly.

To alleviate the restriction of the score function being a sum, and to implicitly enable high-order interactions while modeling structure, our framework extends the aforementioned score via a non-linear transformation of its output, formally,


This is illustrated as a general concept in fig:SPENd and with the specific model used by our experiments in fig:network_model. We use to denote the additional (top) non-linear output transformation. Parameters may or may not be shared between bottom and top layers, , we view as a long vector containing all trainable model weights. Different from structured deep nets, where is required to be real-valued, may be vector-valued. In this work, is a vector where each entry represents the score for a given region and assignment to that region , , the vector has entries; however, other forms are possible. It is immediately apparent that111 denotes the all ones vector. yields the classical score function and other more complex and in particular deep net based transformations are directly applicable.

Further note that for deep net based transformations , is no longer part of the outer-most function, making the proposed approach more general than existing methods. Particularly, the ‘output space’ configuration is obtained inside a deep net, consisting of the bottom part and the top part . This can be viewed as a structure-layer, a natural way to represent meaningful features in the intermediate nodes of the network. Also note that SPENs Belanger and McCallum (2016) can be viewed as a special case of our approach (ignoring optimization). Specifically, we obtain the SPEN formulation when consists of purely local scoring functions, , when . This is illustrated in fig:SPENc.

Generality has implications on inference and learning. Specifically, inference, , solving the program


involves back-propagation through the non-linear output transformation . Note that back-propagation through encodes top-down information into the inferred configuration, while forward propagation through provides a classical bottom-up signal. Because of the top-down information we say that global structure is implicitly modeled. Alternatively, can be thought of as an adjustable loss function which matches predicted scores to data .

Unlike previous structured models, the scoring function presented in eq:nonlinear_transform does not decompose across the regions in . As a result, inference techniques developed previously for structured models do not apply directly here, and new techniques must be developed. To optimize the program given in eq:map, for continuous variables , gradient descent via back-propagation is applicable. In the absence of any other strategy, for discrete , SPENs apply a continuous relaxation where constraints restrict the domain. However, no guarantees are available for this form of optimization, even if maximization over the output space and back-propagation are tractable computationally. Additionally, projection into is nontrivial here due to the additional structured constraints. To obtain consistency with existing structured deep net formulations and to maintain applicability of classical inference methods such as dynamic programming and LP relaxations, in the following, we discuss an alternative formulation for both inference and learning.

3.1 Inference

We next describe a technique to optimize structured deep nets augmented by non-linear output space transformations. This method is compelling because existing frameworks for graphical models can be deployed. Importantly, optimization over computationally tractable output spaces remains computationally tractable in this formulation. To achieve this goal, a dual-decomposition based Lagrangian technique is used to split the objective into two interacting parts, optimized with an alternating strategy. The resulting inference program is similar in spirit to inference problems derived in other contexts using similar techniques (see, for example, (Komodakis and Paragios, 2009)). Formally, note that the inference task considered in eq:map is equivalent to the following constrained program: max_x∈,y T(c,y,w)  s.t. y = H(x,c,w), where the variable may be a vector of scores. By introducing Lagrange multipliers , the proposed objective is reformulated into the following saddle-point problem: min_λ(max_y {T(c,y,w) - λ^T y} + max_x∈ λ^T H(x,c,w)).

1:  Input: Learning rates , ; ; ; number of iterations
5:  for  to  do
6:     repeat
8:     until convergence
11:  end for
12:   ;  
14:  Return: , ,
Algorithm 1 Inference Procedure

Two advantages of the resulting program are immediately apparent. Firstly, the objective for the maximization over the output space , required for the second term in parentheses, decomposes linearly across the regions in . As a result, this subproblem can be tackled with classical techniques, such as dynamic programming, message passing, , for which a great amount of literature is readily available (, Werner, 2007; Boykov et al., 2001; Wainwright and Jordan, 2003; Globerson and Jaakkola, 2006; Sontag et al., 2008; Komodakis and Paragios, 2009; Hazan and Shashua, 2010; Schwing et al., 2011, 2012, 2014; Kappes et al., 2015; Meshi et al., 2015; Meshi and Schwing, 2017). Secondly, maximization over the output space is connected to back-propagation only via Lagrange multipliers. Therefore, back-propagation methods can run independently. Here, we optimize over by following Hazan et al. (2016); Chen et al. (2015), using a message passing formulation based on an LP relaxation of the original program.

Solving inference requires finding the saddle point of eq:SPNINF over , , and . However, the fact that maximization with respect to the output space is a discrete optimization problem complicates this process somewhat. To simplify this, we follow the derivation in (Hazan et al., 2016; Chen et al., 2015) by dualizing the LP relaxation problem to convert maximization over into a minimization over dual variables . This allows us to rewrite inference as follows: min_μ( min_λ (max_y{ T(c, y, w) - λ^Ty } + H^D(μ, c, λ, w) ) ),

where is the relaxed dual objective of the original discrete optimization problem. The algorithm is summarized in alg:inf. See Section 3.3 for discussion of the approach taken to optimize the saddle point.

For arbitrary region decompositions and potential transformations , inference can only be guaranteed to converge to local optima of the optimization problem. There do exist choices for and , however, where global convergence guarantees can be attained – specifically, if forms a tree (Pearl, 1982) and if is concave in

(which can be attained, for example, using an input-convex neural network

(Amos et al., 2017)). We leave exploration of the impact of local versus global inference convergence on model performance for future work. For now, we note that the experimental results presented in Section 4 imply that inference converges sufficiently well in practice for this model to make better predictions than the baselines.

3.2 Learning

We formulate the learning task using the common framework for structured support vector machines  (Taskar et al., 2003; Tsochantaridis et al., 2005). Given an arbitrary scoring function , we find optimal weights by maximizing the margin between the score assigned to the ground-truth configuration and the highest-scoring incorrect configuration:


This formulation applies to any scoring function , and we can therefore substitute in the program given in eq:nonlinear_transform to arrive at the final learning objective:

1:  Input: Learning rate , , , and
2:  for  to  do
4:     for every datapoint in a minibatch do
5:         Inference in Algorithm 1 (adding )
7:     end for
9:  end for
Algorithm 2 Weight Update Procedure

To solve loss augmented inference we follow the dual-decomposition based derivation discussed in sec:inf. In short, we replace loss augmented inference with the program obtained in eq:DUALINF by adding the loss term . This requires the loss to decompose according to , which is satisfied by many standard losses (, the Hamming loss). Note that beyond extending SPEN, the proposed learning formulation is less restrictive than SPEN since we don’t assume the loss to be differentiable .

(a) Word recognition datapoints

(b) Segmentation datapoints
Figure 3: Sample datapoints for experiments

We optimize the program given in eq:learning_obj by alternating between inference to update , , and and taking gradient steps in . Note that the specified formulation contains the additional benefit of allowing for the interleaving of optimization and model parameters , though we leave this exploration to future work. Since inference and learning are based on a saddle-point formulation, specific attention has to be paid to ensure convergence to the desired values. We discuss those details subsequently.

3.3 Implementation Details

We use the primal-dual algorithm from (Chambolle and Pock, 2011) to solve the saddle point problem in eq:DUALINF. Though an averaging scheme is not specified, we observe better convergence in practice by averaging over the last iterates of and . The overall inference procedure is outlined in alg:inf.

For learning, we select a minibatch of data at every iteration. For all samples in the minibatch we first perform loss augmented inference following alg:inf modified by adding the loss. Every round of inference is followed by an update of the weights of the model, which is accomplished via gradient descent. This process is summarized in alg:grad. Note that the current gradient depends on the model estimates for , , and .

We implemented this non-linear structured deep net model using the PyTorch framework.

222Code available at: Our implementation allows for the usage of arbitrary higher-order graphical models, and it allows for an arbitrary composition of graphical models within the vector specified previously. The message passing implementation used to optimize over the discrete space is in C++ and integrated into the model code as a python extension.

Chain Second-order
Train Test Train Test
Unary 0.003 0.2946 0.000 0.2350
DeepStruct 0.077 0.4548 0.040 0.3460 0.084 0.4528 0.030 0.3220
LinearTop 0.137 0.5308 0.085 0.4030 0.164 0.5386 0.090 0.4090
NLTop 0.156 0.5464 0.075 0.4150 0.242 0.5828 0.140 0.4420
Table 1: Results for word recognition experiments. The two numbers per entry represent the word and character accuracies, respectively.
Train Validation Test
Unary 1.670 2.176 2.209
DeepStruct 1.135 2.045 2.045
DeepStruct++ 1.139 2.003 2.057
SPENInf 1.121 2.016 2.061
NLTop 1.111 1.976 2.038
Table 2: Results for image tagging experiments. All values are hamming losses.
Train Validation Test
Unary 0.8005 0.7266 0.7100
DeepStruct 0.8216 0.7334 0.7219
SPENInf 0.8585 0.7542 0.7525
NLTop 0.8542 0.7552 0.7522
Oracle 0.9260 0.8792 0.8633
Table 3: Results for segmentation experiments. All values are mean intersection-over-union

4 Experiments

We evaluate our non-linear structured deep net model on several diverse tasks: word recognition, image tagging, multilabel classification, and semantic segmentation. For these tasks, we trained models using some or all of the following configurations: Unary consists of a deep network model containing only unary potentials. DeepStruct consists of a deep structured model Chen et al. (2015); unless otherwise specified, these were trained by fixing the pretrained Unary potentials and learning pairwise potentials. For all experiments, these potentials have the form , where is the -th element of the weight matrix and , are a pair of nodes in the graph. For the word recognition and segmentation experiments, the pairwise potentials are shared across every pair, and in the others, unique potentials are learned for every pair. These unary and pairwise potentials are then fixed, and a “Top” model is trained using them; LinearTop consists of a structured deep net model with linear , , while NLTop consists of a structured deep net model where the form of is task-specific. For all experiments, additional details are discussed in Appendix A.1

, including specific architectural details and hyperparameter settings.

Word Recognition:

Our first set of experiments were run on a synthetic word recognition dataset. The dataset was constructed by taking a list of 50 common five-letter English words, , ‘close,’ ‘other,’ and ‘world,’ and rendering each letter as a 28x28 pixel image. This was done by selecting a random image of each letter from the Chars74K dataset de Campos et al. (2009)

, randomly rotating, shifting, and scaling them, and then inserting them into random background patches with high intensity variance. The task is then to identify each word from the five letter images. The training, validation, and test sets for these experiments consist of 1,000, 200, and 200 words, respectively, generated in this way. See fig:words_sample for sample words from this dataset.

Here, Unary

consists of a two-layer perceptron trained using a max-margin loss on the individual letter images as a 26-way letter classifier. Both

LinearTop and NLTop

models were trained for this task, the latter of which consist of 2-layer sigmoidal multilayer perceptrons. For all structured models, two different graphs were used: each contains five nodes, one per letter in each word. The first contains four pair edges connecting each adjacent letter, and the second additionally contains second-order edges connecting letters to letters two positions away. Both graph configurations of the LinearTop and NLTop models finished 400 epochs of training in approximately 2 hours.

The word and character accuracy results for these experiments are presented in tab:words. We observe that, for both graph types, adding structure improves model performance. Additionally, including a global potential transformation increases performance further, and this improvement is increased when the transformation is nonlinear.

Multilabel Classification:

For this set of experiments, we compare against SPENs on the Bibtex and Bookmarks datasets used by Belanger and McCallum (2016) and Tu and Gimpel (2018). These datasets consist of binary feature vectors, each of which is assigned some subset of 159/208 possible labels, respectively. 500/1000 pairs were chosen for the structured models for Bibtex and Bookmarks, respectively, by selecting the labels appearing most frequently together within the training data.

Our Unary model obtained macro-averaged F1 scores of 44.0 and 38.4 on the Bibtex and Bookmarks datasets, respectively; DeepStruct and NLStruct performed comparably. Note that these classifier scores outperform the SPEN results reported in Tu and Gimpel (2018) of 42.4 and 34.4, respectively.

Image Tagging:

Next, we train image tagging models using the MIRFLICKR25k dataset Huiskes and Lew (2008). It consists of 25,000 images taken from Flickr, each of which are assigned some subset of a possible 24 tags. The train/development/test sets for these experiments consisted of 10,000/5,000/10,000 images, respectively.

Here, the Unary classifier consists of AlexNet Krizhevsky et al. (2012), first pre-trained on ImageNet and then fine-tuned on the MIRFLICKR25k data. For DeepStruct, both the unary and pairwise potentials were trained jointly. A fully connected pairwise graphical model was used, with one binary node per label and an edge connecting every pair of labels. Training of the NLStruct model was completed in approximately 9.2 hours.

The results for this set of experiments are presented in tab:tag. We observe that adding explicit structure improves a non-structured model and that adding implicit structure through improves an explicitly structured model. We additionally compare against a SPEN-like inference procedure (SPENInf) as follows: we load the trained NLTop model and find the optimal output structure by relaxing to be in and using gradient ascent (the final output is obtained by rounding). We observe that using this inference procedure provides inferior results to our approach.

To verify that the improved results for NLTop are not the result of an increased number of parameters, we additionally trained another DeepStruct model containing more parameters, which is called DeepStruct++ in Table 2. Specifically, we fixed the original DeepStruct potentials and learned two additional 2-layer multilayer perceptrons that further transformed the unary and pairwise potentials. Note that this model adds approximately 1.8 times more parameters than NLTop (, 2,444,544 1,329,408) but performs worse. NLTop can capture global structure that may be present in the data during inference, whereas DeepStruct only captures local structure duing inference.

Semantic Segmentation:

Finally, we run foreground-background segmentation on the Weizmann Horses database (Borenstein and Ullman, 2002), consisting of 328 images of horses paired with segmentation masks (see fig:seg_sample for example images). We use train/validation/test splits of 196/66/66 images, respectively. Additionally, we scale the input images such that the smaller dimension is 224 pixels long and take a center crop of 224x224 pixels; the same is done for the masks, except using a length of 64 pixels. The Unary classifier is similar to FCN-AlexNet from (Shelhamer et al., 2016), while NLStruct consists of a convolutional architecture built from residual blocks (He et al., 2016). We additionally train a model with similar architecture to NLStruct where ground-truth labels are included as an input into (Oracle). Here, the NLStruct model required approximately 10 hours to complete 160 training epochs.

tab:seg displays the results for this experiment. Once again, we observe that adding the potential transformation is able to improve task performance. The far superior performance by the Oracle model validates our approach, as it suggests that our model formulation has the capacity to take a fixed set of potentials and rebalance them in such a way that performs better than using those potentials alone. We also evaluate the model using the same SPEN-like inference procedure as described in the the Image Tagging experiment (SPENInf). In this case, both approaches performed comparably.

5 Conclusion and Future Work

In this work we developed a framework for deep structured models which allows for implicit modeling of higher-order structure as an intermediate layer in the deep net. We showed that our approach generalizes existing models such as structured prediction energy networks. We also discussed an optimization framework which retains applicability of existing inference engines such as dynamic programming or LP relaxations. Our approach was shown to improve performance on a variety of tasks over a base set of potentials.

Moving forward, we will continue to develop this framework by investigating other possible architectures for the top network and investigating other methods of solving inference. Additionally, we hope to assess this framework’s applicability on other tasks. In particular, the tasks chosen for experimentation here contained fixed-size output structures; however, it is common for the outputs for structured prediction tasks to be of variable size. This requires different architectures for than the ones considered here.

Acknowledgments: This material is based upon work supported in part by the National Science Foundation under Grant No. 1718221, Samsung, 3M, and the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR). We thank NVIDIA for providing the GPUs used for this research.



  • Alvarez et al. [2012] J. Alvarez, Y. LeCun, T. Gevers, and A. Lopez. Semantic road segmentation via multi-scale ensembles of learned features. In Proc. ECCV, 2012.
  • Amos et al. [2017] B. Amos, L. Xu, and J. Z. Kolter. Input convex neural networks. In Proc. ICML, 2017.
  • Belanger and McCallum [2016] D. Belanger and A. McCallum. Structured Prediction Energy Networks. In Proc. ICML, 2016.
  • Belanger et al. [2017] D. Belanger, B. Yang, and A. McCallum. End-to-end learning for structured prediction energy networks. In Proc. ICML, 2017.
  • Borenstein and Ullman [2002] E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In Proc. ECCV, 2002.
  • Boykov et al. [2001] Y. Boykov, O. Veksler, and R. Zabih. Fast Approximate Energy Minimization via Graph Cuts. PAMI, 2001.
  • Chambolle and Pock [2011] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 2011.
  • Chen et al. [2015] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In Proc. ICLR, 2015.
  • Chen et al. [2015] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models. In Proc. ICML, 2015. equal contribution.
  • Ciliberto et al. [2018] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction. arXiv preprint arXiv:1806.02402, 2018.
  • de Campos et al. [2009] T. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. 2009.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009.
  • Finley and Joachims [2008] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proc. ICML, 2008.
  • Globerson and Jaakkola [2006] A. Globerson and T. Jaakkola. Approximate Inference Using Planar Graph Decomposition. In Proc. NIPS, 2006.
  • Gygli et al. [2017] M. Gygli, M. Norouzi, and A. Angelova. Deep value networks learn to evaluate and iteratively refine structured outputs. In Proc. ICML, 2017.
  • Hazan and Shashua [2010] T. Hazan and A. Shashua. Norm-Product Belief Propagation: Primal-Dual Message-Passing for LP-Relaxation and Approximate Inference. Trans. Information Theory, 2010.
  • Hazan and Urtasun [2010] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction. In Proc. NIPS, 2010.
  • Hazan et al. [2016] T. Hazan, A. G. Schwing, and R. Urtasun. Blending Learning and Inference in Conditional Random Fields. JMLR, 2016.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proc. CVPR, 2016.
  • Huiskes and Lew [2008] M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In Proc. ACM international conference on Multimedia information retrieval. ACM, 2008.
  • Jegelka et al. [2011] S. Jegelka, H. Lin, and J. Bilmes. On Fast Approximate Submodular Minimization. In Proc. NIPS, 2011.
  • Kappes et al. [2015] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative study of modern inference techniques for structured discrete energy minimization problems. IJCV, 2015.
  • Komodakis [2011] N. Komodakis. Efficient training for pairwise or higher order crfs via dual decomposition. In Proc. CVPR, 2011.
  • Komodakis and Paragios [2009] N. Komodakis and N. Paragios. Beyond pairwise energies: Efficient optimization for higher-order mrfs. In Proc. CVPR, 2009.
  • Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    ImageNet Classification with Deep Convolutional Neural Networks.

    In Proc. NIPS, 2012.
  • Kulesza and Pereira [2008] A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. NIPS, 2008.
  • Lafferty et al. [2001] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for segmenting and labeling sequence data. In Proc. ICML, 2001.
  • Leblond et al. [2018] R. Leblond, J.-B. Alayrac, A. Osokin, and S. Lacoste-Julien. SEARNN: Training RNNs with global-local losses. In International Conference on Learning Representations, 2018.
  • Lin et al. [2015] G. Lin, C. Shen, I. Reid, and A. van den Hengel. Deeply learning the messages in message passing inference. In Proc. NIPS, 2015.
  • McCormick [2008] S. T. McCormick. Submodular Function Minimization, pages 321–391. Elsevier, 2008.
  • Meltzer et al. [2009] T. Meltzer, A. Globerson, and Y. Weiss. Convergent Message Passing Algorithms: A Unifying View. In Proc. UAI, 2009.
  • Meshi and Schwing [2017] O. Meshi and A. Schwing. Asynchronous parallel coordinate minimization for map inference. In Proc. NIPS, 2017.
  • Meshi et al. [2010] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning Efficiently with Approximate Inference via Dual Losses. In Proc. ICML, 2010.
  • Meshi et al. [2015] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and Strong: MAP Inference with Linear Convergence. In Proc. NIPS, 2015.
  • Meshi et al. [2016] O. Meshi, M. Mahdavi, A. Weller, and D. Sontag. Train and test tightness of LP relaxations in structured prediction. In Proc. ICML, 2016.
  • Nam et al. [2017] J. Nam, E. Loza Mencía, H. J. Kim, and J. Fürnkranz.

    Maximizing subset accuracy with recurrent neural networks in multi-label classification.

    In Proc. NIPS, 2017.
  • Nguyen et al. [2017] K. Nguyen, C. Fookes, and S. Sridharan. Deep Context Modeling for Semantic Segmentation. In WACV, 2017.
  • Oord et al. [2016] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proc. ICML, 2016.
  • Pearl [1982] J. Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. In Proc. AAAI, 1982.
  • Pletscher et al. [2010] P. Pletscher, C. S. Ong, and J. M. Buhmann. Entropy and Margin Maximization for Structured Output Learning. In Proc. ECML PKDD, 2010.
  • Schrijver [2004] A. Schrijver. Combinatorial Optimization. Springer, 2004.
  • Schwing and Urtasun [2015] A. G. Schwing and R. Urtasun. Fully Connected Deep Structured Networks. In, 2015.
  • Schwing et al. [2011] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale Graphical Models. In Proc. CVPR, 2011.
  • Schwing et al. [2012] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP Relaxation Solvers Using Fenchel-Young Margins. In Proc. NIPS, 2012.
  • Schwing et al. [2014] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Parallel MAP LP Relaxation Solver Using the Frank-Wolfe Algorithm. In Proc. ICML, 2014.
  • Shelhamer et al. [2016] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. PAMI, 2016.
  • Shimony [1994] S. E. Shimony. Finding MAPs for Belief Networks is NP-hard. Artificial Intelligence, 1994.
  • Song et al. [2016] Y. Song, A. G. Schwing, R. Zemel, and R. Urtasun. Training Deep Neural Networks via Direct Loss Minimization. In Proc. ICML, 2016.
  • Sontag et al. [2008] D. Sontag, T. Meltzer, A. Globerson, and T. Jaakkola. Tightening LP Relaxations for MAP Using Message Passing. In Proc. NIPS, 2008.
  • Stobbe and Krause [2010] P. Stobbe and A. Krause. Efficient Minimization of Decomposable Submodular Functions. In Proc. NIPS, 2010.
  • Sutskever et al. [2014] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks. In Proc. NIPS, 2014.
  • Taskar et al. [2003] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003.
  • Tompson et al. [2014] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation. In Proc. NIPS, 2014.
  • Tsochantaridis et al. [2005] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and Interdependent Output Variables. JMLR, 2005.
  • Tu and Gimpel [2018] L. Tu and K Gimpel. Learning approximate inference networks for structured prediction. In Proc. ICLR, 2018.
  • Wainwright and Jordan [2003] M. J. Wainwright and M. I. Jordan. Variational Inference in Graphical Models: The View from the Marginal Polytope. In Proc. Conf. on Control, Communication and Computing, 2003.
  • Wainwright and Jordan [2008] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families and Variational Inference. Foundations and Trends in Machine Learning, 2008.
  • Werner [2007] T. Werner.

    A Linear Programming Approach to Max-Sum Problem: A Review.

    PAMI, 2007.
  • Zheng et al. [2015] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In Proc. ICCV, 2015.

Appendix A Appendix

a.1 Experimental Details

For all experiments, learning rate, weight decay coefficient , and number of epochs were tuned using a validation set of data; for datasets without a specified validation dataset, a portion of the training data was held-out and used instead to tune parameters. All models were trained using minibatch gradient descent. Additionally, all structured models (everything besides the Unary configuration) used loss-augmentation terms during training. For NLTop models, the number of iterations per round of inference was set to 100; this value was chosen based on preliminary experimentation.

Word recognition experiments

The Unary

classifier model consists of a two-layer multilayer perceptron with 128 hidden units and a ReLU nonlinearity. The multilayer perceptron used in the

NLTop configuration contains 2834 hidden units, which is equal to the size of the and

vectors, and used sigmoid activation functions. The weights for the first layer of this model were initialized to the identity matrix, and the weights for the second layer were initialized to a vector of all


Multilabel classification experiments

The Unary classifier model consists of two-layer multilayer perceptrons with ReLU nonlinearities and 318/600 hidden units for Bibtex/Bookmarks, respectively. For all pairs of nodes and , pairwise potentials were constrained such that and . For the Bibtex experiment, the NLTop model is a 2-layer mulitlayer perceptron with 1000 hidden units and uses leaky ReLU activation functions with negative slopes of 0.25. For the Bookmarks experiment, the NLTop model is a 2-layer multilayer perceptron with 4000 hidden units and sigmoid activation functions; the potentials were divided by 100 before being input into .

Image tagging experiments

The Unary classifier model consists of the pre-trained Alexnet model provided by PyTorch, with the final classifier layer stripped and replaced to generate unary potentials. For all pairs of nodes and , pairwise potentials were constrained such that and . The multilayer perceptron used in the NLTop configuration contains 1152 hidden units and used hardtanh activation functions. For the configuration with additional parameters, the additional MLPs contained hidden sizes equal to both the input and output sizes, which were equal to the number of unary/pairwise potentials, respectively. Each MLP took one of these sets of potentials as input and transformed them, leading to a new set of potentials. For the SPENInf experiments, we run inference using 5 random initializations of

and report the average task losses. The standard deviations for Train, Validation, and Test sets were

, , and , respectively.

Figure 4: The model used for Semantic Segmentation experiments.

Semantic segmentation experiments

Unary consists of an AlexNet model (Krizhevsky et al., 2012)

pretrained on ImageNet with the first MaxPool layer removed and the stride of the second MaxPool Layer from 2 to 1. Additionally, all fully connected layers are replaced with 1x1 convolutional layers, and the final classifier layer is replaced with a 1x1 convolutional layer that outputs two channels, one for each of the two possible labels. Finallly, a deconvolutional layer with a kernel of size 14x14 is used to upsample the output of the previous part to the final 2x64x64 potential maps. The architecture used for

NLTop is presented in Fig. 4. The input to

contains 13 channels of information: two of these channels consist of the unary potentials, with one channel per possible output value; four of these channels consist of the row pairwise potentials, with one channel per possible pair of output values (and one column of padding, since there are

pairs along a row with nodes); four of these channels consist of the column pairwise potentials, with one channel present per possible pair of output values (and one row of padding); and three of these channels consist of the (normalized) input images). The Oracle experiments use the same architecture, except with 23 channels of information being used (the additional ten channels being the ground-truth beliefs, reshaped in the same manner as the potentials). Hence, for this experiment, every convolutional layer except the last contains 23 filters.