Improving Joint Training of Inference Networks and Structured Prediction Energy Networks

11/07/2019 ∙ by Lifu Tu, et al. ∙ NYU college Toyota Technological Institute at Chicago 0

Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energy-based models by training "inference networks" to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning. In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction. We design a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function. We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning. We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Energy-based modeling (LeCun et al., 2006) associates a scalar compatibility measure to each configuration of input and output variables. Belanger and McCallum (2016)

formulated deep energy-based models for structured prediction, which they called structured prediction energy networks (SPENs). SPENs use arbitrary neural networks to define the scoring function over input/output pairs. However, this flexibility leads to challenges for learning and inference. The original work on SPENs used gradient descent for structured inference 

(Belanger and McCallum, 2016; Belanger et al., 2017). Tu and Gimpel (2018, 2019) found improvements in both speed and accuracy by replacing the use of gradient descent with a method that trains a neural network (called an “inference network”) to do inference directly. Their formulation, which jointly trains the inference network and energy function, is similar to training in generative adversarial networks (Goodfellow et al., 2014), which is known to suffer from practical difficulties in training due to the use of alternating optimization (Salimans et al., 2016). To stabilize training, Tu and Gimpel (2018) experimented with several additional terms in the training objectives, finding performance to be dependent on their inclusion.

Also, when using the approach of Tu and Gimpel (2018), there is a mismatch between the training and test-time uses of the trained inference network. During training with hinge loss, the inference network is actually trained to do “cost-augmented” inference. However, at test time, the goal is to simply minimize the energy without any cost term. Tu and Gimpel (2018) fine-tuned the cost-augmented network to match the test-time criterion, but found only minimal change from this fine-tuning. This suggests that the cost-augmented network was mostly acting as a test-time inference network by convergence, which may be hindering the potential contributions of cost-augmented inference in max-margin structured learning (Tsochantaridis et al., 2004; Taskar et al., 2004).

In this paper, we contribute a new training objective for SPENs that addresses the above concern and also contribute several techniques for stabilizing and improving learning. We design a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function. In the context of the new objective, we propose shared parameterizations for the two inference networks that encourage them to capture complementary functionality while reducing the total number of parameters being trained. Quantitative and qualitative analysis shows clear differences in the characteristics of the trained cost-augmented and test-time inference networks. We also present three methods to streamline and stabilize training that help with both the old and new objectives. We empirically validate our strategies on two sequence labeling tasks from natural language processing (NLP), namely part-of-speech tagging and named entity recognition. We show easier paths to strong performance than prior work, and further improvements with global energy terms.

While SPENs have been used for multiple NLP tasks, including multi-label classification (Belanger and McCallum, 2016), part-of-speech tagging (Tu and Gimpel, 2018), and semantic role labeling (Belanger et al., 2017), they are not widely used in NLP. Structured prediction is extremely common in NLP, but is typically approached using methods that are more limited than SPENs (such as conditional random fields) or models that suffer from a train/test mismatch (such as most auto-regressive models). SPENs offer a maximally expressive framework for structured prediction while avoiding the train/test mismatch and therefore offer great potential for NLP. However, the training and inference difficulties have deterred NLP researchers. Our hope is that our methods can enable SPENs to be applied to a larger set of applications, including generation tasks.

2 Background

We denote the input space by . For an input , we denote the structured output space by . The entire space of structured outputs is denoted . A SPEN (Belanger and McCallum, 2016) defines an energy function parameterized by that computes a scalar energy for an input/output pair. At test time, for a given input , prediction is done by choosing the output with lowest energy:

(1)

However, solving equation (1) requires combinatorial algorithms because is a structured, discrete space. This becomes intractable when does not decompose into a sum over small “parts” of . Belanger and McCallum (2016)

relax this problem by allowing the discrete vector

to be continuous; denotes the relaxed output space. They solve the relaxed problem by using gradient descent to iteratively minimize the energy with respect to . The energy function parameters are trained using a structured hinge loss which requires repeated cost-augmented inference during training. Using gradient descent for the repeated cost-augmented inference steps is time-consuming and makes learning unstable (Belanger et al., 2017).

Tu and Gimpel (2018) propose an alternative that replaces gradient descent with a neural network trained to do inference, i.e., to mimic the function performed in equation (1). This “inference network” is parameterized by and trained with the goal that

(2)

When training the energy function parameters , Tu and Gimpel (2018) replaced the cost-augmented inference step in the structured hinge loss from Belanger and McCallum (2016) with a cost-augmented inference network and trained the energy function and inference network parameters jointly:

(3)

where is the set of training pairs, , and is a structured cost function that computes the distance between its two arguments. Tu and Gimpel (2018) alternatively optimized and , which is similar to training in generative adversarial networks (Goodfellow et al., 2014). As alternating optimization can be difficult in practice (Salimans et al., 2016), Tu & Gimpel experimented with including several additional terms in the above objective to stabilize training. We adopt the same learning framework as Tu & Gimpel of jointly learning the energy function and inference network, but we propose a novel objective function that jointly trains a cost-augmented inference network, a test-time inference network, and the energy function.

The energy functions we use for our sequence labeling tasks are taken from Tu and Gimpel (2018) and are described in detail in the appendix.

3 An Objective for Joint Learning of Inference Networks

We now describe our “compound” objective that combines two widely-used losses in structured prediction. We first present it without inference networks:

(4)

This objective contains two different inference problems, which are also the two inference problems that must be solved in structured max-margin learning, whether during training or at test time. Eq. (1) shows the test-time inference problem. The other one is cost-augmented inference, defined as follows:

(5)

where is the gold standard output. This inference problem involves finding an output with low energy but high cost relative to the gold standard. Thus, it is not well-aligned with the test-time inference problem. Tu and Gimpel (2018) used the same inference network for solving both problems, which led them to fine-tune the network at test-time with a different objective. We avoid this issue by training two inference networks, for test-time inference and for cost-augmented inference:

(6)

As indicated, this loss can be viewed as the sum of the margin-rescaled and perceptron losses for SPEN training with inference networks. We treat this optimization problem as a minmax game and find a saddle point for the game similar to Tu and Gimpel (2018) and Goodfellow et al. (2014). We alternatively optimize , , and . The objective for the energy function parameters is:

When we remove 0-truncation (see Sec. 4.1), the objective for the inference network parameters is:

Figure 1: Joint parameterizations for cost-augmented inference network and test-time inference network .

Joint Parameterizations.

If we were to train independent inference networks and , this new objective could be much slower than the original approach of Tu and Gimpel (2018). However, the compound objective offers several natural options for defining joint parameterizations of the two inference networks. We consider three options which are visualized in Figure 1 and described below:

  • Separated: and are two independent networks with their own architectures and parameters as shown in Figure 1(a).

  • Shared: and share a “feature” network as shown in Figure 1(b). We consider this option because both and are trained to produce output labels with low energy. However also needs to produce output labels with high cost (i.e., far from the gold standard).

  • Stacked: the cost-augmented network is a function of the output of the test-time network and the gold standard output . That is, where is a parameterized function. This is depicted in Figure 1(c). Note that we block the gradient at when updating .

For the function in the stacked option, we use an affine transform on the concatenation of the inference network label distribution and the gold standard one-hot vector. That is, denoting the vector at position of the cost-augmented network output by , we have:

where semicolon (;) is vertical concatenation, (position of ) is an -dimensional one-hot vector, is the vector at position of , is an matrix, and is a bias.

One motivation for these parameterizations is to reduce the total number of parameters in the procedure. Generally, the number of parameters is expected to decrease when moving from separated to shared to stacked. We will compare the three options empirically in our experiments, in terms of both accuracy and number of parameters.

Another motivation, specifically for the third option, is to distinguish the two inference networks in terms of their learned functionality. With all three parameterizations, the cost-augmented network will be trained to produce an output that differs from the gold standard, due to the presence of the term in the combined objective. However, Tu and Gimpel (2018) found that the trained cost-augmented network was barely affected by fine-tuning for the test-time inference objective. This suggests that the cost-augmented network was mostly acting as a test-time inference network by the time of convergence. With the stacked parameterization, however, we explicitly provide the gold standard to the cost-augmented network, permitting it to learn to change the predictions of the test-time network in appropriate ways to improve the energy function.

4 Training Stability and Effectiveness

We now discuss several methods that simplify and stabilize training SPENs with inference networks. When describing them, we will illustrate their impact by showing training trajectories for the Twitter part-of-speech tagging task described in Section 6 and the appendix.

4.1 Removing Zero Truncation

Tu and Gimpel (2018) used the following objective for the cost-augmented inference network (maximizing it with respect to ):

where . However, there are two potential reasons why will equal zero and therefore trigger no gradient update. First, (the energy function, corresponding to the discriminator in a GAN) may already be well-trained, and it does a good job separating the gold standard output and the cost-augmented inference network output. Or, it may be the case that the cost-augmented inference network (corresponding to the generator in a GAN) is so poorly trained that the energy of its output is extremely large, leading the margin constraints to be satisfied and to be zero.

In standard margin-rescaled max-margin learning in structured prediction (Taskar et al., 2004; Tsochantaridis et al., 2004), the cost-augmented inference step is performed exactly (or approximately with reasonable guarantee of effectiveness), ensuring that when is 0, the energy parameters are well trained. However, in our case, may be zero simply because the cost-augmented inference network is undertrained, which will be the case early in training. Then, when using zero truncation, the gradient of the inference network parameters will be 0. This is likely why Tu and Gimpel (2018) found it important to add several stabilization terms to the objective. We find that by instead removing the truncation, learning stabilizes and becomes less dependent on these additional terms. Note that we retain the truncation at zero when updating the energy parameters .

As shown in Figure 3(a) in the appendix, without any stabilization terms and with truncation, the inference network will barely move from its starting point and learning fails overall. However, without truncation, the inference network can work well even without any stabilization terms.

(a) cost-augmented loss
(b) margin-rescaled loss
(c) gradient norm of
(d) gradient norm of
Figure 2: Training trajectories with different numbers of I steps. The three curves in each setting correspond to different random seeds. (a) cost-augmented loss after I steps; (b) margin-rescaled loss after I steps; (c) gradient norm of energy function after E steps; (d) gradient norm of test-time inference network after I steps. Tu and Gimpel (2018) use one I step after each E step.

4.2 Local Cross Entropy (CE) Loss

Tu and Gimpel (2018) proposed adding a local cross entropy loss, which is the sum of the label cross entropy losses over all positions in the sequence, to stabilize inference network training. We similarly find this term to help speed up convergence and improve accuracy. Figure 3(b) shows faster convergence to high accuracy when adding the local CE term. More comparisons are in Section 7.

4.3 Multiple Inference Network Update Steps

When training SPENs with inference networks, the inference network parameters are nested within the energy function. We found that the gradient components of the inference network parameters consequently have smaller absolute values than those of the energy function parameters. So, we alternate between steps of optimizing the inference network parameters (“I steps”) and one step of optimizing the energy function parameters (“E steps”). We find this strategy especially helpful when using complex inference network architectures.

To analyze this, we compute the cost-augmented loss and the margin-rescaled loss averaged over all training pairs after each set of I steps. The I steps seek to maximize these terms and the E steps seek to minimize them. Figs. 2(a) and (b) show and during training for different numbers of I steps for every one E step. Fig. 2(c) shows the norm of after the E steps, and Fig. 2(d) shows the norm of after the I steps. With , the inference network lags behind the energy, making the energy parameter updates very small, as shown by the small norms in Fig. 2(c). The inference network gradient norm (Fig. 2(d)) remains high, indicating underfitting. However, increasing too much also harms learning, as evidenced by the “plateau” effect in the curves for ; this indicates that the energy function is lagging behind the inference network. Using leads to more of a balance between and and gradient norms that are mostly decreasing during training. We treat as a hyperparameter that is tuned in our experiments.

5 Global Energies for Sequence Labeling

In addition to new training strategies, we also experiment with several global energy terms for sequence labeling. Eq. (A.1) in the appendix shows the base energy. To capture long-distance dependencies, we include global energy (GE) terms in the form of Eq. (9). Tu and Gimpel (2018) pretrained their tag language model (TLM) on a large, automatically-tagged corpus and fixed its parameters when optimizing . We instead do not pretrain the TLM and learn its parameters when training .

We also propose new global energy terms. Define where is an LSTM TLM that takes a sequence of labels as input and returns a distribution over next labels. First, we add a TLM in the backward direction (denoted analogously to the forward TLM). Second, we include words as additional inputs to forward and backward TLMs. We define where is a forward LSTM TLM. We define the backward version similarly (denoted ). The global energy is therefore

(7)

Here is a hyperparameter that is tuned. We experiment with three settings for the global energy: GE(a): forward TLM as in Tu and Gimpel (2018); GE(b): forward and backward TLMs (); GE(c): all four TLMs in Eq. (7).

6 Experimental Setup

We consider two sequence labeling tasks: Twitter part-of-speech (POS) tagging (Gimpel et al., 2011) and named entity recognition (NER; Tjong Kim Sang and De Meulder, 2003), described in detail in the appendix. We consider three NER modeling configurations. NER uses only words as input and pretrained, fixed GloVe embeddings. NER+ uses words, the case of the first letter, POS tags, and chunk labels, as well as pretrained GloVe embeddings with fine-tuning. NER++ includes everything in NER+ as well as character-based word representations obtained using a convolutional network over the character sequence in each word. Unless otherwise indicated, our SPENs use the energy in Eq. (A.1). As a baseline, we use a BiLSTM tagger trained only with the local CE term.

zero POS NER NER+
trunc. CE acc (%) F1 (%) F1 (%)
yes no 13.9 3.91 3.91
margin- no no 87.9 85.1 88.6
rescaled yes yes 89.4* 85.2* 89.5*
no yes 89.4 85.2 89.5


perceptron
no no 88.2 84.0 88.1
no yes 88.6 84.7 89.0
Table 1: Test set results for Twitter POS tagging and NER of several SPEN configurations. Results with * correspond to the setting of Tu and Gimpel (2018).
POS NER NER+
acc (%) speed F1 (%) speed F1 (%)
BiLSTM 88.8 166K 166K - 84.9 239K 239K - 89.3
SPENs with inference networks (Tu and Gimpel, 2018):
margin-rescaled 89.4 333K 166K - 85.2 479K 239K - 89.5

perceptron
88.6 333K 166K - 84.4 479K 239K - 89.0
SPENs with inference networks, compound objective, CE, no zero truncation (this paper):
separated 89.7 500K 166K 66 85.0 719K 239K 32 89.8
shared 89.8 339K 166K 78 85.6 485K 239K 38 90.1
stacked 89.8 335K 166K 92 85.6 481K 239K 46 90.1
Table 2: Test set results for Twitter POS tagging and NER. is the number of trained parameters; is the number of parameters needed during the inference procedure. Training speeds (examples/second) are shown for joint parameterizations to compare them in terms of efficiency. Best setting (highest performance with fewest parameters and fastest training) is in boldface.

7 Results and Analysis

Effect of Removing Truncation.

Table 1 shows results for the margin-rescaled and perceptron losses when considering the removal of zero truncation and its interaction with the use of the local CE term. Training fails for both tasks when using zero truncation without the CE term. Removing truncation makes learning succeed and leads to effective models even without using CE. However, when using the local CE term, truncation has little effect on performance. The importance of CE in prior work (Tu and Gimpel, 2018) is likely due to the fact that truncation was being used.

Effect of Local CE.

The local cross entropy (CE) term is useful for both tasks, though it appears more helpful for tagging. This may be because POS tagging is a more local task. Regardless, for both tasks, as shown in Section 4.2

, the inclusion of the CE term speeds convergence and improves training stability. For example, on NER, using the CE term reduces the number of epochs chosen by early stopping from

100 to 25. On Twitter POS Tagging, using the CE term reduces the number of epochs chosen by early stopping from 150 to 60.

Effect of Compound Objective and Joint Parameterizations.

The compound objective is the sum of the margin-rescaled and perceptron losses, and outperforms them both (see Table 2). Across all tasks, the shared and stacked parameterizations are more accurate than the previous objectives. For the separated parameterization, the performance drops slightly for NER, likely due to the larger number of parameters. The shared and stacked options also have fewer parameters to train than the separated option, and the stacked version processes examples at the fastest rate during training.

POS NER
margin-rescaled 0.2 0
separated 2.2 0.4
compound shared 1.9 0.5
stacked 2.6 1.7


test-time () cost-augmented ()
common noun proper noun
proper noun common noun
common noun adjective
proper noun proper noun + possessive
adverb adjective
preposition adverb
adverb preposition
verb common noun
adjective verb
Table 3: Top: differences in accuracy/F1 between test-time inference networks and cost-augmented networks (on development sets). The “margin-rescaled” row uses a SPEN with the local CE term and without zero truncation, where is obtained by fine-tuning as done by Tu and Gimpel (2018). Bottom: most frequent output differences between and on the development set.

The left part of Table 3 shows how the performance of the test-time inference network and the cost-augmented inference network vary when using the new compound objective. The differences between and are larger than in the baseline configuration, showing that the two are learning complementary functionality. With the stacked parameterization, the cost-augmented network receives as an additional input the gold standard label sequence, which leads to the largest differences as the cost-augmented network can explicitly favor incorrect labels.111We also tried a BiLSTM in the final layer of the stacked parameterization but results were similar to the simpler affine architecture, so we only report results here with the affine architecture.

The right part of Table 3 shows qualitative differences between the two inference networks. On the POS development set, we count the differences between the predictions of and when makes the correct prediction.222For this analysis we used the BiLSTM version of the stacked parameterization. The most frequent combinations show that tends to output tags that are highly confusable with those output by . For example, it often outputs proper noun when the gold standard is common noun or vice versa. It also captures the noun-verb ambiguity and ambiguities among adverbs, adjectives, and prepositions.

Global Energies.

The results are shown in Table 4. Adding the backward (b) and word-augmented TLMs (c) improves over only using the forward TLM from Tu and Gimpel (2018). With the global energies, our performance is comparable to several strong results (cf. 90.94 of Lample et al., 2016 and 91.37 of Ma and Hovy, 2016). However, it is still lower than the state of the art (Akbik et al., 2018; Devlin et al., 2019), likely due to the lack of contextualized embeddings.

NER NER+ NER++
margin-rescaled 85.2 89.5 90.2
compound, stacked, CE, no truncation 85.6 90.1 90.8
+ global energy GE(a) 85.8 90.2
+ global energy GE(b) 85.9 90.2
+ global energy GE(c) 86.3 90.4 91.0
Table 4: NER test F1 scores with global energy terms. : We took the best configuration from NER/NER+ and evaluated it in the NER++ setting.

8 Related Work

Aside from the relevant work discussed already, there are several efforts aimed at stabilizing and improving learning in adversarial frameworks, for example those developed for generative adversarial networks (GANs) (Goodfellow et al., 2014; Salimans et al., 2016; Zhao et al., 2017; Arjovsky et al., 2017)

. Progress in training GANs has come largely from overcoming learning difficulties by modifying loss functions and optimization, and GANs have become more successful and popular as a result. Notably, Wasserstein GANs

(Arjovsky et al., 2017) provided the first convergence measure in GAN training using Wasserstein distance. To compute Wasserstein distance, the discriminator uses weight clipping, which limits network capacity. Weight clipping was subsequently replaced with a gradient norm constraint (Gulrajani et al., 2017). Miyato et al. (2018) proposed a novel weight normalization technique called spectral normalization. These methods may be applicable to the similar optimization problems solved in learning SPENs. Another direction may be to explore alternative training objectives for SPENs, such as those that use weaker supervision than complete structures (Rooshenas et al., 2018).

9 Conclusions

We contributed several strategies to stabilize and improve joint training of SPENs and inference networks. Our use of joint parameterizations mitigates the need for fine-tuning of inference networks, leads to complementarity in the learned cost-augmented and test-time networks, and yields improved performance overall. These developments offer promise for SPENs to be more easily trained and deployed for a broad range of NLP tasks.

Future work will explore other structured prediction tasks, such as parsing and generation. We have taken initial steps in this direction, experimenting with constituency parsing using the attention-augmented sequence-to-sequence model of Tran et al. (2018). Preliminary experiments are positive,333When comparing methods on the Switchboard-NXT (Calhoun et al., 2010) dataset, the seq2seq baseline achieves 82.80 F1 on the development set and the SPEN (stacked parameterization) achieves 83.11. but significant challenges remain, specifically in terms of defining appropriate inference network architectures to enable efficient learning.

References

  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. External Links: Link Cited by: §7.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    Proceedings of the 34th International Conference on Machine Learning

    ,
    Cited by: §8.
  • D. Belanger and A. McCallum (2016) Structured prediction energy networks. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: Improving Joint Training of Inference Networks and Structured Prediction Energy Networks, §1, §1, §2, §2.
  • D. Belanger, B. Yang, and A. McCallum (2017) End-to-end learning for structured prediction energy networks. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1, §1, §2.
  • S. Calhoun, J. Carletta, J. M. Brenier, N. Mayo, D. Jurafsky, M. Steedman, and D. Beaver (2010) The NXT-format Switchboard Corpus: a rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language resources and evaluation 44 (4), pp. 387–419. Cited by: footnote 3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link, Document Cited by: §7.
  • K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith (2011) Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 42–47. External Links: Link Cited by: §A.2, §6.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §1, §2, §3, §8.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5767–5777. External Links: Link Cited by: §8.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation. Cited by: §A.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2, §A.2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. External Links: Link, Document Cited by: §A.2, §7.
  • Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. In Predicting Structured Data, Cited by: §1.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. External Links: Link, Document Cited by: §A.2, §7.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. External Links: Link Cited by: §A.2.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §8.
  • O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013) Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 380–390. External Links: Link Cited by: §A.2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Proc. of EMNLP, Cited by: §A.2.
  • L. Ratinov and D. Roth (2009) Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pp. 147–155. External Links: Link Cited by: §A.2.
  • A. Rooshenas, A. Kamath, and A. McCallum (2018) Training structured prediction energy networks with indirect supervision. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 130–135. External Links: Document, Link Cited by: §8.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2234–2242. External Links: Link Cited by: §1, §2, §8.
  • B. Taskar, C. Guestrin, and D. Koller (2004) Max-margin markov networks. In Advances in Neural Information Processing Systems 16, S. Thrun, L. K. Saul, and B. Schölkopf (Eds.), pp. 25–32. External Links: Link Cited by: §1, §4.1.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. External Links: Link Cited by: §A.2, §6.
  • T. Tran, S. Toshniwal, M. Bansal, K. Gimpel, K. Livescu, and M. Ostendorf (2018) Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 69–81. External Links: Link, Document Cited by: §9.
  • I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun (2004) Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-first International Conference on Machine Learning, Cited by: §1, §4.1.
  • L. Tu, K. Gimpel, and K. Livescu (2017) Learning to embed words in context for syntactic tasks. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 265–275. External Links: Link Cited by: §A.2.
  • L. Tu and K. Gimpel (2018) Learning approximate inference networks for structured prediction. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: Figure 3, §A.1, §A.1, §A.1, §A.2, Improving Joint Training of Inference Networks and Structured Prediction Energy Networks, §1, §1, §1, §2, §2, §3, §3, §3, Figure 2, §4.1, §4.1, §4.2, §5, §5, Table 1, Table 2, §7, §7, Table 3.
  • L. Tu and K. Gimpel (2019) Benchmarking approximate inference methods for neural structured prediction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3313–3324. External Links: Link, Document Cited by: §1.
  • J. J. Zhao, M. Mathieu, and Y. LeCun (2017) Energy-based generative adversarial network. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §8.

Appendix A Appendix

a.1 Energy Functions and Inference Networks for Sequence Labeling

Our experiments in this paper consider sequence labeling tasks, so the input is a length- sequence of tokens where denotes the token at position . The output is a sequence of labels also of length . We use to denote the output label at position , where is a vector of length (the number of labels in the label set) and where is the th entry of the vector . In the original output space , is 1 for a single and 0 for all others. In the relaxed output space ,

can be interpreted as the probability of the

th position being labeled with label . We then use the following energy for sequence labeling (Tu and Gimpel, 2018):

(8)

where is a parameter vector for label and the parameter matrix contains label pair parameters. Also, denotes the “input feature vector” for position . We define it to be the -dimensional BiLSTM (Hochreiter and Schmidhuber, 1997) hidden vector at . The full set of energy parameters includes the vectors, , and the parameters of the BiLSTM.

Tu and Gimpel (2018) also added a global energy term that they referred to as a “tag language model” (TLM). We use to denote an LSTM TLM that takes a sequence of labels as input and returns a distribution over next labels. We define . Then, the energy term is:

(9)

where is the start-of-sequence symbol and is the end-of-sequence symbol. This energy returns the negative log-likelihood under the TLM of the candidate output .

For inference networks, we use architectures similar to those used by Tu and Gimpel (2018). In particular, we choose BiLSTMs as the inference network architectures in our experiments. We also use BiLSTMs for the baselines and both the inference networks and baseline models use the same hidden sizes.

a.2 Experimental Setup Details

Twitter Part-of-Speech (POS) Tagging.

We use the Twitter POS data from Gimpel et al. (2011) and Owoputi et al. (2013) which contains 25 tags. We use 100-dimensional skip-gram (Mikolov et al., 2013) embeddings from Tu et al. (2017). Like Tu and Gimpel (2018), we use a BiLSTM to compute the input feature vector for each position, using hidden dimension of size 100. We also use BiLSTMs for the inference networks. The output of the inference network is a softmax function, so the inference network will produce a distribution over labels at each position. The

is L1 distance. We train the inference network using stochastic gradient descent (SGD) with momentum and train the energy parameters using Adam 

(Kingma and Ba, 2014). We also explore training the inference network using Adam when we do not use the local CE loss.444We find that Adam works better than SGD when training the inference network without the local cross entropy term. In experiments with the local CE term, its weight is set to 1.

Named Entity Recognition (NER).

We use the CoNLL 2003 English dataset (Tjong Kim Sang and De Meulder, 2003; Ma and Hovy, 2016; Lample et al., 2016). We use the BIOES tagging scheme, following previous work (Ratinov and Roth, 2009), resulting in 17 NER labels. We use 100-dimensional pretrained GloVe embeddings (Pennington et al., 2014). The task is evaluated using F1 score computed with the conlleval script. The architectures for the feature networks in the energy function and inference networks are all BiLSTMs. The architectures for tag language models are LSTMs. We use a dropout keep-prob of 0.7 for all LSTM cells. The hidden size for all LSTMs is 128. We use Adam (Kingma and Ba, 2014) and do early stopping on the development set.

The hyperparameter (the number of I steps) is tuned over the set {1, 2, 5, 10, 50}. is tuned over the set {0, 0.5, 1}.

(a) Truncating at 0
(b) Adding local CE loss
Figure 3: Training trajectories with different settings. The three curves for each setting correspond to different random seeds. Tu and Gimpel (2018) use truncation and CE during training.