Neural networks have long been used for prediction tasks involving complex structured outputs LeCun et al. (2006); Collobert et al. (2011); Lample et al. (2016). In structured prediction, output variables obey local and global constraints that are difficult to satisify using purely local feedforward prediction from an input representation. For example, in sequence tagging tasks such as named entity recognition, the outputs must obey several hard constraints e.g., I-PER cannot follow B-ORG. The results of Collobert et al. (2011) show a significant improvement when such structural output constraints are enforced by incorporating a linear-chain graphical model that captures the interactions between adjacent output variables. The addition of a graphical model to enforce output consistency is now common practice in deep structured prediction models for tasks such as sequence tagging (Lample et al., 2016) and image segmentation (Chen et al., 2015).
From a probabilistic perspective, the potentials of a probabilistic graphical model over the output variables are often parameterized using a deep neural network that learns global features of the input LeCun et al. (2006); Collobert et al. (2011); Lample et al. (2016). This approach takes advantage of deep architectures to learn robust feature representations for , but is limited to relatively simple pre-existing graphical model structures to model the interactions among .
This paper presents work in which feature learning is used not only to learn rich representations of inputs, but also to learn latent output structure. We present a model for sequence tagging that takes the form of a latent-variable conditional random field (Quattoni et al., 2007; Sutton et al., 2007; Morency et al., 2007), where interactions in the latent state space are parametrized by low-rank embeddings. This low-rank structure allows us to use a larger number of latent states learning rich and interpretable substructures in the output space without overfitting. Additionally, unlike LSTMs, the model permits exact MAP and marginal inference via the Viterbi and forward-backward algorithms. Because the model learns large numbers of latent hidden states, interactions among
are not limited to simple Markov dependencies among labels as in most deep learning approaches to sequence tagging.
Previous work on representation learning for structured outputs has taken several forms. Output-embedding models such as Srikumar & Manning (2014)
have focused on learning low-rank similarity among label vectors, with no additional latent structure. The input-output HMM (Bengio & Frasconi, 1995) incorporates learned latent variables, parameterized by a neural network, but the lack of low-rank structure limits the size of the latent space. Structured prediction energy networks (Belanger & McCallum, 2016) use deep neural networks to learn global output representations, but do not allow for exact inference and are difficult to apply in cases when the number of outputs varies independently of the number of inputs, such as entity extraction systems.
In this preliminary work, we demonstrate the utility of learning a large embedded latent output space on a synthetic task based on CoNLL named entity recognition (NER). We consider the task synthetic becausewe condition the output model on we employ input features involving only single tokens, which allows us to better examine the effects of both learned latent output variables and low-rank embedding structure. (The use of NER data is preferable, however, to completely synthetically generated data because its real-world text naturally contains easily interpretable complex latent structure.) We demonstrate significant accuracy gains from low-rank embeddings of large numbers of latent variables in output space, and explore the interpretable latent structure learned by the model. These results show promise for future application of low-rank latent embeddings to sequence modeling tasks involving more complex long-term memory, such as citation extraction, resumés, and semantic role labeling.
2 Related Work
The ability of neural networks to efficiently represent local context features sometimes allows them to make surprisingly good independent decisions for each structured output variable Collobert et al. (2011)
. However, these independent classifiers are often insufficient for structured prediction tasks where there are strong dependencies between the output labelsCollobert et al. (2011); Lample et al. (2016). A natural solution is to use these neural feature representations to parameterize the factors of a conditional random field Lafferty et al. (2001) for joint inference over output variables Collobert et al. (2011); Jaderberg et al. (2015); Lample et al. (2016). However, most previous work restricts the linear-chain CRF states to be the labels themselves—learning no additional output structure.
The latent dynamic conditional random field (LDCRF) learns additional output structure beyond the labels by employing hidden states (latent variables) with Markov dependencies, each associated with a label; it has been applied to human gesture recognition Morency et al. (2007). The dynamic conditional random field (DCRF) learns a factorized representation of each state Sutton et al. (2007). The hidden-state conditional random field (HCRF) also employs a Markov sequence of latent variables, but the latent variables are used to predict a single label rather than a sequence of labels; it has been applied to phoneme recognition Gunawardana et al. (2005) and gesture recognition Quattoni et al. (2007). All these models learn output representations while preserving the ability to perform exact joint inference by belief propagation. While the above use a log-linear parameterization of the potentials over latent variables, the input-output HMM (Bengio & Frasconi, 1995)
uses a separate neural network for each source state to produce transition probabilities to its destination states.
Experiments in all of the above parameterizations use only a small hidden state space due to the large numbers of parameters required. In this paper we enable a large number of states by using a low-rank factorization of the transition potentials between latent states, effectively learning distributed embeddings for the states. This is superficially similar to the label embedding model of Srikumar & Manning (2014), but that work learns embeddings only to model similarity between observable output labels, and does not learn a latent output state structure.
3 Embedded Latent CRF Model
We consider the task of sequence labeling: given an input sequence , find the corresponding output labels where each output is one of possible output labels. Each input is associated with a feature vector , such as that produced by a feed-forward or recurrent neural network.
The models we consider will associate each input sequence with a sequence of hidden states . These discrete hidden states capture rich transition dynamics of the output labels. We consider the case where the number of hidden states is much larger than the number of output labels, .
Given the above notation, the energy for a particular configuration is:
where are scalar scoring functions of their arguments. and are the local scores for the interaction between the input features and the hidden states, and the hidden state and the output state, respectively. are the scores for transitioning from a hidden state to hidden state . The distribution over output labels is given by:
is the partition function.
In the case of our Embedded Latent CRF model, as in the LDCRF model, latent states are deterministically partitioned to correspond to output values. That is, the number of latent states is a multiple of the number of output values, and for pairs in the partitioning and otherwise. The other scoring functions are learned as global bilinear parameter matrices.
In order to manage large numbers of latent states without overfitting, the Embedded Latent CRF enforces an additional restriction that the scoring function possess a low-rank structure: that is, , where and are skinny rectangular matrices and are represented by one-hot vectors.
While inference in this model is tractable using tree belief propagation even when learning , the deterministic factors make it especially simple to implement.
Computing the quantities involved in (2) can be carried out efficiently with dynamic programming using the forward algorithm, as in HMMs. To see this, note that to compute the numerator , given an output label , we can fold the local scores and (1) into one score , and summing the resulting energy corresponds exactly to the forward algorithm in a CRF with states. The partition function can also be computed by dynamic programming:
At test time we perform MAP inference using the exact Viterbi algorithm, which can be done as in the above dynamic program, replacing sums with maxes.
We demonstrate the benefits of latent and embedded large-cardinality state spaces on a synthetic task based on CoNLL-2003 named entity recognition Sang & Meulder (2003), which provides easily interpretable inputs and outputs to explore. With IOB encoding the dataset has eight output labels representing non-entities, as well as inside and beginning of person, location, organization and miscelaneous enties (B-PER is not needed). We consider the task synthetic because we use relatively impoverished input features to explore the capacity of the output representation. We use the BiLSTM+CRF featurization from Lample et al. (2016), but instead of using BiLSTM, we produce local potentials from a feedforward network conditioned on word- and character-level features from the current time-step only. We demonstrate that performance of this model is improved by increasing the size of the latent state space (perhaps unsurprisingly), and that significant further improvement can be obtained from learning low-rank embeddings of the latent states. Qualitatively, we also show the latent states learn interpretable structure.
of 1e-6. We initialise our word level embeddings using pretrained 100 dimensional skip-n-gram embeddingsLing et al. (2015) where available, and use Glorot initialisation Glorot & Bengio (2010) otherwise.
For the Embedded Latent CRF, we learn 16 dimensional embeddings (the matrices and
have rank 16) for all but the 16 hidden state ELCRF, for which we learn 8 dimensional embeddings. We train all models for 200 epochs with early-stopping on the validation set. We use the train-dev-test split of the CoNLL-2003 dataset.
4.1 Quantitative Results
Table 1 reports the performance of all the models. We see that the performance on this structured prediction task is improved both by increasing the size of the latent state space, and by embedding the states into a lower dimensional space (i.e. a low-rank factorization of the log-space transition potential). The benefits of a complex output space are especially important when using a restricted set of input features, as noted in (Liang et al., 2008).
|#States||LDCRF F1||ELCRF F1|
4.2 Qualitative Insight
We now turn to a qualitative analysis of the ELCRF model. Table 2 gives examples of some hidden states and tokens for which they were activated. First, we observe that the model has discovered separate latent states for surnames (228) and surname nobiliary particles (192). The latter almost always transitions to a surname state, and capturing this special transition signature (as distinct from Per label generically, as in a traditional CRF without latent states) improves accuracy when the surname is poorly associated with the Per label.
We also observe the model’s ability to detect phrase boundaries. This is true not only for Per phrases where the model identifies boundaries by identifying first names and last names but also for Misc phrases and Org phrases. We observe that state 257 fires everytime an I-Misc token is followed by an I-Misc token—signalling the start of an I-Misc phrase, and state 272 fires at the end of that phrase. Similarly, state 392 signals the start of an Org phrase and 445 signals the end.
|192||Van, Dal, De, Manne, Jan, Den, Della, Der|
|204||Hendrix, Lien, Werner, Peter, Sylvie, Jack|
|228||Miller, Cesar, Jensen, Dickson, Abbott|
|283||British, German, Polish, Australian|
|297||911, 310, 150, 11|
|269||Korean-related, Beijing-funded, Richmond-based|
The latent state space also learns block structure in the state transition matrix which gives the joint prediction a long-term bidirectional “memory,” encouraging unusual and beneficial interpretation of other parts of the sequence. For example, in the sentence “Boston ’s Mo Vaughn went 3-for-3 with a walk , stole home …,” our model with factorized latent states was correctly able to label “Boston” as I-Org (the team) due to the longer range context of a baseball game, whereas the model without latent states incorrectly labels it I-Loc. In the phrase “Association for Relations Across the Taiwan Straits” our model correctly labels the entire sequence as an Org using a special state for “the,” while the traditional model loses context at “the” and labels the last two words as a Loc.
5 Conclusion and Future Work
We present a method for learning output representations in finite-state sequence modeling. Our Embedded Latent CRF learns state embeddings by representing the transition matrix in a large latent state space with low-rank factorization. Unlike most recent work that learns input representations, but settles with simple Markov dependencies among the given output labels, our approach learns output representations of a large number of memory-providing, expressive and interpretable states, avoids overfitting due to factorization, and maintains tractable exact inference plus maximum likelihood learning using dynamic programming. In future work we will apply this model to non-synthetic sequence labeling tasks involving complex joint predictions and long term memory, such as citation extraction, résumé field extraction, and semantic role labeling.
- Belanger & McCallum (2016) Belanger, David and McCallum, Andrew. Structured prediction energy networks. In ICML, 2016.
- Bengio & Frasconi (1995) Bengio, Yoshua and Frasconi, Paolo. An input output hmm architecture. In NIPS, 1995.
- Chen et al. (2015) Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Collobert et al. (2011) Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Michael, Kavukcuoglu, Koray, and Kuksa, Pavel P. Natural language processing (almost) from scratch. JMLR, 2011.
- Glorot & Bengio (2010) Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
- Gunawardana et al. (2005) Gunawardana, Asela, Mahajan, Milind, Acero, Alex, and Platt, John C. Hidden conditional random fields for phone classification. In Interspeech, 2005.
- Jaderberg et al. (2015) Jaderberg, Max, Simonyan, Karen, Vedaldi, Andrea, and Zisserman, Andrew. Deep structured output learning for unconstrained text recognition. In ICLR, 2015.
- Kingma & Ba (2015) Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2015.
- Lafferty et al. (2001) Lafferty, John D., McCallum, Andrew, and Pereira, Fernando C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- Lample et al. (2016) Lample, Guillaume, Ballesteros, Miguel, Subramanian, Sandeep, Kawakami, Kazuya, and Dyer, Chris. Neural architectures for named entity recognition. In NAACL, 2016.
- LeCun et al. (2006) LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, and Huang, F. A tutorial on energy-based learning. Predicting structured data, 2006.
- Liang et al. (2008) Liang, Percy, Daumé III, Hal, and Klein, Dan. Structure compilation: trading structure for features. In ICML, 2008.
- Ling et al. (2015) Ling, Wang, Tsvetkov, Yulia, Amir, Silvio, Fermandez, Ramon, Dyer, Chris, Black, Alan W., Trancoso, Isabel, and Lin, Chu-Cheng. Not all contexts are created equal: Better word representations with variable attention. In EMNLP, 2015.
- Morency et al. (2007) Morency, Louis-Philippe, Quattoni, Ariadna, and Darrell, Trevor. Latent-dynamic discriminative models for continuous gesture recognition. In CVPR, 2007.
- Quattoni et al. (2007) Quattoni, Ariadna, Wang, Sybor, Morency, Louis-Philippe, Collins, Morency, and Darrell, Trevor. Hidden conditional random fields. PAMI, 2007.
- Sang & Meulder (2003) Sang, Erik F. Tjong Kim and Meulder, Fien De. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL, 2003.
Srikumar & Manning (2014)
Srikumar, Vivek and Manning, Christopher D.
Learning distributed representations for structured output prediction.In NIPS, 2014.
- Sutton et al. (2007) Sutton, Charles, McCallum, Andrew, and Rohanimanesh, Khashayar. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. JMLR, 2007.