Conditional Random Field (CRF) is a widely used algorithm for structured prediction. It is an undirected graphical model trained to maximize a conditional probability. The undirected graph can be encoded with a set of features (node features and edge features). Usually, these features are sparse and well manual designed.
For minimizing the effort in feature engineering, neural network models are used to automatically extract features[Chen and Manning2014, Collobert et al.2011]
. These models learn dense features, which have better representation of both syntax and semantic information. Because of the success of CRF and neural networks, many models take advantage of both of them. Collobert et al. CollobertEA2011 used CRF objective to compute sentence-level probability of convolutional neural networks. Durrett and Klein DurrettKlein2015 introduced a neural CRF model to join sparse features and dense features for parsing. Andor et al. AndorEA2016 proposed a transition-based neural model with a globally normalized CRF objective, and they use feedforward neural networks to learn neural features.
The marriage of feedforward neural network and CRF is natural because feedforward neural network scores local unstructured decisions while CRF makes global structured decisions. It is harder to combine recurrent neural model with CRF because both of them use structural inference. Huang et al. HuangEA2015 provided a solution to combine recurrent structure with CRF structure, and gained good performance in sequence labelling. However, their model only encode node features while both node features and edge features are important to CRF.
In order to completely encode non-linear features for CRF, we propose a new recurrent neural CRF model. Our model uses LSTM to learn edge information of input words, and takes LSTM output as CRF energy function. We do not change the internal structure of both LSTM and CRF, so it easily decodes via standard recurrent propogation and CRF dynamic programming inference, without any extra effort. In our model, we use edge embedding to capture connections inside input structure. LSTM is used to learn hidden edge features from edge embedding. After that, CRF globally normalizes the scores of LSTM output. Andor et al. AndorEA2016 proved that globally normalized CRF objective solved label bias problem for neural models.
The contribution of our paper can be listed as follow:
We propose a neural model which can learn non-linear edge features. We find that learning non-linear edge features is even more important than node features due to the ability of modelling non-linear structure dependence.
We experiment our model in several well-known sequence labelling tasks, including shallow parsing, NP chunking, POS tagging and Chinese word segmentation. It shows that our model can outperform state-of-the-art methods in these tasks.
In structured prediction, our goal is to predict structure given the observations . The label in structure is denoted as , and the observation is .
CRF [Lafferty, McCallum, and Pereira2001] is a popular and effective algorithm for structured prediction. It has a log-linear conditional probability with respect to energy functions over local cliques and transition cliques:
where is energy function over local clique at position , and is energy function over transition clique.
Energy functions are used to learn features. Since conventional CRF is log-linear model, both local clique and transition clique have linear energy functions:
where is the indicator function of the feature for the transition clique , is the indicator function of feature for the local clique , and and are parameters of CRF.
Therefore, conventional CRF can only learn linear features. To learn high-order features, LSTM is combined with CRF model [Huang, Xu, and Yu2015]. At each time step, LSTM recurrently inputs a word and outputs scores of each predicted labels. The output function of LSTM can be used as energy function over local cliques:
where is the hidden state of LSTM at the time step, is the
element of vector, and is a transition score for jumping from tag to tag.
As for transition cliques, energy function is a transition matrix of variables for jumping from tag to tag:
so energy function over transition cliques is linear as conventional CRF. Therefore, LSTM-CRF learns non-linear node features (over local cliques) and linear edge features (over transition cliques).
For further contain more context information, LSTM layer can be replaced with bidirectional LSTM (BiLSTM) layer. BiLSTM contains both forward information and backward information, so that BiLSTM-CRF performs better than LSTM-CRF.
|Linear edge features||Non-linear edge features|
Current LSTM-CRF only learns linear edge features in that it has linear energy function over transition cliques. Do and Artieres DoArtieres2010 show that non-linear energy function performs better in extracting features for structured prediction. For a non-linear energy function, we propose a new recurrent neural CRF, which uses LSTM as energy function over transition cliques. Therefore, our model is able to learn non-linear edge features.
For learning non-linear edge features, we use edge embedding to provide raw edge information. In natural language processing, input structure is usually a sequence of words, so edges of input structure is connections of neighboring words. We have three methods to produce edge embedding from input structure.
Bigram: Bigram embedding is an intuition way to contain neighboring words features. We can build a bigram dictionary and assign a vector to each key. It proves to be efficient in several model [Pei, Ge, and Chang2014, Chen et al.2015], but it may suffer from sparsity and low training speed.
Feedforward layer: Feedforward layer is another method to learn information from input words. It inputs two word embedding and outputs edge embedding after a single neural network layer.
Figure 1 shows our proposed Recurrent Neural CRF model. Our model contains three layers: input layer, LSTM layer and CRF layer.
Input Layer: Input layer is used to input words and provide edge embedding for LSTM layer. Edge embedding is from the concatenation of neighboring word vectors, and it provides raw primary edge features.
LSTM Layer: LSTM layer recurrently inputs edge embedding from input layer and computes output as energy function over transition cliques for CRF layer. Our LSTM layer does not normalize energy output (using softmax function) until it does in CRF layer. Thus, our model is gobally normalized, which can solve label bias problem [Andor et al.2016].
CRF layer: CRF layer is to predict output structure given energy function from LSTM layer. Since we do not change CRF internal structure, viterbi algorithm is still suitable to find out the structure with highest conditional probability efficiently.
In our model, objective function is similar to CRF objective, allowing computing gradients via dynamic programming. For learning non-linear features, we replace the linear energy function with LSTM output function:
where is LSTM energy output which contains hidden edge information.
The objective has a non-linear transition energy function which neither conventional CRF nor LSTM-CRF has. The local energy function can be either linear or non-linear. We call our model with linear local energy function Edge-based-1 and the model with non-linear energy function Edge-based-2.
Edge-based-1: Edge-based-1 model has a linear local energy function, which captures simplest linear node features. We use it to stress the importance of learning non-linear edge features. Our experiments show that model learning only non-linear edge features outperforms model learning only non-linear node features. Local energy function in Edge-based-1 is:
Edge-based-2: Edge-based-2 model has a non-linear local energy function. It is proposed to show the combination of learning non-linear node features and edge features. Local energy function in Edge-based-2 is:
where is computed by another LSTM, and contains hidden node information.
Table 1 shows the different objective function of these recurrent nerual CRF models.
We have two kinds of criteria to train our models: probabilistic criteria and large margin criteria [Do and Artières2010].
Probabilistic Criteria: Probabilistic Criteria was first proposed in [Lafferty, McCallum, and Pereira2001]. The regularized objective function of recurrent neural CRF can be described as:
where is the number of samples in the corpus. We denote the unnormalized score of a sample for Edge-Node Recurrent Neural CRF as:
And this score for Edge-based model is:
Then in Equation 13 can be written as:
Large Margin Criteria: Large margin criteria is first introduced by Taskar et al. TaskarEA2005. In large margin criteria, the margin between the scores of correct tag sequence and incorrect sequence will be larger than a given large margin:
where is the number of incorrect tags in .
So the in objective function is:
To minimize the objective function, we use AdaGrad [Duchi, Hazan, and Singer2011], which is a widely used algorithm recently. The parameter for the update can be calculated as:
where is the initial learning rate, and is the gradient of parameter for the update.
. Among various neural models, recurrent neural networks[Elman1990] proves to perform well in sequence labelling tasks. LSTM [Hochreiter and Schmidhuber1997, Graves and Schmidhuber2005]
improves the performance of RNN by solving the vanishing and exploding gradient problem. Later, bidirectional recurrent model[Graves, Mohamed, and Hinton2013] is proposed to capture the backward information.
CRF model [Lafferty, McCallum, and Pereira2001]
has achieved much success in natural language processing. Many models try to combine CRF with neural networks for more structure dependence. Peng et al. PengEA2009 introduces a conditional neural fields model. Collobert et al. CollobertEA2011 first implements convolutional neural networks with the CRF objective.Zheng et al. ZhengEA2015 integrates CRF with RNN. Durrett and Klein DurrettKlein2015 uses feed forward neural networks with CRF for parsing. Huang et al. HuangEA2015 use recurrent neural networks to learn non-linear node features. They show that BiLSTM-CRF is more robust than neural models without CRF. Do and Artieres DoArtieres2010 suggest feedforward neural networks to learn neural features. Zhou et al. ZhouEA2015 proposes a transition based neural model with CRF for parsing. Finally, Andor et al. AndorEA2016 proves that a globally normalized CRF objective helps deal with label bias problem in neural models.
Compared with these neural CRF models, our recurrent neural CRF has a recurrent structure with the ability to learn non-linear edge features. Recurrent structure helps capture long distant information, and non-linear edge features provide more non-linear structure dependence. Table 2 shows the correlation between our proposed recurrent neural CRF model and other existing neural CRF models.
|Models||Edge||NP chunking||Shallow parsing||POStag||Word Seg|
|Conv-CRF (Collobert 2011)||linear||-||-||-||-||-||94.32||97.29||-|
|LSTM-CRF (Huang 2015)||linear||-||-||-||-||-||94.46||97.55||-|
|NP chunking||F1||Shallow parsing||F1||POS tagging||Acc|
|Sha and Pereira ShaPereira2003||94.30||Zhang et al. ZhangEA2002||94.17||Collobert CollobertEA2011||97.29|
|Ando and Zhang AndoZhang2005||94.70||Ando and Zhang AndoZhang2005||94.39||Sun Sun2014||97.36|
|Shen and Sarkar ShenSarkar2005||95.23||Shen and Sarkar ShenSarkar2005||94.01||Huang et al. HuangEA2015||97.55|
|McDonald et al. McDonaldEA2005||94.29||Collobert et al. CollobertEA2011||94.32||Andor et al. AndorEA2016||97.44|
|Sun et al. SunEA2008||94.34||Huang et al. HuangEA2015||94.46||Shen et al. ShenEA07||97.33|
|Our Edge-based-1||95.16||Our Edge-based-1||94.48||Our Edge-based-1||97.56|
|Our Edge-based-2||95.25||Our Edge-based-2||94.80||Our Edge-based-2||97.52|
We perform some experiments to analyze our proposed models. We choose well-known sequence labelling tasks, including NP chunking, shallow parsing, POS tagging and Chinese word segmentation as our benchmark so that our experiment results are comparable. We compare our model with other popular neural models, and analyze the effect of non-linear edge features.
We introduce our benchmark tasks as follows:
NP Chunking: NP Chunking is short for Noun Phrase Chunking, that the non-recursive cores of noun phrases called based NPs are identified. Our datasets are from CoNLL-2000 shallow-parsing shared task
, which consists of 8936 sentences in training set and 2012 sentences in test set. We further split the training set and extract 90% sentences as development set. Following previous work, we label the sentences with BIO2 format, including 3 tags (B-NP,I-NP,O). Our evaluation metric is F-score.
Shallow Parsing: Shallow parsing is a task similar to NP Chunking, but it needs to identify all chunk types(VP,PP,DT…). The dataset is also from CoNLL-2000, and it contains 23 tags. We use F-score as the evaluation metric.
POS tagging: POS tagging is short for Part-of-Speech Tagging, that each word is annotated with a particular part-of-speech. We use the standard benchmark dataset from the Penn Treebank. We use Sections 0-18 of the treebank as the training set, Sections 19-21 as the development set, and Sections 22-24 as the test set. We use tag accuracy as evaluation metric.
Chinese word segmentation for social media text: Word segmentation is a fundamental task for Chinese language processing [Sun, Wang, and Li2012, Xu and Sun2016]. Although current models perform well in formal text, many of them do badly in informal text like social media text. Our corpus is from NLPCC2016 shared task. Since we have no access to test set, we split training set and extract 10% samples as test set. We use F-score as our evaluation metric.
Embeddings are distributed vectors to represent the semantic of words [Bengio et al.2003, Mikolov et al.2013]. It proves that embeddings can influence the performance of neural models. In our models, we use random initialized word embeddings as well as Senna embeddings [Collobert et al.2011]. Our experiments show that Senna Embeddings can slightly improve the performance of our models. We also incorporate the feature embeddings as suggested by previous work [Collobert et al.2011]. The features include a window of last 2 words and next 2 words, as well as the word suffixes up to 2 characters. Besides, we make use of part-of-speech tags in NP chunking and shallow parsing. To alleviate heavy feature engineering, we do not use other features like bigram or trigram, though they may increase the accuracy as shown in [Pei, Ge, and Chang2014] and [Chen et al.2015]. All these feature embeddings are random initialized.
We also try three methods to learn edge embedding, including concatenate current words embeddings with feature embeddings as our edge embedding in our model.
We tune our hyper-parameters on the development sets. Our model is not sensitive to the dimension of hidden states when it is large enough. For the balance of accuracy and time cost, we set this number to 300 for NP chunking and shallow parsing, and the number is 200 for POS tagging and Chinese word segmentation. The dimension of input embeddings is set to be 100. The initial learning rate of AdaGrad algorithm is 0.1, and the regularization parameter is . The dropout method proves to avoid overfitting in neural models [Srivastava et al.2014], but we find it has limited impact in our models. Besides, we select probabilistic criteria to train our model for its steady convergence and robust performance.
We choose current popular neural models as our baselines, including RNN, LSTM, BiLSTM and BiLSTM-CRF. RNN and LSTM are basic recurrent neural models. For further learn bidirectional context information, we also implement Bi-LSTM for our tasks. We compare our model with these model to show the gain from combining neural model with CRF objective. Finally, BiLSTM-CRF is our strong baseline. We compare our model with BiLSTM-CRF to show that learning non-linear edge features is more important than single non-linear node features.
We analyze the performance of our models in the above benchmark tasks. Our baselines include popular neural models. We train each model for 40 passes through the training sets. The performance curves of these models in test sets are provided as showed in Figure 2. It shows that our Edge-based model outperforms the baseline neural models, including RNN, LSTM, BiLSTM and BiLSTM-CRF.
According to Table 3, our models significantly outperform recurrent models without edge information in three tasks. It concludes that globally normalized objective can bring better performance in that it can model more structure dependence. Besides, our models also have higher accuracy than models with linear edge features, which shows that modelling non-linear edge features is very important for neural models. It seems that Edge-based-2 achieves better result than Edge-based-1 in NP chunking and shallow parsing, so combining non-linear edge features with node features is helpful in these two tasks.
We also compare our models with some existing systems as shown in Table 4.
NP Chunking: In NP Chunking, a popular algorithm is second-order CRF [Sha and Pereira2003], which can achieve a score of 94.30%. McDonald et al. McDonaldEA2005 implemented a multilabel learning algorithm, with a score of 94.29%. Sun et al. SunEA2008 proposed a latent variable CRF model, improving the score up to 94.34%. Some other models [Ando and Zhang2005, Shen and Sarkar2005]
make use of extra resources, and greatly improve the performance of Support Vector Machines(SVM). To the best of our knowledge, few neural models have been introduced for NP Chunking. Our models can outperform all of the above models. We also implement some neural models to compare with our model. LSTM has a score of 94.12, and BiLSTM is better with 94.36% F-score. As a strong baseline, BiLSTM-CRF outperforms them with 94.97% F-score. Our model also performs better than all these neural models, with 95.25% F-score.
Shallow Parsing: In shallow parsing, Zhang et al. ZhangEA2002 proposed a generalized Winnow algorithm which achieve a score of 94.17%. Ando and Zhang AndoZhang2005 introduced a SVD based alternating structure optimization algorithm, improving the score up to 94.39%. Collobert et al. CollobertEA2011 first introduced the neural network model to shallow parsing. They combined the convolutional neural networks with CRF, and reached 94.32% F-score. Huang et al. HuangEA2015 combined BiLSTM with a CRF layer, raising the score up to 94.46%. Our Edge-based model can beat all of these models in performance, and obtain state-of-art result with a score of 94.80%.
POS tagging: As an important task in natural language processing, there are lots of work on POS tagging. We make a comparison of our models with some recent work. Sun Sun2014 introduced a structure regularization method for CRF, which reached 97.36% accuracy. Collobert et al. CollobertEA2011 used a Convolution-CRF model, and obtained 97.29%. Andor et al. AndorEA2016 proposed a globally normalized transition based neural model, which made use of feedforward neural networks and achieved 97.44% accuracy. Our Edge-based model can outperform the above models with 97.56% accuracy.
Chinese word segmentaion for social media text: Our corpus is latest so we do not find comparable result. Instead, we implement some state-of-the-art models, and compare with our model. We find that LSTM achieves 90.50% F-score while BiLSTM is slightly better with 90.81%. BiLSTM gains from CRF objective, and achieves 91.16% F-score. Our model can beat all of these model, with a 91.27% F-score.
We conduct significance tests based on t-test to show the improvement of our models over the baselines. The significance tests suggest that our Edge-based-1 model has a very significant improvement over baseline, within NP chunking, in shallow parsing and in POS tagging and Chinese word segmentation. The Edge-based-2 model also has high statistically significance, with in all tasks. The significance tests support theoretical analysis that our models can outperform the baselines in accuracy.
We propose a new recurrent neural CRF model for learning non-linear edge features. Our model is capable to completely encoding non-linear features for CRF. Experiments show that our model outperforms state-of-the-art methods in several structured prediction tasks, including NP chunking, shallow parsing, Chinese word segmentation and POS tagging.
This work was supported in part by National Natural Science Foundation of China (No. 61300063). Xu Sun is the corresponding author of this paper.
[Ando and Zhang2005]
Ando, R. K., and Zhang, T.
A high-performance semi-supervised learning method for text chunking.In ACL 2005.
- [Andor et al.2016] Andor, D.; Alberti, C.; Weiss, D.; Severyn, A.; Presta, A.; Ganchev, K.; Petrov, S.; and Collins, M. 2016. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042.
[Bengio et al.2003]
Bengio, Y.; Ducharme, R.; Vincent, P.; and Janvin, C.
A neural probabilistic language model.
Journal of Machine Learning Research3:1137–1155.
- [Chen and Manning2014] Chen, D., and Manning, C. D. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 740–750.
- [Chen et al.2015] Chen, X.; Qiu, X.; Zhu, C.; Liu, P.; and Huang, X. 2015. Long short-term memory neural networks for chinese word segmentation. In EMNLP 2015, 1197–1206.
- [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12:2493–2537.
[Do and Artières2010]
Do, T. M. T., and Artières, T.
Neural conditional random fields.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, 177–184.
- [Duchi, Hazan, and Singer2011] Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
- [Durrett and Klein2015] Durrett, G., and Klein, D. 2015. Neural CRF parsing. In ACL 2015, 302–312.
- [Elman1990] Elman, J. L. 1990. Finding structure in time. Cognitive Science 14(2):179–211.
- [Graves and Schmidhuber2005] Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5-6):602–610.
- [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.; and Hinton, G. E. 2013. Speech recognition with deep recurrent neural networks. In ICASSP 2013, 6645–6649.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
- [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
- [Lafferty, McCallum, and Pereira2001] Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), 282–289.
- [McDonald, Crammer, and Pereira2005] McDonald, R. T.; Crammer, K.; and Pereira, F. 2005. Flexible text segmentation with structured multilabel classification. In HLT/EMNLP 2005.
- [Mikolov et al.2010] Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 1045–1048.
- [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, 3111–3119.
[Pei, Ge, and Chang2014]
Pei, W.; Ge, T.; and Chang, B.
Max-margin tensor neural network for chinese word segmentation.In ACL 2014, 293–303.
- [Peng, Bo, and Xu2009] Peng, J.; Bo, L.; and Xu, J. 2009. Conditional neural fields. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009, 1419–1427.
- [Sha and Pereira2003] Sha, F., and Pereira, F. C. N. 2003. Shallow parsing with conditional random fields. In HLT-NAACL 2003.
- [Shen and Sarkar2005] Shen, H., and Sarkar, A. 2005. Voting between multiple data representations for text chunking. In Advances in Artificial Intelligence, 18th Conference of the Canadian Society for Computational Studies of Intelligence, 389–400.
- [Shen, Satta, and Joshi2007] Shen, L.; Satta, G.; and Joshi, A. K. 2007. Guided learning for bidirectional sequence classification. In ACL 2007.
- [Socher et al.2013] Socher, R.; Bauer, J.; Manning, C. D.; and Ng, A. Y. 2013. Parsing with compositional vector grammars. In ACL 2013, 455–465.
- [Srivastava et al.2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.
- [Sun et al.2008] Sun, X.; Morency, L.; Okanohara, D.; Tsuruoka, Y.; and Tsujii, J. 2008. Modeling latent-dynamic in shallow parsing: A latent conditional model with improved inference. In COLING 2008, 841–848.
- [Sun, Wang, and Li2012] Sun, X.; Wang, H.; and Li, W. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In ACL 2012, 253–262.
- [Sun2014] Sun, X. 2014. Structure regularization for structured prediction. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 2402–2410.
- [Sun2016] Sun, X. 2016. Asynchronous parallel learning for neural networks and structured models with dense features. In COLING 2016.
- [Taskar et al.2005] Taskar, B.; Chatalbashev, V.; Koller, D.; and Guestrin, C. 2005. Learning structured prediction models: a large margin approach. In (ICML 2005), 896–903.
- [Xu and Sun2016] Xu, J., and Sun, X. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In ACL 2016.
- [Zhang, Damerau, and Johnson2002] Zhang, T.; Damerau, F.; and Johnson, D. 2002. Text chunking based on a generalization of winnow. Journal of Machine Learning Research 2:615–637.
[Zheng et al.2015]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.;
Huang, C.; and Torr, P. H. S.
Conditional random fields as recurrent neural networks.
2015 IEEE International Conference on Computer Vision, ICCV 2015, 1529–1537.
- [Zhou et al.2015] Zhou, H.; Zhang, Y.; Huang, S.; and Chen, J. 2015. A neural probabilistic structured-prediction model for transition-based dependency parsing. In ACL 2015, 1213–1222.