Introduction
Conditional Random Field (CRF) is a widely used algorithm for structured prediction. It is an undirected graphical model trained to maximize a conditional probability. The undirected graph can be encoded with a set of features (node features and edge features). Usually, these features are sparse and well manual designed.
For minimizing the effort in feature engineering, neural network models are used to automatically extract features
[Chen and Manning2014, Collobert et al.2011]. These models learn dense features, which have better representation of both syntax and semantic information. Because of the success of CRF and neural networks, many models take advantage of both of them. Collobert et al. CollobertEA2011 used CRF objective to compute sentencelevel probability of convolutional neural networks. Durrett and Klein DurrettKlein2015 introduced a neural CRF model to join sparse features and dense features for parsing. Andor et al. AndorEA2016 proposed a transitionbased neural model with a globally normalized CRF objective, and they use feedforward neural networks to learn neural features.
The marriage of feedforward neural network and CRF is natural because feedforward neural network scores local unstructured decisions while CRF makes global structured decisions. It is harder to combine recurrent neural model with CRF because both of them use structural inference. Huang et al. HuangEA2015 provided a solution to combine recurrent structure with CRF structure, and gained good performance in sequence labelling. However, their model only encode node features while both node features and edge features are important to CRF.
In order to completely encode nonlinear features for CRF, we propose a new recurrent neural CRF model. Our model uses LSTM to learn edge information of input words, and takes LSTM output as CRF energy function. We do not change the internal structure of both LSTM and CRF, so it easily decodes via standard recurrent propogation and CRF dynamic programming inference, without any extra effort. In our model, we use edge embedding to capture connections inside input structure. LSTM is used to learn hidden edge features from edge embedding. After that, CRF globally normalizes the scores of LSTM output. Andor et al. AndorEA2016 proved that globally normalized CRF objective solved label bias problem for neural models.
The contribution of our paper can be listed as follow:

We propose a neural model which can learn nonlinear edge features. We find that learning nonlinear edge features is even more important than node features due to the ability of modelling nonlinear structure dependence.

We experiment our model in several wellknown sequence labelling tasks, including shallow parsing, NP chunking, POS tagging and Chinese word segmentation. It shows that our model can outperform stateoftheart methods in these tasks.
Background
In structured prediction, our goal is to predict structure given the observations . The label in structure is denoted as , and the observation is .
CRF [Lafferty, McCallum, and Pereira2001] is a popular and effective algorithm for structured prediction. It has a loglinear conditional probability with respect to energy functions over local cliques and transition cliques:
(1) 
where is energy function over local clique at position , and is energy function over transition clique.
Energy functions are used to learn features. Since conventional CRF is loglinear model, both local clique and transition clique have linear energy functions:
(2) 
(3) 
where is the indicator function of the feature for the transition clique , is the indicator function of feature for the local clique , and and are parameters of CRF.
Therefore, conventional CRF can only learn linear features. To learn highorder features, LSTM is combined with CRF model [Huang, Xu, and Yu2015]. At each time step, LSTM recurrently inputs a word and outputs scores of each predicted labels. The output function of LSTM can be used as energy function over local cliques:
(4) 
(5) 
where is the hidden state of LSTM at the time step, is the
element of vector
, and is a transition score for jumping from tag to tag.As for transition cliques, energy function is a transition matrix of variables for jumping from tag to tag:
(6) 
so energy function over transition cliques is linear as conventional CRF. Therefore, LSTMCRF learns nonlinear node features (over local cliques) and linear edge features (over transition cliques).
For further contain more context information, LSTM layer can be replaced with bidirectional LSTM (BiLSTM) layer. BiLSTM contains both forward information and backward information, so that BiLSTMCRF performs better than LSTMCRF.










Our Edgebased2 
Linear edge features  Nonlinear edge features  







This work 
Proposal
Current LSTMCRF only learns linear edge features in that it has linear energy function over transition cliques. Do and Artieres DoArtieres2010 show that nonlinear energy function performs better in extracting features for structured prediction. For a nonlinear energy function, we propose a new recurrent neural CRF, which uses LSTM as energy function over transition cliques. Therefore, our model is able to learn nonlinear edge features.
Edge Embedding
For learning nonlinear edge features, we use edge embedding to provide raw edge information. In natural language processing, input structure is usually a sequence of words, so edges of input structure is connections of neighboring words. We have three methods to produce edge embedding from input structure.
Bigram: Bigram embedding is an intuition way to contain neighboring words features. We can build a bigram dictionary and assign a vector to each key. It proves to be efficient in several model [Pei, Ge, and Chang2014, Chen et al.2015], but it may suffer from sparsity and low training speed.
Concatenation: Concatenation is a useful way to join two words’ information. It is simple and widely used in previous work [Collobert et al.2011, Huang, Xu, and Yu2015].
Feedforward layer: Feedforward layer is another method to learn information from input words. It inputs two word embedding and outputs edge embedding after a single neural network layer.
Layers
Figure 1 shows our proposed Recurrent Neural CRF model. Our model contains three layers: input layer, LSTM layer and CRF layer.
Input Layer: Input layer is used to input words and provide edge embedding for LSTM layer. Edge embedding is from the concatenation of neighboring word vectors, and it provides raw primary edge features.
LSTM Layer: LSTM layer recurrently inputs edge embedding from input layer and computes output as energy function over transition cliques for CRF layer. Our LSTM layer does not normalize energy output (using softmax function) until it does in CRF layer. Thus, our model is gobally normalized, which can solve label bias problem [Andor et al.2016].
CRF layer: CRF layer is to predict output structure given energy function from LSTM layer. Since we do not change CRF internal structure, viterbi algorithm is still suitable to find out the structure with highest conditional probability efficiently.
Objective function
In our model, objective function is similar to CRF objective, allowing computing gradients via dynamic programming. For learning nonlinear features, we replace the linear energy function with LSTM output function:
(7) 
(8) 
where is LSTM energy output which contains hidden edge information.
The objective has a nonlinear transition energy function which neither conventional CRF nor LSTMCRF has. The local energy function can be either linear or nonlinear. We call our model with linear local energy function Edgebased1 and the model with nonlinear energy function Edgebased2.
Edgebased1: Edgebased1 model has a linear local energy function, which captures simplest linear node features. We use it to stress the importance of learning nonlinear edge features. Our experiments show that model learning only nonlinear edge features outperforms model learning only nonlinear node features. Local energy function in Edgebased1 is:
(9) 
(10) 
Edgebased2: Edgebased2 model has a nonlinear local energy function. It is proposed to show the combination of learning nonlinear node features and edge features. Local energy function in Edgebased2 is:
(11) 
(12) 
where is computed by another LSTM, and contains hidden node information.
Table 1 shows the different objective function of these recurrent nerual CRF models.
Training
We have two kinds of criteria to train our models: probabilistic criteria and large margin criteria [Do and Artières2010].
Probabilistic Criteria: Probabilistic Criteria was first proposed in [Lafferty, McCallum, and Pereira2001]. The regularized objective function of recurrent neural CRF can be described as:
(13) 
where is the number of samples in the corpus. We denote the unnormalized score of a sample for EdgeNode Recurrent Neural CRF as:
(14) 
And this score for Edgebased model is:
(15) 
Then in Equation 13 can be written as:
(16) 
Large Margin Criteria: Large margin criteria is first introduced by Taskar et al. TaskarEA2005. In large margin criteria, the margin between the scores of correct tag sequence and incorrect sequence will be larger than a given large margin:
(17) 
where is the number of incorrect tags in .
So the in objective function is:
(18) 
Optimization
To minimize the objective function, we use AdaGrad [Duchi, Hazan, and Singer2011], which is a widely used algorithm recently. The parameter for the update can be calculated as:
(19) 
where is the initial learning rate, and is the gradient of parameter for the update.
Related Work
Recently, neural networks models have been widely used in natural language processing [Bengio et al.2003, Mikolov et al.2010, Socher et al.2013, Chen et al.2015, Sun2016]
. Among various neural models, recurrent neural networks
[Elman1990] proves to perform well in sequence labelling tasks. LSTM [Hochreiter and Schmidhuber1997, Graves and Schmidhuber2005]improves the performance of RNN by solving the vanishing and exploding gradient problem. Later, bidirectional recurrent model
[Graves, Mohamed, and Hinton2013] is proposed to capture the backward information.CRF model [Lafferty, McCallum, and Pereira2001]
has achieved much success in natural language processing. Many models try to combine CRF with neural networks for more structure dependence. Peng et al. PengEA2009 introduces a conditional neural fields model. Collobert et al. CollobertEA2011 first implements convolutional neural networks with the CRF objective.Zheng et al. ZhengEA2015 integrates CRF with RNN. Durrett and Klein DurrettKlein2015 uses feed forward neural networks with CRF for parsing. Huang et al. HuangEA2015 use recurrent neural networks to learn nonlinear node features. They show that BiLSTMCRF is more robust than neural models without CRF. Do and Artieres DoArtieres2010 suggest feedforward neural networks to learn neural features. Zhou et al. ZhouEA2015 proposes a transition based neural model with CRF for parsing. Finally, Andor et al. AndorEA2016 proves that a globally normalized CRF objective helps deal with label bias problem in neural models.
Compared with these neural CRF models, our recurrent neural CRF has a recurrent structure with the ability to learn nonlinear edge features. Recurrent structure helps capture long distant information, and nonlinear edge features provide more nonlinear structure dependence. Table 2 shows the correlation between our proposed recurrent neural CRF model and other existing neural CRF models.
Models  Edge  NP chunking  Shallow parsing  POStag  Word Seg  

P  R  F1  P  R  F1  Acc  F1  
LSTM  94.00  94.25  94.12  92.93  93.24  93.09  97.28  90.50  
BiLSTM  94.24  94.48  94.36  93.57  93.71  93.64  97.36  90.81  
BiLSTMCRF  linear  94.89  95.05  94.97  94.33  94.26  94.29  97.38  91.16 
ConvCRF (Collobert 2011)  linear            94.32  97.29   
LSTMCRF (Huang 2015)  linear            94.46  97.55   
Our Edgebased1  nonlinear  94.86  95.46  95.16  94.44  94.52  94.48  97.56  91.24 
Our Edgebased2  nonlinear  94.98  95.52  95.25  94.75  94.85  94.80  97.52  91.27 
NP chunking  F1  Shallow parsing  F1  POS tagging  Acc 
Sha and Pereira ShaPereira2003  94.30  Zhang et al. ZhangEA2002  94.17  Collobert CollobertEA2011  97.29 
Ando and Zhang AndoZhang2005  94.70  Ando and Zhang AndoZhang2005  94.39  Sun Sun2014  97.36 
Shen and Sarkar ShenSarkar2005  95.23  Shen and Sarkar ShenSarkar2005  94.01  Huang et al. HuangEA2015  97.55 
McDonald et al. McDonaldEA2005  94.29  Collobert et al. CollobertEA2011  94.32  Andor et al. AndorEA2016  97.44 
Sun et al. SunEA2008  94.34  Huang et al. HuangEA2015  94.46  Shen et al. ShenEA07  97.33 
Our Edgebased1  95.16  Our Edgebased1  94.48  Our Edgebased1  97.56 
Our Edgebased2  95.25  Our Edgebased2  94.80  Our Edgebased2  97.52 
Experiments
We perform some experiments to analyze our proposed models. We choose wellknown sequence labelling tasks, including NP chunking, shallow parsing, POS tagging and Chinese word segmentation as our benchmark so that our experiment results are comparable. We compare our model with other popular neural models, and analyze the effect of nonlinear edge features.
Tasks
We introduce our benchmark tasks as follows:
NP Chunking: NP Chunking is short for Noun Phrase Chunking, that the nonrecursive cores of noun phrases called based NPs are identified. Our datasets are from CoNLL2000 shallowparsing shared task
, which consists of 8936 sentences in training set and 2012 sentences in test set. We further split the training set and extract 90% sentences as development set. Following previous work, we label the sentences with BIO2 format, including 3 tags (BNP,INP,O). Our evaluation metric is Fscore.
Shallow Parsing: Shallow parsing is a task similar to NP Chunking, but it needs to identify all chunk types(VP,PP,DT…). The dataset is also from CoNLL2000, and it contains 23 tags. We use Fscore as the evaluation metric.
POS tagging: POS tagging is short for PartofSpeech Tagging, that each word is annotated with a particular partofspeech. We use the standard benchmark dataset from the Penn Treebank. We use Sections 018 of the treebank as the training set, Sections 1921 as the development set, and Sections 2224 as the test set. We use tag accuracy as evaluation metric.
Chinese word segmentation for social media text: Word segmentation is a fundamental task for Chinese language processing [Sun, Wang, and Li2012, Xu and Sun2016]. Although current models perform well in formal text, many of them do badly in informal text like social media text. Our corpus is from NLPCC2016 shared task. Since we have no access to test set, we split training set and extract 10% samples as test set. We use Fscore as our evaluation metric.
Embeddings
Embeddings are distributed vectors to represent the semantic of words [Bengio et al.2003, Mikolov et al.2013]. It proves that embeddings can influence the performance of neural models. In our models, we use random initialized word embeddings as well as Senna embeddings [Collobert et al.2011]. Our experiments show that Senna Embeddings can slightly improve the performance of our models. We also incorporate the feature embeddings as suggested by previous work [Collobert et al.2011]. The features include a window of last 2 words and next 2 words, as well as the word suffixes up to 2 characters. Besides, we make use of partofspeech tags in NP chunking and shallow parsing. To alleviate heavy feature engineering, we do not use other features like bigram or trigram, though they may increase the accuracy as shown in [Pei, Ge, and Chang2014] and [Chen et al.2015]. All these feature embeddings are random initialized.
We also try three methods to learn edge embedding, including concatenate current words embeddings with feature embeddings as our edge embedding in our model.
Settings
We tune our hyperparameters on the development sets. Our model is not sensitive to the dimension of hidden states when it is large enough. For the balance of accuracy and time cost, we set this number to 300 for NP chunking and shallow parsing, and the number is 200 for POS tagging and Chinese word segmentation. The dimension of input embeddings is set to be 100. The initial learning rate of AdaGrad algorithm is 0.1, and the regularization parameter is . The dropout method proves to avoid overfitting in neural models [Srivastava et al.2014], but we find it has limited impact in our models. Besides, we select probabilistic criteria to train our model for its steady convergence and robust performance.
Baselines
We choose current popular neural models as our baselines, including RNN, LSTM, BiLSTM and BiLSTMCRF. RNN and LSTM are basic recurrent neural models. For further learn bidirectional context information, we also implement BiLSTM for our tasks. We compare our model with these model to show the gain from combining neural model with CRF objective. Finally, BiLSTMCRF is our strong baseline. We compare our model with BiLSTMCRF to show that learning nonlinear edge features is more important than single nonlinear node features.
Results
We analyze the performance of our models in the above benchmark tasks. Our baselines include popular neural models. We train each model for 40 passes through the training sets. The performance curves of these models in test sets are provided as showed in Figure 2. It shows that our Edgebased model outperforms the baseline neural models, including RNN, LSTM, BiLSTM and BiLSTMCRF.
According to Table 3, our models significantly outperform recurrent models without edge information in three tasks. It concludes that globally normalized objective can bring better performance in that it can model more structure dependence. Besides, our models also have higher accuracy than models with linear edge features, which shows that modelling nonlinear edge features is very important for neural models. It seems that Edgebased2 achieves better result than Edgebased1 in NP chunking and shallow parsing, so combining nonlinear edge features with node features is helpful in these two tasks.
We also compare our models with some existing systems as shown in Table 4.
NP Chunking: In NP Chunking, a popular algorithm is secondorder CRF [Sha and Pereira2003], which can achieve a score of 94.30%. McDonald et al. McDonaldEA2005 implemented a multilabel learning algorithm, with a score of 94.29%. Sun et al. SunEA2008 proposed a latent variable CRF model, improving the score up to 94.34%. Some other models [Ando and Zhang2005, Shen and Sarkar2005]
make use of extra resources, and greatly improve the performance of Support Vector Machines(SVM). To the best of our knowledge, few neural models have been introduced for NP Chunking. Our models can outperform all of the above models. We also implement some neural models to compare with our model. LSTM has a score of 94.12, and BiLSTM is better with 94.36% Fscore. As a strong baseline, BiLSTMCRF outperforms them with 94.97% Fscore. Our model also performs better than all these neural models, with 95.25% Fscore.
Shallow Parsing: In shallow parsing, Zhang et al. ZhangEA2002 proposed a generalized Winnow algorithm which achieve a score of 94.17%. Ando and Zhang AndoZhang2005 introduced a SVD based alternating structure optimization algorithm, improving the score up to 94.39%. Collobert et al. CollobertEA2011 first introduced the neural network model to shallow parsing. They combined the convolutional neural networks with CRF, and reached 94.32% Fscore. Huang et al. HuangEA2015 combined BiLSTM with a CRF layer, raising the score up to 94.46%. Our Edgebased model can beat all of these models in performance, and obtain stateofart result with a score of 94.80%.
POS tagging: As an important task in natural language processing, there are lots of work on POS tagging. We make a comparison of our models with some recent work. Sun Sun2014 introduced a structure regularization method for CRF, which reached 97.36% accuracy. Collobert et al. CollobertEA2011 used a ConvolutionCRF model, and obtained 97.29%. Andor et al. AndorEA2016 proposed a globally normalized transition based neural model, which made use of feedforward neural networks and achieved 97.44% accuracy. Our Edgebased model can outperform the above models with 97.56% accuracy.
Chinese word segmentaion for social media text: Our corpus is latest so we do not find comparable result. Instead, we implement some stateoftheart models, and compare with our model. We find that LSTM achieves 90.50% Fscore while BiLSTM is slightly better with 90.81%. BiLSTM gains from CRF objective, and achieves 91.16% Fscore. Our model can beat all of these model, with a 91.27% Fscore.
Significance Tests
We conduct significance tests based on ttest to show the improvement of our models over the baselines. The significance tests suggest that our Edgebased1 model has a very significant improvement over baseline, with
in NP chunking, in shallow parsing and in POS tagging and Chinese word segmentation. The Edgebased2 model also has high statistically significance, with in all tasks. The significance tests support theoretical analysis that our models can outperform the baselines in accuracy.Conclusions
We propose a new recurrent neural CRF model for learning nonlinear edge features. Our model is capable to completely encoding nonlinear features for CRF. Experiments show that our model outperforms stateoftheart methods in several structured prediction tasks, including NP chunking, shallow parsing, Chinese word segmentation and POS tagging.
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (No. 61300063). Xu Sun is the corresponding author of this paper.
References

[Ando and Zhang2005]
Ando, R. K., and Zhang, T.
2005.
A highperformance semisupervised learning method for text chunking.
In ACL 2005.  [Andor et al.2016] Andor, D.; Alberti, C.; Weiss, D.; Severyn, A.; Presta, A.; Ganchev, K.; Petrov, S.; and Collins, M. 2016. Globally normalized transitionbased neural networks. arXiv preprint arXiv:1603.06042.

[Bengio et al.2003]
Bengio, Y.; Ducharme, R.; Vincent, P.; and Janvin, C.
2003.
A neural probabilistic language model.
Journal of Machine Learning Research
3:1137–1155.  [Chen and Manning2014] Chen, D., and Manning, C. D. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 740–750.
 [Chen et al.2015] Chen, X.; Qiu, X.; Zhu, C.; Liu, P.; and Huang, X. 2015. Long shortterm memory neural networks for chinese word segmentation. In EMNLP 2015, 1197–1206.
 [Collobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12:2493–2537.

[Do and Artières2010]
Do, T. M. T., and Artières, T.
2010.
Neural conditional random fields.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010
, 177–184.  [Duchi, Hazan, and Singer2011] Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
 [Durrett and Klein2015] Durrett, G., and Klein, D. 2015. Neural CRF parsing. In ACL 2015, 302–312.
 [Elman1990] Elman, J. L. 1990. Finding structure in time. Cognitive Science 14(2):179–211.
 [Graves and Schmidhuber2005] Graves, A., and Schmidhuber, J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(56):602–610.
 [Graves, Mohamed, and Hinton2013] Graves, A.; Mohamed, A.; and Hinton, G. E. 2013. Speech recognition with deep recurrent neural networks. In ICASSP 2013, 6645–6649.
 [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long shortterm memory. Neural Computation 9(8):1735–1780.
 [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTMCRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
 [Lafferty, McCallum, and Pereira2001] Lafferty, J. D.; McCallum, A.; and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), 282–289.
 [McDonald, Crammer, and Pereira2005] McDonald, R. T.; Crammer, K.; and Pereira, F. 2005. Flexible text segmentation with structured multilabel classification. In HLT/EMNLP 2005.
 [Mikolov et al.2010] Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; and Khudanpur, S. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 1045–1048.
 [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, 3111–3119.

[Pei, Ge, and Chang2014]
Pei, W.; Ge, T.; and Chang, B.
2014.
Maxmargin tensor neural network for chinese word segmentation.
In ACL 2014, 293–303.  [Peng, Bo, and Xu2009] Peng, J.; Bo, L.; and Xu, J. 2009. Conditional neural fields. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009, 1419–1427.
 [Sha and Pereira2003] Sha, F., and Pereira, F. C. N. 2003. Shallow parsing with conditional random fields. In HLTNAACL 2003.
 [Shen and Sarkar2005] Shen, H., and Sarkar, A. 2005. Voting between multiple data representations for text chunking. In Advances in Artificial Intelligence, 18th Conference of the Canadian Society for Computational Studies of Intelligence, 389–400.
 [Shen, Satta, and Joshi2007] Shen, L.; Satta, G.; and Joshi, A. K. 2007. Guided learning for bidirectional sequence classification. In ACL 2007.
 [Socher et al.2013] Socher, R.; Bauer, J.; Manning, C. D.; and Ng, A. Y. 2013. Parsing with compositional vector grammars. In ACL 2013, 455–465.
 [Srivastava et al.2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929–1958.
 [Sun et al.2008] Sun, X.; Morency, L.; Okanohara, D.; Tsuruoka, Y.; and Tsujii, J. 2008. Modeling latentdynamic in shallow parsing: A latent conditional model with improved inference. In COLING 2008, 841–848.
 [Sun, Wang, and Li2012] Sun, X.; Wang, H.; and Li, W. 2012. Fast online training with frequencyadaptive learning rates for chinese word segmentation and new word detection. In ACL 2012, 253–262.
 [Sun2014] Sun, X. 2014. Structure regularization for structured prediction. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, 2402–2410.
 [Sun2016] Sun, X. 2016. Asynchronous parallel learning for neural networks and structured models with dense features. In COLING 2016.
 [Taskar et al.2005] Taskar, B.; Chatalbashev, V.; Koller, D.; and Guestrin, C. 2005. Learning structured prediction models: a large margin approach. In (ICML 2005), 896–903.
 [Xu and Sun2016] Xu, J., and Sun, X. 2016. Dependencybased gated recursive neural network for chinese word segmentation. In ACL 2016.
 [Zhang, Damerau, and Johnson2002] Zhang, T.; Damerau, F.; and Johnson, D. 2002. Text chunking based on a generalization of winnow. Journal of Machine Learning Research 2:615–637.

[Zheng et al.2015]
Zheng, S.; Jayasumana, S.; RomeraParedes, B.; Vineet, V.; Su, Z.; Du, D.;
Huang, C.; and Torr, P. H. S.
2015.
Conditional random fields as recurrent neural networks.
In
2015 IEEE International Conference on Computer Vision, ICCV 2015
, 1529–1537.  [Zhou et al.2015] Zhou, H.; Zhang, Y.; Huang, S.; and Chen, J. 2015. A neural probabilistic structuredprediction model for transitionbased dependency parsing. In ACL 2015, 1213–1222.