1 Introduction
Unlike English and many other languages, Chinese sentences have no explicit word boundaries. Therefore, Chinese Word Segmentation (CWS) is a crucial step for many Chinese Natural Language Processing (NLP) tasks such as syntactic parsing, information retrieval and word representation learning
(Grave et al., 2018).Recently, neural approaches for supervised CWS are attracting huge interest. A great quantities of neural models, e.g., tensor neural network
(Pei et al., 2014), recursive neural network (Chen et al., 2015a), longshorttermmemory (RNNLSTM)
(Chen et al., 2015b)and convolutional neural network (CNN)
(Wang and Xu, 2017), have been proposed and given competitive results to the best statistical models (Sun, 2010). However, the neural approaches for unsupervised CWS have not been investigated.Previous unsupervised approaches to CWS can be roughly classified into discriminative and generative models. The former uses carefully designed goodness measures for candidate segmentation, while the latter focuses on designing statistical models for Chinese and finds the optimal segmentation of the highest generative probability.
Popular goodness measures for discriminative models include Mutual Information (MI) (Chang and Lin, 2003), normalized Variation of Branching Entropy (nVBE) (Magistry and Sagot, 2012) and Minimum Description Length (MDL) (Magistry and Sagot, 2013)
. There is a trivial way to extend these statistical discriminative approaches, because we can simply replace the ngram language models in these approaches by neural language models
(Bengio et al., 2003). There may exists other more sophisticated neural discriminative approaches, but it is not the focus of this paper.For generative approaches, typical statistical models includes Hidden Markov Model (HMM)
(Chen et al., 2014), Hierarchical Dirichlet Process (HDP) (Goldwater et al., 2009) and Nested PitmanYor Process (NPY) (Mochihashi et al., 2009). However, none of them can be easily extended into a neural model. Therefore, neural generative models for word segmentation are remaining to be investigated.In this paper, we proposed the Segmental Language Models (SLMs), a neural generative model that explicitly focuses on the segmental nature of Chinese: SLMs can directly generate segmented sentences and give the corresponding generative probability. We evaluate our methods on four different benchmark datasets from SIGHAN 2005 bakeoff (Emerson, 2005), namely PKU, MSR, AS and CityU. To our knowledge, we are the first to propose a neural model for unsupervised Chinese word segmentation and achieve competitive performance to the stateoftheart statistical models on four different datasets.^{1}^{1}1Our implementation can be found at https://github.com/EdwardSun/SLM
2 Segmental Language Models
In this section, we present our segmental language models (SLMs). Notice that in Chinese NLP, characters are the atom elements. Thus in the context of CWS, we use “character” instead of “word” for language modeling.
2.1 Language Models
The goal of language modeling is to learn the joint probability function of sequences of characters in a language. However, This is intrinsically difficult because of the curse of dimensionality. Traditional approaches obtain generalization based on ngrams, while neural approaches introduce a distributed representation for characters to fight the curse of dimensionality.
A neural Language Model (LM) can give the conditional probability of the next character given the previous ones, and is usually implemented by a Recurrent Neural Network (RNN):
(1)  
(2) 
where is the distributed representation for the character and represents the information of the previous characters.
2.2 Segmental Language Models
Similar to neural language modeling, the goal of segmental language modeling is to learn the joint probability function of the segmented sequences of characters. Thus, for each segment, we have:
(3) 
where is the distributed representation for the character in the segment and is the previous segments. And the concatenation of all segments is exactly the whole sentence , where is the length of the segment , is the length of the sentence .
Moreover, we introduce a context encoder RNN to process the character sequence in order to make conditional on . Specifically, we initialize with the context encoder’s output of .
Notice that although we have an encoder and the segment decoder , SLM is not an encoderdecoder model. Because the content that the decoder generates is not the same as what the encoder provides.
Figure 1 illustrates how SLMs work with a candidate segmentation.
2.3 Properties of SLMs
However, in unsupervised scheme, the given sentences are not segmented. Therefore, the probability for SLMs to generate a given sentence is the joint probability of all possible segmentation:
(4) 
where is the end of segment symbol at the end of each segment, and is the context representation of .
Moreover, for sentence generation, SLMs are able to generate arbitrary sentences by generating segments one by one and stopping when generating end of sentence symbol . In addition, the time complexity is linear to the length of the generated sentence, as we can keep the hidden state of the context encoder RNN and update it when generating new words.
Last but not least, it is easy to verify that SLMs preserve the probabilistic property of language models:
(5) 
where enumerates all possible sentences.
In summary, the segmental language models can perfectly substitute vanilla language models.
2.4 Training and Decoding
Similar to language model, the training is achieved by maximizing the training corpus loglikelihood:
(6) 
Luckily, we can compute the loss objective function in linear time complexity using dynamic programming, given the initial condition that :
(7) 
where is the joint probability of all possible segmentation, is the probability of one segment and is the maximal length of the segments.
We can also find the segmentation with maximal probability (namely, decoding) in linear time using dynamic programming in the similarly way with :
(8)  
(9) 
where is the probability of the best segmentation and is used to trace back the decoding.
3 Experiments
3.1 Experimental Settings and Detail
We evaluate our models on SIGHAN 2005 bakeoff (Emerson, 2005) datasets and replace all the punctuation marks with , English characters with and Arabic numbers with for all text and only consider segment the text between punctuations. Following Chen et al. (2014) , we use both training data and test data for training and only test data are used for evaluation. In order to make a fair comparison with the previous works, we do not consider using other larger raw corpus.
We apply word2vec (Mikolov et al., 2013) on Chinese Gigaword corpus (LDC2011T13) to get pretrained embedding of characters.
A 2layer LSTM (Hochreiter and Schmidhuber, 1997) is used as the segment decoder and a 1layer LSTM is used as the context encoder.
We use stochastic gradient decent with a minibatch size of 256 and a learning rate of 16.0 to optimize the model parameters in the first 400 steps, then we use Adam (Kingma and Ba, 2014)
with a learning rate of 0.005 to further optimize the models. Model parameters are initialized by normal distributions as
Glorot and Bengio (2010)suggested. We use a gradient clip
and apply a dropout with dropout rate to the character embedding and RNNs to prevent overfit.The standard word precision, recall and F1 measures (Emerson, 2005) are used to evaluate segmentation performance.
F1 score  PKU  MSR  AS  CityU 
HDP  68.7  69.9     
HDP + HMM  75.3  76.3     
ESA  77.8  80.1  78.5  76.0 
NPY3    80.7    81.7 
NPY2    80.2    82.4 
nVBE  80.0  81.3  76.6  76.7 
Joint  81.1  81.7     
SLM2  80.2  78.5  79.4  78.2 
SLM3  79.8  79.4  80.3  80.5 
SLM4  79.2  79.0  79.8  79.7 
3.2 Results and Analysis
F1 score  PKU  MSR  AS  CityU 
SLM4  79.2  79.0  79.8  79.7 
SLM4*  81.9  83.0  81.0  81.4 
SLM4†  87.5  84.3  84.2  86.0 
SLM4†*  87.3  84.8  83.9  85.8 
Our final results are shown in Table 1, which lists the results of several previous stateoftheart methods^{2}^{2}2Magistry and Sagot (2012) evaluated their nVBE on the training data, and the joint model of Chen et al. (2014) combine HDP+HMM and is initialized with nVBE, so in principle these results can not be compared directly., where we mark the best results in boldface. We test the proposed SLMs with different maximal segment length and use “SLM” to denote the corresponding model. We do not try because there are rare words that consist more than 4 characters.
As can be seen, it is hard to predict what choice of will give the best performance. This is because the exact definition of what a word remains hard to reach and different datasets follow different guidelines. Zhao and Kit (2008)
use crosstraining of a supervised segmentation system in order to have an estimation of the consistency between different segmentation guidelines and the average consistency is found to be as low as 85 (fscore). Therefore, this can be regarded as a top line for unsupervised CWS.
Table 1 shows that SLMs outperform previous best discriminative and generative models on PKU and AS datasets. This might be due to that the segmentation guideline of our models are closer to these two datasets.
Moreover, in the experiments, we observe that Chinese particles often attach other words, for example, “的” following adjectives and “了” following verbs. It is hard for our generative models to split them apart. Therefore, we propose a rulebased postprocessing module to deal with this problem, where we explicitly split the attached particles from other words.^{3}^{3}3The rules we use are listed in the appendix at https://github.com/EdwardSun/SLM. The postprocessing is applied on the results of “SLM4”. In addition, we also evaluate “SLM4” using the first 1024 sentences of the segmented training datasets (about 5.4% of PKU, 1.2% of MSR, 0.1% of AS and 1.9% of CityU) for training, in order to teach “SLM4” the corresponding ad hoc segmentation guidelines. Table 2 shows the results.
We can find from the table that only 1024 guideline sentences can improve the performance of “SLM4” significantly. While rulebased postprocessing is very effective, “SLM4†” can outperform “SLM4*” on all the four datasets. Moreover, performance drops when applying the rulebased postprocessing to “SLM4†” on three datasets. These indicate that SLMs can learn the empirical rules for word segmentation given only a small amount of training data. And these guideline data can improve the performance of SLMs naturally, superior to using explicit rules.
3.3 The Effect of the Maximal Segment Length
The maximal segment length represents the prior knowledge we have for Chinese word segmentation. For example represents that there are only unigrams, bigrams and trigrams in the text. While there do exist words that contain more than four characters, most of the Chinese words are unigram or bigram. Therefore, denotes a tradeoff between the accuracy of short words and long words.
Specifically, we investigate two major segmentation problems that might affect the accuracy of word segmentation performance, namely, insertion errors and deletion errors. An insertion error insert a segment in a word, which split a correct word. And an deletion error delete the segment between two words, which results in a composition error (Li and Yuan, 1998). Table 3 shows the statistics of different errors on PKU of our model with different . We can observe that insertion error rate decrease with the increase of , while the deletion error rate increase with the increase of .
We also provide some examples in Table 4, which are taken from the results of our models. It clearly illustrates that different could result in different errors. For example, there is an insertion error on “反过来” by SML2, and a deletion error on “促进” and “了” by SLM4.
Error  SLM2  SLM3  SLM4 

Insertion  7866  4803  3519 
Deletion  3855  7518  8851 
Model  Example 

SLM2  而 这些 制度 的 完善 反 过来 又 促进 了 检察 人员 执法 水平 的 进一 步 提高 
SLM3  而 这些 制度 的 完善 反过来 又 促进了 检察 人员 执法 水平 的 进一步 提高 
SLM4  而 这些 制度 的 完善 反过 来 又 促进了 检察 人员 执法 水平 的进一步 提高 
Gold  而 这些 制度 的 完善 反过来 又 促进 了 检察 人员 执法 水平 的 进一步 提高 
4 Related Work
Generative Models for CWS
Goldwater et al. (2009) are the first to proposed a generative model for unsupervised word segmentation. They built a nonparametric Bayesian bigram language model based on HDP (Teh et al., 2005). Mochihashi et al. (2009) proposed a Bayesian hierarchical language model using PitmanYor (PY) process, which can generate sentences hierarchically. Chen et al. (2014) proposed a Bayesian HMM model for unsupervised CWS inspired by the characterbased scheme in supervised CWS task, where the hidden state of charaters are set to to represents their corresponding positions in the words. The segmental language model is not a neural extension of the above statistical models, as we model the segments directly.
Segmental Sequence Models
Sequence modeling via segmentations has been well investigated by Wang et al. (2017), where they proposed the SleepAWake Network (SWAN) for speech recognition. SWAN is similar to SLM. However, SLMs do not have sleepawake states. And SLMs predict the following segment given the previous context while SWAN tries to recover the information in the encoded state. Therefore, the key difference is that SLMs are unsupervised language models while SWANs are supervised seq2seq models. Thereafter, Huang et al. (2017) successfully apply SWAN in their phrasebased machine translation. Another related work in machine translation is the online segment to segment neural transduction (Yu et al., 2016), where the model is able to capture unbounded dependencies in both the input and output sequences. Kong (2017) also proposed a Segmental Recurrent Neural Network (SRNN) with CTC to solve segmental labeling problems.
5 Conclusion
In this paper, we proposed a neural generative model for fully unsupervised Chinese word segmentation (CWS). To the best of knowledge, this is the first neural model for CWS. Our segmental language model is an intuitive generalization of vanilla neural language models that directly modeling the segmental nature of Chinese. Experimental results show that our models achieve competitive performance to the previous stateoftheart statistical models on four datasets from SIGHAN 2005. We also show the improvement of incorporating ad hoc guidelines into our segmental language models. Our future work may include the following two directions.

In this work, we only consider the sequential segmental language modeling. In the future, we are interested in build a hierarchical neural language model like the PitmanYor process.

Like vanilla language models, the segmental language models can also provide useful information for semisupervised learning tasks. It would also be interesting to explore our models in the semisupervised schemes.
Acknowledgements
This work is supported by the National Training Program of Innovation for Undergraduates (URTP2017PKU001). We would also like to thank the anonymous reviewers for their helpful comments.
References

Bengio et al. (2003)
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003.
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155.  Chang and Lin (2003) Jason S Chang and Tracy Lin. 2003. Unsupervised word segmentation without dictionary. ROCLING 2003 Poster Papers, pages 355–359.
 Chen et al. (2014) Miaohong Chen, Baobao Chang, and Wenzhe Pei. 2014. A joint model for unsupervised chinese word segmentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 854–863, Doha, Qatar. Association for Computational Linguistics.
 Chen et al. (2015a) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1744–1753, Beijing, China. Association for Computational Linguistics.
 Chen et al. (2015b) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015b. Long shortterm memory neural networks for chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1197–1206, Lisbon, Portugal. Association for Computational Linguistics.
 Emerson (2005) Thomas Emerson. 2005. The second international chinese word segmentation bakeoff. In Proceedings of the fourth SIGHAN workshop on Chinese language Processing, volume 133, pages 123–133.

Glorot and Bengio (2010)
Xavier Glorot and Yoshua Bengio. 2010.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pages 249–256.  Goldwater et al. (2009) Sharon Goldwater, Thomas L Griffiths, and Mark Johnson. 2009. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54.

Grave et al. (2018)
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas
Mikolov. 2018.
Learning word vectors for 157 languages.
In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).  Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9(8):1735–1780.
 Huang et al. (2017) PoSen Huang, Chong Wang, Sitao Huang, Dengyong Zhou, and Li Deng. 2017. Computer science ¿ computation and language towards neural phrasebased machine translation. arxiv.org/abs/1706.05565.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kong (2017) Lingpeng Kong. 2017. Neural Representation Learning in Linguistic Structured Prediction. Ph.D. thesis, Google Research.
 Li and Yuan (1998) Haizhou Li and Baosheng Yuan. 1998. Chinese word segmentation. In Proceedings of the 12th Pacific Asia Conference on Language, Information and Computation, pages 212–217.
 Magistry and Sagot (2012) Pierre Magistry and Benoît Sagot. 2012. Unsupervized word segmentation: the case for mandarin chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short PapersVolume 2, pages 383–387. Association for Computational Linguistics.
 Magistry and Sagot (2013) Pierre Magistry and Benoît Sagot. 2013. Can mdl improve unsupervised chinese word segmentation? In Sixth International Joint Conference on Natural Language Processing: Sighan workshop, page 2.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
 Mochihashi et al. (2009) Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested pitmanyor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1Volume 1, pages 100–108. Association for Computational Linguistics.
 Pei et al. (2014) Wenzhe Pei, Tao Ge, and Baobao Chang. 2014. Maxmargin tensor neural network for chinese word segmentation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 293–303, Baltimore, Maryland. Association for Computational Linguistics.
 Sun (2010) Weiwei Sun. 2010. Wordbased and characterbased word segmentation models: Comparison and combination. In Coling 2010: Posters, pages 1211–1219, Beijing, China. Coling 2010 Organizing Committee.
 Teh et al. (2005) Yee W Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2005. Sharing clusters among related groups: Hierarchical dirichlet processes. In Advances in neural information processing systems, pages 1385–1392.
 Wang et al. (2017) Chong Wang, Yining Wang, PoSen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463.
 Wang and Xu (2017) Chunqi Wang and Bo Xu. 2017. Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation. In Proceedings of the 8th International Joint Conference on Natural Language Processing.
 Wang et al. (2011) Hanshi Wang, Jian Zhu, Shiping Tang, and Xiaozhong Fan. 2011. A new unsupervised approach to word segmentation. Computational Linguistics, 37(3):421–454.
 Yu et al. (2016) Lei Yu, Jan Buys, and Phil Blunsom. 2016. Online segment to segment neural transduction. arXiv preprint arXiv:1609.08194.
 Zhao and Kit (2008) Hai Zhao and Chunyu Kit. 2008. An empirical comparison of goodness measures for unsupervised chinese word segmentation with a unified framework. In Proceedings of the Third International Joint Conference on Natural Language Processing: VolumeI.