Deep learning has been the mainstay for natural language processing, ranging from text summarization to sentiment analysis 3] and automated question-answering system . Unlike traditional rule-based methods, the scale and quality of the corpus significantly influence the performance of the deep learning models. In Chinese NLP field, there are many famous large-scale corpora with high quality, such as Baidu Encyclopedia, People’s Daily News and Sina Weibo News. Various powerful Chinese deep learning models are trained on these corpora , , , , .
However, most Chinese corpora are in written Chinese, while most real-world deep learning based NLP systems deal with informal Chinese, such as products reviews, netizens’ opinions, and microblogs. There are great gaps between informal Chinese and written Chinese, especially in words usages and sentences structures. The pre-trained deep learning model trained from written Chinese corpus, such as words embedding and Chinese words segmentation tools, may perform badly on tasks with informal Chinese.
To address this issue, we introduce LSICC, a large-scale corpus of informal Chinese. Containing around 37 million book reviews and 50 thousand netizens’ opinions to news, LSICC is a typical informal Chinese corpus. Most sentences of LSICC are in spoken Chinese and even Internet slang. As far as we know, LSICC is the first large-scale, well-formatted, cleansed corpus focusing on informal Chinese.
This paper makes the following contributions:
collect a large scale corpus of informal Chinese
filter out the informationless data items
compare the proportions of informal words in several corpus
2 Informal Chinese
Informal Chinese, including spoken Chinese and Chinese Internet Slang, has a substantial difference with the formal one, in both grammar and words usage. In this section, we discuss the difference between formal Chinese and informal Chinese.
2.1 Spoken Chinese
For most language, there are differences between the spoken one and the written one. In Chinese, the gap is even more significant due to the long history of written Chinese.
Similar to another language, spoken Chinese sometimes does not follow the rules as strictly as written Chinese, especially for the elliptical sentences. For example, in spoken Chinese, the subjects sometimes are omitted.
In addition to the grammar, the usage of the words influences the neural network based Chinese natural language processing model most. There are various interchangeable words pairs between written Chinese and spoken Chinese, such as “脑袋” and "头部", which both mean “head” in Chinese. The two words in each interchangeable words pair usually have almost the same meanings, but the one in written Chinese is more formal, while the one in spoken Chinese is informal.
2.2 Internet Slang
Born in the 1990s, Chinese Internet slang refers to various kinds of slang created by netizens and used in chat rooms, social networking services, and online community. Nowadays, Chinese Internet slang is not little memes within internet ingroup, but becoming popular language style of all Chinese speakers. From 2012, Xinhuanet selects “Top 10 Chinese Internet Slang”  every year, and Chinese Internet slang is used even by Chinese official institutions.
The first kind of Internet slang is the phonetic substitution, whose pronunciation is same or similar to the formal phrase. For example, in Internet slang, people may use “神马” to replace “什么”. Both “神马” and “什么” are pronounced as "cien ma" and has the meaning of “what”. However, in written Chinese, “神马” means “horse-god”, while “什么” means "what".
Transliteration is also a primary way to form Internet slang. As the words are transliterated from another language, both the meaning and pronunciation of the transliterated words are similar to the source language. For example, “伐木累” is transliterated from English word “family” and only used as Chinese Internet slang .
Meanwhile, Internet slang is also created by giving new meanings to the old words. For example, in written Chinese, “酱油” means “soy s sauce”. However, in the Chinese Internet slang, it refers to “passing by".
3 Data Collection
LSICC collects book reviews from DouBan Dushu and netizen’s opinions from Chiphell. This section describes these two datasets and pre-processing methods briefly.
3.1 DouBan DuShu
DouBan DuShu111available on: https://github.com/JaniceZhao/Douban-Dushu-Dataset.git is a Chinese website where users can share their reviews about various kinds of books. Most of the users on this website are unprofessional book reviewers. Therefore, the comments are usually spoken Chinese or even Internet slang. In addition to the comments, users can mark the books from one star to 5 stars according to the quality of the books. We have collected more than 37 million short comments from about 18 thousand books with 1 million users. The great number of users provide diversities of the language styles, from moderate formal to informal. An example of the data item is shown in table 1.
|Book Name||The name of the book||理想国|
|User Name||Who gives the comment (anonymized)||399|
|Tag||The tag the book belongs to||思想|
|Comment||Content of the comment||我是国师的脑残粉|
|Star||Stars given to the book (from 1 star to 5 stars)||5 stars|
|Date||When the comment posted||2018-08-21|
|Like||Count of “like” on the comment||0|
Chiphell 222available on: https://github.com/JaniceZhao/Chinese-Forum-Corpus.git is a web portal where netizens share their views to news and discuss within groupuscule. We have collected discussion forums from several subjects, such as computer hardware, motors and clothes. There are more than 50 thousand discussions in the corpus. Similar to the DouBan DuShu corpus, most of the sentences collected from Chiphell are informal Chinese and some of them are in particular domains. An example from each subject is shown in table 2.
|Mobile Phones||努比亚X 综合讨论帖||MIX3辣鸡被友商各种吊打|
3.3 Data Pre-processing
In addition to the raw dataset, we extracted the comments and preprocessed them to provide a clean, formal formatted and comprehensive Chinese corpus. After carefully investigate the raw text, mainly three preprocessing methods are applied:
convert Traditional Chinese to Simplified Chinese
remove over-short comments (less than 4 characters)
add identifier to special characters, such as special signs, English words and emoticons
To further explore the informal Chinese corpus, we calculate the proportion of informal words in the corpus. The experiment is conducted on Weibo News , Sougou News, People’s Daily  and the LSICC. We manually collected 70 informal words as the benchmark, which covers both spoken Chinese words and Chinese network slang words.
We counted the frequencies of informal words and the number of total words to calculate the proportion of the informal words in the whole corpus. As shown in table 3, the LSICC has the highest proportion of the informal words, which is more than two times the second highest one, the Weibi News. Noted that the more formal the media is, the lower the proportion of the informal words in it.
|Corpus||Informal Words||Total Words||Proportion|
The result indicated that the gap between the language that the real-world natural language models deal with the existing corpora is significant. Using the vector representations extracted from the corpus of formal Chinese as the word embedding may attribute to poor performance.
5 Conclusions and Future Work
We constructed a large-scale Informal Chinese dataset and conducted a basic words frequency statistic experiment on it. Compared to the existing Chinese corpus, LSICC is more typical dataset for real-world natural language processing tasks, especially for sentiment analysis. As a next step, we should conduct embedding extraction Chinese words segmentation and sentiment analysis on LSICC. Meanwhile, as the raw information, such as the usernames and book names is kept, LSICC can also be used to build recommendation systems and explore social network.
-  Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
-  Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, page e1253, 2018.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
-  Lei Yu, Karl Moritz Hermann, Phil Blunsom, and Stephen Pulman. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632, 2014.
-  Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, and Xiaoyong Du. Analogical reasoning on chinese morphological and semantic relations. arXiv preprint arXiv:1805.06504, 2018.
-  Kerui Min, Chenggang Ma, Tianmei Zhao, and Haiyan Li. Bosonnlp: an ensemble approach for word segmentation and pos tagging. In Natural Language Processing and Chinese Computing, pages 520–526. Springer, 2015.
-  Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, and Guoping Hu. Consensus attention-based neural networks for chinese reading comprehension. arXiv preprint arXiv:1607.02250, 2016.
-  Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
-  Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.
-  Yinjuan Xu. Top 10 chinese internet slang: 2012.
-  Zhifei Li and David Yarowsky. Mining and modeling relations between formal and informal chinese phrases from web corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 1031–1040, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics.
-  Baotian Hu, Qingcai Chen, and Fangze Zhu. Lcsts: A large scale chinese short text summarization dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1967–1972, 2015.
-  SW Yu et al. Guideline of people’s daily corpus annotation. Fechnical report, Beijing University, 2001.