LSICC: A Large Scale Informal Chinese Corpus

by   Jianyu Zhao, et al.

Deep learning based natural language processing model is proven powerful, but need large-scale dataset. Due to the significant gap between the real-world tasks and existing Chinese corpus, in this paper, we introduce a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news. We explore the informal words frequencies of the corpus and show the difference between our corpus and the existing ones. The corpus can be further used to train deep learning based natural language processing tasks such as Chinese word segmentation, sentiment analysis.



There are no comments yet.


page 1

page 2

page 3

page 4


Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

Segmenting a chunk of text into words is usually the first step of proce...

Constructing Financial Sentimental Factors in Chinese Market Using Natural Language Processing

In this paper, we design an integrated algorithm to evaluate the sentime...

Explorations in an English Poetry Corpus: A Neurocognitive Poetics Perspective

This paper describes a corpus of about 3000 English literary texts with ...

State-of-the-Art Vietnamese Word Segmentation

Word segmentation is the first step of any tasks in Vietnamese language ...

Differences in Chinese and Western tourists faced with Japanese hospitality: A natural language processing approach

Since culture influences expectations, perceptions, and satisfaction, a ...

Chinese Word Segmentation: Another Decade Review (2007-2017)

This paper reviews the development of Chinese word segmentation (CWS) in...

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

In this paper, we give an overview for the shared task at the CCF Confer...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has been the mainstay for natural language processing, ranging from text summarization

[1] to sentiment analysis [2]

to text generation

[3] and automated question-answering system [4]. Unlike traditional rule-based methods, the scale and quality of the corpus significantly influence the performance of the deep learning models. In Chinese NLP field, there are many famous large-scale corpora with high quality, such as Baidu Encyclopedia, People’s Daily News and Sina Weibo News. Various powerful Chinese deep learning models are trained on these corpora [5], [6], [7], [8], [9].

However, most Chinese corpora are in written Chinese, while most real-world deep learning based NLP systems deal with informal Chinese, such as products reviews, netizens’ opinions, and microblogs. There are great gaps between informal Chinese and written Chinese, especially in words usages and sentences structures. The pre-trained deep learning model trained from written Chinese corpus, such as words embedding and Chinese words segmentation tools, may perform badly on tasks with informal Chinese.

To address this issue, we introduce LSICC, a large-scale corpus of informal Chinese. Containing around 37 million book reviews and 50 thousand netizens’ opinions to news, LSICC is a typical informal Chinese corpus. Most sentences of LSICC are in spoken Chinese and even Internet slang. As far as we know, LSICC is the first large-scale, well-formatted, cleansed corpus focusing on informal Chinese.

This paper makes the following contributions:

  1. collect a large scale corpus of informal Chinese

  2. filter out the informationless data items

  3. compare the proportions of informal words in several corpus

2 Informal Chinese

Informal Chinese, including spoken Chinese and Chinese Internet Slang, has a substantial difference with the formal one, in both grammar and words usage. In this section, we discuss the difference between formal Chinese and informal Chinese.

2.1 Spoken Chinese

For most language, there are differences between the spoken one and the written one. In Chinese, the gap is even more significant due to the long history of written Chinese.

Similar to another language, spoken Chinese sometimes does not follow the rules as strictly as written Chinese, especially for the elliptical sentences. For example, in spoken Chinese, the subjects sometimes are omitted.

In addition to the grammar, the usage of the words influences the neural network based Chinese natural language processing model most. There are various interchangeable words pairs between written Chinese and spoken Chinese, such as “脑袋” and "头部", which both mean “head” in Chinese. The two words in each interchangeable words pair usually have almost the same meanings, but the one in written Chinese is more formal, while the one in spoken Chinese is informal.

2.2 Internet Slang

Born in the 1990s, Chinese Internet slang refers to various kinds of slang created by netizens and used in chat rooms, social networking services, and online community. Nowadays, Chinese Internet slang is not little memes within internet ingroup, but becoming popular language style of all Chinese speakers. From 2012, Xinhuanet selects “Top 10 Chinese Internet Slang” [10] every year, and Chinese Internet slang is used even by Chinese official institutions.

The first kind of Internet slang is the phonetic substitution, whose pronunciation is same or similar to the formal phrase. For example, in Internet slang, people may use “神马” to replace “什么”. Both “神马” and “什么” are pronounced as "cien ma" and has the meaning of “what”. However, in written Chinese, “神马” means “horse-god”, while “什么” means "what".

Transliteration is also a primary way to form Internet slang. As the words are transliterated from another language, both the meaning and pronunciation of the transliterated words are similar to the source language. For example, “伐木累” is transliterated from English word “family” and only used as Chinese Internet slang [11].

Meanwhile, Internet slang is also created by giving new meanings to the old words. For example, in written Chinese, “酱油” means “soy s sauce”. However, in the Chinese Internet slang, it refers to “passing by".

3 Data Collection

LSICC collects book reviews from DouBan Dushu and netizen’s opinions from Chiphell. This section describes these two datasets and pre-processing methods briefly.

3.1 DouBan DuShu

DouBan DuShu111available on: is a Chinese website where users can share their reviews about various kinds of books. Most of the users on this website are unprofessional book reviewers. Therefore, the comments are usually spoken Chinese or even Internet slang. In addition to the comments, users can mark the books from one star to 5 stars according to the quality of the books. We have collected more than 37 million short comments from about 18 thousand books with 1 million users. The great number of users provide diversities of the language styles, from moderate formal to informal. An example of the data item is shown in table 1.

Key Description Value Example
Book Name The name of the book 理想国
User Name Who gives the comment (anonymized) 399
Tag The tag the book belongs to 思想
Comment Content of the comment 我是国师的脑残粉
Star Stars given to the book (from 1 star to 5 stars) 5 stars
Date When the comment posted 2018-08-21
Like Count of “like” on the comment 0
Table 1: Example of DouBan DuShu dataset

3.2 Chiphell

Chiphell 222available on: is a web portal where netizens share their views to news and discuss within groupuscule. We have collected discussion forums from several subjects, such as computer hardware, motors and clothes. There are more than 50 thousand discussions in the corpus. Similar to the DouBan DuShu corpus, most of the sentences collected from Chiphell are informal Chinese and some of them are in particular domains. An example from each subject is shown in table 2.

Subject Topic Example
News 美机场航空业希望修改客机降落的Emoji表情:机头朝下不吉利 那我还说改完的意思是无限复飞呢,飞到没油不又gg了
Computer Hardware 请问现在大船货除开3610还有其他性价比的大船大容量吗 我1T的PM1633。。卖1300都木有人接
Mobile Phones 努比亚X 综合讨论帖 MIX3辣鸡被友商各种吊打
Clothes 程序媛的皮艺生活 花点时间在复杂又感兴趣的事情上是一件快乐又有成就感的体验
Table 2: Example of Chiphell dataset

3.3 Data Pre-processing

In addition to the raw dataset, we extracted the comments and preprocessed them to provide a clean, formal formatted and comprehensive Chinese corpus. After carefully investigate the raw text, mainly three preprocessing methods are applied:

  1. convert Traditional Chinese to Simplified Chinese

  2. remove over-short comments (less than 4 characters)

  3. add identifier to special characters, such as special signs, English words and emoticons

4 Experiments

To further explore the informal Chinese corpus, we calculate the proportion of informal words in the corpus. The experiment is conducted on Weibo News [12], Sougou News, People’s Daily [13] and the LSICC. We manually collected 70 informal words as the benchmark, which covers both spoken Chinese words and Chinese network slang words.

We counted the frequencies of informal words and the number of total words to calculate the proportion of the informal words in the whole corpus. As shown in table 3, the LSICC has the highest proportion of the informal words, which is more than two times the second highest one, the Weibi News. Noted that the more formal the media is, the lower the proportion of the informal words in it.

Corpus Informal Words Total Words Proportion
LSICC 621807 705231306 8.82‰
Weibo News 46831 125082112 3.74‰
Sougou News 1238 14160148 0.87‰
People’s Daily 25 3482887 0.07‰
Table 3: Proportion of the informal words in each corpus

The result indicated that the gap between the language that the real-world natural language models deal with the existing corpora is significant. Using the vector representations extracted from the corpus of formal Chinese as the word embedding may attribute to poor performance.

5 Conclusions and Future Work

We constructed a large-scale Informal Chinese dataset and conducted a basic words frequency statistic experiment on it. Compared to the existing Chinese corpus, LSICC is more typical dataset for real-world natural language processing tasks, especially for sentiment analysis. As a next step, we should conduct embedding extraction Chinese words segmentation and sentiment analysis on LSICC. Meanwhile, as the raw information, such as the usernames and book names is kept, LSICC can also be used to build recommendation systems and explore social network.