Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

06/09/2017
by   Xipeng Qiu, et al.
FUDAN University
0

In this paper, we give an overview for the shared task at the CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2017): Chinese News Headline Categorization. The dataset of this shared task consists 18 classes, 12,000 short texts along with corresponded labels for each class. The dataset and example code can be accessed at https://github.com/FudanNLP/nlpcc2017_news_headline_categorization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

09/24/2020

N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models

We introduce N-LTP, an open-source Python Chinese natural language proce...
11/26/2018

LSICC: A Large Scale Informal Chinese Corpus

Deep learning based natural language processing model is proven powerful...
12/18/2017

A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction

Abbreviation is a common phenomenon across languages, especially in Chin...
06/10/2019

AGRR-2019: A Corpus for Gapping Resolution in Russian

This paper provides a comprehensive overview of the gapping dataset for ...
07/23/2019

Overview and Results: CL-SciSumm Shared Task 2019

The CL-SciSumm Shared Task is the first medium-scale shared task on scie...
11/28/2019

KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents

Keyphrase generation is the task of predicting a set of lexical units th...
10/22/2020

An overview of block Gram-Schmidt methods and their stability properties

Block Gram-Schmidt algorithms comprise essential kernels in many scienti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Task Definition

This task aims to evaluate the automatic classification techniques for very short texts, i.e., Chinese news headlines. Each news headline (i.e., news title) is required to be classified into one or more predefined categories. With the rise of Internet and social media, the text data on the web is growing exponentially. Make a human being to analysis all those data is impractical, while machine learning techniques suits perfectly for this kind of tasks. after all, human brain capacity is too limited and precious for tedious and non-obvious phenomenons.

Formally, the task is defined as follows: given a news headline , where represents th word in , the object is to find its possible category or label . More specifically, we need to find a function to predict in which category does belong to.

(1)

where is the parameter for the function.

2 Data

We collected news headlines (titles) from several Chinese news websites, such as toutiao, sina, and so on.

There are 18 categories in total. The detailed information of each category is shown in Table 1. All the sentences are segmented by using the python Chinese segmentation tool jieba.

Category Train Dev Test
entertainment 10000 2000 2000
sports 10000 2000 2000
car 10000 2000 2000
society 10000 2000 2000
tech 10000 2000 2000
world 10000 2000 2000
finance 10000 2000 2000
game 10000 2000 2000
travel 10000 2000 2000
military 10000 2000 2000
history 10000 2000 2000
baby 10000 2000 2000
fashion 10000 2000 2000
food 10000 2000 2000
discovery 4000 2000 2000
story 4000 2000 2000
regimen 4000 2000 2000
essay 4000 2000 2000
Table 1: The information of categories.

Some samples from training dataset are shown in Table 2.

Category Title Sentence
world 首辩 在 即 希拉里 特朗普 如何 备战
society 山东 实现 城乡 环卫 一体化 全 覆盖
finance 除了 稀土 股 , 还有 哪个 方向 好戏 即将 ..
travel 独库 公路 再次 爆发 第三次 泥石流 无法 …
finance 主力 资金 净流入 9000 万 以上 28 股 …
sports 高洪波 : 足协 眼中 的 应急 郎中
entertainment 世界级 十大 喜剧之王 排行榜
Table 2: Samples from dataset. The first column is Category and the second column is news headline.

Length

Figure 1 shows that most of title sentence character number is less than 40, with a mean of 21.05. Title sentence word length is even shorter, most of which is less than 20 with a mean of 12.07.

The dataset is released on github https://github.com/FudanNLP/nlpcc2017_news_headline_categorization along with code that implement three basic models.

Figure 1: The blue line is character length statistic, and blue line is word length.
Category Size Avg. Chars Avg. Words
train 156000 22.06 13.08
dev. 36000 22.05 13.09
test 36000 22.05 13.08
Table 3: Statistical information of the dataset.

3 Evaluation

We use the macro-averaged precision, recall and F1 to evaulate the performance.

The Macro Avg. is defined as follow:

And Micro Avg. is defined as:

Where m denotes the number of class, in the case of this dataset is 18. is the accuracy of th category, represents how many test examples reside in th category, is total number of examples in the test set.

4 Baseline Implementations

As a branch of machine learning, Deep Learning (DL) has gained much attention in recent years due to its prominent achievement in several domains such as Computer vision and Natural Language processing.

We have implemented some basic DL models such as neural bag-of-words (NBoW), convolutional neural networks (CNN)

[Kim2014]

and Long short-term memory network (LSTM)

[Hochreiter and Schmidhuber1997].

Empirically, 2 Gigabytes of GPU Memory should be sufficient for most models, set batch to a smaller number if not.

The results generated from baseline models are shown in Table 4.

Model Macro P Macro R Macro F Accuracy
LSTM 0.760 0.747 0.7497 0.747
CNN 0.769 0.763 0.764 0.763
NBoW 0.791 0.783 0.784 0.783
Table 4: Results of the baseline models.

5 Participants Submitted Results

Participant Macro P Macro R Macro F Accu.
P1 0.831 0.829 0.830 0.829
P2 0.828 0.825 0.826 0.825
P3 0.818 0.814 0.816 0.814
P4 0.816 0.809 0.813 0.809
P5 0.812 0.809 0.810 0.809
P6 0.811 0.807 0.809 0.807
P7 0.809 0.804 0.806 0.804
P8 0.806 0.802 0.804 0.802
P9 0.803 0.800 0.802 0.800
P10 0.805 0.800 0.802 0.800
P11 0.799 0.798 0.798 0.798
P12 0.797 0.795 0.796 0.795
P13 0.793 0.789 0.791 0.789
P14 0.791 0.789 0.790 0.789
P15 0.792 0.787 0.789 0.786
P16 0.786 0.783 0.785 0.783
P17 0.778 0.775 0.777 0.775
P18 0.785 0.775 0.780 0.775
P19 0.785 0.775 0.780 0.775
P20 0.766 0.765 0.765 0.765
P21 0.768 0.759 0.764 0.759
P22 0.768 0.748 0.758 0.748
P23 0.744 0.729 0.736 0.729
P24 0.729 0.726 0.728 0.726
P25 0.745 0.700 0.722 0.700
P26 0.734 0.688 0.710 0.688
P27 0.698 0.685 0.691 0.685
P28 0.640 0.633 0.637 0.633
P29 0.645 0.629 0.637 0.629
P30 0.437 0.430 0.433 0.430
P31 0.474 0.399 0.433 0.399
P32 0.053 0.056 0.054 0.056
Table 5: Results submitted by participants.

There are 32 participants actively participate and submit they predictions on the test set. The predictions are evaluated and the results are shown in table 5.

6 Conclusion

Since large amount of data is required for Machine Learning techniques like Deep Learning, we have collected considerable amount of News headline data and contributed to the research community.

References