HSD Shared Task in VLSP Campaign 2019:Hate Speech Detection for Social Good

07/13/2020 ∙ by Xuan-Son Vu, et al. ∙ CSIRO Umeå universitet VNU 0

The paper describes the organisation of the "HateSpeech Detection" (HSD) task at the VLSP workshop 2019 on detecting the fine-grained presence of hate speech in Vietnamese textual items (i.e., messages) extracted from Facebook, which is the most popular social network site (SNS) in Vietnam. The task is organised as a multi-class classification task and based on a large-scale dataset containing 25,431 Vietnamese textual items from Facebook. The task participants were challenged to build a classification model that is capable of classifying an item to one of 3 classes, i.e., "HATE", "OFFENSIVE" and "CLEAN". HSD attracted a large number of participants and was a popular task at VLSP 2019. In particular, there were 71 teams signed up for the task, 14 of them submitted results with 380 valid submissions from 20th September 2019 to 4th October 2019.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

On social network sites (SNSs), such as Facebook, Twitter, the threat of abuse and harassment online makes many SNS users stop expressing themselves as well as seeking different opinions. This problem is not trivial to be handled. And SNSs have been struggling with it. For example, to overcome the problem, SNSs might limit or even completely shut down the user post/comment functions in some communities (groups) producing “not-clean” content. This, however, further creates an issue of blocking “clean” content produced by the same communities.

To handle the problem, one of the popular strategies is to train systems capable of recognising hateful (“not-lean”) contents, which can then be removed or quarantined by the moderators of communities. In the last few years, much attention has been paid to the problem of detecting hateful contents in SNSs [1, 2, 3, 4]. However, the research is focused mainly on popular languages, such as English [5, 1, 3, 4]. Despite the large number of SNS users in Vietnam expected to reach 48 million users by the end of 2019111http://bit.ly/number-of-social-network-users-in-vietnam/, to our knowledge, there is no publicly available research on hate speech detection for Vietnamese.

To this end, we first introduce the task of hate speech detection (HSD) in SNSs for Vietnamese with the aim of supporting more effective conversations in SNSs. The task is organised at the VLSP 2019, which is the sixth annual international workshop in conjunction with the 2019 Conference of the Pacific Association for Computational Linguistics (PACLING 2019).

The remainder of the paper is as follows. In the next section, the data collection and annotation methodologies are described. The shared task description and evaluation are summarised in Section III. Section IV describes the participants and results. Section V concludes the paper as well as shows possible designs for the next year challenge.

Ii Data Collection and Annotation

A general corpus is firstly collected from Facebook posts and comments. From the general corpus, we built a neural model to select about 25,431 items for manual annotation. We proposed a pipeline to select these 25,431 items, we as follows:

  • Based on top obscene keywords222https://github.com/vietnlp/vlsp2019_hatespeech_task/ in Vietnamese, we apply semantic search to find 200 most relevant items in the collected corpus.

  • Annotators were asked to initially annotate the above 200 items (i.e., to label each of them to one of three classes: hate speech (HATE), offensive but not hate speech (OFFENSIVE), neither offensive nor hate speech (CLEAN).

  • Based on the above 200 annotated items, we built a classifier to predict the chance of every item in the BIG corpus belongs to each of 3 classes.

  • Top items for each class is selected until we reach a total of 25,431 items. Normally, many items will belong to the CLEAN class, therefore, we prioritise items belong to less popular classes (i.e., {HATE, OFFENSIVE}) unless if any of them has more than 9,000 items (i.e., more than of ).

Ii-a Data Annotation

From the above initial 25,431 items, we ask twenty-five annotators to manually annotate them in one month. Each item was annotated by three annotators to label each item as one of three categories: hate speech (HATE), offensive but not hate speech (OFFENSIVE), or neither offensive nor hate speech (CLEAN). The annotators were provided with our pre-defined annotation guideline, in which each category is associated with a definition and a paragraph explaining the definition in detail. The annotators were asked to consider not only terms (words) appearing in a given item but also about the context in which they thought the terms (words) were used. The annotators were also instructed that the presence of particular words, such as offensive words, does not necessarily indicate the corresponding item is hate speech. Since each item is annotated by three annotators, we used the majority voting schema to decide the final label of the item.

Here is the detail explanation for each type of three classes:

  • Hate speech (HATE): an item is identified as hate speech if it (1) targets individual or groups on the basis of their characteristics; (2) demonstrates a clear intention to incite harm, or to promote hatred; (3) may or may not use offensive or profane words. For example: “Assimilate? No they all need to go back to their own countries. #BanMuslims Sorry if someone disagrees too bad.”. See the definition of (see definition of Zhang et al. [6]). In contrast, “All you perverts (other than me) who posted today, needs to leave the O Board” is an example of abusive language, which often bears the purpose of insulting individuals or groups, and can include hate speech, derogatory and offensive language.

  • Offensive but not hate speech (OFFENSIVE): an item (posts/comments) may contain offensive words but it does not target individual or groups on the basis of their characteristics. E.g., “WTF, tomorrow is Monday already?”

  • Neither offensive nor hate speech (CLEAN): normal item, it does not contain offensive languages or hate speech. E.g., “She learned how to paint very hard when she was young”.

Ii-B Data Pre-processing

As the data might contain sensitive information such as email address, phone number, we run data pre-processing to remove or anonymise the sensitive information. Here is the list of pre-processed information in the user posts/comments:

  1. All links are replaced by URL.

  2. Three last digits of phone numbers are replaced by XXX.

  3. The first part of email addresses are replaced by AAA.

Although we tried to anonymise sensitive information. The data itself is very sensitive. Therefore, we stated that by joining the challenge, all participants are not allowed to attempt to re-identify the owner of any post or comment in any form or circumstance.

Iii Shared Task Description and Evaluation

In this shared task, participants are challenged to build a multi-class classification model that is capable of classifying an item to one of three classes (HATE, OFFENSIVE, CLEAN). The prepared dataset was provided to all participants. The data were randomly split into two parts: the training data and the test data. The test data contains both “public-test” and “private-test”. The public-test was used to allow all participated teams to tune their proposed models. They could submit at most five submissions per day. The final ranking was based on the private-test set. The private-test set was used to ensure the predictive models were not over-fit on the training data and hence, perform equally well on the private-test data. The evaluation metric used in the shared task is the macro-averaged F1 score (Macro-F1). The metric is calculated as follows:


Iv Participants and Results

There are 71 teams registered for this year’s challenge and 35 ones that obtained the data after sending the signed user agreement. Finally, only 14 teams participated with 380 submissions during the period of 14 days from 20 September 2019 to 04 October 2019. The performances of the top five teams on the public-test and the private-test are detailed in Table I. It can be seen that the average performance of the top-5 teams on the public-test of the top five participated teams is about +12.5% absolute higher than that on the private-test. Moreover, although the public-test and private-test are distributed differently. There are 5 out of 8 teams stay in the top-8 of both public-test and private test. This means that competing to achieve higher scores on the public-test with the expectation of getting higher performances on the private-test is still hold even with the highly different distributions of the public-test and private-test data.

# Public-Test Macro-F1 Private-Test Macro-F1
1 Try hard 0.73019 SunBear (1st place) 0.61971
2 HH_UIT 0.71432 ABCD (2nd place) 0.58883
3 titanic 0.70747 Try hard (3rd place) 0.58455
4 ABCD 0.70582 Cr4zy (on-hold) 0.57357
5 TIN HUYNH 0.70576 BA (on-hold) 0.56281
- Top-5 Average 0.71271 Top-5 Average 0.58589
TABLE I: Top 5 teams on public-test and private-test. Evaluation metric is Macro-F1.
# Team Macro-F1 Final Approach Ensemble? Deep learning?
Public-test Private-test
1 SunBear (1 place) 0.67756 0.61971 Logistic Regression (LR) Yes No
2 ABCD (2 place) 0.70582 0.58883

LR, Extra Trees, Random Forest.

Yes No
3 Try hard (3 place) 0.73019 0.58455 VDCNN, TextCNN, LSTM, LSTMCNN, SARNN Yes Yes
4 HH_UIT 0.71432 0.56281 Bi-LSTM No Yes
5 TIN HUYNH 0.70576 0.51705 Bi-GRU-LSTM-CNN No Yes
TABLE II: Top 5 teams on public-test and private-test with submitted papers and their final approaches. The rank is based on the macro-averaged F1 scores on the private-test.

Each team in the top-5 teams on both public-test and private-test were qualified to submit papers describing the predictive model to the VLSP workshop. There were five submitted papers from five teams including (1) SunBear, (2) ABCD, (3) Try hard, (4) HH_UIT, and (5) TIN HUYNH. The predictive models are described in Table II

. It can be seen that although deep learning works well on the public-test data, conventional feature-based machine learning works better on the private-test data. Furthermore, all the top-3 performing models on the private-test data utilised ensemble learning. This is not a new phenomenon, however, we would like to re-confirm that ensemble learning is applicable for the HSD task in Vietnamese as well.

In particular, the SunBear team proposed to utilise logistic regression, a conventional feature-based machine learning model, to handle the task. They used the 35,000 most frequent grams extracted from the dataset as the input features for training the model. An ensemble learning was then employed to achieve the best macro-average F1 score of 61.97% on the private-set which is 3% absolute higher than that produced by the ABCD team, the second performing one. They also showed the data pre-processing or normalisation played a very important role in the success of their model as the data from SNSs contains many abbreviations and typos which need to be handled well before training the model. Similarly, the second best performing team (ABCD) with macro-averaged F1 of 58.88% on the private-test data also employed stacking ensemble learning on the outputs of logistic regression models. In their proposed model, many feature types were used including grams of words, part-of-speech tags and numeric features.

The remaining three teams employed deep learning to handle the classification problem and achieved good performances on both the public-test and private-test data, especially on the public-test. To the success of the models, all the proposed deep learning models utilised various pre-trained word embeddings which is similar to the findings detailed in the ETNLP paper [7]. The advantage of deep learning is that there is no need to hand-craft features. While other models did not use word segmentation, the Try hard team employed Vietnamese word segmentation [8] on the dataset and achieved the best performance on the public-test and the third performance on the private-test. Moreover, the best performing team on both the public and private test data without ensemble learning is HH_UIT, in which they employed Bi-LSTM with fastText embeddings to handle the task.

V Conclusions

The Hate Speech Detection (HSD) shared task in the VLSP Campaign 2019 has been a valuable exercise in building predictive models to filter out hate speech contents on social networks. It has brought together different teams looking at a common goal. We plan to have a similar challenge using social network data to better support society in the information age for the next VLSP campaign in 2020.


The authors would like to thank the InfoRE Technology Company, the team of AiViVN.Com, and the twenty-five annotators for their hard work to support the shared task. Without their support, the task would not have been possible.


  • [1] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” in Eleventh international aaai conference on web and social media, 2017.
  • [2] M. Wiegand, M. Siegel, and J. Ruppenhofer, “Overview of the germeval 2018 shared task on the identification of offensive language,” 2018.
  • [3] R. Kumar, A. K. Ojha, S. Malmasi, and M. Zampieri, “Benchmarking aggression identification in social media,” in Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018).   Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 1–11.
  • [4] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso, and M. Sanguinetti, “Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter,” in Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 54–63.
  • [5] J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from bullying traces in social media,” in Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies.   Association for Computational Linguistics, 2012, pp. 656–666.
  • [6] Z. Zhang and L. Luo, “Hate speech detection: A solved problem? the challenging case of long tail on twitter,” CoRR, vol. abs/1803.03662, 2018.
  • [7] X.-S. Vu, T. Vu, S. N. Tran, and L. Jiang, “Etnlp: A visual-aided systematic approach to select pre-trained embeddings for a downstream task,” in

    Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP)

    , 2019.
  • [8] D. Q. Nguyen, D. Q. Nguyen, T. Vu, M. Dras, and M. Johnson, “A fast and accurate vietnamese word segmenter,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018., 2018.