User Generated Data: Achilles' heel of BERT

03/29/2020 ∙ by Ankit Kumar, et al. ∙ 0

Pre-trained language models such as BERT are known to perform exceedingly well on various NLP tasks and have even established new State-Of-The-Art (SOTA) benchmarks for many of these tasks. Owing to its success on various tasks and benchmark datasets, industry practitioners have started to explore BERT to build applications solving industry use cases. These use cases are known to have much more noise in the data as compared to benchmark datasets. In this work we systematically show that when the data is noisy, there is a significant degradation in the performance of BERT. Specifically, we performed experiments using BERT on popular tasks such sentiment analysis and textual similarity. For this we work with three well known datasets - IMDB movie reviews, SST-2 and STS-B to measure the performance. Further, we examine the reason behind this performance drop and identify the shortcomings in the BERT pipeline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained contextual language models have provided significant improvement in performance for many NLP tasks. In this work, we focus only on the BERT [devlin2018bert] model. The strength of the BERT model 1) stems from it’s transformer [vaswani2017attention] based encoder architecture 2. It uses the tokens, resulting from applying wordPiece tokenizer [wu2016google], as input and learns contextualized embeddings for each of those tokens. It does so via pre-training on two tasks - Masked Language Model (MLM) and Next Sentence Prediction (NSP). In this paper, we work with pre-trained BERTBASE (L=12, H=768, A=12, Total Parameters=110M) model[devlin2018bert] which is trained over English Wikipedia and BooksCorpus [zhu2015aligning].

Figure 1: BERT architecture [devlin2018bert]
Figure 2: The Transformer model architecture [vaswani2017attention]

The focus of this work is to understand the issues one might run into when using BERT for building NLP applications in industrial settings. These applications are required to deal with data that has noise. Specifically how BERT handles Out Of Vocabulary(OOV) words, given its limited vocabulary.

At present, the way it handles OOV words is as follows:

  • It tries to find the exact match for the token in its vocabulary. If it does not find a match, it breaks the words in constituent subwords.

  • If one or more subwords are not found in its vocabulary, then all such subwords are replaced by UNK (unknow) token.

In this paper, we evaluate the performance of BERT on noisy data. By noisy data we mean non-canonical text such as abbreviations, slang, internet jargon, emojis, embedded metadata (such as hashtags, URLs or at mentions), and non standard syntactic constructions and spelling variations. Noisy data is a hallmark of user generated text content. This evaluation is motivated from the business use case we are solving where we use a dialogue system to screen candidates for blue collar jobs. Our user base, coming from underprivileged backgrounds, tends to make a lot of typos and spelling mistakes while typing their responses. Hence, for this work we focus on spelling mistakes as the primary noise in the data. While this work is motivated from our business use case, it is applicable across various use cases in industry - be it be sentiment classification on twitter data or a mobile based chat bot.

To simulate noise in the data, we begin with a clean dataset and introduce spelling errors in a fraction (x%) of words in it. These words are chosen randomly. Spelling mistakes introduced mimic the user behavior of typographical errors. The errors introduced arise from ‘fat finger’ over a QWERTY mobile keypad. We use the BERT model over clean and noisy datasets to compare the results. We show that the introduction of noise leads to a significant drop in performance of the model as compared to clean dataset. We further show that as we increase the amount of noise in the data, there is a sharp degradation in the performance of the BERT model.

2 Related Work

Pre-trained large language models, such as BERT [devlin2018bert], have made breakthroughs in several natural language understanding tasks. These models trained over large unlabeled corpra which are easily available. These models can then be fine-tuned over a variety of tasks such as text classification, regression, question answering and many more. BERT has established new benchmarks for many of these tasks. Nowadays, people from industries also started to explore to use BERT such as in search engine [patel2019tinysearch].

In [baldwin2015shared] the author introduces a lexical normalization and named entity task on noisy user generated tweets. Author of [supranovich2015ihs_rd] introduces twitter lexical normalizer by using a CRF model and also gets the best result on lexical normalization task [baldwin2015shared].

Our contribution in this paper is to answer can we use large language models like BERT directly over user generated data.

3 Experiment

For our experiments, we use pre-trained BERT implementation as given by huggingface 111https://github.com/huggingface/transformers transformer library. We use the BERTBase uncased model. We work with three datasets namely - IMDB movie reviews[maas2011learning], Stanford Sentiment Treebank (SST-2) [socher2013recursive] and Semantic Textual Similarity (STS-B) [cer2017semeval].

IMDB dataset is a popular dataset for sentiment analysis tasks, which is a binary classification problem with equal number of positive and negative examples. Both STS-B and SST-2 datasets are a part of GLUE benchmark[2] tasks . In STS-B too, we predict positive and negative sentiments. In SST-2 we predict textual semantic similarity between two sentences. It is a regression problem where the similarity score varies between 0 to 5. To evaluate the performance of BERT we use standard metrics of F1-score for imdb and STS-B, and Pearson-Spearman correlation for SST-2.

In Table 1, we give the statistics for each of the datasets.

Dataset Training utterances Validation utterances
IMDB Movie Review 25000 25000
SST-2 67349 872
STS-B 5749 1500
Table 1: Number of utterances in each datasets

We take the original datasets and add varying degrees of noise (i.e. spelling errors to word utterances) to create datasets for our experiments. From each dataset, we create 4 additional datasets each with varying percentage levels of noise in them. For example from IMDB, we create 4 variants, each having 5%, 10%, 15% and 20% noise in them. Here, the number denotes the percentage of words in the original dataset that have spelling mistakes. Thus, we have one dataset with no noise and 4 variants datasets with increasing levels of noise. Likewise, we do the same for SST-2 and STS-B.

All the parameters of the BERTBase model remain the same for all 5 experiments on the IMDB dataset and its 4 variants. This also remains the same across other 2 datasets and their variants. For all the experiments, the learning rate is set to 4e-5, for optimization we use Adam optimizer with epsilon value 1e-8. We ran each of the experiments for 10 and 50 epochs.

4 Results

Let us discuss the results from the above mentioned experiments. We show the plots of accuracy vs noise for each of the tasks. For IMDB, we fine tune the model for the sentiment analysis task. We plot F1 score vs % of error, as shown in Figure 3. Figure 3LABEL: shows the performance after fine tuning for 10 epochs, while Figure 3LABEL: shows the performance after fine tuning for 50 epochs.

(a) 10 epochs
(b) 50 epochs
Figure 3: F1 score vs % of error for Sentiment analysis on IMDB dataset
(a) 10 epochs
(b) 50 epochs
Figure 4: F1 score vs % of error for Sentiment analysis on SST-2 data
(a) 10 epochs
(b) 50 epochs
Figure 5: Pearson-Spearman correlation vs % of error for textual semantic similarity on STS-B dataset

Similarly, Figure 4LABEL: and Figure 4LABEL:) shows F1 score vs % of error for Sentiment analysis on SST-2 dataset after fine tuning for 10 and 50 epochs respectively.

Figure 5LABEL: and 5LABEL: shows Pearson-Spearman correlation vs % of error for textual semantic similarity on STS-B dataset after fine tuning for 10 and 50 epochs respectively.

4.1 Key Findings

It is clear from the above plots that as we increase the percentage of error, for each of the three tasks, we see a significant drop in BERT’s performance. Also, from the plots it is evident that the reason for this drop in performance is introduction of noise (spelling mistakes). After all we get very good numbers, for each of the three tasks, when there is no error (0.0 % error). To understand the reason behind the drop in performance, first we need to understand how BERT processes input text data. BERT uses WordPiece tokenizer to tokenize the text. WordPiece tokenizer utterances based on the longest prefix matching algorithm to generate tokens 222https://github.com/google-research/bert/blob/master/tokenization.py. The tokens thus obtained are fed as input of the BERT model.

When it comes to tokenizing noisy data, we see a very interesting behaviour from WordPiece tokenizer. Owing to the spelling mistakes, these words are not directly found in BERT’s dictionary. Hence WordPiece tokenizer tokenizes noisy words into subwords. However, it ends up breaking them into subwords whose meaning can be very different from the meaning of the original word. Often, this changes the meaning of the sentence completely, therefore leading to substantial dip in the performance.

To understand this better, let us look into two examples, one each from the IMDB and STS-B datasets respectively, as shown below. Here, (a) is the sentence as it appears in the dataset ( before adding noise) while (b) is the corresponding sentence after adding noise. The mistakes are highlighted with italics. The sentences are followed by the corresponding output of the WordPiece tokenizer on these sentences: In the output ‘##’ is WordPiece tokenizer’s way of distinguishing subwords from words. ‘##’ signifies subwords as opposed to words.

Example 1 (imdb example):

  1. [label=()]

  2. “that loves its characters and communicates something rather beautiful about human nature” (0% error)

  3. “that loves 8ts characters abd communicates something rathee beautiful about human natuee” (5% error)

Output of wordPiece tokenizer:

  1. [label=()]

  2. [’that’, ’loves’, ’its’, ’characters’, ’and’, ’communicate’, ’##s’, ’something’, ’rather’, ’beautiful’, ’about’, ’human’,’nature’] (0% error IMDB example)

  3. [’that’, ’loves’, ’8’, ’##ts’, ’characters’, ’abd’, ’communicate’,’##s’, ’something’,’rat’, ’##hee’, ’beautiful’, ’about’, ’human’,’nat’, ’##ue’, ’##e’] (5% error IMDB example)

Example 2(STS example):

  1. [label=()]

  2. “poor ben bratt could n’t find stardom if mapquest emailed himpoint-to-point driving directions.” (0% error)

  3. “poor ben bratt could n’t find stardom if mapquest emailed him point-to-point drivibg dirsctioge.” (5% error)

Output of wordPiece tokenizer:

  1. [label=()]

  2. [’poor’, ’ben’, ’brat’, ’##t’, ’could’, ’n’, "’", ’t’, ’find’,’star’, ’##dom’, ’if’, ’map’, ’##quest’, ’email’, ’##ed’, ’him’,’point’, ’-’, ’to’, ’-’, ’point’, ’driving’, ’directions’, ’.’] (0% error STS example)

  3. [’poor’, ’ben’, ’brat’, ’##t’, ’could’, ’n’, "’", ’t’, ’find’,’star’, ’##dom’, ’if’, ’map’, ’##quest’, ’email’, ’##ed’, ’him’, ’point’, ’-’, ’to’, ’-’, ’point’, ’dr’, ’##iv’, ’##ib’,’##g’,’dir’,’##sc’, ’##ti’, ’##oge’, ’.’] (5% error STS example)

In example 1, the tokenizer splits communicates into [‘communicate’, ‘##s’] based on longest prefix matching because there is no exact match for “communicates” in BERT vocabulary. The longest prefix in this case is “communicate” and left over is “s” both of which are present in the vocabulary of BERT. We have contextual embeddings for both “communicate” and “##s”. By using these two embeddings, one can get an approximate embedding for “communicates”. However, this approach goes for a complete toss when the word is misspelled. In example 1(b) the word natuee (‘nature’ is misspelled) is split into [’nat’, ’##ue’, ’##e’] based on the longest prefix match. Combining the three embeddings one cannot approximate the embedding of nature. This is because the word nat has a very different meaning (it means ‘a person who advocates political independence for a particular country’). This misrepresentation in turn impacts the performance of downstream subcomponents of BERT bringing down the overall performance of BERT model. Hence, as we systematically introduce more errors, the quality of output of the tokenizer degrades further, resulting in the overall performance drop.

Our results and analysis shows that one cannot apply BERT blindly to solve NLP problems especially in industrial settings. If the application you are developing gets data from channels that are known to introduce noise in the text, then BERT will perform badly. Examples of such scenarios are applications working with twitter data, mobile based chat system, user comments on platforms like youtube, reddit to name a few. The reason for the introduction of noise could vary - while for twitter, reddit it’s often deliberate because that is how users prefer to write, while for mobile based chat it often suffers from ‘fat finger’ typing error problem. Depending on the amount of noise in the data, BERT can perform well below expectations.

We further conducted experiments with different tokenizers other than WordPiece tokenizer. For this we used stanfordNLP WhiteSpace [manning2014stanford]

and Character N-gram

[mcnamee2004character] tokenizers. WhiteSpace tokenizer splits text into tokens based on white space. Character N-gram tokenizer splits words that have more than n characters in them. Thus, each token has at most n characters in them. The resultant tokens from the respective tokenizer are fed to BERT as inputs. For our case, we work with n = 6.

Results of these experiments are presented in Table 2. Even though wordPiece tokenizer has the issues stated earlier, it is still performing better than whitespace and character n-gram tokenizer. This is primarily because of the vocabulary overlap between STS-B dataset and BERT vocabulary.

Error WordPiece WhiteSpce N-gram(n=6)
0% 0.89 0.69 0.73
5% 0.78 0.59 0.60
10% 0.60 0.41 0.46
15% 0.45 0.33 0.36
20% 0.35 0.22 0,25
Table 2: Comparative results on STS-B dataset with different tokenizers

5 Conclusion and Future Work

In this work we systematically studied the effect of noise (spelling mistakes) in user generated text data on the performance of BERT. We demonstrated that as the noise increases, BERT’s performance drops drastically. We further investigated the BERT system to understand the reason for this drop in performance. We show that the problem lies with how misspelt words are tokenized to create a representation of the original word.

There are 2 ways to address the problem - either (i) preprocess the data to correct spelling mistakes or (ii) incorporate ways in BERT architecture to make it robust to noise. The problem with (i) is that in most industrial settings this becomes a separate project in itself. We leave (ii) as a future work to fix the issues.

References