Stacked Denoising BERT for Text Classification in Incomplete Data
In this paper, we propose Stacked DeBERT, short for Stacked Denoising Bidirectional Encoder Representations from Transformers. This novel model improves robustness in incomplete data, when compared to existing systems, by designing a novel encoding scheme in BERT, a powerful language representation model solely based on attention mechanisms. Incomplete data in natural language processing refer to text with missing or incorrect words, and its presence can hinder the performance of current models that were not implemented to withstand such noises, but must still perform well even under duress. This is due to the fact that current approaches are built for and trained with clean and complete data, and thus are not able to extract features that can adequately represent incomplete data. Our proposed approach consists of obtaining intermediate input representations by applying an embedding layer to the input tokens followed by vanilla transformers. These intermediate features are given as input to novel denoising transformers which are responsible for obtaining richer input representations. The proposed approach takes advantage of stacks of multilayer perceptrons for the reconstruction of missing words' embeddings by extracting more abstract and meaningful hidden feature vectors, and bidirectional transformers for improved embedding representation. We consider two datasets for training and evaluation: the Chatbot Natural Language Understanding Evaluation Corpus and Kaggle's Twitter Sentiment Corpus. Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in the sentiment and intent classification tasks.READ FULL TEXT VIEW PDF
Intent classification and slot filling are two essential tasks for natur...
Representing texts as fixed-length vectors is central to many language
Recurrent Neural Network (RNN) is one of the most popular architectures ...
Bidirectional Encoder Representations from Transformers (BERT) models fo...
Recently, the bidirectional encoder representations from transformers (B...
This paper presents an improved classification model for Igbo text using...
Human thinking requires the brain to understand the meaning of language
Stacked Denoising BERT for Text Classification in Incomplete Data
Understanding a user’s intent and sentiment is of utmost importance for current intelligent chatbots to respond appropriately to human requests. However, current systems are not able to perform to their best capacity when presented with incomplete data, meaning sentences with missing or incorrect words. This scenario is likely to happen when one considers human error done in writing. In fact, it is rather naive to assume that users will always type fully grammatically correct sentences. Panko [panko2008thinking] goes as far as claiming that human accuracy regarding research paper writing is none when considering the entire document. This has been aggravated with the advent of internet and social networks, which allowed language and modern communication to be been rapidly transformed [al2018evolution, grieve2018mapping]. Take Twitter for instance, where information is expected to be readily communicated in short and concise sentences with little to no regard to correct sentence grammar or word spelling [sirucek2010twitter].
Further motivation can be found in Automatic Speech Recognition (ASR) applications, where high error rates prevail and pose an enormous hurdle in the broad adoption of speech technology by users worldwide[errattahi2018automatic]. This is an important issue to tackle because, in addition to more widespread user adoption, improving Speech-to-Text (STT) accuracy diminishes error propagation to modules using the recognized text. With that in mind, in order for current systems to improve the quality of their services, there is a need for development of robust intelligent systems that are able to understand a user even when faced with incomplete representation in language.
The advancement of deep neural networks have immensely aided in the development of the Natural Language Processing (NLP) domain. Tasks such as text generation, sentence correction, image captioning and text classification, have been possible via models such as Convolutional Neural Networks and Recurrent Neural Networks[sergio2018temporal, vinyals2015show, kim2014convolutional]
. More recently, state-of-the-art results have been achieved with attention models, more specifically Transformers[vaswani2017attention]. Surprisingly, however, there is currently no research on incomplete text classification in the NLP community. Realizing the need of research in that area, we make it the focus of this paper. In this novel task, the model aims to identify the user’s intent or sentiment by analyzing a sentence with missing and/or incorrect words. In the sentiment classification task, the model aims to identify the user’s sentiment given a tweet, written in informal language and without regards for sentence correctness.
Current approaches for Text Classification tasks focus on efficient embedding representations. Kim et al. [kim2016intent] use semantically enriched word embeddings to make synonym and antonym word vectors respectively more and less similar in order to improve intent classification performance. Devlin et al. [devlin2018bert] propose Bidirectional Encoder Representations from Transformers (BERT), a powerful bidirectional language representation model based on Transformers, achieving state-of-the-art results on eleven NLP tasks [wang2018glue], including sentiment text classification. Concurrently, Shridhar et al. [shridhar2018subword]
also reach state of the art in the intent recognition task using Semantic Hashing for feature representation followed by a neural classifier. All aforementioned approaches are, however, applied to datasets based solely on complete data.
The incomplete data problem is usually approached as a reconstruction or imputation task and is most often related to missing numbers imputation[pratama2016review]. Vincent et al. [vincent2008extracting, vincent2010stacked]
propose to reconstruct clean data from its noisy version by mapping the input to meaningful representations. This approach has also been shown to outperform other models, such as predictive mean matching, random forest, Support Vector Machine (SVM) and Multiple imputation by Chained Equations (MICE), at missing data imputation tasks[gondara2017multiple, costa2018missing].
Researchers in those two areas have shown that meaningful feature representation of data is of utter importance for high performance achieving methods. We propose a model that combines the power of BERT in the NLP domain and the strength of denoising strategies in incomplete data reconstruction to tackle the tasks of incomplete intent and sentiment classification. This enables the implementation of a novel encoding scheme, more robust to incomplete data, called Stacked Denoising BERT or Stacked DeBERT. Our approach consists of obtaining richer input representations from input tokens by stacking denoising transformers on an embedding layer with vanilla transformers. The embedding layer and vanilla transformers extract intermediate input features from the input tokens, and the denoising transformers are responsible for obtaining richer input representations from them. By improving BERT with stronger denoising abilities, we are able to reconstruct missing and incorrect words’ embeddings and improve classification accuracy. To summarize, our contribution is two-fold:
Novel model architecture that is more robust to incomplete data, including missing or incorrect words in text.
Proposal of the novel tasks of incomplete intent and sentiment classification from incorrect sentences, and release of corpora related with these tasks.
The remainder of this paper is organized in four sections, with Section 2 explaining the proposed model. This is followed by Section 3 which includes a detailed description of the dataset used for training and evaluation purposes and how it was obtained. Section 4 covers the baseline models used for comparison, training specifications and experimental results. Finally, Section 5 wraps up this paper with conclusion and future works.
We propose Stacked Denoising BERT (DeBERT) as a novel encoding scheming for the task of incomplete intent classification and sentiment classification from incorrect sentences, such as tweets and text with STT error. The proposed model, illustrated in Fig. 1, is structured as a stacking of embedding layers and vanilla transformer layers, similarly to the conventional BERT [devlin2018bert], followed by layers of novel denoising transformers. The main purpose of this model is to improve the robustness and efficiency of BERT when applied to incomplete data by reconstructing hidden embeddings from sentences with missing words. By reconstructing these hidden embeddings, we are able to improve the encoding scheme in BERT.
The initial part of the model is the conventional BERT, a multi-layer bidirectional Transformer encoder and a powerful language model. During training, BERT is fine-tuned on the incomplete text classification corpus (see Section 3). The first layer pre-processes the input sentence by making it lower-case and by tokenizing it. It also prefixes the sequence of tokens with a special character ‘[CLS]’ and sufixes each sentence with a ‘[SEP]’ character. It is followed by an embedding layer used for input representation, with the final input embedding being a sum of token embedddings, segmentation embeddings and position embeddings. The first one, token embedding layer, uses a vocabulary dictionary to convert each token into a more representative embedding. The segmentation embedding layer indicates which tokens constitute a sentence by signaling either 1 or 0. In our case, since our data are formed of single sentences, the segment is 1 until the first ‘[SEP]’ character appears (indicating segment A) and then it becomes 0 (segment B). The position embedding layer, as the name indicates, adds information related to the token’s position in the sentence. This prepares the data to be considered by the layers of vanilla bidirectional transformers, which outputs a hidden embedding that can be used by our novel layers of denoising transformers.
Although BERT has shown to perform better than other baseline models when handling incomplete data, it is still not enough to completely and efficiently handle such data. Because of that, there is a need for further improvement of the hidden feature vectors obtained from sentences with missing words. With this purpose in mind, we implement a novel encoding scheme consisting of denoising transformers, which is composed of stacks of multilayer perceptrons for the reconstruction of missing words’ embeddings by extracting more abstract and meaningful hidden feature vectors, and bidirectional transformers for improved embedding representation. The embedding reconstruction step is trained on sentence embeddings extracted from incomplete data as input and embeddings corresponding to its complete version as target. Both input and target are obtained after applying the embedding layers and the vanilla transformers, as indicated in Fig. 1, and have shape , where is the batch size, is the original BERT embedding size for a single token, and is the maximum sequence length in a sentence.
The stacks of multilayer perceptrons are structured as two sets of three layers with two hidden layers each. The first set is responsible for compressing the into a latent-space representation, extracting more abstract features into lower dimension vectors , and with shape , , and , respectively. This process is shown in Eq. (1):
where is the parameterized function mapping to the hidden state . The second set then respectively reconstructs , and into , and . This process is shown in Eq. (2):
where is the parameterized function that reconstructs as .
The reconstructed hidden sentence embedding is compared with the complete hidden sentence embedding
through a mean square error loss function, as shown in Eq. (3):
After reconstructing the correct hidden embeddings from the incomplete sentences, the correct hidden embeddings are given to bidirectional transformers to generate input representations. The model is then fine-tuned in an end-to-end manner on the incomplete text classification corpus.
Classification is done with a feedforward network and softmax activation function. Softmax
is a discrete probability distribution function forclasses, with the sum of the classes probability being 1 and the maximum value being the predicted class. The predicted class can be mathematically calculated as in Eq. (4):
where , the output of the feedforward layer used for classification.
In order to evaluate the performance of our model, we need access to a naturally noisy dataset with real human errors. Poor quality texts obtained from Twitter, called tweets, are then ideal for our task. For this reason, we choose Kaggle’s two-class Sentiment140 dataset [go2009twitterSentiment140]222https://www.kaggle.com/kazanova/sentiment140, which consists of spoken text being used in writing and without strong consideration for grammar or sentence correctness. Thus, it has many mistakes, as specified in Table 1.
|Spelling||“teh” (the), “correclty” (correctly), “teusday” (Tuesday)|
|Casual pronunciation||“wanna” (want to), “dunno” (don’t know)|
|Abbreviation||“Lit” (Literature), “pls” (please), “u” (you), “idk” (I don’t know)|
|Repeteated letters||“thursdayyyyyy”, “sleeeeeeeeeep”|
|Onomatopoeia||“Woohoo”, “hmmm”, “yaay”|
|Others||“im” (I’m), “your/ur” (you’re), “ryt” (right)|
Even though this corpus has incorrect sentences and their emotional labels, they lack their respective corrected sentences, necessary for the training of our model. In order to obtain this missing information, we outsource native English speakers from an unbiased and anonymous platform, called Amazon Mechanical Turk (MTurk) [buhrmester2011amazon], which is a paid marketplace for Human Intelligence Tasks (HITs). We use this platform to create tasks for native English speakers to format the original incorrect tweets into correct sentences. Some examples are shown in Table 2.
|Original tweet||Corrected tweet|
|“goonite sweet dreamz”||“Good night, sweet dreams.”|
|“well i dunno..i didnt give him an ans yet”||”Well I don’t know, I didn’t give him an answer yet.”|
|“u kno who am i talkin bout??”||“Do you know who I am talking about?”|
After obtaining the correct sentences, our two-class dataset 333Available at https://github.com/gcunhase/StackedDeBERT has class distribution as shown in Table 3. There are 200 sentences used in the training stage, with 100 belonging to the positive sentiment class and 100 to the negative class, and 50 samples being used in the evaluation stage, with 25 negative and 25 positive. This totals in 300 samples, with incorrect and correct sentences combined. Since our goal is to evaluate the model’s performance and robustness in the presence of noise, we only consider incorrect data in the testing phase. Note that BERT is a pre-trained model, meaning that small amounts of data are enough for appropriate fine-tuning.
In the intent classification task, we are presented with a corpus that suffers from the opposite problem of the Twitter sentiment classification corpus. In the intent classification corpus, we have the complete sentences and intent labels but lack their corresponding incomplete sentences, and since our task revolves around text classification in incomplete or incorrect data, it is essential that we obtain this information. To remedy this issue, we apply a Text-to-Speech (TTS) module followed by a Speech-to-Text (STT) module to the complete sentences in order to obtain incomplete sentences with STT error. Due to TTS and STT modules available being imperfect, the resulting sentences have a reasonable level of noise in the form of missing or incorrectly transcribed words. Analysis on this dataset 444Available at https://github.com/gcunhase/StackedDeBERT adds value to our work by enabling evaluation of our model’s robustness to different rates of data incompleteness.
The dataset used to evaluate the models’ performance is the Chatbot Natural Language Unerstanding (NLU) Evaluation Corpus, introduced by Braun et al. [braun2017evaluating] to test NLU services. It is a publicly available 555https://github.com/sebischair/NLU-Evaluation-Corpora benchmark and is composed of sentences obtained from a German Telegram chatbot used to answer questions about public transport connections. The dataset has two intents, namely Departure Time and Find Connection with 100 train and 106 test samples, shown in Table 4. Even though English is the main language of the benchmark, this dataset contains a few German station and street names.
|Chatbot NLU||Departure Time||43||35||98|
The incomplete dataset used for training is composed of lower-cased incomplete data obtained by manipulating the original corpora. The incomplete sentences with STT error are obtained in a 2-step process shown in Fig. 2. The first step is to apply a TTS module to the available complete sentence. Here, we apply gtts 666https://pypi.org/project/gTTS/, a Google Text-to-Speech python library, and macsay 777https://ss64.com/osx/say.html, a terminal command available in Mac OS as say. The second step consists of applying an STT module to the obtained audio files in order to obtain text containing STT errors. The STT module used here was witai 888https://wit.ai, freely available and maintained by Wit.ai. The mentioned TTS and STT modules were chosen according to code availability and whether it’s freely available or has high daily usage limitations.
Table 5 exemplifies a complete and its respective incomplete sentences with different TTS-STT combinations, thus varying rates of missing and incorrect words. The level of noise in the STT imbued sentences is denoted by a inverted BLEU (iBLEU) score ranging from to . The inverted BLEU score is denoted in Eq. (5):
where BLEU is a common metric usually used in machine translation tasks [papineni2002bleu]. We decide to showcase that instead of regular BLEU because it is more indicative to the amount of noise in the incomplete text, where the higher the iBLEU, the higher the noise.
|TTS-STT||iBLEU||Original sentence||With STT error|
|gtts-witai||0.44||“how can i get from garching to milbertshofen?”||“how can i get from garching to melbourne open.”|
|macsay-witai||0.50||“how can i get from garching to milbertshofen?”||“how can i get from garching to meal prep.”|
Besides the already mentioned BERT, the following baseline models are also used for comparison.
We focus on the three following services, where the first two are commercial services and last one is open source with two separate backends: Google Dialogflow (formerly Api.ai)999https://dialogflow.com, SAP Conversational AI (formerly Recast.ai) 101010https://cai.tools.sap
and Rasa (spacy and tensorflow backend)111111https://rasa.com.
Shridhar et al. [shridhar2018subword] proposed a word embedding method that doesn’t suffer from out-of-vocabulary issues. The authors achieve this by using hash tokens in the alphabet instead of a single word, making it vocabulary independent. For classification, classifiers such as Multilayer Perceptron (MLP), Support Vector Machine (SVM) and Random Forest are used. A complete list of classifiers and training specifications are given in Section 4.2.
The baseline and proposed models are each trained 3 separate times for the incomplete intent classification task: complete data and one for each of the TTS-STT combinations (gtts-witai and macsay-witai). Regarding the sentiment classification from incorrect sentences task, the baseline and proposed models are each trained 3 times: original text, corrected text and incorrect with correct texts. The reported F1 scores are the best accuracies obtained from 10 runs.
No settable training configurations available in the online platforms.
Trained on 3-gram, feature vector size of 768 as to match the BERT embedding size, and 13 classifiers with parameters set as specified in the authors’ paper so as to allow comparison: MLP with 3 hidden layers of sizes respectively; Random Forest with estimators or trees; 5-fold Grid Search with Random Forest classifier and estimator ; Linear Support Vector Classifier with L1 and L2 penalty and tolerance of
; Regularized linear classifier with Stochastic Gradient Descent (SGD) learning with regularization term
and L1, L2 and Elastic-Net penalty; Nearest Centroid with Euclidian metric, where classification is done by representing each class with a centroid; Bernoulli Naive Bayes with smoothing parameterand regularization term of . Most often, the best performing classifier was MLP.
Conventional BERT is a BERT-base-uncased model, meaning that it has 12 transformer blocks , hidden size of 768, and 12 self-attention heads
. The model is fine-tuned with our dataset on 2 Titan X GPUs for 3 epochs with Adam Optimizer, learning rate of, maximum sequence length of , and warm up proportion of . The train batch size is 4 for the Twitter Sentiment Corpus and 8 for the Chatbot Intent Classification Corpus.
Our proposed model is trained in end-to-end manner on 2 Titan X GPUs, with training time depending on the size of the dataset and train batch size. The stack of multilayer perceptrons are trained for 100 and 1,000 epochs with Adam Optimizer, learning rate of , weight decay of , MSE loss criterion and batch size the same as BERT (4 for the Twitter Sentiment Corpus and 8 for the Chatbot Intent Classification Corpus).
Experimental results for the Twitter Sentiment Classification task on Kaggle’s Sentiment140 Corpus dataset, displayed in Table 6, show that our model has better F1-micros scores, outperforming the baseline models by 6 to 8. We evaluate our model and baseline models on three versions of the dataset. The first one (Inc) only considers the original data, containing naturally incorrect tweets, and achieves accuracy of 80 against BERT’s 72. The second version (Corr) considers the corrected tweets, and shows higher accuracy given that it is less noisy. In that version, Stacked DeBERT achieves 82 accuracy against BERT’s 76, an improvement of 6. In the last case (Inc+Corr), we consider both incorrect and correct tweets as input to the models in hopes of improving performance. However, the accuracy was similar to the first aforementioned version, 80 for our model and 74 for the second highest performing model. Since the first and last corpus gave similar performances with our model, we conclude that the Twitter dataset does not require complete sentences to be given as training input, in addition to the original naturally incorrect tweets, in order to better model the noisy sentences.
|F1-score (micro, )|
|SAP Conversational AI||59.18||65.31||59.18|
|Stacked DeBERT (ours)||80.00||82.00||80.00|
In addition to the overall F1-score, we also present a confusion matrix, in Fig. 3, with the per-class F1-scores for BERT and Stacked DeBERT. The normalized confusion matrix plots the predicted labels versus the target/target labels. Similarly to Table 6, we evaluate our model with the original Twitter dataset, the corrected version and both original and corrected tweets. It can be seen that our model is able to improve the overall performance by improving the accuracy of the lower performing classes. In the Inc dataset, the true class 1 in BERT performs with approximately 50%. However, Stacked DeBERT is able to improve that to 72%, although to a cost of a small decrease in performance of class 0. A similar situation happens in the remaining two datasets, with improved accuracy in class 0 from 64% to 84% and 60% to 76% respectively.
Experimental results for the Intent Classification task on the Chatbot NLU Corpus with STT error can be seen in Table 7. When presented with data containing STT error, our model outperforms all baseline models in both combinations of TTS-STT: gtts-witai outperforms the second placing baseline model by 0.94% with F1-score of 97.17%, and macsay-witai outperforms the next highest achieving model by 1.89% with F1-score of 96.23%.
|F1-score (micro, )|
|SAP Conversational AI||95.24||94.29||94.29|
|Stacked DeBERT (ours)||99.06||97.17||96.23|
The table also indicates the level of noise in each dataset with the already mentioned iBLEU score, where 0 means no noise and higher values mean higher quantity of noise. As expected, the models’ accuracy degrade with the increase in noise, thus F1-scores of gtts-witai are higher than macsay-witai. However, while the other models decay rapidly in the presence of noise, our model does not only outperform them but does so with a wider margin. This is shown with the increasing robustness curve in Fig. 4 and can be demonstrated by macsay-witai outperforming the baseline models by twice the gap achieved by gtts-witai.
Further analysis of the results in Table 7 show that, BERT decay is almost constant with the addition of noise, with the difference between the complete data and gtts-witai being 1.88 and gtts-witai and macsay-witai being 1.89. Whereas in Stacked DeBERT, that difference is 1.89 and 0.94 respectively. This is stronger indication of our model’s robustness in the presence of noise.
Additionally, we also present Fig. 5 with the normalized confusion matrices for BERT and Stacked DeBERT for sentences containing STT error. Analogously to the Twitter Sentiment Classification task, the per-class F1-scores show that our model is able to improve the overall performance by improving the accuracy of one class while maintaining the high-achieving accuracy of the second one.
In this work, we proposed a novel deep neural network, robust to noisy text in the form of sentences with missing and/or incorrect words, called Stacked DeBERT. The idea was to improve the accuracy performance by improving the representation ability of the model with the implementation of novel denoising transformers. More specifically, our model was able to reconstruct hidden embeddings from their respective incomplete hidden embeddings. Stacked DeBERT was compared against three NLU service platforms and two other machine learning methods, namely BERT and Semantic Hashing with neural classifier. Our model showed better performance when evaluated on F1 scores in both Twitter sentiment and intent text with STT error classification tasks. The per-class F1 score was also evaluated in the form of normalized confusion matrices, showing that our model was able to improve the overall performance by better balancing the accuracy of each class, trading-off small decreases in high achieving class for significant improvements in lower performing ones. In the Chatbot dataset, accuracy improvement was achieved even without trade-off, with the highest achieving classes maintaining their accuracy while the lower achieving class saw improvement. Further evaluation on the F1-scores decay in the presence of noise demonstrated that our model is more robust than the baseline models when considering noisy data, be that in the form of incorrect sentences or sentences with STT error. Not only that, experiments on the Twitter dataset also showed improved accuracy in clean data, with complete sentences. We infer that this is due to our model being able to extract richer data representations from the input data regardless of the completeness of the sentence. For future works, we plan on evaluating the robustness of our model against other types of noise, such as word reordering, word insertion, and spelling mistakes in sentences. In order to improve the performance of our model, further experiments will be done in search for more appropriate hyperparameters and more complex neural classifiers to substitute the last feedforward network layer.
This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2016-0-00564, Development of Intelligent Interaction Technology Based on Context Awareness and Human Intention Understanding) and Korea Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MOTIE) (50%) and the Technology Innovation Program: Industrial Strategic Technology Development Program (No: 10073162) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea) (50%).