Absit invidia verbo: Comparing Deep Learning methods for offensive language

by   Silvia Sapora, et al.

This document describes our approach to building an Offensive Language Classifier. More specifically, the OffensEval 2019 competition required us to build three classifiers with slightly different goals: - Offensive language identification: would classify a tweet as offensive or not. - Automatic categorization of offense types: would recognize if the target of the offense was an individual or not. - Offense target identification: would identify the target of the offense between an individual, group or other. In this report, we will discuss the different architectures, algorithms and pre-processing strategies we tried, together with a detailed description of the designs of our final classifiers and the reasons we choose them over others. We evaluated our classifiers on the official test set provided for the OffenseEval 2019 competition, obtaining a macro-averaged F1-score of 0.7189 for Task A, 0.6708 on Task B and 0.5442 on Task C.


OffensEval at SemEval-2019 Task 6: Okham's Razor on Identifying and Categorizing Offensive Language in Social Media

This document describes our approach to building an Offensive Language C...

OffensEval at SemEval-2018 Task 6: Identifying and Categorizing Offensive Language in Social Media

This document describes our approach to building an Offensive Language C...

Investigating Machine Learning Methods for Language and Dialect Identification of Cuneiform Texts

Identification of the languages written using cuneiform symbols is a dif...

ML_LTU at SemEval-2022 Task 4: T5 Towards Identifying Patronizing and Condescending Language

This paper describes the system used by the Machine Learning Group of LT...

Combination of multiple Deep Learning architectures for Offensive Language Detection in Tweets

This report contains the details regarding our submission to the OffensE...

I Am Going MAD: Maximum Discrepancy Competition for Comparing Classifiers Adaptively

The learning of hierarchical representations for image classification ha...

1 Pre-Processing

Our pre-processing included:

  1. Removing all @USER strings

  2. Removing all URL strings

  3. Removing all punctuation

  4. Removing all symbols

  5. Converting all text lowercase

  6. Expanding abbreviations

Other steps we considered were:

  1. Removing all emojis

  2. Removing all hashtags

  3. Removing all numbers

  4. Removing stop words

After testing, we determined those changes either did not contribute positively or decreased the performance of our classifiers. The resulting corpus is then tokenized using the Natural Language Toolkit (nltkLoper and Bird (2002)) library, in order to obtain the individual composing words, that will later be embedded and used as inputs for the classification models.

2 Input Data

Initially, we decided to use the Bag-of-Words model. To do this, we used sklearn’sPedregosa et al. (2011) function CountVectorizer. CountVectorizer converts a series of text strings into a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix. We had the option to either provide CountVectorizer with an a-priori dictionary or to let the library create one using the words provided in the given input data.

As word embeddings have the ability to generalize (thanks to semantically similar words having similar vectors)

Karani (2018) we decided we would try out this approach and check its performance. We used the Word2Vec function from the gensimŘehůřek and Sojka (2010) library to get word embeddings we could then use to train a model with. Our approach was to first get a dictionary mapping each word to a n-dimensional vector (we tried n=100 to start with). Then, to build the feature vector, we tried to average word vectors for all the words in a given sentence. To do this we built a sklearn-compatible transformer that is initialized with a word to vector dictionary. Unfortunately, we had some problems getting the results from the Word2Vec function and averaging them. gensim provides many methods to calculate the similarity of words and other properties, but it doesn’t provide any direct way of getting a list of features for each sentence. This is why we decided to abandon this approach.

We also tried using Term Frequency-Inverse Document Frequency (TF-IDF) through the sklearn function TfidfVectorizer

. Surprisingly, it yielded worse results than Bag of Words when we tested it with our Logistic Regression classifier.

For our Neural Networks(RNNs and CNN) we decided to use the

word2idx function we implemented. This function would first assign each word an index, an then it would replace each word in a sentence with its corresponding index. This meant our sentences were now list of integers.

3 Performance and Evaluation

In order to train and evaluate the performance of our classifiers, we used the provided training dataset together with the validation one. During training, we split the 13240 tweets in a 90-10 ratio where the 10% was used for validation in between epochs. The provided validation dataset was in fact used for testing our model. For each model, we compared the predictions on the test data to the provided reference classifications baseline, using accuracy and F1 with macro averaging as metrics.

4 Libraries and Frameworks

In our implementations we made extensive use of PyTorch

Paszke et al. (2017)

, Keras

Chollet et al. (2015), scikit-learn, and Natural Language Toolkit.

We decided to use Natural Language Toolkit because it had a base Tokenizer for Tweets. The Tokenizer was a good starting point because it took care of removing handles, allowed easy lower casing and stripped repeating symbols to a certain length. As the coursework progressed, we built on top of it while experimenting with pre-processing of sentences.

When we considered deep learning frameworks we looked into TensorFlow

Abadi et al. (2015), Keras and PyTorch. Both TensorFlow and Keras are older frameworks than PyTorch. Keras has a reputation for being easy to reason about, while TensorFlow has the reputation of more powerful, used in production systems and having a steep learning curve. PyTorch, on the other hand, seemed to combine the best of both by being easy to reason about, unlike TensorFlow, and yet enabling developers to go into low-level implementation details, unlike Keras. We chose to go with PyTorch when implementing the CNN because the majority of our team had already used it for other courses and we felt confident working with it. We mainly used Keras for the RNNs to compare whether it was indeed easier to use than PyTorch.

We chose to use the linear regression model in scikit-learn over that in PyTorch because we thought the API was more intuitive to use.

4.1 Convolutional Neural Network (CNN)

As our first attempt, we decided to approach the classification problem by using a Convolutional Neural Network.

The CNN which gave us the best results used an embedding, a feature and a pooling layer. It made use of a word2idx

encoding of tweets padded to the length of the longest tweet. The CNN favored a 100 by 1 vector for its embedding representation which was then sampled using a window of size 3 in order to extract useful features. The loss function which gave best results was

BCEWithLogitsLoss and the optimizer which worked in conjunction with it was Adam.

In the beginning, we began with a window size of 1 i.e. no neighbourhood around the word was used to enhance its semantics. By expanding the window size and plotting it against the accuracy and F1 scores we saw that they leveled off at a window size of 3.

As a next step, we tried to see whether increasing size of the output vector from the embedding layer would improve the feature extraction. We went up to 1000 entries in a one-dimensional vector, but that drastically added to the computational time and barely altered the output scores.

Afterward, we experimented with changing the structure of the network and making it deeper. We added a couple more convolutional layers, but the network still only reached an accuracy of 66%.

We made a similar experiment by varying the probability of a feature being dropped (originally set to 50%). It turned out, however, that both increasing and decreasing the probability caused a deterioration in our results rather than an improvement.

At this point, we considered the structure of the dataset. For task A, the dataset was biased because 66% of the data classified as NOT offensive, while 33% classified as OFFensive. This meant that the network could have learned to achieve the accuracy and F1 results we obtained just by classifying everything as the predominant class i.e. NOT offensive.

In order to mitigate this issue, we tried experimenting with different loss functions and optimizers. We started off using BCEWithLogitsLoss, which weights all classes as if each had an equal portion of the dataset. We tested other loss functions (MSELoss, CrossEntropyLoss) which we have seen to have a good performance in other deep learning scenarios, but they did not yield better results.

The optimizer which we originally used was SGD, but we found out that the optimizer which worked the best with BCEWithLogitsLoss was Adam.

After altering pretty much every parameter we could and not achieving a significant improvement in our results, we decided that a CNN might not be the best approach to this problem and considered using Recurrent Neural Networks.

4.2 Recurrent Neural Network (RNN)

Having in mind all the previous results, we decided to move our attention to RNNs Pak and Paroubek (2010). Considering the particularities of RNN models that make them suitable for working with sequential data, we decided to use their capabilities to solve our classification task.

We made use of two different approaches: LSTM and Bidirectional LSTM (B-LSTM) which try to mitigate some of the vanilla RNN drawbacks, mainly, the vanishing gradient Tai et al. (2015).

Since both LSTM and B-LSTM gave us results above the baseline we decided to used them for tasks B and C as well. We trained different models for each task. After the input was pre-processed and tokenized, we used the word2idx function we defined to get a sentence embedding. We then padded the result with 0 values to obtain equally sized inputs.

The resulted embeddings were used for both the LSTM and the B-LSTM models which share a common architecture. The input layer of both of them is connected to an Embedding layer whose goal is to reduce the input’s dimensionality into a more meaningfully latent space which facilitates the classification procedure. The next layer is composed of LSTM or B-LSTM cells. The two share a common internal structure, but the difference lies in the B-LSTM: it is built out of two LSTM cells, one is trained forward (from left to right, the natural order of the input sequence), while the other is trained backward (from right to left, considering the inverse order of the input). The output is a result of a Dense Layer which takes the result of the above LSTM/B-LSTM and reduces it to the probability of the input belonging to one of the output classes.

4.3 Logistic Regression (LR)

We were quite interested to see how Neural Networks perform against a simpler method like Logistic Regression. As such, we trained a model for each task in order to compare its behaviour to our RNNs.

We started off by creating a one-hot encoding of the training dataset

Agarwal et al. (2011). Upon making predictions on the testing dataset we ran into the problem where the one-hot encoding of the testing dataset was different to the one-hot encoding of the training dataset. This was due to the fact that not all words in the training dataset were seen in the testing dataset. We worked around this problem by using the testing dataset in conjunction with the training dataset when creating the one-hot encoding. When we trained the Logistic Regression model we only fed it the one-hot encoding of the training dataset.

To our surprise, Logistic Regression appeared to perform quite well on all three tasks.

5 Training Challenges and Tuning

During the training stage, a series of optimization decisions were made. Firstly, as the corpus dimension was not very big and suffered from imbalance, we decided to use a network with just 4 layers (Input, Embedding, LSTM/B-LSTM, Dense=Output). This measure was taken in order to reduce the over-fitting and allowed us to use LSTM for the given task. We tried adding more layers, but unfortunately, the increased number of parameters inevitably lead to poor generalization capabilities, as the higher flexibility of the model adapted too well to the training data it was fed. The Embedding layer tries to compress the 2500 words (vocabulary size) into an embedding vector of size 60. We chose to limit ourselves to the first 2500 words from our vocabulary as we observed, after plotting the word frequency, that the occurrence frequency dropped dramatically after the first 2486 words. At the next stage, the LSTM/B-LSTM layer takes its input from the Embedding layer, a vector of 60 elements, and produces a cell output of size 100. After repeated experiments, we found 100 was the dimension that offered the best result for the given problem. This output goes to the Dense layer which predicts the probability of the input sentence being Offensive/NotOffensive, Targeted/NotTargeted and Individual/Collective/Other, depending on the problem task.

During the tuning of hyperparameters, we built different models with different properties. Plotting the training versus validation accuracy we observed that the models were highly over-fitting. As we couldn’t enhance the training set, we tried other regularization techniques. The first one was Dropout, which showed its usefulness by reducing the over-fitting and increasing the accuracy on the validation set by about 1.5%. Next, we tried changing the output size of the some of the network’s layers (Embedding size and LSTM memory cell size) which in turn brought some improvement. Another technique we used with a noticeable improvement on the model’s performance was L2 regularization: it slightly increased the generalization capabilities of the network and reduced the discrepancy between the training set accuracy and the test one. At last, we plotted the accuracy versus the number of epochs used for training and we used Early Stopping to maximize our generalization capabilities.

As the B-LSTM considers the sequence from both ends, it gets a better consideration of the context. These extra capabilities enhanced the performance, for our tasks, with an average of 3%, compared to a similar LSTM model. During hyperparameter tuning, we observed that given the reduced size of the corpus the B-LSTM overfits easily, so the early stopping was used to mitigate those effects.

6 Results

For task A, we tested many different models and compared their performance. From the results we got, we concluded Logistic Regression, LSTM and B-LSTM were the best ones. Because of this, we decided to only use them on the following tasks and to spend our time trying to improve their performance by trying out different hyperparameters. We also submitted the predictions for all the models we used to Codalab and waited for the results to make the final decision on which models worked the best. Table 1 shows the accuracy and F1 score for each model we tested, while Table 2 summarizes how those models performed on the official test set.

Task A
Model Accuracy F1
B-LSTM 20 epochs 0.7190 0.6901
B-LSTM 5 epochs 0.7462 0.7147
B-LSTM 7 epochs + L2 0.7892 0.7309
LSTM 6 epochs + L2 0.7522 0.7165
LSTM 12 epochs + L2 0.7322 0.6322
CNN 0.6148 0.34
Multinomial NB + RD 0.7726 0.7687
SGD Classifier + RD 0.7794 0.7781
Logistic Regression 0.7681 0.7282
Logistic Regression + RD 0.8529 0.8528
Task B
Model Accuracy F1
LSTM 0.6681 0.3965
B-LSTM 0.5681 0.4399
Logistic Regression 0.85 0.5247
Logistic Regression + RD 0.9072 0.9067
Task C
Model Accuracy F1
LSTM 0.6881 0.4929
B-LSTM 0.7233 0.5023
Logistic Regression 0.7164 0.4630
Logistic Regression + RD 0.8478 0.8456
Table 1: Tasks performance of different models. Each model is trained with 90% of training data and tested on the remaining 10%. RD indicates Random Draw to balance the number of samples of each type. L2 indicates L2 regularization.
Task A
Model F1
LSTM 8 epochs + L2 0.718944681
B-LSTM 7 epochs + L2 0.72103681
Logistic Regression + RD 0.72552438
Task B
Model F1
LSTM 7 epochs + L2 0.542312321
B-LSTM 6 epochs + L2 0.560707749
Logistic Regression + RD 0.670806302
Task C
Model F1
LSTM 7 epochs + L2 0.439543212
B-LSTM 6 epochs + L2 0.441269841
Logistic Regression + RD 0.544192563
Table 2: Tasks performance of different models on the official test set.

7 Conclusion

This task provided a good hands-on experience with Natural Language Processing. Our main takeaway was that suitably trained simpler models (Logistic Regression) can sometimes perform as well, if not better, than more complicated ones (RNNs).

Our code and resources can be accessed via the following Google Drive URL: https://drive.google.com/drive/folders/10DDRyFcQ2cszSZAwnP2Lu-Bf4TB89urP


  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  • Agarwal et al. (2011) Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38.
  • Chollet et al. (2015) François Chollet et al. 2015. Keras. https://keras.io.
  • Karani (2018) Dhruvil Karani. 2018. Introduction to word embedding and word2vec. Medium.
  • Loper and Bird (2002) Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.
  • Pak and Paroubek (2010) Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011.

    Scikit-learn: Machine Learning in Python .

    Journal of Machine Learning Research, 12:2825–2830.
  • Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  • Zampieri et al. (2019a) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019a. Predicting the Type and Target of Offensive Posts in Social Media. In Proceedings of NAACL.
  • Zampieri et al. (2019b) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019b. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).