ColBERT: Using BERT Sentence Embedding for Humor Detection

04/27/2020 ∙ by Issa Annamoradnejad, et al. ∙ Sharif Accelerator 0

Automatic humor detection has interesting use cases in modern technologies, such as chatbots and personal assistants. In this paper, we describe a novel approach for detecting humor in short texts using BERT sentence embedding. Our proposed model uses BERT to generate tokens and sentence embedding for texts. It sends embedding outputs as input to a two-layered neural network that predicts the target value. For evaluation, we created a new dataset for humor detection consisting of 200k formal short texts (100k positive, 100k negative). Experimental results show an accuracy of 98.1 percent for the proposed method, 2.1 percent improvement compared to the best CNN and RNN models and 1.1 percent better than a fine-tuned BERT model. In addition, the combination of RNN-CNN was not successful in this task compared to the CNN model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Interstellar (2014 movie), a future world is depicted where humans can set the level of humor in robots111Tarzs, in the case of the movie.. While we may have a long road toward the astral travels, we are very close in reaching high-quality systems injected with adjustable humor.

Humor, as a potential cause of laughter, is an important part of human communication, which not only makes people feel comfortable, it also creates a cozier environment Castro et al. (2016). Automatic humor detection in texts has interesting use cases in chatbots and personal assistants (such as Cortana and Siri). An appealing use case Castro et al. (2016) is to identify whether an input text should be taken seriously or not. This is critical in order to give appropriate answers by realizing real motive of user’s question, and to enhance the overall experience of user with system. A more advanced outcome would be the injection of humor into computer-generated responses, thus making the conversations more engaging and interesting Blinov et al. (2017); Niculescu et al. (2013); Khooshabeh et al. (2011). One way to achieve this goal is by scoring and adapting the level of humor in possible answers to maintain a desired level, similar to the mentioned movie.

With advances in NLP, researchers applied and evaluated state-of-the-art methods for the task of humor detection. This includes using statistical and N-gram analysis

(Taylor and Mazlack, 2004), Regression Trees (Purandare and Litman, 2006), Word2Vec combined with K-NN Human Centric Features (Yang et al., 2015)

, and Convolutional Neural Networks

(Chen and Soo, 2018) Weller and Seppi (2019)

. With the popularity of transfer learning, a good body of research focused on using pre-trained models for different tasks of NLP. BERT language model

Devlin et al. (2018) initially obtained eleven state-of-the-art results and has been applied for several NLP tasks since then.

Existing humor detection datasets use a combination of formal non-humor texts and informal jokes, usually with incompatible statistics (text length, words count, etc.). This makes it more likely to detect humor with simple analytical models and without understanding the underlying latent lingual features. Moreover, the existing datasets are relatively small for a task of text classification. These problems encouraged us to build and publish a dataset exclusively for the task of humor detection, where simple feature-based models will not be able to score very well.

This paper aims at utilizing BERT for humor detection. Our approach builds on using BERT sentence embedding in a neural network, where, given a text, our method first obtains its token representation from the BERT tokenizer, then, by feeding tokens into the BERT model, it will gain BERT sentence embedding (768 hidden units). Finally, it will pass sentence embedding as input to a two-layered neural network to predict the target value.

We summarize our contributions as follows:

  • We introduce a new combined dataset for the task of humor detection, entitled “ColBERT” dataset, which contains exactly 200k short texts (100k positive, 100k negative). We reduced or completely removed the mentioned issues of the existing datasets from this new dataset.

  • We propose a novel model for humor detection by using BERT sentence embedding. We will introduce steps for preprocessing, tokenizing, sentence embedding, and NN architecture.

  • We evaluate our model on 20% of the dataset, and compare its performance with four state-of-the-art models on our new dataset.

The structure of this article is as follows: Section 2 reviews past works on the task humor detection with focus on transfer learning methods. Section 3 introduces data collection and preparation techniques, and the statistics of the new dataset. Section 4 elaborates on the methodology, and section 5 presents our experimental results. Section 6 is the concluding remarks.

2 Literature Review

With advances in NLP, researchers applied and evaluated state-of-the-art methods for the task of humor detection. This includes using statistical and N-gram analysis (Taylor and Mazlack, 2004), Regression Trees (Purandare and Litman, 2006), Word2Vec combined with K-NN Human Centric Features (Yang et al., 2015), and Convolutional Neural Networks (Chen and Soo, 2018) Weller and Seppi (2019).

With the popularity of transfer learning, a good body of research focused on using pre-trained models for different tasks of NLP. Transfer learning in NLP, particularly models like ULMFiT Howard and Ruder (2018), Allen AI’s ELMO Peters et al. (2018), and Google’s BERT Devlin et al. (2018), focuses on storing knowledge gained from training on one problem and applying it to a different but related problem (usually after fine-tuning on a small amount of data).

The first transfer learning method in Natural Language Processing (NLP) was Universal Language Model Fine-tuning for Text Classification (ULMFiT)

Howard and Ruder (2018). They trained a language model on a preliminary dataset, such as Wikitext, and then fine-tune it for a new task on a small training set. Their sample model, besides significantly outperforming many state-of-the-art tasks, was done by training on only 100 labeled examples, that matched performances equivalent to old models trained on much more data.

ELMo Peters et al. (2018) includes task-specific architectures and uses the pre-trained representations as additional features [12]. It is a deep contextualized word representation that models complex characteristics of word use (e.g., syntax and semantics). They added these representations to existing models and showed improvements to six popular NLP tasks.

BERT Devlin et al. (2018) utilizes a multi-layer bidirectional transformer encoder consisting of several encoders stacked together, which can learn deep bi-directional representations. Similar to previous transfer learning methods, it is pre-trained on unlabeled data to be later fine-tuned for a variety of tasks, such as Question-answering. It initially came with two model sizes (BERTBASE and BERTLARGE) and obtained eleven new state-of-the-art results. Since then, it was pre-trained and fine-tuned for several tasks and languages, and several BERT-based architectures and model sizes have been introduced (such as Multilingual BERT, RoBERTa Liu et al. (2019), ALBERT Lan et al. (2019) and VideoBERT Sun et al. (2019a)).

Ref Weller and Seppi (2019) focused on the task of detecting whether a joke is humorous by using a Transformer architecture. They approached this problem by building a model that learns to identify humorous jokes based on ratings taken from the popular Reddit r/Jokes thread (13884 negative and 2025 positives). They showed that their method outperforms all previous work done on their selected tasks, with an F-measure of 93.1 percent for the Puns dataset and 98.6 percent on the Short Jokes dataset.

There are emerging tasks related to humor detection. Ref Yang et al. (2019) focused on predicting humor by using audio information, hence reached 0.750 AUC by using only audio data. A good number of research is focused on the detecting humor in non-English texts, such as on Spanish (Chiruzzo et al., 2019; Ismailov, 2019; Giudice, 2019), Chinese Yang et al. (2019), and English-Hindi Khandelwal et al. (2018).

3 Data

Existing humor detection datasets use a combination of formal texts and informal jokes with incompatible statistics (text length, words count, etc.), making it more likely to detect humor with simple analytical models and without understanding the underlying latent connections. Moreover, they are relatively small for the tasks of text classification. These problems encouraged us to create a new dataset exclusively for the task of humor detection, where simple feature-based models will not be able to predict accurately without any insight into the linguistic features.

We begin with a survey of the existing humor detection datasets (binary task), highlighting their size and data source (see Table 1 for an overview). There are other datasets focused on similar tasks, for example, on the tasks of punchline detection and success (whether or not a punchline triggers laughter) Chen and Lee (2017); Hasan et al. (2019), or on using speak audio and video to detect humor Bertero and Fung (2016); Hasan et al. (2019).

In this section, we will introduce data sources, data filtering methods, and some general statistics on the new dataset.

Dataset #Positive #Negative
16000 One-Liners Mihalcea and Strapparava (2005) 16,000 16,002
Pun of the Day Yang et al. (2015) 2,423 2,403
PTT Jokes Chen and Soo (2018) 1,425 2,551
English-Hindi Khandelwal et al. (2018) 1,755 1,698
ColBERT 100,000 100,000
Table 1: Statistics on the existing datasets for the binary task of humor classification

3.1 Data Collection

We carefully analyzed existing datasets (exclusively on news stories, news headlines, Wikipedia pages, tweets, proverbs and jokes) with regard to table size, character length, word count, and formality of language. Since none of them was originally compatible with each other in the mentioned criteria, we selected two datasets with formal texts (one with humor texts and one without) and performed a few preprocessing actions and row cuts to make them syntactically similar.

News dataset includes 200,853 news headlines (plus links, categories and stories) from 2012-2018 obtained from Huffington Post. Headlines are scattered in all news categories, including politics, wellness, entertainment and parenting.

Jokes dataset contains 231,657 jokes/humor short texts, crawled from Reddit communities222Mostly from /r/jokes and /r/cleanjokes subreddits.. The dataset is compiled as a single csv file with no additional information about each text (such as the source, date, etc) and is available at Kaggle. Ref Chen and Soo (2018) combined this dataset with the WMT162 English news crawl, but did not publicly publish the dataset. Ref Weller and Seppi (2019) also combined this dataset with extracted sentences from the WMT162 news crawl and made it publicly available.

3.2 Preprocessing and Filtering

First, we realized that there are duplicate texts in both datasets. Dropping duplicate rows removed 1369 rows from the jokes dataset and 1558 rows from the news dataset.

Then, to make their statistics more similar, we analyzed the number of characters and words, separately for each one of them, and by realizing differences in their distributions, we performed a few cuts. In short, we only kept texts with character length between 30 and 100, and word length between 10 and 18. Resulting data parts have very similar distribution with regard to these statistics.

In addition, we noticed that headlines in the news dataset use Title Case333All words are capitalized, except non-initial articles like “a, the, and”, etc. formatting, while this is not the case with the jokes dataset. Thus, we decided to apply Sentence Case444Capitalization as in a standard English sentence, e.g., “Witchcraft is real.”.

formatting to all news headlines by keeping the first character of the sentences in capital and lower-casing the rest. This simple modification also helps to prevent simple classifiers from reaching close to perfect accuracy.

Finally, we randomly selected 100k rows from both datasets and merged them together to create an evenly distributed dataset.

3.3 ColBERT Dataset Analysis

ColBERT dataset555Dataset is available at: consists of 200k labeled short texts, equally distributed between humor and non-humor. It is much larger than the previous datasets (Table 1) and it contains texts with similar textual features. Correlation between character count and the target is insignificant (+0.09), and there is no notable connection between the target value and sentiment features (correlation coefficient of -0.09 and +0.02 for polarity and subjectivity, respectively). Table 2 describes the dataset with respect to a few general statistics.

#chars #words #unique words #punctuation #duplicate words #sentences sentiment polarity sentiment subjectivity
mean 71.561 12.811 12.371 2.378 0.440 1.180 0.051 0.317
std 12.305 2.307 2.134 1.941 0.794 0.448 0.288 0.327
min 36 10 3 0 0 1 -1.000 0.000
25% 63 11 11 1 0 1 0.000 0.000
50% 71 12 12 2 0 1 0.000 0.268
75% 80 14 14 3 1 1 0.150 0.542
max 99 22 22 37 13 2 1.000 1.000
Table 2: General statistics of the ColBERT dataset (100k positive, 100k negative)

4 Model

Our approach builds on using BERT sentence embedding in a neural network. More specifically, our method first obtains text token representation from the BERT tokenizer, using maximum sequence length of 100 (the maximum sequence length of BERT is 512). Then, by feeding tokens as input into the BERT model, it will generate BERT sentence embedding (768 hidden units). The model will pass sentence embedding as input to a two-layered neural network (dropout=0.2, activation=sigmoid) to predict the single target value. Finally, we applied percentage split on the target predictions to produce binary results. Training will be performed on the neural network and not on the BERT model.

BERT comes with two pre-trained general types (the BERTBASE and the BERTLARGE) which use separate data sources and sizes for training (the BooksCorpus Zhu et al. (2015) with 800M words for the former and English Wikipedia with 2,500M words for the latter Devlin et al. (2018)). In our proposed method, we used the smaller sized (BERTBASE) with the following characteristics: 12-layer, 768-hidden, 12-heads, 110M parameters, trained on lower-cased English text (uncased).

4.1 Preprocessing

To achieve clean data for training and with the overall goal of accurate model, before sending data to the tokenizer, we performed a few textual actions on both training and test set:

  • Cleaning Contractions: We replaced all contractions with the longer version of the expressions. For example, “can’t" is replaced by "can not".

  • Separating Punctuation Marks: We separated punctuation marks from other words to create cleaner sentences. For example, “This is’ (fun).” is converted to “This is ‘ ( fun ) .”

  • Cleaning Special Characters: We replaced special characters with a meaningful alias. For example, “alpha” instead of “”.

5 Experiments

In this section, we will compare ColBERT model with a few baseline models. The data is split into 80% (160k) training and 20% (40k) testing rows, where each class has been separated roughly in a 50-50 split.

5.1 Baselines

In order to have fair baselines, we implemented three models that performed very well (Table 3

). We performed all of them for 5 epochs, with


log loss function,

mlp_depth=2 and mlp_drop_out=0.2. We trained them on the complete 160K training set, which was not the case for the main model. Baseline models are:

  1. Convolutional Neural Network (CNN) is a Deep Learning algorithm that uses convolution in place of general matrix multiplication in at least one of its layers

    Goodfellow et al. (2016).

  2. Attention-Based Recurrent Neural Network: A Recurrent Neural Network (RNN) is a class of artificial neural networks, where connections between nodes form a directed graph along a temporal sequence. In Attention-Based RNN, a bidirectional RNN (BiRNN) reads the source sequence in both forward and backward directions [Bing liu: Attention-Based Recurrent].

  3. BiRNN-CNN: We used a combination of bidirectional RNN with CNN to access the advantages of both models. We integrate them in an end-to-end way, starting with BiRNN model.

  4. BERT-base-uncased: We fine-tuned BERT-base-uncased with different learning rates and finally adjusted it to a lower learning rate necessary to make BERT overcome the forgetting problem Sun et al. (2019b). We tried several learning-rate values, hence selected 1.5e-4 for the final model.

5.2 Results

We trained all models (proposed and baselines) for 5 epochs. However, while we used only 60k rows (37.5 percent of the available 160k rows) for training of the proposed method, all baselines were trained on the complete training dataset.

Our experiment with the new dataset found the proposed model’s accuracy and F1 score to be 0.981. This is a 2.1 percent jump from the best available CNNs and 2.3 percent from Attention RNN (Table 3). Fine-tuned BERTBASE model performed better than the first three baselines, reaching close to 97 percent accuracy, 1.1 percent behind the proposed model. In addition, the combination of RNN-CNN was not successful in this task and did not improve results compared to the CNN model.

For timing and performance, it took 23.1 minutes in average for each epoch on a computer with 16GB RAM and Intel(R) Xeon(R) CPU 2.00GHz.

Method Configuration Accuracy Precision Recall F1
CNN cnn_drop_out=0.2 0.960 0.955 0.966 0.960
Attention RNN rnn_depth=1, rnn_drop_out=0.3, rnn_state_drop_out=0.3 0.958 0.977 0.939 0.957
BiRNN - CNN rnn_depth=1, cnn_drop_out=0.2, rnn_drop_out=0.3, rnn_state_drop_out=0.3 0.960 0.954 0.965 0.960
BERT-base-uncased learning-rate=1.5e-4 0.969 0.965 0.975 0.970
ColBERT BERT-based-uncased, learning-rate=1.5e-4, NN-L1-dropout=0.2, L1-activation=sigmoid 0.981 0.984 0.977 0.981
Table 3: Comparison of Methods on the ColBERT Dataset

6 Conclusion

In this study, we have demonstrated the ability to classify formal short texts based on the existence of humor in text. Our approach consists of injecting BERT sentence embedding into a neural network model. Moreover, we built a novel and large dataset consisting of 200k formal short texts. We trained our model and four baseline classifiers on the dataset and obtained an accuracy of 0.981 by using the proposed method on the test set. Results showed that fine-tuning pre-trained BERT model can achieve higher accuracy than CNN and RNN models. In addition to the task of humor detection, the proposed method can be used in future studies examining a wider range of contexts and text classifications.


  • D. Bertero and P. Fung (2016) Deep learning of audio and language features for humor prediction. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 496–501. Cited by: §3.
  • V. Blinov, K. Mishchenko, V. Bolotova, and P. Braslavski (2017) A pinch of humor for short-text conversation: an information retrieval approach. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 3–15. Cited by: §1.
  • S. Castro, M. Cubero, D. Garat, and G. Moncecchi (2016) Is this a joke? detecting humor in spanish tweets. In

    Ibero-American Conference on Artificial Intelligence

    pp. 139–150. Cited by: §1.
  • L. Chen and C. Lee (2017) Predicting audience’s laughter during presentations using convolutional neural network. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 86–90. Cited by: §3.
  • P. Chen and V. Soo (2018) Humor recognition using deep learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 113–117. Cited by: §1, §2, §3.1, Table 1.
  • L. Chiruzzo, S. Castro, M. Etcheverry, D. Garat, J. J. Prada, and A. Rosá (2019) Overview of haha at iberlef 2019: humor analysis based on human annotation. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019), Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §2, §4.
  • V. Giudice (2019) Aspie96 at haha (iberlef 2019): humor detection in spanish tweets with character-level convolutional rnn. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain, Cited by: §2.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: item 1.
  • M. K. Hasan, W. Rahman, A. Zadeh, J. Zhong, M. I. Tanveer, L. Morency, et al. (2019) UR-funny: a multimodal language dataset for understanding humor. arXiv preprint arXiv:1904.06618. Cited by: §3.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Cited by: §2, §2.
  • A. Ismailov (2019) Humor analysis based on human annotation challenge at iberlef 2019: first-place solution. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). CEUR Workshop Proceedings, CEUR-WS, Bilbao, Spain (9 2019), Cited by: §2.
  • A. Khandelwal, S. Swami, S. S. Akhtar, and M. Shrivastava (2018) Humor detection in english-hindi code-mixed social media content: corpus and baseline system. arXiv preprint arXiv:1806.05513. Cited by: §2, Table 1.
  • P. Khooshabeh, C. McCall, S. Gandhe, J. Gratch, and J. Blascovich (2011) Does it matter if a computer jokes. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, pp. 77–86. Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • R. Mihalcea and C. Strapparava (2005) Making computers laugh: investigations in automatic humor recognition. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 531–538. Cited by: Table 1.
  • A. Niculescu, B. van Dijk, A. Nijholt, H. Li, and S. L. See (2013) Making social robots more attractive: the effects of voice pitch, humor and empathy. International journal of social robotics 5 (2), pp. 171–191. Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2, §2.
  • A. Purandare and D. Litman (2006) Humor: prosody analysis and automatic recognition for f* r* i* e* n* d* s. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 208–215. Cited by: §1, §2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019a) Videobert: a joint model for video and language representation learning. In

    Proceedings of the IEEE International Conference on Computer Vision

    pp. 7464–7473. Cited by: §2.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019b) How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics, pp. 194–206. Cited by: item 4.
  • J. M. Taylor and L. J. Mazlack (2004) Computationally recognizing wordplay in jokes. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 26. Cited by: §1, §2.
  • O. Weller and K. Seppi (2019) Humor detection: a transformer gets the last laugh. arXiv preprint arXiv:1909.00252. Cited by: §1, §2, §2, §3.1.
  • D. Yang, A. Lavie, C. Dyer, and E. Hovy (2015) Humor recognition and humor anchor extraction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376. Cited by: §1, §2, Table 1.
  • Z. Yang, B. Hu, and J. Hirschberg (2019) Predicting humor by learning from time-aligned comments. Proc. Interspeech 2019, pp. 496–500. Cited by: §2.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §4.