Detecting Hate Speech in Multi-modal Memes

by   Abhishek Das, et al.

In the past few years, there has been a surge of interest in multi-modal problems, from image captioning to visual question answering and beyond. In this paper, we focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem. We aim to solve the Facebook Meme Challenge <cit.> which aims to solve a binary classification problem of predicting whether a meme is hateful or not. A crucial characteristic of the challenge is that it includes "benign confounders" to counter the possibility of models exploiting unimodal priors. The challenge states that the state-of-the-art models perform poorly compared to humans. During the analysis of the dataset, we realized that majority of the data points which are originally hateful are turned into benign just be describing the image of the meme. Also, majority of the multi-modal baselines give more preference to the hate speech (language modality). To tackle these problems, we explore the visual modality using object detection and image captioning models to fetch the "actual caption" and then combine it with the multi-modal representation to perform binary classification. This approach tackles the benign text confounders present in the dataset to improve the performance. Another approach we experiment with is to improve the prediction with sentiment analysis. Instead of only using multi-modal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. We perform a detailed analysis of the above two approaches, providing compelling reasons in favor of the methodologies used.



There are no comments yet.


page 1

page 2

page 4

page 5

page 6

page 7

page 8


Sequential Late Fusion Technique for Multi-modal Sentiment Analysis

Multi-modal sentiment analysis plays an important role for providing bet...

Cross-stitched Multi-modal Encoders

In this paper, we propose a novel architecture for multi-modal speech an...

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

Medical image captioning automatically generates a medical description t...

Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition

Collecting and accessing a large amount of medical data is very time-con...

A Longitudinal Multi-modal Dataset for Dementia Monitoring and Diagnosis

Dementia is a family of neurogenerative conditions affecting memory and ...

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Significant progress has been made on visual captioning, largely relying...

MeToo Tweets Sentiment Analysis Using Multi Modal frameworks

In this paper, We present our approach for IEEEBigMM 2020, Grand Challen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Abstract

In the past few years, there has been a surge of interest in multi-modal problems, from image captioning to visual question answering and beyond. In this paper, we focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem. We try to solve the Facebook Meme Challenge (Kiela et al., 2020) which aims to solve a binary classification problem of predicting whether a meme is hateful or not. A crucial characteristic of the challenge is that it includes ”benign confounders” to counter the possibility of models exploiting unimodal priors. The challenge states that the state-of-the-art models perform poorly compared to humans. During the analysis of the dataset, we realized that majority of the data points which are originally hateful are turned into benign just be describing the image of the meme. Also, majority of the multi-modal baselines give more preference to the hate speech (language modality). To tackle these problems, we explore the visual modality using object detection and image captioning models to fetch the “actual caption” and then combine it with the multi-modal representation to perform binary classification. This approach tackles the benign text confounders present in the dataset to improve the performance. Another approach we experiment with is to improve the prediction with sentiment analysis. Instead of only using multi-modal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. We perform a detailed analysis of the above two approaches, providing compelling reasons in favor of the methodologies used.

Figure 1: Multi-modal “mean” meme and Benign confounders. Mean meme (left), Benign text confounder (middle) and Benign image confounder (right)

2 Introduction

In today’s world, social media platforms play a major role in influencing people’s everyday life. Though having numerous benefits, it also has the capability of shaping public opinion and religious beliefs across the world. It can be used to attack people directly or indirectly based on race, caste, immigration status, religion, ethnicity, nationality, sex, gender identity, sexual orientation, and disability or disease. Hate Speech on online social media can trigger social polarization, hateful crimes. On large platforms such as Facebook and Twitter, it becomes practically impossible for a human to monitor the source and spreading of such malicious activities, thus it is the responsibility of the machine learning and artificial intelligence research community to address and solve this problem of detecting hate speech efficiently.

In tasks such as VQA and multi-modal machine translation, it has been observed that baseline models using the language domain perform well without even exploiting the multi-modal understanding and reasoning(Devlin et al., 2015). However, the Facebook Hateful Memes Challenge dataset is designed in such a manner that unimodal models exploiting just the language or vision modalities separately will fail, and only the models that can learn the true multi-modal reasoning and understanding will be able to perform well.

They achieve this by the introduction of “benign confounders” in the dataset, i.e. for every hateful meme, they find an alternative image or caption which when replaced, is enough to make the meme harmless or non-hateful, thus flipping the label. Consider a sentence like “dishwasher for sale, missing parts”. Unimodally, this sentence is harmless, but when combined with an equally harmless image of a girl without a hand, suddenly it becomes mean. See Figure 1 for an illustration. Thus, this challenge set is an excellent stage that aims to facilitate the development of robust multi-modal models, and at the same time addresses an important real-world problem of detecting hateful speech on online social media platforms. Majority of the prior work baselines aim at solving this problem by finding an alignment between the two modalities, but it faces the hardship of not knowing the context behind the image and the text combination.

In this paper, we introduce two major ideas wherein we try to explore the two modalities using pre-trained Image captioning models and sentiment analysis to understand the context and relationship between the two modalities. Many of the baselines tend to focus more on the text modality for hate speech. Also, during the data analysis, we realized that majority of the hateful memes are converted into benign just by describing the image, i.e., benign text confounders. In our first approach, we try to balance the representations of the two modalities and tackle the benign text confounders by fetching a deeper understanding of the image via object detection and captioning. We then use this representation and fuse it with the multi-modal representation from the state-of-the-art models to improve the performance. Through the error analysis, we also found that the finetuning a model with pretrained multi-modal representations does not always provide desirable results. It may because those embeddings are pretrained to predict the semantic correlation between image and text but semantic information are difficult to capture and may be insufficient for solving this challenge. Thus, we try to include some high-level features like text and image sentiments to aid the prediction because the sentiment analysis is a related and relatively simple task. On the Facebook Hateful Memes Challenge Dataset, we show both our approaches benefit the prediction.

In what follows, we discuss the related prior work for such a problem in the next section (3), followed by defining the problem statement (4) and discussing our novel approaches in section (5). We then present our Experimental setup in section (6) followed with its results and discussion in section (7). Finally, we end the discussion with conclusion and future directions in the last section (8).

3 Related Work

Hate speech detection has gained more and more attentions in recent years. There have been several text-only hate speech datasets released, mostly based on Twitter [(Waseem, 2016),(Waseem and Hovy, 2016),(Davidson et al., 2017)

], and various architectures have been proposed for classifiers [

(Kumar et al., 2018), (Malmasi and Zampieri, 2017)]. Also, in the past few years, there has been a surge in multi-modal tasks and problems, ranging from visual question answering[(Goyal et al., 2017)] to image captioning[(Sidorov et al., 2020),(Gurari et al., 2020)] and beyond. However, there has been surprisingly little work related to multi-modal hate speech, with only a few papers including both image and text modality. Some of the works related to multi-modal hate detection based on image and text modality are as follows.

Yang et al. [(Yang et al., 2019)

] reported that augmenting text with image embedding information immediately boosts performance in hate speech detection. In this paper, the image embeddings are formed by using the second last layer of the pre-trained ResNet neural network on ImageNet and then hashing these values for efficient photo indexing, searching, and clustering. The most straightforward way of integrating text with photo features is to concatenate both image and text vectors. The concatenated vector is followed by dropout, MLP, and softmax operations for the final hate speech classification. They also explore other fusion techniques like gated summation and bi-linear transformation.

Figure 2: Mean memes and their benign text confounders
Figure 3: Approach 1 - Model architecture for Image captioning

Gomez et al.(Gomez et al., 2020) highlighted the issue that most of the previous work on hate speech is done using textual data only and that hate-speech detection on multi-modal publications has not been addressed yet. So, they created MMHS150k, a manually annotated multi-modal hate speech dataset formed by 150,000 tweets, each one of them containing text and an image. The data points are labeled into one of the six categories: No attacks to any community, racist, sexist, homophobic, religion-based attacks, or attacks to other communities. They trained a LSTM model which considered just the tweets text as a baseline for the task of detecting hate speech in multi-modal publications. Their further objective was to exploit the information in the visual domain to outperform their baseline. They did this by proposing two models. The first one was the Feature Concatenation Model (FCM), which is an MLP that concatenates the image representation extracted by a CNN and the textual features of both the tweet text and the image text extracted by an LSTM. Their second model named Textual Kernels Model (TKM) was inspired by VQA tasks and was based on the intuition of looking for patterns in the image corresponding to the associated texts. This was done by learning kernels from textual representations and convolving them with CNN feature maps.

Our first approach extends this idea of a deeper understanding of the visual domain. To our knowledge, this paper is the first to use pre-trained image captioning models to generate the ”actual caption” from the image along with the image embeddings and add these through fusion techniques like concatenation and bilinear transformations with the multi-modal embedding of the state-of-the-art baselines.

Now, we describe some relevant work in Image Captioning. (Xu et al., 2015) introduced an encoder-decoder architecture which uses attention mechanism to generate captions. It is trainable by standard back-propagation methods. Most conventional approaches use a top-down mechanism for captioning tasks. A recent method (Anderson et al., 2018) combines Bottom-Up and Top-Down Attention which utilizes a Faster R-CNN based object detection to extract k image features, that enables the attention to be calculated at the level of objects. Each image feature here encodes a salient image region. The captioning model uses a soft-top down approach given the features and partial output sequences as context. It consists of a 2-Layer LSTM, the first layer is called as Top-Down Attention LSTM, the output of which is used to find the attention weights. These attended image features are used by the second LTSM layer which is characterized as a Language Model. They further use cross-entropy loss minimization. The quality of captions generated is vastly improved using this combined technique. Their method is highly modular and allows using various architectures in captioning stage for the features generated using object detection. One can also use different object detection mechanisms in place of Faster R-CNN, or even replace it with the spatial output of a CNN.

Multi-modal sentiment analysis is a relatively new topic. However, extensive research (Soleymani et al., 2017) (Shenoy and Sardana, 2020) (Kumar and Vepa, 2020) (Ghosal et al., 2018) (Zadeh et al., 2018) (Majumder et al., 2018) has already been done in this field and yielded fruitful results. Some (Kumar and Vepa, 2020) (Ghosal et al., 2018) tends to improve the prediction accuracy by developing more sophisticated attention mechanism to better capture the interaction between two modalities, while some (Zadeh et al., 2018) (Majumder et al., 2018) introduce very innovative ways of fusion, which utilizes graph or hierarchical architecture. In addition, (Shenoy and Sardana, 2020) leverages the sentiments to improve multi-modal dialogue task. However, very little has been done to improve the hateful media detection with multi-modal sentiment information. We introduce a sentiment analysis approach as our second experiment wherein we carry out uni-modal sentiment analysis on both text and visual domains to find the orientation of both the modalities.

Figure 4: Approach 2 - Model architecture using Sentiment analysis

4 Proposed Approaches

4.1 Problem Statement

The objective of this challenge is to classify memes as hateful or benign while considering their information from both text and visual modality. Denote the visual components of all memes by where is the index of the memes, and in our case, the visual component is the meme itself. Let denotes the text extracted from the memes. If phrases locate in multiple regions of a single meme, the corresponding will include all the text information by concatenation. Let be the corresponding labels of all memes, where each with 0 means benign and 1 indicates a hateful meme. Thus, our task can be formulated as a binary classification problem with and as input. The goal of our paper is to model the , denoted by , which minimize the following cost function:


4.2 Image Captioning

As discussed above, this paper tackles the benign text confounders present in the dataset which converts an originally hateful meme into a benign one just by describing what is happening in the image. Figure 2 shows some of these adversarial samples. As shown in Figure 5, they account for 20% of the dataset and thus our hypothesis is that if we can provide our model with this extra knowledge, it will combat these adversarial examples and provide a boost in accuracy. Using object detection and image captioning helps in learning this aspect of the dataset and understanding the behavior of the benign text confounders and thus gives a better performance than the baseline models. Comparing the ”actual caption” with the ”pre-extracted caption” of the meme will help in understanding whether both are aligned or not. Also, most of the multi-modal baselines tend to focus more on the text modality for the hate speech. Our intuition behind this approach is to find a deeper relationship between the text and the image modalities.

As we can see in the Figure 3

, we fist pass the hateful dataset (both the modalities,i.e., X- pre-extracted captions and Y- Images of the meme) into the Visual Bert model pre-trained on the COCO dataset. This fetches us the multi-modal representation of the two modalities, i,e., a 786 tensor of the multi-modal representation (m1,m2,m3,…). Parallelly, we also pass the image into an Image Captioning model (Show, Attend, and Tell, Bottom up top down), which fetches us a caption for the image present in the meme (

denotes the caption extracted from the images.). We then pass this text caption via a pre-trained Bert model to get a textual representation of another 768 dimensional tensor (h1,h2,h3,…). Then, we fuse the two tensors using fusion techniques like concatenation and bilinear transformations. Bilinear transformation is a filter to integrate the information of two vectors into one vector. Mathematically we have bilinear (m’,h’, dim) = m’.M.h + b, where dim is a hyper-parameter indicating the expected dimension of the output vector (768), M is a weight matrix of dimension (dim,m’,h’

), and b is a bias vector of dimension dim. Again we concatenate m, h, and bilinear(m’,h’,dim) for hate speech classification. Finally, we then pass the output via an Multi-layer perceptron to get a binary classification of hateful and non-hateful memes (0/1).

We fine tune the Visual Bert model and the Bert model from the Facebook hateful dataset and the captions generated on the images of the Facebook hateful dataset. This new approach of combining the image captioning and multi-modal baselines helps in tackling the previous mentioned challenges and increases the performance significantly.

4.3 Sentiment Analysis

Another approach is to utilize the sentiment information of both modalities to generate richer representations for further prediction. We first obtain the multi-modal contextual representations from input and by using a pre-trained model. In our experiment, we use VisualBERT (Li et al., 2019). However, similar to some other pre-trained models, the VisualBERT focuses more on the correlation between the input modalities, but the text and image in hateful memes are usually connected indirectly. Thus, unimodal sentiments, which are closely related to hate detection, can benefit the prediction. A RoBERTa (Liu et al., 2019) model is then used to obtain the text sentiment embeddings from , while a VGG (Simonyan and Zisserman, 2014) is applied for visual sentiments from . However, due to the limitation of annotated data, we are unable to fine-tune those two models on our dataset. Instead, the RoBERTa is trained on Stanford Sentiment Treebank (Socher et al., 2013) and the visual sentiment model parameters are learned from T4SA dataset (Vadicamo et al., 2017). Then, are fused through concatenation and passed to multi-layer perceptrons to make the final prediction . The framework of the entire model is shown in Figure 4.

5 Experimental Setup

5.1 Dataset

We have used the Facebook Memes Challenge Dataset (Kiela et al., 2020) which comprises 10k memes. These memes are carefully designed for this task by annotators who are specially trained to employ hate-speech as defined by Facebook. The features in this dataset are the meme images themselves and string representations of the text in the image. The dataset comprises five different types of memes as shown in Figure 5: multi-modal hate, where benign confounders were found for both modalities, unimodal hate where one or both modalities were already hateful on their own, benign image and benign text confounders and finally random not-hateful examples. The Training, Validation and Test split is 85, 5 and 10 respectively, and the individual sets are fully balanced. Each meme in the training and validation set are annotated as either 1 or 0 which corresponds to the hateful and benign classes respectively.

Figure 5: Types of memes in the Facebook Hateful Memes Challenge Dataset

5.2 Multi-modal Baselines

For analysis, we select VisualBERT(Li et al., 2019), a baseline model pretrained on COCO dataset with a multimodal objective. We fine-tune the model on our dataset following the same training guidelines as mentioned in the original challenge paper(Kiela et al., 2020) and then evaluate it on the validation set comprising of 500 memes. Figure 6

shows the Confusion matrix for the same, which gives an approximate of the error cases made by the baseline.

5.2.1 VisualBERT

In order to utilize the VisualBERT, multiple region features are first extracted from input image using Faster RCNN (Ren et al., 2015). Each region feature is then converted to visual embedding by following equation.


where stands for segment embedding, which indicates whether the input is text or image.

For the text input, the textual embedding is obtained in a similar way:


where is the token embedding for each token in the sentence, and is the positional embedding to indicate the relative position of each token. After concatenating and , the embedding is sent into pre-trained VisualBERT model for further processing.

VisualBERT (Li et al., 2019) is a pre-trained model for learning joint contextualized representations of vision and language. It contains multiple transformer blocks on top of the visual and text embedding. It is pre-trained on Microsoft COCO captions (Chen et al., 2015) with two objectives: masked language modelling and sentence-image prediction task. The masked language modelling is very similar to the approach in sentence BERT (Devlin et al., 2018), where some input text tokens are masked randomly, and the model needs to predict what are the original tokens. The sentence-image prediction requires the model to decide whether the input text matches the image. The VisualBERT output of the first token is used as the multi-modal representation

. An MLP is then used to make the final prediction. The model is fine-tuned for the current task by using the following loss function.


where is a vector of size . is the hidden size of VisualBERT. , which has a shape of 2 by , is the learnable matrix of the MLP. denotes the parameters of the entire model, including the .

Figure 6: Confusion Matrix for baseline VisualBERT COCO model

5.3 Methodology

For both approaches, we use mmf (Singh et al., 2020), a modular framework from Facebook AI Research, to build the main neural architectures. We use mmf’s version of Visual BERT to generate multi-modal representations. The model is pre-trained on MS COCO dataset with a hidden dimension of 768.

For the Image Captioning models in our first approach, we use two implementations. The first one is an implementation of Show, Attend, and Tell by Xu et al. (Xu et al., 2015) and second one using Bottom-Up Top-Down approach by Anderson et al. (Anderson et al., 2018). We take the top 10000 words from the vocabulary and process the images via Inception V3 model. The pre-trained Bert model used to encode generated caption has a dimension of 768. These two result are then fused together and then passed via an MLP classifier.

In the second approach, we directly use the final logits of sentiment analysis models and their sum as the sentiment embedding. The MLP classifer consists of 2 layers with 768 hidden units.

5.4 Evaluation Metrics

We have evaluated the performance of our classifier using the following two metrics as suggested in the challenge

5.4.1 Area under the Receiver Operating Characteristics (AUCROC)

Receiver Operating Characteristics curve is a graph of True Positive Rate (TPR) v/s False Positive Rate (FPR). It measures how well the binary classifier discriminates between the classes as its decision threshold is varied.(Bradley, 1997). A perfect classifier will have an area under the curve of 1, where the top left corner in the plot is the ideal point with a TPR of 1 and a FPR of 0. Thus, a larger area under the curve is desirable for any classifier to maximize TPR and minimize FPR.

5.4.2 Classification Accuracy

We find the accuracy of the predictions which is given by the ratio of correct predictions to the total number of predictions made, since it is easier to interpret. Thus, for each test sample, we output the label

and the probabilities with which the classifier predicts a sample to be hateful. This probability is used to plot the AUCROC curve.

Figure 7: AUCROC/Accuracy for Different Experiments

6 Results and Discussions

6.1 Image Captioning

We use two frameworks to test our experiments, first being the MMF framework designed by the Facebook research lab who conducted this challenge and the second being creating all the models locally by using simple baselines like Concat BERT.

Initially, we tested the image captioning locally by fusing it with Concat BERT baseline model . The baseline accuracy for this model turned out to be 57%. Then, we built an Image captioning model based on Xu et al. (Xu et al., 2015) and passed the caption via a Bert model to get the textual representation. When we fused this textual representation with the Concat BERT results, the accuracy increased by 2% verifying the importance of the captioning and tackling the presence of the benign text confounders. Then, we shifted to the MMF framework to test it on better baseline models like Visual Bert. As can be seen in the Figure 7, the image captioning approach gives a significant improvement in the AUCROC and the accuracy on the test set. There is increase of 3.6 % in the AUCROC score and an increase of 6.7 % in the accuracy of the model. This shows us that the image captioning model tackles these benign confounders and give a better representation to the image modality and thus improves the results.

Figure 8: Mean meme (left), Benign Text Confounder and the testing meme (middle), Object Detection Visualization before captioning(right)

The Figure 8 comprises of three images. The first image is the original hateful meme, the second image is the one being tested which is created by adding benign text confounder by just describing the image and thus making it a non-hateful meme with a label of ’0’. i.e., non-hateful. The third image shows the visualization of object detection bounding boxes on the test image. For the test image as input, the baseline VisualBert predicts a label ’1’, thus, misclassifying it as a hateful meme because it is not able to understand the benign text confounder. However, using our approach, it is correctly labeled as benign. Our model captions the image similar to the benign text confounder and thus the model learns about its similarity and that its benign behavior. This helps the classifier to classify this as a benign result. There are many such examples present in the dataset which are correctly classified by our model, thus improving the accuracy and the AUCROC value. We also ran Bilinear Transformation as the fusion technique but it brought the performance down and also it ran very slow on the dataset, thus, we decided to move ahead with concatenation itself for the results.

Figure 9: Sample images from dev set. The sentiment value under the image are ranged from 0 to 1 with 1 as positive. Green label denotes the ground truth.

6.2 Sentiment Analysis

For the sentiment analysis approach, although the model doesn’t improve the AUCROC value by a large margin, we still see a significant gain of 4 in the accuracy. We directly compare our models’ results against the Visual BERT baseline and observe two common cases when sentiment analysis benefits the prediction. The first case is when the text and image have opposite sentiments, as shown in the first image of Figure 9. The baseline considers this meme as benign, but our model can clearly indicate its irony and then guide the prediction. The other is when both modalities have a positive sentiment as the meme shown in the second image of Figure 9. Sentiment information can help to confirm benign memes. However, since we do not have the annotated data to fine-tune the sentiment analysis models or perform multi-task learning, the accuracy of sentiment prediction is undesirable. As shown in the third meme in Figure 9, the text doesn’t seem very negative, and the image seems neutral, but our model predicts both as negative. Also in some complicated cases, the sentiments are not very helpful. For example, when sentiments of both modalities are negative as the last two visuals in the figure, our model does not work well because the meme has a similar chance to be benign or hateful.

6.3 Combining both Image Captioning and Sentiment Analysis

We also performed an experiment wherein we concatenated both the image captioning results as well as the sentiment analysis features along with the Visual Bert multimodal representation and fine tuned it on the dataset. Again, we saw a significant increase in the AUCROC and the accuracy of the model in comparison to the baseline model. We expected the results to give an even better performance than the captioning results, as it would have different features to learn from, but the value of the accuracy decreased in comparison to the Image captioning results. Some reasons for this behavior could be due to conflicts in the two representations being concatenated together which could lead to lower accuracy and AUCROC value. Another reason could be due to presence of redundant features in different representations and thus reducing the performance. We also performed an analysis on some data points related to this test. As can be seen in the figure 10, the middle image of the benign confounder is wrongly classified by the baseline model as hateful but the combined approach learns the alignment of the caption and the pre-extracted caption along with the sentiment of both the modalities (i.e. positive in this case) and gives a correct prediction of non-hateful label.

Figure 10: Mean meme (left), Benign Text Confounder and the testing meme with positive text sentiment and positive visual sentiment (middle), Object Detection Visualization before captioning(right)

7 Conclusion and Future Directions

We present two novel mediums for introducing outside world knowledge to our multi-modal models. i.e. Image Captioning and Sentiment analysis. Our approach enables to identify the adversarial examples in the Hateful Memes Challenge dataset. While both Image Captioning and Sentiment analysis show a promising improvement over the baseline models published by the Facebook challenge, the combination of object detection and image captioning provides the best results.

One of the primary objectives of this challenge was to facilitate the research for the development of true multi-modal models which gives importance to all the modalities. After analysis of the dataset and various techniques for this task, we have identified several areas which should be explored for future research in this domain. An improvement in the quality of captions generated by other image captioning models like OSCAR (Li et al., 2020) will enhance the ability of the model to identify benign text confounders and thus increase the classification accuracy. Combining image captioning and sentiment in an efficient manner such that they cancel out their individual effects is critical. Fusion plays a key role in this task, thus we plan to explore more concatenation techniques, by using attention mechanism, transformers etc. Further, we believe that using other Large-scale pretrained multi-modal models like UNITER (Chen et al., 2020), providing ‘Internet Knowledge’ through graphical approach are some interesting research questions in this task.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §3, §5.3.
  • A. P. Bradley (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition 30 (7), pp. 1145–1159. External Links: Link, Document Cited by: §5.4.1.
  • X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §5.2.1.
  • Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2020) UNITER: universal image-text representation learning. External Links: 1909.11740 Cited by: §7.
  • T. Davidson, D. Warmsley, M. W. Macy, and I. Weber (2017) Automated hate speech detection and the problem of offensive language. CoRR abs/1703.04009. External Links: Link, 1703.04009 Cited by: §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §5.2.1.
  • J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467. Cited by: §2.
  • D. Ghosal, M. S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, and P. Bhattacharyya (2018) Contextual inter-modal attention for multi-modal sentiment analysis. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 3454–3466. Cited by: §3.
  • R. Gomez, J. Gibert, L. Gomez, and D. Karatzas (2020) Exploring hate speech detection in multimodal publications. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Cited by: §3.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
  • D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya (2020) Captioning images taken by people who are blind. External Links: 2002.08565 Cited by: §3.
  • D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Testuggine (2020) The hateful memes challenge: detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790. Cited by: §1, §5.1, §5.2.
  • A. Kumar and J. Vepa (2020) Gated mechanism for attention based multi modal sentiment analysis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4477–4481. Cited by: §3.
  • R. Kumar, A. Kr. Ojha, S. Malmasi, and M. Zampieri (2018) Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, New Mexico, USA, pp. 1–11. External Links: Link Cited by: §3.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §4.3, §5.2.1, §5.2.
  • X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. External Links: 2004.06165 Cited by: §7.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4.3.
  • N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-based systems 161, pp. 124–133. Cited by: §3.
  • S. Malmasi and M. Zampieri (2017) Detecting hate speech in social media. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, pp. 467–472. External Links: Link, Document Cited by: §3.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §5.2.1.
  • A. Shenoy and A. Sardana (2020) Multilogue-net: a context aware rnn for multi-modal emotion detection and sentiment analysis in conversation. arXiv preprint arXiv:2002.08267. Cited by: §3.
  • O. Sidorov, R. Hu, M. Rohrbach, and A. Singh (2020) TextCaps: a dataset for image captioning with reading comprehension. External Links: 2003.12462 Cited by: §3.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  • A. Singh, V. Goswami, V. Natarajan, Y. Jiang, X. Chen, M. Shah, M. Rohrbach, D. Batra, and D. Parikh (2020) MMF: a multimodal framework for vision and language research. Note: Cited by: §5.3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §4.3.
  • M. Soleymani, D. Garcia, B. Jou, B. Schuller, S. Chang, and M. Pantic (2017) A survey of multimodal sentiment analysis. Image and Vision Computing 65, pp. 3 – 14. Note: Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing External Links: ISSN 0262-8856, Document, Link Cited by: §3.
  • L. Vadicamo, F. Carrara, A. Cimino, S. Cresci, F. Dell’Orletta, F. Falchi, and M. Tesconi (2017) Cross-media learning for image sentiment analysis in the wild. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 308–317. Cited by: §4.3.
  • Z. Waseem and D. Hovy (2016) Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, San Diego, California, pp. 88–93. External Links: Link, Document Cited by: §3.
  • Z. Waseem (2016) Are you a racist or am I seeing things? annotator influence on hate speech detection on Twitter. In Proceedings of the First Workshop on NLP and Computational Social Science, Austin, Texas, pp. 138–142. External Links: Link, Document Cited by: §3.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §3, §5.3, §6.1.
  • F. Yang, X. Peng, G. Ghosh, R. Shilon, H. Ma, E. Moore, and G. Predovic (2019) Exploring deep multimodal fusion of text and photo for hate speech classification. In Proceedings of the Third Workshop on Abusive Language Online, Florence, Italy, pp. 11–18. External Links: Link, Document Cited by: §3.
  • A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2236–2246. Cited by: §3.