Daily, we receive an avalanche of short text information from the explosion of online communication, e-commerce and the use of digital devices (Song et al., 2014)
. This volume of information requires text mining tools to perform a number of document operations in a timely and appropriately manner. Text classification is a classic topic for natural language processing and an essential component in many applications, such as web searching, information filtering, topic categorization and sentiment analysis(Aggarwal and Zhai, 2012). Text classification can be defined simply as follows: Given a set of documents D and a set of classes (or labels) C, define a function F that will assign a value from the set of C to each document in D (Murty et al., 2011). As a result, a huge pool of machine learning methodologies have been applied for text classification in various data types with satisfactory results. Nowadays, information is usually in short texts such as social networks, news in web pages, forums and so on. However, short texts collections, having the limitation of short length documents, end up represented by sparse matrices, with minimal co-occurrence or shared context. As a result, defining efficient similarity measures is not straightforward, especially regarding the most popular word-frequency based approaches, resulting in degrading performance (Quan et al., 2010).
Recently, Convolutional Neural Networks (CNN) are being applied to text classification or natural language processing both to distributed as to discrete embedding of words (dos Santos and Gatti, 2014; Kim, 2014a), without using syntactic or semantic knowledge of a language (Johnson and Zhang, 2014). Indicatively, the work of Zhang et al. (Zhang et al., 2015) offers an empirical study on character-level convolutional networks for text classification providing evidences that CNN using character-level features is an effective method. Also, a recurrent CNN model was proposed recently for text classification without human-designed features (Lai et al., 2015)
by succeeding to outperform both the CNN model as well as other well-established classifiers. Their model captures contextual information with the recurrent structure and constructs the representation of text using a convolutional neural network. Meanwhile, CNN has been shown an alternative mechanism for effective use of word order for text categorization(Johnson and Zhang, 2014). An effective CNN based model using word embeddings to encode texts is published recently (Zhang et al., 2018)
. It uses semantic, embeddings, sentiment embeddings and lexicon embeddings for texts encoding, and three different attentions including attention vector, LSTM (Long Short Term Memory) attention and attentive pooling are integrated with CNN model. To improve the performance of three different attention CNN models, CCR (Cross-modality Consistent Regression) and transfer learning are presented. It is worth noticing that CCR and transfer learning are used in textual sentiment analysis for the first time. Finally, some experiments on two different datasets demonstrate that the proposed attention CNN models achieve the best or the next-best results against the existing state-of-the-art models.
Toxic Comment Discovery
Text arising from online interactive communication hides many hazards such as fake news, online harassment and toxicity (Duggan, 2014). Toxic comment is not only the verbal violence but a comment that is rude, disrespectful or otherwise likely to make someone leave a discussion. Toxic comment can be considered also the personal attack, the online harassment and bullying behaviors. Unfortunately, it is a usual phenomenon in the world of web and causes several problems and with the rise of social media platforms and the explosion of online communication, the risk is increased. Indicatively, the Wikimedia foundation found that of those who had experienced online harassment expressed decreased participation in the particular project which occurred (Wulczyn et al., 2017). Also, a 2014 Pew Report highlights that of adult internet users have seen someone harassed online, and have personally experienced it (Duggan, 2014). Although, there are efforts to enhance the safety of online environments based on crowdsourcing voting schemes or the capacity to denounce a comment, in most cases these techniques are inefficient and fail to predict a potential toxicity (Hosseini et al., 2017).
Automatic toxic comment identification and prediction in real time is of paramount importance, since it would allow the prevention of several adverse effects for internet users. Towards this direction, the work of Wulczyn et al. (Wulczyn et al., 2017) proposed a methodology that combines crowdsourcing and machine learning to analyze personal attacks at scale. Recently, Google and Jigsaw launched a project called Perspective (Hosseini et al., 2017), which uses machine learning to automatically detect online insults, harassment, and abusive speech. Perspective is an API (www.perspectiveapi.com) that enables the developers to use the toxic detector running on Google’s servers, to identify harassment and abuse on social media or more efficiently filtering invective from the comments on a news website. The API uses machine learning models to score the perceived impact a comment might have on a conversation. Developers and publishers can use this score to give realtime feedback to commenters or help moderators do their job, or allow readers to more easily find relevant information, as illustrated in two experiments below.
A main limitation of these models is that they are not as reliable as it should and that usually the degree of toxicity is not determined. The latter is depicted on a Kaggle competition. Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users. In this work we employ the dataset provided by the current Kaggle challenge on toxic comment classification in an attempt to investigate whether the recent emergence of CNN in text mining, using word embeddings to encode texts, provide any significant benefit over the traditional approaches for classification accompanied by Bag-of-Words (BoW) representations.
The rest of the paper is organized as follows. In Section 2 we present the details of the CNN’s based approach for text classification while in Section 3 we describe text prepossessing based on the Bag-of-Words model; for text analysis. Section 4 is devoted to the experimental evaluation. Finally, Section 5 contains concluding remarks and pointers for future work.
2. Convolutional Neural Networks
In this section, we outline the Convolutional Neural Networks for classification and also provide the process description for text classification in particular. Convolutional Neural Networks are multistage trainable Neural Networks architectures developed for classification tasks. Each of these stages, consist the types of layers described below (Lecun et al., 1998; Fukushima, 1980):
, are major components of the CNNs. A convolutional layer consists of a number of kernel matrices that perform convolution on their input and produce an output matrix of features where a bias value is added. The learning procedures aim to train the kernel weights and biases as shared neuron connection weights.
, are also integral components of the CNNs. The purpose of a pooling layer is to perform dimensionality reduction of the input feature images. Pooling layers make a subsampling to the output of the convolutional layer matrices combing neighboring elements. The most common pooling function is the max-pooling function, which takes the maximum value of the local neighborhoods.
Embedding Layer, is a special component of the CNNs for text classification problems. The purpose of an embedding layer is to transform the text inputs into a suitable form for the CNN. Here, each word of a text document is transformed into a dense vector of fixed size.
, is a classic Feed-Forward Neural Network (FNN) hidden layer. It can be interpreted as a special case of the convolutional layer with kernel size. This type of layer belongs to the class of trainable layer weights and it is used in the final stages of CNNs.
The training of CNN relies on the BackPropagation (BP) training algorithm(Lecun et al., 1998). The requirements of the BP algorithm is a vector with input patterns and a vector with targets , respectively. The input is associated with the output . Each output is compared to its corresponding desirable target and their difference provides the training error. Our goal is to find weights that minimize the cost function
where the number of patterns, the output of j neuron that belongs to layer, the number of neurons in output layer, the desirable target of neuron of pattern . To minimize the cost function
, a pseudo-stochastic version of SGD algorithm, also called mini-batch Stochastic Gradient Descent (mSGD), is usually utilized(Bottou, 1998).
2.1. CNN for Text Classification
The CNN have been widely applied to image classification problems due to their inner capability to exploit the two statistical properties that characterize image data, namely ‘local stationarity’ and ‘compositional structure’ (Henaff et al., 2015). Local stationarity structure can be interpreted as the attribute of an image to present dependency between neighboring pixels that is reasonably constant in local image regions. Local stationarity is exploited by the CNNs’ convolution operator.
We may claim that for text classification problems the original raw data also present the aforementioned statistical properties based on the fact that neighboring words in a sentence present dependency, however, their processing is not straight forward. The components of an image are simply pixels represented by integer values within a specific range. On the other hand the components of a sentence (the words) have to be encoded before fed to the CNN (Collobert et al., 2011). For this purposed we may use a vocabulary. The vocabulary is constructed as an index containing the words that appear in the set of document texts, mapping each word to an integer between and the vocabulary size. An example of this procedure is illustrated in Figure 1
. The variability in documents length (number of words in a document) need to be addressed as CNNs require a constant input dimensionality. For this purpose the padding technique is adopted, filling with zeros the document matrix in order to reach the maximum length amongst all documents in dimensionality.
In the next step the encoded documents are transformed into matrices for which each row corresponds to one word. The generated matrices pass through the embedding layer where each word (row) is transformed into a low-dimension representation by a dense vector (Gal and Ghahramani, 2016). The procedure then continues following the standard CNN methodology. At this point, it is worth mentioning that there are two approaches for the low-dimension representation of each word. The first approach called ‘randomized’ which is achieved by placing a distribution over the word, producing a dense vector with fixed length. The values of the vectors are tuned via the training process of the CNN. The other very popular approach also evaluated here is to employ fixed dense vectors for words, which have produced based on word embedding methods such as the word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). In general the word embedding methods have been trained on a large volume dataset of words producing for each word a dense vector with a specific dimension and fixed values. The word2vec embedding method for example, has been trained on 100 billion words from Google News producing a vocabulary of 3 million words. The embedding layer matches the input words with the fixed dense vector of the pre-trained embedding methods that have been selected. The values of these vectors do not change during the training process, unless there are words not already included in the vocabulary of the embedding method in which case they are initialized randomly. An indicative description of the process is illustrated in Figure 2.
3. Bag-of-Words for Text Mining
Although word embeddings for text mining have attracted the interest of research community lately, it is still not clear enough how beneficial can this approach be in comparison with tradition text mining approaches such as the BoW model. A BoW model, is an alternative way of extracting features from text that can later be used for any kind of analysis such as classification. This representation of text describes the occurrence of words within a document and for this purpose it only involves a vocabulary of known words and their corresponding measure of the presence. In contrast to the model described in Section 2
any information about the order or structure of words in the document is discarded in this case. The model is only concerned about the occurrence of words in the document and not where in the document they occur. Despite the simplicity of this approach usually the models that use it for text categorization and classification tasks demonstrate good performance. Although theoretical simple and practical efficient a BoW model involves several technical challenges. The first step in a typical text analysis procedure is to construct the Document-Term-Matrix (DTM) from input documents. This is done by vectorizing documents creating a map from words to a vector space. In the next step a model for either supervised or unsupervised learning is applied.
In what follows we provide the details of the DTM construction for the toxic comment classification problem at hand. We begin the procedure by generating the vocabulary of words. Here we choose unique words appearing in all documents ignoring case, punctuation, numbers and frequent words that don’t contain much information (called stop words). Then instead of scoring words based on the number of times each word appears in a document we employ the Term Frequency - Inverse Document Frequency (TF-IDF) methodology (Rajaraman and Ullman, 2011). This way we avoid the problem of high scoring by dominating words that usually do not contain ‘informational content’. To achieve this the frequency of words is rescaled by how often they appear in all documents by penalizing most frequent words across all documents. The resulting scores are a weighting indicating importance or interestingness of words. TF-IDF have proven to be very successful for classification tasks in particular by exposing the differences amongst natural groups. In the final step of preprocessing we deal with the sparsity problem. Sparse representations are usually more difficult to model both for computational reasons and also for information reasons, as so little information need to be extracted from such a large representational space. Here we choose to discard terms with higher that sparsity managing also to reduce the dimensionality of the DTM significantly.
In an attempt to visually validate the resulting DTM we employ two basic dimensionality reduction methods for visualizations. We employ Principal Component Analysis(Jolliffe, 2002) to produce a two dimensional projection and the popular t-SNE (van der Maaten and Hinton, 2008) methodology for generating a two dimensional embedding of the DTM (see figure 3). For the DTM construction, samples has been selected according to the procedure followed in the Experimental Analysis Section. In both Figures points with different color correspond to samples belonging to different classes. We observe cluster structures in both representations indicating appropriate circumstances for a learning algorithm. The results of applying classification methodologies on the resulting DTM are reported at the following experimental results section
4. Experimental Analysis
This section is devoted to the experimental evaluation of the presented approaches. Word embeddings and CNN are compared against the BoW approach for which four well-established text classification methods namely Support Vector Machines (SVM), Naive Bayes (NB), k-Nearest Neighbor (kNN) and Linear Discriminated Analysis (LDA) applied on the designed DTMs. We evaluate the methods for toxic comment detection employing the dataset provide by a current Kaggle competition111https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. The dataset contain comments from Wikipedia’s talk page edits which have been labeled by human raters for toxic behavior. Although there are six types of indicated toxicity ‘toxic’, ‘severe toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity hate’ in the original dataset, all these categories were considered as toxic in order to convert our problem to binary classification. Furthermore, for more coherent comparisons, a balanced subset of the Kaggle dataset is constructed for each evaluation of the aforementioned methods. This is achieved by random sub-sampling of the non-toxic texts, obtaining a subset with equal number of samples with the toxic texts. Subsequently,a random selection of the of the resulting balanced dataset is used for training and the rest for testing. In order to provide reliable and robust inferences, each of the examined methods are evaluated times on such random separations.
The CNN architecture used here is based on the one presented in (Kim, 2014b). We use three different convolutional layers simultaneously, with filter size width , dense vector dimension . The filters width is equal to the vector dimension while their height was and , for each convolutional layer respectively. After each convolutional layer a max-over-time pooling operation (Collobert et al., 2011) is applied. The output of the pooling layer concatenate to a fully-connected layer, while the softmax function is applied on final layer. The model is training using the SGD algorithm with mini-batches set to and learning rate . Interestingly here we examine two variants of CNN approach, the where the representation of the words initialized randomly and tuned during the CNN’s training and the where we use the pre-trained, by the word2vec model 222https://code.google.com/archive/p/word2vec/; low-dimensional representations of the words. For SVM, NB and LDA we used the default parameters provided by the statistical packages e1071 and MASS respectively. For kNN we choose to used a value of equal to as a representative of general performance of the method. Further parameter fitting could take place in this part of the analysis for all methods but inevitably their performance is dominated by the construction procedure of the DTM, which has already been design considering classification performance.
Finally we perform a statistical analysis on the outcomes of the binary classification task for all methods. For this purpose we may consider, (i) samples labeled as ‘toxic’ and predicted as ‘toxic’ as True Positive (TP), (ii) samples labeled as ‘toxic’ and predicted as ‘non-toxic’ as False Negative (FN), (iii) samples labeled as ‘non-toxic’ and predicted as ‘non-toxic’ as True Negative (TN) and (iv) samples labeled as ‘non-toxic’ and predicted as ‘toxic’ as False Positive (FP).
A confusion matrix for each ’run’ was created by calculating values in main diagnostic tests such as, True positive rate (Recall), Positive predictive value (Precision), F1 score (see Figure4), Accuracy, False discovery rate (FDR) and True negative rate (Specificity) (see Table 1). It is evident that both and models outperformed SVM, kNN, NB and LDA methods in all cases achieving accuracy almost over . With respect to the toxic predictive value (Precision), CNN models had values above while other methods range from to
percent. Almost similar behavior is observed in the True Positive Rate (Recall) and in F1-score. NB and kNN methods had the worst results having the lowest values on precision and recall respectively. This means that kNN classifies several non-toxic comments as toxic and NB classifies several toxic comments as non-toxic. Also, CNN models has the lowest variance in all cases whilehas the best performance with respect to precision and recall. Finally we observed that CNN models have the lowest False Discovery Ratio, meaning that they mistakenly predicted as ‘toxic’ the lowest number of ‘non-toxic’ comments. The aforementioned findings state clearly that CNN outperforms traditional text mining approaches for toxic comment classification presenting great potential for further development in toxic comment identification.
Mean values and Standard Deviation across all experiments for Accuracy, Specificity and False discovery rate for all Classification Methods.
Both industrial and research community in the last few years have made several tries to identify an efficient model for online toxic comment prediction due to its importance in online interactive communications among users, but these steps are still in their infancy. This work is devoted to the study of a recent approach for text classification involving word representations and Convolutional Neural Networks. We investigate its performance in comparison to more widespread text mining methodologies for the task of toxic comment classification. As shown CNN can outperform well established methodologies providing enough evidence that their use is appropriate for toxic comment classification. The promising results are motivating for further development of CNN based methodologies for text mining in the near future, in our interest, employing methods for adaptive learning and providing further comparisons with n-gram based approaches.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
- Aggarwal and Zhai (2012) Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text classification algorithms. In Mining text data. Springer, 163–222.
- Bottou (1998) Léon Bottou. 1998. On-line learning and stochastic approximations. In In On-line Learning in Neural Networks. Cambridge University Press, 9–42.
- Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 12 (Nov. 2011), 2493–2537. http://dl.acm.org/citation.cfm?id=1953048.2078186
- dos Santos and Gatti (2014) Cicero dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 69–78.
- Duggan (2014) Maeve Duggan. 2014. Online harassment. Pew Research Center.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics 36, 4 (1980), 193–202.
Yarin Gal and Zoubin
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. InAdvances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. 1019–1027.
- Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on Graph-Structured Data. CoRR abs/1506.05163 (2015). arXiv:1506.05163 http://arxiv.org/abs/1506.05163
- Hosseini et al. (2017) Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google’s Perspective API Built for Detecting Toxic Comments. arXiv preprint arXiv:1702.08138 (2017).
- Johnson and Zhang (2014) Rie Johnson and Tong Zhang. 2014. Effective use of word order for text categorization with convolutional neural networks. arXiv preprint arXiv:1412.1058 (2014).
- Jolliffe (2002) T.I. Jolliffe. 2002. Principal Component Analysis. Springer.
- Kim (2014a) Yoon Kim. 2014a. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
- Kim (2014b) Yoon Kim. 2014b. Convolutional Neural Networks for Sentence Classification. CoRR abs/1408.5882 (2014). arXiv:1408.5882 http://arxiv.org/abs/1408.5882
- Lai et al. (2015) Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neural Networks for Text Classification.. In AAAI, Vol. 333. 2267–2273.
- Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119.
et al. (2011)
M Ramakrishna Murty, JVR
Murthy, and Prasad Reddy PVGD.
Text Document Classification based-on Least Square Support Vector Machines with Singular Value Decomposition.International Journal of Computer Applications 27, 7 (2011).
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
- Quan et al. (2010) Xiaojun Quan, Gang Liu, Zhi Lu, Xingliang Ni, and Liu Wenyin. 2010. Short text similarity based on probabilistic topics. Knowledge and information systems 25, 3 (2010), 473–491.
- Rajaraman and Ullman (2011) Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press. https://doi.org/10.1017/CBO9781139058452
- Song et al. (2014) Ge Song, Yunming Ye, Xiaolin Du, Xiaohui Huang, and Shifu Bie. 2014. Short Text Classification: A Survey. Journal of Multimedia 9, 5 (2014).
- van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
- Wulczyn et al. (2017) Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1391–1399.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems. 649–657.
- Zhang et al. (2018) Zufan Zhang, Yang Zou, and Chenquan Gan. 2018. Textual sentiment analysis via three different attention convolutional neural networks and cross-modality consistent regression. Neurocomputing 275 (2018), 1407–1415.