Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification

09/10/2019 ∙ by Pratik Ratadiya, et al. ∙ 0

Forums play an important role in providing a platform for community interaction. The introduction of irrelevant content or spam by individuals for commercial and social gains tends to degrade the professional experience presented to the forum users. Automated moderation of the relevancy of posted content is desired. Machine learning is used for text classification and finds applications in spam email detection, fraudulent transaction detection etc. The balance of classes in training data is essential in the case of classification algorithms to make the learning efficient and accurate. However, in the case of forums, the spam content is sparse compared to the relevant content giving rise to a bias towards the latter while training. A model trained on such biased data will fail to classify a spam sample. An approach based on Synthetic Minority Over-sampling Technique(SMOTE) is presented in this paper to tackle imbalanced training data. It involves synthetically creating new minority class samples from the existing ones until balance in data is achieved. The enhanced data is then passed through various classifiers for which the performance is recorded. The results were analyzed on the data of forums of Spoken Tutorial, IIT Bombay over standard performance metrics and revealed that models trained after Synthetic Minority oversampling outperform the ones trained on imbalanced data by substantial margins. An empirical comparison of the results obtained by both SMOTE and without SMOTE for various supervised classification algorithms have been presented in this paper. Synthetic oversampling proves to be a critical technique for achieving uniform class distribution which in turn yields commendable results in text classification. The presented approach can be further extended to content categorization on educational websites thus helping to improve the overall digital learning experience.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The intention of forums is to provide a platform between the users and administration for resolving issues and grievance redressal. However, there are instances of users posting malicious content and links on these platforms which degrades the core objective of the website. Currently, there is manual moderation for quality check to remove any irrelevant content. Such moderation increases the cost as well as the time required for completion of the objective which can be reduced by automating this process.

There has been ample research in recent years to develop automated spam classification systems using supervised machine learning algorithms. Awad and Elseuofi applied various machine learning techniques for automated classification of emails [1]. Manlangit et al. proposed an intelligent system which could detect fraud transactions using classification algorithms [2]. These systems made use of labelled data to plot a decision boundary based on which a new test case was classified into the appropriate class.

One of the major issues faced for developing such classifier systems for forums is the imbalance in data. Majority of the content posted on forums is in accordance with the intended goal(non spam). As a result, the frequency of spam samples is sparse in comparison to the non spam samples. Conventional Machine Learning algorithms when trained on such data will tend to have a bias towards the majority class which will lead to a new spam sample being accepted as relevant. This inconsistency can be resolved by feeding balanced data to the algorithm. The balance amongst classes can be achieved by either undersampling the majority class or oversampling the minority class. In 2008, Tang et al. presented a hybrid intelligent system based on undersampling of the majority class which could analyze the behaviour of sender [3]. However, undersampling of data poses the risk of losing crucial information especially in cases where the number of minority class samples is extremely small, thus affecting the overall performance of the system. In this paper we propose an approach which involves synthetically oversampling the minority class to solve the imbalance problem. This increases the training size so as to bring uniformity in the data by reducing variation in distribution of classes. This in turn enhances the results achieved by the system.

The rest of the paper is as follows: Section 2 provides an overview of the pre-processing techniques. The proposed approach is described in section 3. Description of the dataset, evaluation metrics and experimental results are provided in section 4. Finally, section 5 draws the conclusion and suggests future direction of the work.

Ii Preprocessing techniques

The data obtained from the forums is in text format and cannot be feeded directly to the machine learning algorithm. Certain preprocessing techniques need to be applied on the data which help in extracting useful information. The text data is converted into feature vectors which are then passed to the classification algorithm. The preprocessing techniques include:

Ii-a Stripping of HTML tags

The text is removed of any html tags and scripts as they are not relevant to the intent of the message. Apart from this special characters like — ,newline and tab characters are also removed. Words with less than 3 characters are removed as they do not provide relevant information and tend to divert the model function.

Ii-B Stop words

Commonly used words like ’are’,’is’,’they’,’this’ etc can be found in both spam as well as non-spam content and thus cannot be a deciding factor for classification. Such words are called stop words and we remove them from our text.

Ii-C Vectorization

The filtered text is now converted into a sparse matrix of tokens by vectorizing them. The various vectorizer techniques include count vectorizer and Tf-idf vectorizer. Tf-Idf(Term frequency inverse document frequency) vectorizer is used by us in this project. The mathematical equation for Tf-Idf is as follows:


The Tf-Idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a sample, divided by the total number of words in that sample; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the samples in the dataset divided by the number of samples where the specific word appears.

The above mentioned preprocessing techniques are implemented using the beautiful soup and scikit learn packages in Python 3.6. After carrying out these techniques, the text data is now converted to a sparse matrix consisting of numbers which can be fed to a classification algorithm. As a part of the evaluation process, we split the data into train and test sets with the model being trained on 80% of the data and results are calculated for the remaining 20% of dataset. For any new test case, the above preprocessing methods still need to be carried out and the class label is predicted for the feature vector of the test case.

Iii Proposed approach

The training dataset initially contains uneven distribution of classes. We propose oversampling the minority class using Synthetic Minority Oversampling Technique(SMOTE) to reduce the variance present in the training dataset. SMOTE was first introduced by Chawla

et al. in 2002 [4]. The technique involves creation of ’synthetic’ samples rather than replacement techniques which are traditionally used for oversampling. The algorithm operates on the feature vectors created earlier by introducing artificial samples along the line joining all of the k minority class nearest neighbors for every minority class sample. It is generally implemented in the training phase so as to remove bias by balancing the data. Synthetically oversampling the test data does not enhance the performance of the classification algorithm. We implement the technique on training data such that the number of minority and majority class samples become equal.

0:  Number of minority class samples T; Number of majority class samples M; Number of nearest neighbors k
0:  M minority class samples
1:  for 1 to  do
2:     Compute k nearest neighbors for i
3:     while T != M do
4:        Randomly chose one of the k nearest neighbor of i in feature space, say nn
5:        Compute vector between nn and i and multiply it with a random number between 0 and 1
6:        Synthetic sample = i + computed vector
7:     end while
8:  end for
9:  return  
Algorithm Algorithm for SMOTE

After passing the feature vectors through this algorithm, there is modification in the feature space as observed through figure 1 and 2. It should be noted that the samples in blue indicate the majority class which are comparatively more abundant than the minority class(green) in fig 1. The increase in minority class samples after implementing SMOTE is visible in figure 2. We can also infer that the formation of decision boundary to distinguish the two classes in feature space is now comparatively easier, thus enhancing the performance of the classification algorithm.

Fig. 1: Feature space of training vector before applying SMOTE
Fig. 2: Feature space of training vector after applying SMOTE

The balanced training dataset is now passed through a supervised classification algorithm. The task of the algorithm is to plot a decision boundary which fits the training data such that it precisely partitions the underlying vector space into two sets, one for each class. For a new test case, the previously mentioned preprocessing steps are carried out and its feature vector is plotted in the feature space. The classifier will classify it as the class which is on the same side of the decision boundary as the vector.

We implement the SMOTE algorithm in Python using the imblearn package. The various classification algorithms are implemented using the scikit learn library [5, 6]. The pandas and matplotlib packages are used for reading and visualization of the data respectively.

Iv Evaluation

Iv-a Dataset description

The dataset for this task was gathered from the forums of Spoken Tutorials project, IIT Bombay. Spoken tutorial project provides tutorials on FOSS available in several Indian languages for the learners[7]. A forum has been setup to address problems of students as well as contributors. The data was collected from the forums which consisted of questions asked by users and the replies given. A label was assigned to each case. Label 0 indicated ’non spam’ or relevant content whereas the label 1 indicated spam content. The dataset comprised of 313 samples-273 non spam samples and 40 spam samples. It was then split into training and testing dataset with the training set consisting of 201 non spam samples and 33 spam samples. The distribution of training data before and after applying SMOTE is shown in table 1.

Number of majority class samples Number of minority class samples
Initial stage 201 33
After applying SMOTE 201 201
TABLE I: Distribution of training dataset

Iv-B Performance metric

Accuracy is usually considered as the performance metric for classification algorithms. It is the ratio of correctly detected samples to the total number of test samples. However, in case of imbalanced datasets the accuracy will provide a wrong perception of the results. For example, a dataset consisting of 95 non-spam and 5 spam samples will produce an accuracy of 95% by predicting all samples as non-spam. The model has failed to recognize the spam class samples and is thus deemed faulty. The performance metrics which provide a better interpretation in such cases are:

Precision =

Recall =

F1 Score = 2 *

TP: True positive(no. of outcome where the model correctly predicts the positive class)
FP: False positive(outcome where the model incorrectly predicts the positive class)

FN: False negative(outcome where the model incorrectly predicts the negative class) The reason is that these metrics are more focused on the positive class(Spam) than on the negative class and actually measure the probability of correctly detecting positive values which is crucial in case of spam classification.

Iv-C Evaluation of results

The enhanced training dataset is passed through various classification algorithms namely: Multinomial Naive Bayes, Logistic regression, Linear SVC(Support vector clustering) and decision tree. The performance of each algorithm based on the above mentioned metrics is tabulated in table no 2.

Fig. 3: F1 Score metric representation
Metric Multinomial NB Logistic Regression Linear SVC Decision Tree
With SMOTE Without SMOTE With SMOTE Without SMOTE With SMOTE Without SMOTE With SMOTE Without SMOTE
Accuracy 0.96 0.95 0.92 0.91 0.95 0.936 0.88 0.89
Precision 1.0 0.71 0.28 0.0 0.57 0.28 0.14 0.14
Recall 0.7 0.71 0.6 0.0 0.8 1.0 0.25 0.3
F1 Score 0.82 0.71 0.4 0.0 0.6 0.4 0.18 0.2
TABLE II: Performance of various classification algorithms on Spoken Tutorial forum dataset
Fig. 4: Precision Score metric representation

It can be observed that applying SMOTE on training data produces better metric performance compared to models trained on imbalanced data for major classification algorithms.

V Conclusion

We have observed that algorithms working with SMOTE outperforms algorithms trained on imbalanced data with margins as high as 10%. The effect of uneven class representation is negated by this technique. It can thus prove to be extremely useful for classification of cases in forums. Future work includes modifications in the SMOTE function for better generalization of the minority class. The performance of the technique can be enhanced for decision tree based algorithms through parameter tuning. The technique can also be meshed with deep learning algorithms to enhance the classification result. The idea can be further extended to create automated tagger for contents posted on educational websites which would improve content organization.


We would like to thank Spoken Tutorial Project, IIT Bombay for their assistance in the collection of data used in this project. We would also like to extend our thanks to Prof. Kannan Moudgalya, IIT Bombay for his encouragement and useful critiques for this research work.


  • [1] W.A. Awad  and S.M. Elseuofi, ”Machine Learning Methods for spam E-Mail classification”,International Journal of Computer Science & Information Technology (IJCSIT), Vol 3, No 1,Feb 2011
  • [2] Sylvester Manlangit,Sami Azam,Bharanidharan Shanmugam,Krishnan Kannoorpatti,Mirjam Jonkman,Arasu Balasubramaniam,An Efficient Method for Detecting Fraudulent Transactions Using Classification Algorithms on an Anonymized Credit Card Data Set,Intelligent Systems Design and Applications, pp.418-429
  • [3] Yuchun Tang, Sven Krasser, Yuanchen He, Weilai Yang, Dmitri Alperovitch,

    Support Vector Machines and Random forest modelling for Spam Senders Behavior Analysis”

    ,IEEE ”GLOBECOM” 2008
  • [4] Nitesh V. Chawla,Kevin W. Bowyer,Lawrence O. Hall,W. Philip Kegelmeyer, ”SMOTE: Synthetic Minority Over-sampling Technique”

    , Journal of Artificial Intelligence Research 16 (2002) 321–357

  • [5] E.Pedregosa, F. Varoquaux, G.Gramfort, A.Michel, V.Thirion, B.Grisel, O.Blondel, M.Prettenhofer, P.Weiss, R.Dubourg, V.Vanderplas, J.Passos, A.Cournapeau, D.Brucher, M.Perrot, M.Duchesnay,Scikitlearn:Machine Learning in Python,Journal of Machine Learning Research,volume 12,2825–2830,2011
  • [6] Lars Buitinck,Gilles Louppe,Mathieu Blondel,Fabian Pedregosa,Andreas Mueller,Olivier Grisel,Vlad Niculae Peter Prettenhofer,Alexandre Gramfor, Jaques Grobler,Robert Layton ,Jake VanderPlas ,Arnaud Joly,Brian Holt and Gael Varoquaux, design for machine learning software: experiences from the scikit-learn project,ECML PKDD Workshop: Languages for Data Mining and Machine Learning,108–122,2013
  • [7] KM Moudgalya, Spoken tutorial: A collaborative and scalable education technology, CSI Communications 35 (6), 10-12