Nowadays, almost every person in the World generates data on the Internet; social media, news, public comments, blogs, searching the Internet, etc. All this information is recorded in a database which makes a problem known as Big Data . Accordingly, vast amounts of data are unstructured textual data which makes inconvenience for machine learning models to harness their predictive power.
Consequently, diverse approaches and algorithms have been proposed to deal with this issue. However, most of them work well only with numerical data. To use those algorithms, we need to convert textual data into numerical feature vectors. Then, AI algorithms are fed with those data in order to learn the important features that lead the final successful prediction. But first, before the whole system is implemented, we need a textual corpus to show whether our approach is better than the available ones. After the data is acquired, it needs to be preprocessed, which means that the original text is altered in a way to be more suitable for further use — reducing vocabulary, outliers, and other impurities. Such a modified text is then ready for feature engineering. This process uses statistics to construct numerical feature vectors from raw textual data. In data mining and information retrieval this procedure of converting text into a numerical vector is known as Vector Space Model (VSM); a set of linearly independent basis vectors that represent textual documents. Afterwards, popular approaches, such as Case-Based Reasoning (CBR) , can be used to find the most similar documents based on already seen cases (training data). The system’s recommendation for the new document can be evaluated by an expert and added to the pool of training data in order to further improve the system’s predictive performances. However, to harness the power of CBR system, we need to construct a similarity metric that can capture the important characteristics of feature vectors.
The major contribution of this work is investigation of performances of the similarity metric TS-SS (Triangle’s area Similarity - Sector’s area Similarity) , proposed by Heidarian and Dinneen, that has shown state-of-the-art performances for document clustering. The results of this algorithms are compared with the other well-known similarity measures, Euclidean Distance (ED) and Cosine Similarity (CS). Also, we provide a theoretical justification why TS-SS measure is incapable of capturing feature differences among data points.
Section II discusses state-of-the-art methods and related work done in the field of information retrieval and data mining for each of the stages in the process. Section III describes an implementation of our system and methods for performance evaluation. Section IV has performance measurements of our implementation for a variety of similarity metrics. Section V explores the reasons and provides theoretical justification for achieved performances. Finally, a short summary of our research and future work is given in Section VI.
The code for this project is publicly available111https://github.com/Maki94/document-classification.
Ii Related Work
Many systems have been devised to overcome difficulties with recognizing the most similar textual documents. This process includes several independent steps, and a lot of research has been done in each of these steps to improve systems’ predictive performances.
The first step required by any machine learning system is to find a benchmark dataset for performance evaluation. According to Larson, standard test collections for information retrieval are The Cranfield collection, NTCIR, Reuters Corpus, 20 NewsGroups, and several other corpora .
Before the features are extracted from the raw textual data, the textual data should be altered in order to reduce the vocabulary size and decrease inaccuracies in feature representation. This process can be briefly divided into 3 categories — dropping specific terms, word replacement, and stemming.
Some specific words do not bring any value, and they should be excluded from the dataset. This procedure mainly depends on textual corpora; for example, if the data includes web pages, then HTML tags should be removed, if it contains XML files, then XML tags should be eliminated, etc. Additionally, some words are contained only in a couple of documents, which make them too specific to be used in overall system. On the other side, some words too frequently occur in every document and they should be removed as well; for example, if all documents are about computer science, then term computer is irrelevant, and should be excluded from feature space. In our scenario, we are going to focus on regular textual data, so we will not further examine specific cases as with HTML and XML. Larson  recommends some procedures that are considered as a good practice — removing stop words, eliminating punctuation, making the word lowercase, and many other procedures which are described in detail in the implementation phase.
The purpose of the word replacement procedure is to reduce the vocabulary size. This process includes spelling correction, synonym replacement, and specific replacements. For spelling correction, it is widely used edit distance based on Levenshtein distance  to find a well spelled word. State-of-the-art approach for synonym replacement is based on WordNet . Other corrections include simple word concatenation, number mapping and other procedures to overcome dataset peculiarities; for example, mapping mac book to macbook, every number to the one token, etc.
The next process is stemming – replacing each words with its base form. The Porter stemming algorithm  has been recommended by most authors for natural language processing tasks. For example, by applying the Porter stemmer, the word women will be replaced by woman, plays by play, etc.
Several successful feature extraction methods for NLP tasks have been proposed and improved over the past decades. These methods map words or phrases from the vocabulary to vectors of real numbers. The most popular approaches for constructing feature vectors are: bag-of-words model (BoW), tf-idf , Glove , and word2vec .
BoW method simply counts each word and its number of occurrences is recorded in feature vector at a specific position for that term. One obvious drawback of this approach is that it favors longer documents, therefore, tf-idf measure was introduced to overcome this problem by calculating product between term frequencies and inverse document frequency. Now inverse document frequency will decrease bias towards longer documents. Glove method also counts how frequently a word appears in a context/document, but it uses dimension reduction techniques to achieve low-dimension representation, hence feature vectors lose interpretability, whereas word2vec suffer from the same drawback. It is constructed by a neural network, which results in representing words/phrases as their probability distribution.
To find the most similar document, numerical feature vectors should be compared by calculating a similarity metric. A lot of metrics have been invented to capture the most important features of a vector. Those metrics can be divided into two subcategories: geometrical and non-geometrical methods. Summary of these approaches can be found in .
A standard way of evaluating the quality of classification algorithms is based on confusion matrix, and derived measures from it, such as accuracy, precision, recall, andscore.
Iii Methods and Implementation
Based on the previous work, the architecture of our system is devised accordingly. Fig. 1 depicts modular architecture of our system.
The first phase of our system is to acquire some training data, which will be used for searching similar documents. Those documents are then filtered by a preprocessing procedure in order to get rid of outliers and reduce the size of vocabulary. The next phase is to extract features from modified textual documents and to save them in order to accelerate the classification step. When the features are extracted, the most similar document is retrieved for a new query (document). The similarity between two vectors is calculated by several similarity measures. The outcome for each query is recorded and reviewed by an expert, then, the new document can be added to the pool of training documents, which will lead to the improvement in predictive performance for the future queries.
The first step is to acquire some data. We used five different datasets. 20 NewsGroups222scikit-learn.org dataset which comprises of 18864 newsgroups posts on 20 topics, Reuters333nltk.org dataset which contains 10,788 news documents on 90 topics, the first million-word electronic corpus of English, created in 1961 at Brown University — Brown dataset33footnotemark: 3 — each document is categorized by one out of fifteen genres. The other two datasets — the Movie Review33footnotemark: 3 and Sentence Polarity33footnotemark: 3 datasets — are labeled with binary values positive or negative. The reason for our choice is that these corpora are different in the number of documents and labels, which can reveal different characteristics of used similarity metrics. Both the datasets are split into training and test dataset.
|Minimum word length||3|
|Maximum document frequency||50%|
|Minimum document frequency||1%|
|Feature vector dimensionality||Different values are evaluated|
After the data is acquired, it should be altered in order to reduce the size of vocabulary used for feature extraction. Different parameters for the procedures of altering the textual documents are summarized in the Table I.
The performances of our system deeply depend on the data preparation procedure. First, words that are shorter of three characters are eliminated. Then, based on the term frequency in documents, terms are kept or discarded. Those that are present too frequently, appear in 50% of the total amount of documents, and those that are too specific, 1% appearance, are eliminated. After this procedure, all characters are lowercase, and words that do not contain any valuable information, stop words, are removed.
In the next step, by observing the textual documents it is concluded that specific words need to be eliminated, in our example email and web addresses are also removed. The final step was to discard every character that is not a letter and to apply Porter stemmer to simplify word forms. It can be noted that some recommended procedures, such as mapping numbers to a specific token, are not implemented because achieving the best performances was not an aim of the project. The goal is to explore the behavior of different similarity metrics when finding the most similar documents.
The data preprocessing procedure has reduced the size of vocabulary significantly. Now, the feature extraction method should convert each document into a numerical feature vector.
Vector space model based on tf-idf method usually outperforms other methods with the smaller amount of data.
For this purpose, tf-idf procedure is applied over the given training dataset. To evaluate the performances of different similarity metrics, different lengths of feature vectors are considered. Thus, forcing feature vectors to be of a fixed dimensionality. This is done by removing features with the smallest values. The aim of this experiment is to show how the system behaves in high-dimension space.
To retrieve the most similar document, a similarity between two documents needs to be calculated. Each of the documents is first converted to a numerical feature vector, then the similarity is calculated. In this work, we used three similarity measures –– ED (1), CS (2), and TS-SS (5) — that showed state-of-the-art performances in many NLP tasks. All these metrics have useful geometrical representation and a short summary of drawbacks and advantages of these methods is well described in .
The reason for introducing a novel similarity measure, TS-SS, is justified by weaknesses of the Euclidean distance and cosine similarity. The drawback of ED can be illustrated in 2-dimensional space (Fig. 2), it can be clearly seen that holds; however, vectors differ significantly. One clear disadvantage of ED is not taking angle between two vectors into account.
On the other side, the cosine similarity does not suffer from this drawback because it only considers the angle between two given vectors. However, the problem with the cosine similarity is that it does not consider the magnitude of vectors. Fig. 2 illustrates a scenario when three vectors are equally similar despite their obvious dissimilarity. In particular, statement holds.
To address these weaknesses — vector magnitude for CS and angle between two vectors for ED — TS-SS metric was proposed. This measure is calculated as a product between Triangle’s Area Similarity (TS) and Sector’s Area Similarity (SS). The former is calculated based on the triangular area between two vectors in the Euclidean space, which alone suffers from the same drawback as ED. The latter is calculated as an area of a circular segment, which is describe by a diameter and an angle. The diameter is equal to the difference between the vectors’ magnitudes, while the angle is the angle between two documents. Now, a metric defined as a product of TS and SS should perform better than ED and CS separately because it addresses the drawbacks of both approaches.
The performances are evaluated on the test dataset. For each retrieved document the answer is recorded and compared with the solution, in the end, the probability of correct retrieval is calculated — accuracy. New documents are not added to the pool of training documents because it would be inconvenient to evaluate predictive performances for different parameters.
The performances of the implemented system deeply depend on the data preprocessing and feature extraction procedure. Those procedures require tuning several parameters that are summarized in the Table I.
Three similarity metrics — ED, CS and TS-SS similarity –– are used for finding the most similar document in the training dataset for a given textual query (document). Then, accuracy is calculated to evaluate predictive performances of our system for different similarity metrics. The whole process of extracting features and searching feature space for the most similar ones is repeated with different tuning parameters and over five datasets; Reuters, 20 NewsGroups, Brown, Movie Review, and Sentence Polarity.
In the first experiment, the Reuters dataset is used. Fig. 4 shows the accuracy achieved for different dimensionality of the feature vectors. It can be clearly seen that the cosine similarity performed the best, whereas the similarity based on the product of triangle-sector areas was the worse. The method based on the Euclidean distance for the small feature dimensionality follows the same predictive pattern as cosine similarity, but it levels out for the feature length bigger than 300, while the accuracy of cosine similarity constantly increases. In contrast, accuracy of TS-SS similarity fluctuates until the feature dimensionality of 200, after which it constantly decreases.
Iv-2 20 NewsGroups
For the second experiment, the 20 NewsGroups dataset is used. This dataset has more documents, and so richer vocabulary, which means that feature vectors are of bigger dimensionality; if not limited to a fixed feature length. Therefore, the performances of our system showed a different pattern for every similarity metric. Fig. 5 depicts probability that our system will recognize a document’s category successfully for different feature length. In this scenario, the gap between performances of these three metrics is wider. Although the metric based on the cosine angle is still superior to the other ones, it follows logarithmic incline in performances. However, the TS-SS metric, after the slight increase in performances until the feature length of 100, gets worse dramatically until the feature dimensionality of 400, after which it levels out with small oscillations. Euclidean distance’s accuracy, again, shows a similar growth pattern as cosine similarity — in the beginning it follows cosine similarity until the feature length of 100, after which it levels out, however, the gap between ED’s and CS’s accuracy is bigger.
The third dataset has fewer categories and feature vectors are of bigger dimensionality. These characteristics of the dataset are opposite from the previous two compared, which help us to reveal different features of similarity metrics. From the Fig. 6 it can be that the difference in predictive power between similarity metrics is no longer clear. However, the overall trend is that accuracy decreases with the increase of dimensionality for ED and TS-SS, while CS’s accuracy fluctuates and slightly increases over time. One interesting fact is that regardless of the constant fluctuation, CS’S similarity always performs equally or better compared to the other two similarities.
The next two datasets are binary labeled.
Iv-4 Binary datasets
The Movie Reviews dataset has a richer vocabulary, hence feature vectors are of bigger length, while the Sentence Polarity dataset is a smaller one and characterized by less informative features. The Fig. 7 clearly demonstrates that for the former dataset cosine similarity is preferred because of its ability to discard mismatched features. On the other side, for the Sentence Polarity dataset the performances of all the similarity metrics are somehow equal, except for the feature length above 75 when TS-SS ability to generate a wider range of values comes into play.
The cosine similarity shows a constant trend of increase in performances with the incline of feature dimensionality. In other words, the more information a feature vector contains, the better the performances will be. On the other side, the two other methods lack this ability to exploit highly dimensional feature vectors.
After these five experiments, we further examine the performances of our system when the tf-idf feature vectors are normalized. Two procedures are evaluated: normalization based on l2 and l1 norm; respectively, each data point is subjected to the constraint (7) and (10), where is the feature length, parameter indexes documents, , and is the total number of documents in the training dataset.
Subjecting feature vectors to the constraint (7), the first five experiments are repeated. Now, the results achieved for these experiments differ significantly. All the three similarity metrics have shown a similar accuracy growth pattern.
For the Reuters dataset, the performances of the cosine and TS-SS metrics are almost the same, whereas Euclidean distance is slightly worse for the feature vector length between 100 and 200. From 50 to 250 they show a rapid increase in performances, while after dimensionality of 250 with small oscillations it levels out. For the 20 NewsGroups dataset, the performances of these similarity measures follow the same logarithmic increase in performances. However, the normalization constraint (7) did not impair predictive performances of our system.
The performances of the other three datasets due to l2 normalization are shown in the Appendix A. However, they all show the same growth pattern as the previous two datasets.
The reason for this phenomenon of showing almost the same growth pattern for each similarity measure is that constraining data points to (7) we get the similar mathematical equations. Now, the similarity metrics are described by mathematical equations (1) for ED, (8) for CS, and (9) for TS-SS.
Note, that fixed scaling parameters that are same for each data point are excluded from the equations because static scaling does not contribute to different document ranking.
When feature vectors are subjected to the constraint l1 (10), the results are somewhere in between the previous two scenarios. The growth pattern is not nearly the same for each similarity metric as it was with l2 normalization neither the difference is so drastic as it was without any normalization.
Fig. 10 depicts the performances of the Reuters dataset due to l1 constraint, surprisingly unlike in the previous two scenarios, the accuracy of ED and TS-SS do not follow the same pattern for different feature dimensionality. When the dimensionality of the features is smaller TS-SS can capture more details than ED, however with the increase in dimensionality it fails to exhibit this ability. Whereas ED neither benefits of the longer feature factors, but it manages to level out and remain nearly constant.
In the second experiment Fig. 8, for the 20 NewsGroup dataset, the pattern of growth for all three similarity metrics is similar to the scenario without any normalization, with the exception that due to l1 constraint the growth pattern is delayed. Additionally, predictive performances over the other three datasets are virtually the same and are given in the Appendix B.
The authors who proposed the TS-SS similarity claim that the features should not be normalized because a normalization constraint would diminish diversity among vectors. However, this characteristic may be beneficial for clustering problems, but for the classification task it is clearly detrimental.
From the results, we can conclude that the metrics based on the Euclidean distance between data points and Triangular similarity suffer from the problem known as , which was first introduced by Bellman .
To exemplify why these measures perform worse as dimensionality increases, consider
data points uniformly distributed in an
-dimensional unit ball centered at the origin. Then, our system will classify a given query by finding the closes data point, and assigning it the same class. The median distance from the origin to the closest data point is given by the expression (11).
From this equation we can derive a formula (12) to calculate the number of data points required for the given feature length and expected distance to the closes data point.
Let us fix the expected distance to the closes data point to the value , , then for , we need data points, for , , and for , . This means that if we want to achieve the same expected accuracy as with the low dimensional feature vectors, we need the size of our training set to grow exponentially. That is why the Euclidean and TS-SS measures show worse performances for high-dimensional feature vectors.
The authors who proposed TS-SS algorithm, evaluated its performances by four different methods; Uniqueness, Number of Booleans, Minimum gap score, and Purity. However, these methods are more relevant for evaluating document clustering than document classification.
Vi Conclusion and Future Work
In this work, we performed simple document recommendation based on CBR system in order to evaluate different similarity measures; Euclidean distance, Cosine similarity, and TS-SS similarity.
We implemented a system that is capable of finding the most similar document in two datasets for a given query (unseen document). Both, the training dataset and queries, are subjected to the same procedures of data preprocessing and feature extraction. Then, three different similarity metrics are employed to retrieve the most similar document, each prediction is recorded so that their performances can be evaluated. After exhaustive testing, we concluded that similarity metric based on cosine outperformed ED and TS-SS in high dimensional space, which leads us to conclusion that ED and TS-SS suffer from a problem known as the curse of dimensionality. However, when feature vectors are subjected to the constraint (3), then all three similarity metrics show the similar predictive pattern.
Recognizing the most similar documents is an active research area in recent years due to increasing use of the Internet. The content on the Internet is mostly unstructured, and for each searched textual document, the similar ones should be retrieved in order to provide the best recommendation to users. Even though a lot of research has been done to improve each step of such a predictive system, there is still no the best solution to deal with NLP problems.
During our research, we stumbled upon a few interesting research question that we consider worthy of further investigation:
Investigate the performances of similarity metrics for document clustering tasks.
Whether the similarity metrics will show different predictive pattern for feature extraction methods such as Glove or word2vec.
Investigate performances for datasets that have longer textual documents.
-  (2008) Case-based reasoning for diagnosis of stress using enhanced cosine and fuzzy similarity. Transactions on Case-Based Reasoning for Multimedia Data 1 (1), pp. 3–19. Cited by: §I.
-  (1961) Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ. Cited by: §V.
-  (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future 2007 (2012), pp. 1–16. Cited by: §I.
-  (1954) Distributional structure. Word 10 (2-3), pp. 146–162. Cited by: §II.
-  (2016) A hybrid geometric approach for measuring similarity level among documents and document clustering. In Big Data Computing Service and Applications (BigDataService), 2016 IEEE Second International Conference on, pp. 142–151. Cited by: §I, §II, §III.
-  (2010) Introduction to information retrieval. Journal of the American Society for Information Science and Technology 61 (4), pp. 852–853. Cited by: §II, §II.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §II.
-  (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §II.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §II.
-  (1986) A critical analysis of vector space model for information retrieval. Journal of the American Society for information Science 37 (5), pp. 279. Cited by: §I.
-  (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28 (1), pp. 11–21. Cited by: §II.
-  (2006) The porter stemming algorithm: then and now. Program 40 (3), pp. 219–223. Cited by: §II.
-  (2007) A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29 (6), pp. 1091–1095. Cited by: §II.