1 Introduction
“The challenge is not only to collect and manage vast volumes and different types of data, but also to extract meaningful value from this data” [4]. Indeed, big data makes extracting information a challenge that is both difficult and time consuming. Machine learning (ML) is a powerful tool that can help us with this task. Depending on the task, we need to decide which machine learning algorithm is the most appropriate. One of the relevant criteria, is the ability of the machine learning model to perform accurately on new, unseen examples. Therefore, it is necessary to compare and analyze the result of different models according to different criteria to be able to choose the best possible for each task. However, in practice, this type of comparisons cannot be done as they require extra resources and take more time.
The main goal of natural language processing (NLP) based in machine learning is to obtain a high level of accuracy and efficiency. Unfortunately, obtaining high accuracy often comes at the cost of slow computation [11]. While there is a lot of research to improve accuracy, few consider time and accuracy together, as with big data we need NLP systems to be fast as well as accurate, seeking a reasonable tradeoff between speed and accuracy. However, “what is reasonable for one person might not be reasonable for another” [11]. The same comment applies to a given task. Hence, we want to find the best algorithm with respect to a customerspecified speed/accuracy tradeoff, on a customer proprietary data set.
Among the supervised machine learning algorithms for a particular task, the algorithms may vary by processing methodology as well as by training efficiency. This makes it difficult for a customer to select an appropriate algorithm for a specific situation. The situation is even more difficult considering that the answer may depend on the specific data set and/or its size as well as the set of algorithms and the type of evaluation used. We can even complicate even more this problem by adding space or time restrictions for the training and/or the prediction phase. In addition, it is hard to find annotated data sets and using professional humans to annotate new training data sets can be expensive and time consuming, even when using crowdsourcing as training data sets can be very large.
With respect to training data size, when it increases, usually quality improves but the algorithm takes more time. However, after a point, quality may not increase as much, while the running time keeps increasing and hence the quality gain may not be worth the efficiency loss. Therefore, increasing training data after that point is not efficient nor effective any more.
We address this problem by considering the tradeoffs between training time efficiency and learned accuracy on different sizes of data by studying several algorithms on three different problems/tasks in text processing, comparing them in three dimensions: running time, data size, and accuracy quality. For this we define a framework that allows us to compare different algorithms and define relevant tradeoffs.
The main contributions of this work are the following:

A tradeoff analysis framework between quality and efficiency that can be applied to most problems that use ML algorithms. In fact, the framework borrows from similar ideas used for generic algorithms.

Application of this framework to text processing tasks that are typically solved with supervised ML algorithms, analyzing the impact of the object granularity of the tasks (entities, reviews, documents), the specific data set as well as the type of evaluation used (holdout versus folds). The main finding is that the best algorithm is not necessarily the one that achieves best quality nor the most efficient one, but the one that balances well both measures for a given training data size.

An experimental comparison of three wellknown Named Entity Recognizers (NER) using a news data set that is relevant on its own. The main result is that the clear winner is the Stanford NER.

An experimental comparison of several ML algorithms for Sentiment Analysis using two subsets of the same reviews data set, to analyze what is the impact of changing the data set when they are of similar type. For one of the subsets we also analyze the impact of the evaluation technique used. The main result is that Support Vector Machines (SVM) is the best algorithm, followed by Logistic Regression (LR), among the algorithms considered.

An experimental comparison of several ML algorithms for Document Classification using two different tasks (binary and multiclass) for two different data sets (news and patents), to analyze the impact of changing those parameters. For one of the sets we also analyze the impact of the evaluation technique used. The main result is that SVM is again the best algorithm, among the ML algorithms considered.
Notice that we did not include neural networks in the comparison (that is, deep learning) because their training time complexity is much higher than the most used algorithms and hence they are not competitive in our tradeoff analysis.
The rest of the paper is organized as follows. In Section 2 we present the state of the art, while in Section 3 we explore our problem statement and define our tradeoff framework. In Section 4 we consider the named entity recognition problem while in Section 5 we address sentiment analysis. In Section 6, we address the document classification problems in two different data sets (news and patents). We end with our conclusions in Section 7. Most of this work is part of the PhD thesis of the second author [16] and a slightly shorter version of this paper was published in [3].
2 State of the Art
There are many papers that do experimental comparisons of different ML based algorithms related to text processing, such as [10, 8], but they usually do not look at the performance tradeoff between quality and time. One exception is Kong and Smith [13] that compares different methods of obtaining Stanford typed dependencies, obtaining the tradeoff shown in Figure 1. A similar work on parsing is [11].
Banko and Brill [5] studied natural language disambiguation. They try to find out the effect of training data size on performance and when the benefit from additional training data ends. Ma and Ji [17]
reviewed various general techniques on supervised learning to improve performance and efficiency. They introduce performance as “the generalization capability of a learning machine on randomly chosen samples that are not included in a training set. Efficiency deals with the complexity of a learning machine in both space and time”. In our case, we are interested in time and quality.
3 Tradeoff Analysis
As quality is as important as time efficiency, we need a fair way to compare algorithms that achieve different quality at a different processing time cost. Most of the time the best quality algorithm is the slowest one and is not clear if the extra processing time is worth the quality improvement. Hence, we need to explore the tradeoffs between quality and time in a way that is independent of the problem being solved as well as the computational infrastructure that is being used.
Can we trade quality and time and at the end improve both quality and time? Many times, the answer is yes. Let us consider the following example that comes from NER [2]. Let us say that algorithm finds true entities in a text of size in linear time while algorithm finds true entities in time. That is, has better quality by a margin of , but is slower than . However, we are just looking at quality with respect to the data size.
To compare them fairly, we need to consider the same time. So, if both algorithms run in time proportional to , we have that the number of correct entities is:
A:  
B: 
Hence, we can equate the two cases to find such that for some constant , algorithm finds more correct entities than algorithm , just because it can process more data in the same time, despite achieving less quality. That happens for some such that:
For example, for and with and , we find that the point where starts finding more correct entities than is when can process almost 2.4 GB while can do just over 2 GB.
3.1 Quality, Time and Data Size
We can divide a data set into different sizes and measure the training time of different ML algorithms on them. Next, we can calculate the quality of the algorithms using the Fmeasure [1], as captures sensitivity (recall) as well as specificity (precision). We can use other measures such as accuracy, but usually high accuracy can be achieved by predicting the most common class. Plotting the results, we can find a point on the curve where quality will no longer get better with more data. With increasing data size, obviously, the time increases depending on the algorithm complexity as seen earlier.
To consider the three measures to define our performance measure as:
(1) 
That is, performance scales with the size of the data but is penalized by the time consumed. This way, high quality and fast time on large data sets will have very high performance, but high quality with slow time on large data sets will decrease the performance. Figure 2 is an example of performance change for five wellknown classification algorithms applied to data sets of different size. Here we can see that slower algorithms like
Nearest Neighbors (KNN) and Decision Trees (DT) decrease their performance while linear time ML algorithms keep the performance more stable. This performance measure could have different weights in each of the variable considered as well as different functions applied to each variable depending on real costs associated to them.
3.2 Dominant Algorithms
An important concept related to tradeoffs is a dominant algorithm. A tradeoff graph is a powerful tool for making decisions and usually when one measure improves, another decreases. Dominant algorithms are in the convex hull of all the points in the graph. In Figure 1 we added a red line to show the dominant algorithms for that example. At one extreme, a choice like Huang would be selecting fast throughput but smaller accuracy. At the other extreme, a choice like CharniakJohnson would be selecting a high level of accuracy, but slower throughput. According to such graph, an increase of quality involves most of the time a loss in speed. However, we need to avoid choices like Stanford EnglishPCFG or Stanford RNN, that are not the best for any measure. A good choice would be in the frontier, where dominant algorithms are.
Dominant algorithms usually are the same for all data sets, that is, the dependency on data is low. However, this is not always the case. When we include the number of data sets for which each algorithm is better, the notion of dominant algorithm gets more complicated. We do not use this more complex notion, but a good example for learning to rank algorithms can be found in [20].
3.3 Experimental Design
Now we explain the experimental rationale that we use in the following three sections for each of the problems that we selected. As we mentioned before, the three problems have different granularity, from words to full documents. The training data sets used also have different sizes due to the availability of large public data sets for each problem. We also use two different evaluation techniques, a simple holdout set for evaluation (20%) or fold crossvalidation (at least 5 folds).
In Table I we show that the rationale is that we start by processing words, then paragraphs and finally full documents. For this reason, we use a news data set for word classification (NER), the Amazon reviews data set for sentiment classification (Movies & TV and Books, two homogenous subsets for a binary prediction problem), and patents and news for document classification (the first binary and the second multiclass). We also give the maximal data size used and the number of types of features considered. The types of features were basically all the attributes available on the data sets used plus the word vector space, the main type, of the texts considered. In addition, in one case we consider two subsets of the same data set (reviews) and two classes (binary); while in the other case we consider two completely different collections (news and patents) and different prediction problems (multiclass and binary). As we cannot exhaustively compare all the parameters, this selection has a good coverage of all possible combinations. For each problem, we choose several algorithms, where we tried to have two problems with a similar set of algorithms to see the effect of the problem on the results. Another parameter of the experimental space is the number of features, but we decided to keep this parameter of the same order of magnitude for all cases as has a much larger granularity than other parameters being studied and because features do depend on the problem.
Data set  Objects  Size  Feature 

types  
News  Articles (0.8M)  2.13 GB  11 
& words  
Movies & TV  Reviews (1.8M)  8.77 GB  10 
Books  Reviews (2.6M)  14.39 GB  10 
Patents  Documents (7.1M)  7.18 GB  6 
Problem (Object)  Classes  Data set  Algorithms  Evaluation Measures  Evaluation Methods 
NER (words)  Multi  News  SNER, LingPipe, Illinois  Precision, Recall, Fmeasure  Holdout 
Sentiment  Binary  Movies & TV  DT, LR, KNN, RF, SVM  Precision, Recall, Fmeasure  Holdout 
Analysis  Binary  Books  DT, LR, KNN, RF, SVM  Precision, Recall, Fmeasure  Holdout & 
(reviews)  Cross validation (10 folds)  
Document  Multi  News  DT, LR, KNN, SVM,  Precision, Recall, Fmeasure  Cross validation (6 folds) 
Classification  Naïve Bayes  (micro and macro average)  
(documents)  Binary  Patents  DT, LR, KNN, SVM  Precision, Recall, Fmeasure  Holdout & 
RF  Cross validation (6 folds) 
The characteristics of each problem, algorithms and parameters used are shown in Table II. For Movies & TV reviews, we used training and testing data sets. For Book reviews and Patents, we use both evaluation methods (holdout as well as crossvalidation). There were no significant differences between holdout or 6 and 10 folds in the crossvalidation case. So, we show the results for only one of them in each experiment.
We used the Scikitlearn framework for the ML algorithms, feature extraction and their evaluation
[18]. We use KNN, SVM, LR, DT, Naïve Bayes (NB), and Random Forests (RF). For all the language processing tasks, we used the NLTK toolkit
[6]. For all the experiments, we used a computer with processor Core i7 2.5GHz Intel with MHz DDR3 cache CPU and GB of RAM memory. So our algorithms run almost all the time with the data in main memory.3.4 Data Sets
For news, we used the Reuters Corpus [15] collection, RCV1, that contains about 800 thousand articles in English, taking approximately 2.1 GB when uncompressed.
With almost million reviews for almost 2.5 million products from more than 6 million users, Amazon reviews [14] is one of the largest data sets for product reviews. These reviews were collected for years up to March . The overall data spans over categories with a total size of about 35 GB. From them we selected the two largest subsets: Books and Movies & Television. The Mov&TV subset consists in 2.4M reviews from 70K products, that takes 8.8 GB uncompressed. In the case of the Books subset, it contains 12.8M reviews from almost 930K products, taking 14.4 GB uncompressed.
Finally, our last data set is derived from the USA Patent databases [21], that includes thorough information from patents granted between 1976 and 2016 and patent applications filed between 2001 and 2016, accounting for 7.1M documents that require 7.2 GB of space.
4 Named Entity Recognition
We selected the Named Entity Recognition problem because it is one of the main technologies used as a preprocessing step on more advanced NLP applications on different kinds of corpora. We selected for comparison three supervised NER algorithms that are publicly available, well known, free for research, and that are based on machine learning methods:

Stanford Named Entity Recognizer (SNER) [9] is based in conditional random fields (CRF);

Illinois Named Entity Tagger (INET) [19]
uses neural networks (NN) and hidden Markov models (HMM); and

LingPipe (LIPI) [7] that is also based in HMM.
The entity types we chose to focus on are Person (PER), Location (LOC) and Organization (ORG), using data subsets from MB to GB. Figure 3 shows the overall comparison on quality for the three NERs when used on different data sizes. Our result shows that SNER is clearly better than LIPI and INET in all data sizes. In general, with increased data size, quality increases. Once reaching MB, the quality stabilized and increasing the data did not improve the quality.
Regarding the running time of these systems in different data sizes, SNER was the best, followed by LIPI and INET for all data sizes as shown in Figure 4.
Given that SNER has the best quality and is the fastest algorithm, is trivially dominant as is shown in Figure 5. Now we use our performance measure defined in Equation 1, to obtain Figure 6
, which shows that overall SNER has better performance, considering all three factors. As the quality does not change after 500 MB, probably the performance will decrease with larger data sets.
5 Sentiment Analysis
Sentiment analysis is used for identifying whether a short text is a positive or negative comment and sometimes even the degree of positivity or negativity, such as in web product reviews [23]. The Amazon reviews data set includes the rating of each review, plus the product and user information. The rating is based on one to five stars where one means that the user did not like the product and five means that the user loved the product. So, we used these ratings as training labels (positive for at least four stars and negative for less than three stars). We analyze the same algorithms in two subsets of the Amazon reviews, as explained earlier.
5.1 Movies and Television Reviews
Figure 7 shows the comparison on the quality of the algorithms when run on different data sizes. We see in this case that there is a quality increase when the data size increases. SVM, LR and RF have very close results, but SVM performs better with larger data. LR has the best quality up to 50 MB. Due to the sparsity of the textual data, was very hard to train KNNs on large data. As we can see in Figure 8, there is a general linearity for the training time with respect to data size. Because of their inefficiency, selecting RF or DT are poor choices. Figure 9 shows how SVM becomes the dominant algorithm for big data, as SVM is the fastest in that case. However for small data there are several dominant algorithms as shown in Figure 10.
5.2 Book Reviews
Figures 12 and 13 compare the different algorithms with respect to quality and time, respectively. We can see that quality keeps increasing up to the maximal size we had available. If there is a saturation point, happens for 10 GB or more.
For big data, SVM is the dominant algorithm, just beating LR, as shown in Figure 14. However, here again there are more algorithms for the case of small data as seen in Figure 15. Regarding performance, given in Figure 16, SVM dominates LR after 10 MB for a small margin, far better than the other algorithms. DT shows stable performance while KNN and RF performance drops or need too many resources for big data.
5.3 Discussion
We tested different algorithms for sentiment analysis with two different subsets of reviews. We got the same result in both experiments, corroborating our intuition that the results should not change when the data is similar. The results show that quality increased by adding more data, and all the algorithms had the same learning curves in quality. However, our results show that in both cases increasing the training data does not always helps to improve the performance. Performance curves were different in both cases, slightly increasing for Movies & TV reviews and basically constant after 500 MB for Book reviews.
Considering both cases, if we order the algorithms by quality we have: SVM, LR, RF, and DT for data larger than MB. By considering only speed (fastslow), the order changes to: SVM, LR, KNN, DT and RF in all data sizes. Ordering by performance we have: SVM, LR, and KNN for small data. The order for big data (more than MB) is SVM, LR, DT, and RF. Regarding dominant algorithms, our results show that SVM is the dominant algorithm in data sizes over MB. KNN, LR, and SVM are dominant in small data.
6 Document Classification
The goal of document classification is to automatically assign an appropriate class to each document and is a very popular application [22].
The most popular machine learning approaches used in document classification are SVM, DT, KNN, NB, Neural Networks, and Latent Semantic Analysis [10, 12]. In our experiments, we used the first four plus LR, to have a similar set of algorithms to the previous section.
6.1 News Classification
Because we have more than one class, we used macro averaging to compute quality for each class (precision and recall) and then we average all of them to obtain the Fmeasure. In Figure
17 we show quality with varying training data size for the different algorithms. In this case, we see a saturation effect after 100 MB.SVM performs better on small data; however, when for larger data sizes LR is equally efficient. KNN is near to LR on small and medium data sizes, but as mentioned earlier, KNN is affected by the sparsity of the word vector space. The results for time efficiency are shown in Figure 18. Here we can see that SVM, LR, and NB are very similar, while the DTT and KNN are not competitive.
Figure 19 shows the dominant algorithms for 500 MB, one example where SVM and NB are the dominant algorithms up to GB. However, for larger data, SVM is the clear winner and hence is a good algorithm for all data sizes. The performance comparison is shown in Figure 20 where SVM more or less keeps the same performance from 0.5 GB or more.
6.2 Patent Classification
Figure 21 shows the quality comparison of the classification algorithms on various data sizes for the Patents data set. SVM and DT were the two best algorithms, almost with a perfect tie after 3 GB. As Figure 22 shows, SVM and LR were the fastest algorithms in this case.
For all data sizes, SVM dominates. For 1 GB, also RF and LR are dominant. For 4 GB SVM is still dominant, but now together with LR as shown in Figure 23. However, for the largest data set (7.2 GB) only SVM dominates.
In the overall performance comparison shown in Figure 24, SVM wins in all data sizes tested but one, followed closely by LR. Note that the performance is almost the same for all data sizes, something that confirms that the two algorithms have linear complexity.
6.3 Discussion
We studied two different document classification problems on different data sets (news and patents). In both cases, quality increased by adding more data. All the algorithms had the same learning curves in quality, but the shape of the performance curves was different. In addition, algorithms showed different behavior on different data sizes. So, this shows again that adding more data does not always helps to improve the performance.
We also did a fold evaluation for both cases, news and patents. However, the results did not change with respect to the simple holdout case (80% training, 20% for testing).
SVM and LR in News classification and RF and SVM in Patents classification were the best quality algorithms. RF and DT show good quality in the latter case because the patent word space is less sparse. SVM and NB are dominant algorithms in small data, while SVM is dominant in big data. In both cases, SVM, NB and LR were the fastest algorithms for text classification problems, while KNN and RF were not able to handle large data in our setting. Finally, SVM shows the best performance in larger data sets.
7 Conclusions and Future Work
With the rapid growth of the Web and textual data in general, the task of extracting information from documents in an automatic way, also known as knowledge discovery, has become more important and challenging. Extracting information or knowledge discovery usually uses a combination of ML, NLP, and data mining. A lot of research has been done on this problem using small data sets to try to improve quality. However, few research explores the tradeoffs of quality and running time on various data sizes. Therefore, our goal was to understand the tradeoffs for supervised ML algorithms when dealing with larger data sets.
So, we selected three problems in text processing that are usually solved with supervised machine learning. We compared the performance of several methods based on the quality of the results returned, the training time of the algorithms, and the size of the training data used. We discussed the tradeoffs between quality and time efficiency, defining a simple performance measure as quality multiplied by data size and divided by running time. In this way, we can compare fairly different algorithms. We also find the dominant algorithms for each of the problems in different data sizes.
The problems considered here included Named Entity Recognition, Sentiment Analysis, and Document Classification. Depending on the problem, kind of data, and number of samples as well as features, the algorithms exhibited different behaviors. For example, SVM had the best performance in larger data sizes. Naïve Bayes and KNN showed better performance on small data sizes. However, KNN could not work on large data in our computing setting.
In conclusion, depending if we focus on quality or speed, we would recommend the use of SVM or Naïve Bayes for text classification when the data size is small. Indeed, Naïve Bayes requires only a small amount of training data to estimate the constraints parameters necessary for classification. If the number of samples is huge, we would recommend the use of SVM. In fact, SVM was the only algorithm that was dominant and had the best performance for the largest data size in all experiments. If the data has many duplicates, DT and RF might be a reasonable choice, as it worked well in data sets which included a significant amount of duplicated data. However, these methods are less efficient.
Regarding tradeoffs, we saw the two possible cases: quality increases with data size or quality goes stable after some point. This is interesting because implies that tradeoffs already happen for GB of data. In this case, linear time ML algorithms will at the end be the winners. Indeed, for those algorithms the performance would remain constant independently of data size. Regarding dominant algorithms, our results show that for small data there are several dominant algorithms and hence the decision of which algorithm to use is not trivial.
The first way to extend our work is to do more experiments to cover better the parameter space of the problem of comparing supervised ML algorithms. That implies using more data sets where the notion of dominant algorithm can be extended [20], as well as trying all possible evaluation techniques. Another extension would be to vary the number of features and consider more algorithms. We could also add the effect of topic sparsity and text redundancy.
Other future work would be finding the threshold point where quality or performance no longer improves by adding more data for the problems where our data sizes were not large enough. To find the best performance levels, we must be able to estimate the size of the annotated sample required to reach the best performance. Another dimension that can be explored is parallel processing and repeat the same experiments using the mapreduce paradigm.
Another extension is to include the prediction time in the tradeoff. For example, in many applications, online prediction time could be a constraint that not all ML algorithms may satisfy. This research can also continue by comparing semisupervised and unsupervised learning methods. All these problems open new interesting tradeoff challenges in algorithm analysis and general design for NLP and ML.
References
 [1] R. BaezaYates and B. RibeiroNeto, Modern information retrieval: The concepts and technology behind search. AddisonWesley, Pearson, 2011.
 [2] R. BaezaYates, “Big data or right data?” in Alberto Mendelzon Workshop 2013, Puebla, Mexico, 2013.
 [3] R. BaezaYates and Z. Liaghat, “Qualityefficiency tradeoffs in machine learning for text processing,” in IEEE Big Data. IEEE CS Press, 2017.
 [4] K. Bakshi, “Considerations for big data: Architecture and approach,” in 2012 IEEE Aerospace Conference, March 2012, pp. 1–7.
 [5] M. Banko and E. Brill, “Scaling to very very large corpora for natural language disambiguation,” in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2001, pp. 26–33.
 [6] S. Bird, “NLTK: The natural language toolkit,” in Proceedings of COLING. ACL, 2006, pp. 69–72, http://www.nltk.org.
 [7] B. Carpenter and B. Baldwin, “Text analysis with LingPipe 4,” 2011.
 [8] D. M. Cer, M.C. de Marneffe, D. Jurafsky, and C. D. Manning, “Parsing to Stanford dependencies: Tradeoffs between speed and accuracy.” in LREC. European Language Resources Association, 2010.
 [9] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating nonlocal information into information extraction systems by Gibbs sampling,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA, USA: ACL, 2005, pp. 363–370.
 [10] V. C. Gandhi and J. A. Prajapati, “Review on comparison between text classification algorithms,” International Journal of Emerging Trends & Technology in Computer Science, vol. 1, no. 3, 2012.
 [11] J. Jiang, A. Teichert, H. Daumé, III, and J. Eisner, “Learned prioritization for trading off accuracy and speed,” in Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012, pp. 1331–1339.
 [12] A. Khan, B. Baharudin, L. H. Lee, K. Khan, and U. T. P. Tronoh, “A review of machine learning algorithms for textdocuments classification,” in Journal of Advances in Information Technology, 2010.
 [13] L. Kong and N. A. Smith, “An empirical comparison of parsing methods for Stanford dependencies,” CoRR, vol. abs/1404.4314, 2014.
 [14] J. Leskovec, “Amazon product data,” https://snap.stanford.edu/data/webAmazon.html, 2013.
 [15] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” J. Mach. Learn. Res., vol. 5, pp. 361–397, Dec. 2004.
 [16] Z. Liaghat, “Qualityefficiency tradeoffs in machine learning applied to text processing,” Ph.D. dissertation, Universitat Pompeu Fabra, 2017.
 [17] S. Ma and C. Ji, “Performance and efficiency: recent advances in supervised learning,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1519–1535, Sep 1999.
 [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [19] L. Ratinov and D. Roth, “Design challenges and misconceptions in named entity recognition,” in Proceedings of the Thirteenth Conference on Computational Natural Language Learning, ser. CoNLL ’09, 2009, pp. 147 – 155.
 [20] N. Tax, S. Bockting, and D. Hiemstra, “A crossbenchmark comparison of 87 learning to rank methods,” Information processing & management, vol. 51, no. 6, pp. 757–772, 2015.
 [21] USPTO.gov, “Research datasets,” https://www.uspto.gov/learningandresources/ippolicy/economicresearch/researchdatasets, Accessed 2016.
 [22] M. A. Wajeed and T. Adilakshmi, “Text classification using machine learning,” Journal of Theoretical and Applied Information Technology, vol. 7, no. 2, pp. 119–123, 2009.
 [23] A. Wallin, “Sentiment analysis of Amazon reviews and perception of product features,” Master’s thesis, Lund University, 2014.