Automatic Machine Learning Derived from Scholarly Big Data

03/06/2020 ∙ by Asnat Greenstein-Messica, et al. ∙ 0

One of the challenging aspects of applying machine learning is the need to identify the algorithms that will perform best for a given dataset. This process can be difficult, time consuming and often requires a great deal of domain knowledge. We present Sommelier, an expert system for recommending the machine learning algorithms that should be applied on a previously unseen dataset. Sommelier is based on word embedding representations of the domain knowledge extracted from a large corpus of academic publications. When presented with a new dataset and its problem description, Sommelier leverages a recommendation model trained on the word embedding representation to provide a ranked list of the most relevant algorithms to be used on the dataset. We demonstrate Sommelier's effectiveness by conducting an extensive evaluation on 121 publicly available datasets and 53 classification algorithms. The top algorithms recommended for each dataset by Sommelier were able to achieve on average 97.7



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The enormous growth in the creation of digital data has created numerous opportunities. Companies and organizations are now capable of mining data regarding almost every aspect of their activity. However, the ability to collect data far outstrips the ability of organizations to apply analytics or derive meaningful insights. One of the main reasons for this gap is the need for human involvement at multiple points of the data-analysis process and the relative scarcity of individuals with relevant skills (i.e., data scientists).

To address the shortage of skilled individuals, researchers have attempted to automate multiple aspects of the data analysis pipeline. Recent studies in this domain have focused on hyperparameter optimization

thornton2013auto, feature engineering katz2016explorekit, data cleaning chu2015katara

, and automatic generation of deep neural network architectures

zoph2016neural. Thornton et al. autoweka2017 suggest an iterative process for the simultaneous selection of the machine learning algorithm and the optimization of its hyperparameters.

We present Sommelier, a framework for leveraging publicly accessible academic publications and open repositories such as Wikipedia to recommend the most suitable algorithms for previously unseen datasets. Based on the intuition that similarly described problems can be solved using similar algorithms, we designed a framework that extracts terms related to machine learning problems and algorithms from Wikipedia. The extracted terms are used to train a recommender system using word embedding techniques applied on a large amount of publicly accessible academic publications.

We conduct our evaluation on the extensive dataset created by Fernandez et al. fer14

, which contains an exhaustive evaluation of well over a hundred public datasets and algorithms. Our experiments show that Sommelier is highly effective in recommending top performing algorithms for a given dataset. Moreover, the top algorithm recommended by our approach significantly outperforms the results obtained by applying the Random Forest algorithm, a popular ensemble algorithm which was the best-performing algorithm (on average) in the above mentioned study.

Our contributions are as follows:

  • We present an expert system for recommending top-performing algorithms to previously unseen datasets. The recommendation is based on a word embedding representation of the domain knowledge automatically extracted from a large corpus of relevant academic publications. Moreover, Sommelier does not require extensive analysis of the data itself. We emulate the way a human would approach the problem by relying on relevant previously published work. Sommelier can also be used as a preliminary step for other iterative algorithm recommendation solutions such as Auto-Weka autoweka2017. The effectiveness of the proposed approach is demonstrated empirically on a large corpus of publicly available datasets.

  • We propose a framework for the automated construction of a structured knowledge-base on machine learning. This goal is achieved by combining unsupervised keyword extraction from Wikipedia with the vast body of work available in public academic repositories. We demonstrate how this knowledge-base can be used to effectively derive actionable insights for machine learning applications.

2 Related Work

2.1 Knowledge base construction from large scale corpora

The growth in the amount of data available online – scholarly and otherwise – provided a significant boost to various attempts to map this data into structured and semi-structured formats and ontologies. The main drive for this was the challenges faced by practitioners in multiple fields to obtain a sufficient amount of labeled data in their respective fields gil16.

The best known publicly available large scale corpus is no arguably Wikipedia denoyer2006wikipedia, and many projects such as DBpedia bizer2009dbpedia and Wikidata vrandevcic2014wikidata use it as a foundation. Wikipedia has been used successfully in a large variety of tasks, including entity extraction gattani2013entity, query expansion li2007improving, query performance prediction Katz:2014:WQP:2600428.2609553 and ranking of real-world objects katz2017wikiometrics. Other examples of a large online dataset are YAGO suchanek2007yago, which maps entities and their relations, and Wordnet miller1995wordnet.

Another group of algorithms for building knowledge graphs from large scale corpora utilizes an iterative approach. Algorithms from this group rely on the knowledge gathered in previous runs to expand and refine their knowledge base. This group of algorithms includes NELL

mohamed2011discovering – which explores relations among noun categories – and Probase wu2012probase – a taxonomy for automatic understating of text. An additional member of this group was recently proposed by Al-Zaidy and Giles gil17, and includes an unsupervised bootstrapping approach for knowledge extraction from academic publications.

2.2 Information extraction in scholarly documents

Scholarly publication are an important source of information to researchers and practitioners alike. For this reason, a significant amount of work has been dedicated to the extraction of structured data and entities (tables, figures and algorithms) from academic papers so5; so6; so7; so8. For example, pc12 presents a method for identifying and extracting pseudo-code segments from academic papers in the field of computer science. Given that pseudo code segments are generally accompanied by a caption, the purpose of the code can often be inferred using regular expressions.

More recently, additional approached for algorithm extraction and analysis have been proposed. Seer algoseer16

proposed an algorithms search engine that leverages both machine learning and a rule-based system for the detection and indexing of code. in

algoeff17, the authors present an algorithm for extracting both the algorithm discussed in a research paper and its performance. Tuarob tuarob2016improving proposes the use of ensemble algorithms for the same task.

In addition to algorithms focused on extracting code, some recent work has focused on a much broader extraction of data. In wu2014towards, the authors propose a big data platform for the extraction of a wide array of meta-features including ISBNs, authorships and co-authorships, citation graphs etc. This work, along with others that stem from it ororbia2015big; osborne2013exploring could be used to extend our own framework as it currently focused on text extraction from scholarly data repositories.

2.3 Algorithm selection

The classical meta-learning approach for algorithm recommendation, uses a set of measures to characterize datasets and establish their relationship to algorithm performance bra08. These measures typically include a set of statistical measures, information-theoretic measures and/or the performance of simple algorithms referred to as landmarkers bra08. The aim of these methods is to obtain a model that characterizes the relationship between the given measures and the performance of algorithms evaluated on these datasets. This model can then be used to provide a ranking of algorithms, ordered by their suitability for the task at hand sm08; bra08.

Recent studies autoweka2017; autosk suggest an iterative process for the simultaneous selection of the machine learning algorithm. AutoWEKA autoweka2017, a tool for automatic algorithm and hyperparameters selection, uses a random forest-based SMAC hu11 for a given performance measure (e.g. accuracy). Other algorithm selection tools include Auto-sklearn autosk. Another study [11] calculates dataset similarity through the generation of metafeatures and the application of automatic ensemble construction.

Unlike the meta-learning approach, which requires a large amount of datasets for each dataset cluster to train a machine learning model for algorithm recommendation, Sommelier relies on the scholarly big data papers and can provide effective recommendations even in cases where the available training set is relatively small. Furthermore, the proposed approach does not require extensive analysis of the data itself. We emulate the way a human would approach the problem by relying on relevant previously published work. Comparing to the relatively time consuming iterative recommendation solutions autoweka2017; autosk, Sommelier provides a fast algorithm recommendation, and can also be used as a preliminary step for more resource-heavy solutions autoweka2017; autosk.

3 Approach

3.1 Overview

Our approach builds on recent advances in the field of natural language processing (NLP), where the technique of word embedding has had success in capturing and quantifying fine-grained semantic relationships among terms. We apply this technique to a large corpus of publicly available academic publications in the field of machine learning and use it to implicitly model the relationships among problems and algorithms. We then expand and refine our model by crawling Wikipedia and leveraging its rich metadata structure (namely links and categories). We use the refined model as a recommendation algorithm whose goal is to pair datasets with algorithms.

Our approach for recommending algorithms is presented in Figure 1. It is comprised of four phases: corpus extraction, semantic embedding generation, machine learning-related keyword extraction, and recommendation.

Figure 1: The Sommelier approach pipeline. The blue shapes refer to offline phase, and the orange shapes refer to the online algorithm recommendations for an unseen dataset.
Figure 2: An example of the Sommelier service user interface.

During the corpus extraction phase we crawl and retrieve relevant metadata from a large number of machine learning-related papers. The metadata includes features such as the title of the paper, keywords provided by the author and the journal, abstract, publication year, and references to other academic publications.

In the semantic embedding generation phase we employ GloVe glove to create representations of the keywords describing the papers. These representations, after they are refined using data crawled from Wikipedia in the next phase, enable us to identify and recommend algorithms to previously unseen datasets.

In the machine learning-related keyword extraction phase we crawl Wikipedia and use an unsupervised machine learning approach to extract terms related to machine learning algorithms or problems. We match these labelled terms to the terms of the embedded representation. By doing so we are able to identify specific implicit connections between “algorithm” terms and “problem” terms in our embedded representation. terms.

During the recommendation phase we receive the title and description of a previously unseen dataset as input. We perform a keyword extraction process, similar to the one used in the machine learning-related keyword extraction phase. This process produces a vector representation of the new dataset, which is then compared with the vectors associated with each of the “algorithms” terms in our embedding. Based on the degree of similarity, we produce a ranked list of recommended algorithms. An example of Sommelier’s user interface is presented in Figure


The phases of the process are described in detail below.

3.2 The corpus extraction phase

The goal of this phase is to generate a corpus of metadata on machine learning-related papers. To obtain a large number of papers, we crawl the Engineering Village website111 – a large repository of academic papers which offers access to 13 databases of engineering literature and patents. We applied the following steps to download machine learning-related papers:

  1. We downloaded all of the papers whose text contained at least one of the following terms: machine learning, data mining, regression, supervised learning,unsupervised learning, decision trees, boosting, random forest, neural networks, ANN, deep learning, recurrent neural network, RNN, convolutional neural network, CNN, relevance vector machine, RVM, support vector machine, SVM, k-means, DBSCAN, mean-shift, bayesian networks, or feature engineering.

  2. We downloaded all of the papers that appeared in the following list of top machine learning journals or papers whose citations include papers that appeared in these venues: Data Mining and Analysis222, AI333

    , Computer Vision and Pattern Recognition

    444 Database and Information Systems555

    and Probability and Statistics with Applications


Overall, we downloaded the metadata of 461,420 papers, published between 1961 and 2017. For each paper, we stored the paper ID, authors and journal keywords as well as the year of publication. To enable aggregation of similar keywords, we applied standard text normalization on the keywords. The normalization included transforming the text into lower case and replacing space and dash characters with underscores. Following the normalization process we were left with 1,395,788 keywords in our database.

3.3 The semantic embedding generation phase

The goal of this phase is to generate word embeddings that model the problem–algorithm relationships described in the papers that were extracted in the previous phase. Word embeddings are often used in multiple NLP tasks to discover semantic relatedness among terms glove; word2vec. Many studies in this area are based on the distributional hypothesis word2vec; w2vexp, which states that words that appear in similar contexts have close meanings. By representing each term as a vector, word embeddings enable us to identify terms with similar meanings even if there are no co-occurrences of the terms in the same document. We hypothesize that this property will enable us to identify effective algorithms for a given problem even if the particular approach has not been previously attempted.

To generate the semantic embedding representations of the papers’ keywords, we first needed to create a corpus of candidate keywords. The open–source algorithm, GloVe

glove, is a highly scalable solution that generates predictive models for unsupervised learning of word embeddings from text. We applied GloVe to all the extracted (normalized) keywords found in the academic papers downloaded from Engineering Village (see Section 3.2).

GloVe is based on the global log-bilinear regression model and combines the advantages of the global matrix factorization lee1999learning and local context window kawakami2011high

methods. GloVe explicitly factorizes the word-context co-occurrence matrix on symmetric word windows across the corpus. The embedded word representation is calculated by minimizing the following loss function using gradient descent.

where is the number of words in the vocabulary; denotes the number of times word j occurs in the context of word i, while also taking into account the distance between the items within the context window; is the vector representation of word i (i.e., the word embedding), and its size is the latent embedding size; is the context item vector, are bias terms; and is a weighting function that cuts off low co-occurrences, which are usually noisy, as well as prevents overweighting high co-occurrences. The parameters and are learned during training.

To adapt GloVe to our needs we enhanced the co-occurrence factor in the equation with a weight factor to increase the influence of recent papers. The weight factor is equal to 1 for papers published before the year 2000, and increases linearly for later years. The calculation is performed as follows

where is the set of papers where keywords and co-occur.

The weighting function is designed to reduces the weight of keywords with rare co-occurrence (“noise” reduction) while also limiting the contribution of common co-occurrences. The weighting function is represented as follows

where is the co-occurrence term of two keywords and is saturation co-occurrence value.

After filtering out keywords which appeared less than 5 times in the corpus extracted from Engineering Village, we were left with a vocabulary of 120,700 keywords. Our described adaptation of GloVe (including the enhancements to the embedding process) is publicly available777The link will be added pending acceptance of the paper. The frequently recommended values of , and were used in our experiments. An embedding size with 20 iterations provided good results in our experiments. The process is presented in Algorithm 1.

1:procedure KeywordsEmbedding(papersMetadataSet, gloveParameters)
3:      for each (paper) in papersMetadataSet do
6:      end for
8:      return (autoMLVocab,autoMLVectors)
9:end procedure
Algorithm 1 Semantic embedding generation

3.4 The machine learning-related keyword extraction phase

The goal of this phase is to compile two lists: “problems” – a list of the types of challenges for which machine learning is used and, “algorithms” – a comprehensive list of machine learning algorithms. Given a new dataset, these two lists will be used to characterize the dataset’s traits and recommend relevant algorithms. We hypothesize that our proposed approach can easily take into account all recent and important trends in the field, due to Wikipedia’s dynamic and constant update by thousands of contributors.

Our need for generating these two lists – algorithms and problems – stems from the fact that we are unable to know whether a term in the corpus extracted in Section 3.2 represents an algorithm, a problem, or neither. By generating these lists from Wikipedia and identifying matching terms, we are able to label relevant terms and filter irrelevant ones. Once the terms are labeled, we can model the algorithm recommendation challenge as a recommendation problem (Section 3.5).

The list generation process consists of four phases: seed generation, feature extraction, classifier training, and candidate ranking and selection; each phase is described in detail below.

Seed generation. For each of the two types of lists we wish to identify, we first compile the set of page titles that are certain to belong to it:

  • For the machine learning algorithms, we extracted all of the page titles belonging to the following Wikipedia categories: “classification algorithms”, “cluster analysis algorithms” and “regression models”. In addition, we extracted all of the algorithms that appeared in the “machine learning” bar in the infobox of the machine learning Wikipedia page.

  • For the problems, we extracted the titles of the pages that appeared under “Applications” in the infobox of the Machine Learning Wikipedia page.

Feature extraction. Next we generate a feature vector to represent every term in the two lists. The vector consists of two types of features:

  • Network-based features. Since each of our chosen seed terms is represented using a Wikipedia page, we can represent all of the terms in a graph whose vertices are determined by the inter-page links (we construct a single graph containing both lists). For each seed term on either list, we calculated the following values compared to the seed terms of both lists: in-degree, out-degree, page rank, betweenness, closeness, hub, authority, and the Dijkstra distance. Each set of values is represented using three statistics: min, max and average.

  • Text-based features. We represent the text of the Wikipedia page corresponding to the seed term using the bag-of-words joachims1996probabilistic approach.

Classifier training. After performing the previous steps, we now have two sets of vectors, each representing a single seed term. Next we use these vectors to train a machine learning-based classifier to label previously unseen terms as either as “algorithm”, “problem” or “other”. To obtain samples for the last label, we randomly sampled Wikipedia pages and labeled them as “other”. The number of pages belonging to this group was five times the number of pages in the two other groups, combined.

Generating the candidate terms. In order to expand our lists of algorithms and tasks we first need to identify possible candidates. These candidates will then be classified by the model trained in the previous step. We use three approaches to obtain the candidates:

  1. We select all Wikipedia articles whose title includes at least one of the following terms: recognition (e.g., speech recognition), analysis (e.g., malware analysis), detection (e.g., plagiarism detection), system (e.g., recommender system, intrusion detection system).

  2. For each seed term on either list, we traverse the Wikipedia graph (constructed based on inter-page links) and retrieve all of the pages that are at most three hops away from a seed concept.

  3. We select all Wikipedia pages whose text contains at least one of the following terms: machine learning, data mining, regression, supervised learning, unsupervised learning, decision trees, boosting, random forest, neural networks, ANN, deep learning, recurrent neural network, RNN, convolutional neural network, CNN, relevance vector machine, RVM, support vector machine, SVM, k-means, DBSCAN, mean-shift, Bayesian networks, or feature engineering.

Combining these three approaches enabled us to obtain 1.5 million candidate terms.

Candidate ranking and selection.

Next, we apply the trained classifier on the set of candidates. Using the XGBoost algorithm

chen2016xgboost, we rank all of the candidate terms based on their likelihood of belonging to the “algorithm” and “problem” labels. For each type, we select the 2,000 top-ranking terms and add them to the relevant set. In order to insure the quality of the newly added terms, we manually review and remove irrelevant terms. The process described above was conducted on an August 2014 version of Wikipedia and resulted in 276 terms describing machine learning algorithms and 380 terms describing relevant challenges.

Finally, following the creation of the two lists we attempt to match the terms on the two list to terms in the embedding. To compensate for small variations in the text (e.g. “random forest” and “random forests”), we use the normalized Levenshtein distance yujian2007normalized as the matching criteria. The threshold value for determining a match was set to 0.35.

3.5 The recommendation phase

The goal of this phase is to produce a ranked list of machine learning algorithms with the highest likelihood of being effective for a given problem. We model this challenge as a recommendation problem where our goal is to recommend useful items (algorithms) to users (problems).

The recommendation process begins when we are presented with the title and a short description of the new problem. It is important to note that we do not require the actual data to make an effective recommendation (based on katz2016explorekit, we do hypothesize that such information could be useful in future work). We then apply the following steps:

  1. The dataset title and problem description are normalized and matched with the vocabulary keywords. The matching of the description is carried out by extracting unigram and bi-gram terms from the text. A list of all of the matched keywords is generated and and each is represented as a vector following the removal of duplicates.

  2. Next, we calculate the similarity of each vector generated in the previous section to the algorithms’ keyword vocabulary generated in Section 3.3

    . For each algorithm vocabulary term, we use cosine similarity

    steinbach2000comparison to calculate its similarity to each dataset title and problem description matched keyword’s vector.

  3. Each machine learning algorithm in the dictionary is ranked based on the sum of its terms’ cosine similarity with the terms extracted from the analyzed dataset’s title and description. The algorithms are then ranked in descending order based on their score, using the following equation:

    where represents the model keyword embedded vector, represents the embedded vector for each dataset matched term, and is the set of all matched dataset keywords.

The process is presented in Algorithm 2. The product of this phase is a ranked list of algorithms, sorted by their likelihood of being relevant to the problem at hand.

1:procedure AlgoRecommend(dataset, autoMLVocab,autoMLVectors,algoKeywordSet,problemKeywordSet)
4:      for each (datasetKeyword) in datasetKeywordSet do
5:            for each (algoKeyword) in algoKeywordSet do
7:            end for
8:      end for
10:      return recommendedAlgoList
11:end procedure
12:procedure UpdateAlgoDist(algoKeyword, datasetKeyword,autoMLVectors,distance)
16:      return ()
17:end procedure
18:procedure GetDatasetKeywords(datasetTitle, datasetDescription,autoMLVocab,problemKeywords)
22:      if datasetKeywordSet  then
23:            return datasetKeywordSet
24:      else
26:      end if
27:      return datasetKeywordSet
28:end procedure
Algorithm 2 Algorithm recommendation

4 Evaluation

We evaluated our approach on the well-known dataset published by fer14, which contains the evaluation results of 179 classification algorithms on 121 datasets. The algorithms can be grouped into 17 different “families”, based on popular criteria. The datasets cover the UCI database in its entirety (as of March 2013, excluding some large-scale problems) in addition to some real-world problems (please see fer14 for details). For each dataset, all applicable algorithms were applied and evaluated using the accuracy metric. The large scale of the experiments and the diversity of both datasets algorithms ensure that the results were free from collection bias.

The structure of this section is as follows: we first review the models and datasets used in the experiments presented in fer14 and describe our preprocessing of the data (Section 4.1). We then present the results of our evaluation (Section 4.2) and analyze the results (Section 4.3).

4.1 Experimental setting

In this section we describe the algorithms and datasets included in the evaluation conducted by fer14. In addition, we describe the preprocessing steps we applied in order to make sure that the dataset is compatible with the data gathered in Sections 3.3 and 3.4.

4.1.1 Models

In their evaluation, Fernandez et al. fer14 used 179 classifiers implemented in C/C++, MATLAB, R, and Weka. The classifiers are highly diverse, originating from 17 “families.” A complete list, including the breakdown by family, is presented in Table 1. The main challenge in mapping these algorithms to our embedding was the fact that several algorithms had multiple implementations while our embedding only had a single entry per algorithm (since it is often impossible to infer algorithmic configurations from academic papers). For example, the Random Forest algorithm had eight implementations: cforest caret, rf_caret, rforest_R, parRF_caret, RRFglobal_caret, RRF_caret, and RandomForest_weka.

Model Classifier implementation
Adaboost adaboost_R, AdaBoostM1_weka, AdaBoostM1_J48_weka, C5.0_caret
Adaptive gcvEarth_caret, mars_R
Bagging Bagging_IBk_weka, Bagging_RandomForest_weka, ctreeBag_R, Bagging_weka, Bagging_DecisionTable_weka, treebag_caret, Bagging_PART_weka, Bagging_RandomTree_weka, Bagging_Logistic_weka, bagging_R, svmBag_R, Bagging_LibSVM_weka, Bagging_J48_weka, ldaBag_R, plsBag_R, nbBag_R, Bagging_NaiveBayes_weka, Bagging_OneR_weka, Bagging_HyperPipes_weka, nnetBag_R, Bagging_DecisionStump_weka, Bagging_LWL_weka, Bagging_MultilayerPerceptron_weka
Bayes net BayesNet_weka
Cascade correlation neural network cascor_C
Decision table DTNB_weka, DecisionTable_weka
Decision tree ctree_caret, RandomSubSpace_weka, rpart_caret, REPTree_weka, rpart_R, rpart2_caret, obliqueTree_R, J48_weka, J48_caret, PART_caret, C5.0Tree_caret, PART_weka, NBTree_weka, ctree2_caret, RandomTree_weka, DecisionStump_weka
Discriminant analysis sda_caret
Elm neural network elm_kernel_matlab, elm_matlab
Ensemble Decorate_weka, RandomCommittee_weka, OrdinalClassClassifier_weka, Dagging_weka, MultiScheme_weka, Grading_weka, Vote_weka
Flexible discriminant analysis fda_caret, fda_R
Gaussian kernel gaussprRadial_R
Generalized linear models glm_R, mlm_R, glmStepAIC_caret, glmnet_R
Learning vector quantization lvq_caret, lvq_R
Linear discriminant analysis lda_R, lda2_caret, PenalizedLDA_R, slda_caret, rrlda_R, stepLDA_caret, sddaLDA_R, sparseLDA_R
Logistic regression Logistic_weka, SimpleLogistic_weka
Logitboost RacedIncrementalLogitBoost_weka, LogitBoost_weka, logitboost_R
Learning vector quantization neural networks lvq_caret, lvq_R
Majority voting Vote_weka
Mars mars_R
Mixture discriminant analysis mda_R, mda_caret
Multiboost MultiBoostAB_REPTree_weka, MultiBoostAB_DecisionTable_weka, MultiBoostAB_MultilayerPerceptron_weka, MultiBoostAB_LibSVM_weka, MultiBoostAB_RandomTree_weka, MultiBoostAB_Logistic_weka, MultiBoostAB_PART_weka, MultiBoostAB_RandomForest_weka, MultiBoostAB_J48_weka, MultiBoostAB_NaiveBayes_weka, MultiBoostAB_IBk_weka, MultiBoostAB_weka, MultiBoostAB_OneR_weka
Multinomial logistic regression multinom_caret
Naive bayes NaiveBayesSimple_weka, NaiveBayesUpdateable_weka, naiveBayes_R, NaiveBayes_weka
Nearest neighbors knn_R, knn_caret, IBk_weka, IB1_weka, NNge_weka
Neural networks MultilayerPerceptron_weka, pcaNNet_caret, nnet_caret, avNNet_caret, mlp_C, mlp_caret, mlp_matlab, mlpWeightDecay_caret
One R OneR_weka, OneR_caret
partial_least_squares_regression pls_caret, gpls_R, widekernelpls_R, simpls_R, kernelpls_R, spls_R
pda pda_caret
pipe HyperPipes_weka
pnn pnn_matlab
quadratic_discriminant_analysis qda_caret, stepQDA_caret, sddaQDA_R, QdaCov_caret
random_forest cforest_caret, rf_caret, rforest_R, parRF_caret, RRFglobal_caret, RRF_caret, RandomForest_weka
random_subspace RandomSubSpace_weka
random_tree RandomTree_weka
rbf_neural_network rbf_matlab, rbfDDA_caret, rbf_caret, RBFNetwork_weka
rda rda_R
rep_tree REPTree_weka
rotation_forest RotationForest_weka
rule Ridor_weka
rules C5.0Rules_caret, OneR_weka, OneR_caret
sda sda_caret
smo SMO_weka
stacking Stacking_weka, StackingC_weka
support_vector_machine svmBag_R, svmLinear_caret, svmlight_C, svm_C, LibSVM_weka, lssvmRadial_caret, svmRadial_caret, svmRadialCost_caret, svmPoly_caret, LibLINEAR_weka
Table 1: Mapping of classifiers described by fer14 to models vocabulary keywords

We addressed this problem by a manually aggregating the different implementations of the same algorithm. After this aggregation was performed, the original 179 machine learning algorithms presented in fer14 were mapped to 45 entries in the mapping whose creation is described in Section 3.4. This information is presented in full in Table 1.

4.1.2 Datasets

In their evaluation, Fernandez et al. used 121 datasets. These datasets consisted of most of the UCI repository at that time (March 2013) as well as four additional datasets. For a detailed description of these datasets we refer the reader to their publication fer14.

In order to test Sommelier’s ability to recommend top-performing algorithms for the datasets described above, we needed the datasets’ titles and a short description of their prediction problems. For most of the datasets included in the experiments performed by fer14, the authors included an additional file containing a description of the prediction problem as well as a meaningful title. In several cases, though, the problem was not described (please see Figure 3 which contains the adult dataset description and abstract as an example).

Figure 3: Adult dataset, first few lines of the description file, and the abstract from the UCI website
Measure RF Relative Accuracy (%) Sommelier Relative Accuracy (%)
Average 96.4 97.7
Stdev 7 3
Table 2: Relative accuracy of Sommelier vs. Random Forest (RF) across 121 datasets reported by Fernandez et al. fer14
Recommendation Type MRR Rank Position of Maximum Accuracy Algorithm
Algorithm 0.28 3.5
Algorithm Family 0.36 2.7
Table 3: Sommelier recommendation average MRR, and ranking index of algorithm with maximum accuracy across 121 datasets reported by fer14

To address this issue, we crawled the UCI website and extracted the abstracts for all of the participating datasets. Then, we combined the UCI dataset’s abstract and description provided by fer14 when the two were available. We were unable to find a description in the UCI repository for two synthetic datasets (ringnorm and twonorm), and therefore we downloaded this information from the University of Toronto’s website888 For the four datasets not included in the UCI repository, we manually extracted the descriptions from the relevant papers. Once the process described above was completed, we were able to assign a title and a description to all of the datasets included in the study. These descriptions were used to rank relevant algorithms, as described in Section 3.5.

At the end of the process described above, we produce a list of matched keywords for each of the datasets used by Fernandez et al.

4.1.3 Algorithm performance analysis and comparison

One of the conclusions reached by fer14 was that the Random Forest algorithm, with its different versions, performed best overall. The highest-performing version of the RF algorithm (implemented in R and accessed via caret) achieved an average relative maximal accuracy of 94.1% for all datasets. We define relative maximal accuracy as a percentage of the maximal accuracy obtained by any algorithm for the analyzed dataset.

To evaluate the effectiveness of our approach, we compared the ranking produced by Sommelier to two baselines. The first is the maximal performance for each dataset, achieved by any algorithm. The second baseline is the performance of the Random Forest algorithm. As it was shown to have the best performance overall (as we explain above), this algorithm is the preferred choice if no information on the analyzed dataset is available.

Because our approach can recommend algorithm types (e.g. Random Forest, logistic regression) but not implementations (e.g. Weka, R) or parameters, for each dataset we chose the highest performing member of the relevant algorithm type (please see Table 1 for the complete list).We apply this approach for Sommelier as well as the baselines. For this reason, the average relative maximal accuracy of the Random Forest algorithm is 96.4% instead of 94.1%. It is important to note that this setting actually raises the bar for Sommelier compared to the RF baseline.

(a) Relative accuracy distribution histogram
(b) Relative accuracy distribution
Figure 4: Relative accuracy of 121 datasets for the Sommelier approach, Random Forest algorithm and the average among all algorithms.
Figure 5: Relative accuracy averaged over 121 datasets achieved by Sommelier vs the number of recommended algorithms.

4.2 Evaluation results

The results of our evaluation are presented in Table 2

, in which we compare the relative maximal accuracy of both Sommelier and the Random Forest algorithm. The results show that not only does Sommelier outperform the Random Forest algorithm overall but that the performance is consistently closer to the optimal performance (as shown by the lower standard deviation). We evaluated the significance of the results using a paired two-tailed t-test and found the results to be significant with a confidence level of 95%.

In Figure 3(a) we present a breakdown of the analyzed datasets based on the relative maximal accuracy for Sommelier, the Random Forest algorithms, and an average of all of the algorithms applied to the dataset. The results show that while both Sommelier and the Random Forest algorithm outperform the average overall performance, our approach performs best overall. While both approaches manage to reach a relative maximal accuracy of >95% in the majority of cases, Sommelier achieved this in 108 of 121 datasets compared to Random Forest’s 99. Moreover, the lowest relative maximal accuracy achieved by Sommelier is 77.5% compared with Random Forest’s 35%. These results indicate that Sommelier is not only more consistent in its performance but that it is mostly able to avoid assigning unsuitable algorithms to a given problem. In Figure 3(b) we compare the performance of Sommelier and Random Forest for each dataset. These results also support our conclusion that while both algorithms fare well overall, the performance by Sommelier is both better and more stable.

4.3 Analysis

Relative maximal performance as a function of the number of evaluated algorithms. As shown in Table 2, using the top-ranked algorithm by Sommelier would result in an average relative maximal accuracy of 97.7%. We now explore the effect of evaluating several top-ranked algorithms on this performance measure. The results of this evaluation are presented in Figure 5 and show a consistent increase in average performance. These results lead us to conclude that the ranked lists produced by our approach are effective overall, as they consists of multiple algorithms that achieve high performance for the various datasets.

Number of experiments required to obtain maximal performance. Next we analyzed the number of algorithms that would have to be evaluated from the ranked list produced by Sommelier in order to obtain the maximal possible performance. To this end we calculated two measures: the relative accuracy as a function of the number of recommended algorithms, and the Mean reciprocal rank (MRR). MMR is a statistic measure used for evaluating the rank of the first correct answer:

where represents the number of datasets evaluated and is the ranking of the recommended algorithm which match the highest accuracy achieved for the dataset .

we calculated these two measures for two scenarios: a) where each algorithm in the ranked list is evaluated individually (45 possible algorithms) and; b) where each algorithm “family” puts forward its most effective member (17 possible algorithms). We hypothesize that the latter scenario is of value because some researchers and practitioners may be interested in conducting hyperparameter optimization once an algorithm is selected (using tools such as AutoWEKA thornton2013auto). In such cases, the algorithm family is more important than the actual implementation.

The results of our analysis are presented in Table 3. For individual algorithms, one would have to evaluate an average of four algorithms in order to obtain maximal performance. For algorithm families, the number of required evaluations is three. These results also emphasize the advantages of our approach compared with the Random Forest baseline, since the Random Forest algorithm only achieves maximal performance in 18 out of 121 datasets (15% of cases).

5 Conclusions and Future Work

In this study we presented Sommelier, an expert system for recommending which machine learning algorithms should be applied on a previously unseen dataset. When provided with a new dataset, our approach analyzes its title and problem and produces a ranked list of algorithms based on their likelihood of performing well on the said dataset. Our approach is based on a word embedding representation of the domain knowledge extracted from a large corpus of academic publications and refined through the use information extracted from Wikipedia. Our evaluation demonstrates that these embeddings can be used to effectively recommend top performing algorithms for diverse datasets with a large variety in size and features composition.

In future work, we plan to incorporate metadata information on the analyzed datasets (when available) into the embedding process. We hypothesize that the metadata can provide additional context and further improve the recommendation accuracy. Examples of such metadata information will include features such as the number of target categories, number of input features and statistical distribution of features. Furthermore, we plan to extend the process described in this work to include additional types of entities in addition to “algorithms” and “problems”. Such entities may include the performance evaluation metric and type of the used machine learning framework. This expansion of our process can be used to create as automatic machine learning ontology such as


Finally, we plan to explore combining Sommelier with automatic hyperparameter optimization tools such as AutoWEKA thornton2013auto or as an initial algorithm recommendation within iterative model selection and hyperparameter optimization tools such as autoweka2017.

6 References