Evaluating recommender systems for AI-driven data science

05/22/2019 ∙ by William La Cava, et al. ∙ University of Pennsylvania 0

We present a free and open-source platform to allow researchers to easily apply supervised machine learning to their data. A key component of this system is a recommendation engine that is bootstrapped with machine learning results generated on a repository of open-source datasets. The recommendation system chooses which analyses to run for the user, and allows the user to view analyses, download reproducible code or fitted models, and visualize results via a web browser. The recommender system learns online as results are generated. In this paper we benchmark several recommendation strategies, including collaborative filtering and metalearning approaches, for their ability to learn to select and run optimal algorithm configurations for various datasets as results are generated. We find that a matrix factorization-based recommendation system learns to choose increasingly accurate models from few initial results.



There are no comments yet.


page 5

page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Experimental data is being collected faster than it can be understood across scientific disciplines [1]. The hope of many in the data science community is that widely accessible, open-source artificial intelligence (AI) tools will allow scientific insights from these data to keep abreast of their collection [2]. AI is expected to make significant improvements to scientific discovery, human health and other fields in the coming years. In this light, the key promise of AI is the automation of learning from large sets of collected data. However, at the same time that data collection is outpacing researchers, methodological improvements from the machine learning (ML) and AI communities are outpacing their dissemination to other fields. As a result, AI and ML remain steep learning curves for non-experts, especially for researchers pressed to gain expertise in their own disciplines.

It is clear from this situation that specialized researchers would benefit greatly from increasingly automated, accessible and open-source tools for AI. With this in mind, we present an open-source, free AI platform, PennAI, that is designed for the non-expert to quickly conduct a ML analysis on their data. PennAI uses a web browser-based user interface (UI) to display a user’s datasets, experiments, and results. Central to PennAI’s ease of use is a bootstrapped recommendation system that automatically configures and runs supervised learning algorithms catered to the user’s datasets and previous results. In addition to streamlining the ML process for new users, PennAI provides a research platform for AI researchers who can quickly plug in their own automatic machine learning (AutoML) methods and use PennAI as a test-bed for methods development. The project is documented and maintained on Github 


In this paper, we focus on the AI strategy used to automatically run analyses for the user. The underlying methods are based on recommender systems, which are well known inference methods underlying many commercial content platforms, including Netflix [3], Amazon [4], Youtube [5], and others. Our goal is to assess the ability of state-of-the-art recommender systems to learn the best ML algorithm and parameter settings for a given dataset over a set number of iterations in the context of previous results. We compare collaborative filtering (CF) approaches as well as metalearning approaches on a set of 135 open-source classification problems. We find that collaborative filtering algorithms based on matrix factorization perform well in choosing the best analysis, and are able to learn an optimal algorithm setting from few results. We demonstrate that over time, the learning system is able to generate increasingly accurate recommendations.

2 Background

In this section, we briefly review AutoML methodologies, which are a key component to making data science approachable to new users.We then describe recommender systems, the various methods that have worked in other application areas, and our motivation for applying them to this relatively new area of AutoML.

2.1 Automated Machine Learning

AutoML is a burgeoning area of research in the ML community that seeks to automatically configure and run learning algorithms without human intervention. A number of different learning paradigms have been applied to this task, and tools are available to the research community as well as commercially. A competition around this goal has been running since 2015 222http://automl.chalearn.org/ focused various budget-limited tasks for supervised learning [6].

A popular approach arising from the early competitions is sequential model-based optimization via Bayesian learning [7], represented by the auto-Weka and auto-sklearn packages [8, 9]

. These tools parameterize the combined problem of algorithm selection and hyperparameter tuning and use Bayesian optimization to select and optimize algorithm configurations. It is worth nothing this parameterization of the problem can be handled with other learning approaches, for example in Hyperopt 


Auto-sklearn incorporates metalearning into the optimization process [11, 12] to narrow the search space of the optimization process. The idea behind metalearning in this context is that the “metafeatures" from the datasets, such as predictor distributions, variable types, cardinality etc. provide valuable information about algorithm performance that can be leveraged to choose an appropriate algorithm configuration. Auto-sklearn uses this approach to narrow the search space of their learning algorithm. PoSH auto-sklearn[13], an update to auto-sklearn, bootstrapped auto-sklearn with an extensive analysis to minimize the configuration space. ML pipelines were optimized on a large number of datasets beforehand, and the pipelines were narrowed to those that performed best over all datasets. This tool effectively replaced metalearning with bootstrapping; our experiments provide some evidence supporting a similar strategy.

Another popular method for AutoML is tree-based pipeline optimization tool TPOT [14]

. TPOT uses an evolutionary computation approach known as genetic programming to optimize syntax tree representations of ML pipelines. Complexity is controlled via multi-objective search. Benchmark comparisons of TPOT and auto-sklearn show trade-offs in performance for each 


There are many commercial tools providing variants of AutoML as well. Many of these platforms do not focus on choosing from several ML algorithms, but instead provide automated ways of tuning specific ones. Google has created AutoML tools as well using neural architecture search [16]

, a method for configuring the architecture of neural networks. This reflects their historical focus on learning from sequential and structured data like images.


uses genetic algorithms to tune the feature engineering pipeline of a user-chosen model. Intel has focused on proprietary gradient boosted ensembles of decision trees 


A main paradigm of many AutoML is that they wrap several ML analyses, thereby obscuring the analysis from the user. Although this does indeed automate the ML process, it also removes the user from the experience. In contrast to these strategies, PennAI uses a recommender system as its basis of algorithm recommendation with the goal of providing a more intuitive and actionable user experience.

There has been limited research utilizing recommender systems as AutoML approaches. One recent exception is Yang et al. [17], who found that non-negative matrix factorization could be competitive with auto-sklearn for classification. A recent workshop333http://amir-workshop.org/ also solicited discussion of algorithm selection and recommender systems, although most research of this nature is interested in tuning the recommendation algorithms themselves [18].

Ultimately, the best algorithm for a dataset is highly subjective: a user must balance their wants and needs, including the accuracy of the model, its interpretability, the training budget and so forth. PennAI’s coupling of the recommendation system approach with the UI allows for more user interaction, essentially by maintaining their ability to “look under the hood". The user is able to fully interface with any and all experiments initialized by the AI in order to, for example, interrupt them, generate new recommendations, download reproducible code or extract fitted models and their results. Although the experiments in this paper use accuracy as the focus for generating recommendations, the general strategy opens the door for future versions that explicitly incorporate user feedback on the models that are generated in order to tailor the analysis to the user’s preferences.

By developing PennAI as a free and open-source tool, we also hope to contribute an extensible research platform for the ML and AutoML communities. The code is developed on Github and documents a base recommender class that can be written to the specification of any learning algorithm. We therefore hope that it will serve as a framework for bring real world users into contact with cutting edge methodologies.

2.2 Recommender Systems

Recommender systems are typically used to recommend items, e.g. movies, books, or other products, to users, i.e. customers, based on a collection of user ratings of items. The most popular approach to recommender systems is collaborative filtering (CF). CF approaches rely on user ratings of items to learn the explicit relationship between similar users and items. In general, CF approaches attempt to group similar users and/or group similar items, and then to recommend similar items to similar users. CF approaches assume, for the most part, that these similarity groupings are implicit in the ratings that users give to items, and therefore can be learned. However, they may be extended to incorporate additional attributes of users or items [19].

Recommenders face challenges when deployed to new users, or in our case, datasets. The new user cold start problem [20] refers to the difficulty in predicting ratings for a user with no data by which to group them. With datasets, one approach to this problem is through metalearning. Each dataset has quantifiable traits that can be leveraged to perform similarity comparisons without relying on algorithm performance history. In our experiments we benchmark a recommender that uses metafeatures to derive similarity scores for recommendations, as has been proposed in previous AutoML work [9, 12].

Recommender systems are typically used for different applications than AutoML, and therefore the motivations behind different methods and evaluation strategies are also different. For example, unlike typical product-based recommendation systems, the AI automatically runs the chosen algorithm configurations, and therefore receives more immediate feedback on its performance. Since the feedback is explicitly the performance of the ML choice on the given dataset, the ratings/scores are reproducible, less noisy, and less sparse than user-driven systems. This robustness allows us to measure the performance of each recommendation strategy reliably in varying training contexts. As another example, many researchers have found in product recommendation that the presence or absence of a rating may hold more weight than the rating itself, since users choose to rate or to not rate certain products for non-random reasons [21]. This observation has led to the rise of implicit rating-based systems, such as SVD++ [22], that put more weight on presence/absence of ratings. In the context of AutoML, it is less likely that the presence of results for a given algorithm configuration imply that it will outperform others. Furthermore, the goal of advertising-based, commercial recommendation systems may not be to find the best rating for a user and product, but to promote engagement of the user, vis-a-vis their time spent browsing the website. To this end, recommender systems such as Spotlight [23] are based on the notion of sequence modeling: what is the likelihood of a user interacting with each new content given the sequence of items they have viewed? Sequence-based recommendations may improve the user experience with a data science tool, but we contend that they do not align well with the goals of an approachable data science assistant.

3 Methods

In this section we first describe the user experience of PennAI, and then describe recommenders that we benchmark in our experiments for automating the algorithm selection problem. Fig. 1 gives an overview of the data science pipeline. Users upload datasets through the interface or optionally by pointing to a path during startup. At that point, users can choose between building a custom experiment (manually configuring an algorithm of their choice) or simply clicking the AI button. Once the AI is requested, the recommendation engine chooses a set of analyses to run. The AI can be configured with different termination criteria, including a fixed number of runs, a time limit, or running until the user turns it off. As soon as the runs have finished, the user may navigate to the results page, where several visualizations of model performance are available (Fig. 1.C)).

PennAI is available as a docker image that may be run locally or on distributed hardware. Due to its container-based architecture, it is straightforward to run analysis in parallel, both for datasets and algorithms, by configuring the docker environment. For more information on the system architecture, refer to the Appendix.

Figure 1:

Overview of the UI. A) Users upload datasets and choose a custom experiment (right), or allow the AI to run experiments of its choosing by clicking the AI button. B) Experiments are tabulated with configuration and performance information. The user may download scripts to reproduce the experiment in python, or export the fitted model. C) The results page displays experiment information and statistics of the fitted model, including various performance measures (confusion matrix, receiver operating characteristic (ROC) curve, etc.) as well as feature importance scores for the independent variables.

3.1 Recommendation System

In order to use recommender systems as a data science assistant, we treat datasets as users, and algorithms as items. The goal of the AI is therefore as follows: given a dataset, recommend an algorithm configuration to run. Once the algorithm configuration has been run, the result is now available as a rating of that algorithm configuration on that dataset, as if the user had rated the item. This is a nice situation for recommender systems, since normally users only occasionally rate the items they are recommended. We denote this knowledge base of experimental results as , where is the test score, i.e. rating, of algorithm configuration on dataset . In our experiments the test score is the average 10-fold CV score of the algorithm on the dataset.

With a few notable exceptions discussed below, the recommenders follow this basic procedure:

  1. Whenever new experiment results are added to , update an internal model mapping datasets to algorithm configurations, .

  2. Given a new recommendation request, generate

    , the estimated the score of algorithm configuration

    on dataset . Do this for every pair that has not already been recommended.

  3. Return recommended algorithm configurations in accordance with the termination criterion, in order of best rating to worst.

Note that the knowledge base can be populated not only by the AI, but by the user through manual experiments (Fig. 1.A) and by the initial knowledge base for PennAI. In production mode, the knowledge base is seeded with approximately 1 million ML results generated on 165 open-source datasets, detailed here [24]. The user may also specify their own domain-specific cache of results. Below, we describe several recommender strategies that are benchmarked in the experimental section of this paper. Most of these recommenders are adapted from the Surprise recommender library [25].

3.1.1 Neighborhood Approaches

We test four different neighborhood approaches to recommending algorithm configurations that vary in their definitions of the neighborhood. Three of these implementations are based the

-nearest neighbors (KNN) algorithm, and the other uses co-clustering. For each of the neighborhood methods, similarity is calculated using the mean squared deviation metric.

In the first and second approach, clusters are derived from the results data directly and used to estimate the ranking of each ML method by computing the centroid of rankings within the neighborhood. Let be the -nearest neighbors of algorithm configuration that have been run on dataset . For KNN-ML, we then estimate the ranking from this neighborhood as:


For KNN-data, we instead define the neighborhood over datasets, with consisting of the nearest neighbors to dataset that have results from algorithm . Then we estimate the rating as:


Instead of choosing to define the clusters according to datasets or algorithms, we may define co-clusters to capture algorithms and datasets that cluster together. This is the motivation behind co-clustering [26], the third neighborhood-based approach in this study. Under the CoClustering approach, the rating of an algorithm configuration is estimated as:


where is the average rating in cluster . As Eqn. 3 shows, clusters are defined with respect to and together and separately. Co-clustering uses a -means strategy to define these clusters. In case the dataset is unknown, the average algorithm rating, , is returned instead; likewise if the algorithm configuration is unknown, the average dataset rating is used. In case neither is known, the global average rating is returned.

Finally, we test a metalearning method dubbed KNN-meta. In this case, the neighborhood is defined according to metafeature similarity, in the same way as other approaches [11, 12, 9]

. We use a set of 45 metafeatures calculated from the dataset, including properties such as average correlation with the dependent variable; statistics describing the mean, max, min, skew and kurtosis of the distributions of each independent variable; counts of types of variables; and so on.

Rather than attempting to estimate ratings of every algorithm, KNN-meta maintains an archive of the best algorithm configuration for each dataset experiment. Given a new dataset, KNN-meta calculates the nearest neighboring datasets and recommends the highest scoring algorithm configurations from each dataset. KNN-meta has the advantage in cold starts since it does not have to have seen a dataset before to reason about its similarity to other results; it only needs to know how its metafeatures compare to previous experiments. KNN-meta has the limitation, however, that it can only recommend algorithm configurations that have been tried on neighboring datasets. In the case that all of these algorithm configurations have already been recommended, KNN-meta will recommends uniform-randomly from algorithms and their configurations.

3.1.2 Singular Value Decomposition

The singular value decomposition (

SVD) recommender is a CF method popularized by the top entries to the Netflix challenge [3]. Like other top entrants [27, 28]

, SVD is based on a matrix factorization technique that attempts to minimize the error of the rankings via stochastic gradient descent (SGD). Each rating is estimated as


where is the average score of all datasets across all learners; is the estimated bias for algorithm , initially zero; is the estimated bias for dataset , initially zero;

is a vector of factors associated with algorithm

and is the vector of factors associated with dataset

, both initialized from normal distributions centered at zero. Ratings are learned to minimize the regularized loss function


One of the attractive aspects of SVD is its ability to learn latent factors of datasets and algorithms ( and ) that interact to describe the observed experimental results without explicitly defining these factors, as is done in metalearning. A historical challenge of SVD is its application to large, sparse matrices, such as the matrix defined by datasets and algorithms (in our experiments this matrix is about 1 million elements). SVD recommenders address the computational hurdle by using SGD to estimate the parameters of Eqn. 4 using available experiments (i.e. dataset ratings of algorithms) only [29]. SGD is applied by the following update rules:


where . To facilitate online learning, the parameters in Eqn. 6

are maintained between updates to the experimental results, and the number of iterations (epochs) of training is set proportional to the number of new results.

3.1.3 Slope One

Slope one [30] is a simple recommendation strategy that models algorithm performance on a dataset as the average deviation of the performance of algorithms on other datasets with which the current dataset shares at least one analysis in common. To rate an algorithm configurations on dataset , we first collect a set of algorithms that have been trained on and share at least one common dataset experiment with . Then the rating is estimated as


3.1.4 Benchmark Recommenders

As a control, we test two baseline algorithms: a random recommender and an average best recommender. The Random recommender chooses uniform-randomly among ML methods, and then uniform-randomly among hyperparameters for that method to make recommendations. The Average recommender keeps a running average of the best algorithm configuration as measured by the average validation balanced accuracy across experiments. Given a dataset request, the Average recommender recommends algorithm configurations in order of their running averages, from best to worst.

4 Experiments

The goal of our experiments is to assess different recommendation strategies in their ability to recommend better algorithm configurations for various datasets as they learn over time from previous experiments. The diagram in Fig. 2 describes the experimental design used to evaluate recommendation strategies under PennAI.


We assess each recommender on 135 open-source classification datasets, varying in size and origin. We use datasets from the Penn Machine Learning Benchmark (PMLB) [31]. PMLB is a curated and standardized set of hundreds of open source supervised machine learning problems from around the web (sharing many in common with OpenML [32]). In previous work, we conducted an extensive benchmarking experiment to characterize the performance of different ML frameworks on these problems [31, 24]. The benchmark assessed 13 ML algorithms over a range of hyperparameters detailed in Table 1 on these problems. This resulted in a cache of over 1 million ML results across a broad range of classification problems which we use here to assess the performance of each recommender with a known ranking of algorithms for each dataset. For the experiment in this paper, we use a subset of these results consisting of 12 ML algorithms (dropping one of three naïve Bayes algorithms) with 7330 possible hyperparameter configurations, shown in Table 1.


The experiment consists of 300 repeat trials for each recommender. In each trial, the recommender begins with a knowledge base of experiments that consist of single ML runs on single datasets. For each iteration of the experiment, the recommender is asked to return

recommendations for a randomly chosen dataset. Once the recommendation is made, the results for the recommended algorithm configurations are fed back to the recommender as updated knowledge. Thus, the recommender behaves as a reinforcement learning experiment in which the actions taken by the recommender (i.e., the recommendations it makes) determine the information it is able to learn about the relationship between datasets and algorithm configurations.

We conduct experiments varying and . This allows us to assess the sensitivities to 1) starting with more information and 2) exploring more algorithm options during learning.

Figure 2: Diagram describing the experimental design.

Since we have the complete results of ML analyses on our experiment datasets, we are able to assess recommendations in terms of their closeness to the best configuration, i.e. that configuration with the best performance on a dataset over all results obtained in our exhaustive benchmark. Each algorithm configuration is primarily assessed according to its balanced accuracy (), a metric that takes into account class imbalance by averaging accuracy over classes[33]. We define the best balanced accuracy on a given dataset as . Then the performance of a recommendation is assessed as the relative distance to the best solution, as:


In addition to Eqn. 8, we assess the AI in terms of the number of datasets for which it is able to identify an “optimal" configuration. Here we define “optimal" algorithm configurations to be those that score within 1% of the best algorithm configuration for a dataset. This definition of optimality is of course limited both by the finite search space defined by Table 1 and by the choice of thresholding at 1%. Nonetheless, this definition gives a practical indicator of the extent to which AI is able to reach the global best performance.

Algorithm name Parameter Values
AdaBoostClassifier learning_rate [0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
n_estimators [10, 50, 100, 500, 1000]
BernoulliNB alpha [0.0, 0.1, 0.25, 0.5, 0.75, 1.0, 5.0, 10.0, 25.0, 50.0]
fit_prior [’true’, ’false’]
binarize [0.0, 0.1, 0.25, 0.5, 0.75, 0.9, 1.0]
DecisionTreeClassifier min_weight_fraction_leaf [0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
max_features [0.1, 0.25, 0.5, 0.75, ’log2’, None, ’sqrt’]
criterion [’entropy’, ’gini’]
ExtraTreesClassifier n_estimators [10, 50, 100, 500, 1000]
min_weight_fraction_leaf [0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
max_features [0.1, 0.25, 0.5, 0.75, ’log2’, None, ’sqrt’]
criterion [’entropy’, ’gini’]
GradientBoostingClassifier loss [’deviance’]
learning_rate [0.01, 0.1, 0.5, 1.0, 10.0]
n_estimators [10, 50, 100, 500, 1000]
max_depth [1, 2, 3, 4, 5, 10, 20, 50, None]
max_features [’log2’, ’sqrt’, None]
KNeighborsClassifier n_neighbors [1, 2, , 25]
weights [’uniform’, ’distance’]
LogisticRegression C [0.5, 1.0, , 20.0]
penalty [’l2’, ’l1’]
fit_intercept [’true’, ’false’]
dual [’true’, ’false’]
MultinomialNB alpha [0.0, 0.1, 0.25, 0.5, 0.75, 1.0, 5.0, 10.0, 25.0, 50.0]
fit_prior [’true’, ’false’]
PassiveAggressiveClassifier C [0.0, 0.001, 0.01, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
loss [’hinge’, ’squared_hinge’]
fit_intercept [’true’, ’false’]
RandomForestClassifier n_estimators [10, 50, 100, 500, 1000]
min_weight_fraction_leaf [0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
max_features [0.1, 0.25, 0.5, 0.75, ’log2’, None, ’sqrt’]
criterion [’entropy’, ’gini’]
SGDClassifier loss

[’hinge’, ’perceptron’, ’log’, ’squared_hinge’, ’modified_huber’]

penalty [’elasticnet’]
alpha [0.0, 0.001, 0.01]
learning_rate [’constant’, ’invscaling’]
fit_intercept [’true’, ’false’]
l1_ratio [0.0, 0.25, 0.5, 0.75, 1.0]
eta0 [0.01, 0.1, 1.0]
power_t [0.0, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]
SVC C [0.01]
gamma [0.01]
kernel [’poly’]
degree [2, 3]
coef0 [0.0, 0.1, 0.5, 1.0, 10.0, 50.0, 100.0]

Table 1: Analyzed algorithms with their parameters settings. The methods and parameters are named according to Scikit-learn nomenclature[34].

5 Results

Results for the PMLB experiment are shown in Figures 3-5. Recommenders are first compared in terms of median Balanced Accuracy 8 in Fig. 3. In Fig. 4, we look at the fraction of datasets for which SVD is able to find an optimal configuration under different experiment treatments. Finally we visualize the behavior of a subset of recommenders in Fig. 5 in order to gain insight into which algorithms are being selected and how this compares to the underlying distribution of algorithm performance in the knowledgebase.

Figure 3: Experiment results for each recommendation strategy. Each plot shows the median Balanced Accuracy (Eqn. 8

for 300 trials with error bars denoting 95% confidence intervals. From left to right, the number of recommendations made per dataset increases; from top to bottom, the number of experiments in the initial knowledgebase increases.

Let us first focus on the performance results in Fig. 3. We find in general that the various recommender systems are able to learn to recommend algorithm configurations that increasingly minimize the gap between the best performance on each dataset (unkown to the AI) and the current experiments. SVD performs the best, tending to reach good performance more quickly than the other recommendation strategies, across treatments. KNN-data and KNN-ml are the next best methods across treatments, and tend to perform similarly to each other. There is a gap between those three methods and the next best recommenders, which vary between SlopeOne, KNN-meta, and CoClustering for different settings.

As we expected due to its cold-start strategy, KNN-meta turns out to work well initially, but over time fails to converge on a set of high quality recommendations. The recommendation engines are generally able to learn quickly from few examples compared to the metalearning approach. This finding is in line with previous findings regarding movie recommendations [35].

Given 100 initial experiment results and 10 recommendations per iteration, the SVD recommender converges to within 5% of the optimal performance in approximately 100 iterations, corresponding to approximately 7 training instances per dataset. Note that the performance curves begin to increase on the right-most plots that correspond to 100 recommendations per dataset. In these cases, the recommender begins recommending algorithms configurations with lower rankings due to repeat filtering, described in Section 3.

Figure 4: Cumulative success rates for every dataset in the knowledgebase. The success rate is the fraction of datasets for which the recommender has trained an algorithm configuration that achieves a test set balanced accuracy within 1% of the best performance on that dataset. As the number of recommendations increases, the success rate improves, such that about 80% of datasets have an optimal configuration chosen by the AI.

In Fig. 4, we look deeper into SVD’s peformance since it is shown to perform the best in terms of Balanced Accuracy. As described earlier, we are interested in capturing a notion of “optimality", which we define as being within 1% of the best performance on a dataset across the exhaustive experiments in our previous work [31, 24]. To capture this, we plot the success rate of SVD, defined as the fraction of datasets for which the recommender has trained an algorithm configuration that achieves a test set within 1% of the best. Fig. 4 demonstrates that, given more recommendations per iteration, SVD tends to be able to “solve" most of the datasets, i.e. it recommends an optimal configuration for approximately 80% within 1000 iterations. This success rate corresponds to roughly 740 model fits per dataset, or a sampling of about 10% of the algorithm configuration space.

Figure 5: Heatmaps of three different recommendation strategies on the PMLB benchmark showing how often each ML was recommended. These plots show the experiment treatment with and . The top left figure is derived from the true data, and represents the average frequency of top-ranking algorithms across all of the datasets (note that the uniformity is due to the shuffling of the datasets for each trial). The top right figure shows SVD’s performance; over several iterations, it learns to approximate the frequency distribution of best ML models, i.e. GradientBoostingClassifier, followed by RandomForestClassifier, ExtraTreesClassifier, and SGDClassifier. The bottom left shows the same plot for KNN-meta, indicating a more uniform selection of different algorithms. The bottom right plot shows SlopeOne’s behavior which is to recommend GradientBoostingClassifier and SGDClassifier most frequently at different points in time.

The final set of plots in Fig. 5 show the frequency with which SVD, KNN-meta, and SlopeOne recommend different algorithms in comparison to the frequency of top-ranking algorithms by type (the top left plot). Here we see that SVD gradually learns to recommend the top five algorithms in approximately the same ranking as they appear in the knowledgebase. This lends some confidence to the relationship that SVD has learned between algorithm configurations and dataset performance.

6 Conclusion

In this paper we propose a data science tool for non-experts that generates increasingly reliable ML recommendations tailored to the needs of the user. The web-based interface provides the user with an intuitive way to launch and monitor analyses, view and understand results, and download reproducible experiments and fitted models for offline use. The learning methodology is based on a recommendation system that can learn from both cached and generated results. We demonstrate through the experiments in this paper that collaborative filtering algorithms can successfully learn to produce intelligent analyses for the user starting with sparse data on algorithm performance. We find in particular that a matrix factorization algorithm, SVD, works well in this application area.

PennAI automates the algorithm selection and tuning problem using a recommendation system that is bootstrapped with a knowledgebase of previous results. The default knowledgebase is derived from a large set of experiments conducted on 165 open source datasets. The user can also configure their own knowledgebase of datasets and results catered to their application area. We hope in the future to provide these for various learning domains, including electronic health records and genetics.

We also hope that PennAI can serve as a testbed for novel AutoML methodologies. In the near term we plan to extend the methodology in the following ways. First, we plan to implement a focused hyperparameter tuning strategy that can fine-tune the models that are recommended by the AI, similar to auto-sklearn [9] or Hyperopt [10]

. We plan to make this process transparent to the user so that they may easily choose which models to tune and for how long. We also plan to increasingly automate the data preprocessing, which is, at the moment, mostly up to the user. This can include processes from imputation and data standardization to more complex operations like feature selection and engineering.

7 Acknowledgments

The authors would like thank the members of the Institute for Biomedical Informatics at Penn for their many contributions. Contributors included Steve Vitale, Sharon Tartarone, Josh Cohen, Randal Olson, Patryk Orzechowski, and Efe Ayhan. We also thank John Holmes, Moshe Sipper, and Ryan Urbanowicz for their useful discussions. The development of this tool was supported by National Institutes of Health research grants (R01 LM010098 and R01 AI116794) and National Institutes of Health infrastructure and support grants (UC4 DK112217, P30 ES013508, and UL1 TR001878).


  • [1] Manju Bansal. Big Data: Creating the Power to Move Heaven and Earth, 2014.
  • [2] Randal S. Olson, Moshe Sipper, William La Cava, Sharon Tartarone, Steven Vitale, Weixuan Fu, John H. Holmes, and Jason H. Moore. A system for accessible artificial intelligence. arXiv preprint arXiv:1705.00594, 2017.
  • [3] James Bennett and Stan Lanning. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, NY, USA., 2007.
  • [4] Brent Smith and Greg Linden. Two decades of recommender systems at Amazon. com. Ieee internet computing, 21(3):12–18, 2017.
  • [5] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, and Blake Livingston. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pages 293–296. ACM, 2010.
  • [6] Isabelle Guyon, U Paris-Saclay, Hugo Jair Escalante, Sergio Escalera, U Barcelona, Damir Jajetic, James Robert Lloyd, and Nuria Macia. A brief Review of the ChaLearn AutoML Challenge:. JMLR Workshop and Conference Proceedings, 64:21–30, 2016.
  • [7] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. In Carlos A. Coello Coello, editor, Learning and Intelligent Optimization, Lecture Notes in Computer Science, pages 507–523. Springer Berlin Heidelberg, 2011.
  • [8] Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, 18(1):826–830, 2017.
  • [9] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
  • [10] Brent Komer, James Bergstra, and Chris Eliasmith. Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In ICML workshop on AutoML, 2014.
  • [11] Pavel B. Brazdil, Carlos Soares, and Joaquim Pinto Da Costa. Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3):251–277, 2003.
  • [12] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Metalearning: Applications to Data Mining. Springer Science & Business Media, November 2008.
  • [13] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. Practical automated machine learning for the automl challenge 2018. In International Workshop on Automatic Machine Learning at ICML, 2018.
  • [14] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pages 485–492. ACM, 2016.
  • [15] Adithya Balaji and Alexander Allen. Benchmarking Automatic Machine Learning Frameworks. arXiv:1808.06492 [cs, stat], August 2018. arXiv: 1808.06492.
  • [16] Esteban Real. Using Evolutionary AutoML to Discover Neural Network Architectures, March 2018.
  • [17] Chengrun Yang, Yuji Akimoto, Dae Won Kim, and Madeleine Udell. OBOE: Collaborative Filtering for AutoML Initialization. arXiv:1808.03233 [cs, stat], August 2018. arXiv: 1808.03233.
  • [18] Tiago Cunha, Carlos Soares, and André C. P. L. F. de Carvalho. Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering. Information Sciences, 423:128–144, January 2018.
  • [19] Lyle H. Ungar and Dean P. Foster. Clustering methods for collaborative filtering. In AAAI workshop on recommendation systems, volume 1, pages 114–129, 1998.
  • [20] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, and David M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 253–260. ACM, 2002.
  • [21] Benjamin M. Marlin and Richard S. Zemel. Collaborative prediction and ranking with non-random missing data. In RecSys, 2009.
  • [22] Yehuda Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. In KDD, page 9, 2008.
  • [23] Maciej Kula. Spotlight. GitHub, 2017.
  • [24] Randal S. Olson, William La Cava, Zairah Mustahsan, Akshay Varik, and Jason H. Moore. Data-driven Advice for Applying Machine Learning to Bioinformatics Problems. In Pacific Symposium on Biocomputing (PSB), August 2017. arXiv: 1708.05070.
  • [25] Nicolas Hug. Surprise, a Python library for recommender systems. 2017.
  • [26] T. George and S. Merugu. A Scalable Collaborative Filtering Framework Based on Co-Clustering. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 625–628, Houston, TX, USA, 2005. IEEE.
  • [27] Robert M. Bell, Yehuda Koren, and Chris Volinsky. The bellkor solution to the netflix prize. KorBell Team’s Report to Netflix, 2007.
  • [28] Robert M. Bell and Yehuda Koren.

    Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights.

    In icdm, volume 7, pages 43–52. Citeseer, 2007.
  • [29] Genevieve Gorrell.

    Generalized Hebbian algorithm for incremental singular value decomposition in natural language processing.

    In 11th conference of the European chapter of the association for computational linguistics, 2006.
  • [30] Daniel Lemire and Anna Maclachlan. Slope One Predictors for Online Rating-Based Collaborative Filtering. arXiv:cs/0702144, February 2007. arXiv: cs/0702144.
  • [31] Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining, 2017. arXiv preprint arXiv:1703.00512.
  • [32] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explor. Newsl., 15(2):49–60, June 2014.
  • [33] Digna R. Velez, Bill C. White, Alison A. Motsinger, William S. Bush, Marylyn D. Ritchie, Scott M. Williams, and Jason H. Moore. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 31(4):306–315, 2007.
  • [34] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [35] István Pilászy and Domonkos Tikk. Recommending New Movies: Even a Few Ratings Are More Valuable Than Metadata. In Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pages 93–100, New York, NY, USA, 2009. ACM. event-place: New York, New York, USA.


System Architecture

Figure 6: Diagram describing the architecture of PennAI.

PennAI is a multi-component architecture that uses a variety of technologies including Docker444https://www.docker.com/, Python, Node.js555https://nodejs.org, scikit-learn666http://sklearn.org, FGLab 777https://kaixhin.github.io/FGLab/, and MongoDb888https://www.mongodb.com/. The project contains multiple docker containers that are orchestrated by a docker-compose file. The central component is the controller engine, a server written in Node.js. This component is responsible for managing communication between the other components using a REST API. A MongoDb database is used for persistent storage. The UI component is a web application written in javascript that uses the React library to create the user interface and the Redux library to manage UI state. The interface supports user interactions including uploading datasets for analysis, requesting AI recommendations for a dataset, manually specifying and running machine learning experiments, and displaying experiment results in an intuitive way. The AI engine is written in Python. As users make requests to perform analysis on datasets, the AI engine will generate new machine learning experiment recommendations and communicate them to the controller engine. The AI engine contains a knowledgebase of previously run experiments, results, and dataset metafeatures that it uses to inform the recommendations it makes. The knowledgebase is bootstrapped with a collection of experiment results generated from the PMBL benchmark datasets. Instructions and code templates are provided to allow easy integration of custom recommendation systems. The machine learning component is responsible for running machine learning experiments on datasets. It has a Node.js server that is used to communicate with the controller engine, and uses python to execute Scikit-learn experiments on datasets and communicate results back to the central server. A PennAI instance can support multiple instances of machine learning engines, enabling multiple experiments to be run in parallel. Fig. 6 shows how each component interacts in PennAI.