Man versus Machine: AutoML and Human Experts' Role in Phishing Detection

08/27/2021 ∙ by Rizka Purwanto, et al. ∙ 13

Machine learning (ML) has developed rapidly in the past few years and has successfully been utilized for a broad range of tasks, including phishing detection. However, building an effective ML-based detection system is not a trivial task, and requires data scientists with knowledge of the relevant domain. Automated Machine Learning (AutoML) frameworks have received a lot of attention in recent years, enabling non-ML experts in building a machine learning model. This brings to an intriguing question of whether AutoML can outperform the results achieved by human data scientists. Our paper compares the performances of six well-known, state-of-the-art AutoML frameworks on ten different phishing datasets to see whether AutoML-based models can outperform manually crafted machine learning models. Our results indicate that AutoML-based models are able to outperform manually developed machine learning models in complex classification tasks, specifically in datasets where the features are not quite discriminative, and datasets with overlapping classes or relatively high degrees of non-linearity. Challenges also remain in building a real-world phishing detection system using AutoML frameworks due to the current support only on supervised classification problems, leading to the need for labeled data, and the inability to update the AutoML-based models incrementally. This indicates that experts with knowledge in the domain of phishing and cybersecurity are still essential in the loop of the phishing detection pipeline.



There are no comments yet.


page 5

page 12

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the availability of anti-phishing technologies, phishing attacks are still thriving and have caused data breaches of personal sensitive information and private company data. Phishing attacks were reported to double in 2020 [2], and have caused significant financial losses of roughly between $60 million and $3 billion per year in the United States [28]. With the rapidly evolving nature of phishing, it would be ideal to have an automated detection system which could quickly adapt to phishing data changes and robustly detect these attacks.

Phishing is a cyber-attack that aims at stealing sensitive information by impersonating a legitimate person, company, or organization. By using social engineering and psychological tactics, phishing attackers exploit human vulnerabilities by sending messages that seem authentic and usually have a sense of urgency, persuading users to follow the attacker’s instruction in the message. While phishing attacks nowadays are also advertised through voice messages and SMS messages, reports show that phishing is still dominated by the traditional ones delivered via emails [2].

During the past decade, machine learning has developed rapidly and has successfully performed a broad range of tasks. Machine learning techniques have shown to be effective in detecting phishing attacks. However, building a machine learning model is not a trivial task, and requires data scientists with knowledge of the relevant domain. In the past couple of years, Automated Machine Learning (AutoML) has gained much attention, enabling non-ML experts in building a machine learning model.

The emergence of AutoML frameworks brings us to the question of whether AutoML-generated models could outperform manually trained machine learning models on phishing data, how AutoML could assist non-experts in building machine learning models, and to what extent we could automate the whole process of building an ML-based detection system. There are a number of past studies that have tried to investigate whether the existence of AutoML frameworks would affect the roles of human experts in the ML development pipeline [10]. However, to the best of our knowledge, we are the first to discuss this topic specifically in the case of phishing detection systems.

2 Automated Machine Learning

There has been significant research conducted in the areas of machine learning and deep learning since 1995, resulting in the development of various tools, such as Weka (1990s), scikit-learn (2007-2010), Tensorflow (2015), Keras (2015), and so on


. The emergence of these tools has enabled multidisciplinary research and the application of machine learning and deep learning techniques, which has shown promising results, demonstrating the potential of machine learning models to solve various problems. However, it has also become evident that developing a machine learning model is a complex task, requiring intuition, experience, and technical expertise to tune the model’s hyperparameters. The heavy reliance of machine learning development on human experts motivates researchers to explore the possibility of inventing a technique to automate the development of machine learning models. Such attempts focusing on AutoML projects have been initiated by researchers and machine learning practitioners, and were followed by startups that sell AutoML frameworks as part of their business models. Some of the first AutoML tools are AutoWEKA

[32] which is based on Weka [23], followed by auto-sklearn [18] and TPOT [48] which are both built on the scikit-learn library on Python [51]. Various AutoML frameworks have also emerged as the product of the ChaLearn AutoML challenge competitions [3] between 2015 and 2018, which is still conducted every year until now.

There are some existing studies which focus on performing a thorough literature review on the comparison and discussions on past works in AutoML approaches and tools [25, 59]. In general, AutoML tools aim to automate various aspects of the machine learning pipeline, including data preprocessing, feature engineering, model training and validation. A standard full ML pipeline is shown in Figure 1. Based on previous literature reviews of AutoML tools [59, 25], we could divide AutoML frameworks based on which aspect of the machine learning pipeline it tries to automate; namely data preparation, feature engineering, model generation, and model evaluation.

Figure 1: Standard Machine Learning Pipeline

2.1 Data Preparation

Data preparation includes data collection and preprocessing of the data, such as data cleaning and data augmentation. Automated data collection is necessary for tasks in which the data needs to be extended and continuously updated. Data cleaning aims to remove noises and handle missing values in the data. Meanwhile, data augmentation enhances the model’s performance by generating new data based on the existing data and by preventing the trained model from over-fitting.

2.1.1 Data Collection

Plenty of open datasets are now being shared among researchers, especially image datasets (e.g. MNIST handwritten digital dataset

[35]). However, it remains a challenge to obtain a high quality dataset for some specific tasks, including phishing websites and email datasets which require anonymization. This issue can be solved with two different approaches: data searching and data synthesis. Some automated methods that help in the performance of data searching include learning-based self-labeling methods for unlabeled data [8] and synthetic minority over-sampling technique (SMOTE) [7]

to deal with dataset class imbalances. Meanwhile, various data generation techniques also exist to automate data synthesis process, for example using Generative Adversarial Networks (GAN)

[30, 47, 5]

and reinforcement learning-based methods in data simulators


2.1.2 Data Cleaning

Data cleaning is an essential process to eliminate data noises, which could negatively affect machine learning model performances if not removed. However, data cleaning is generally a costly task, since it requires experts with specialist knowledge. Over time, there have been efforts to automate the process of data cleaning. Initially, data cleaning was performed through crowd-sourcing [25]. To enhance efficiency, a past work proposed a data cleaning technique in which data scientists define specific data cleaning operations based on a small subset of the data, and apply these cleaning operations to the full dataset. Past studies attempted to further improve this method using machine learning techniques, such as boosting and hyperparameter optimization, to find the best data cleaning operation pipeline or combination [33, 34]. To be applicable for real world data, the data cleaning methods should be able to clean data steadily. Several past works have proposed a technique to evaluate data cleaning algorithms that can perform continuously, and to orchestrate cleaning workflows that can learn from past cleaning tasks.

2.1.3 Data Augmentation

Data augmentation aims to enrich the dataset by generating new data by transforming the existing data. Past studies have proposed various methods to perform neural-based transformations on image data, such as adversarial noise [44], neural style transfers [45], and GAN-based techniques [1]. Meanwhile, there are two approaches to textual data augmentation: data warping and synthetic over-sampling [60]. Recently, various methods have been proposed to search for augmentation policies for different tasks using reinforcement learning [11], and various other improved algorithms [37, 38].

Figure 2: AutoML Pipeline and Components [25]

2.2 Feature Engineering

Feature engineering in supervised machine learning problems is defined as the process of finding explanatory variables that are predictive of the classification outcome [59]. This process is typically performed in a trial-and-error fashion and often requires extensive knowledge of the relevant domain. Feature engineering holds an important role, since its quality affects the machine learning model’s performance heavily [13]; however, it is also a time-consuming task. To help with this process, some AutoML tools provided automated feature engineering methods with the goal of constructing new feature sets that give the best machine learning model performance.

The automated feature engineering task can be formally defined as follows [59]. Given a feature set with number of features,

, a target vector,

, and a machine learning algorithm, , let indicate the performance of the model with the corresponding algorithm, target vector, and feature set. Assume there are transformation functions, which can be applied to the features, and a sequence of transformations of the features . The goal of the automated feature engineering process is to find a set of transformation sequences to produce a new set of features, , which satisfies . The goal of automated feature engineering is to obtain the best set of features and feature transformations which give the best classification performance from structured data.

Traditionally, new features are constructed manually by performing some standard transformations, such as standardization, normalization, or feature discretization. To improve the efficiency of such processes, automatic feature construction methods using decision trees

[19, 61]

, genetic algorithms

[56], and annotation-based approaches [54], have been proposed to search and evaluate the best combination of transformations.

Besides the construction of new features, feature engineering can also be performed by reducing the feature dimensionality to extract the most informative features and reduce redundancies in the feature set. This process is performed by applying mapping functions, such as principal component analysis (PCA) or linear discriminant analysis (LDA). In recent studies, there have been other improved methods proposed to perform feature extraction, e.g., using autoencoder-based algorithms

[43] and unsupervised feature extraction methods [29].

2.3 Model Generation and Evaluation

There are two important elements in model generation: search space and the optimization method. The search space defines the structure and design principles of the machine learning models. Given a certain model, we could apply hyperparameter optimization (HPO), which aims to find the optimal training-related hyperparameters (e.g. learning rate), and architecture optimization (AO), to obtain the best hyperparameters associated with the model’s structure or design (e.g., number of neighbors for

-NN, or number of layers or neural architectures for deep neural networks).

Traditional hyperparameter optimization strategies usually do not make any assumptions about the search space. One of the simplest hyperparameter optimization methods is grid search; a brute-force method to find the best set of hyperparameters, given a finite set of values for each hyperparameter specified by users. Another simple alternative to grid search is random search, which relies on sampling from a user-specified set of hyperparameter values under a certain budget constraint. There is also another kind of hyperparameter optimization method which performs ”optimization from samples” [9]

, e.g. particle swarm optimization (PSO)


and evolutionary algorithms

[4], which are both inspired by biological behaviours. Meanwhile, Bayesian optimization has emerged as the most advanced hyperparameter optimization method used in AutoML frameworks. Bayesian optimization builds a probabilistic model, which maps different hyperparameter configurations to their performance with some degree of uncertainty.

Besides hyperparameter optimization, finding the best model architecture is also an important and non-trivial task when building a machine learning classifier. With the emerging research in neural networks in the past decade, neural architecture search (NAS) has gained great interest in the AutoML community. There are three essential components of a neural architecture search (NAS): the search space of neural architectures, architecture optimization, and model evaluation methods. An intuitive method to perform model evaluation is to assess model performance after training and the neural network has converged. However, this takes an extensive time and resources due to the amount of computation needed. Past studies have focused on finding methods to accelerate the model training and model evaluation process. The first approach is by using low-fidelity model evaluation by reducing the model size. Another technique that has been proposed is weight sharing, which can make the model training time faster by utilizing knowledge regarding weights of prior network architectures. Past studies also proposed the use of surrogate-based methods which could estimate the black-box function of a neural network model, making it easier to obtain the best model configuration and performance. Another proposed approach is early stopping, which was initially used in classical ML to prevent over-fitting. More recent studies have improved early stopping to perform computation on smaller set of data, making it faster to compute. We will not cover this topic in detail in our paper. However, more thorough discussions on studies on hyperparameter optimization and architecture optimization, especially neural architecture search (NAS), are covered by Waring et al.

[59] and He at al. [25] in their literature reviews.

3 Phishing Detection Systems

Machine learning has shown to be effective in detecting phishing attacks based on past studies [12]. While attackers have been using various strategies to conduct phishing attacks, emails still remain a primary delivery method for attackers. In this section, we provide the general workflow of phishing detection systems and how it interacts with external parties, e.g. users and blacklist providers. The phishing detection systems workflow is shown in Figure 3.

Figure 3: Phishing Detection Systems

Phishing attacks start with the attacker broadcasting emails that contain a message which tries to convince the receivers to proceed further by clicking on the provided link. Automated phishing detection systems inspect this email’s raw data and analyze the message, URL, content or visual appearance of the associated web page, and page hosting information. After automatically fetching this information, the feature extraction process is performed. Various past studies have focused on investigating the performance of features extracted from the phishing emails and websites. The output of the feature extraction process is a vector that contains values associated with each specific feature, each representing an email or website. Afterwards, a classifier is built, which predicts whether an email or a website is a phishing attack. This information is provided to users and phishing blacklist providers to update their database. In many cases, these detection systems also accept feedback from users or report from blacklist providers when misclassification occurs. This information is accepted as the ground truth for updating the machine learning model and improving its detection performance.

4 Can AutoML outperform humans?

The advancement of AutoML frameworks raises the question of whether AutoML can outperform humans in building machine learning models for detecting phishing. To answer this question, we performed an experiment to compare the performances of the models built using AutoML and the ones that are manually crafted.

4.1 Dataset and Performance Metrics

To perform a comparison between AutoML and manually built machine learning models, we tested the models on various phishing email, URL, and website datasets. Besides using publicly available datasets [16, 40, 42, 41, 46], we also performed feature extraction proposed in past studies [16, 6, 22, 52, 57] on a raw phishing email dataset compiled by Verma et al. [58] to construct new sets of phishing email data. In each task of this experiment, we trained machine learning models using a specific dataset and compared the performance of models constructed with and without AutoML frameworks. Further details of each task are provided in Table 1.

width=1.3center Task Num of Rows Num of Features Num of Classes Details eml_1a 3668 3 2 Raw phishing email dataset [58] with features extraction proposed in [16] eml_1b 3668 23 2 Raw phishing email dataset [58] with feature extraction proposed in [6] eml_1c 3668 791 2 Raw phishing email dataset [58] with feature extraction proposed in [22] eml_1d 3668 10 2 Raw phishing email dataset [58] with feature extraction proposed in [52] eml_1e 3668 579 2 Raw phishing email dataset [58] with feature extraction proposed in [57] eml_2 9205 3 2 Phishing email dataset and feature extraction proposed in [16] url_1 96800 500 2 Phishing website URL dataset and feature extraction proposed in [40] url_2 76728 12 2 Phishing website URL dataset and feature extraction proposed in [42] url_3 15185 79 5 Phishing website URL dataset and feature extraction proposed in [41] web_1 8844 30 2 Phishing website HTML dataset and feature extraction proposed in [46]

Table 1: Phishing Dataset Details

While various metrics were measured during this experiment, we are particularly interested in the following performance metrics:

  • Accuracy
    Accuracy measures the number of correct predictions divided by the total size of the data, and can be expressed as:


    where is the predicted value of the -th sample, is the corresponding true value, and is the total number of samples. In a multi-class classification setting, e.g. Task url_3, this formula would calculate the subset accuracy, or the percentage of samples which are classified correctly.

  • AUC score

    AUC, or Area Under the ROC Curve, is the total area under the ROC (receiver operating characteristic) curve, which would depict the model’s performance in identifying the positive and negative examples for each class and finding the best threshold to separate between both examples. In a binary classification task, positive and negative examples would refer to phishing and legitimate samples. In a multi-class classification, we chose the

    one-vs-oneheuristic, where the dataset is split into one dataset for each class versus every other class [24], and the final AUC score is obtaining by computing the average AUC of all possible pairwise combinations of classes.

  • F1-score

    Due to the precision-recall trade-off, it is challenging to have both precision and recall high, especially in imbalanced datasets. F1-score calculates the harmonic mean of recall (true positive rate) and precision, measuring the model’s performance correctly performing classification while heavily penalizing low recall or low precision scores.

  • Training duration
    The training duration refers to the total amount of time each framework or algorithm takes to process the given dataset, until a classification model is produced. We do not include the time each model takes to perform prediction on the testing dataset, as it is generally negligible.

4.2 Experiment Constraints and Setup

In this study, we compared the performance of various mature open source AutoML frameworks, which are briefly described as follows:

  • AutoGluon [14]
    AutoGluon uses a multi-layer stack ensemble, in which multiple models are ensembled and stacked in multiple layers. The main difference between AutoGluon and existing AutoML frameworks is that AutoGluon utilizes almost every trained model to produce the final prediction instead of only selecting the best model.

  • auto-sklearn [17]

    Auto-sklearn was a winner of the ChaLearn AutoML Challenge 1 in 2015-2016 and ChaLearn AutoML Challenge 2 in 2017-2018. It uses Bayesian optimization to obtain the best machine learning pipeline. Auto-sklearn features automatic ensemble construction and uses meta-learning to increase the probability of finding a good pipeline by warm-starting the search procedure.

  • GAMA [21]
    GAMA supports configurable AutoML pipelines, which allow the selection of optimization and post-processing algorithms. By default, GAMA searches over linear ML pipelines and creates a model ensemble in the post-processing step. These pipelines can be optimized with an asynchronous evolutionary algorithm or ASHA.

  • H2OAutoML [36]
    H2OAutoML performs a random search, which is followed by a model stacking stage. This framework uses the H2O machine learning package by default, which supports distributed training.

  • hyperopt-sklearn [31]
    Hyperopt-sklearn allows various search strategies, including random search, and various sequential model based optimization (SMBO) techniques. These techniques include Tree of Parzen Estimators (TPE), Annealing and Gaussian Process Trees.

  • TPOT [49]
    TPOT constructs machine learning pipelines of arbitrary length using scikit-learn algorithms [51]

    and allows the use of XGBoost algorithm. During its search, pre-processing and stacking are both considered. While the model’s pipeline length is arbitrary, TPOT performs multi-objective optimization, in which it aims to keep the number of pipeline components minimal while optimizing the main selected metric. TPOT also provides support for sparse matrices, multiprocessing, and custom pipeline components.

To ensure that our evaluation was performed fairly, we trained and tested all the models using the OpenML AutoML Benchmarking Framework [20] to make sure that the models were developed under the same constraints and with the same setup. Evaluating and comparing AutoML systems is challenging due to the subtle differences in problem definition, e.g. the design of the hyperparameter search space or the way time budgets are defined. The OpenML AutoML Benchmarking toolkit aims to address this issue by providing a standardized environment to perform in-depth experiments comparing a wide range of AutoML frameworks.

For each task in Table 1

, we performed random splits on the whole dataset, then allocated 75% for training and 25% for testing respectively. All tasks were run using the same computing resources, with 8-core Intel Xeon and 32 GB RAM. The maximum runtime for each task was set to 3,600 seconds (1 hour). Each task could use a maximum of 8 cores when multiprocessing is available. To reduce biases resulting from outlier results, we repeated the experiment 10 times for each task and observed the consistency of the performance in the evaluation metrics. We selected the default metrics to optimize, which were the AUC score for binary classification tasks and log loss for multi-class classification tasks.

We also selected several traditional machine learning algorithms to compare with the models built using the aforementioned AutoML frameworks. The selected algorithms are Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Multi-layer Perceptron, and Gaussian Naive Bayes, which are available in the

scikit-learn Python package [51]. To train these models, we also used the same dataset split and computing resources as the one used in the AutoML setting, with a maximum of 8 core when multiprocessing was available. For each task in Table 1

, we ran 10 experiments to obtain the best model for each algorithm and to observe any variance in the models’ performance. We manually defined a set of model hyperparameters and performed random searches to obtain the best model, optimizing the AUC score in binary classification tasks and log loss in multi-class classification tasks. Unlike the AutoML experiment setting, we did not set the maximum runtime for each task as this feature was not available. However, we set the number of iterations during the random search for each task. The iteration number was manually set to 20 for SVM and 100 for all other algorithms.

4.3 Results

We computed the average performance metrics of each model and task in all experiments. To perform comparisons between models built using AutoML frameworks and non-AutoML algorithms, we selected two models from each task which would represent the best model built using the AutoML framework and non-AutoML algorithms. The best performance in this case would be in terms of the average AUC score as this metric was optimized during the model search process.

Comparisons of the accuracy and AUC score of the best model built using AutoML and non-AutoML frameworks are provided in Figure 4 and Figure 5. As shown in these figures, the performances in terms of AUC score and accuracy are almost similar between models built using AutoML and manually developed ML models. However, there are some exceptions in Task eml_1a and Task eml_2 in which AutoML-based models significantly outperformed manually built models. The AUC score difference is between 13.5% to 23.3%, and the accuracy difference is between 11.6% to 22.9%.

Figure 4: Accuracy

Figure 5: AUC Score

Figure 6: Duration

In terms of training time, we found that it was much faster to train the models manually and achieve this level of performance without AutoML frameworks. Figure 6 provides a summary of the comparison of training duration between AutoML frameworks and traditional ML algorithms.

5 When Does AutoML Outperform Humans?

To gain a better understanding of the results in Section 4, we analyzed the complexity of performing classification on the datasets assigned to each task using the DCoL library [50]. The aim of this experiment is to understand in what kind of classification task AutoML frameworks provide better results.

The complexity measures that are computed can be grouped into several categories, which are briefly described as follows.

  • Measures of the overlaps from different classes based on the discriminative power of the features, including the maximum Fisher’s discriminant ratio (F1), directional-vector maximum Fisher’s discriminant Ratio (F1v), volume of per-class bounding boxes overlap (F2), maximum individual feature efficiency (F3), and collective feature efficiency (F4).

  • Measures of linearity, which includes the minimised sum of the error distance of a linear classifier (L1), training error of a linear classifier (L2), and non-linearity of a linear classifier (L3).

  • Neighborhood measures, which includes the fraction of points on the class boundary (N1), ratio of intra/inter class nearest neighbor distance (N2), leave-one-out error rate of the one-nearest neighbor classifier (N3), non-linearity of the one-nearest neighbor classifier (N4), and maximum covering spheres fraction (T1).

  • Measures of dimensionality, which includes the average number of points per dimension (T2).

  • Measures of class imbalance, which includes the entropy of class proportions (C1) and imbalance ratio (C2).

A complete analysis of task complexity is provided in Figure 7. In this figure, darker cells are associated with higher complexity. In general, a higher parameter value usually indicates that the classification task is more complex, with an exception for F1, F1v, F3, F4, and T2 complexity measures where a higher measure value corresponds to simpler classification tasks. Higher complexity means that it is more difficult to achieve good classification performances. Further details on complexity measures are not discussed in our paper. However, we refer to [27, 26, 39] for those interested in more detailed explanations regarding measuring supervised classification complexity.

F1 Maximum Fisher’s discriminant ratio L2 Training error of a linear classifier (linear SMO) N4 Non-linearity of the one-nearest neighbor classifier F1v Directional-vector maximum Fisher’s discriminant ratio L3 Non-linearity of a linear classifier (linear SMO) T1 Fraction of maximum covering spheres F2 Overlap of the per-class bounding boxes N1 Fraction of points on the class boundary T2 Average number of points per dimension F3 Maximum (individual) feature efficiency N2 Ratio of average intra/inter class nearest neighbor distance C1 Entropy of Class Proportions F4 Collective feature efficiency (sum of each feature efficiency) N3 Leave-one-out error rate of the one-nearest neighbor classifier C2 Imbalance Ratio L1 Minimized sum of the error distance of a linear classifier (linear SMO)
Figure 7: Classification Task Complexity

To observe the relationship between a task’s complexity and classification performance, we performed a correlation test between AutoML performance gain or improvement () and a complexity measure based on one of the parameters previously mentioned. Full results of this analysis is provided in Table 7 (Appendix B). A summary of the correlation test with statistically significant results () is shown in Table 2.

Based on the correlation test, there are negative relationships between the F1 complexity measure (maximum Fisher’s discriminant ratio) and the performance gain in terms of AUC score. Note that a higher Fisher’s discriminant ratio would correspond to a simpler classification task. The negative relationship between F1 complexity measure and the AUC score gain would indicate that there would be a larger AUC score difference between AutoML-based models and manually built (non-AutoML) models in more complex tasks (with lower F1 complexity measure). The same goes with F3 and F4 complexity measures, in which both would have higher values when the classification task is simpler. Thus, a negative correlation between the F3 complexity measure and AUC score gain would indicate that AutoML-based models would outperform more significantly in more complex classification tasks (with lower F3 complexity measure). The correlation test also shows that AutoML-based models would outperform manually built models, in terms of accuracy, in more complex classification tasks with lower F4 complexity measure.

Furthermore, the correlation test also showed a positive correlation between some of the neighbor-based complexity measures (N1, N3, N4) and the AUC score performance gain. Unlike the previous complexity measures, a higher neighbor-based complexity measure indicates a more complex classification task. The results in Table 2 show that the AutoML-based models would be more likely to outperform manually developed classification models in a more complex classification task with higher N1, N3, and/or N4 complexity measures. These results show that AutoML-based models would be most beneficial when used to build classification models in complex settings where the features are not quite discriminative, and in datasets with overlapping classes or relatively high degrees of non-linearity.

Complexity Measure Performance Gain Correlation p-value
F1 AUC Score -0.70909 0.021666
F3 AUC Score -0.67273 0.033041
F4 Accuracy -0.70909 0.021666
N1 AUC Score 0.745455 0.01333
N3 AUC Score 0.721212 0.018573
N4 AUC Score 0.769697 0.009222
Table 2: Correlation between Complexity Measure and Performance Gain ()

Referring back to the results in Section 4, we found that this result is consistent with empirical findings. As shown in Figure 4 and Figure 5, AutoML frameworks significantly outperformed manually crafted ML models in Task eml_1a and Task eml_2. As shown in Figure 7, Task eml_1a is deemed to be complex as indicated by its feature-based complexity measures (F1, F3, F4) and neighborhood complexity measures (N1, N3, N4). Meanwhile, the complexity of Task eml_2 is indicated by the feature-based measures (F1, F3, F4). This confirms that AutoML-based models outperform manually built ML models in these types of complex classification tasks.

6 Automating Phishing Detection with AutoML Frameworks

In this section, we discuss further on the AutoML-based models’ performances, followed by a discussion on the opportunity and challenges of the use of AutoML frameworks in automated phishing detection systems, and a highlight on the study limitations and potential future works.

6.1 AutoML-based Models’ Performances

In this section, we analyze further the AutoML-based models’ performances in each classification task, and observe the relationship between a model’s performance and complexity level of the classification task. We provide more details on the average accuracy, AUC score, and duration of each framework in Table 4 in Appendix A. The average AUC score, accuracy, and F1 score of each task is provided in Figure 8, Figure 9, and Figure 10

. In these figures, we also computed the standard deviation and the confidence interval of each metric in every classification task.

We performed a correlation test to observe the relationship between the performance of an AutoML-based model on a specific classification task and a measure indicating the task’s complexity, as shown in Table 6 (Appendix B). We provide a summary of the correlation analysis with statistically significant results in Table 3, As shown in this table, there are five complexity measures that have a significant correlation with AutoML-based models’ performances, namely F4, L3, T1, C1, and C2 complexity measures. A higher F4 complexity measure would indicate a simpler classification task, whereas higher L3, T1, C1, or C2 complexity measure indicates that the classification task is more complex. The results in Table 3 are quite intuitive in terms of the relationship between F4, L3, C1, C2 complexity measure and the performance metrics, where the performance would improve in simpler classification tasks. Interestingly, there is a positive correlation between AutoML-based model performance (accuracy and F1 score) and T1 complexity measure, indicating that the model seems to have better performance when classifying on datasets with higher fraction of maximum covering spheres.

Complexity Measure Performance Metric Correlation p-value
F4 Accuracy 0.648485 0.04254
L3 Accuracy -0.69697 0.025097
L3 F1 Score -0.68485 0.028883
L3 AUC Score -0.7697 0.009222
T1 Accuracy 0.830303 0.00294
T1 F1 Score 0.781818 0.007547
C1 F1 Score -0.69176 0.026678
C1 AUC Score -0.834 0.002705
C2 F1 Score -0.73055 0.016409
C2 AUC Score -0.88572 0.000649
Table 3: Correlation between Complexity Measure and AutoML-based Models’ Performance ()

Figure 8: AutoML-based Models’ AUC Score per Task

Figure 9: AutoML-based Models’ Accuracy per Task

Figure 10: AutoML-based Models’ F1 Score per Task

6.2 Opportunity and Challenges

Machine learning models built using AutoML frameworks have shown to outperform manually built models (Figure 4 and Figure 5). It is also shown that performance gain correlates with several complexity measures, such as F1, F3, F4, N1, N3, and N4 complexity measures (Table 2), indicating that human experts would benefit AutoML frameworks greatly in automating machine learning development, when dealing with complex classification tasks particularly with dataset features having low discriminative power, datasets with high class overlap, and dataset with high degree of non-linearity.

However, several challenges remain to be addressed in using AutoML frameworks for implementing a full machine learning pipeline of phishing detection systems. AutoML frameworks currently focus on supervised classification tasks with labeled datasets, which are not always available in the case of phishing classification tasks. Furthermore, most AutoML frameworks build stacked or ensembles of ML models, which are difficult to implement and update in an incremental learning setting. With these limitations, the role of data scientists and security experts is still crucial in observing whether the classification models need retraining due to the existence of concept drift, or if the features in the dataset are no longer representative of the attacks. It is also shown in Figure 4 and Figure 5 that while AutoML built models were able to outperform manually built ML models, the performance differences are not significant, except in several complex cases. Furthermore, the time needed to train manually built ML models is significantly lower when compared with training AutoML-based models (Figure 6).

7 Conclusion

Our study shows that AutoML frameworks are able to consistently produce models that perform similarly or better than manually built ML models. However, challenges remain in implementing these frameworks to fully automate phishing detection due to the support provided only on supervised classification problems, which consequently lead to the need for labeled data, and the inability to update the generated models incrementally. This indicates that experts with domain knowledge are still needed in-the-loop of the full phishing detection pipeline. Despite its limitations, our study also reveals that AutoML frameworks are able to outperform manually developed machine learning models in certain complex classification tasks, indicating that there are opportunities for utilizing AutoML to improve human-guided phishing detection systems. Future studies which further explore the collaboration of human experts and AutoML, as well as the design of the human-in-the-loop system architectures would be beneficial in improving phishing detection.


This work has been supported by the Cyber Security Cooperative Research Centre Limited, whose activities are partially funded by the Australian Government’s Cooperative Research Centres Programme.

Rizka Widyarini Purwanto was supported by a UNSW University International Postgraduate Award (UIPA) scholarship. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the scholarship provider.

A sincere thank you to Muhammad Johan Alibasa for his constructive feedback and discussions on the manuscript.


  • [1] A. Antoniou, A. Storkey, and H. Edwards (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: §2.1.3.
  • [2] APWG (2021) APWG phishing trends reports. Anti Phishing Working Group. Cited by: §1, §1.
  • [3] AutoML@ChaLearn. Note: 2021-03-29 Cited by: §2.
  • [4] T. Back (1996) Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press. Cited by: §2.3.
  • [5] C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. Gunn, A. Hammers, D. A. Dickie, M. V. Hernández, J. Wardlaw, and D. Rueckert (2018) Gan augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. Cited by: §2.1.1.
  • [6] M. Chandrasekaran, K. Narayanan, and S. Upadhyaya (2006) Phishing Email Detection Based on Structural Properties. In NYS Cyber Security Conference, Vol. 3. Cited by: §4.1, Table 1.
  • [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: §2.1.1.
  • [8] B. Collins, J. Deng, K. Li, and L. Fei-Fei (2008)

    Towards scalable dataset construction: an active learning approach


    European conference on computer vision

    pp. 86–98. Cited by: §2.1.1.
  • [9] A. R. Conn, K. Scheinberg, and L. N. Vicente (2009) Introduction to derivative-free optimization. SIAM. Cited by: §2.3.
  • [10] A. Crisan and B. Fiore-Gartland (2021-01) Fits and Starts: Enterprise Use of AutoML and the Role of Humans in the Loop. arXiv:2101.04296 [cs]. Note: arXiv: 2101.04296Comment: CHI 2021 Conference, 15 pages, 3 figures, 1 Table External Links: Link Cited by: §1.
  • [11] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 113–123. Cited by: §2.1.3.
  • [12] A. Das, S. Baki, A. El Aassal, R. Verma, and A. Dunbar (2019) SoK: A Comprehensive Reexamination of Phishing Research From the Security Perspective. IEEE Communications Surveys & Tutorials 22 (1). Cited by: §3.
  • [13] P. Domingos (2012) A Few Useful Things to Know About Machine Learning. Communications of the ACM 55 (10), pp. 78–87. Cited by: §2.2.
  • [14] N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola (2020) AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv preprint arXiv:2003.06505. Cited by: 1st item.
  • [15] H. J. Escalante, M. Montes, and L. E. Sucar (2009) Particle swarm model selection.. Journal of Machine Learning Research 10 (2). Cited by: §2.3.
  • [16] I. Fette, N. Sadeh, and A. Tomasic (2007) Learning to Detect Phishing Emails. In Proceedings of the 16th International Conference on World Wide Web, Cited by: §4.1, Table 1.
  • [17] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28, pp. 2962–2970. External Links: Link Cited by: 2nd item.
  • [18] M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. NIPS’15, Cambridge, MA, USA, pp. 2755–2763. Cited by: §2.
  • [19] J. Gama (2004) Functional trees. Machine Learning 55 (3), pp. 219–250. Cited by: §2.2.
  • [20] P. Gijsbers, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren (2019) An Open Source AutoML Benchmark. arXiv preprint arXiv:1907.00909 [cs.LG]. Note: Accepted at AutoML Workshop at ICML 2019 External Links: Link Cited by: §4.2.
  • [21] P. Gijsbers and J. Vanschoren (2019) GAMA: Genetic Automated Machine learning Assistant. Journal of Open Source Software 4 (33), pp. 1132. External Links: Document, Link Cited by: 3rd item.
  • [22] C. N. Gutierrez, T. Kim, R. Della Corte, J. Avery, D. Goldwasser, M. Cinque, and S. Bagchi (2018) Learning from The Ones That Got Away: Detecting New Forms of Phishing Attacks. IEEE Transactions on Dependable and Secure Computing 15 (6). Cited by: §4.1, Table 1.
  • [23] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten (2009-11) The weka data mining software: an update. SIGKDD Explor. Newsl. 11 (1), pp. 10–18. External Links: ISSN 1931-0145, Document Cited by: §2.
  • [24] D. J. Hand and R. J. Till (2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Machine learning 45 (2), pp. 171–186. Cited by: 2nd item.
  • [25] X. He, K. Zhao, and X. Chu (2021-01) AutoML: A Survey of The State-of-the-art. Knowledge-Based Systems 212, pp. 106622 (en). External Links: ISSN 0950-7051, Link, Document Cited by: Figure 2, §2.1.2, §2.3, §2.
  • [26] T. K. Ho, M. Basu, and M. H. C. Law (2006) Measures of Geometrical Complexity in Classification Problems. In Data Complexity in Pattern Recognition, pp. 1–23. Cited by: §5.
  • [27] T. K. Ho and M. Basu (2002) Complexity Measures of Supervised Classification Problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (3), pp. 289–300. Cited by: §5.
  • [28] J. Hong (2012-01) The state of phishing attacks. Commun. ACM 55 (1), pp. 74–81. External Links: ISSN 0001-0782, Link, Document Cited by: §1.
  • [29] O. Irsoy and E. Alpaydın (2017) Unsupervised feature extraction with autoencoder trees. Neurocomputing 258, pp. 63–73. Cited by: §2.2.
  • [30] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.1.1.
  • [31] B. Komer, J. Bergstra, and C. Eliasmith (2014) Hyperopt-sklearn: Automatic Hyperparameter Configuration for Scikit-Learn. In Proc. SciPy, External Links: Link Cited by: 5th item.
  • [32] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown (2017) Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 18 (25), pp. 1–5. External Links: Link Cited by: §2.
  • [33] S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu (2017) Boostclean: automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299. Cited by: §2.1.2.
  • [34] S. Krishnan and E. Wu (2019) Alphaclean: automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827. Cited by: §2.1.2.
  • [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.1.1.
  • [36] E. LeDell and S. Poirier (2020-07) H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML). External Links: Link Cited by: 4th item.
  • [37] C. Lin, M. Guo, C. Li, X. Yuan, W. Wu, J. Yan, D. Lin, and W. Ouyang (2019) Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6579–6588. Cited by: §2.1.3.
  • [38] T. C. LingChen, A. Khonsari, A. Lashkari, M. R. Nazari, J. S. Sambee, and M. A. Nascimento (2020) UniformAugment: a search-free probabilistic data augmentation approach. arXiv preprint arXiv:2003.14348. Cited by: §2.1.3.
  • [39] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, and T. K. Ho (2019) How Complex is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Computing Surveys (CSUR) 52 (5), pp. 1–34. Cited by: §5.
  • [40] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker (2009) Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Cited by: §4.1, Table 1.
  • [41] M. S. I. Mamun, M. A. Rathore, A. H. Lashkari, N. Stakhanova, and A. A. Ghorbani (2016) Detecting Malicious URLs Using Lexical Analysis. In International Conference on Network and System Security, Cited by: §4.1, Table 1.
  • [42] S. Marchal, J. François, R. State, and T. Engel (2014) PhishStorm: Detecting Phishing with Streaming Analytics. IEEE Transactions on Network and Service Management 11 (4). Cited by: §4.1, Table 1.
  • [43] Q. Meng, D. Catchpoole, D. Skillicom, and P. J. Kennedy (2017) Relational autoencoder for feature extraction. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 364–371. Cited by: §2.2.
  • [44] A. Mikołajczyk and M. Grochowski (2018) Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp. 117–122. Cited by: §2.1.3.
  • [45] A. Mikołajczyk and M. Grochowski (2019) Style transfer-based image synthesis as an efficient regularization technique in deep learning. In 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), pp. 42–47. Cited by: §2.1.3.
  • [46] R. M. Mohammad, F. Thabtah, and L. McCluskey (2012) An Assessment of Features Related to Phishing Websites Using an Automated Technique. In 2012 International Conference for Internet Technology and Secured Transactions, Cited by: §4.1, Table 1.
  • [47] T. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand, W. T. Freeman, and W. Matusik (2018) Learning-based video motion magnification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 633–648. Cited by: §2.1.1.
  • [48] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore (2016)

    Evaluation of a tree-based pipeline optimization tool for automating data science


    Proceedings of the Genetic and Evolutionary Computation Conference 2016

    GECCO ’16, New York, NY, USA, pp. 485–492. External Links: ISBN 9781450342063, Document Cited by: §2.
  • [49] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore (2016) Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16. External Links: Link, Document Cited by: 6th item.
  • [50] Cited by: §5.
  • [51] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2, 6th item, §4.2.
  • [52] V. Ramanathan and H. Wechsler (2012) PhishGILLNET—Phishing Detection Methodology Using Probabilistic Latent Semantic Analysis, AdaBoost, and Co-training. EURASIP Journal on Information Security 2012 (1). Cited by: §4.1, Table 1.
  • [53] N. Ruiz, S. Schulter, and M. Chandraker (2018) Learning to simulate. arXiv preprint arXiv:1810.02513. Cited by: §2.1.1.
  • [54] P. Sondhi (2009) Feature construction methods: a survey. sifaka. cs. uiuc. edu 69, pp. 70–71. Cited by: §2.2.
  • [55] A. Truong, A. Walters, J. Goodsitt, K. Hines, C. B. Bruss, and R. Farivar (2019-11) Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1471–1479. Note: arXiv: 1908.05557 External Links: Link, Document Cited by: §2.
  • [56] H. Vafaie and K. De Jong (1998) Evolutionary feature space transformation. In Feature Extraction, Construction and Selection, pp. 307–323. Cited by: §2.2.
  • [57] R. Verma and N. Hossain (2013)

    Semantic Feature Selection for Text with Application to Phishing Email Detection

    In International Conference on Information Security and Cryptology, Cited by: §4.1, Table 1.
  • [58] R. M. Verma, V. Zeng, and H. Faridi (2019) Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §4.1, Table 1.
  • [59] J. Waring, C. Lindvall, and R. Umeton (2020) Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare. Artificial Intelligence in Medicine 104. Note: Cited By :16lots of useful information. deep learning –> unstructured data, representation learning External Links: Document Cited by: §2.2, §2.2, §2.3, §2.
  • [60] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell (2016) Understanding data augmentation for classification: when to warp?. In 2016 international conference on digital image computing: techniques and applications (DICTA), pp. 1–6. Cited by: §2.1.3.
  • [61] Z. Zheng (1998) A comparison of constructing different types of new feature for decision tree learning. In Feature Extraction, Construction and Selection, pp. 239–255. Cited by: §2.2.

Appendix A Model Performances

In this section, we provide more details regarding the AutoML and traditional ML model performances, as shown in Table 4 and Table 5.

width=0.625center Task Framework Accuracy AUC Score Duration (sec) eml_1a AutoGluon 0.866 0.865 3492.86 auto-sklearn 0.865 0.869 3598.4 GAMA 0.749 0.865 3239.22 H2OAutoML 0.865 0.871 2442.72 hyperopt-sklearn 0.865 0.73 381.92 TPOT 0.86 0.834 3315.22 eml_1b AutoGluon 0.908 0.913 3525.68 auto-sklearn 0.895 0.913 3601.26 GAMA 0.908 0.908 3239.44 H2OAutoML 0.899 0.906 2602.88 hyperopt-sklearn 0.887 0.821 429.83 TPOT 0.904 0.908 3613.41 eml_1c AutoGluon 0.965 0.967 3275.3 auto-sklearn 0.933 0.964 3601.11 GAMA 0.946 0.967 3241.95 H2OAutoML 0.942 0.975 3289.61 hyperopt-sklearn 0.942 0.891 343.08 TPOT 0.938 0.965 3047.91 eml_1d AutoGluon 0.923 0.952 3527.28 auto-sklearn 0.916 0.948 3598.93 GAMA 0.92 0.947 3239.68 H2OAutoML 0.899 0.946 2912.53 hyperopt-sklearn 0.897 0.871 31.57 TPOT 0.915 0.947 3619.78 eml_1e AutoGluon 0.896 0.901 3491.24 auto-sklearn 0.897 0.876 3599.78 GAMA 0.894 0.873 3240.91 H2OAutoML 0.901 0.886 2621.21 hyperopt-sklearn 0.872 0.802 413.54 TPOT 0.86 0.87 3418.89 eml_2 AutoGluon 0.971 0.984 3461.46 auto-sklearn 0.97 0.985 3600.33 GAMA 0.64 0.984 3239.53 H2OAutoML 0.972 0.984 2483.41 hyperopt-sklearn 0.972 0.967 879.84 TPOT 0.824 0.984 3610.52 url_1 AutoGluon nan nan 12113.62 auto-sklearn 0.969 0.993 3596.35 GAMA 0.97 0.993 3305.78 H2OAutoML 0.964 0.993 3399.22 hyperopt-sklearn 0.931 0.916 3660.97 TPOT 0.972 0.985 5274.93 url_2 AutoGluon 0.961 0.994 2912.35 auto-sklearn 0.952 0.991 3602.72 GAMA 0.952 0.991 3257.05 H2OAutoML 0.953 0.992 3281.59 hyperopt-sklearn 0.946 0.946 1812.67 TPOT 0.952 0.991 3764.12 url_3 AutoGluon 0.961 0.99 3178.55 auto-sklearn 0.952 0.986 3600.84 GAMA 0.952 0.986 3272.36 H2OAutoML 0.953 0.988 3375.62 hyperopt-sklearn 0.949 0.974 1094.88 TPOT 0.952 0.986 3767.93 web_1 AutoGluon 0.942 0.987 3438.56 auto-sklearn 0.941 0.988 3599.88 GAMA 0.945 0.989 3240.61 H2OAutoML 0.947 0.99 3199.18 hyperopt-sklearn 0.937 0.935 417.4 TPOT 0.939 0.89 3766.47

Table 4: AutoML-based Model Performance

width=0.625center Task Framework Accuracy AUC Score Duration (sec) eml_1a Logistic Regression 0.749 0.473 1.641 SVM 0.749 0.473 2.214 KNN 0.749 0.399 2.179 Decision tree 0.749 0.72 1.499 Random forest 0.749 0.72 7.764 Multilayer perceptron 0.749 0.473 18.032 Gaussian NB 0.749 0.736 0.001 eml_1b Logistic Regression 0.883 0.862 2.709 SVM 0.834 0.828 3.608 KNN 0.752 0.642 2.195 Decision tree 0.896 0.86 1.777 Random forest 0.906 0.908 13.047 Multilayer perceptron 0.749 0.304 23.342 Gaussian NB 0.863 0.849 0.002 eml_1c Logistic Regression 0.945 0.973 10.8 SVM 0.943 0.983 28.961 KNN 0.775 0.765 29.746 Decision tree 0.933 0.888 10.87 Random forest 0.948 0.96 191.959 Multilayer perceptron 0.951 0.976 44.91 Gaussian NB 0.702 0.777 0.045 eml_1d Logistic Regression 0.909 0.934 1.912 SVM 0.917 0.937 2.586 KNN 0.903 0.919 2.963 Decision tree 0.897 0.894 1.801 Random forest 0.916 0.95 19.209 Multilayer perceptron 0.749 0.934 19.411 Gaussian NB 0.839 0.915 0.001 eml_1e Logistic Regression 0.892 0.856 7.154 SVM 0.879 0.86 39.138 KNN 0.868 0.83 91.487 Decision tree 0.845 0.766 3.076 Random forest 0.901 0.847 24.441 Multilayer perceptron 0.893 0.874 36.772 Gaussian NB 0.829 0.841 0.044 eml_2 Logistic Regression 0.603 0.598 1.711 SVM 0.397 0.402 14.276 KNN 0.707 0.694 3.346 Decision tree 0.741 0.752 1.647 Random forest 0.741 0.752 11.034 Multilayer perceptron 0.603 0.5 36.178 Gaussian NB 0.383 0.466 0.003 url_1 Logistic Regression 0.975 0.995 473.486 SVM 0.788 0.88 5031.541 KNN 0.934 0.975 18751.06 Decision tree 0.943 0.962 3847.875 Random forest 0.971 0.99 120065.8 Multilayer perceptron 0.977 0.997 942.103 Gaussian NB 0.731 0.769 0.773 url_2 Logistic Regression 0.815 0.903 25.737 SVM 0.733 0.214 298.13 KNN 0.911 0.968 79.309 Decision tree 0.9 0.959 11.277 Random forest 0.955 0.992 279.35 Multilayer perceptron 0.511 0.365 282.544 Gaussian NB 0.747 0.769 0.017 url_3 Logistic Regression 0.94 0.921 134.057 SVM 0.94 0.89 184.147 KNN 0.94 0.938 40.508 Decision tree 0.94 0.917 10.184 Random forest 0.94 0.981 233.451 Multilayer perceptron 0.94 0.742 81.508 Gaussian NB 0.94 0.778 0.028 web_1 Logistic Regression 0.923 0.974 3.601 SVM 0.659 0.756 15.978 KNN 0.913 0.966 51.269 Decision tree 0.899 0.947 2.183 Random forest 0.934 0.98 20.192 Multilayer perceptron 0.919 0.973 47.325 Gaussian NB 0.588 0.965 0.005

Table 5: Manually Developed (non-AutoML) Model Performance

Appendix B Correlation Analysis

In this section, we provide results on the analysis of correlation between the complexity measures and the AutoML performances (Table 6), and correlation between the complexity measures and the classification performance gain when using AutoML frameworks (Table 7).

width=0.725center Complexity Measure Performance Metric Correlation p-value F1 Accuracy 0.151515 0.676065 F1 F1 Score 0.187879 0.603218 F1 AUC Score -0.21212 0.556306 F1v Accuracy -0.24848 0.488776 F1v F1 Score -0.32121 0.365468 F1v AUC Score -0.55152 0.098401 F2 Accuracy 0.006465 0.985858 F2 F1 Score -0.12284 0.735313 F2 AUC Score 0.071116 0.845218 F3 Accuracy 0.515152 0.127553 F3 F1 Score 0.563636 0.089724 F3 AUC Score 0.260606 0.467089 F4 Accuracy 0.648485 0.04254 F4 F1 Score 0.587879 0.073878 F4 AUC Score 0.284848 0.425038 L1 Accuracy 0.418182 0.229113 L1 F1 Score 0.393939 0.259998 L1 AUC Score 0.139394 0.700932 L2 Accuracy -0.32121 0.365468 L2 F1 Score -0.17576 0.627188 L2 AUC Score -0.12727 0.726057 L3 Accuracy -0.69697 0.025097 L3 F1 Score -0.68485 0.028883 L3 AUC Score -0.7697 0.009222 N1 Accuracy -0.11515 0.75142 N1 F1 Score -0.04242 0.907364 N1 AUC Score 0.29697 0.404702 N2 Accuracy 0.090909 0.802772 N2 F1 Score -0.15152 0.676065 N2 AUC Score -0.29697 0.404702 N3 Accuracy -0.10303 0.776998 N3 F1 Score -0.06667 0.854813 N3 AUC Score 0.260606 0.467089 N4 Accuracy -0.28485 0.425038 N4 F1 Score -0.13939 0.700932 N4 AUC Score 0.10303 0.776998 T1 Accuracy 0.830303 0.00294 T1 F1 Score 0.781818 0.007547 T1 AUC Score 0.587879 0.073878 T2 Accuracy 0.127273 0.726057 T2 F1 Score 0.151515 0.676065 T2 AUC Score 0.2 0.579584 C1 Accuracy -0.5366 0.109784 C1 F1 Score -0.69176 0.026678 C1 AUC Score -0.834 0.002705 C2 Accuracy -0.57539 0.081792 C2 F1 Score -0.73055 0.016409 C2 AUC Score -0.88572 0.000649

Table 6: Correlation between Complexity Measure and AutoML Performance

width=0.725center Complexity Measure Performance Gain Correlation p-value F1 Accuracy -0.21212 0.556306 F1 F1 Score -0.38182 0.276255 F1 AUC Score -0.70909 0.021666 F1v Accuracy 0.272727 0.445838 F1v F1 Score 0.333333 0.346594 F1v AUC Score 0.006061 0.986743 F2 Accuracy 0.536602 0.109784 F2 F1 Score 0.316789 0.37248 F2 AUC Score 0.38144 0.276771 F3 Accuracy -0.62424 0.053718 F3 F1 Score -0.41818 0.229113 F3 AUC Score -0.67273 0.033041 F4 Accuracy -0.70909 0.021666 F4 F1 Score -0.44242 0.200423 F4 AUC Score -0.53939 0.107593 L1 Accuracy 0.357576 0.310376 L1 F1 Score -0.45455 0.186905 L1 AUC Score -0.49091 0.149656 L2 Accuracy 0.575758 0.081553 L2 F1 Score 0.115152 0.75142 L2 AUC Score 0.115152 0.75142 L3 Accuracy -0.16364 0.651477 L3 F1 Score 0.078788 0.828717 L3 AUC Score -0.06667 0.854813 N1 Accuracy 0.563636 0.089724 N1 F1 Score 0.539394 0.107593 N1 AUC Score 0.745455 0.01333 N2 Accuracy -0.15152 0.676065 N2 F1 Score -0.47879 0.161523 N2 AUC Score -0.33333 0.346594 N3 Accuracy 0.624242 0.053718 N3 F1 Score 0.490909 0.149656 N3 AUC Score 0.721212 0.018573 N4 Accuracy 0.127273 0.726057 N4 F1 Score 0.612121 0.059972 N4 AUC Score 0.769697 0.009222 T1 Accuracy 0.10303 0.776998 T1 F1 Score -0.57576 0.081553 T1 AUC Score -0.38182 0.276255 T2 Accuracy 0.030303 0.933773 T2 F1 Score -0.4303 0.214492 T2 AUC Score -0.26061 0.467089 C1 Accuracy -0.47195 0.168458 C1 F1 Score -0.0194 0.957589 C1 AUC Score -0.16163 0.655533 C2 Accuracy -0.38144 0.276771 C2 F1 Score -0.05819 0.873149 C2 AUC Score -0.21335 0.553971

Table 7: Correlation between Complexity Measure and Performance Gain with AutoML