Despite the availability of anti-phishing technologies, phishing attacks are still thriving and have caused data breaches of personal sensitive information and private company data. Phishing attacks were reported to double in 2020 , and have caused significant financial losses of roughly between $60 million and $3 billion per year in the United States . With the rapidly evolving nature of phishing, it would be ideal to have an automated detection system which could quickly adapt to phishing data changes and robustly detect these attacks.
Phishing is a cyber-attack that aims at stealing sensitive information by impersonating a legitimate person, company, or organization. By using social engineering and psychological tactics, phishing attackers exploit human vulnerabilities by sending messages that seem authentic and usually have a sense of urgency, persuading users to follow the attacker’s instruction in the message. While phishing attacks nowadays are also advertised through voice messages and SMS messages, reports show that phishing is still dominated by the traditional ones delivered via emails .
During the past decade, machine learning has developed rapidly and has successfully performed a broad range of tasks. Machine learning techniques have shown to be effective in detecting phishing attacks. However, building a machine learning model is not a trivial task, and requires data scientists with knowledge of the relevant domain. In the past couple of years, Automated Machine Learning (AutoML) has gained much attention, enabling non-ML experts in building a machine learning model.
The emergence of AutoML frameworks brings us to the question of whether AutoML-generated models could outperform manually trained machine learning models on phishing data, how AutoML could assist non-experts in building machine learning models, and to what extent we could automate the whole process of building an ML-based detection system. There are a number of past studies that have tried to investigate whether the existence of AutoML frameworks would affect the roles of human experts in the ML development pipeline . However, to the best of our knowledge, we are the first to discuss this topic specifically in the case of phishing detection systems.
2 Automated Machine Learning
There has been significant research conducted in the areas of machine learning and deep learning since 1995, resulting in the development of various tools, such as Weka (1990s), scikit-learn (2007-2010), Tensorflow (2015), Keras (2015), and so on
. The emergence of these tools has enabled multidisciplinary research and the application of machine learning and deep learning techniques, which has shown promising results, demonstrating the potential of machine learning models to solve various problems. However, it has also become evident that developing a machine learning model is a complex task, requiring intuition, experience, and technical expertise to tune the model’s hyperparameters. The heavy reliance of machine learning development on human experts motivates researchers to explore the possibility of inventing a technique to automate the development of machine learning models. Such attempts focusing on AutoML projects have been initiated by researchers and machine learning practitioners, and were followed by startups that sell AutoML frameworks as part of their business models. Some of the first AutoML tools are AutoWEKA which is based on Weka , followed by auto-sklearn  and TPOT  which are both built on the scikit-learn library on Python . Various AutoML frameworks have also emerged as the product of the ChaLearn AutoML challenge competitions  between 2015 and 2018, which is still conducted every year until now.
There are some existing studies which focus on performing a thorough literature review on the comparison and discussions on past works in AutoML approaches and tools [25, 59]. In general, AutoML tools aim to automate various aspects of the machine learning pipeline, including data preprocessing, feature engineering, model training and validation. A standard full ML pipeline is shown in Figure 1. Based on previous literature reviews of AutoML tools [59, 25], we could divide AutoML frameworks based on which aspect of the machine learning pipeline it tries to automate; namely data preparation, feature engineering, model generation, and model evaluation.
2.1 Data Preparation
Data preparation includes data collection and preprocessing of the data, such as data cleaning and data augmentation. Automated data collection is necessary for tasks in which the data needs to be extended and continuously updated. Data cleaning aims to remove noises and handle missing values in the data. Meanwhile, data augmentation enhances the model’s performance by generating new data based on the existing data and by preventing the trained model from over-fitting.
2.1.1 Data Collection
Plenty of open datasets are now being shared among researchers, especially image datasets (e.g. MNIST handwritten digital dataset to deal with dataset class imbalances. Meanwhile, various data generation techniques also exist to automate data synthesis process, for example using Generative Adversarial Networks (GAN) and reinforcement learning-based methods in data simulators
Plenty of open datasets are now being shared among researchers, especially image datasets (e.g. MNIST handwritten digital dataset). However, it remains a challenge to obtain a high quality dataset for some specific tasks, including phishing websites and email datasets which require anonymization. This issue can be solved with two different approaches: data searching and data synthesis. Some automated methods that help in the performance of data searching include learning-based self-labeling methods for unlabeled data  and synthetic minority over-sampling technique (SMOTE) 
to deal with dataset class imbalances. Meanwhile, various data generation techniques also exist to automate data synthesis process, for example using Generative Adversarial Networks (GAN)[30, 47, 5]
and reinforcement learning-based methods in data simulators.
2.1.2 Data Cleaning
Data cleaning is an essential process to eliminate data noises, which could negatively affect machine learning model performances if not removed. However, data cleaning is generally a costly task, since it requires experts with specialist knowledge. Over time, there have been efforts to automate the process of data cleaning. Initially, data cleaning was performed through crowd-sourcing . To enhance efficiency, a past work proposed a data cleaning technique in which data scientists define specific data cleaning operations based on a small subset of the data, and apply these cleaning operations to the full dataset. Past studies attempted to further improve this method using machine learning techniques, such as boosting and hyperparameter optimization, to find the best data cleaning operation pipeline or combination [33, 34]. To be applicable for real world data, the data cleaning methods should be able to clean data steadily. Several past works have proposed a technique to evaluate data cleaning algorithms that can perform continuously, and to orchestrate cleaning workflows that can learn from past cleaning tasks.
2.1.3 Data Augmentation
Data augmentation aims to enrich the dataset by generating new data by transforming the existing data. Past studies have proposed various methods to perform neural-based transformations on image data, such as adversarial noise , neural style transfers , and GAN-based techniques . Meanwhile, there are two approaches to textual data augmentation: data warping and synthetic over-sampling . Recently, various methods have been proposed to search for augmentation policies for different tasks using reinforcement learning , and various other improved algorithms [37, 38].
2.2 Feature Engineering
Feature engineering in supervised machine learning problems is defined as the process of finding explanatory variables that are predictive of the classification outcome . This process is typically performed in a trial-and-error fashion and often requires extensive knowledge of the relevant domain. Feature engineering holds an important role, since its quality affects the machine learning model’s performance heavily ; however, it is also a time-consuming task. To help with this process, some AutoML tools provided automated feature engineering methods with the goal of constructing new feature sets that give the best machine learning model performance.
The automated feature engineering task can be formally defined as follows . Given a feature set with number of features, , a target vector,
, a target vector,, and a machine learning algorithm, , let indicate the performance of the model with the corresponding algorithm, target vector, and feature set. Assume there are transformation functions, which can be applied to the features, and a sequence of transformations of the features . The goal of the automated feature engineering process is to find a set of transformation sequences to produce a new set of features, , which satisfies . The goal of automated feature engineering is to obtain the best set of features and feature transformations which give the best classification performance from structured data.
Traditionally, new features are constructed manually by performing some standard transformations, such as standardization, normalization, or feature discretization. To improve the efficiency of such processes, automatic feature construction methods using decision trees
Traditionally, new features are constructed manually by performing some standard transformations, such as standardization, normalization, or feature discretization. To improve the efficiency of such processes, automatic feature construction methods using decision trees[19, 61], and annotation-based approaches , have been proposed to search and evaluate the best combination of transformations.
Besides the construction of new features, feature engineering can also be performed by reducing the feature dimensionality to extract the most informative features and reduce redundancies in the feature set. This process is performed by applying mapping functions, such as principal component analysis (PCA) or linear discriminant analysis (LDA). In recent studies, there have been other improved methods proposed to perform feature extraction, e.g., using autoencoder-based algorithms
Besides the construction of new features, feature engineering can also be performed by reducing the feature dimensionality to extract the most informative features and reduce redundancies in the feature set. This process is performed by applying mapping functions, such as principal component analysis (PCA) or linear discriminant analysis (LDA). In recent studies, there have been other improved methods proposed to perform feature extraction, e.g., using autoencoder-based algorithms and unsupervised feature extraction methods .
2.3 Model Generation and Evaluation
There are two important elements in model generation: search space and the optimization method. The search space defines the structure and design principles of the machine learning models. Given a certain model, we could apply hyperparameter optimization (HPO), which aims to find the optimal training-related hyperparameters (e.g. learning rate), and architecture optimization (AO), to obtain the best hyperparameters associated with the model’s structure or design (e.g., number of neighbors for -NN, or number of layers or neural architectures for deep neural networks).
-NN, or number of layers or neural architectures for deep neural networks).
Traditional hyperparameter optimization strategies usually do not make any assumptions about the search space. One of the simplest hyperparameter optimization methods is grid search; a brute-force method to find the best set of hyperparameters, given a finite set of values for each hyperparameter specified by users. Another simple alternative to grid search is random search, which relies on sampling from a user-specified set of hyperparameter values under a certain budget constraint. There is also another kind of hyperparameter optimization method which performs ”optimization from samples”  , e.g. particle swarm optimization (PSO)
, e.g. particle swarm optimization (PSO), which are both inspired by biological behaviours. Meanwhile, Bayesian optimization has emerged as the most advanced hyperparameter optimization method used in AutoML frameworks. Bayesian optimization builds a probabilistic model, which maps different hyperparameter configurations to their performance with some degree of uncertainty.
Besides hyperparameter optimization, finding the best model architecture is also an important and non-trivial task when building a machine learning classifier. With the emerging research in neural networks in the past decade, neural architecture search (NAS) has gained great interest in the AutoML community. There are three essential components of a neural architecture search (NAS): the search space of neural architectures, architecture optimization, and model evaluation methods. An intuitive method to perform model evaluation is to assess model performance after training and the neural network has converged. However, this takes an extensive time and resources due to the amount of computation needed. Past studies have focused on finding methods to accelerate the model training and model evaluation process. The first approach is by using low-fidelity model evaluation by reducing the model size. Another technique that has been proposed is weight sharing, which can make the model training time faster by utilizing knowledge regarding weights of prior network architectures. Past studies also proposed the use of surrogate-based methods which could estimate the black-box function of a neural network model, making it easier to obtain the best model configuration and performance. Another proposed approach is early stopping, which was initially used in classical ML to prevent over-fitting. More recent studies have improved early stopping to perform computation on smaller set of data, making it faster to compute. We will not cover this topic in detail in our paper. However, more thorough discussions on studies on hyperparameter optimization and architecture optimization, especially neural architecture search (NAS), are covered by Waring et al.
Besides hyperparameter optimization, finding the best model architecture is also an important and non-trivial task when building a machine learning classifier. With the emerging research in neural networks in the past decade, neural architecture search (NAS) has gained great interest in the AutoML community. There are three essential components of a neural architecture search (NAS): the search space of neural architectures, architecture optimization, and model evaluation methods. An intuitive method to perform model evaluation is to assess model performance after training and the neural network has converged. However, this takes an extensive time and resources due to the amount of computation needed. Past studies have focused on finding methods to accelerate the model training and model evaluation process. The first approach is by using low-fidelity model evaluation by reducing the model size. Another technique that has been proposed is weight sharing, which can make the model training time faster by utilizing knowledge regarding weights of prior network architectures. Past studies also proposed the use of surrogate-based methods which could estimate the black-box function of a neural network model, making it easier to obtain the best model configuration and performance. Another proposed approach is early stopping, which was initially used in classical ML to prevent over-fitting. More recent studies have improved early stopping to perform computation on smaller set of data, making it faster to compute. We will not cover this topic in detail in our paper. However, more thorough discussions on studies on hyperparameter optimization and architecture optimization, especially neural architecture search (NAS), are covered by Waring et al. and He at al.  in their literature reviews.
3 Phishing Detection Systems
Machine learning has shown to be effective in detecting phishing attacks based on past studies . While attackers have been using various strategies to conduct phishing attacks, emails still remain a primary delivery method for attackers. In this section, we provide the general workflow of phishing detection systems and how it interacts with external parties, e.g. users and blacklist providers. The phishing detection systems workflow is shown in Figure 3.
Phishing attacks start with the attacker broadcasting emails that contain a message which tries to convince the receivers to proceed further by clicking on the provided link. Automated phishing detection systems inspect this email’s raw data and analyze the message, URL, content or visual appearance of the associated web page, and page hosting information. After automatically fetching this information, the feature extraction process is performed. Various past studies have focused on investigating the performance of features extracted from the phishing emails and websites. The output of the feature extraction process is a vector that contains values associated with each specific feature, each representing an email or website. Afterwards, a classifier is built, which predicts whether an email or a website is a phishing attack. This information is provided to users and phishing blacklist providers to update their database. In many cases, these detection systems also accept feedback from users or report from blacklist providers when misclassification occurs. This information is accepted as the ground truth for updating the machine learning model and improving its detection performance.
4 Can AutoML outperform humans?
The advancement of AutoML frameworks raises the question of whether AutoML can outperform humans in building machine learning models for detecting phishing. To answer this question, we performed an experiment to compare the performances of the models built using AutoML and the ones that are manually crafted.
4.1 Dataset and Performance Metrics
To perform a comparison between AutoML and manually built machine learning models, we tested the models on various phishing email, URL, and website datasets. Besides using publicly available datasets [16, 40, 42, 41, 46], we also performed feature extraction proposed in past studies [16, 6, 22, 52, 57] on a raw phishing email dataset compiled by Verma et al.  to construct new sets of phishing email data. In each task of this experiment, we trained machine learning models using a specific dataset and compared the performance of models constructed with and without AutoML frameworks. Further details of each task are provided in Table 1.
While various metrics were measured during this experiment, we are particularly interested in the following performance metrics:
Accuracy measures the number of correct predictions divided by the total size of the data, and can be expressed as:
where is the predicted value of the -th sample, is the corresponding true value, and is the total number of samples. In a multi-class classification setting, e.g. Task url_3, this formula would calculate the subset accuracy, or the percentage of samples which are classified correctly.
AUC, or Area Under the ROC Curve, is the total area under the ROC (receiver operating characteristic) curve, which would depict the model’s performance in identifying the positive and negative examples for each class and finding the best threshold to separate between both examples. In a binary classification task, positive and negative examples would refer to phishing and legitimate samples. In a multi-class classification, we chose theone-vs-oneheuristic, where the dataset is split into one dataset for each class versus every other class , and the final AUC score is obtaining by computing the average AUC of all possible pairwise combinations of classes.
Due to the precision-recall trade-off, it is challenging to have both precision and recall high, especially in imbalanced datasets. F1-score calculates the harmonic mean of recall (true positive rate) and precision, measuring the model’s performance correctly performing classification while heavily penalizing low recall or low precision scores.
The training duration refers to the total amount of time each framework or algorithm takes to process the given dataset, until a classification model is produced. We do not include the time each model takes to perform prediction on the testing dataset, as it is generally negligible.
4.2 Experiment Constraints and Setup
In this study, we compared the performance of various mature open source AutoML frameworks, which are briefly described as follows:
In this study, we compared the performance of various mature open source AutoML frameworks, which are briefly described as follows:
AutoGluon uses a multi-layer stack ensemble, in which multiple models are ensembled and stacked in multiple layers. The main difference between AutoGluon and existing AutoML frameworks is that AutoGluon utilizes almost every trained model to produce the final prediction instead of only selecting the best model.
Auto-sklearn was a winner of the ChaLearn AutoML Challenge 1 in 2015-2016 and ChaLearn AutoML Challenge 2 in 2017-2018. It uses Bayesian optimization to obtain the best machine learning pipeline. Auto-sklearn features automatic ensemble construction and uses meta-learning to increase the probability of finding a good pipeline by warm-starting the search procedure.
GAMA supports configurable AutoML pipelines, which allow the selection of optimization and post-processing algorithms. By default, GAMA searches over linear ML pipelines and creates a model ensemble in the post-processing step. These pipelines can be optimized with an asynchronous evolutionary algorithm or ASHA.
H2OAutoML performs a random search, which is followed by a model stacking stage. This framework uses the H2O machine learning package by default, which supports distributed training.
Hyperopt-sklearn allows various search strategies, including random search, and various sequential model based optimization (SMBO) techniques. These techniques include Tree of Parzen Estimators (TPE), Annealing and Gaussian Process Trees.
TPOT constructs machine learning pipelines of arbitrary length using scikit-learn algorithms 
and allows the use of XGBoost algorithm. During its search, pre-processing and stacking are both considered. While the model’s pipeline length is arbitrary, TPOT performs multi-objective optimization, in which it aims to keep the number of pipeline components minimal while optimizing the main selected metric. TPOT also provides support for sparse matrices, multiprocessing, and custom pipeline components.
To ensure that our evaluation was performed fairly, we trained and tested all the models using the OpenML AutoML Benchmarking Framework  to make sure that the models were developed under the same constraints and with the same setup. Evaluating and comparing AutoML systems is challenging due to the subtle differences in problem definition, e.g. the design of the hyperparameter search space or the way time budgets are defined. The OpenML AutoML Benchmarking toolkit aims to address this issue by providing a standardized environment to perform in-depth experiments comparing a wide range of AutoML frameworks.
For each task in Table 1 , we performed random splits on the whole dataset, then allocated 75% for training and 25% for testing respectively. All tasks were run using the same computing resources, with 8-core Intel Xeon and 32 GB RAM. The maximum runtime for each task was set to 3,600 seconds (1 hour). Each task could use a maximum of 8 cores when multiprocessing is available. To reduce biases resulting from outlier results, we repeated the experiment 10 times for each task and observed the consistency of the performance in the evaluation metrics. We selected the default metrics to optimize, which were the AUC score for binary classification tasks and log loss for multi-class classification tasks.
, we performed random splits on the whole dataset, then allocated 75% for training and 25% for testing respectively. All tasks were run using the same computing resources, with 8-core Intel Xeon and 32 GB RAM. The maximum runtime for each task was set to 3,600 seconds (1 hour). Each task could use a maximum of 8 cores when multiprocessing is available. To reduce biases resulting from outlier results, we repeated the experiment 10 times for each task and observed the consistency of the performance in the evaluation metrics. We selected the default metrics to optimize, which were the AUC score for binary classification tasks and log loss for multi-class classification tasks.
We also selected several traditional machine learning algorithms to compare with the models built using the aforementioned AutoML frameworks. The selected algorithms are Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Multi-layer Perceptron, and Gaussian Naive Bayes, which are available in the , we ran 10 experiments to obtain the best model for each algorithm and to observe any variance in the models’ performance. We manually defined a set of model hyperparameters and performed random searches to obtain the best model, optimizing the AUC score in binary classification tasks and log loss in multi-class classification tasks. Unlike the AutoML experiment setting, we did not set the maximum runtime for each task as this feature was not available. However, we set the number of iterations during the random search for each task. The iteration number was manually set to 20 for SVM and 100 for all other algorithms.
We also selected several traditional machine learning algorithms to compare with the models built using the aforementioned AutoML frameworks. The selected algorithms are Logistic Regression, SVM, KNN, Decision Tree, Random Forest, Multi-layer Perceptron, and Gaussian Naive Bayes, which are available in thescikit-learn Python package . To train these models, we also used the same dataset split and computing resources as the one used in the AutoML setting, with a maximum of 8 core when multiprocessing was available. For each task in Table 1
, we ran 10 experiments to obtain the best model for each algorithm and to observe any variance in the models’ performance. We manually defined a set of model hyperparameters and performed random searches to obtain the best model, optimizing the AUC score in binary classification tasks and log loss in multi-class classification tasks. Unlike the AutoML experiment setting, we did not set the maximum runtime for each task as this feature was not available. However, we set the number of iterations during the random search for each task. The iteration number was manually set to 20 for SVM and 100 for all other algorithms.
We computed the average performance metrics of each model and task in all experiments. To perform comparisons between models built using AutoML frameworks and non-AutoML algorithms, we selected two models from each task which would represent the best model built using the AutoML framework and non-AutoML algorithms. The best performance in this case would be in terms of the average AUC score as this metric was optimized during the model search process.
Comparisons of the accuracy and AUC score of the best model built using AutoML and non-AutoML frameworks are provided in Figure 4 and Figure 5. As shown in these figures, the performances in terms of AUC score and accuracy are almost similar between models built using AutoML and manually developed ML models. However, there are some exceptions in Task eml_1a and Task eml_2 in which AutoML-based models significantly outperformed manually built models. The AUC score difference is between 13.5% to 23.3%, and the accuracy difference is between 11.6% to 22.9%.
In terms of training time, we found that it was much faster to train the models manually and achieve this level of performance without AutoML frameworks. Figure 6 provides a summary of the comparison of training duration between AutoML frameworks and traditional ML algorithms.
5 When Does AutoML Outperform Humans?
To gain a better understanding of the results in Section 4, we analyzed the complexity of performing classification on the datasets assigned to each task using the DCoL library . The aim of this experiment is to understand in what kind of classification task AutoML frameworks provide better results.
The complexity measures that are computed can be grouped into several categories, which are briefly described as follows.
Measures of the overlaps from different classes based on the discriminative power of the features, including the maximum Fisher’s discriminant ratio (F1), directional-vector maximum Fisher’s discriminant Ratio (F1v), volume of per-class bounding boxes overlap (F2), maximum individual feature efficiency (F3), and collective feature efficiency (F4).
Measures of linearity, which includes the minimised sum of the error distance of a linear classifier (L1), training error of a linear classifier (L2), and non-linearity of a linear classifier (L3).
Neighborhood measures, which includes the fraction of points on the class boundary (N1), ratio of intra/inter class nearest neighbor distance (N2), leave-one-out error rate of the one-nearest neighbor classifier (N3), non-linearity of the one-nearest neighbor classifier (N4), and maximum covering spheres fraction (T1).
Measures of dimensionality, which includes the average number of points per dimension (T2).
Measures of class imbalance, which includes the entropy of class proportions (C1) and imbalance ratio (C2).
A complete analysis of task complexity is provided in Figure 7. In this figure, darker cells are associated with higher complexity. In general, a higher parameter value usually indicates that the classification task is more complex, with an exception for F1, F1v, F3, F4, and T2 complexity measures where a higher measure value corresponds to simpler classification tasks. Higher complexity means that it is more difficult to achieve good classification performances. Further details on complexity measures are not discussed in our paper. However, we refer to [27, 26, 39] for those interested in more detailed explanations regarding measuring supervised classification complexity.
To observe the relationship between a task’s complexity and classification performance, we performed a correlation test between AutoML performance gain or improvement () and a complexity measure based on one of the parameters previously mentioned. Full results of this analysis is provided in Table 7 (Appendix B). A summary of the correlation test with statistically significant results () is shown in Table 2.
Based on the correlation test, there are negative relationships between the F1 complexity measure (maximum Fisher’s discriminant ratio) and the performance gain in terms of AUC score. Note that a higher Fisher’s discriminant ratio would correspond to a simpler classification task. The negative relationship between F1 complexity measure and the AUC score gain would indicate that there would be a larger AUC score difference between AutoML-based models and manually built (non-AutoML) models in more complex tasks (with lower F1 complexity measure). The same goes with F3 and F4 complexity measures, in which both would have higher values when the classification task is simpler. Thus, a negative correlation between the F3 complexity measure and AUC score gain would indicate that AutoML-based models would outperform more significantly in more complex classification tasks (with lower F3 complexity measure). The correlation test also shows that AutoML-based models would outperform manually built models, in terms of accuracy, in more complex classification tasks with lower F4 complexity measure.
Furthermore, the correlation test also showed a positive correlation between some of the neighbor-based complexity measures (N1, N3, N4) and the AUC score performance gain. Unlike the previous complexity measures, a higher neighbor-based complexity measure indicates a more complex classification task. The results in Table 2 show that the AutoML-based models would be more likely to outperform manually developed classification models in a more complex classification task with higher N1, N3, and/or N4 complexity measures. These results show that AutoML-based models would be most beneficial when used to build classification models in complex settings where the features are not quite discriminative, and in datasets with overlapping classes or relatively high degrees of non-linearity.
|Complexity Measure||Performance Gain||Correlation||p-value|
Referring back to the results in Section 4, we found that this result is consistent with empirical findings. As shown in Figure 4 and Figure 5, AutoML frameworks significantly outperformed manually crafted ML models in Task eml_1a and Task eml_2. As shown in Figure 7, Task eml_1a is deemed to be complex as indicated by its feature-based complexity measures (F1, F3, F4) and neighborhood complexity measures (N1, N3, N4). Meanwhile, the complexity of Task eml_2 is indicated by the feature-based measures (F1, F3, F4). This confirms that AutoML-based models outperform manually built ML models in these types of complex classification tasks.
6 Automating Phishing Detection with AutoML Frameworks
In this section, we discuss further on the AutoML-based models’ performances, followed by a discussion on the opportunity and challenges of the use of AutoML frameworks in automated phishing detection systems, and a highlight on the study limitations and potential future works.
6.1 AutoML-based Models’ Performances
In this section, we analyze further the AutoML-based models’ performances in each classification task, and observe the relationship between a model’s performance and complexity level of the classification task. We provide more details on the average accuracy, AUC score, and duration of each framework in Table 4 in Appendix A. The average AUC score, accuracy, and F1 score of each task is provided in Figure 8, Figure 9, and Figure 10
We performed a correlation test to observe the relationship between the performance of an AutoML-based model on a specific classification task and a measure indicating the task’s complexity, as shown in Table 6 (Appendix B). We provide a summary of the correlation analysis with statistically significant results in Table 3, As shown in this table, there are five complexity measures that have a significant correlation with AutoML-based models’ performances, namely F4, L3, T1, C1, and C2 complexity measures. A higher F4 complexity measure would indicate a simpler classification task, whereas higher L3, T1, C1, or C2 complexity measure indicates that the classification task is more complex. The results in Table 3 are quite intuitive in terms of the relationship between F4, L3, C1, C2 complexity measure and the performance metrics, where the performance would improve in simpler classification tasks. Interestingly, there is a positive correlation between AutoML-based model performance (accuracy and F1 score) and T1 complexity measure, indicating that the model seems to have better performance when classifying on datasets with higher fraction of maximum covering spheres.
|Complexity Measure||Performance Metric||Correlation||p-value|
6.2 Opportunity and Challenges
Machine learning models built using AutoML frameworks have shown to outperform manually built models (Figure 4 and Figure 5). It is also shown that performance gain correlates with several complexity measures, such as F1, F3, F4, N1, N3, and N4 complexity measures (Table 2), indicating that human experts would benefit AutoML frameworks greatly in automating machine learning development, when dealing with complex classification tasks particularly with dataset features having low discriminative power, datasets with high class overlap, and dataset with high degree of non-linearity.
However, several challenges remain to be addressed in using AutoML frameworks for implementing a full machine learning pipeline of phishing detection systems. AutoML frameworks currently focus on supervised classification tasks with labeled datasets, which are not always available in the case of phishing classification tasks. Furthermore, most AutoML frameworks build stacked or ensembles of ML models, which are difficult to implement and update in an incremental learning setting. With these limitations, the role of data scientists and security experts is still crucial in observing whether the classification models need retraining due to the existence of concept drift, or if the features in the dataset are no longer representative of the attacks. It is also shown in Figure 4 and Figure 5 that while AutoML built models were able to outperform manually built ML models, the performance differences are not significant, except in several complex cases. Furthermore, the time needed to train manually built ML models is significantly lower when compared with training AutoML-based models (Figure 6).
Our study shows that AutoML frameworks are able to consistently produce models that perform similarly or better than manually built ML models. However, challenges remain in implementing these frameworks to fully automate phishing detection due to the support provided only on supervised classification problems, which consequently lead to the need for labeled data, and the inability to update the generated models incrementally. This indicates that experts with domain knowledge are still needed in-the-loop of the full phishing detection pipeline. Despite its limitations, our study also reveals that AutoML frameworks are able to outperform manually developed machine learning models in certain complex classification tasks, indicating that there are opportunities for utilizing AutoML to improve human-guided phishing detection systems. Future studies which further explore the collaboration of human experts and AutoML, as well as the design of the human-in-the-loop system architectures would be beneficial in improving phishing detection.
This work has been supported by the Cyber Security Cooperative Research Centre Limited, whose activities are partially funded by the Australian Government’s Cooperative Research Centres Programme.
Rizka Widyarini Purwanto was supported by a UNSW University International Postgraduate Award (UIPA) scholarship. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the scholarship provider.
A sincere thank you to Muhammad Johan Alibasa for his constructive feedback and discussions on the manuscript.
-  (2017) Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Cited by: §2.1.3.
-  (2021) APWG phishing trends reports. Anti Phishing Working Group. Cited by: §1, §1.
-  AutoML@ChaLearn. Note: https://automl.chalearn.org/Accessed: 2021-03-29 Cited by: §2.
-  (1996) Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press. Cited by: §2.3.
-  (2018) Gan augmentation: augmenting training data using generative adversarial networks. arXiv preprint arXiv:1810.10863. Cited by: §2.1.1.
-  (2006) Phishing Email Detection Based on Structural Properties. In NYS Cyber Security Conference, Vol. 3. Cited by: §4.1, Table 1.
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research16, pp. 321–357. Cited by: §2.1.1.
Towards scalable dataset construction: an active learning approach. In
European conference on computer vision, pp. 86–98. Cited by: §2.1.1.
-  (2009) Introduction to derivative-free optimization. SIAM. Cited by: §2.3.
-  (2021-01) Fits and Starts: Enterprise Use of AutoML and the Role of Humans in the Loop. arXiv:2101.04296 [cs]. Note: arXiv: 2101.04296Comment: CHI 2021 Conference, 15 pages, 3 figures, 1 Table External Links: Cited by: §1.
Autoaugment: learning augmentation strategies from data.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123. Cited by: §2.1.3.
-  (2019) SoK: A Comprehensive Reexamination of Phishing Research From the Security Perspective. IEEE Communications Surveys & Tutorials 22 (1). Cited by: §3.
-  (2012) A Few Useful Things to Know About Machine Learning. Communications of the ACM 55 (10), pp. 78–87. Cited by: §2.2.
-  (2020) AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv preprint arXiv:2003.06505. Cited by: 1st item.
-  (2009) Particle swarm model selection.. Journal of Machine Learning Research 10 (2). Cited by: §2.3.
-  (2007) Learning to Detect Phishing Emails. In Proceedings of the 16th International Conference on World Wide Web, Cited by: §4.1, Table 1.
-  (2015) Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28, pp. 2962–2970. External Links: Cited by: 2nd item.
-  (2015) Efficient and robust automated machine learning. NIPS’15, Cambridge, MA, USA, pp. 2755–2763. Cited by: §2.
-  (2004) Functional trees. Machine Learning 55 (3), pp. 219–250. Cited by: §2.2.
-  (2019) An Open Source AutoML Benchmark. arXiv preprint arXiv:1907.00909 [cs.LG]. Note: Accepted at AutoML Workshop at ICML 2019 External Links: Cited by: §4.2.
-  (2019) GAMA: Genetic Automated Machine learning Assistant. Journal of Open Source Software 4 (33), pp. 1132. External Links: Cited by: 3rd item.
-  (2018) Learning from The Ones That Got Away: Detecting New Forms of Phishing Attacks. IEEE Transactions on Dependable and Secure Computing 15 (6). Cited by: §4.1, Table 1.
-  (2009-11) The weka data mining software: an update. SIGKDD Explor. Newsl. 11 (1), pp. 10–18. External Links: Cited by: §2.
-  (2001) A simple generalisation of the area under the roc curve for multiple class classification problems. Machine learning 45 (2), pp. 171–186. Cited by: 2nd item.
-  (2021-01) AutoML: A Survey of The State-of-the-art. Knowledge-Based Systems 212, pp. 106622 (en). External Links: Cited by: Figure 2, §2.1.2, §2.3, §2.
-  (2006) Measures of Geometrical Complexity in Classification Problems. In Data Complexity in Pattern Recognition, pp. 1–23. Cited by: §5.
-  (2002) Complexity Measures of Supervised Classification Problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (3), pp. 289–300. Cited by: §5.
-  (2012-01) The state of phishing attacks. Commun. ACM 55 (1), pp. 74–81. External Links: Cited by: §1.
-  (2017) Unsupervised feature extraction with autoencoder trees. Neurocomputing 258, pp. 63–73. Cited by: §2.2.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §2.1.1.
-  (2014) Hyperopt-sklearn: Automatic Hyperparameter Configuration for Scikit-Learn. In Proc. SciPy, External Links: Cited by: 5th item.
-  (2017) Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 18 (25), pp. 1–5. External Links: Cited by: §2.
-  (2017) Boostclean: automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299. Cited by: §2.1.2.
-  (2019) Alphaclean: automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827. Cited by: §2.1.2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.1.1.
-  (2020-07) H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML). External Links: Cited by: 4th item.
-  (2019) Online hyper-parameter learning for auto-augmentation strategy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6579–6588. Cited by: §2.1.3.
-  (2020) UniformAugment: a search-free probabilistic data augmentation approach. arXiv preprint arXiv:2003.14348. Cited by: §2.1.3.
-  (2019) How Complex is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Computing Surveys (CSUR) 52 (5), pp. 1–34. Cited by: §5.
-  (2009) Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Cited by: §4.1, Table 1.
-  (2016) Detecting Malicious URLs Using Lexical Analysis. In International Conference on Network and System Security, Cited by: §4.1, Table 1.
-  (2014) PhishStorm: Detecting Phishing with Streaming Analytics. IEEE Transactions on Network and Service Management 11 (4). Cited by: §4.1, Table 1.
-  (2017) Relational autoencoder for feature extraction. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 364–371. Cited by: §2.2.
-  (2018) Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp. 117–122. Cited by: §2.1.3.
-  (2019) Style transfer-based image synthesis as an efficient regularization technique in deep learning. In 2019 24th International Conference on Methods and Models in Automation and Robotics (MMAR), pp. 42–47. Cited by: §2.1.3.
-  (2012) An Assessment of Features Related to Phishing Websites Using an Automated Technique. In 2012 International Conference for Internet Technology and Secured Transactions, Cited by: §4.1, Table 1.
-  (2018) Learning-based video motion magnification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 633–648. Cited by: §2.1.1.
Evaluation of a tree-based pipeline optimization tool for automating data science. In
Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, New York, NY, USA, pp. 485–492. External Links: Cited by: §2.
-  (2016) Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16. External Links: Cited by: 6th item.
-  Cited by: §5.
-  (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2, 6th item, §4.2.
-  (2012) PhishGILLNET—Phishing Detection Methodology Using Probabilistic Latent Semantic Analysis, AdaBoost, and Co-training. EURASIP Journal on Information Security 2012 (1). Cited by: §4.1, Table 1.
-  (2018) Learning to simulate. arXiv preprint arXiv:1810.02513. Cited by: §2.1.1.
-  (2009) Feature construction methods: a survey. sifaka. cs. uiuc. edu 69, pp. 70–71. Cited by: §2.2.
-  (2019-11) Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1471–1479. Note: arXiv: 1908.05557 External Links: Cited by: §2.
-  (1998) Evolutionary feature space transformation. In Feature Extraction, Construction and Selection, pp. 307–323. Cited by: §2.2.
Semantic Feature Selection for Text with Application to Phishing Email Detection. In International Conference on Information Security and Cryptology, Cited by: §4.1, Table 1.
-  (2019) Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §4.1, Table 1.
-  (2020) Automated Machine Learning: Review of the State-of-the-Art and Opportunities for Healthcare. Artificial Intelligence in Medicine 104. Note: Cited By :16lots of useful information. deep learning –> unstructured data, representation learning External Links: Cited by: §2.2, §2.2, §2.3, §2.
-  (2016) Understanding data augmentation for classification: when to warp?. In 2016 international conference on digital image computing: techniques and applications (DICTA), pp. 1–6. Cited by: §2.1.3.
-  (1998) A comparison of constructing different types of new feature for decision tree learning. In Feature Extraction, Construction and Selection, pp. 239–255. Cited by: §2.2.