Clustering is an area of unsupervised machine learning that attempts to find structure in unstructured data by creating groups of similar values [1, 2]. One of the primary challenges of clustering is that there are numerous algorithms, and algorithm selection can have a drastic impact on performance. Furthermore, the performance of a particular algorithm is often dependent on the nature of the clusters in the data . Even two similar algorithms may find completely different sets of clusters in the same data set . Clustering algorithms are also notoriously difficult to evaluate, as there is no ground truth available and multiple sets of clusters created from one data set could be equally valid .
The selection of a clustering algorithm and the algorithm parameters, a process known as hyperparameter tuning, is a considerable challenge when applying a clustering solution to real-world problems. Multiple iterations and considerable domain knowledge is often required to find an optimal algorithm configuration, and the process is often long and tedious [3, 4]
. In supervised problems, where a ground truth is available, hyperparameter tuning is often automated, however, automated hyperparameter tuning requires accurate and objective evaluation metrics. As evaluating clustering algorithms is a considerable problem, completely automated methods of hyperparameter tuning for clustering algorithms often rely on internal evaluation metrics[3, 5, 6], or having some ground truth labels available for external evaluation metrics [4, 7], which moves the problem into the semi-automated space.
However, these methods of evaluation are often flawed, and cannot comment on the quality of the clusters developed for the use case . Internal methods measure the cluster quality with similarity metrics and tend to be biased towards particular types of clustering algorithms . Another method of evaluation is through meta-criteria, such as stability and statistical significance, which can be useful in determining the quality of a clustering algorithm but less so in comparing the results of multiple algorithms. Von Luxburg et al.  asserted that clustering algorithms cannot be evaluated independently to the context in which they will be used. Domain specific evaluation can be highly subjective and often requires significant time and resources to perform. As the effect of hyperparameters on clustering results cannot be described through a convex function, an exhaustive grid search is required to find the optimal hyperparameters . For an individual to manually perform an exhaustive grid search and evaluate all of the possible results would be a time intensive and cumbersome process.
We propose a framework for semi-automated hyperparameter tuning of clustering problems, using internal metrics and meta-criteria to guide an individual performing manual, domain specific evaluation. Preliminary results were found by running the framework to identify the most appropriate algorithm and parameter combination for persona development. The results illustrated the framework’s facilitation of domain specific evaluation and ability to identify more use case relevant results than methods based purely on internal metrics. The key contributions of this preliminary study are that a framework for the semi-automated hyperparameter tuning of a clustering problem is presented and evaluated on a real-world clustering problem. This is then compared to results using internal metrics for hyperparameter tuning.
The proposed framework performs an exhaustive grid search across multiple clustering algorithms and parameters. The results are then outputted as a set of graphs and simple meta-criteria metrics that can be used for focused domain specific evaluation. An overview of the framework is given in Fig. 1.
Ii-a Grid Search
The framework takes a map with an identifier as the key and an exhaustive parameter map as the value. The parameter map also gives the function or class used to run the clustering algorithm. Each parameter combination is assigned a unique identifier that is used throughout the output, made up of the identifier given in the map and a number, e.g., kmeans_v0.
Ii-B Automated Outputs
A number of metrics are collected from the clusters developed by each parameter combination: cluster sizes; internal metric, specifically Silhouette Coefficient , Calinski-Harabasz Index ; and, Davies-Bouldin Index 
; the mean value for each feature in each cluster with the number of standard deviations each cluster mean is from the population mean, and its statistical significance, or p-value. Any features found to be statistically significant are tracked. All of the data is outputted to a CSV file for the parameter combination, as well as values such as the internal metrics and meta-criteria being additionally outputted to running CSV files for quick reference. A series of graphs are then built so that each graph represents how many standard deviations a cluster centroid is from the population mean for each of the predefined key features for domain specific evaluation.
Ii-C Domain Specific Evaluation
When performing the manual evaluation, the individual is encouraged to first use the meta-criteria and internal metrics to rule out unacceptable cluster sets. For example, a set of clusters that has no significant features would be considered unacceptable. The individual can then use the graphs and knowledge of the statistically significant features for the remaining options to perform a subjective, domain specific evaluation. It was found most effective to perform a quick first pass of the graphs to find graphs that showed particularly weak clusters or obviously went against the domain specific evaluation criteria.
Iii Preliminary results
The framework was used to compare three algorithms for the purpose of persona development based on cyclone preparatory behaviour. A persona is a description of a fictitious person used to describe analytical data and customer segments in a manner that emphasises human attributes and empathy 
. Personas are used in a wide range of fields, but primarily for marketing and design purposes. The three algorithms compared, each with multiple parameter options, were: 1) k-means[13, 14, 15, 16]
; 2) Agglomerative Hierarchical Clustering (AHC); and, 3) Non-negative Matrix Factorization (NMF). These algorithms were selected as they are most common within persona development . The domain specific evaluation performed was based on how well the clusters could be explained via a behavioural model, specifically the Protection Action Decision Model (PADM) [19, 20, 21]. The data used was survey data from 519 residents of the cyclone prone North Queensland, Australia .
Of the 16 parameter combinations used, six could be immediately ruled out due to the meta-criteria and a further four were able to be easily ruled out from the graphs, which left six for domain specific evaluation. That is, the framework facilitated the identification of a preferred algorithm and parameter combination, in this case AHC using Ward’s linkage and 3 clusters. This result contradicted what would have been found using a fully automated framework based on internal metrics, as all of the internal metrics preferred other combinations, including combinations which were ruled out by meta-criteria in some cases.
Iv Conclusion and Future Work
The quality of a set of clusters is highly dependent on the algorithm and parameters used to develop them. However, the subjective nature of cluster evaluation makes hyperparameter tuning difficult to automate, resulting in a time consuming, tedious process. Previous approaches have relied on having some ground truth labels available, moving the problem out of the unsupervised space, or on internal metrics, which are known to be biased and unreliable.
This preliminary study presented a semi-automated framework for hyperparameter tuning for clustering problems. The framework performs an exhaustive grid search of all algorithm parameter combinations to produce a series of graphs and easy to interpret outputs. Preliminary results show that these graphs and outputs can then be used for efficient domain specific evaluation that can produce results more relevant to the cluster’s use case.
-  A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, Jun. 2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167865509002323
D. Xu and Y. Tian, “A comprehensive survey of
Annals of Data Science, vol. 2, no. 2, pp. 165–193, Jun. 2015. [Online]. Available: https://doi.org/10.1007/s40745-015-0040-1
-  X. Fan, Y. Yue, P. Sarkar, and Y. X. R. Wang, “On hyperparameter tuning in general clustering problems,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119. PLMR, Jul. 2020, pp. 2996–3007. [Online]. Available: http://proceedings.mlr.press/v119/fan20b.html
-  T. Van Craenendonck and H. Blockeel, “Constraint-based clustering selection,” Machine Learning, vol. 106, no. 9, pp. 1497–1521, Oct. 2017. [Online]. Available: https://doi.org/10.1007/s10994-017-5643-7
-  L. Blumenberg and K. V. Ruggles, “Hypercluster: a flexible tool for parallelized unsupervised clustering optimization,” BMC Bioinformatics, vol. 21, no. 1, p. 428, Sep. 2020. [Online]. Available: https://doi.org/10.1186/s12859-020-03774-1
-  V. Shalamov, V. Efimova, S. Muravyov, and A. Filchenkov, “Reinforcement-based Method for Simultaneous Clustering Algorithm Selection and its Hyperparameters Optimization,” Procedia Computer Science, vol. 136, pp. 144–153, Jan. 2018, publisher: Elsevier. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050918315527
L. L. Minku, “A novel online supervised hyperparameter tuning procedure applied to cross-company software effort estimation,”Empirical Software Engineering, vol. 24, no. 5, pp. 3153–3204, Oct. 2019. [Online]. Available: https://doi.org/10.1007/s10664-019-09686-w
-  U. Von Luxburg, R. C. Williamson, and I. Guyon, “Clustering: Science or art?” 2012, pp. 65–79.
P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,”Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, Nov. 1987, publisher: North-Holland. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0377042787901257
-  T. Caliński and J. Harabasz, “A Dendrite Method for Cluster Analysis,” Communications in Statistics - Theory and Methods, vol. 3, pp. 1–27, Jan. 1974.
-  D. L. Davies and D. W. Bouldin, “A Cluster Separation Measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, Apr. 1979, conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
-  J. Salminen, B. J. Jansen, J. An, H. Kwak, and S.-g. Jung, “Are personas done? Evaluating their usefulness in the age of digital analytics,” Persona Studies, vol. 4, no. 2, pp. 47–65, Nov. 2018. [Online]. Available: https://ojs.deakin.edu.au/index.php/ps/article/view/737
-  G. H. Ball and D. J. Hall, “ISODATA, a novel method of data analysis and pattern classification,” Stanford research inst Menlo Park CA, Tech. Rep., 1965.
-  S. Lloyd, “Least squares quantization in PCM,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
J. MacQueen, “Some methods for classification and analysis of multivariate
Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. Oakland, CA, USA, 1967, pp. 281–297.
-  H. Steinhaus, “Sur la division des corp materiels en parties,” Bull. Acad. Polon. Sci, vol. 1, no. 804, p. 801, 1956.
-  D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999, publisher: Nature Publishing Group.
-  J. Salminen, K. Guan, S.-G. Jung, and B. J. Jansen, “A survey of 15 years of data-driven persona development,” International Journal of Human–Computer Interaction, vol. 0, no. 0, pp. 1–24, 2021. [Online]. Available: https://doi.org/10.1080/10447318.2021.1908670
-  M. K. Lindell and R. W. Perry, Behavioral foundations of community emergency planning, ser. Behavioral foundations of community emergency planning. Washington, DC, US: Hemisphere Publishing Corp, 1992.
-  ——, “The protective action decision model: theoretical modifications and additional evidence,” Risk Analysis: An International Journal, vol. 32, no. 4, pp. 616–632, 2012.
-  T. Terpstra and M. K. Lindell, “Citizens’ perceptions of flood hazard adjustments: an application of the protective action decision model,” Environment and Behavior, vol. 45, no. 8, pp. 993–1018, 2013.
-  M. Scovell, C. McShane, A. Swinbourne, and D. Smith, “North Queenslanders’ perceptions of cyclone risk and structural mitigation intentions. Part I: psychological and demographic factors,” Jul. 2018.