The increasingly challenging task of scaling the traditional Central Processing Unit (CPU) has lead to the exploration of new computational platforms such as quantum computers, CMOS annealers, neuromorphic computers, and so on (see  for a detailed exposition). Although their physical implementations differ significantly, adiabatic quantum computers, CMOS annealers, memristive circuits, and optical parametric oscillators all share Ising models as their core mathematical abstraction . This has lead to a growing interest in the formulation of computational problems as Ising models and in the empirical evaluation of these models on such novel computational platforms. This body of literature includes clustering and community detection [14, 20, 24], graph partitioning [27, 28], and many NP-Complete problems such as covering, packing, and coloring [18, 17].
Consensus clustering is the problem of combining multiple ‘base clusterings’ of the same set of data points into a single consolidated clustering . Consensus clustering is used to generate robust, stable, and more accurate clustering results compared to a single clustering approach . The problem of consensus clustering has received significant attention over the last two decades , and was previously considered under different names (clustering aggregation, cluster ensembles, clustering combination) 
. It has applications in different fields including data mining, pattern recognition, and bioinformatics and a number of algorithmic approaches have been used to solve this problem. The consensus clustering is, in essence, a combinatorial optimization problem  and different instances of the problem have been proven to be NP-hard (e.g., [6, 26]).
In this work, we investigate the use of special purpose hardware to solve the problem of consensus clustering. To this end, we formulate the problem of consensus clustering using Ising models and evaluate our approach on a specialized CMOS annealer. We make the following contributions:
We present and study two Ising models for consensus clustering that can be solved on a variety of special purpose hardware platforms.
We demonstrate how our models are embedded on the Fujitsu Digital Annealer (DA), a quantum-inspired specialized CMOS hardware.
We present an empirical evaluation based on seven benchmark datasets and show our approach outperforms existing techniques for consensus clustering.
2.1 Problem Definition
Let be a set of data points. A clustering of is a process that partitions into subsets, referred to as clusters, that together cover . A clustering is represented by the mapping where is the number of clusters produced by clustering . Given and a set of clusterings of the points in , the Consensus Clustering Problem is to find a new clustering, , of the data that best summarizes the set of clusterings . The new clustering is referred to as the consensus clustering.
Due to the ambiguity in the definition of an optimal consensus clustering, several approaches have been proposed to measure the solution quality of consensus clustering algorithms . In this work, we focus on the approach of determining a consensus clustering that agrees the most with the original clusterings. As an objective measure to determine this agreement, we use the mean Adjusted Rand Index (ARI) metric (Equation 14). However, we also consider clustering quality measured by mean Silhouette Coefficient  and clustering accuracy based on true labels. In Section 4 these evaluation criteria are discussed in more details.
2.2 Existing Criteria and Methods
Various criteria or objectives have been proposed for the Consensus Clustering Problem. In this work we mainly focus on two well-studied criteria, one based on the pairwise similarity of the data points, and the other based on the different assignments of the base clusterings. Other well-known criteria and objectives for the Consensus Clustering Problem can be found in the excellent surveys of [9, 29], with most defining NP-Hard optimization problems.
Pairwise Similarity Approaches:
In this approach, a similarity matrix is constructed such that each entry in represents the fraction of clusterings in which two data points belong to the same cluster . In particular,
with being the indicator function. The value lies between 0 and 1, and is equal to 1 if all the base clusterings assign points and to the same cluster. Once the pairwise similarity matrix is constructed, one can use any similarity-based clustering algorithm on to find a consensus clustering with a fixed number of clusters, . For example,  proposed to find a consensus clustering with exactly clusters that minimizes the within-cluster dissimilarity:
Partition Difference Approaches:
An alternative formulation is based on the different assignments between clustering. Consider two data points , and two clusterings . The following binary indicator tests if and disagree on the clustering of and :
The distance between two clusterings is then defined based on the number of pairwise disagreements:
with the factor to take care of double counting and can be ignored. This measure is defined as the number of pairs of points that are in the same cluster in one clustering and in different clusters in the other, essentially considering the (unadjusted) Rand index . Given this measure, a common objective is to find a consensus clustering with respect to the following optimization problem:
Methods and Algorithms:
The two different criteria given above define fundamentally different optimization problems, thus different algorithms have been proposed. One key difference between the two approaches inherently lies in determining the number of clusters in . The pairwise similarity approaches (e.g., Equation (2)) require an input parameter that fixes the number of clusters in , whereas the partition difference approaches such as Equation (5) do not have this requirement and determining is part of the objective of the problem. Therefore, for example, Equation (2) will have a minimum value in the case when , however this does not hold for Equation (5).
The Cluster-based Similarity Partitioning Algorithm (CSPA) is proposed in  for solving the pairwise similarity based approach. The CSPA constructs a similarity-based graph with each edge having a weight proportional to the similarity given by . Determining the consensus clustering with exactly clusters is treated as a -way graph partitioning problem, which is solved by methods such as METIS . In , the authors experiment with different clustering algorithms including hierarchical agglomerative clustering (HAC) and iterative techniques that start from an initial partition and iteratively reassign points to clusters based on their pairwise similarities. For the partition difference approach, Li et al.  proposed to solve Equation (5) using nonnegative matrix factorization (NMF). Gionis et al.  proposed several algorithms that make use of the connection between Equation (5) and the problem of correlation clustering. CSPA, HAC, NMF: these three approaches are considered as baseline in our empirical evaluation section (Section 4).
2.3 Ising Models
Ising models are graphical models that include a set of nodes representing spin variables and a set of edges corresponding to the interactions between the spins. The energy level of an Ising model which we aim to minimize is given by:
where the variables are the spin variables and the couplers, , represent the interaction between the spins.
A Quadratic Unconstrained Binary Optimization (QUBO) model includes binary variablesand couplers, . The objective to minimize is:
QUBO models can be transformed to Ising models by setting .
3 Ising Approach for Consensus Clustering on Specialized Hardware
In this section, we present our approach for solving consensus clustering on specialized hardware using Ising models. We present two Ising models that correspond to the two approaches in Section 2.2. We then demonstrate how they can be solved on the Fujitsu Digital Annealer (DA), a specialized CMOS hardware.
3.1 Pairwise Similarity-based Ising Model
For each data point , let be the binary variable such that if assigns to cluster , and 0 otherwise. Then the constraints
ensure assigns each point to exactly one cluster. Subject to the constraints (8), the sum of quadratic terms is 1 if assigns both to the same cluster, and is if assigned to different clusters. Therefore the value
represents the sum of within-cluster dissimilarities in : is the fraction of clusterings in that assign and to different clusters while assigns them to the same cluster. We therefore reformulate Equation (2) as QUBO:
where the term is added to the objective function to ensure that the constraints (8) are satisfied. is positive constant that penalizes the objective for violations of constraints (8). One can show that if , the optimal solution of the QUBO in Equation (10) does not violate the constraints (8). The proof is very similar to proof of Theorem 3.1 and a similar result in .
3.2 Partition Difference Ising Model
The partition difference approach essentially considers the (unadjusted) Rand Index  and therefore can be expected to perform better. The Correlation Clustering Problem is another important problem in data mining. Gionis et al.  showed that Equation (5) is a restricted case of the Correlation Clustering Problem, and that Equation (5) can be expressed as the following equivalent form of the Correlation Clustering Problem
We take advantage of this equivalence to model Equation (5) as a QUBO. In a similar fashion to the QUBO formulated in the preceding subsection, the terms
measure the similarity between points in different clusters, where represents an upper bound for the number of clusters in . This then leads to the minimizing the following QUBO:
Intuitively, Equation (13) measures the disagreement between the consensus clustering and the clusterings in . This disagreement is due to points that are clustered together in the consensus clustering but not in the clusterings in , however it is also due to points that are assigned to different clusters in the consensus partition but in the same cluster in some of the partitions in .
Formally, we can show that Equation (13) is equivalent to the correlation clustering formulation in Equation (11) when setting . Consistent with other methods that optimize Equation (5) (e.g., ), our approach takes as an input , an upper bound on the number of clusters in , however the obtained solution can use smaller number of clusters. In our proof, we assume is large enough to represent the optimal solution, i.e., greater than the number of clusters in optimal solutions to the correlation clustering problem in Equation (11).
First we show the optimal solution to the QUBO in Equation (13
) satisfies the one-hot encoding (). This would imply given we can create a valid clustering . Note, the optimal solution will never have as it can only increase the cost. The only case in which an optimal solution will have is when the cost of assigning a point to a cluster is higher than the cost of not assigning it to a cluster (i.e., the penalty ). Assigning a point to a cluster will incur a cost of for each point in the same cluster and for each point that is not in the cluster. As there is additional points in total, and both and are less or equal to one (Equation (1)), setting guarantees the optimal solution satisfies the one-hot encoding.
Now we assume that is not optimal, i.e., there exists an optimal solution to Equation (11) that has a strictly lower cost than . Let be the corresponding QUBO solution to , such that if and only if . This is possible because is large enough to accomodate all clusters in . As both and satisfy that one-hot encoding (penalty terms are zero), their cost is identical to the cost of and . Since the cost of is strictly lower than , and the cost of is lower or equal to , we have a contradiction. ∎
3.3 Solving Consensus Clustering on the Fujitsu Digital Annealer
The Fujitsu Digital Annealer (DA) is a recent CMOS hardware for solving combinatorial optimization problems formulated as QUBO [1, 8]. We use the second generation of the DA that is capable of representing problems with up to 8192 variables with up to 64 bits of precision. The DA has previously been used to solve problems in areas such as communication  and signal processing .
The DA algorithm  is based on simulated annealing (SA) , while taking advantage of the massive parallelization provided by the CMOS hardware . It has several key differences compared to SA, most notably a parallel-trial scheme in which each MC step considers all possible one-bit flips in parallel and dynamic offset mechanism that increase the energy of a state to escape local minima .
3.3.1 Encoding Consensus Clustering on the DA
When embedding our Ising models on the DA, we need to consider the hardware specification and adapt the representation of our model accordingly. Due to hardware precision limit, we need to embed the couplers and biases on an integer scale with limited granularity. In our experiments, we normalize the pairwise costs in the discrete range , , and accordingly is replaced by . Note that the theoretical bound is adjusted accordingly to be .
The theoretical bound guarantees that all constraints are satisfied if problems are solved to optimality. In practice, the DA does not necessarily solve problems to optimality and due to the nature of annealing-based algorithms, using very high weights for constraints is likely to create deep local minima and result in solutions that may satisfy the constraints but are often of low-quality. This is especially relevant to our pairwise similarity model where the bound tends to become loose as the number of clusters grows. In our experiments, we use constant, reasonably high, weights that were empirically found to perform well across datasets. For the pairwise similarity-based model (Equation (10)) we use , and for the partition difference model (Equation (13)) we use . While we expect to get better performance by tuning the weights per-dataset, our goal is to demonstrate the performance of our approach in a general setting. Automatic tuning of the weight values for the DA is a direction for future work.
Unlike many of the existing consensus clustering algorithms that run until convergence, our method runs for a given time limit (defined by the number of runs and iterations) and returns the best solution encountered. In our experiments, we arbitrarily choose three seconds as a (reasonably short) time limit to solve our Ising models. As with the weights, we employ a single temperature schedule across all datasets, and do not tune it per dataset.
4 Empirical Evaluation
We perform an extensive empirical evaluation of our approach using a set of seven benchmark datasets. We first describe how we generate the set of clusterings,
. Next, we describe the baselines, the evaluation metrics, and the datasets.
4.0.1 Generating Partitions
We follow 
and generate a set of clusterings by randomizing the parameters of the K-Means algorithm, namely the number of clustersand the initial cluster centers. In this work, we only use labelled datasets for which we know the number of clusters, , based on the true labels. To generate the base clusterings we run the K-Means algorithm with random cluster centers and we randomly choose from the range . For each dataset, we generate 100 clusterings to serve as the clustering set .
4.0.2 Baseline Algorithms
We compare our pairwise similarity-based Ising model, referred to as DA-Sm, and our correlation clustering Ising model, referred to as DA-Cr, to three popular algorithms for consensus clustering:
We evaluate the different methods using three measures. Our main concern in this work is the level of agreement between the consensus clustering and the set of input clusterings. To this end, one requires a metric measuring the similarity of two clusterings that can be used to measure how close the consensus clustering to each base clustering is. One of popularly used metrics to measure the similarity between two clusterings is the Rand Index (RI) and Adjusted Rand Index (ARI) . The Rand Index of two clustering lies between 0 and 1, obtaining the value 1 when both clusterings perfectly agree. Likewise, the maximum score of ARI, which is corrected-for-chance version of RI, is achieved when both clusterings perfectly agree. can be viewed as measure of agreement between the consensus clustering and some base clusterings . We use the mean ARI as the main evaluation criteria:
We also evaluate based on clustering quality and accuracy. For clustering quality, we use the mean Silhouette Coefficient  of all data points (computed using the Euclidean distance between the data points). For clustering accuracy, we compute the ARI between the consensus partition and the true labels.
4.0.4 Benchmark Datasets
We run experiments on seven datasets with different characteristics: Iris, Optdigits, Pendigits, Seeds, Wine from the UCI repository  as well as Protein  and MNIST.111http://yann.lecun.com/exdb/mnist/ Optdigits-389 is a randomly sampled subset of Optdigits containing only the digits . Similarly, MNIST-3689 and Pendigits-149 are subsets of the MNIST and Pendigits datasets.
Table 1 provides statistics on each of the data set, with the coefficient of variation (CV)  describing the degree of class imbalance: zero indicates perfectly balanced classes, while higher values indicate higher degree of class imbalance.
|Dataset||# Instances||# Features||# Clusters||CV|
Clustering is typically an unsupervised task and the number of clusters is unknown. The number of clusters in the true labels, , is not available in real scenarios. Furthermore, is not necessarily the best value for clustering tasks (e.g., in many cases it is better to have smaller clusters that are more pure). We therefore test the algorithms in two configurations: when the number of clusters is set to , as in the true labels, and when the number of clusters is set to .
4.1.1 Consensus Criteria
Table 2 shows the mean ARI between and the clusterings in . To avoid bias due to very minor differences, we consider all the methods that achieved Mean ARI that is within a threshold of 0.0025 from the best method to be equivalent and highlight them in bold. We also summarize the number of times each method was considered best across the different datasets.
The results show that DA-Cr is the best performing method for both and clusters. The results of DA-Sm are not consistent: DA-Sm and NMF are performing well for clusters and HAC is performing better for clusters.
4.1.2 Clustering Quality
Table 3 report the mean Silhouette Coefficient of all data points. Again, DA-Cr is the best performing method across datasets, followed by HAC. NMF seems to be equivalent to HAC for .
4.1.3 Clustering Accuracy
Table 4 shows the clustering accuracy measured by the ARI between and the true labels. For , we find DA-Sm to be best-performing solution (followed by DA-Cr). For , DA-Cr outperforms the other methods. Interestingly, there is no clear winner between CSPA, NMF, and HAC.
4.1.4 Experiments with higher
In partition difference approaches, increasing does not necessarily lead to a that has more clusters. Instead, serves as an upper bound and new clusters will be used in case they reduce the objective.
To demonstrate how different algorithms handle different values, Table 5 shows the consensus criteria and the actual number of clusters in for different values of (note that in Iris). The results show that the performance of the pairwise similarity methods (CSPA, HAC, DA-Sm) degrades as we increase . This is associated with the fact the actual number of clusters in is equal to which is significantly higher compared to the clusterings in . Methods based on partition difference (NMF and DA-Cr) do not exhibit significant degradation and the actual number of clusters does not grow beyond 5 for DA-Cr and 6 for NMF. Note that the average number of clusters in is .
|Consensus Criteria||# of clusters in consensus clustering|
Motivated by the recent emergence of specialized hardware platforms, we present a new approach to the consensus clustering problem that is based on Ising models and solved on the Fujitsu Digital Annealer, a specialized CMOS hardware. We perform an extensive empirical evaluation and show that our approach outperforms existing methods on a set of seven datasets. These results shows that using specialized hardware in core data mining tasks can be a promising research direction. As future work, we plan to investigate additional problems in data mining that can benefit from the use of specialized optimization hardware as well as experimenting with different types of specialized hardware platforms.
-  Aramon, M., Rosenberg, G., Valiante, E., Miyazawa, T., Tamura, H., Katzgraber, H.G.: Physics-inspired optimization for quadratic unconstrained problems using a digital annealer. Frontiers in Physics 7 (2019)
-  Bian, Z., Chudak, F., Macready, W.G., Rose, G.: The ising model: teaching an old problem new tricks. D-wave systems 2 (2010)
-  Coffrin, C., Nagarajan, H., Bent, R.: Evaluating ising processing units with integer programming. In: CPAIOR. pp. 163–181 (2019)
DeGroot, M.H., Schervish, M.J.: Probability and Statistics. Pearson (2012)
Dua, D., Graff, C.: UCI machine learning repository (2017),http://archive.ics.uci.edu/ml
Filkov, V., Skiena, S.: Integrating microarray data by consensus clustering. International Journal on Artificial Intelligence Tools13(04), 863–880 (2004)
-  Fred, A.L., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE TPAMI 27(6), 835–850 (2005)
-  Fujitsu: Digital annealer. https://www.fujitsu.com/jp/digitalannealer/
-  Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(4), 305–315 (2011)
-  Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. TKDD 1(1), 4 (2007)
-  Hubert, L., Arabie, P.: Comparing partitions. J. Classification 2(1), 193–218 (1985)
-  Karypis, G., Kumar, V.: Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed computing 48(1), 96–129 (1998)
-  Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
-  Kumar, V., Bass, G., Tomlin, C., Dulny, J.: Quantum annealing for combinatorial clustering. Quantum Information Processing 17(2), 39 (2018)
-  Li, T., Ding, C., Jordan, M.I.: Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: ICDM. pp. 577–582 (2007)
-  Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings: an overview and a new perspective. Applied Intelligence 33(2), 207–219 (2010)
-  Liu, X., Ushijima-Mwesigwa, H., Mandal, A., Upadhyay, S., Safro, I., Roy, A.: On modeling local search with special-purpose combinatorial optimization hardware. arXiv preprint arXiv:1911.09810 (2019)
-  Lucas, A.: Ising formulations of many np problems. Frontiers in Physics 2, 5 (2014)
-  Naghsh, Z., Javad-Kalbasi, M., Valaee, S.: Digitally annealed solution for the maximum clique problem with critical application in cellular v2x. In: ICC. pp. 1–7 (2019)
-  Negre, C.F.A., Ushijima-Mwesigwa, H., Mniszewski, S.M.: Detecting multiple communities using quantum annealing on the d-wave system. PLOS ONE 15, 1–14 (02 2020), https://doi.org/10.1371/journal.pone.0227538
-  Nguyen, N., Caruana, R.: Consensus clusterings. In: ICDM. pp. 607–612 (2007)
Rahman, M.T., Han, S., Tadayon, N., Valaee, S.: Ising model formulation of outlier rejection, with application in wifi based positioning. In: ICASSP. pp. 4405–4409 (2019)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Computational and Applied Mathematics20, 53–65 (1987)
-  Shaydulin, R., Ushijima-Mwesigwa, H., Safro, I., Mniszewski, S., Alexeev, Y.: Network community detection on small quantum computers. Advanced Quantum Technologies p. 1900029 (2019)
-  Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3(Dec), 583–617 (2002)
-  Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: Models of consensus and weak partitions. IEEE TPAMI 27(12), 1866–1881 (2005)
-  Ushijima-Mwesigwa, H., Negre, C.F., Mniszewski, S.M.: Graph partitioning using quantum annealing on the d-wave system. In: PMES. pp. 22–29 (2017)
-  Ushijima-Mwesigwa, H., Shaydulin, R., Negre, C.F., Mniszewski, S.M., Alexeev, Y., Safro, I.: Multilevel combinatorial optimization across quantum architectures. arXiv preprint arXiv:1910.09985 (2019)
-  Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. IJPRAI 25(03), 337–372 (2011)
-  Wu, J., Liu, H., Xiong, H., Cao, J., Chen, J.: K-means-based consensus clustering: A unified view. IEEE TKDE 27(1), 155–169 (2014)
-  Xing, E.P., Jordan, M.I., Russell, S.J., Ng, A.Y.: Distance metric learning with application to clustering with side-information. In: NIPS. pp. 521–528 (2003)