Data measured from a high-order complex system can be difficult to analyze. A convenient tool to store such data is in the form of a tensor, or d-way array. Each entry of the array describes the value obtained across the parameters. Often, the dependencies between indices is not clear, making interpretation of the data a demanding task. Clustering slices or sub-tensors allows one to readily parse the complex interdependencies to provide meaningful interpretation.
The focus of this paper is a new method for clustering slices of a tensor. This procedure is designed to address some of the fundamental flaws of clustering discussed below. While the method is general, it is designed for a particular domain of application, namely, an improved capability to detect and identify climate biomes.
1.1 Clustering challenges
Clustering is known to be an ill-defined problem in the sense that no clustering algorithm satisfies all desirable clustering criterion [kleinberg2003impossibility]. Further, the number of clusterings for data points is astronomically large, leading to a difficult search problem. As a result, some prior bias for exploring the space of clusterings must be adopted. However, the resulting optimization schemes are almost always NP-hard [mahajan2009planar].
Because each clustering requires many choices, different clustering measures have been formulated to assess the quality of a clustering. As with the clustering schemes themselves, these measures are largely arbitrary. Indeed, often the measures are directly exported from the optimization functions used in the clustering algorithm. The algorithm that is designed to optimize this clustering measure will, by design, out perform other clustering methods with respect to that metric. As a result, this provides no further information as to what clustering strategy is better suited for the problem.
These challenges highlight that there is no true “best” clustering in general. Rather, there are many good clusterings that arise from the specifics of the scientific inquiry pursued. However, the quality of the clustering is not easy to evaluate. Furthermore, no particular clustering in this collection is certifiably “correct”, but each provides different insights into the structure of the data.
While a collection of clusterings is more robust to error than a single clustering, it is often less interpretable. To some extent, this defeats the purpose of clustering. This lack of interpretability has led researchers to define the concept of an ensemble, or consensus clustering [nguyen2007consensus]. Here, many clusterings are combined to produce a single clustering of the data. Common features between the clusterings are amplified, and artifacts become dulled (Figure 1a). Generally, selecting smaller ensembles with diverse clusters has been shown to outperform larger ensembles [fern2008cluster]. Therefore, it is advantageous for users to adopt a method of parsimony for constructing their ensemble.
1.2 Classifying climate biomes
There are still unresolved issues not addressed by the current ensemble clustering framework. One practical example is the scale at which the data is acquired. Most natural or environmental data is formed by directly observing and measuring quantities where the underlying or driving processes are usually unknown. The hidden or latent features of the data may not clearly present themselves at the resolution that the data was sampled. This problem arises in the climate sciences. Here, weather data is often gathered at fine temporal and spatial detail, e.g., daily temperature at a single weather station. However, climate signals are often observed on the order of years or decades and across a region.
Thus, climate data frequently arises as a tensor. At a single site located as a specific spatial coordinate, one has a time series of various climate measurements. The tensor of climate data compactly records the complex interdependence between space and time for different variables of interest. Clustering the data according to the spatial index is parallel to identifying climate biomes.
Historically, the standard used to classify climate biomes has been the Köppen-Geiger (KG) model[kottek2006world]
. The KG model is an expert based judgment that describes climate zones using temperature and precipitation measurements. The KG model utilizes a fixed decision tree, where each branch uses various information about temperature and precipitation. This heuristic allows one to broadly assess climate regions. While KG is interpretable, it is overly simplistic and somewhat arbitrary. In an attempt to remedy this problem, Thornthwaite[thornthwaite1948approach] introduced a more nuanced model using moisture and thermal factors. However, the Thornthwaite model (along with its successors) still suffer from expertly chosen biases in their parameters.
A solution to this problem is to move towards data-driven methods of classification. Here, the human bias is placed onto the machine learning algorithm that seeks to minimize some cost function. This is equivalent to a statistical assumption about the data generation and distribution[bishop2006pattern]. In the views of the authors, this is often a more reasonable assignment of bias. In [zscheischler2012climate]
, Zscheischler et. al. compare KG to the K-means algorithm. They show that, unsurprisingly, K-means outperforms KG with respect to some statistical measures. In[netzel2016using]
, the authors use mean monthly climate data to perform hierarchical clustering and partition around medoids. In each clustering algorithm, two distance metrics are tested, and these results are compared to KG using an information-theoretic measure.
These data driven approaches to climate clustering are an epistemological improvement over the use chosen heuristics of KG. However, data driven methods still suffer from two key problems. First, the algorithms themselves are user-chosen, and therefore somewhat arbitrary. Because clustering is an ill-posed problem, no single clustering is necessarily a clear improvement over another. Rather, a collection of clusterings with reasonable coherency should be assembled for further analysis. Second, because the algorithms are dependent on the input data, they are dependent on the scale at which the data is acquired. A priori, it is not clear how this “hidden parameter” of scale affects the overall clustering result. Latent features of the climate system may appear at different coarse-grainings, and it is important to analyze how the scale effects the clustering of the data.
1.3 Key contributions
The above discussion highlights two important problems: 1) the unknown dependence scale has on clustering data and 2) the necessity to build an ensemble of clusters. In this work, we discuss a clustering method that illuminates these dependencies to build an ensemble of clusters that efficiently represents the diversity across different coarse-grainings. We develop a technique that uses the discrete wavelet transform to cluster slices of tensors at different scales that we call coarse-grain clutering (CGC). This results in many potential clusterings, one for each chosen coarse-graining. Not all of these coarse-grain clusterings provide new information, however. Thus, we present a novel selection method that leverages mutual information between clusterings to quantify the loss of information between clusterings and select a small subset that best represents ensemble. We call this reduction algorithm Mutual Information Ensemble Reduce (MIER).
While the end-to-end workflow we have discussed involves ideas from traditional consensus clustering (e.g., Figure 1a), the focus of this paper is specifically on a novel modification to this approach leveraging the CGC and the MIER algorithms to develop a classification using coarse-grain clustering (CGC) and in accordance with a mutual information ensemble reduction (MIER), i.e., the blue highlighted portion of Figure 1b.
This paper is organized as follows. First background material used for development of the CGC and MIER algorithms is presented in Section 2. The structure of these algorithms is detailed in Section 3. In Section 4, the algorithm is applied to a widely-used climate data set as a case study with presentation of results and discussion, followed by conclusions and a recommendation for future work in Section 5.
In this section, we briefly review key mathematical tools used throughout this work including 1) the discrete wavelet transform and its role in separating earth system data into spatio-temporal scales, 2) graph cuts and their connection to spectral clustering, and 3) use of mutual information to measure similarity between two clusters of the same data.
2.1 Discrete Wavelet Transform (DWT)
Given a one-dimensional discrete function the discrete wavelet transform (DWT) is a process of iteratively decomposing into a series of low and high frequency signals. The low frequency signal is often referred to as the approximation coefficients, and the high frequency is called the detail coefficients. This process is accomplished by convolving the function with low frequency and high frequency filter functions that arise from a choice of mother wavelet, sometimes called a wavelet for short.
We are interested in multi-way signals, namely tensoral data. The wavelet transform of a tensor is obtained by taking one-dimensional wavelet transforms along each axis of interest where different choices of wavelets may be chosen for each axis. Note, the DWT can be applied multiple times to a tensor axis. At each step, the signal is decomposed into its high and low frequency signals. These are then downsampled. Taking the low frequency signal, one again performs the DWT transform, splitting this into another high and low frequency signal to be further downsampled. This process, known as a filter bank, is illustrated in Figure 2. For a comprehensive overview of wavelets, see [jensen2001ripples].
Thus, low-frequency information separated by the wavelet captures climatology and large-scale spatial features; high-frequency information quantifies weather. For example, coarse-graining temporal signals captures seasonal, yearly, and eventually decadal trends, whereas coarse-graining spatial information captures city, county, and eventually state size features. Therefore, to classify regional climate systems into biomes, we use the wavelet approximation coefficients.
2.2 Clustering algorithms and graph cuts
2.2.1 K-means and spectral clustering
Clustering algorithms are diverse with varying advantages and disadvantages [fahad2014survey]. Arguably the most famous are partitioned based algorithms, where data is iteratively reassigned to clusters until an optimization function is minimized. The prototypical example of a partitioned based clustering algorithm is K-means. Given a natural number , the K-means algorithm seeks to partition the data-set into distinct groups that minimize the variance within the clusters.
Another popular method of clustering is spectral clustering, whereby one leverages spectral graph theory to separate the data into clusters. In spectral clustering, an undirected weighted graph is formed, where each vertex is a data point and the edge weight is a chosen affinity between vertices.
Let denote the weighted adjacency matrix for the graph . The (unnormalized) graph Laplacian of is a matrix that captures the combinatorial properties of the Laplacian on discrete data. The Laplacian
is a symmetric positive semi-definite matrix, so the eigenvalues may be ordered
. Finding the eigenvectorscorresponding to the lowest eigenvalues, define and cluster the rows using k-means. For more details, see [ng2002spectral].
However, K-means and spectral clustering requires the user to choose , and additional heuristics are needed to constrain the search space. In spectral clustering, one can use the eigenvalues of the Laplacian to determine the cluster number. Specifically, as the eigenvalues are ordered, search for a value of such that the first are small, and is large. This method is justified by the fact that the spectral properties of are closely related to the connected components of [von2007tutorial]. Use of graph Laplacian eigenvalues to decide the cluster number is called the eigen-gap heuristic.
2.2.2 Graph cut clustering
Given a notion of distance of data, the adjacency graph or matrix records the pairwise similarity. Clustering the data into clusters is equivalent to providing a cut of the adjacency graph . Graph cut strategies vary depending upon application. For example, the min cut algorithm minimizes the cost between components of the graph, but this can result in an undesirable clustering, e.g., a cluster with one element.
The Ratio cut algorithm is a graph cut that seeks to ameliorate this issue by incorporating the size of each component. Concretely, let , is the complement of , and . Given disjoint subsets such that , its ratio cut is defined as
Finding such that Equation 1 is minimized is NP-hard [wagner1993between]. However, a solution to a relaxed ratio cut problem can be obtained using spectral clustering [von2007tutorial].
This MIER algorithm will require a graph cut of a particular adjacency matrix formed by a large ensemble of coarse-grain clusterings. As discussed in Section 3.2, we perform a ratio cut on the adjacency graph formed using normalized mutal information. Consequently, to implement a ratio cut in the MIER algorithm, we will use spectral clustering on this adjacency matrix.
2.3 Mutual Information
Mutual information provides a method to quantify the shared information. Here, we outline how the mutual information is computed. For a more detailed account of mutual information and clustering, see [dom2002information] and [vinh2010information].
Let be a collection of data points. Suppose that we partition the data into two clusterings and . The entropy of the clustering , denoted is the average amount of information (e.g., in bits) needed to encode the cluster label for each data points of . If the clustering is known, can be encoded with less bits of information. The conditional entropy denotes the average amount of information needed to encode if is known.
The mutual information measures how knowledge of one clustering reduces our uncertainty of the other. Formally,
Explicit formulas for and can be derived as follows. Let denote the number of points in both and . We set to be the size of , and to be the size of .
Assume points of
are sampled uniformly. Then the probability that a random point inis in cluster is . Moreover, the probability that points satisfy and is . Therefore, it follows that
Notice that , and . It then follows that
Therefore, one can normalize the mutual information to take on values in . Equation 2 shows there is more than one way to do this - for example, one can divide the mutual information either by the minimum or the average of the entropies [vinh2010information]. There are, in fact, many ways to normalize the mutual information, each with their own benefits and downsides [vinh2010information]. Throughout, we normalize using the average value, and therefore define the normalized mutual information as
3 The CGC and MIER algorithms
Here, we present our wavelet-based clustering model for classifying slices of tensor data. We detail the clustering algorithm Coarse-Grain Clustering (CGC) and present a method for selecting clusters to include in an ensemble based off the mutual information, which we call Mutual Information Ensemble Reduce (MIER).
3.1 Coarse-Grain Clustering (CGC)
This manuscript considers 4-way climate data tensors . We will index the modes of the tensor using subscripts, namely
Each of the coordinates coordinates describes a feature of the abstract dataset . Correspondingly, we will always make the following physical identifications: the first and second indices and refer to latitude and longitude coordinates, respectively; the index denotes time, and refers to a state variables (e.g., temperature or precipitation).
The goal of this work is to provide meaningful clusterings for the spatial location, namely the coordinates corresponding to and . Hence, we seek clusterings of the indices using the data . While our focus is on clustering two indices of 4-way tensors, we note that this method does generalize to clustering d-way tensors along any number of indices.
Step One - Split Tensor: The first step in the coarse-grain clustering (CGC) algorithm is to separate the tensor into sub-tensors that are largely statistically uncorrelated across the dataset. For example, temperature and precipitation are locally correlated - e.g., seasonal rainfall. However, they are weakly correlated at large spatial scales. Indeed, there are hot dry deserts, cold dry deserts, wet cold regions, and wet hot regions. Therefore in the climate dataset , one would separate by climate variables, but not by space or time. In a generic, non-climate specific tensor, one might split across different variables or runs of an experiment. We let be the 3-way tensors obtained by fixing the index to the possible values. Note that each of these tensors for have the same size.
Step Two - DWT: After splitting the tensor into sub-tensors, the next step is to select the inputs. The user chooses wavelets for each of the remaining indices and . We let denote the wavelet for index , . Non-negative integers for are selected to control the level of the DWT on index . For each 3-way climate variable tensor , take the DWT transform.
Step Three - Stack: Since the same wavelets are used on each , the DWT of will each have the same shape. These tensors can therefore be stacked along the face we wish to classify. For the climate biome problem, this would be the face.
Step Four - Vectorize:
Once the approximation coefficients are stacked, they may be vectorized along the face of interest. These vectors will be clustered according to a clustering algorithm of choice. This will result in a clustering of the face of interest on the DWT stack.
Step Five - Clustering: The final input is the choice of clustering algorithm, as well as any hyper-parameters required for the chosen algorithm. For example, one may choose K-means, in which case the user needs to specify the number of clusters . Let denote the chosen clustering algorithm, along with its chosen hyper-parameters. With the inputs chosen, the algorithm proceeds as follows. Algorithm is applied to the vectorized DWT coefficients from step four.
Step Six - Return Labels: The final step is to translate these labels on the coarse-grain stack to the face of the original data set. This is done using the inverse DWT. Specifically, cluster labels corresponding with the largest value appearing in the inverse DWT filter are used to propagate the coarse label to finer detail.
3.2 Mutual Information Ensemble Reduce (MIER)
The CGC algorithm describes how to produce a single clustering at a fixed coarse-graining. This coarse-graining arises from the choice of wavelets and wavelet levels . The power behind CGC is its ability to produce many clusterings by simply varying the wavelet levels , , which can be readily parallelized via a single instruction.
This process results in an ensemble of clusters, one that is potentially too big to analyze. In this section, we discuss a method to select a small subset of this large ensemble of coarse-grainings. Our method leverages the mutual information to find a compact subset of clusters that contains most of the information across the large ensemble. This is accomplished by computing the mutual information between all the clusters in the large ensemble. This results in a connected graph. This connected graph is then ratio-cut to find heavily connected and therefore information theoretically similar clusters. For each component, we again use mutual information to select a single representative of the component. We call this method Mutual Information Ensemble Reduce (MIER).
Given a cluster from the large ensemble, one can look at which clique it belongs to in the graph cut. By construction of the MIER algorithm, the chosen representative contains a large amount of the information contained in . The MIER slgorithm is summarized in Figure 4 and Algorithm 2. The details of the algorithm are as follows.
Step One - Large Ensemble: Let denote the permissible set of wavelet resolutions for the chosen wavelets . Reasonable values for can be deduced from the dataset and problem of interest, e.g. scale of data and anticipated importance of embedded features. Once has been decided, CGC is run for each . We denote the clustering using the wavelet resolutions by . This results in an ensemble of clusters .
Step Two - Mutual Information: Next, we compute the normalized mutual information between each clustering in our ensemble . This results in a complete weighted graph on nodes indexed by the the set . The weight between node and node is the normalized mutual information . We call the graph the mutual information graph, and let denote the weighted adjacency matrix for .
Step Three - Graph Cut: Having built the mutual information graph, we now perform a graph cut. Recall, spectral clustering solves a relaxed version of the ratio cut problem. Hence, we use spectral clustering on to find a ratio-cut of . The eigen-gap heuristic is used when selecting the number of clusters for spectral clustering [von2007tutorial]. Let denote the components of corresponding to the cut of .
Step Four - Average : For each component of the cut mutual information graph, we seek a best representative. Let denote the average mutual information between and all other members of its component. That is, for ,
where is the normalized mutual information between the clusters and .
Step Five - Choose Representative: For each , the goal is to select the clustering that best represents all the clusterings in . If is a good representative for all the other clusterings within its component, then the mutual information between and the other members of the component will be high on average. Thus, will be large. Consequently, we select a cluster in for which is maximized:
4 Application - Gridded Climate Dataset
As a proof of concept, we apply the MR-Cluster to a gridded historical climate data set of North America [livneh2015spatially]
, referred to hereafter as L15. This data set ingests station data and interpolates results for each grid point, integrating the effects of topography on local weather patterns. The gridded data is six by six kilometers a side and consists of 614 latitudinal, 928 longitudinal, and 768 temporal steps for the years 1950-2013. The available monthly variables in the L15 data set are averaged values of daily total precipitation, daily maximum temperature, daily minimum temperature, and daily average wind speed. A representative snapshot of precipitation, maximum and minimum temperature is shown in Figure5. The datasets contains key inputs needed for biome classification using the KG model [kottek2006world] and allows ready comparison against this expert judgement based approach. As this dataset is freely available, as well as widely used within the climate community (e.g., Henn et al. 2017, cited over 130 times), it provides a good benchmark application to illustrate capabilities of the method, especially in comparison against more typical expert judgement approaches like the KG model.
4.1 CGC Hyperparameter Selection for L15
The first step of CGC is to split the tensor into sub-tensors corresponding to the climate variables. The historical precedent has been to use temperature and precipitation data to prescribe the biomes [kottek2006world, netzel2016using, thornthwaite1948approach, zscheischler2012climate]. Hence, we will only consider the sub-tensors corresponding to precipitation, maximum temperature, and minimum temperature. The next step is to determine the inputs to Algorithms 1 and 2. We describe these now.
L15 is a gridded observational dataset that achieves a six km spatial resolution, while each timeslice of the data represents monthly timescale data. Whenever a wavelet transform is taken, the spatial and/or temporal scales are approximately doubled. For example, the L15 dataset has a six km spatial resolution. Thus, the coarse wavelet coefficients have a spatial resolution of , etc., km for one, two, and three wavelet transforms respectively. Similarly, wavelet transforms of the monthly time scales will result in , etc., month long scales.
There is a scale at which both the spatial and temporal information is too coarse and begins to lose meaning. For example, on one extreme the spatial scale of the entire dataset is meaningless. On the other, the six km initial resolution is too fine scale for adequate characterizations into distinctly visible biomes at the North American scale.
These scales demarcate our set of permissible wavelet resolutions . At least one wavelet transform is taken in both space and time. The maximum for the spatial indices is four (roughly km). The maximum number of temporal wavelet transforms is six (roughly years). Further, we opted for a parsimony with regards to the spatial wavelet transforms– a wavelet transform is taken along (latitude) if and only if it is also taken along (longitude). For example, if we take two wavelet transforms in space laterally, we will also take two in space longitudinally so that horizontal spatial resolution is uniformly scaled. Thus, we have
Note, while it was possible to push the maximum levels to coarser grain, we wanted to avoid the risk of over-coarsening the result. For our choice of wavelets, we choose Daubechies 2 (db2) to match the time signals and Haar for space, corresponding to anticipated smooth periodicities in time and sharp gradients, e.g., near mountains, in space.
For the algorithm , we have chosen to use K-means clustering for various values of due to the historical precedence this algorithm has in clustering for climate applications [netzel2016using, zscheischler2012climate] and straightforward implementation. Recall that the aim is not to find the “best clustering” of our data; instead, we wish to understand how coarse-graining effects clustering and can be used to develop an ensemble of clusterings for use in understanding cluster method sensitivity to latent data scales.
4.2.1 L15 CGC algorithm results
Application of the CGC algorithm to the L15 dataset results in spatial mappings of unique, non-overlapping classifications. For example, resolution and cluster number effect the resultant clustering. Figure 6 explores sensitivity to the wavelet transform for a fixed value of . Note, several coherent features are observed. First, strong latitudinal dependence in the eastern portion of the US is consistent across clusterings as classified scales are modified, e.g., Figure 6a to 6d. Second, reduced spatial scale, e.g., in Figure 6b as compared to Figure 6a, results in a loss of high-spatial frequencies in the produced classification. Sensitivity to temporal scale, similar to spatial scale, produces large scale structural change but with higher spatial fidelity between classified regions, e.g., Figure 6a vs Figure 6e. However, the combined scale reduction in Figure 6f, results in some degree of loss of strong latitudinal structure in the Eastern US as is more evident in the other wavelet cases and results are distinct from Figures 6b and Figures 6e. Intermediate cluster scales (Figures 6c and 6d) are in the range of these results.
Sensitivity to , in contrast to wavelet scales, further granulates the classification scheme obtained. Figure 7 plots CGC at various values of for the fixed resolution . For example, in the Eastern US, increase in increases the number of latitudinally aligned biomes from 2 in Figure 7b to four in Figure 7d. Classified structure in the western US increases in complexity as k increase, e.g., Figure 7a vs 7d. However, comparing Figures 7c vs 7d, we see that the complexity does largely saturate at sufficiently large .
Coherent structures, such as the Rocky Mountains, are evident across all , illustrating the overall consistency and reduced sensitivity of CGC to as opposed to choice of wavelet parameter set .
4.2.2 L15 MIER algorithm results
For different values, the CGC algorithm was run across all the resolutions as in Equation 3. For each fixed , the MIER algorithm was applied to the outputs to discover the reduced ensemble.
Figure 8 shows the aspects of the MIER algorithm for . Figure 8a displays the value of for each resolution . The value on the vertical access denotes the number of spatial wavelet transforms, while the horizontal axis displays the number of temporal wavelet transforms. Figure 8b shows the results of the ratio cut algorithm. The key resolutions found by running the MIER algorithm are highlighted in a darker shade.
The clusters plotted in Figure 8c through Figure 8f are the best representative clusterings found in the MIER algorithm. Each clustering encapsulates different observed features from the large ensemble of clusterings . For example, decreased temporal scale increases resolution from two to three eastern US classifications and shown in Figure 8c versus 8d and 8f. Coarsened classifications are observed as a direct role of spatial scale,e.g., Figure 8f versus 8c to 8e). Cluster boundary shape is also effected by the wavelet resolution. For example, a vertical boundary can be found in the middle of the United States across each classification. However, the shape of that boundary depends on the resolution, e.g. Figure 8d versus 8e.
CGC resolution dependence plots in Figure 6 highlight the variability that data resolution introduces into the clustering process. As expected, increasing the number of spatial wavelet transform results in a coarser clustering. High variance regions, such as the Rocky Mountains, become less resolved as the number of spatial resolutions increases. Large structural features such as The Great Plains are persistent across the spatial wavelet coarse-graining.
What is more unexpected is the effect that coarse-graining time has on the clustering. High variability regions remain high variability, however distinctly different clustering patterns do begin to emerge. For instance, how CGC clusters the Northern Rocky Mountains does seem to depend on the temporal resolution selected, which points to the role of high-altitude storms on resultant biome classification. Low variability regions also depend heavily on the temporal scale. For example, the North Eastern U.S. splits into more biomes as the temporal scale becomes coarser, illustrating that high-frequencies may appear as noise until reduced by the wavelet and may consequently mask a signal appropriate to more specifically classify a region.
The MIER algorithm massively reduces reduces the size of the large ensemble . In all the experiments run, the size of was 24, but the reduced ensemble size is between three and five, with the majority of the cases being four. This illustrates the success of the method in identifying characteristic, reduced set of clusterings. Furthermore, the algorithm is successful at picking resolutions that are sufficiently spaced apart. Consequently, the chosen clusters accurately represent the dynamical range of all the 24 clusters in the large ensemble. The reduction in clusterings from 24 to 5 greatly aids analysis and human comprehensibility of the output.
This can be seen by comparing Figures 6 and 8. The six sample plots in Figure 6 are the extreme cases (lowest and highest coarse-graining) and as well as some middle cases. By looking at Figure 8 we see that, for instance, the cluster belongs to component . The representative for component is the cluster . There are a lot of visual similarities between on Figure 6 and on Figure 8. Indeed, appears to be a blend between and , which is another clustering that belongs to the same connected component.
As can be seen from the output of the MIER algorithm, the reduced ensemble can succinctly represent differences across the spatial temporal resolutions. Most of the variance seen between the clusterings at different resolutions is captured within this subset. From a numerical standpoint, the reduced ensemble is robust as well. As can been seen from Figure 8, the expected normalized mutual information between any representative and the other clusters in its component of the graph is usually rather large. Thus, MIER has successfully found a small, representative subset of the large ensemble.
We have shown that scale of data is a non-negligible feature with regards to clustering. Consequently, in addition to running several clustering algorithms, it is also important to include several coarse-grain clusterings into your cluster ensemble. To avoid ballooning the size of the ensemble, its crucial to not consider every possible coarse-graining, but rather a small subset that largely represents every possible resolution. The MIER algorithm has shown to be a good method to prune the size of the CGC ensemble. This capability to produce an ensemble of classifications representing the diversity of scales provides a direct pathway to better understand clustering sensitivities, illustrating a continued need to assess and mitigate uncertainties resultant from hyperparameter selection.
Introducing the ensemble of clusters from the CGC and MIER algorithm comes at the cost of complexity. It is more difficult to analyze a set of clusterings than a single clustering. As shown in Figure 1, the additional clusterings from the CGC and MIER framework should be imported into a consensus clustering algorithm. However, as discussed in the introduction, clustering is an ill-posed problem without a single optimal solution. Further study is needed to assess the confidence across the cluster ensembles within this classification approach.
This research was supported as part of the Energy Exascale Earth System Model (E3SM) project, funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research as well as LANL laboratory directed research and development (LDRD) grant 20190020DR. High-performance computing time was conducted at Los Alamos Nat. Lab. Institutional Computing, US DOE NNSA (DE-AC52-06NA25396).