Scalable Bottom-up Subspace Clustering using FP-Trees for High Dimensional Data

11/07/2018 ∙ by Minh Tuan Doan, et al. ∙ 12

Subspace clustering aims to find groups of similar objects (clusters) that exist in lower dimensional subspaces from a high dimensional dataset. It has a wide range of applications, such as analysing high dimensional sensor data or DNA sequences. However, existing algorithms have limitations in finding clusters in non-disjoint subspaces and scaling to large data, which impinge their applicability in areas such as bioinformatics and the Internet of Things. We aim to address such limitations by proposing a subspace clustering algorithm using a bottom-up strategy. Our algorithm first searches for base clusters in low dimensional subspaces. It then forms clusters in higher-dimensional subspaces using these base clusters, which we formulate as a frequent pattern mining problem. This formulation enables efficient search for clusters in higher-dimensional subspaces, which is done using FP-trees. The proposed algorithm is evaluated against traditional bottom-up clustering algorithms and state-of-the-art subspace clustering algorithms. The experimental results show that the proposed algorithm produces clusters with high accuracy, and scales well to large volumes of data. We also demonstrate the algorithm's performance using real-life data, including ten genomic datasets and a car parking occupancy dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Subspace clustering aims to find groups of similar objects, or clusters, that exist in lower dimensional subspaces from a high dimensional dataset. This has a wide range of applications, including the rapidly growing fields of the Internet of Things (IoT) [1] and bioinformatics [2]. Applications such as these generate large volumes of high dimensional data, which bring new challenges to the subspace clustering problem. In this paper we propose a novel approach to subspace clustering that addresses two key challenges in these applications: scalability to large datasets and non-disjoint subspaces.

The first challenge lies in handling large inputs. This is essential for many applications nowadays since the captured data can grow to million of records in a short period of time. It has been shown [3, 4] that many existing algorithms have high computational costs and take considerable time to cluster relatively small inputs, e.g., STATPC [3] needs more than 13 hours to cluster 7,500 records of 16 dimensions. Table I illustrates how our algorithm can scale to inputs with large volumes of data, in comparison to state-of-the-art subspace clustering algorithms SWCC [2], SSC [5], and LRR [6]. The running time of our algorithm over 100,000 data points is half that required by SWCC (which is a highly efficient co-clustering algorithm, but cannot find clusters in non-disjoint subspaces). The state-of-the-art subspace clustering algorithms SSC and LRR also suffer as the number of data points increases. SSC triggers memory errors when the numbers of data points reaches 15,000, while LRR cannot terminate in 12 hours for just 5,000 points.

5,000 10,000 15,000 20,000 50,000 100,000
Ours 6.7 13.2 20.7 28.7 127.9 184.5
SWCC 9.8 19.9 37.8 93.94 198.96 374.48
SSC 226.1 416.9 1506.4 - - -
LRR - - - - - -
TABLE I: Clustering time (in seconds) on 10-dimensional datasets. The volume ranges from 5,000 to 100,000 points.

The second challenge involves finding clusters in non-disjoint subspaces [7]. Many recent algorithms [5, 6] assume that clusters are located in disjoint subspaces, which do not have any intersection except for the origin. This is a strong assumption that can be unrealistic, because real-life data may be correlated in different overlapping subsets of dimensions, also known as the property of local feature relevance [8]. For example, with gene expression data, a particular gene can be involved in multiple genetic pathways, which can result in different symptoms among different sets of patients [9]. Hence, a gene can belong to different clusters that have dimensions in common while differing in other dimensions [10]. Figure 1 presents another example of clusters in non-disjoint subspaces that are observed in data collected from IoT applications. The heatmap visualizes the subspace clustering results of a car parking occupancy dataset at 10 locations from 9am to 1pm, where each column represents a car parking bay, and each row represents an hour of the day. It can be observed that clusters and are in non-disjoint subspaces since they share the dimensions of parking bays P2 and P3 in common. In the case of , this can be interpreted as the utilisation of these two parking bays following some pattern that is also observed at P1 between 9am-10am. On the other hand, cluster shows that P2 and P3 follow a different pattern between 11am-1pm, and share that pattern with P4 and P5. Further analysis of the data can suggest that are busy parking bays during morning peaks, whereas have higher occupancy levels during lunch time.

Fig. 1: An illustration of clusters in non-disjoint subspaces for car parking occupancy data. Clusters are highlighted to show simultaneous groupings of points and dimensions.

To address these challenges, we propose a novel algorithm that can find clusters in non-disjoint subspaces and scale well with large inputs. The algorithm follows a bottom-up strategy and comprises two phases. First, it searches for potential clusters in low dimensional subspaces, which we call base clusters. We start with base clusters instead of dense units in separate dimensions, which are used in existing bottom-up clustering algorithms [8]. This allows our algorithm to preserve the covariance of data between different dimensions, which is also a critical factor when clustering high dimensional data, as we further elaborate in Section 4.1. In addition, this approach makes our algorithm more stable and tolerant to variations in parameters settings.

In the second phase, base clusters that share similar sets of data points are aggregated together to form clusters in higher dimensional subspaces. This process of aggregation is non-trivial. One of the main challenges lies in keeping the number of aggregated clusters tractable. This not only directly affects the computational costs of the algorithm, but also ensures that the final result is presented in an appropriate number of meaningful clusters. Many existing algorithms [11, 12] depend on combinatorial search to combine low dimensional clusters (dense units). If there are on average dense units in each dimension, the first level of aggregation of CLIQUE [11] (to combine one-dimensional dense units into two-dimensional clusters) would need to check pairwise possible aggregations, where is the number of dimensions. Further aggregation would need to be applied sequentially for each subsequent higher dimension. We alleviate this heavy computation by transforming the aggregation problem into a frequent pattern mining problem [13] to achieve efficient and robust aggregation of base clusters. This approach also allows us to avoid the construction of a similarity matrix, which has quadratic complexity with respect to the input volume. Therefore, we reduce both time and space complexity and enable the algorithm to work with very large inputs. During this process, a base cluster may be aggregated into more than one cluster in different higher dimensional subspaces that have overlapping dimensions, which enables us to find non-disjoint subspace clusters. The general steps of our algorithms are summarized in Figure 2 and are detailed in Section 4.

Fig. 2: Framework of the proposed algorithm

We make the following contributions:

  • We propose a novel subspace clustering algorithm that can find clusters in non-disjoint subspaces and handle very large inputs. The novelty of our approach is reflected in both phases of the algorithm. First, we search for base clusters in low dimensional subspaces to preserve the covariance of data between different dimensions. Second, we transform the process of sequential aggregation of low dimensional clusters to a problem of frequent pattern mining to construct high dimensional clusters.

  • We demonstrate that the proposed algorithm outperforms traditional subspace clustering algorithms using bottom-up strategies, as well as state-of-the-art algorithms with other clustering strategies, in terms of accuracy and scalability on large volumes of data.

  • We conduct a range of experiments to demonstrate the effectiveness of our algorithm in different practical applications. Specifically, we present how the algorithm can be applied to (1) real-life sensor data from the City of Melbourne, Australia [14], and (2) 10 different gene expression datasets [9], and produce comparable or better results than state-of-the-art algorithms.

Ii Related work

Subspace clustering is an active research field that aims to partition high dimensional datasets into groups of objects that are similar in subspaces of the data space. The attributes of high dimensional data lead to multiple challenges for subspace clustering. A major challenge is referred to as local feature relevance [8]

, which states that clusters only exist in subspaces (or subsets of dimensions) rather than the full dimensional space. In addition, the subspaces where a cluster exists vary for different subsets of data points. This phenomenon makes traditional similarity measures, such as Euclidean distance, Manhattan distance, and cosine similarity ineffective. The reason is that these measures use all dimensions, both relevant and irrelevant, when computing similarity. Moreover, since subspaces vary for different (and unknown) subsets of points, common dimensionality reduction techniques, such as PCA

[15], MDS [16]

, and feature selection methods

[17] that apply global changes to the data, are not effective.

Subspace clustering methods.

Subspace clustering methods can be categorised into five groups: iterative methods, algebraic methods, statistical methods, matrix factorisation-based methods, and spectral clustering based methods. We briefly describe each group with representative algorithms. A detailed survey of these algorithms is in

[4].

Iterative methods suchs as K-subspaces [18]

iteratively alternate between assigning points to the subspaces and updating subspaces to refine the clusters. K-subspaces is simple, fast, and is guaranteed to converge. However, it needs to know the number of clusters as well as the dimensions of each cluster beforehand. The algorithm is also sensitive to outliers and only converges to a local optimum.

Statistical methods, such as MPPCA [19]

, assume that the data in each subspace follow a known distribution, such as a Gaussian distribution. The clustering process alternates between clustering the data and adjusting the subspaces by maximizing the expectation of the principle components of all subspaces. These algorithms need to know the number of clusters as well as the number of dimensions of each subspace. Moreover, their accuracy heavily depends on the initialization of the clusters and subspaces.

GPCA [20] is a representative algorithm of the algebraic methods. It considers the full data space as the union of underlying subspaces, and hence represents the input data as a polynomial of degree : where and

are the normal vector and the equation of subspace

respectively. The subspaces are then identified by grouping the normal vectors of all the points, which are the derivatives at the values . GPCA needs to know the number of dimensions of each subspace, and is sensitive to noise and outliers. Besides, GPCA has high computational complexity and does not scale well to the number of subspaces or their dimensionalities.

Matrix factorization based algorithms use low-rank factorization to construct the similarity matrix over the data points. Specifically, given the input containing points in -dimensions, matrix factorization based algorithms [21], [22] find the SVD [15] of the input to subsequently construct the similarity matrix , where if points and belong to different subspaces. The final clusters are obtained by thresholding the entries of . These methods assume the subspaces to be independent and noise free.

SSC [5] and LRR [6] are two state-of-the-art algorithms that use spectral clustering techniques. They initially express each data point as a linear combination of the remaining data , and use the coefficients to construct the similarity matrix . The algorithms then optimize to make for all points that do not belong to the same subspace. SSC uses -norm regularization [23] to enforce to be sparse, while LRR enforces the matrix to be low-rank by using nuclear norm regularization [24]. Both algorithms assume the underlying subspaces to be disjoint. In addition, both have high computational complexity, which grows rapidly with the number of input records.

Bottom-up subspace clustering algorithms.

From an algorithmic point of view, clustering algorithms can be classified into bottom-up algorithms and top-down algorithms

[8]. As our algorithm follows a bottom-up strategy, we briefly discuss the relevant algorithms of this class to highlight our contributions.

The bottom-up strategy involves searching for dense units in individual dimensions, and subsequently aggregating these dense units to form clusters in higher dimensional subspaces. The difference among bottom-up algorithms lies in the definition of dense units and the method of aggregating lower dimensional clusters. For example, CLIQUE [11] divides individual dimensions into fixed size cells, and defines dense units as cells containing more than a predefined number of points. It then aggregates adjacent dense units to construct higher dimensional clusters. CLIQUE heavily depends on setting appropriate values of the cell size and density threshold. This can be challenging because the value ranges differ in different dimensions and there might not be a single set of parameters that suit all dimensions. In addition, searching for dense units in separate dimensions omits the covariance between dimensions, which can lead to either missing clusters or redundant combinations of dense units. We discuss this phenomenon in more detail in Section 4.1. SUBCLU [12] does not rely on fixed cells. Instead, it uses DBSCAN [8] to search for dense units in each dimension, and iteratively constructs higher dimensional subspaces. The algorithm invokes a call of DBSCAN for each candidate subspace, which can lead to a high running time. We propose to perform clustering only at the beginning of the algorithm while still guaranteeing that the aggregation of these -dimensional clusters form valid high dimensional clusters, which achieves a much lower computational cost.

Co-clustering. Another relevant topic is co-clustering (a.k.a bi-clustering or pattern-based clustering) [25]. Co-clustering can be considered as a more general class of clustering high dimensional data by simultaneously clustering rows (points) and columns (dimensions). The main point that differentiates co-clustering from subspace clustering lies in the approach to the problem, and the homogeneous methodology to find clusters in both axis-parallel and arbitrarily oriented subspaces [8]. In this paper, we also compare the performance of our algorithm on gene expression data with a range of co-clustering algorithms, including SWCC [2], BBAC-S [26], ITCC [27], FFCFW [28], and HICC [29].

Iii Problem statement

We first present the notation used in this paper.

  • is a subspace of dimensions, which is represented as a set of its component dimensions: , represents the dimension.

  • or is a set of points; denotes a point: , is the coordinate in the dimension.

  • is a cluster formed by points in subspace .

Let be a set of points in a -dimensional space, and be a subset of . The set of all subspace clusters is denoted as . Here, denotes the number of subspaces containing clusters, and denotes the number of all clusters. More than one cluster can exist in a subspace, i.e., . Our subspace clustering algorithm finds all clusters by identifying their corresponding subspaces and point sets.

We take a bottom-up approach to find the clusters in subspaces starting from finding base clusters in low dimensional subspaces. The algorithm to find the base clusters is orthogonal to our study. We use k-means in the experiments for simplicity, although any low dimensional clustering algorithms may be used. Once the base clusters are found, our algorithm aggregates them to form clusters in higher-dimensional subspaces. We follow a probabilistic approach together with the downward closure property of density to guarantee the validity of the formation of clusters in higher dimensional subspaces. This is formulated as Lemma 1.

Lemma 1: Given two points and in subspace

, the probability that

and belong to the same cluster in subspace is proportional to the cardinality () in which and belong to the same cluster.

Proof: Let denote the event where two points and belong to the same cluster in subspace . Assume that we already perform clustering in lower dimensional subspaces and find that these two points belong to the same cluster in a set of subspaces (). Given this knowledge, the probability that and belong to the same cluster in is:

We show that the probability increases as new evidence of the cluster formation of and is found in other subspaces of . Specifically, let these two points also belong to a cluster in a certain subspace (, i.e., is indeed a newly discovered subspace in which and belong to the same cluster). The probability of them belonging to the same cluster in becomes:

By applying the chain rule, we can show that

: