I Introduction
Subspace clustering aims to find groups of similar objects, or clusters, that exist in lower dimensional subspaces from a high dimensional dataset. This has a wide range of applications, including the rapidly growing fields of the Internet of Things (IoT) [1] and bioinformatics [2]. Applications such as these generate large volumes of high dimensional data, which bring new challenges to the subspace clustering problem. In this paper we propose a novel approach to subspace clustering that addresses two key challenges in these applications: scalability to large datasets and nondisjoint subspaces.
The first challenge lies in handling large inputs. This is essential for many applications nowadays since the captured data can grow to million of records in a short period of time. It has been shown [3, 4] that many existing algorithms have high computational costs and take considerable time to cluster relatively small inputs, e.g., STATPC [3] needs more than 13 hours to cluster 7,500 records of 16 dimensions. Table I illustrates how our algorithm can scale to inputs with large volumes of data, in comparison to stateoftheart subspace clustering algorithms SWCC [2], SSC [5], and LRR [6]. The running time of our algorithm over 100,000 data points is half that required by SWCC (which is a highly efficient coclustering algorithm, but cannot find clusters in nondisjoint subspaces). The stateoftheart subspace clustering algorithms SSC and LRR also suffer as the number of data points increases. SSC triggers memory errors when the numbers of data points reaches 15,000, while LRR cannot terminate in 12 hours for just 5,000 points.
5,000  10,000  15,000  20,000  50,000  100,000  

Ours  6.7  13.2  20.7  28.7  127.9  184.5 
SWCC  9.8  19.9  37.8  93.94  198.96  374.48 
SSC  226.1  416.9  1506.4       
LRR             
The second challenge involves finding clusters in nondisjoint subspaces [7]. Many recent algorithms [5, 6] assume that clusters are located in disjoint subspaces, which do not have any intersection except for the origin. This is a strong assumption that can be unrealistic, because reallife data may be correlated in different overlapping subsets of dimensions, also known as the property of local feature relevance [8]. For example, with gene expression data, a particular gene can be involved in multiple genetic pathways, which can result in different symptoms among different sets of patients [9]. Hence, a gene can belong to different clusters that have dimensions in common while differing in other dimensions [10]. Figure 1 presents another example of clusters in nondisjoint subspaces that are observed in data collected from IoT applications. The heatmap visualizes the subspace clustering results of a car parking occupancy dataset at 10 locations from 9am to 1pm, where each column represents a car parking bay, and each row represents an hour of the day. It can be observed that clusters and are in nondisjoint subspaces since they share the dimensions of parking bays P2
and P3
in common. In the case of , this can be interpreted as the utilisation of these two parking bays following some pattern that is also observed at P1
between 9am10am. On the other hand, cluster shows that P2
and P3
follow a different pattern between 11am1pm, and share that pattern with P4
and P5
. Further analysis of the data can suggest that are busy parking bays during morning peaks, whereas have higher occupancy levels during lunch time.
To address these challenges, we propose a novel algorithm that can find clusters in nondisjoint subspaces and scale well with large inputs. The algorithm follows a bottomup strategy and comprises two phases. First, it searches for potential clusters in low dimensional subspaces, which we call base clusters. We start with base clusters instead of dense units in separate dimensions, which are used in existing bottomup clustering algorithms [8]. This allows our algorithm to preserve the covariance of data between different dimensions, which is also a critical factor when clustering high dimensional data, as we further elaborate in Section 4.1. In addition, this approach makes our algorithm more stable and tolerant to variations in parameters settings.
In the second phase, base clusters that share similar sets of data points are aggregated together to form clusters in higher dimensional subspaces. This process of aggregation is nontrivial. One of the main challenges lies in keeping the number of aggregated clusters tractable. This not only directly affects the computational costs of the algorithm, but also ensures that the final result is presented in an appropriate number of meaningful clusters. Many existing algorithms [11, 12] depend on combinatorial search to combine low dimensional clusters (dense units). If there are on average dense units in each dimension, the first level of aggregation of CLIQUE [11] (to combine onedimensional dense units into twodimensional clusters) would need to check pairwise possible aggregations, where is the number of dimensions. Further aggregation would need to be applied sequentially for each subsequent higher dimension. We alleviate this heavy computation by transforming the aggregation problem into a frequent pattern mining problem [13] to achieve efficient and robust aggregation of base clusters. This approach also allows us to avoid the construction of a similarity matrix, which has quadratic complexity with respect to the input volume. Therefore, we reduce both time and space complexity and enable the algorithm to work with very large inputs. During this process, a base cluster may be aggregated into more than one cluster in different higher dimensional subspaces that have overlapping dimensions, which enables us to find nondisjoint subspace clusters. The general steps of our algorithms are summarized in Figure 2 and are detailed in Section 4.
We make the following contributions:

We propose a novel subspace clustering algorithm that can find clusters in nondisjoint subspaces and handle very large inputs. The novelty of our approach is reflected in both phases of the algorithm. First, we search for base clusters in low dimensional subspaces to preserve the covariance of data between different dimensions. Second, we transform the process of sequential aggregation of low dimensional clusters to a problem of frequent pattern mining to construct high dimensional clusters.

We demonstrate that the proposed algorithm outperforms traditional subspace clustering algorithms using bottomup strategies, as well as stateoftheart algorithms with other clustering strategies, in terms of accuracy and scalability on large volumes of data.

We conduct a range of experiments to demonstrate the effectiveness of our algorithm in different practical applications. Specifically, we present how the algorithm can be applied to (1) reallife sensor data from the City of Melbourne, Australia [14], and (2) 10 different gene expression datasets [9], and produce comparable or better results than stateoftheart algorithms.
Ii Related work
Subspace clustering is an active research field that aims to partition high dimensional datasets into groups of objects that are similar in subspaces of the data space. The attributes of high dimensional data lead to multiple challenges for subspace clustering. A major challenge is referred to as local feature relevance [8]
, which states that clusters only exist in subspaces (or subsets of dimensions) rather than the full dimensional space. In addition, the subspaces where a cluster exists vary for different subsets of data points. This phenomenon makes traditional similarity measures, such as Euclidean distance, Manhattan distance, and cosine similarity ineffective. The reason is that these measures use all dimensions, both relevant and irrelevant, when computing similarity. Moreover, since subspaces vary for different (and unknown) subsets of points, common dimensionality reduction techniques, such as PCA
[15], MDS [16], and feature selection methods
[17] that apply global changes to the data, are not effective.Subspace clustering methods.
Subspace clustering methods can be categorised into five groups: iterative methods, algebraic methods, statistical methods, matrix factorisationbased methods, and spectral clustering based methods. We briefly describe each group with representative algorithms. A detailed survey of these algorithms is in
[4].Iterative methods suchs as Ksubspaces [18]
iteratively alternate between assigning points to the subspaces and updating subspaces to refine the clusters. Ksubspaces is simple, fast, and is guaranteed to converge. However, it needs to know the number of clusters as well as the dimensions of each cluster beforehand. The algorithm is also sensitive to outliers and only converges to a local optimum.
Statistical methods, such as MPPCA [19]
, assume that the data in each subspace follow a known distribution, such as a Gaussian distribution. The clustering process alternates between clustering the data and adjusting the subspaces by maximizing the expectation of the principle components of all subspaces. These algorithms need to know the number of clusters as well as the number of dimensions of each subspace. Moreover, their accuracy heavily depends on the initialization of the clusters and subspaces.
GPCA [20] is a representative algorithm of the algebraic methods. It considers the full data space as the union of underlying subspaces, and hence represents the input data as a polynomial of degree : where and
are the normal vector and the equation of subspace
respectively. The subspaces are then identified by grouping the normal vectors of all the points, which are the derivatives at the values . GPCA needs to know the number of dimensions of each subspace, and is sensitive to noise and outliers. Besides, GPCA has high computational complexity and does not scale well to the number of subspaces or their dimensionalities.Matrix factorization based algorithms use lowrank factorization to construct the similarity matrix over the data points. Specifically, given the input containing points in dimensions, matrix factorization based algorithms [21], [22] find the SVD [15] of the input to subsequently construct the similarity matrix , where if points and belong to different subspaces. The final clusters are obtained by thresholding the entries of . These methods assume the subspaces to be independent and noise free.
SSC [5] and LRR [6] are two stateoftheart algorithms that use spectral clustering techniques. They initially express each data point as a linear combination of the remaining data , and use the coefficients to construct the similarity matrix . The algorithms then optimize to make for all points that do not belong to the same subspace. SSC uses norm regularization [23] to enforce to be sparse, while LRR enforces the matrix to be lowrank by using nuclear norm regularization [24]. Both algorithms assume the underlying subspaces to be disjoint. In addition, both have high computational complexity, which grows rapidly with the number of input records.
Bottomup subspace clustering algorithms.
From an algorithmic point of view, clustering algorithms can be classified into bottomup algorithms and topdown algorithms
[8]. As our algorithm follows a bottomup strategy, we briefly discuss the relevant algorithms of this class to highlight our contributions.The bottomup strategy involves searching for dense units in individual dimensions, and subsequently aggregating these dense units to form clusters in higher dimensional subspaces. The difference among bottomup algorithms lies in the definition of dense units and the method of aggregating lower dimensional clusters. For example, CLIQUE [11] divides individual dimensions into fixed size cells, and defines dense units as cells containing more than a predefined number of points. It then aggregates adjacent dense units to construct higher dimensional clusters. CLIQUE heavily depends on setting appropriate values of the cell size and density threshold. This can be challenging because the value ranges differ in different dimensions and there might not be a single set of parameters that suit all dimensions. In addition, searching for dense units in separate dimensions omits the covariance between dimensions, which can lead to either missing clusters or redundant combinations of dense units. We discuss this phenomenon in more detail in Section 4.1. SUBCLU [12] does not rely on fixed cells. Instead, it uses DBSCAN [8] to search for dense units in each dimension, and iteratively constructs higher dimensional subspaces. The algorithm invokes a call of DBSCAN for each candidate subspace, which can lead to a high running time. We propose to perform clustering only at the beginning of the algorithm while still guaranteeing that the aggregation of these dimensional clusters form valid high dimensional clusters, which achieves a much lower computational cost.
Coclustering. Another relevant topic is coclustering (a.k.a biclustering or patternbased clustering) [25]. Coclustering can be considered as a more general class of clustering high dimensional data by simultaneously clustering rows (points) and columns (dimensions). The main point that differentiates coclustering from subspace clustering lies in the approach to the problem, and the homogeneous methodology to find clusters in both axisparallel and arbitrarily oriented subspaces [8]. In this paper, we also compare the performance of our algorithm on gene expression data with a range of coclustering algorithms, including SWCC [2], BBACS [26], ITCC [27], FFCFW [28], and HICC [29].
Iii Problem statement
We first present the notation used in this paper.

is a subspace of dimensions, which is represented as a set of its component dimensions: , represents the dimension.

or is a set of points; denotes a point: , is the coordinate in the dimension.

is a cluster formed by points in subspace .
Let be a set of points in a dimensional space, and be a subset of . The set of all subspace clusters is denoted as . Here, denotes the number of subspaces containing clusters, and denotes the number of all clusters. More than one cluster can exist in a subspace, i.e., . Our subspace clustering algorithm finds all clusters by identifying their corresponding subspaces and point sets.
We take a bottomup approach to find the clusters in subspaces starting from finding base clusters in low dimensional subspaces. The algorithm to find the base clusters is orthogonal to our study. We use kmeans in the experiments for simplicity, although any low dimensional clustering algorithms may be used. Once the base clusters are found, our algorithm aggregates them to form clusters in higherdimensional subspaces. We follow a probabilistic approach together with the downward closure property of density to guarantee the validity of the formation of clusters in higher dimensional subspaces. This is formulated as Lemma 1.
Lemma 1: Given two points and in subspace
, the probability that
and belong to the same cluster in subspace is proportional to the cardinality () in which and belong to the same cluster.Proof: Let denote the event where two points and belong to the same cluster in subspace . Assume that we already perform clustering in lower dimensional subspaces and find that these two points belong to the same cluster in a set of subspaces (). Given this knowledge, the probability that and belong to the same cluster in is:
We show that the probability increases as new evidence of the cluster formation of and is found in other subspaces of . Specifically, let these two points also belong to a cluster in a certain subspace (, i.e., is indeed a newly discovered subspace in which and belong to the same cluster). The probability of them belonging to the same cluster in becomes:
Comments
There are no comments yet.