I Introduction and Motivation
Datasets with a mixture of categorical and numerical attributes are pervasive in applications from business and socio-economic settings. Clustering these datasets is an important activity in their analysis. Techniques to cluster these datasets have been developed by researchers, see for example ,  and . Techniques to cluster mixed datasets either prescribe a probabilistic generative model  or use a dissimilarity measure 
to compute a dissimilarity matrix that is then clustered. Each of these approaches have issues that need to be addressed when they are applied to big datasets - datasets with a large number of instances compared to attributes. For example, latent class clustering uses expectation maximization to estimate model parameters. Scaling expectation maximization and latent class clustering to big datasets is non-trivial (see and ). Similarly, dissimilarity based approaches to clustering have to overcome the computational and storage (memory) hurdles associated with computing and storing a large dissimilarity matrix. This study is limited to dissimilarity based approaches to clustering mixed datasets.
Clustering methods that are based on a euclidean representation of the data have been well studied by researchers over the years. Novel techniques to cluster big datasets using the euclidean distance have been developed in recent times, for example the mini-batch K-Means algorithm
. Similarly we have a wide range of tools for related tasks like feature extraction, starting with older conventional methods like principal component analysis to more recent methods like random projections . If we could determine a euclidean representation of the categorical attributes in a mixed dataset then we will be able to leverage these tools and techniques. Representing categorical attributes in a euclidean space has some difficulties. For example in a dataset with a gender attribute, how do we assign a value to the male and female levels? Should the male be assigned a higher value or a lower value? A natural intuition would be that the data and the application context should determine this, but we still need a theory to frame this as a problem and arrive at an optimal representation of the levels. This is precisely what homogeneity analysis  provides. Details of the method are provided in section III.
In this study we illustrate methods to cluster big datasets with a mixture of categorical and numerical attributes by first determining an optimal euclidean representation of the dataset. We illustrate this on synthetic and real world data. In the synthetic data, the ground truth is known. Experiments revealed that the clustering solution obtained using the homogeneity analysis based representation of the dataset was very close to the ground truth. Validating clustering solutions when the ground truth is unknown is a difficult task as discussed in [Chapter 4]. The ground truth is usually unknown in most real world datasets, therefore we take recourse to measures that evaluate quality of clustering using quality measures like compactness of clusters, separation of clusters etc. . Experiments with real world data suggest that the proposed method could be very useful in discovering structure in large datasets (see section VII-B). If a partitioning approach is used to cluster the large dataset each partition can be analyzed using an appropriate methodology. Large partitions can be reduced to smaller partitions by reapplying the clustering procedure. Small partitions can be analyzed by sophisticated, computationally expensive techniques if required, for example manifold learning. In summary, the representation of the dataset obtained using homogeneity analysis can be very useful in the analysis and exploration of big datasets with a mixture of categorical and numerical attributes.
Ii Problem Context
We have a large dataset with a mixture of numerical and categorical attributes. This dataset has rows and attributes (columns). There are numerical attributes and categorical attributes. The set of categorical attributes, with elements, is represented by
. We need to determine a euclidean representation for the categorical variables in the dataset., represents the dataset corresponding to the attributes in .
Iii Overview of Homogeneity Analysis
Homogeneity analysis posits that the observed categorical variables have a euclidean representation in a latent (unobserved) euclidean space. The dimensionality of this space, , is a parameter to this procedure. The representation of a row of in the latent space is characterized by the following elements:
The true representation of the row or instance in the latent Euclidean space, .
The optimally scaled representation of the row in terms of the observed attributes. This representation uses an optimal real value for the level of each categorical attribute. This is what we seek to learn.
An edge between an object’s true representation and its approximation. This edge represents the loss of information due to the categorical nature of the object’s attributes.
Such a representation induces a bipartite graph. The disjoint vertex sets for this graph are the object’s true representation and its approximate attribute representation in the latent Euclidean space. This idea is represented in Figure 1
Definition 1 (Object Score)
The true representation of a row of in is called the object score of the row. The object scores of are represented by a matrix of dimension .
Definition 2 (Category Quantification)
A categorical attribute’s representation in is called its category quantification. The category quantification for the attributes of is represented by a matrix called the category quantification matrix. The number of category quantifications, , is given by:
where represents the number of levels for attribute . The dimension of is .
The optimally scaled representation of in requires the use of an indicator matrix. The indicator matrix, , is a representation of
using a one hot encoding scheme. In a one hot encoding scheme, each attribute ofis represented by a set of columns corresponding to the number of levels of the attribute. The attribute level taken by the attribute for a particular row is encoded as a , other levels are encoded as . The dimension of is .
Homogeneity Analysis solves the following optimization problem:
The above constraints standardize and force the solution to be centered around the origin. The constraints also eliminate the trivial solution: and . This optimization problem is solved using an Alternating Least Squares (ALS) algorithm. A brief sketch of the steps of the algorithm is provided in Algorithm 1.
associated with the Homogeneity Analysis based solution is the difference between the true representation and the optimally scaled representation. The loss function can be expressed in terms of the attributes as follows:
Here, refers to sum of the squares of elements of matrix
The Homogeneity Analysis problem can be expressed as an eigenvalue problem (see for details). The loss function of the Homogeneity Analysis solution (Equation 3) can be expressed in terms of the eigenvalues of the average projection matrix , for the subspace spanned by the columns of the indicator matrix (see ). For attribute , the projection matrix is . The average projection matrix for attributes is given by . The loss , for the Homogeneity Analysis solution can be expressed in terms of the eigenvalues of . (see ):
where represents the eigenvalues of . An inspection of Equation 4 shows that the number of eigenvalues to use with Homogeneity Analysis solution ( in Equation 4) is a parameter to be chosen. Increasing the number of eigenvalues decreases the loss, however this also increases the dimensionality of the category quantification (each categorical attribute level has a dimensional representation). In this study we found that using the first eigenvalue () alone to determine the optimal real valued representation yielded good clustering solutions. This is consistent with the fact that the first eigenvalue holds most information about the attribute’s real valued representation. If we use higher values of (i.e, ), then we would replace each categorical value in our original dataset by a tuple of values.
Iv Application of Homogeneity Analysis to Big Data
The original homogeneity analysis () characterized by Equation 1 treats the numerical values in a mixed dataset as categorical variables with a large number of levels or categories. The homogeneity analysis solution determined using such a representation is computable for small or even moderate sized datasets. In big datasets where we could have several hundred thousand unique values of a numeric variable, treatment of numerical variables as categorical variables with a large number of categories introduces computational hurdles associated with processing the matrix . A one hot encoded representation of a variable with a large number of categories can create computational issues. We already have a numerical representation of these variables, therefore we do not include them in the homogeneity analysis computation. Eliminating the numerical variables from the set of variables used for the homogeneity analysis computation implies that the numerical variables do not influence the representation of the categorical variables in the latent space. This is very similar to assumptions made by probabilistic approaches to model mixed datasets like latent class clustering that model the numeric variables and categorical variables independently in the latent space. This is the approach taken in this work. If we are compelled to consider the numerical values in the homogeneity analysis model, we can take a sampling based approach where we determine the euclidean representation based on an appropriate sample, determined for example by stratified sampling.
GROUPALS  is a clustering solution that is also based on homogeneity analysis. This approach solves the clustering problem and homogeneity analysis problem together. While this approach is tractable for small and moderate size datasets, it can be impractical with big datasets. The GROUPALS algorithm requires us to provide the number of clusters () in the data. If this is not known, then we have to compute the GROUPALS solution for a range of values and then determine the optimal value by applying a suitable cluster validation index on each solution. We propose an approach that decouples the homogeneity analysis solution and clustering problems. By decoupling the clustering and the homogeneity analysis solution, we compute the homogeneity analysis problem only once. Computing the homogeneity analysis for a range of
values on a large dataset is a time consuming and computationally expensive task. Further, we can apply a wider range of clustering algorithms, not only the k-means like approach used by GROUPALS, to determine an appropriate clustering solution. Similarly it is common solve the clustering and feature selection problems together in small or moderate sized datasets, see and . This again implies that we have to solve the clustering and feature selection problem for a range of values and then determine the best solution by applying a clustering index to each solution. This is computationally expensive and impractical for big datasets. The scope of this study is limited to datasets that are big in the number of instances rather than the number of attributes. For this study we used incremental PCA , a scalable feature extraction technique, to evaluate feature noise and relevance to the clustering solution. The datasets used in this study had relatively small number of features and feature extraction did not help improve the quality of clustering solutions.
After the euclidean representation of the big dataset has been determined, we need to address the task of clustering it. [Chapter 2] and [Chapter 8] (more recent) provide a good overview of the algorithmic approaches to clustering. [Chapter 8] provides some guidelines for determining the suitability of a clustering method for a particular application.  mentions that the computational complexity of competing algorithms and the availability of software often dictate the choice of the method. This advice is timeless, it is still applicable today.  provides a summary of the strategies and approaches to clustering big datasets. In this work, we picked three clustering methods for which many good open source implementations exist (such as  or ). We illustrate clustering solutions using the mini-batch K-means (), the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) and the Clustering Large Applications (CLARA)  algorithms.
Validating the developed clustering solutions is the next task. The ground truth in clustering applications is rarely known. For this reason we use a synthetic dataset (described in section VI-A) to compare the clustering solutions with the ground truth. When the ground truth is known, external clustering indexes [Chapter 4, section 4.2] are used to validate clustering solutions. When the ground truth is unknown, we need to take recourse to indexes (internal clustering indexes) that measure properties generally observed in good clustering solutions such as good separation and compactness of clusters. The choice of the cluster validation index is again subjective depends on application context. As indicated in , there is usually some human intuition about what a good clustering is for a clustering application that helps determine this choice. Further, calculation of these indexes for big datasets is not trivial (see  and ) and the availability of reliable software implementations may also play a factor. The adjusted rand index is a very common choice for comparing a clustering solution with a ground truth. This was used in this work to determine the number of clusters for the synthetic dataset. The Calinski-Harabasz  index was used to determine the number of clusters for the airline delay dataset (see section VI-A). This index scales well to large datasets and has been reported to be reliable by researchers over time (see  and ).
Vi Experimental Evaluation
The following datasets were used for this study:
Synthetic Dataset With Mixed Attributes
: The synthetic dataset used in this study was generated using a two step procedure. The dataset used consists of four attributes - two continuous and two categorical variables with three levels each. The first step of the generating procedure was generate three clusters by sampling a standard isotropic Gaussian distribution. This generates the two continuous variables for the dataset. The second step was to generate the categorical variables for the dataset. Each categorical variable was generated by sampling a multinomial distribution that could generate three category levels. The categorical variables were generated one cluster at a time with a different parameter for each cluster. The multinomial parameter was such that one attribute level was dominant for each categorical attribute in a cluster. The dataset had one million instances.
Airline Delay:This dataset was obtained from the US Department of Transportation’s website (). The data represents arrival delays for US domestic flights during January of 2016. This dataset had 11 features and over four hundred and thirty thousand instances. The description of the attributes is provided in Table I.
Attribute Type Description 1 DAY_OF_MONTH Ordinal Day of flight record 2 DAY_OF_WEEK Ordinal Day of week for flight record 3 CARRIER Nominal Carrier (Airline) for the flight record 4 ORIGIN Nominal Origin airport code 5 DEST Nominal Destination airport code 6 DEP_DELAY Continuous Departure Delay in minutes 7 TAXI_OUT Continuous Taxi out time in minutes 8 TAXI_IN Continuous Taxi in time in minutes 9 ARR_DELAY Continuous Arrival delay in minutes 10 CRS_ELAPSED_TIME Continuous Flight duration 11 NDDT Continuous Departure time in minutes from midnight January 1 2016 TABLE I: Delay Data January 2016
The arrival delay attribute is the attribute of interest in this dataset. There are obvious outliers with this attribute, for example it contains flights that actually depart over 24 hours from scheduled time of departure. These obvious outliers were removed by using values corresponding topercentile of the values for this attribute. The majority of the outliers remain the dataset used for this study. Details of removing them are described in section VII-B. The attributes in the dataset are standardized.
Vi-B Software Tools
Vii Discussion of Results
Vii-a Synthetic Dataset
The adjusted rand index  was used to compare the clustering results obtained from using the dataset for homogeneity analysis with the ground truth. The adjusted rand index accounts for chance when comparing the ground truth clustering with the clustering solutions produced by the algorithms. The results are shown in Table II. The columns in Table II represent the adjusted rand index (ARI) obtained with the various values for the number of clusters in the data ().
For all the algorithms used in this study, the clustering solutions are similar to the ground truth and the adjusted rand index identifies the correct number of clusters in the data.
Vii-B Airline Delay Dataset
As discussed in section VI-A
, the arrival delay attribute is the attribute of interest in the airline delay dataset. This dataset has outliers. This is evident from a review of the quantiles associated with the arrival delay attribute (see TableIV). The % of the arrival delay is 5 minutes but the % of the arrival delay is 155 minutes. The ground truth for the airline delay dataset is unknown, therefore we use an internal measure of cluster validity, the Calinski -Harabasz index  to determine the optimal number of clusters. The results of applying clustering algorithms to this dataset after using homogeneity analysis to determine an optimal representation are provided in Table III.
A review of Table III shows that the optimal number of clusters for each clustering method is different. Therefore we examine each clustering solution in detail to determine characteristics of each solution. In particular we evaluate the mean arrival delay associated with the clustering solutions. The results are shown in Table V
. Note that CLARA provides a cluster where the mean delay is nearly three standard deviations from the mean. This is a very useful finding. Indeed, clustering can be used to remove noise from datasets, see. Since CLARA provides the most useful clustering at the first level of analysis, we explore this solution further. The profile of the CLARA solution is shown in Table VI. This shows that the clustering produces two large clusters that are characterized by early arrivals. Long flight delays are a rare occurrence and majority of the flights are slightly early, see Table IV. This is consistent with the results we observed with this data. Since we are are applying a partitioning method to cluster the dataset, the clusters produced by CLARA represent a partition of the data. This implies we can analyze each of the clusters in Table VI independently and arrive at a collective picture of the dataset. Accordingly we applied CLARA clustering to each of the clusters in Table VI. The results are provided in Table VII through Table X. An analysis of these results reveals that the data fall into sub-clusters that either represent early arrivals or slight delays. Table X represents the outliers. Cluster membership associated with a data element can inform us about the likelihood that the data element is associated with a flight delay. For example, membership in sub-cluster 2 of cluster 4 is likely to be associated with delays. In summary, we have been able extract insights by clustering the dataset obtained from homogeneity analysis.
The original formulation  must be carefully applied, accounting for the size of the data and its characteristics (see section IV), when homogeneity analysis is applied to big datasets. Experiments on synthetic data indicate that clustering solutions developed using the euclidean representation determined using homogeneity analysis are similar to the ground truth. Experiments on real datasets indicate that clustering solutions developed using homogeneity analysis can be very useful in analyzing big datasets. Clustering a big dataset with a partitioning based clustering method permits us to apply a divide and conquer strategy for data analysis. The partitions produced by clustering the big dataset can be analyzed independently. When partitions are small, we can consider sophisticated computationally expensive tools for their analysis. In summary, homogeneity analysis can be a useful tool for the exploration and analysis of big datasets with a mixture of continuous and categorical attributes.
-  C. Hennig and T. F. Liao, “How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 62, no. 3, pp. 309–369, 2013.
A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed numeric and
Data & Knowledge Engineering, vol. 63, no. 2, pp. 503–527, 2007.
-  Z. Huang, “Clustering large data sets with mixed numeric and categorical values,” in Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining,(PAKDD). Singapore, 1997, pp. 21–34.
J. K. Vermunt and J. Magidson, “Latent class cluster analysis,”Applied latent class analysis, vol. 11, pp. 89–106, 2002.
-  J. C. Gower and J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, 1971.
-  P. S. Bradley, U. Fayyad, C. Reina et al., “Scaling em (expectation-maximization) clustering to large databases,” Technical Report MSR-TR-98-35, Microsoft Research Redmond, Tech. Rep., 1998.
-  A. Abarda, Y. Bentaleb, and H. Mharzi, “A divided latent class analysis for big data,” Procedia Computer Science, vol. 110, pp. 428–433, 2017.
-  D. Sculley, “Web-scale k-means clustering,” in Proceedings of the 19th international conference on World wide web. ACM, 2010, pp. 1177–1178.
-  C. Boutsidis, A. Zouzias, and P. Drineas, “Random projections for -means clustering,” in Advances in Neural Information Processing Systems, 2010, pp. 298–306.
G. Michailidis and J. de Leeuw, “The gifi system of descriptive multivariate analysis,”STATISTICAL SCIENCE, vol. 13, pp. 307–336, 1998.
-  A. K. Jain and R. C. Dubes, Algorithms for clustering data. Prentice-Hall, Inc., 1988.
-  S. Van Buuren and W. J. Heiser, “Clusteringn objects intok groups under optimal scaling of variables,” Psychometrika, vol. 54, no. 4, pp. 699–706, 1989.
D. A. Ross et al., “Incremental learning for robust visual tracking,”
International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125–141, 2008.
-  P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to data mining. 1st,” 2005.
-  J. Béjar Alonso, “Strategies and algorithms for clustering large datasets: a review,” Tech. Report, Universitat Politècnica de Catalunya, 2013.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2017. [Online]. Available: https://www.R-project.org/
-  T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” in ACM Sigmod Record, vol. 25, no. 2. ACM, 1996, pp. 103–114.
-  L. Kaufman and P. J. Rousseeuw, “Clustering large applications (program clara),” Finding groups in data: an introduction to cluster analysis, pp. 126–163, 2008.
J. M. Luna-Romera, M. del Mar Martínez-Ballesteros,
J. García-Gutiérrez, and J. C. Riquelme-Santos, “An approach to
silhouette and dunn clustering indices applied to big data in spark,” in
Conference of the Spanish Association for Artificial Intelligence. Springer, 2016, pp. 160–169.
M. Tlili and T. M. Hamdani, “Big data clustering validity,” in
Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of. IEEE, 2014, pp. 348–352.
-  L. Hubert and P. Arabie, “Comparing partitions,” Journal of classification, vol. 2, no. 1, pp. 193–218, 1985.
-  T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1–27, 1974.
-  G. W. Milligan and M. C. Cooper, “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, vol. 50, no. 2, pp. 159–179, 1985.
-  O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, no. 1, pp. 243–256, 2013.
-  B. USDOT. (2014) Rita airline delay data download. [Online]. Available: http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
-  G. Rossum, “Python reference manual,” Amsterdam, The Netherlands, The Netherlands, Tech. Rep., 1995.
-  J. de Leeuw and P. Mair, “Gifi methods for optimal scaling in R: The package homals,” Journal of Statistical Software, vol. 31, no. 4, pp. 1–20, 2009. [Online]. Available: http://www.jstatsoft.org/v31/i04/
-  C. Hennig, fpc: Flexible Procedures for Clustering, 2015, r package version 2.1-10. [Online]. Available: https://CRAN.R-project.org/package=fpc
-  H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, “Enhancing data analysis with noise removal,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 304–319, 2006.