This work is concerned with the problem of finding dependencies between random variables using samples. Many existing methods to address this problem seek to identify a functional relationship between random variables through some type of regression. Linear relationships are by far the easiest to analyze. In the case where the number of random variable samples is significantly larger than the number (dimension) of random variables considered, this can be done by estimating the pairwise correlations between the variables through the covariance matrix. Non-linear relationships can be found by considering non-linear combinations of the random variables, thus increasing the dimensionality of the problem. We are interested in the case where few samples are observed, which can lead to ill-conditioning in the underlying regression, especially when looking for non-linear dependencies. Ill-conditioning can be addressed through the addition of a prior model or assumption for the dependencies, which typically adds a bias to the solution estimate. For example, Ridge regression and LASSO  do this by promoting sparsity in the functional relationship.
In this paper, we take a more general standpoint and rephrase the problem of finding functional relationships as 1) finding patterns (clusters) among the samples of an observed random variable, and 2) determining to what extent the outcome random variable depends on the cluster from which the observed random variable is drawn. This point of view allows us to investigate datasets of arbitrarily large dimensions, even small datasets of 20-30 subjects.
Our work is motivated by the problem of analyzing data arising in educational research. Often, the number of students observed is very small and the data representing each student is high-dimensional. For example, one problem of interest is to determine how student characteristics influence the course outcomes. We view the student characteristics (e.g., attendance rate) as an observed random variable , and the course outcomes (e.g., grade) as the outcome variable . The question we seek to answer is “Do different student characteristics patterns yield different student outcomes?”
We begin with probing for linear dependencies of the outcome variable on the patterns of observed random variables. This is done by first clustering the feature vectors representing samples of the observed random variable in a large number of different ways. Clustering is accomplished using a proposed modification of the
-TARP classifier and the RP1D clustering method . This clustering is based on projecting the sample data onto a random one-dimensional linear subspace and dividing the projected sample data into two by thresholding. Previous work [5, 4] has shown that there exists a significant amount of hidden structure in real datasets that can be uncovered through 1D random projections. Running this random clustering several times yields a large number of binary clusterings. A statistical permutation test is performed to check the validity of each clustering, and non-valid ones are discarded. This clustering approach differs significantly from mainstream ones (e.g., [6, 7] ) which seek to find a unique “best” clustering by numerical optimization. The underlying prior model is also fundamentally different.
The dependencies of the outcome variable is analyzed by looking at the cumulative distribution function (CDF) of the difference between the distribution of the outcome variables between the two subject groups defined by a clustering. The resulting curve is compared to a null hypothesis in order to detect the difference levels at which there is a valid pattern dependence. Higher order dependencies are analyzed by first prolonging the feature vector space to include higher degree monomials before clustering. The influence of individual variables is explored by sequential removal. A numerical experiment illustrating the use of our method on data from a course on digital signal processing[8, 9] with a very small number of students is presented at the end of this paper.
We begin by describing the first step of our method, which is used to determine the clusterability of the data and thus whether our framework is applicable. Our proposed clustering method is then described, followed by the construction of the CDF curves used to analyze the pattern dependencies and the null hypothesis used for detection. We finish with describing the extension of the feature space to analyze non-linear dependencies as well as the feature analysis process.
Ii-a Clusterability Quantification
Our method is applicable when the distribution of the observed random variable has enough structure so that a projection onto a random line has a high-likelihood of uncovering a binary clustering. To check if this is the case, one should estimate the clusterability of the dataset by plotting the histogram of “normalized withinss” (). Normalized withinss is derived from the between-cluster scatter 
. It measures to what extent a 1D set of samples is divided into two clusters. A good binary clustering corresponds to a low value of normalized withinss. Empirically, the probability density function ofshould have a non-negligible mass below .
Ii-B -TARP Clustering
Our proposed clustering method is non-deterministic and seeks to generate a large number of statistically significant binary clusterings. A fraction of the samples is used to find a clustering (training), and the remaining samples are used to test the validity of the clustering (testing). These two steps are repeated several times until a large number of valid clusterings are obtained. A MATLAB implementation is available .
Training: Let be the training points.
For to :
Generate a random vector in ;
Project each onto by taking the dot product ;
Use -means to find 2 clusters in the projected ;
Compute the normalized withinss for this clustering;
Store the vector associated with the smallest .
Compute and store the threshold separating the two clusters;
Validity Testing: Let be the testing points.
Import and from the training phase;
Project each testing point onto the vector by taking the dot product ;
Use the threshold to assign a cluster to each of the ;
Perform permutation test with Monte-Carlo simulations  on the projected test data at statistical significance level of 99%.
There are many ways to generate the random vectors . For simplicity, each coordinate is generated using a i.i.d uniform random probability model on . There are also many ways to pick an appropriate threshold . For simplicity, we compute the threshold as the halfway point between the extreme ends (the closer pair) of the projected values from the two classes. Note that, each time the algorithm is run, a different random vector is generated. Therefore, the criteria (feature) used to cluster is different every time. Note also that checking the statistical significance of the grouping using a permutation test is appropriate when the data set is small; for larger datasets, another test should be used.
Ii-C CDF of difference between outcome variable distribution
Let be a measure of the difference between two 1D random variable distributions for the outcome variable z. For simplicity, we set to be the absolute value of the difference between the average values of the two distributions, but other, potentially more accurate measures can be used. We compute for each of the previously obtained (valid) clusters. Since the clustering process is random, can be viewed as a random variable. We estimate its CDF using samples, specifically the values of for each of the previously obtained (valid) clusters.
Ii-D Comparison with Null Hypothesis
The CDF curve is compared to a null-hypothesis CDF curve corresponding to the case with no relationship or dependency between the outcome variable and the clustering of the subjects. This CDF curve is obtained by randomly partitioning the outcome variable values for the entire set of subjects into two groups and computing for these groups. This partitioning is repeated several times and the resulting sample values of are used to estimate a CDF curve. The size of the random groups are chosen to match the sizes of the groups obtained by clustering in the previous step.
The CDF curve obtained as described above will vary from one trial to the next. Thus we compute the average CDF curve, as well as the CDF curves lying one and two standard deviations above/below the average curve. Let us fix a value of; the values of all the curves obtained through our trials can be used to estimate the exact probability that a curve value at would be below two standard deviations under the mean value. For example, if the distribution of the curve values were a Gaussian, then would be ; In general, will tend to be very small.
The values of at which the experimental CDF curve lies two standard deviations below the average null-hypothesis curve are highlighted; if is a value in the highlighted region, and if the experimental CDF curve value at is , then of the patterns have a difference of or less, and thus of the patterns have a difference of at least . This conclusion is valid with probability .
Ii-E Feature Space Extension
The above steps can be applied either directly to the feature vector or, in an identical manner, to an extended feature vector. For example, we can extend the dimensionality of the feature vector by including terms of order of the form , in order to check for non-linear dependencies. While extending the feature space in this manner may result in a large increase in the dimensionality of the observed random variable, such high-dimensions are not an issue for our clustering method.
Ii-F Feature Selection
Another interesting question is whether the outcome variable is dependent on individual features of the observed random variable. This can be investigated by repeating the construction of the CDF curve after removing the feature of interest: if the CDF curve moves up, then the outcome variable dependency on the patterns has been decreased by the removal of the feature, and thus it must be dependent on that feature. If the upward movement is not seen with the degree one curve but with, say, the degree two curve, then the dependency must be quadratic. And so on.
Iii Experimental Results
We implemented our method in MATLAB and used an educational research  dataset to illustrate the use of our method. The data relates to certain skills (“Habits of Mind”) of students in a course on signal processing. Details of how the data was acquired and a feature vector formed can be found in . In this dataset, we have 27 students, each represented by a 26 dimensional feature vector representing skills (observed random variable). The course grade of each student is the outcome variable.
Iii-a Check for Clusterability Results
We evaluated the normalized withinss of the cluster assignments resulting from 500 random projection attempts on a random subset of roughly half the data. As seen from Figure 1, we observe that nearly 80% of random projection attempts made on our experimental dataset result in a value of the withinss below 0.36 which is an indicator of the presence of several (linear) binary clustering.
|Method||Distinct Clusters||Stat Sig in high D||Stat Sig in high D||Stat Sig in proj space||Stat Sig in proj space|
|% (#)||(without repetitions)||(with repetitions)||(without repetitions)||(with repetitions)|
|-means||2.60 % (52)||38.46 %||22.7 %||N.A.||N.A.|
|-TARP||74.00 % (1480)||16.61 %||16.2 %||100 %||89.6 %|
|Proclus ||100% (4)||25%||25%||-||-|
: No clusters were formed, : Overlapping clusters formed, : Did not result in any clusters due to errors over many attempts
Iii-B -TARP clustering Results
We set and ran the clustering algorithm a total of 1000 times. For each individual run of the clustering, we randomly split the students into one group of 13 for training, and one group of 14 for validating. The vast majority (74%) of the clusters found turned out to be distinct (Table I).
In comparison, we also ran -means  with a total of 1000 times on the entire dataset. Unsurprisingly, distinct clusters were only obtained 2.6% of the time out of 1000 attempts (52 different clusterings). We checked for the statistical significance of these clusters (in the original 26 dimensional space) using the high-dimensional version of the permutation test with Monte-Carlo simulations  and found that 38.5% of these were statistically significant in the original high-dimensional space. So overall, only 20 distinct and statistically significant clusterings were obtained with this method.
We also compared with several other clustering methods, with a focus on those specially designed for high-dimensions. Our results are summarized in Table I. We used the implementations found in Weka  for these algorithms and consequently could not automate statistical significance testing or make modifications to evaluate clusterability in projected sub-spaces of the algorithms since we did not have access to the source code. Most of the algorithms did not work or produced overlapping clusters as seen in the Table. Proclus was the only method to produce non-overlapping clusters, and in 4 experiments, only 1 cluster assignment was found to be statistically significant in the original high-dimensional space. Such a performance is not very surprising since we are dealing with a very small number of data points (27) in a high-dimensional space (26 D). As far as we know, our proposed clustering method is the only one that can reliably produce a large number of statistically significant and/or distinct clusters for such a small dataset.
Iii-C CDFs of and dependency analysis results
We obtained the empirical CDFs of the absolute difference in average grades of the clusters formed. Specifically, we run our clustering method 10000 times to generate 10000 cluster assignments and retain only those that were statistically significant in the projected one dimensional space. The resulting empirical CDF is shown in Figure 1(a) in dotted black. For example, about 30% of the clusters found by -TARP are associated with an average grade difference of at least 0.5. This is because the y-axis value is about 0.7 at the x-axis value of 0.5. Note that the value of the CDF curve at that point is below the two-sigma line, and thus the value is within the significance region. Therefore, there is a statistically valid dependency of the outcome variable at that point (yielding a difference in average mean of at least 0.5), at the confidence level corresponding to two standard deviations below the mean.
Iii-D Feature Space Extension
We extended our feature vectors by including terms of order two and three, growing the space dimension to 377 and 3003, respectively. The resulting CDF curves are shown in Figures 1(b) and 1(c). Notice how the empirical CDF curve (dotted) moves further down as the degree increases. The significance region, however, is largest when only features of degree two are included.
Iii-E Feature Selection
Each of the features of our data points is related to one of 5 different skills, which we denote as A,B,C,D and E. We investigate whether the outcome is dependent on a given individual skill by removing the feature coordinates corresponding to that skill and recomputing the corresponding CDF curves. The results using the initial set of features (linear relationship), shown in Figure 2(a), show that the only skills on which the grade depends in a linear fashion are A and, to a lesser extent, E.
The same analysis was carried out after order 2 and order 3 terms expansion of the feature space, respectively (Figures 2(b) and 2(c)). For order two, we see that removing A and E seems once again to shift the curve higher. However for order three, removing D also shifts the curve upward. Thus the course grade is dependent on skill D at a degree level three, but not at a degree one (linear) or two.
We proposed a data analysis method to study the dependence of an outcome random variable on patterns of observed variables. In the method, the existence of dependence patterns is statistically validated and the extent of the dependency is statistically quantified. Changes in the extent of the influence as the feature space is expanded to non-linear terms and as observation variable components are removed are observed. Our approach is computationally inexpensive, scalable to high dimensions, and can be easily modified to handle both very small and large datasets.
As an illustration, we studied 27 subjects whose skills were represented by a 26 dimensional vector in a course on digital signal processing, and analyzed the statistical dependence between these skills and the course grade.
Acknowledgement: This work was funded in part by NSF grant EEC-1544244.
-  A. N. Tikhonov, V. I. Arsenin, and F. John, Solutions of ill-posed problems. Winston Washington, DC, 1977, vol. 14.
-  R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
T. Yellamraju, J. Hepp, and M. Boutin, “Benchmarks for image classification and other high-dimensional pattern recognition problems,”arXiv preprint arXiv:2296364, 2018.
-  T. Yellamraju and M. Boutin, “Clusterability and clustering of images and other real high-dimensional data,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1927–1938, April 2018.
-  S. Han and M. Boutin, “The hidden structure of image datasets,” in Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015, pp. 1095–1099.
C. You, D. Robinson, and R. Vidal, “Scalable sparse subspace clustering by
orthogonal matching pursuit,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3918–3927.
-  C. You, C.-G. Li, D. P. Robinson, and R. Vidal, “Oracle based active set algorithm for scalable elastic net subspace clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3928–3937.
-  T. Yellamraju, A. J. Magana, and M. Boutin, “Board # 11 : Investigating engineering students habits of mind: A case study approach,” in 2017 ASEE Annual Conference & Exposition. Columbus, Ohio: ASEE Conferences, June 2017, https://peer.asee.org/27686.
-  ——, “Investigating students’ habits of mind in a course on digital signal processing,” IEEE Transactions on education, (accepted with revisions) 2018.
H. Wang and M. Song, “Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming,”The R journal, vol. 3, no. 2, p. 29, 2011.
-  T. Yellamraju and M. Boutin, “n-tarp binary clustering code,” May 2018. [Online]. Available: https://purr.purdue.edu/publications/2973/1
-  M. D. Ernst et al., “Permutation methods: a basis for exact inference,” Statistical Science, vol. 19, no. 4, pp. 676–685, 2004.
-  C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” in ACM SIGMoD Record, vol. 28, no. 2. ACM, 1999, pp. 61–72.
-  R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications. ACM, 1998, vol. 27, no. 2.
-  C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. Murali, “A monte carlo algorithm for fast projective clustering,” in Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 2002, pp. 418–427.
-  H.-P. Kriegel, P. Kroger, M. Renz, and S. Wurst, “A generic framework for efficient subspace clustering of high-dimensional data,” in Data Mining, Fifth IEEE International Conference on. IEEE, 2005, pp. 8–pp.
-  I. Assent, R. Krieger, E. Müller, and T. Seidl, “Inscy: Indexing subspace clusters with in-process-removal of redundancy,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, 2008, pp. 719–724.
-  M. L. Yiu and N. Mamoulis, “Frequent-pattern based iterative projected clustering,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003, pp. 689–692.
-  G. Moise, J. Sander, and M. Ester, “P3c: A robust projected clustering algorithm,” in Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, 2006, pp. 414–425.
-  K. Sequeira and M. Zaki, “Schism: A new approach for interesting subspace mining,” in Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on. IEEE, 2004, pp. 186–193.
-  G. Moise and J. Sander, “Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008, pp. 533–541.
-  K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-connected subspace clustering for high-dimensional data,” in Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 2004, pp. 246–256.
-  J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
-  E. Müller, S. Günnemann, I. Assent, and T. Seidl, “Evaluating clustering in subspace projections of high dimensional data http://dme.rwth-aachen.de/opensubspace/,” in In Proc. 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France. (2009), 2009.