Data are often represented as vectors in a Euclidean space , but frequently, data possess more structure and treating them as Euclidean vectors may be inappropriate. A simple example of this instance is when data are normalized to have unit norm, and thereby put on the surface of the unit hypersphere . Such data are better viewed as objects on a manifold, and when building mathematical models for such data it is often advantageous to exploit the geometry of the manifold (here ).
For example, in classical information retrieval it has been convincingly demonstrated that cosine similarity is a more effective measure of similarity for analyzing and clustering text documents than just Euclidean distances. There is substantial empirical evidence that normalizing the data vectors helps to remove the biases induced by the length of a document and provide superior results[37, 36]
. On a related note, the spherical k-means (spkmeans) algorithm  that runs k-means with cosine similarity for clustering unit norm vectors, has been found to work well for text clustering and a variety of other data. Another widely used similarity measure is Pearson correlation: given this defined as where and . Mapping with (similarly define ), we obtain the inner-product . Moreover, . Thus, the Pearson correlation is exactly the cosine similarity between and . More broadly, domains where similarity measures such as cosine, Jaccard or Dice  are more effective than measures derived from Mahalanobis type distances, possess intrinsic “directional” characteristics, and are hence better modeled as directional data .
This chapter recaps basic statistical models for directional data
, which herein refers to unit norm vectors for which “direction” is more important than “magnitude.” In particular, we recall some basic distributions on the unit hypersphere, and then discuss two of the most commonly used ones: the von Mises-Fisher and Watson distributions. For these distributions, we describe maximum likelihood estimation as well as mixture modeling via the Expectation Maximization (EM) algorithm. In addition, we include a brief pointer to recent literature on applications of directional statistics within machine learning and related areas.
We warn the advanced reader that no new theory is developed in this chapter, and our aim herein is to merely provide an easy introduction. The material of this chapter is based on the author’s thesis , and the three papers [5, 40, 39], and the reader is referred to these works for a more detailed development and additional experiments.
2 Basic Directional Distributions
2.1 Uniform distribution
2.2 The von Mises-Fisher distribution
The vMF distribution is one of the simplest distributions for directional data and it has properties analogous to those of the multivariate Gaussian on . For instance, the maximum entropy density on subject to the constraint that is fixed, is a vMF density (see e.g., [32, pp. 172–174] and ).
A unit norm vector has the von Mises-Fisher (vMF) distribution if its density is
where and . Integrating using polar coordinates, it can be shown [38, App. B.4.2] that the normalizing constant is given by
where denotes the modified Bessel function of the first kind .111Note that sometimes in directional statistics literature, the integration measure is normalized by the uniform measure, so that instead of , one uses .
The vMF density is parameterized by the mean direction , and the concentration parameter , so-called because it characterizes how strongly the unit vectors drawn according to are concentrated about the mean direction . Larger values of imply stronger concentration about the mean direction. In particular when , reduces to the uniform density on , and as , tends to a point density.
2.3 Watson distribution
The uniform and the vMF distributions are defined over directions. However, sometimes the observations are axes of direction, i.e., the vectors are equivalent. This constraint is also denoted by , where
is the projective hyperplane of dimension. The multivariate Watson distribution  models such data; it is parametrized by a mean-direction , and a concentration parameter , with probability density
The normalization constant is given by
where is the confluent hypergeometric function [18, formula 6.1(1)]
and , , , denotes the rising-factorial.
Observe that for , the density concentrates around as increases, whereas for , it concentrates around the great circle orthogonal to .
2.4 Other distributions
We briefly summarize a few other interesting directional distributions, and refer the reader to  for a more thorough development.
Some axial data do not exhibit the rotational symmetry of Watson distributions. Such data could be potentially modeled using Bingham distributions, where the density at a point is , where the normalizing constant can be shown to be , where denotes the confluent hypergeometric function of matrix argument .
Note that since , the Bingham density is identifiable only up to a constant diagonal shift. Thus, one can assume
, or that the smallest eigenvalue ofis zero . Intuitively, one can see that the eigenvalues of
determine the axes around which the data clusters, e.g., greatest clustering will be around the axis corresponding to the leading eigenvector of.
Certain problems require rotationally symmetric distributions that have a ‘modal ridge’ rather than just a mode at a single point. To model data with such characteristics  suggest a density of the form
where as usual denotes the normalization constant.
This distribution is a simpler version of the more general Fisher-Bingham distribution . The density is
This is a more general directional distribution; its density is
There does not seem to exist an easy integral representation of the normalizing constant, and in an actual application one needs to resort to some sort of approximation for it (such as a saddle-point approximation). Kent distributions arise by putting an additional constraint in (2.6).
3 Related work and applications
The classical references on directional statistics are [27, 26, 46]; a more recent, updated reference is . Additionally, for readers interested in statistics on manifolds, a good starting point is . To our knowledge, the first work focusing on high-dimensional application of directional statistics was , where the key application was clustering of text and gene expression data using mixtures of vMFs. There exist a vast number of other applications and settings where hyperspherical or manifold data arise. Summarizing all of these is clearly beyond the scope of this chapter. We mention below a smattering of some works that are directly related to this chapter.
We note a work on feature extraction based on correlation in. Classical data mining applications such as topic modeling for normalized data are studied in [4, 34]. A semi-parametric setting using Dirichlet process mixtures for spherical data is . Several directional data clustering settings include: depth images using Watson mixtures ; a k-means++  style procedure for mixture of vMFs ; clustering on orthogonal manifolds ; mixtures of Gaussian and vMFs . Directional data has also been used in several biomedical (imaging) applications, for example , fMRI , white matter supervoxel segmentation , and brain imaging . In signal processing there are applications to spatial fading using vMF mixtures  and speaker modeling 
. Finally, beyond vMF and Watson, it is worthwhile to consider the Angular Gaussian distribution, which has been applied to model natural images for instance in .
4 Modeling directional data: maximum-likelihood estimation
In this section we briefly recap data models involving vMF and Watson distributions. In particular, we describe maximum-likelihood estimation for both distributions. As is well-known by now, for these distributions estimating the mean is simpler than estimating the concentration parameter .
4.1 Maximum-Likelihood estimation for vMF
Let be a set of points drawn from . We wish to estimate and by solving the m.l.e. optimization problem
Writing , a brief calculation shows that the optimal solution satisfies
where the nonlinear map is defined as
The challenge is to solve (4.3) for . For small values of (e.g., ) the simple estimates provided in  suffice. But for machine learning problems, where is typically very large, these estimates do not suffice. In , the authors provided efficient numerical estimates for that were obtained by truncating the continued fraction representation of and solving the resulting equation. These estimates were then corrected to yield the approximation
which turns out to be remarkably accurate in practice.
Subsequently,  showed simple bounds for by exploiting inequalities about the Bessel ratio —this ratio possesses several nice properties, and is very amenable to analytic treatment . The work of  lends theoretical support to the empirically determined approximation (4.4), by essentially showing this approximation lies in the “correct” range. Tanabe et al.  also presented a fixed-point iteration based algorithm to compute an approximate solution .
The critical difference between this approximation and the next two is that it does not involve any Bessel functions (or their ratio). That is, not a single evaluation of is needed—an advantage that can be significant in high-dimensions where it can be computationally expensive to compute . Naturally, one can try to compute () to avoid overflows (or underflows as the case may be), though doing so introduces yet another approximation. Therefore, when running time and simplicity are of the essence, approximation (4.4) is preferable.
Approximation (4.4) can be made more exact by performing a few iterations of Newton’s method. To save runtime,  recommends only two-iterations of Newton’s method, which amounts to computing using (4.4), followed by
Approximation (4.5) was shown in  to be competitive in running time with the method of , and was seen to be overall more accurate. Approximating remains a topic of research interest, as can be seen from the recent works [21, 12].
4.2 Maximum-Likelihood estimation for Watson
Let be i.i.d. samples drawn from . We wish to estimate and by maximizing the log-likelihood
where are (normalized) eigenvectors of the scatter matrix corresponding to the eigenvalues .
To estimate the concentration parameter we must solve:222We need to ensure a unique m.l.e. for positive , and , for negative
The hard part is to solve (4.8). One could use a root-finding method (e.g. Newton-Raphson), but similar to the vMF case, an out-of-the-box root-finding approach can be unduly slow or numerically hard as data dimensionality increases. The authors of  consider the following more general equation:
and derive for it high-quality closed form numerical approximations. These approximations improve upon two previous approaches, that of  and . Bijral et al.  followed the continued-fraction approach of 
to obtain the heuristic approximation
Other heuristic approximations were presented by the author in .
The following theorem of  provides rigorously justified approximations, most of which are typically more accurate than previous heuristics.
5 Mixture models
Many times a single vMF or Watson distribution is insufficient to model data. In these cases, a richer model (e.g., for clustering, or as a generative model, etc.) such as a mixture model may be more useful. We summarize in this section mixtures of vMF (movMF) and mixtures of Watson (moW) distributions. The former was originally applied to high-dimensional text and gene expression data in , and since then it has been used in a large number of applications (see also Section 3). The latter has been applied to genetic data , as well as in other data mining applications .
Let denote either a vMF density or a Watson density. We consider mixture models of different vMF densities or different Watson densities. Thus, a given unit norm observation vector has the mixture density
Suppose we observe the set of i.i.d. samples drawn from (5.1). Our aim is to infer the mixture parameters , where , , , and for an movMF and for a moW.
5.1 EM algorithm
A standard, practical approach to estimating the mixture parameters is via the Expectation Maximization (EM) algorithm  applied to maximize the mixture log-likelihood for . Specifically, we seek to maximize
To apply EM, first we use Jensen’s inequality to compute the lower bound
Then, the E-Step sets to the posterior probability (for given component ):
With this choice of , the M-Step maximizes (5.3), which is essentially just a maximum-likelihood problem, to obtain parameter updates. In particular, we obtain
M-Step for movMF:
M-Step for moW:
where denotes the top eigenvector corresponding to eigenvalue of the weighted-scatter matrix
Hard Assignments. To speed up EM, we can replace can E-Step (5.4) by the standard hard-assignment rule:
The corresponding -Step also simplifies considerably. Such hard-assignments maximize a lower-bound on the incomplete log-likelihood and yield partitional-clustering algorithms.
Initialization. For movMF, typically an initialization using spherical kmeans (spkmeans)  can be used. The next section presents arguments that explain why this initialization is natural for movMF. Similarly, for moW, an initialization based on diametrical kmeans  can be used, though sometimes even an spkmeans initialization suffices .
5.2 Limiting versions
It is well-known that the famous k-means algorithm may be obtained as a limiting case of the EM algorithm applied to a mixture of Gaussians. Analogously, the spherical kmeans algorithm of  that clusters unit norm vectors and finds unit norm means (hence ‘spherical’) can be viewed as the limiting case of a movMF. Indeed, assume that the priors of all mixture components are equal. Furthermore, assume that all the mixture components have equal concentration parameters and let . Under these assumptions, the E-Step (5.9) reduces to assigning a point to the cluster nearest to it, which here is given by the cluster with whose centroid the given point has largest dot product. In other words, a point is assigned to cluster because and for in (5.9).
In a similar manner, the diametrical clustering algorithm of  also may be viewed as a limiting case of EM applied to a moW. Recall that the diametrical clustering algorithm groups together correlated and anti-correlated unit norm data vectors into the same cluster, i.e., it treats diametrically opposite points equivalently. Remarkably, it turns out that the diametrical clustering algorithm of  can be obtained as follows: Let , so that for each
, the corresponding posterior probabilities; the particular that tends to 1 is the one for which is maximized in the E-Step; subsequently the M-Step (5.7), (5.8) also simplifies, and yields the same updates as made by the diametrical clustering algorithm.
Alternatively, we can obtain diametrical clustering from the hard-assignment heuristic of EM applied to a moW where all mixture components have the same (positive) concentration parameter . Then, in the E-Step (5.9), we can ignore altogether, which reduces Alg. 1 to the diametrical clustering procedure.
5.3 Application: clustering using movMF
Mixtures of vMFs have been successfully used in text clustering; see  for a detailed overview. We recall below results of a two main experiments below: (i) simulated data; and (ii) Slashdot news articles.
The key characteristic of text data is its high dimensionality. And for modeling clusters of such data using a movMF, the approximate computation of the concentration parameter as discussed in Sec. 4.1 is of great importance: without this approximation, the computation breaks down due to floating point difficulties.
For (i), we simulate a mixture of 4 vMFs in with each, and draw a sample of 5000 data points. The clusters are chosen to be roughly of the same size and their relative mixing proportions are , with concentration parameters (to one digiit) , and random units vectors as means. This is the same data as the big-mix data in . We generated the samples using the vmfsamp code (available from the author upon request).
For (ii), we recall a part of the results of  on news articles from the Slashdot website. These articles are tagged and cleaned to retain 1000 articles that more clearly belong to a primary category / cluster. We report results on ‘Slash-7’ and ‘Slash-6’; the first contains 6714 articles in 7 primary categories: business, education, entertainment, games, music, science, and internet; while the second contains 5182 articles in 6 categories: biotech, Microsoft, privacy, Google, security, and space.
There are several ways to evaluate performance of a clustering method. For the simulated data, we know the true parameters from which the dataset was simulated, hence we can compare the error in estimated parameter values. For the Slashdot data, we have knowledge of “ground truth” labels, so we can use the normalized mutual information (NMI)  (a measure that was also previously used to assess movMF based clustering [5, 6]) as an external measure of cluster quality. Suppose the predicted cluster labels are and the true labels are , then the NMI between and is defined as
where denotes the usual mutual information and denotes the entropy . When the predicted labels agree perfectly with the true labels, then NMI equals 1; thus higher values of NMI are better.
Results on ‘bigsim’.
The results of the first experiment are drawn from , and are reported in Table 1. From the results it is clear that on this particular simulated data, EM manages to recover the true parameters to quite a high degree of accuracy. Part of this reason is due to the high values of the concentration parameter: as increases, the probability mass concentrates, which makes it easier to separate the clusters using EM. To compute the values in the table, we ran EM with soft-assignments and then after convergence used assignment (5.9).
Results on Slashdot.
These results are drawn from . The results here are reported merely as an illustration, and we refer the reader to  for more extensive results. We report performance of our implementation of Alg. 1 (EM for movMF) against Latent Dirichlet Allocation (LDA)  and a Exponential-family Dirichlet compound multinomial model (EDCM) .
Table 2 reports results of comparing Alg. 1 specialized for movMFs against LDA and EDCM. As can be seen from the results, the vMF mixture leads to much higher quality clustering than the other two competing approaches. We did not test an optimized implementation (and used our own Matlab implementation), but note anecdotally that the EM procedure was 3–5 times faster than the others.
5.4 Application: clustering using moW
Figure 1 shows a toy example of axial data. Here, the original data has two clusters (leftmost panel of Fig. 1). If we cluster this data into two clusters using Euclidean kmeans, we obtain the plot in the middle panel; clustering into 4 groups using Euclidean kmeans yields the rightmost panel. As is clear from the figure, Euclidean kmeans cannot discover the desired structure, if the true clusters are on axial data. The diametrical clustering algorithm of  discovers the two clusters (leftmost panel), which also shows the mean vectors for each cluster. Recall that as mentioned above, the diametrical clustering method is obtained as the limiting case of EM on moW.
We summarized a few distributions from directional statistics that are useful for modeling normalized data. We focused in particular on the von Mises-Fisher distribution (the “Gaussian” of the hypersphere) and the Watson distribution (for axially symmetric data). For both of these distributions, we recapped maximum likelihood parameter estimation as well as mixture modeling using the EM algorithm. For extensive numerical results on clustering using mixtures of vMFs, we refer the reader to the original paper ; similarly, for mixtures of Watsons please see . The latter paper also describes asymptotic estimates of the concentration parameter in detail.
Now directional distributions are widely used in machine learning (Sec. 3
provides some pointers to related work), and we hope the brief summary provided in this chapter helps promote wider understanding about these. In particular, we hope to see more exploration of directional models in the following important subareas: Bayesian models, Hidden Markov Models using directional models, and deep generative models.
-  M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables. Dover, New York, June 1974.
-  D. E. Amos. Computation of modified Bessel functions and their ratios. Mathematics of Computation, 28(125):235–251, 1974.
-  D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
A. Banerjee and S. Basu.
Topic models over text streams: A study of batch and online unsupervised learning.In SDM, volume 7, pages 437–442. SIAM, 2007.
-  A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. JMLR, 6:1345–1382, Sep 2005.
-  A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra. Text Clustering with Mixture of von Mises-Fisher Distributions. In A. N. Srivastava and M. Sahami, editors, Text Mining: Theory, Applications, and Visualization. CRC Press, 2009.
-  A. Bijral, M. Breitenbach, and G. Z. Grudic. Mixture of Watson Distributions: A Generative Model for Hyperspherical Embeddings. In Artificial Intelligence and Statistics (AISTATS 2007), J. Machine Learning Research - Proceedings Track 2, pages 35–42, 2007.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
R. P. Cabeen and D. H. Laidlaw.
White matter supervoxel segmentation by axial dp-means clustering.
Medical Computer Vision. Large Data in Medical Imaging, pages 95–104. Springer, 2014.
H. Cetingul and R. Vidal.
Intrinsic mean shift for clustering on stiefel and grassmann
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1896–1902. IEEE, 2009.
-  Y. Chikuse. Statistics on special manifolds, volume 174. Springer Science & Business Media, 2012.
-  D. Christie. Efficient von Mises-Fisher concentration parameter estimation using Taylor series. Journal of Statistical Computation and Simulation, 85(16):1–7, 2015.
-  T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, New York, USA, 1991.
-  A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1977.
-  I. S. Dhillon, E. M. Marcotte, and U. Roshan. Diametrical clustering for identifying anti-correlated gene clusters. Bioinformatics, 19(13):1612–1619, 2003.
-  I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001.
-  C. Elkan. Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Proceedings of the 23rd international conference on Machine learning, pages 289–296. ACM, 2006.
-  A. Erdélyi, W. Magnus, F. Oberhettinger, and F. G. Tricomi. Higher transcendental functions, volume 1. McGraw Hill, 1953.
-  Y. Fu, S. Yan, and T. S. Huang. Correlation metric for generalized feature extraction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(12):2229–2235, 2008.
-  M. A. Hasnat, O. Alata, and A. Trémeau. Unsupervised clustering of depth images using watson mixture model. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 214–219. IEEE, 2014.
-  K. Hornik and B. Grün. On maximum likelihood estimation of the concentration parameter of von mises–fisher distributions. Computational statistics, 29(5):945–957, 2014.
-  R. Hosseini. Natural Image Modelling using Mixture Models with compression as an application. PhD thesis, Berlin, Technische Universtität Berlin, Diss., 2012, 2012.
-  P. Kasarapu and L. Allison. Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions. Machine Learning, pages 1–46, 2015.
-  D. Lashkari, E. Vul, N. Kanwisher, and P. Golland. Discovering structure in the space of fmri selectivity profiles. Neuroimage, 50(3):1085–1098, 2010.
-  K. Mammasis, R. W. Stewart, and J. S. Thompson. Spatial fading correlation model using mixtures of von mises fisher distributions. Wireless Communications, IEEE Transactions on, 8(4):2046–2055, 2009.
-  K. V. Mardia. Statistical Distributions in Scientific Work, volume 3, chapter Characteristics of directional distributions, pages 365–385. Reidel, Dordrecht, 1975.
-  K. V. Mardia. Statistics of directional data. J. Royal Statistical Society. Series B (Methodological), 37(3):349–393, 1975.
-  K. V. Mardia and P. Jupp. Directional Statistics. John Wiley and Sons Ltd., second edition, 2000.
-  M. Mash’al and R. Hosseini. K-means++ for Mixtures of von Mises-Fisher Distributions. In Information and Knowledge Technology (IKT), 2015 7th Conference on, pages 1–6. IEEE, 2015.
-  T. McGraw, B. C. Vemuri, B. Yezierski, and T. Mareci. Von mises-fisher mixture model of the diffusion odf. In Biomedical Imaging: Nano to Macro, 2006. 3rd IEEE International Symposium on, pages 65–68. IEEE, 2006.
-  R. J. Muirhead. Aspects of multivariate statistical theory. John Wiley, 1982.
-  C. R. Rao. Linear Statistical Inference and its Applications. Wiley, New York, 2nd edition, 1973.
-  E. Rasmussen. Clustering algorithms. In W. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 419–442. Prentice Hall, New Jersey, 1992.
-  J. Reisinger, A. Waters, B. Silverthorn, and R. J. Mooney. Spherical topic models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 903–910, 2010.
-  S. Ryali, T. Chen, K. Supekar, and V. Menon. A parcellation scheme based on von mises-fisher distributions and markov random fields for segmenting brain regions using resting-state fmri. NeuroImage, 65:83–96, 2013.
-  G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley (Reading MA), 1989.
-  G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
-  S. Sra. Matrix Nearness Problems in Data Mining. PhD thesis, Univ. of Texas at Austin, 2007.
-  S. Sra. A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of . Computational Statistics, Apr. 2009. Accepted.
S. Sra and D. Karp.
The multivariate Watson distribution: Maximum-likelihood estimation
and other aspects.
Journal of Multivariate Analysis (JMVA), 114:256–269, 2013.
-  J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III. A dirichlet process mixture model for spherical data. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 930–938, 2015.
-  A. Strehl and J. Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583–617, 2002.
-  A. Tanabe, K. Fukumizu, S. Oba, T. Takenouchi, and S. Ishii. Parameter estimation for von Mises-Fisher distributions. Computational Statistics, 22(1):145–157, 2007.
-  H. Tang, S. M. Chu, and T. S. Huang. Generative model-based speaker clustering via mixture of von mises-fisher distributions. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4101–4104. IEEE, 2009.
-  D. E. Tyler. Statistical analysis for the angular central gaussian distribution on the sphere. Biometrika, 74(3):579–589, 1987.
-  G. S. Watson. The statistics of orientation data. The Journal of Geology, pages 786–797, 1966.