Classification is the process of finding and analyzing underlying group structure in heterogenous data. This problem can be framed as the search for class labels of unlabelled observations. In general, some (non-trivial) proportion of observations have known labels. A special case of classification, known as clustering, occurs when none of the observations have known labels. One common approach for clustering is mixture model-based clustering, which makes use of a finite mixture model. In general, a
-component finite mixture model assumes that a multivariate random variablehas density
where , is the th component density, and is the th mixing proportion such that . Note that the notation used in (1) corresponds to the multivariate case and, save for Appendix A, will hereafter represent a matrix variate random variable with denoting its realization.
McNicholas (2016) traces the relationship between mixture models and clustering to Tiedeman (1955), who uses a component of a mixture model to define a cluster. The mixture model was first used for clustering by Wolfe (1965)
who considered a Gaussian mixture model. Other early uses of Gaussian mixture models for clustering can be found inBaum et al. (1970) and Scott & Symons (1971)
. Although the Gaussian mixture model is attractive due to its mathematical properties, it is problematic when dealing with outliers and asymmetry in the data and thus there has been an interest in non-Gaussian mixtures for the multivariate case. Some examples of mixtures of symmetric distributions that parameterize tail weight include thedistribution (Peel & McLachlan 2000, Andrews & McNicholas 2011, 2012, Lin et al. 2014)
and the power exponential distribution(Dang et al. 2015). There has also been work in the area of skewed distributions such as the skew- distribution, (Lin 2010, Vrbik & McNicholas 2012, 2014, Lee & McLachlan 2014, Murray, Browne & McNicholas 2014, Murray, McNicholas & Browne 2014), the normal-inverse Gaussian (NIG) distribution (Karlis & Santourian 2009), the shifted asymmetric Laplace (SAL) distribution (Morris & McNicholas 2013, Franczak et al. 2014)
, the variance-gamma distribution(McNicholas et al. 2017), the generalized hyperbolic distribution (Browne & McNicholas 2015), the hidden truncation hyperbolic distribution (Murray et al. 2017a), and the joint generalized hyperbolic distribution (Tang et al. 2018).
There has also been an increased interest in model-based clustering of matrix variate data such as multivariate longitudinal data and images. Such examples include the work of Viroli (2011) and Anderlucci et al. (2015)
, who consider mixtures of matrix variate normal distributions for clustering. More recently,Gallaugher & McNicholas (2018a) looked at mixtures of four skewed matrix distributions, namely the matrix variate skew-, generalized hyperbolic, variance gamma and normal inverse Gaussian (NIG) distributions and considered classification of greyscale Arabic numerals. Melnykov & Zhu (2018) also considered modelling skewness by means of transformations.
The main problem with all of the aforementioned methods, for both the multivariate and matrix variate cases, arises when the dimensionality of the data increases. Although the problem of dealing with high-dimensional data has been thoroughly addressed in the case of multivariate data, there is relative paucity of work for matrix variate data. In the matrix variate case, matrix variate bilinear probabilistic principal component analysis was developed byZhao et al. (2012). More recently, Gallaugher & McNicholas (2018b) considered the closely-related mixture of matrix variate bilinear factor analyzers (MMVBFA) model for clustering. The MMVBFA model can be viewed as a matrix variate generalization of the mixture of factor analyzers model (Ghahramani & Hinton 1997) in the multivariate case. Although these methods allow for simultaneous dimension reduction and clustering, they both assume matrix variate normality which is not sensible if cluster skewness or heavy tails are present. Herein we present an extension of the MMVBFA model to skewed distributions, specifically the matrix variate skew-, generalized hyperbolic, variance-gamma, and NIG distributions.
2.1 Generalized Inverse Gaussian Distribution
The generalized inverse Gaussian distribution has two different parameterizations, both of which will be utilized herein. A random variable has a generalized inverse Gaussian (GIG) distribution parameterized by and , denoted by
, if its probability density function can be written as
for , and , where
is the modified Bessel function of the third kind with index . Expectations of some functions of a GIG random variable have a mathematically tractable form, e.g.:
Although this parameterization of the GIG distribution will be useful for parameter estimation, the alternative parameterization given by
2.2 Matrix Variate Distributions
As in the multivariate case, the most mathematically tractable matrix variate distribution is the matrix variate normal. An random matrix follows an matrix variate normal distribution with location matrix and scale matrices and , of dimensions and , respectively, denoted by if the density of is
The matrix variate normal distribution is related to the multivariate normal distribution, as discussed in Harrar & Gupta (2008), via where is the multivariate normal density with dimension ,
is the vectorization of, and denotes the Kronecker product. Although the matrix variate normal distribution is popular, there are other well known examples of matrix variate distributions. For example, the Wishart distribution (Wishart 1928) is the distribution of the sample covariance matrix from a multivariate normal sample. There are also a few formulations of a matrix variate skew normal distribution (Chen & Gupta 2005, Domínguez-Molina et al. 2007, Harrar & Gupta 2008).
More recently, Gallaugher & McNicholas (2017, 2019) derived a total of four skewed matrix variate distributions using a variance mean matrix variate mixture approach. This assumes that a random matrix can be written as
where and are matrices representing the location and skewness, respectively, , and is a random variable with density . Gallaugher & McNicholas (2017), show that the matrix variate skew- distribution, with degrees of freedom, arises from (7) with , where denotes the inverse-gamma distribution with density
for and . The resulting density of is
where , and . For notational clarity, this distribution will be denoted by .
In Gallaugher & McNicholas (2019), one of the distributions considered is a matrix variate generalized hyperbolic distribution. This again arises from (7) with . This distribution will be denoted by , and the density is
where and .
for and The density of the random matrix with this distribution is
Finally, the matrix variate NIG distribution arises when , where denotes the inverse-Gaussian distribution with density
for , . The density of is
where . This distribution is denoted by .
2.3 Matrix Variate Factor Analysis
Readers who may benefit from the context provided by the mixture of factor analyzers model should consult the appendix. Xie et al. (2008) and Yu et al. (2008) consider a matrix variate extension of probabilistic principal components analysis (PPCA) and assumes an random matrix can be written
where is an location matrix, is an matrix of column factor loadings, is a matrix of row factor loadings, , and . It is assumed that and are independent of each other. The main disadvantage of this model is that, in general, does not follow a matrix variate normal distribution.
Zhao et al. (2012) present bilinear probabilistic principal component analysis (BPPCA) which extends (8) by adding two projected error terms. The resulting model assumes can be written where is the same as in (8), , . In this model, it is assumed that , and are all independent of each other. Gallaugher & McNicholas (2018b) further extend this to matrix variate factor analysis and consider a mixture of matrix variate bilinear factor analyzers (MMVBFA) model. For MMVBFA, Gallaugher & McNicholas (2018b) altered BPPCA by removing the isotropic constraints so that , and , where , with , and , with . With these slight modifications, it can be shown that , similarly to its multivariate counterpart (Appendix A).
It is important to note that the term “column factors” refers to reduction in the dimension of the columns, which is equivalent to the number of rows, and not a reduction in the number of columns. Likewise, the term “row factors” refers to the reduction in the dimension of the rows (number of columns). As discussed by Zhao et al. (2012), the interpretation of the terms and are the row and column noise, respectively, whereas the final term is the common noise.
3 Mixture of Skewed Matrix Variate Bilinear Factor Analyzers
3.1 Model Specification
We now consider a mixture of skewed bilinear factor analyzers according to one of the four skewed distributions discussed previously. Each random matrix from a random sample distributed according to one of the four distributions can be written as
with probability for , , , where is the location of the th component, is the skewness, and is a random variable with the density . This will be dependent on the distribution in question, i.e., skew-, generalized hyperbolic, variance-gamma or NIG. Assume also that can be written as
where is a matrix of column factor loadings, is a matrix of row factor loadings, and
Note that and are all independently distributed and independent of each other.
To facilitate clustering, introduce the indicator , where if observation belongs to group , and otherwise. Then, it can be shown that
where D is the distribution in question, and is the set of parameters related to the distribution of .
As in the matrix variate normal case, this model has a two stage interpretation given by
which will be useful for parameter estimation.
3.2 Parameter Estimation
Suppose we observe the matrices distributed according to one of the four distributions. We assume that this data is incomplete and employ an alternating expectation conditional maximization (AECM) algorithm (Meng & van Dyk 1997). This algorithm is now described after inititalization.
AECM 1st Stage
The complete-data in the first stage consists of the observed data , the latent variables and the unknown group labels for . In this case, the complete-data log-likelihood is
where , and is constant with respect to the parameters.
In the E-step, we calculate the following conditional expectations:
As usual, all expectations are conditional on current parameter estimates; however, to avoid cluttered notation, we do not use iteration-specific notation. Although these expectations are dependent on the distribution in question, it can be shown that
In the M-step, we update :
The update for is dependent on the distribution and are identical to those given in Gallaugher & McNicholas (2018a).
AECM Stage 2
In the second stage, the complete-data consists of the observed data , the latent variables , the unknown group labels and the latent matrices
for . The complete-data log-likelihood in this stage is
In the E-step, it can be shown that
and so we can calculate the expectations
In the M-step, the updates for and are calculated. These updates are given by
AECM Stage 3
In the third stage, the complete-data consists of the observed data , the latent variables , the labels and the latent matrices for .