# Mixtures of Skewed Matrix Variate Bilinear Factor Analyzers

Clustering is the process of finding and analyzing underlying group structures in data. In recent years, data as become increasingly higher dimensional and therefore an increased need for dimension reduction techniques for use in clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate or three way data. Furthermore, these few methods all assume matrix variate normality which is not always sensible if skewness is present. We propose a mixture of bilinear factor analyzers model using four skewed matrix variate distributions, namely the matrix variate skew-t, generalized hyperbolic, variance gamma and normal inverse Gaussian distributions.

## Authors

• 13 publications
• 40 publications
12/22/2017

### Mixtures of Matrix Variate Bilinear Factor Analyzers

Over the years data is becoming increasingly higher dimensional, which h...
11/20/2019

### Parsimonious Mixtures of Matrix Variate Bilinear Factor Analyzers

Over the years, data have become increasingly higher dimensional, which ...
11/26/2013

### A Mixture of Generalized Hyperbolic Factor Analyzers

Model-based clustering imposes a finite mixture modelling structure on d...
08/28/2013

### Clustering, Classification, Discriminant Analysis, and Dimension Reduction via Generalized Hyperbolic Mixtures

A method for dimension reduction with clustering, classification, or dis...
11/04/2017

### Mixtures of Hidden Truncation Hyperbolic Factor Analyzers

The mixture of factor analyzers model was first introduced over 20 years...
10/02/2018

### Inverse Gaussian quadrature and finite normal-mixture approximation of generalized hyperbolic distribution

In this study, a numerical quadrature for the generalized inverse Gaussi...
04/03/2020

### Granular Computing: An Augmented Scheme of Degranulation Through a Modified Partition Matrix

As an important technology in artificial intelligence Granular Computing...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classification is the process of finding and analyzing underlying group structure in heterogenous data. This problem can be framed as the search for class labels of unlabelled observations. In general, some (non-trivial) proportion of observations have known labels. A special case of classification, known as clustering, occurs when none of the observations have known labels. One common approach for clustering is mixture model-based clustering, which makes use of a finite mixture model. In general, a

-component finite mixture model assumes that a multivariate random variable

has density

 f(x | ϑ)=G∑g=1πgfg(x | θg), (1)

where , is the th component density, and is the th mixing proportion such that . Note that the notation used in (1) corresponds to the multivariate case and, save for Appendix A, will hereafter represent a matrix variate random variable with denoting its realization.

McNicholas (2016) traces the relationship between mixture models and clustering to Tiedeman (1955), who uses a component of a mixture model to define a cluster. The mixture model was first used for clustering by Wolfe (1965)

who considered a Gaussian mixture model. Other early uses of Gaussian mixture models for clustering can be found in

Baum et al. (1970) and Scott & Symons (1971)

. Although the Gaussian mixture model is attractive due to its mathematical properties, it is problematic when dealing with outliers and asymmetry in the data and thus there has been an interest in non-Gaussian mixtures for the multivariate case. Some examples of mixtures of symmetric distributions that parameterize tail weight include the

distribution (Peel & McLachlan 2000, Andrews & McNicholas 2011, 2012, Lin et al. 2014)

and the power exponential distribution

(Dang et al. 2015). There has also been work in the area of skewed distributions such as the skew- distribution, (Lin 2010, Vrbik & McNicholas 2012, 2014, Lee & McLachlan 2014, Murray, Browne & McNicholas 2014, Murray, McNicholas & Browne 2014), the normal-inverse Gaussian (NIG) distribution (Karlis & Santourian 2009), the shifted asymmetric Laplace (SAL) distribution (Morris & McNicholas 2013, Franczak et al. 2014)

, the variance-gamma distribution

(McNicholas et al. 2017), the generalized hyperbolic distribution (Browne & McNicholas 2015), the hidden truncation hyperbolic distribution (Murray et al. 2017a), and the joint generalized hyperbolic distribution (Tang et al. 2018).

There has also been an increased interest in model-based clustering of matrix variate data such as multivariate longitudinal data and images. Such examples include the work of Viroli (2011) and Anderlucci et al. (2015)

, who consider mixtures of matrix variate normal distributions for clustering. More recently,

Gallaugher & McNicholas (2018a) looked at mixtures of four skewed matrix distributions, namely the matrix variate skew-, generalized hyperbolic, variance gamma and normal inverse Gaussian (NIG) distributions and considered classification of greyscale Arabic numerals. Melnykov & Zhu (2018) also considered modelling skewness by means of transformations.

The main problem with all of the aforementioned methods, for both the multivariate and matrix variate cases, arises when the dimensionality of the data increases. Although the problem of dealing with high-dimensional data has been thoroughly addressed in the case of multivariate data, there is relative paucity of work for matrix variate data. In the matrix variate case, matrix variate bilinear probabilistic principal component analysis was developed by

Zhao et al. (2012). More recently, Gallaugher & McNicholas (2018b) considered the closely-related mixture of matrix variate bilinear factor analyzers (MMVBFA) model for clustering. The MMVBFA model can be viewed as a matrix variate generalization of the mixture of factor analyzers model (Ghahramani & Hinton 1997) in the multivariate case. Although these methods allow for simultaneous dimension reduction and clustering, they both assume matrix variate normality which is not sensible if cluster skewness or heavy tails are present. Herein we present an extension of the MMVBFA model to skewed distributions, specifically the matrix variate skew-, generalized hyperbolic, variance-gamma, and NIG distributions.

## 2 Background

### 2.1 Generalized Inverse Gaussian Distribution

The generalized inverse Gaussian distribution has two different parameterizations, both of which will be utilized herein. A random variable has a generalized inverse Gaussian (GIG) distribution parameterized by and , denoted by

, if its probability density function can be written as

 f(y|a,b,λ)=(a/b)λ2yλ−12Kλ(√ab)exp{−ay+b/y2},

for , and , where

 Kλ(u)=12∫∞0yλ−1exp{−u2(y+1y)}dy

is the modified Bessel function of the third kind with index . Expectations of some functions of a GIG random variable have a mathematically tractable form, e.g.:

 E(Y)=√baKλ+1(√ab)Kλ(√ab), (2)
 E(1/Y)=√abKλ+1(√ab)Kλ(√ab)−2λb, (3)
 E(logY)=log(√ba)+1Kλ(√ab)∂∂λKλ(√ab). (4)

Although this parameterization of the GIG distribution will be useful for parameter estimation, the alternative parameterization given by

 g(y|ω,η,λ)=(w/η)λ−12ηKλ(ω)exp{−ω2(wη+ηw)}, (5)

where and , is used when deriving the generalized hyperbolic distribution (see Browne & McNicholas 2015). For notational clarity, we will denote the parameterization given in (5) by .

### 2.2 Matrix Variate Distributions

As in the multivariate case, the most mathematically tractable matrix variate distribution is the matrix variate normal. An random matrix follows an matrix variate normal distribution with location matrix and scale matrices and , of dimensions and , respectively, denoted by if the density of is

 f(X | M,Σ,Ψ)=1(2π)np2|Σ|p2|Ψ|n2exp{−12tr(Σ−1(X−M)Ψ−1(X−M)′)}. (6)

The matrix variate normal distribution is related to the multivariate normal distribution, as discussed in Harrar & Gupta (2008), via where is the multivariate normal density with dimension ,

is the vectorization of

, and denotes the Kronecker product. Although the matrix variate normal distribution is popular, there are other well known examples of matrix variate distributions. For example, the Wishart distribution (Wishart 1928) is the distribution of the sample covariance matrix from a multivariate normal sample. There are also a few formulations of a matrix variate skew normal distribution (Chen & Gupta 2005, Domínguez-Molina et al. 2007, Harrar & Gupta 2008).

More recently, Gallaugher & McNicholas (2017, 2019) derived a total of four skewed matrix variate distributions using a variance mean matrix variate mixture approach. This assumes that a random matrix can be written as

 X=M+WA+√WV, (7)

where and are matrices representing the location and skewness, respectively, , and is a random variable with density . Gallaugher & McNicholas (2017), show that the matrix variate skew- distribution, with degrees of freedom, arises from (7) with , where denotes the inverse-gamma distribution with density

 f(y | a,b)=baΓ(a)y−a−1exp{−by},

for and . The resulting density of is

 f\tiny MVST(X | ϑ)= ×K−ν+np2(√[ρ(A,Σ,Ψ)][δ(X;M,Σ,Ψ)+ν]),

where , and . For notational clarity, this distribution will be denoted by .

In Gallaugher & McNicholas (2019), one of the distributions considered is a matrix variate generalized hyperbolic distribution. This again arises from (7) with . This distribution will be denoted by , and the density is

 f\tiny MVGH(X|ϑ)= exp{tr(Σ−1(X−M)Ψ−1A′)}(2π)np2|Σ|p2|Ψ|n2Kλ(ω)(δ(X;M,Σ,Ψ)+ωρ(A,Σ,Ψ)+ω)(λ−np2)2 ×K(λ−np/2)(√[ρ(A,Σ,Ψ)+ω][δ(X;M,Σ,Ψ)+ω]),

where and .

The matrix variate variance-gamma distribution, also derived in Gallaugher & McNicholas (2019) and denoted , arises from (7) with , where denotes the gamma distribution with density

 f(y | a,b)=baΓ(a)ya−1exp{−by},

for and The density of the random matrix with this distribution is

 f\tiny MVVG(X|ϑ)= 2γγexp{tr(Σ−1(X−M)Ψ−1A′)}(2π)np2|Σ|p2|Ψ|n2Γ(γ)(δ(X;M,Σ,Ψ)ρ(A,Σ,Ψ)+2γ)(γ−np/2)2 ×K(γ−np2)(√[ρ(A,Σ,Ψ)+2γ][δ(X;M,Σ,Ψ)]),

where .

Finally, the matrix variate NIG distribution arises when , where denotes the inverse-Gaussian distribution with density

 f(y | δ,γ)=δ√2πexp{δγ}y−32exp{−12(δ2y+γ2y)},

for , . The density of is

 f\tiny MVNIG(X|ϑ) =2exp{tr(Σ−1(X−M)Ψ−1A′)+~γ}(2π)np+12|Σ|p2|Ψ|n2(δ(X;M,Σ,Ψ)+1ρ(A,Σ,Ψ)+~γ2)−(1+np)/4

where . This distribution is denoted by .

### 2.3 Matrix Variate Factor Analysis

Readers who may benefit from the context provided by the mixture of factor analyzers model should consult the appendix. Xie et al. (2008) and Yu et al. (2008) consider a matrix variate extension of probabilistic principal components analysis (PPCA) and assumes an random matrix can be written

 X=M+ΛUΔ′+E, (8)

where is an location matrix, is an matrix of column factor loadings, is a matrix of row factor loadings, , and . It is assumed that and are independent of each other. The main disadvantage of this model is that, in general, does not follow a matrix variate normal distribution.

Zhao et al. (2012) present bilinear probabilistic principal component analysis (BPPCA) which extends (8) by adding two projected error terms. The resulting model assumes can be written where is the same as in (8), , . In this model, it is assumed that , and are all independent of each other. Gallaugher & McNicholas (2018b) further extend this to matrix variate factor analysis and consider a mixture of matrix variate bilinear factor analyzers (MMVBFA) model. For MMVBFA, Gallaugher & McNicholas (2018b) altered BPPCA by removing the isotropic constraints so that , and , where , with , and , with . With these slight modifications, it can be shown that , similarly to its multivariate counterpart (Appendix A).

It is important to note that the term “column factors” refers to reduction in the dimension of the columns, which is equivalent to the number of rows, and not a reduction in the number of columns. Likewise, the term “row factors” refers to the reduction in the dimension of the rows (number of columns). As discussed by Zhao et al. (2012), the interpretation of the terms and are the row and column noise, respectively, whereas the final term is the common noise.

## 3 Mixture of Skewed Matrix Variate Bilinear Factor Analyzers

### 3.1 Model Specification

We now consider a mixture of skewed bilinear factor analyzers according to one of the four skewed distributions discussed previously. Each random matrix from a random sample distributed according to one of the four distributions can be written as

 Xi=Mg+WigAg+Vig

with probability for , , , where is the location of the th component, is the skewness, and is a random variable with the density . This will be dependent on the distribution in question, i.e., skew-, generalized hyperbolic, variance-gamma or NIG. Assume also that can be written as

 Vig=ΛgUigΔ′g+ΛgEBig+EAigΔ′g+Eig,

 Uig|wig ∼Nq×r(0,wigIq,Ip), EBig|wig∼Nq×p(0,wigIq,Ψg), EAig|wig ∼Nn×r(0,wigΣg,Ir), Eig|wig∼Nn×p(0,wigΣg,Ψg).

Note that and are all independently distributed and independent of each other.

To facilitate clustering, introduce the indicator , where if observation belongs to group , and otherwise. Then, it can be shown that

 Xi | zig=1∼Dn×p(Mg,Ag,Σg+ΛgΛ′g,Ψg+ΔgΔ′g,θg),

where D is the distribution in question, and is the set of parameters related to the distribution of .

As in the matrix variate normal case, this model has a two stage interpretation given by

 Xi=Mg+WigA+ΛgYBig+RBig,YBig=UigΔ′g+EBig,RBig=EAigΔ′g+Eig,

and

 Xi=Mg+WigA+YAigΔ′g+RAig,YAig=ΛgUig+EAig,RAig=ΛgEBig+Eig,

which will be useful for parameter estimation.

### 3.2 Parameter Estimation

Suppose we observe the matrices distributed according to one of the four distributions. We assume that this data is incomplete and employ an alternating expectation conditional maximization (AECM) algorithm (Meng & van Dyk 1997). This algorithm is now described after inititalization.

#### AECM 1st Stage

The complete-data in the first stage consists of the observed data , the latent variables and the unknown group labels for . In this case, the complete-data log-likelihood is

 ℓC1=C+N∑i=1G∑g=1zig[logπg+logh(wig|θg)−12tr{1WigΣ∗−1g(Xi−Mg)Ψ∗−1(Xi−Mg)′−Σ∗−1g(Xi−Mg)Ψ∗−1A′g−Σ∗−1gAgΨ∗−1(Xi−Mg)′+WigΣ∗−1gAgΨ∗−1A′g}],

where , and is constant with respect to the parameters.

In the E-step, we calculate the following conditional expectations:

As usual, all expectations are conditional on current parameter estimates; however, to avoid cluttered notation, we do not use iteration-specific notation. Although these expectations are dependent on the distribution in question, it can be shown that

 WSTig | Xi,zig=1 ∼GIG(ρ(Ag,Σg,Ψg),δ(X;Mg,Σg,Ψg)+νg,−(νg+np)/2), WGHig | Xi,zig=1 ∼GIG(ρ(Ag,Σg,Ψg)+ωg,δ(X;Mg,Σg,Ψg)+ωg,λg−np/2), WVGig | Xi,zig=1 ∼GIG(ρ(Ag,Σg,Ψg)+2γg,δ(X;Mg,Σg,Ψg),γg−np/2), WNIGig | Xi,zig=1 ∼GIG(ρ(Ag,Σg,Ψg)+~γ2g,δ(X;Mg,Σg,Ψg)+1,−(1+np)/2).

Therefore, the exact updates are obtained by using the expectations given in (2)–(4) for appropriate values of , and .

In the M-step, we update :

 ^πg=NgN,^Mg=∑Ni=1^zigXi(¯¯¯agbig−1)∑Ni=1^zig¯¯¯agbig−Ng,^A=∑Ni=1^zigXi(¯¯bg−big)∑Ni=1^zig¯¯¯agbig−Ng,

where

 Ng=N∑i=1^zig,¯¯¯ag=∑Ni=1^zigaigNg,¯¯bg=∑Ni=1^zigbigNg.

The update for is dependent on the distribution and are identical to those given in Gallaugher & McNicholas (2018a).

#### AECM Stage 2

In the second stage, the complete-data consists of the observed data , the latent variables , the unknown group labels and the latent matrices
for . The complete-data log-likelihood in this stage is

 ℓC2 =C+N∑i=1G∑g=1zig[logπg+logh(Wig|νg)+logϕq×p(YBig|0,WigIq,Ψ∗g) +logϕn×p(Xi|Mg+WigAg+ΛgYBig,WigΣg,Ψ∗)] =C+N∑i=1G∑g=1−12zig[−plog|Σg|+tr{1WigΣ−1g(Xi−Mg)Ψ∗−1g(Xi−Mg)′ −Σ−1g(Xi−Mg)Ψ∗−1gA′g}−1WigΣ−1g(Xi−Mg)Ψ∗−1gYBig′Λ′g−Σ−1gAgΨ∗−1g(Xi−Mg)′ +WigΣ−1gAgΨ∗−1gA′g+Σ−1gAgΨ∗−1gYBig′Λ′g−1WigΣ−1gΛgYBigΨ∗−1g(Xi−Mg)′

In the E-step, it can be shown that

 YBig|Xi,Wig,zig=1∼Nq×p((Iq+Λ′gΣ−1gΛg)−1Λ′gΣ−1g(Xi−Mg−WigAg),Wig(Iq+Λ′gΣ−1gΛg)−1,Ψ∗g)

and so we can calculate the expectations

 E(2)1ig\coloneqqE[YBig|^ϑ,Xi,zig=1]=Lg(Xi−^Mg−aig^Ag)E(2)2ig\coloneqqE[1WigYBig∣∣∣^ϑ,Xi,zig=1]=Lg(big(Xi−^Mg)−^Ag)E(2)3ig\coloneqqE[1WigYBigΨ∗−1gYBig′∣∣∣^ϑ,Xi,zig=1]=p(Iq+^Λ′g^Σ−1g^Λg)−1+bigLg(Xi−^Mg)Ψ∗−1g(Xi−^Mg)′L′g−Lg((Xi−^Mg)^Ψ∗−1g^A′g+^Ag^Ψ∗−1g(Xi−^Mg)′)L′g+aigLg^Ag^Ψ∗−1g^A′gL′g,

where .

In the M-step, the updates for and are calculated. These updates are given by

 ^Λg=N∑i=1^zig[(Xi−^Mg)^Ψ∗−1gE(2)2ig′−^Ag^Ψ∗−1gE(2)1ig′](N∑i=1zigE(2)3ig)−1

and where

 SLg=1NgpN∑i=1^zig[big(Xi−^Mg)^Ψ∗−1g(Xi−^Mg)′−(^Ag+^ΛgE(2)2ig)^Ψ∗−1g(Xi−^Mg)′−(Xi−^Mg)^Ψ∗−1g^A′g+aig^Ag^Ψ∗−1g^Ag+^ΛgE(1)1ig^Ψ∗−1g^A′g−(Xi−^Mg)^Ψ∗−1gE(2)2ig′^Λ′g+^Ag^Ψ∗−1gE(2)1ig′^Λ′g+^ΛgE(2)3ig^Λ′g].

#### AECM Stage 3

In the third stage, the complete-data consists of the observed data , the latent variables , the labels and the latent matrices for .

 ℓC3=C+N∑i=1G∑g=1zig[logπg+logh(Wig|νg)+logϕq×p(YAig|0,WigΣ∗g,Ip)+logϕn×p(Xi|Mg+WigAg+YAigΔ′g,WigΣ∗g,Ψg)]=C+N∑i=1G∑g=1−12zig[−nlog|Ψg|+tr{1WigΨ−1g(Xi−Mg)′Σ∗g−1(Xi−Mg)−Ψ−1g(Xi−Mg)′Σ∗g−1Ag}−1WigΨ−1g(Xi−Mg)′Σ∗g−1YAigΔ′g−Ψ−1gA′gΣ∗g−1(Xi−Mg)+WigΨ−1