We consider the problem of learning a symmetric positive definite (SPD) matrix in a nonlinear regression setting, . The input is a matrix of arbitrary size which can be represented as a column vector without loss of generality. The output , on the contrary, is a structured matrix in the form of an SPD matrix. Let the training set include instances of such inputs and outputs. The task is to learn the function such that given an unseen test input , it produces the prediction of the corresponding output under the constraint that has to be an SPD matrix.
Consider solving the problem using a multilayer perceptron (MLP) neural network(e.g., Goodfellow et al., 2016)
. The standard MLP cannot straightforwardly be used, since as in most other neural network architectures, an MLP is designed to learn a vector of parameters from data without the consideration of any constraints. The objective is to design a nonlinear architecture which can learn the target outputswhile satisfying the SPD constraint across all layers. Our main contribution is to show how to alter the architecture of the MLP in such a way that it not only respects the constraints, but also makes explicit use of them. We will achieve this by: 1) Explicitly taking the non-Euclidean geometry of the underlying SPD manifolds (e.g., Pennec et al., 2005)1986) that respects the SPD nature of the matrices. This new model will be referred to as the matrix multilayer perceptron (mMLP). The mMLP makes use of positive-definite kernels to satisfy the SPD requirement across all layers. Hence, it provides a natural way of enabling deep SPD matrix learning.
We take a step-by-step approach in the development of the model. We first develop a simplified version of the resulting model that is designed for learning SPD matrices. We then extend this model into its most general form and show how it can be applied in connection to the VAE (Kingma & Welling, 2014; Rezende et al., 2014). More specifically, we replace the MLP in the vanilla VAE with the mMLP. This will crucially allow us to consider more general parametric families of distributions, in particular, those with dense covariance (dispersion) matrices. Two concrete examples are considered: the dense covariance
multivariate Gaussian distribution and the multivariate power exponential (mPE) distribution(e.g., Gómez et al., 1998). Based on the parametric choice of the distributions we examine the effect of increasing model flexibility not only on the VAE’s recognition network but also its generative network. This is achieved by relaxing the diagonality assumption on the covariance matrices and the Gaussian assumption.
2 Related Work
SPD manifold metric.
Earlier approaches for analyzing SPD matrices relied on the Euclidean space. Several recent studies suggest that non-Euclidean geometries such as the Riemannian structure may be better suited (e.g., Arsigny et al., 2006; Pennec et al., 2005). In this work, we consider the von Neumann divergence (e.g., Nielsen & Chuang, 2000) as our choice of the SPD manifold metric which is related to the Riemannian geometry. Previously, Tsuda et al. (2005) used this divergence in derivation of the matrix exponentiated gradients. Their work suggests its effectiveness for measuring dissimilarities between positive definite (PD) matrices.
SPD manifold learning.
There are multiple approaches towards the SPD matrix learning, via flattening SPD manifolds through tangent space approximations (e.g., Oncel Tuzel, 2008; Fathy et al., 2016), mapping them into reproducing kernel Hilbert spaces (Harandi et al., 2012; Minh et al., 2014), or geometry-aware SPD matrix learning (Harandi et al., 2014). While these methods typically follow shallow learning, the more recent line of research aims to design a deep architecture to nonlinearly learn target SPD matrices (Ionescu et al., 2015; Huang & Gool, 2017; Masci et al., 2015; Huang et al., 2018). Our method falls in this category but differs in the problem formulation. While the previous methods address the problem where the input is an SPD matrix and the output is a vector, we consider the reverse problem where the input is a matrix with an arbitrary size and the output is an SPD matrix.
Our extension of the matrix backpropagation differs from the one introduced by Ionescu et al. (2015). In their work, the necessary partial derivatives are computed using a two-step procedure consisting of first computing the functional that describes the variations of the upper layer variables with respect to the variations of the lower layer variables, and then computing the partial derivatives with respect to the lower layer variables using properties of the matrix inner product. In contrast, we make use of the concept of -derivatives (Magnus, 2010) and its favorable generalization properties to derive a routine which closely mimics the standard backpropagation.
Flexible variational posterior in the VAE.
An active line of research in the VAE is related to designing flexible variational posterior distributions that preserve dependencies between latent variables. In this regard, the early work of (Rezende et al., 2014) proposed the use of the rank-1 covariance matrix with a diagonal correction. Although computationally attractive, this makes for a poor approximation of the desired dense covariance matrix. Another approach is to induce dependencies between latent variables by considering a dependence on some auxiliary variables (Maaloe et al., 2016) or assuming hierarchical structures (Ranganath et al., 2016; Tran et al., 2016). Finally, an alternative approach towards achieving flexible variational posteriors is based on the idea of normalizing flow (Rezende & Mohamed, 2015; Kingma et al., 2016). In this work, we take a different approach toward increasing the posterior flexibility. This is achieved by learning dense covariance matrices via the mMLP model and relaxing the Gaussian assumption via the mPE distribution. While the approach taken by Kingma et al. (2016) toward learning dense covariance matrices solves an overparameterized problem, in our formulation, we make explicit use of kernels, within the mMLP architecture, to learn the dense covariance matrices.
Throughout this work we adopt the narrow definition of the matrix derivatives known as the -derivative (Magnus, 2010) in favor of the broad definition, -derivative. The reason for this is that the -derivative has better generalization properties. This choice turned out to be crucial in the derivation of the mMLP’s backpropagation routine which involves derivatives of matrix functions w.r.t. the matrix of variables. The -derivative and some of its properties are introduced in Appendix B.
Bregman matrix divergences.
Let us restrict ourselves to the domain of SPD matrices. Furthermore, let be a real-valued strictly convex differentiable function of the parameter domain and , where denotes the gradient w.r.t. the matrix. Then the Bregman divergence between and is defined as (e.g., Kulis et al., 2009)
Bregman divergences are non-negative, definite, and in general asymmetric. There are several choices for the function (e.g., Sra, 2016)
. The most common choice is probably, which leads to the Stein divergence (Stein, 1956), or commonly known as the LogDet divergence (refer to Appendix C for details). However, in this work, we argue in the favor of the von Neumann entropy, also known as the quantum relative entropy (QRE) (e.g., Nielsen & Chuang, 2000) as the choice of function. A numerical example is discussed in Section 6.1.1 which highlights the advantage of the von Neumann divergence over the Stein divergence as the choice of the SPD manifold metric within the mMLP architecture.
Using the von Neumann entropy as our choice of function in (1), we arrive at: , where denotes the matrix logarithm—for an SPD matrix , it is computed using , where and. The Bregman divergence corresponding to this choice of function is known as the von Neumann divergence, . Throughout, we consider the cases where the parameters are normalized so that: . The normalized von Neumann divergence is given by
4 Matrix Multilayer Perceptron
We first construct the basic form of the mMLP suitable for learning SPD matrices. Next, we construct its general form which will be applied in connection to the VAE.
4.1 The Basic Form of the mMLP
Activation matrix function.
Let denote a matrix of variables
. The activation functiondefines a matrix function in the form of , , where is some differentiable activation function outputting scalar values. In the following, we restrict ourselves to the kernel functions which form PD activation matrix functions. Irrespective of the functional form of , we will—mostly for numerical reasons and partly for the fact that our loss function in (6) will make use of the normalized von Neumann divergence—need to normalize the resulting kernel matrix. This can be achieved by enforcing the trace-one constraint,
where denotes a differentiable PD activation matrix function of trace one. Without loss of generality, throughout this work, we use the Mercer sigmoid kernel (Carrington et al., 2014) defined as
where and denote the slope and the intercept, respectively. Furthermore, denotes the dot product. In all experiments, we use default values of and .
Let indicate the input matrix and indicate the corresponding output matrix, an SPD matrix of trace one. The mMLP of hidden layers is shown as and constructed as
where the pair of and are the weight matrices, are the bias matrices, , are the latent input matrices, and are latent output SPD matrices of trace one.
In the construction of (5), we have ensured that are SPD matrices of trace one across all layers as opposed to only at the output layer. The idea is to propagate the nonlinearities introduced via the SPD activation matrix functions through all layers. This design choice turned out to be more effective than the alternative, and arguably simpler, design where the SPD requirement is met only at the output layer. We will discuss this further in Section 6.1.2, where we also present an illustrative numerical example.
We consider the normalized von Neumann divergence (2) as the base for the loss function. The von Neumann divergence is asymmetric. However, it can be symmetrized by using the fact that the von Neumann entropy of trace one follows the class of generalized quadratic distances (Nielsen & Nock, 2007). Hence, we define the loss function as
The remaining steps are feed-forward computation, backpropagation, and learning, as in the standard MLP. However, here, the backpropagation requires taking derivatives with respect to the matrix functions. These steps are described in Appendix E.
4.2 The General Form of the mMLP
We now discuss a general version of the mMLP which produces both a vector and an SPD matrix as outputs. One possible application of this model is for heteroscedastic multivariate regression (we do not pursue this application here). Another application is within the VAE formulation, as we discuss in more detail in Section5.
Let denote the input matrix. The corresponding outputs in this case are: which is an SPD matrix of trace one, and . The mMLP of hidden layers is shown as and constructed as:
where , , , , , . Just as in the standard MLP, is an activation function of choice, e.g., the hyperbolic tangent function.
The loss function needs to be designed with the specific application in mind. In the context of the VAEs, the default loss function is provided by the lower bound on the marginal likelihood. We will explore this choice in Section 5.
The remaining steps of feed-forward computation, backpropagation, and learning, are all described in Appendix F.
5 Exploiting the mMLP within the VAE
5.1 Background and Problem Formulation
Let indicate the set of i.i.d. observations on the real space . It is assumed that the data are generated by some random process involving a continuous latent variable
admitting a joint distributionparametrized by and . Here, is the generative model which is also known as the decoder, and is the prior. Let indicate the recognition model also known as the encoder, parametrized by . The distribution approximates the intractable true posterior . Kingma & Welling (2014) introduced a method based on variational inference (refer to Blei et al., 2017, for a recent review) for learning the recognition model parameters jointly with the generative model parameters and . This is done by maximizing a lower bound on the marginal log-likelihood of the data, also known as the evidence lower bound (ELBO),
is the Kullback-Leibler divergence (KLD) betweenand .
The vanilla VAE.
The parametric form of the recognition model is assumed to be a diagonal Gaussian distribution, with parameters and which are outputs of an MLP as a function of . Similarly for the observations on the continuous real space, the parametric form of the generative model is assumed to take on a diagonal Gaussian distribution, with parameters and which are outputs of an MLP as a function of .
The diagonality assumption.
The VAE formulation is general and as pointed out by Kingma & Welling (2014), the reason they insisted on using a diagonal covariance matrix on the Gaussian is to simplify the resulting inference. However, this choice comes with the cost of losing the dependencies between latent variables. It is fair to say that most extensions of the vanilla VAE make the same choice. The key reason for this choice is the computational complexity associated with maintaining a dense matrix.
Irrespective of the computational complexity, the question we are interested in is that whether there would be any significant gain in relaxing this assumption? When it comes to the recognition model, we know for a fact that it is never possible to have overfitting. Thus, it can only be advantageous to increase the expressiveness of the recognition network by expressing the posterior via more flexible family of parametric distributions, e.g., dense covariance Gaussian distributions. There is in fact an active line of research that is focused on recapturing some of the lost dependencies between the latent variables (see Section 2 for references). On the other hand, when it comes to the generative network, there might be the possibility of overfitting: The argument is that if the generative network is flexible enough, the VAE model may simply choose to ignore the latent variables, forcing the posterior to approach the prior. However, the question is how realistic this scenario really is? Perhaps more importantly, should this be a reason to instead force the generative network to take on a simple parametric form, e.g., the diagonal Gaussian distribution?
To get insights into these questions, we will consider several model alternatives with various degrees of flexibility on both the recognition model and the generative model. For this purpose, we will in the subsequent section replace the MLP with the mMLP which allows us to construct models with the dense covariance matrices. Two parametric families of distributions are considered: the multivariate Gaussian distribution and the mPE distribution. The latter drops the Gaussian assumption and provides a simple way to construct models with the various degree of flexibilities.
5.2 VAE via mMLP
), and finally we derive the estimators (Section5.2.3).
5.2.1 Parametric Families of Distributions
Trace-one Gaussian distribution.
The output layer of the mMLP model in (7) operates under the trace-one constraint. It is of course possible to drop the trace one constraint from the output layer, but we would then also be forced to use a different kernel function in that layer. Instead, we find it easier to work with a reparameterized Gaussian distribution with a trace-one covariance matrix. This allows us to use the same choice of kernel function (4) across all layers.
Trace-one power exponential distribution.
A generalized variant of the multivariate Gaussian distribution is provided by the mPE distribution (e.g., Gómez et al., 1998). As in the case of the Gaussian distribution, we find it easier to work with a reparameterized representation where the dispersion matrix is of trace one. Imposing this trace-one constraint, for a random variable , the trace-one mPE distribution can be expressed as
The pair of and are the scale and shape parameters of the density, is the mean vector, and is a symmetric real dispersion matrix where . The parameter has the same role as in the trace-one Gaussian distribution.
shows the probability density function of the distribution for various values ofand (for the case of ). As a special case, the mPE includes the Gaussian distribution: For and , the trace-one mPE distribution corresponds to the trace-one multivariate Gaussian distribution.
Stochastic representation (reparametrization trick). Let , , and . The density admits a known stochastic representation
where denotes the equality in distribution,
is a random vector uniformly distributed on the unit-sphere ofsuch that , and
is a positive scalar random variable distributed according to a known gamma distribution,
The shape of the gamma unfortunately depends on . As there is no known stochastic representation of the gamma distribution, it ultimately causes difficulties when we need to take the derivative of the random sample. There are techniques which can be used for this purpose (Ruiz et al., 2016; Naesseth et al., 2017). However, in our case, is a “second-level” parameter in the sense that it appears in (11) and not directly in (10
). For our specific case, we found that approximating the gamma with a normal distribution as the limiting distribution of the gamma to be sufficiently accurate. More specifically,
where . The error in the approximation (12) is expected to decrease as the shape parameter of the gamma distribution increases. A pessimistic upper bound is which follows from the Berry-Essèen theorem. In our numerical evaluations, we found that the use of (12) instead of (11) in (10) is quite well suited, surprisingly even in extreme cases where and is large (see the numerical comparisons in Appendix L).
Practical considerations. The pair of and control the tail and the shape of the distribution (see Figure M.1). Very large and very small values of and might be undesirable—for one, they pose numerical challenges. In practice, these parameters can be restricted within a range by choosing suitable output activation functions. In the experiments in Section 6.2, we choose to bound them conservatively as:
, using the sigmoid function.
5.2.2 Constructing Two Models
The parametric form of the recognition model is assumed to take on a trace-one Gaussian distribution, with parameters as the output of an mMLP as a function of , . Similarly, the generative model is assumed to take on a trace-one Gaussian distribution in the form of with parameters as the output of an mMLP as a function of , .
Power exponential model.
We can improve the flexibility of the recognition and the generative models by generalizing the Gaussian model to the mPE model. Here, it is assumed that the generative model takes on a trace-one mPE distribution in the form of with parameters as the output of an mMLP as a function of , . There are two alternatives when it comes to the recognition model:
To assume the same parametric choice for the recognition model as the generative model and define: , with parameters as the output of an mMLP as a function of , that is
. The disadvantage of this choice is that the KLD between the posterior and the prior will become analytically intractable. Thus, the estimator needs to rely on Monte Carlo sampling for the evaluation of the KLD term. This could increase the variance of the estimator. The advantage of this choice is that it gives additional flexibility to the recognition model which is desirable.
To restrict the recognition model to the dense Gaussian model and define: with parameters as the output of an mMLP as a function of , . This choice leads to an analytically tractable KLD computation.
We will consider both cases in our evaluation. Based on the choice of the generative and the recognition networks, various models can be constructed. Table 1 summarizes the list of model variants considered in this work. For all these models, the prior is assumed to follow a standard Gaussian distribution as .
Given , an estimator of the lower bound of the full dataset can be constructed based on mini-batches of size as , where is the estimate of the lower bound (8) using Monte Carlo samples. Based on the choice of the recognition model, the estimator is different. In the case of the Gaussian recognition model, is computed from
where is given by (H.2). In the case of the mPE recognition model,
where is given by (10).
Our experiments are divided into two parts. The first part is on the empirical validation of the mMLP model in a supervised task of learning SPD matrices using synthetic data. The second part evaluates the VAE models in an unsupervised task using real data.
6.1 SPD Matrix Learning
6.1.1 The Choice of Loss Function
Consider the problem of learning SPD matrices on synthetic data using the mMLP model of (5). The objectives are to validate the model, and to evaluate the effect of the choice of the loss function on the performance. In particular how the choice of the SPD manifold metric under Bergman matrix divergence affects the learning.
For this purpose, here, we consider two candidate loss functions in the family of Bergman matrix divergences. The first candidate is the loss function based on the normalized von Neumann divergence given by (6). The second candidate is based on the symmetrized Stein divergence given by Eq. (C.2). These two candidates are related to the Riemannian geometry. For the sake of comparison, we also consider the quadratic loss which is related to the Euclidean geometry.
Consider the set of inputs and corresponding SPD matrix outputs which are in this case dense PD covariance matrices (refer to Appendix K.1 for details on the data generation). The goal is to estimate the covariance matrices associated to the input vectors from the unseen test set, . The training size is varied between samples. The analysis is carried out for . Two examples of the test outputs for are shown in Figure M.4-A.
The mMLP models (5) are trained using 3 layers (20 units per layer) under three choices of loss functions: , , and . The only difference here is the loss function. Refer to Appendix K.1 for additional details on the mMLP initialization. The performance is evaluated on the test set, , in terms of all three losses as the error measures, shown as .
Table 2 summarizes the results of the evaluation. The first observation is that the quality of estimates differs considerably depending on the choice of the loss function. The loss function that takes into account the geometry of the SPD matrices outperforms the one based on the Euclidean geometry, . Between the two choices of Bergman divergences ( and ), the is clearly the best performer: It performs consistently well in terms of the various error measures, and shows robustness even in cases where the training data are limited. Figures M.4-B,C,D visualize the predicted covariance matrices for the case of and for two test samples.
For the sake of comparison, we also solved the same problem using the standard MLP as a regressor with the quadratic loss. To meet the SPD requirement, we simply used the Cholesky decomposition. In general, the performance was quite poor in comparison to the mMLP model (refer to Appendix K.2 for additional details).
6.1.2 Shallow vs Deep SPD Matrix Learning
The design of the mMLP model in (5) enables a mechanism for deep SPD matrix learning by satisfying the SPD constraint across all input, hidden and output layers. The simpler approach would be to consider the standard MLP architecture across input and hidden layers but make use of the activation matrix functions only at the output layer to meet the SPD requirement:
This amounts to a shallow design in the sense that it does not enable a mechanism for preserving the SPD constraint across all layers during the learning.
The design in (5) allows nonlinearities to pass through layers via activation function matrices which impose the SPD constraint, whereas in the shallow design, nonlinearities are propagated across layers via activation functions without imposing any constraints. Our hypothesis is that the former has advantage over the latter in that it captures complex dependencies which are important for the SPD matrix learning. Below we present a numerical example which highlights the importance of preserving the SPD constraint across all layers when learning the SPD matrix.
Consider a similar experiment as in Example 1 for the case of and output dimensions (using a different random seed from Example 1). We directly compare the performance of (5) against (15) under different number of hidden layers (20 units per layer). The shallow design (15) uses the hyperbolic tangent as the activation function . The same choice of the activation matrix function , given by (4), is used for both models. We use as the choice of the loss function (refer to Appendix K.3 for further details). The performance is evaluated in terms of .
Table 3 summarizes the results of the evaluation. Although the shallow design (15) performs relatively well, it underperforms in comparison to (5). Given the limited number of training samples, arbitrary increasing the number of layers may not be necessarily advantageous, which is the case for both models. However, in this regard, the design in (15) is fairly more sensitive.
6.2 Experiments using VAE Models
, where and . The kernel function is given by (4).