Maximum likelihood estimation of the Fisher-Bingham distribution via efficient calculation of its normalizing constant

04/30/2020 ∙ by Yici Chen, et al. ∙ The University of Tokyo 1

This paper proposes an efficient numerical integration formula to compute the normalizing constant of Fisher–Bingham distributions. This formula uses a numerical integration formula with the continuous Euler transform to a Fourier-type integral representation of the normalizing constant. As this method is fast and accurate, it can be applied to the calculation of the normalizing constant of high-dimensional Fisher–Bingham distributions. More precisely, the error decays exponentially with an increase in the integration points, and the computation cost increases linearly with the dimensions. In addition, this formula is useful for calculating the gradient and Hessian matrix of the normalizing constant. Therefore, we apply this formula to efficiently calculate the maximum likelihood estimation (MLE) of high-dimensional data. Finally, we apply the MLE to the hyperspherical variational auto-encoder (S-VAE), a deep-learning-based generative model that restricts the latent space to a unit hypersphere. We use the S-VAE trained with images of handwritten numbers to estimate the distributions of each label. This application is useful for adding new labels to the models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Fisher–Bingham distribution

The Fisher–Bingham distribution is defined as a multivariate normal distribution restricted on a unit sphere.

Definition 1

For a -dimensional multivariate normal distribution with a mean

and a variance-covariance matrix

, the Fisher–Bingham distribution is given by the density function

where and

is the normalizing constant and is the uniform measure in the -dimensional sphere .

The Fisher–Bingham distribution plays an essential role in directional statistics, which is concerned with data on various manifolds, especially data represented in a high-dimensional sphere. For example, wind direction and the geomagnetic field are common types of data that can be represented on a sphere . In addition, data on a hypersphere are used in link prediction of networks and image generation. Therefore, the Fisher–Bingham distribution, a normal distribution restricted on a unit sphere, is commonly used in this field.

However, the spherical domain causes some problems when using Fisher–Bingham distributions. One such problems is calculating the normalizing constant. As it is difficult to calculate it analytically, a numerical method is necessary. The saddlepoint approximation method is a numerical method for computing the normalizing constant developed by kume2005saddlepoint. Another approach, the holonomic gradient method considered by kume2018exact, computes the normalizing constant as well. However, these methods have some limitations. The saddlepoint approximation method is not as accurate as the holonomic gradient method, which is theoretically exact because the problem of calculating is mathematically characterized by solving an ODE. However, the holonomic gradient method is computationally expensive and cannot be applied to calculate the normalizing constant of high-dimensional distributions. Hence, it is necessary to create a numerical method that is efficient, numerically stable, and accurate.

To construct such a numerical method, the following details about Fisher–Bingham distributions are required (kume2018exact).

Since any orthogonal transformation in is isometric, the parameter dimensions are reduced from to

by singular value decomposition. Therefore, we have

where and

is the orthogonal matrix obtained from

. Thus, without loss of generality, we can assume that the variance-covariance matrix is diagonal. After reducing the parameter dimensions to , the normalizing constant becomes

where

and

Since is restricted on a unit sphere, we have

where is a real number and . If we put

then we have

As a result, if the normalizing constant is obtained, can also be obtained. Moreover, for the maximum likelihood estimation (MLE), as , can be shifted to for all . Additionally, because the unit sphere is symmetrical,

As a result, it can be assumed that has non-negative entries when calculating the normalizing constant.

1.2 Aim of this paper

In this paper,

  1. we propose an efficient numerical integration formula to compute the normalizing constant.

  2. we apply this formula to perform MLE.

  3. we apply MLE to the latent variables of the hyperspherical variational auto-encoder (S-VAE) (davidson2018hyperspherical).

The normalizing constant of Fisher–Bingham distributions can be represented in a Fourier integration form. Therefore, we can use the numerical integration formula with the continuous Euler transform introduced by ooura2001continuous. Note that the continuous Euler transform is useful for calculating the normalizing constant and MLE.

This method can be applied to the MLE of high-dimensional data, such as the latent variables of S-VAE (davidson2018hyperspherical

), a generating model used in machine learning. The dimensions of the hyperspherical variational auto-encoder rely on the complexity of the data. For example, for human face data, there may be 100 dimensions of the latent variables.

1.3 Organization of this paper

This paper is organized as follows. In Section 2, we make some general remarks about the Fisher–Bingham distribution and the Fourier transform representation of the normalizing constant. In Section 3, we explain the continuous Euler transform and its use for numerical computation of the normalizing constant. In Section 4, we discuss the calculation of the gradient of the normalizing constant, which is necessary for MLE. Subsequently, the MLE algorithm is provided. In Section 5, we demonstrate some MLE numerical experiment to show the effectiveness of this method. In Section 6, we show the application of MLE in the S-VAE whose latent space includes high-dimensional data on a hypersphere.

2 Fourier transform representation of the normalizing constant

2.1 Laplace inversion representation

This section explains how the normalizing constant can be represented in a simpler form, as derived by kume2005saddlepoint and kume2018exact. The first step is to change the range of integration of the normalizing constant from a -dimensional hypersphere to a one-dimensional line . The integration range is then shifted to . Note that the derivation can be somewhat technical, but the calculation itself is not difficult.

First, the distribution of

independent normal random variables

is

We then apply the variable transform

to and integrate it with respect to . Then, the marginalized distribution becomes

(1)

where

and

(2)

When , Equation (2) matches the definition of the normalizing constant . As a result, based on Equation (1), we obtain

(3)

Therefore, if the distribution can be represented in a one-dimensional integration form, the goal will be achieved.

The moment generating function of

is

Since , the moment generating function is the same as the Laplace transform of . Thus, with the inverse Laplace transform, we obtain

(4)

where . Substituting Equation (4) into Equation (3), we get

(5)

This is the Laplace inversion representation of the normalizing constant, which contains a one-parameter integration of . The Fourier transform representation can be easily deduced by applying the variable transform to Equation (5).

2.2 Fourier-type integral representation

Theorem 1
(6)

where

Now, we derive the representation of the normalizing constant in the form of a Fourier integral. Therefore, the next step is to apply a numerical integration formula to calculate this Fourier integral.

3 Continuous Euler transform for numerical calculation

3.1 Continuous Euler transform

Since the normalizing constant is represented as a Fourier-type integral in Equation (6), it is necessary to find an accurate numerical integration formula to calculate this integration. Here, note that the function decays slowly. With a usual trapezoidal formula, the slower the function decays, the slower the convergence of the numerical integration becomes. It is necessary to reinforce the decay of the function to improve the accuracy. Here, we can adopt the continuous Euler transform by ooura2001continuous, which can accelerate the convergence of the Fourier transform. Moreover, tanaka2014error proved that adding the continuous Euler transformation to the trapezoidal formula made the integration converge rapidly. The details of the continuous Euler transform have been provided by ooura2001continuous and tanaka2014error.

We apply the continuous Euler transform to Equation (6). Choose and to satisfy and . Choose . Let be an integer with

Let , and be defined by

Then, we get

(7)
(8)
(9)

where

Equation (8) uses the trapezoidal formula.

The accuracy of the numerical formula is prooved using Theorem 2.

Theorem 2

where is a polynomial of .

With this theorem, if the normalizing constant is approximated by , the error converges to as . In other words, any accuracy can be achieved if a practically large enough is taken. For instance, if the normalizing constant with an error less than is necessary, is sufficient. Subsequently, if the parameter dimensions are about 100, it only takes 20 ms, whereas the holonomic gradient method takes about 15 s with 10 dimensions (sei2015calculating). The details of the numerical experiments are provided in Section 5. An efficient numerical integration formula for the normalizing constant can be obtained using this method.

4 MLE optimization using the continuous Euler transform

The likelihood function of the Fisher–Bingham distribution has been provided by kume2018exact. With a observed data matrix , put and , the likelihood function is

with

and

Therefore, maximizing the likelihood function

is equivalent to minimizing

(10)

In this section, is also called the likelihood function, although

Thus, it is possible to optimize Equation (10) by iteratively updating the parameters that decrease the likelihood value. First, we consider the optimization problem in and for a fixed . Then, the optimal can be obtained by minimizing mantaining the value of and fixed,. The formula for updating has been published by kume2018exact. Although the outline for updating is mentioned in this section,

is fixed as an identity matrix, that is,

is assumed to be diagonal in numerical experiments.

It is necessary to calculate the partial derivatives of when optimizing and .

(11)

and

(12)

The partial derivatives of are needed to calculate Equation (11) and Equation (12).

In Theorem 1, the Fourier transform representation of the normalizing constant becomes

where

By changing the order of integration and differentiation, the partial derivatives of become

The continuous Euler transform can also be applied to these calculations to numerically calculate the derivatives efficiently.

If we put

Then, we get

(13)
(14)

Therefore, the derivatives of the normalizing constant, as well as the derivatives of the likelihood function, can be calculated using Equation (11) and (12). As a result, it is possible to optimize and with a fixed .

Given that the algorithm to optimize with fixed and values was developed by kume2018exact, this will not be demonstrated in this paper.

(15)

and

(16)

must be symmetrical for to be an optional orthogonal matrix (kume2018exact). Moreover, if is symmetrical, a curve with reduces to a single point; this can be used as a stopping criterion. The proof of this stopping criterion was provided by kume2018exact.

Now, we have obtained all parts necessary to optimize the likelihood function. Consequently, we can discuss the approach for obtaining the MLE of Fisher–Bingham distributions. This algorithm performs the same steps as the algorithm given by kume2018exact, but the method used to calculate the derivatives of the likelihood function and the normalizing constant are different. Although Kume and Sei used the holonomic gradient method, the continuous Euler transform is adopted in this paper.

Algorithm

(the gradient descent method)
Update the given , , and as follows until the differentiation of , , and becomes small enough:

  1. The partial derivatives of the likelihood function are obtained by substituting Equation (13) into Equation (11), and is a real number such that

  2. The partial derivatives of the likelihood function are obtained by substituting Equation (14) into Equation (12), and is a real number such that

  3. is obtained by substituting Equation (15) into Equation (16), and is a real number such that

This algorithm is based on the gradient descent method that converges linearly. If faster convergence is required, it is preferable to use the quasi-Newton method.

5 Numerical experiments

5.1 Normalizing constant

We compare the calculation method for the normalizing constant of the Bingham distribution, the holonomic gradient method, and the saddlepoint approximation method with our method. The Bingham distribution is a special case of the Fisher–Bingham distribution, which fixes .

Definition 2

For a -dimensional multivariate distribution, the Bingham distribution is given by the density function

where and

is the normalizing constant and is the uniform measure in the -dimensional sphere .

Theorem 2 is used to compute the normalizing constant of the Bingham distribution. More precisely, the numerical integration formula is

where

The same parameters were used by sei2015calculating. The results of the holonomic gradient method and the saddlepoint approximation method were obtained from sei2015calculating. In our method, the parameter is fixed to 200. The holonomic gradient method must be theoretically exact because the problem is mathematically characterized by solving an ODE with numerical methods.

The accuracy of the holonomic gradient method and our method is very high. As shown in Table 1 to Table 4, the normalizing constants with different parameters calculated by these methods (the hg and ce columns) coincide until they reach 6 decimal places. However, the saddlepoint approximation method is not as good because the error is larger than . Besides, the calculation time for our method with 100 dimensions is about 20 ms, whereas the holonomic gradient method takes about 15 s with 10 dimensions (sei2015calculating).

In Table 1, columns 2 and 3 compare the saddlepoint approximation (spa) and the holonomic gradient method (hg) of with our method (ce); columns 4 and 5 compare the same quantities for .

spa hg ce spa hg ce
5 4.237006 4.238950 4.238950 3.376766 3.372017 3.372017
10 2.982628 2.985576 2.985576 1.689684 1.689355 1.689355
30 1.708766 1.711919 1.711919 0.555494 0.556123 0.556123
50 1.321178 1.323994 1.323994 0.332102 0.332661 0.332661
100 0.932895 0.935094 0.935094 0.165587 0.165940 0.165940
200 0.659185 0.660814 0.660814 0.082676 0.082871 0.082871
Table 1: The normalizing constant of Bingham distributions with in columns 2, 3, and 4 and in columns 5, 6, and 7

The complex Bingham distribution, which is defined on a unit complex sphere, can be calculated analytically (kume2005saddlepoint).

Definition 3

For a -dimensional multivariate distribution, the complex Bingham distribution is given by the density function

where and

is the normalizing constant and is the uniform measure in the -dimensional complex sphere .

In Table 2, we compare the saddlepoint, exact, hg, and our method for the complex Bingham distribution with parameters , that is, .

spa ex hg ce
5 5.942975 5.936835 5.936835 5.936835
10 3.429004 3.425468 3.425468 3.425468
30 1.248280 1.246421 1.246421 1.246421
50 0.761347 0.760180 0.760180 0.760180
100 0.385272 0.384675 0.384675 0.384675
200 0.193779 0.193477 0.193477 0.193477
Table 2: The normalizing constant of complex Bingham distributions with

In Table 3, columns 2 and 3 compare the saddlepoint approximation (spa) and the holonomic gradient method (hg) of with our method (ce); columns 4 and 5 compare the same quantities for .

spa hg ce spa hg ce
5 1.258672 1.273161 1.273161 1.032128 1.044072 1.044072
10 0.874523 0.883394 0.883394 0.500707 0.505223 0.505223
30 0.497757 0.503213 0.503213 0.162251 0.163901 0.163901
50 0.384440 0.388775 0.388775 0.096784 0.097828 0.097828
100 0.271249 0.274375 0.274375 0.048182 0.048725 0.048725
200 0.191595 0.193826 0.193826 0.024039 0.024316 0.024316
Table 3: The normalizing constant of Bingham distributions with in columns 2, 3, and 4 and in columns 5, 6, and 7

In Table 4, we compare the saddlepoint, exact, hg, and our method for the complex Bingham distribution with parameters , that is, .

spa ex hg ce
5 0.921027 0.921726 0.921726 0.921726
10 0.506236 0.506341 0.506341 0.506341
30 0.177602 0.177495 0.177495 0.177495
50 0.177602 0.107458 0.107458 0.107458
100 0.054115 0.054081 0.054081 0.054081
200 0.027144 0.027127 0.027127 0.027127
Table 4: The normalizing constant of complex Bingham distributions with

The advantage of our method is its efficiency in high-dimensional cases. The calculation time of the normalizing constant with multiple dimensions is demonstrated in Figure 1. The parameters and are determined by the random number generator in C++ library and .

Figure 1: The computation time with respect to the dimensions

The calculation is very rapid and stable. The calculation time increases linearly. with the number of dimensions. Since the derivatives of the normalizing constant are calculated with the same numerical integration formula, we prove that the derivatives can be calculated efficiently. The MLE algorithm, introduced in Section 4, is based on the gradient descent method. The bottleneck depends on the optimizing method because the calculation of the gradients is not expensive. Since the gradient descent method converges linearly, it is not efficient enough for high-dimensional data. This problem can be solved using the quasi-Newton method.

5.2 Mle

In this section, some numerical experiments on the MLE using data with multiple dimensions are shown. The data are obtained by rejection sampling, where is assumed to be diagonal, that is, is not estimated.

5.2.1 2-dimensional data

For data matrix (Figure 2), where

the following MLE values

are obtained.

Figure 2: Histogram of the sampling points on a unit circle

5.2.2 3-dimensional data

For a data matrix (Figure 3), where

the following MLE values

are obtained.

Figure 3: Sampling points on a unit sphere

5.2.3 10-dimensional data

For a data matrix , where

the following MLE values

are obtained.

6 Application of Fisher–Bingham distribution to S-VAE

6.1 Introduction to S-VAE

As mentioned in Section 5, our method enables to perform the MLE with high-dimensional data which cannot be achieved using other methods such as the holonomic gradient method. Because of this advantage, we can apply our method to the MLE of the latent variables of variational auto-encoders (VAE).

It is essential to comprehend the auto-encoders (AE) mechanism before introducing the VAE. AE is a generating model with two networks. One is called the encoder, while the other is called the decoder. The encoder transforms images into vectors with smaller dimensions than those of the images. For example, MNIST is a dataset with handwritten numbers from 0 to 9, and each image has 28

28 dimensions. With the encoder of the AE, an image with 28 28 dimensions is transformed into a vector with 2-20 dimensions. If an image becomes complicated, the dimensions of the latent space increase. For instance, the latent space of human face data has about 100 dimensions.

Figure 4: Auto-encoder

VAE is a generating model derived from AE. The difference between the VAE and AE is that the VAE assumes that the vectors obtained from images are generated from some distributions (doersch2016tutorial; kingma2013auto

). The encoder estimates the parameters of distributions, while the decoder generates an image from the vector, which is a sample of the estimated distributions. For example, in most cases, the vectors are assumed to be generated from some normal distributions in Euclidean space. In this section, the VAE implies that the normal distribution in Euclidean space is assumed. Considering MNIST, 10 normal distributions match with different numbers from 0 to 9. If we decode a vector obtained from the distribution of zero, something like zero will probably be generated. Therefore, the encoder of the VAE transforms images into parameters, such as the means and variations of the distributions, instead of vectors themselves. The advantage of the VAE compared to the AE the VAE that it gives the structure of the latent variables.

Figure 5: Variational auto-encoder

S-VAE is a VAE assuming that the vectors are generated from distributions restricted to a high-dimensional unit hypersphere. Moreover, the vectors are assumed to be generated from the von Mises–Fisher distribution.

Definition 4

For a -dimensional multivariate distribution, the von Mises–Fisher distribution is given by the density function

where and

is the normalizing constant and is the uniform measure in the -dimensional sphere .

The von Mises–Fisher distribution is a special case of the Fisher–Bingham distribution when restricting for all . When

, the von Mises–Fisher distribution becomes a uniform distribution on a unit sphere.

davidson2018hyperspherical

proposed two reasons to assume that the latent space is a unit sphere rather than Euclidean space. First, S-VAE allows for a uniform distribution on the hypersphere to be a prior, which is a truly uninformative prior. Since uniform distribution in Euclidean space does not exist, the VAE must make some informative assumptions about the prior. Second, with high-dimensional data, ”the soap-bubble-effect”, where the latent variables converge on a hyperspherical shell, is observed. Therefore, the assumption of the latent space as a Euclidean hyperplane would be improper because

is not homeomorphic to .

The AE, VAE and S-VAE are unsupervised machine learning, which means that the labels of images are not used for the training of networks. However, images with high similarities, such as several handwritten images of the number one, become close to each other in latent space. Therefore, some clusters are generated in latent space as shown in Figures 6 and 7 (davidson2018hyperspherical). Each color matches with each label from 0 to 9. Figure 7 is the Hammer projection, a projection that shows a sphere on a 2-dimensional plain. Therefore, the latent variable representation of each image can be obtained with a pre-trained model. The parameters of these clusters can then be estimated using MLE. The algorithm mentioned in Section 4 can be applied specifically to S-VAE.

Figure 6: Latent space of the VAE (davidson2018hyperspherical)
Figure 7: Latent space of the S-VAE (davidson2018hyperspherical)

In addition, the conditional VAE (CVAE) is a supervised model that uses the label information to train the networks.

Figure 8: The conditional variational auto-encoder

It is common to consider adding more details to the labels of CVAE. For example, if a pre-trained CVAE with human faces is labeled with continents, such as Asia, Europe and Africa, we want to add the details of continents to the labels. Additionally, we must consider the situation in which a pre-trained CVAE with images of fish labeled with different species. When a new species is discovered, we may want to add this new species label to the network. In most cases, it is common to train the networks with the dataset to which new labels are appended. However, the computing cost is high. For example, the training of the VAE of MNIST with the latent space of a using a CPU (Dell latitude 5480) takes several hours in our experiment.

If the latent variables are assumed to be generated with Fisher–Bingham distributions, the distribution of the images with new labels can be obtained by estimating its parameters via MLE. In this way, since the computation cost of MLE is much lower than the training process of CVAE, new labels can be appended much more easily. In the VAE of MNIST, the estimation of the parameters takes only seconds. Therefore, the MLE algorithm mentioned in Section 4 can be applied for S-VAE. Attention should be paid to other methods for the MLE of the Fisher–Bingham distribution, such as the holonomic gradient method and the saddlepoint approximation method, are not appropriate for high-dimensional data.

6.2 Application results

In this section, the handwritten images dataset MNIST is used to train S-VAE. The experiments with the latent space and are shown. The models are trained according to davidson2018hyperspherical 111https://github.com/nicola-decao/s-vae-pytorch

. It is necessary to add a sigmoid layer to the decoder. The training process is increased from 1000 to 2000 epochs with early-stopping at 50. Besides these changes, the training procedures are mostly the same as those mentioned by

davidson2018hyperspherical. Note that the experiment results are not accurate enough. In this section, the accuracy of the model means roughly the degree of the similarity of generated images to the handwritten numbers. The similarity is judged by the recognition of the authors. However, some concerns remain, such as the accuracy of the models and that of the Fisher–Bingham distribution assumption.

6.2.1 Latent space

The images generated by S-VAE are shown in Figure 9.

Figure 9: Images generated with the -VAE

The latent space is assumed to be a 2-dimensional unit sphere to visualize the latent space easily. However, in terms of the accuracy of the model, it is probably not enough to learn the features of dimensional images. Therefore, the images generated by the model are not accurate; specifically, the features of the handwritten 2, 4, and 5 are not learned completely.

For example, the distribution of the latent variables of labels 0 and 1 is illustrated in Figure 10. It is clear that the areas of label 0 and label 1 are separate, and the data are accumulating. However, the latent variables of label 2 are spread on the sphere, as shown in Figure 11.

Figure 10: cluster of label 0 and 1
Figure 11: cluster of label 2

Given the latent variables mentioned above, the following MLE values for each label are obtained and listed in Table 5.

label 0 1 2 3 4