1 Introduction
1.1 Fisher–Bingham distribution
The Fisher–Bingham distribution is defined as a multivariate normal distribution restricted on a unit sphere.
Definition 1
For a dimensional multivariate normal distribution with a mean
and a variancecovariance matrix
, the Fisher–Bingham distribution is given by the density functionwhere and
is the normalizing constant and is the uniform measure in the dimensional sphere .
The Fisher–Bingham distribution plays an essential role in directional statistics, which is concerned with data on various manifolds, especially data represented in a highdimensional sphere. For example, wind direction and the geomagnetic field are common types of data that can be represented on a sphere . In addition, data on a hypersphere are used in link prediction of networks and image generation. Therefore, the Fisher–Bingham distribution, a normal distribution restricted on a unit sphere, is commonly used in this field.
However, the spherical domain causes some problems when using Fisher–Bingham distributions. One such problems is calculating the normalizing constant. As it is difficult to calculate it analytically, a numerical method is necessary. The saddlepoint approximation method is a numerical method for computing the normalizing constant developed by kume2005saddlepoint. Another approach, the holonomic gradient method considered by kume2018exact, computes the normalizing constant as well. However, these methods have some limitations. The saddlepoint approximation method is not as accurate as the holonomic gradient method, which is theoretically exact because the problem of calculating is mathematically characterized by solving an ODE. However, the holonomic gradient method is computationally expensive and cannot be applied to calculate the normalizing constant of highdimensional distributions. Hence, it is necessary to create a numerical method that is efficient, numerically stable, and accurate.
To construct such a numerical method, the following details about Fisher–Bingham distributions are required (kume2018exact).
Since any orthogonal transformation in is isometric, the parameter dimensions are reduced from to
by singular value decomposition. Therefore, we have
where and
is the orthogonal matrix obtained from
. Thus, without loss of generality, we can assume that the variancecovariance matrix is diagonal. After reducing the parameter dimensions to , the normalizing constant becomeswhere
and
Since is restricted on a unit sphere, we have
where is a real number and . If we put
then we have
As a result, if the normalizing constant is obtained, can also be obtained. Moreover, for the maximum likelihood estimation (MLE), as , can be shifted to for all . Additionally, because the unit sphere is symmetrical,
As a result, it can be assumed that has nonnegative entries when calculating the normalizing constant.
1.2 Aim of this paper
In this paper,

we propose an efficient numerical integration formula to compute the normalizing constant.

we apply this formula to perform MLE.

we apply MLE to the latent variables of the hyperspherical variational autoencoder (SVAE) (davidson2018hyperspherical).
The normalizing constant of Fisher–Bingham distributions can be represented in a Fourier integration form. Therefore, we can use the numerical integration formula with the continuous Euler transform introduced by ooura2001continuous. Note that the continuous Euler transform is useful for calculating the normalizing constant and MLE.
This method can be applied to the MLE of highdimensional data, such as the latent variables of SVAE (davidson2018hyperspherical
), a generating model used in machine learning. The dimensions of the hyperspherical variational autoencoder rely on the complexity of the data. For example, for human face data, there may be 100 dimensions of the latent variables.
1.3 Organization of this paper
This paper is organized as follows. In Section 2, we make some general remarks about the Fisher–Bingham distribution and the Fourier transform representation of the normalizing constant. In Section 3, we explain the continuous Euler transform and its use for numerical computation of the normalizing constant. In Section 4, we discuss the calculation of the gradient of the normalizing constant, which is necessary for MLE. Subsequently, the MLE algorithm is provided. In Section 5, we demonstrate some MLE numerical experiment to show the effectiveness of this method. In Section 6, we show the application of MLE in the SVAE whose latent space includes highdimensional data on a hypersphere.
2 Fourier transform representation of the normalizing constant
2.1 Laplace inversion representation
This section explains how the normalizing constant can be represented in a simpler form, as derived by kume2005saddlepoint and kume2018exact. The first step is to change the range of integration of the normalizing constant from a dimensional hypersphere to a onedimensional line . The integration range is then shifted to . Note that the derivation can be somewhat technical, but the calculation itself is not difficult.
First, the distribution of
independent normal random variables
isWe then apply the variable transform
to and integrate it with respect to . Then, the marginalized distribution becomes
(1) 
where
and
(2) 
When , Equation (2) matches the definition of the normalizing constant . As a result, based on Equation (1), we obtain
(3) 
Therefore, if the distribution can be represented in a onedimensional integration form, the goal will be achieved.
The moment generating function of
isSince , the moment generating function is the same as the Laplace transform of . Thus, with the inverse Laplace transform, we obtain
(4) 
where . Substituting Equation (4) into Equation (3), we get
(5) 
This is the Laplace inversion representation of the normalizing constant, which contains a oneparameter integration of . The Fourier transform representation can be easily deduced by applying the variable transform to Equation (5).
2.2 Fouriertype integral representation
Theorem 1
(6) 
where
Now, we derive the representation of the normalizing constant in the form of a Fourier integral. Therefore, the next step is to apply a numerical integration formula to calculate this Fourier integral.
3 Continuous Euler transform for numerical calculation
3.1 Continuous Euler transform
Since the normalizing constant is represented as a Fouriertype integral in Equation (6), it is necessary to find an accurate numerical integration formula to calculate this integration. Here, note that the function decays slowly. With a usual trapezoidal formula, the slower the function decays, the slower the convergence of the numerical integration becomes. It is necessary to reinforce the decay of the function to improve the accuracy. Here, we can adopt the continuous Euler transform by ooura2001continuous, which can accelerate the convergence of the Fourier transform. Moreover, tanaka2014error proved that adding the continuous Euler transformation to the trapezoidal formula made the integration converge rapidly. The details of the continuous Euler transform have been provided by ooura2001continuous and tanaka2014error.
We apply the continuous Euler transform to Equation (6). Choose and to satisfy and . Choose . Let be an integer with
Let , and be defined by
Then, we get
(7)  
(8)  
(9) 
where
Equation (8) uses the trapezoidal formula.
The accuracy of the numerical formula is prooved using Theorem 2.
Theorem 2
where is a polynomial of .
With this theorem, if the normalizing constant is approximated by , the error converges to as . In other words, any accuracy can be achieved if a practically large enough is taken. For instance, if the normalizing constant with an error less than is necessary, is sufficient. Subsequently, if the parameter dimensions are about 100, it only takes 20 ms, whereas the holonomic gradient method takes about 15 s with 10 dimensions (sei2015calculating). The details of the numerical experiments are provided in Section 5. An efficient numerical integration formula for the normalizing constant can be obtained using this method.
4 MLE optimization using the continuous Euler transform
The likelihood function of the Fisher–Bingham distribution has been provided by kume2018exact. With a observed data matrix , put and , the likelihood function is
with
and
Therefore, maximizing the likelihood function
is equivalent to minimizing
(10) 
In this section, is also called the likelihood function, although
Thus, it is possible to optimize Equation (10) by iteratively updating the parameters that decrease the likelihood value. First, we consider the optimization problem in and for a fixed . Then, the optimal can be obtained by minimizing mantaining the value of and fixed,. The formula for updating has been published by kume2018exact. Although the outline for updating is mentioned in this section,
is fixed as an identity matrix, that is,
is assumed to be diagonal in numerical experiments.It is necessary to calculate the partial derivatives of when optimizing and .
(11) 
and
(12) 
The partial derivatives of are needed to calculate Equation (11) and Equation (12).
In Theorem 1, the Fourier transform representation of the normalizing constant becomes
where
By changing the order of integration and differentiation, the partial derivatives of become
The continuous Euler transform can also be applied to these calculations to numerically calculate the derivatives efficiently.
If we put
Then, we get
(13) 
(14) 
Therefore, the derivatives of the normalizing constant, as well as the derivatives of the likelihood function, can be calculated using Equation (11) and (12). As a result, it is possible to optimize and with a fixed .
Given that the algorithm to optimize with fixed and values was developed by kume2018exact, this will not be demonstrated in this paper.
(15) 
and
(16) 
must be symmetrical for to be an optional orthogonal matrix (kume2018exact). Moreover, if is symmetrical, a curve with reduces to a single point; this can be used as a stopping criterion. The proof of this stopping criterion was provided by kume2018exact.
Now, we have obtained all parts necessary to optimize the likelihood function. Consequently, we can discuss the approach for obtaining the MLE of Fisher–Bingham distributions. This algorithm performs the same steps as the algorithm given by kume2018exact, but the method used to calculate the derivatives of the likelihood function and the normalizing constant are different. Although Kume and Sei used the holonomic gradient method, the continuous Euler transform is adopted in this paper.
Algorithm
(the gradient descent method)
Update the given , , and as follows until the differentiation of , , and becomes small enough:
This algorithm is based on the gradient descent method that converges linearly. If faster convergence is required, it is preferable to use the quasiNewton method.
5 Numerical experiments
5.1 Normalizing constant
We compare the calculation method for the normalizing constant of the Bingham distribution, the holonomic gradient method, and the saddlepoint approximation method with our method. The Bingham distribution is a special case of the Fisher–Bingham distribution, which fixes .
Definition 2
For a dimensional multivariate distribution, the Bingham distribution is given by the density function
where and
is the normalizing constant and is the uniform measure in the dimensional sphere .
Theorem 2 is used to compute the normalizing constant of the Bingham distribution. More precisely, the numerical integration formula is
where
The same parameters were used by sei2015calculating. The results of the holonomic gradient method and the saddlepoint approximation method were obtained from sei2015calculating. In our method, the parameter is fixed to 200. The holonomic gradient method must be theoretically exact because the problem is mathematically characterized by solving an ODE with numerical methods.
The accuracy of the holonomic gradient method and our method is very high. As shown in Table 1 to Table 4, the normalizing constants with different parameters calculated by these methods (the hg and ce columns) coincide until they reach 6 decimal places. However, the saddlepoint approximation method is not as good because the error is larger than . Besides, the calculation time for our method with 100 dimensions is about 20 ms, whereas the holonomic gradient method takes about 15 s with 10 dimensions (sei2015calculating).
In Table 1, columns 2 and 3 compare the saddlepoint approximation (spa) and the holonomic gradient method (hg) of with our method (ce); columns 4 and 5 compare the same quantities for .
spa  hg  ce  spa  hg  ce  

5  4.237006  4.238950  4.238950  3.376766  3.372017  3.372017 
10  2.982628  2.985576  2.985576  1.689684  1.689355  1.689355 
30  1.708766  1.711919  1.711919  0.555494  0.556123  0.556123 
50  1.321178  1.323994  1.323994  0.332102  0.332661  0.332661 
100  0.932895  0.935094  0.935094  0.165587  0.165940  0.165940 
200  0.659185  0.660814  0.660814  0.082676  0.082871  0.082871 
The complex Bingham distribution, which is defined on a unit complex sphere, can be calculated analytically (kume2005saddlepoint).
Definition 3
For a dimensional multivariate distribution, the complex Bingham distribution is given by the density function
where and
is the normalizing constant and is the uniform measure in the dimensional complex sphere .
In Table 2, we compare the saddlepoint, exact, hg, and our method for the complex Bingham distribution with parameters , that is, .
spa  ex  hg  ce  

5  5.942975  5.936835  5.936835  5.936835 
10  3.429004  3.425468  3.425468  3.425468 
30  1.248280  1.246421  1.246421  1.246421 
50  0.761347  0.760180  0.760180  0.760180 
100  0.385272  0.384675  0.384675  0.384675 
200  0.193779  0.193477  0.193477  0.193477 
In Table 3, columns 2 and 3 compare the saddlepoint approximation (spa) and the holonomic gradient method (hg) of with our method (ce); columns 4 and 5 compare the same quantities for .
spa  hg  ce  spa  hg  ce  

5  1.258672  1.273161  1.273161  1.032128  1.044072  1.044072 
10  0.874523  0.883394  0.883394  0.500707  0.505223  0.505223 
30  0.497757  0.503213  0.503213  0.162251  0.163901  0.163901 
50  0.384440  0.388775  0.388775  0.096784  0.097828  0.097828 
100  0.271249  0.274375  0.274375  0.048182  0.048725  0.048725 
200  0.191595  0.193826  0.193826  0.024039  0.024316  0.024316 
In Table 4, we compare the saddlepoint, exact, hg, and our method for the complex Bingham distribution with parameters , that is, .
spa  ex  hg  ce  

5  0.921027  0.921726  0.921726  0.921726 
10  0.506236  0.506341  0.506341  0.506341 
30  0.177602  0.177495  0.177495  0.177495 
50  0.177602  0.107458  0.107458  0.107458 
100  0.054115  0.054081  0.054081  0.054081 
200  0.027144  0.027127  0.027127  0.027127 
The advantage of our method is its efficiency in highdimensional cases. The calculation time of the normalizing constant with multiple dimensions is demonstrated in Figure 1. The parameters and are determined by the random number generator in C++ library and .
The calculation is very rapid and stable. The calculation time increases linearly. with the number of dimensions. Since the derivatives of the normalizing constant are calculated with the same numerical integration formula, we prove that the derivatives can be calculated efficiently. The MLE algorithm, introduced in Section 4, is based on the gradient descent method. The bottleneck depends on the optimizing method because the calculation of the gradients is not expensive. Since the gradient descent method converges linearly, it is not efficient enough for highdimensional data. This problem can be solved using the quasiNewton method.
5.2 Mle
In this section, some numerical experiments on the MLE using data with multiple dimensions are shown. The data are obtained by rejection sampling, where is assumed to be diagonal, that is, is not estimated.
5.2.1 2dimensional data
5.2.2 3dimensional data
5.2.3 10dimensional data
For a data matrix , where
the following MLE values
are obtained.
6 Application of Fisher–Bingham distribution to SVAE
6.1 Introduction to SVAE
As mentioned in Section 5, our method enables to perform the MLE with highdimensional data which cannot be achieved using other methods such as the holonomic gradient method. Because of this advantage, we can apply our method to the MLE of the latent variables of variational autoencoders (VAE).
It is essential to comprehend the autoencoders (AE) mechanism before introducing the VAE. AE is a generating model with two networks. One is called the encoder, while the other is called the decoder. The encoder transforms images into vectors with smaller dimensions than those of the images. For example, MNIST is a dataset with handwritten numbers from 0 to 9, and each image has 28
28 dimensions. With the encoder of the AE, an image with 28 28 dimensions is transformed into a vector with 220 dimensions. If an image becomes complicated, the dimensions of the latent space increase. For instance, the latent space of human face data has about 100 dimensions.VAE is a generating model derived from AE. The difference between the VAE and AE is that the VAE assumes that the vectors obtained from images are generated from some distributions (doersch2016tutorial; kingma2013auto
). The encoder estimates the parameters of distributions, while the decoder generates an image from the vector, which is a sample of the estimated distributions. For example, in most cases, the vectors are assumed to be generated from some normal distributions in Euclidean space. In this section, the VAE implies that the normal distribution in Euclidean space is assumed. Considering MNIST, 10 normal distributions match with different numbers from 0 to 9. If we decode a vector obtained from the distribution of zero, something like zero will probably be generated. Therefore, the encoder of the VAE transforms images into parameters, such as the means and variations of the distributions, instead of vectors themselves. The advantage of the VAE compared to the AE the VAE that it gives the structure of the latent variables.
SVAE is a VAE assuming that the vectors are generated from distributions restricted to a highdimensional unit hypersphere. Moreover, the vectors are assumed to be generated from the von Mises–Fisher distribution.
Definition 4
For a dimensional multivariate distribution, the von Mises–Fisher distribution is given by the density function
where and
is the normalizing constant and is the uniform measure in the dimensional sphere .
The von Mises–Fisher distribution is a special case of the Fisher–Bingham distribution when restricting for all . When
, the von Mises–Fisher distribution becomes a uniform distribution on a unit sphere.
davidson2018hyperspherical
proposed two reasons to assume that the latent space is a unit sphere rather than Euclidean space. First, SVAE allows for a uniform distribution on the hypersphere to be a prior, which is a truly uninformative prior. Since uniform distribution in Euclidean space does not exist, the VAE must make some informative assumptions about the prior. Second, with highdimensional data, ”the soapbubbleeffect”, where the latent variables converge on a hyperspherical shell, is observed. Therefore, the assumption of the latent space as a Euclidean hyperplane would be improper because
is not homeomorphic to .The AE, VAE and SVAE are unsupervised machine learning, which means that the labels of images are not used for the training of networks. However, images with high similarities, such as several handwritten images of the number one, become close to each other in latent space. Therefore, some clusters are generated in latent space as shown in Figures 6 and 7 (davidson2018hyperspherical). Each color matches with each label from 0 to 9. Figure 7 is the Hammer projection, a projection that shows a sphere on a 2dimensional plain. Therefore, the latent variable representation of each image can be obtained with a pretrained model. The parameters of these clusters can then be estimated using MLE. The algorithm mentioned in Section 4 can be applied specifically to SVAE.
In addition, the conditional VAE (CVAE) is a supervised model that uses the label information to train the networks.
It is common to consider adding more details to the labels of CVAE. For example, if a pretrained CVAE with human faces is labeled with continents, such as Asia, Europe and Africa, we want to add the details of continents to the labels. Additionally, we must consider the situation in which a pretrained CVAE with images of fish labeled with different species. When a new species is discovered, we may want to add this new species label to the network. In most cases, it is common to train the networks with the dataset to which new labels are appended. However, the computing cost is high. For example, the training of the VAE of MNIST with the latent space of a using a CPU (Dell latitude 5480) takes several hours in our experiment.
If the latent variables are assumed to be generated with Fisher–Bingham distributions, the distribution of the images with new labels can be obtained by estimating its parameters via MLE. In this way, since the computation cost of MLE is much lower than the training process of CVAE, new labels can be appended much more easily. In the VAE of MNIST, the estimation of the parameters takes only seconds. Therefore, the MLE algorithm mentioned in Section 4 can be applied for SVAE. Attention should be paid to other methods for the MLE of the Fisher–Bingham distribution, such as the holonomic gradient method and the saddlepoint approximation method, are not appropriate for highdimensional data.
6.2 Application results
In this section, the handwritten images dataset MNIST is used to train SVAE. The experiments with the latent space and are shown. The models are trained according to davidson2018hyperspherical ^{1}^{1}1https://github.com/nicoladecao/svaepytorch
. It is necessary to add a sigmoid layer to the decoder. The training process is increased from 1000 to 2000 epochs with earlystopping at 50. Besides these changes, the training procedures are mostly the same as those mentioned by
davidson2018hyperspherical. Note that the experiment results are not accurate enough. In this section, the accuracy of the model means roughly the degree of the similarity of generated images to the handwritten numbers. The similarity is judged by the recognition of the authors. However, some concerns remain, such as the accuracy of the models and that of the Fisher–Bingham distribution assumption.6.2.1 Latent space
The images generated by SVAE are shown in Figure 9.
The latent space is assumed to be a 2dimensional unit sphere to visualize the latent space easily. However, in terms of the accuracy of the model, it is probably not enough to learn the features of dimensional images. Therefore, the images generated by the model are not accurate; specifically, the features of the handwritten 2, 4, and 5 are not learned completely.
For example, the distribution of the latent variables of labels 0 and 1 is illustrated in Figure 10. It is clear that the areas of label 0 and label 1 are separate, and the data are accumulating. However, the latent variables of label 2 are spread on the sphere, as shown in Figure 11.
Given the latent variables mentioned above, the following MLE values for each label are obtained and listed in Table 5.
label  0  1  2  3  4 

Comments
There are no comments yet.