1 Introduction
The Boltzmann machine [1]
is a famous stochastic neural network that can discover data representations, in terms of probability distribution, without supervision. Its variant called the restricted Boltzmann machine (RBM)
[5]is of great practical importance because RBM can be trained with less computational effort than the other models. Owing to its capability in discovering latent representations without labeled data, RBM has been successfully utilized in various applications involving pattern recognition and machine learning, including computer vision
[15], collaborative filtering [24], and even geochemical analysis [3], to name a few.In audio applications of RBM, signals are usually modeled based on their amplitude spectra. Since audio signals can be well characterized by their spectral components, RBM is trained to approximate the probability distribution of the given data in the domain related to frequency. For example, many studies have applied RBM to model the melfrequency cepstral coefficients (MFCC) [19, 22] or melcepstral features [9, 20] of speech signals. Raw amplitude or STRAIGHT [12] spectra have also been considered for extracting richer information from the signals [16, 17, 21]. Moreover, some studies attempted modeling the raw signals using RBM [10, 23]. All of these are not easy tasks for the original RBM defined for binary signals, and therefore the GaussianBernoulli RBM [7, 4] is usually chosen for the audio applications because it can naturally handle realvalued data.
However, modeling of amplitude spectra by the GaussianBernoulli RBM has two issues from the viewpoint of audio applications. First, the Gaussian distribution allows negative values that are not consistent with the concept of amplitude. Since amplitude spectra are calculated via absolute value, they are always nonnegative by definition. Handling nonnegative values with the Gaussian distribution is not straightforward, and therefore the learned representation may contain unavoidable model error. Second, the human auditory system recognizes sound in the logarithmiclike scale rather than the linear scale. Based on this fact, many handcrafted audio features as MFCC involves the logarithmic operation within their calculation processes. Although usefulness of logamplitude spectra is wellknown in the literature, the asymmetric nature of the logarithmic function may make the training difficult for symmetric models as the Gaussian distribution. Moreover, logamplitude of approximately sparse spectra (e.g., those of many audio signals including speech) can cause extreme outliers when the amplitude is around zero. These issues should be resolved for better modeling of audio signals.
In this paper, we propose the gammaBernoulli RBM for explicitly modeling linear and logamplitude spectrograms. At first, a general gamma Boltzmann machine is defined by a new energy function consisting of the usual quadratic term and an additional logamplitude term. Such addition enables simultaneous consideration of the linear and logamplitude spectrograms. Then, its connection is restricted to form the gammaBernoulli RBM. The proposed model represents the conditional distribution of the visible units by the gamma distribution which naturally limits the domain of data to positive numbers. Owing to these properties, the gammaBernoulli RBM should be suitable for representing amplitude spectra and hence audio signals. By the experiment of reconstructing amplitude spectrograms, the effectiveness of the proposed RBM compared to the ordinary GaussianBernoulli RBM was confirmed in terms of PESQ and mean squared error (MSE).
2 Boltzmann Machines
In this section, the ordinary Boltzmann machines are briefly reviewed for contrasting the difference between the conventional and proposed models.
2.1 Boltzmann Machine
The Boltzmann machine [1] is an unsupervised neural network for approximating a distribution of the given data. Let
be a vector, where
is a space of the variables under investigation (they will be clarified later). Then, a Boltzmann machine represents its probability density function (PDF) as
(1) 
where is the socalled energy function, and is the normalizing constant called partition function. A type of Boltzmann machines is determined by the definition of the energy function. In this section, the following energy function involving the parameters and is considered for explaining the conventional models:
(2) 
where the explicit forms of the parameters are given later.
2.2 Restricted Boltzmann Machine (RBM)
RBM is the most important variants of the Boltzmann machine. The above general Boltzmann machine may not be practical because the calculation (or even approximation) of the integral is quite difficult, which makes its training extremely slow for practical dimensionality. To avoid such difficulty, RBM restricts the connection between the units so that a fast training algorithm can be developed.
In RBM, the variables are separated into two: the visible and hidden variables denoted by and , respectively. An element of these vectors is called a unit, and their connection is defined by the energy function. corresponds to the data points (and hence visible), while
represents the latent variables for conditional hidden representation of the data. That is, a PDF of the visible data is given by the following marginalization:
(3) 
where , and and are the spaces of visible and hidden variables, respectively.
The energy function of RBM is restricted so that both visible and hidden units do not have interconnections (i.e., RBM does not have visiblevisible and hiddenhidden connections that can be introduced through the energy function by adding and , respectively, with a square matrix having nondiagonal elements). Such restriction enables fast training by sampling from the conditional distributions: and . These two conditional probabilities are the key ingredients for characterizing the types of RBMs.
2.3 BernoulliBernoulli RBM
The original RBM [5]
was defined for binary variables, i.e.,
and are the sets of binary numbers: , . The energy function is defined as(4) 
that is related to the general Boltzmann machine in Eq. (2) as
(5) 
where , , ,
represents the allzero matrix with appropriate size, and the operations between the binary and real numbers are performed by regarding the binary symbols as real numbers.
This type of RBM is called the BernoulliBernoulli RBM because the two conditional probabilities required for its training are the elementwise Bernoulli distributions
:(6)  
(7) 
where (or ) is a vector representing the probabilities of taking the value for each element, and
denotes the elementwise sigmoid function.
2.4 GaussianBernoulli RBM
The BernoulliBernoulli RBM has an obvious limitation that the visible variables must be binary. That is, it can only handle binary data even though many of the interesting realworld data are apparently not binary in nature. In this respect, the GaussianBernoulli RBM [7] is the most important variants of RBM because it can naturally handle realvalued data , while the hidden variables are remained binary, . The energy function is defined as^{1}^{1}1 Note that this definition is somewhat different from those defined in [7] or [4]. We defined the energy function as in Eq. (8) because we empirically found that this works better for our application in Section 4.
(8) 
that is related to the general Boltzmann machine in Eq. (2) as
(9) 
where is a diagonal matrix,
is the model parameter representing variance of the visible variables, and
is the operator constructing the diagonal matrix from an input vector. Its difference from that of the BernoulliBernoulli RBM in Eq. (4) is merely the first term which represents the selfconnection of the visible units. Note that this term does not introduce interconnection of the visible units because the matrix does not have any nondiagonal element.This model defined by Eq. (8) is called the GaussianBernoulli RBM because its conditional probabilities are
(10)  
(11) 
where is the Gaussian distribution with a mean vector and a covariance matrix . That is, the data are handled by the Gaussian distribution, while the hidden variables are by the Bernoulli distribution. Therefore, it can approximate the distribution of realvalued data by learning the parameters from the given data.
3 Gamma Boltzmann Machine
Among Boltzmann machines, the GaussianBernoulli RBM has been a standard choice for many realworld applications because it can handle realvalued signals. In audio applications, the amplitude spectrogram is one of the most reasonable choices of a meaningful acoustic feature, and therefore its generative modeling has been investigated [21, 26, 14, 27, 11]. However, as mentioned in the Introduction (3rd paragraph), modeling of amplitude spectrograms by the GaussianBernoulli RBM has two issues: production of negative values and ignorance of the human auditory mechanism. To circumvent these issues, we propose a new variant of the Boltzmann machines named the gammaBernoulli RBM in this section.
3.1 Proposed Gamma Boltzmann Machine
Similarly to the previous section, we first define a general Boltzmann machine without the restriction. We propose the generative model termed gamma Boltzmann machine by defining the following energy function involving logarithmic terms:
(12) 
where is the elementwise logarithmic function, is a positive vector (i.e., ), and its PDF is given by Eq. (1): . Owing to the existence of , this model naturally enforces the variables to be positive. By introducing the logrelated parameters and , it can learn a PDF with consideration of the logarithmic scale.
3.2 Proposed GammaBernoulli RBM
By introducing the visible and hidden units and imposing the restriction, we can obtain RBM based on the above gamma Boltzmann machine. Due to the logarithmic function in Eq. (12), all variables must be positive. In our model, the data are assumed to be positive, , and the hidden variables are binary, . However, this assumption cannot be accepted directly because takes whenever contains . Therefore, we consider transformation that makes the values positive: , where for a vector input is the elementwise exponential function. With this modification, the energy function is defined as
(13) 
that can be derived from Eq. (12) by inserting
(14)  
(15) 
where , , , , represents the allzero vector with appropriate size, and the joint density function of the variables is given as in Eq. (3): .
This proposed RBM is termed the gammaBernoulli RBM because its conditional probabilities are given by
(16)  
(17) 
where is the elementwise i.i.d. gamma distribution with a shapeparameter vector and a rateparameter vector , i.e., with
(18) 
and is the gamma function.
The gamma distribution is a natural choice for modeling positive data. Furthermore, some research has reported that the gamma distribution can approximate the distribution of speech signals better than the Gaussian distribution regardless of the type of speech parameterization [6, 25, 18, 2]. Thus, the proposed gammaBernoulli RBM should be more suitable for modeling amplitude spectra than the GaussianBernoulli RBM.
3.3 Implementation of GammaBernoulli RBM
In the proposed formulation, and correspond to the parameters of the gamma distribution ( and , respectively) as in Eq. (16). Therefore, both vectors must be positive for satisfying the definition of the gamma distribution. To ensure positivity of and without causing instability of the training, we parameterize them as follows [4]:
(19) 
where , and .
Moreover, in order to avoid which occurs when , one may modify the vector given in Eq. (15) as with a small constant . This addition makes the shape parameter of the gamma distribution in Eq. (16), , always positive as required by the definition. However, such modification is not so important for practical applications because rarely happens.
3.4 Objective Function and Parameter Optimization
Like the conventional Boltzmann machines, the objective of the proposed RBM is to maximize the loglikelihood:
(20)  
(21)  
(22) 
where and are the th training data and the corresponding hidden variables, respectively, and represents marginalization over all possible states of .
For the optimization, the gradient of the loglikelihood function w.r.t. the parameters is required. Although it can be explicitly written as
(23) 
this gradient is practically intractable owing to the second term, where and
represent the expectations on data and model distributions, respectively. Therefore, as usual in the conventional Boltzmann machines, the contrastive divergence method
[8] is applied to approximate the gradient:(24) 
where is the expectation on the reconstructed data usually obtained through the Gibbs sampling.
The negative partial gradients of the energy function in Eq. (13) w.r.t. each parameter are obtained as follows:
(25)  
(26) 
where denotes the elementwise multiplication.
4 Experiment
The effectiveness of the proposed model was investigated by a speech representation experiment as follows.
4.1 Configuration
In the experiment, the ATR speech corpus (set B, speaker FTK) was utilized. The speech signals of 50 sentences (SDA) were utilized for training, while the other 53 sentences (SDJ
) were used for evaluation. Those signals originally sampled at 20 kHz were downsampled to 16 kHz for speeding up the computation. The shorttime Fourier transform (STFT) was implemented with a 256samplelong Hamming window and a hop size of 64 samples. The 129dimensional data vector
was calculated by taking the absolute value of the spectrum of each windowed segment. After discarding silent segments, the number of the data samples for training was 51 197.The proposed RBM was compared with the GaussianBernoulli RBM. The training data were normalized so that the data distribution was standardized for each RBM. For the ordinary GaussianBernoulli RBM, as usual, each dimension was normalized so that the data were distributed with center 0 and standard deviation 1. For the proposed gammaBernoulli RBM, each dimension was normalized as
(27) 
so that the gamma distribution, assumed as , becomes the standard form, , where denotes the mean of , and
is the maximumlikelihood estimation of
. In this experiment, we considered for the normalization.Both RBMs were trained by the Adam optimizer [13]
with a batch size 100 and a learning rate 0.01. The number of hidden units was set to 100, 200, 400, or 800. After training with 100 epochs, the amplitude spectrogram of the evaluation data were encoded and reconstructed using the trained models by calculating the expectation of
from the expectation of the encoded signal obtained from the inputted data samples, i.e., reconstruction is obtained from . Their performances were evaluated by PESQ and MSE after canceling the effect of normalization by the inverse operation.4.2 Results
We show the experimental results in the following three ways: scores of PESQ, an example of a reconstructed spectrogram, and learning curves in terms of MSE.
Firstly, PESQ scores averaged over all evaluation data are shown in Fig.1. After reconstructing the amplitude spectrograms, the corresponding signals in the time domain were calculated by the inverse STFT using phase of the original signals. Then, the PESQ scores were calculated using the original signals as the references. As illustrated in the figure, the proposed RBM (gammaRBM) outperformed the ordinary RBM (GaussRBM) for all situations. This should be because the proposed model explicitly considers the logamplitude spectrogram which is more relevant to the human auditory system. The proposed model could obtain better scores by increasing the number of hidden units as it did not reach the ceiling with 800 units.
Secondly, an example of the reconstructed amplitude spectrograms is shown in Fig.2, where the number of hidden units was set to 800 (H800). As can be seen from the top right figure, the conventional model resulted in the negative values indicated by the red points. Some reconstruction error at the timefrequency bins having small energy can also be noticed in the bottom right figure. In contrast, the proposed model did not produce any negative value as expected by the definition. Although the reconstructed spectrogram was smother as shown in the central figure, its spectral envelope seems to be closer to the original signal than the conventional method, which should be the reason of the better PESQ scores.
Finally, MSE w.r.t. linear and logamplitude spectrograms per epoch are illustrated in Figs.3 and 4, respectively. Since the proposed RBM considers both linear and logamplitude spectra, MSE was calculated in both domains as follows:
(28)  
(29) 
where and denote the th original and reconstructed amplitude spectra, respectively, and is the total number of the segments. While the conventional models (black) were slightly better than the proposed models (red) in terms of (Fig.3), the proposed models outperformed the conventional models in terms of (Fig.4). By paying attention to the number of hidden units, in Fig.3, the proposed model with 800 units (solid red) easily outperformed the conventional model with 100 units (dotted black). In contrast, in Fig.4, the conventional model with 800 units (solid black) could not outperform the proposed model with 100 units (dotted red). These results indicate that the simultaneous consideration of linear and logamplitude spectra in the proposed gammaBernoulli RBM can improve the overall performance.
5 Conclusions
In this paper, we proposed a novel RBM named gammaBernoulli RBM. By modeling data via the gamma distribution, the proposed RBM can naturally handle positive data such as amplitude spectra. Since it optimizes the parameters by simultaneously considering data in the linear and logarithmic scales, the obtained model should be suitable for an application sensitive to logarithmic quantities as well as the linear scale.
Acknowledgment
This work was partially supported by JSPS KAKENHI Grant Number 18K18069.
References
 [1] (1985) A learning algorithm for Boltzmann machines. Cognitive science 9 (1), pp. 147–169. Cited by: §1, §2.1.
 [2] (2006) MMSE speech spectral amplitude estimators with chi and gamma speech priors. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 3, pp. III–III. Cited by: §3.2.
 [3] (2014) Application of continuous restricted Boltzmann machine to identify multivariate geochemical anomaly. Journal of Geochemical Exploration 140, pp. 56–63. Cited by: §1.
 [4] (2011) Improved learning of GaussianBernoulli restricted Boltzmann machines. In Proc. ICANN, pp. 10–17. Cited by: §1, §3.3, footnote 1.
 [5] (1994) Unsupervised learning of distributions of binary vectors using two layer networks. Computer Research Laboratory, pp. 912–919. Cited by: §1, §2.3.
 [6] (2003) Speech probability distribution. IEEE Signal Processing Letters 10 (7), pp. 204–207. Cited by: §3.2.
 [7] (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1, §2.4, footnote 1.
 [8] (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §3.4.
 [9] (2016) DBNbased spectral feature representation for statistical parametric speech synthesis. IEEE Signal Processing Letters 23 (3), pp. 321–325. Cited by: §1.
 [10] (2011) Learning a better representation of speech soundwaves using restricted Boltzmann machines. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5887. Cited by: §1.
 [11] (2017) Generative adversarial networkbased postfilter for STFT spectrograms.. In Interspeech, pp. 3389–3393. Cited by: §3.
 [12] (2008) TandemSTRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interferencefree spectrum, F0, and aperiodicity estimation. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3933–3936. Cited by: §1.
 [13] (2015) Adam: A method for stochastic optimization. In Proc. ICLR, pp. 1–15. Cited by: §4.1.
 [14] (2019) Melgan: generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems, pp. 14910–14921. Cited by: §3.
 [15] (2008) Sparse deep belief net model for visual area V2. In Advances in neural information processing systems, pp. 873–880. Cited by: §1.
 [16] (2015) Feature extraction with convolutional restricted Boltzmann machine for audio classification. In 2015 3rd IAPR Asian conference on pattern recognition (ACPR), pp. 791–795. Cited by: §1.
 [17] (2013) Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7825–7829. Cited by: §1.
 [18] (2002) Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I–253. Cited by: §3.2.
 [19] (2010) Phone recognition using restricted Boltzmann machines. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4354–4357. Cited by: §1.
 [20] (2016) Nonparallel training in voice conversion using an adaptive restricted Boltzmann machine. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2032–2045. Cited by: §1.

[21]
(2018)
LSTBM: a novel sequence representation of speech spectra using restricted Boltzmann machine with long shortterm memory.
. In INTERSPEECH, pp. 2529–2533. Cited by: §1, §3.  [22] (2016) DNNbased amplitude and phase feature enhancement for noise robust speaker identification.. In Interspeech, pp. 2204–2208. Cited by: §1.
 [23] (2016) Novel unsupervised auditory filterbank learning using convolutional RBM for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (12), pp. 2341–2353. Cited by: §1.
 [24] (2007) Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th international conference on Machine learning, pp. 791–798. Cited by: §1.
 [25] (2005) Statistical modeling of speech signals based on generalized gamma distribution. IEEE Signal Processing Letters 12 (3), pp. 258–261. Cited by: §3.2.

[26]
(2019)
Melnet: a generative model for audio in the frequency domain
. arXiv preprint arXiv:1906.01083. Cited by: §3.  [27] (2017) Tacotron: towards endtoend speech synthesis. In Interspeech, pp. 4006–4010. Cited by: §3.
Comments
There are no comments yet.