I Introduction
The recent advances of variational autoencoder(VAE) provide new unsupervised approaches to learn hidden structure of the data [5]. The variational autoencoder is a powerful generative model which allows inference of the learned latent representation. However, the classic VAEs are prone to the “posterior collapse ”phenomenon that the latent representations are ignored due to the powerful decoder. Vector quantized variational autoencoder (VQVAE) learns discrete representations by incorporating the idea of vector quantization into the bottleneck stage and the “posterior collapse ”can be avoided [1]. [3] proposes to use the Expectation Maximization algorithm to train the VQVAE in the bottleneck stage and achieves higher perplexity of its latent space. Both the proposed VQVAE models made progress on training of the discrete latent variable models to match their continous counterparts.
In this paper, we show that the formulation of the VQVAEs can be interpreted from an informationtheoretic perspective. The loss function of the original VQVAE can be derived from the deterministic variational information bottleneck principle [6]. On the other hand, the VQVAE trained by EM algorithm can be viewed as an approximation to the variational information bottleneck.
Ii Related Work
Given a joint probability distribution
of input dataand the observed relevant random variable
, the information bottleneck (IB) method seeks a representation such that the mutual information is minimized, while preserving the mutual information [7]. can be seen as a measure of the predicative power of on , and can be seen as a compression measure. Hence, the information bottleneck is designed to find the trade off between the accuracy and compression. [8]first used the information bottleneck principle to analysis the deep neural networks theoretically, but no practical models are derived from the IB model.
[4] presents a variational approximation to the information bottleneck so that the IBbased models can be parameterized by the neural networks.The deterministic information bottleneck (DIB) principle introduces alternative formulation of the IB problem. It focus on the representational cost of the latent instead of finding the minimal sufficient statistics for predicting . Hence, DIB replaces mutual information with the entropy , Using the similar techniques from [4], [2] derived a variational deterministic information bottleneck(VDIB) to approximate the DIB.
Iii Variational Information Bottleneck
We adapt an unsupervised clustering setting to derive the loss functions of the VDIB and VIB. We denote the data point index as the input data, the codeword index as the latent variable, and the feature representation of input data as the observed relevant variable and
as the reconstructed representation. The above variables are subject to the Markov chain constraint
(1) 
The information bottleneck principle can be formulated as a ratedistortion like problem [9]
(2) 
The loss function of the information bottleneck principle is the equivalent problem with the Lagrangian formulation,
(3) 
where is the Lagrangian parameter.
Consider the information bottleneck distortion is defined as
(4) 
where
denotes the Kullback–Leibler divergence. Let
be the measure on , and we have , , we can decompose the into two terms(5)  
(6) 
where (6
) is derived from using the chain rule to express the conditional probably
as(7) 
Since the second term of (6) is determined solely by the given data distribution and is a constant, so it can be ignored in the loss function for the propose of minimization. The first term of (6) can have an upper bounded by replacing the with a variational approximation [4]
(8)  
(9)  
(10) 
where (9) is resulted from the nonnegative of the KL divergence
(11)  
(12) 
Similarly, the mutual information can have an upper bounded by replacing marginal with a variational approximation
(13)  
(14)  
(15) 
where (14) is resulted from the nonnegative of the KL divergence
(16)  
(17) 
Iv Connection to VQVAEs
In this section, we establish the connection between the VIB and VDIB principles with the VQVAE and the VQVAE trained by EM algorithm. In the VQVAE setting, the distribution is parameterized by the encoder neural network and the distribution is parameterized by the decoder neural network .
The loss function of the VQVAE uses three terms to minimize the first term of (18) and (23) empirically [1]
(24) 
where is the stop gradient operator, is the number of codewords of the quantizer, is the output of the encoder of the th data point, is the output of the bottleneck quantizer and the input of the decoder. The stop gradient operator outputs its input as it is in the forward pass, and it is not taken into account for computing gradients in the training process.
The first term of (24
) is the reconstruction error between the output and input. The gradients of the backpropagation is copied from the decoder input
to the encoder output . Hence, the first term only optimizes the encoder and decoder, and the codewords receive no update gradients. The second term is the commitment loss that is used to force the encoder output commits to the codewords and the bottleneck codewords are optimized by the third term. is a constant weight parameter for the commitment loss.For the second regularization term, VDIB minimizes the cross entropy with the empirical expression
(25) 
Conventionally, the marginal
is set to be a uniform distribution. Then
becomes a constant and can be omit from the loss function. The loss function of VDIB then can be reduced to the loss function (24) of VQVAE.For the VIB, the KL divergence can be expressed as
(26) 
The first term of (26) is the cross entropy that the same as (25). However, the conditional entropy of (26) encourages the input data to be quantized uniformly with more codewords.
The classic VQVAE applies nearest neighbor search on the codebook in the bottleneck stage
(27) 
where is the codeword. Hence, the conditional entropy is zero.
On the other hand, the VQVAE trained by the EM algorithm uses a soft clustering scheme based on the distance between the codeword and the output of the encoder. The probability the data assigns with the codeword is
(28) 
That is, the EM algorithm explicitly increases the conditional entropy and achieve a lower value for (18). The experiments in [3] also suggests that VQVAE trained by the EM algorithm can achieve higher perplexity of the codewords than the original VQVAE.
V Conclusion
We derive the loss function of VIB and VDIB from a clustering setting. We show the loss function of the original VQVAE can be derived from the VDIB principle. In addition, we show that the VQVAE trained with the EM algorithm explicitly increases the perplexity of the latents and can be viewed as an approximation of the VIB principle. References
References
 [1] A. Oord, K. Kavukcuoglu, and O. Vinyals, “Neural discrete representation learning,” in Advances on Neural Information Processing Systems (NIPS), Long Beach, CA, Dec. 2017.
 [2] DJ Strouse and D. Schwab, “Variational deterministic information bottleneck,” 2018, [Online]. Available: http://djstrouse.com/downloads/vdib.pdf.
 [3] A. Roy, A. Vaswani, A. Neelakantan, and N. Parmar, “Theory and experiments on vector quantized autoencoders,” 2018, [Online]. Available: https://arxiv.org/abs/1803.03382.
 [4] A.A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, Apr. 2017.
 [5] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in Proceedings of the International Conference on Learning Representations (ICLR), Banff, Canada, Apr. 2014.
 [6] DJ Strouse and D. Schwab, “The deterministic information bottleneck,” Neural Comput., vol. 29, no. 6, pp. 1611–1630, 2017.
 [7] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” .

[8]
N. Tishby and N. Zaslavsky,
“Deep learning and the information bottleneck principle,”
.  [9] A. GiladBachrach, A. Navot, and N. Tishby, “An information theoretic tradeoff between complexity and accuracy,” in Proceedings of the COLT, 2003.
Comments
There are no comments yet.