Minimal Random Code Learning with Mean-KL Parameterization
This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior Q_𝐰 and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution P_𝐰. To achieve the desired compression rate, D_KL[Q_𝐰‖ P_𝐰] must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for Q_𝐰. Instead, we parameterize Q_𝐰 by its mean and KL divergence from P_𝐰 to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning.
READ FULL TEXT