Optimal Latent Representations: Distilling Mutual Information into Principal Pairs
Principal component analysis (PCA) is generalized from one to two random vectors, decomposing the correlations between them into a set of "principal pairs" of correlated scalars. For the special case of Gaussian random vectors, PCA decomposes the information content in a single Gaussian random vector into a mutually exclusive and collectively exhaustive set of information chunks corresponding to statistically independent numbers whose individual entropies add up to the total entropy. The proposed Principal Pair Analysis (PPA) generalizes this, decomposing the total mutual information two vectors as a sum of the mutual information between a set of independent pairs of numbers. This allows any two random vectors to be interpreted as the sum of a perfectly correlated ("signal") part and a perfectly uncorrelated ("noise") part. It is shown that when predicting the future of a system by mapping its state into a lower-dimensional latent space, it is optimal to use different mappings for present and future. As an example, it is shown that that PPA outperforms PCA for predicting the time-evolution of coupled harmonic oscillators with dissipation and thermal noise. We conjecture that a single latent representation is optimal only for time-reversible processes, not for e.g. text, speech, music or out-of-equilibrium physical systems.
READ FULL TEXT