DeepAI

# Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers

In this draft, which reports on work in progress, we 1) adapt the information bottleneck functional by replacing the compression term by class-conditional compression, 2) relax this functional using a variational bound related to class-conditional disentanglement, 3) consider this functional as a training objective for stochastic neural networks, and 4) show that the latent representations are learned such that they can be used in a naive Bayes classifier. We continue by suggesting a series of experiments along the lines of Nonlinear In-formation Bottleneck [Kolchinsky et al., 2018], Deep Variational Information Bottleneck [Alemi et al., 2017], and Information Dropout [Achille and Soatto, 2018]. We furthermore suggest a neural network where the decoder architecture is a parameterized naive Bayes decoder.

• 11 publications
• 21 publications
01/31/2021

### Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

We introduce the matrix-based Renyi's α-order entropy functional to para...
09/27/2020

### Learning Optimal Representations with the Decodable Information Bottleneck

We address the question of characterizing and finding optimal representa...
11/14/2021

### Improving usual Naive Bayes classifier performances with Neural Naive Bayes based models

Naive Bayes is a popular probabilistic model appreciated for its simplic...
03/23/2021

### Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration

We propose a novel information bottleneck (IB) method named Drop-Bottlen...
09/29/2021

### PAC-Bayes Information Bottleneck

Information bottleneck (IB) depicts a trade-off between the accuracy and...
12/21/2014

### Locally Weighted Learning for Naive Bayes Classifier

As a consequence of the strong and usually violated conditional independ...
10/15/2019

### REVE: Regularizing Deep Learning with Variational Entropy Bound

Studies on generalization performance of machine learning algorithms und...

## 1 Notation

We consider a classification task with a feature random variable (RV)

on and a class RV on the finite set of classes. If a dataset is available, then this dataset consists of realizations

of the joint distribution

, i.e., .

We further consider stochastic feed-forward neural networks (NNs). We assume that the input of the NN is the RV

, the output of the network is the RV , and every hidden layer defines an internal representation. In this work we are interested in a particular representation at a dedicated bottleneck layer, which we will denote by the RV . The NN is parameterized by a set of parameters which define the stochastic map from the input to the representation and the stochastic map from the representation to the network output. We call and the encoder and decoder, respectively.

With this notation established, we denote distributions that are induced by the encoder/decoder (i.e., that depend on the parameters ) with . For example, we have

 qY|T(y|t):=EX∼pX|Y(⋅|y)[qT|X(t|X)]EX∼pX[qT|X(t|X)] (1)

for the distribution of the class label conditioned on the latent representation and

 qT|Y(t|y):=EX∼pX|Y(X|y)[qT|X(t|X)] (2)

for the distribution of the latent representation conditioned on the class label. Surrogate distributions are denoted with .

## 2 Adapting the Information Bottleneck Loss for Optimal Representations

Our aim is to extract a representation of the feature such that the representation allows an accurate classification, but that at the same time is maximally compressed. In other words, we are looking for a stochastic map of such that the output of this map contains all – but not more – information about the class that is contained in . This aim is often formalized in terms of the information bottleneck (IB) functional; in the notation of [Achille and Soatto, 2018, eq. (2)], we aim to find a minimizer of

 LIB:=H(Y|T)+βI(X;T) (3)

where trades between the aims of preserving information about (first term) and compressing the representation (second term). These two goals are conflicting, because compressing the representation potentially also leads to a loss of information relevant for classification. In the extreme case where we trivially have .

We now show that a different but equivalent formulation of the IB functional leads to two terms which not in direct conflict anymore. Specifically, we replace the compression term by a class-conditional compression term: Our aim is not to compress the latent representation , but to remove every bit of information from this latent representation that is not necessary for classification. This latter quantity is captured in the conditional mutual information .

Indeed, since is a Markov tuple, we have that

. Furthermore, by the chain rule of mutual information, we have

 I(X;T)=I(X;T|Y)+I(Y;T)=I(X;T|Y)+H(Y)−H(Y|T). (4)

Inserting (4) into (3) yields

 LIB =H(Y|T)+βI(X;T|Y)+βH(Y)−βH(Y|T) =(1−β)H(Y|T)+βI(X;T|Y)+βH(Y) (5)

Since is independent of the map , minimizes for if and only if it minimizes

 LCIB:=H(Y|T)+β′I(X;T|Y) (6)

for . Minimizing the second term – which we call class-conditional compression in the remainder of this work – is not in direct conflict with minimizing the first anymore, as and are jointly possible.222Going one step further, noticing that does not depend on , and that , one can show that the optimization problem is equivalent to finding a minimizer of

(7)
for some . The first term is a measure of sufficiency of the representation  [Achille and Soatto, 2018, Sec. 4], while the second term quantifies whether the representation is minimal in the sense of removing irrelevant information.

Taking a closer look at the fact that illustrates that for we have , i.e., the optimization problem focuses only on (class-conditional) compression. This has been observed both analytically (e.g., [Kolchinsky et al., 2018, p. 2]) and in experiments (e.g., [Achille and Soatto, 2018, Figs. 4 & 5] and [Alemi et al., 2017, Fig. 1]).

## 3 A Variational Bound on Class-Conditional Compression and Its Consequences

While we have shown that the functional (3), and thus also (6) becomes infinite for deterministic NNs with a continuously distributed input [Amjad and Geiger, 2018, Th. 1]

, for stochastic NNs it was argued that these functionals are complicated to estimate

[Kolchinsky et al., 2018, Alemi et al., 2017]. As a remedy, both terms of can be replaced by variational bounds. We aim to do the same here for .

 H(Y|T) =EX,Y∼pXY[ET∼qT|X[−logqY|T(Y|T)]] (8) =EX,Y∼pXY[ET∼qT|X[−logq^Y|T(Y|T)]]−EX,T∼pXqT|X[D(qY|T(⋅|T)∥q^Y|T(⋅|T))] (9) ≤EX,Y∼pXY[ET∼qT|X[−logq^Y|T(Y|T)]] (10)

where the inequality follows from the non-negativitiy of KL divergence and leads to the popular cross-entropy cost function. For the second term , note that by the non-negativity of KL divergence we have

 I(X;T|Y) =EX,Y∼pXY[ET∼qT|X[logqT|X(T|X)qT|Y(T|Y)]] (11a) =EX,Y∼pXY[ET∼qT|X[logqT|X(T|X)rT|Y(T|Y)]]−EY∼pY[D(qT|Y(⋅|Y)∥rT|Y(⋅|Y))] (11b) ≤EX,Y∼pXY[ET∼qT|X[logqT|X(T|X)rT|Y(T|Y)]] (11c) =EX,Y∼pXY[D(qT|X(⋅|X)∥rT|Y(⋅|Y))] (11d)

for any surrogate distribution . Combining both terms and evaluating the outer expectation by averaging over a dataset , we obtain the following cost function for NN training:

 L∗CIB(D):=1NN∑i=1ET∼qT|X(⋅|xi)[−logq^Y|T(yi|T)]+β′D(qT|X(⋅|xi)∥rT|Y(⋅|yi)) (12)

For a fixed , this cost function is minimized over and , or equivalently, over the parameters of the NN. More generally, if can be selected from a family of distributions, then is minimized over the parameters of the NN and over all within this family.

Since (11c) holds for every surrogate distribution, it also holds for a product distribution over the components of , i.e., for , where is the

-th neuron in the bottleneck layer. This choice yields the variational bounds in

[Alemi et al., 2017, Achille and Soatto, 2018]. In contrast, we make the assumption that the distribution of the representation factorizes when conditioning on the class variable . In other words, we set

 rT|Y=∏rTj|Y. (13)

In a generative auto-encoding setup in which no class labels are present (or even meaningful), the setting makes sense: Generating a sample of amounts to sampling from , which is particularly simple if the components of are independent.333The authors of [Achille and Soatto, 2018] build a connection between information dropout and variational auto-encoders (VAE) [Kingma and Welling, 2014]. Specifically, they argue that the variational bound on corresponding to is equivalent to the cost function of the VAE when . We wish to note here that the IB functional itself is not meaningful in an auto-encoding setup, i.e., for : In this case, we have for , i.e., the cost is independent of the encoder and the decoder . For , the IB functional aims at minimizing , which is trivially fulfilled by an encoder that makes independent of . Auto-encoding as a trade-off between compression and reconstruction fidelity is only obtained after bounding with the cross-entropy induced by the decoder distribution. As soon as class labels are available, we argue that (13) is preferable over the unconditional setting

. This is obvious for the classification task; e.g., it is easier to build a classifier operating on a Gaussian mixture model than on a Gaussian RV, cf. Section

3.1.

However, even for a generative auto-encoding setup, (13) makes sense if class labels are available. In this case, the aim of the decoder is to reconstruct the input from the latent representation , i.e., the decoder has the structure . Generating an example of a given class amounts to sampling from , i.e., the distribution over which one samples depends on the class of which one wants to generate an example. (And sampling from this distribution is particularly simple if is a product distribution.) This conditional variational auto-encoding (CVAE) was discussed in [Sohn et al., 2015] for the case where both encoder and decoder may depend on the class variable, i.e., for and . Removing this dependences on the class variable, their cost function [Sohn et al., 2015, eq. (4)] is equivalent to our (12) for and for exchanged with .

### 3.1 First Consequence: Naive Bayes Structure

An immediate consequence of (11) is that minimizing (12) for (13) simultaneously encourages an encoder that leads to class-conditional compression and a naive Bayes structure that can be exploited by the decoder. This follows because

 EX,Y∼pXY[D(qT|X(⋅|X)∥∏rTj|Y(⋅|Y))]=I(X;T|Y)+EY∼pY[D(qT|Y(⋅|Y)∥∏rTj|Y(⋅|Y))]. (14)

Specifically, suppose that the second term (14) vanishes. Then, almost surely, and the optimal decoder is a naive Bayes classifier.

From this perspective, the following approach seems to make sense: One fixes a family of distributions from which is taken; e.g.,

could be a multivariate Gaussian distribution with mean vector and diagonal covariance matrix that depend on the class label. For this parameterized family of distributions, one fixes the decoder

to be the corresponding naive Bayes classifier. Then, by (14), minimizing (12) over the encoder and the parameters of leads to an encoder network such that 1) the latent representations are such that the support a naive Bayes classifier, 2) the naive Bayes classifier has good performance on the latent representations, and 3) the latent representations are class-conditionally compressed.

### 3.2 Second Consequence: Class-Conditional Disentanglement

In the more general case in which is minimized over the parameters of the NN and over all within a given family, it was shown that [Achille and Soatto, 2018, Proposition 1]

 minqT|X,q^Y|T,{rTj}1NN∑i=1ET∼qT|X(⋅|xi)[−logq^Y|T(yi|T)]+βD(qT|X(⋅|xi)∥∏rTj(⋅)) (15a) is equivalent to minqT|X,q^Y|T1NN∑i=1ET∼qT|X(⋅|xi)[−logq^Y|T(yi|T)]+βD(qT|X(⋅|xi)∥qT(⋅))+βTC(T) (15b)

where is the total correlation and where . In other words, minimizing for the setting encourages disentangled representations.

If instead of we set , then one can show that conditionally disentangled representations are encouraged. In other words, the extracted features are not required to be independent, but to be conditionally independent given the class variable. We believe that this conditional disentanglement is theoretically preferable over disentanglement, if some kind of disentanglement is preferable at all.

###### Corollary 1 (Corollary to [Achille and Soatto, 2018, Proposition 1]).

The minimization problem

 minqT|X,q^Y|T,{rTj|Y}1NN∑i=1ET∼qT|X(⋅|xi)[−logq^Y|T(yi|T)]+βD(qT|X(⋅|xi)∥∏rTj|Y(⋅|yi)) (16a) is equivalent to minqT|X,q^Y|T1NN∑i=1ET∼qT|X(⋅|xi)[−logq^Y|T(yi|T)]+βD(qT|X(⋅|xi)∥qT|Y(⋅|yi))+βTC(T|yi) (16b)

where and .

Before providing the proof, two aspects are worth mentioning. First, the equivalence of the two optimization problems in the corollary is only valid if the optimization over the marginal distributions is unconstrained. If instead, for example, the distributions have to be chosen from a specific family (e.g., Gaussian), then this equivalence need not hold in general. We believe that such a constrained optimization is of greater practical relevance than the unconstrained one, which in some sense limits the practical applicability of this result. The second aspect is that, if instead of a dataset the distribution is used to compute expectations, the second and third terms in (16b) evaluate to and . Thus, and connecting to (14), it can be seen that the variational bound on is equivalent to adding a regularization term that encourages disentanglement (cf. the discussion after [Achille and Soatto, 2018, Proposition 1]).

###### Proof.

The first term does not depend on , so it suffices to show that

 min{rTj|Y}1NN∑i=1D(qT|X(⋅|xi)∥∏rTj|Y(⋅|yi))=1NN∑i=1D(qT|X(⋅|xi)∥qT|Y(⋅|yi))+TC(T|yi) (17)

for every . Indeed, by the product rule of the logarithm one can show that

 1NN∑i=1D(qT|X(⋅|xi)∥∏rTj|Y(⋅|yi)) =1NN∑i=1ET∼qT|X(⋅|xi)[logqT|X(T|xi)qT|Y(T|yi)]+ET∼qT|X(⋅|xi)[logqT|Y(T|yi)∏rTj|Y(T|yi)] (18) =1NN∑i=1D(qT|X(⋅|xi)∥qT|Y(⋅|yi))+ET∼qT|X(⋅|xi)[logqT|Y(T|yi)∏rTj|Y(T|yi)] (19)

It remains to show that minimizing the second part of this sum over all yields . To this end, we split the sum over all samples over two sums, one of which runs over the possible values of the class variable, and one that runs over all samples for which . With this, and the law of total expectation, we get

 1NN∑i=1ET∼qT|X(⋅|xi)[logqT|Y(T|yi)∏rTj|Y(T|yi)] =1N∑y∈Y∑i: yi=yET∼qT|X(⋅|xi)[logqT|Y(T|y)∏rTj|Y(T|y)] (20) =1N∑y∈Y|{i: yi=y}|ET∼qT|Y(⋅|y)[logqT|Y(T|y)∏rTj|Y(T|y)] (21) =1N∑y∈Y|{i: yi=y}|D(qT|Y(⋅|y)∥∏rTj|Y(⋅|y)) (22)

We now minimize the right-hand side of (22) over all . To this end, for every , we expand the KL divergence via the chain rule [Cover and Thomas, 1991, Th. 2.5.3] to get

 min{rTj|Y}1N∑y∈Y|{i: yi=y}|D(qT|Y(⋅|y)∥∏rTj|Y(⋅|y)) =1N∑y∈Y|{i: yi=y}|∑jminrTj|YETj−11∼qTj−11|Y(⋅|y)[D(qTj|Y,Tj−11(⋅|y,Tj−11)∥rTj|Y(⋅|y))] (23) (a)=1N∑y∈Y|{i: yi=y}|∑jETj−11∼qTj−11|Y(⋅|y)[D(qTj|Y,Tj−11(⋅|y,Tj−11)∥qTj|Y(⋅|y))] (24) =1N∑y∈Y|{i: yi=y}|D(qT|Y(⋅|y)∥∏qTj|Y(⋅|y))=1NN∑i=1TC(T|yi) (25)

where follows from [Cover and Thomas, 1991, Lemma 13.8.1]. This completes the proof. ∎

## 4 Planned Experiments

To investigate whether the presented framework based on class-conditional compression is useful, we plan to perform a set of experiments. Whether these experiments are feasible in principle is, at present, unclear.

### 4.1 Nonlinear Information Bottleneck

In [Kolchinsky et al., 2018], the authors use a stochastic encoder which learns the mean vector of a multivariate Gaussian with identity covariance matrixm, i.e., . Therefore, the authors assume that the latent representation is a Gaussian mixture, with each point in the dataset being an individual component. Based on this assumption, they propose bounding the compression term via [Kolchinsky et al., 2018, eq. (10)]

 I(X;T)≤−1NN∑i=1logN∑j=1exp(−12∥fθ(xi)−fθ(xj)∥η2(θ)+σ2)−mlogσ2η2(θ)+σ2 (26)

where is a noise parameter that is learned.

Moving from compression to class-conditional compression is achieved by replacing by . We believe that this should also be possible in the framework of nonlinear information bottleneck by computing (26) separately for each class. In other words, we bound

 I(X;T|Y=y)≤−1N∑i: yi=ylog∑i: yi=yexp(−12∥fθ(xi)−fθ(xj)∥η2(θ)+σ2)−mlogσ2η2(θ)+σ2=:^I(X;T|Y=y) (27)

and obtain

 I(X;T|Y)≤∑y∈Y|{i: yi=y}|^I(X;T|Y=y). (28)

### 4.2 Naive Bayes Decoder

This experiment is based on Section 3.1. Specifically, we plan to choose from the family of Gaussian distributions with a mean vector that depends on the class

and an identity matrix (possibly scaled with a constant

that depends on the class ) as covariance matrix. This leads to the goal of obtaining a latent representation that is well-approximated by a Gaussian mixture model, where each mixture component is spherical.

Rather than training the decoder part of the network, we replace this part by a naive Bayes classifier fitted to the parameters of . Our aim is then to train the encoder part of the network such that the naive Bayes decoder can be fully exploited, i.e., we learn the parameters of the encoder and the parameters of such that cost is minimized.

### 4.3 Deep Variational Information Bottleneck

The authors of [Alemi et al., 2017] suggest minimizing for a spherical Gaussian , i.e., they assume that . Replacing this target distribution by a conditionally independent distribution of the latent dimensions given the class, i.e., by is simple. Unclear is, how the mean vectors shall be chosen or – which is preferable in the light of Corollary 1 – if these mean vectors can be learned from data jointly (or alternatingly) with the remaining network parameters.

### 4.4 Conditional Information Dropout

In [Achille and Soatto, 2018], the authors made the connection between a well-chosen variational bound and disentanglement [Achille and Soatto, 2018, Proposition 1]. They further proposed an encoder

that is implemented by a NN where each neuron output is affected by multiplicative data-dependent noise (which is chosen to follow a log-normal distribution with data-dependent variance for the sake of analytical simplicity). The authors furthermore proposed that

, where

is log-uniform with a point mass at zero or log-normal for ReLU or softplus activation functions, respectively (cf.

[Achille and Soatto, 2018, Propositions 2 and 3]).

In the setting proposed in this draft, one would have to replace by . In case of a softplus activation, this would mean that is a log-normal distribution the mean of which depends on the class (and potentially on the latent dimension ). In case of a ReLU activation, this would require that the point mass at zero depends on the class (and potentially on the latent dimension ). We are again faced with the issued mentioned in the previous subsection, i.e., whether these parameters can be trained from data or if (and how) they can be selected a priori.