1 Notation
We consider a classification task with a feature random variable (RV)
on and a class RV on the finite set of classes. If a dataset is available, then this dataset consists of realizationsof the joint distribution
, i.e., .We further consider stochastic feed-forward neural networks (NNs). We assume that the input of the NN is the RV
, the output of the network is the RV , and every hidden layer defines an internal representation. In this work we are interested in a particular representation at a dedicated bottleneck layer, which we will denote by the RV . The NN is parameterized by a set of parameters which define the stochastic map from the input to the representation and the stochastic map from the representation to the network output. We call and the encoder and decoder, respectively.With this notation established, we denote distributions that are induced by the encoder/decoder (i.e., that depend on the parameters ) with . For example, we have
(1) |
for the distribution of the class label conditioned on the latent representation and
(2) |
for the distribution of the latent representation conditioned on the class label. Surrogate distributions are denoted with .
2 Adapting the Information Bottleneck Loss for Optimal Representations
Our aim is to extract a representation of the feature such that the representation allows an accurate classification, but that at the same time is maximally compressed. In other words, we are looking for a stochastic map of such that the output of this map contains all – but not more – information about the class that is contained in . This aim is often formalized in terms of the information bottleneck (IB) functional; in the notation of [Achille and Soatto, 2018, eq. (2)], we aim to find a minimizer of
(3) |
where trades between the aims of preserving information about (first term) and compressing the representation (second term). These two goals are conflicting, because compressing the representation potentially also leads to a loss of information relevant for classification. In the extreme case where we trivially have .
We now show that a different but equivalent formulation of the IB functional leads to two terms which not in direct conflict anymore. Specifically, we replace the compression term by a class-conditional compression term: Our aim is not to compress the latent representation , but to remove every bit of information from this latent representation that is not necessary for classification. This latter quantity is captured in the conditional mutual information .
Indeed, since is a Markov tuple, we have that
. Furthermore, by the chain rule of mutual information, we have
(4) |
(5) |
Since is independent of the map , minimizes for if and only if it minimizes
(6) |
for . Minimizing the second term – which we call class-conditional compression in the remainder of this work – is not in direct conflict with minimizing the first anymore, as and are jointly possible.^{2}^{2}2Going one step further, noticing that does not depend on , and that , one can show that the optimization problem is equivalent to finding a minimizer of
Taking a closer look at the fact that illustrates that for we have , i.e., the optimization problem focuses only on (class-conditional) compression. This has been observed both analytically (e.g., [Kolchinsky et al., 2018, p. 2]) and in experiments (e.g., [Achille and Soatto, 2018, Figs. 4 & 5] and [Alemi et al., 2017, Fig. 1]).
3 A Variational Bound on Class-Conditional Compression and Its Consequences
While we have shown that the functional (3), and thus also (6) becomes infinite for deterministic NNs with a continuously distributed input [Amjad and Geiger, 2018, Th. 1]
, for stochastic NNs it was argued that these functionals are complicated to estimate
[Kolchinsky et al., 2018, Alemi et al., 2017]. As a remedy, both terms of can be replaced by variational bounds. We aim to do the same here for .We start with :
(8) | ||||
(9) | ||||
(10) |
where the inequality follows from the non-negativitiy of KL divergence and leads to the popular cross-entropy cost function. For the second term , note that by the non-negativity of KL divergence we have
(11a) | ||||
(11b) | ||||
(11c) | ||||
(11d) |
for any surrogate distribution . Combining both terms and evaluating the outer expectation by averaging over a dataset , we obtain the following cost function for NN training:
(12) |
For a fixed , this cost function is minimized over and , or equivalently, over the parameters of the NN. More generally, if can be selected from a family of distributions, then is minimized over the parameters of the NN and over all within this family.
Since (11c) holds for every surrogate distribution, it also holds for a product distribution over the components of , i.e., for , where is the
-th neuron in the bottleneck layer. This choice yields the variational bounds in
[Alemi et al., 2017, Achille and Soatto, 2018]. In contrast, we make the assumption that the distribution of the representation factorizes when conditioning on the class variable . In other words, we set(13) |
In a generative auto-encoding setup in which no class labels are present (or even meaningful), the setting makes sense: Generating a sample of amounts to sampling from , which is particularly simple if the components of are independent.^{3}^{3}3The authors of [Achille and Soatto, 2018] build a connection between information dropout and variational auto-encoders (VAE) [Kingma and Welling, 2014]. Specifically, they argue that the variational bound on corresponding to is equivalent to the cost function of the VAE when . We wish to note here that the IB functional itself is not meaningful in an auto-encoding setup, i.e., for : In this case, we have for , i.e., the cost is independent of the encoder and the decoder . For , the IB functional aims at minimizing , which is trivially fulfilled by an encoder that makes independent of . Auto-encoding as a trade-off between compression and reconstruction fidelity is only obtained after bounding with the cross-entropy induced by the decoder distribution. As soon as class labels are available, we argue that (13) is preferable over the unconditional setting
. This is obvious for the classification task; e.g., it is easier to build a classifier operating on a Gaussian mixture model than on a Gaussian RV, cf. Section
3.1.However, even for a generative auto-encoding setup, (13) makes sense if class labels are available. In this case, the aim of the decoder is to reconstruct the input from the latent representation , i.e., the decoder has the structure . Generating an example of a given class amounts to sampling from , i.e., the distribution over which one samples depends on the class of which one wants to generate an example. (And sampling from this distribution is particularly simple if is a product distribution.) This conditional variational auto-encoding (CVAE) was discussed in [Sohn et al., 2015] for the case where both encoder and decoder may depend on the class variable, i.e., for and . Removing this dependences on the class variable, their cost function [Sohn et al., 2015, eq. (4)] is equivalent to our (12) for and for exchanged with .
3.1 First Consequence: Naive Bayes Structure
An immediate consequence of (11) is that minimizing (12) for (13) simultaneously encourages an encoder that leads to class-conditional compression and a naive Bayes structure that can be exploited by the decoder. This follows because
(14) |
Specifically, suppose that the second term (14) vanishes. Then, almost surely, and the optimal decoder is a naive Bayes classifier.
From this perspective, the following approach seems to make sense: One fixes a family of distributions from which is taken; e.g.,
could be a multivariate Gaussian distribution with mean vector and diagonal covariance matrix that depend on the class label. For this parameterized family of distributions, one fixes the decoder
to be the corresponding naive Bayes classifier. Then, by (14), minimizing (12) over the encoder and the parameters of leads to an encoder network such that 1) the latent representations are such that the support a naive Bayes classifier, 2) the naive Bayes classifier has good performance on the latent representations, and 3) the latent representations are class-conditionally compressed.3.2 Second Consequence: Class-Conditional Disentanglement
In the more general case in which is minimized over the parameters of the NN and over all within a given family, it was shown that [Achille and Soatto, 2018, Proposition 1]
(15a) | |||
is equivalent to | |||
(15b) |
where is the total correlation and where . In other words, minimizing for the setting encourages disentangled representations.
If instead of we set , then one can show that conditionally disentangled representations are encouraged. In other words, the extracted features are not required to be independent, but to be conditionally independent given the class variable. We believe that this conditional disentanglement is theoretically preferable over disentanglement, if some kind of disentanglement is preferable at all.
Corollary 1 (Corollary to [Achille and Soatto, 2018, Proposition 1]).
The minimization problem
(16a) | |||
is equivalent to | |||
(16b) |
where and .
Before providing the proof, two aspects are worth mentioning. First, the equivalence of the two optimization problems in the corollary is only valid if the optimization over the marginal distributions is unconstrained. If instead, for example, the distributions have to be chosen from a specific family (e.g., Gaussian), then this equivalence need not hold in general. We believe that such a constrained optimization is of greater practical relevance than the unconstrained one, which in some sense limits the practical applicability of this result. The second aspect is that, if instead of a dataset the distribution is used to compute expectations, the second and third terms in (16b) evaluate to and . Thus, and connecting to (14), it can be seen that the variational bound on is equivalent to adding a regularization term that encourages disentanglement (cf. the discussion after [Achille and Soatto, 2018, Proposition 1]).
Proof.
The first term does not depend on , so it suffices to show that
(17) |
for every . Indeed, by the product rule of the logarithm one can show that
(18) | |||
(19) |
It remains to show that minimizing the second part of this sum over all yields . To this end, we split the sum over all samples over two sums, one of which runs over the possible values of the class variable, and one that runs over all samples for which . With this, and the law of total expectation, we get
(20) | ||||
(21) | ||||
(22) |
We now minimize the right-hand side of (22) over all . To this end, for every , we expand the KL divergence via the chain rule [Cover and Thomas, 1991, Th. 2.5.3] to get
(23) | |||
(24) | |||
(25) |
where follows from [Cover and Thomas, 1991, Lemma 13.8.1]. This completes the proof. ∎
4 Planned Experiments
To investigate whether the presented framework based on class-conditional compression is useful, we plan to perform a set of experiments. Whether these experiments are feasible in principle is, at present, unclear.
4.1 Nonlinear Information Bottleneck
In [Kolchinsky et al., 2018], the authors use a stochastic encoder which learns the mean vector of a multivariate Gaussian with identity covariance matrixm, i.e., . Therefore, the authors assume that the latent representation is a Gaussian mixture, with each point in the dataset being an individual component. Based on this assumption, they propose bounding the compression term via [Kolchinsky et al., 2018, eq. (10)]
(26) |
where is a noise parameter that is learned.
Moving from compression to class-conditional compression is achieved by replacing by . We believe that this should also be possible in the framework of nonlinear information bottleneck by computing (26) separately for each class. In other words, we bound
(27) |
and obtain
(28) |
4.2 Naive Bayes Decoder
This experiment is based on Section 3.1. Specifically, we plan to choose from the family of Gaussian distributions with a mean vector that depends on the class
and an identity matrix (possibly scaled with a constant
that depends on the class ) as covariance matrix. This leads to the goal of obtaining a latent representation that is well-approximated by a Gaussian mixture model, where each mixture component is spherical.Rather than training the decoder part of the network, we replace this part by a naive Bayes classifier fitted to the parameters of . Our aim is then to train the encoder part of the network such that the naive Bayes decoder can be fully exploited, i.e., we learn the parameters of the encoder and the parameters of such that cost is minimized.
4.3 Deep Variational Information Bottleneck
The authors of [Alemi et al., 2017] suggest minimizing for a spherical Gaussian , i.e., they assume that . Replacing this target distribution by a conditionally independent distribution of the latent dimensions given the class, i.e., by is simple. Unclear is, how the mean vectors shall be chosen or – which is preferable in the light of Corollary 1 – if these mean vectors can be learned from data jointly (or alternatingly) with the remaining network parameters.
4.4 Conditional Information Dropout
In [Achille and Soatto, 2018], the authors made the connection between a well-chosen variational bound and disentanglement [Achille and Soatto, 2018, Proposition 1]. They further proposed an encoder
that is implemented by a NN where each neuron output is affected by multiplicative data-dependent noise (which is chosen to follow a log-normal distribution with data-dependent variance for the sake of analytical simplicity). The authors furthermore proposed that
, whereis log-uniform with a point mass at zero or log-normal for ReLU or softplus activation functions, respectively (cf.
[Achille and Soatto, 2018, Propositions 2 and 3]).In the setting proposed in this draft, one would have to replace by . In case of a softplus activation, this would mean that is a log-normal distribution the mean of which depends on the class (and potentially on the latent dimension ). In case of a ReLU activation, this would require that the point mass at zero depends on the class (and potentially on the latent dimension ). We are again faced with the issued mentioned in the previous subsection, i.e., whether these parameters can be trained from data or if (and how) they can be selected a priori.
References
- [Achille and Soatto, 2018] Achille, A. and Soatto, S. (2018). Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2897–2905.
- [Alemi et al., 2017] Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2017). Deep variational information bottleneck. In Proc. International Conference on Learning Representations (ICLR), Toulon.
- [Amjad and Geiger, 2018] Amjad, R. A. and Geiger, B. C. (2018). Learning representations for neural network-based classification using the information bottleneck principle. accepted for publication in IEEE Trans. Pattern Anal. Mach. Intell., preprint available: arXiv:1802.09766 [cs.LG].
- [Cover and Thomas, 1991] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. John Wiley & Sons, Inc., New York, NY, 1 edition.
- [Kingma and Welling, 2014] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In Proc. International Conference on Learning Representations (ICLR), Banff.
- [Kolchinsky et al., 2018] Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. (2018). Nonlinear information bottleneck. arXiv:1705.02436v7 [cs.IT].
- [Sohn et al., 2015] Sohn, K., Lee, H., and Yan, X. (2015). Learning structured output representation using deep conditional generative models. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 3483–3491, Montreal.
Comments
There are no comments yet.