Information Competing Process for Learning Diversified Representations

Learning representations with diversified information remains an open problem. Towards learning diversified representations, a new approach, termed Information Competing Process (ICP), is proposed in this paper. Aiming to enrich the information carried by feature representations, ICP separates a representation into two parts with different mutual information constraints. The separated parts are forced to accomplish the downstream task independently in a competitive environment which prevents the two parts from learning what each other learned for the downstream task. Such competing parts are then combined synergistically to complete the task. By fusing representation parts learned competitively under different conditions, ICP facilitates obtaining diversified representations which contain complementary information. Experiments on image classification and image reconstruction tasks demonstrate the great potential of ICP to learn discriminative and disentangled representations in both supervised and self-supervised learning settings.


Learning Disentangled Representations via Mutual Information Estimation

In this paper, we investigate the problem of learning disentangled repre...

Learning Downstream Task by Selectively Capturing Complementary Knowledge from Multiple Self-supervisedly Learning Pretexts

Self-supervised learning (SSL), as a newly emerging unsupervised represe...

Understanding Failure Modes of Self-Supervised Learning

Self-supervised learning methods have shown impressive results in downst...

High Fidelity Visualization of What Your Self-Supervised Representation Knows About

Discovering what is learned by neural networks remains a challenge. In s...

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the...

Automatic Discovery and Optimization of Parts for Image Classification

Part-based representations have been shown to be very useful for image c...

1 Introduction

Representation learning aims to make the learned feature representations more effective on extracting useful information for downstream tasks Bengio et al. (2013), which has been an active research topic in recent years Radford et al. (2015); Chen et al. (2016); Hamilton et al. (2017); Hu et al. (2019); Kolesnikov et al. (2019)

. Notably, a majority of works about representation learning have been studied from the viewpoint of mutual information estimation. For instance, the Information Bottleneck (IB) theory

Tishby et al. (2000); Alemi et al. (2017)

minimizes the information carried by representations to fit the target outputs, and the generative models such as

-VAE Higgins et al. (2017); Burgess et al. (2018) also rely on such information constraint to learn disentangled representations. Some other works Linsker (1988); Bell and Sejnowski (1995); Oord et al. (2018); Hjelm et al. (2019) reveal the advantages of maximizing the mutual information for learning discriminative representations. Despite the exciting progresses, learning diversified representations retains as an open problem. In principle, a good representation learning approach is supposed to discriminate and disentangle the underlying explanatory factors hidden in the input Bengio et al. (2013). However, this goal is hard to realize as the existing methods typically resort to only one type of information constraint to learn the desired representations. As a consequence, the information diversity of the learned representations is deteriorated.

In this paper we present a diversified representation learning scheme, termed Information Competing Process (ICP), which handles the above issues through a new information diversifying objective. First, the separated representation parts learned with different constraints are forced to accomplish the downstream task competitively. Then, the rival representation parts are combined to solve the downstream task synergistically. A novel solution is further proposed to optimize the new objective in both supervised and self-supervised learning settings.

We verify the effectiveness of the proposed ICP on both image classification and image reconstruction tasks, where neural networks are used as the feature extractors. In the supervised image classification task, we integrate ICP with four different network architectures (

i.e., VGG Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015), ResNet He et al. (2016), and DenseNet Huang et al. (2017)) to demonstrate how the diversified representations boost classification accuracy. In the self-supervised image reconstruction task, we implement ICP with -VAE Higgins et al. (2017) to investigate its ability of learning disentangled representations to reconstruct and manipulate the inputs. Empirical evaluations suggest that ICP fits finer labeled dataset and disentangles fine-grained semantic information for representations.

2 Related Work

Representation Learning with Mutual Information. Mutual information has been a powerful tool in representation learning for a long time. In the unsupervised setting, mutual information maximization is typically studied, which targets at adding specific information to the representation and forces the representation to be discriminative. For instance, the InfoMax principle Linsker (1988); Bell and Sejnowski (1995)

advocates maximizing mutual information between the inputs and the representations, which forms the basis of independent component analysis

Hyvärinen and Oja (2000). Contrastive Predictive Coding Oord et al. (2018) and Deep InfoMax Hjelm et al. (2019) maximize mutual information between global and local representation pairs, or the input and global/local representation pairs.

In the supervised or self-supervised settings, mutual information minimization is commonly utilized. For instance, the Information Bottleneck (IB) theory Tishby et al. (2000) uses the information theoretic objective to constrain the mutual information between the input and the representation. IB was then introduced to deep neural networks Tishby and Zaslavsky (2015); Shwartz-Ziv and Tishby (2017); Saxe et al. (2018), and Deep Variational Information Bottleneck (VIB) Alemi et al. (2017) was recently proposed to refine IB with a variational approximation. Another group of works in self-supervised setting adopt generative models to learn representations Kingma and Welling (2013); Rezende et al. (2014), in which the mutual information plays an important role in learning disentangled representations. For instance, -VAE Higgins et al. (2017) is a variant of Variation Auto-Encoder Kingma and Welling (2013) that attempts to learn a disentangled representation by optimizing a heavily penalized objective with mutual information minimization. Recent works in Burgess et al. (2018); Kim and Mnih (2018); Chen et al. (2018) revise the objective of -VAE by applying various constraints. One special case is InfoGAN Chen et al. (2016)

, which maximizes the mutual information between representation and a factored Gaussian distribution. Differing from the above schemes, the proposed ICP leverages both mutual information maximization and minimization to create competitive environment for learning diversified representations.

Representation Collaboration.

The idea of collaborating neural representations can be found in Neural Expectation Maximization

Greff et al. (2017) and Tagger Greff et al. (2016), which uses different representations to group and represent individual entities. The Competitive Collaboration Ranjan et al. (2019) method is the most relevant to our work. It defines a three-player game with two competitors and a moderator, where the moderator takes the role of a critic and the two competitors collaborate to train the moderator. Unlike Competitive Collaboration, the proposed ICP enforces two (or more) parts to be complementary for the same downstream task by a competitive environment, which endows the capability of learning more discriminative and disentangled representations.

Figure 1: The proposed Information Competing Process: In the competitive step, the rival representation parts are forced to accomplish the downstream task solely by preventing both parts from knowing what each other learned under different constraints for the task. In the synergetic step, these representation parts are combined to complete the downstream task synthetically. ICP can be generalized to arbitrary number of constrained parts, and here we make an example of two parts.

3 Information Competing Process

The key idea of ICP is depicted in Fig. 1, in which different representation parts compete and collaborate with each other to diversify the information. In this section, we first unify supervised and self-supervised objectives for ICP. A new information diversifying objective is then proposed, and we derive a solution to optimize it.

3.1 Unifying Supervised and Self-Supervised Objectives

The information constraining objective in supervised setting has the same form as that of self-supervised setting except the target outputs. We therefore unify these two objectives by using as the output of the downstream tasks, leading to the unified objective function as:


where stands for the mutual information, denotes the parameters of representation extractor, is the parameters of downstream task solver, are the input, representation and output, respectively ( in the self-supervised setting). This unified objective describes a constrained optimization with the goal of maximizing the mutual information between the representation and the output , while discarding the less relevant information about that presents in the input . The regularization factor balances the information capacity of the representation with the accuracy of a downstream task.

3.2 Separating and Diversifying the Representations

To explicitly diversify the information on representations, we directly separate the representation into two parts with different constraints, and encourage representations to learn discrepant information from the input . Specifically, we decrease the information capacity of while increasing the information capacity of . To that effect, we have the following objective function:


where and denote the parameters of representation extractors, and are the regularization factors. To ensure that the representations learn diversified information through different constraints, ICP additionally enforces and to accomplish the downstream task solely, and prevents and from learning what each other learned for the downstream task. This process results in a competitive environment to enrich the information carried by representations. Correspondingly, the information diversifying objective of ICP is concluded as:


where and denote the parameters of downstream task solvers, denotes the parameters of predictors used to predict and by inputting and .

3.3 Solving Mutual Information Constraints

In this subsection, we derive a solution to optimize the objective of ICP. Although all terms of this objective have the same formulation that calculates the mutual information between two variables, they need to be optimized using different methods due to their different aims. We therefore classify these terms as the mutual information minimization term

, the mutual information maximization term , the inference terms and the predictability minimization term to find the solution111In the rest of the paper, we will omit the parameters on the mutual information for short..

3.3.1 Mutual Information Minimization Term

To minimize the mutual information between and , we can find out a tractable upper bound for the intractable . In the existing works Kingma and Welling (2013); Alemi et al. (2017),

is usually defined under the joint distribution of inputs and their encoding distribution, as it is the constraint between the inputs and the representations. Concretely, the formulation is derived as:


Let be a variational approximation of , we have:


According to Eq. 5, the trackable upper bound after applying the variational approximation is:


which enforces the extracted conditioned on to a predefined distribution such as a standard Gaussian distribution.

3.3.2 Mutual Information Maximization Term

To maximize the mutual information between and , we deduce a tractable alternate for the intractable . Specifically, like the above minimization term, the mutual information should also be defined as the joint distribution of inputs and their encoding distribution. As it is hard to derive a tractable lower bound for this term, we find an alternative by expanding the mutual information as:


Eq. 7

means that maximizing the mutual information is equal to enlarge the Kullback–Leibler divergence between distributions

and . Since the maximization of Kullback–Leibler (KL) divergence is divergent, we instead maximize the Jensen-Shannon divergence which approximates the maximization of KL divergence but is convergent. As Nowozin et al. (2016), a tractable variational estimation of Jensen-Shannon divergence can be defined as:


where is a discriminator with parameters

that outputs an estimation of the probability,

is the positive pair sampled from , and is the negative pair sampled from . As shoule be the representation conditioned on , we disorganize in the positive pair to obtain the negative pair .

3.3.3 Inference Term

The inference terms in Eq. 3 should be defined as the joint distribution of representation and the output distribution of downstream task solver. We take as an example and expand this mutual information term as:


Let be a variational approximation of , we have:


By applying the variational approximation, the trackable lower bound of the mutual information between and is:


Based on the above formulation, we derive different objectives for the supervised and self-supervised settings in what follows.

Supervised Setting. In the supervised setting, represents the known target labels. By assuming that the representation is not dependent on the label , i.e., , we have:


Accordingly, the joint distribution of and can be written as:


Combining Eq. 11 with Eq. 13, we get the lower bound of the inference term in the supervised setting:


Since the conditional probability represents the distribution of labels in the supervised setting, Eq. 14 is actually the cross entropy loss for classification.

Input: The source input with the downstream task target ( in self-supervised setting), the assigned distribution

for variational approximation, and the hyperparameters

Output: The learned parameters for representation extractors, and for the downstream solver.
1 while  not Convergence  do
2       Optimize Eq. 8 and Eq. 16 for discriminator and predictor ;
3       // Mutual Information Minimization Term:
4       Replace in Eq. 3 with the tractable upper bound in Eq. 6;
5       // Mutual Information Maximization Term:
6       Replace in Eq. 3 with the tractable alternative in Eq. 8;
7       // Inference Term:
8       Replace in Eq. 3 with the tractable lower bound in Eq. 14;
9       // Predictability Minimization Term:
10       Replace in Eq. 3 with Eq. 16;
11       Optimize Eq. 3 while fixing the parameters of and ;
12 end while
Algorithm 1 Optimization of Information Competing Process

Self-supervised Setting. In the self-supervised setting, is equal to input . Eq 11 can be directly derived:


Assuming as a Gaussian distribution, Eq. 15 can be expanded as the L2 reconstruction loss for the input .

3.3.4 Predictability Minimization Term

To diversify the information and prevent the dominance of one representation part, we constrain the mutual information between and , which equals to make and be independent with each other. Inspired by Schmidhuber (1992); Li et al. (2019), we introduce a predictor parameterized by to fulfill this goal. Concretely, we let predict conditioned on , and prevent the extractor from producing which can predict . The same operation is conducted on to . The corresponding objective is:


So far, we have all the tractable bounds and alternatives for optimizing the information diversifying objective of ICP. The optimization process is summarized in Alg. 1.

4 Experiments

In experiments, all the probabilistic feature extractors, task solvers, predictor and discriminator are implemented by neural networks. We suppose are standard Gaussian distributions and use reparameterization trick by following VAE Kingma and Welling (2013)

. The objectives are differentiable and trained using backpropagation. In the classification task (supervised setting), we use one fully-connected layer as classifier. In the reconstruction task (self-supervised setting), multiple deconvolution layers are used as the decoder to reconstruct the inputs.

4.1 Supervised Setting: Classification

Dataset. CIFAR-10 and CIFAR-100 Krizhevsky and Hinton (2009) are used to evaluate the performance of ICP in the image classification task. These datasets contain natural images belonging to 10 and 100 classes respectively. CIFAR-100 comes with “finer” labels than CIFAR-10. The raw images are with 32

32 pixels and we normalize them using the channel means and standard deviations. Standard data augmentation by random cropping and mirroring is applied to the training set.

VGG16 Simonyan and Zisserman (2014) GoogLeNet Szegedy et al. (2015) ResNet20 He et al. (2016) DenseNet40 Huang et al. (2017)
Baseline 6.67 4.92 7.63 5.83
ICP-ALL 6.97 4.76 6.47 6.13
ICP-COM 6.59 4.67 7.33 5.63
VIB Alemi et al. (2017) 6.81 5.09 6.95 5.72
DIM* Hjelm et al. (2019) 6.54 4.65 7.61 6.15
VIB2 6.86 4.88 6.85 6.36
DIM*2 7.24 4.95 7.46 5.60
ICP 6.10 4.26 6.01 4.99
Table 1: Classification error rates (%) on CIFAR-10 test set.
VGG16 Simonyan and Zisserman (2014) GoogLeNet Szegedy et al. (2015) ResNet20 He et al. (2016) DenseNet40 Huang et al. (2017)
Baseline 26.41 20.68 31.91 27.55
ICP-ALL 26.73 20.90 28.35 27.51
ICP-COM 26.37 20.81 32.76 26.85
VIB Alemi et al. (2017) 26.56 20.93 30.84 26.37
DIM* Hjelm et al. (2019) 26.74 20.94 32.62 27.51
VIB2 26.08 22.09 29.74 29.33
DIM*2 25.72 21.74 30.16 27.15
ICP 24.54 18.55 28.13 24.52
Table 2: Classification error rates (%) on CIFAR-100 test set.

Backbone Networks. We utilize four architectures including VGGNet Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015, 2016), ResNet He et al. (2016), and DenseNet Huang et al. (2017) to test the general applicability of ICP and to study the diversified representations learned by ICP.

Baselines. We use the classification results of original network architectures as baselines. The deep Variational Information Bottleneck (VIB) Alemi et al. (2017) and global version of Deep InfoMax with one additional mutual maximization term (DIM*) Hjelm et al. (2019) are used as references. To make a fair comparison, we expand the representation dimension of both methods to the same size of ICP’s (denoted as VIB2, and DIM*2). For the ablation study, we implement ICP by combining two representation parts simply, or combining two representation parts learned with different constraints but do not make them independent with each other (i.e., optimize Eq. 2). The former implementation is denoted as ICP-ALL and the latter is denoted as ICP-COM.

(a) Correlation heatmap of ICP-ALL.
(b) Correlation heatmap of ICP.
Figure 2: Heatmaps of the correlation (absolute value of the classifier’s weights, normalized by rows) between categories and representations of VGGNet on CIFAR-10. The horizontal axis denotes the dimension of representations, and the vertical axis denotes the categories. Darker color denotes higher correlation.

Classification Performance. The classification error rates on CIFAR-10 and CIFAR-100 are shown in Tables 1 and 2. The preformance of ICP-ALL shows that simply combining two representation parts is sub-optimal, except for ResNet20. By adding the constraints on the representations, the performance of ICP-COM is almost the same as the baselines. The VIB, DIM*, VIB2 and DIM*2 can be regarded as the methods that only use one type of constraints in ICP. These methods still do not achieve satisfactory results. Only ICP generalizes to all these architectures and reports the best performance. In addition, the results on CIFAR-10 and CIFAR-100 suggest that ICP works well on the finer labeled dataset (i.e., CIFAR-100). We attribute the success to the diversified representations that capture more detailed information of inputs.

Why Dose ICP Work? To explain the intuitive idea and the superior results of ICP, we study the learned classification models to explore why ICP works. In the following, we make an example of VGGNet on CIFAR-10 (other models have consistent performance as this example) to explain. In Fig. 2, we visualize the normalized absolute value of the classifier’s weights (i.e., the normalized absolute value of last fully connected layer’s weights). As shown in Fig. 2(a), the classification dependency is fused in ICP-ALL, which means combining two representations directly without any constraints does not diversify the representation. The first green bounding box shows that the classification relies on both parts. The second and the third green bounding boxes show that the classification relies more on the first part or the second part. On the contrary, as shown in Fig. 2(b), the classification dependency can be separated into two parts. As the mutual information minimization makes the representation bring more general information of input while the maximization makes the representation bring more specific information of input, a few dimensions of the general part are enough for classification (i.e., the left bounding box of Fig. 2(b)) while many dimensions of the specific part should be considered for classification (i.e., the right bounding box of Fig. 2(b)). This reveals that ICP greatly diversifies and aggregates the complementary representations for the classification task.

4.2 Self-supervised Setting: Reconstruction

MSE SSIM Wang et al. (2004)
-VAE Higgins et al. (2017) 0.0092 0.60
ICP 0.0085 0.62
Table 3: Quantitive results of self-reconstruction.

Dataset. We use CelebA Liu et al. (2015), i.e., a large-scale face attributes dataset with more than 200,000 celebrity images, to study the ability of disentanglement by using ICP. The images are reshaped to 128 128 pixels.

Baseline. -VAE Higgins et al. (2017) are implemented with the same representation dimension as ICP. We manually pick the dimensions which have the most related semantic meaning in -VAE and ICP to conduct the qualitative evaluation of disentanglement. The average Mean Square Error (MSE) and the Structural Similarity Index (SSIM) Wang et al. (2004) are used to show the quantitative results of reconstruction.

Disentanglement and Reconstruction. The qualitative results of disentanglement by using -VAE and ICP are shown in Fig. 3. It can be seen that many fine-grained semantic attributes such as goatee and eyeglasses are disentangled clearly by ICP with details. In Table 3, the quantitative results show that ICP retains more information of input and has a better reconstruction performance.

(a) Smile
(b) Goatee
(c) Eyeglasses
(d) Hair Color
Figure 3: Qualitative results of manipulating the representation learned by -VAE or ICP on CelebA. In all figures, we traverse a single dimension of the disentangled representation while keeping other dimensions fixed. Each row represents a different seed image used to infer the representation. The traversal is over the range of [-3,3].

5 Conclusion

We proposed a new approach named Information Competing Process (ICP) for learning diversified representations. To enrich the information carried by representations, ICP separates a representation into two parts with different mutual information constraints, and prevents both parts from knowing what each other learned for the downstream task. Such rival representations are then combined to accomplish the downstream task synthetically. Experiments demonstrated the great potential of ICP in both supervised and self-supervised settings. The nature behind the performance gain lies in that ICP has the ability to learn diversified representations, which provides fresh insights for the representation learning problem.


  • Alemi et al. [2017] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
  • Bell and Sejnowski [1995] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 1995.
  • Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. PAMI, 2013.
  • Burgess et al. [2018] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. In NeurIPS, 2018.
  • Chen et al. [2018] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud.

    Isolating sources of disentanglement in variational autoencoders.

    In NeurIPS, 2018.
  • Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.
  • Greff et al. [2016] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Jürgen Schmidhuber. Tagger: Deep unsupervised perceptual grouping. In NeurIPS, 2016.
  • Greff et al. [2017] Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Neural expectation maximization. In NeurIPS, 2017.
  • Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • Higgins et al. [2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • Hjelm et al. [2019] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
  • Hu et al. [2019] Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. Towards visual feature translation. In CVPR, 2019.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 2000.
  • Kim and Mnih [2018] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2013.
  • Kolesnikov et al. [2019] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised visual representation learning. In CVPR, 2019.
  • Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Li et al. [2019] Xinyang Li, Jie Hu, Shengchuan Zhang, Xiaopeng Hong, Qixiang Ye, Chenglin Wu, and Rongrong Ji. Attribute guided unpaired image-to-image translation with semi-supervised learning. arXiv preprint arXiv:1904.12428, 2019.
  • Linsker [1988] Ralph Linsker. Self-organization in a perceptual network. Computer, 1988.
  • Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In NeurIPS, 2016.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Ranjan et al. [2019] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black.

    Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation.

    In CVPR, 2019.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
  • Saxe et al. [2018] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. 2018.
  • Schmidhuber [1992] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 1992.
  • Shwartz-Ziv and Tishby [2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.

    Rethinking the inception architecture for computer vision.

    In CVPR, 2016.
  • Tishby and Zaslavsky [2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In ITW, 2015.
  • Tishby et al. [2000] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. TIP, 2004.