1 Introduction
Representation learning aims to make the learned feature representations more effective on extracting useful information for downstream tasks Bengio et al. (2013), which has been an active research topic in recent years Radford et al. (2015); Chen et al. (2016); Hamilton et al. (2017); Hu et al. (2019); Kolesnikov et al. (2019)
. Notably, a majority of works about representation learning have been studied from the viewpoint of mutual information estimation. For instance, the Information Bottleneck (IB) theory
Tishby et al. (2000); Alemi et al. (2017)minimizes the information carried by representations to fit the target outputs, and the generative models such as
VAE Higgins et al. (2017); Burgess et al. (2018) also rely on such information constraint to learn disentangled representations. Some other works Linsker (1988); Bell and Sejnowski (1995); Oord et al. (2018); Hjelm et al. (2019) reveal the advantages of maximizing the mutual information for learning discriminative representations. Despite the exciting progresses, learning diversified representations retains as an open problem. In principle, a good representation learning approach is supposed to discriminate and disentangle the underlying explanatory factors hidden in the input Bengio et al. (2013). However, this goal is hard to realize as the existing methods typically resort to only one type of information constraint to learn the desired representations. As a consequence, the information diversity of the learned representations is deteriorated.In this paper we present a diversified representation learning scheme, termed Information Competing Process (ICP), which handles the above issues through a new information diversifying objective. First, the separated representation parts learned with different constraints are forced to accomplish the downstream task competitively. Then, the rival representation parts are combined to solve the downstream task synergistically. A novel solution is further proposed to optimize the new objective in both supervised and selfsupervised learning settings.
We verify the effectiveness of the proposed ICP on both image classification and image reconstruction tasks, where neural networks are used as the feature extractors. In the supervised image classification task, we integrate ICP with four different network architectures (
i.e., VGG Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015), ResNet He et al. (2016), and DenseNet Huang et al. (2017)) to demonstrate how the diversified representations boost classification accuracy. In the selfsupervised image reconstruction task, we implement ICP with VAE Higgins et al. (2017) to investigate its ability of learning disentangled representations to reconstruct and manipulate the inputs. Empirical evaluations suggest that ICP fits finer labeled dataset and disentangles finegrained semantic information for representations.2 Related Work
Representation Learning with Mutual Information. Mutual information has been a powerful tool in representation learning for a long time. In the unsupervised setting, mutual information maximization is typically studied, which targets at adding specific information to the representation and forces the representation to be discriminative. For instance, the InfoMax principle Linsker (1988); Bell and Sejnowski (1995)
advocates maximizing mutual information between the inputs and the representations, which forms the basis of independent component analysis
Hyvärinen and Oja (2000). Contrastive Predictive Coding Oord et al. (2018) and Deep InfoMax Hjelm et al. (2019) maximize mutual information between global and local representation pairs, or the input and global/local representation pairs.In the supervised or selfsupervised settings, mutual information minimization is commonly utilized. For instance, the Information Bottleneck (IB) theory Tishby et al. (2000) uses the information theoretic objective to constrain the mutual information between the input and the representation. IB was then introduced to deep neural networks Tishby and Zaslavsky (2015); ShwartzZiv and Tishby (2017); Saxe et al. (2018), and Deep Variational Information Bottleneck (VIB) Alemi et al. (2017) was recently proposed to refine IB with a variational approximation. Another group of works in selfsupervised setting adopt generative models to learn representations Kingma and Welling (2013); Rezende et al. (2014), in which the mutual information plays an important role in learning disentangled representations. For instance, VAE Higgins et al. (2017) is a variant of Variation AutoEncoder Kingma and Welling (2013) that attempts to learn a disentangled representation by optimizing a heavily penalized objective with mutual information minimization. Recent works in Burgess et al. (2018); Kim and Mnih (2018); Chen et al. (2018) revise the objective of VAE by applying various constraints. One special case is InfoGAN Chen et al. (2016)
, which maximizes the mutual information between representation and a factored Gaussian distribution. Differing from the above schemes, the proposed ICP leverages both mutual information maximization and minimization to create competitive environment for learning diversified representations.
Representation Collaboration.
The idea of collaborating neural representations can be found in Neural Expectation Maximization
Greff et al. (2017) and Tagger Greff et al. (2016), which uses different representations to group and represent individual entities. The Competitive Collaboration Ranjan et al. (2019) method is the most relevant to our work. It defines a threeplayer game with two competitors and a moderator, where the moderator takes the role of a critic and the two competitors collaborate to train the moderator. Unlike Competitive Collaboration, the proposed ICP enforces two (or more) parts to be complementary for the same downstream task by a competitive environment, which endows the capability of learning more discriminative and disentangled representations.3 Information Competing Process
The key idea of ICP is depicted in Fig. 1, in which different representation parts compete and collaborate with each other to diversify the information. In this section, we first unify supervised and selfsupervised objectives for ICP. A new information diversifying objective is then proposed, and we derive a solution to optimize it.
3.1 Unifying Supervised and SelfSupervised Objectives
The information constraining objective in supervised setting has the same form as that of selfsupervised setting except the target outputs. We therefore unify these two objectives by using as the output of the downstream tasks, leading to the unified objective function as:
(1) 
where stands for the mutual information, denotes the parameters of representation extractor, is the parameters of downstream task solver, are the input, representation and output, respectively ( in the selfsupervised setting). This unified objective describes a constrained optimization with the goal of maximizing the mutual information between the representation and the output , while discarding the less relevant information about that presents in the input . The regularization factor balances the information capacity of the representation with the accuracy of a downstream task.
3.2 Separating and Diversifying the Representations
To explicitly diversify the information on representations, we directly separate the representation into two parts with different constraints, and encourage representations to learn discrepant information from the input . Specifically, we decrease the information capacity of while increasing the information capacity of . To that effect, we have the following objective function:
(2) 
where and denote the parameters of representation extractors, and are the regularization factors. To ensure that the representations learn diversified information through different constraints, ICP additionally enforces and to accomplish the downstream task solely, and prevents and from learning what each other learned for the downstream task. This process results in a competitive environment to enrich the information carried by representations. Correspondingly, the information diversifying objective of ICP is concluded as:
(3) 
where and denote the parameters of downstream task solvers, denotes the parameters of predictors used to predict and by inputting and .
3.3 Solving Mutual Information Constraints
In this subsection, we derive a solution to optimize the objective of ICP. Although all terms of this objective have the same formulation that calculates the mutual information between two variables, they need to be optimized using different methods due to their different aims. We therefore classify these terms as the mutual information minimization term
, the mutual information maximization term , the inference terms and the predictability minimization term to find the solution^{1}^{1}1In the rest of the paper, we will omit the parameters on the mutual information for short..3.3.1 Mutual Information Minimization Term
To minimize the mutual information between and , we can find out a tractable upper bound for the intractable . In the existing works Kingma and Welling (2013); Alemi et al. (2017),
is usually defined under the joint distribution of inputs and their encoding distribution, as it is the constraint between the inputs and the representations. Concretely, the formulation is derived as:
(4) 
Let be a variational approximation of , we have:
(5) 
According to Eq. 5, the trackable upper bound after applying the variational approximation is:
(6) 
which enforces the extracted conditioned on to a predefined distribution such as a standard Gaussian distribution.
3.3.2 Mutual Information Maximization Term
To maximize the mutual information between and , we deduce a tractable alternate for the intractable . Specifically, like the above minimization term, the mutual information should also be defined as the joint distribution of inputs and their encoding distribution. As it is hard to derive a tractable lower bound for this term, we find an alternative by expanding the mutual information as:
(7) 
Eq. 7
means that maximizing the mutual information is equal to enlarge the Kullback–Leibler divergence between distributions
and . Since the maximization of Kullback–Leibler (KL) divergence is divergent, we instead maximize the JensenShannon divergence which approximates the maximization of KL divergence but is convergent. As Nowozin et al. (2016), a tractable variational estimation of JensenShannon divergence can be defined as:(8) 
where is a discriminator with parameters
that outputs an estimation of the probability,
is the positive pair sampled from , and is the negative pair sampled from . As shoule be the representation conditioned on , we disorganize in the positive pair to obtain the negative pair .3.3.3 Inference Term
The inference terms in Eq. 3 should be defined as the joint distribution of representation and the output distribution of downstream task solver. We take as an example and expand this mutual information term as:
(9) 
Let be a variational approximation of , we have:
(10) 
By applying the variational approximation, the trackable lower bound of the mutual information between and is:
(11) 
Based on the above formulation, we derive different objectives for the supervised and selfsupervised settings in what follows.
Supervised Setting. In the supervised setting, represents the known target labels. By assuming that the representation is not dependent on the label , i.e., , we have:
(12) 
Accordingly, the joint distribution of and can be written as:
(13) 
Combining Eq. 11 with Eq. 13, we get the lower bound of the inference term in the supervised setting:
(14) 
Since the conditional probability represents the distribution of labels in the supervised setting, Eq. 14 is actually the cross entropy loss for classification.
3.3.4 Predictability Minimization Term
To diversify the information and prevent the dominance of one representation part, we constrain the mutual information between and , which equals to make and be independent with each other. Inspired by Schmidhuber (1992); Li et al. (2019), we introduce a predictor parameterized by to fulfill this goal. Concretely, we let predict conditioned on , and prevent the extractor from producing which can predict . The same operation is conducted on to . The corresponding objective is:
(16) 
So far, we have all the tractable bounds and alternatives for optimizing the information diversifying objective of ICP. The optimization process is summarized in Alg. 1.
4 Experiments
In experiments, all the probabilistic feature extractors, task solvers, predictor and discriminator are implemented by neural networks. We suppose are standard Gaussian distributions and use reparameterization trick by following VAE Kingma and Welling (2013)
. The objectives are differentiable and trained using backpropagation. In the classification task (supervised setting), we use one fullyconnected layer as classifier. In the reconstruction task (selfsupervised setting), multiple deconvolution layers are used as the decoder to reconstruct the inputs.
4.1 Supervised Setting: Classification
Dataset. CIFAR10 and CIFAR100 Krizhevsky and Hinton (2009) are used to evaluate the performance of ICP in the image classification task. These datasets contain natural images belonging to 10 and 100 classes respectively. CIFAR100 comes with “finer” labels than CIFAR10. The raw images are with 32
32 pixels and we normalize them using the channel means and standard deviations. Standard data augmentation by random cropping and mirroring is applied to the training set.
VGG16 Simonyan and Zisserman (2014)  GoogLeNet Szegedy et al. (2015)  ResNet20 He et al. (2016)  DenseNet40 Huang et al. (2017)  
Baseline  6.67  4.92  7.63  5.83 
ICPALL  6.97  4.76  6.47  6.13 
ICPCOM  6.59  4.67  7.33  5.63 
VIB Alemi et al. (2017)  6.81  5.09  6.95  5.72 
DIM* Hjelm et al. (2019)  6.54  4.65  7.61  6.15 
VIB2  6.86  4.88  6.85  6.36 
DIM*2  7.24  4.95  7.46  5.60 
ICP  6.10  4.26  6.01  4.99 
VGG16 Simonyan and Zisserman (2014)  GoogLeNet Szegedy et al. (2015)  ResNet20 He et al. (2016)  DenseNet40 Huang et al. (2017)  
Baseline  26.41  20.68  31.91  27.55 
ICPALL  26.73  20.90  28.35  27.51 
ICPCOM  26.37  20.81  32.76  26.85 
VIB Alemi et al. (2017)  26.56  20.93  30.84  26.37 
DIM* Hjelm et al. (2019)  26.74  20.94  32.62  27.51 
VIB2  26.08  22.09  29.74  29.33 
DIM*2  25.72  21.74  30.16  27.15 
ICP  24.54  18.55  28.13  24.52 
Backbone Networks. We utilize four architectures including VGGNet Simonyan and Zisserman (2014), GoogLeNet Szegedy et al. (2015, 2016), ResNet He et al. (2016), and DenseNet Huang et al. (2017) to test the general applicability of ICP and to study the diversified representations learned by ICP.
Baselines. We use the classification results of original network architectures as baselines. The deep Variational Information Bottleneck (VIB) Alemi et al. (2017) and global version of Deep InfoMax with one additional mutual maximization term (DIM*) Hjelm et al. (2019) are used as references. To make a fair comparison, we expand the representation dimension of both methods to the same size of ICP’s (denoted as VIB2, and DIM*2). For the ablation study, we implement ICP by combining two representation parts simply, or combining two representation parts learned with different constraints but do not make them independent with each other (i.e., optimize Eq. 2). The former implementation is denoted as ICPALL and the latter is denoted as ICPCOM.
Classification Performance. The classification error rates on CIFAR10 and CIFAR100 are shown in Tables 1 and 2. The preformance of ICPALL shows that simply combining two representation parts is suboptimal, except for ResNet20. By adding the constraints on the representations, the performance of ICPCOM is almost the same as the baselines. The VIB, DIM*, VIB2 and DIM*2 can be regarded as the methods that only use one type of constraints in ICP. These methods still do not achieve satisfactory results. Only ICP generalizes to all these architectures and reports the best performance. In addition, the results on CIFAR10 and CIFAR100 suggest that ICP works well on the finer labeled dataset (i.e., CIFAR100). We attribute the success to the diversified representations that capture more detailed information of inputs.
Why Dose ICP Work? To explain the intuitive idea and the superior results of ICP, we study the learned classification models to explore why ICP works. In the following, we make an example of VGGNet on CIFAR10 (other models have consistent performance as this example) to explain. In Fig. 2, we visualize the normalized absolute value of the classifier’s weights (i.e., the normalized absolute value of last fully connected layer’s weights). As shown in Fig. 2(a), the classification dependency is fused in ICPALL, which means combining two representations directly without any constraints does not diversify the representation. The first green bounding box shows that the classification relies on both parts. The second and the third green bounding boxes show that the classification relies more on the first part or the second part. On the contrary, as shown in Fig. 2(b), the classification dependency can be separated into two parts. As the mutual information minimization makes the representation bring more general information of input while the maximization makes the representation bring more specific information of input, a few dimensions of the general part are enough for classification (i.e., the left bounding box of Fig. 2(b)) while many dimensions of the specific part should be considered for classification (i.e., the right bounding box of Fig. 2(b)). This reveals that ICP greatly diversifies and aggregates the complementary representations for the classification task.
4.2 Selfsupervised Setting: Reconstruction
Dataset. We use CelebA Liu et al. (2015), i.e., a largescale face attributes dataset with more than 200,000 celebrity images, to study the ability of disentanglement by using ICP. The images are reshaped to 128 128 pixels.
Baseline. VAE Higgins et al. (2017) are implemented with the same representation dimension as ICP. We manually pick the dimensions which have the most related semantic meaning in VAE and ICP to conduct the qualitative evaluation of disentanglement. The average Mean Square Error (MSE) and the Structural Similarity Index (SSIM) Wang et al. (2004) are used to show the quantitative results of reconstruction.
Disentanglement and Reconstruction. The qualitative results of disentanglement by using VAE and ICP are shown in Fig. 3. It can be seen that many finegrained semantic attributes such as goatee and eyeglasses are disentangled clearly by ICP with details. In Table 3, the quantitative results show that ICP retains more information of input and has a better reconstruction performance.
5 Conclusion
We proposed a new approach named Information Competing Process (ICP) for learning diversified representations. To enrich the information carried by representations, ICP separates a representation into two parts with different mutual information constraints, and prevents both parts from knowing what each other learned for the downstream task. Such rival representations are then combined to accomplish the downstream task synthetically. Experiments demonstrated the great potential of ICP in both supervised and selfsupervised settings. The nature behind the performance gain lies in that ICP has the ability to learn diversified representations, which provides fresh insights for the representation learning problem.
References
 Alemi et al. [2017] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2017.
 Bell and Sejnowski [1995] Anthony J Bell and Terrence J Sejnowski. An informationmaximization approach to blind separation and blind deconvolution. Neural Computation, 1995.
 Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. PAMI, 2013.
 Burgess et al. [2018] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in betavae. In NeurIPS, 2018.

Chen et al. [2018]
Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud.
Isolating sources of disentanglement in variational autoencoders.
In NeurIPS, 2018.  Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.
 Greff et al. [2016] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Jürgen Schmidhuber. Tagger: Deep unsupervised perceptual grouping. In NeurIPS, 2016.
 Greff et al. [2017] Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Neural expectation maximization. In NeurIPS, 2017.
 Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, 2017.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Higgins et al. [2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 Hjelm et al. [2019] R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.
 Hu et al. [2019] Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. Towards visual feature translation. In CVPR, 2019.
 Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 Hyvärinen and Oja [2000] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 2000.
 Kim and Mnih [2018] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In ICML, 2018.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2013.
 Kolesnikov et al. [2019] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting selfsupervised visual representation learning. In CVPR, 2019.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. [2019] Xinyang Li, Jie Hu, Shengchuan Zhang, Xiaopeng Hong, Qixiang Ye, Chenglin Wu, and Rongrong Ji. Attribute guided unpaired imagetoimage translation with semisupervised learning. arXiv preprint arXiv:1904.12428, 2019.
 Linsker [1988] Ralph Linsker. Selforganization in a perceptual network. Computer, 1988.
 Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.
 Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In NeurIPS, 2016.
 Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
 Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Ranjan et al. [2019]
Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and
Michael J Black.
Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation.
In CVPR, 2019.  Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 Saxe et al. [2018] Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. 2018.
 Schmidhuber [1992] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 1992.
 ShwartzZiv and Tishby [2017] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.

Szegedy et al. [2016]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna.
Rethinking the inception architecture for computer vision.
In CVPR, 2016.  Tishby and Zaslavsky [2015] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In ITW, 2015.
 Tishby et al. [2000] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. TIP, 2004.