Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

01/26/2021 ∙ by Yulin Wang, et al. ∙ Tsinghua University 0

Due to the need to store the intermediate activations for back-propagation, end-to-end (E2E) training of deep networks usually suffers from high GPUs memory footprint. This paper aims to address this problem by revisiting the locally supervised learning, where a network is split into gradient-isolated modules and trained with local supervision. We experimentally show that simply training local modules with E2E loss tends to collapse task-relevant information at early layers, and hence hurts the performance of the full model. To avoid this issue, we propose an information propagation (InfoPro) loss, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information. As InfoPro loss is difficult to compute in its original form, we derive a feasible upper bound as a surrogate optimization objective, yielding a simple but effective algorithm. In fact, we show that the proposed method boils down to minimizing the combination of a reconstruction loss and a normal cross-entropy/contrastive term. Extensive empirical results on five datasets (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes) validate that InfoPro is capable of achieving competitive performance with less than 40 training, while allowing using training data with higher-resolution or larger batch sizes under the same GPU memory constraint. Our method also enables training local modules asynchronously for potential training acceleration. Code is available at: https://github.com/blackfeather-wang/InfoPro-Pytorch.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end (E2E) back-propagation has become a standard paradigm to train deep networks (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Huang et al., 2019). Typically, a training loss is computed at the final layer, and then the gradients are propagated backward layer-by-layer to update the weights. Although being effective, this procedure may suffer from memory and computation inefficiencies. First, the entire computational graph as well as the activations of most, if not all, layers need to be stored, resulting in intensive memory consumption. The GPU memory constraint is usually a bottleneck that inhibits the training of state-of-the-art models with high-resolution inputs and sufficient batch sizes, which arises in many realistic scenarios, such as 2D/3D semantic segmentation/object detection in autonomous driving, tissue segmentation in medical imaging and object recognition from remote sensing data. Most existing works address this issue via the gradient checkpointing technique (Chen et al., 2016) or the reversible architecture design (Gomez et al., 2017), while they both come at the cost of significantly increased computation. Second, E2E training is a sequential process that impedes model parallelization (Belilovsky et al., 2020; Löwe et al., 2019), as earlier layers need to wait for their successors for error signals.

As an alternative to E2E training, the locally supervised learning paradigm (Hinton et al., 2006; Bengio et al., 2007; Nøkland and Eidnes, 2019; Belilovsky et al., 2019, 2020) by design enjoys higher memory efficiency and allows for model parallelization. In specific, it divides a deep network into several gradient-isolated modules and trains them separately under local supervision (see Figure 1 (b)). Since back-propagation is performed only within local modules, one does not need to store all intermediate activations at the same time. Consequently, the memory footprint during training is reduced without involving significant computational overhead. Moreover, by removing the demands for obtaining error signals from later layers, different local modules can potentially be trained in parallel. This approach is also considered more biologically plausible, given that brains are highly modular and predominantly learn from local signals (Crick, 1989; Dan and Poo, 2004; Bengio et al., 2015). However, a major drawback of local learning is that they usually lead to inferior performance compared to E2E training (Mostafa et al., 2018; Belilovsky et al., 2019, 2020).

In this paper, we revisit locally supervised training and analyse its drawbacks from the information-theoretic perspective. We find that directly adopting an E2E loss function (i.e., cross-entropy) to train local modules produces more discriminative intermediate features at earlier layers, while it collapses task-relevant information from the inputs and leads to inferior final performance. In other words, local learning tends to be short sighted, and learns features that only benefit local modules, while ignoring the demands of the rest layers. Once task-relevant information is washed out in earlier modules, later layers cannot take full advantage of their capacity to learn more powerful representations.

Based on the above observations, we hypothesize that a less greedy training procedure that preserves more information about the inputs might be a rescue for locally supervised training. Therefore, we propose a less greedy information propagation (InfoPro) loss that aims to encourage local modules to propagate forward as much information from the inputs as possible, while progressively abandon task-irrelevant parts (formulated by an additional random variable named nuisance), as shown in Figure

1 (c). The proposed method differentiates itself from existing algorithms (Nøkland and Eidnes, 2019; Belilovsky et al., 2019, 2020)

on that it allows intermediate features to retain a certain amount of information which may hurt the short-term performance, but can potentially be leveraged by later modules. In practice, as the InfoPro loss is difficult to estimate in its exact form, we derive a tractable upper bound, leading to surrogate losses, e.g., cross-entropy loss and contrastive loss.

Empirically, we show that InfoPro loss effectively prevents collapsing task-relevant information at local modules, and yields favorable results on five widely used benchmarks (i.e., CIFAR, SVHN, STL-10, ImageNet and Cityscapes). For instance, it achieves comparable accuracy as E2E training using 40% or less GPU memory, while allows using a 50% larger batch size or a 50% larger input resolution with the same memory constraints. Additionally, our method enables training different local modules asynchronously (even in parallel).

Figure 1: (a) and (b) illustrate the paradigms of end-to-end (E2E) learning and locally supervised learning (). “End-to-end Loss” refers to the standard loss function used by E2E training, e.g., softmax cross-entropy loss for classification, etc., while denotes the loss function used to train local modules. (c) compares three training approaches in terms of the information captured by features. Greedy supervised learning (greedy SL) tends to collapse some of task-relevant information with the beginning module, leading to inferior final performance. The proposed information propagation (InfoPro) loss, however, alleviates this problem by encouraging local modules to propagate forward all the information from inputs, while maximally discard task-irrelevant information.

2 Why locally supervised learning underperforms E2E training?

We start by considering a local learning setting where a deep network is split into multiple successively stacked modules, each with the same depth. The inputs are fed forward in an ordinary way, while the gradients are produced at the end of every module and back-propagated until reaching an earlier module. To generate supervision signals, a straightforward solution is to train all the local modules as

Test Error 7.37% 10.30% 16.07% 21.19% 24.59%
Table 1: Test errors of a ResNet-32 using greedy SL on CIFAR-10. The network is divided into successive local modules. Each module is trained separately with the softmax cross-entropy loss by appending a global-pool layer followed by a fully-connected layer (see Appendix F for details). “” refers to end-to-end (E2E) training.

independent networks, e.g., in classification tasks, attaching a classifier to each module, and computing the local classification loss such as cross-entropy. However, such a

greedy version of the standard supervised learning (greedy SL) algorithm leads to inferior performance of the whole network. For instance, in Table 1, we present the test error of a ResNet-32 (He et al., 2016) on CIFAR-10 (Krizhevsky et al., 2009) when it is greedily trained with modules. One can observe a severe degradation (even more than ) with growing larger. Plausible as this phenomenon seems, it remains unclear whether it is inherent for local learning and how to alleviate this problem. In this section, we investigate the performance degradation issue of the greedy local training from an information-theoretic perspective, laying the basis for the proposed algorithm.

Figure 2: The linear separability (left, measured by test errors), mutual information with the input (middle), and mutual information with the label (right) of the intermediate features from different layers when the greedy supervised learning (greedy SL) algorithm is adopted with local modules. The ends of local modules are marked using larger markers with black edges. The experiments are conducted on CIFAR-10 with a ResNet-32.

Linear separability of intermediate features. In the common case that greedy SL operates directly on the features output by internal layers, a natural intuition is to investigate how these locally learned features differ from their E2E learned counterparts in task-relevant behaviors. To this end, we fix the networks in Table 1, and train a linear classifier using the features from each layer. The test errors of these classifiers are presented in the left plot of Figure 2, where the horizontal axis denotes the indices of layers. The plot shows an intriguing trend: greedy SL contributes to dramatically more discriminative features with the first one (or few) local module, but is only able to slightly improve the performance with all the consequent modules. In contrast, the E2E learned network progressively boosts the linear separability of features throughout the whole network with even more significant effects in the later layers, surpassing greedy SL eventually. This raises an interesting question: why does the full network achieve inferior performance in greedy SL compared to the E2E counterpart, even though the former is based on more discriminative earlier features? This observation appears incompatible with prior works like deeply-supervised nets (Lee et al., 2015a).

Information in features. Since we use the same training configuration for both greedy SL and E2E learning, we conjecture that the answer to the above question might lie in the differences of features apart from merely separability. To test that, we look into the information captured by the intermediate features. In specific, given intermediate feature corresponding to the input data and the label (all of them are treated as random variables), we use the mutual information and to measure the amount of all retained information and task-relevant information in , respectively. As these metrics cannot be directly computed, we estimate the former by training a decoder with binary cross-entropy loss to reconstruct from . For the latter, we train a CNN using as inputs to correctly classify , and estimate with its performance. Details are deferred to Appendix G.

The estimates of and at different layers are shown in the middle and right plots of Figure 2. We note that in E2E learned networks, remains unchanged when the features pass through all the layers, while reduces gradually, revealing that the models progressively discard task-irrelevant information. However, greedily trained networks collapse both and in their first few modules. We attribute this to the short sighted optimization objective of earlier modules, which have relatively small capacity compared with full networks and are not capable of extracting and leveraging all the task-relevant information in , as the E2E learned networks do. As a consequence, later modules, even though introducing additional parameters and increased capacity, lack necessary information about the target to construct more discriminative features.

Information collapse hypothesis. The above observations suggest that greedy SL induces local modules to collapse some of the task-relevant information that may be useless for short-term performance. However, the information is useful for the full model. In addition, we postulate that, although E2E training is incapable of extracting all task-relevant information at earlier layers as well, it alleviates this phenomenon by allowing a larger amount of task-irrelevant information to be kept, even though it may not be ideal for short-term performance. More empirical validation of our hypothesis is provided in Appendix A.

3 Information Propagation (InfoPro) Loss

In this section, we propose an information propagation (InfoPro) loss to address the issue of information collapse in locally supervised training. The key idea is to enforce local modules to retain as much information about the input as possible, while progressively discard task-irrelevant parts. As it is difficult to estimate InfoPro loss in its exact form, we derive an easy-to-compute upper bound as the surrogate loss, and analyze its tightness.

3.1 Learning to Discard Useless Information

Nuisance. We first model the task-irrelevant information in the input data by introducing the concept of nuisance. A nuisance is defined as an arbitrary random variable that affects but provides no helpful information for the task of interest (Achille and Soatto, 2018). Take recognizing a car in the wild for example. The random variables determining the weather and the background are both nuisances. Formally, given a nuisance , we have and , where is the label. Without loss of generality, we suppose that , , and

form the Markov chain

, namely . As a consequence, for the intermediate feature from any layer, we obviously have . Nevertheless, we postulate that . This assumption is mild since it does not hold only when strictly contains no task-irrelevant information.

Information Propagation (InfoPro) Loss. Thus far, we have been ready to introduce the proposed InfoPro loss. Instead of overly emphasizing on learning highly discriminative features at local modules, we also pay attention to preventing collapsing useful information in the feed-forward process. A simple solution to achieve this is maximizing the mutual information . Ideally, if there is no information loss, all useful information will be retained. However, it goes to another extreme case where the local modules do not learn any task-relevant feature, and is obviously dispensable. By contrast, in E2E training, intermediate layers progressively discard useless (task-irrelevant) information as well as shown above. Therefore, to model both effects simultaneously, we propose the following combined loss function:

(1)

where the nuisance is formulated to capture as much task-irrelevant information in as possible, and the coefficient controls the amount of information that is propagated forward (first term) and task-irrelevant information that is discarded (second term). Notably, we assume that the final module is always trained using the normal E2E loss (e.g., softmax cross-entropy loss for classification) weighted by the constant 1, such that is essential to balance the intermediate loss and the final one. In addition, is used to train the local module outputting , whose inputs are not required to be . The module may stack over another local module trained with the same form of but (possibly) different and .

Our method differs from existing works (Nøkland and Eidnes, 2019; Belilovsky et al., 2019, 2020) in that it is a non-greedy approach. The major effect of minimizing can be described as maximally discarding the task-irrelevant information under the goal of retaining as much information of the input as possible. Obtaining high short-term performance is not necessarily required. As we explicitly facilitate information propagation, we refer to as the InfoPro loss.

3.2 Upper Bound of

The objective function in Eq.(1) is difficult to be directly optimized, since it is usually intractable to estimate , which is equivalent to disentangling all task-irrelevant information from intermediate features. Therefore, we derive an easy-to-compute upper bound of as an surrogate loss. Our result is summarized in Proposition 1, with the proof in Appendix B.

Proposition 1.

Suppose that the Markov chain holds. Then an upper bound of is given by

(2)

where , .

For simplicity, we integrate and into two mutually independent hyper-parameters, and . Although we do not explicitly restrict , we find in experiments that the performance of networks is significantly degraded with (see Figure 4), or say, , where models tend to reach local minima by trivially minimizing in Eq. (1). Thus, we assume .

With Proposition 1, we can optimize the upper bound as an approximation, circumventing dealing with the intractable term in . To ensure that the approximation is accurate, the gap between the two should be reasonably small. Below we present an analysis of the tightness of in Proposition 2, (proof given in Appendix C). We also empirically check it in Appendix H. Proposition 2 provides a useful tool to examine the discrepancy between and its upper bound.

Proposition 2.

Given that and that is a deterministic function with respect to , the gap is upper bounded by

(3)

3.3 Mutual Information Estimation

In the following, we describe the specific techniques we use to obtain the mutual information and in . Both of them are estimated using small auxiliary networks. However, we note that the involved additional computational costs are minimal or even negligible (see Tables 3, 4).

Estimating . Assume that denotes the expected error for reconstructing from . It has been widely known that follows , where denotes the marginal entropy of , as a constant (Vincent et al., 2008; Rifai et al., 2012; Kingma and Welling, 2013; Makhzani et al., 2015; Hjelm et al., 2019). Therefore, we estimate by training a decoder parameterized by to obtain the minimal reconstruction loss, namely . In practice, we use the binary cross-entropy loss for .

Estimating . We propose two ways to estimate . Since , a straightforward approach is to train an auxiliary classifier with parameters to approximate , such that we have . Note that this approximate equation will become an equation if and only if (according to the Gibbs’ inequality). Finally, we estimate the expectation on using the samples , namely . Consequently, can be trained in a regular classification fashion with the cross-entropy loss.

In addition, motivated by recent advances in contrastive representation learning (Chen et al., 2020; Khosla et al., 2020; He et al., 2020), we formulate a contrastive style loss function , and prove in Appendix D that minimizing is equivalent to maximizing a lower bound of . Empirical results indicate that adopting may lead to better performance if a large batch size is available. In specific, considering a mini-batch of intermediate features corresponding to the labels , is given by:

(4)

Herein, returns only when A is true, is a pre-defined hyper-parameter, temperature, and is a projection head parameterized by that maps the feature

to a representation vector

(this design follows Chen et al. (2020); Khosla et al. (2020)).

Implementation details. We defer the details on the network architecture of , and to Appendix E. Briefly, on CIFAR, SVHN and STL-10, is a two layer decoder with up-sampled inputs (if not otherwise noted), with and sharing the same architecture consisting of a single convolutional layer followed by two fully-connected layers. On ImageNet and Cityscapes, we use relatively larger auxiliary nets, but they are very small compared with the primary network. Empirically, we find that these simple architectures are capable of achieving competitive performance consistently. Moreover, in implementation, we train , and collaboratively with the main network. Formally, let denote the parameters of the local module to be trained, and then our optimization objective is

(5)

which corresponds to using the cross-entropy and contrast loss to estimate , respectively. Such an approximation is acceptable as we do not need to acquire the exact approximation of mutual information, and empirically it performs well in various experimental settings.

4 Experiments

Setups. Our experiments are based on five widely used datasets (i.e., CIFAR-10 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), STL-10 (Coates et al., 2011), ImageNet (Deng et al., 2009) and Cityscapes (Cordts et al., 2016)) and two popular network architectures (i.e., ResNet (He et al., 2016) and DenseNet (Huang et al., 2019)) with varying depth. We split each network into local modules with the same (or approximately the same) number of layers, where the first modules are trained using , and the last module is trained using the standard E2E loss, as aforementioned. Due to spatial limitation, details on data pre-processing, training configurations and local module splitting are deferred to Appendix F. The hyper-parameters and are selected from . Notably, to avoid involving too many tunable hyper-parameters when is large (e.g., ), we assume that and change linearly from to local module, and thus we merely tune and for these two modules. We always use in .

Two training modes are considered: (1) simultaneous training, where the back-propagation process of all local modules is sequentially triggered with every mini-batch of training data; and (2) asynchronous training, where local modules are isolatedly learned given cached outputs from completely trained earlier modules. Both modes enjoy high memory efficiency since only the activations within a single module require to be stored at a time. The second mode removes the dependence of local modules on their predecessors, enabling the fully decoupled training of network components. The experiments using asynchronous training are referred to as “Asy-InfoPro”, while all other results are based on simultaneous training.

4.1 Main Results

Figure 3: Comparisons of InfoPro and state-of-the-art local learning methods in terms of the test errors at the final layer (left) and the task-relevant information capture by intermediate features, (right). Results of ResNet-32 on CIFAR-10 are reported. We use the contrastive loss in .

Comparisons with other local learning methods. We first compare the proposed InfoPro method with three recently proposed algorithms, decoupled greedy learning (Belilovsky et al., 2020) (DGL), BoostResNet (Huang et al., 2018a) and deep incremental boosting (Mosca and Magoulas, 2017) (DIB) in Figure 3. Our method yields the lowest test errors with all values of . Notably, DGL can be viewed as a special case of InfoPro where . Hence, we use the same architecture of auxiliary networks as us in DGL for fair comparison. In addition, we present the estimates of mutual information between intermediate features and labels in the right plot of Figure 3. One can observe that DGL suffers from a severe collapse of task-relevant information at early modules, since it optimizes local modules greedily for merely short-term performance. By contrast, our method effectively alleviates this problem, retaining a larger amount of task-relevant information within intermediate features.

Results on various image classification benchmarks are presented in Table 2. We also report the result of DGL (Belilovsky et al., 2020) in our implementation. It can be observed that InfoPro outperforms greedy SL by large margins consistently across different networks, especially when is large. For example, on CIFAR-10, ResNet-32 + InfoPro achieves a test error of with , surpassing greedy SL by . For ResNet-110, we note that our method performs on par with E2E training with , while degrading the performance by up to with . Moreover, InfoPro is shown to compare favorably against DGL under most settings.

Dataset Network Method
CIFAR-10 ResNet-32 (E2E: 7.37 0.10%) Greedy SL 10.30 0.20% 16.07 0.46% 21.19 0.52% 24.59 0.83%
DGL (Belilovsky et al., 2020) 8.69 0.12% 11.48 0.20% 14.17 0.28% 16.21 0.36%
InfoPro (Softmax) 8.13 0.23% 8.64 0.25% 11.40 0.18% 14.23 0.42%
InfoPro (Contrast)  7.76 0.12%  8.58 0.17%  11.13 0.19%  12.75 0.11%
ResNet-110 (E2E: 6.50 0.34%) Greedy SL 8.21 0.24% 13.16 0.28% 15.61 0.57% 18.92 1.27%
Greedy SL 8.00 0.11% 12.47 0.17% 14.58 0.36% 17.35 0.31%
DGL (Belilovsky et al., 2020) 7.70 0.28% 10.50 0.11% 12.46 0.37% 13.80 0.15%
InfoPro (Softmax) 7.01 0.34% 7.96 0.06% 9.40 0.27% 10.78 0.28%
Asy-InfoPro (Contrast) 7.34 0.11% 8.39 0.15%
InfoPro (Contrast)  6.42 0.08%  7.30 0.14%  8.93 0.40%  9.90 0.19%
DenseNet-BC-100-12 (E2E: 4.61 0.08%) Greedy SL 5.10 0.05% 6.07 0.21% 8.21 0.31% 10.41 0.42%
DGL (Belilovsky et al., 2020) 4.86 0.15% 5.71 0.04% 6.82 0.21% 7.67 0.16%
InfoPro (Softmax) 4.79 0.07% 5.69 0.21% 6.44 0.11% 7.47 0.21%
InfoPro (Contrast)  4.74 0.04%  5.24 0.25%  5.86 0.18%  6.92 0.16%
SVHN ResNet-110 (E2E: 3.07 0.23%) Greedy SL 3.71 0.16% 5.39 0.22% 5.75 0.10% 6.37 0.42%
DGL (Belilovsky et al., 2020) 3.61 0.16% 4.97 0.19% 5.35 0.13% 5.55 0.34%
InfoPro (Softmax) 3.41 0.08% 3.72 0.03% 4.67 0.07% 5.14 0.08%
InfoPro (Contrast)  3.15 0.03%  3.28 0.11%  3.62 0.11%  3.91 0.16%
STL-10 ResNet-110 (E2E: 22.27 1.61%) Greedy SL 25.56 1.37% 27.97 0.75% 29.07 0.76% 30.38 0.39%
DGL (Belilovsky et al., 2020) 24.96 1.18% 26.77 0.64% 27.33 0.24% 27.73 0.58%
InfoPro (Softmax) 21.02 0.51%  21.28 0.27%  23.60 0.49%  26.05 0.71%
InfoPro (Contrast)  20.99 0.64% 22.73 0.40% 25.15 0.52% 26.27 0.48%
Table 2:

Performance of different networks with varying numbers of local modules. The averaged test errors and standard deviations of 5 independent trials are reported. InfoPro (Softmax/Contrast) refers to two approaches to estimating

. The results of Asy-InfoPro is obtain by asynchronous training, while others are based on simultaneous training. Greedy SL adopts deeper networks to have the same computational costs as InfoPro.
Methods CIFAR-10 (batch size = 1024) STL-10 (batch size = 128)
Test Error Memory Cost Computational Overhead Test Error Memory Cost Computational Overhead
(Theoretical / Wall Time) (Theoretical / Wall Time)
E2E Training 6.50 0.34% 9.40 GB 22.27 1.61% 10.77 GB
GC (Chen et al., 2016) 6.50 0.34% 3.91 GB (blue) 32.8% / 27.5% 22.27 1.61% 4.50 GB (blue) 32.8% / 27.0%
InfoPro,  6.41 0.13% 5.38 GB (blue) 1.3% / 1.1%  20.95 0.57% 6.15 GB (blue) 1.3% / 1.7%
InfoPro, 6.74 0.12% 4.22 GB (blue) 3.3% / 7.5% 21.00 0.52% 4.96 GB (blue) 3.3% / 7.0%
InfoPro, 6.93 0.20% 3.52 GB (blue) 5.9% / 13.4% 21.22 0.72% 4.08 GB (blue) 5.9% / 11.4%
Table 3: Trade-off between GPU memory footprint during training and test errors. Results of training ResNet-110 on a single Nvidia Titan Xp GPU are reported. ‘GC’ refers to gradient checkpointing (Chen et al., 2016).
Models Methods Batch Size Top-1 Error Top-5 error Memory Cost Computational Overhead
(per GPU) (Theoretical / Wall Time)
ResNet-101 E2E Training 1024 22.03% 5.93% 19.71 GB
InfoPro, 1024 21.85% 5.89% 12.06 GB (blue) 5.7% / 11.7%
ResNet-152 E2E Training 1024 21.60% 5.92% 26.29 GB
InfoPro, 1024 21.45% 5.84% 15.53 GB (blue) 3.9% / 8.7%
ResNeXt-101, 328d E2E Training 512 20.64% 5.40% 19.22 GB
InfoPro, 512 20.35% 5.28% 11.55 GB (blue) 2.7% / 5.6%
Table 4: Single crop error rates (%) on the validation set of ImageNet. We use 8 Tesla V100 GPUs for training.
Model Training Training Batch Crop Size mIoU Memory Cost Computational Overhead
Algorithms Iterations Size SS MS MS+Flip (per GPU) (Theoretical / Wall Time)
DeepLab-V3 -R101 (w/ syncBN) E2E (original) 40k 8   77.82%   79.06%   79.30%
E2E (ours) 40k 8 79.12% 79.81% 80.02% 19.43GB
DGL 40k 8 78.15% 79.40% 79.56%
InfoPro, 40k 8  79.37%  80.53%  80.54%  12.01GB (blue) 6.4% / 2.2%
E2E (ours) 60k 8 79.32% 79.95% 80.07% 19.43GB
InfoPro, 40k 12 79.99% 81.09% 81.20%  16.62GB (blue) 6.4% / blue
InfoPro, 40k 8  80.25%  81.33%  81.42% 17.00GB (blue) 10.3% / blue
Table 5: Results of semantic segmentation on Cityscapes. 2 Nvidia GeForce RTX 3090 GPUs are used for training. ‘SS’ refers to the single-scale inference. ‘MS’ and ‘Flip’ denote employing the average prediction of multi-scale ([0.5, 1.75]) and left-right flipped inputs during inference. We also present the results reported by the original paper in the “original” row. DGL refers to decoupled greedy learning (Belilovsky et al., 2020).

In addition, given that our method introduces auxiliary networks, we enlarge network depth for greedy SL to match the computational cost of InfoPro, named as greedy SL. However, this only slightly ameliorates the performance since the problem of information collapse still exists. Another interesting phenomenon is that InfoPro (Contrast) outperforms InfoPro (Softmax) on CIFAR-10 and SVHN, yet fails to do so on STL-10. We attribute this to the larger batch size we use on the former two datasets and the proper value of the temperature . A detailed analysis is given in Appendix H.

Asynchronous and parallel training. The results of asynchronous training are presented in Table 2 as “Asy-InfoPro”, and it appears to slightly hurt the performance. Asy-InfoPro differentiates itself from InfoPro on that it adopts the cached outputs from completely trained earlier modules as the inputs of later modules. Therefore, the degradation of performance might be ascribed to lacking regularizing effects from the noisy outputs of earlier modules during training (Löwe et al., 2019). However, Asy-InfoPro is still considerably better than both greedy SL and DGL, approaching E2E

Figure 4: Sensitivity tests. The CIFAR-10 test errors of ResNet-32 trained using InfoPro () are reported. We vary and for and local modules respectively, with all other modules unchanged. We do not consider , where we obviously have .

training. Besides, we note that asynchronous training can be easily extended to training different local modules in parallel by dynamically caching the outputs of earlier modules. To this end, we preliminarily test training two local modules parallelly on 2 GPUs when , using the same experimental protocols as Huo et al. (2018b) (train ResNet-110 with a batch size of 128 on CIFAR-10) and their public code. Our method gives a 1.5 speedup over the standard parallel paradigm of E2E training (the DataParallel toolkit in Pytorch). Note that parallel training has the same performance as simultaneous training (i.e., “InfoPro” in Table 2) since their training processes are identical except for the parallelism.

Reducing GPUs memory requirements. Here we split the network into local modules to ensure each module consumes a similar amount of GPU memory during training. Note that this is different from splitting the model into modules with the same number of layers. We denote the results in this setting by InfoPro, and the trade-off between GPU memory consumption and test errors is presented in Table 3, where we report the minimally required GPU memory to run the training algorithm. The contrastive and softmax loss are used in InfoPro on CIFAR-10 and STL-10, respectively. One can observe that our method significantly improves the memory efficiency of CNNs. For instance, on STL-10, InfoPro () outperforms the E2E baseline by with 37.9% of the GPUs memory requirements. The computational overhead is presented in both the theoretical results and the practical wall time. Due to implementation issues, we find that the latter is slightly larger than the former for InfoPro. Compared to the gradient checkpointing technique (Chen et al., 2016), our method achieves competitive performance with significantly reduced computational and time cost.

Results on ImageNet are reported in Table 4. The softmax loss is used in InfoPro since the batch size is relatively small. The proposed method reduces the memory cost by 40%, and achieves slightly better performance. Notably, our method enables training these large networks using 16 GB GPUs.

Results of semantic segmentation on Cityscapes are presented in Table 5. We report the mean Intersection over Union (mIoU) of all classes on the validation set. The softmax loss is used in InfoPro. The details of and are presented in Appendix E. Our method boosts the performance of the DeepLab-V3 (Chen et al., 2017) network and allows training the model with 50% larger batch sizes (2

3 per GPU) under the same memory constraints. This contributes to more accurate statistics for batch normalization, which is a practical requirement for tasks with high resolution inputs. In addition, InfoPro

enables using larger crop sizes () during training without enlarging GPUs memory footprint, which significantly improves the mIoU. Note that this does not increase the training or inference cost.

4.2 Hyper-parameter Sensitivity and ablation study

The coefficient and . To study how and affect the performance, we change them for the and local modules of a ResNet-32 trained using InfoPro (Contrast), , with the results shown in Figure 4. We find that the earlier module benefits from small to propagate more information forward, while larger helps the later module to boost the final accuracy. This is compatible with previous works showing that removing earlier layers in ResNets has a minimal impact on performance (Veit et al., 2016).

10.30 0.20% 21.19 0.52%
8.90 0.17% 15.82 0.34%
8.49 0.16% 14.13 0.22%
 7.76 0.12%  11.13 0.19%
Table 6: Ablation studies. Test errors of ResNet-32 on CIFAR-10 are reported.

Ablation study. For ablation, we test directly removing the decoder or replacing the contrastive head by the linear classifier used in greedy SL, as shown in Table 6.

5 Related Work

Greedy training of deep networks is first proposed to learn unsupervised deep generative models, or to obtain an appropriate initialization for E2E supervised training (Hinton et al., 2006; Bengio et al., 2007). However, later works reveal that this initialization is indeed dispensable once proper networks architectures are adopted, e.g., introducing batch normalization layers (Ioffe and Szegedy, 2015), skip connections (He et al., 2016) or dense connections (Huang et al., 2019). Some other works (Kulkarni and Karande, 2017; Malach and Shalev-Shwartz, 2018; Marquez et al., 2018; Huang et al., 2018a) attempt to learn deep models in a layer-wise fashion. For example, BoostResNet (Huang et al., 2018a) trains different residual blocks in a ResNet (He et al., 2016) sequentially with a boosting algorithm. Deep Cascade Learning (Marquez et al., 2018) extends the cascade correlation algorithm (Fahlman and Lebiere, 1990)

to deep learning, aiming at improving the training efficiency. However, these approaches mainly focus on theoretical analysis and are usually validated with limited experimental results on small datasets. More recently, several works have pointed out the inefficiencies of back-propagation and revisited this problem

(Nøkland and Eidnes, 2019; Belilovsky et al., 2019, 2020). These works adopt a similar local learning setting to us, while they mostly optimize local modules with a greedy short-term objective, and hence suffer from the information collapse issue we discuss in this paper. In contrast, our method trains local modules by minimizing the non-greedy InfoPro loss.

Alternatives of back-propagation have been widely studied in recent years. Some biologically-motivated algorithms including target propagation (Lee et al., 2015b; Bartunov et al., 2018) and feedback alignment (Lillicrap et al., 2014; Nøkland, 2016) avoid back-propagation by directly propagating backward optimal activations or error signals with auxiliary networks. Decoupled Neural Interfaces (DNI) (Jaderberg et al., 2017) learn auxiliary networks to produce synthetic gradients. In addition, optimization methods like Alternating Direction Method of Multipliers (ADMM) split the end-to-end optimization into sub-problems using auxiliary variables (Taylor et al., 2016; Choromanska et al., 2018). Decoupled Parallel Back-propagation (Huo et al., 2018b) and Features Replay (Huo et al., 2018a) update parameters with previous gradients instead of current ones, and show its convergence theoretically, enabling training network modules in parallel. Nevertheless, these methods are fundamentally different from us as they train local modules by explicitly or implicitly optimizing the global objective, while we merely consider optimizing local objectives.

Information-theoretic analysis in deep learning has received increasingly more attention in the past few years. Shwartz-Ziv and Tishby (2017) and Saxe et al. (2019) study the information bottleneck (IB) principle (Tishby et al., 2000) to explain the training dynamics of deep networks. Achille and Soatto (2018) decompose the cross-entropy loss and propose a novel IB for weights. There are also efforts towards fulfilling efficient training with IB (Alemi et al., 2016)

. In the context of unsupervised learning, a number of methods have been proposed based on mutual information maximization

(Oord et al., 2018; Tian et al., 2020; Hjelm et al., 2019). SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) propose to maximize the mutual information of different views from the same input with the contrastive loss. This paper analyzes the drawbacks of greedy local supervision and propose the InfoPro loss from the information-theoretic perspective as well. In addition, our method can also be implemented as the combination of a contrastive term and a reconstruction loss.

6 Conclusion

This work studied locally supervised deep learning from the information-theoretic perspective. We demonstrated that training local modules greedily results in collapsing task-relevant information at earlier layers, degrading the final performance. To address this issue, we proposed an information propagation (InfoPro) loss that encourages local modules to preserve more information about the input, while progressively discard task-irrelevant information. Extensive experiments validated that InfoPro significantly reduced GPUs memory footprint during training without sacrificing accuracy. It also enabled model parallelization in an asynchronous fashion. InfoPro may open new avenues for developing more efficient and biologically plausible deep learning algorithms.

Acknowledgments

This work is supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grants 2018AAA0100701, the National Natural Science Foundation of China under Grants 61906106 and 62022048, the Institute for Guo Qiang of Tsinghua University and Beijing Academy of Artificial Intelligence.

References

  • A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations.

    The Journal of Machine Learning Research

    19 (1), pp. 1947–1980.
    Cited by: Appendix C, §3.1, §5.
  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2016) Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: §5.
  • S. Bartunov, A. Santoro, B. Richards, L. Marris, G. E. Hinton, and T. Lillicrap (2018) Assessing the scalability of biologically-motivated deep learning algorithms and architectures. In NeurIPS, pp. 9368–9378. Cited by: §5.
  • E. Belilovsky, M. Eickenberg, and E. Oyallon (2019) Greedy layerwise learning can scale to imagenet. In ICML, pp. 583–593. Cited by: §1, §1, §3.1, §5.
  • E. Belilovsky, M. Eickenberg, and E. Oyallon (2020) Decoupled greedy learning of cnns. In ICML, Cited by: Table 14, §1, §1, §1, §3.1, §4.1, §4.1, Table 2, Table 5, §5.
  • Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: §1, §5.
  • Y. Bengio, D. Lee, J. Bornschein, T. Mesnard, and Z. Lin (2015) Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156. Cited by: §1.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: Appendix H.
  • L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: Appendix F, §4.1.
  • T. Chen, B. Xu, C. Zhang, and C. Guestrin (2016) Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174. Cited by: §1, §4.1, Table 3.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: Appendix H, §3.3, §5.
  • A. Choromanska, E. Tandon, S. Kumaravel, R. Luss, I. Rish, B. Kingsbury, R. Tejwani, and D. Bouneffouf (2018) Beyond backprop: alternating minimization with co-activation memory. stat 1050, pp. 24. Cited by: §5.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In AISTATS, pp. 215–223. Cited by: Appendix F, §4.
  • M. Contributors (2020)

    MMSegmentation, an open source semantic segmentation toolbox

    .
    Note: https://github.com/open-mmlab/mmsegmentation Cited by: Appendix F.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, pp. 3213–3223. Cited by: Appendix F, §4.
  • F. Crick (1989)

    The recent excitement about neural networks

    .
    Nature 337 (6203), pp. 129–132. Cited by: §1.
  • Y. Dan and M. Poo (2004) Spike timing-dependent plasticity of neural circuits. Neuron 44 (1), pp. 23–30. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In ICML, pp. 248–255. Cited by: Appendix F, §4.
  • S. E. Fahlman and C. Lebiere (1990) The cascade-correlation learning architecture. In NeurIPS, pp. 524–532. Cited by: §5.
  • A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse (2017)

    The reversible residual network: backpropagation without storing activations

    .
    In NeurIPS, pp. 2214–2224. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9729–9738. Cited by: Appendix H, §3.3, §5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Appendix F, Appendix F, §1, §2, §4, §5.
  • G. E. Hinton, S. Osindero, and Y. W. Teh (2006) A fast learning algorithm for deep belief nets. Neural Computation 18, pp. 1527–1554. Cited by: §1, §5.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: Appendix G, §3.3, §5.
  • F. Huang, J. Ash, J. Langford, and R. Schapire (2018a) Learning deep resnet blocks sequentially using boosting theory. In ICML, pp. 2058–2067. Cited by: §4.1, §5.
  • G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Weinberger (2018b) Multi-scale dense networks for resource efficient image classification. In ICLR, External Links: Link Cited by: Appendix F.
  • G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten, and K. Weinberger (2019) Convolutional networks with dense connectivity. IEEE transactions on pattern analysis and machine intelligence. Cited by: Appendix F, Appendix F, Appendix F, §1, §4, §5.
  • Z. Huo, B. Gu, and H. Huang (2018a) Training neural networks using features replay. In NeurIPS, pp. 6659–6668. Cited by: §5.
  • Z. Huo, B. Gu, Q. Yang, and H. Huang (2018b) Decoupled parallel backpropagation with convergence guarantee. In ICML, Cited by: §4.1, §5.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.
  • M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu (2017) Decoupled neural interfaces using synthetic gradients. In ICML, pp. 1627–1635. Cited by: §5.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Cited by: Appendix H, §3.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix G.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Appendix G, §3.3.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: Appendix F, §2, §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In NeurIPS, pp. 1097–1105. Cited by: §1.
  • M. Kulkarni and S. Karande (2017) Layer-wise training of deep networks using kernel similarity. arXiv preprint arXiv:1703.07115. Cited by: §5.
  • C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015a) Deeply-supervised nets. In AISTATS, pp. 562–570. Cited by: §2.
  • D. Lee, S. Zhang, A. Fischer, and Y. Bengio (2015b) Difference target propagation. In Joint european conference on machine learning and knowledge discovery in databases, pp. 498–515. Cited by: §5.
  • T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman (2014) Random feedback weights support learning in deep neural networks. arXiv preprint arXiv:1411.0247. Cited by: §5.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: Appendix H.
  • T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014)

    Microsoft coco: common objects in context

    .
    External Links: Link Cited by: Table 15, Appendix H.
  • S. Löwe, P. O’Connor, and B. Veeling (2019) Putting an end to end-to-end: gradient-isolated learning of representations. In NeurIPS, pp. 3039–3051. Cited by: §1, §4.1.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015)

    Adversarial autoencoders

    .
    arXiv preprint arXiv:1511.05644. Cited by: Appendix G, §3.3.
  • E. Malach and S. Shalev-Shwartz (2018) A provably correct algorithm for deep learning that actually works. arXiv preprint arXiv:1803.09522. Cited by: §5.
  • E. S. Marquez, J. S. Hare, and M. Niranjan (2018) Deep cascade learning. IEEE transactions on neural networks and learning systems 29 (11), pp. 5475–5485. Cited by: §5.
  • A. Mosca and G. D. Magoulas (2017) Deep incremental boosting. arXiv preprint arXiv:1708.03704. Cited by: §4.1.
  • H. Mostafa, V. Ramesh, and G. Cauwenberghs (2018) Deep supervised learning using local errors. Frontiers in neuroscience 12, pp. 608. Cited by: §1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: Appendix F, §4.
  • A. Nøkland and L. H. Eidnes (2019) Training neural networks with local error signals. arXiv preprint arXiv:1901.06656. Cited by: §1, §1, §3.1, §5.
  • A. Nøkland (2016) Direct feedback alignment provides learning in deep neural networks. In NeurIPS, pp. 1037–1045. Cited by: §5.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: Appendix D, §5.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: Table 15, Appendix H.
  • S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza (2012) Disentangling factors of variation for facial expression recognition. In ECCV, pp. 808–822. Cited by: Appendix G, §3.3.
  • A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox (2019) On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment 2019 (12), pp. 124020. Cited by: §5.
  • R. Shwartz-Ziv and N. Tishby (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: Appendix D, §5.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 14, §1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: §1.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, pp. 1195–1204. Cited by: Appendix F.
  • G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein (2016) Training neural networks without gradients: a scalable admm approach. In ICML, pp. 2722–2731. Cited by: §5.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §5.
  • N. Tishby, F. C. Pereira, and W. Bialek (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §5.
  • A. Veit, M. J. Wilber, and S. Belongie (2016) Residual networks behave like ensembles of relatively shallow networks. In NeurIPS, pp. 550–558. Cited by: §4.2.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In ICML, pp. 1096–1103. Cited by: Appendix G, §3.3.
  • Y. Wang, J. Guo, S. Song, and G. Huang (2020a)

    Meta-semi: a meta-learning approach for semi-supervised learning

    .
    arXiv preprint arXiv:2007.02394. Cited by: Appendix F.
  • Y. Wang, K. Lv, R. Huang, S. Song, L. Yang, and G. Huang (2020b) Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. In NeurIPS, Cited by: Appendix F.
  • Y. Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu (2019) Implicit semantic data augmentation for deep networks. In NeurIPS, pp. 12635–12644. Cited by: Appendix F.
  • L. Yang, Y. Han, X. Chen, S. Song, J. Dai, and G. Huang (2020) Resolution adaptive networks for efficient inference. In CVPR, pp. 2369–2378. Cited by: Appendix F.

Appendix

Appendix A A Toy Example

Figure 5:

Illustration of the MNIST-STL10 dataset.

To further validate the proposed information collapse hypothesis, we visualize the “information flow” within deep networks using a toy example. First, we establish a MNIST-STL10 dataset via placing MNIST digits on a certain position (randomly selected from 64 candidates) of a background image from STL-10. Then, three specific tasks can be defined on MNIST-STL10, namely classifying digits, backgrounds and positions of numbers. We refer to the labels of them as , and , respectively, as illustrated by Figure 5.

We train ResNet-32 networks for the three tasks with greedy SL () and end-to-end training (). The estimates of mutual information , and are shown in Figure 6, with the same estimating approach as Figure 2 (details in Appendix G). Note that when one label (take for example) is adopted for training, the information related to other labels ( and ) is task-irrelevant. From the plots, one can clearly observe that end-to-end training retains all task-relevant information throughout the feed-forward process, while greedy SL usually yields less informative intermediate representations in terms of the task of interest. This phenomenon confirms the proposed information collapse hypothesis empirically. In addition, we postulate that the end-to-end learned early layers prevent collapsing task-relevant information by being allowed to keep larger amount of task-irrelevant information, which, however, may lead to the inferior classification performance of intermediate features, and thus cannot be achieved by greedy SL.

Figure 6: The estimates of mutual information between the intermediate features and the three labels of MNIST-STL10 (see: Figure 5), i.e. (left, background), (middle, digit) and (right, position of digit). Models are trained using greedy SL () supervised by one of the three labels, and the results are shown with respect to layer indices. “” refers to end-to-end training.

Appendix B Proof of proposition 1

Proposition 1.

Suppose that the Markov chain holds. Then an upper bound of is given by

(6)

where , .

Proof.

Note that, is given by

(7)

Due to the Markov chain , we have . Given that

(8)

we have

(9)

By the definition of nuisance, we note that and are mutually independent, and thus we obtain

(10)

Combining Eqs. (9) and (10), we have

(11)

Finally, Proposition 1 is proved by combining Eq. (7) and Inequality (11). ∎

Appendix C Proof of proposition 2

We first introduce a Lemma proven by Achille and Soatto (2018).

Lemma 1.

Given a joint distribution

, where

is a discrete random variable, we can always find a random variable

independent of such that , for some deterministic function .

Proposition 2.

Given that and that is a deterministic function with respect to , the gap is upper bounded by

(12)
Proof.

Let be the random variable in Lemma 1, and then, since we can find a deterministic function that maps to , we have

(13)

Assume , in terms that , we obtain

(14)

Since and are mutually independent, namely , we have

(15)

When considering as a deterministic function with regards to , we obtain , and therefore

(16)

Given that , we have

(17)

for which we have proven Proposition 2. ∎

Appendix D Why minimizing the contrastive loss maximizes the lower bound of task-relevant information?

In this section, we show that minimizing the proposed contrastive loss, namely

(18)

actually maximizes an lower bound of task-relevant information . We start by considering a simplified but equivalent situation. Suppose that we have a query sample , together with a set consisting of samples with one positive sample from the same class as definitely, and other negative samples are randomly sampled, namely . Then the expectation of can be written as

(19)

Eq. (19) can be viewed as a categorical cross-entropy loss of recognizing the positive sample

correctly. Hence, we define the optimal probability of this classification problem as

, which denotes the true probability of being the positive sample. Assuming that the label of is , the positive and negative samples can be viewed as being sampled from the true distributions and , respectively. As a consequence, can be derived as

(20)

which indicates that an optimal value for is . Therefore, by assuming that is uniformly sampled from all classes, we have

(21)
(22)
(23)
(24)
(25)
(26)
(27)

In the above, Inequality (24) follows from Oord et al. (2018), which quickly becomes more accurate when increases. Inequality (27) follows from the data processing inequality (Shwartz-Ziv and Tishby, 2017). Finally, we have , and thus minimizing

under the stochastic gradient descent framework maximizes a lower bound of

.

Appendix E Architecture of auxiliary networks

Here, we introduce the network architectures of , and we use in our experiments. Note that,

is a decoder that aims to reconstruct the input images from deep features, while

and share the same architecture except for the last layer. The architectures used on CIFAR, SVHN and STL-10 are shown in Table E and Table E. Architectures on ImageNet are shown in Table E and Table E. The architecture of for the semantic segmentation experiments on Cityscapes is shown in Table E, where we use the same decoder as on ImageNet (except for the size of feature maps). An empirical study on the size and architecture of auxiliary nets is presented in Appendix H.

Input: 3232 / 1616 / 88 feature maps (9696 / 4848 / 2424 on STL-10)

Bilinear Interpolation to 32

32 (9696 on STL-10)
3

3 conv., stride=1, padding=1, output channels=12, BatchNorm

ReLU
33 conv., stride=1, padding=1, output channels=3, Sigmoid
Table 7: Architecture of the decoder on CIFAR, SVHN and STL-10.
Input: 3232 / 1616 / 88 feature maps (9696 / 4848 / 2424 on STL-10)
3232 (9696) input features: 33 conv., stride=2, padding=1, output channels=32, BatchNormReLU
1616 (4848) input features: 33 conv., stride=2, padding=1, output channels=64, BatchNormReLU
88 (2424) input features: 33 conv., stride=1, padding=1, output channels=64, BatchNormReLU
Global average pooling
Fully connected 32 / 64128, ReLU
Fully connected 12810 for or 128128 for
Table 8: Architecture of and on CIFAR, SVHN and STL-10.
Input: 2828 feature maps
11 conv., stride=1, padding=0, output channels=128, BatchNormReLU
Bilinear Interpolation to 5656
33 conv., stride=1, padding=1, output channels=32, BatchNormReLU
Bilinear Interpolation to 112112
33 conv., stride=1, padding=1, output channels=12, BatchNormReLU
Bilinear Interpolation to 224224
33 conv., stride=1, padding=1, output channels=3, Sigmoid
Table 9: Architecture of the decoder on ImageNet.
Input: 2828 feature maps
11 conv., stride=1, padding=0, output channels=128, BatchNormReLU
33 conv., stride=2, padding=1, output channels=256, BatchNormReLU
33 conv., stride=2, padding=1, output channels=512, BatchNormReLU
11 conv., stride=1, padding=0, output channels=2048, BatchNormReLU
Global average pooling
Fully connected 20481000
Table 10: Architecture of on ImageNet.
Input: 64128 feature maps, 1024 channels
33 conv., stride=1, padding=1, output channels=512, BatchNormReLU
Dropout, p=0.1
11 conv., stride=1, padding=0, output channels=19
Table 11: Architecture of on Cityscapes.

Appendix F Details of Experiments

Datasets. (1) The CIFAR-10 (Krizhevsky et al., 2009) dataset consists of 60,000 32x32 colored images of 10 classes, 50,000 for training and 10,000 for test. We normalize the images with channel means and standard deviations for pre-processing. Then data augmentation is performed by 4x4 random translation followed by random horizontal flip (He et al., 2016; Huang et al., 2019). (2) SVHN (Netzer et al., 2011) consists of 32x32 colored images of digits. 73,257 images for training and 26,032 images for test are provided. Following Tarvainen and Valpola (2017); Wang et al. (2020a), we perform random 2x2 translation to augment the training set. (3) STL-10 (Coates et al., 2011) contains 5,000 training examples divided into 10 predefined folds with 1000 examples each, and 8,000 images for test. We use all the labeled images for training and test the performance on the provided test set. Data augmentation is performed by 4x4 random translation followed by random horizontal flip. (4) ImageNet is a 1,000-class dataset from ILSVRC2012 (Deng et al., 2009), with 1.2 million images for training and 50,000 images for validation. We adopt the same data augmentation and pre-processing configurations as Huang et al. (2019, 2018b); Wang et al. (2019, 2020b); Yang et al. (2020). (5) Cityscapes dataset (Cordts et al., 2016) contains 5,000 10242048 pixel-level finely annotated images (2,975/500/1,525 for training, validation and testing) and 20,000 coarsely annotated images from 50