Log In Sign Up

Blissful Ignorance: Anti-Transfer Learning for Task Invariance

We introduce the novel concept of anti-transfer learning for neural networks. While standard transfer learning assumes that the representations learned in one task will be useful for another task, anti-transfer learning avoids learning representations that have been learned for a different task, which is not relevant and potentially misleading for the new task and should be ignored. Examples of such tasks are style vs content recognition or pitch vs timbre from audio. By penalizing similarity between the second network and the previously learned features, co-incidental correlations between the target and the unrelated task can be avoided, yielding more reliable representations and better performance on the target task. We implemented anti-transfer learning with different similarity metrics and aggregation functions. We evaluate the approach in the audio domain with different tasks and setups, using four datasets in total. The results show that anti-transfer learning consistently improves accuracy in all test cases, proving that it can push the network to learn more representative features for the task at hand.


page 1

page 2

page 3

page 4


Do sound event representations generalize to other audio tasks? A case study in audio transfer learning

Transfer learning is critical for efficient information transfer across ...

P2L: Predicting Transfer Learning for Images and Semantic Relations

Transfer learning enhances learning across tasks, by leveraging previous...

Deconfounded Representation Similarity for Comparison of Neural Networks

Similarity metrics such as representational similarity analysis (RSA) an...

Transfer learning to decode brain states reflecting the relationship between cognitive tasks

Transfer learning improves the performance of the target task by leverag...

Listening to the World Improves Speech Command Recognition

We study transfer learning in convolutional network architectures applie...

Unsupervised Transfer Learning for Spatiotemporal Predictive Networks

This paper explores a new research problem of unsupervised transfer lear...

Learning to select data for transfer learning with Bayesian Optimization

Domain similarity measures can be used to gauge adaptability and select ...

1 Introduction

The ability to transfer knowledge between neural networks has proven to be advantageous in various tasks and domains. One successful approach to transferring learned knowledge is to encourage a network to develop internal representations that are similar to another network. This can be achieved in different ways. The most common approach is to use a trained network’s weights as starting point for another network through weight initialization, e.g. retraining or partially retraining a pre-trained network. This can improve the performance of a model in terms of training time and overall accuracy even when the pre-training and the actual training belong to distant tasks and/or domains, and has been proven to be particularly useful in cases when data availability for the target task is scarce [caruana1995learning, bengio2012deep, 41530, yosinski2014transferable].

The idea of anti-transfer is that if a neural network can be used to teach another network what to do, perhaps it can be adopted also to teach what not to do, performing what we call anti-transfer learning. This approach can be particularly useful with what we call an orthogonal task, i.e., a task that a network has to be invariant to and that can be separately learned. We then penalize the use of features that have been learned for the orthogonal task when training for the target task, thus leading to greater independence of the target predictions from the orthogonal task.

The specific contributions of this work are the following:

  • For the first time, to the best of our knowledge, we introduce the concept of anti-transfer learning to achieve task-invariance between a pre-trained CNN and a new one.

  • We implement anti-transfer with a number of different similarity measures and aggregation functions. Source code is publicly available 111

  • We demonstrate the effectiveness of anti-transfer learning by testing it in 2 different audio-related tasks, using 4 different datasets achieving improvements in all tasks over baseline and standard transfer learning, obtaining competitive test accuracy.

The remainder of the paper is organized as follows: Section 2 contains a brief review of relevant background literature, Section 3 introduces concept and implementation of anti-transfer learning, Section 4 presents the experimental results we obtained in several classification tasks on audio data and Section 5 presents the conclusions of this paper.

2 Related Work

This work is inspired by recent developments in the field of selective knowledge transfer among Convolutional Neural Networks (CNNs), both in the visual and audio domain. yosinski2014transferable gave an encouraging contribution for this relatively novel research area, demonstrating that features become increasingly task-specific and transferable going thorough the deep layers of a network. Basing on their work, gatys2016image developed a strategy to separate content and style of an image and to transfer the only style to another image. To do this they use a CNN pre-trained on object recognition and localization as a feature extractor and deep feature losses

, comparing the pre-trained features of selected convolution blocks with the corresponding features that a new model is developing during the training, therefore transferring selective knowledge. The use of deep feature loss has been explored also in several applications in the audio domain.

pasini2019melgan with MelGAN, successfully performs audio style transfer using both speech and musical sources. beckmann2019speech use deep feature loss to enhance similarity between the deep representations of two networks and therefore transferring knowledge from one to the other. Deep feature loss has been used also by germain2018speech for speech denoising, obtaining a consistent improvement especially in the most challenging and noisy conditions. sahai2019spectrogram succesfully used deep feature loss for an audio source separation task. beckmann2019speech and kegler2019deep apply the same conceptual idea to speech enhancement, language identification, speech, noise and music classification, and speaker identification.

The majority of studies regarding selective knowledge transfer are based on the idea of encouraging a network to develop similar deep representations of a pre-trained network in selected layers. However, our main idea is performing the opposite, that is encouraging diversity of deep representations. In fact, our approach is a similar, but opposite to beckmann2019speech and kegler2019deep. While they maximize the similarity between deep representations of a CNN with the same representation of a pre-trained network, our aim is minimizing it. Following a similar idea, wang2019learning apply the concept of Mutual Information Minimization to extract features independent from the domain of the data points. This is similar to our approach, but they aimed at maximizing the feature difference from a measure of pertinence of the feature to a specific domain. Instead, we take in consideration the representations of different networks. Also yao2004evolving apply the concept of mutual information minimization, but in the context of ensemble models. Minimizing the mutual information among the deep representations of the models that are part of an ensemble encourages them to learn different aspects of the same data. A different method for encouraging diversity and diminishing redundancy in neural networks was introduced by liu2018learning with the Minimum Hyperspherical Energy (MHE). perez2020improving successfully applied this method also to an audio task in a time domain representation. MHE provides advantages in the generalization of networks, but it differs from our approach since we encourage diversity with respect to feature maps of another model and not among feature maps of the same model.

3 Method

The main of idea anti-transfer learning is to encourage diversity of the deep representations of a CNN with respect to another CNN with the same architecture but pre-trained on an orthogonal task.

3.1 Approach

We achieve anti-transfer learning through the introduction of an anti-transfer loss term during the training, that is specifically a deep feature loss dosovitskiy2016generating

. The anti-transfer loss expresses the amount of similarity of the deep representations that the trained network is developing, compared to the pre-trained network with the same architecture. By adding this term as a penalty to the loss function we encourage the trained network to develop deep representations that are different from the pre-trained network. In other words, we encourage the network being trained to develop feature representations that are good for its main task but different from those developed to solve the orthogonal task in the pre-trained network. This, in turn, encourages the trained network’s invariance to the orthogonal tasks target. For example, one of our test cases concerns performing speech emotion recognition, which is independent of the words uttered by the speakers, so that the two orthogonal targets are the uttered words and the expressed emotion.

Figure 1: Block diagram of a CNN network with anti-transfer learning applied to a classification task. We use spectrograms of audio signals as the input, but anti-transfer is not specific to the audio domain or spectrogram representations.

Figure 1 depicts a block diagram of a generic CNN with applied anti-transfer learning. As the diagram shows, this architecture has two parallel networks: a pre-trained feature extractor

(on the upper side), which is the convolutional part of a pre-trained network with non-trainable weights, and the actual CNN classifier that is being trained (on the lower side).

3.2 Implementation

Our implementation is based on the VGG16 Architecture simonyan2014very, a deep Convolutional Neural Network. We select this network since it has been proved to be effective in computing a deep feature loss in the audio domain beckmann2019speech. Nevertheless, the same concept and implementation could be translated to any other CNN design.

The input data, a spectrogram in our test cases, is forward propagated in parallel through both networks. The feature maps of the convolution layer in both networks are extracted and the channel-wise Gram matrix is computed for each network, similarly to the approach used by gatys2016image to compute the style matrix

of an image. The Gram matrix is computed as the inner products between the vectorized feature maps

for each pair of channels:


where are the channel numbers. The Gram matrix correlates the information of all channels, consequently reducing the dimensionality of a data-point from 3 (, , ) to 2 dimensions (, ), where is the number of channels. We then calculate the anti-transfer loss

as the squared cosine similarity of the vectorized Gram matrices

for the pre-trained net and for the net being trained:


The Gram product serves to compare all possible channel combinations at once, using a limited amount of memory. This is essential for consistently measuring the similarity of the feature maps, where permutations can occur in the channel dimension. We choose the squared cosine similarity since it is naturally limited in the 0-1 interval and therefore it can have only a limited impact in the overall loss function. Moreover, we square it to apply a stronger penalty when the similarity is high.

Even though the diagram shows the loss calculated on the last convolution layer, it is possible to apply the same method to any of the convolution layers. The loss is added to the standard loss function (cross entropy in our test cases, which are based on classification tasks) in the trained network. As presented in Section 4, we test several other aggregation strategies and distance measures and we test the use of different convolution layers in the network to compute the loss.

4 Evaluation

We test anti-transfer learning on several audio classification tasks with 8 combinations of training and pre-training tasks in order to evaluate the behavior of our method in a variety of set-ups. We have two discrete main classification tasks: speech emotion recognition (SER) and sound goodness estimation (SGE) (i.e. how well musical notes are played from professional musicians

romani_picas_oriol_2017_820937). Each main task is evaluated with two dataset split types: random training-validation-test split and speaker-wise and instrument-wise split for SER and SGE, respectively. The latter split types provide a more challenging task than the random split, since it is more difficult for the networks to generalize to unseen speakers or instruments.

For each main task we also pre-train on two different tasks. For SER we pre-train on speaker recognition with the training dataset and on speech recognition (word detection) with a larger dataset. For SGE we pre-train on instrument recognition with the training dataset and with a larger dataset. We use four different datasets, where we extracted subsets as follows to enable efficient training and adjust class imbalances:

  1. Librispeech: An ASR corpus based on public domain audio books Panayotov2015LibrispeechAA

    . Task: Automatic Speech Recognition.  100 hours of audio, 40 speakers, 1000 single-word labels. One-word excerpts from audio book recordings.

  2. IEMOCAP: The Interactive Emotional Dyadic Motion Capture Database busso2008iemocap. Tasks: Speech Emotion Recognition, Speaker Recognition.  7:30 hours of audio, 5 speakers, 4 emotion labels: neutral, angry, happy, sad. Actors perform semi-improvised or scripted scenarios on defined topics.

  3. Nsynth: A large-scale, high-quality dataset of annotated musical notes. nsynth2017. Task: Instrument Recognition.  66 hours of audio. 11 different instrument macro-categories. One-note recordings of musical instruments.

  4. Good-Sounds: A dataset to explore the quality of instrumental sounds romani_picas_oriol_2017_820937. Tasks: Instrument Goodness Recognition, Instrument Recognition.  14 hours of audio. 12 different instruments, 5 different goodness rates. One-note recordings of acoustic musical instruments, played by professional musicians.

The above descriptions refer to the subsets we extracted, not to the original size and arrangement of these datasets. Please refer to the references above for the original specifications.

Our experiments are set up to test the effectiveness of anti-transfer learning, comparing it to regular transfer learning with weight initialization and to a baseline without any kind of knowledge transfer. We put particular attention to perform all experiments (both trainings and pre-trainings) in the same conditions, in order to isolate the influence of anti-transfer and weight initialization in the quality of the networks’ outputs. All experiments are performed in a Python and Pytorch environment, using the VGG16 network architecture

simonyan2014very (in the standard implementation from the torchvision library222

We apply the same 3-stage pre-processing to all datasets: we first re-sample all audio data to 16KHz sampling rate. Then we zero-pad/segment all sounds in order to feed data-points of the same length for each task. In the Speech Emotion Recognition task we used 4-seconds sounds (for Librispeech we first extract segments containing only one word and then we zero-pad them to 4-seconds, while for IEMOCAP we extract 4-seconds fragments with 50% overlap). In the Instrument Goodness Recognition Task we used 6-second sounds (applying zero-pad to both Nsynth and Good-Sounds sounds to match this length). After this stage, we compute the Short-Time-Fourier-Transform (STFT) using 16 ms sliding windows with 50% overlap, applying a Hamming window and discarding the phase information. Finally, we normalize the magnitude spectra of each dataset to zero mean and unit standard deviation, based on the training set’s mean and standard deviation. We perform all training with the same parameters: learning rate of 0.0005, batch size of 13, we use ADAM optimizer

[kingma2014adam]. We apply dropout at 50% but no or

regularization. We train for a maximum of 50 epochs applying early stopping by looking at the validation loss improvement with a patience of 5 epochs. We divide every dataset using samples of approximately 70% of the data for training, 20% for validation and 10% for the test set.

Dataset Task Labels Hours Train acc Test acc
Librispeech Speech recognition 1000 100 97.60 91.85
Iemocap Speaker recognition 5 7.30 99.88 96.50
Good-Sounds Instrument recognition 12 14 100.00 100.00
Nsynth Instrument category recognition 11 66 98.10 69.90
Table 1: Accuracy results for the pre-training datasets. Labels is the number of classification labels. Hours refers to the amount of recorded material present on the subset that we used.
Train Test
Transfer Type













None (Baseline) n/a n/a 69.0 67.8 63.7 57.2
Weight Initialization Speech recog. Librispeech 66.9 66.9 63.4 59.2
Weight Initialization Speaker recog. IEMOCAP 70.7 66.9 64.8 58.5
Anti-Transfer Speech recog. Librispeech 72.0 68.6 66.9 61.1
Anti-Transfer Speaker recog. IEMOCAP 75.5 74.5 66.5 61.3
Table 2: Accuracy results for Emotion Recognition on the IEMOCAP dataset. Comparison between no transfer learning, weight initialization and anti-transfer with pre-training on different datasets. The best results per column are highlighted in bold font.
Train Test
Transfer Type













None (Baseline) n/a n/a 91.8 42.2 83.8 22.8
Weight Initialization Instrument recog. Nsynth 93.4 40.5 84.7 29.6
Weight Initialization Instrument recog. Good-Sounds 93.3 42.3 84.9 23.9
Anti-Transfer Instrument recog. Nsynth 96.8 41.0 86.3 30.0
Anti-Transfer Instrument recog. Good-Sounds 93.9 36.4 85.7 34.3
Table 3: Accuracy results for Sound Goodness Recognition. Training on the Good-Sounds dataset. Comparison between no transfer learning, weight initialization and anti-transfer. The best results per column are highlighted in bold font.

We perform the pre-training of the feature extractors with the results in terms of accuracy shown in Table 1. Table 2 shows the results of the emotion recognition task and Table 3 shows the results we obtained in the sound goodness estimation task. These tables contain all training and test accuracy results we obtained without transfer learning, with weight initialization and with anti-transfer in the 8 combinations listed above, respectively. Of note, we apply weight initialization to all convolution layers at once, while we apply anti-transfer considering one single convolution layer per-experiment. The reported anti-transfer test accuracy results reflect the choice of layer that obtained the best validation accuracy.

The results show that anti-transfer improves the test accuracy in all cases and improves the training accuracy in all cases but one (sound goodness estimation with instrument-wise split dataset), compared to both the baseline and weight initialization. We have a maximum improvement in the test accuracy of 11.5 percentage points (pp) (for sound goodness estimation with instrument-wise split dataset) and a maximum improvement in the training accuracy of 6.7 pp (for speech emotion recognition with speaker-wise split dataset). The overall average improvement is of 5.25 pp for the test accuracy and of 2.37 pp for the training accuracy.

Figure 2: Average improvement by applying anti-transfer learning on different applications and different setting compared to the baseline (standard training).

Figure 2 shows the average gain achieved by anti-transfer learning in the test accuracy for different tasks ands settings. It is particularly interesting that the improvement in the networks’ generalization is higher when anti-transfer is applied with a feature extractor trained with the same dataset, but on an orthogonal task (i.e.IEMOCAP pre-trained on speaker recognition vs IEMOCAP trained on speech emotion recognition and Good-Sounds pre-trained on instrument recognition vs Good-Sounds trained on sound goodness estimation). This makes anti-transfer a well-suited approach to exploit datasets provided with multiple labels.

Figure 3: Mean per-layer improvement on the IEMOCAP dataset. The improvement refers to the baseline with no anti-transfer nor weight initialization

There is not an overall best choice for the layer used in the anti-transfer loss. Figure 3 shows the average per-layer improvement in both train and test accuracy that we obtained in the only speech emotion recognition task. It seems that layers 5, 7 and 13 achieve the best results when used to compute the anti-transfer loss in this case. Moreover, in both training and test layer 9 produces the lowest improvement and it is the only one that leads to a slight training accuracy decrease. However, most other layers also lead to improvements and the situation may vary when using different architectures or datasets. Interestingly, these results contradict our intuitive expectation that the last layers are more specific to the task and would lead to the highest performance increase.

Figure 4 separately depicts the trends of the cross entropy loss and the anti-transfer loss during one training on speaker-wise split IEMOCAP, pre-training on Librispeech and anti-transfer computed on the 5th convolution block. Here it is evident that the training network is actually learning to differentiate its deep representations from the pre-trained ones. Moreover, as we expected, the anti-transfer loss is low since the first epoch because the randomly initialized feature maps start heavily uncorrelated with the ones of the pre-trained network. Furthermore, the relatively low magnitude of the anti-transfer loss with respect to the cross entropy loss indicates that anti-transfer plays a prevention role in the training, keeping the deep representations uncorrelated.

Figure 4: Evolution of the train and validation loss and train and validation anti-transfer loss during the training. This example refers to a training on IEMOCAP with speaker-wise splitting, pre-training on Librispeech and with anti-transfer applied to the 5th convolution block
Sigmoid MSE Squared Cos Distance
Aggregation Train Val Test Train Val test
Mean 68.2 68.3 63.9 68.7 66.1 60.0
Sum 68.4 68.2 63.8 69.5 66.0 60.0
Comp Mul 71.0 67.5 63.9 70.4 67.0 63.1
Max 68.3 66.3 65.0 76.7 66.2 66.7
Gram Product 76.3 65.9 65.8 72.5 68.7 66.9
Table 4: Accuracy results for different channel aggregation methods and different distance functions. All results are computed with pre-training on Librispeech and training on IEMOCAP. The best training, validation and test accuracy results overall are highlighted in bold font.

Table 4

show the results of grid search performed to select the best channel aggregation and multidimensional distance function to compute the anti-transfer loss. All aggregation types refer to a function applied pixel-by-pixel along the channel dimension. Comp Mul stands for compressed multiplication (feature map values elevated to the power of 0.001 and then multiplied along the channel dimension). Sigmoid MSE refers to standard Mean Squared Error multiplied by a sigmoid function to avoid exponential growing of the loss during the training. We also tried several approaches to compute the multidimensional distance one-by-one for all possible channels combinations, but all of them were too computation-expensive or too memory-expensive. We chose to aggregate the channel information using the Gram Product and to compute the matrix distance with squared cosine similarity since this combination gives the best validation and test accuracy results.

The improved accuracy comes at a cost of increases computation and memory demands. Regarding computation time, a network with anti-transfer loss applied takes on average 2.8 times longer compared to the same network without anti-transfer loss. Moreover, learning with anti-transfer loss requires more memory than a standard CNN, since it requires to fit in the memory both the trained network and the pre-trained feature extractor and the Gram matrices, the size of which depends on the chosen architecture.

5 Conclusions

In this paper we introduced anti-transfer learning, a method for instilling task invariance in a neural network. By using a pre-trained network and encouraging the training network to find representations that are dissimilar to the pre-trained ones, we achieve improved generalisation on two audio tasks in several configurations. The results are positive on all test sets, showing a significant improvement and leading to results that are close to state of the art.

The positive results justify further investigation of this approach. We intend to investigate more datasets from other domains, a better understanding of the role of the different layers, the possible use of multiple layers in the anti-transfer loss. Another relevant research goal is to reduce the computation and memory requirements of anti-transfer learning.

Broader Impact

The method of anti-transfer, that is proposed and evaluated in this paper enables the transfer of knowledge between trained networks on tasks that are orthogonal, i.e. where the target on one task is independent of the other. The ability to make use of trained networks in this way enables broader use and re-use of training data. It will be particularly useful in areas with limited availability of training data, e.g. emotion recognition, low-resource languages, or rare diseases in the medicine. This may make it possible to develop new applications or improve existing ones in these domains.