Multi-task learning is a popular sub-field in machine learning. In the conventional setting, different tasks share the same representation and are learned simultaneously. Each task has its contribution to the total learning loss and a good balance among tasks can result in good predictive capability. Multi-task learning models are usually believed to be computationally efficient and can obtain better generalized representations compared to learning a single task at a time. Generally speaking, there are two groups of approaches which are widely used to improve the multi-task learning performance. The first group is focused on improving the shared latent representation and the other group aims to find the optimal relative weights for sub-learning tasksRuder (2017); Zhang and Yang (2017). The proposed model in our research will work on both issues.
Many multi-task learning models use the deterministic encoding techniques to obtain latent representations. While in this study we use the variational encoding method to improve the latent representation. Specifically, we adopt the information bottleneck method Tishby et al. (2000) to obtain the latent codes. The information bottleneck method can be implemented via variational inference Blei et al. (2017); Zhang et al. (2018), called the variational information bottleneck (VIB) Alemi et al. (2017)
. Variational inference has widely been used in deep learning such as variational autoencoders (VAEs) and their variantsKingma et al. (2015); Rezende et al. (2014); Higgins et al. (2017); Burgess et al. (2018); Maaløe et al. (2016)
, Bayesian neural networks (BNNs)Blundell et al. (2015), variational dropout Kingma et al. (2015); Molchanov et al. (2017) and deep variational prior Atanov et al. (2018)
. In addition, more recently, variational inference has been used for representation learning, for instance, mutual information estimation and maximizationTschannen et al. (2018).
In terms of optimal task weights, the grid search (GS) is a simple way to find the optimal trade-off among tasks, which uses a set of different combinations of constant weights to find the appropriate relative weights of different losses. However, this method is not computationally efficient Lin et al. (2019). Recently, the gradient normalization is used to balance the losses in multi-task deep neural networks Chen et al. (2018). Alternatively, from the perspective of optimization, if different learning tasks are regarded as different objectives, then multi-task learning can be formulated as a multi-objective optimization problem Sener and Koltun (2018); Lin et al. (2019). One can also make use of the homoscedastic uncertainties of the losses to compute the optimal weights for each learning goal Kendall et al. (2018). Since our approach is probabilistic, we naturally adopt this uncertainty-based method by using likelihoods of the predictions to weigh the different losses and we also find that this method helps stabilize the training process in practice. In order to extend the VIB for multi-task learning, we leverage the predictive likelihoods of the VIB decoder to calculate the task-related weights for different losses. Based on this, we propose the multi-task variational information bottleneck (MTVIB).
Our research has two methodological contributions. First, the MTVIB adopts the VIB structure to obtain the latent representations of the input data. Compared to deterministic latent representations, the variational latent representations are regularized and thereby expected to be more robust to noises, for instance, under adversarial attacks. Second, the MTVIB uses the task-related uncertainties to assign the relative weights for each task loss. This not only helps us to find a good trade-off among different tasks but also ensures the multi-task training process steady.
2 Model Setup
Our model assumptions are based on a typical information Markov chain, where the information associated with the input data can be represented by a latent distribution. Letdenote the input data, denote the latent representation and denote the output. The information Markov chain is then , which can be presented in an encoder-decoder structure. A similar idea was proposed by Achille and Soatto (2018), where the information is stored in the neural network weights.
There exists a statistic of the input
which is solely sufficient enough to learn the posterior probability of, i.e., .
This assumption was proposed by Alemi et al. (2017). It suggests that the input data contains the needed information in order to compute the latent distribution. The latent distribution in our model will be learnt by an encoder network. Different to the deterministic codes computed by vanilla auto-encoders Hinton and Zemel (1994), we adopt the VIB method and accordingly the learned latent codes can be seen as a disentangled representation. In addition, in an ideal situation, the latent representation can not only be sufficient but also be minimal. Consequently, only the task-related information will be retained.
The learned representation is solely sufficient enough to learn the likelihood of , i.e., .
This assumption was made by Achille and Soatto (2018). It indicates that the sufficiency of the latent representation is ensured by the decoder network. This can be easily evaluated through the loss function, e.g., the cross-entropy for the classification problem. In short, we expect that our model can compress the data maximally while also express the output as much as possible under Assumptions 1 and 2.
If there are multiple learning tasks (e.g., ), they are conditionally independent and share the same representation, i.e., .
This assumption has also been used in Kendall et al. (2018). It allows us to divide the output into different sub-outputs which can be represented by different decoders and then we can make use of the factorial probabilities to weigh the losses.
3 Latent Representation
We seek a sufficient minimal representation for the input features. For supervised learning, according to the information bottleneck theoryTishby et al. (2000), the following optimization problem can be formulated
where denotes the mutual information between two variables, and is the information constraint.
To solve this optimization problem, the Karush-Kuhn-Tucker (KKT) conditions can be applied and the corresponding Lagrangian yields
where is a non-negative Lagrangian multiplier and can be ignored as it is a constant.
Direct computing the mutual information in Eq. 3 is intractable, instead, we can use other techniques to approximate it. Based on Assumptions 1-2, the following variational lower bound of the information bottleneck can be obtained:
where is an uninformative prior distribution. The detailed derivation of Eq. (18) is provided in the supplementary material.
Since , we can approximate by drawing from the dataset first, then drawing from and from the dataset simultaneously. Then Eq. (18) becomes
Similar to -VAEs and VIB, we adopt an encoder-decoder structure. The encoder is parameterized by and the latent variable can be sampled via . The decoder is parameterized by . Maximizing the lower bound in Eq. (5) is equivalent to minimizing the following loss function:
Eq. (6) is the VIB structure to acquire the stochastic latent representation. Note that the trade-off between the encoding term and the decoding term is governed by . A larger means that the VIB is more compressed for the input and less expressed for the output, and vice versa. When it comes to supervised learning, a relatively small value of is preferred to ensure the prediction performance.
4 Multi-Task Variational Information Bottleneck
We now extend the VIB structure for multi-task learning. The uncertainty weighted losses method Kendall et al. (2018) is used to balance the weights among different tasks. The learning tasks in Eq. (6) are to compute the task-related likelihoods. With Assumption 3, different tasks are conditionally independent from each other. Therefore, the likelihood of the output for tasks is
where if otherwise .
We take the homoscedastic uncertainties of the losses into account, and we can use the softmax function to construct a Boltzmann distribution (or Gibbs distribution) for classification problem. Let
be the output vector of the decoder network for taskand it is parameterized by weight vector . We follow Kendall et al. (2018) and the classification likelihood of class for task is adapted with a scaling squash through the softmax function:
where is a positive scalar indicating the homoscedastic uncertainty of task and is the th element of the decoder output, and is the likelihood of class for task without squash scaling.
Thus, the loss function based on the negative log-likelihood for tasks can be obtained:
where is the cross-entropy for task .
Our proposed model combines the VIB structure with the uncertainty weighted losses. Based on Eq. (5), the multi-task learning can be formulated as follows
where is the encoder parameter, is the decoder parameter, and is a constant. Applying the KKT condition to Eq. (10) then gives the Lagrangian form
where is the deterministic function used in encoder , andis the Hadamard product, and is a random noise sampled from a standard diagonal Gaussian distribution .
where is the size of mini batch for Monte Carlo sampling.
Similar to the VIB, in model testing, we sample the latent variable using encoder , then will be used as the input for decoders and the output of decoders are our model’s output. The architecture of the MTVIB is exhibited in Fig. 1. The encoder-decoder architecture with a latent distributions has been used in VAEs and -VAEs. It should be noted that
-VAEs also adopt the information bottleneck method to acquire better disentanglement. However, they are designed to sample new imagines for unsupervised learning while our model is for supervised multi-task learning. Additionally, the proposed model is in line with the minimum description length (MDL) principle and provides a general solution to the overfitting problemRissanen (1978), Grünwald and Grunwald (2007). It can be said that the KL divergence is the expected number of additional bits for the -optimal encoding of outcomes generated by Grünwald and Grunwald (2007). Alternatively, for the mutual information term , it can be estimated by using the method proposed in Belghazi et al. (2018) instead of using the re-parameterization trick. More detailed discussions are provided in the supplementary material.
Our proposed model is evaluated with three publicly available datasets.
Two-task classification: We use the MultiMNIST dataset and the MultiFashionMNIST dataset Lin et al. (2019). The former is created from the MNIST dataset LeCun et al. (1998) while the latter is created based on the Fashion-MNIST dataset Xiao et al. (2017). Specifically, each image of the MultiMNIST or MultiFashionMNIST dataset is overlapped by two images from the MNIST or Fashion-MNIST dataset, respectively. The samples of the MultiMNIST and Fashion-MNIST datasets are provided in the supplementary material. The input dimension of data is and the each input is related to a -label target. For each dataset, the training instance number is and the test instance number is .
Four-task classification: We use the multi-task facial landmark (MTFL) dataset Zhang et al. (2014). This dataset has four classification tasks, whose target category numbers are , , and , respectively. The dimension of the images is resized to . The training image number is and the test image number is .
Different adversarial attacks are taken into account. In the experiments, we use the fast gradient sign method (FGSM) Goodfellow et al. (2015) to produce the adversarial samples:
where is a positive scalar (we use different values, , , , , , for the experiments). is the loss function and is the model parameter. The original samples and their adversarial for the used datasets are presented in Figs. 4-4.
The benchmarked models include the grid search (GS), the uncertainty weighted losses (UWL) Kendall et al. (2018), the single-task learning (STL), and the single-task variational information bottleneck (STVIB). The GS and the UWL are multi-task learning models while the STL and the STVIB are single task learning models. We follow Zhang et al. (2017) and use the Small AlexNet as the base structure for STL, GS and UWL. In STVIB and MTVIB, the latent dimensions are set as and is set to be . The learning rate is ; the optimizer is Adam Kingma and Ba (2015)
; the training epoch is; and the minibatch size is . The implementation details of models are presented in Tables 1-2.
Our model has shown promising performance in multi-task classification. As presented in Figs. 5-6, the model is on a par with the benchmarked single-task and multi-task learning models in terms of the overall classification accuracy of each task. Without adversarial attacks, the average classification accuracy is for the MultiMNIST dataset, for the MultiFashionMNIST dataset, and for the MTFL dataset. When the noise level increase, our model is more robust and significantly outperforms the benchmarked multi-task learning models (i.e., the UWL and the GS).
We propose the MTVIB in this paper. Our model is based on the VIB structure, which can obtain the latent representation of the input data. The task-dependent uncertainties is used to learn the relative weights of task loss functions and the multi-task learning can be formulated as a constrained multi-objective optimization problem. The MTVIB can enhance the latent representation and consider the trade-offs among different learning tasks. It is examined with publicly available datasets under different adversarial attacks. It has achieved comparable classification accuracy as the benchmarked models, and has shown a better robustness against adversarial attacks compared with other multi-task learning models.
-  (2018) Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §2, §2.
-  (2017) Deep variational information bottleneck. 5th International Conference on Learning Representations. Cited by: §1, §2.
-  (2018) The deep weight prior. 6th International Conference on Learning Representation. Cited by: §1.
-  (2018) Mutual information neural estimation. Proceedings of the 35th International Conference on Machine Learning, pp. 531–540. Cited by: §4.
-  (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §1.
-  (2015) Weight uncertainty in neural network. Proceedings of the 32nd International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
-  (2018) Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599. Cited by: §1.
-  (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. Proceedings of the 35th International Conference on Machine Learning, pp. 794–803. Cited by: §1.
-  (2015) Explaining and harnessing adversarial examples. 3rd International Conference on Learning Representations. Cited by: §5.
-  (2007) The minimum description length principle. MIT press. Cited by: §4.
-  (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. 5th International Conference on Learning Representations 2 (5), pp. 6. Cited by: §1.
-  (1994) Autoencoders, minimum description length and helmholtz free energy. Advances in Neural Information Processing Systems, pp. 3–10. Cited by: §2.
-  (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. , pp. 7482–7491. Cited by: §1, §2, §4, §4, §5.
-  (2015) Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations. Cited by: §5.
-  (2014) Auto-encoding variational bayes. 2nd International Conference on Learning Representations. Cited by: §4.
-  (2015) Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
-  (2019) Pareto multi-task learning. Advances in Neural Information Processing Systems, pp. 12037–12047. Cited by: §1, §5.
-  (2016) Auxiliary deep generative models. Proceedings of the 33rd International Conference on Machine Learning, pp. 1445–1453. Cited by: §1.
-  (2017) Variational dropout sparsifies deep neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §1.
Stochastic backpropagation and approximate inference in deep generative models. Proceedings of the 31st International Conference on Machine Learning-Volume 32, pp. II–1278. Cited by: §1.
-  (1978) Modeling by shortest data description. Automatica 14 (5), pp. 465–471. Cited by: §4.
-  (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1.
-  (2018) Multi-task learning as multi-objective optimization. Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §1.
-  (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §3.
-  (2018) Recent advances in autoencoder-based representation learning. Third workshop on Bayesian Deep Learning (NeurIPS 2018). Cited by: §1.
-  (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
-  (2018) Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 2008–2026. Cited by: §1.
-  (2017) Understanding deep learning requires rethinking generalization. 5th International Conference on Learning Representations. Cited by: §5.
-  (2017) A survey on multi-task learning. arXiv preprint arXiv:1707.08114. Cited by: §1.
-  (2014) Facial landmark detection by deep multi-task learning. European Conference on Computer Vision, pp. 94–108. Cited by: §5.
6.1 Derivation of the Variational Lower Bound of the Information Bottleneck
As the mutual information can be defined as the Kullback-Leibler (KL) divergence between the joint density and the product of the marginal densities, if Assumptions 1-2 are met, then
On the other hand, according to Assumption 1, the posterior can be calculated, and then
where is an uninformative prior distribution.
6.2 Derivation of
Let be the output vector of the decoder network for task and it is parameterized by weight vector . The likelihood of the th element in vector with dimension is
If without squash scaling, the likelihood of the th element in vector is
Therefore, we have
|Input: 36 36 Conv 64 (Kernel: 3 MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax||Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 384 + BN + ReLU FC 384 192 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 FC 192 10 softmax softmax||Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e-3 FC 256 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax||Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e-3 FC 256 384 + BN + ReLU FC 256 384 + BN + ReLU FC 384 192 FC 384 192 FC 192 10 FC 192 10 softmax softmax|
|Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax||Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 384 + BN + ReLU FC 384 192 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 FC 192 10 softmax softmax||Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e-3 FC 256 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax||Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e-3 FC 256 384 + BN + ReLU FC 256 384 + BN + ReLU FC 384 192 FC 384 192 FC 192 10 FC 192 10 softmax softmax|