1 Introduction
Multitask learning is a popular subfield in machine learning. In the conventional setting, different tasks share the same representation and are learned simultaneously. Each task has its contribution to the total learning loss and a good balance among tasks can result in good predictive capability. Multitask learning models are usually believed to be computationally efficient and can obtain better generalized representations compared to learning a single task at a time. Generally speaking, there are two groups of approaches which are widely used to improve the multitask learning performance. The first group is focused on improving the shared latent representation and the other group aims to find the optimal relative weights for sublearning tasks
Ruder (2017); Zhang and Yang (2017). The proposed model in our research will work on both issues.Many multitask learning models use the deterministic encoding techniques to obtain latent representations. While in this study we use the variational encoding method to improve the latent representation. Specifically, we adopt the information bottleneck method Tishby et al. (2000) to obtain the latent codes. The information bottleneck method can be implemented via variational inference Blei et al. (2017); Zhang et al. (2018), called the variational information bottleneck (VIB) Alemi et al. (2017)
. Variational inference has widely been used in deep learning such as variational autoencoders (VAEs) and their variants
Kingma et al. (2015); Rezende et al. (2014); Higgins et al. (2017); Burgess et al. (2018); Maaløe et al. (2016), Bayesian neural networks (BNNs)
Blundell et al. (2015), variational dropout Kingma et al. (2015); Molchanov et al. (2017) and deep variational prior Atanov et al. (2018). In addition, more recently, variational inference has been used for representation learning, for instance, mutual information estimation and maximization
Tschannen et al. (2018).In terms of optimal task weights, the grid search (GS) is a simple way to find the optimal tradeoff among tasks, which uses a set of different combinations of constant weights to find the appropriate relative weights of different losses. However, this method is not computationally efficient Lin et al. (2019). Recently, the gradient normalization is used to balance the losses in multitask deep neural networks Chen et al. (2018). Alternatively, from the perspective of optimization, if different learning tasks are regarded as different objectives, then multitask learning can be formulated as a multiobjective optimization problem Sener and Koltun (2018); Lin et al. (2019). One can also make use of the homoscedastic uncertainties of the losses to compute the optimal weights for each learning goal Kendall et al. (2018). Since our approach is probabilistic, we naturally adopt this uncertaintybased method by using likelihoods of the predictions to weigh the different losses and we also find that this method helps stabilize the training process in practice. In order to extend the VIB for multitask learning, we leverage the predictive likelihoods of the VIB decoder to calculate the taskrelated weights for different losses. Based on this, we propose the multitask variational information bottleneck (MTVIB).
Our research has two methodological contributions. First, the MTVIB adopts the VIB structure to obtain the latent representations of the input data. Compared to deterministic latent representations, the variational latent representations are regularized and thereby expected to be more robust to noises, for instance, under adversarial attacks. Second, the MTVIB uses the taskrelated uncertainties to assign the relative weights for each task loss. This not only helps us to find a good tradeoff among different tasks but also ensures the multitask training process steady.
2 Model Setup
Our model assumptions are based on a typical information Markov chain, where the information associated with the input data can be represented by a latent distribution. Let
denote the input data, denote the latent representation and denote the output. The information Markov chain is then , which can be presented in an encoderdecoder structure. A similar idea was proposed by Achille and Soatto (2018), where the information is stored in the neural network weights.Assumption 1.
There exists a statistic of the input
which is solely sufficient enough to learn the posterior probability of
, i.e., .This assumption was proposed by Alemi et al. (2017). It suggests that the input data contains the needed information in order to compute the latent distribution. The latent distribution in our model will be learnt by an encoder network. Different to the deterministic codes computed by vanilla autoencoders Hinton and Zemel (1994), we adopt the VIB method and accordingly the learned latent codes can be seen as a disentangled representation. In addition, in an ideal situation, the latent representation can not only be sufficient but also be minimal. Consequently, only the taskrelated information will be retained.
Assumption 2.
The learned representation is solely sufficient enough to learn the likelihood of , i.e., .
This assumption was made by Achille and Soatto (2018). It indicates that the sufficiency of the latent representation is ensured by the decoder network. This can be easily evaluated through the loss function, e.g., the crossentropy for the classification problem. In short, we expect that our model can compress the data maximally while also express the output as much as possible under Assumptions 1 and 2.
Assumption 3.
If there are multiple learning tasks (e.g., ), they are conditionally independent and share the same representation, i.e., .
This assumption has also been used in Kendall et al. (2018). It allows us to divide the output into different suboutputs which can be represented by different decoders and then we can make use of the factorial probabilities to weigh the losses.
3 Latent Representation
We seek a sufficient minimal representation for the input features. For supervised learning, according to the information bottleneck theory
Tishby et al. (2000), the following optimization problem can be formulatedmax  (1)  
s.t.  (2) 
where denotes the mutual information between two variables, and is the information constraint.
To solve this optimization problem, the KarushKuhnTucker (KKT) conditions can be applied and the corresponding Lagrangian yields
(3) 
where is a nonnegative Lagrangian multiplier and can be ignored as it is a constant.
Direct computing the mutual information in Eq. 3 is intractable, instead, we can use other techniques to approximate it. Based on Assumptions 12, the following variational lower bound of the information bottleneck can be obtained:
(4) 
where is an uninformative prior distribution. The detailed derivation of Eq. (18) is provided in the supplementary material.
Since , we can approximate by drawing from the dataset first, then drawing from and from the dataset simultaneously. Then Eq. (18) becomes
(5) 
Similar to VAEs and VIB, we adopt an encoderdecoder structure. The encoder is parameterized by and the latent variable can be sampled via . The decoder is parameterized by . Maximizing the lower bound in Eq. (5) is equivalent to minimizing the following loss function:
(6) 
Eq. (6) is the VIB structure to acquire the stochastic latent representation. Note that the tradeoff between the encoding term and the decoding term is governed by . A larger means that the VIB is more compressed for the input and less expressed for the output, and vice versa. When it comes to supervised learning, a relatively small value of is preferred to ensure the prediction performance.
4 MultiTask Variational Information Bottleneck
We now extend the VIB structure for multitask learning. The uncertainty weighted losses method Kendall et al. (2018) is used to balance the weights among different tasks. The learning tasks in Eq. (6) are to compute the taskrelated likelihoods. With Assumption 3, different tasks are conditionally independent from each other. Therefore, the likelihood of the output for tasks is
(7) 
where if otherwise .
We take the homoscedastic uncertainties of the losses into account, and we can use the softmax function to construct a Boltzmann distribution (or Gibbs distribution) for classification problem. Let
be the output vector of the decoder network for task
and it is parameterized by weight vector . We follow Kendall et al. (2018) and the classification likelihood of class for task is adapted with a scaling squash through the softmax function:(8) 
where is a positive scalar indicating the homoscedastic uncertainty of task and is the th element of the decoder output, and is the likelihood of class for task without squash scaling.
Thus, the loss function based on the negative loglikelihood for tasks can be obtained:
(9) 
where is the crossentropy for task .
Our proposed model combines the VIB structure with the uncertainty weighted losses. Based on Eq. (5), the multitask learning can be formulated as follows
(10)  
s.t.  (11) 
where is the encoder parameter, is the decoder parameter, and is a constant. Applying the KKT condition to Eq. (10) then gives the Lagrangian form
(12) 
To solve the KL divergence term in Eq. (4), we resort to the reparameterization method Kingma and Welling (2014):
(13) 
where is the deterministic function used in encoder , and
are the deterministic functions to calculate the mean and variance of the latent Gaussian distribution,
is the Hadamard product, and is a random noise sampled from a standard diagonal Gaussian distribution .Eq. (4) can be solved with Eq. (13) and Monte Carlo sampling. The final loss of the MTVIB is
(14) 
where is the size of mini batch for Monte Carlo sampling.
Similar to the VIB, in model testing, we sample the latent variable using encoder , then will be used as the input for decoders and the output of decoders are our model’s output. The architecture of the MTVIB is exhibited in Fig. 1. The encoderdecoder architecture with a latent distributions has been used in VAEs and VAEs. It should be noted that
VAEs also adopt the information bottleneck method to acquire better disentanglement. However, they are designed to sample new imagines for unsupervised learning while our model is for supervised multitask learning. Additionally, the proposed model is in line with the minimum description length (MDL) principle and provides a general solution to the overfitting problem
Rissanen (1978), Grünwald and Grunwald (2007). It can be said that the KL divergence is the expected number of additional bits for the optimal encoding of outcomes generated by Grünwald and Grunwald (2007). Alternatively, for the mutual information term , it can be estimated by using the method proposed in Belghazi et al. (2018) instead of using the reparameterization trick. More detailed discussions are provided in the supplementary material.5 Experiments
Our proposed model is evaluated with three publicly available datasets.
Twotask classification: We use the MultiMNIST dataset and the MultiFashionMNIST dataset Lin et al. (2019). The former is created from the MNIST dataset LeCun et al. (1998) while the latter is created based on the FashionMNIST dataset Xiao et al. (2017). Specifically, each image of the MultiMNIST or MultiFashionMNIST dataset is overlapped by two images from the MNIST or FashionMNIST dataset, respectively. The samples of the MultiMNIST and FashionMNIST datasets are provided in the supplementary material. The input dimension of data is and the each input is related to a label target. For each dataset, the training instance number is and the test instance number is .
Fourtask classification: We use the multitask facial landmark (MTFL) dataset Zhang et al. (2014). This dataset has four classification tasks, whose target category numbers are , , and , respectively. The dimension of the images is resized to . The training image number is and the test image number is .
Different adversarial attacks are taken into account. In the experiments, we use the fast gradient sign method (FGSM) Goodfellow et al. (2015) to produce the adversarial samples:
(15) 
where is a positive scalar (we use different values, , , , , , for the experiments). is the loss function and is the model parameter. The original samples and their adversarial for the used datasets are presented in Figs. 44.
The benchmarked models include the grid search (GS), the uncertainty weighted losses (UWL) Kendall et al. (2018), the singletask learning (STL), and the singletask variational information bottleneck (STVIB). The GS and the UWL are multitask learning models while the STL and the STVIB are single task learning models. We follow Zhang et al. (2017) and use the Small AlexNet as the base structure for STL, GS and UWL. In STVIB and MTVIB, the latent dimensions are set as and is set to be . The learning rate is ; the optimizer is Adam Kingma and Ba (2015)
; the training epoch is
; and the minibatch size is . The implementation details of models are presented in Tables 12.Our model has shown promising performance in multitask classification. As presented in Figs. 56, the model is on a par with the benchmarked singletask and multitask learning models in terms of the overall classification accuracy of each task. Without adversarial attacks, the average classification accuracy is for the MultiMNIST dataset, for the MultiFashionMNIST dataset, and for the MTFL dataset. When the noise level increase, our model is more robust and significantly outperforms the benchmarked multitask learning models (i.e., the UWL and the GS).
6 Conclusion
We propose the MTVIB in this paper. Our model is based on the VIB structure, which can obtain the latent representation of the input data. The taskdependent uncertainties is used to learn the relative weights of task loss functions and the multitask learning can be formulated as a constrained multiobjective optimization problem. The MTVIB can enhance the latent representation and consider the tradeoffs among different learning tasks. It is examined with publicly available datasets under different adversarial attacks. It has achieved comparable classification accuracy as the benchmarked models, and has shown a better robustness against adversarial attacks compared with other multitask learning models.
References
 [1] (2018) Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §2, §2.
 [2] (2017) Deep variational information bottleneck. 5th International Conference on Learning Representations. Cited by: §1, §2.
 [3] (2018) The deep weight prior. 6th International Conference on Learning Representation. Cited by: §1.
 [4] (2018) Mutual information neural estimation. Proceedings of the 35th International Conference on Machine Learning, pp. 531–540. Cited by: §4.
 [5] (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §1.
 [6] (2015) Weight uncertainty in neural network. Proceedings of the 32nd International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
 [7] (2018) Understanding disentangling in vae. arXiv preprint arXiv:1804.03599. Cited by: §1.
 [8] (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. Proceedings of the 35th International Conference on Machine Learning, pp. 794–803. Cited by: §1.
 [9] (2015) Explaining and harnessing adversarial examples. 3rd International Conference on Learning Representations. Cited by: §5.
 [10] (2007) The minimum description length principle. MIT press. Cited by: §4.
 [11] (2017) Betavae: learning basic visual concepts with a constrained variational framework.. 5th International Conference on Learning Representations 2 (5), pp. 6. Cited by: §1.
 [12] (1994) Autoencoders, minimum description length and helmholtz free energy. Advances in Neural Information Processing Systems, pp. 3–10. Cited by: §2.

[13]
(2018)
Multitask learning using uncertainty to weigh losses for scene geometry and semantics.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7482–7491. Cited by: §1, §2, §4, §4, §5.  [14] (2015) Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations. Cited by: §5.
 [15] (2014) Autoencoding variational bayes. 2nd International Conference on Learning Representations. Cited by: §4.
 [16] (2015) Variational dropout and the local reparameterization trick. Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §1.
 [17] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
 [18] (2019) Pareto multitask learning. Advances in Neural Information Processing Systems, pp. 12037–12047. Cited by: §1, §5.
 [19] (2016) Auxiliary deep generative models. Proceedings of the 33rd International Conference on Machine Learning, pp. 1445–1453. Cited by: §1.
 [20] (2017) Variational dropout sparsifies deep neural networks. Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2498–2507. Cited by: §1.

[21]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. Proceedings of the 31st International Conference on Machine LearningVolume 32, pp. II–1278. Cited by: §1.  [22] (1978) Modeling by shortest data description. Automatica 14 (5), pp. 465–471. Cited by: §4.
 [23] (2017) An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1.
 [24] (2018) Multitask learning as multiobjective optimization. Advances in Neural Information Processing Systems, pp. 527–538. Cited by: §1.
 [25] (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §3.
 [26] (2018) Recent advances in autoencoderbased representation learning. Third workshop on Bayesian Deep Learning (NeurIPS 2018). Cited by: §1.
 [27] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
 [28] (2018) Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 2008–2026. Cited by: §1.
 [29] (2017) Understanding deep learning requires rethinking generalization. 5th International Conference on Learning Representations. Cited by: §5.
 [30] (2017) A survey on multitask learning. arXiv preprint arXiv:1707.08114. Cited by: §1.
 [31] (2014) Facial landmark detection by deep multitask learning. European Conference on Computer Vision, pp. 94–108. Cited by: §5.
Supplementary Material
6.1 Derivation of the Variational Lower Bound of the Information Bottleneck
As the mutual information can be defined as the KullbackLeibler (KL) divergence between the joint density and the product of the marginal densities, if Assumptions 12 are met, then
(16) 
On the other hand, according to Assumption 1, the posterior can be calculated, and then
(17) 
where is an uninformative prior distribution.
6.2 Derivation of
Let be the output vector of the decoder network for task and it is parameterized by weight vector . The likelihood of the th element in vector with dimension is
(19) 
If without squash scaling, the likelihood of the th element in vector is
(20) 
Therefore, we have
(21) 
Input: 36 36 Conv 64 (Kernel: 3 MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax  Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 384 + BN + ReLU FC 384 192 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 FC 192 10 softmax softmax  Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e3 FC 256 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax  Input: 36 36 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 33; Stride: 1) + BN + ReLU MaxPool 2 2 FC 3136 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e3 FC 256 384 + BN + ReLU FC 256 384 + BN + ReLU FC 384 192 FC 384 192 FC 192 10 FC 192 10 softmax softmax 
Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax  Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 384 + BN + ReLU FC 384 192 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 FC 192 10 softmax softmax  Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e3 FC 256 384 + BN + ReLU FC 384 192 + BN + ReLU FC 192 10 softmax  Input: 150 150 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 Conv 64 (Kernel: 55; Stride: 2) + BN + ReLU MaxPool 2 2 FC 4096 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 1024 + BN + ReLU FC 1024 256 + BN + ReLU FC 1024 256 + BN + ReLU Latent dimension: 256; : 1e3 FC 256 384 + BN + ReLU FC 256 384 + BN + ReLU FC 384 192 FC 384 192 FC 192 10 FC 192 10 softmax softmax 
Comments
There are no comments yet.