1 Introduction
Sampling from an unnormalized target distribution that we know the density function up to a scaling factor is a pivotal problem with many applications in probabilistic inference [bishop, murphy, MAL001]
. For this purpose, Markov chain Monte Carlo (MCMC) has been widely used to draw approximate posterior samples, but unfortunately, is often timeconsuming and has difficulty accessing the convergence
[liu2016stein]. Targeting an efficient acceleration of MCMC, some stochastic variational particlebased approaches have been proposed, notably Stochastic Langevin Gradient Descent [welling2011bayesian] and Stein Variational Gradient Descent (SVGD) [liu2016stein]. Outstanding among them is SVGD, with a solid theoretical guarantee of the convergence of the set of particles to the target distribution by maintaining a flow of distributions. More specifically, SVGD starts from an arbitrary and easytosample initial distribution and learns the subsequent distribution in the flow by pushforwarding the current one using a function , where , is the learning rate, and with to be the Reproducing Kernel Hilbert Space corresponding to a kernel . It is wellknown that for the case of using Gaussian RBF kernel, by letting the kernel width approach , the update formula of SVGD at each step asymptotically reduces to the typical gradient descent (GD) [liu2016stein], showing the connection between a probabilistic framework like SVGD and a singleobjective optimization algorithm. In other words, SVGD can be viewed as a probabilistic version of the GD for singleobjective optimization.On the other side, multiobjective optimization (MOO) [desideri2012multiple] aims to optimize a set of objective functions and manifests itself in many realworld applications problems, such as in multitask learning (MTL) [mahapatra2020multi, sener2018multi]
[anderson2021modest], and reinforcement learning
[ghosh2013towards, pirotta2016inverse, parisi2014policy]. Leveraging the above insights, it is natural to ask: “Can we derive a probabilistic version of multiobjective optimization?”. By answering this question, we enable the application of the Bayesian inference framework to the tasks inherently fulfilled by the MOO framework.
Contribution. In this paper, we provide an affirmative answer to that question. In particular, we go beyond the SVGD to propose Stochastic Multiple Target Sampling Gradient Descent (MTSGD), enabling us to sample from multiple target distributions. By considering the pushforward map with , we can find a closedform for the optimal pushforward map pushing the current distribution on the flow simultaneously closer to all target distributions. Similar to SVGD, in the case of using Gaussian RBF kernel, when the kernel width approaches , MTSGD reduces exactly to the multiplegradient descent algorithm (MGDA) [desideri2012multiple] for multiobjective optimization (MOO). Our MTSGD, therefore, can be considered as a probabilistic version of the GD multiobjective optimization [desideri2012multiple] as expected.
Additionally, in practice, we consider a flow of discrete distributions, in which, each distribution is presented as a set of particles. Our observations indicate that MTSGD globally drives the particles to close to all target distributions, leading them to diversify on the joint highlikelihood region for all distributions. It is worth noting that, different from other multiparticle approaches [lin2019pareto, liu2021profiling, mahapatra2020multi] leading the particles to diversify on a Pareto front, our MTSGD orients the particle to diversify on the socalled Pareto common (i.e., the joint highlikelihood region for all distributions) (cf. Section 2.4
for more discussions). We argue and empirically demonstrate that this characteristic is essential for the Bayesian setting, whose main goal is to estimate the
ensemble accuracy and the uncertainty calibration of a model. In summary, we make the following contributions in this work:
Propose a principled framework that incorporates the power of Stein Variational Gradient Descent into multiobjective optimization. Concretely, our method is motivated by the theoretical analysis of SVGD, and we further derive the formulation that extends the original work and allows to sample from multiple unnormalize distributions.

Demonstrate our algorithm is readily applicable in the context of multitask learning. The benefits of MTSGD are twofold: i) the trained network is optimal, which could not be improved in any task without diminishing another, and ii) there is no need for predefined preference vectors as in previous works
[lin2019pareto, mahapatra2020multi], MTSGD implicitly learns diverse models universally optimizing for all tasks. 
Conduct comprehensive experiments to verify the behaviors of MTSGD and demonstrate the superiority of MTSGD to the baselines in a Bayesian setting, with higher ensemble performances and significantly lower calibration errors.
Related works. The work of [desideri2012multiple]
proposed a multigradient descent algorithm for multiobjective optimization (MOO) which opens the door for the applications of MOO in machine learning and deep learning. Inspired by
[desideri2012multiple], MOO has been applied in multitask learning (MTL) [mahapatra2020multi, sener2018multi], fewshot learning [chen2021pareto, ye2021multi], and knowledge distillation [chennupati2021adaptive, du2020agree]. Specifically, in an earlier attempt at solving MTL, [sener2018multi] viewed multitask learning as a multiobjective optimization problem, where a task network consists of a shared feature extractor and a taskspecific predictor. In another study, [mahapatra2020multi]developed a gradientbased multiobjective MTL algorithm to find a set of solutions that satisfies the user preferences. Also follows the idea of learning neural networks conditioned on predefined preference vectors,
[lin2019pareto] proposed Pareto MTL, aiming to find a set of welldistributed Pareto solutions, which can represent different tradeoffs among different tasks. Recently, the work of [liu2021profiling] leveraged MOO with SVGD [liu2016stein] and Langevin dynamics [welling2011bayesian] to diversify the solutions of MOO. In another line of work, [ye2021multi] proposed a bilevel MOO that can be applied to fewshot learning. Furthermore, a somewhat different result was proposed, [du2020agree] applied MOO to enable the knowledge distillation from multiple teachers and find a better optimization direction in training the student network.Outline. The paper is organized as follows. In Section 2, we first present our theoretical contribution by reviewing the formalism and providing the point of view adopted to generalize SVGD in the context of MOO. Then, Section 3 introduces an algorithm to showcase the application of our proposed method in the multitask learning scenario. We report the results of extensive experimental studies performed on various datasets that demonstrate the behaviors and efficiency of MTSGD in Section 4. Finally, we conclude the paper in Section 5. The complete proofs and experiment setups are deferred to the supplementary material.
2 MultiTarget Sampling Gradient Descent
We first briefly introduce the formulation of the multitarget sampling in Section 2.1. Second, Section 2.2 presents our theoretical development and shows how our proposed method is applicable to this problem. Finally, we detail how to train the proposed method in Section 2.3 and highlight key differences between our method and related work in Section 2.4.
2.1 Problem Setting
Given a set of target distributions with parameter , we aim to find the optimal distribution that simultaneously minimizes:
(1) 
where
represents KullbackLeibler divergence and
is a family of distributions.The optimization problem (OP) in (1) can be viewed as a multiobjective OP [desideri2012multiple]
on the probability distribution space. Let us denote
by the Reproducing Kernel Hilbert Space (RKHS) associated with a positive semidefinite (p.s.d.) kernel , and by the dimensional vector function , where . Inspired by [liu2016stein], we construct a flow of distributions departed from a simple distribution , that gradually move closer to all the target distributions. In particular, at each step, assume that is the current obtained distribution, and the goal is to learn a transformation so that the feedforward distribution moves closer to simultaneously. Here we use to denote the identity operator, is a step size, and is a velocity field. Particularly, the problem of finding the optimal transformation is formulated as:(2) 
2.2 Our Theoretical Development
It is worth noting that the transformation defined above is injective when is sufficiently small [liu2016stein]. We consider each as a function w.r.t. , by applying the firstorder Taylor expansion at , we have:
where .
Similar to [liu2016stein], the gradient can be calculated as provided in footnote 1
where and is the dot product in the RKHS.
This means that, for each target distribution , the steepest descent direction is , in which the KL divergence of interest gets decreased roughly by toward the target distribution . However, this only guarantees a divergence reduction for a single target distribution itself. Our next aim is hence to find a common direction to reduce the KL divergences w.r.t. all target distributions, which is reflected in the following lemma, showing us how to combine the individual steepest descent direction to yield the optimal direction as summarized in Figure 1.
Lemma 1.
Let be the optimal solution of the optimization problem and , where and with , then we have
Lemma 1 provides a common descent direction so that all KL divergences w.r.t. the target distributions are consistently reduced by roughly and Theorem 2 confirms this argument.
Theorem 2.
If there does not exist such that , given a sufficiently small step size , all KL divergences w.r.t. the target distributions are strictly decreased by at least where is a positive constant.
The next arising question is how to evaluate the matrix with for solving the quadratic problem: . To this end, using some wellknown equalities in the RKHS^{1}^{1}1All proofs and derivations can be found in the supplementary material., we arrive at the following formula:
(3) 
where denotes the trace of a (square) matrix.
2.3 Algorithm for MTSGD
For the implementation of MTSGD, we consider as a discrete distribution over a set of , () particles . The formulation to evaluate in Equation. (3) becomes:
(4) 
The optimal solution then can be computed as:
(5) 
The key steps of our MTSGD are summarized in Algorithm 1, where the set of particles is updated gradually to approach the multiple distributions . Furthermore, the update formula consists of two terms: (i) the first term (i.e., relevant to ) helps to push the particles to the joint highlikelihood region for all distributions and (ii) the second term (i.e., relevant to ) which is a repulsive term to push away the particles when they reach out each other. Finally, we note that our proposed MTSGD can be applied in the context where we know the target distributions up to a scaling factor (e.g., in the posterior inference).
Analysis for the case of RBF kernel.
We now consider a radial basisfunction (RBF) kernel of bandwidth
: and examine some asymptotic behaviors.General case: The elements of the matrix become
Single particle distribution : The elements of the matrix become
and our formulation reduces exactly to MOO in [desideri2012multiple].
When : The elements of the matrix become
2.4 Comparison to MOOSVGD and Other Works
The most closely related work to ours is MOOSVGD [liu2021profiling]. In a nutshell, ours is principally different from that work. Figure 2 shows the fundamental difference between our MTSGD and MOOSVGD. Our MTSGD navigates the particles from one distribution to another distribution consecutively with a theoretical guarantee of globally getting closely to multiple target distributions, whereas MOOSVGD uses the MOO [desideri2012multiple] to update the particles individually and independently. Additionally, MOOSVGD employs a repulsive term to encourage the particle diversity without any theoreticalguaranteed principle to control the repulsive term, hence it can force the particles to scatter on the multiple distributions. Moreover, MOOSVGD is not computationally efficient when the number of particles is high because it requires solving an independent quadratic programming problem for each particle (cf. Section 4.1.1 and Figure 3 for the experiment on a synthetic dataset).
Furthermore, it expects that our MTSGD globally moves the set of particles to the joint highlikelihood region for all target distributions. Therefore, we do not claim our MTSGD as a method to diversify the solution on a Pareto front for user preferences as in [liu2021profiling, mahapatra2020multi]. Alternatively, our MTSGD can generate diverse particles on the socalled Pareto common (i.e., the joint highlikelihood region for all target distributions). We argue and empirically demonstrate that by finding and diversifying the particles on Pareto common for the multiple posterior inferences, our MTSGD can outperform the baselines on Bayesianinference metrics such as the ensemble accuracy and the calibration error.
3 Application to MultiTask Learning
For multitask learning, we assume to have tasks and a training set , where is a data example and are the labels for the tasks. The model for each task consists of the shared part and nonshared part targeting the task . The posterior for each task reads
where
is a loss function and the predictive likelihood
is examined.For our approach, we maintain a set of models with , where . At each iteration, given the nonshared parts with , we sample the shared parts from the multiple distributions as
(6) 
We now apply our proposed MTSGD to sample the shared parts from the multiple distributions defined in (6) as
(7) 
where and are the weights received from solving the quadratic programming problem. Here we note that can be estimated via the batch gradient of the loss using Equation (6).
Given the updated shared parts , for each task , we update the corresponding nonshared parts by sampling
(8) 
We now apply SVGD [liu2016stein] to sample the nonshared parts for each task from the distribution defined in (8) as
(9) 
where with which the term can be estimated via the batch loss gradient using Equation (8).
Algorithm 2 summarizes the key steps of our multitask MTSGD. Basically, we alternatively update the shared parts given the nonshared ones and vice versa.
4 Experiments
In this section, we verify our MTSGD by evaluating its performance on both synthetic and realworld datasets. For our experiments, we use the RBF kernel . The detailed training and configuration are given in the supplementary material.
4.1 Experiments on Toy Datasets
4.1.1 Sampling from Multiple Distributions
We first qualitatively analyze the behavior of the proposed method on sampling from three target distributions. Each target distribution is a mixture of two Gaussians as ( where the mixing proportions , , the means , , and , and the common covariance matrix and . It can be seen from Figure 3 that there is a common highdensity region spreading around the origin. The fifty particles are drawn randomly in the space, and the initialization is retained across experiments for a fair comparison.
Figure 3 shows the updated particles by MOOSVGD and MTSGD at selected iterations, we observe that the particles from MOOSVGD spread out and tend to characterize all the modes, some of them even scattered along trajectories due to the conflict in optimizing multiple objectives. By contrast, our method is able to find and cover the common high density region among target distributions with welldistributed particles, which illustrates the basic principles of MTSGD. Additionally, at the th step, the training time for ours is 0.23 min, whereas that for MOOSVGD is 1.63 min. The reason is that MOOSVGD requires solving an independent quadratic programming problem for each particle at each step.
4.1.2 Multiobjective Optimization
The previous experiment illustrates that MTSGD can be used to sample from multiple target distributions, we next test our method on the other lowdimensional multiobjectives OP from [zitzler2000comparison]. In particular, we use the two objectives ZDT3, whose Pareto front consists of noncontiguous convex parts, to show our method simultaneously minimizes both objective functions. Graphically, the simulation results from Figure 4 show the difference in the convergence behaviors between MOOSVGD and MTSGD: the solution set achieved by MOOSVGD covers the entire Pareto front, while ours distributes and diversifies on the three middle curves (mostly concentrated in the middle curve) which are the Pareto common having low values for two objective functions in ZDT3.
4.2 Experiments on Real Datasets
4.2.1 Experiments on MultiFashion+MultiMNIST Datasets
We apply the proposed MTSGD method on multitask learning, following Algorithm 2. Our method is validated on different benchmark datasets: (i) MultiFashion+MNIST [NIPS2017_2cad8fa4], (ii) MultiMNIST, and (iii) MultiFashion. Each of them consists of 120,000 training and 20,000 testing images generated from MNIST [mnist] and FashionMNIST [xiao2017fashion] by overlaying an image on top of another: one in the topleft corner and one in the bottomright corner. Lenet [mnist]
(22,350 params) is employed as the backbone architecture and trained for 100 epochs with SGD in this experimental setup.
Baselines: In multitask experiments, the introduced MTSGD is compared with stateoftheart baselines including MGDA [sener2018multi], Pareto MTL [lin2019pareto], MOOSVGD [liu2021profiling]. We note that to reproduce results for these baselines, we either use the author’s official implementation released on GitHub or ask the authors for their codes. For MOOSVGD and Pareto MTL, the reported result is from the ensemble prediction of five particle models. Additionally, for linear scalarization and MGDA, we train five particle models independently with different initializations and then ensemble these models.
Evaluation metrics: We compare MTSGD against baselines regarding both average accuracy and predictive uncertainty. Besides the commonly used accuracy metric, we measure the quality and diversity of the particle models by relying on two other popular Bayesian metrics: Brier score [brier1950verification, ovadia2019can] and expected calibration error (ECE) [dawid1982well, naeini2015obtaining].
From Figure 5, we observe that MTSGD consistently improves model performance across all tasks in both accuracy and Brier score by large margins, compared to existing techniques in the literature. The network trained using linear scalarization, as expected, produces inferior ensemble results while utilizing MOO techniques helps yield better performances. Overall, our proposed method surpasses the secondbest baseline by at least 1% accuracy in any experiment. Furthermore, Table 1 provides a comparison between these methods in terms of expected calibration error, in which MTSGD also consistently provides the lowest expected calibration error, illustrating our method’s ability to obtain wellcalibrated models (the accuracy is closely approximated by the produced confidence score). It is also worth noting that while Pareto MTL has higher accuracy, MOOSVGD produces slightly better calibration estimation.
Dataset  Task  Linear scalarization  MGDA  Pareto MTL  MOOSVGD  MTSGD 

MultiFashion+MNIST  Top left  20.4%  19.7%  10.1%  8.7%  4.6% 
Bottom right  17.1%  14.9%  4.6%  4.8%  3.2%  
MultiMNIST  Top left  16.7%  15.1%  5.0%  5.1%  3.1% 
Bottom right  16.9%  16.2%  6.6%  6.7%  3.8%  
MultiFashion  Top left  14.7%  13.6%  7.8%  5.1%  4.2% 
Bottom right  14.6%  13.1%  7.0%  6.7%  4.6% 
4.2.2 Experiment on CelebA Dataset
In this experiment, we verify the significance of MTSGD on a larger neural network: Resnet18 [he2016deep], which consists of 11.4M parameters. We take the first 10 binary classification tasks and randomly select a subset of 40k images from the CelebA dataset [liu2015deep]. Note that in this experiment, we consider Single task, in which 10 models are trained separately and serves as a strong baseline for this experiment.
Method  5S  AE  Att  BUE  Bald  Bangs  BL  BN  BlaH  BloH  Average  

Acc (%)  Single task  91.8  84.6  80.3  81.9  98.8  94.8  85.8  81.3  89.6  94.2  88.3 
MGDA  91.8  84.0  79.0  81.3  98.6  94.6  83.6  81.6  89.8  93.8  87.8  
MOOSVGD  92.3  84.2  78.9  81.2  98.9  94.5  86.4  80.0  90.8  94.8  88.2  
MTSGD  92.6  84.8  80.3  82.9  99.1  95.2  86.3  82.6  91.1  95.0  89.0  
ECE (%)  Single task  3.3  2.4  4.4  3.9  0.7  1.6  5.7  6.5  3.1  1.1  3.3 
MGDA  1.4  1.1  3.5  7.3  0.3  1.8  6.9  5.4  2.1  1.2  3.1  
MOOSVGD  2.8  1.9  3.1  5.6  0.3  0.5  4.7  3.3  1.3  1.3  2.5  
MTSGD  1.2  1.4  1.7  2.3  0.6  1.7  6.8  1.2  2.1  0.9  2.0 
The performance comparison of the mentioned models in CelebA experiment is shown in Table 2. As clearly seen from the upper part of the table, MTSGD performs best in all tasks, except in BL, where MOOSVGD is slightly better (86.4% vs 86.3%). Moreover, our method matches or beats Single task  the secondbest baseline in all tasks. Regarding the wellcalibrated uncertainty estimates, ensemble learning methods exhibit better results. In particular, MTSGD and MOOSVGD provide the best calibration performances, which are and , respectively, which emphasizes the importance of efficient ensemble learning for enhanced calibration.
5 Conclusion
In this paper, we propose Stochastic Multiple Target Sampling Gradient Descent (MTSGD), allowing us to sample the particles from the joint highlikelihood of multiple target distributions. Our MTSGD is theoretically guaranteed to simultaneously reduce the divergences to the target distributions. Interestingly, the asymptotic analysis of our MTSGD reduces exactly to the multiobjective optimization. We conduct comprehensive experiments to demonstrate that by driving the particles to the Pareto common (the joint highlikelihood of multiple target distributions), our MTSGD can outperform the baselines on the ensemble accuracy and the wellknown Bayesian metrics such as the expected calibration error and the Brier score.
Supplement to “Stochastic Multiple Target Sampling Gradient Descent”
These appendices provide supplementary details and results of MTSGD, including our theory development and additional experiments. This consists of the following sections:
Appendix A Proofs of Our Theory Development
a.1 Derivations for the Taylor expansion formulation
We have
(10) 
Proof of Equation (10): Since is assumed to be an invertible mapping, we have the following equations:
and
(11) 
According to the change of variables formula, we have , then:
Using this, the first term in Equation (11) is rewritten as:
(12) 
Similarly, the second term in Equation (11) could be expressed as:
(13)  
It could be shown from the reproducing property of the RKHS that , then we find that
(14) 
Let whose denotes the row vector and the particle is represented by , the row vector is given by:
(15) 
a.2 Proof of Lemma 1
Before proving this lemma, let us restate it:
Lemma 3.
Let be the optimal solution of the optimization problem and , where and with , then we have
a.3 Derivations for the matrix ’s formulation in Equation (3)
We have
Therefore, we find that
which is equivalent to
Now, note that
hence we gain
which follows that
Putting these results together, we obtain that
As a consequence, we obtain the conclusion of Equation (3).
a.4 Proof of Theorem 2
Before proving this theorem, let us restate it:
Theorem 4.
such that , given a sufficiently small step size , all KL divergences w.r.t. the target distributions are strictly decreased by at least where is a positive constant.
Proof.
We have for all that