Stochastic Multiple Target Sampling Gradient Descent

06/04/2022
by   Hoang Phan, et al.
9

Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.

READ FULL TEXT VIEW PDF

page 9

page 19

01/24/2021

Annealed Stein Variational Gradient Descent

Particle based optimization algorithms have recently been developed as s...
12/22/2017

True Asymptotic Natural Gradient Optimization

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimi...
05/23/2022

Bézier Flow: a Surface-wise Gradient Descent Method for Multi-objective Optimization

In this paper, we propose a strategy to construct a multi-objective opti...
01/24/2019

Multi-objective training of Generative Adversarial Networks with multiple discriminators

Recent literature has demonstrated promising results for training Genera...
09/05/2018

Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory

Particle-optimization sampling (POS) is a recently developed technique t...
02/22/2018

Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem

We study sampling as optimization in the space of measures. We focus on ...
10/27/2018

Stein Variational Gradient Descent as Moment Matching

Stein variational gradient descent (SVGD) is a non-parametric inference ...

1 Introduction

Sampling from an unnormalized target distribution that we know the density function up to a scaling factor is a pivotal problem with many applications in probabilistic inference [bishop, murphy, MAL-001]

. For this purpose, Markov chain Monte Carlo (MCMC) has been widely used to draw approximate posterior samples, but unfortunately, is often time-consuming and has difficulty accessing the convergence

[liu2016stein]. Targeting an efficient acceleration of MCMC, some stochastic variational particle-based approaches have been proposed, notably Stochastic Langevin Gradient Descent [welling2011bayesian] and Stein Variational Gradient Descent (SVGD) [liu2016stein]. Outstanding among them is SVGD, with a solid theoretical guarantee of the convergence of the set of particles to the target distribution by maintaining a flow of distributions. More specifically, SVGD starts from an arbitrary and easy-to-sample initial distribution and learns the subsequent distribution in the flow by push-forwarding the current one using a function , where , is the learning rate, and with to be the Reproducing Kernel Hilbert Space corresponding to a kernel . It is well-known that for the case of using Gaussian RBF kernel, by letting the kernel width approach , the update formula of SVGD at each step asymptotically reduces to the typical gradient descent (GD) [liu2016stein], showing the connection between a probabilistic framework like SVGD and a single-objective optimization algorithm. In other words, SVGD can be viewed as a probabilistic version of the GD for single-objective optimization.

On the other side, multi-objective optimization (MOO) [desideri2012multiple] aims to optimize a set of objective functions and manifests itself in many real-world applications problems, such as in multi-task learning (MTL) [mahapatra2020multi, sener2018multi]

, natural language processing

[anderson2021modest]

, and reinforcement learning

[ghosh2013towards, pirotta2016inverse, parisi2014policy]. Leveraging the above insights, it is natural to ask: “Can we derive a probabilistic version of multi-objective optimization?

”. By answering this question, we enable the application of the Bayesian inference framework to the tasks inherently fulfilled by the MOO framework.

Contribution. In this paper, we provide an affirmative answer to that question. In particular, we go beyond the SVGD to propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple target distributions. By considering the push-forward map with , we can find a closed-form for the optimal push-forward map pushing the current distribution on the flow simultaneously closer to all target distributions. Similar to SVGD, in the case of using Gaussian RBF kernel, when the kernel width approaches , MT-SGD reduces exactly to the multiple-gradient descent algorithm (MGDA) [desideri2012multiple] for multi-objective optimization (MOO). Our MT-SGD, therefore, can be considered as a probabilistic version of the GD multi-objective optimization [desideri2012multiple] as expected.

Additionally, in practice, we consider a flow of discrete distributions, in which, each distribution is presented as a set of particles. Our observations indicate that MT-SGD globally drives the particles to close to all target distributions, leading them to diversify on the joint high-likelihood region for all distributions. It is worth noting that, different from other multi-particle approaches [lin2019pareto, liu2021profiling, mahapatra2020multi] leading the particles to diversify on a Pareto front, our MT-SGD orients the particle to diversify on the so-called Pareto common (i.e., the joint high-likelihood region for all distributions) (cf. Section 2.4

for more discussions). We argue and empirically demonstrate that this characteristic is essential for the Bayesian setting, whose main goal is to estimate the

ensemble accuracy and the uncertainty calibration of a model. In summary, we make the following contributions in this work:

  • Propose a principled framework that incorporates the power of Stein Variational Gradient Descent into multi-objective optimization. Concretely, our method is motivated by the theoretical analysis of SVGD, and we further derive the formulation that extends the original work and allows to sample from multiple unnormalize distributions.

  • Demonstrate our algorithm is readily applicable in the context of multi-task learning. The benefits of MT-SGD are twofold: i) the trained network is optimal, which could not be improved in any task without diminishing another, and ii) there is no need for predefined preference vectors as in previous works

    [lin2019pareto, mahapatra2020multi], MT-SGD implicitly learns diverse models universally optimizing for all tasks.

  • Conduct comprehensive experiments to verify the behaviors of MT-SGD and demonstrate the superiority of MT-SGD to the baselines in a Bayesian setting, with higher ensemble performances and significantly lower calibration errors.

Related works. The work of [desideri2012multiple]

proposed a multi-gradient descent algorithm for multi-objective optimization (MOO) which opens the door for the applications of MOO in machine learning and deep learning. Inspired by

[desideri2012multiple], MOO has been applied in multi-task learning (MTL) [mahapatra2020multi, sener2018multi], few-shot learning [chen2021pareto, ye2021multi], and knowledge distillation [chennupati2021adaptive, du2020agree]. Specifically, in an earlier attempt at solving MTL, [sener2018multi] viewed multi-task learning as a multi-objective optimization problem, where a task network consists of a shared feature extractor and a task-specific predictor. In another study, [mahapatra2020multi]

developed a gradient-based multi-objective MTL algorithm to find a set of solutions that satisfies the user preferences. Also follows the idea of learning neural networks conditioned on pre-defined preference vectors,

[lin2019pareto] proposed Pareto MTL, aiming to find a set of well-distributed Pareto solutions, which can represent different trade-offs among different tasks. Recently, the work of [liu2021profiling] leveraged MOO with SVGD [liu2016stein] and Langevin dynamics [welling2011bayesian] to diversify the solutions of MOO. In another line of work, [ye2021multi] proposed a bi-level MOO that can be applied to few-shot learning. Furthermore, a somewhat different result was proposed, [du2020agree] applied MOO to enable the knowledge distillation from multiple teachers and find a better optimization direction in training the student network.

Outline. The paper is organized as follows. In Section 2, we first present our theoretical contribution by reviewing the formalism and providing the point of view adopted to generalize SVGD in the context of MOO. Then, Section 3 introduces an algorithm to showcase the application of our proposed method in the multi-task learning scenario. We report the results of extensive experimental studies performed on various datasets that demonstrate the behaviors and efficiency of MT-SGD in Section 4. Finally, we conclude the paper in Section 5. The complete proofs and experiment setups are deferred to the supplementary material.

2 Multi-Target Sampling Gradient Descent

We first briefly introduce the formulation of the multi-target sampling in Section 2.1. Second, Section 2.2 presents our theoretical development and shows how our proposed method is applicable to this problem. Finally, we detail how to train the proposed method in Section 2.3 and highlight key differences between our method and related work in Section 2.4.

2.1 Problem Setting

Given a set of target distributions with parameter , we aim to find the optimal distribution that simultaneously minimizes:

(1)

where

represents Kullback-Leibler divergence and

is a family of distributions.

The optimization problem (OP) in (1) can be viewed as a multi-objective OP [desideri2012multiple]

on the probability distribution space. Let us denote

by the Reproducing Kernel Hilbert Space (RKHS) associated with a positive semi-definite (p.s.d.) kernel , and by the -dimensional vector function , where . Inspired by [liu2016stein], we construct a flow of distributions departed from a simple distribution , that gradually move closer to all the target distributions. In particular, at each step, assume that is the current obtained distribution, and the goal is to learn a transformation so that the feed-forward distribution moves closer to simultaneously. Here we use to denote the identity operator, is a step size, and is a velocity field. Particularly, the problem of finding the optimal transformation is formulated as:

(2)

2.2 Our Theoretical Development

It is worth noting that the transformation defined above is injective when is sufficiently small [liu2016stein]. We consider each as a function w.r.t. , by applying the first-order Taylor expansion at , we have:

where .

Similar to [liu2016stein], the gradient can be calculated as provided in footnote 1

where and is the dot product in the RKHS.

Figure 1: How to find the optimal descent direction .

This means that, for each target distribution , the steepest descent direction is , in which the KL divergence of interest gets decreased roughly by toward the target distribution . However, this only guarantees a divergence reduction for a single target distribution itself. Our next aim is hence to find a common direction to reduce the KL divergences w.r.t. all target distributions, which is reflected in the following lemma, showing us how to combine the individual steepest descent direction to yield the optimal direction as summarized in Figure 1.

Lemma 1.

Let be the optimal solution of the optimization problem and , where and with , then we have

Lemma 1 provides a common descent direction so that all KL divergences w.r.t. the target distributions are consistently reduced by roughly and Theorem 2 confirms this argument.

Theorem 2.

If there does not exist such that , given a sufficiently small step size , all KL divergences w.r.t. the target distributions are strictly decreased by at least where is a positive constant.

The next arising question is how to evaluate the matrix with for solving the quadratic problem: . To this end, using some well-known equalities in the RKHS111All proofs and derivations can be found in the supplementary material., we arrive at the following formula:

(3)

where denotes the trace of a (square) matrix.

2.3 Algorithm for MT-SGD

For the implementation of MT-SGD, we consider as a discrete distribution over a set of , () particles . The formulation to evaluate in Equation. (3) becomes:

(4)

The optimal solution then can be computed as:

(5)

The key steps of our MT-SGD are summarized in Algorithm 1, where the set of particles is updated gradually to approach the multiple distributions . Furthermore, the update formula consists of two terms: (i) the first term (i.e., relevant to ) helps to push the particles to the joint high-likelihood region for all distributions and (ii) the second term (i.e., relevant to ) which is a repulsive term to push away the particles when they reach out each other. Finally, we note that our proposed MT-SGD can be applied in the context where we know the target distributions up to a scaling factor (e.g., in the posterior inference).

0:  Multiple unnormalized target densities .
0:  The optimal particles .
1:  Initialize a set of particles .
2:  for  to  do
3:     Form the matrix with the element computed as in Equation. (4).
4:     Solve the QP to find the optimal weights .
5:     Compute the optimal direction , where is defined in Equation. (5).
6:     Update .
7:  end for
8:  return .
Algorithm 1 Pseudocode for MT-SGD.

Analysis for the case of RBF kernel.

We now consider a radial basis-function (RBF) kernel of bandwidth

: and examine some asymptotic behaviors.

 General case: The elements of the matrix become

 Single particle distribution : The elements of the matrix become

and our formulation reduces exactly to MOO in [desideri2012multiple].

 When : The elements of the matrix become

2.4 Comparison to MOO-SVGD and Other Works

The most closely related work to ours is MOO-SVGD [liu2021profiling]. In a nutshell, ours is principally different from that work. Figure 2 shows the fundamental difference between our MT-SGD and MOO-SVGD. Our MT-SGD navigates the particles from one distribution to another distribution consecutively with a theoretical guarantee of globally getting closely to multiple target distributions, whereas MOO-SVGD uses the MOO [desideri2012multiple] to update the particles individually and independently. Additionally, MOO-SVGD employs a repulsive term to encourage the particle diversity without any theoretical-guaranteed principle to control the repulsive term, hence it can force the particles to scatter on the multiple distributions. Moreover, MOO-SVGD is not computationally efficient when the number of particles is high because it requires solving an independent quadratic programming problem for each particle (cf. Section 4.1.1 and Figure 3 for the experiment on a synthetic dataset).

Figure 2: Our MT-SGD moves the particles from one distribution to another distribution to globally get closer to two target distributions (i.e., the blue and green ones). Differently, MOO-SVGD uses MOO [desideri2012multiple] to move the particles individually and independently. The diversity is enforced by the repulsive forces among particles. There is no principle to control these repulsive forces, hence they can push the particles scattering on two distributions.

Furthermore, it expects that our MT-SGD globally moves the set of particles to the joint high-likelihood region for all target distributions. Therefore, we do not claim our MT-SGD as a method to diversify the solution on a Pareto front for user preferences as in [liu2021profiling, mahapatra2020multi]. Alternatively, our MT-SGD can generate diverse particles on the so-called Pareto common (i.e., the joint high-likelihood region for all target distributions). We argue and empirically demonstrate that by finding and diversifying the particles on Pareto common for the multiple posterior inferences, our MT-SGD can outperform the baselines on Bayesian-inference metrics such as the ensemble accuracy and the calibration error.

3 Application to Multi-Task Learning

For multi-task learning, we assume to have tasks and a training set , where is a data example and are the labels for the tasks. The model for each task consists of the shared part and non-shared part targeting the task . The posterior for each task reads

where

is a loss function and the predictive likelihood

is examined.

For our approach, we maintain a set of models with , where . At each iteration, given the non-shared parts with , we sample the shared parts from the multiple distributions as

(6)

We now apply our proposed MT-SGD to sample the shared parts from the multiple distributions defined in (6) as

(7)

where and are the weights received from solving the quadratic programming problem. Here we note that can be estimated via the batch gradient of the loss using Equation (6).

Given the updated shared parts , for each task , we update the corresponding non-shared parts by sampling

(8)

We now apply SVGD [liu2016stein] to sample the non-shared parts for each task from the distribution defined in (8) as

(9)

where with which the term can be estimated via the batch loss gradient using Equation (8).

0:  A training set .
0:  The models with , where .
1:  Initialize a set of particles .
2:  for  to  do
3:     for  to  do
4:        Update the shared parts using Equation. (7).
5:        for  to  do
6:           Update the non-shared part using Equation. (9).
7:        end for
8:     end for
9:  end for
10:  return .
Algorithm 2 Pseudocode for multi-task learning MT-SGD.

Algorithm 2 summarizes the key steps of our multi-task MT-SGD. Basically, we alternatively update the shared parts given the non-shared ones and vice versa.

4 Experiments

In this section, we verify our MT-SGD by evaluating its performance on both synthetic and real-world datasets. For our experiments, we use the RBF kernel . The detailed training and configuration are given in the supplementary material.

4.1 Experiments on Toy Datasets

4.1.1 Sampling from Multiple Distributions

We first qualitatively analyze the behavior of the proposed method on sampling from three target distributions. Each target distribution is a mixture of two Gaussians as ( where the mixing proportions , , the means , , and , and the common covariance matrix and . It can be seen from Figure 3 that there is a common high-density region spreading around the origin. The fifty particles are drawn randomly in the space, and the initialization is retained across experiments for a fair comparison.

Figure 3 shows the updated particles by MOO-SVGD and MT-SGD at selected iterations, we observe that the particles from MOO-SVGD spread out and tend to characterize all the modes, some of them even scattered along trajectories due to the conflict in optimizing multiple objectives. By contrast, our method is able to find and cover the common high density region among target distributions with well-distributed particles, which illustrates the basic principles of MT-SGD. Additionally, at the -th step, the training time for ours is 0.23 min, whereas that for MOO-SVGD is 1.63 min. The reason is that MOO-SVGD requires solving an independent quadratic programming problem for each particle at each step.

Figure 3:

Sampling from three mixtures of two Gaussian distributions with a joint high-likelihood region. We run MOO-SVGD (top) and MT-SGD (bottom) to update the initialized particles (left-most figures) until convergence using Adam optimizer

[kingma2014adam]. While MOO-SVGD transports the initialized particles scattering on the distributions, MT-SGD perfectly drives them to diversify in the region of interest.

4.1.2 Multi-objective Optimization

The previous experiment illustrates that MT-SGD can be used to sample from multiple target distributions, we next test our method on the other low-dimensional multi-objectives OP from [zitzler2000comparison]. In particular, we use the two objectives ZDT3, whose Pareto front consists of non-contiguous convex parts, to show our method simultaneously minimizes both objective functions. Graphically, the simulation results from Figure 4 show the difference in the convergence behaviors between MOO-SVGD and MT-SGD: the solution set achieved by MOO-SVGD covers the entire Pareto front, while ours distributes and diversifies on the three middle curves (mostly concentrated in the middle curve) which are the Pareto common having low values for two objective functions in ZDT3.

Figure 4: Solutions obtained by MOO-SVGD (mid) and MT-SGD (right) on ZDT3 problem after 10,000 steps, with blue points representing particles and blue curves indicating the Pareto front. As expected, from initialized particles (left), MOO-SVGD’s solution set widely distributes on the whole Pareto front while the one of MT-SGD concentrates around middle curves (mostly the middle one).

4.2 Experiments on Real Datasets

4.2.1 Experiments on Multi-Fashion+Multi-MNIST Datasets

Figure 5:

Results on Multi-Fashion+MNIST (top), Multi-MNIST (mid), and Multi-Fashion (bottom). We report the ensemble accuracy (

higher is better) and the Brier score (lower is better).

We apply the proposed MT-SGD method on multi-task learning, following Algorithm 2. Our method is validated on different benchmark datasets: (i) Multi-Fashion+MNIST [NIPS2017_2cad8fa4], (ii) Multi-MNIST, and (iii) Multi-Fashion. Each of them consists of 120,000 training and 20,000 testing images generated from MNIST [mnist] and FashionMNIST [xiao2017fashion] by overlaying an image on top of another: one in the top-left corner and one in the bottom-right corner. Lenet [mnist]

(22,350 params) is employed as the backbone architecture and trained for 100 epochs with SGD in this experimental setup.

Baselines: In multi-task experiments, the introduced MT-SGD is compared with state-of-the-art baselines including MGDA [sener2018multi], Pareto MTL [lin2019pareto], MOO-SVGD [liu2021profiling]. We note that to reproduce results for these baselines, we either use the author’s official implementation released on GitHub or ask the authors for their codes. For MOO-SVGD and Pareto MTL, the reported result is from the ensemble prediction of five particle models. Additionally, for linear scalarization and MGDA, we train five particle models independently with different initializations and then ensemble these models.

Evaluation metrics: We compare MT-SGD against baselines regarding both average accuracy and predictive uncertainty. Besides the commonly used accuracy metric, we measure the quality and diversity of the particle models by relying on two other popular Bayesian metrics: Brier score [brier1950verification, ovadia2019can] and expected calibration error (ECE) [dawid1982well, naeini2015obtaining].

From Figure 5, we observe that MT-SGD consistently improves model performance across all tasks in both accuracy and Brier score by large margins, compared to existing techniques in the literature. The network trained using linear scalarization, as expected, produces inferior ensemble results while utilizing MOO techniques helps yield better performances. Overall, our proposed method surpasses the second-best baseline by at least 1% accuracy in any experiment. Furthermore, Table 1 provides a comparison between these methods in terms of expected calibration error, in which MT-SGD also consistently provides the lowest expected calibration error, illustrating our method’s ability to obtain well-calibrated models (the accuracy is closely approximated by the produced confidence score). It is also worth noting that while Pareto MTL has higher accuracy, MOO-SVGD produces slightly better calibration estimation.

Dataset Task Linear scalarization MGDA Pareto MTL MOO-SVGD MT-SGD
Multi-Fashion+MNIST Top left 20.4% 19.7% 10.1% 8.7% 4.6%
Bottom right 17.1% 14.9% 4.6% 4.8% 3.2%
Multi-MNIST Top left 16.7% 15.1% 5.0% 5.1% 3.1%
Bottom right 16.9% 16.2% 6.6% 6.7% 3.8%
Multi-Fashion Top left 14.7% 13.6% 7.8% 5.1% 4.2%
Bottom right 14.6% 13.1% 7.0% 6.7% 4.6%
Table 1: Expected calibration error (num_bin ) on Multi-MNIST, Multi-Fashion and Multi-Fashion+MNIST datasets. We use the bold font to highlight the best results .

4.2.2 Experiment on CelebA Dataset

In this experiment, we verify the significance of MT-SGD on a larger neural network: Resnet18 [he2016deep], which consists of 11.4M parameters. We take the first 10 binary classification tasks and randomly select a subset of 40k images from the CelebA dataset [liu2015deep]. Note that in this experiment, we consider Single task, in which 10 models are trained separately and serves as a strong baseline for this experiment.

Method 5S AE Att BUE Bald Bangs BL BN BlaH BloH Average
Acc (%) Single task 91.8 84.6 80.3 81.9 98.8 94.8 85.8 81.3 89.6 94.2 88.3
MGDA 91.8 84.0 79.0 81.3 98.6 94.6 83.6 81.6 89.8 93.8 87.8
MOO-SVGD 92.3 84.2 78.9 81.2 98.9 94.5 86.4 80.0 90.8 94.8 88.2
MT-SGD 92.6 84.8 80.3 82.9 99.1 95.2 86.3 82.6 91.1 95.0 89.0
ECE (%) Single task 3.3 2.4 4.4 3.9 0.7 1.6 5.7 6.5 3.1 1.1 3.3
MGDA 1.4 1.1 3.5 7.3 0.3 1.8 6.9 5.4 2.1 1.2 3.1
MOO-SVGD 2.8 1.9 3.1 5.6 0.3 0.5 4.7 3.3 1.3 1.3 2.5
MT-SGD 1.2 1.4 1.7 2.3 0.6 1.7 6.8 1.2 2.1 0.9 2.0
Table 2: Results on CelebA dataset, regarding accuracy and expected calibration error. For the full names of the tasks, please refer to our supplementary material. While MGDA trains a single model only to adapt on all tasks, reported performance of MOO-SVGD and MT-SGD is the ensemble results from five particle models.

The performance comparison of the mentioned models in CelebA experiment is shown in Table 2. As clearly seen from the upper part of the table, MT-SGD performs best in all tasks, except in BL, where MOO-SVGD is slightly better (86.4% vs 86.3%). Moreover, our method matches or beats Single task - the second-best baseline in all tasks. Regarding the well-calibrated uncertainty estimates, ensemble learning methods exhibit better results. In particular, MT-SGD and MOO-SVGD provide the best calibration performances, which are and , respectively, which emphasizes the importance of efficient ensemble learning for enhanced calibration.

5 Conclusion

In this paper, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), allowing us to sample the particles from the joint high-likelihood of multiple target distributions. Our MT-SGD is theoretically guaranteed to simultaneously reduce the divergences to the target distributions. Interestingly, the asymptotic analysis of our MT-SGD reduces exactly to the multi-objective optimization. We conduct comprehensive experiments to demonstrate that by driving the particles to the Pareto common (the joint high-likelihood of multiple target distributions), our MT-SGD can outperform the baselines on the ensemble accuracy and the well-known Bayesian metrics such as the expected calibration error and the Brier score.

Supplement to “Stochastic Multiple Target Sampling Gradient Descent”

These appendices provide supplementary details and results of MT-SGD, including our theory development and additional experiments. This consists of the following sections:

  • Appendix A contains the proofs and derivations of our theory development.

  • Appendix B contains the network architectures, experiment settings of our experiments and additional ablation studies.

Appendix A Proofs of Our Theory Development

a.1 Derivations for the Taylor expansion formulation

We have

(10)

Proof of Equation (10): Since is assumed to be an invertible mapping, we have the following equations:

and

(11)

According to the change of variables formula, we have , then:

Using this, the first term in Equation (11) is rewritten as:

(12)

Similarly, the second term in Equation (11) could be expressed as:

(13)

It could be shown from the reproducing property of the RKHS that , then we find that

(14)

Let whose denotes the row vector and the particle is represented by , the row vector is given by:

(15)

Combining Property (14) and Equation (15), we have:

(16)

Substituting Equation (16) to Equation (13), the linear term of the Taylor expansion could be derived as:

where denotes the -th element of and is a matrix whose column vector is given by

In other word, the formula of becomes

As a consequence, we obtain the conclusion of Equation (10).

a.2 Proof of Lemma 1

Before proving this lemma, let us re-state it:

Lemma 3.

Let be the optimal solution of the optimization problem and , where and with , then we have

Proof. For arbitrary and , then , we thus have the following inequality:

which is equivalent to

(17)

Hence , since otherwise the R.H.S of inequality (17) will be negative with sufficiently small . By that, we arrive at

By choosing to be a one hot vector at , we obtain the conclusion of Lemma 1.

a.3 Derivations for the matrix ’s formulation in Equation (3)

We have

Therefore, we find that

which is equivalent to

Now, note that

hence we gain

which follows that

Putting these results together, we obtain that

As a consequence, we obtain the conclusion of Equation (3).

a.4 Proof of Theorem 2

Before proving this theorem, let us re-state it:

Theorem 4.

such that , given a sufficiently small step size , all KL divergences w.r.t. the target distributions are strictly decreased by at least where is a positive constant.

Proof.

We have for all that