snips_torch
None
view repo
In this work we introduce a novel stochastic algorithm dubbed SNIPS, which draws samples from the posterior distribution of any linear inverse problem, where the observation is assumed to be contaminated by additive white Gaussian noise. Our solution incorporates ideas from Langevin dynamics and Newton's method, and exploits a pre-trained minimum mean squared error (MMSE) Gaussian denoiser. The proposed approach relies on an intricate derivation of the posterior score function that includes a singular value decomposition (SVD) of the degradation operator, in order to obtain a tractable iterative algorithm for the desired sampling. Due to its stochasticity, the algorithm can produce multiple high perceptual quality samples for the same noisy observation. We demonstrate the abilities of the proposed paradigm for image deblurring, super-resolution, and compressive sensing. We show that the samples produced are sharp, detailed and consistent with the given measurements, and their diversity exposes the inherent uncertainty in the inverse problem being solved.
READ FULL TEXT VIEW PDFNone
Many problems in the field of image processing can be cast as noisy linear inverse problems. This family of tasks includes denoising, inpainting, deblurring, super resolution, compressive sensing, and many other image recovery problems. A general linear inverse problem is posed as
(1) |
where we aim to recover a signal from its measurement , given through a linear degradation operator and a contaminating noise, being additive, white and Gaussian, . In this work we assume that both and are known.
Over the years, many strategies, algorithms and underlying statistical models were developed for handling image restoration problems. A key ingredient in many of the classic attempts is the prior that aims to regularize the inversion process and lead to visually pleasing results. Among the various options explored, we mention sparsity-inspired techniques Elad and Aharon (2006); Yang et al. (2010); Dong et al. (2012)
, local Gaussian-mixture modeling
Yu et al. (2011); Zoran and Weiss (2011), and methods relying on non-local self-similarity Buades et al. (2005); Danielyan et al. (2011); Ram et al. (2013); Vaksman et al. (2016). More recently, and with the emergence of deep learning techniques, a direct design of the recovery path from
to an estimate of
took the lead, yielding state-of-the-art results in various linear inverse problems, such as denoising Lefkimmiatis (2017); Zhang et al. (2017a, 2018); Vaksman et al. (2020), deblurring Kupyn et al. (2019); Suin et al. (2020), super resolution Dong et al. (2016); Haris et al. (2018); Wang et al. (2018) and other tasks McCann et al. (2017); Lucas et al. (2018); Hyun et al. (2018); Gupta et al. (2018); Ravishankar et al. (2019); Zhang and Ghanem (2018).Despite the evident success of the above techniques, many image restoration algorithms still have a critical shortcoming: In cases of severe degradation, most recovery algorithms tend to produce washed out reconstructions that lack details. Indeed, most image restoration techniques aim to form an MMSE estimator, i.e., seek a reconstruction that minimizes the mean squared error between the restored image, , and the unknown original one, . When the degradation is acute and information is irreversibly lost, image reconstruction becomes a highly ill-posed problem, implying that many possible clean images could explain the given measurements. The MMSE solution averages all these candidate solutions, being the conditional mean of the posterior of given , leading to an image with loss of fine details. A recent work reported in Blau and Michaeli (2018) has shown that reconstruction algorithms necessarily suffer from a perception-distortion tradeoff, i.e., targeting a minimization of the error between and (in any metric) is necessarily accompanied by a compromised perceptual quality. As a consequence, as long as we stick to the tendency to design recovery algorithms that aim for minimum MSE (or other distances), only a limited perceptual improvement can be expected.
When perceptual quality becomes our prime objective, the strategy for solving inverse problems must necessarily change. More specifically, the solution should concentrate on producing a sample (or many of them) from the posterior distribution
instead of its conditional mean. Recently, two such approaches have been suggested – GAN-based and Langevin sampling. Generative Adversarial Networks (GANs) have shown impressive results in generating realistically looking images (
e.g., Goodfellow et al. (2014); Radford et al. (2016)). GANs can be utilized for solving inverse problems while producing high-quality images (see e.g. Bahat and Michaeli (2020); Menon et al. (2020); Peng and Li (2020)). These solvers aim to produce a diverse set of output images that are consistent with the measurements, while also being aligned with the distribution of clean examples. A major disadvantage of GAN-based algorithms for inverse problems is their tendency (as practiced in Bahat and Michaeli (2020); Menon et al. (2020); Peng and Li (2020)) to assume noiseless measurements, a condition seldom met in practice. An exception to this is the work reported in Ohayon et al. (2021), which adapts a conditional GAN to become a stochastic denoiser.The second approach for sampling from the posterior, and the one we shall be focusing on in this paper, is based on Langevin dynamics. This core iterative technique enables sampling from a given distribution by leveraging the availability of the score function – the gradient of the log of the probability density function
Roberts et al. (1996); Besag (2001). The work reported in Song and Ermon (2019); Kadkhodaie and Simoncelli (2020); Song et al. (2021) utilizes the annealed Langevin dynamics method, both for image synthesis and for solving noiseless inverse problems.^{1}^{1}1The work reported in Ho et al. (2020); Saharia et al. (2021); Li et al. (2021) and Guo et al. (2019); Laumont et al. (2021) is also relevant to this discussion, but somewhat different. We shall specifically address these papers’ content and its relation to our work in section 2.Their synthesis algorithm relies on an MMSE Gaussian denoiser (given as a neural network) for approximating a gradually blurred score function. In their treatment of inverse problems, the conditional score remains tractable and manageable due to the noiseless measurements assumption.
The question addressed in this paper is the following: How can the above line of Langevin-based work be generalized for handling linear inverse problems, as in Equation 1, in which the measurements are noisy? A partial and limited answer to this question has already been given in Kawar et al. (2021) for the tasks of image denoising and inpainting. The present work generalizes these (Song and Ermon (2019); Kadkhodaie and Simoncelli (2020); Song et al. (2021); Kawar et al. (2021)) results, and introduces a systematic way for sampling from the posterior distribution of any given noisy linear inverse problem. As we carefully show, this extension is far from being trivial, due to two prime reasons: (i) The involvement of the degradation operator , which poses a difficulty for establishing a relationship between the reconstructed image and the noisy observation; and (ii) The intricate connection between the measurements’ and the synthetic annealed Langevin noise. Our proposed remedy is a decorrelation of the measurements equation via a singular value decomposition (SVD) of the operator , which decouples the dependencies between the measurements, enabling each to be addressed by an adapted iterative process. In addition, we define the annealing noise to be built as portions of the measurement noise itself, in a manner that facilitates a constructive derivation of the conditional score function.
Following earlier work Song and Ermon (2019); Kadkhodaie and Simoncelli (2020); Song et al. (2021); Kawar et al. (2021)
, our algorithm is initialized with a random noise image, gradually converging to the reconstructed result, while following the direction of the log-posterior gradient, estimated using an MMSE denoiser. Via a careful construction of the gradual annealing noise sequence, from very high values to low ones, the entries in the derived score switch mode. Those referring to non-zero singular values start by being purely dependent on the measurements, and then transition to incorporate prior information based on the denoiser. As for entries referring to zero singular values, their corresponding entries undergo a pure synthesis process based on the prior-only score function. Note that the denoiser blends values in the evolving sample, thus intermixing the influence of the gradient entries. Our derivations include an analytical expression for a position-dependent step size vector, drawing inspiration from Newton’s method in optimization. This stabilizes the algorithm and is shown to be essential for its success.
We refer hereafter to our algorithm as SNIPS (Solution of Noisy Inverse Problems Stochastically). Observe that as we target to sample from the posterior distribution , different runs of SNIPS on the same input necessarily yield different results, all of which valid solutions to the given inverse problem. This should not come as a surprise, as ill-posedness implies that there are multiple viable solutions for the same data, as has already been suggested in the context of super resolution Menon et al. (2020); Bahat and Michaeli (2020); Peng and Li (2020). We demonstrate SNIPS on image deblurring, single image super resolution, and compressive sensing, all of which contain non-negligible noise, and emphasize the high perceptual quality of the results, their diversity, and their relation to the MMSE estimate.
To summarize, this paper’s contributions are threefold:
We present an intricate derivation of the blurred posterior score function for general noisy inverse problems, where both the measurement and the target image contain delicately inter-connected additive white Gaussian noise.
We introduce a novel stochastic algorithm – SNIPS – that can sample from the posterior distribution of these problems. The algorithm relies on the availability of an MMSE denoiser.
We demonstrate impressive results of SNIPS on image deblurring, single image super resolution, and compressive sensing, all of which are highly noisy and ill-posed.
Before diving into the details of this work, we should mention that using MMSE Gaussian denoisers iteratively for handling general linear inverse problems has been already proposed in the context of the Plug-and-Play-Prior (PnP) method Venkatakrishnan et al. (2013) and RED Romano et al. (2017), and their many followup papers (e.g., Zhang et al. (2017b); Meinhardt et al. (2017); Arridge et al. (2019); Sun et al. (2019); Buzzard et al. (2018); Tirer and Giryes (2018); Rond et al. (2016)). However, both PnP and RED are quite different from our work, as they do not target sampling from the posterior, but rather focus on MAP or MMSE estimation.
row4< 8 col1< 12 | ||||
---|---|---|---|---|
Original | Blurred | Samples from our algorithm | Mean | std |
). Here and in all other shown figures, the standard deviation image is scaled by 4 for better visual inspection.
The Langevin dynamics algorithm Besag (2001); Roberts et al. (1996)
suggests sampling from a probability distribution
using the iterative transition rule(2) |
where and is an appropriately chosen small constant. The added allows for stochastic sampling, avoiding a collapse to a maximum of the distribution. Initialized randomly, after a sufficiently large number of iterations, and under some mild conditions, this process converges to a sample from the desired distribution Roberts et al. (1996).
The work reported in Song and Ermon (2019) extends the aforementioned algorithm into annealed Langevin dynamics. The annealing proposed replaces the score function in Equation 2 with a blurred version of it, , where and is a synthetically injected noise. The core idea is to start with a very high noise level and gradually drop it to near-zero, all while using a step size dependent on the noise level. These changes allow the algorithm to converge much faster and perform better, because it widens the basin of attraction of the sampling process. The work in Kadkhodaie and Simoncelli (2020) further develops this line of work by exposing a brilliant relation between MMSE Gaussian denoisers and the blurred score function mentioned above, given as
(3) |
where is the minimizer of the MSE measure , which can be approximated using a neural network. This facilitates the use of denoisers in Langevin dynamics as a replacement for the evasive score function.
When turning to solve inverse problems, previous work suggests sampling from the posterior distribution using annealed Langevin dynamics Kadkhodaie and Simoncelli (2020); Song et al. (2021); Kawar et al. (2021) or similar methods Guo et al. (2019); Ho et al. (2020); Saharia et al. (2021); Li et al. (2021), by replacing the score function used in the generation algorithm with a conditional one. As it turns out, if limiting assumptions can be posed on the measurements formation, the conditional score is tractable, and thus generalization of the annealed Langevin process to these problems is within reach. Indeed, in Song and Ermon (2019); Kadkhodaie and Simoncelli (2020); Song et al. (2021); Saharia et al. (2021); Li et al. (2021) the core assumption is for specific and simplified choices of and with no noise in the measurements. The works in Guo et al. (2019); Laumont et al. (2021) avoid these difficulties altogether by returning to the original (non-annealed) Langevin method, with the unavoidable cost of becoming extremely slow. In addition, their algorithms are demonstrated on inverse problems in which the additive noise is restricted to be very weak. The work in Kawar et al. (2021) is broader, allowing for an arbitrary additive white Gaussian noise, but limits to the problems of denoising or inpainting. While all these works demonstrate high quality results, there is currently no clear way for deriving the blurred score function of a general linear inverse problem as posed in Equation 1. In the following, we present such a derivation.
We consider the problem of recovering a signal from the observation , where , and and are known. Our ultimate goal is to sample from the posterior . However, since access to the score function is not available, we retarget our goal, as explained above, to sampling from blurred posterior distributions, , where and , with noise levels starting very high, and decreasing towards near-zero.
As explained in Appendix C, the sampling should be performed in the SVD domain in order to get a tractable derivation of the blurred score function. Thus, we consider the singular value decomposition (SVD) of , given as , where and are orthogonal matrices, and is a rectangular diagonal matrix containing the singular values of , denoted as in descending order (). For convenience of notations, we also define for . To that end, we notice that
(4) |
The first equality holds because the multiplication of
by the orthogonal matrix
does not add or remove information, and the second equality holds because the multiplication of bydoes not change its probability
Lebanon (2012). Therefore, sampling from and then multiplying the result by will produce the desired sample from . As we are using Langevin dynamics, we need to calculate the conditional score function . For simplicity, we denote hereafter , , , , and . Observe that with these notations, the measurements equation becomesand thus
where we have relied on the relation . This leads to
(5) |
In this formulation, which will be used for deriving part of the conditional score, our aim is to make design choices on such that is white and independent of . This brings us to the formation of the synthetic annealed noise, which is an intricate ingredient in our derivations.
We base this formation on the definition of a sequence of noise levels such that , where is very high (possibly ) and is very close to zero. We also require that for every and : . Lastly, we require that for every such that , there exists such that and .
Using , we would like to define , a sequence of noisy versions of , where the noise level in is . One might be tempted to define these noise additions as independent of the measurement noise . However, this option leads to a conditional score term that cannot be calculated analytically, as explained in Appendix C. Therefore, we define these noise additions differently, as carved from in a gradual fashion. To that end, we define , and for every : , where . This results in , where .
And now we turn to define the statistical dependencies between the measurements’ noise and the artificial noise vectors . Since and are each Gaussian and white, so are the components of the vectors , , and . In order to proceed while easing notations, let us focus on a single entry in these three vectors, for which , and omit this index. We denote these entries as , and , respectively. We construct such that
This implies that the layers of noise are all portions of itself, with an additional portion being contained in . Afterwards, become independent of . In the case of , the above relations simplify to be for all , implying no statistical dependency between the given and the synthetic noises. Consequently, it can be shown that the overall noise in Equation 5 satisfies
(6) |
The top option refers to high values of the annealed Langevin noise, in which, despite the possible decay caused by the singular value , this noise is stronger than . In this case, contains all and an additional independent portion of noise. The bottom part assumes that the annealed noise (with the influence of ) is weaker than the measurements’ noise, and then it is fully immersed within , with the difference being Gaussian and independent.
The above derivations show that the noise in Equation 5
is zero-mean, white and Gaussian and of known variance, and this noise is independent of
. Thus Equation 5 can be used conveniently for deriving the measurements part of the conditional score function. We denote , , for simplicity, and turn to calculate . We split into three parts: (i) refers to the entries for which ; (ii) corresponds to the entries for which ; and (iii) includes the entries for which . Observe that this partition of the entries of is non-overlapping and fully covering. Similarly, we partition every vector into , which are the entries of corresponding to , respectively. Furthermore, we define as all the entries of except , respectively. With these definitions in place, the complete derivation of the score function is detailed in Appendix A, and here we bring the final outcome. For , the score is independent of the measurements and given by(7) |
For the case of , the expression obtained is only measurements-dependent,
(8) |
Lastly, for the case of , the conditional score includes two terms – one referring to the plain (blurred) score, and the other depending on the measurements,
(9) |
As already mentioned, the full derivations of equations 7, 8, and 9 are detailed in Appendix A. Aggregating all these results together, we obtain the following conditional score function:
(10) |
where is the vector , but with zeros in its entries that correspond to . Observe that the first term in Equation 10 contains zeros in the entries corresponding to , matching the above calculations. The vector can be estimated using a neural network as in Song and Ermon (2019), or using a pre-trained MMSE denoiser as in Kadkhodaie and Simoncelli (2020); Kawar et al. (2021). All the other elements of this vector are given or can be easily obtained from by calculating its SVD decomposition once at the beginning.
row0< 2 col1< 8 | ||||
---|---|---|---|---|
Original | Low-res | Samples from our algorithm | Mean | std |
Armed with the conditional score function in Equation 10, the Langevin dynamics algorithm can be run with a constant step size or an annealed step size as in Song and Ermon (2019), and this should converge to a sample from . However, for this to perform well, one should use a very small step size, implying a devastatingly slow convergence behavior. This is mainly due to the fact that different entries of
advance at different speeds, in accord with their corresponding singular values. As the added noise in each step has the same variance in every entry, this leads to an unbalanced signal-to-noise ratio, which considerably slows down the algorithm.
In order to mitigate this problem, we suggest using a step size vector . We denote , and obtain the following update formula for a Langevin dynamics algorithm:
(11) |
where the conditional score function is estimated as described in subsection 3.2, and is some constant. For the choice of the step sizes in the diagonal of , we draw inspiration from Newton’s method in optimization, which is designed to speed up convergence to local maximum points. The update formula in Newton’s method is the same as Equation 11, but without the additional noise , and with being the negative inverse Hessian of . We calculate a diagonal approximation of the Hessian, and set to be its negative inverse. We also estimate the conditional score function using Equation 10 and a neural network. Note that this mixture of Langevin dynamics and Newton’s method has been suggested in a slightly different context in Simsekli et al. (2016), where the Hessian was approximated using a Quasi-Newton method. In our case, we analytically calculate a diagonal approximation of the negative inverse Hessian and obtain the following:
(12) |
The full derivations for each of the three cases are detailed in Appendix B. Using these step sizes, the update formula in Equation 11, the conditional score function in Equation 10, and a neural network that estimates the score function ,^{3}^{3}3Recall that , being a denoising residual. we obtain a tractable iterative algorithm for sampling from , where the noise in is sufficiently negligible to be considered as a sampling from the ideal image manifold.
Note that when we set and , implying no measurements, the above algorithm degenerates to an image synthesis, exactly as in Song and Ermon (2019). Two other special cases of this algorithm are obtained for or with some rows removed, the first referring to denoising and the second to noisy inpainting, both cases shown in Kawar et al. (2021). Lastly, for the choices of as in Kadkhodaie and Simoncelli (2020) or Song and Ermon (2019); Song et al. (2021) and with , the above algorithm collapses to a close variant of their proposed iterative methods.
Original | Degraded | Sample | Mean | std |
In our experiments we use the NCSNv2 Song and Ermon (2020) network in order to estimate the score function of the prior distribution. Three different NCSNv2 models are used, each trained separately on the following datasets: (i) images of size pixels from the CelebA dataset Liu et al. (2015); (ii) images of size pixels from LSUN Yu et al. (2015) bedrooms dataset; and (iii) LSUN images of towers. We demonstrate SNIPS’ capabilities on image deblurring, super resolution, and compressive sensing. In each of the experiments, we run our algorithm 8 times, producing 8 samples for each input. We examine both the samples themselves and their mean, which serves as an approximation of the MMSE solution, .
For image deblurring, we use a uniform blur kernel, and an additive white Gaussian noise with (referring to pixel values in the range ). Figure 1 demonstrates the obtained results for several images taken from the CelebA dataset. As can be seen, SNIPS produces visually pleasing, diverse samples.
For super resolution, the images are downscaled using a block averaging filter, i.e., each non-overlapping block of pixels in the original image is averaged into one pixel in the low-resolution image. We use blocks of size or pixels, and assume the low-resolution image to include an additive white Gaussian noise. We showcase results on LSUN and CelebA in Figures 2, 4, and 5.
row0< 4 col1< 12 | ||||
---|---|---|---|---|
Original | Low-res | Samples from our algorithm | Mean | std |
row0< 4 col1< 12 | ||||
---|---|---|---|---|
Original | Low-res | Samples from our algorithm | Mean | std |
For compressive sensing, we use three random projection matrices with singular values of , that compress the image by , , and . As can be seen in Figure 3 and as expected, the more aggressive the compression, the more significant are the variations in reconstruction.
We calculate the average PSNR (peak signal-to-noise ratio) of each of the samples in our experiments, as well as the PSNR of their mean, as shown in Table 1. In all the experiments, the empirical conditional mean presents an improvement of around dB in PSNR, even though it is less visually appealing compared to the samples. This is consistent with the theory in Blau and Michaeli (2018), which states that the difference in PSNR between posterior samples and the conditional mean (the MMSE estimator) should be dB, with the MMSE estimator having poorer perceptual quality but better PSNR.
Problem | Sample PSNR | Mean PSNR |
---|---|---|
Uniform deblurring | ||
Super resolution (by ) | ||
Super resolution (by ) | ||
Compressive sensing (by ) | ||
Compressive sensing (by ) |
A valid solution to an inverse problem should satisfy two conditions: (i) It should be visually pleasing, consistent with the underlying prior distribution of images, and (ii) It should be faithful to the given measurement, maintaining the relationship as given in the problem setting. Since the prior distribution is unknown, we assess the first condition by visually observing the obtained solutions and their tendency to look realistic. As for the second condition, we perform the following computation: We degrade the obtained reconstruction by , and calculate its difference from the given measurement , obtaining . According to the problem setting, this difference should be an additive white Gaussian noise vector with a standard deviation of . We examine this difference by calculating its empirical standard deviation, and performing the Pearson-D’Agostino D’Agostino and Pearson (1973) test of normality on it, accepting it as a Gaussian vector if the obtained p-value is greater than . We also calculate the Pearson correlation coefficient (denoted as ) among neighboring entries, accepting them as uncorrelated for coefficients smaller than in absolute value. In all of our tests, the standard deviation matches almost exactly, the Pearson correlation coefficient satisfies , and we obtain p-values greater than in around of the samples (across all experiments). These results empirically show that our algorithm produces valid solutions to the given inverse problems.
row4< 6 col1< 8 | ||||
---|---|---|---|---|
Original | Degraded | Samples from our algorithm | Mean | std |
SNIPS, presented in this paper, is a novel stochastic algorithm for solving general noisy linear inverse problems. This method is based on annealed Langevin dynamics and Newton’s method, and relies on the availability of a pre-trained Gaussian MMSE denoiser. SNIPS produces a random variety of high quality samples from the posterior distribution of the unknown given the measurements, while guaranteeing their validity with respect to the given data. This algorithm’s derivation includes an intricate choice of the injected annealed noise in the Langevin update equations, and an SVD decomposition of the degradation operator for decoupling the measurements’ dependencies. We demonstrate SNIPS’ success on image deblurring, super resolution, and compressive sensing.
Extensions of this work should focus on SNIPS’ limitations: (i) The need to deploy SVD decomposition of the degradation matrix requires a considerable amount of memory and computations, and hinders the algorithm’s scalability; (ii) The current version of SNIPS does not handle general content images, a fact that is related to the properties of the denoiser being used Ryu et al. (2019); and (iii) SNIPS, as any other Langevin based method, requires (too) many iterations (e.g., in our super-resolution tests on CelebA, minutes are required for producing sample images), and means for its acceleration should be explored.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 2716–2725. Cited by: §1, §1.Accelerating the super-resolution convolutional neural network
. In European conference on computer vision, pp. 391–407. Cited by: §1.Image reconstruction: from sparsity to data-adaptive methods and machine learning
. Proceedings of the IEEE 108 (1), pp. 86–109. Cited by: §1.We would like to derive a term for depending on known ingredients such as , , , and the SVD components of , as well as the blurred prior score function , which can be estimated using a neural network. To that end, in accordance with the definitions of , and , for a matrix we define as leading minors of with subsets of rows and columns extracted accordingly from the above-defined partition. Recalling Equation 5, we have
(13) |
Observe that the entries of the right-hand-side vector are statistically independent, and their distribution for is given by
(14) |
This is a direct result of Equation 6, obtained by simply aggregating the different entries into a vector. Similarly,
(15) |
obtained from aggregating the entries from Equation 6 into a vector, and multiplying it by . Notice that is a diagonal square matrix, and thus invertible. The above two formulae will be used in the following analysis.
Given , , is the SVD decomposition of , , as constructed in subsection 3.1, , , , the conditional score is approximately given by:
We split our derivation into three cases: , , and , and then concatenate the results.
For the case of , we calculate using the Bayes rule:
Deriving by is the same as deriving by and then taking the part referring to zero singular values of . Thus, the second term becomes . As for the first term, we can subtract the vector without changing the statistics because it is a known quantity in this setting, resulting in
The last equality holds due to Equation 13. Referring to the first term, because the entries of the vector are independent, we can split the probability density function into a product of two such functions for two parts of the vector, as follows:
The entries of were defined element-wise as gradual noise additions, statistically independent of the entries of . Therefore, the conditioning on is equivalent to conditioning on . Deriving this log-probability by results in zero. As for the first term, is zero due to the definition of , and is a Gaussian vector that is independent of , resulting in
The second last equality holds because , and multiplication by the orthogonal matrix
does not change the statistics of the variable. The last equality holds due to the multivariate chain rule:
, where is the Jacobian matrix of w.r.t. . Finally, we obtain(16) |
For the case of , using the definition of the conditional distribution we get:
(17) |
Focusing on the second term, we calculate, with a similar reasoning as above and get:
Substituting , , leads to
The last equality holds because . Observe that is zero everywhere except in the part of the vector, which we discard because of the notation. We can split this term into two parts, as before,
The derivative of the second term (the part) is zero, because this vector was built element-wise as gradual noise additions, independent of . This results in
This is the gradient-log of a Gaussian density function of the vector , known to have a zero mean and a covariance matrix , according to Equation 15. Thus, we use the known Gaussian gradient-log and conclude:
Multiplying a certain part of a diagonal matrix (in this case, the part) by the corresponding part of a vector is the same as multiplying the original matrix and vector, and then taking the relevant part. This results in