We revisit the Blind Deconvolution problem with a focus on understanding its robustness and convergence properties. Provable robustness to noise and other perturbations is receiving recent interest in vision, from obtaining immunity to adversarial attacks to assessing and describing failure modes of algorithms in mission critical applications. Further, many blind deconvolution methods based on deep architectures internally make use of or optimize the basic formulation, so a clearer understanding of how this sub-module behaves, when it can be solved, and what noise injection it can tolerate is a first order requirement. We derive new insights into the theoretical underpinnings of blind deconvolution. The algorithm that emerges has nice convergence guarantees and is provably robust in a sense we formalize in the paper. Interestingly, these technical results play out very well in practice, where on standard datasets our algorithm yields results competitive with or superior to the state of the art. Keywords: blind deconvolution, robust continuous optimizationREAD FULL TEXT VIEW PDF
Image deblurring has been an active area of study in computer vision for nearly five decades. The early proposals sought to sharpen ordeblur images from photographs by relying on parameters relating the exposure and the amplifier gain, e.g., via the use of the Stroke/Zech division filter Stroke and Halioua (1970). Most contemporary algorithms for deblurring, however, pose the problem as blind deconvolution, which refers to separating a true unknown signal and some unknown “kernel” or “filter” when provided knowledge only of the noisy measurement of the signal convolved with the filter. This is a fundamental topic today in signal processing and vision, and remains challenging due to its non-convex and ill-posed nature — only within the last few years has brisk progress been made towards methods that gracefully handle real images encountered in practice Levin et al. (2009); Campisi and Egiazarian (2016). These recent developments notwithstanding, due to the foregoing technical challenges, we are often unable to guarantee provably good solutions to the underlying optimization task, and strategies to address these issues are being studied by various researchers in our community today Perrone et al. (2015); Jin et al. (2017); Li et al. (2016); Campisi and Egiazarian (2016).
Modern approaches generally prefer one of two related but distinct strategies for blind deconvolution. On the statistical side, research has primarily revolved around Bayesian methods Ruiz et al. (2015), taking advantage of useful priors ranging from fundamental image geometry in the context of its relation to edge detection and saliency, to expert knowledge of the specific application domain of interest Cho and Lee (2009); Xu and Jia (2010). While these ideas provide guarantees in terms of robustness, the development of efficient sampling (e.g., Gibbs sampler) and inference algorithms remains an active topic of research. On the optimization side, total variation regularization has proven to be extremely effective in general image deblurring Perrone and Favaro (2014); Chan and Wong (1998); Osher et al. (2005)
in a variety of image domains. While the mathematical properties of total variation have been well studied in applied mathematics, signal processing and machine learning, our understanding of the robustness and convergence behavior of even the best performing algorithms for blind deconvolution based on this construct remains limited, although there is exciting progress being madeSrinivasan et al. (2017). A primary motivation of our work is to shed light on these theoretical issues.
Separate from, but complementary to the above lines of work, the enormous success of deep convolutional architectures in vision has led to a number of papers Sun et al. (2015); Schuler et al. (2016); Chakrabarti (2016); Noroozi et al. (2017) exploring how such successes can be adapted to deconvolution in general. While some initial attempts showed the use of machine learning methods for non-blind
image deconvolution (i.e., the blur kernel is provided), discriminatively trained architectures have now been shown to work quite well for the general setting, both with and without priors on motion blur types. A natural question one may ask is whether an in-depth study of the core blind deconvolution formulation and its properties is relevant in light of this still evolving body of convolutional neural networks based literature. The reader will see that our work is complementary. Of the recent proposals in this line of work reformulate deconvolution as a supervised learning problem by synthesizing blurred and sharp image pairs, and are often based on some form of blind deconvolution sub-routine internallySchuler et al. (2016). As these methods get closer to practical deployment in mission critical applications, a detailed assessment of their behavior profile will be a first order requirement for regulation compliance. To enable investigating the robustness and convergence properties of these architectures and their resilience to adversarial examples — as is happening in the last few years for other problems in both computer vision and machine learning Su et al. (2017); Moosavi-Dezfooli et al. (2016); Moosavi Dezfooli et al. (2017) — we will necessarily rely on and benefit from a “first principles” understanding of such properties for the standalone (i.e., shallow regimes of) blind deconvolution.
Contribution. In this paper, we provide (1) a quantifiably and provably robust algorithm for blind deconvolution with (2) guaranteed convergence properties. To our knowledge, no algorithm is currently known that offers both these properties at once. Our convergence guarantees match the best known results in optimization at this time. Our technical analysis is also backed up by practical performance. Via an extensive experimental study, we show that on most available benchmarks, our simple algorithm competes favorably with (or is superior to) the state of the art, and provide a user-friendly implementation which can be easily extended to a complete user-interactive deblurring package.
Methods for image deblurring via blind deconvolution have employed a variety of regularizations derived from a wide range of image priors. The literature is vast and so we restrict our discussion to a subset of works that are closely related to or motivate our proposed strategy. The earlier forms of regularization were based on the -norm in You and Kaveh (1996), where an alternating minimization scheme was proposed. More recent improvements have been proposed by Cho and Lee Cho and Lee (2009) and Xu and Jia Xu and Jia (2010). On the other hand, total variation regularization – the defacto choice in many state of the art methods today – was initially deployed in image denoising applications Rudin et al. (1992); Vogel and Oman (1996). and brought to the image deconvolution problem by Chan and Wong Chan and Wong (1998). A nice result by Osher et al. (2005) gives a variational iterative procedure for solving the total variation objective. A conceptually distinct set of results for blind deconvolution adopt a more statistical approach instead. Levin et al. in Levin et al. (2009) provide analysis of algorithms following maximum a posteriori
(MAP) estimators. A recent workRuiz et al. (2015) gives a nice and comprehensive overview of Bayesian methods for blind deconvolution. A few years back, Perrone and Favaro (2014) built on analysis in Levin et al. (2009) and demonstrated experimentally the behavior of Chan and Wong (1998) .In a follow-up work, those authors showed the advantage of a logarithmic prior Perrone et al. (2015), obtaining state of the art results with a mild modification to the classical TV-norm based formulation which we will present shortly. Separate from total variation regularization based approaches, interesting results have been shown by Michaeli and Irani (2014) through an regularization on text images and by Michaeli and Irani (2014); Sun et al. (2013) via the use of patch priors. Recently, a detailed comparative study was conducted by Lai et al. (2016), in which participants were asked to qualitatively compare two results from multiple algorithms, a subset of which are described in our review above.
In the last few years, ideas based on specialized deep networks have started yielding interesting results for this problem. For example, Sun et al. (2015) was among the first approaches for motion blur removal by posing the problem as a supervised learning task and training a convolutional neural network (CNN) to infer the parameters. Schuler built on these results in Schuler et al. (2016), and Chakrabarti Chakrabarti (2016) constructed a network to predict the Fourier coefficients of the filter necessary to deblur specific image patches. Taking advantage of modern convolutional architectures, Nah et al. (2016) constructed deep multi-scale networks for dynamic scene deblurring with strong empirical results. In the past year, Generative Adversarial Networks have also been applied with measured success Ramakrishnan et al. (2017).
Throughout this paper we assume that an image is an
dimensional vector taking values betweenand without loss of generality. We will use and to denote the vectorized sharp image and blur kernel, both of which are to be estimated given the vectorized blurry image . Mathematically, the model can be written as,
where denotes the usual convolution between two signals and denotes the independent noise vector at each pixel. Assuming that , we can estimate by maximizing the log-likelihood, corresponding to solving the following least squares optimization problem,
Observe that the number of parameters to be estimated is and can be much larger than the number of observations if the kernel is large. To solve for solutions to (2), many regularization functions (or priors) and/or constraints have been proposed in the literature Levin et al. (2009); Ruiz et al. (2015); Campisi and Egiazarian (2016). To keep the presentation simple, we will focus our attention on two generic components that have shown strong empirical performance to specify the full model.
Component 1) The Total Variation (TV) -norm on has been shown to promote smoothness of the estimated image Chambolle and Lions (1997). The image TV norm is defined as some norm of its discrete gradient field over the image lattice :
Note that for , this corresponds to the classical anisotropic and isotropic TV norm respectively. Our theoretical analysis extends to any , but we will assume that to describe our results.
In order to define a reasonable constraint set, we appeal to the fundamentals of the image capture process. Pixel values are explicitly a positive function of the photon count at a specific point on the image sensor, and so we enforce the constraint that the kernel must be nonnegative. Further, a blurred image can be interpreted as a weighted average of a sharp image captured with slight shifts, typically stemming from an extended exposure time due to a variety of reasons. Together, these requirements form our constraint set: the probability simplex. With these two pieces, the problem that we aim to solve can be formally written as,
where is a tunable regularization parameter. Intuitively, higher values of will encourage more smoothness in the optimal sharp image of (4).
In principle, Problem (4) should be easily amenable to many continuous optimization methods but in practice, Perrone and Favaro (2014) provides compelling evidence that choosing the right algorithm is critical to a successful recovery of the sharp image . Notice two important but straightforward properties of the optimization in (4): 1) the objective is smooth and convex in each argument individually but not jointly convex and 2) the feasible set is convex and compact.
Roadmap. We will see shortly that properly exploiting these two simple properties will suggest a natural choice of an algorithm that is familiar in non-linear optimization but not very broadly used in machine learning and vision. Interestingly, after we motivate the choice of the algorithm, we will see how the properties above provide certain technical results that yield guarantees for fast convergence rates and subsequently, suggest strategies for a rigorous robustness analysis. But first, let us analyze why some obvious simplifications and/or direct use of an alternating scheme may not be an effective strategy for this model.
Potential Idea: ignore nonconvexity? A natural strategy to solve (4) may be to use an algorithm which exploits the convexity of individually with respect to and . A well known method that offers this capability is the Alternating Minimization (AM) algorithm Hardt (2014). The AM algorithm for this model performs the following calculation (or update) at each iteration:
A potential problem of Alternating Minimization: Random versus structured blur. There are some recent results that analyze the convergence behavior of the AM algorithm for random blur kernels Hardt (2014), and offer guarantees on its performance. Unfortunately, it is still an open question whether such guarantees are available for structured blur kernels that we universally encounter in vision. In fact, Perrone and Favaro (2014) explicitly constructs an illustrative example where the AM algorithm converges to a strict saddle point due to the nonconvexity of .
In the context of the blind deconvolution problem, strict saddle points correspond to a no blur solution, that is, when the kernel has only one nonzero entry. We see that in Perrone and Favaro (2014) (cf. Section 3.4), the authors give a clear example where the AM algorithm converges to the no blur solution, and thereby propose specific work-arounds to solve the subproblem (6) such that the algorithm empirically converges to the desired one instead. The authors also show that their scheme performs consistently better on many standard benchmark datasets. However, to our knowledge, it is not clear if the procedure suggested in Perrone and Favaro (2014) guarantees convergence in general. Whether the method in Perrone and Favaro (2014) provably returns a minimizer of (4) is also not described in their work.
Revisit Gradient Methods? Instead of the alternating scheme, we take a more “classical” approach to this problem and propose updating both and simultaneously at each iteration. Our choice of algorithm, described shortly, is motivated by two key insights in Problem (4). First, for a smooth optimization problem, it has recently become known that the set of initial points from where a first order gradient method converges to a saddle point has a Lebesgue measure of zero Panageas and Piliouras (2016). This immediately entails that with with very high probability, a gradient method will converge to a local minimizer. Second, the geometry of will allow us to provably speed up the convergence which is interesting from both a theoretical standpoint and a practical one.
A Mirror-descent style algorithm. To describe our algorithm, it is easiest to briefly review the form of a classical mirror descent (MD) scheme used in convex optimization. Recall that the standard way to solve constrained optimization problems is to use projections, that is, we first take a (negative) gradient step and then a Euclidean
projection on to the feasible set, assuming that this is easy to do (as is the case with norm balls, hyperplanes and so on). This procedure is often referred to as Projected Gradient Descent (PGD):
where is the Euclidean projection. Under mild conditions on the step size , PGD in fact guarantees convergence. However, the use of PGD type algorithms require some caution: PGD completely disregards the geometry of the feasible set and only uses the local behavior of the objective function. Hence, the algorithm can be very inefficient particularly in the high dimensional and large scale settings we see in vision Mahadevan and Liu (2012); Luong et al. (2012).
Intuitively, Mirror Descent (MD) addresses this problem with the following simple modification: it is better to choose a function that acts like a metric depending on the feasible set. This function is called the Distance Generating Function (DGF) and moreover, it is enough for that function to be a metric just on the feasible set Juditsky et al. . Exploiting this property, MD has been used to design algorithms that are provably faster than PGD Nesterov (2005) and is the preferred algorithm in many applications Srebro et al. (2011); Jain and Thakurta (2012). An excellent description of the MD algorithm and its variants is given in Nemirovski (2012). Recently, Zhou et al. (2017) showed how to extend MD to a class of nonconvex problems called variationally coherent problems. But unfortunately, our problem (4) does not satisfy the assumptions, hence it is not clear how or if the results shown in Zhou et al. (2017) apply.
Motivated by the above discussion, we propose a Provably Robust Image Deconvolution Algorithm (PRIDA), shown in Alg. 1. As alluded to previously, PRIDA is similar in spirit to the MD algorithm in Convex Optimization. The main difference between the standard MD algorithm and PRIDA is that the step size is chosen independently for each coordinate. The intuition behind the step size rule can be seen as follows: if a coordinate of the filter (kernel) at the th iteration is large in magnitude, then we expect it to remain reasonably high at the th iteration. Our empirical results show that this is very effective in practice. Next, we show that PRIDA converges provably to a minimizer.
To analyze PRIDA, we use the following equivalent interpretation of the update step (derived in the supplement):
represents the usual Kullback-Leibler divergence betweenand , is the inner product, and denotes element-wise multiplication. Note that when the divergence function is replaced by the Euclidean norm, the algorithm becomes the standard PGD update. Observe that acts as a distance-generating function on simplex , and hence is unique. In order to show convergence we use the following intermediate result. Let . Then for any , we have that,
See supplement. With this in hand, we have the following convergence result. Let , then with step sizes where is fixed, PRIDA converges to a local minimizer of (avoids strict saddle points) almost surely. We will assume without loss of generality for the analysis that the step size . We prove the convergence in two steps. In step 1, we show that the iterates of the PRIDA algorithm 1 converges to a fixed point. In step 2, we show that there is a subsequence that converges to a stationary point, that is, a point that approximately satisfies the first order necessary conditions. We then use Lee et al. (2017) to show that such a stationary point is a locally optimal solution.
Step 1: For notational convenience, let , where the first coordinates denote and the last coordinates denote respectively. Define where is the indicator function that takes the value if and otherwise. Then for any , we have that,
where (12) is by smoothness of the gradient (assumption), (14) is by Cauchy-Schwarz inequality and (15) is by Pinsker’s inequality (see page 88 in Tsybakov (2008)). Note that the minimizer of with respect to exactly corresponds to the update rule in PRIDA and that is strongly convex in (again due to Pinsker’s inequality, see page 301 in Bubeck et al. (2015)). Hence we can bound the per iteration improvement by,
Step 2: Since the update rule for is standard gradient descent, we know that the iterates converge to a point where the gradient vanishes, see section 1.2.3. in Nesterov (2013). So we focus on the update rule for the for which we use Lemma 4. Taking there, we have that,
Taking the limit as , we showed that we can find a point that satisfies the first order optimality conditions of our optimization problem. Thus we have shown that after steps, we can find a point that is optimal. PRIDA iterates now satisfy the assumptions of Proposition 10 in Lee et al. (2017), and so by Corollary 7 therein it directly follows that PRIDA does not converge to a strict saddle point almost surely. While we can get the same convergence rate (up to logarithmic factors in ) of as that of PGD (see Ghadimi and Lan (2016)), the efficiency of PRIDA comes from the fact that each iteration of PRIDA takes time, compared to the required in PGD (see Chen and Ye (2011) for details) and is trivially parallelizable/amenable to GPU implementation. Details are included in the supplement.
Having shown the convergence of PRIDA, the natural follow-up investigation is to characterize its behavior in terms of its noise tolerance. We call an algorithm robust if it produces the same output on two different images such that one of them is a slightly perturbed version of the other. This notion of robustness has been recently introduced in the machine learning literature under the context of algorithmic stability Hardt et al. (2015). Recent results in our community show that this is a critically desirable property of algorithms used in vision-based deployments since they are often sensitive to very small perturbations Moosavi Dezfooli et al. (2017); Su et al. (2017).
Plan of Attack. Using only the main concepts of stability, we aim to measure the robustness of our algorithm. In typical stability analyses, noise is often introduced in the gradient computation, as a proxy for stochastic or approximate gradient updates. We follow this idea, and aim to bound the difference between the result of a noisy gradient update and a clean one. To be specific, we assume that two images, one with noise and the other without, produce gradients that are approximately the same.
Hence, at iteration we observe some noisy gradient of (and respectively of ). We would like to bound the distance between and ( of ). In what follows, we look only at the update for the kernel , but note that an analogous argument can be made for the sharp image : the update step for is essentially a (sub)gradient step, and so the argument is simpler. Let be the initial point where all coordinates are equal. Let be the true gradient and be some noisy gradient. Then, we have that computed using and are -close in the sense. In order to study the robustness properties of our algorithm, we will use the interpretation of PRIDA given in (9) and (10). Because the noisy gradient is only being used in the (9), we analyze how much iterates can stray after each of the two updates separately. To that end define the intermediate iterate computed using the true gradient and similarly the noisy one. To make the proof simple, we will assume that the step size is same for all the coordinates, that is, (say ) and note that the argument can be easily extended for the general case. Then, the distance between and can be bounded as follows,
where the first step follows from the definition of and , and the last two from the fact that and . Now we show that the second step of the update, which corresponds to a simple normalization, is also well behaved:
where we use the triangle inequality for (25), the reverse triangle inequality for the inequality in (27), and (28) follows from (23). If the noise level satisfies , then we know the iterates computed using the noisy and true gradients are at most away. This result clearly shows the interplay between the noise level and the step size . When a sharp image undergoes convolution followed by the addition of noise, Lemma 4.1 tells us that it is better to take short steps instead of being overtly aggressive. Why are short steps sufficient in practice? Given that every pixel in the blurred image is a nonnegative combination of neighboring pixels in the sharp image, it is enough to search among its neighbors to form a realistic image rather than searching over the whole image space. This can be performed efficiently using short steps.
Initialization. We follow the standard practice common across many vision problems and estimate both and at many resolutions. More specifically, our estimation proceeds through a coarse-to-fine pyramid scheme. For each level, we run PRIDA (Algorithm 1) and upscale the resulting estimated image and kernel for the next level.
At the coarsest level, we initialize the kernel to be uniform, that is, . While this choice of initialization is critical for many existing algorithms Perrone and Favaro (2014); Pan et al. (2014), it is not so important for PRIDA. Because the objective function is (jointly) bilinear, it may be the case that the initial few gradient steps will push some of the coordinates of the kernel to after a Euclidean projection. This is problematic because it will remain at during the entire course of optimization (at that scale), thus reducing the pyramid scheme’s effectiveness. PRIDA on the other hand can be thought of as a version of “soft-removal”: the multiplicative nature will naturally force all elements of any given kernel to remain strictly positive at all times, and hence a few “bad” steps will not necessarily hurt the overall performance.
Numerical Considerations. When calculating the step size per pixel , it may be the case that a given point in the kernel has already been driven close to 0. In this case if the (noisy) gradient is negative, however small, the computed step may be if the value at that point has fallen below machine precision. To avoid these issues, we apply a “Big ” correction Nemirovski (chapter 4) such that the step taken is the minimum of , where is a large positive constant. Intuitively, a large will allow PRIDA to take larger steps, thus encouraging faster convergence. We fix throughout our experiments.
. (d) Our Result. (e) Ground Truth. From top to bottom, each row corresponds to added noise with standard deviation 0,0.1, and 0.5 respectively.
All experiments were conducted using MATLAB 2017a running on a 12-core Xeon E5-2620 @ 2.4 GHz machine with 64GB RAM. For all experiments on images of size
, we use a fixed regularization hyperparameter of. The run time of each image on the finest scale is approximately - minutes. In the first two sets of experiments, our goal is to validate the theoretical properties of PRIDA shown in earlier sections viz., convergence and robustness. Finally, we test if PRIDA is efficient on real world color images. We compare with two recent standard baselines that are closely related to our algorithm Perrone and Favaro (2014); Perrone et al. (2015), and provide additional experimental details and comparisons with other algorithms in the supplement.
Figure 2 shows the function value convergence rates for PRIDA and for Perrone and Favaro (2014). Using the same pyramid scheme, we compute the function value for 1000 iterations of both algorithms over the finest level, fixing as stated above for PRIDA and the default setting provided by the authors in Perrone and Favaro (2014). Notice that while Perrone and Favaro (2014)’s method initially drops quickly, our method eventually converges much faster to a lower objective function. We note also that the PRIDA updates are significantly more stable, providing evidence of our robustness analysis above.
Color image recovery in the presence of intensity noise. From left to right, we add 0-mean Gaussian random noise with variancerespectively. The first row shows the blurred and noisy input, the second the recovered kernel, and the third our final image recovery. Standard denoising methods can be applied to the deblurred image.
|Perrone and Favaro (2014)||0.0008||0.0223||0.0584||0.0849||0.0957|
|Perrone et al. (2015)||0.0006||0.0994||0.1375||0.1212||0.0941|
To exemplify the robustness of PRIDA to noise, we conduct experiments on the well-known dataset first introduced by Levin et al. (2009). The grayscale images are pixels in size with known blur kernels ranging in size from 13 to 27 pixels square. To evaluate robustness, we add varying levels of noise to each image, and qualitatively evaluate the end result. We compare our method to the algorithms presented in Perrone and Favaro (2014) and in Perrone et al. (2015).
We show the results of PRIDA in comparison with the standard baselines in Figure 3, see supplement for more results. To generate the noisy and blurred images, Gaussian random noise with mean was added to each blurred image. Here we can clearly observe the ability of our procedure to handle large amounts of noise. Over the entire dataset, we observe that in some interesting cases both algorithms from Perrone and Favaro (2014) and Perrone et al. (2015) are able to recover a reasonably sharp image in the presence of noise. Over the entire dataset from Levin et al. (2009), however, we note that their results are significantly more variable than that of PRIDA. On average, PRIDA is much more consistent in recovery over the entire dataset, shown in Table 1, validating our theoretical analysis above.
While the results above are valuable in validating our theoretical claims, we also evaluate our algorithms’ robustness on real world images. Computationally, an interesting property of PRIDA is that all of its operations involve convolutions (Fast Fourier Transforms) and elementwise operations, both of which can benefit from GPU efficiencies. We provide our (unoptimized) code in the supplement.
We apply PRIDA to a set of large, color images that have been synthetically blurred. A recent comparative study on modern blind deconvolution algorithms compiled a dataset of synthetically-blurred spanning a wide range of image sizes, image content, and blur difficulty Lai et al. (2016). 25 real-world images collected from the Internet were each uniformly blurred with 4 known kernels of various size and support. Applying our algorithm to these images we find results comparable to state-of-the-art.
In order to find an appropriate regularization, we perform a mild parameter sweep across all 25 images simultaneously for a given kernel size. For a kernel size of , we find that leads to the best qualitative results. Results on the front page include samples from this set.
To demonstrate the robustness of PRIDA on color images, noise was added to each pixel’s lightness value in LAB space Wyszecki and Stiles (1982) and converted back to the original RGB color space. Figure 4 shows how our recovery is affected by increasing amounts of Gaussian random noise. While our kernel recovery degrades with more added noise, it is clear that we are still able to recover the kernel structure, and that our final recovered image is in fact deblurred. Here, we present the raw output of our proposed model. Since the literature on denoising algorithms is mature, if necessary, a denoising algorithm can easily be run after PRIDA to remove the noise depending on its type. In fact, it is a common practice to have a “non-blind” stage at the end of the fine scale in many existing deblurring algorithms.
We propose a new algorithm, PRIDA, for recovering sharp images through blind deconvolution. PRIDA uniquely takes advantage of the specific problem domain, employing mirror descent over the simplex constraint set. We present theoretical analysis of PRIDA and derive guarantees on both convergence and robustness with no extra assumptions. In most real world settings, as noted by Zhu and Milanfar (2011), low light conditions and auto-focus software systems may introduce extra blur and noise since they depend on both exposure time and camera settings.
Our exhaustive experimentation shows that PRIDA can be a comprehensive solution for real world problems. We showed both qualitatively and quantitively that PRIDA performs as good as the state of the art under no noise conditions and unarguably better in the presence of noise. We believe that our results will be a strong foundation not only for single image blind deconvolution problems, but also for furthering the success of recent data driven approaches such as deep learning architectures.
Our code and additional experiments can be accessed through our Github repository at https://github.com/sravi-uwmadison/prida.
Let . Then for any , we have that,
Define . Since minimizes over , and that is differentiable and strongly convex on with respect to -norm, (see page 88 in Tsybakov (2008)), the gradient at should satisfy the following inequality,
Now the derivative of divergence with respect to the th coordinate of is given by,
Plugging in the derivative of KL divergence, adding and subtracting into (30), we get,
where (34) is because . Now rearranging terms in (39) (with in (32)) we get the desired result. Comparison of PGD vs PRIDA: Even though both PGD and PRIDA achieve the same convergence result for smooth function as said in the main paper, PRIDA is more general since the smoothness assumption can be relaxed for any (instead of the specific as required by PGD). This can be seen from inequalities (13)-(15) as,
This is most useful when since it amounts to checking (absolute) maximum entry of the Hessian matrix which is easy to perform.
Moreover, PRIDA can be implemented in an atomic fashion, that is, each coordinate of can be updated individually followed by a simple normalization, thus the per iteration complexity is . In contrast, the most efficient algorithms to project onto the probability simplex requires at least , see Figure 1 in Duchi et al. (2008). While the penalty seems innocuous, these algorithms at the least require sorting (as a subroutine) and hence cannot be easily implemented in GPUs.
Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 579–590, 2012.
Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017.
Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 564–573. AUAI Press, 2012.