1 Introduction
Throughout our life, we continue to learn about the world by sequentially exploring it. We acquire new experiences using old ones, and use the new ones to learn even more. How can we design methods that can perform such “explorative” learning to obtain good generalizations? This is an open question in artificial intelligence and machine learning.
One such approach is based on Bayesian methods. This approach has not only been used as a theoretical model of human cognitive development (Perfors et al., 2011) but also been applied to a wide variety of practical explorativelearning tasks, e.g., in active learning and Bayesian optimization to select informative data examples (Houlsby et al., 2011; Gal et al., 2017; Brochu et al., 2010; Fishel and Loeb, 2012), and in reinforcement learning to learn through interactions (Wyatt, 1998; Strens, 2000). Unfortunately, Bayesian methods are computationally demanding because computation of the posterior distribution is a difficult task, especially for largescale problems. In contrast, nonBayesian methods, such as those based on continuous optimization methods, are generally computationally much cheaper, but they cannot directly exploit the mechanisms of Bayesian exploration because they do not estimate the posterior distribution. This raises the following question: how can we design explorativelearning methods that compute a distribution just like Bayesian methods, but cost similar to optimization methods?
In this paper, we propose such a method. Our method can be used to solve generic unconstrained functionminimization problems^{1}^{1}1This is a blackbox optimization problem in the sense that we may not have access to an analytical form of the function or its derivatives, but we might be able to approximate them at a point. that take the following form:
(1) 
A wide variety of problems in supervised, unsupervised, and reinforcement learning can be formulated in this way. Instead of directly solving the above problem, our method solves it indirectly by first taking the expectation of
with respect to an unknown probability distribution
, and then solving the following minimization problem:(2) 
where minimization is done with respect to the parameter of the distribution . This approach is referred to as the Variational Optimization (VO) approach by Staines and Barber (2012) and can lead us to the minimum because is an upper bound on the minimum value of , i.e., . Therefore minimizing minimizes , and when the distribution puts all its mass on , we recover the minimum value. This type of function minimization is commonly used in many areas of stochastic search such as evolution strategies (Hansen and Ostermeier, 2001; Wierstra et al., 2008). In our problem context, this formulation is advantageous because it enables learning via exploration, where exploration is facilitated through the distribution .
Our main contribution is a new method to solve (2) by using a mirrordescent algorithm. We show that our algorithm is a secondorder method which solves the original problem (1), even though it is designed to solve the problem (2). Due to its similarity to Newton’s method, we refer to our method as the Variational Adaptive Newton (VAN) method. Figure 1 shows an example of our method for a onedimensional nonconvex function.
We establish connections of our method to many existing methods in continuous optimization, variational inference, and evolution strategies, and use these connections to derive new algorithms for explorative learning. Below, we summarize the contributions made in the rest of the paper:

In Section 7
, we apply our method to supervised leaning, unsupervised learning, active learning, and reinforcement learning. In Section
8, we discuss relevance and limitations of our approach.
This work presents a generalpurpose explorativelearning method that has the potential to improve learning in areas such as active learning and reinforcement learning.
2 Variational Optimization
We will focus on solving the problem (2
) since it enables estimation of a distribution that can be used for exploration. In this paper, we will use the Gaussian distribution
. The problem (2) can then be rewritten as follows,(3) 
where is the Gaussian distribution with being the mean and being the covariance, and . The function is differentiable under mild conditions even when is not differentiable, as discussed in Staines and Barber (2012). This makes it possible to apply gradientbased optimization methods to optimize it.
A straightforward approach to minimize
is to use Stochastic Gradient Descent (SGD) as shown below,
(4)  
(5) 
where is a step size at iteration , denotes an unbiased stochasticgradient estimate, and . We refer to this approach as Variational SGD or simply VSGD to differentiate it from the standard SGD that optimizes in the space.
The VSGD approach is simple and can also work well when used with adaptivegradient methods to adapt the stepsize, e.g., AdaGrad and RMSprop. However, as pointed by
Wierstra et al. (2008), it has issues, especially when REINFORCE (Williams, 1992) is used to estimate the gradients of . Wierstra et al. (2008) argue that the VSGD update becomes increasingly unstable when the covariance is small, while becoming very small when the covariance is large. To fix these problems, Wierstra et al. (2008) proposed a naturalgradient method. Our method is also a naturalgradient method, but, as we show in the next section, its updates are much simpler and they lead to a secondorder method which is similar to Newton’s method.3 Variational AdaptiveNewton Method
VAN is a naturalgradient method derived using a mirrordescent algorithm. Due to this, the updates of VAN are fundamentally different from VSGD. We will show that VAN adapts the stepsizes in a very similar spirit to the adaptivegradient methods. This property will be crucial in establishing connections to Newton’s method.
VAN can be derived by making two minor modifications to the VSGD objective. Note that the VSGD update in (4) and (5) are solutions of following optimization problem:
(6) 
The equivalence to the update (4) and (5) can be shown by simply taking the derivative with respect to of (6) and setting it to zero:
(7) 
which results in the update (4) and (5). A simple interpretation of this optimization problem is that, in VSGD, we choose the next point along the gradient but contain it within a scaled ball centered at the current point . This interpretation enables us to obtain VAN by making two minor modifications to VSGD.
The first modification is to replace the Euclidean distance by a Bregman divergence which results in the mirrordescent method. Note that, for exponentialfamily distributions, the KullbackLeibler (KL) divergence corresponds to the Bregman divergence (Raskutti and Mukherjee, 2015). Using the KL divergence results in naturalgradient updates which results in better steps when optimizing the parameter of a probability distribution (Amari, 1998).
The second modification is to optimize the VO objective with respect to the mean parameterization of the Gaussian distribution instead of the parameter . We emphasize that this modification does not change the solution of the optimization problem since the Gaussian distribution is a minimal exponential family and the relationship between and is onetoone (see Section 3.2 and 3.4.1 of Wainwright and Jordan (2008) on the basics of exponential family and meanparameterization respectively).
The two modifications give us the following problem:
(8) 
where , , and denotes the KL divergence. The convergence of this procedure is guaranteed under mild conditions (Ghadimi et al., 2014).
As shown in Appendix A, a solution to this optimization problem is given by
(9)  
(10) 
The above update differs from those of VSGD in two ways. First, here we update the precision matrix while in VSGD we update the covariance matrix . However, both updates use the gradient with respect to . Second, the stepsize for are adaptive in the above update since is scaled by the covariance .
The above updates corresponds to a secondorder method which is very similar to Newton’s method. We can show this using the following identities (Opper and Archambeau, 2009):
(11)  
(12) 
By substituting these into (9) and (10), we get the following updates which we call the VAN method:
(13)  
(14) 
where is the precision matrix and . The precision matrix contains a runningsum of the past averaged Hessians, and the searchdirection for the mean is obtained by scaling the averaged gradients by the inverse of . If we compare Eq. 9 to the following update of Newton’s method,
(15) 
we can see that the Hessian matrix is replaced by the Precision matrix in the VAN update. Due to this connection to Newton’s method and the use of an adaptive scaling matrix , we call our method the Variational AdaptiveNewton (VAN) method.
The averaged gradient and running sum of averaged Hessian allow VAN to avoid some types of local optima. Figure 1 shows such a result when minimizing^{2}^{2}2This example is discussed in a blog by Huszar (2017) with initial solution at . Clearly, both the gradient and Hessian at suggest updating towards a local optimum at . However, VAN computes an averaged gradient over samples from which yields steeper descent directions pointing towards the global optimum. The adaptive scaling of the steps further ensures a smooth convergence.
The averaging property of VAN is strikingly different from other secondorder optimization methods. We expect VAN to be more robust due to averaging of gradients and Hessians. Averaging is particularly useful for optimization of stochastic objectives. For such objectives, application of Newton’s method is difficult because reliably selecting a stepsize is difficult. Several variants of Newton’s method have been proposed to solve this difficulty, e.g., based on quasiNewton methods (Byrd et al., 2016) or incremental approaches (Gürbüzbalaban et al., 2015), or by simply adapting minibatch size (Mokhtari et al., 2016). VAN is most similar to the incremental approach of (Gürbüzbalaban et al., 2015) where a running sum of past Hessians is used instead of just a single Hessian. In VAN however the Hessian is replaced by the average of Hessians with respect to . For stochastic objectives, VAN differs substantially from existing approaches and it has the potential to be a viable alternative to them.
An issue with using the Hessian is that it is not always positive semidefinite, for example, for nonconvex problems. For such cases, we can use a GaussNewton variant shown below Bertsekas (1999) which we call the Variational Adaptive GaussNewton (VAG) Method:
(16) 
4 VAN for LargeScale Problems
Applying VAN to problems with large number of parameters is not feasible because we cannot compute the exact Hessian matrix. In this section, we describe several variants of VAN that scale to large problems. Our variants are similar to existing adaptivegradient methods such as AdaGrad Duchi et al. (2011) and AROW (Crammer et al., 2009). We derive these variants by using a meanfield approximation for . Our derivation opens up the possibility of a new framework for designing computationally efficient secondorder methods by using structured distributions .
One of the most common way to obtain scalability is to use a diagonal approximation of the Hessian. In our case, this approximation corresponds to a distribution with diagonal covariance, i.e., , where
is the variance. This is a common approximation in variational inference methods and is called the meanfield approximation
(Bishop, 2006). Let us denote the precision parameters by, and a vector containing them by
. Using this Gaussian distribution in the update (13) and (14), we get the following diagonal version of VAN, which we call VAND:(17) 
where is a diagonal matrix containing the vector as its diagonal and is the diagonal of the Hessian .
The VAND update requires computation of the expectation of the diagonal of the Hessian, which could still be difficult to compute. Fortunately, we can compute its approximation easily by using the reparameterization trick (Kingma and Ba, 2014). This is possible in our framework because we can express the expectation of the Hessian as gradients of an expectation, as shown below:
(18)  
(19)  
(20) 
where the first step is obtained using (11). In general, we can use the stochastic approximation by simultaneous perturbation (SPSS) method (Bhatnagar, 2007; Spall, 2000) to compute derivatives. A recent paper by Salimans et al. (2017) showed that this type of computation can also be parallelized which is extremely useful for largescale learning. Note that these tricks cannot be applied directly to standard continuousoptimization methods. The presence of expectation with respect to in VO enables us to leverage such stochastic approximation methods for largescale learning.
Finally, when corresponds to a supervised or unsupervised learning problem with large data, we could compute its gradients by using stochastic methods. We use this version of VAN in our largescale experiments and call it sVAN, which is an abbreviation for stochasticVAN.
4.1 Relationship with AdaptiveGradient Methods
The VAND update given in (17) is closelyrelated to an adaptivegradient method called AdaGrad (Duchi et al., 2011) which uses the following updates:
(21)  
(22) 
Comparing these updates to (17), we can see that both AdaGrad and VAND compute the scaling vector using a moving average. However, there are some major differences between their updates: 1. VAND uses average gradients instead of a gradient at a point, 2. VAND does not raise the scaling matrix to the power of 1/2, 3. The update of in VAND uses the diagonal of the Hessian instead of the squared gradient values used in AdaGrad. It is possible to use the squared gradient in VAND but since we can compute the Hessian using the reparameterization trick, we do not have to make further approximations.
VAN can be seen as a generalization of an adaptive method called AROW (Crammer et al., 2009). AROW uses a mirror descent algorithm that is very similar to ours, but has been applied only to problems which use the Hinge Loss. Our approach not only generalizes AROW but also brings new insights connecting ideas from many different fields.
Method  Description 

VAN  Variational AdaptiveNewton Method using (13) and (14). 
sVAN  Stochastic VAN (gradient estimated using minibatches). 
sVANExact  Stochastic VAN with no MC sampling. 
sVAND  Stochastic VAN with diagonal covariance . 
sVANActive  Stochastic VAN using active learning. 
sVAG  Stochastic Variational Adaptive GaussNewton Method. 
sVAGD  Stochastic VAG with diagonal covariance . 
5 VAN for Variational Inference
Variational inference (VI) enables a scalable computation of the Bayesian posterior and therefore can also be used for explorative learning. In fact, VI is closely related to VO. In VI, we compute the posterior approximation for a model with data by minimizing the following objective:
(23) 
We can perform VI by using VO type of method on the function inside the square bracket. A small difference here is that the function to be optimized also depends on parameters of . Conjugatecomputation variational inference (CVI) is a recent approach for VI by Khan and Lin (2017). We note that by applying VAN to the variational objective, one recovers CVI. VAN however is more general than CVI since it applies to many other problems other than VI. A direct consequence of our connection is that CVI, just like VAN, is also a secondorder method to optimize , and is related to adaptivegradient methods as well. Therefore, using CVI should give better results than standard methods that use update similar to VSGD, e.g., blackbox VI method of Ranganath et al. (2014).
6 VAN for Evolution Strategies
VAN performs naturalgradient update in the parameterspace of the distribution (as discussed earlier in Section 3). The connection to naturalgradients is based on a recent result by Raskutti and Mukherjee (2015) that shows that the mirror descent using a KL divergence for exponentialfamily distributions is equivalent to a naturalgradient descent. The naturalgradient corresponds to the one obtained using the Fisher information of the exponential family distribution. In our case, the mirrordescent algorithm (8) uses the Bregman divergence that corresponds to the KL divergence between two Gaussians. Since the Gaussian is an exponentialfamily distribution, mirror descent (8) is equivalent to naturalgradient descent in the dual Riemannian manifold of a Gaussian. Therefore, VAN takes a naturalgradient step by using a mirrordescent step.
Natural Evolution Strategies (NES) (Wierstra et al., 2008) is also a naturalgradient algorithm to solve the VO problem in the context of evolution strategies. NES directly applies naturalgradient descent to optimize for and and this yields an infeasible algorithm since the Fisher information matrix has parameters. To overcome this issue, Wierstra et al. (2008) proposed a sophisticated reparameterization that reduces the number of parameters to . VAN, like NES, also has parameters, but with much simpler updates rules due to the use of mirror descent in the meanparameter space.
7 Applications and Experimental Results
In this section, we apply VAN to a variety of learning tasks to establish it as a generalpurpose learning tool, and also to show that it performs comparable to continuous optimization algorithm while extending the scope of their application. Our first application is supervised learning with Lasso regression. Standard secondorder methods such as Newton’s method cannot directly be applied to such problems because of discontinuity. For this problem, we show that VAN enables
stochasticsecondorder optimization which is faster than existing secondorder methods such as iterativeRidge regression. We also apply VAN to supervised learning with logistic regression and unsupervised learning with Variational AutoEncoder, and show that stochastic VAN gives comparable results to existing methods such as AdaGrad. Finally, we show two application of VAN for explorative learning, namely active learning for logistic regression and parameterspace exploration for deep reinforcement learning.
Table 1 summarizes various versions of VAN compared in our experiments. The first method is the VAN method which implements the update shown in (13) and (14). Stochastic VAN implies that the gradients in the updates are estimated by using minibatches of data. The suffix ‘Exact’ indicates that the expectations with respect to are computed exactly, i.e., without resorting to Monte Carlo (MC) sampling. This is possible for the Lasso objective as shown in Staines and Barber (2012) and also for logistic regression as shown in Marlin et al. (2011). The suffix ‘D’ indicates that a diagonal covariance with the update (17) is used. The suffix ‘Active’ indicates that minibatches are selected using an active learning method. Final, VAG corresponds to the GaussNewton type update shown in (16). For all methods, except ‘sVANExact’, we use MC sampling to approximate the expectation with respect to . In our plot, we indicate the number of samples by adding it as a suffix, e.g., sVAN10 is the stochastic VAN method with 10 MC samples.
7.1 Supervised Learning: Lasso Regression
Given example pairs with , in lasso regression, we minimize the following loss that contains an regularization:
(24) 
Because the function is nondifferentiable, we cannot directly apply gradientbased methods to solve the problem. For the same reason, it is also not possible to use secondorder methods such as Newton’s method. VAN can be applied to this method since expectation of is twice differentiable. We use the gradient and Hessian expression given in Staines and Barber (2012).
We compare VAN and sVAN with the iterativeridge method (iRidge), which is also a secondorder method. We compare on two datasets: Bank32nh () and YearPredictionMSD (). We set values using a validation set where we picked the value that gives minimum error over multiple values of on a grid. The iRidge implementation is based on minFunc implementation by Mark Schmidt. For sVAN, the size of the minibatch used to train Bank32nh and YearPredictionMSD are and respectively. We report the absolute difference of parameters, where is the parameters estimated by a method and is the parameters optimal value (found by iRidge). For VAN the estimated value is equal to the mean of the distribution. Results are shown in Figure 2 (a) and (b), where we observe that VAN and iRidge perform comparably, but sVAN is more dataefficient than them in the first few passes. Results on multiple runs show very similar trends.
In conclusion, VAN enables application of a stochastic secondorder method to a nondifferentiable problem where existing secondorder method and their stochastic versions cannot be applied directly.
7.2 Supervised Learning: Logistic Regression
In logistic regression, we minimize the following:
(25) 
where is the label.
We compare VAN to Newton’s method and AdaGrad both of which standard algorithms for batch and stochastic learning, respectively, on convex problems. We use VAN, its stochastic version sVAN, and the diagonal version sVAND. We use three realworld datasets from the libSVM database (Chang and Lin, 2011): ‘breastcancerscale’ (), ‘USPS’ (), and ‘a1a’ (). We compare the logloss on the test set computed as follows: where is the parameter estimate and is the number of examples in the test set. For sVAN and sVAND, we use a minibatch size of 10 for ‘breastcancerscale’ dataset and a minibatch of size 100 for the rest of the datasets.
Results are shown in Figure 2 (d)(j). The first row, with plots (d)(f), shows comparison of Batch methods, where we see that VAN converges at a comparable rate to Newton’s method. The second row, with plots (h)(j), shows the performance of the stochastic learning. Since sVAN uses the full Hessian, it converges faster than sVAND and AdaGrad which use a diagonal approximation. sVAND shows overall similar performance to AdaGrad. The main advantage of sVAN over AdaGrad is that sVAN maintains a distribution which can be used to evaluate the uncertainty of the learned solution, as shown in the next subsection on active learning.
In conclusion, VAN and sVAN give comparable results to Newton’s method while sVAND gives comparable results to AdaGrad.
7.3 Active Learning for using VAN
An important difference between active and stochastic learning is that an active learning agent can query for its training data examples in each iteration. In active learning for classification, examples in a pool of input data are ranked using an acquisition score which measures how informative an example is for learning. We pick the top data examples as the minibatch. Active learning is expected to be more data efficient than stochastic learning since the learning agent focuses on the most informative data samples.
In our experiments, we use the entropy score (Schein and Ungar, 2007) as the acquisition score to select data example for binary logistic regression:
(26) 
where is the estimated probability that the label for input takes a value . Within our VAN framework, we estimate these probabilities using distributions at iteration . We use the following approximation that computes the probability using samples from :
(27)  
(28) 
where is the number of MC samples, are sample from , and is the logistic likelihood.
Figure 4 compares the performance on on the USPS dataset with active learning by VAN for minibatch of 10 examples. The result clearly shows that VAN with active learning is much more data efficient and stable than VAN with stochastic learning.
7.4 Unsupervised Learning with Variational AutoEncoder
We apply VAN to optimize the parameters of variational autoencoder (VAE) (Kingma and Welling, 2013). Given observations
, VAE models the data points using a neuralnetwork decoder
with input and parameters . The input are probabilistic and follow a prior . The encoder is also parameterized with a neuralnetwork but follows a different distribution with parameters . The goal is to learn by minimizing the following,(29) 
where is the latent vector, and are parameters to be learned. Similar to previous work, we assume to be a standard Gaussian, and use a Gaussian encoder and a Bernoulli decoder, both parameterized by neutral networks.
We train the model for the binary MNIST dataset ( with minibatches of size and set the dimensionality of the latent variable to
. and measure the learning performance by the imputed logloss of the test set using a procedure similar to
Rezende and Mohamed (2015).We compare our methods to adaptivegradient methods, namely AdaGrad (Duchi et al., 2011) and RMSprop (Tieleman and Hinton, 2012). For all the methods, we tune the stepsize using a validation set and report the test logloss after convergence. Figure 2 (c) shows the results of 5 runs of all methods. We see that all methods perform similar to each other. RMSprop has high variance among all methods. sVAND1 is slightly worse than sVAGD1, which is expected since for nonconvex problems GaussNewton is a better approximation than using the Hessian.
7.5 Parameter Based Exploration in Deep Reinforcement Learning
Exploration is extremely useful in reinforcement learning (RL), especially in environment where obtaining informative feedback is rare. Unlike most learning tasks where a training dataset is readily available, RL agents needs to explore the stateaction space to collect data. An explorative agent that always chooses random actions might never obtain good data for learning the optimal policy. On the other hand, an exploitative agent that always chooses the current optimal action(s) may never try suboptimal actions that can lead to learning a better policy. Thus, striking a balance between exploration and exploitation is a must for an effective learning. However, finding such a balance is an open problem in RL.
In this section, we apply VAN to enable efficient exploration by using the parameterbased exploration (Rückstieß et al., 2010). In standard RL setup, we wish to learn a parameter of a parametric policy for action given state . We seek an optimal such that a stateaction sequence maximizes the expected returns where and are state and action at time , respectively. To facilitate exploration, a common approach is to perturb the action by a random noise , e.g., we can simply add it to the action . In contrast, in parameterbased exploration, exploration is facilitated in the parameter space as the name suggest, i.e., we sample the policy parameter from a distribution . Our goal therefore is to learn the distribution parameters and .
The parameterbased exploration is better than the actionspace exploration because in the former a small perturbation of a parameter result in significant explorative behaviour in the action space. However, to get a similar behaviour through an actionspace exploration, the extent of the noise needs to be very large which leads to instability (Rückstieß et al., 2010).
The existing methods for parameterbased exploration independently learn and (similar to VSGD) (Rückstieß et al., 2010; Plappert et al., 2017; Fortunato et al., 2017). However, this method can be increasingly unstable as learning progresses (as discussed in Section 2). Since VAN exploits the geometry of the Gaussian distribution to jointly learn and , we expect it to perform better than the existing methods that use VSGD.
We consider the deep deterministic policy gradient (DDPG) method (Silver et al., 2014; Lillicrap et al., 2015), where a local optimal is obtained by minimizing
(30) 
where is an estimated of the expected return and a state distribution. Both and are neural networks. We compare the performance of sVAN and sVAG against two baseline methods denoted by SGD and VSGD. SGD refers to DPG where there are no exploration at all, and VSGD refers to an extension of DPG where the mean and covariance of the parameter distribution are learned using the update (4) and (5). Due to the large number of neural networks parameters, we use diagonal approximation to the Hessian for all methods. More details of experiments are given in Appendix B.
The result in Figure 4 shows the performance on HalfCheetah from OpenAI Gym (Brockman et al., 2016) for 5 runs, where we see that VAN based methods significantly outperform existing methods. sVAND1 and sVAGD1 both perform equally well. This suggests that both Hessian approximations obtained by using the reparameterization trick shown in (20) and the GaussNewton approximation shown in (16), respectively, are equally accurate for this problem. We can also see that sVAND10 has better data efficiency than sVAND1 especially in the early stages of learning. VSGD is able to find a good policy during learning but has unstable performance that degenerates over time, as mentioned previously. On the other hand, SGD performs very poorly and learns a suboptimal solution. This strongly suggests that good exploration strategy is crucial to learn good policies of HalfCheetah.
We also tried these comparisons on the ‘pendulum’ problem in OpenAI Gym where we did not observe significant advantages from explorations. We believe that this is because this problem does not benefit from using exploration and pureexploitative methods are good enough for these problems. More extensive experiments are required to validate the results presented in this section.
8 Discussion and Conlcusions
We proposed a generalpurpose explorativelearning method called VAN. VAN is derived within the variationaloptimization problem by using a mirrordescent algorithm. We showed that VAN is a secondorder method and is related to existing methods in continuous optimization, variational inference, and evolution strategies. We proposed computationallyefficient versions of VAN for largescale learning. Our experimental results showed that VAN works reasonably well on a widevariety of learning problems.
For problems with highdimensional parameters , computing and inverting the full Hessian matrix is computationally infeasible. One line of possible future work is to develop versions of VAN that can deal with issue without making a diagonal approximation.
It is straightforward to extend VAN to nonGaussian distributions. Our initial work, not discussed in this paper, suggests that studentt distribution and Cauchy distribution could be useful candidate. However, it is always possible to use other types of distributions, for example, to minimize discrete optimization problem within a stochastic relaxation framework
(Geman and Geman, 1984).Another venue is to evaluate the impact of exploration on fields such as active learning and reinforcement learning. In this paper, we have provided some initial results. An extensive application of the methods developed in this paper to realworld problems is required to further understand their advantages and disadvantages compared to existing methods.
The main strength of VAN lies in exploration, using which it can potentially accelerate and robustify optimization. Compared to Bayesian methods, VAN offers a computationally cheap alternative to perform explorative learning. Using such cheap explorations, VAN has the potential to solve difficult learning problems such as deep reinforcement learning, active learning, and lifelong learning.
References
 Amari (1998) Amari, S.I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276.
 Bertsekas (1999) Bertsekas, D. P. (1999). Nonlinear programming. Athena Scientific.
 Bhatnagar (2007) Bhatnagar, S. (2007). Adaptive Newtonbased multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation (TOMACS), 18(1):2.
 Bishop (2006) Bishop, C. M. (2006). Pattern recognition. Machine Learning, 128:1–58.
 Brochu et al. (2010) Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
 Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016). A stochastic quasiNewton method for largescale optimization. SIAM Journal on Optimization, 26(2):1008–1031.

Chang and Lin (2011)
Chang, C.C. and Lin, C.J. (2011).
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.  Crammer et al. (2009) Crammer, K., Kulesza, A., and Dredze, M. (2009). Adaptive regularization of weight vectors. In Advances in neural information processing systems, pages 414–422.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159.
 Fishel and Loeb (2012) Fishel, J. A. and Loeb, G. E. (2012). Bayesian exploration for intelligent identification of textures. Frontiers in neurorobotics, 6.
 Fortunato et al. (2017) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. (2017). Noisy networks for exploration. arXiv preprint arXiv:1706.10295.
 Gal et al. (2017) Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep Bayesian Active Learning with Image Data. ArXiv eprints.
 Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, 1(6):721–741.
 Ghadimi et al. (2014) Ghadimi, S., Lan, G., and Zhang, H. (2014). Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, pages 1–39.
 Gürbüzbalaban et al. (2015) Gürbüzbalaban, M., Ozdaglar, A., and Parrilo, P. (2015). A globally convergent incremental Newton method. Mathematical Programming, 151(1):283–313.
 Hansen and Ostermeier (2001) Hansen, N. and Ostermeier, A. (2001). Completely derandomized selfadaptation in evolution strategies. Evolutionary computation, 9(2):159–195.
 Houlsby et al. (2011) Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian Active Learning for Classification and Preference Learning. ArXiv eprints.
 Huszar (2017) Huszar, F. (2017). Evolution Strategies, Variational Optimisation and Natural ES. http://www.inference.vc/evolutionstrategiesvariationaloptimisationandnaturales2/.
 Khan and Lin (2017) Khan, M. E. and Lin, W. (2017). Conjugatecomputation variational inference: Converting variational inference in nonconjugate models to inferences in conjugate models. arXiv preprint arXiv:1703.04265.
 Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971.
 Marlin et al. (2011) Marlin, B., Khan, M., and Murphy, K. (2011). Piecewise bounds for estimating Bernoullilogistic latent Gaussian models. In International Conference on Machine Learning.
 Mokhtari et al. (2016) Mokhtari, A., Daneshmand, H., Lucchi, A., Hofmann, T., and Ribeiro, A. (2016). Adaptive Newton method for empirical risk minimization to statistical accuracy. In Advances in Neural Information Processing Systems, pages 4062–4070.
 Opper and Archambeau (2009) Opper, M. and Archambeau, C. (2009). The Variational Gaussian Approximation Revisited. Neural Computation, 21(3):786–792.
 Perfors et al. (2011) Perfors, A., Tenenbaum, J. B., Griffiths, T. L., and Xu, F. (2011). A tutorial introduction to Bayesian models of cognitive development. Cognition, 120(3):302–321.
 Plappert et al. (2017) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. (2017). Parameter space noise for exploration. arXiv preprint arXiv:1706.01905.
 Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In International conference on Artificial Intelligence and Statistics, pages 814–822.
 Raskutti and Mukherjee (2015) Raskutti, G. and Mukherjee, S. (2015). The information geometry of mirror descent. IEEE Transactions on Information Theory, 61(3):1451–1457.
 Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.
 Rückstieß et al. (2010) Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., and Schmidhuber, J. (2010). Exploring parameter space in reinforcement learning. Paladyn, 1(1):14–24.
 Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv eprints.
 Schein and Ungar (2007) Schein, A. I. and Ungar, L. H. (2007). Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. A. (2014). Deterministic policy gradient algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 387–395.
 Spall (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
 Staines and Barber (2012) Staines, J. and Barber, D. (2012). Variational Optimization. ArXiv eprints.
 Strens (2000) Strens, M. (2000). A Bayesian framework for reinforcement learning. In In Proceedings of the Seventeenth International Conference on Machine Learning, pages 943–950. ICML.
 Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5RMSprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4.
 Wainwright and Jordan (2008) Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1–2:1–305.
 Wierstra et al. (2008) Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Natural evolution strategies. In Evolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE Congress on, pages 3381–3387. IEEE.
 Williams (1992) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256.
 Wyatt (1998) Wyatt, J. (1998). Exploration and inference in learning from reinforcement. PhD thesis, University of Edinburgh. College of Science and Engineering. School of Informatics.
Appendix A Derivation of VAN
Denote the mean parameters of by which is equal to the expected value of the sufficient statistics , i.e., . The mirror descent update at iteration is given by the solution to
(31)  
(32)  
(33)  
(34)  
(35)  
(36) 
where is the normalizing constant of the distribution in the denominator which is a function of the gradient and step size.
Minimizing this KL divergence gives the update
(37) 
By rewriting this, we see that we get an update in the natural parameters of , i.e.
(38) 
Recalling that the mean parameters of a Gaussian are and
and using the chain rule, we can express the gradient
in terms of and ,(39)  
(40) 
Finally, recalling that the natural parameters of a Gaussian are and , we can rewrite the VAN updates in terms of and ,
(41)  
(42)  
(43)  
(44) 
Appendix B Details on the RL experiment
In this section, we give details of the parameterbased exploration task in reinforcement learning (RL). An important open question in reinforcement learning is how to efficiently explore the state and action space. An agent always acting greedily according to the policy results in a pure exploitation. Exploration is necessary to visit inferior state and actions once in while to see if they might really be better. Traditionally, exploration is performed in the action space by, e.g., injecting noise to the policy output. However, injecting noise to the action space may not be sufficient in problems where the reward is sparse, i.e., the agent rarely observes the reward of their actions. In such problems, the agent requires a rich explorative behaviour in which noises in the action space cannot provide. An alternative approach is to perform exploration in the parameter space (Rückstieß et al., 2010). In this section, we demonstrate that variational distribution obtained using VAN can be straightforwardly used for such exploration in parameter space, .
b.1 Background
First, we give a brief background on reinforcement learning (RL). RL aims to solve the sequential decision making problem where at each discrete time step an agent observes a state and selects an action using a policy , i.e., . The agent then receives an immediate reward and observes a next state . The goal in RL is to learn the optimal policy which maximizes the expected returns where is the discounted factor and the expectation is taken over a sequence of densities and .
A central component of RL algorithms is the stateaction value function or the Qfunction gives the expected return after executing an action in a state and following the policy afterwards. Formally, it is defined as follows:
(45) 
The Qfunction also satisfies a recursive relation also known as the Bellman equation:
(46) 
Using the Qfunction, the goal of reinforcement learning can be simply stated as finding a policy which maximizes the expected Qfunction, i.e.,
(47) 
In practice, the policy is represented by a parameterized function such as neural networks with policy parameter and the goal is to instead find the optimal parameters .
b.1.1 Deterministic Policy Gradients
Our parameterbased exploration via VAN can be applied to any reinforcement learning algorithms which rely on gradient ascent to optimize the policy parameter . For demonstration, we focus on a simple yet efficient algorithm called the deterministic policy gradients algorithm (DPG) (Silver et al., 2014). Simply speaking, DPG aims to find a deterministic policy that maximizes the actionvalue function by gradient ascent. Since in practice the actionvalue function is unknown, DPG learns a function approximator with a parameter such that . Then, DPG finds which locally minimize an objective by gradient ascent where the gradient is given by
(48) 
The parameter of may be learned by any policy evaluation methods. Here, we adopted the approach proposed by (Lillicrap et al., 2015) which minimizes the squared Bellman residual to the slowly moving target actionvalue function. More precisely, is updated by
(49) 
where the expectation is taken over and . The and denote target networks which are separate function approximators that slowly tracks and , respectively. The target networks help at stabilizing the learning procedure (Lillicrap et al., 2015).
Overall, DPG is an actorcritic algorithm that iteratively update the critic (actionvalue function) by taking gradient of Eq.(49) and update the actor (policy) by the gradient Eq.(48). However, the crucial issue of DPG is that it uses a deterministic policy and does not perform exploration by itself. In practice, exploration is done for DPG by injecting a noise to the policy output, i.e., where is a noise from some random process such as Gaussian noise. However, as discussed about, actionspace noise may be insufficient in some problems. Next, we show that VAN can be straightforwardly applied to DPG to obtain parameterbased exploration DPG.
b.1.2 Parameterbased Exploration DDPG
To perform parameterbased exploration, we can relax the policygradient objective by assuming that the parameter is sampled from a distribution , and solve the following optimization problem:
(50) 
This is exactly the VO problem of (2). The stochasticity of through allows the agent to explore the state and action space by varying its policy parameters. This exploration strategy is advantageous since the agent can now exhibits much more richer explorative behaviours when compared with exploration by action noise injection.
Algorithm 1 outlines parameterbased exploration DPG via VAN.