1 The New Stochastic Approximation Theorem
The new state-dependent stochastic approximation theorem which minimizes the active learning risk function in (3) as well as the passive learning risk function in (2) is similar to theorems described by Gu and Kong (1998) and Delyon et al. (1999) (also see Kushner, 1984; Benveniste et al., 1990; Theorem 2; Kushner, 2010) However, the assumptions, conclusions, and proof of this theorem are specifically designed to be easily understood by researchers outside the field of stochastic approximation theory. The accessibility of these theoretical results is fundamentally important for the development of the field of machine learning to ensure that such results are correctly applied in specific applications.
The theorem is also sufficiently general to handle not only the case of stochastic gradient descent but also cases involving variable metric descent methods in situations where one or more multiple minimizers of the risk function may exist and the risk function is not necessarily strictly convex.
A crucial assumption of the following theorem is that the expected value of , , is continuous in . Given that with probability one, the assumption is continuous in is used to establish in Step 6 of the Proof in Section 3 that
Showing that is continuous in . is not sufficient to obtain the convergence result in (7) for the case where is functionally dependent upon . For example, if is a discrete random vector on a finite sample space , so that
then a sufficient (but not necessary) condition for to be continuous is that both and are continuous in .
The terminology that a function is bounded means that for all there exists a finite number such that: .
The terminology that a stochastic sequence is bounded means that there exists a finite number such that for all : with probability one. Note this is a much stronger assumption than the assumption that each element of the stochastic sequence is bounded!
Theorem 1.1 (State-Dependent Stochastic Approximation Theorem).
Let be a twice continuously differentiable function with a lower bound. Let . Let . Assume that there exists a finite number for all so that with probability one where has Radon-Nikodym density . Let be a function piecewise continuous on a finite partition in its first argument and continuous in its second argument. Let be defined such that for all :
when it exists. Assume and are defined such that is continuous on . In addition, assume that for all :
Let be a sequence of positive real numbers such that:
Let be a -dimensional bounded random vector. Let be a sequence of -dimensional random vectors defined such that for
where is a -dimensional random vector with conditional probability density for each .
Assume, in addition, that either of the following two conditions holds:
A1. both and are bounded functions, or
A2. the stochastic sequence is bounded with probability one.
Then as , with probability one where
The condition that there exists a finite number for all so that with probability one is satisfied, for example, if where is either a finite subset of or a bounded subset of . In other words, if is a discrete random vector taking on values in a finite sample space, then this condition is automatically satisfied.
Assumption A1 of the Stochastic Approximation Theorem requires that the descent search direction function is bounded. This constraint can be usually satisfied by an appropriate choice of . However, Assumption A1 also requires that the Hessian of is bounded. The Hessian of
is only bounded for some types of risk functions commonly encountered in machine learning applications. For example, consider a logistic regression modeling problem where the training stimuli are sampled from a probability mass function on a finite sample space. The Hessian of the negative log-likelihood risk is uniformly bounded on the parameter space in this case. However, the Hessian for many types of multi-layer perceptron type networks in which parameters of multiple layers are simultaneously updated is often not bounded.
When the Stochastic Approximation Theorem is invoked using Assumption A2 this means that if the parameter estimates are evolving in a closed and bounded region of the parameter space (e.g., this could be a very large region such as ), then this is a sufficient condition to ensure that converges to the specific solution set defined in (12). In practice, Assumption A2 is empirically rather than theoretically verified for specific applications.
2 Adaptive Algorithms for Representation Learning
In this section, we discuss several examples of adaptive learning algorithms which can be analyzed using the State-Dependent Stochastic Approximation Theorem presented in Section 2.
2.1 Adaptive Learning in Passive Statistical Environments
Stochastic approximation algorithms provide a methodology for the analysis and design of adaptive learning machines. A stochastic approximation algorithm is defined by beginning with an initial guess for the parameter values of the learning machine denoted by and then updating that initial guess to obtain a refined estimate called , more specifically the process of iterated updates is defined by:
where it is assumed that the mini-batch of observations are independent and identically distributed with common density . The mini-batch size is denoted by which is a positive integer whose value can be as small as one. Because is not functionally dependent upon the current state of the learning machine, this is an example of learning in a passive statistical environment.
The function is called the search direction function which attempts to use the current guess for the parameter values and the current observation (or equivalently training stimulus) to calculate the change to the current parameter estimate which is given by the second term on the right-hand side of (13). The magnitude of this change is governed by the strictly positive step size parameter . In order for the stochastic approximation algorithm to converge to an appropriate solution, both the search direction function and step size sequence must be appropriately chosen.
In order to determine if is an appropriate search direction function, one typically uses to compute the expected search direction
and then one checks if satisfies the downhill condition
Note that the downhill condition in (15) is a commonly used condition for ensuring that the deterministic gradient descent algorithm defined by:
converges where the sequence of positive step sizes are appropriately chosen. To summarize, the stochastic search direction is chosen such that, on the average, the search direction is downhill by satisfying the relation in (15). As the mini-batch size
increases, the actual search direction will tend to converge to the expected search direction when an appropriate law of large numbers holds. However, it is not necessary thatincrease or take on a large value to establish convergence of the algorithm in (13).
Selecting the search direction in an appropriate manner, however, is not sufficient to ensure convergence of the stochastic descent algorithm in (13). The sequence of positive step sizes must also be appropriately chosen. One common choice for the step size sequence is to use a ”search and converge” approach. In this type of approach, the step size is initially held constant or even increased but then eventually decreased at an appropriate rate. For example, Darken et al. (1992) proposed that:
where is the initial positive step size and specifies the ”search” time period where the stepsize is relatively constant, while corresponds to the ”converge” time period where the stepsize tends to decrease for .
Appropriate choice of the search direction and step size sequence are the essential ingredients for guaranteeing that the stochastic sequence will converge with probability one to the set of points where whenever it does converge. In the special case where the search direction
the stochastic descent algorithm in (13) is called a stochastic gradient descent algorithm. When it converges, a stochastic gradient descent algorithm converges to the set of points where which corresponds to the set of critical points of . Variable metric search directions for accelerating convergence of adaptive learning (e.g., Jani et al., 2000; Paik et al., 2006; Sunehag et al., 2009) such as Quasi-Newton methods and Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods can also be analyzed using the Theorem presented in Section 1.
2.2 Normalization Constants and Contrastive Divergence
Maximum likelihood estimation is a method for computing the parameter estimates that maximize the likelihood of the observed data or equivalently minimize the cross-entropy between the researcher’s model and the empirical distribution of the observed data. For example, suppose that the observed data is a collection of -dimensional vectors which are presumed to be a particular realization of a sequence of independent and identically distributed random vectors with common density . Then the method of maximum likelihood estimation corresponds to finding the parameter vector that is a global minimizer of
on . In addition, as , with probability one where is a particular global minimizer of
under appropriate regularity conditions.
Let . Let be a closed and bounded subset of . Assume for each that the probability density of is a Gibbs density defined such that
where the normalization constant is defined as:
The derivative of in (18) is given by the formula:
Even though satisfies (15) with , Equation (22) can not, however, be immediately used to derive a stochastic gradient descent algorithm that minimizes for the following reasons. The first term on the right-hand side of (23) is usually relatively easy to evaluate. On the other hand, the second term on the right-hand side of (23) is usually very difficult to evaluate because it involves a computationally intractable multidimensional integration.
Let be a sequence of possibly correlated distributed random vectors with a common mean whose joint density is for a given . To obtain a computationally practical method of evaluating the second term on the right-hand side of (23), note that the expected value of
which corresponds to the second term on the right-hand side of (23).
Substituting the Monte Carlo approximation in (24) for the multidimensional integral in (23) and then using the resulting approximate derivative as a stochastic search direction for a stochastic approximation algorithm defined by:
where the mini-batch is a collection of possibly highly correlated observations with joint density for the th iteration of (26). It is assumed that mini-batches are independent and identically distributed with common density . Equation (26) is an example of contrastive divergence type learning algorithm which can be interpreted as a stochastic approximation algorithm. The mini-batch size can be a fixed integer (e.g., or ) or can be varied (e.g., initially is chosen to be small and then gradually increased to some finite positive integer during the learning process).
Note that the statistical environment used to generate the data for the stochastic approximation algorithm in (26) is not a passive statistical environment since the parameters of the learning machine are updated at learning trial not only by the observation but also by the observations whose joint distribution is functionally dependent upon the current parameter estimates . Thus, contrastive-divergence algorithms of this type can be analyzed using the Theorem presented in Section 1.
2.3 Missing Data, Hidden Variables, and Expectation Maximization
In this section, the problem of ”hidden variables” is considered. The presence of hidden variables is a characteristic feature of deep learning architectures. Suppose the -dimensional random vector could be partitioned such that where is the observable component of and is the unobservable component whose probability distribution is functionally dependent upon a realization of . The elements of correspond to the ”visible random variables” while the elements of correspond to the ”hidden random variables’ or the ”missing data”.
The missing data negative log-likelihood analogous to the complete data negative log-likelihood in (18) is given by the formula:
which can be rewritten in terms of the joint density as:
Now take the derivative of (28) under the assumption that the interchange of derivative and integral operators is permissable to obtain:
The derivative in the integrand of (30) is obtained using the identity (e.g., see Louis, 1982; McLachlan and Thriyambakam, 1996):
which is then approximated using a Monte Carlo approximation using the formula:
where the stochastic imputation
stochastic imputationis a realization of whose distribution is specified by the conditional density for a given realization and parameter vector .
The final stochastic descent expectation maximization algorithm is then defined by constructing a stochastic gradient descent algorithm by defining the stochastic search direction as negative one multiplied by the derivative in (30) and then replacing the integral in (30) with the Monte Carlo approximation in (32) to obtain:
where the mini-batch at the th learning trial is generated by first sampling a realization from the environment and then sampling times from using the sampled value and the current parameter estimates at the th learning trial. Thus, the new stochastic approximation theorem provides a method for analyzing the asymptotic behavior of the stochastic descent expectation-maximization algorithm.
Note that can be chosen equal to or any positive integer. In the case where , then the resulting algorithm approximates the deterministic Generalized Expectation Maximization (GEM) algorithm (see McLachlan and Thriyambakam, 1996, for a formal definition of a GEM algorithm) in which the learning machine uses its current probabilistic model to compute the expected downhill search direction, takes a downhill step, updates its current probabilistic model, and then repeats this process in an iterative manner.
2.4 Active Learning Machines
Another aspect of deep learning is the ability of a learning machine to learn in environments whose statistical characteristics are molded by its actions. In other words, an optimal ”deep representation” depends not only upon the learning machine’s sensory representation of its statistical environment but also upon the effects of a learning machine’s interactions upon its environment as well. In this section, one particular implementation of a learning machine capable of active learning in statistical environments is discussed.
Suppose that a learning machine experiences a collection of ”episodes”. The episodes are assumed to be independent and identically distributed. In addition, the th episode is defined such that where is called the initial state of episode and is called the final state of episode . The probability density of when the learning machine is a ”passive learner” is specified by the density where specifies the likelihood that is observed by the learning machine in its statistical environment.
On the other hand, define the probability density of when the learning machine is an ”active learner”. In this case, the probability that the learning machine selects action given the current state of the environment and the learning machine’s current state of knowledge is expressed by the conditional probability mass function , . The statistical environment of the learning machine is characterized by the probability density, specifying the likelihood of a given initial state of an episode and the conditional density which specifies the likelihood of a final state of an episode given the learning machine’s action and the initial state of the episode .
Thus, the probability distribution of an episode is specified by the density
Let specify the cost incurred by the learning machine when episode is encountered in its environment for a particular state of knowledge . Notice that the cost is functionally dependent upon as well as allowing for the possibility of a learning machine with an ”adaptive critic” (e.g., Sutton and Barton, 1998). One possible goal of an adaptive learning machine in an active statistical environment is to minimize the objective function defined by the formula:
where is a Gibbs density for each .
Now take the derivative of (33), interchange the integral and derivative operators, and use a Monte Carlo approximation for the integral in (33) similar to the approximations in (24) and (32). The resulting derivative can then be used to construct the stochastic gradient descent algorithm:
where the probability distribution of at learning trial is specified by the conditional probability density . Note that the identity
was used to derive equation (34)(e.g., see Louis, 1982; McLachlan and Thriyambakam, 1996).
Note that in this example, the action is functionally dependent upon the initial state of the episode and the parameters of the learning machine. Since the initial state of the episode is independently and identically distributed, one could argue that this example of active learning is an example of learning in an ”open-loop ” system. However, this approach is easily generalizable and the Theorem presented in Section 1 is still applicable to a problem which has more ”closed-loop” characteristics. In particular, define an episode such that:
That is, an episode is chosen at random by identifying an initial state , then the learning machine chooses an action at random using and its current parameter estimates which influences its statistical environment and generates the random state . Next, the random state and are used to generate at random the next action which results in a final state . At this time, the parameter update equation in (1) is applied to generate . This same methodology can clearly be extended to situations involving episodes of longer time periods.
The key assumption here is that the episodes are independent and identically distributed () for each parameter vector in the parameter space. Learning can take place as the learning machine is interacting with its environment provided that the episodes are sampled such that they are not overlapping in time and can be effectively modeled as . This type of strategy may be interpreted as Besag’s (1974) coding assumption for the special case of sequences of random vectors.
3 Proof of the New Stochastic Approximation Theorem
In this section, the proof of the stochastic approximation theorem for state-dependent learning is provided which minimizes the active environment risk function in (3) as well as the passive environment risk function in (2). The proof of the theorem is based upon a combination of arguments by Blum (1954) and the Appendix of Benveniste et al. (1990) and the well-known Robbins-Siegmund Lemma (Robbins and Siegmund, 1971) (e.g., see Beneveniste et al., 1987, p. 344 or Mohri et al, 2012 for relevant reviews of this Lemma).
The proof of this theorem is a variation of the analysis of Blum (1954; also see Appendix of Benveniste et al., 1987). Let . Let . Let .
Step 1: Expand objective function using a second-order mean value expansion. Expand about and evaluate at using the mean value theorem for random vectors (Jennrich, 1969, Lemma 3) to obtain:
where the random variable can be defined using Lemma 3 of Jennrich (1969) as a point on the chord connecting and . Substituting the relation
into (36) gives:
Step 2: Identify conditions required for the remainder term of the expansion to be bounded. If Assumption holds so that both and are bounded functions, then there exists a finite number such that for all :
If Assumption holds the stochastic sequence is bounded with probability one. Since (i) is bounded with probability one, (ii) and are continuous functions of , and (iii) is piecewise continuous on the -dependent bounded stochastic sequence for all , it follows that there exists a finite number such that for all : with probability one
Step 3: Show asymptotic average decrease in objective function value. Taking the conditional expectation of both sides of (38) with respect to the conditional density and evaluating at and yields:
Now use the relation that with probability one from Step 2 to obtain:
Step 4: Apply the Robbins-Siegmund Lemma.. The Robbins-Siegmund Lemma(Robbins and Siegmund, 1971; also see Beneveniste et al., 1987, p. 344 or Mohri et al, 2012 for relevant reviews) asserts that since has a lower bound, and additionally (8), (9), (10), and (41) hold then: (1) converges with probability one as to a random variable , and (2)
with probability one.
Step 5: Show the state sequence converges to a random vector. A proof by contradiction is used to demonstrate converges w.p.1 to a random vector . Assume does not converge to as . This means there exists a subsequence of which does not converge to w.p.1 as . Since is continuous, this implies that there exists a subsequence of which does not converge to . But this contradicts the conclusion in Step 4 that every subsequence of converges to the random variable with probability one.
Step 6: Show the state sequence converges to . Equation (42) obtained in Step 4 in conjunction with the relation in (10) implies that a subsequence of as with probability one. Since is continuous, this implies that a subsequence of converges to with probability one. Since (and thus every subsequence of ) converges w.p.1 to the same random vector by the results of Step 5, it follows that converges with probability one to . ∎
A theory of adaptive pattern classifiers.IEEE Transactions on Electronic Computers, 16(3):299–307, 1967.
- Balcan & Feldman (2013) Balcan, Maria-Florina F and Feldman, Vitaly. Statistical active learning algorithms. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems, volume 26, pp. 1295–1303. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5101-statistical-active-learning-algorithms%.pdf.
- Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. ISSN 0162-8828. doi: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.50.
- Benveniste et al. (1990) Benveniste, A., Metivier, M., and Priouret, P. Adaptive Algorithms and Stochastic Approximation. Springer, New York, 1990.
- Besag (1974) Besag, Julian. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), pp. 192–236, 1974.
- Blum (1954) Blum, J. R. Multidimensional stochastic approximation. Annals of Mathematical Statistics, 9:737–744, 1954.
Stochastic gradient learning in neural networks.In Proceedings of Neuro-Nîmes 91, Nimes, France, 1991. EC2. URL http://leon.bottou.org/papers/bottou-91c.
- Bottou (1998) Bottou, Léon. Online algorithms and stochastic approximations. In Saad, David (ed.), Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. URL http://leon.bottou.org/papers/bottou-98x. revised, oct 2012.
In Bousquet, Olivier and von Luxburg, Ulrike (eds.), Advanced
Lectures on Machine Learning
, Lecture Notes in Artificial Intelligence, LNAI 3176, pp. 146–168. Springer Verlag, Berlin, 2004.URL http://leon.bottou.org/papers/bottou-mlss-2004.
- Darken et al. (1992) Darken, C., Chang, J., and Moody, J. Learning rate schedules for faster stochastic gradient search. In Proceedings of the 1992 IEEE-SP Workshop on Neural Networks for Signal Processing  II, pp. 3–12, Aug 1992. doi: 10.1109/NNSP.1992.253713.
- Delyon et al. (1999) Delyon, Bernard, Lavielle, Marc, and Moulines, Eric. Convergence of a stochastic approximation version of the em algorithm. The Annals of Statistics, 27(1):94–128, 03 1999. doi: 10.1214/aos/1018031103. URL http://dx.doi.org/10.1214/aos/1018031103.
- Golden (1996) Golden, Richard M. Mathematical Methods for Neural Network Analysis and Design. MIT Press, New York, 1996.
Gu & Kong (1998)
Gu, Ming Gao and Kong, Fan Hui.
A stochastic approximation algorithm with markov chain monte-carlo method for incomplete data estimation problems.Proceedings of the National Academy of Sciences of the United States of America, pp. 7270–7274, 1998.
- Hinton et al. (2006) Hinton, Geoffrey E., Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
- Jaakkola et al. (1994) Jaakkola, T., Jordan, M. I., and Singh, S. P. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185–1201, 1994.
- Jani et al. (2000) Jani, UG, Dowling, EM, Golden, RM, and Wang, ZF. Multiuser interference suppression using block shanno constant modulus algorithm. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 48(5):1503–1506, MAY 2000. ISSN 1053-587X. doi: –10.1109/78.840003˝.
- Jennrich (1969) Jennrich, Robert I. Asymptotic properties of non-linear least squares estimators. The Annals of Mathematical Statistics, 40(2):633–643, 1969.
- Kushner (1984) Kushner, H. J. Approximation and Weak Convergence Methods for Random Processes with Applications to Stochastic Systems Theory. MIT, Cambridge, 1984.
- Louis (1982) Louis, Thomas A. Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society, Series B, 44(2):226–233, 1982.
- McLachlan & Krishnan (1996) McLachlan, Geoffrey J. and Krishnan, Thriyambakam. The EM Algorithm and its Extensions. Wiley, New York, 1996.
- Mohri et al. (2012) Mohri, Mehryar, Rostamizadeh, Afshin, and Talwalkar, Ameet. Foundations of Machine Learning. MIT Press, Cambridge, MA, 2012.
- Paik et al. (2006) Paik, Daehyun, Golden, R. M., Torlak, M., and Dowling, E. M. Blind adaptive cdma processing for smart antennas using the block shanno constant modulus algorithm. Signal Processing, IEEE Transactions on, 54(5):1956–1959, May 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.870608.
- Robbins & Siegmund (1971) Robbins, H. and Siegmund, D. A convergence theorem for non negative almost supermartingales and some applications. In Rustagi, J. S. (ed.), Optimizing Methods in Statistics, pp. 233–257. Academic Press, New York, 1971.
Salakhutdinov & Hinton (2012)
Salakhutdinov, Ruslan and Hinton, Geoffrey E.
An efficient learning procedure for deep boltzmann machines.Neural Computation, 24:1967–2006, 2012.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 1998.
Swersky et al. (2010)
Swersky, K., Chen, Bo, Marlin, B., and de Freitas, N.
A tutorial on stochastic approximation algorithms for training restricted boltzmann machines and deep belief nets.In Information Theory and Applications Workshop (ITA),, pp. 1–10, Jan 2010. doi: 10.1109/ITA.2010.5454138.
- Tieleman (2008) Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008.
- Younes (1999) Younes, L. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and stochastic reports, 65(3):177–228, 1999.