1 Introduction
Active learning algorithms aim to learn parameters with less data by querying labels adaptively. However, since such algorithms change the sampling distribution, they can introduce bias in the learned parameters. While there has been some work to understand this (Schütze et al., 2006; Bach, 2007; Dasgupta & Hsu, 2008; Beygelzimer et al., 2009), the most common algorithm, “uncertainty sampling” (Lewis & Gale, 1994; Settles, 2010), remains elusive. One of the oddities of uncertainty sampling is that sometimes the bias is helpful: uncertainty sampling with a subset of the data can yield lower error than random sampling on all the data (Schohn & Cohn, 2000; Bordes et al., 2005; Chang et al., 2017). But sometimes, uncertainty sampling can vastly underperform, and in general, different initializations can yield different parameters asymptotically. Despite the wealth of theory on active learning (Balcan et al., 2006; Hanneke et al., 2014), a theoretical account of uncertainty sampling is lacking.
In this paper, we characterize the dynamics of a streaming variant of uncertainty sampling to explain the bias introduced. We introduce a smoothed version of the zeroone loss which approximates and converges to the zeroone loss. We show that uncertainty sampling, which minimizes a convex surrogate loss on all the points so far, is asymptotically performing a preconditioned stochastic gradient step on the smoothed (nonconvex) population zeroone loss. Furthermore, each uncertainty sampling iterate in expectation moves in a descent direction of the smoothed population zeroone loss, unless the parameters are at an approximate stationary point. In addition, uncertainty sampling converges to a stationary point of the smoothed population zeroone loss. This explains why uncertainty sampling sometimes achieves lower zeroone loss than random sampling, since that is approximately the quantity it implicitly optimizes. At the same time, as the zeroone loss is nonconvex, we can get stuck in a local minimum with higher zeroone loss (see Figure 1).
Empirically, we validate the properties of uncertainty sampling on a simple synthetic dataset for intuition as well as 22 realworld datasets. Our new connection between uncertainty sampling and zeroone loss minimization clarifies the importance of a sufficiently large seed set, rather than using a single point per class, as is commonly done in the literature (Tong & Koller, 2001; Yang & Loog, 2016).
2 Setup
We focus on binary classification. Let be a data point, where is the input and is the label, drawn from some unknown true data distribution . Assume we have a scoring function , where are the parameters; for linear models, we have , where is the feature map.
Given parameters , we predict if and otherwise, and therefore err when and have opposite signs. Define
to be the zeroone loss (misclassification rate) over the data distribution, the evaluation metric of interest:
(1) 
where is the Heaviside step function:
(2) 
Note that the training zeroone loss is stepwise constant, and the gradient is
almost everywhere. However, assuming the probability density function (PDF) of
is smooth, the population zeroone loss is differentiable at most parameters, a fact that will be shown later.Since minimizing the zeroone loss is computationally intractable (Feldman et al., 2012), it is common to define a convex surrogate which upper bounds the zeroone loss; for example, the logistic loss is given by . Given a labeled dataset
, we can define the estimator that minimizes the sum of the loss plus regularization:
(3) 
This can often be solved efficiently via convex optimization.
Passive learning: random sampling.
Define the population surrogate loss as
(4) 
In standard passive learning, we sample randomly from the population and compute . As , the parameters generally converge to the minimizer of , which is in general distinct from the minimizer of .
Active learning: uncertainty sampling.
In this work, we consider the streaming setting (Settles, 2010) where a learner receives a stream of unlabeled examples (known drawn from with unknown ) and must decide to label each point or not. We analyze uncertainty sampling in this setting (Lewis & Gale, 1994; Settles, 2010), which is widely used for its simplicity and efficacy (Yang & Loog, 2016).
Let us denote our label budget as , the number of points we label. Uncertainty sampling (Algorithm 1) begins with labeled points drawn from the beginning of the stream and minimizes the regularized loss (3) to obtain initial parameters. Then, the algorithm takes a point from the stream and labels it with probability for some acceptance function and scalar . One example is , which corresponds to labeling points from the stream if and only if . As gets smaller, we choose points closer to the decision boundary.
If we decide to label , then we obtain the corresponding label and add to . Finally, we update the model by optimizing (3). The process is continued until we have labeled points in total.
3 Theory
We present four types of theoretical results. First, in Section 3.1, we show how the optimal parameters change with the addition of a single point to the convex surrogate (e.g. logistic) loss. Then, in Section 3.2, we introduce a smoothed version of the zeroone loss and show that uncertainty sampling is preconditioned stochastic gradient descent on this smoothed zeroone loss. Finally, we show that uncertainty sampling iterates in expectation move in a descent direction in Section 3.3, and that the uncertainty sampling converges to a stationary point of the smoothed zeroone loss (Section 3.4).
3.1 Incremental Parameter Updates
First, we analyze how the sample convex surrogate loss minimizer changes with a single additional point, showing that the change is a preconditioned^{1}^{1}1Preconditioned refers to multiplication of a symmetric positive semidefinite matrix by the (stochastic) gradient for (stochastic) gradient descent (Li, 2018; Klein et al., 2011). It is often chosen to approximate the inverse Hessian. gradient step on the additional point. Let us assume the loss is convex and thrice differentiable with bounded derivatives:
Assumption 1 (Convex Loss).
The loss is convex in .
Assumption 2 (Loss Regularity).
The loss is continuously thrice differentiable in , and the first three derivatives are bounded by some constant in the Frobenius norm.
Consider any iterative algorithm that adds a single point each time and minimizes the regularized training loss at each iteration :
(5) 
to produce . Since and differ by only one point, we expect and to also be close. We can make this formal using Taylor’s theorem. First, since is a minimizer, we have . Then, since the loss is continuously twicedifferentiable:
(6) 
Let be the value of the integral. Since is convex and regularizer is quadratic, is symmetric positive definite, and thus is symmetric positive definite and thus invertible. Therefore, we can solve for :
(7) 
Since minimizes , we have . Also note that . Thus,
(8) 
The update above holds for any choice of , in particular, when is chosen by either random sampling or uncertainty sampling.
For random sampling, , so we have
(9) 
from which one can interpret the iterates of random sampling as preconditioned SGD on the population surrogate loss .
3.2 Parameter Updates of Uncertainty Sampling
Whereas random sampling is preconditioned SGD on the population surrogate loss , we will now show that uncertainty sampling is preconditioned SGD on a smoothed version of the population zeroone loss .
Recall that is the acceptance function. Define as a normalized antiderivative of , which converges to the Heaviside function when the domain is scaled by . First, we need to make an assumption that ensures has an integral over the real line.
Assumption 3 (Continuous, bounded, even).
The function

is continuous,

has bounded support (),

is even ().
Now we are ready to define , which is made by replacing the Heaviside step function with and scaling the domain by .
(10)  
(11)  
(12) 
We now show that converges to pointwise:
Proposition 4.
For all , .
Proof.
This follows from noticing that and applying the Dominated Convergence Theorem. ∎
Before stating the dynamics of uncertainty sampling, we must first define another quantity, which is the probability of accepting a random point :
(13) 
We must further make an assumption that is locally exactly linear around , which is satisfied for the hinge loss and smoothed versions of the hinge loss.
Assumption 5 (Locally linear ).
There is some neighborhood of where is exactly linear:
(14) 
We also assume the score is smooth, and the support of is bounded:
Assumption 6 (Smooth Score).
The score is smooth, that is, all derivatives with respect to and exist.
Assumption 7 (Bounded Support).
The support of is bounded.
We are ready to state the relationship between uncertainty sampling iterates (8), which are governed by , and the smoothed zeroone loss :
Theorem 8.
Proof.
First, note that
(16) 
Because is continuously differentiable, is smooth, and has bounded support, by the Leibniz Integral Rule, we can exchange the integral and derivative:
(17) 
Because is even,
(18) 
Now we are ready to evaluate . From the definition of uncertainty sampling,
(19)  
(20) 
Notice that for and that for . Thus, for ,
(21)  
(22) 
∎
Thus, if is drawn using uncertainty sampling, is in the direction of , since the quantity in front of is a scalar that is positive for all common losses.^{2}^{2}2, , and are all positive, and is negative for all reasonable losses. Similar to how we showed random sampling is preconditioned SGD on the population surrogate loss (9), uncertainty sampling is preconditioned SGD on the smoothed population zeroone loss .
3.3 Descent Direction
So far, we have shown that uncertainty sampling is preconditioned SGD on the smoothed population zeroone loss by analyzing . To show that these updates are descent directions on , we need to also consider the preconditioner appearing in (8). Due to quadratic regularization (5), the preconditioner is positive definite. However, we need to be careful since the preconditioner depends on the resulting iterate . Because of this snag, we need to ensure that doesn’t change the preconditioner too much, which we can accomplish by requiring and large enough regularization.
Theorem 9 (Uncertainty Sampling Descent Direction).
It might seem that above is rather large. However, note that we don’t optimize the regularized loss until after random points, so if , then the regularization is always less than the number of data points that contribute to the loss, and the regularization will not dominate. Thus, this constraint on can be intuitively thought of as a constraint on .
3.4 Convergence
Having shown that uncertainty sampling iterates move in descent directions of in expectation, we now turn to showing that they also converge to a stationary point of . To prove convergence, we will need to stay (with high probability) in regions where the assumption of Theorem 8 holds () and also ensure that the parameters stay bounded (with high probability).
As is standard in stochastic gradient convergence analyses, instead of showing convergence of the final parameter iterate , we show the convergence of the parameters from a random iteration. For a budget , let with . Then, define be the randomized parameters. We will show that converges to a stationary point of as .
Define as the failure probability that any parameter iterate is too large or has zero acceptance probability:
(25) 
We will assume that this failure probability converges to as . As grows, the seed set size grows, the regularization grows, and the effective step size shrinks. Intuitively, this means we might expect that the parameter iterates become more stable. Unless the parameters diverge, we can choose large enough to contain the parameter iterates. Furthermore, the acceptance probability should be nonzero if there are not large regions of the space with zero probability density. Assuming the probability of these failure events goes to zero, the randomized parameters converge to a stationary point:
Theorem 10 (Convergence to Stationary Points).
These results shed light on the mysterious dynamics of uncertainty sampling which motivated this paper. In particular, uncertainty sampling can achieve lower zeroone loss than random sampling because it is implicitly descending on the smoothed zeroone loss . Furthermore, since is nonconvex, uncertainty sampling can converge to different values depending on the initialization.
It is important to note that the actual uncertainty sampling algorithm is unchanged—it is still performing gradient updates on the convex surrogate loss. But because its sampling distribution is skewed towards the decision boundary, we can interpret its updates as being on the smoothed zeroone loss with respect to the original datagenerating distribution.
4 Experiments
We run uncertainty sampling on a simple synthetic dataset to illustrate the dynamics (Section 4.1) as well as 22 real datasets (Section 4.2). In both cases, we show how uncertainty sampling converges to different parameters depending on initialization, and how it can achieve lower asymptotic zeroone loss compared to minimizing the surrogate loss on all the data. Note that most active learning experiments are interested in measuring the rate of convergence (data efficiency), whereas this paper focuses exclusively on asymptotic values and the variation that we obtain from different seed sets. We evaluate only on the zeroone loss, but all algorithms perform optimization on the logistic loss.
4.1 Synthetic Data
Synthetic dataset based on a mixture of four Gaussians (left) and the associated learning curves for runs of uncertainty sampling with different initial seed sets (right). Depending on the seed set, uncertainty sampling can produce either better or worse parameters than random sampling.
Figure 2
(left) shows a mixture of Gaussian distributions in two dimensions. All the Gaussians are isotropic, and the size of the circle indicates the variance (one standard deviation for the inner circle, and two standard deviations for the outer circle). The points drawn from the two red Gaussian distributions are labeled
and the points drawn from the two blue ones are labeled . The percentages refer to the mixture proportions of the clusters. We see that there are four local minima of the population zeroone loss, indicated by the green dashed lines. Each minimum misclassifies one of the Gaussian clusters, yielding error rates of about 10%, 20%, 30%, and 40%. The black dotted line corresponds to the parameters that minimize the logistic loss, which yields an error of about 20%.Figure 2 (right) shows learning curves for different seed sets, which consist of two points, one from each class. We see that the uncertainty sampling learning curves converge to four different asymptotic losses, corresponding to the four local minima of the zeroone loss mentioned earlier. The thick black dashed line is the zeroone loss for random sampling. We see that uncertainty sampling can actually achieve lower loss than random sampling, since the global optimum of the logistic loss does not correspond to the global minimum of the zeroone loss.
4.2 RealWorld Datasets
We collected 22 datasets from OpenML (retrieved August, 2017) that had a large number of data points and where logistic regression outperformed the baseline classifier that always predicts the majority label. We further subsampled each dataset to have 10,000 points, which was divided into 7000 training points and 3000 test points. We created a stream of points by randomly selecting points onebyone with replacement from the dataset. We ran uncertainty sampling on each dataset with random seed sets of sizes that are powers of two from 2 to 4096 and then 7000. We stopped when uncertainty sampling either could not select a point (
) or when a point had been repeatedly selected more than times. For each dataset and seed set size, we ran uncertainty sampling 10 times, for a total of runs.In Figure 3, we see scatter plots of the asymptotic zeroone loss of 130 points: 13 seed set sizes, each with 10 runs. The dataset on the left was chosen to exhibit the wide range of convergence values of uncertainty sampling, some with lower zeroone loss than with the full dataset. In both plots, we see that the variance of the zeroone loss of uncertainty sampling decreases as the seed set grows. This is expected from theory since the initialization has less variance for larger seed set sizes (as the seed set size goes to infinity, the parameters converge). For most of the datasets, the behavior was more similar to the plot on the right, where uncertainty sampling has a higher mean zeroone loss than random sampling for most seed sizes.
A violin plot capturing the relative asymptotic zeroone loss compared to the zeroone loss on the full dataset. The plot shows the density of points with kernel density estimation. The red lines are the median losses. Each “violin” captures 220 points (10 runs over 22 datasets).
To gain a more quantitative understanding of all the datasets, we summarized the asymptotic zeroone loss of uncertainty sampling for various random seed set sizes. In Figure 5, we show the proportions of the runs over the datasets where uncertainty sampling converges to a lower zeroone loss than using the entire dataset. In Figure 5, we show a “violin plot” for the distribution of the ratio between the asymptotic zeroone loss of uncertainty sampling and the zeroone loss using the full dataset. We note that the mean and variance of uncertainty sampling significantly drops as the size of the seed set grows larger. The initial parameters are poor if the seed set is small, and it is wellknown that poor initializations for optimizing nonconvex functions locally can yield poor results, as seen here.
5 Related Work and Discussion
The phenomenon that uncertainty sampling can achieve lower error with a subset of the data rather than using the entire dataset has been observed multiple times in the literature. In fact, the original uncertainty sampling paper (Lewis & Gale, 1994)
notes that “For 6 of 10 categories, the mean [Fscore] for a classifier trained on a uncertainty sample of 999 examples actually exceeds that from training on the full training set of 319,463”.
Schohn & Cohn (2000)defines a heuristic that selects the point closest to the decision boundary of an SVM, which is similar to uncertainty sampling in our formulation. In the abstract, the authors note, “We observe… that a SVM trained on a wellchosen subset of the available corpus frequency performs better than one trained on
all available data”. More recently, Chang et al. (2017) developed an “active bias” technique that emphasizes the uncertain points and found that it increases the performance compared to using a fullylabeled dataset.There is also work showing the bias of active learning can harm final performance. Schütze et al. (2006) notes the “missed cluster effect”, where active learning can ignore clusters in the data and never query points from there; this is seen in our synthetic experiment. Dasgupta & Hsu (2008) has a section on the bias of uncertainty sampling and provides another example where uncertainty sampling fails due to sampling bias, which we can explain as due to local minima of the zeroone loss. Bach (2007) and Beygelzimer et al. (2009) note this bias issue and propose different importance sampling schemes to reweight points and correct for the bias.
In this work, we showed that uncertainty sampling updates are preconditioned SGD steps on the population zeroone loss and move in descent directions for parameters that are not approximate stationary points. Note that this does not give any global optimality guarantees. In fact, for linear classifiers, it is NPhard to optimize the training zeroone loss below (for any ) even when there is a linear classifier that achieves just training zeroone loss (Feldman et al., 2012).
One of the key questions in light of this work is when optimizing convex surrogate losses yield good zeroone losses. If the loss function corresponds to the negative loglikelihood of a
wellspecified model, then the zeroone loss will have a local minimum at the parameters that optimize the loglikelihood. If the loss function is “classificationcalibrated” (which holds for most common surrogate losses), Bartlett et al. (2006) shows that if the convex surrogate loss of the estimated parameters converges to the optimal convex surrogate loss, then the zeroone loss of the estimated parameters converges to the global minimum of the zeroone loss (Bayes error). This holds only for universal classifiers (Micchelli et al., 2006), and in practice, these assumptions are unrealistic. For instance, several papers show how outliers and noise can cause linear classifiers learned on convex surrogate losses to suffer high zeroone loss
(Nguyen & Sanner, 2013; Wu & Liu, 2007; Long & Servedio, 2010).Other works connect active learning with optimization in rather different ways. Ramdas & Singh (2013) uses active learning as a subroutine to improve stochastic convex optimization. Guillory et al. (2009) shows how performing online active learning updates corresponds to online optimization updates of nonconvex functions, more specifically, truncated convex losses. In this work, we analyze active learning with offline optimization and show the connection between uncertainty sampling and one particularly important nonconvex loss, the zeroone loss.
In summary, our work is the first to show a connection between the zeroone loss and the commonlyused uncertainty sampling. This provides an explanation and understanding of the various empirical phenomena observed in the active learning literature. Uncertainty sampling simultaneously offers the hope of converging to lower error and the danger of converging to local minima (an issue that can possibly be avoided with larger seed sizes). We hope this connection can lead to improved active learning and optimization algorithms.
Reproducibility.
The code, data, and experiments for this paper are available on the CodaLab platform at https://worksheets.codalab.org/worksheets/0xf8dfe5bcc1dc408fb54b3cc15a5abce8/.
Acknowledgments.
This research was supported by an NSF Graduate Fellowship to the first author.
References
 Bach (2007) Bach, F. R. Active learning for misspecified generalized linear models. In Advances in neural information processing systems, pp. 65–72, 2007.

Balcan et al. (2006)
Balcan, M.F., Beygelzimer, A., and Langford, J.
Agnostic active learning.
In
Proceedings of the 23rd international conference on Machine learning
, pp. 65–72. ACM, 2006.  Bartlett et al. (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
 Beygelzimer et al. (2009) Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49–56. ACM, 2009.
 Bordes et al. (2005) Bordes, A., Ertekin, S., Weston, J., and Bottou, L. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research, 6(Sep):1579–1619, 2005.

Chang et al. (2017)
Chang, H.S., LearnedMiller, E., and McCallum, A.
Active bias: Training more accurate neural networks by emphasizing high variance samples.
In Advances in Neural Information Processing Systems, pp. 1003–1013, 2017.  Dasgupta & Hsu (2008) Dasgupta, S. and Hsu, D. Hierarchical sampling for active learning. In Proceedings of the 25th international conference on Machine learning, pp. 208–215. ACM, 2008.
 Feldman et al. (2012) Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Y. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590, 2012.
 Guillory et al. (2009) Guillory, A., Chastain, E., and Bilmes, J. Active learning as nonconvex optimization. In Artificial Intelligence and Statistics, pp. 201–208, 2009.
 Hanneke et al. (2014) Hanneke, S. et al. Theory of disagreementbased active learning. Foundations and Trends® in Machine Learning, 7(23):131–309, 2014.
 Hoveijn (2007) Hoveijn, I. Differentiability of the volume of a region enclosed by level sets. arXiv preprint arXiv:0712.0915, 2007.
 Klein et al. (2011) Klein, S., Staring, M., Andersson, P., and Pluim, J. P. Preconditioned stochastic gradient descent optimisation for monomodal image registration. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 549–556. Springer, 2011.
 Lewis & Gale (1994) Lewis, D. D. and Gale, W. A. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3–12. SpringerVerlag New York, Inc., 1994.
 Li (2018) Li, X.L. Preconditioned stochastic gradient descent. IEEE transactions on neural networks and learning systems, 29(5):1454–1466, 2018.
 Long & Servedio (2010) Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287–304, 2010.
 Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006.
 Nguyen & Sanner (2013) Nguyen, T. and Sanner, S. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085–1093, 2013.
 Ramdas & Singh (2013) Ramdas, A. and Singh, A. Algorithmic connections between active learning and stochastic convex optimization. In International Conference on Algorithmic Learning Theory, pp. 339–353. Springer, 2013.

Schohn & Cohn (2000)
Schohn, G. and Cohn, D.
Less is more: Active learning with support vector machines.
In ICML, pp. 839–846. Citeseer, 2000.  Schütze et al. (2006) Schütze, H., Velipasaoglu, E., and Pedersen, J. O. Performance thresholding in practical text classification. In Proceedings of the 15th ACM international conference on Information and knowledge management, pp. 662–671. ACM, 2006.
 Settles (2010) Settles, B. Active learning literature survey. Computer Sciences Technical Report, 1648, 2010.
 Tong & Koller (2001) Tong, S. and Koller, D. Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
 Wu & Liu (2007) Wu, Y. and Liu, Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007.
 Yang & Loog (2016) Yang, Y. and Loog, M. A benchmark and comparison of active learning for logistic regression. arXiv preprint arXiv:1611.08618, 2016.
6 Appendix
The appendix has two main sections. In Section 6.1, we prove the results about the descent direction and SGD convergence of uncertainty sampling. In Section 6.2, we show that under some conditions, .
6.1 Descent Direction and Convergence
We prove two lemmas about the parameter updates. First, we show that a single step of the parameter iterates don’t change much due to regularization (Lemma 11). Second, we show that is approximately equal to (Lemma 12) and bound the error in approximation. This is important because it takes the dependency on iterate only through the gradient of the loss at , the point selected at iterate . With these two lemmas and Theorem 8, the descent direction (Theorem 9) is straightforward and the SGD convergence (Theorem 10) follows a standard SGD convergence argument.
6.1.1 Parameter Update Lemmas
Lemma 11.
(27) 
Proof.
As in the main text, we have
(28) 
Thus, and further . Together, this implies that .
Using the Taylor expansion,
(29) 
where
(30) 
Since the loss is convex with quadratic regularization,
(31)  
(32)  
(33)  
(34) 
Therefore,
(35)  
(36) 
∎
Lemma 12.
(37) 
Proof.
From a Taylor expansion,
(38) 
where
(39) 
We want to solve for , but in order to do this, we need to bound .
(40) 
Using Lemma 11
(41)  
(42)  
(43) 
Solving for in the Taylor expansion,
(44)  
(45)  
(46)  
(47)  
(48) 
Looking at the theorem statement, we are almost done. The only difference between the theorem statement and the equation above is that the theorem statement has while the equation above has . We can use the triangle inequality and bound the difference.
(50)  
(51)  
(52)  
(53)  
(54) 
∎
6.1.2 Descent Direction
Theorem 9.
Proof.
The first thing to note is that if , then .
(57)  
(58) 
Because has bounded support and since is smooth, there exists a constant such that .
(59) 
And thus, if , then .
This will allow us to use Theorem 8 later in the proof.
Using Lemma 12,
(60)  
(61) 
Note that the only part that is dependent on the iteration is the term, which we can evaluate on the right by Theorem 8. Thus,
(62)  
(63)  
(64)  
(65)  
(66) 
If the last term is positive then the whole expression is less than and the theorem is proved. A sufficient condition for this to be the case is that,
(68) 
and
(69) 
which are both satisfied for
(70) 
∎
6.1.3 Convergence
Theorem 10.
Proof.
Assume that the parameter iterates are bounded by and the acceptance probability is nonzero for all parameter iterates. This will occur with probability going to and so if we can show convergence in probability under this condition, then unconditional convergence in probability follows. The set of parameters that are bounded are a compact set and thus and are bounded by some constant, call it .
From a Taylor expansion, for some between and ,
(72)  
(73)  
(74)  
(75)  
(76) 
Taking an expectation conditioned on ,
(77)  
Comments
There are no comments yet.