# Temporally-Biased Sampling Schemes for Online Model Management

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function". We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a trade-off between sample footprint and sample-size stability. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

## Authors

• 4 publications
• 7 publications
• 6 publications
• ### Temporally-Biased Sampling for Online Model Management

To maintain the accuracy of supervised learning models in the presence o...
01/29/2018 ∙ by Brian Hentschel, et al. ∙ 0

• ### Exact PPS Sampling with Bounded Sample Size

Probability proportional to size (PPS) sampling schemes with a target sa...
05/22/2021 ∙ by Brian Hentschel, et al. ∙ 0

• ### The Adversarial Robustness of Sampling

Random sampling is a fundamental primitive in modern algorithms, statist...
06/26/2019 ∙ by Omri Ben-Eliezer, et al. ∙ 0

• ### Improved Algorithms for Time Decay Streams

In the time-decay model for data streams, elements of an underlying data...
07/17/2019 ∙ by Vladimir Braverman, et al. ∙ 0

• ### Weighted Reservoir Sampling from Distributed Streams

We consider message-efficient continuous random sampling from a distribu...
04/08/2019 ∙ by Rajesh Jayaram, et al. ∙ 0

• ### Survey schemes for stochastic gradient descent with applications to M-estimation

In certain situations that shall be undoubtedly more and more common in ...
01/09/2015 ∙ by Stephan Clémençon, et al. ∙ 0

• ### Sampling Sketches for Concave Sublinear Functions of Frequencies

We consider massive distributed datasets that consist of elements modele...
07/04/2019 ∙ by Edith Cohen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

A key challenge for machine learning (ML) is to keep ML models from becoming stale in the presence of evolving data. In the context of the emerging Internet of Things (IoT), for example, the data comprises dynamically changing sensor streams (Whitmore et al., 2015), and a failure to adapt to changing data can lead to a loss of predictive power.

One way to deal with this problem is to re-engineer existing static supervised learning algorithms to become adaptive. Some parametric methods—such as support-vector machines (SVMs) without the “kernel trick”, hidden Markov models, and regression models—can indeed be re-engineered so that the parameters are time-varying, but for many popular non-parametric algorithms such as k-nearest neighbors (kNN) classifiers, decision trees, random forests, gradient boosted machines, and so on, it is not at all clear how re-engineering can be accomplished. The 2017 Kaggle Data Science Survey

(Sudalai Rajkumar, 2017) indicates that a substantial portion of the models that developers use in industry are non-parametric. We therefore consider alternative approaches in which we periodically retrain ML models, allowing static ML algorithms to be used in dynamic settings essentially as-is. There are several possible retraining approaches.

Retraining on cumulative data: Periodically retraining a model on all of the data that has arrived so far is clearly infeasible because of the huge volume of data involved. Moreover, recent data is swamped by the massive amount of past data, so the retrained model is not sufficiently adaptive.

Sliding windows: A simple sliding-window approach would be to, e.g., periodically retrain on the data from the last two hours. If the data arrival rate is high and there is no bound on memory, then one must deal with long retraining times caused by large amounts of data in the window. The simplest way to bound the window size is to retain the last items. Alternatively, one could try to subsample within the time-based window (Gemulla and Lehner, 2008). The fundamental problem with all of these bounding approaches is that old data is completely forgotten; the problem is especially severe when the data arrival rate is high. This can undermine the robustness of an ML model in situations where old patterns can reassert themselves. For example, a singular event such as a holiday, stock market drop, or terrorist attack can temporarily disrupt normal data patterns, which will reestablish themselves once the effect of the event dies down. Periodic data patterns can lead to the same phenomenon. Another example, from (Xie et al., 2015), concerns influencers on Twitter: a prolific tweeter might temporarily stop tweeting due to travel, illness, or some other reason, and hence be completely forgotten in a sliding-window approach. Indeed, in real-world Twitter data, almost a quarter of top influencers were of this type, and were missed by a sliding window approach.

Temporally-biased sampling: An appealing alternative is a temporally biased sampling-based approach, i.e., maintaining a sample that heavily emphasizes recent data but also contains a small amount of older data, and periodically retraining a model on the sample. By using a time-biased sample, the retraining costs can be held to an acceptable level while not sacrificing robustness in the presence of recurrent patterns. This approach was proposed in (Xie et al., 2015) in the setting of graph analysis algorithms, and has recently been adopted in the MacroBase system (Bailis et al., 2017). The orthogonal problem of choosing when to retrain a model is also an important question, and is related to, e.g., the literature on “concept drift” (Gama et al., 2014); in this paper we focus on the problem of how to efficiently maintain a time-biased sample.

In more detail, our time-biased sampling algorithms ensure that the “appearance probability” for a given data item—i.e., the probability that the item appears in the current sample—decays over time at a controlled rate. Specifically, we assume that items arrive in batches , at time points , where each batch contains 0 or more items and as . Our goal is to generate a sequence , where is a sample of the items that have arrived at or prior to time , i.e., a sample of the items in . These samples should be biased towards recent items, in the following sense. For , denote by the age at time of an item belonging to batch . Then for arbitrary times and items and ,

 (1) Pr[x∈Sk]/Pr[y∈Sk]=f(αi,k)/f(αj,k),

for any batch arrival time , where is a nonnegative and nonincreasing decay function such that . Thus items with a given timestamp are sampled uniformly, and items with different timestamps are handled in a carefully controlled manner, such that the appearance probability for an item of age is proportional to . The criterion in (1), which is expressed in terms of wall-clock time, is natural and appealing in applications and, importantly, is interpretable and understandable to users.

Choosing a decay function: Although our primary focus is on developing sampling methods that can support a variety of decay functions, the question of how to choose a good decay function is important, and a topic of ongoing research. Cohen and Strauss (Cohen and Strauss, 2006) discuss the choice of decay functions in the setting of time-decayed aggregates in telecommunications networks, and argue that the proper choice of a decay function depends on domain knowledge. In an example involving a reliability comparison between two telecommunications links, they conclude that a polynomial decay function best matches their intuition on how the comparison should evolve over time. More generally, the authors assert that there is a trade-off between the ability to decay quickly in the short term and the ability to potentially retain older data, and that the choice should depend on the perceived importance of older data and on the time scales of correlations between values; e.g., the latter might correspond to the amount of time it takes a prior pattern to reassert itself. The authors therefore argue for supporting a rich class of decay functions. Similarly, Xie et al. (Xie et al., 2015) show how a decay function can be chosen to meet application-specific criteria. For example, by using an exponential decay function with , a data item from 40 batches ago is as likely to appear in the current analysis as a newly arrived item. If training data is available, can also be chosen to maximize accuracy of a specified ML model via cross validation combined with grid search—in our experiments, where ground truth data was available, we found empirically that accuracy tended to be a quasiconvex function of

, which bodes well for automatic optimization methods such as stochastic gradient descent. We find exponential and sub-exponential decay functions such as polynomial decay to be of the greatest interest. As will become apparent, exponential decay functions, though of limited flexibility, are the easiest to work with, and most prior work has centered around exponential decay. Super-exponential decay functions are of less practical interest: older items decay too fast and the sampling scheme behaves essentially like a sliding window.

Sample-size control: For the case in which the item-arrival rate is high, the main issue is to keep the sample size from becoming too large. On the other hand, when the incoming batches become very small or widely spaced, the sample sizes for all of the time-biased algorithms that we discuss (as well as for sliding-window schemes based on wall-clock time) can become small. This is a natural consequence of treating recent items as more important, and is characteristic of any sampling scheme that satisfies (1). We emphasize that—as shown in our experiments—a smaller, but carefully time-biased sample typically yields better prediction accuracy than a sample that is larger due to overloading with too much recent data or too much old data. I.e., more sample data is not always better. Indeed, with respect to model management, this decay property can be viewed as a feature in that, if the data stream dries up and the sample decays to a very small size, then this is a signal that there is not enough new data to reliably retrain the model, and that the current version should be kept for now. In any case, we show that, within the class of sampling algorithms that support exponential decay, our new R-TBS algorithm with an exponential decay function maximizes the expected sample size whenever the sample is not saturated.

Prior work: It is surprisingly hard to both enforce (1) and to bound the sample size. As discussed in detail in Section 7, prior algorithms cannot handle arbitrary decay functions, and can only support “forward decay” schemes, which are less intuitive for users than “backward decay” schemes that enforce (1), and can lead to both numerical issues and poor adaptation behavior for ML algorithms. The only decay functions that support (1) and are handled by prior algorithms are the exponential decay functions, because backward and forward decay coincide in this case. Even in this restricted setting, prior algorithms that bound the sample size either cannot consistently enforce (1) or cannot handle wall-clock time. Examples of the former include algorithms based on the A-Res scheme of Efraimidis and Spirakis (Efraimidis and Spirakis, 2006) and Chao’s algorithm (Chao, 1982). A-Res enforces conditions on the acceptance probabilities of items; this leads to appearance probabilities that, unlike (1), are both hard to compute and not intuitive. In Appendix C we demonstrate how Chao’s algorithm can be specialized to the case of exponential decay and modified to handle batch arrivals. We then observe that the resulting algorithm fails to enforce (1) either when initially filling up an empty sample or in the presence of data that arrives slowly relative to the decay rate, and hence fails if the data rate fluctuates too much. The second type of algorithm, due to Aggarwal (Aggarwal, 2006), can only control appearance probabilities based on the indices of the data items. For example, after items arrive, one could require that, for some specified , the th item is 1/10 as likely to be in the sample as the current item. If the data arrival rate is constant, then this might correspond to a constraint of the form “a data item that arrived 10 hours ago is 1/10 as likely to be in the sample as the current item". For varying arrival rates, however, it is impossible to enforce the latter type of constraint, and a large batch of arriving data can prematurely flush out older data. Thus our new sampling schemes are interesting in their own right, significantly expanding the set of unequal-probability sampling techniques.

T-TBS: We first provide and analyze Targeted-Size Time-Biased Sampling (T-TBS), a relatively simple algorithm that generalizes the Bernoulli sampling scheme in (Xie et al., 2015). T-TBS allows complete control over the decay rate (expressed in wall-clock time) and probabilistically maintains a target sample size. That is, the expected and average sample sizes converge to the target and the probability of large deviations from the target decreases exponentially or faster in both the target size and the deviation size. T-TBS is easy to implement and highly scalable when applicable, but only works under the strong restriction that the mean sizes of the arriving batches are constant over time and known a priori. T-TBS is a good choice in some scenarios (see Section 3), but many applications have non-constant, unknown mean batch sizes, thus cannot tolerate sample overflows.

R-TBS: We then provide a novel algorithm, Reservoir-Based Time-Biased Sampling (R-TBS), that is the first to simultaneously enforce (1) at all times, provide a guaranteed upper bound on the sample size, and allow unknown, varying data arrival rates. Guaranteed bounds are desirable because they avoid memory management issues associated with sample overflows, especially when large numbers of samples are being maintained—so that the probability of some sample overflowing is high—or when sampling is being performed in a limited memory setting such as at the “edge” of the IoT. Also, bounded samples reduce variability in retraining times and do not impose upper limits on the incoming data flow. For an exponential decay function, the appearance probability for an item of age is always exactly proportional to ; in general, for a “cutoff age” , the appearance probability for an item of age is proportional to , where for but for . At the cost of additional storage, the user can make arbitrarily large—so that only a small set of very old items are affected—and the discrepancy for these old items arbitrarily small. We emphasize that, even though R-TBS involves some approximations in the case of general decay functions, the magnitude of these approximations is completely controllable a priori by the user; in contrast, prior schemes such as A-Res and Chao’s Algorithm offer no control over departures from (1) and indeed it can be difficult even to quantify the extent of these departures.

The idea behind R-TBS is to adapt the classic reservoir sampling algorithm, which bounds the sample size but does not allow time biasing. Our approach rests on the notion of a “fractional” sample whose nonnegative size is real-valued in an appropriate sense. For exponential decay, we show that R-TBS maximizes the expected sample size whenever the data arrival rate is low and also minimizes the sample-size variability; in general, there again is a user-controllable tradeoff between storage requirements and sample size stability.

Distributed implementation: Both T-TBS and R-TBS can be parallelized. Whereas T-TBS is relatively straightforward to implement, an efficient distributed implementation of R-TBS is nontrivial. We exploit various implementation strategies to reduce I/O relative to other approaches, avoid unnecessary concurrency control, and make decentralized decisions about which items to insert into, or delete from, the reservoir.

Extensions of our prior work: A preliminary version of this work appeared in (Hentschel et al., 2018); that paper focused entirely on the case of exponential decay and was missing many of the proofs for the given theoretical results. In the current paper, we extend our results to the setting of general decay functions. Handling such functions requires significant extensions to the algorithms, theory, and experimental study given in (Hentschel et al., 2018). Interestingly, viewing the original R-TBS algorithm in (Hentschel et al., 2018) as a special case of the general algorithm has led to streamlining of the original algorithm as well as its theoretical analysis. The current paper contains all relevant proofs.

Organization: The rest of the paper is organized as follows. In Section 2 we describe our batch-arrival model and, to provide context for the current work, discuss two prior simple sampling schemes: a simple Bernoulli scheme as in (Xie et al., 2015) and the classical reservoir sampling scheme, modified for batch arrivals. These methods either bound the sample size but do not control the decay rate, or control the decay rate but not the sample size. We next present and analyze the T-TBS and R-TBS algorithms in Section 3 and Section 4. We describe the distributed implementation in Section 5, and Section 6 contains experimental results. We review the related literature in Section 7 and conclude in Section 8.

## 2. Background

For the remainder of the paper, we focus on settings in which batches arrive at regular time intervals, so that for some . This simple integer batch sequence often arises from the discretization of time (Qian et al., 2013; Zaharia et al., 2013). Specifically, the continuous time domain is partitioned into intervals of length , and the items are observed only at times . All items that arrive in an interval are treated as if they arrived at time , i.e., at the end of the interval, so that all items in batch have time stamp . It follows that the age at time  of an item that arrived at time is simply .

In this section, we briefly review two classical sampling schemes whose properties we will combine in the R-TBS algorithm.

Bernoulli Time-Biased Sampling (B-TBS): A well known, simple Bernoulli time-biased sampling scheme processes each incoming item, one at a time, by first downsampling the current sample and then accepting the incoming item into the sample with probability 1. Downsampling is accomplished by flipping a coin independently for each item in the sample: an item is retained in the sample with probability  and removed with probability . To adapt this sampling scheme to our batch-arrival setting, we process incoming items a batch at a time, and implicitly assume an exponential decay function , setting . Moreover, we take advantage of the fact that the foregoing downsampling operation is probabilistically equivalent to pre-selecting the number

of items to retain according to a binomial distribution and then choosing the actual set of

retained items uniformly from the elements in the current sample; see Appendix B for a proof of this fact. Generating a sample of can be done efficiently using standard algorithms (Stadlober and Zechner, 1999), and obviates the need for executing multiple coin flips.

The resulting sampling scheme is given as Algorithm 1. At each time  we accept each incoming item into the sample with probability 1 (line 1). Downsampling is accomplished in lines 1 and 1: the function returns a random sample from the binomial distribution with independent trials and success probability per trial, and the function returns a uniform random sample, without replacement, containing elements of the set ; note that the function call returns an empty sample for any empty or nonempty .

To see that Algorithm 1 enforces the relation in (1) as required, observe that the sequence of samples is a set-valued Markov process, so that

 Pr[x∈Si+j∣x∈Si+j−1,x∈Si+j−2,…,x∈Si]=Pr[x∈Si+j∣x∈Si+j−1]

for . We then have, for ,

 (2) Pr[x∈Sk]=Pr[x∈Si]×k−i∏j=1Pr[x∈Si+j∣x∈Si+j−1]=1×pk−i=e−λ(k−i)Δ=e−λ(tk−ti),

and (1) follows immediately from (2). Thus Algorithm 1 precisely controls the relative inclusion probabilities according to the exponential decay function given above. This is the algorithm used, e.g., in (Xie et al., 2015) to implement time-biased edge sampling in dynamic graphs.

Unfortunately, the user cannot independently control the expected sample size, which is completely determined by and the sizes of the incoming batches. In particular, if the batch sizes systematically grow over time, then sample size will grow without bound. Arguments in (Xie et al., 2015) show that if , then the sample size can be bounded, but only probabilistically. Because B-TBS is a special case of T-TBS with an exponential decay function and a unitary acceptance probability for arriving items, the results in Section 3 represent a significant extension and refinement of the analysis in (Xie et al., 2015).

Batched Reservoir Sampling (B-RS): The classical reservoir sampling algorithm (Knuth, 1998; McLeod and Bellhouse, 1983) maintains a bounded uniform sample of items in a data stream. The idea is to fill up the reservoir with the first items, where is the reservoir size. For , the th incoming item is accepted into the sample with probability , and an accepted item overwrites a randomly chosen victim. Our choice of is intuitively motivated by the observation that, in general, a given item from a population of size  appears in a uniform sample of size  with probability precisely equal to .

We can extend the classical algorithm to our batch setting; to our knowledge, a batch-oriented variant has not appeared previously in the literature. To informally motivate the algorithm, we generalize the foregoing intuition. For , let be the set of items arriving up through time and set . Suppose that the sample is full (i.e., ) just before batch arrives. After processing , we ought to have a uniform sample of items from the set . We would thus expect the number of -items in the sample to follow a hypergeometric distribution; here the hypergeometric probability mass function is given by if and otherwise. This motivates us to accept new items from into the sample by first generating the number of items to accept as a hypergeometric variate and then selecting specific items for acceptance in a random and uniform manner. As with classical reservoir sampling, incoming items arriving before the sample fills up are accepted into the sample with probability 1 and do not overwrite random victims, whereas subsequent incoming items do overwrite random victims. In the corner case where and , so that an incoming batch would cause the sample to overflow if all items were accepted, we generate as before, but of these elements are accepted into the sample without overwriting a random victim, and the remainder overwrite a random victim from .

The resulting sampling scheme is given as Algorithm 2. In the algorithm, Sample is defined as before and returns a sample from the hypergeometric distribution; see (Stadlober and Zechner, 1999) for a discussion of efficient implementations of HyperGeo. Appendix B contains a formal proof of correctness. Although B-RS guarantees an upper bound on the sample size, it does not support time biasing in that all items seen so far are equally likely to be in the sample. The R-TBS algorithm (Section 4) maintains a bounded reservoir as in B-RS while simultaneously allowing time-biased sampling as in B-TBS.

## 3. Targeted-Size TBS

As a first step towards time-biased sampling with a controlled sample size, we provide the T-TBS scheme, which improves upon B-TBS by ensuring the inclusion property in (1) while providing probabilistic guarantees on the sample size. Throughout, we focus on the case where the batch sizes are independent and identically distributed (i.i.d.) with common mean , and assume that the decay function satisfies .

### 3.1. The Algorithm

The key idea is to not just downsample to remove older items as in B-TBS, but to also downsample incoming batches at a rate such that becomes (asymptotically) the “equilibrium” sample size. Unlike B-TBS, we now want to have the retention probability of an item depend on its age. In particular, if for , we set the retention probability at time equal to

 (3) pi,k=f(αi,k)/f(αi,k−1),

then we have

 (4) Pr[x∈Sk]=Pr[x∈Si]×k−i∏j=1Pr[x∈Si+j∣x∈Si+j−1]=q×k−i∏j=1f(αi,i+j)f(αi,i+j−1)=qf(αi,k),

and (1) follows immediately from (4).

To choose , we reason as follows. Suppose that the sample size equals the target value and we are about to process batch . Prior to incrementing the ages and processing the arriving batch, the ages in the sample range from down to , with an expected number of sample items belonging to batch , where . Thus the expected number of items removed prior to processing is , where is defined in (3). Summing over all batches , we find, after some algebra, that the expected total number of removed items is , where

 (5) γk=k−1∑i=0ϕi,k−1(1−pi,k)=1−∑k−1i=0f(αi,k)∑k−1i=0f(αi,k−1)

for . On the other hand, the expected number of items entering the sample is (where we initially allow to depend on ). For to be an equilibrium point, we equate the expected inflow and outflow and solve for to obtain . If

 (6) limk→∞γk=γ

for some , then . In light of (4), we see that by setting , we ensure that (1) holds at all times, and that the sample size is asymptotically an equilibrium point as , the number of batches processed, becomes large. (See Section 3.2 for a formal statement and proof.) Note that, even if we always accept all items in an arriving batch (i.e., ) but the resulting expected inflow is less than the expected outflow , the sample will consistently fall below , and so we require that .

Because we are assuming that for , we can easily derive a necessary and sufficient condition for (6) to hold. Writing , we have , where for . Thus, if , then (6) holds with , so that . Necessity follows from the fact that by assumption. For polynomial decay with , we have , where is the Hurwitz zeta function. If the decay function is exponential, i.e., , and we choose a time scale so that , then a simple calculation shows that and for , and we obtain the T-TBS algorithm for exponential decay as described in (Hentschel et al., 2018). Here is an equilibrium point for every value of , and not merely in an asymptotic sense as . For this special case, we do not need to maintain the arrival timestamp for each item, and therefore do not need to partition the sample items based on arrival time. If we further assume that , then we obtain the B-TBS algorithm as a special case in which the equilibrium sample size is , which is completely determined by and . For complex functions , we can compute numerically.

The resulting sampling scheme is given as Algorithm 3; it precisely controls inclusion probabilities in accordance with (1) while constantly pushing the sample size toward the target value . We represent a sample as a collection of sets , where is the set of sample items that arrived at time ; thus . The function Gamma in line 3 computes the constant defined above. Conceptually, at each time , T-TBS first downsamples the current sample by independently flipping a coin for each item. The retention probability for an item depends on its age; specifically, an item is retained with probability . T-TBS then downsamples the arriving batch via independent coin flips; an item in is inserted into the sample with probability . As with B-TBS, the algorithm efficiently simulates multiple coin flips by directly generating the binomially distributed number of successes; thus the functions and are defined as before.

###### Remark 1.

The constraint that may lead to an inconveniently large required mean batch size. Intuitively, the problem is that an item’s weight can become too small too quickly, even for polynomial decay. For instance, with , all items lose three fourths of their weight going from age 0 to age 1. For subexponential decay functions , we can deal with this issue by using a shifted decay function , where is a positive integer. By choosing sufficiently large, the corresponding value of can be made as small as desired. For instance, with and , we have , whereas . Of course, the original constraint requiring that for and is now modified to require that , so that the relative inclusion probabilities have essentially the same “tail behavior” as for large and , but the initial decay rate will be slower. This trick will not work for exponential decay, because here , which implies that for . In this case we must select

small enough to accommodate the mean batch size. Thus non-exponential decay functions allow an additional degree of freedom when parameterizing the sampling algorithm. For superexponential decay, shifting will actually increase

but, as discussed previously, such decay functions are of less practical interest. Over a broad range of experiments, quadratic decay with a shift of yielded superior ML robustness results for both T-TBS and R-TBS, and we often use this variant in our experiments (Section 6).

### 3.2. Sample-Size Properties

We now analyze the sample size behavior of T-TBS, which directly impacts memory requirements, efficiency of memory usage, and ML model retraining time. We continue to assume that the batch sizes are i.i.d. with common mean . Our first result (Theorem 3.1) describes the probabilistic behavior of the sample size for a fixed time

. Specifically, we give approximate expressions for the mean and variance of

when is large. We also use Hoeffding’s inequality to give exponential bounds valid for any , showing that the probability of a very large deviation above or below the target value  at any given time is very low. The proof of the theorem (and of most other results in the paper) is given in Appendix A. Denote by the maximum possible batch size, so that . Recall that , and set .

###### Theorem 3.1 ().

For any decay function such that ,

1. as ;

2. ;

3. if , then

1. for and

2. for and sufficiently large .

Thus, from (i), the expected sample size converges to the target size as becomes large and, from (ii), the variance also converges to a constant that depends on and . By (iii), the probability that the sample size deviates from by more than is exponentially small when or is large.

###### Remark 2.

If decays very slowly as , then the convergence of the expected sample size to will also be very slow. For example, if for some , then, using (i) above and a standard bound, it is easy to show that . Thus choosing a value of, say, will result in a long sequence of undersized samples. Similarly, if the sample size becomes overly large at some point, recovery will be slow.

Theorem 3.1 does not tell the entire story. Although it follows from this theorem that, over many different sampling runs, the average sample size at a given (large) time  is close to and the probability of being far away from is small, the successive sample sizes during an individual sampling run need not be well behaved. This issue is addressed by Theorem 3.2 below. Assertion (i) shows that any sample size can be attained with positive probability, so one potential type of bad behavior might occur if, with positive probability, the sample size is unstable in that it drifts off to over time. Assertion (ii) shows that such unstable behavior is ruled out if the maximum batch size is bounded and decays rapidly enough so that equation (7) below holds. If decays even faster, so that equation (8) below holds, then the stability assertion can be strengthened to guarantee that the times between successive attainments of a given sample size are not only all finite, but all have the same finite mean; moreover, the average sample size—averaged over times —converges to with probability 1 as becomes large. On the negative side, it follows that, for a given sampling run, the sample size will repeatedly—though infrequently, since the expected sample size at any time point is finite—become arbitrarily large, even if the average behavior is good. This result shows that the sample-size control provided by T-TBS is incomplete, and thus motivates the more complex R-TBS algorithm.

In the following, write “i.o.” to denote that an event occurs “infinitely often”, i.e., for infinitely many values of , and write “w.p.1” for “with probability 1”.

###### Theorem 3.2 ().

Let the T-TBS decay function satisfy and let be the maximum possible batch size. Then

1. for all , there exists such that ;

2. if and

 (7) ∞∑i=0ifi<∞,

then for all ;

3. if and, for ,

 (8) supi≥0(fi+k/fi)≤gk

for some sequence with , then (a) the expected times between successive visits to state  are uniformly bounded for any , and (b) .

The proof of Theorem 3.2 rests on a reduced representation of the state of the sample at a time , comprising a collection of pairs of the form , where is the number of sample items of age . In Appendix A we argue that the process

is a time-homogeneous Markov chain, and hence we can use tools from the theory of Markov chains to establish the “recurrence” properties that correspond to our stability results. The state space of this Markov chain is quite complex, as are the transition probabilities between states, so the application of these tools is decidedly nontrivial.

###### Remark 3.

Note that the assumptions on indeed become increasingly strong when going from Assertions (i) to (iii). The condition in (7) trivially implies that . Also, (8) implies (7). To see this, fix large enough so that , and observe that, since for all ,

 ∞∑i=0ifi≤k−1∑m=0fm(∞∑i=0(ki+m)gik)<∞.

This increase in strength is strict: the decay function satisfies but not (7), and the decay function satisfies (7) but not (8). The condition in (7) holds, e.g., for exponential decay and for polynomial decay with . The condition in (8) holds, e.g., for functions that decay exponentially or faster.

Even in the most stable case, however, we do not have complete control over the sample size. Indeed, any sample size , no matter how large, is exceeded infinitely often w.p.1 and the expected time between such incidents is uniformly bounded. Although the expected times are often very large, so that the incidents are infrequent, and the faster the decay, the faster the recovery from an incident, T-TBS is ultimately fragile with respect to sample size. This fragility is amplified when batch sizes fluctuate in a non-predicable way, as often happens in practice, and T-TBS can break down; see the experiments in Section 6.2.

Despite the fluctuations in sample size, T-TBS is of interest because, when the mean batch size is known and constant over time, and when some sample overflows are tolerable, T-TBS is relatively simple to implement and parallelize, and is very fast (see Section 6). For example, if the data comes from periodic polling of a set of robust sensors, the data arrival rate will be known a priori and will be relatively constant, except for the occasional sensor failure, and hence T-TBS might be appropriate. On the other hand, if data is coming from, e.g., a social network, then batch sizes may be hard to predict.

## 4. Reservoir-Based TBS

Targeted time-biased sampling (T-TBS) controls the decay rate but only partially controls the sample size, whereas batched reservoir sampling (B-RS) bounds the sample size but does not allow time biasing. Our new reservoir-based time-biased sampling algorithm (R-TBS) combines the best features of both, controlling the decay rate while ensuring that the sample never overflows. For exponential decay, R-TBS has optimal sample size and stability properties, and in the general case the user can trade off storage for both sample-size stability and accuracy. Importantly, unlike T-TBS, the R-TBS algorithm can handle any sequence of batch sizes.

### 4.1. Item Weights and Latent Samples

R-TBS combines the use of a reservoir with the notion of “latent samples" to enforce (1) and bound the sample size. Latent samples, in turn, rest upon the notion of “item weights".

Item weights: In R-TBS, the weight of an item of age is given by , where is the decay function; note that a newly arrived item has a weight of . As discussed later, R-TBS ensures that the probability that an item appears in the sample is proportional (or approximately proportional) to its weight. All items arriving at the same time have the same weight, so that the total weight of all items seen up through time  is . For traditional (sequential or batch-oriented) reservoir sampling, an item does not decay, and so has a weight equal to 1 at all times; thus the notions of items and item weights coincide. Moreover, in the traditional setting the weight of a sample coincides with the number of items in the sample. In our generalized setting, the “size” (weight) of a sample and the number of items in the sample differ, with samples having fractional sizes. We handle this complication via the notion of a latent fractional sample.

Latent samples: A latent fractional sample formalizes the idea of a sample of fractional size. Formally, given a set of items, a latent sample of with sample weight is a triple , where is a set of full items and is a (possibly empty) set containing at most one partial item; is nonempty if and only if .

R-TBS maintains a latent sample over time and produces an actual sample from on demand by sampling as described in Algorithm 4; see Figure 1 for an example. In the pseudocode, and the function

generates a random number uniformly distributed on

. Because each full item is included with probability 1 and the partial item is included with probability , we have

 (9) E[|S|]=⌈C⌉frac(C)+⌊C⌋(1−frac(C))=(⌈C⌉−⌊C⌋)frac(C)+⌊C⌋=frac(C)+⌊C⌋=C,

so that the size of equals in expectation. By allowing at most one partial item, we minimize the latent sample’s footprint: . Importantly, if the weight of a latent sample is an integer, then contains no partial item, and the sample generated from via Algorithm 4 is unique and contains exactly items; thus, the sample weight and the number of sample items coincide in this case. We now describe two key operations on latent samples that are used by R-TBS.

Downsampling: Besides extracting an actual sample from a latent sample, another key operation on latent samples is downsampling. For , the goal of downsampling is to obtain an new latent sample such that, if we generate and from and via Algorithm 4, we have

 (10) Pr[x∈S′]=θPr[x∈S]

for all . Thus the appearance probability for each item in , as well as the expected size of the sample , is scaled down by a factor of . Theorem 4.1 (later in this section) asserts that Algorithm 5 satisfies this property.

In the pseudocode for Algorithm 5, the subroutine moves a randomly selected item from to and moves the current item in (if any) to . Similarly, moves a randomly selected item from to , replacing the current item in (if any). More precisely, executes the operations , , and , and executes the operations , , and .

To gain some intuition for why the algorithm works, consider a simple special case, where the goal is to form a latent sample from a latent sample of integral size ; that is, comprises exactly full items. Assume that is non-integral, so that contains a partial item, and that ; e.g., and , so that . In this case, we simply select an item at random (from ) to be the partial item in and then select of the remaining items at random to be the full items in ; see Figure 2(a). Denote by and the samples obtained from and via Algorithm 4. By symmetry, each item is equally likely to be included in , so that the inclusion probabilities for the items in are all scaled down by the same fraction, as required for (10). In Figure 2(a), for example, item appears in with probability 1 since it is a full item. In , where the weights have been reduced by 50%, item (either as a full or partial item, depending on the random outcome) appears with probability , as expected. This scenario corresponds to lines 5 and 5 in the algorithm, where we carry out the above selections by randomly sampling items from to form and then choosing a random item in as the partial item by moving it to .

In the case where contains a partial item  that appears in with probability , it follows from (10) that should appear in with probability . Thus, with probability , lines 55 retain and convert it to a full item so that it appears in . Otherwise, in lines 55, is removed from the sample when it is overwritten by a random item from ; see Figure 2(b). Again, a new partial item is chosen from in a random manner to uniformly scale down the inclusion probabilities. For instance, in Figure 2(b), item appears in with probability 0.2 (because it is a partial item) and in , appears with probability . Similarly, item appears in with probability 1 and in with probability .

The if-statement in line 5 corresponds to the corner case in which does not contain a full item. The partial item either becomes full or is swapped into and then immediately ejected; see Figure 2(c).

The if-statement in line 5 corresponds to the case in which no items are deleted from the latent sample, e.g., when and . In this case, either becomes full by being swapped into or remains as the partial item for . Denoting by the probability of not swapping, we have . On the other hand, (10) implies that . Equating these expression shows that must equal the expression on the right side of the inequality on line 5; see Figure 2(d).

###### Theorem 4.1 ().

For , let be the latent sample produced from a latent sample  via Algorithm 5, and let and be samples produced from and via Algorithm 4. Then for all .

The union operator: We also need to take the union of disjoint latent samples while preserving the inclusion probabilities for each. Two latent samples and are disjoint if . The pseudocode for the union operation is given as Algorithm 6. The idea is to add all full items to the combined latent sample. If there are partials items in and , then we transform them to either a single partial item, a full item, or a full plus partial item, depending on the values of and . Such transformations are done in a manner that preserves the appearance probabilities. Of course, we can obtain the union of an arbitrary number of latent samples by iterating Algorithm 6; for latent samples , we denote by the latent sample produced by this procedure.

###### Theorem 4.2 ().

Let and , be disjoint latent samples, and let be the latent sample produced from and by Algorithm 6. Let , , and be random samples generated from , and , and via Algorithm 4. Then

1. ;

2. , ; and

3. , .

### 4.2. The R-TBS Algorithm with Exponential Decay

Our general goal is to provide a sampling algorithm that bounds the sample size at while enforcing (1). For the special case of exponential decay, this task is greatly facilitated by the fact that, at each time step, all items in the sample decay by the same multiplicative factor. We exploit this fact to provide a relatively simple version of R-TBS for the case of exponential decay. In Section 4.3, we show how to generalize our approach to the case of arbitrary decay functions.

The algorithm: R-TBS for exponential decay is given as Algorithm 7. The algorithm generates a sequence of latent samples and from these generates a sequence of actual samples that are returned to the user. In the algorithm, the functions Getsample, Downsample, and Union execute the operations described in Algorithms 4, 5, and 6.

The goal of the algorithm is to ensure that

 (11) Pr[x∈Sk]=ρkf(αi,k)

for all , , and , where and are the successive values of the variable during a run of the algorithm. Clearly, (11) immediately implies (1). We choose to make the sample size as large as possible without exceeding . In more detail, we show in Theorem 4.3 below that for all . We therefore set —see line 7—so that . Thus if , then the sample weight is at its maximum possible value , leading to the maximum possible sample size of either or . If , then the sample weight, and hence the sample size, is capped at . The algorithm functions analogously to classic reservoir sampling: if the (weighted) items seen so far can fit into the reservoir of size , then they are simply accepted, if the total item weight exceeds , then, when a new batch arrives, a random subset of old items is removed from the sample via downsampling (line 7) and a random subset of the arriving items, also filtered via downsampling (line 7), take their place (line 7). Note that if for all , so that we process items one at a time, and if there is no decay, so that , then and the inclusion probability in (11) reduces to , where , exactly as in traditional reservoir sampling.