1 Model, Assumptions, and Motivations
Consider a two-parameters token bucket  where denotes the token rate (in messages/sec) and the allowed burst size (in jobs or messages). In other words the number of jobs that can leave the token bucket in any time interval of duration (the arrival curve to the system downstream of the token bucket) is upper-bounded by . Job arrivals to the token bucket follow an arbitrary arrival process and each job consumes one token. Jobs that find an available token upon their arrival immediately clear the token bucket without incurring any delay. Jobs that arrive to an empty token bucket (or a token bucket with only a fraction of a token) wait until a full token is available before they are allowed to leave the token bucket. The waiting space at the token bucket is assumed large enough (infinite) to ensure that jobs waiting for tokens are never lost.
Of concern is the latency that jobs can incur in the token bucket. The primary job latency metric of interest is the sum of the job latencies, or conversely the average job latency, i.e., the sum of job latencies divided by the number of jobs. More specifically, the problem we are investigating in this note is the impact on job latency when replacing a one-bucket system with a two (or more) bucket system consisting of two separate sub-token buckets and , where and . In the two-bucket system, the original stream of arrivals is split arbitrarily across the two sub-token buckets at the times of job arrivals, and each sub-token bucket has an infinite queue where jobs waiting for tokens can be stored. Since jobs are indistinguishable, we initially assume for simplicity that they are served in first-served-first-come (FCFS) order111Note that a FCFS service order is known to minimize the sum of job latencies in both single-server and multi-server systems when service times are constant . in both the one-bucket and two-bucket systems, though as we shall see the main results hold under arbitrary service ordering.
The primary motivation for the investigation is that of Distributed Rate Limiting (DRL) systems that arise in distributed computing environments as found in the cloud or datacenters [3, 4]. In such settings, users specify a job traffic profile in the form of a token bucket, while the compute service provider provisions resources to ensure an agreed upon Service Level Objective (SLO) that commonly includes (average) latency. Because of resource constraints, it is often necessary for the provider to distribute the user’s jobs across multiple compute facilities. For scalability, rate control is performed separately at each compute facility, which in turn calls for splitting the original token bucket into multiple sub-token buckets, one for each compute facility . Furthermore, ensuring that the user job arrival process still conforms to the original traffic envelope, calls for preserving the total job arrival rate and burst size across sub-token buckets.
Towards investigating the performance of a DRL system, we first note that under the assumptions of a general job arrival process with each job requiring exactly one token, a token bucket system with unit token rate, i.e., , and a bucket size of tokens behaves like a modified G/D/1 queue with unit service times. The modification is that in the token bucket system, jobs experience a delay if and only if the queue content in the G/D/1 system exceeds . In other words, the token bucket delay of the job can be obtained from the system time of this job in the corresponding G/D/1 system as follows:
where corresponds to the unfinished work found in the G/D/1 queue by the job upon its arrival at time plus its own contribution to the unfinished work, and is the bucket size.
Next, we proceed to compare the relative (latency) performance of a one-bucket system to that of a multi-bucket system obtained by splitting the one-bucket system as described above. In particular, we establish that splitting a token bucket in two (or more) sub-token buckets always increases the sum of the job latencies, and hence the average job latency.
2 One vs. Two or more Token Buckets
Towards establishing the result that splitting a token bucket can only worsen the sum of job latencies, we first state a simple Lemma.
At any point in time , the unfinished work in a work-conserving G/D/1 queue is smaller than or equal to the total unfinished work in a set of work-conserving G/D/1 queues with the same aggregate service rate and fed the same arrival process.
The result directly stems from the observation that when fed the same arrival process, parallel work-conserving G/D/1 queues never clear work faster than a single work-conserving G/D/1 queue with the same aggregate service rate. Specifically, at any point in time both the one-queue and the -queues system have received the same amount of work (they are fed the same set of arrivals), both systems are work-conserving, and the one-queue system processes work at least as fast as the -queue system whenever it is not empty, so that it can never have more unfinished work than the -queue system.
Formally, we assume that up to the start of the busy period of the one-queue system, the unfinished work in the one-queue system has always been smaller than or equal to that of the -queue system, and wlog we assume that the one-queue system has unit service rate. We establish the result by induction on the busy periods of the one-queue system.
Denote as the start of the busy period of the one-queue system, and let denote the duration of that busy period. The unfinished work in the one-queue system during that busy period is then of the form , where represents the amount of work that has arrived in , and we have used the fact that by definition the unfinished work just before the start of a busy period is . Similarly, the unfinished work in the -queue system is of the form , where we have used the facts that (from our induction hypothesis), , i.e., the aggregate service rate in the -queue system can never exceed the unit service rate of the one-queue system, and both systems receive the same amount of work . Furthermore, because by definition of a busy period and both the one-queue and the -queue system see the same arrivals, we also have , where is the start time of the busy period of the one-queue system, i.e., the time of the next arrival after . This establishes that the unfinished work in the one-queue system remains smaller than or equal to that in the -queue system until the start of the busy period of the one-queue system. This completes the proof of the induction step. ∎
We are now ready to state our main result, which establishes that splitting a two-parameter token bucket into multiple sub-token buckets with equivalent aggregate parameters and , is never beneficial when it comes to the overall (sum or average) job latency introduced by the rate control enforcement of the token bucket.
Given a two-parameter token bucket and a general job arrival process where jobs each require one token to exit the bucket, splitting this one-bucket system into multiple, say, , sub-token buckets with parameters such that and , can never improve the sum of the job latencies, irrespective of how jobs are distributed to the sub-token buckets. More generally, denoting as and the sum of the delays accrued by all jobs up to time in the one-bucket and -bucket systems, respectively, we have
We first establish the result for the case , and wlog assume that .
The proof is simply based on the fact that jobs waiting for tokens in either system accrue delay at the same rate, and establishing that at any time the number of jobs experiencing delays in the one-bucket system is less than or equal to the number of such jobs in the two-bucket system. Note that the sum of the job delays incurred in either system up to time is of the form
Hence, if , then , which proves the result for . We therefore proceed to establish that .
The number of jobs waiting for tokens, i.e., accruing delay, at time in a one-bucket system with bucket size is of the the form
where represents the ceiling of , is the unfinished work in the corresponding G/D/1/ queue, and consistent with Eq. (1) we have used the fact that jobs are delayed in the token bucket only when the unfinished work in the G/D/1 queue exceeds the bucket size .
Similarly the total number of jobs waiting for tokens in a two-bucket system with bucket sizes and such that is of the form
Since we know that , when , we focus on establishing that
From Lemma 1, we know that . Next, we consider separately the cases and .
In this case, Eq. (3) is trivially verified.
We further separate this case in two separate sub-cases:
Case : and (or interchangeably and )
In this case, Eq. (3) simplifies to
Applying again the result of Lemma 1, we have
where we have used the fact that and . Hence, Eq. (3) again holds in Case .
Case : and
Since the case and is not possible under Case (it would violate Lemma 1), this establishes that Eq. (3) holds in all cases. Accordingly, , so that as mentioned earlier, , which establishes the result for .
Extending the result to is readily accomplished by applying the above approach recursively to groups of two sub-token buckets. ∎
In concluding, we note that while Eq. (1) assumed an FCFS service ordering for jobs in the token bucket, both Lemma 1 and Theorem 2 are independent of the order in which jobs waiting for tokens are scheduled for transmission, as long as the schedule is “work-conserving,” i.e., jobs (any waiting job) leave as soon as one full token is available. In other words, available tokens are not split across multiple waiting jobs.
This work was supported by NSF grant CNS 1514254. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
-  J. Heinanen and R. Guerin. A Single Rate Three Color Marker. RFC 2697 (Informational), September 1999.
-  N. Uuganbaatar. Optimality of first-come-first-served: a unified approach. Mongolian Mathematical Journal, 15:45–53, 2011.
-  Tyk Open Source API Gateway. Tyk: Rate limiting. https://tyk.io/docs/control-limit-traffic/rate-limiting/, 2018.
-  Yahoo. Cloud Bouncer: Distributed rate limiting at Yahoo. https://yahooeng.tumblr.com/post/111288877956/cloud-bouncer-distributed-rate-limiting-at-yahoo, 2018.
-  B. Raghavan, K. Vishwanath, S. Ramabhadran, K. Yocum, and A. C. Snoeren. Cloud control with distributed rate limiting. In Proc. ACM SIGCOMM, Los Angeles, CA, August 2007.