Learning to Cache With No Regrets

04/22/2019 ∙ by Georgios S. Paschos, et al. ∙ 0

This paper introduces a novel caching analysis that, contrary to prior work, makes no modeling assumptions for the file request sequence. We cast the caching problem in the framework of Online Linear Optimization (OLO), and introduce a class of minimum regret caching policies, which minimize the losses with respect to the best static configuration in hindsight when the request model is unknown. These policies are very important since they are robust to popularity deviations in the sense that they learn to adjust their caching decisions when the popularity model changes. We first prove a novel lower bound for the regret of any caching policy, improving existing OLO bounds for our setting. Then we show that the Online Gradient Ascent (OGA) policy guarantees a regret that matches the lower bound, hence it is universally optimal. Finally, we shift our attention to a network of caches arranged to form a bipartite graph, and show that the Bipartite Subgradient Algorithm (BSA) has no regret



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We study the performance of caching systems from a new perspective: we seek a caching policy that optimizes the system’s performance under any distribution of file request sequence. This not only has huge practical significance as it tackles the caching policy design problem in its most general form, but also reveals a novel connection between caching and online linear optimization [1, 2, 3]. This, in turn, paves the way for a new mathematical framework enabling the principled design of caching policies with performance guarantees.

I-a Background and Related Work

Due to its finite capacity a cache can typically host only a small subset of the file library, and hence a caching policy must decide which files should be stored. The main performance criterion for a caching policy is the so-called cache hit ratio, i.e., the portion of file requests the cache can satisfy locally. Several policies have been proposed in the past with the aim to maximize the cache hit ratio. For instance, the Least-Recently-Used (LRU) policy inserts in the cache the newly requested file and evicts the one that has not been requested for the longest time period; while the Least-Frequently-Used (LFU) policy evicts the file that is least frequently requested. These policies (and variants) were designed empirically, and one might ask: under what conditions do they achieve high hit ratios?

The answer depends on the properties of the file request sequence. For instance, we know that (i) for stationary requests, LFU achieves the highest hit ratio [4]; (ii) a more sophisticated age-based-threshold policy maximizes the hit ratio when the requests follow the Poisson Shot Noise model [5]; and (iii) LRU has the highest competitive hit ratio [6] for the adversarial model [7, 8] which assumes that the requests are set by an adversary aiming to degrade the system’s performance. However, the highest competitive hit ratio is achieved by any marking policy [9] – including FIFO – suggesting that this metric is perhaps “too strong” to allow a good policy classification. Evidently, to decide which policy to use, it is necessary to know the underlying file request model, which in practice is a priori unknown and, possibly, time-varying. This renders imperative the design of a universal caching policy that will work provably well for all request models.

Fig. 1: File requests must be served by a local cache (hit) or the origin server (miss). A caching policy observes past requests and decides which files to cache in order to increase the hit ratio.

This is even more important in the emerging edge caching architectures that use small local servers or caches attached to wireless stations. These caches receive a low number of requests and therefore, inevitably, “see” request processes with highly non-stationary popularity [5, 10, 11]. Prior works employ random replacement models [11] or inhomogeneous Poisson processes [5, 12]

to model the requests in these systems. However, such multi-parametric models are challenging to fit and rely on strong assumptions about the popularity evolution. Other notable approaches learn the instantaneous popularity model with no prior assumptions. For instance

[13, 14] employ Q-learning, and [15] leverages a scalable prediction method; but contrary to our approach, they assume the popularity evolution is stationary across time. The design of adaptive paging policies is systematically studied in [16, 17, 18] as an online learning problem. These works, however, consider only the basic paging problem, i.e., hit ratio maximization when caching entire files in a single cache.

Caching networks (CNs) on the other hand, are hitherto under-explored; yet very important as most often caches are organized in networks. In CNs one needs to jointly decide which cache will satisfy a request (routing) and which files will be evicted (caching). The works [19, 20] proposed joint routing and caching policies for bipartite and general network graphs, respectively, and [21] extended them to CNs with elastic storage. However all these works assume that file popularity is stationary. On the other hand, [22] proposed the multi-LRU (mLRU) strategy, and [23] proposed the “lazy rule” extending -LRU to provide local optimality guarantees under stationary requests. It transpires that we currently lack a principled method to design caching policies with provable performance guarantees, under general popularity conditions, for single or networks of caches. This is exactly our focus.

I-B Methodology and Contributions

We introduce a model-free caching policy along the lines of Online Linear Optimization (OLO). For the single cache case we obtain matching lower and upper regret bounds, proving that the online gradient ascent achieves order optimal performance, and we then extend these results to CNs. We assume that file requests are drawn from a general distribution, which is equivalent to caching versus an adversary selecting the requests. At each slot, in order, (i) the caching policy decides which file parts to cache; (ii) the adversary introduces the new request; and (iii) a file-dependent utility is obtained proportional to the fraction of the request that was served by the cache. This generalizes the criterion of cache hit ratio and allows us to build policies that, e.g., optimize delay or provide preferential treatment to files. In this setting, we seek to find a caching policy with minimum regret, i.e., minimum utility loss over an horizon , compared to the best cache configuration when the request sample path is known.

We prove that well-known caching policies such as LRU and LFU have regret, hence there exist request patterns for which these policies fail to learn a good caching configuration (losses increase with time). In contrast, we propose the Online Gradient Ascent (OGA) policy, and prove its regret is at most , for a cache that can store out of the total files. This shows that OGA eventually (as ) amortizes the losses under any request model, even under denial-of-service attacks. Additionally we prove a novel lower bound which is tighter than the general OLO lower bound. For the case of online hit ratio maximization, we find that any policy must have at least regret. Combining the two results we conclude that (i) the regret of online caching is exactly , and (ii) OGA is a universally optimal policy. Interestingly, OGA can be seen as a regularized LFU, or a slightly modified LRU as we explain in Section IV-C.

We extend our model to a network of caches, arranged in the form of a bipartite graph – a setting known as femtocaching [24]. We provide the Bipartite Subgradient Algorithm (BSA) caching strategy that achieves a regret , where is the number of caches and deg the maximum network node degree. Our contributions can be summarized as follows:

  • [leftmargin=3.5mm]

  • Machine Learning (ML) approach: we provide a fresh ML angle into caching policy design and performance analysis. To the best of our knowledge this is the first time OLO is used to provide optimality bounds in caching problem. Moreover, we reverse-engineer the standard caching LRU/LFU-type of policies, by drawing connections with OLO, and provide directions for improving them.

  • Universal single-cache policy: The proposed OGA policy is universally optimal, i.e., provides zero loss versus the best caching configuration under any request model. An important projection algorithm is provided to reduce complexity and enable operation in large caches.

  • Universal bipartite caching: We consider a general bipartite CN and design the online joint caching and routing BSA policy. Our approach hits the sweet spot of complexity versus utility for CNs: offers rigorous performance results, while it is applicable to fairly complicated settings.

  • Trace-driven Evaluation: We employ a battery of tests evaluating our policies with several request patterns. We find that OGA outperforms LRU/LFU by 20 in different scenarios, while BSA beats lazy-LRU by 45.8.

Ii System model

We study a system with a library of files of equal size and a cache that fits of them. The system evolves in time slots, and in each slot a single request is made for , denoted with the event

. Vector

represents the -slot request, chosen from the set:

The instantaneous file popularity is determined by the probability distribution

(with support

), which is allowed to be unknown and arbitrary; and the same holds for the joint distribution

that describes the file popularity evolution. This generic model captures all studied request sequences in the literature, including stationary (i.i.d. or otherwise), non-stationary, and adversarial models.

The cache is managed with the caching configuration variable , that denotes the fraction of file cached in slot .111Why caching of file fractions makes sense? Large video files are composed of chunks stored independently, see literature of partial caching [25]. Also, the fractional variables may represent caching probabilities [2, 26], or coded equations of chunks [24]. For practical systems, the fractional should be rounded, which will induce a small application-specific error. Taking into account the cache size , the set of admissible caching configurations is:

Definition 1 (Caching Policy).

A caching policy is a (possibly randomized) rule that at slot maps past observations and configurations to the configuration of slot .

We denote with the utility obtained when file is requested and found in the cache (also called a hit). This file-dependent utility can be used to model bandwidth economization from cache hits [25], QoS improvement [24], or any other cache-related benefit. We will also be concerned with the special case , i.e., the cache hit ratio maximization. A cache configuration accrues in slot a utility determined as follows:

Let us now cast caching as an online linear optimization problem. This requires the following conceptual innovations. Since we allow the request sequence to follow any arbitrary probability distribution, we may equivalently think of as being set by an adversary that aims to degrade the performance of the policy. Going a step further, we can equivalently interpret that the adversary selects at each slot the utility function of the system from a family of linear functions, . Finally, note that is set at the beginning of slot , before the adversary selects , Fig. 2. The above place our problem squarely in the OLO framework.

Fig. 2: A caching decision is taken; the adversary selects the request; the caching utility is realized; and the system state is updated for next .

Given the adversarial nature of our model, the ability to extract useful conclusions depends crucially on the selected performance metric. Differently from the competitive ratio approach of [6], we introduce a new metric that compares how our caching policy fares against the best static policy in hindsight. This metric is often used in the literature of machine learning [2, 3] with the name worst-case static regret. In particular, we define the regret of policy as:

where is the horizon; the maximization is over the admissible adversary distributions; the expectation is taken w.r.t. the possibly randomized and ; and is the best static configuration in hindsight, i.e., a benchmark policy that knows the sample path . Intuitively, measuring the utility loss of over constrains the power of the adversary: radical request pattern changes will challenge but also induce huge losses in . This comparison allows us to discern policies that can learn good caching configurations from those that fail.

We seek a caching policy that minimizes the regret by solving the problem , known as Online Linear Optimization (OLO) [2]. The analysis in OLO aims to discover how the regret scales with horizon . A caching policy with sublinear regret produces average losses with respect to the best configuration with hindsight, hence it learns to adapt the cache configuration without any prior knowledge about the request distribution. Our problem is further compounded due to the cache size dimension. Namely, apart from optimizing the regret w.r.t. , it is of practical importance to consider also the dependence on (or ).

Regret of Standard Policies. Having introduced this new caching optimization formulation, it is interesting to characterize the worst case performance of LRU and LFU policies. Recall that LRU caches the most recently requested files, while LFU calculates for each file the request frequency , where is the slot when file was requested for the first time, and caches the most frequent files. The following proposition describes their performance under arbitrary requests where, for simplicity, we assumed (hit ratio maximization). [boxrule=0.7pt,arc=0.6em, left=1.5mm, right=1.5mm]

Proposition 1.

The regret of LRU, LFU satisfies:


Assume that the adversary chooses the periodic request sequence . For any , since the requested file is the least recent file, it is not in the LRU cache, and no utility is received. Hence, LRU can achieve at most utility from the first slots. However, a static policy with hindsight achieves at least by caching the first files. The same rationale can be used for LFU by noticing that due to the structure of the periodic arrivals, the least frequent file is also the least recent one. ∎

The performance of standard caching policies is poor, yet this is rather expected since they are designed to perform well only under certain models (LRU for requests with temporal locality; LFU for stationary requests), and they are known to under-perform in other cases, e.g., LRU in stationary, and LFU in decreasing popularity patterns. Low regret performance means that there exist request distributions for which the policy fails to “learn” what is the best configuration. Remarkably, we show below that there exist universal caching policies which ensure low regret under any request model.

Iii Regret Lower Bound

A regret lower bound is a powerful theoretical result that provides the fundamental limits of how fast any algorithm can learn to cache, much like the upper bound of the channel capacity. Regret lower bounds in OLO have been previously derived for different action sets: for -dimensional unit ball centered at the origin in [27], and -dimensional hypercube in [28]. In our case, however, the above results do not apply since in (1) is a capped simplex, i.e., the intersection of a box and a simplex inequality. Therefore, we need the following new regret lower bound tailored to the online caching problem.

[boxrule=0.7pt,arc=0.6em, left=1mm, right=1mm]

Theorem 1 (Regret Lower Bound).

The regret of any caching policy satisfies:

where is the -th maximum element of a Gaussian random vector with zero mean and covariance matrix given by (5).

Furthermore, assume and define any permutation of and the set of all such permutations:

In the important special case of hit rate maximization, where each file is , the above bound simplifies to: [boxrule=0.7pt,arc=0.6em, colback=gray!3]

Corollary 1.

Fix , , and . Then, the regret of any caching policy satisfies:

Before providing the proof, a few remarks are in order. Our bound is tighter than the classical of OLO in the literature [28, 27], which is attributed to the difference of sets . In our proof we provide technical lemmas that are also useful, beyond caching, for the regret analysis of capped simplex sets. In next section we will design a caching policy that achieves regret , establishing that the regret of online caching is in fact .


To find a lower bound, we will analyze a specific adversary . In particular, we will consider an i.i.d. such that file is requested with probability

where is a vector with element equal to and the rest zero. With such a choice of , any causal caching policy yields an expected utility at most , since


independently of . To obtain a regret lower bound we show that a static policy with hindsight can exploit the knowledge of the sample path to achieve a higher utility than (2). Specifically, defining the number of times file is requested in slots , the best static policy will cache the files with highest products . In the following, we characterize how this compares against the average utility of (2) by analyzing the order statistics of an Gaussian vector.

For i.i.d. we may rewrite regret as the expected difference between the best static policy in hindsight and (2):


where is the Hadamard product between the weights and request vector. Further, (3) can be rewritten as a function:

where, (i) is the set of all possible integer caching configurations (and therefore is the sum of the maximum elements of its argument); and (ii) the process is the vector of utility obtained by each file after the first rounds, centered around its mean:


where are i.i.d. random vectors, with distribution

and, therefore, mean .222Above we have used the notation A key ingredient in our proof is the limiting behavior of :

Lemma 1.

Let be a Gaussian vector , where is given in (5), and its th largest element. Then


Observe that is the sum of uniform i.i.d. zero-mean random vectors, where the covariance matrix can be calculated using (4):


where the second equality follows from the distribution of and some calculations.333For the benefit of the reader, we note that has no well-defined density (since is singular). For the proof, we only use its distribution.

Due to the Central Limit Theorem:


Since is continuous, (6) and the Continuous Mapping Theorem [29] imply

and the proof is completed by noticing that is the sum of the maximum elements of its argument. ∎

An immediate consequence of Lemma 1, is that


and the first part of the Theorem is proved.

To prove the second part, we remark that the RHS of (7) is the expected sum of maximal elements of vector , and hence larger than the expected sum of any elements of . In particular, we will compare with the following: Fix a permutation over all elements, partition the first elements in pairs by combining 1-st with 2-nd, …, -th with +1-th, -1-th with -th, and then from each pair choose the maximum element and return the sum. We then obtain:

where the second step follows from the linearity of the expectation, and the expectation is taken over the marginal distribution of a vector with two elements of . We now focus on for (any) two fixed . We have that


From [30] we then have:



for all . The result follows noticing that the tightest bound is obtained by maximizing (8) over all permutations. ∎

Iv Online Gradient Ascent

Gradient-based algorithms are widely used in resource allocation mechanisms due to their attractive scalability properties. Here, we focus on the online variant and show that, despite its simplicity, it achieves the best possible regret.

Iv-a Algorithm Design and Properties

Recall that the utility in slot is described by the linear function . The gradient at is an -dimensional vector with coordinates:

Definition 2 (Oga).

The Online Gradient Ascent (OGA) caching policy adjusts the decisions ascending in the direction of the gradient:

where is the stepsize, and is the Euclidean projection of the argument vector onto , and the Euclidean norm.

The projection step is discussed in detail next. We emphasize that OGA bases the decision on the caching configuration and the most recent request . Therefore, it is a very simple causal policy that does not require memory for storing the entire state (full history of and ).

Let us now discuss the regret performance of OGA. We define first the set diameter to be the largest Euclidean distance between any two elements of set . To determine the diameter, we inspect two vectors which cache entire and totally different files and obtain:

Also, let be an upper bound of , we have . [boxrule=0.7pt,arc=0.6em]

Theorem 2 (Regret of OGA).

Fix stepsize , the regret of OGA satisfies:


Using the non-expansiveness property of the Euclidean projection [31] we can bound the distance of the algorithm iteration from the best static policy in hindsight:

where we expanded the norm. If we fix and sum telescopically over horizon , we obtain:

Since , rearranging terms and using and :


For convex it holds , , and with equality if is linear. Plugging these in the OGA regret expression ( operator is removed) we get:

and for we obtain the result. ∎

Using the above values of and we obtain:


Corollary 2 (Regret of Online Caching).

Fix , , for all , and assume , the regret of online caching satisfies:

Corollary 2 follows from Corol. 1 and Theorem 2. We conclude that disregarding (amortized by ) OGA achieves the best possible regret and thus fastest learning rate.

Iv-B Projection Algorithm

We explain next the Euclidean projection used in OGA, which can be written as a constrained quadratic program:


In practice is expected to be large, and hence we require a fast algorithm. Let us introduce the Lagrangian:

where are the Lagrangian multipliers. The KKT conditions of (10) ensure that the values of at optimality will be partitioned into three sets :


where follows from the tightness of the simplex constraint. It suffices for the projection to determine a partition of files into these sets. Note that given a candidate partition, we can check in linear time whether it satisfies all KKT conditions (and only the unique optimal partition does). Additionally, one can show that the ordering of files in is preserved at optimal , hence a known approach is to search exhaustively over all possible ordered partitions, which takes steps [32]. For our problem, we propose Algorithm 1, which exploits the property that all elements of satisfy except at most one (hence also ), and computes the projection in steps (where the term comes from sorting ). In our simulations each loop is visited at most two times, and the OGA simulation takes comparable time with LRU.

Finally, the projection algorithm provides insight into OGA functionality. In a slot where file has been requested, OGA will increase according to the stepsize, and then decrease all other variables symmetrically until the simplex constraint is satisfied.

0:  ; sorted
2:  repeat
6:     ,
7:  until 
8:  if  then
10:     Repeat 2-7
11:  end if      % KKT conditions are satisfied.
Algorithm 1 Projection on Capped Simplex

Iv-C Performance and Relation to Other Policies

Fig. 3(a) showcases the hit ratio of OGA for different choices of fixed step sizes, where it can be seen that larger steps lead to faster but more inaccurate convergence. The horizon-optimal step is given in Theorem 2 as , and plugging in , and , we obtain ; indeed we see that our experiments verify this.

The online gradient descent (similar to OGA) is identical to the well-known Follow-the-Leader (FtL) policy with a Euclidean regularizer , cf. [2], where FtL chooses in slot the configuration that maximizes the average utility in slots . We may observe that FtL applied here would cache the files with the highest frequencies, hence it is identical to LFU (when the frequency starts counting from ). Hence, OGA can be seen as a regularized version of a utility-LFU policy, where additionally to largest frequencies we smoothen the decisions by a Euclidean regularizer.

(a) Fixed stepsize
(b) Sorted based on LRU
Fig. 3: Single cache with , . (a) Smaller stepsizes converge slower, but more accurately. (b) For each file in the LRU cache, we show the respective OGA caching variable.
(a) CDN aggregation (IRM model)
(b) Youtube videos (model [5])
(c) Web browsing (trace [33])
(d) Ephemeral Torrents (model [11])
Fig. 4: Average hits under different request models [34]; (a) i.i.d. Zipf, (b) Poisson Shot Noise [5], (c) web browsing dataset [33], (d) random replacement [11]; Parameters: .

Furthermore, OGA for , bears similarities to LRU, since recent requests enter the cache at the expense of older requests. Since the Euclidean projection drops some chunks from each file, we expect least recent requests to drop first in OGA. The difference is that OGA evicts files gradually, chunk by chunk, and not in one shot. Fig. 3(b) shows the values for all files in the LRU cache (the most recently used). This reveals that the two policies make strongly correlated decisions, but OGA additionally “remembers” which of the recent requests are also infrequent (e.g., see point (A) in Fig. 3(b)), and decreases accordingly the values (as LFU would have done).

Finally, in Fig. 4 we compare the performance of OGA to LRU, LFU, and the best in hindsight static configuration. We perform the comparison for catalogues of K files, with a cache that fits K files, and we use four different request models: (a) an i.i.d. Zipf model that represents requests in a CDN aggregation point [34]; (b) a Poisson shot noise model that represents ephemeral YouTube video requests [5]; (c) a dataset from [33] with actual web browsing requests at a university campus; and (d) a random replacement model from [11] that represents ephemeral torrent requests. We observe that OGA performance is always close to the best among LFU and LRU. The benefits from the second best policy here is as high as 16 over LRU and 20 over LFU.

V Bipartite Online Caching

We extend now our study to a network of caches reachable by the user population via a weighted bipartite graph. Bipartite caching was first used for small cell networks (see the seminal femtocaching model [24]), and subsequently also to model a variety of wired and wireless caching problems where the bipartite graph links represent delay, cost or other QoS metrics when users accessing different caches [35].

V-a Bipartite Caching Model

Our caching network (CN) comprises a set of user locations served by a set of caches , each with capacity , . We use parameters:

to denote whether cache is reachable from location , or not (). This includes the general case where a location is connected to multiple caches (e.g., consider base stations with overlapping coverage). We also maintain the origin server as a special node (indexed with ), which contains the entire library, and serves the requests for files not found in the caches. The request process is a sequence of vectors with element if a request for file arrives at in slot . At each slot there is only one request .

We perform caching using the standard model of Maximum Distance Separable (MDS) codes [24], where each stored chunk is a pseudo-random linear combination of original file chunks, and a user requires a fixed number of such chunks to decode the file. Furthermore, we will populate the caches with different random chunks such that, following from the MDS properties, the collected chunks will be linearly independent with high probability (and therefore complement each other for decoding). This results in the following model: the caching decision vector has elements, and each element denotes the amount of random equations of file stored at . The set of eligible caching vectors is convex:

The caching policy can be defined as follows:


Since each location might be connected to multiple caches, we use routing variables to describe how a request is served at each location. The caching and routing decisions are coupled and constrained: (i) a request cannot be routed to an unreachable cache, (ii) we cannot route from a cache more data than it has, and (iii) each request must be fully routed. Hence, the eligible routing decisions under are:

where (the routing to origin) is unconstrained, and hence the set is non-empty for all (i.e. can always be satisfied). Naturally, we assume that is decided after the requests arrive, and hence also after the caching policy.

Finally, we introduce utility weights to denote the obtained benefit by retrieving a unit fraction of a file from cache instead of the origin, and trivially . Apart from file importance, these weights also model the locality of caches within the geography of user locations–for example a cache might have higher benefit for certain locations and lower for others. In sum, the total utility accrued in slot is:


where the index is used to remind us that is affected by the adversary’s decision . The regret of policy is:

where is the best static configuration in hindsight (factoring the associated routing).

Next we establish the concavity of our objective function for each slot . Since there is only one request at each slot , we can simplify the form of . Let be the file and location where the request in arrives. Then is zero except for . Denoting with the set of reachable caches from , and simplifying the notation, (13) reduces to:

s.t. (15)
Lemma 2.

The function is concave in its domain .

Proof: Let us consider two feasible caching vectors , our goal is to show that:

We begin by denoting with and the routing maximizers of (14) for vectors respectively. Immediately, it is . Next, consider a candidate vector for some . We first show that the routing is a feasible routing for , i.e., that ; by the feasibility of , we have:

which proves satisfies (15). Further, for all , it is:

which proves that also satisfies (16); hence . It follows that . Combining:


V-B Bipartite Subgradient Algorithm (BSA)

Since is concave, our plan is to design a universal online bipartite caching policy by extending OGA. Note, however, that is not necessarily differentiable everywhere (hence the gradient might not exist at ), and hence we need to find a subgradient direction at each slot, as explained next.

Consider the partial Lagrangian of (14):

and define the function . From strong duality we obtain:

Lemma 3 (Supergradient).

Let be the vector of optimal multipliers of (16). Define:

The vector is a supergradient of at , i.e., it holds .


We have

where is obtained directly by the form of the Lagrangian where only one term depends on , and in (b) we have used that minimizes . ∎

Now that we have found a method to calculate the supergradient, we can extend OGA as follows:

Definition 3 (Bsa).

The Bipartite Subgradient Algorithm caching policy adjusts the decisions with the supergradient: