Similarity Caching: Theory and Algorithms

12/09/2019 ∙ by Michele Garetto, et al. ∙ Università di Torino Inria Politecnico di Torino 0

This paper focuses on similarity caching systems, in which a user request for an object o that is not in the cache can be (partially) satisfied by a similar stored object o', at the cost of a loss of user utility. Similarity caching systems can be effectively employed in several application areas, like multimedia retrieval, recommender systems, genome study, and machine learning training/serving. However, despite their relevance, the behavior of such systems is far from being well understood. In this paper, we provide a first comprehensive analysis of similarity caching in the offline, adversarial, and stochastic settings. We show that similarity caching raises significant new challenges, for which we propose the first dynamic policies with some optimality guarantees. We evaluate the performance of our schemes under both synthetic and real request traces.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Caching at the network edge plays a key role in reducing user-perceived latency, in-network traffic, and server load. In the most common setting, when a user requests a given object , the cache provides if locally available (hit), and retrieves it from a remote server (miss) otherwise. In other cases, a user request can be (partially) satisfied by a similar object . For example, a request for a high-quality video can still be met by a lower resolution version. In other scenarios, a user query is itself a query for objects similar to a given object . This situation goes under the name of similarity searching, proximity searching, or also metric searching [5]. Similarity searching plays an important role in many application areas, like multimedia retrieval [13], recommender systems [24, 26], genome study [1], machine learning training [28, 17, 25], and serving [9, 10]. In all these cases, a cache can deliver to the user one or more objects similar to among those locally stored, or decide to forward the request to a remote server. The answer provided by the cache is in general an approximate one in comparison to the best possible answer the server could provide. Following the seminal papers [13, 24], we refer to this setting as similarity caching, and to the classic one as exact caching.

To the best of our knowledge, the first paper introducing the problem of caching for similarity searching was [13]

. The authors considered how caches can improve the scalability of content-based image retrieval systems. Almost at the same time, 

[24] studied caches in content-match systems for contextual advertisement. Both papers propose some simple modifications to the least recently used policy (LRU) to account for the possibility of providing approximate answers. More recently, in [28] and [17]

, similarity caching has been used to retrieve similar feature vectors from a memory unit to improve the performance of sequence learning tasks, leading to the concept of memory-augmented neural networks 

[25]. Clipper [10]—a distributed system to serve machine learning predictions—includes similarity caches to provide low-latency, approximate answers. A preliminary evaluation of the effect of different caching strategies for this purpose can be found in [9]. Recently, [26] and a series of papers by the same authors have studied recommendation systems in a cellular setting, where contents can be stored close to the users. They focus on how to statically allocate the contents in each cache assuming to know the contents’ popularities and the utility for a user interested in content to receive a similar content .

Exact caching has been studied for decades in many areas of computer science, and there is now a deep understanding of the problem. Optimal caching algorithms are known in specific settings, and general approaches to study caching policies have been developed both under adversarial and stochastic request processes. On the contrary, despite the many potential applications of similarity caching, there is still almost no theoretical study of the problem, specially for dynamic policies. The only one we are aware of is the competitive analysis of a particular variant of similarity caching in [7] (details in Sect. IV

). Basic questions are still unanswered: are similarity and exact caching fundamentally different problems? Do low-complexity optimal similarity caching policies exist? Are there margins of improvement with respect to heuristics proposed in the literature, like in 

[13, 24]? This paper provides the first answers to the above questions. Our contributions are the following:

  1. we show that similarity caching gives rise to NP-hard problems even in settings for which there is a trivial polynomial algorithm in the case of exact caching;

  2. we provide an optimal pseudo-polynomial algorithm when the sequence of future requests is known;

  3. we recognize that, in the adversarial setting, similarity caching is a -server problem with excursions;

  4. we propose optimal dynamic policies both when objects’ popularities are known, and when they are unknown;

  5. we show by simulation that our dynamic policies provide better performance than existing schemes, both under the independent reference model (IRM) and under real request traces.

A major technical challenge of our analysis is that we allow the object catalog to be potentially infinite and uncountable, as it happens when objects/requests are described by vectors of real-valued features [10]. Note that in this case exact caching policies like LRU would achieve zero hit ratio.

The rest of the paper is organized as follows. Section II introduces our main assumptions on request processes and caching policies. Sections III and IV present results on similarity caching respectively in the offline and in the adversarial setting. Our new dynamic policies are described in Sect. V, together with their optimality guarantees in the stochastic setting. We numerically explore the performance of our policies in Sect. VI.

Ii Main assumptions

Let be the (finite or infinite) set of objects that can be requested by the users. We assume that all objects have equal size and the cache can store up to objects. The state of the cache at time is given by the set of objects currently stored in it, , with .

We assume that, given any two objects and in , there is a non-negative (potentially infinite) cost to approximate with . We consider . Given a set of elements in , let denote the minimum approximation cost provided by elements in , i.e., .

In what follows, we consider two main instances for and . In the first instance, is a finite set of objects and thus the approximation cost can be characterized by an matrix of non-negative values. This case could well describe the (dis)similarity of contents (e.g. videos) in a finite catalog. In the second instance, is a subset of and , where is a non-decreasing non-negative function and is a metric in (e.g. the Euclidean one). This case is more suitable for describing objects characterized by continuous features. We will refer to the above two instances as finite and continuous, respectively.

Our goal is to design effective and efficient cache management policies that minimize the aggregate cost to serve a sequence of requests for objects in . We assume that the function is available for caching decisions and that the cache is able to compute the set of best approximators for , i.e., . This can be efficiently done using locality sensitive hashing (LSH) [24]. Moreover, we will restrict ourselves to online policies in which object insertion into the cache is triggered by requests (i.e., the cache cannot pre-fetch arbitrary objects). Upon a request for object at time , if the content is locally stored (), then the cache directly provides incurring a null cost () and we have an exact hit. Otherwise, the cache can either i) provide the best approximating object locally stored, i.e., , incurring the approximation cost (approximate hit) or ii) retrieve the content from the server incurring a fixed cost (miss). Upon a miss, the cache retrieves the object , serves it to the user, and then may replace one of the currently stored objects with . We stress the caching policy is not required to store . Without loss of generality, we can restrict to caching policies providing an approximate hit only if the approximation cost is smaller than the retrieval cost (). Indeed, one could otherwise devise a new caching policy that retrieves content and then discards it, paying a smaller cost. As a consequence, when the cache state does not change, the cost to serve is equal to .

We also define the movement cost from cache state to cache state as

Given a finite sequence of requests and an initial state , the average cost paid by a given caching policy is


In fact, if , the cache has retrieved paying the retrieval cost , but no approximation cost (. If , the cache has provided an approximated answer or has retrieved (but not stored) , paying . Note that the average cost depends on , because the policy determines the evolution of the cache state . Policies differ in the choice of which requests are approximate hits or misses (even if , the cache can decide to retrieve and store ) and in the choice of which object is evicted upon insertion of a new one. We observe that, if for , we recover the exact caching setting. If, in addition, , Eq. (1) provides the miss ratio.

As mentioned in the introduction, similarity caching lacks a solid theoretical understanding. From an algorithmic view-point, it is not clear if similarity caching is a problem intrinsically more difficult than exact caching. From a performance evaluation view-point, we do not know if similarity caching can be studied resorting to the same approaches adopted for exact caching. In this paper we provide the first answers to these questions, which depend crucially on the nature of the requests’ sequence. Three scenarios are commonly considered in the literature:


the request sequence is known in advance. This assumption is made when one wants to determine the best possible performance of any policy. In the case of exact caching, it is well known that the minimum cost (miss ratio) is achieved by Bélády’s policy [3], that evicts at each time the object whose next request is further in the future.


the request sequence is selected by an adversary who wants to maximize the cost incurred by a given caching policy. This approach leads to competitive analysis, which determines how much worse an online policy (without knowledge of future requests) performs in comparison to the optimal offline policy.


requests arrive according to a stationary exogenous stochastic process. One example is the classic IRM, where requests for different objects are generated by independent time-homogeneous Poisson processes. The goal here is to minimize the expected cost or equivalently the average cost in (1) over an infinite time horizon.

We separately consider the above three scenarios in the next sections.

Iii Offline optimization

Here we consider the offline setting in which a finite sequence of requests is known in advance. We first address the problem of finding a static set of objects to be prefetched in the cache, so as to minimize the cost in (1), i.e., we want to find:

Note that the corresponding version of this (static, offline) problem for exact caching has a simple polynomial solution with time complexity and space complexity: one simply needs to store in the cache the most requested objects in the trace. For similarity caching the problem is much more difficult, in fact:

Theorem III.1.

The static offline similarity caching problem is NP-hard.


The result follows from a reduction of maximum coverage problem (NP-hard) to a static offline similarity caching problem. Let be an undirected graph with set of nodes and set of edges . We consider the static offline similarity caching problem with , , and if and otherwise. The request sequence has one and only one request for each content. Minimizing the total cost of this instance of the similarity caching problem is equivalent to finding the nodes in that cover the largest number of nodes in . ∎

In the continuous case, where objects are points in , and is a function of a distance , one may think that the problem becomes simpler. The following theorem shows that this is not the case in general.

Theorem III.2.

Let , and , where for and otherwise. Finding the optimal static set of objects to store in the cache is NP-hard both for norm-2 and norm-1 distance.


We prove NP-hardness in the restricted case when every object is requested only once. We observe that any object stored in the cache can satisfy requests for all the points in a disc (resp. square) centered in in the case of norm-2 (resp. norm-1) distance. The problem of determining the optimal static set of objects to store in the cache to maximize the number of hits is then equivalent to the problem of finding identical translated geometric shapes covering the largest number of points in the request sequence. These shapes are, respectively, discs and squares in the case of norm-2 and norm-1 distance. NP-hardness follows immediately from the NP-hardness of the two covering problems on the plane known as DISC-COVER and BOX-COVER [14]. ∎

We have already observed that exact caching is a particular case of similarity caching. Theorems III.1 and III.2 show that similarity caching is an intrinsically more difficult problem.

For the dynamic setting, we propose a dynamic programming algorithm adapted from that proposed in [22] for the classic -server problem.

Let denote a finite sequence of requests for distinct objects, and the sequence obtained appending to a new request for content . We denote by the initial cache state, and by , the minimum aggregate cost achievable under the request sequence , when the final cache state is . It is possible to write the following recurrence equations, where denotes the empty sequence:

These equations lead to a dynamic programming procedure that iteratively computes the optimal cost for and determine the corresponding sequence of caching decisions. Algorithm time complexity is . Space complexity is at least . As this algorithm can only be applied to small cache/catalog sizes, we will derive more useful bounds for the optimal cost in Sect. V-C.

Iv Competitive analysis under adversarial requests

The usual worst case analysis is not particularly illuminating for caching problems: if an adversary can arbitrarily select the request sequence, then the performance of any caching policy can be arbitrarily bad. For example, with a catalog of objects, the adversary can make any deterministic algorithm achieve a null hit rate by simply asking at any time the content that is not currently stored in the cache.

For this reason, the seminal work of Sleator and Tarjan [27] introduced competitive analysis to characterize the relative performance of caching policies in comparison to the best possible offline policy with hindsight, i.e., under the assumption that the sequence of requests selected by the adversary is known when caching decision are taken.111 By now, competitive analysis has become a standard approach to study the performance of many other algorithms. In particular, an online caching algorithm is said to be -competitive, if its performance is within a factor (plus a constant) from the optimum. More formally, there exists such that

A competitive analysis of similarity caching in the particular case when if where is a distance in is in [7] (the only theoretical study of similarity caching we are aware of). In this section we present results for other particular cases, relying on existing work for the -server problem with excursions.

The -server problem [22] is perhaps the “most influential online problem […] that manifests the richness of competitive analysis” [20]. In the -server problem, at each time instant a new request arrives over a metric space and the user has to decide which server to move to serve it, paying a cost equal to the distance between the previous position of the server and the request. It is well known that the -server problem generalizes the exact caching problem.

Interestingly, Manasse and McGeoch’s seminal paper on the -server problem [22] also introduces the following variant: a server can perform an excursion to serve the new request and then come back to the original point paying a cost determined by a different function. Similarity caching problem can be considered as a -server problem with excursions where server movements have uniform cost and excursions have cost equal to . Moreover, as shown in [22], the -server problem with excursions is essentially equivalent to another well-known problem known as metrical task system, introduced in the same years by Borodin et al. [4]. Unfortunately, while we have found noble relatives of our problem in the algorithmic field, not much is known about the -server problem with excursions in the scenario we are interested in (uniform metric space for movements and generic metric space for excursions). We rephrase a few existing results in terms of the similarity caching problem. The first one applies to the case when the cache can contain all objects but one. The second one applies to the uniform scenario where each object can equally well approximate any other object. We hope that the important applications of similarity caching will motivate further research on the -server problem with excursions.

Theorem IV.1.

[22, Thm 10] Let be an upper bound for the set . If , then the competitive ratio of any algorithm is bounded below by . Moreover, there exists a -competitive deterministic algorithm (BAL).

Theorem IV.2.

[2, Thms 4.1-2] If and there exists such that for all with , then the competitive ratio of any algorithm is at least . Moreover, there exists a -competitive deterministic algorithm (RFWF).

V Stochastic request process

We now consider the case when requests arrive according to a Poisson process with (normalized) intensity and are i.i.d. distributed. In the finite case (), we have a request rate for each content and we essentially obtain the classic IRM. In the continuous case, we need to consider a spatial density of requests defined by a Borel-measurable function , i.e., for every Borel set , the rate with which contents in are requested is given by .

Under the above assumptions, for a given cache state , we can compute the corresponding expected cost to serve a request:


We observe that, as the sequence of future requests does not depend on the past, the average cost incurred over time by any online caching algorithm

is bounded with probability 1 (w.p. 1) by the minimum expected cost

,222This is a quite intuitive result, but a formal proof is not trivial. The corresponding result for exact caching is in [23]. i.e.,


We then say that an online caching algorithm is optimal if its time-average cost achieves the lower bound in (V) w.p. . For example, an algorithm that reaches a state and, then, does not change its state is optimal. More in general, an optimal algorithm visits states with non minimum expected cost only a vanishing fraction of time. Unfortunately, finding an optimal set of objects to store is an NP-hard problem. In fact, minimizing (2) is a weighted version of the problem considered in Sect. III. Despite the intrinsic difficulty of the problem, we present some online caching policies that achieve a global or local minimum of the cost. We call a policy -aware (resp. -unaware), if it relies (resp. does not rely) on the knowledge of .

In practice, -aware policies are meaningful only when objects’ popularities do not vary wildly over time, remaining approximately constant over time-scales in which

can be estimated through runtime measurements, similarly to what has been done in the case of exact caching by various implementations of the Least Frequently Used (

LFU) policy (see e.g. [11]). In contrast, -unaware policies do not suffer from this limitation.

Sections V-A and V-B below are respectively devoted to -aware and -unaware policies. Section V-C presents some lower bounds for the cost of the optimal cache configuration in the continuous scenario.

V-a Online -aware policies

The first policy we present, Greedy, is based on the simple idea to systematically move to states with a smaller expected cost (2). It works as follows. Upon a request for content at time , Greedy computes the maximum decrement in the expected cost that can be obtained by replacing one of the objects currently in the cache with , i.e., .

  • if ( contributes to decrease the cost), then the cache retrieves , serves it to the user, and replaces with ;

  • if , the cache state is not updated. If , is retrieved to serve the request; otherwise the request is satisfied by one of the best approximating object in .

Intuitively, we expect Greedy to converge to a local minimum of the cost. In the continuous case, special attention is required to correctly define and prove this result.

Definition V.1.

A content is said significant if, for any , it holds: where is the ball of volume centered at .

Definition V.2.

A cache configuration is locally optimal if where is obtained from by replacing only one of the contents in the cache with a significant content .

Theorem V.3.

If and are smooth and is a compact set, the expected cost of Greedy converges to the expected cost of a configuration that is locally optimal w.p. 1. If is a finite set, the cache state converges to a locally optimal configuration in finite time w.p. 1.


We start with the finite case. Let be the sequence of time instants at which contents are requested, and be the corresponding sequence of cache configurations. The sequence

is non increasing and then convergent. Therefore, there exists a finite random variable

such that w.p. 1. Moreover, as the set of possible cache configurations (and then the set of possible costs) is finite, the limit is necessarily reached within a finite number of requests. Also the sequence converges after a finite number of requests w.p. 1 to a random configuration , such that . Observe, indeed, that no configuration can be visited more than once by construction. We prove that is locally optimal w.p. 1. Consider all the path trajectories of converging to a specific configuration that is not locally optimal. By definition, there exists a significant object , whose insertion in the cache strictly reduces the cost of . By construction, can be requested only before the convergence of to . Then, the sequence must converge to with probability zero, because and, therefore, sample-paths contain w.p.  an unbounded sequence of time-instants at which requests for arrive.

The proof for the continuous case is more complex, essentially because the set of possible configurations is infinite. It is still possible to prove that the sequence converges to a random variable through the convergence theorem for super-martingales, but the sequence may not converge. We then use Prokhorov’s theorem to prove that there exists a subsequence that converges in distribution to . Working with convergence in distribution makes the rest of the proof more involuted, and we omit the technical details here. ∎

The Greedy policy converges to a locally optimal configuration. In the finite catalog case, under knowledge of content popularities, it is possible to asymptotically achieve the global optimal configuration using a policy that mimics a simulated annealing optimization algorithm. This policy is adapted from the OSA policy (Online Simulated Annealing) proposed in [23], and we keep the same name here. OSA maintains a dynamic parameter (the temperature). Upon a request for content at iteration , OSA modifies the cache state as follows:

  • If , the state of the cache is unchanged.

  • If , a content is randomly selected according to some vector of positive probabilities , and the state of the cache is changed to with probability .

In the first case, OSA obviously serves (a hit). In the second case, if the state changes to , the cache serves . Otherwise, it serves or , respectively, if or . OSA always stores a new content if this reduces the cost (as Greedy does), but it does not get stuck in a local minimum because it can also accept apparently harmful changes with a probability that is decreasing in the cost increase. By letting the temperature decrease over time, the probability to move to worse states converges to 0 over time: the algorithm explores a larger part of the solution space at the beginning and becomes more and more “greedy” as time goes by. The eviction probability vector can be arbitrarily chosen, as far as each content in has a positive probability to be selected. In practice, we want to select with larger probability contents in , whose contribution to the cost reduction is smaller.

OSA provides the following theoretical guarantees. Let be the maximum absolute difference of costs between two neighboring states, then

Proposition V.4.

If , asymptotically only the states with minimum cost have a non-null probability to be visited.

If the content to be evicted were selected uniformly at random from the cache, then the proof would be the same as the one of Proposition IV.2 in [23]

. A key point in that proof is that the homogeneous Markov chains induced by

OSA when the temperature is constant (

) are reversible, so that one can easily write their stationary probability distributions. Here, it is not the case, but we can use the more general result for

weakly-reversible time-variant Markov chains in [18, Thm. 1].

As it is usual for simulated annealing results, convergence is guaranteed under very slow decrease of the temperature parameter (inversely proportional to the logarithm of the number of iterations). In practice, much faster cooling rates are adopted and convergence is still empirically observed.

Figure 1 shows a toy case with a catalog of 4 contents and cache size equal to 2, for which Greedy with probability at least 9/20 converges to a suboptimal state with corresponding cost . On the contrary, OSA escapes from this local minimum and asymptotically converges to the optimal state with .


Fig. 1: OSA converges to the minimum cost state. catalog , ; for all other pairs , ; , , , .

V-B Online -unaware policies

In this section we present two new policies, LRU- and Duel, that, without knowledge of , bias admission and eviction decisions so to statistically favour configurations with low cost . Policy LRU- is inspired by LRU-Lazy proposed in [21] to coordinate caching decisions across different base stations to maximize the hit ratio. Despite the different application scenario, there are deep similarities between how different copies of the same content interact in the dense cellular network scenario of [21] and how different contents interact in a similarity cache.

In LRU- the cache is managed as an ordered queue as follows. Let be the content requested at time .

  • If , there is a miss. The cache retrieves the content to serve it to the user. The content is inserted at the front of the queue, with probability .

  • If , there is an approximate hit. The cache serves a content , that is refreshed, i.e., it is moved to the front of the queue, with probability . With probability the content is still retrieved from the remote server and inserted at the head of the queue.

If needed, contents are evicted from the tail of the queue. We observe that corresponds to the cost saving for the request due to the presence of in the cache. We call this policy LRU-.

When is finite, the following result holds under the characteristic time (or Che’s) approximation (CTA) [6] and the exponentialization approximation (EA), that has been recently proposed in [21].

Theorem V.5.

When , under CTA and EA, when converges to , LRU- stores a set of contents that corresponds to a local minimum of the cost.


We start by extending the CTA to similarity caching. Given an object , let denote the time content stays in the cache until eviction if 1) the cache is in state just after its insertion and 2) during its sojourn in the cache is never refreshed (i.e., moved to the front). In general is a random variable, whose distribution depends both on and on the cache state . The basic assumption of CTA is that for each , , , and , i.e., we can ignore dependencies on the content and on the state. Moreover, for caching policies where contents are maintained in a priority queue ordered by the time of the most recent refresh, and where evictions occur from the tail (as in LRU, LRU, and LRU-), CTA approximates with a constant.

The strong advantage of CTA is that the interaction among different contents in the cache is now greatly simplified as in a TTL-cache [8]. In a TTL-cache, upon insertion, a timer with value is activated. It is restarted upon each new request for the same content. Once the timer expires, the content is removed from the cache. Under CTA, the instantaneous cache occupancy () can violate the hard buffer constraint.333Under CTA the number of contents stored in the cache is a random variable with expected value equal to

and Poisson distribution. Since its coefficient of variation tends to 0 as

grows large, CTA is expected to be asymptotically accurate. The value of is obtained by imposing the expected occupancy to be equal to the buffer size, i.e.,


where is the stationary probability distribution over states and the number of contents in . For exact caching, it is relatively easy to express as function of and, then, to numerically compute the value of . For similarity caching, additional complexity arises because the timer refresh rate for each content depends on the other contents in the cache (as can be used to provide approximate answers) i.e., dynamics of different contents are still coupled. Nevertheless, the TTL-cache model allows us to study this complex system as well.

The expected marginal cost reduction due to in state is . If the state of the cache does not change, the expected sojourn time of content in the cache can be computed as:


EA assumes that evolves as a Markov Chain with transition rate from to equal to from (5), and from to equal to (if is not already in ). [21] shows that EA is very precise in practice for complex systems of interacting caches.

Results for regular perturbations of Markov chains [29] allow us to study the asymptotic behavior of the MC when vanishes, and in particular to determine which states are stochastically stable, i.e., have a non-null probability to occur as converges to . Despite the different application scenario, we can adapt the proof of [21, Prop. V.3] to our problem, and show that the stochastically stable states are locally optimizers in the sense that it is not possible to replace a content in such states while reducing the cost.

The following technical changes to the proof of [21, Proposition V.3] are required. We consider that scales as . The cost reduction achieved by state , denoted as , replaces the global hit rate as performance metric of interest. Let , the relation corresponding to [21, Lemma II.1] is The weight associated to the downward transition from to becomes . Finally, the state function is defined as

The paper [24] proposes two policies for similarity caching: RND-LRU and SIM-LRU. In RND-LRU a request produces a miss with a probability that depends on the distance from the best approximating object . If it does not produce a miss, it refreshes the timer of . Interestingly, RND-LRU can emulate in part LRU-, by using as its miss probability. The only difference is the refresh probability: in RND-LRU the best approximating content is refreshed with probability (instead of as in LRU-). Our simulations in Sect. VI confirm that, for equal , RND-LRU and LRU- exhibit very similar performance. Given our result in Theorem V.5, it is not surprising that RND-LRU performs better than SIM-LRU [24, Fig. 9].

As we will show in Sect. VI, LRU- approaches the minimum cost only for very small values of . This is undesirable when contents’ popularities change rapidly. To obtain a more responsive cache behavior, we propose a novel online -unaware policy, that we call Duel.

Similarly to Greedy, upon a request at time for a content which is not in the cache, Duel estimates the potential advantage of replacing a cached content with , i.e., to move from the current state to state . As popularities are unknown, it is not possible to evaluate instantaneously the two costs and . Then, the two contents are compared during a certain amount of time during which they engage in a ‘duel’ (during this time we need to store only a reference to ). When a duel between a real content and its virtual challenger starts, we initialize to zero a counter for each of them. If (resp. ) is the best approximating object for a following request occurring at time , then the corresponding counter is incremented by (resp. ). The counter associated to a dueling content accumulates then the aggregate cost savings due to that content. A duel finishes in one of two possible ways: 1) counters get separated by more than a fixed quantity (a tunable parameter), or 2) a maximum delay (another parameter) has elapsed since the start of the duel. Duellist replaces if and only if, at the end of the duel, its counter exceeds the counter of . Otherwise is evicted, and becomes available again for a new duel.

A requested content is matched, whenever possible, with a content in the cache that is not engaged in an ongoing duel. At a given time, then, there can be up to ongoing duels. A challenger is matched to a stored object in two possible ways: with probability , it is matched to the closest object in the cache; with the complementary probability , it is matched to a content selected uniformly at random. Duels between nearby contents allow for fine adjustments of the current cache configuration, while duels between far contents enable fast macroscopic changes in the density of stored objects. In essence, Duel provides a distributed, stochastic version of Greedy with delayed decisions (due to lack of knowledge of ).

V-C Performance bound in the continuous scenario

The continuous scenario is particularly interesting, both because it can be more appropriate to describe objects/queries in some applications, and because it marks a striking difference with exact caching.444Recall that in the continuous case the rate of exact hits is null. For this scenario, we can derive some exact bounds and approximations of the minimum cost, exploiting simple geometric considerations.

We start considering a homogeneous request process where over a bounded set . In what follows, all integrals are Lebesgue ones and all sets are Lebesgue measurable. Given a set , let denote its volume (its measure), and denote the ball with the same volume centered in .555The geometric shape of a ball depends on the considered distance function . For example, in , if is the usual norm-2, balls are circles; if it is the norm-1, balls are squares.

Lemma V.6.

For any and a set it holds:


The lemma provides the intuitive result that, among all sets with a given volume, the approximation cost for requests falling in is minimized when is a ball centered in , since is a non-decreasing function of the distance between and . We omit the simple proof.

Observe that the integral on the right hand size of (6) does not depend on , but only on the volume . We then write . We are now able to express the following bound for the expected cost:

Theorem V.7.

In the continuous scenario with constant request rate over , for any cache state ,


Given , we denote by the set of objects in having as closest object in the cache, i.e., . We have:

where the first inequality follows from Lemma V.6, and the second one from Jensen’s inequality, since is a convex function (as it can be easily checked). ∎

In some cases, it is possible to show that specific cache configurations achieve the lower bound in (7) and then are optimal:

Corollary 1.

Let be the distance for which the approximation cost is equal to the retrieval cost, i.e., . Let be a ball of radius centered in . Any cache state , such that the balls are contained in and have intersections with null volume, is optimal.

Corollary 2.

Any cache state , such that, for some , the balls for are a tessellation of (i.e.,  and for each and ), is optimal.

If the request rate is not space-homogeneous, one can apply the results above over small regions of where can be approximated by a constant value , assuming a given number of cache slots is devoted to each area (with the constraint that ). In the regime of large cache size , it is possible to determine how should scale with the local request rate , obtaining an approximation of the minimum achievable cost through Theorem V.7. For example, if is a square, is the norm-1, , and , we obtain:



Fig. 2: Example of perfect tessellation of a square grid with wrap-around conditions, in the case , . Black dots black dots correspond to a minimum cost cache configuration under homogenous popularities.

Vi Experiments

To evaluate the performance of different caching policies, we have run extensive Monte-Carlo simulations in the following reference scenario: a bidimensional square grid of points, with unitary step and wrap-around conditions, and , i.e., the approximation cost equals the minimum number of hops between and . This is a finite scenario (with catalog size equal to ) that approximates the continuous scenario in which is a square. We let , for some positive integer . When , there exists a regular tessellation of the grid with balls (squares in this case), each with points. Figure 2 provides an example of such regular tessellation in the case , . When , we can apply (the discrete versions of) Corollary 2 and approximation (8) to compute the minimum cost.

We first consider traffic synthetically generated according to the IRM, in two cases: homogeneous, in which all objects are requested with the same rate; Gaussian, in which the request rate of object is proportional to , where is the hop distance from the grid center. Under homogeneous traffic, Corollary 2 guarantees that a cache configuration storing the centers of the balls of any tessellation like the one in Fig. 2 is optimal. The case of homogeneous popularities then tests the ability of similarity caching policies to converge to one of the optimal configurations (corresponding to translated tessellations). The case of Gaussian popularities, instead, tests their ability to reach a heterogeneous configuration richer of stored objects close to the center of the grid.

We consider the case , , with catalog size slightly less than objects. We set , i.e., a setting very far from exact caching, where any request can in principle be approximated by any object. For a fair comparison, all algorithms start from the same initial state, corresponding to a set of (distinct) objects drawn uniformly at random from the catalog. For the Duel policy, we experimentally found that, in the general case of grids with unitary step, a good and robust way to set its various parameters is , , , which requires to choose a single parameter .


Fig. 3: Performance of different policies in the case of homogeneous traffic.


Fig. 4: Performance of different policies in the case of Gaussian traffic, with .

Figures 3 and 4 show the instantaneous cost (2) achieved by different policies as function of the number of arrived requests, respectively for homogeneous and Gaussian traffic (with ). The optimal cost (approximated by (8), and also exactly computed thanks to Corollary 2 in the homogeneous case) is also reported as reference. In both cases, as expected, Greedy outperforms all -unaware policies and reaches an almost optimal cache configuration after a number of arrivals of the order of the catalog size. For LRU-, RND-LRU, and Duel, we show two curves for different settings of their parameter (either or ), leading to a faster convergence to more costly states (thin dotted curves), or a slower one to less costly states (thick dash-dotted curves). As we mentioned in Sect. V-B, LRU- and RND-LRU are close (provided that we match their miss probability), and indeed they exhibit very similar performance, with a slight advantage of LRU- for small values of (remember that the local optimality in Theorem V.5 holds for vanishing ). Duel achieves the best accuracy-responsiveness trade-off, i.e., for a given quality (cost) of the final configuration, it achieves it faster than the other -aware policies. Figure 5 shows the cache configuration achieved by Duel after arrivals, for both types of traffic.

[scale=0.15]finalduel_uniform.pdf [scale=0.15]finalduel_gaussian.pdf
Fig. 5: Final configuration produced by the Duel policy under homogeneous traffic (left plot, with ), and Gaussian traffic (right plot, with ).

We have also evaluated the performance of the different policies using a real content request trace collected over 5 days from a major CDN provider. The trace contains roughly 418 million requests for about 13 million objects. By discarding the 116 least popular objects (all requested only once) from the original trace, we obtain a slightly reduced catalog that can be mapped to a square grid with . We tested two extremely different ways to carry out the mapping. In the uniform mapping, trace objects are mapped to the grid points according to a random permutation: popularities of close objects on the grid are, then, uncorrelated. In the spiral mapping, trace objects are ordered from the most popular to the least popular, and then mapped to the grid points along an expanding spiral starting from the center: popularities of close-by objects are now strongly correlated, similarly to what happens under synthetic Gaussian popularities.

Figure 6 shows the accumulated cost achieved by different policies as function of the number of arrived requests, for both mappings. For LRU- and Duel we performed a (coarse) optimization of their parameter, so as to achieve the smallest final accumulated cost. To better appreciate the possible gains achievable by similarity caching, we have also added the curves produced by a cache whose state evolves according to two exact caching policies: LRU and Random [16]. Since these policies produce a disproportionate number of misses, their total aggregate cost (1) is at least one order of magnitude larger than that LRU- and Duel. For a fair comparison we only plot the aggregate approximation cost . Although retrieval costs incurred are ignored, LRU and Random still perform between 30% and 50% worse than Duel.

The figure also shows the performance of Greedy, using as the empirical popularity distribution measured on the entire trace. Interestingly, under non stationary, realistic traffic conditions, Greedy no longer outperforms -unaware policies. In particular, Duel takes the lead under both mappings, due to its ability to dynamically adapt to shifts in contents’ popularity.

[scale=0.47]random.png [scale=0.47]spiral.png
Fig. 6: Performance of different policies under real traffic: uniform mapping (left plot) and spiral mapping (right plot).

Vii Conclusion and future work

The analysis provided in this paper constitutes a first step toward the understanding of similarity caching, however it is far from being exhaustive. In the offline dynamic setting, it is unknown if and under which conditions there exists an efficient polynomial clairvoyant policy corresponding to Bélády’s one [3] for exact caching. The adversarial setting calls for more general results for the -server problem with excursions. For exact caching, the characteristic time approximation is rigorously justified under an opportune scaling of cache and catalogue sizes [12, 15, 19]. It would be interesting to understand if and to what extent analogue results hold for similarity caching. Moreover, is it possible to use the CTA to compute the expected cost of a similarity caching policy, similarly to what can be done for the miss ratio of LRU, LRU, Random and other policies in the classic setting? There is surely still room for the design of efficient -unaware policies. Interestingly, in our experiments the smallest cost is achieved by Duel, a novel policy that completely differs from exact caching policies, suggesting that similarity caching may require to depart from traditional approaches. Another interesting direction would be to consider networks of similarity caches. At last, the above issues should be declined in the context of the different application domains mentioned in the introduction, ranging from multimedia retrieval to recommender systems, from sequence learning tasks to low-latency serving of machine learning predictions. The design of computationally efficient algorithms can, indeed, strongly depend on the specific application context. In conclusion, our initial theoretical study and performance evaluation of similarity caching opens many interesting directions and brings new challenges into the caching arena.


  • [1] A. F. Auch, H. Klenk, and M. Göker (2010) Standard operating procedure for calculating genome-to-genome distances based on high-scoring segment pairs. Standards in genomic sciences 2 (1), pp. 142. Cited by: §I.
  • [2] Y. Bartal, M. Charikar, and P. Indyk (2001) On page migration and other relaxed task systems. Theoretical Computer Science 268 (1), pp. 43 – 66. Note: On-line Algorithms ’98 External Links: ISSN 0304-3975, Document, Link Cited by: Theorem IV.2.
  • [3] L. A. Bélády (1966-06) A study of replacement algorithms for a virtual-storage computer. IBM Syst. J. 5 (2), pp. 78–101. External Links: ISSN 0018-8670, Link, Document Cited by: item Offline:, §VII.
  • [4] A. Borodin, N. Linial, and M. E. Saks (1992-10) An optimal on-line algorithm for metrical task system. J. ACM 39 (4), pp. 745–763. External Links: ISSN 0004-5411, Link, Document Cited by: §IV.
  • [5] E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín (2001-09) Searching in metric spaces. ACM Comput. Surv. 33 (3), pp. 273–321. External Links: ISSN 0360-0300, Link, Document Cited by: §I.
  • [6] H. Che, Y. Tung, and Z. Wang (2002-09) Hierarchical Web caching systems: modeling, design and experimental results. Selected Areas in Communications, IEEE Journal on 20 (7), pp. 1305–1314. External Links: Document, ISSN 0733-8716 Cited by: §V-B.
  • [7] F. Chierichetti, R. Kumar, and S. Vassilvitskii (2009) Similarity caching. In Proceedings of the Twenty-eighth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’09, New York, NY, USA, pp. 127–136. External Links: ISBN 978-1-60558-553-6, Link, Document Cited by: §I, §IV.
  • [8] N. E. Choungmo Fofack, P. Nain, G. Neglia, and D. Towsley (2012-10) Analysis of TTL-based Cache Networks. In ValueTools - 6th International Conference on Performance Evaluation Methodologies and Tools - 2012, Cargèse, France. Note: RR-7883 : student paper award External Links: Link, Document Cited by: §V-B.
  • [9] D. Crankshaw, X. Wang, J. E. Gonzalez, and M. J. Franklin (2015) Scalable training and serving of personalized models. In NIPS 2015 Workshop on Machine Learning Systems (LearningSys), Cited by: §I, §I.
  • [10] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica (2017) Clipper: a low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), Boston, MA, pp. 613–627. External Links: ISBN 978-1-931971-37-9, Link Cited by: §I, §I, §I.
  • [11] G. Einziger and R. Friedman (2014-02) TinyLFU: a highly efficient cache admission policy. In 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Vol. , pp. 146–153. External Links: Document, ISSN 1066-6192 Cited by: §V.
  • [12] R. Fagin (1977) Asymptotic miss ratios over independent references. Journal of Computer and System Sciences 14 (2), pp. 222 – 250. Cited by: §VII.
  • [13] F. Falchi, C. Lucchese, S. Orlando, R. Perego, and F. Rabitti (2008) A metric cache for similarity search. In Proceedings of the 2008 ACM Workshop on Large-Scale Distributed Systems for Information Retrieval, LSDS-IR ’08, New York, NY, USA, pp. 43–50. External Links: ISBN 978-1-60558-254-2, Link, Document Cited by: §I, §I, §I.
  • [14] R. J. Fowler, M. S. Paterson, and S. L. Tanimoto (1981) Optimal packing and covering in the plane are np-complete. Information Processing Letters 12 (3), pp. 133 – 137. External Links: ISSN 0020-0190, Document, Link Cited by: §III.
  • [15] C. Fricker, P. Robert, and J. Roberts (2012) A versatile and accurate approximation for lru cache performance. In Proceedings of the 24th International Teletraffic Congress, ITC ’12, pp. 8:1–8:8. External Links: ISBN 978-1-4503-1896-9, Link Cited by: §VII.
  • [16] M. Garetto, E. Leonardi, and V. Martina (2016-05) A unified approach to the performance analysis of caching systems. ACM Trans. Model. Perform. Eval. Comput. Syst. 1 (3), pp. 12:1–12:28. External Links: ISSN 2376-3639, Document Cited by: §VI.
  • [17] A. Graves, G. Wayne, and I. Danihelka (2014) Neural Turing Machines. arXiv preprint arXiv:1410.5401. Cited by: §I, §I.
  • [18] B. Hajek (1988) Cooling schedules for optimal annealing. Mathematics of Operations Research 13 (2), pp. 311–329. External Links: ISSN 0364765X, 15265471 Cited by: §V-A.
  • [19] B. Jiang, P. Nain, and D. Towsley (2018-09) On the convergence of the ttl approximation for an lru cache under independent stationary request processes. ACM Trans. Model. Perform. Eval. Comput. Syst. 3 (4), pp. 20:1–20:31. External Links: ISSN 2376-3639, Link, Document Cited by: §VII.
  • [20] E. Koutsoupias (2009-05) The k-server problem. Comput. Sci. Rev. 3 (2), pp. 105–118. External Links: ISSN 1574-0137, Link, Document Cited by: §IV.
  • [21] E. Leonardi and G. Neglia (2018-06) Implicit coordination of caches in small cell networks under unknown popularity profiles. IEEE Journal on Selected Areas in Communications 36 (6), pp. 1276–1285. External Links: Document, ISSN 0733-8716 Cited by: §V-B, §V-B, §V-B, §V-B, §V-B.
  • [22] M. S. Manasse, L. A. McGeoch, and D. D. Sleator (1990-05) Competitive algorithms for server problems. J. Algorithms 11 (2), pp. 208–230. External Links: ISSN 0196-6774, Link, Document Cited by: §III, Theorem IV.1, §IV, §IV.
  • [23] G. Neglia, D. Carra, and P. Michiardi (2018) Cache Policies for Linear Utility Maximization. IEEE/ACM Transactions on Networking 26 (1), pp. 302–313. External Links: Document Cited by: §V-A, §V-A, footnote 2.
  • [24] S. Pandey, A. Broder, F. Chierichetti, V. Josifovski, R. Kumar, and S. Vassilvitskii (2009) Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, New York, NY, USA, pp. 441–450. External Links: ISBN 978-1-60558-487-4, Link, Document Cited by: §I, §I, §I, §II, §V-B.
  • [25] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016-20–22 Jun) Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1842–1850. External Links: Link Cited by: §I, §I.
  • [26] P. Sermpezis, T. Giannakas, T. Spyropoulos, and L. Vigneri (2018-06) Soft cache hits: improving performance through recommendation and delivery of related content. IEEE Journal on Selected Areas in Communications 36 (6), pp. 1300–1313. External Links: Document, ISSN 0733-8716 Cited by: §I, §I.
  • [27] D. D. Sleator and R. E. Tarjan (1985-02) Amortized efficiency of list update and paging rules. Commun. ACM 28 (2), pp. 202–208. External Links: ISSN 0001-0782, Link, Document Cited by: §IV.
  • [28] J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §I, §I.
  • [29] H. P. Young (1993-01) The Evolution of Conventions. Econometrica 61 (1), pp. 57–84. Cited by: §V-B.