The rapidly growing demand for mobile content delivery  creates new revenue opportunities for wireless networks, but also requires to increase rapidly their capacity. Unfortunately, typical solutions based on PHY-layer advances or network densification are constantly outpaced by the increasing demand , and this calls for new content delivery approaches. To this end, a potentially game-changing idea is to employ device-to-device (D2D) communications where memory-endowed user devices can cache popular files and exchange them with each other upon request . Such cooperative D2D swarms can increase the network content delivery capacity, mitigate cellular congestion, and improve the end-user experience.
A key challenge in D2D cooperative caching is to design the caching policy, i.e., identify which files to cache at each device at any given time 
. On the one hand the devices have small storage and therefore can store only a small subset of the possible content files; on the other hand each of them “sees” a small number of requests per unit time and hence estimating the popular files at each location becomes very challenging. In view of these limitations, the devices may consistently fail to store in their cache the files that will be requested in the future by their neighbors, rendering the D2Dcache-hit ratio practically negligible and hence this solution ineffective.
Caching policies often assume that requests are generated by a given stationary process, cf. , and systems in practice rely on reactive policies such as LFU and LRU. These, however, solve the 1-cache problem conditionally on the request process. For example, LFU is suitable for stationary requests , and LRU for the adversarial model . When the actual process is other than assumed these policies perform poorly , and this problem is exacerbated in D2D networks where popularity has hot spots in time and space. Recently, [8, 14, 2] proposed dynamic policies for caching networks, e.g., the m-LRU  or “lazy rule”  policies, which however do not offer performance guarantees. Hence, the problem of designing a policy robust to the D2D network dynamics is an equally challenging and important open problem.
In this cooperative D2D network, users might change positions and content preferences, and hence we need an algorithm that will allow each device to decide: (i) which files to cache for serving its neighbors’ needs; (ii) from which neighbors to retrieve a file, and when to use the last-resort solution of the base station (BS), Fig. 1. This requires a learning mechanism for making caching and routing decisions in an online and decentralized fashion. Previous efforts employing learning are restricted either to inference of the popularity model [3, 4]; or rely in Q-learning [21, 24] and classification techniques  to estimate request frequencies. These approaches are not applicable to this D2D scenario as they are centralized, exhibit often high complexity, presume a stationary model, and do not decide routing. Other interesting suggestions for D2D networks, see [13, 6], suffer from computational complexity or rely on assumptions that are valid only in some cases, e.g., known popularity. Finally,  proposes a decentralized D2D file sharing algorithm which however needs access to the pattern of file requests and network evolution.
Here we design a caching policy that has universally-optimal performance, which is defined as the cost for delivering the requested files to users through (cheap) D2D or (costly) BS-to-device transmissions. We formulate the D2D caching operation as an online convex optimization (OCO) problem, and develop a dynamic and distributed algorithm that solves it without the need to make any assumption about the request pattern. That is, our policy ensures asymptotically no regret, as it achieves no more average cost than a static caching configuration selected with knowledge of future requests.
The contributions of this paper can be summarized as follows. (i)
We propose the idea of embedding a distributed online learning mechanism to D2D caching policies. We achieve this by formulating an OCO problem, and this opens a link between caching and this novel machine learning tool.(ii) We design an online caching policy that leverages the online gradient descent algorithm to achieve asymptotically optimal performance under any possible spatio-temporal request pattern in our D2D network. We also explain how our policy can adapt to network changes and user churn. (iii) We compare our policy with the state-of-the-art mLRU and “lazy” LRU policies, verifying that it outperforms its competitors, while converging to the optimal static policy.
Ii System model
Network. Consider a set of wireless users in an area, each one with a cache of size . Let be the set of direct links connecting the users, where a link appears if the devices’ proximity and the electromagnetic environment allows it to be reliably established. A link between and is associated to a cost , which represents, e.g. application-layer latency performance or energy consumption during content transmission; and we assume . All users maintain a connection with a base station (BS, subscripted with ), and we define and . The users can obtain any file from the BS at cost . We assume that links can deliver the requested content in the considered time window.
Requests. There is a catalog with files of unit size. The system operation is time-slotted, and denotes the event that a request for file has been submitted by user during slot . At each we assume there is one request, or, from a different perspective, that the system decisions are updated after each request.111We can also consider batches of requests. If the batch has 1 request from each location, the pattern is biased to equal request rate at each location. An unbiased batch should contain an arbitrary number of requests from each location. Our guarantees hold for unbiased batches of arbitrary finite length.
Hence, the request process is described by a sequence of vectorsdrawn from set:
The instantaneous file popularity is expressed by the probability distribution(with support
), which is allowed to be unknown and arbitrary. The same holds for the joint distributionthat describes the file popularity evolution, for any user location, and within an interval of slots. This generic model captures all possible spatio-temporal request sequences, including stationary (i.i.d. or otherwise), non-stationary, and adversarial models. The latter is the most general case, as they include request sequences selected by an adversary aiming to disrupt the system performance.
Caching. The cache of each user can store only files, but the BS has the entire catalog. Following the standard practice in wireless caching models [9, 19], we perform caching using the Maximum Distance Separable (MDS) codes. In MDS, the files are split into a fixed number of data chunks, and we store in each cache an amount of coded chunks that are pseudo-random linear combinations of the data chunks. Using the MDS properties, a user can decode the file (with high probability) if it receives any coded chunks. Hence, the caching decision vector has elements, where denotes the amount of random coded chunks of file stored at user during slot .222The fractional caching is supported by the observation that large files are composed of thousands chunks, stored independently, see literature of partial caching . Hence, by rounding these fine-grained fractional decisions, we will only induce a small application-specific error. In some prior caching models, fractional variables represent probabilities of caching [22, 5]. Based on this, we introduce the convex set of eligible caching vectors:
We are interested in distributed policies, where each user changes its cache based on information from its one-hop neighbors . Thus, we define:
Definition 1 (Local Caching Policy).
A local caching policy for user is a (possibly randomized) rule
The collection of the caching policies for all users will be henceforth referred to as a “caching policy”.
Routing. Since each user might have more than one neighbors, we introduce routing variables to determine the cache from which the requested file will be fetched. Let denote the portion of request that is fetched from cache , and we define the routing vector implemented in slot . There are two important remarks here. First, due to the coded caching model, the requests can be simultaneously routed from multiple caches. In terms of communications, this can be implemented through time-sharing among the activated links, or using concurrently different network interfaces. Second, the caching and routing decisions are coupled and constrained: (i) a request cannot be routed from an unreachable cache, (ii) we cannot route from a cache more data chunks than it has, and (iii) each request must be fully routed.
We define if and otherwise, and thus the set of eligible routing decisions conditioned on is:
Note that does not appear in the second constraint, because the BS stores the entire catalog and can serve all users. This last-resort routing option ensures that is non-empty for any . As it will become clear next, the optimal routing decisions can be devised for a given cache configuration. This is an inherent property of link-uncapacitated caching networks, see also [9, 19].
Iii Problem Formulation
A file request of a node can be served, exclusively or partially (due to MDS), by neighboring devices at a smaller cost than fetching it from the base station. Given a cache configuration , the (minimum) cost to satisfy is:
where the optimization decides the routing that minimizes the cost for a given file placement at the nodes. The function’s form suggests that is beneficial if the file has been cached at the device asking for it () or at nearby devices that can send it with low cost. However, it is daunting to assess the impact of , as it involves the solution of an optimization problem. Fortunately, the cost function above is convex:
Function is convex in its domain , .
Proof: Fix a request vector and consider cache configurations ; note that, for any , is also a valid configuration. We will show that:
Let us denote the optimal routing vectors corresponding to , respectively. We then have:
It holds , thus
Subscript at the cost function (2) reminds us its dependence on the request that is generated at . Since these events may vary according to a non-stationary process, we will use the concept of regret from online convex optimization .
We capture that the request sequence may follow any arbitrary and a priori unknown probability distribution, by using the idea of an adversary which selects at each slot , while knowing . This assumption reflects that, in practice, caches are populated before the requests are issued. Since by Lemma 1 are convex, our problem falls in the Online Convex Optimization framework . The performance metric of an algorithm in this line of work is the regret: the difference between costs incurred by the algorithm and the best static configuration in hindsight. In our case, this benchmark is the optimal cache configuration (same for all slots) devised with knowledge of all requests in the time horizon of interest . Hence, the regret of policy is:
The expectation is over the joint probability distribution of requests and possible randomizations in and,
is the best fixed action in hindsight, i.e. the best chunk placement over the entire sample path of requests. Our goal is to devise a policy whose regret scales sublinearly with :
This “no regret” property implies that the algorithm learns to perform as good as the best cache configuration . Note that if the requests are i.i.d. “no regret” implies that the performance of the policy approaches the optimal in terms of . However, our adversarial model is much more general; in this case, comparing to a static policy is a way to limit the power of the adversary while still being able to obtain meaningful policies, which are robust for all request models.
Iv Distributed D2D Caching Algorithm
Our distributed caching algorithm is based on online gradient descent . The main idea is to use the first order approximation as a predictor of the unknown function that the adversary will select next. The caching configurations, then, are updated by taking an appropriate step in the direction of the gradient .
Iv-a Finding the Direction of Improvement
Since the utility function (2) is not necessarily differentiable everywhere, we will rely on subgradients. In order to find one, we first simplify . Let us denote the user making the request and the file requested at slot , respectively, and the set of the nodes (including the BS) connected to user . Then, it is , and hence simplifies to:
Equations (6)-(8) define an optimization problem, henceforth referred to as , the solution of which yields the optimal routing for any (constant input for ). That is, to evaluate at vector we need to solve . Despite this intricate form of , we show that it is possible to obtain a subgradient which is needed for our online caching algorithm.
We first define the Lagrangian of as follows:
where and are the dual variables, and we simplified notation by dropping . We will prove that the subgradient of at is the optimal dual variables for (8) in .
Lemma 2 (Subgradient).
Then is a subgradient of at , that is: .
Proof. We start by denoting the outcome of (10) for cache configuration and define the function:
where () holds since (6)-(8) has the strong duality property; (b) holds since is linear and we can maximize successively over the different primal or dual variables; and (c) holds as only in depends on . Due to strong duality for , we can replace , and then suffices to rearrange terms.
Since is the optimal multiplier for (8), it has nonzero elements only where this constraint is tight. Intuitively, this means that after user requests file , the direction of the subgradient is towards caching more parts of this file at user and at users having low-cost D2D links with .
Iv-B Algorithm Design
The Distributed Online Caching Policy (DOCP) is shown in Algorithm 1. The execution of the policy is iterative, where in each slot the following steps take place. First, a user submits a request for a file (step 3). This user solves (6)-(7) to find the optimal routing for the current caching configuration (step 4), and requests the parts of from the respective neighbors or the BS (step 5). A certain utility is accrued based on this routing and the existing (that was calculated based on previous requests). Then, user sends the optimal multiplier to each neighbor (step 8) who updates its caching policy accordingly. This involves calculating the new , based on the latest request, and projecting them back into the feasible space (step 9):
where is the Euclidean projection on , and has zero elements except the -th element being equal to 1.
Note that DOCP is indeed distributed, since only the neighbors of each requester need to update their caches. Moreover, this update is based solely on messages received by the requester, and these communication overheads are moderate as only the Lagrange multipliers are sent to 1-hop neighbors. Finally, the projection operation can be executed efficiently, i.e., in runtime, and for each user independently, by using the local projection algorithm introduced in . We omit the details here due to lack of space.
Iv-C Performance Guarantees
The next theorem proves that DOCP achieves no regret performance, under any possible spatio-temporal arrival pattern.
Theorem 1 (Regret of DOCP).
For step size , the regret of DOCP satisfies:
where, we defined the parameters , and .
Proof: Using non-expansiveness of Euclidean projection:
Also, a telescopic sum over slots gives
To proceed, note that and . Using that and rearranging:
Furthermore, due to convexity of , it holds , thus:
The value of the step size and regret bound follow by minimizing the Right Hand Side of the above inequality.
Hence, no regret is achieved with a constant step which depends on , and if is unknown we can select step , or employ the doubling trick, see [22, Sec. 2.3].
Iv-D Dynamic Network Costs
The above model and analysis can be readily extended for the case where users change positions in different slots; or the user population evolves with time; or, finally, the users are static but the channel conditions vary. In particular, these scenarios can be captured by the updated cost function:
where we have replaced the previously constant link costs with slot-specific ones . For instance, if link exists in slot but not in slot (e.g., nodes have moved farther), then we can use which will make this link non-eligible for DOCP (the BS is available and cheaper). It is interesting to note that this extension does not change the regret bound, which is set by the highest cost of the available links, that remains the one between any device and the BS.
V Numerical Results
We illustrate the performance of DOCP in a setting with files, and devices which are equipped with a cache of capacity and are placed randomly in a cell of size Km. Devices can communicate if they are within a range of m, as in current LTE-direct standards . The (relative) cost of downloading a file from the base station is set to ; a device can fetch a file from its cache at no cost; and the respective costs from other devices vary with the distance: if the device is closer that m, if the distance is within , if in , and if in . File requests are drawn from a power law distribution with exponent . We compare DOCP with the best static policy in hindsight and the lazy LRU and mLRU.
Our results are presented in Fig. 2 which shows the empirical average of the cost for the different policies. We observe that DOCP outperforms both LRU and mLRU, and the margin gets wider as time progresses. In addition, the performance of DOCP gets closer to the one of the best policy in hindsight, thus verifying the no-regret theoretical guarantee. Figure 3 compares the total cache allocation, i.e., the total fraction of each file cached at the devices, for DOCP and the best static hindsight policy. We see that, while the algorithm starts from an almost uniform allocation, by the end of the time interval the DOCP cache contents are very aligned with the best configuration in hindsight. This demonstrates that DOCP indeed tends to learn the best static configuration.
D2D cooperative caching is certainly very promising, but raises previously unseen challenges in devising effective caching policies. Here, we used OCO, a fast-developing area of machine learning, to design an online distributed caching and routing policy that adapts to any (unknown) spatio-temporal request process. This makes it an ideal candidate for such dynamic, often sparse, caching networks. Our work opens a new exciting area at the nexus of online learning and D2D caching systems, and a fascinating next step is to explore how such mechanisms can incorporate incentives for ensuring users’ cooperation, leveraging credit mechanisms  or the human tendency to build reciprocal sharing relationships .
-  Cisco visual networking index: Global mobile data traffic forecast update 2017–2022. White Paper, 2015.
-  K. Avrachenkov, J. Goseling, and B. Serbetci. A low-complexity approach to distributed cooperative caching with geographic constraints. Proc. of ACM Meas. Anal. Computing Systems, 1(1):1–827, 2017.
E. Baştuğ et al.
A transfer learning approach for cache-enabled wireless networks.In Proc. of WiOpt, May 2015.
-  B. N. Bharath, K. G. Nagananda, and H. V. Poor. A learning-based approach to caching in heterogenous small cell networks. IEEE Trans. on Communications, 64(4), 2016.
-  B. Blaszczyszyn and A. Giovanidis. Optimal geographic caching in cellular networks. arXiv:1409.7626, 2014.
-  B. Chen and C. Yang. Caching policy for cache-enabled d2d communications by learning user preference. IEEE Trans. on Communications, 66(12), 2018.
-  C. Fricker, P. Robert, and J. Roberts. A versatile and accurate approximation for LRU cache performance. In ITC, 2012.
-  A. Giovanidis and A. Avranas. Spatial multi-LRU: Distributed caching for wireless networks with coverage overlaps. arXiv:1612.04363, 2016.
-  N. Golrezaei et al. Femtocaching and device-to-device collaboration: A new architecture for wireless video distribution. IEEE Communications Magazine, 51(4):142–149, April 2013.
-  M. Haus, M. Waqas, A. Y. Ding, Y. Li, S. Tarkoma, and J. Ott. Security and privacy in device-to-device (d2d) communication: A review. IEEE Communications Surveys Tutorials, 19(2):1054–1079, 2017.
-  S. Ioannidis, L. Massoulie, and A. Chainteau. Distributed caching over heterogeneous mobile networks. In ACM SIGMETRICS.
-  G. Iosifidis et al. Efficient and fair collaborative mobile internet access. IEEE/ACM Trans. on Networking, 25(3):20–27, 2017.
W. Jiang, G. Feng, S. Qin, T. Yum, and G. Cao.
Multi-agent reinforcement learning for efficient content caching in mobile d2d networks.IEEE Trans. on Wireless Communications, to appear, 2019.
-  E. Leonardi and G. Neglia. Implicit coordination of caches in small cell networks under unknown popularity profiles. IEEE JSAC, 36(6), 2018.
-  S. Li et al. Trend-aware video caching through online learning. IEEE Trans. Multimedia, 18(12), 2016.
-  L. Maggi et al. Adapting caching to audience retention rate. Comp. Comm., 116, 2018.
-  R. L. Mattson et al. Evaluation techniques for storage hierarchies. In IBM Systems Journal, 1970.
-  G. Paschos, A. Destounis, L. Vignieri, and G. Iosifidis. Learning to cache with no regret. In IEEE INFOCOM, 2019.
-  G. Paschos et al. The role of caching in future communication systems and networks. IEEE JSAC, 36(6), 2018.
-  G. S. Paschos et al. Wireless caching: Technical misconceptions and business barriers. IEEE Comm. Mag., 54(8), 2016.
-  A. Sadeghi, F. Sheikholeslami, and G. B. Giannakis. Optimal and scalable caching for 5G using reinforcement learning of space-time popularities. IEEE J. on Sel. Areas in Sig. Proc., 12(1), 2018.
-  S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Now Publishers Inc., 2012.
-  H. Shirado, G. Iosifidis, L. Tassiulas, and N. Christakis. Resource sharing in technologically-defined social networks. Nature Comm., 2019.
-  S. O. Somuyiwa et al. A reinforcement-learning approach to proactive caching in wireless networks. IEEE JSAC, 36(6), 2018.
-  M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.