I Introduction
The past decade has witnessed a significant growth of mobile traffic. Such growth puts tremendous pressure on the paradigm of Cloudbased service provisioning, since moving a large volume of data into and out of the cloud wirelessly requires substantial spectrum resources, and meanwhile may incur large latency. Mobile Edge Computing (MEC) emerges as a new paradigm to alleviate the capacity concern of mobile networks [1]. Residing on the network edge, MEC makes abundant storage and computing resources available to mobile users through lowlatency wireless connections, facilitating a number of mobile services like local content caching, augmented reality, and cognitive assistance [2].
Among these services, content caching at the network edge is garnering much attention [3][17]. In particular, with the prevalence of social media, multimedia contents are spreading among mobile users in a viral fashion, putting high pressure on the network backhaul [9, 11]. It is pointed out that, by caching contents on network edge, up to traffic on the backhaul can be reduced [2]. Unfortunately, compared with the increasing content volume, the storage size at edge node (EN) is always limited. It is impossible to cache all the contents locally. Hence, identifying the optimal set of contents that maximizes cache utilization becomes crucial.
Content popularity is an effective measure for making caching decisions. Extensive works have been devoted to popularitybased content caching. According to the features of content popularity profile, those works on content caching can be classified into three categories: 1) known popularity profile
[11][13]; 2) fixed but unknown popularity profile [16, 17]; and 3) timevarying and unknown popularity profile [19, 20]. In case of fixed and unknown popularity profile, learning algorithms have been proposed under different network settings. In case of timevarying and unknown popularity profile, context information of the request, including system states and user characteristics, is exploited to make content hit rate predictions. To improve the accuracy of popularity prediction, the context space needs to be subtly designed since there is endless context information that could be taken into consideration. It is often difficult to directly identify the factors that influence content popularity. More importantly, using user information for context differentiation is subject to privacy regulations and may not be applicable in practice.In this paper, we investigate mobile edge caching with timevarying and unknown popularity profile. Instead of relying on user information for context differentiation, we explore location features of each EN to improve the accuracy of popularity prediction, with the rationale outlined as follows. First, locations can be divided into categories with distinct social functions, such as residential area and business district. Meanwhile, users in different places have diverse interests [22]. As indicated by realworld measurement studies [35], the distribution of content popularity for even adjacent WiFi APs and cellular base stations are different, and existing content caching schemes do not take such finegrained popularity difference into consideration [36]
. To further improve the content distribution in mobile context, it is crucial to investigate content popularity with location awareness. Given that there is no established model to characterize location features and user demands, we take some initial steps to devise a model where user demand of a certain content is treated as a linear combination of content features and location characteristics with unknown noise. It follows that, the popularity prediction problem boils down to the estimation of location feature vector of each EN in the presence of noise. In practice, the noise process is affected by various factors. Firstly, it is affected by locationdependent factors, such as user interests, the number of users and the social function of the coverage area of each EN. Secondly, it is also affected by contentdependent factors, which include genre, length, and frame quality for video contents. Unfortunately, it is often difficult for content providers and edge servers to understand the statistical nature of the underlying noise process in such complicated context space. To solve the location feature estimation problem, two online prediction algorithms are proposed for different scenarios.
To start with, we consider the tractable zeromean noise scenario as the first step. A ridge regression based prediction algorithm (RPUC) is proposed to estimate the location feature vector. To account for the impact of noise, a positive perturbation is added to the result as the correction of the prediction. By comparing to the hindsight optimal caching policy, theoretical analysis shows that the RPUC algorithm achieves sublinear regret, i.e., it asymptotically approaches the optimal strategy in the longterm.
Furthermore, we consider practical cases where noise structure is unknown a priori. To ensure robust prediction, we resort to the filter technique, which enables us to obtain guaranteed accuracy even in the worstcase scenario. In particular, taking a prescribed accuracy threshold as an input, we propose an based prediction algorithm (HPDT), which is robust as long as the noise amplitude is finite.
Both RPUC and HPDT require no training phases, and hence are adaptive to the timevarying user demand. Numerical analysis indicates that, the regret of RPUC originates from the bias and variance of ridge regression, as well as the artificial perturbation. Note that the HPDT algorithm is conservative in that it makes no assumption on the noise. Yet, it is still able to make unbiased estimation on the location feature vector. Extensive simulations on real world traces demonstrate that those two algorithms can be applied to scenarios with different noise features, and both of them are able to make adaptive caching decisions, achieving content hit rate that is comparable to that using the hindsight optimal strategy. The contributions of this work on mobile edge caching are threefold:

We propose to exploit the diversity of content popularity over different locations. We establish a linear model for content popularity prediction, taking into account both content and location features.

We develop two popularity prediction algorithms that deal with different noise models. Both algorithms are able to make locationaware caching decisions. Moreover, they require no training phases, and hence can adapt to dynamic user demand.

We demonstrate the effectiveness of the proposed algorithms through theoretical analysis. It is proved that performance of the RPUC algorithm asymptotically approaches that using the hindsight optimal strategy, while performance of the HPDT algorithm hinges upon noises. Experiments on real dataset crawled from YouTube show that, the longterm content hit rates of the proposed algorithms are comparable to that via the hindsight optimal strategy.
The remainder of the paper is organized as follows. Section II reviews related works on content caching in wireless networks. Section III describes the system model, including the mobile edge caching architecture and the formal problem formulation. In Section IV, we propose the RPUC caching algorithm for the case of zeromean noise, and give the detailed performance analysis. For the case of unknown noise model, we present the HPDT algorithm as well as detailed regret analysis in Section V. Numerical analysis and experimental results of the two algorithms are provided in Section VI, followed by concluding remarks in Section VII.
Ii Related Work
Mobile user’s capacity is greatly augmented in the era of MEC. As a result, mobile service provisioning is expected to have further improved quality of experience (QoE) [2]. To this end, various mobile edge architectures have been proposed. Tandom et al. proposed to deploy edge resources within radio access networks. They characterized the relationship between latency and caching size, as well as latency and fronthaul capacity, from an informationtheoretic perspective [3]. Yang et al. introduced an edge resource provisioning architecture based on cloud radio access network (CRAN), and devised a cloudedge interoperation scheme via software defined networking techniques [4]. Tong et al. designed a hierarchical edge architecture, aiming at making efficient use of edge resources when serving the peak loads from mobile users [5]. As the 5G wireless network is expected to incorporate diverse access technologies, in this paper, we consider edge caching in the context of heterogeneous networks. Potential EN deployment can be capacityaugmented base stations, WiFi access points and other devices with excess resources.
As an effective approach to improving QoE in 5G systems, edge caching has received extensive attention [6]. Specifically, various works have been done on video content caching, since video contents are forecast to be dominant in 5G systems [7][10]. A vast amount of other works simply focus on generalized content caching. Zhang et al. investigated the cacheenabled vehicular networks with energy harvesting, aiming at minimizing network deployment costs with QoE guarantees [12]. Ao et al. explored distributed content caching and small cell cooperation to accelerate content delivery [13]. Devicetodevice (D2D) communication is another promising solution to improve the QoE of mobile content dissemination [14]. Different from conventional content unicast from cellular base stations, D2D communication has the potential to significantly boost system throughput by multicasting. Ji et al. provided a comprehensive summary on D2D caching networks, incorporating throughput scaling law and coded caching in D2D networks [15]. In the above works, content popularity profile was assumed to be completely known. However, in practice, content popularity may be unknown a priori. To address this issue, various learningbased approaches have been proposed to predict content popularity. Bharath et al. proposed a learning method that achieves desired popularity accuracy in finite training time [16]. Blasco et al. modeled content caching with unknown popularity as a multiarmed bandit problem [17]. By carefully balancing exploration and exploitation in the learning phase, they proposed three algorithms that quickly learn content popularity under various system settings.
Unfortunately, often times content popularity profile can not only be unknown a priori, but also timevarying. This is because user’s interests change constantly, and meanwhile new contents are being created [22]. As a result, learningbased caching algorithms should be designed in an online fashion, i.e., requiring no training phase, and adaptive to popularity fluctuations. To this end, Roy et al. proposed to predict video popularity by utilizing knowledge from the social streams [18]. Müller et al. introduced contextaware proactive caching [19]. By constructing context space based on user information, they proposed an online algorithm that first learns contextspecific user demands, and then updates cached contents accordingly. Other information has also been used for context differentiation, such as content features and system states [20]. The prediction accuracy of those solutions is highly dependent on the information used for context differentiation. To content service providers, however, user information is extremely sensitive and often unavailable. In addition, it is also impossible for them to get detailed system or network information when making caching decisions.
In this paper, we exploit locational features for context differentiation. Locational information can be easily obtained, for example, users attached to different ENs are naturally divided into geographical groups. Based on which we investigate the locationaware caching problem with unknown and timevarying content popularity profile. By modeling user demand as linear combination of location features and content attributes, our previous work has addressed the content popularity prediction problem with the assumption that the model noise is zeromean [23]. As an extension, this paper additionally considers the practical scenario, where noise structure is unknown a priori. Specifically, a robust prediction algorithm is proposed with detailed theoretical caching performance analysis. The proposed algorithm is robust and practical as it guarantees prediction accuracy regardless of the noise statistics. Additionally, numerical analysis and comparison on the root causes of estimation errors of both algorithms are presented. Much extensive experiments are conducted to validate the performance of the proposed algorithms.
It is worth noting that, in the mobile context, fetching content from the edge cache significantly reduces the delay, compared with that from conventional content distribution network (CDN). Moreover, existing content pushing strategies in CDN do not consider the finegrained popularity differentiation in neighbouring WiFi APs and cellular base stations [35]. With the consideration of location awareness, this paper further models and predicts the dynamics of content popularity, which is constantly varying with time.
Iii System Model and Problem Formulation
In this section, we present the system model and formulate the caching problem in mobile edge networks.
Iiia Network Model
Mobile edge computing can enhance mobile user’s capacity by provisioning storage, computing and networking resources in their proximity. Capacityaugmented base stations, WiFi access points and other devices with excess capacity can be exploited for edge node deployment [1]. In this paper, the storage resources at edge nodes are harnessed for content caching services. Specifically, as shown in Fig. 1, a set of edge nodes is deployed with separated backhaul links connecting to the mobile core network. Online contents are dynamically pushed to edge nodes so that user’s content requests can be processed with reduced latency. Each edge node serves a disjoint set of mobile users.
IiiB Content Popularity and Location Diversity
A simple yet effective caching strategy is to push the most popular contents to the network edge. Hence, local content hit rate is maximized and user’s requests are served with reduced latency and improved QoE. Extensive works have been done on the popularity of contents, especially video files [9, 22, 25]. According to the statistics we crawled from YouTube, as illustrated in Fig. 2, the popularity profile of a video file varies in twofold. 1) The daily view amount is timevarying. 2) As other videos’ daily view amounts are also varying and new videos are uploaded, the popularity of a video file is constantly fluctuating [21]. Moreover, location diversity also affects the content popularity. As a result, general caching strategies based on fixed popularity profile are not optimal in practice.
Let be a dimensional attribute vector of file associated with EN . For example, the attributes of video contents may include video quality, genre, length, and historical view statistics. Then, the hit rate^{1}^{1}1We define hit rate as the number of content requests rather than a ratio. of file at EN , denoted by , can be expressed as the following noisy linear combination
(1) 
where is the unknown location feature vector associated with EN . Further, it also represents the location characteristics of EN , which is timeinvariant. is the random noise associated with EN , which may be affected by various locational features, including social function of the area around EN , the number of users served by EN , the frequency of content update (e.g., hourly or daily). As a result, contents with the same attribute vector are expected to have different view amounts at different ENs. This linear prediction model is widely used in other areas, such as signal processing and financial engineering [26]. It provides a method to predict future hit rate and it is essential when exploiting location diversity for popularunknown content caching. Without loss of generality, let , and for all and , where denotes the Euclidean norm of , , and are positive constants. Also, for notational simplicity, define as the weighted (by a matrix ) Euclidean norm of .
IiiC Problem Formulation
Consider a set of files that can be cached at ENs, and let be the caching size of each EN. We assume that all the contents are of equal size^{2}^{2}2In case contents are of different sizes, they are split into smaller ones of equal size. For example, the widely used DASH (Dynamic Adaptive Streaming over HTTP) protocol breaks contents into small segments before transmission. This assumption is used to simplify the theoretical analysis, and a similar assumption has been made in [20, 24]. Locationaware edge caching with different content sizes deserves further investigation. and the size is normalized to 1, i.e., each EN can cache up to contents. As indicated by Fig. 2, content popularity is timevarying. Therefore, contents with higher popularity should be proactively identified and cached at the ENs, and the less popular ones should be evicted so as to improve the local content hit rate. Considering a sequence of time slots , and let denote the set of contents cached at EN during time slot , and be the amount of user demand on file at EN during time slot . The objective of a caching policy is to maximize the timeaveraged hit rate. Formally, it can be formulated as the following timeaveraged hit rate maximization (THRM) problem^{3}^{3}3Without loss of generality, we assume that the underlying process is ergodic.:
(2) 
As the amount of user demand, i.e., the hit rate of contents at each EN, is unknown a priori, the decision variables in problem (2) is intractable directly. For convenience, denoting the optimal caching strategy for EN at time , we have
(3) 
Define the timeaveraged caching regret of a solution respect to the optimal caching strategy as
(4) 
Then, the THRM problem can be reformulate as a timeaveraged regret minimization (TRM) problem:
(5) 
Given that the optimal set is unknown a priori, our goal is to develop a caching policy that constantly makes good estimation of the optimal set , and therefore minimizes the timeaveraged caching regret. As indicated by Eq. (3), the LRM problem boils down to estimating user demands of different contents at each EN. Given the linear model in Eq. (1), if the location feature vector can be found in the presence of noise, we can make an accurate prediction on user demand. Unfortunately, there is no established statistical model on the noise processes that impinges the prediction of user demand. In what follows, we propose two online content popularity prediction algorithms by making dynamic estimations on the location feature vectors for different noise processes. In particular, the first algorithm achieves nearoptimal performance with the assumption that the model noise is zeromean, while the second algorithm is designed to provide robust performance guarantees in the case of unknown noise statistics.
Iv Ridge Regression based Content Popularity Prediction and Edge Caching
In this section, as the first attack on the TRM problem, we present a caching algorithm when noise is zeromean.
Iva Location Feature Vector Estimation
When the noise is zeromean, according to Eq. (1), we have
(6) 
It can be interpreted that, at time slot , given the attribute vector , the hit rate of file at EN is predicted to be the linear combination of its attributes, which provides a feasible way to predict the content hit rate. Since the location feature vector of EN is timeinvariant, a good estimation of will lead to accurate prediction of the content hit rate.
Let the attribute matrix be the historical data up to time slot , where is the frequency of file being cached at EN up to time slot , and the th row of is the corresponding attribute vector . Denote by the time empirical hit rate of file at EN
. By applying the standard ordinary least square linear regression, i.e.,
, we can obtain the unique solution , which is unbiased. However, when there are correlated variables in the attribute vector, the matrix may not be invertible. As a result, the estimated can be poorly determined and will exhibit high variance.In contrast to the unbiased estimation, ridge regression makes biased estimation by adding a control parameter that “penalizes” the magnitude of estimated , which helps to improve estimation stability. Specifically, ridge regression aims at minimizing a penalized sum
(7) 
where controls the size of : the larger the value of , the greater the shrinkage of the magnitude of [27]. Consequently, the estimation of can be explicitly given as
(8) 
where
is the identity matrix. The accuracy of the estimation depends on the amount of data and the selection of
. For convenience, let for all and . The following lemma, which is slightly manipulated from [31], gives an upper bound on the estimation error of ridge regression.Lemma 1.
If for all , then , the estimation error of ridge regression can be upper bounded as
(9) 
with probability at least
.Please refer to Appendix A for the proof. The probabilistic upper bound of estimation error provided in Lemma 1 indicates that, the true hit rate
falls into the confidence interval around the estimation
with high probability. The righthand side of Eq. (9) gives the length of the confidence interval, which is crucial to the following content popularity prediction and caching algorithm.IvB RPUC Caching Algorithm
The locationaware edge caching algorithm is sketched in Algorithm 1. After initialization, the algorithm iteratively performs the following three phases.

Predict: During each time slot , the location feature vector is firstly updated according to the demand information observed in time slot . Then, based on the linear prediction model, the estimated demand is obtained. Considering the impact of random noises, a perturbation is added to the estimation, i.e., the ultimate hit rate is predicted to be
(10) where the perturbation is given by
(11) and .

Optimize and cache: Based on the predicted hit rate of each content, a set of contents that maximizes the content hit rate at EN during time slot is identified and cached respectively. Note that, certain contents may be cached in multiple ENs simultaneously.

Observe and update: At the end of time slot , the empirical hit rate information of cached files on each EN is recorded, which is then used to update the parameter matrices for subsequent estimation and prediction.
The rationale of the perturbation is that Eq. (6) only gives a mean value of the hit rate which omits the potential random fluctuation, while Lemma 1 provides a probabilistic upper bound of the demand estimation error. The perturbation given in Eq. (11) is inline with the righthand side of Eq. (9) and can be regarded as the optimism in face of uncertainty, or equivalently, the upper confidence of the demand estimation. By adding a perturbation according to Eq. (11), we have . According to Lemma 1, the upper bound holds with probability at least , which approximates to rapidly as increases.
IvC Regret Analysis
The content hit rate of the RPUC algorithm highly depends on the accuracy of prediction. This subsection gives a theoretical upper bound on its timeaveraged caching regret .
In mobile edge caching, let be the caching size of each EN, and be the size of ground file set. Note that content hit rate satisfies the linear model, and content attribute vectors and user demands are bounded by and for all , and , we have the following theorem.
Theorem 1.
If the noise is zeromean, the RPUC algorithm achieves nearoptimal performance, i.e., the timeaveraged regret is of order , and when .
The proof has been relegated to Appendix B. Basically, the rootcause of regret is twofold: the estimation error and the perturbation. Particularly, the estimation error consists of the linear model error and the intended bias incurred by in ridge regression. The perturbation term is well managed by the timevarying control parameter . Theorem 1 indicates that under RPUC algorithm, the content hit rate asymptotically approaches the optimal caching policy in the long term.
V Robust Content Popularity Prediction and Edge Caching
In the previous section, we proposed the online caching algorithm RPUC based on the linear prediction model given in Eq. (6). A biased estimation of the location feature vector for each EN is obtained by ridge regression. Further, a perturbation is added to the estimation of content hit rate to account for uncertainty. However, this caching algorithm would not work well when the noise is not zeromean. Even worse, it is likely that we are unable to get detailed noise statistics, as it is affected by various location features, such as population, social function or even weather condition. Therefore, robust prediction algorithm that could handle noise uncertainty is desirable. In this section, by resorting to the filter technique, we propose a popularity prediction algorithm that provides guaranteed accuracy in the case of unknown noise structures.
Va Noisy Model for Content Popularity
Denoted by the additive noise added to the linear model. The linear model is rewritten as
(12) 
If the noise process
follows white Gaussian distribution and its mean and correlation are always known, Kalman filtering technique can be applied to estimate
, which achieves the smallest possible standard deviation of the estimation error
[28][30]. Since there is no established model on the statistics of the noise structure, robust estimators on the location feature vector that can tolerate noise uncertainty are needed. Next, we will introduce the filtering technique for location feature vector estimation, which requires no a priori information on the noise process. The only assumption is that, the magnitude of the noise process is finite, which is true since the total demand is always finite in reality.VB An Filter Approach
When locational features are timeinvariant, the true location feature vector remains the same across the time span. For notational simplicity, in this subsection, we focus on a specific content on a certain EN and neglect indices and . Based on Eq. (12), the location feature vector estimation problem is reformulated as
(13) 
Different from Kalman filter, we aim at providing a uniformly small estimation error, , for any form of noise process. Notice that estimation of the location feature vector is crucial to minimizing the estimation error . Then, we define the following cost function of estimation [28]:
(14) 
where is a symmetric positive definite matrix reflecting the confidence of the a priori knowledge of the initial state. A ’smaller’ choice of indicates larger uncertainty of the initial condition and vice versa. The objective is to make a sequence of estimation on such that the above cost is minimized. The denominator of the cost function can be regarded as a combined norm of all possible initial states and noises affecting the system. Given that there is no established stochastic model for , the cost function in Eq. (14) allows us to make robust estimation of content popularity from gametheoretical perspective. Suppose that there is an unrestricted adversary, who can control the initial state and the magnitude of to maximize the error of our estimation. While we focus on minimizing the numerator of Eq. (14), the adversary may incur infinite magnitude of disturbances. The form of prevents the adversary from using brute force to maximize . Instead, the adversary needs to carefully choose and as it tries to maximize . Formally, this game can be generalized as a minmax problem, and the optimal estimation on achieves the minimal cost as
(15) 
Given the cost function in Eq. (14), directly minimizing is challenging. In practice, a better approach for filtering is to seek a suboptimal estimation that meets a given threshold. Specifically, one can try to find such that the optimal estimate of among all possible (even including the worstcase performance measure) should satisfy
(16) 
where is the prescribed performance bound. Eq. (16) indicates that filter guarantees the smallest estimation error over all possible finite disturbances of the noise magnitude [28]. Rearranging Eq. (16) gives the following equivalent minmax problem
(17)  
This minmax problem can be interpreted as a zerosum game against the adversary. With a given , our goal is to find an estimation that wins the game (i.e., achieve a negative cost ). By resorting to the Lagrange multiplier method for the dynamic constrained optimization problem (17), the filter approach results in the following iterative algorithm to find the optimal estimates for all :
(18) 
where is initialized as , is the filter gain, given by
(19) 
and
(20) 
with initialized by . The detailed proof of this solution can be found in [30]. With the aid of filter, we are able to make performanceguaranteed estimation on the location feature vector , regardless of the detailed noise structure. Based on the estimation of location feature vector, content popularity prediction and caching algorithm can be further devised.
VC From Determining to the HPDT Caching Algorithm
The performance of filter highly depends on the prescribed threshold . A smaller results in a smaller estimation error. However, if is too small, in Eq. (20) may be singular, which renders the iterative solution infeasible. Hence, the value of should be carefully selected. An adaptive scheme for threshold selection was proposed in [28], which makes online iterative prediction possible. Denote by the threshold on the th iteration, it should be properly chosen to guarantee is positive definite, i.e.,
(21) 
Denote by
the eigenvalues of
matrix , and is the th largest eigenvalue. According to the minmax theorem on matrix eigenvalues, the adaptive threshold should satisfy(22) 
for all . Since holds for all , equivalently, we have . We may let
(23) 
where is a constant very close to but larger than one, so that is guaranteed to be positive and hence is nonsingular. Meanwhile, the magnitude of is also suppressed.
With the aid of filter technique, we are able to make performanceguaranteed estimation on the location feature vectors. Therefore, more precise prediction on content popularity can be made, and hence differentiated caching policies can be devised on each EN. The corresponding prediction and caching algorithm is sketched in Algorithm 2. Similar to Algorithm 1, the iteration of the HPDT algorithm can be generalized into the following three steps after initialization.

Predict: During time slot , estimation of the location feature vector of each EN is predicted based on the updated location feature vector by the end of time slot .

Optimize and cache: Based on the predicted content hit rate profile, the set of contents with maximized predicted content hit rate are cached on each EN respectively. Certain contents may be cached in multiple ENs simultaneously.

Observe and update: At the end of time slot , the empirical hit rate of the cached files on each EN is observed, which is then used to update the input of the filtering process, yielding the updated location feature vector.
The adaptive adjustment of in Eq. (23) is crucial to the online HPDT algorithm. It is tuned to its minimum at each iteration, so that is guaranteed to be positive definite, and meanwhile the upper bound of the cost function is minimized.
VD Regret Analysis
The filter technique provides a robust estimation of the location feature vector regardless of the statistical model of the noise process. However, this approach is also conservative since it needs to accommodate the disturbances of all kinds of noise processes. In this subsection, the performance bound of filter based prediction and caching algorithm is given.
Note that the prescribed performance threshold is crucial to the prediction accuracy, the adaptive threshold in HPDT algorithm is firstly characterized in this subsection. Note that is initialized as a symmetric and positive definite matrix, and is also guaranteed to be symmetric and positive definite with the help of . According to Weyl’s monotonicity theorem [33], the smallest eigenvalue of can be bounded as:
(24) 
where is the largest eigenvalue of , and can be simple bounded by matrix trace as . As is positive definite, we have
(25) 
where is very close to but smaller than one. According to Eq. (20), the smallest eigenvalue of equals to , which is suppressed to be small but positive. Hence, is guaranteed to be nonsingular. Iteratively, is positive definite for all .
In the following, a theorem is given to provide a bound on the caching regret of the HPDT algorithm. Let be the caching size of each EN, and be the cardinality of the ground file set. Suppose content hit rate satisfies the linear model given by Eq. (13), and note that the attribute vectors are bounded as for all , and . Let and be the upper bound of and for all and , respectively. Then, we have the following theorem.
Theorem 2.
The timeaveraged regret of the HPDT algorithm is of order .
Please refer to Appendix C for the proof. Theorem 2 indicates that, if the linear model is free of noises, the timeaveraged regret of the HPDT algorithm tends to zero as grows to infinity. Otherwise, the HPDT algorithm may not approach the optimal solution, and its performance depends on the noise magnitude. This is due to the characteristics inherited from the filter. Since filter makes no assumption on the noise feature, to minimize the worstcase estimation error, it needs to accommodate all possible noise processes, which turns out to be overconservative. However, when the noise is zeromean, the regret of exploiting HPDT algorithm reduces to the order of , which is smaller than that using the RPUC algorithm. In the next section, we will further evaluate the proposed two algorithms by numerically decomposing the estimation errors, and examine the algorithms by experiments on real dataset.
Vi Numerical Analysis and Experimental Results
To validate performance of the proposed caching algorithms, numerical analysis is firstly performed in this section. Afterwards, an experiment based on realworld dataset from YouTube is conducted to further illuminate the performance of the proposed algorithms in practical scenarios.
Via Numerical Analysis
The proposed two algorithms can be used for content caching with different user demand features. In essence, they are estimating the location feature vector , which specifies the location characteristics and user preferences on each EN. Given the linear model of user demand, the performance of the proposed algorithms highly depends on the accuracy of the estimation. We use mean square error (MSE) to evaluate the accuracy of the proposed algorithms. For notational simplicity, the indices and are omitted in this section. Let be the underlying feature vector, and be the estimation of by using the proposed algorithms. Denote , then, the MSE can be defined in Euclidean norm as
(26)  
The two terms in Eq. (26) turn out to be the variance and bias of , respectively.
Note that the RPUC algorithm is based on ridge regression. Unlike the ordinary least square linear regression, which makes an unbiased estimation on the feature vector, ridge regression intentionally introduces bias so as to reduce variance of the estimation. Moreover, the RPUC algorithm adds a perturbation to the estimation of ridge regression to account for noise uncertainty, which further increases the bias.
In contrast, the HPDT algorithm makes no assumption on the statistical model of the underlying noise process. It is able to meet the prescribed performance threshold even if the noise process leads to the worst case. Moreover, the HPDT algorithm makes unbiased estimation on the feature vector. This can be observed from the definition of the cost function in Eq. (14). By decomposing the numerator of , we have
(27) 
As filter makes robust estimation over all kinds of noise structures and initial conditions, given any , there exists a combination of initial condition and noise such that . Since this is true for all , according to Eq. (27), the cost function , which will grow linearly if . Consequently, any algorithm that bounds the cost function must be unbiased, i.e., .
On the other hand, the performance of the HPDT algorithm also depends on the a priori confidence on the estimation of the initial state, i.e., the selection of . A smaller matrix (eigenvalue) should be chosen if the estimation of initial condition is made with larger uncertainty, and vice versa.
Comparison of estimation errors of ridge regression and HPDT algorithm under varying sample rate and noise structure. a) Zeromean noise; b) Nonzeromean noise of uniform distribution (with smaller mean value); c) Nonzeromean noise of normal distribution (with larger mean value).
Time Span  Ridge Regression  ( = 5)  ( = 10)  ( = 30)  

20  0.1659  0.4284  0.5943  0.2881  0.1094  0.3975  0.2687  0.1045  0.3732  0.2647  0.1036  0.3683 
50  0.1364  0.1364  0.2728  0.1315  0.0126  0.1441  0.1244  0.0116  0.1360  0.1228  0.0115  0.1343 
200  0.0545  0.0927  0.1472  0.0335  0.0131  0.0466  0.0318  0.0129  0.0447  0.0314  0.0129  0.0444 
500  0.0347  0.0381  0.0728  0.0134  0.0155  0.0289  0.0127  0.0154  0.0282  0.0126  0.0154  0.0280 
1000  0.0211  0.0207  0.0418  0.0067  0.0164  0.0231  0.0064  0.0164  0.0227  0.0063  0.0164  0.0227 
2000  0.0157  0.0070  0.0227  0.0034  0.0174  0.0207  0.0032  0.0174  0.0206  0.0032  0.0174  0.0205 
We conduct simulation based on synthesized time sequences, which is generated according to a prescribed linear model with zeromean noise and nonzeromean noise, respectively. Since the proposed RPUC algorithm is perturbed intentionally, we only present the comparison between ridge regression and the HPDT algorithm. Fig. 3 shows that, under all scenarios, ridge regression provides a more stable estimation as it achieves smaller variance than that using HPDT algorithm when the amount of historical data is small. Such stability advantage of ridge regression benefits from its intentional penalty. However, the HPDT algorithm performs much better in terms of bias and MSE under varying sampling rate in all scenarios, which proves the robustness of the HPDT algorithm. Moreover, with the increase of noise magnitude, the gain of HPDT over ridge regression also increases.
To demonstrate the impact of on the performance of HPDT, estimation results of the HPDT algorithm with different initialization matrices are presented. is initialized as diagonal matrix with positive elements on the main diagonal. When the confidence on the initial state is small, a larger with bigger eigenvalues is used. As shown in Table I, with larger eigenvalue achieves better performance than the others, which means that the initial guess is close to the prescribed vector. In practice, the matrix can be selected according to the prior information regarding the initial condition.
ViB Real Dataset Experiment
ViB1 Experiment Setup
To further demonstrate the advantages of the proposed algorithms in practical scenarios, we conduct an experiment on the dataset crawled from YouTube. On YouTube, some video owners made their video view statistics open to public. Among other information, the view amount information is recorded on a daily basis. To obtain such information, a Pythonbased crawling program is written, and the request record of each video is crawled into a .json file. Based on which we conducted the rest Pythonbased experiments . In total, videos are randomly crawled, which were uploaded before January 2013, with full view statistics till May 2017. The most popular video has been watched over billion times by the end of the timespan, while the least popular one has been rarely viewed across the time span. Fig. 4 shows the statistics of video popularity skewness. The popularity of the most popular videos is highly skewed, and the most popular 50 videos account for almost of the total view amount.
Note that the dataset only contains the global statistics of each video record (with the recent update of YouTube webpage, even the global view statistics is inaccessible), while the view statistics of most online video content providers in a local area is unavailable. To emulate the video request processes on different locations, the original statistic of each video is shifted and scaled randomly over the time span. For the record of a certain video, by shifting, the request record of the original global data of each content is moved backward and forward on the timespan. By scaling, each content request record is then randomly scaled up and down. After shifting and scaling, the original global request statistics are transformed and treated as the requests from different locations. The key point is that the pattern of the record remains valid after the above transformation. In this way, we are able to characterize the location diversity based on the emulated view statistics.
Specifically, consider the content library containing those videos, each video can be cached on ENs, each with caching size . Content refreshing is performed upon the network traffic pattern. For example, wireless traffic presents regular peak and valley every day. Hence, content refreshing can be performed during the offpeak period with minimal impact on normal network activity. A video can be characterized from several aspects, including video quality, genre, length, and historical view statistic. In this experiment, we use view amount information in the past days as the attribute vector of each content, i.e., . Based on the attribute vectors and an initial guess on the content popularity, the algorithms gradually select contents that are predicted to be more popular than the others, and cache them on each EN accordingly. The longterm content hit rates of the proposed algorithms are shown by comparing with the following benchmark algorithms. 1) Hindsight optimal. By analyzing the full view record over the time span at each EN, the most popular videos are selected and cached respectively. Note that this benchmark requires future information and hence cannot be implemented in practice. 2) Location oblivious (denoted by LO). During each time slot, the historical demands of all the contents from all ENs are analyzed, afterwards the ones that are predicted (by ridge regression) to have the highest demands in the next time slot are identified. Then, all ENs will cache the same set of contents without location differentiation. 3) Random. A random set of videos is selected to update the ENs during each time slot.
ViB2 Experimental Results
As shown in Fig. 4, the popularity of YouTube video is highly skewed, and the most popular videos have attracted almost of user requests. The skewness of video popularity has also been validated in [25]. The popularity of this dataset can be roughly divided into three levels: highly skewed (popularity of the top videos); medium skewed (popularity of videos ranking from to ); and less skewed (the rest ones).
Figure 5 shows the comparison of longterm content hit rates of different algorithms with varying EN caching sizes. The performance of the caching algorithms is affected by the skewness of the popularity profile. However, the proposed locationbased approaches always outperform the locationoblivious scheme in varying caching size scenarios. Specifically, when the caching size falls into the highly skewed area and less skewed area, the proposed caching algorithms RPUC and HPDT outperform other benchmarks considerably. In particular, the HPDT algorithm performs better than the RPUC. For the highly skewed area, the top videos present much higher variance than the rest. As a result, the noise mean of those records is also significant. Since RPUC algorithm is designed for zeromean noises, their performance is limited when noise amplitude is significant. In contrast, when the caching size falls into the less skewed area, the algorithms need to corporate various noise types of different videos, which may not always be zeromean. RPUC and HPDT perform equally well when the caching size falls into the medium skewed area (Fig. 5). The filter is utilized to provide guaranteed performance even when noise type lead to the worst case for estimation. As a result, the HPDT algorithm is conservative yet robust.
Figure 4 also indicates that content popularity is longtailed, i.e., the less popular contents attract almost vanishing requests compared to the popular ones. As a result, the total hit rate of different caching schemes in Fig. 5 does not increase linearly with the caching size. Note that, both algorithms run iteratively in an online fashion. During each iteration, the most computationally intensive execution is the times sorting of the estimated demands of contents, which has a typical computational complexity of . Complexity of the value assignments and matrix update could be neglected compared with sorting. As a result, both algorithms are of low time complexity.
ViC Discussions
ViC1 Another dimension of the prediction
This work focuses on the estimation of location feature vector. Actually, the selection of video attribute vector also influences the prediction accuracy. As mentioned before, other factors, such as video quality, length and genre, can also be used to characterize video contents. If such labeling information is available, by reducing the dimension as well as training the dataset, we can identify influential features that affect content popularity. On the other hand, for the location feature vector , both RPUC and HPDT algorithms are designed with the precondition that is timeinvariant, as indicated in both Eq. (6) and (13). Actually, the HPDT algorithm can be directly extended to timevariant scenario if the state equation is also linear, i.e., , where is the transition matrix, and is the state noise vector. By resorting to technique, the adaptive estimation on can be made with guaranteed accuracy [28].
ViC2 The applicability in practical video streaming
The proposed popularity prediction approach can be applied to the delivery of various types of contents. As video content consumes the most bandwidth, it deserves to be indepth investigated. Practical video streaming protocols (such as HTTPbased Adaptive Streaming, HAS) divide a video content into several chunks/segments, each with multiple bitrates and quality versions [38]. Those pullbased streaming protocols dynamically change the quality of the streamed video according to the observed network conditions on a perfragment basis. Most of the research works on adaptive video streaming (both serverside bitrate switching [37] and clientside switching [39]) strive to predict the network condition when transmitting the next video segment. In contrast, with the aid of edge storage resources, our work focuses on pushbased content distribution. In other words, estimating the available bandwidth is out of the scope of this work, and content updates are scheduled to offpeak periods, where streaming bandwidth is sufficient.
When streaming video contents based on HAS, whether to cache individual segments or the whole quality representation depends on both the available bandwidth and the content popularity. In particular, for the popular contents, users tend to keep requesting them regardless of the received video quality. Hence, it is more appropriate to store the whole representation of the video so as to fulfill users’ requirements via dynamic bandwidth. As the popularity profile of contents are highly skewed (shown in Fig. 4), the streaming provider only needs to store the full quality representation of the most popular videos (the amount of which really depends on the caching size budget). For the less popular ones, the provider may choose to cache the individual segments that are of moderate quality, so as to save bandwidth and meanwhile be responsive to user requests. The rationale behind such decision is the content popular profile and the available caching resources, which is the merit of our work. In this sense, the proposed locationbased popularity prediction approaches are crucial in HASbased streaming system, and the prediction of content popularity and network condition will collectively contribute to improved video streaming.
Vii Conclusion
In this paper, we investigate popularity prediction for mobile edge caching, with special focus on location awareness. We model the content popularity profile by a linear model and propose online algorithms to deal with different statistical models of the noise process. The proposed RPUC algorithm achieves content hit rate that asymptotically approaches the optimal solution when the noise is zeromean. Noticing that the noise may not necessarily be zeromean, we resort to the filter technique and propose the HPDT algorithm for popularity prediction. This algorithm can achieve guaranteed prediction accuracy even when the worstcase noise occurs. Both algorithms can be implemented without training phases. Numerical analysis shows how the performance of the proposed algorithms is affected by different types of noises, the amount of historical data, and the initial state. Extensive experiments on real dataset demonstrate the advantage of the proposed algorithm, which helps to make customized caching decisions in practical scenarios. For future works, we will exploit locational features of neighboring ENs to make better caching decisions.
Appendix A Proof of Lemma 1
Let , based on Eq. (8), the estimation error can be rewritten as
Since , Hölder’s inequality indicates that . Then, the estimation error is bounded as
(28)  
The righthand side of above inequality decomposes the estimation error into two parts, where the first (variance term) specifies the error caused by linear model, and the second (bias term) is the bias incurred by ridge regression parameter . According to Eq. (6), we have . The Azuma’s inequality gives a probabilistic upper bound on the variance term of Eq. (28):
(29) 
where the last inequality is due to the fact that
(30)  
Hence, the variance term of Eq. (28) can be bounded by with probability at least . Further, The bias term of Eq. (28) can be bounded as
(31)  
By substituting Eq. (29) and (31) into Eq. (28), the probabilistic bound in Eq. (