I Introduction
Network function virtualization (NFV) is shifting the way of network service deployment and delivery by virtualizing and scaling network functions (NFs) on commodity servers in an ondemand fashion [33]. As a revolutionary technique, NFV paves the way for operators towards better manageability and qualityofservice of network services.
In NFV systems, each network service is implemented as an ordered chain of virtual network functions (VNFs) that are deployed on commodity servers, a.k.a. a service chain. Along the chain, every VNF performs some particular treatment on the received requests, then hands over the output to the next VNF in a pipeline fashion. To enable a network service, one needs to place, activate, and chain VNFs deployed on various servers. Considering the high cost of VNF migration and instantiation [24], VNF replacement can only be performed infrequently; that being said, when it comes to flowscale or requestscale operations, function placement can be viewed as a static operation. Given this fact, a natural practice is to place multiple VNFs in one server in advance, but due to hardware resource constraints (e.g., CPU, memory, and storage)[19], a server must carefully schedule resources among a subset of such VNFs at a particular time (i.e., only a subset of VNF instances can be activated on a server at a particular time). Therefore, with a fixed VNF placement, the activation and chaining of VNFs refer to: 1) for each server, the resource allocation to a subset of deployed VNFs subject to resource constraints; and 2) for each network service, the selection of the activated instances for its VNFs, so as to determine the sequence of instances that the requests will be treated through, a.k.a. service chaining.
Given that VNF placement is considered static at the time scale of flow or request operations[27], for service chaining and resource scheduling, a natural question is: should they also be static, or dynamic? Static schemes have been implemented in some scenarios[18], but often times request traffic is highly fluctuating in both temporal and spatial dimensions[25]. In such cases, static schemes may lead to workload imbalance among instances, leaving some instances overloaded and others underutilized. Hence, there is a huge demand to design an efficient and dynamic scheme that performs service chaining and resource scheduling, which adapts to traffic variations and achieves load balancing in real time. As for implementability, recent advances (e.g., temporal and spatial processor sharing [26]) have enabled realtime adjustment of resource allocation among various functions on the same server.
However, such dynamic design is nontrivial, especially in face of the complex interplay between successive VNFs and the resource contention among VNF instances on servers. In particular, we would like to address the following challenges:

Characterization of the tunable tradeoffs among various performance metrics: NFV systems often have multiple optimization objectives, e.g., maximizing resource utilization, minimizing energy consumption, and reducing request response time. Different stakeholders may have different preferences over these objectives which often times conflict with each other[43]. It is important to characterize their tradeoffs to acquire a comprehensive understanding of our design space and tune the system towards the particular state that we desire.

Efficient online decision making: VNF request processing often requires low latency and high throughput. Hence, an effective dynamic scheme must also be computationally efficient, and can be adaptive to request changes. This is challenging not only because of the nature of the high complexity, but also that service requests arrive in an online manner, while the underlying traffic statistics are often unknown a priori. All these uncertainties make it more challenging to optimize system objectives through a series of online decisions, not to mention that a distributed design is often preferred.

Understanding benefits of predictive scheduling:
A natural optimization of online decision making is to consider leveraging recently developed machine learning techniques
[6][46] to predict future traffic information to reduce response time and improve qualityofservice. There is no free lunch, though. For example, in NFVbased multimedia delivery systems, multimedia service providers can predict potential requests based on the popularity of streaming contents and subscribers’ preferences [7]. Based on such predictions, service providers can carry out prerendering or compression to optimize the quality of their services with faster responses[5]. Despite the wide adoption of such predictionbased approaches [62][36][22][23][15], it still remains open what are the fundamental benefits of predictive scheduling to NFV systems, even in the presence of prediction errors. Answers to the questions are the key to understand whether the endeavor worthy to put on predictive VNF scheduling, and whether one can tolerate the worst possible case that may occur.
Despite recent headway on VNF scheduling[19][18], as far as we are aware, there is still no fundamental understanding on the above questions, nor is there any strategy that can achieve the design objectives simultaneously in a fully online fashion. One important reason is in the difficulty of problem formulation and modeling, especially in choosing the granularity. If one models the system state and strategy in flowlevel abstraction[35], it may fall short in accurate characterization of interplay between successive VNF instances and system dynamics over time; however, if one applies finegrained control to each request[62], then the decision making will inevitably incur a rather high computational overhead. Such issue not only prohibits a deep understanding on system dynamics, but also prevents us from obtaining efficient and accurate strategy design.
In this paper, we overcome such difficulties by applying a number of novel techniques. Our contributions include:
Modeling and formulation: We propose a novel model that separates the granularity of system state characterization and strategy making. In particular, we develop a queuing model at the request granularity to characterize system dynamics. Unlike flowlevel abstraction, our model require no prior knowledge on underlying flows, but accurately captures the interplay between successive instances, i.e., realtime dynamics of how requests are received, processed, and forwarded. As for strategy making, it is conducted at the granularity of request batch in a pertimeslot fashion to avoid the high overheads of perrequest optimization. Such a careful choice makes it possible to characterize the system dynamics and performance in a clear yet accurate way.
Algorithm design: To enable online and efficient decision making, we transform the longterm stochastic optimization problem into a series of subproblems over time slots. By exploiting their unique structure, we propose POSCARS, a Predictive Online Service Chaining And Resource Scheduling scheme. Particularly, POSCARS includes two coupled parts. One is for the predictive scheduling of requests, while the other is for service chaining and resource allocation. The former part takes advantage of predicted information to effectively reduce request delays. Meanwhile, the latter part can incur a nearoptimal system cost while stabilizing all queues in the system. Furthermore, it can also achieve a tunable control between system cost optimization and queue stability.
Predictive scheduling: To the best of our knowledge, this paper is the first to address the dynamic service chaining and scheduling problem in NFV system by jointly considering resource utilization, energy efficiency, and request latency. This paper is also the first to study the fundamental benefits of predictive scheduling with future information in NFV system, which extends a new dimension for NFV system design.
Experiment verification and investigation: We conduct tracedriven simulations and results show the effectiveness of POSCARS and its variants under various settings against baseline schemes, as well as the benefits of predictive scheduling in achieving ultralow request response time.
The rest of this paper is organized as follows. In Section II, we show a motivating example of predictive scheduling in NFV systems. Section III presents our model and formulation, followed by the design and performance analysis of POSCARS and its variants in Section IV. We show simulation results and analysis in Section V, then review related work in Section VI. Finally, Section VII concludes the paper.
Ii Motivating Example
In this section, we first show a motivating example that exhibits the potential tradeoff in the multiobjective optimization for different system metrics, including reduction in energy cost and communication cost, as well as shortening response times, which is mainly due to queueing delay. Besides, the example also explores the value of future information and the potential benefit of predictive scheduling.
We consider a time slotted NFV system, where predictive scheduling is viable, i.e., the request in time can be perfectly predicted, pregenerated, and preserved by the system.^{1}^{1}1 An example of predictive scheduling in practical systems is that Netflix predicts its users’ behaviors and preloads video onto their devices[7]. Figure 1(a) shows the basic settings and initial system state in time slot . All VNF instances are readily deployed on servers with a fixed placement. Each instance maintains a queue to buffer any untreated request. Every server has a service capacity of two requests per time slot; processing a request incurs an energy cost of . Note that 1) any requests processed by VNF ’s instance are not counted in the queues, but readily to be sent to VNF ’s instances in the next time slot; 2) requests that have been processed by VNF ’s instances are considered finished.
In this case, there are two possible service chaining decisions, i.e., forwarding the processed request from the instance of VNF to either VNF ’s instance on server II (Decision #1) or server III (Decision #2). It takes a communication cost of to forward the request to VNF ’s instance on server II. The communication cost is to the other instance of VNF on server III.
Our goal is to choose a service chaining decision in time that jointly minimizes the total energy cost, total communication cost, and the total residual backlog size at the end of time . ^{2}^{2}2By applying Little’s law, a short queue length implies short queueing delay or short response time. Figures 1(b)  1(d) compare the scheduling processes under different service chaining decisions.
In Figure 1(b), the new request in time is admitted, while the processed request is forwarded to the instance of VNF on server II. Although incurring a low communication cost of , such a decision also leads to imbalanced queue loads among VNF ’s instances. Note that every server can serve at most two requests per time slot. Hence, servers will then process four requests in total, including the new request on server I, two requests on server II, and another one on server III. The processing incurs a total energy cost of . After processing, VNF ’s instance on server II still has one untreated request in its backlog. Thus Decision #1 incurs a total cost of on energy and communication, with a residual backlog size of .
On the other hand, when the processed request is forwarded to the instance of VNF on server III, the decision incurs a high communication cost of but results in balanced queue loads among VNF ’s instances. Servers will process five requests in total, including the new request on server I, and the rest from server II and III. The processing incurs a total energy cost of . After processing, there are no untreated request left in the backlogs. Decision #2 incurs a higher total cost of on energy and communication, but with no residual backlogs.
Insight 1: Figures 1(b) and 1(c) show that we cannot achieve the optimal values for different system metrics simultaneously, i.e., there is a potential tradeoff between optimizing the total system cost and reducing the total queue length.
Additionally, we find that server I is underutilized in both Figures 1(b) and 1(c), because VNF ’s instance only receives and handles the new request at time . In fact, Figure 1(d) shows that we can exploit the spare processing power on server I by preadmitting and preserving the future request. Consequently, we can shorten the response time for the future request by incurring one more energy cost in time . Note that preservice does not introduce extra energy cost but actually pays it beforehand. The reason is that even without preservice, we still have to pay one energy cost in the subsequent time slots after the future request arrives.
Insight 2: By utilizing servers’ spare processing power and paying system cost in advance, predictive scheduling can effectively shorten response times of future requests.
To characterize the nontrivial tradeoff and exploit the power of predictive scheduling in NFV systems, we present our formulation in the next section.
Iii Problem Formulation
We consider a time slotted NFV system, where virtualized network functions (VNF) are instantiated, deployed over a substrate network, and chained together to deliver numbers of network services. Upon the arrival of new network service requests, each VNF processes and hands over requests to its following VNF in a pipeline fashion. All requests are assumed homogeneous; i.e., each request is assumed to have equal size and require the same amounts of computation to be processed. We show an instance of our system model in Figure 2 and summarize main notations in Table I. More details of service chaining can be found in IETF RFC7665 [16].
Substrate Network Model  

The set of servers that host VNFs  
Communication cost of sending one request  
from server to server  
The capacity of type resource type on  
server  
The unit cost of type resource on server  
Network Service Model  
Number of network services  
Chain length of network service  
The set of virtual network functions (VNFs)  
The set of all ingress VNFs  
The set of all nonterminal VNFs  
The th VNF in network service chain  
The network service that contains VNF  
Previous VNF of in service chain  
Next VNF of in service chain  
Number of requests processed by unit of  
type resource  
Deployment Model  
The set of all servers that host ’s instances  
The set of VNFs with instances residing on  
server  
System Dynamics  
Number of new requests for network  
service arriving in time  
Number of untreated requests for network  
service in the next th slot from time  
The prediction queue length for  
network service in time  
Queue length of VNF ’s instance on  
server in time  
Number of processed requests by VNF ’s  
instance on server in time , and to  
be sent to next VNF in time  
Total number of admitted requests for network  
service  
Number of admitted requests from  
Scheduling Decisions  
Number of admitted requests for service  
onto server  
if the instance of on  
server is selected to receive the processed  
requests from ’s instance on server and  
otherwise  
the vector of allocated resources on server 

VNF ’s instance  
System Objectives  
Total communication cost in time  
Total computation cost in time  
Weighted total queue length in time 
Iiia Substrate Network Model
We consider the substrate network with a set of heterogeneous servers. On each server , we consider types of resources, e.g., GPU [59], CPU cores [26], and cache [47]. The th resource type has a capacity of and a unit cost of . We denote the resource capacity vector by , and the resource unit cost vector by .
For every server pair , we use to denote the communication cost of transferring a request between the servers in time , e.g., the number of hops or roundtrip times. If two servers are not reachable from each other in time , then we set . The set of all communication cost in time slot is denoted by .
IiiB Network Service Model
There are network services and a set of VNFs. Each network service is represented by a chain of ordered VNFs, wherein the th VNF is denoted by . To avoid triviality, we assume that for every network service . Note that is a constant and usually not very large[17]. We regard the same VNF that appears in different service chains as distinct VNFs. In practice, one can set up multiple queues on one VNF instance to buffer requests for different services and map each queue to one VNF instance in our model.
Next, we use to denote , i.e., the set of ingress VNFs of all network services, and to denote the set of nonterminal VNFs of all network services. For every VNF , we denote its network service by . If , i.e., not the first VNF of its network service, then we denote its previous VNF by ; likewise, if , i.e., not a terminal VNF, then we denote its next VNF by .
IiiC Deployment Model
In practice, due to request workload changes, it’s common to provide multiple instances for every VNF, encapsulate the instances into containers, and distribute them on servers for better load balancing and fault tolerance [51]. We assume that each VNF has at most one instance on each server but it can have multiple instances on different servers. The placement of VNF instances is assumed to be predetermined by adopting VNF placement schemes similar to existing ones[58, 61, 10, 41]. Depending on the placement, the instances required by each service are not necessarily readily available on the same server. Note that our model can be further extended to cases with each VNF having multiple instances on the same server.
For VNF , we use to denote the set of servers that host ’s instances. Correspondingly, each server hosts a subset of VNFs. Every instance maintains one queue to buffer its relevant requests. For example, if VNF has one instance on server , then the instance has a queue of size at the beginning of time slot . Instead of individual queues, one can also implement a shared public queue among instances of the same VNF. All requests from preceding VNF’s instances are firstly forwarded and buffered in the public queue. These buffered requests are then rescheduled to one or more idle or least loaded instances. Such a way brings more flexibility so that requests can avoid the potential long queueing delay on individual instances. However, it requires additional physical storage and communication cost due to additional rescheduling. The choice depends on the tradeoff made by system designers. Here we adopt the queueing model for each individual instance.
IiiD Predictive Request Arrival Model
For network service , we use ( for some constant ) to denote the number of its new requests that arrive in time slot , and independent over time slots. In practice, considering the statefulness of VNFs, requests may be aggregated and scheduled in the unit of flow. Our model captures the system dynamics at a finer granularity than the flowlevel abstraction, and can be further extended to the case with correlation between requests.
Next, we consider a system which can predict and preserve future request arrivals for network services in a finite number of time slots ahead. Though the technique and analysis of prediction is still under active development[62, 39, 36], we do not assume any particular prediction technique in this paper. Instead, we assume the prediction as the output from other standalone predictive modules, and investigate the fundamental benefits by acquiring and leveraging such future information and the risks induced by misprediction. Note that such an assumption is valid to approximate practical scenarios where shortterm prediction is viable. For example, Netflix promotes its qualityofexperience (QoE) by predicting user demand and network conditions, then prefetching video frames onto user devices[7].
We assume that for network service , the system has perfect access to its future requests in a prediction window of size ( for some constant ), denoted by . In practice, however, such prediction may be errorprone; we shall evaluate the impact of misprediction in the simulation. With preservice, some future requests may have been admitted into or even preserved before time , thus we use () to denote the number of untreated requests in slot at time , such that
(1) 
Note that denotes the number of untreated requests that arrive at time . Therefore, the total number of untreated requests for service is . Here we can treat as a virtual prediction queue that buffers untreated future requests for network service . In practice, the prediction queues can be hosted on servers or storage systems in proximity to the request traffic classifier [44]. To simplify notations, we use to denote the vector of all queues’ length and .
IiiE System Workflow and Scheduling Decisions
System Workflow: At the beginning of each time slot , system components (including traffic classifier, VNF instances, and servers) collect relevant system dynamics to decide request admission, service chaining, and resource allocation. According to the decisions, the traffic classifier admits new requests for different network services. VNF instances steer the requests which are processed in time slot to their next VNF’s instances. Meanwhile, every server allocates the resources to its resident VNF instances[26]. The instances then process the requests from their respective queues. At the end of time slot, the prediction window moves one slot ahead.
In the above process, we need to consider three kinds of scheduling decisions.
i) Admission Decision: For every network service, the traffic classifier decides the number of untreated newly arriving and future requests, to admit into the system. Particularly, for a network service and its respective ingress VNF , the classifier decides , i.e., the number of admitted requests to ’s instance on server . We use to denote the total number of admitted requests from prediction queue . These admitted requests should include at least all the untreated requests that actually arrive, while not exceeding , i.e., in time slot and for ,
(2) 
Note that requests are admitted in a fullyefficient manner [21]. In other words, by admitting untreated requests from for , the allocation should ensure a total number of requests to be admitted, i.e.,
(3) 
The untreated request backlog evolves as follows,
(4) 
while , where we define . We denote all admission decisions by .
ii) Service Chaining Decision: Given a nonterminal VNF , we denote as the service chaining decision at time . We consider the case when VNF and its next VNF have instances on server and , respectively. The decision with value indicates the processed requests from VNF ’s instance on server will be sent to ’s instance on server , and otherwise. To ensure that every instance has a target instance to send its requests, we have
(5) 
On the other hand, if VNF (or its next VNF) has no instances on server (or ), then in each time slot . Note that dynamic request steering can be implemented by adopting VNFsenabled SDN switches [20]. We denote all chaining decisions by .
iii) Resource Scheduling Decision: For each server and VNF , we define as the allocated resource vector to ’s instance. To ensure any allocation with at least one CPU core and other resources, or without any resources at all, we restrict the choice of to a finite set of options . Note that for all , i.e., the option of no resource allocation is always available. Besides, the total allocated resources should not exceed server ’s resource capacity, i.e.,
(6) 
Note that for all the time if . Given resource allocation , the instance can process and forward at most requests, where
is assumed to be estimated from system logs. Due to time slot length limit, a VNF instance can’t process too many requests and thus we assume
for some constant . We denote all allocation decisions by .IiiF System Workflow and Queueing Dynamics
In time slot , the system workflow proceeds as follows. At the beginning of time slot , system components (including traffic classifier, VNF instances, and servers) collect all available system dynamics to make request admission, service chaining, and resource allocation decisions . According to the decisions, traffic classifier admits new requests for different network services. VNF instances steer the requests which are processed in time slot to their next VNF’s instances. Meanwhile, every server allocates the resources to its resident VNF instances. The instances then process the requests from their respective queues. At the end of time slot , the prediction window for each network service moves one slot ahead. Thus given , prediction queue is updated as follows
(7) 
With the above workflow, we have the subsequent queueing dynamics for different VNF instances.
Instances of Ingress VNFs: For every network service and its respective ingress VNF , there are admitted requests to ’s instance on server . Accordingly, the update function for queue length is
(8) 
Instances of NonIngress VNFs: For the instance of VNF on server , if , then the instance will receive processed requests from the instance of VNF on server ; otherwise, the instance will receive no new requests. then the queueing update function is given by
(9) 
where i.e., the allocated service rate for the instance of on server in time . The inequality is due to that the actual number of untreated requests may be less than the service rate in time . All requests processed by the last instances of service chains are considered finished. The vector is denoted by . Figure 3 shows an example of our queue model, in which there are two network services that require six types of VNF whose instances are hosted on three servers; each of the network services has a prediction window of size two. In Figure 3, we show how requests are admitted and transferred between successive queues for the first network service (NS ) in time , given admission decision , , and chaining decision .
IiiG Optimization Objectives
Communication Cost: Recall that transferring a request over link incurs a communication cost , e.g., the number of hops or roundtrip times. Low communication cost are highly desirable for responsiveness of requests. In time slot , given the service chaining decisions, the communication cost between server and is
(10) 
where denotes the communication cost of transferring a request between servers and in time . Then the total communication cost in time is given by
(11) 
Energy Cost: Efficient resource utilization for servers is another important objective to achieve in NFV systems [53]. Given the resource allocation , we define the corresponding energy cost in time as , where is a constant vector, with each entry as the unit cost of th type of server resources. The total energy cost in time is
(12) 
Queue Stability: Considering the responsiveness of requests and scarcity of computational resources such as memory and cache, it is also imperative to ensure that no queues would be overloaded. We denote the weighted total queue length in time as
(13) 
where is a constant that weights the importance of stabilizing instances queues compared to prediction queues. Accordingly, we define the queue stability [37] as
(14) 
IiiH Problem Formulation
Based on the above models, we formulate the following stochastic network optimization problem (P1) that aims at the joint minimization of timeaverage expectations of weighted communication cost and energy cost while ensuring queue stability. With such formulation, we explore the potential tradeoff among different system metrics.
(15) 
where is a constant that weights the relative importance of energy efficiency to reducing communication cost.
Iv Algorithm Design and Performance Analysis
We present POSCARS, an online and predictive algorithm that solves problem P1 through a series of online decisions, followed by its performance analysis and three variants.
Iva Algorithm Design
Problem P1 is challenging to solve due to timevarying system dynamics, the online nature of request arrivals, and complex interaction between successive VNF instances. Therefore, instead of solving problem P1 directly, we adopt Lyapunov optimization techniques [37] to transform the longterm stochastic optimization problem into a series of subproblems over time slots, as specified by the following lemma.
Lemma 1
By applying Lyapunov optimization techniques and the concept of opportunistically minimizing an expectation, problem P1 can be transformed to the following optimization problem to be solved in each time slot :
(16) 
(17) 
where is defined as
(18) 
such that is a positive parameter that weights the importance of minimizing system cost compared to stabilizing system queues, and is defined as
(19) 
The detailed proof of Lemma 1 is relegated to AppendixA. Here we provide a sketch of how the problem transformation is carried out. Note that the key technique we adopt is the driftpluspenalty method [37], which generally aims to stabilize a queueing network while also optimizing the timeaverage of some objective (e.g., the total cost of energy consumption and communication in P1). To this end, a quadratic function (a.k.a. Lyapunov function) is first introduced to characterize the stability of all queues in each time slot. Then the key idea of the method is to introduce a driftpluspenalty term to characterize the joint change in the queue stability and the objective value across time slots. In particular, the driftplus penalty term is defined as the weighted sum of two parts. One is defined as the difference (a.k.a. drift) between the Lyapunov functions of two consecutive time slots, which measures the shortterm change in queue stability. The other part is defined as the instant objective value in a time slot. Then the stability of the queueing network and the optimization of the time average of the objective are jointly achieved by deriving an online control policy that greedily minimizes the upper bound of the driftpluspenalty term during each time slot. In this way, it can be proven that it is equivalent to solve problem P1 by resolving a series of subproblems (P2) over time slots.
Note that by solving problem P2 over time slots, problem P1 can be solved asymptotically optimally as the total number of time slots and the value of parameter both approach infinity, as shown by Theorem 1 in Sec. IVB. Furthermore, problem P2 can be decomposed into three subproblems for request admission, service chaining, and resource allocation, with their decisions in each time slot denoted by , , and , respectively. Then we propose POSCARS, a predictive online service chaining and predictive resource scheduling scheme, and show its pseudocode in Algorithm 1.
Remark 1
Regarding request admission, when all instances are more loaded than the prediction queue. in order not to overload any instances, POSCARS admits only untreated requests at current time slot and spreads them evenly onto least loaded instances. However, when instances all have shorter queue lengths than the prediction queue, POSCARS admits all future requests and assigns them to the least loaded instances.
Remark 2
POSCARS decides the service chaining by jointly considering instances’ queue length and the communication cost. Recall the definition in (18), where the weighted summation actually reflects the unit price of sending a request from VNF ’s instance on server to the instance of its next VNF on server . If the target instance is heavily loaded, there will be a high price of forwarding the request to that instance. Besides, a large communication cost also makes it less willing to choose the target instance.
Remark 3
On server , the resource allocation is decided by jointly considering the resource cost and the queue length of its resident instances. Particularly, we regard the term as the unit net cost vector of resources allocated to the instance of VNF . Regarding the unit net cost of type resource, i.e., , it is the weighted difference between the unit cost of type resource and the queue length of the instance. A high unit resource cost will result in a prudent allocation. On the other hand, a sufficiently long queue length will make the allocation more worthwhile. In both cases, POSCARS selects the set of resource allocation decisions that satisfy constraint (6) and minimize the total net cost.
IvB Performance Analysis
We analyze the computational complexity of POSCARS in each time slot as follows. For each network service, it takes time to make request admission decisions (lines ). Next, each nonterminal VNF instance selects and forwards requests to its successors in time (line ). Every server takes time to initialize the lookup table (lines ) and time to decide the resource allocation, where is the maximum number of applicable resource allocation for any VNF instance. In practice, POSCARS can be run in a distributed manner. Particularly, the request admission subroutine can be implemented on each traffic classifier with a computational complexity of ; meanwhile, the service chaining and resource scheduling subroutines can be deployed on the hypervisor of each server, with computational complexities of for each instance and for each server, respectively, where is the maximum number of applicable resource allocations.
On the other hand, without predictive scheduling, we show that POSCARS achieves an tradeoff between the timeaverages of total queue length and total cost via the tunable parameter . In particular, given the value of , let denote the optimal value of problem P1; then we have the following theorem.
Theorem 1
Suppose that and, given the system resource capacities on each server and VNF placement, there exists an online scheme which ensures that, for each VNF instance, the mean arrival rate is smaller than its mean service rate. Under POSCARS without prediction, there exist constants and such that
The proof is relegated to AppendixB. Theorem 1 demonstrates an tradeoff between system cost optimization and queue stability. Particularly, without prediction, POSCARS can achieve a nearoptimal cost within an optimality gap but at the cost of an increase in the timeaveraged total queue length. Intuitively, with a large value for , VNF instances are more willing to steer requests to their successive instances in nearby servers, while server would allocate resources to instance with less energy cost. As a result, the total cost can be effectively reduced; however, some servers may become hot spots and the total queue length will increase. In contrast, a smaller value of conduces to more balanced queue loads among servers and more energy cost consumed to serve requests, leading to an increasing total cost. Moreover, given predicted information about future requests, POSCARS can achieve a better tradeoff with a notable delay reduction by preserving requests with surplus system resources. We verify such advantages by our simulation results in Section V.
IvC Practical Issues and Variants of POSCARS
The distributed nature of POSCARS requires each VNF instance to gather relevant system dynamics on its own. However, the probing process may incur considerable sampling overheads and additional latencies. Meanwhile, each instance makes its independent decision based on the sampled information at the beginning of a time slot. Therefore, instances may blindly choose the same lowestcost instance, without knowing others’ choices. The chosen instance will then become overloaded due to the noncoordinated decisions. An alternative is to perform sampling before sending each request. Nonetheless, this method suffers from the messaging overheads of frequent samplings. A possible compromise is to split the processed requests into batches, then sample and schedule for each batch separately.
To mitigate such issues, we propose the following variants of POSCARS, by adopting the ideas from recent randomized load balancing techniques, such as ThePowerofChoices [34], BatchSampling [38], and BatchFilling [54].
POSCARS with ThePowerofChoices (PPo): To reduce sampling overheads, we apply the idea of ThePowerofChoices to POSCARS. Particularly, every nonterminal instance probes only the instances uniformly randomly from its next VNF. Next, the instance chooses to send all its processed requests to the lowestcost instance among the samples. In such a way, each instance requires only few times of sampling to decide its target instance. Although the selected instance may not be the leastcost one, our later simulation results show that the reduced sampling brings only a mild increase in the total cost.
The above variant significantly reduces the sampling overheads. However, the issue of noncoordinated decision making still remains. To mitigate such issues, we adopt the idea of batchsampling[38] and batchfilling [54] and propose another two variants of POSCARS, namely POSCARS with BatchSampling (PBS) and POSCARS with BatchFilling (PBF), respectively. Basically, these two variants split the processed requests on each instance into batches, each batch with a size of , then carry out scheduling upon such request batches. When , we actually perform scheduling for each request separately. When is greater than the number of processed requests, then scheduling is only performed once in a time slot, degenerating to POSCARS. We elaborate the design of PBS and PBF as follows.
POSCARS with BatchSampling (PBS): Given an instance with batches of requests, it probes instances uniformly randomly from its next VNF, where is the respective probe ratio. Then the instance sends the request batch to the leastcost instances, with each batch to a distinct target instance.
POSCARS with BatchFilling (PBF): Given an instance with request batches, it probes instances uniformly randomly from its next VNF. Then it forwards the request batches one by one. Each batch is sent to the leastcost instance among the samples. The chosen instance’s cost is updated after it receives the batch of requests.
V Simulation
We conduct tracedriven simulations to evaluate the performance of POSCARS and its variants. The request arrival measurements are drawn from realworld systems[4], with a mean arrival rate of per time slot (ms) and mean interarrival time of ms. Besides, we conduct simulations with the Poisson request arrivals with the same rate of . All the results are obtained by averaging measurements collected from repeated and independent simulations.
Va Simulation Settings
Substrate Network Topology: We construct the substrate network based on two widely adopted topologies, i.e., Jellyfish[45] and FatTree[3]. Both topologies have a comparable scale to clusters in data center networks, each equipped with switches, servers with deployed VNFs, and the rest servers as hosts that generate service requests. Particularly, in FatTree, there are pods, each pod containing to servers; amongst them, we choose one server uniformly at random as the one with deployed VNFs and the rest as hosts. Requests can be processed on servers in any pod with the VNF they demand. Between any two servers, request traffic traverses over the shortest path with link capacity of Gbps. For each pair of servers, the communication cost per request is proportional to the number of hops of the shortest path between them, with variation.
Server Resources: We consider CPU cores as the resources on each server, since CPUs have become the major bottleneck for request processing in NFV systems [32, 1, 8]. Servers are heterogeneous, each with a number of CPU cores ranging from to . In every time slot, we calculate the power consumption in the unit of utilized CPU cores, with . Regarding parameter , setting it with a greater value would encourage each server to assign most resources to heavily loaded VNF instances. Conversely, a smaller value of would lead to more balanced resource allocation among such instances; consequently, this will minimize the impact of imbalanced queue loads on the decision making for service chaining. The value setting depends on the objectives to fulfill in real systems. In our simulation, by fixing , we assume that communication cost reduction and system energy efficiency are equally important.
Service Function Chains: We deploy five network services, each with a service chain length varying from to . Each service contains at least one of the most commonlydeployed VNFs; e.g., Intrusion Detection System (IDS), Firewall (FW), Load Balancer (LB). The rest VNFs of each service are chosen uniformly from other commonlyused VNFs [29] at random without replacement. For each VNF, the total number of instances ranges from to .
Prediction Settings: Network services’ traffic often varies in predictability. We denote the average window size by , and set each service window size by sampling uniformly from at random. We evaluate the cases with perfect and imperfect prediction. For perfect prediction, future request arrivals in the time window are assumed perfectly known to the system and can be preserved. In practice, such an assumption is not feasible for stateful requests; nonetheless, that can be seen as the extended case of our results with more constraints on request processing. For imperfect prediction, the failure of prediction generally falls into two categories. One is falsenegative detection, i.e., a request is not predicted to arrive, and as a result, it receives no preservice before its arrival. The other is falsepositive detection, i.e., a request that does not exist is predicted to arrive. In this case, the system preallocates resources to preserve such requests. We consider two extreme cases: one is that we fail to predict the arrivals of all future requests; the other is that we correctly predict the actual future arrivals, and furthermore, some extra arrivals are falsely alarmed. Note that any form of misprediction can be seen as a superposition of such two extremes. In addition, we also implement five schemes that forecast request arrivals in the next time slot (with window size
), including: 1) Kalman filter (Kalman)
[9]; 2) distribution estimator (Distr), which generates the next estimate by independent sampling from the distribution of arrivals learned from historical data; 3) Prophet (FB) [46], Facebook’s timeseries forecasting procedure; 4) moving average (MA) and 5) exponentially weighted moving average (EWMA)[6].Baseline Schemes: We compare POSCARS with three baseline schemes, including Random, JSQ (JointheShortestQueue), and stateoftheart OneHopSCH (OneHop scheduling)[49]. These schemes differ in the service chaining strategy from POSCARS. In Random scheme, each instance uniformly randomly sends requests to one of its successors. In JSQ scheme, each instance sends requests to its leastloaded successor. In OneHopSCH, each instances sends requests to its successor with the least communication cost and idle capacity.
Variants of POSCARS: To compare the performance of POSCARS and its variants, we evaluate them under different settings. For each of the variants, we vary their probe ratio ( for PPo, for PBS, and for PBF) from to , and fix the batch size for PBS and PBF as requests per batch. We omit the cases when the ratio is and greater than . Notice that the former corresponds to the random scheme and actually leverages no load information; the latter leads to excessively finedgrained control since it induces too much sampling overheads.
Request Response Time Metric: To evaluate the impact of predictive scheduling, we define a request’s response time as the number of time slots from its actual arrival to its eventual completion. If a request is preserved before it arrives, then the system is assumed to respond to the request upon its arrival, and the request will experience a zero response time.
VB Performance Evaluation under Perfect Prediction
Intuitively, POSCARS is promising to shorten the requests’ response time by exploiting predicted information and preallocating idle system resources to preserve future requests. Therefore, the essential benefits of predictive scheduling come from the load balancing in the temporal dimension. To verify such intuition, we first consider the case with perfectly predicted arrivals, and evaluate POSCARS with () and without () prediction, against the baseline schemes.
Average response time vs. window size : Figure 4 shows the performance of the different schemes under Jellyfish and FatTree topology. The response times induced by the baseline schemes remain constant since they do not involve predictive scheduling. Random incurs the highest response time (ms), since it disregards information about workloads or communication cost when dispatching requests. JSQ does much better (ms) because requests are always greedily forwarded to the leastloaded successors. OneHopSCH outperforms the previous two by jointly taking the workloads and communication cost into consideration. Meanwhile, without prediction (), POSCARS achieves comparable performance with OneHopSCH; but as increases from to , we observe a significant reduction in the average response time under both topologies; e.g., from ms to ms under FatTree topology. The marginal reduction diminishes as further increases, and eventually, remains at around ms.
Insight: In practice, due to traffic variability, it is often not realistic to achieve high predictability (large ). However, the results show that, only mildvalue of future information suffices to POSCARS’s shortening requests response time effectively and achieving loadbalancing in the temporal dimension. With more future information, the reduction diminishes since the idle system resources have already been depleted.
Considering the qualitative similarities among curves with different settings, we only present results under FatTree and tracedriven request loads.
Backlogcost tradeoff with parameter : Recall from Section III.B that the value of parameter controls the backlogcost tradeoff. Figures 5(a) and 5(b) verify such a tradeoff. Figure 5(a) compares the timeaverage communication cost of POSCARS with , , , against baselines. Both Random and JSQ induce a high total cost since their decision making disregards the resultant communication cost and the heterogeneity of servers in terms of energy cost. OneHopSCH further lowers the total cost by about , by taking its advantages of jointly optimizing cost and shortening queue lengths based on flowlevel statistics. Given different choices of , POSCARS achieves closetooptimal timeaverage total cost as the value of rises up to . Notably, POSCARS excels OneHopSCH whenever .
However, recall that parameter weighs the importance of minimizing system cost compared to maintaining queue stability. Hence, to reduce system cost, large values of also lead to increased backlogs. By Little’s theorem[30], this would increase response time as well. In Figure 5(b), we see that the total queue length is almost proportional to value of , exceeding all other baselines as .
Insight: POSCARS achieves a backlogcost tradeoff with different values of parameter . By choosing an appropriate value of from , it outperforms the baseline schemes with both lower system cost and shorter queue lengths. In practice, such an interval may vary from system to system but it is usually proportional to the ratio of magnitudes of the total queue length to total system cost.
POSCARS and its variants: Upon forwarding requests, POSCARS requires each instance to collect statistics from all its successors. In practice, this may require nonnegligible sampling overheads in face of a large number of instances. In Section III.C, we propose three variants of POSCARS, i.e., PPo, PBS, and PBF. These variants trade off optimality of decision making for reduction in sampling overheads and complexity[54] from to , where denotes the total number of candidate instances. Figure 6 evaluates the total cost and average response time induced by POSCARS and its variants, with parameter , , batch size of for PBS and PBF, and the probe ratio .
In Figure 6(a), we see that POSCARS achieves the lowest total cost, since each instance’s decision making is based on the full dynamics of its succeeding instances. For each variant, we see a cutdown in the total cost by up to as increases from to . Similarly, from Figure 6(b), we also observe a reduction in response time from about ms by up to . Among the three variants, PBS and PBF induce more reduction in both cost and response time than PPo, because aggregated sampling is often more conducive to lowering the cost [38].
Insight: By sampling partial system dynamics for decision making, variants of POSCARS trade off optimality for reduction in sampling overheads and complexity. Owing to aggregated sampling, PBF and PBS outperforms PPo in terms of both lower total cost and response time.
VC Performance Evaluation under Imperfect Prediction
In practice, prediction errors are inevitable due to dataset bias and noise. To explore the fundamental limits of predictive scheduling, we evaluate the impact of imperfect prediction on the system performance.
Total cost and response time vs. : Figure 7 compares the timeaverage total cost and average response time induced by different forecasting schemes and perfect scheduling using POSCARS. In Figure 7(a), we observe that all forecasting schemes incur higher timeaverage total cost than predictive scheduling by up to . The reason is as follows. Recall that the prediction under these forecasting schemes are imperfect, with both falsenegative and falsepositive detection. Particularly, the system preallocates extra resources to preserve falsepositive requests, resulting in higher total cost. Figure 7(b), shows the overall ascending trend proportional to increased . This is due to that larger values of lead to a greater total queue length, and by Little’s theorem[30], a greater queue length implies longer response time. However, we also see that, even under imperfect prediction, predictive scheduling does not necessarily lead to longer response time than that under perfect prediction.
To figure out the reason, we consider two extreme cases. One is allfalsenegative, i.e., during each time slot, all future request arrivals in the lookahead window are falsenegative. Notice that this case is equivalent to the case without predictive scheduling (), since no requests will be preallocated resources. The other is allfalsepositive, i.e., all future request arrivals are perfectly predicted, and besides, some extra requests are wrongly predicted to arrive.
Perfect prediction with two extremes: Figure 8(a) compares average response times under perfect prediction and the two extremes, with , , and falsepositive requests on average. Overall, the average response time is proportional to the value of . Miss detection incurs higher response time than the other two, because it does not preserve any requests before they arrive. On the other hand, perfect prediction and false alarm do not necessarily outperform the other with lower response times. This is because of two consequences of false alarm. The first is that falsepositive requests will consume extra system resources and prolong the request queues length, thus leading to longer response times. The second is that, according to lines  in Algorithm 1, falsepositive requests result in a greater prediction queue length. That forces POSCARS to admit future requests more frequently, thus conducing to shorter response times. The same effect can be achieved by tuning the parameter – greater values of lead to less frequent admission.
How do these two consequences interplay? The question is answered by Figure 8(b), where the number of average falsepositive requests varies from to , with , , and . When the average number of falsepositive requests increases from to , the resultant response time falls even lower than that under perfect prediction. In such cases, the second consequence dominates – mild false alarm leads to more frequent admission, making POSCARS spread requests more evenly among instances. However, as false alarm continues aggravating, the reduction diminishes and the response time grows constantly. In such cases, though the admission frequency is intensified, too much false alarm severely extends the total queue length, offsetting and eventually outweighing the effect of load balancing.
Insight: Imperfect prediction does not necessarily degrade system performance, such as longer response times. Instead, mild false alarm allows the system to make better use of idle system resources, further shortening response time.
Vi Related Work
In this section, we first summarize existing works that study the optimization of NFV from different aspects. Then we narrow down our focus onto those that are most relevant to this paper and compare their proposed approaches with ours.
Via Optimizing NFV/VNF from Different Aspects
A wide range of recent works have studied NFV systems from various aspects. Below we take a brief overview and discuss how they are related to our work.

VNF placement: In NFV, the placement of VNF instances often has a significant impact on system performances [27] and thus deserves an elaborate design. A number of existing works have been conducted to this end (e.g., [40][13][11][2][60]). In practice, such approaches can serve to decide the VNF placement, upon which our schemes can carry out their scheduling procedures accordingly.

VNF Resource Allocation: Another series of works (e.g., [26][28][64]) focused on the optimization of resource allocation for VNF/NFV, with the aim to minimize VNF execution overheads and accelerate the processing speed of VNF instances. They concentrated on achieving such improvements with particular hardware designs. Different from such works, we mainly focus on exploiting predicted information to perform effective scheduling on existing NFV systems. Nonetheless, our schemes can be applied to systems built with their solutions.

Load Balancing: Existing works (e.g., [48][50][56]) also developed various schemes to balance the workloads among chained VNF instances to improve resource utilization and fault tolerance while shortening delays in NFV systems. In practice, existing solutions can serve as reference points for system designers to tune the proper value of parameter for desired performance metrics.
ViB Chaining and Resource Scheduling of VNFs in NFV:
Regarding the optimization of VNF service chaining and resource scheduling in NFV, existing works generally fall into two categories.
Of the first category are the schemes that perform service chaining and resource scheduling in an offline fashion. Typically, they assume the full availability of information about all service requests or flows. Based on flow abstraction, Zhang et al. [58] consider the joint optimization for VNF placement and service chaining. They formulate the problem as an ILP problem and develop an efficient roundingbased approximation algorithm with performance guarantee. Yoon et al. [55]
adopt the BCMP queueing model for VNF service chains and propose heuristics to approximately minimize the expected waiting time of service chains. Wang
et al. [49] consider the joint optimization of service chaining and resource allocation and develop a greedy scheme that aims to place instances and schedule traffic with minimum link cost, CAPEX, and OPEX. Later, D’Oro et al. [12] study service chaining problem from the perspective of congestion games. By formulating the problem as an atomic weighted congestion game, they propose a distributed algorithm that provably converges to the Nash equilibrium. On the other hand, Zhang et al. [61] formulate a requestlevel optimization problem based on steadystate metrics and propose a heuristic scheme by applying techniques from open Jackson queueing network. However, there is no empirical evidence to show that service request arrivals follow Poisson process in NFV systems. Different from existing works, our model and problem formulation assume no prior knowledge about underlying request traffic. Moreover, instead of offline or even centralized decision making, our solution is capable to perform nearoptimal service chaining and scheduling in a computationally efficient and decentralized manner.Of the second category are the online schemes that process requests upon their arrivals. Under this setting, Mohammadkhan et al. [35] formulate the VNF placement for service chaining as a MILP problem based on flow abstraction and develop a heuristic to solve the problem incrementally. Lukovszki et al. [31] develop an online algorithm that performs request admission and service chaining with a logarithmic competitive ratio. Zhang et al. in [63] propose a novel VNF brokerage service model and online algorithms to predict traffic demands, purchase VMs and deploy VNFs. Further, Fei et al. [14] develop an effective algorithm that performs online VNF scheduling and flow routing with predicted flow demand, so as to minimize the impact of inaccurate prediction and the cost of overprovisioned resources. Later, Xiao et al. [52]
propose an adaptive service chaining deployment scheme based on deep reinforcement learning techniques, which conducts service chaining to serve incoming requests in an online fashion. Such schemes either resort to flowlevel system dynamics and predicted information for decision making, or perform finergrained control at the request level to optimize dedicated objectives. Our model considers such tradeoffs and separates the granularity of system state and decision making. Besides, we also explore the fundamental benefits and limits of predictive scheduling, which still remains open in NFV systems.
Vii Conclusion
In this paper, we studied the problem of dynamic service chaining and resource scheduling and systematically investigated the benefits of predictive scheduling in NFV systems. We developed a novel queue model that accurately characterizes the system dynamics. Then we formulated a stochastic network optimization problem and then proposed POSCARS, an efficient and decentralized algorithm that performs service chaining and scheduling through a series of online and predictive decisions. Theoretical analysis and tracedriven simulations showed the effectiveness and robustness of POSCARS and its variants in achieving nearoptimal system cost while effectively shortening average response time. Our results also show that prediction with mild falsepositive conduces to shorter response times. In addition, note that fairshare of resources and performance isolation among VNF instances are the key to maintaining high quality of service. Therefore, it is an interesting direction for future work to establish a more effective joint service chaining and scheduling scheme with multiresource fairness consideration among VNF instances. Moreover, it would also be intriguing to explore the interplay between resource fairness and other performance metrics.
References
 [1] B. Addis, D. Belabed, M. Bouet, and S. Secci, “Virtual network functions placement and routing optimization,” in Proceedings of IEEE CloudNet, 2015.
 [2] S. Agarwal, F. Malandrino, C.F. Chiasserini, and S. De, “Joint vnf placement and cpu allocation in 5g,” in Proceedings of IEEE INFOCOM, 2018.
 [3] M. AlFares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proceedings of ACM SIGCOMM, 2008.
 [4] T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data centers in the wild,” in Proceedings of ACM SIGCOMM, 2010.
 [5] N. Bouten, J. Famaey, R. Mijumbi, B. Naudts, J. Serrat, S. Latré, and F. De Turck, “Towards nfvbased multimedia delivery,” in Proceedings of IFIP/IEEE International Symposium on Integrated Network Management (IM), 2015.
 [6] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 [7] J. Broughton, “Netflix adds download functionality,” https://technology.ihs.com/586280/netflixaddsdownloadsupport.
 [8] F. Callegati, W. Cerroni, C. Contoli, and G. Santandrea, “Dynamic chaining of virtual network functions in cloudbased edge networks,” in Proceedings of IEEE NetSoft, 2015.
 [9] C. K. Chui and G. Chen, Kalman Filtering. Springer, 2017.
 [10] R. Cohen, L. LewinEytan, J. S. Naor, and D. Raz, “Near optimal placement of virtual network functions,” in Proceedings of IEEE INFOCOM, 2015.
 [11] R. Cziva, C. Anagnostopoulos, and D. P. Pezaros, “Dynamic, latencyoptimal vnf placement at the network edge,” in Proceedings of IEEE INFOCOM, 2018.
 [12] S. D’Oro, L. Galluccio, S. Palazzo, and G. Schembra, “Exploiting congestion games to achieve distributed service chaining in nfv networks,” IEEE JSAC, vol. 35, no. 2, pp. 407–420, 2017.
 [13] X. Fei, F. Liu, H. Xu, and H. Jin, “Towards loadbalanced vnf assignment in geodistributed nfv infrastructure,” in Proceedings of IEEE/ACM IWQoS, 2017.
 [14] ——, “Adaptive vnf scaling and flow routing with proactive demand prediction,” in Proceedings of IEEE INFOCOM, 2018.
 [15] X. Gao, X. Huang, S. Bian, Z. Shao, and Y. Yang, “Pora: predictive offloading and resource allocation in dynamic fog computing systems,” in Proceedings of IEEE ICC, 2019.
 [16] J. Halpern, C. Pignataro et al., “Service function chaining (sfc) architecture,” in RFC 7665, 2015.
 [17] B. Han, V. Gopalakrishnan, L. Ji, and S. Lee, “Network function virtualization: challenges and opportunities for innovations,” IEEE Communications Magazine, vol. 53, no. 2, pp. 90–97, 2015.
 [18] H. Hantouti, N. Benamar, T. Taleb, and A. Laghrissi, “Traffic steering for service function chaining,” IEEE Communications Surveys & Tutorials, vol. 21, no. 1, pp. 487–507, 2018.
 [19] J. G. Herrera and J. F. Botero, “Resource allocation in nfv: a comprehensive survey,” IEEE Transactions on Network and Service Management, vol. 13, no. 3, pp. 518–532, 2016.
 [20] C.L. Hsieh and N. Weng, “Nfswitch: vnfsenabled sdn switches for high performance service function chaining,” in Proceedings of IEEE ICNP, 2017.
 [21] L. Huang, S. Zhang, M. Chen, and X. Liu, “When backpressure meets predictive scheduling,” IEEE/ACM Transactions on Networking (ToN), vol. 24, no. 4, pp. 2237–2250, 2016.
 [22] X. Huang, S. Bian, Z. Shao, and Y. Yang, “Predictive switchcontroller association and control devolution for sdn systems,” in Proceedings of IEEE/ACM IWQoS, 2019.
 [23] X. Huang, Z. Shao, and Y. Yang, “Dynamic tuple scheduling with prediction for data stream processing systems,” in Proceedings of IEEE GLOBECOM, 2019.
 [24] J. W. Jiang, T. Lan, S. Ha, M. Chen, and M. Chiang, “Joint vm placement and routing for data center traffic engineering,” in Proceedings of IEEE INFOCOM, 2012.
 [25] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken, “The nature of data center traffic: measurements & analysis,” in Proceedings of ACM SIGCOMM, 2009.
 [26] G. P. Katsikas, T. Barbette, D. Kostic, R. Steinert, and G. Q. Maguire Jr, “Metron: NFV service chains at the true speed of the underlying hardware,” in Proceedings of USENIX NSDI, 2018.
 [27] A. Laghrissi and T. Taleb, “A survey on the placement of virtual resources and virtual network functions,” IEEE Communications Surveys & Tutorials, vol. 21, no. 2, pp. 1409–1434, 2018.
 [28] X. Li, X. Wang, F. Liu, and H. Xu, “Dhl: Enabling flexible software network functions with fpga acceleration,” in Proceedings of IEEE ICDCS, 2018.
 [29] Y. Li and M. Chen, “Softwaredefined network function virtualization: a survey,” IEEE Access, vol. 3, pp. 2542–2553, 2015.
 [30] J. D. Little, “A proof for the queuing formula: L= w,” Operations Research, vol. 9, no. 3, pp. 383–387, 1961.
 [31] T. Lukovszki and S. Schmid, “Online admission control and embedding of service chains,” in Proceedings of SIROCCO, 2015.
 [32] S. Mehraghdam, M. Keller, and H. Karl, “Specifying and placing chains of virtual network functions,” in Proceedings of IEEE CloudNet, 2014.
 [33] R. Mijumbi, J. Serrat, J.L. Gorricho, N. Bouten, F. De Turck, and R. Boutaba, “Network function virtualization: stateoftheart and research challenges,” IEEE Communications Surveys and Tutorials, vol. 18, no. 1, pp. 236–262, 2016.
 [34] M. Mitzenmacher, “The power of two choices in randomized load balancing,” IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 10, pp. 1094–1104, 2001.
 [35] A. Mohammadkhan, S. Ghapani, G. Liu, W. Zhang, K. Ramakrishnan, and T. Wood, “Virtual function placement and traffic steering in flexible and dynamic software defined networks,” in Proceedings of IEEE LANMAN, 2015.
 [36] S. Nanda, F. Zafari, C. DeCusatis, E. Wedaa, and B. Yang, “Predicting network attack patterns in sdn using machine learning approach,” in Proceedings of IEEE NFVSDN, 2016.
 [37] M. J. Neely, “Stochastic network optimization with application to communication and queueing systems,” Synthesis Lectures on Communication Networks, vol. 3, no. 1, pp. 1–211, 2010.
 [38] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: distributed, low latency scheduling,” in Proceedings of ACM SOSP, 2013.
 [39] I. S. Petrov, “Mathematical model for predicting forwarding rule counter values in sdn,” in Proceedings of IEEE EIConRus, 2018.
 [40] C. Pham, N. H. Tran, S. Ren, W. Saad, and C. S. Hong, “Trafficaware and energyefficient vnf placement for service chaining: Joint sampling and matching approach,” IEEE Transactions on Services Computing, vol. 13, no. 1, pp. 172–185, 2017.
 [41] Y. Sang, B. Ji, G. R. Gupta, X. Du, and L. Ye, “Provably efficient algorithms for joint placement and allocation of virtual network functions,” in Proceedings of IEEE INFOCOM, 2017.
 [42] M. Savi, M. Tornatore, and G. Verticale, “Impact of processingresource sharing on the placement of chained virtual network functions,” IEEE Transactions on Cloud Computing, 2019, early access, doi:10.1177/0163443718810910.
 [43] S. Schneider, S. Dräxler, and H. Karl, “Tradeoffs in dynamic resource allocation in network function virtualization,” in Proceedings of IEEE Globecom workshop, 2018.
 [44] A. Sheoran, P. Sharma, S. Fahmy, and V. Saxena, “Contained: an nfv microservice system for containing e2e latency,” ACM SIGCOMM Computer Communication Review, vol. 47, no. 5, pp. 54–60, 2017.
 [45] A. Singla, C.Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: networking data centers, randomly.” in Proceedings of USENIX NSDI, 2012.
 [46] S. J. Taylor and B. Letham, “Forecasting at scale,” The American Statistician, vol. 72, no. 1, pp. 37–45, 2018.
 [47] A. Tootoonchian, A. Panda, C. Lan, M. Walls, K. Argyraki, S. Ratnasamy, and S. Shenker, “Resq: enabling slos in network function virtualization,” in Proceedings of USENIX NSDI, 2018.
 [48] H. Wang and J. Schmitt, “Load balancingtowards balanced delay guarantees in nfv/sdn,” in Proceedings of IEEE NFVSDN, 2016.
 [49] L. Wang, Z. Lu, X. Wen, R. Knopp, and R. Gupta, “Joint optimization of service function chaining and resource allocation in network function virtualization,” IEEE Access, vol. 4, pp. 8084–8094, 2016.
 [50] T. Wang, H. Xu, and F. Liu, “Multiresource load balancing for virtual network functions,” in Proceedings of IEEE ICDCS, 2017.
 [51] S. Woo, J. Sherry, S. Han, S. Moon, S. Ratnasamy, and S. Shenker, “Elastic scaling of stateful network functions,” in Proceedings of USENIX NSDI, 2018.
 [52] Y. Xiao, Q. Zhang, F. Liu, J. Wang, M. Zhao, Z. Zhang, and J. Zhang, “Nfvdeep: adaptive online service function chain deployment with deep reinforcement learning,” in Proceedings of IEEE/ACM IWQoS, 2019.
 [53] Z. Xu, F. Liu, T. Wang, and H. Xu, “Demystifying the energy efficiency of network function virtualization,” in Proceedings of IEEE/ACM IWQoS, 2016.
 [54] L. Ying, R. Srikant, and X. Kang, “The power of slightly more than one sample in randomized load balancing,” in Proceedings of IEEE INFOCOM, 2015.
 [55] M. S. Yoon and A. E. Kamal, “Nfv resource allocation using mixed queuing network model,” in Proceedings of IEEE GLOBECOM, 2016.
 [56] C. You and L. Li, “Efficient load balancing for the vnf deployment with placement constraints,” in Proceedings of IEEE ICC, 2019.
 [57] C. Zeng, F. Liu, S. Chen, W. Jiang, and M. Li, “Demystifying the performance interference of colocated virtual network functions,” in Proceedings of IEEE INFOCOM, 2018.
 [58] J. Zhang, W. Wu, and J. C. Lui, “On the theory of function placement and chaining for network function virtualization,” in Proceedings of ACM MobiHoc, 2018.
 [59] K. Zhang, B. He, J. Hu, Z. Wang, B. Hua, J. Meng, and L. Yang, “Gnet: effective gpu sharing in nfv systems,” in Proceedings of USENIX NSDI, 2018.
 [60] Q. Zhang, F. Liu, and C. Zeng, “Adaptive interferenceaware vnf placement for servicecustomized 5g network slices,” in Proceedings of IEEE INFOCOM, 2019.
 [61] Q. Zhang, Y. Xiao, F. Liu, J. C. Lui, J. Guo, and T. Wang, “Joint optimization of chain placement and request scheduling for network function virtualization,” in Proceedings of IEEE ICDCS, 2017.
 [62] S. Zhang, L. Huang, M. Chen, and X. Liu, “Proactive serving decreases user delay exponentially,” ACM SIGMETRICS Performance Evaluation Review, vol. 43, no. 2, pp. 39–41, 2015.
 [63] X. Zhang, C. Wu, Z. Li, and F. C. Lau, “Proactive vnf provisioning with multitimescale cloud resources: fusing online learning and online optimization,” in Proceedings of IEEE INFOCOM, 2017.
 [64] Y. Zhang, Z.L. Zhang, and B. Han, “Hybridsfc: Accelerating service function chains with parallelism,” in Proceedings of IEEE NFVSDN, 2019.
AppendixA
Proof of Lemma 1
To solve problem (15), we adopt the Lyapunov optimization technique [37]. We define the quadratic Lyapunov function as
(20) 
and the Lyapunov drift for two consecutive time slots as
(21) 
which measures the conditional expected successive change in queues’ congestion state. To avoid overloading any queues in the system, it is desirable to make the difference as low as possible. However, striving for a short total queue length may incur considerable communication cost and computation cost. To jointly consider both queueing stability and the consequent system cost, we define the driftpluspenalty function as
(22) 
where is a positive constant that determines the balance between queueing stability and minimizing total system cost.
1) for ,
(24) 
where we define .
2) for and ,
(25) 
3) for and ,
(26) 
By (22) – (26) and the boundedness of request arrival numbers, service capacities on switches and controllers, and according to (2), we have for , and
Comments
There are no comments yet.