Over the last years, cloud computing has swiftly transformed the IT infrastructure landscape, leading to large cost-savings for deployment of a wide range of IT applications. Some main characteristics of cloud computing are resource pooling, elasticity, and metering. Physical resources such as compute nodes, storage nodes, and network fabrics are shared among tenants. Virtual resource elasticity brings the ability to dynamically change the amount of allocated resources, for example as a function of workload or cost. Resource usage is metered and in most pricing models the tenant only pays for the allocated capacity.
While cloud technology initially was mostly used for IT applications, e.g. web servers, databases, etc., it is rapidly finding its way into new domains. One such domain is processing of network packages. Today network services are packaged as physical appliances that are connected together using physical network. Network services consist of interconnected network functions (NF). Examples of network functions are firewalls, deep packet inspections, transcoding, etc. A recent initiative from the standardisation body ETSI (European Telecommunications Standards Institute) addresses the standardisation of virtual network services under the name Network Functions Virtualisation (NFV) . The expected benefits from this are, among others, better hardware utilisation and more flexibility, which translate into reduced capital and operating expenses (CAPEX and OPEX). A number of interesting use cases are found in , and in this technical report we are investigating the one referred to as Virtual Network Functions Forwarding Graphs, see Figure 1.
We investigate the allocation of virtual resources to a given packet flow, i.e. what is the most cost efficient way to allocate VNFs with a given capacity that still provide a network service within a given latency bound? The distilled problem is illustrated as the packet flows in Figure 1. The forwarding graph is implemented as a chain of virtual network nodes, also known as a service chains. To ensure that the capacity of a service chain matches the time-varying load, the number of instances of each individual network function may be scaled up or down.
The contribution of the technical report is
a mathematical model of the virtual resources supporting the packet flows in Figure 1,
the set-up of an optimization problem for controlling the number of machines needed by each function in the service chain,
solution of the optimization-problem leading to a control-scheme of the number of machines needed to guarantee that the end-to-end deadline is met for incoming packets under a constant input flow.
There are a number of well known and established resource management frameworks for data centers, but few of them explicitly address the issue of latency. Sparrow  presents an approach for scheduling a large number of parallel jobs with short deadlines. The problem domain is different compared to our work in that we focus on sequential rather than parallel jobs. Chronos  focuses on reducing latency on the communication stack. RT-OpenStack  adds real-time performance to OpenStack by usage of a real-time hypervisor and a timing-aware VM-to-host mapping.
The enforcement of an end-to-end (E2E) deadline of a sequence of jobs to be executed through a sequence of computing elements was addressed by several works, possibly under different terminologies. In the holistic analysis [6, 7, 8] the schedulability analysis is performed locally. At global level the local response times are transformed into jitter or offset constraints for the subsequent tasks.
A second approach to guarantee an E2E deadline is to split a constraint into several local deadline constraints. While this approach avoids the iteration of the analysis, it requires an effective splitting method. Di Natale and Stankovic  proposed to split the E2E deadline proportionally to the local computation time or to divide equally the slack time. Later, Jiang  used time slices to decouple the schedulability analysis of each node, reducing the complexity of the analysis. Such an approach improves the robustness of the schedule, and allows to analyse each pipeline in isolation. Serreli et al. [11, 12] proposed to assign local deadlines to minimize a linear upper bound of the resulting local demand bound functions. More recently, Hong et al  formulated the local deadline assignment problem as a MILP with the goal of maximising the slack time. After local deadlines are assigned, the processor demand criterion was used to analyze distributed real-time pipelines [14, 12].
In all the mentioned works, jobs have non-negligible execution times. Hence, their delay is caused by the preemption experienced at each function. In our context, which is scheduling of virtual network services, jobs are executed non-preemptively and in FIFO order. Hence, the impact of the local computation onto the E2E delay of a request is minor compared to the queueing delay. This type of delay is intensively investigated in the networking community in the broad area queuing systems . In this area, Henriksson et al.  proposed a feedforward/feedback controller to adjust the processing speed to match a given delay target.
Most of the works in queuing theory assumes a stochastic (usually markovian) model of job arrivals and service times. A solid contribution to the theory of deterministic queuing systems is due to Baccelli et al. , Cruz , and Parekh & Gallager . These results built the foundation for the network calculus , later applied to real-time systems in the real-time calculus . The advantage of network/real-time calculus is that, together with an analysis of the E2E delays, the sizes of the queues are also modelled. As in the cloud computing scenario the impact of the queue is very relevant since that is part of the resource usage which we aim to minimize, hence we follow this type of modeling.
2 Problem formulation
To analyse the resource management problem described in Section 1, we model an abstract version of Figure 1 with the one shown in Figure 2. In our model we consider each VNF simply as a function that is processing requests. Within each function there are a number of machines running (which in Section 1 would correspond to virtual machines).
2.1 Input model
The service chain is composed by service functions. The -th function, denoted by , receives requests at an incoming rate . Then, the cumulative arrived requests is
We model incoming requests and service speeds of each functions by a fluid approximation. In fact, in  they used recent advances in NFV-technology to process requests with a throughput of about 10 million requests per second. We believe this to show that the possible discretization error when using a fluid approximation is indeed negligible.
Finally, each request needs to pass through the entire service-chain within an end-to-end deadline, denoted .
2.2 Service model
As illustrated in Figure 3, the incoming requests to function are stored in the queue and then processed once it reaches the head of the queue. Here one should note that due to the fluid approximation we made earlier, our analysis will assume that a request is processed in parallel by all present machines in the function. Again, with the requests entering at a rate of millions per second along with them being very small we believe that this is a good abstraction. At time there are machines ready to serve the requests, each with a nominal speed of (note that this nominal speed might differ between different functions in the service chain, i.e. it does not in general hold that for ). The maximum speed that function can process requests at is thus . The rate by which is processing requests is denoted . The cumulative served requests is defined as
At time the number of requests stored in the queue is defined as the queue length :
Each function has a fixed maximum-queue capacity , representing the largest number of requests that can be stored at the function .
The queueing delay, depends on the status of the queue as well as on the service rate. We denote by the time taken by a request from when it enters function to when it exits , with , where is the time when the request exits function :
The maximum queueing delay then is . The requirement that a requests meets it end-to-end deadline is .
To control the queueing delay, it is necessary to control the service rate of the function. Therefore, we assume that it is possible to change the maximum service-rate of a function by changing the number of machines that are on, i.e. changing . However, turning on a machine takes time units, and turning off a machine takes time units. Together they account for a time delay, , associated with turning on/off a machine.
In the famous paper , Google profiled where the latency in a data center occurred. They showed that less than 1% (¡1) of the latency occurred was due to the propagation in the network fabric. The other 99% () occurred somewhere in the kernel, the switches, the memory, or the application. Since it is very difficult to say exactly which of this 99% is due to processing, or queueing, we make the abstraction of considering queueing delay and processing delay together, simply as queueing delay. Hence, once a request has reached the head of the queue and is processed it immediately exits the function and enters the next function in the chain, or exit the chain if exiting the final function. We thus assume that no request is lost in the communication links, and that there is no propagation delay. Therefore, the concatenation of the functions through implies that the input of function is exactly the output of function , for , as illustrated in Figure 2.
2.3 Cost model
To be able to provide guarantees about the behaviour of the service chain, it is necessary to make hard reservations of the resources needed by each function in the chain. This means that when a certain resource is reserved, it is guaranteed to be available for utilisation. Reserving this resource results in a cost, and due to the hard reservation, the cost is not dependent on the actual utilisation, but only on the resource reserved.
The computation cost per time-unit per machine is denoted , and can be seen as the cost for the CPU-cycles needed by one machine in . This cost will also occur during the time-delay . Without being too conservative, this time-delay can be assumed to occur only when a machine is started. The average computing cost per time-unit for the whole function is then
where , and is the left-limit of :
that is, a sequence of Dirac’s deltas at all points where the number of machines changes. This means that the value of the left-limit of is only adding to the computation-cost whenever it is positive, i.e. when a machine is switched on.
The queue cost per time-unit per space for a request is denoted and can be seen as the cost for having a queue with the capacity of one request. This cost come from the fact that physical storage needs to be reserved so that a queue can be hosted on it, normally this would correspond to the RAM of the network-card. Reserving the capacity of would thus result in a cost per time-unit of
2.4 Problem definition
The aim of this technical report is to control the number of machines running at stage , such that the total average cost is minimized, while the E2E constraint is not violated and the maximum queue sizes are not exceeded. This can be posed as the following problem:
A valid lower bound to the cost achieved by any feasible solution of (6) is found by assuming that all functions are capable of providing exactly a service rate equal to the input rate. This is possible by running a fractional number of machines at function . In such an ideal case, buffers can be of zero size (), and there is no queueing delay () since service and the arrival rates are the same at all functions. Hence, the lower bound to the cost is
Such a lower bound will be used to compare the quality of the several solutions found later on.
3 Machine switching scheme
In presence of an incoming flow of requests at a constant rate , a number
of machines running the function must always stay on. To match the incoming rate , in addition to the machines always on, another machine must be on for some time in order to process a request rate of where is the normalized residual request rate:
In our scheme, the extra machine is switched on at a desired on-time :
: function switches on the additional machine when the time exceeds .
Since the additional machine does not need to always be on, it could be switched off after some time. The off-switching is also based on a time-condition, the desired stop-time , i.e. the time-instance that the machine should be switched off, and is given by:
where is the duration that the machine should be on for, and something that needs to be found. The off-switching is then triggered in the following way:
: function switches off the additional machine when the time exceeds .
Note that this control-scheme, in addition with the constant input, result in the extra machine being switched on/off periodically, with a period . We thus assume that the extra machine can process requests for a time every period . The time during each period where the machine is not processing any requests is denoted . Notice, however, that the actual time the extra machine is consuming power is due to the time delay.
In the presence of a constant input, it is straight-forward to find the necessary on-time during each period—in order for the additional machine to provide the residual processing capacity of , its on-time must be such that
With each additional machine being switched on/off periodically, it is also straightforward to find the computation cost for each function. If machines are on for a time , and only machines are on for a time , then the cost of (4) becomes
if . If instead , that is if
then there is no time to switch the additional machine off and then on again. Hence, we keep the last machine on, even if it is not processing packets, and the computing cost becomes
Next, using this control-scheme, the optimization problem of (6) will be studied under two different set of assumptions. In Section 4, we will approximate the service functions with linear lower-bounds, which allows us to find a period of each function. Note that the lower-bound approximation incurs in some pessimism in the solution. In Section 5 we will assume that every function will switch on/off its additional machine with the same period, . For this case we will derive the optimal period .
4 Linear approximation of service
In this section, the service functions are approximated by linear lower-bounds. This choice allows us finding an explicit solution to the switching periods of each function. Inevitably, the solution incurs in some pessimism due to the approximation.
If the cumulative served requests (2) is lower-bounded by a linear function, as illustrated in Figure 4, the maximum size of the queue at function is attained exactly when the function switches on its extra machine, :
while the maximum introduced delay is
By setting the variable and constants , , and as
the optimal design problem of (6) can be formulated as
with being the cost lower bound as in (7). First, we check the unconstrained solution, which is
with the multiplier being the unique positive solution of
Finally, the switching-period is given by
and the maximum queue-size are given by Eq.(14). Notice that, for all such that Eq. (12) holds true, then there is physically no time to switch the additional machine off and then on again (). For all these machines the cost is computed as machines are always on (as in Eq. (13)) and not by (11).
Let us apply the described design methodology to a simple example of a service chain with two functions. We assume an incoming rate of requests per second with an E2E-deadline of . The parameters of the functions are reported in Table I.
The unconstrained solution of (17) is then given by and . Such a solution, however, violates E2E deadline constraint since
Therefore, the constrained solution must be explored.
When solving the constrained solution, the Lagrange multiplier is the solution of (20). From (18), this gives the solution of , and , resulting in the periods and . Note that the off-time for the two functions are and , which are both larger than . Note that the E2E-delay for this solution is exactly the E2E-deadline. Finally, from Eq. (14) we find that the maximum queue-sizes for this solution are and . Finally, from (19) the cost for the solution is . It should be noted that this example is meant to illustrate how one can use the design methodology of this section in order to find the periods and as well as the maximum queue-sizes and . In a real setting the incoming traffic will likely be around million requests per second, .
5 Design of machine-switching period
In the previous section, the service functions were approximated by a linear lower-bound, which allowed us to find a period for each function. However, such an approximation leads to an extra cost. In this section, the exact expression of the service functions will be considered. Since the exactness of the service functions leads to an increases in the complexity, the design problem of (6) will be solved while letting every function switch its additional machine on/off with the same period, .
The common period of the schedule, by which every function switches its additional machine on/off, is the only design variable in the optimization problem (6). As proved later in Lemma 1 and Lemma 2, the maximum queue size of any function and the E2E delay are both proportional to the switching period . The intuition behind this fact is that the longer the period is, the longer a function will have to wait with the additional machine being off, before turning it on again. During this interval of time, each function is accumulating work and consequently both the maximum queue size and the delay grows with .
With these hypothesis, the cost function of the optimization problem (6) becomes
where is the lower bound given by (7) and , where is given by Lemma 1. Furthermore, (defined in (12)) represents the value of the period below which it is not feasible to switch the additional machine off and then on again (). In fact, with we pay the full cost of having machines always on.
The deadline constraint in (6), can be simply written as
with opportune constants, given in Lemma 2.
The cost of (21) is a continuous function of one variable . It has to be minimized over the closed interval . Hence, by the Weierstaß’s extreme-value theorem, it has a minimum. To find this minimum, we just check all (finite) points at which the cost is not differentiable and the ones where the derivative is equal to zero. Let us define all points in in which is not differentiable:
We denote by the number of points in . Also, we denote by the points in and we assume they are ordered increasingly . Since the cost is differentiable over the open interval , the minimum may also occur at an interior point of with derivative equal to zero. Let us denote by the set of all interior points of with derivative of equal to zero, that is
Then, the optimal period is given by
As in Section 4, we use an example to illustrate the solution of the optimization problem of a service chain containing two functions. The input to the service-chain has a rate of . Every request has an E2E-deadline of . The parameters of the two functions are reported in Table I.
The input can be seen as dummy function preceding , with , , and (from Equations (8)–(9)). Also, as in the example of Section 4, , , and . This in turn leads to and , where is the threshold period for function , as defined in (12). From Lemma 1 it follows that the parameter of the cost function (21) is , while from Lemma 2 the parameters determining the queuing delay introduced by each function, are and , which in turn leads to
Since , the set of (22) containing the boundary is
To compute the set of interior points with derivative equal to zero defined in (23), which is needed to compute the period with minimum cost from (24), we must check all intervals with boundaries at two consecutive points in . In the interval the derivative of is never zero. When checking the interval , the derivative is zero at
which, however, falls outside the interval. Finally, when checking the interval the derivative is zero at
Hence, the set of points with derivative equal to zero is . By inspecting the cost at points in we find that the minimum occurs at , with cost . It should noted that this solution provides a lower cost than the one found by the linear approximation (in Section 4), that is . This, however, is not true in general.
To conclude the example we show in Figure 5 the state-space trajectory for the two queues. There one can see how the two queues grows and shrinks depending on which of the two functions has their additional machine on. Again, it should be noted that this example is meant to illustrate how one can use the design methodology of this section in order to find the best period . In a real setting the incoming traffic will likely be around million requests per second, .
Next we derive the expression of the maximum queue size as function of the switching period .
The maximum queue size at function is
with as defined in (9), and being the period of the switching scheme, common to all functions.
The queue size over time is a continuous, piecewise-linear function, since both the input and the service rates are piecewise constant, and the queue size is defined by Eq. (3). Hence, if at the function takes its maximum value, it must necessarily happen that in a left-neighbourhood of and in a right-neighbourhood of .
To find the value of , one needs to distinguish among the four possible cases, Case (1a), Case (1b), Case (2a), and Case (2b), depending on the nominal speeds and , as is shown in Table II. These cases, in turn, determine the sign of , as summarised in Table III. Note that for , one should consider the input as with , leading to and , which would then belong to Case (2b).
Next, the maximum queue-size will be derived for each case. We will also derive the best time for each function to start its additional machine, i.e. .
For this case, illustrated in Figure 6, the sign of shown in Table III, implies that grows only when and . From this condition, the -th queue can start to decrease either when or . In the first case, the rate of decrease is
and such a state lasts for (during the interval of length in Figure 6). This therefore yields a local maximum of:
It is easy to verify that changing the on-time to instead be later will yield a larger local maximum, and changing it to instead be earlier will yield a negative queue size. The given is thus the optimal one, and can be expressed relative to as:
On the other hand, the local maximum when is determined by the interval of length , as shown in Figure 6, that is
By taking the maximum of the two local maxima, we find
As shown in Table III, the queue size grows if and only if machines are running within function . The maximum queue size, then, is attained at the instant when such a machine is switched off. To analyse this case, we distinguish between two cases: (illustrated in Figure 7) and (Figure 8). In both cases, to minimize , the function must start the extra machine simultaneously as start its additional machine in order to reduce the rate of growth of the -th queue, i.e.
Note that the queue size for function will therefore be zero when it switches on the additional machine,
This case is essentially the same as Case (1b). As shown by Table III, the only difference is that is reduced whenever has its extra machine on, and grows whenever it is off. This then implies that the maximum queue size is attained when switches on the extra machine. To minimize , the queue size of should therefore be such that the queue is empty when it switches off the additional machine. Note that this corresponds to both and switching off their additional machine simultaneously (compare with Case (1b) where the two functions switches on their additional machine simultaneously). The time when should switch on its additional machine is thus:
Note that for this case have to consider both and when computing :
The maximum queue size is, as stated earlier, found when switches on its extra machine. By considering and together, the expression for can be combined into:
Table III show the similarity between this case and Case (1a), with the difference being that for this case only shrinks when and . Therefore, will always grow when