Optimal Algorithms for Right-Sizing Data Centers - Extended Version

07/13/2018 ∙ by Susanne Albers, et al. ∙ Technische Universität München 0

Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew and Thereska that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this paper, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. We develop a randomized online algorithm that is 2-competitive against an oblivious adversary and prove that 2 is a lower bound for the competitive ratio of randomized online algorithms. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. All lower bounds mentioned above also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Energy conservation in data centers is a major concern for both operators and the environment. In the U.S., about 1.8% of the total electricity consumption is attributed to data centers [22]. In 2015, more than 416 TWh (terawatt hours) were used by the world’s data centers, which exceeds the total power consumption in the UK [7]. Electricity cost is a significant expense in data centers [9]; about 18–28% of their budget is invested in power [13, 8]. Remarkably, the servers of a data center are only utilized 20–40% of the time on average [4, 6]. Even worse, when idle and in active mode, they consume about half of their peak power [21]. Hence, a promising approach for energy conservation and capacity management is to transition idle servers into low-power sleep states. However, state transitions, and in particular power-up operations, also incur energy/cost. Therefore, dynamically matching the number of active servers with the varying demand for computing capacity is a challenging optimization problem. In essence, the goal is to right-size a data center over time so as to minimize energy and operation costs.

Problem Formulation. We investigate a basic algorithmic problem with the objective of dynamically resizing a data center. Specifically, we resort to a framework that was introduced by Lin, Wierman, Andrew and Thereska [17, 19] and further explored, for instance, in [2, 3, 5, 1, 23, 18, 20].

Consider a data center with homogeneous servers, each of which has an active state and a sleep state. An optimization is performed over a discrete, finite time horizon consisting of time steps . At any time , , a non-negative convex cost function models the operating cost of the data center. More precisely, is the incurred cost if servers are in the active state at time , where . This operating cost captures, e.g., energy cost and service delay, for an incoming workload, depending on the number of active servers. Furthermore, at any time there is a switching cost, taking into account that the data center may be resized by changing the number of active servers. This switching cost is equal to , where is a positive real constant and . Here we assume that transition cost is incurred when servers are powered up from the sleep state to the active state. A cost of powering down servers may be folded into this cost. The constant incorporates, e.g., the energy needed to transition a server from the sleep state to the active state, as well as delays resulting from a migration of data and connections. We assume that at the beginning and at the end of the time horizon all servers are in the sleep state, i.e. 

. The goal is to determine a vector

called schedule, specifying at any time the number of active servers, that minimizes

(1)

In the offline version of this data-center optimization problem, the convex functions , , are known in advance. In the online version, the arrive over time. At time , function is presented. Recall that the operating cost at time depends for instance on the incoming workload, which becomes known only at time .

All previous work on the data-center optimization problem assumes that the server numbers , , may take fractional values. That is, may be an arbitrary real number in the range . From a practical point of view this is acceptable because a data center has a large number of machines. Nonetheless, from an algorithmic and optimization perspective, the proposed algorithms do not compute feasible solutions. Important questions remain if the are indeed integer valued: (1) Can optimal solutions be computed in polynomial time? (2) What is the best competitive ratio achievable by online algorithms? In this paper, we present the first study of the data-center optimization problem assuming that the take integer values and, in particular, settle questions (1) and (2).

Previous Work. As indicated above, all prior work on the data-center optimization problem assumes that the , , may take fractional values in . First, Lin et al. [19] consider the offline problem. They develop an algorithm based on a convex program that computes optimal solutions. Second, Lin et al. [19] study the online problem. They devise a deterministic algorithm called Lazy Capacity Provisioning (LCP) and prove that it achieves a competitive ratio of exactly 3. Algorithm LCP, at any time , computes a lower bound and an upper bound on the number of active servers by considering two scenarios in which the switching cost is charged, either when a server is powered up or when it is powered down. The LCP algorithm lazily stays within these two bounds. The tight bound of 3 on the competitiveness of LCP also holds if the algorithm has a finite prediction window , i.e. a time it knows the current as well as the next arriving functions . Furthermore, Lin et al. [19] perform an experimental study with two real-world traces evaluating the savings resulting from right-sizing in data centers.

Bansal et al. [5] presented a 2-competitive online algorithm and showed that no deterministic or randomized online strategy can attain a competitiveness smaller than 1.86. Recently, Antoniadis and Schewior [3] improved the lower bound to 2. Bansal et al. [5] also gave a 3-competitive memoryless algorithm and showed that this is the best competitive factor achievable by a deterministic memoryless algorithm. The data-center optimization problem is an online convex optimization problem with switching costs. Andrew et al. [1] showed that there is an algorithm with sublinear regret but that -competitiveness and sublinear regret cannot be achieved simultaneously. Antoniadis et al. [2] examine generalized online convex optimization, where the values selected by an algorithm may be points in a metric space, and relate it to convex body chasing.

Further work on energy conservation in data center includes, for instance, [14, 15]. Khuller et al. [14] introduce a machine activation problem. There exists an activation cost budget and jobs have to be scheduled on the selected, activated machines so as to minimize the makespan. They present algorithms that simultaneously approximate the budget and the makespan. A second paper by Li and Khuller [15] considers a generalization where the activation cost of a machine is a non-decreasing function of the load. In the more applied computer science literature, power management strategies and the value of sleep states have been studied extensively. The papers focus mostly on experimental evaluations. Articles that also present analytic results include [10, 11, 12].

Our Contribution. We conduct the first investigation of the discrete data-center optimization problem, where the values , specifying the number of active servers at any time , must be integer valued. Thereby, we seek truly feasible solutions.

First, in Section 2 we study the offline algorithm. We show that optimal solutions can be computed in polynomial time. Our algorithm is different from the convex optimization approach by Lin et al. [19]. We propose a new, yet natural graph-based representation of the discrete data-center optimization problem. We construct a grid-structured graph containing a vertex , for each and . Edges represent right-sizing operations, i.e. changes in the number of active servers, and are labeled with operating and switching costs. An optimal solution could be determined by a shortest path computation. However, the resulting algorithm would have a pseudo-polynomial running time. Instead, we devise an algorithm that improves solutions iteratively using binary search. In each iteration the algorithm uses only a constant number of graph layers. The resulting running time is .

The remaining paper focuses on the online problem and develops tight bounds on the competitiveness. In Section 3 we adapt the LCP algorithm by Lin et al. [19] to the discrete data-center optimization problem. We prove that LCP is 3-competitive, as in the continuous setting. We remark that our analysis is different from that by Lin et al. [19]. Specifically, our analysis resorts to the discrete structure of the problem and identifies respective properties. The analysis by Lin et al. [19] relates to their convex optimization approach that characterizes optimal solutions in the continuous setting.

In Section 4 we develop a randomized online algorithm which is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [5] that achieves a competitive ratio of 2 for the continuous setting. Our algorithm works as follows. First, it extends the given discrete problem instance to the continuous setting. Then, it calculates a 2-competitive fractional schedule by using the algorithm of Bansal et al. Finally, we round the fractional schedule randomly to obtain an integral schedule. By using the right rounding technique it can be shown that the resulting schedule is 2-competitive according to the original discrete problem instance.

In Section 5 we devise lower bounds. We prove that no deterministic online algorithm can obtain a competitive ratio smaller than 3. Hence, LCP achieves an optimal competitive factor. Interestingly, while LCP does not attain an optimal competitiveness in the continuous data-center optimization problem (where the may take fractional values), it does so in the discrete problem (according to deterministic algorithms). We prove that the lower bound of 3 on the best possible competitive ratio also holds for a more restricted setting, originally introduced by Lin et al. [17] in the conference publication of their paper. Specifically, the problem is to find a vector that minimizes

(2)

subject to , for . Here is the incoming workload at time and is a non-negative convex function representing the operating cost of a single server running with load . Since is convex, it is optimal to distribute the jobs equally to all active servers, so that the operating cost at time is . This problem setting is more restricted in that there is only a single function modeling operating cost over the time horizon. Nonetheless it is well motivated by real data center environments.

Furthermore, in Section 5 we address the continuous data-center optimization problem and prove that no deterministic online algorithm can achieve a competitive ratio smaller than 2. The same result was shown by Antoniadis and Schewior [3]. We develop an independent proof that can again be extended to the more restricted optimization problem stated in (2), i.e. the lower bound of 2 on the best competitiveness holds in this setting as well.

In addition, we show that there is no randomized online algorithm with a competitive ratio smaller than 2, so our randomized online algorithm presented in Section 4 is optimal. The construction of the lower bound uses some results of the lower bound proof for the continuous setting. Again, we show that the lower bound holds for the more restricted model.

Finally, in Section 5 we analyze online algorithms with a finite prediction window, i.e. at time an online algorithm knows the current as well as the next arriving functions . We show that all our lower bounds, for both settings (continuous and discrete) and both models (general and restricted), still hold.

2 An optimal offline algorithm

In this section we study the offline version of the discrete data-center optimization problem. We develop an algorithm that computes optimal solutions in time.

Figure 1: Construction of the graph.

2.1 Graph-based approach

Our algorithm works with an underlying directed, weighted graph that we describe first. Let and with . For each and each , there is a vertex , representing the state that exactly servers are active at time . Furthermore, there are two vertices and for the initial and final states and . For each and each pair , there is a directed edge from to having weight . This edge weight corresponds to the switching cost when changing the number of servers between time and and to the operating cost incurred at time . Similarly, for and each , there is a directed edge from to with weight . Finally, for and each , there is a directed edge from to of weight 0. The structure of is depicted in Figure 1.

In the following, for each , vertex set is called row . For each , vertex set is called column .

A path between and represents a schedule. If the path visits , then servers are active at time . Note that a path visits exactly one vertex in each column , , because the directed edges connect adjacent columns. The total length (weight) of a path is equal to the cost of the corresponding schedule. An optimal schedule can be determined using a shortest path computation, which takes time in the particular graph . However, this running time is not polynomial because the encoding length of an input instance is linear in and , in addition to the encoding of the functions .

In the following, we present a polynomial time algorithm that improves an initial schedule iteratively using binary search. In each iteration the algorithm constructs and uses only a constant number of rows of .

2.2 Polynomial time algorithm

An instance of the data-center optimization problem is defined by the tuple with . We assume that is a power of two. If this is not the case we can transform the given problem instance to with and

with . The term ensures that is a convex function, since the greatest slope of is . The inequality holds because for all . The additional term ensures that it is adverse to use a state , because the cost of is always smaller.

Our algorithm uses iterations denoted reversely by for the first iteration and for the last iteration. The states used in iteration are always multiples of . For the first iteration we use the rows , so that the graph of the first iteration contains the vertices

The optimal schedule for this simplified problem instance can be calculated in time, since each column contains only five states. Given an optimal schedule of iteration , let

be the states used in the -th column of the next iteration . Thus the iteration uses the vertex set

Note that the states with were already used in iteration and we just insert the intermediate states and . If (or ), then (or ) leads to negative states (or to states larger than ), thus the set is cut with to ensure that we only use valid states.

The last iteration () provides an optimal schedule for the original problem instance as shown in the next section. The runtime of the algorithm is and thus polynomial.

2.3 Correctness

To prove the correctness of the algorithm described in the previous section we have to introduce some definitions:

Given the original problem instance , we define (with ) as the data-center optimization problem where we are only allowed to use the states that are multiples of . Let , so is a feasible schedule for if holds for all . To express as a tuple, we need another tuple element called which describes the allowed states, i.e. for all . The original problem instance can be written as and . Note that . Let denote an optimal schedule for . In general, for any given problem instance , let , so .

Instead of using only states that are multiple of we can also scale a given problem instance as follows. Let

with , and . Given a schedule for with cost , the corresponding schedule for has exactly the same cost, i.e. . Note that the problem instance uses all integral states less than or equal to , so there are no gaps.

Furthermore, we introduce a continuous version of any given problem instance where fractional schedules are allowed. Let with be the continuous extension of the problem instance , where , and

(3)

The operating cost of the fractional states is linearly interpolated, thus

is convex for all . Let be an optimal schedule for .

The set of all optimal schedules for a given problem instance is denoted by . Let be the cost during the time interval . We define , so .

Now, we are able to prove the correctness of our algorithm. We begin with a simple lemma showing the relationship between the functions and .

Lemma 1.

The problem instances and are equivalent.

Proof.

We begin with and simply apply the definitions of , and .

Afterwards, we use the definitions of , and and get as shown below:

The next technical lemma will be needed later. Informally, it demonstrates that optimal solutions of the reduced discrete problem and the above continuous problem behave similarly.

Lemma 2.

Let be an optimal schedule for with . There exists an optimal solution such that

(4)

holds for all with or .

Proof.

Let be the greatest state that minimizes and let be the smallest state that minimizes . Let be an arbitrary optimal solution. We will show that it is possible to modify such that it fulfills equation (4) without increasing the cost. The modified schedule is denoted by . We differ between several cases according to the relations of and :


Equation (4) is fulfilled.


If , then using instead of would lead to a better solution, because is a convex function and the switching costs between the time slots and are not increased, so must be fulfilled. If , then would lead to a better solution for the same reason, so .


If , then using instead of would lead to a better solution, so must be fulfilled.
Case 1:
We set , so equation (4) is fulfilled. Since the cost of is not increased.
Case 2:
We set which does not increase the cost of because and . We have . If , then would lead to a better solution, so . We set , so equation (4) is fulfilled. Since the cost of is not increased.


Case 1:
We have . If , then would lead to a better solution, so . We set , so equation (4) is fulfilled. Since the cost of is not increased.
Case 2: and
There exists a state with . If , then using instead of would lead to a better solution, so must be fulfilled. We set , so equation (4) is fulfilled. Since the cost of is not increased.
Case 3: and
There exists a state with . If , then using instead of would lead to a better solution, so . We set , so equation (4) is fulfilled. Since the cost of is not increased.


If , then would lead to a better solution, so . If , then would lead to a better solution, so . We set , so equation (4) is fulfilled. Since

the cost of is not increased.


Equation (4) is fulfilled.


This case is symmetric to case 1. ∎

By using Lemma 2, we can show that an optimal solution for a discrete problem instance cannot be very far from an optimal solution of the continuous problem instance .

Lemma 3.

Let be an arbitrary optimal schedule for with . There exists an optimal schedule for such that holds for all . Formally,

Proof.

To get a contradiction, we assume that there exists a with such that for all optimal schedules there is at least one with . Let be arbitrary. Given the schedule , let be the inclusion maximal time intervals such that holds for all and the sign of remains the same during . The set of all with is denoted by . If is empty, then the condition is fulfilled for all . We divide into the disjunct sets and such that contains the intervals where is positive and the others.

Given a schedule , the corresponding interval set is denoted by , the set of all time slots by , and the number of time slots in by .

We will use a recursive transformation that reduces at least by one for each step, while the cost of is not increased. Formally, we have to show that and holds. The first inequality ensures that the recursive procedure will terminate. The transformation described below will produce fractional schedules, however for each it is ensured that . Therefore, if , the corresponding schedule fulfills and for all .

To describe the transformation, we will use the following notation: A given schedule with is transformed to .

Let . We differ between two cases, in case 1 we handle the intervals in , i.e. holds for all and in case 2 we handle the intervals in , i.e. . We will handle case 1 first.

Let with and be the smallest value that is divisible by and greater than or equal to . The schedule is transformed to with

where is as small as possible such that holds for all , so at least one time slot satisfies this condition with equality. This transformation ensures that holds, because the interval is split into at least two intervals and one time slot () between them that fulfills .

We still have to show that the total cost is not increased by this operation. The total cost can be written as

(5)

We have and .

Consider the time slot . By the definition of the interval , the condition is fulfilled. Thus we can apply Lemma 2 which says that the terms and are both either non-negative or non-positive, so in Equation (5) the term can be replaced by or zero, respectively. Analogously, for the time slot , the condition is fulfilled, so by Lemma 2 the term in Equation (5) can be replaced by or zero. In the former cases, the cost function is

Given a schedule , we define and . Since there is no summand that contains both and , the function