1 Introduction
Recently, the demand for bigdata processing has promoted the popularity of cloud computing platforms due to their reliability, scalability and security [1, 2, 3]. Handling Bigdata applications requires unique systemlevel design since these applications, more than often, cannot be processed via a single PC, server, or even a datacenter (DC). To this end, modern parallel and distributed processing systems (e.g., [4, 5, 6]) are developed. In this work, we propose a framework for allocating bigdata applications represented via graph jobs in geodistributed cloud networks (GDCNs), explicitly considering the power consumption of the DCs. In the graph job model, each node denotes a subtask of a bigdata application while the edges impose the required communication constraints among the subtasks, further discussed later.
1.1 Related Work
There is a body of literature devoted to task and resource allocation in contemporary cloud networks, e.g., [7, 8, 9, 10, 11, 12]. In [7], the task placement and resource allocation plan for embarrassingly parallel jobs, which are composed of a set of independent tasks, is addressed to minimize the job completion time. To this end, three algorithms named TaPRA, TaPRAfast, and OnTaPRA are proposed, which significantly reduce the job execution time as compared to the stateoftheart algorithms. In [8], the multiresource allocation problem in cloud computing systems is addressed through a mechanism called DRFH, where the resource pool is constructed from a large number of heterogeneous servers containing various number of slots. It is shown that DRFH leads to much higher resource utilization with considerably shorter job completion times. In [9], a resource allocation scheme is proposed resulting in efficient utilization of the resources while increasing the revenue of the mobile cloud service providers. One of the pioneer works addressing resource allocation in GDCNs is [10], where a distributed algorithm, called DGLB, is proposed for realtime geographical load balancing. None of the above works has considered allocation of bigdata jobs composed of multiple subtasks requiring certain communication constraints among their subtasks.
Allocation of bigdata jobs represented by graph structures is a complicated process entailing more delicate analysis. Among limited literature [11, 12] are most relevant, in which randomized algorithms are developed capable of matching the vertices of the graph jobs to the idle slots of the cloud servers, considering the cost of using the communication infrastructure of the network to handle the data flows among the subtasks. These algorithms are developed for a fixed network cost configuration, i.e., the cost of job execution using the same allocation strategy is fixed throughout the time. As mentioned in [13], these randomized algorithms suffer from long convergence time. Due to this fact, these algorithms are impractical in scenarios that i) the job allocation needs to be performed with respect to a time varying network cost configuration, ii) the network size is large leading to an enormous size of the strategy set (see Section 5). In GDCNs, the execution cost is mainly determined by the realtime power consumption of the DCs [14]. Hence, an applicable allocation framework should be capable of fast allocation of incoming graph jobs to the GDCNs considering the effect of allocation on the current DCs’ power consumption state. Also, with the rapid growth in the size of cloud networks, adaptability to largescale GDCNs is a must for such a framework. These are the main motivations behind this work.
1.2 Contributions
The main goal of this paper is to provide a framework for graph job allocation in GDCNs with various scales. Our main contributions can be summarized as follows:
i) We formulate the problem of graph job allocation in GDCNs considering the incurred power consumption on the cloud network.
ii) We propose a centralized approach to solve the problem suitable for smallscale cloud networks.
iii) We design a distributed algorithm for allocation of graph jobs in mediumscale GDCNs, using the DCs’ processing power in parallel.
iv) For largescale GDCNs, given the huge size of the strategy set, and extremely slow convergence of the distributed algorithm, we introduce the idea of cloud crawling. In particular, we propose a fast method to address the NPcomplete subgraph isomorphism problem, which is one of the major challenges for graph job allocation in cloud networks. Also, we propose a novel decentralized subgraph isomorphism extraction algorithm for a cloud crawler to identify “potentially good” strategies for customers while traversing a GDCN.
v) For largescale GDCNs, considering the suggested strategies of cloud crawlers, we find the best suggested strategies for the customers under adaptive and fixed pricing of the DCs in a distributed fashion. To this end, we model proxy agents’ behavior in a GDCN, based on which we propose two online learning algorithms inspired by the concept of “regret” in the bandit problem [15, 16].
This paper is organized as follows. Section 2 includes system model. Section 3 contains a suboptimal approach for graph job allocation in smallscale GDCNs. A distributed graph job allocation mechanism for mediumscale GDCNs is presented in Section 4. Cloud crawling along with online learning algorithms for largescale GDCNs are presented in Section 5. Simulation results are given in section 6. Finally, Section 7 concludes the paper.
2 System Model
A GDCN comprises various DCs connected through communication links. Inside each DC, there is a set of fullyconnected cloud servers each consisting of multiple fullyconnected slots. Without loss of generality, we assume that all the cloud servers have the same number of slots. Each slot corresponds to the same bundle of processing resources which can be utilized independently. Since all the slots belonging to the same DC are fullyconnected, we consider a DC as a collection of slots directly in our study.^{1}^{1}1The number of cloud servers does not play a major role in our study except in the energy consumption models. It is assumed that a DC provider (DCP) is in charge of DC management. Abstracting each DC to a node and a communication link between two DCs as an edge, a GDCN with DCs can be represented as a graph , where denotes the set of nodes and represents the set of edges. Henceforth, is assumed to be connected; however, due to the geographical constraints, may not be a complete graph.
Let denote the set of slots belonging to DC . Connection between two DCs enables the communication capability between all the slots of them. Consequently, two slots are called adjacent if and only if both belong to the same DC or there exists a link between their corresponding DCs. Let denote the set of edges between the adjacent slots, where if and only if or . We define the aggregated network graph as , where and .
Let , denote the set of all possible types of the graph jobs in the system, each of which is considered as a graph . Each node of a graph job requires one slot from a DC to get executed. It is assumed that , and , if and only if the nodes and need to be executed using two adjacent slots of the GDCN.
The system model is depicted in Fig. 2. For the small and mediumscale GDCNs, the GDCN network is assumed to be in charge of finding adequate allocations for the incoming graph jobs from proxy agents (PAs) ( [17, 18]), which act as trusted parties between the GDCN and the customers. In these cases, each graph job is allocated through either a centralized controller or a distributed algorithm utilizing the communication infrastructure between the DCs (see Section 4). For largescale GDCNs, cloud crawlers are introduced to explore the GDCN to provide a set of suggested strategies for the PAs. Afterward, PAs allocate their graph jobs with respect to the utility of the suggested strategies (see Section 5). The following definitions are introduced to facilitate our subsequent derivations.
Definition 1.
A feasible mapping between a and the GDCN is defined as a mapping , which satisfies the communication constraints of the graph job. This implies that , if , then . Let denote the set of all feasible mappings for the .
Definition 2.
For a , a mapping vector associated with a feasible mapping
is defined as a vector
, where denotes the number of used slots from DC . Mathematically, , where represents the indicator function. Let denote the set of all mapping vectors for the .Finding a feasible allocation/mapping between a graph job and a GDCN is similar to the subgraph isomorphism problem in graph theory [19]. Some examples of feasible allocations for a graph job with three nodes considering a GDCN with four DCs each consisting of four slots is depicted in Fig. 2.
Remark 1.
Our aim is to allocate bigdata driven applications, e.g., data streams [11], to GDCNs. Due to the nature of these applications, the jobs usually stay in the system so long as they are not terminated. This work can be considered as a realtime allocation of graph jobs to the system, where we find the best currently possible assignment considering the current network status. Hence, we deliberately omit the time index from the following discussions.
Inspired by [20, 14], we model the power consumption upon utilizing slots of comprising cloud servers each with idle power consumption as:
(1) 
In this model, is the socalled Power Usage Effectiveness which is the ratio between the power consumed by the ITequipment and the total power usage, including cooling, lights, UPS, etc., of a DC, and is chosen in such a way that determines the peak power consumption of a cloud sever inside . Also, is a DCrelated constant. Subsequently, we define the incurred cost of executing a graph job with type allocated according to the feasible mapping vector as follows:

(2) 
where is the original load of DC , indicates the I/O incurred power of using the communication infrastructure of DC per slot, and is the ratio between the cost and power consumption, which is dependent on the DC’s location and infrastructure design. The I/O cost is considered to be proportional to the number of used slots since the data generated at each DC is correlated with that number, and that data should be exchanged using the I/O infrastructure either among adjacent DCs or between DCs and the users.
2.1 Problem Formulation
Our goal is to find an allocation for each arriving graph job to minimize the total incurred cost on the network. Due to the inherent relation between the cost and loads of the DCs, minimizing the cost is coupled with balancing the loads of the DCs. In a GDCN, let denote the number of in the system demanded for execution. Let denote the matrix of mapping vectors of these graph jobs defined as follows:
We formulate the optimal graph job allocation as the following optimization problem ():
(3)  
(4)  
(5) 
In , the objective function is the total incurred cost of execution, the first condition ensures the stability of the DCs, and the second constraint guarantees the feasibility of the assignment. There are two main difficulties in obtaining the solution: i) Identifying the feasible mappings (s) requires solving the subgraph isomorphism problem between the graph jobs’ topology and the aggregated network graph, which is categorized as NPcomplete [19]. Hence, we only assume the knowledge of s in the small and mediumscale GDCNs. In the largescale GDCNs, we propose a lowcomplexity decentralized approach to extract isomorphic subgraphs to a graph job and implement it in our proposed cloud crawlers. ii) is a nonlinear integer programming problem, which is known to be NPhard. In small and mediumscale GDCNs, we tackle this problem considering a convex relaxed version of it. However, for largescale GDCNs, we find a “potentially good” subset of feasible mappings as the cloud crawlers traverse the network. Afterward, the strategy selection is carried out using the computing power of the PAs in a decentralized fashion.
Symbol  Definition 

The GDCN graph  
Set of DCs in the GDCN  
The DC with index  
Number of DCs in the GDCN  
Set of slots of DC  
Aggregated graph of the GDCN  
Set of slots of the entire GDCN  
Set of edges between adjacent slots of a GDCN  
Set of graph jobs in the system  
Number of different types of jobs in the system  
Associated graph to the graph job with type  
Number of jobs with type in the system  
Set of nodes of the graph job with type  
Set of edges of the graph job with type  
Load of DC  
Number of cloud servers in DC  
Set of all the mapping vectors for  
Set of PAs in the system  
Set of cloud crawler’s suggested strategies for  
Probability of selection of strategy 
3 Graph Job Allocation in SmallScale GDCNs: Centralized Approach
Solving requires solving an integer programming problem in dimensions. For a small GDCN with three types of graph jobs (), DCs (), and graph jobs of each type in the system, the dimension of the solution becomes rendering the computations impractical. To alleviate this issue, we solve in a sequential manner for available graph jobs in the system. In our approach, at each stage, the best allocation is obtained for one graph job while neglecting the presence of the rest. Afterward, the graph job is allocated to the GDCN and the loads of the utilized DCs are updated. As a result, at each stage, the dimension of the solution is ( in the above example). For a , let the available graph jobs be indexed from to according to their execution order, where preferred customers can be prioritized in practice. For a graph job with type with index , we reformulate as ():
(6)  
(7)  
(8) 
where denotes the updated load of DC after the previous graph job allocation. The last constraint in
forces the solution to be discrete making the derivation of a tractable solution impossible. In the following, we relax this constraint and provide a tractable method to derive the solution in the set of feasible points. For the moment, we consider
. We define as an optimization problem with the same objective function as with three constraints. In this problem, the first constraint is Eq. (7), and the second and third constraints are relaxed versions of Eq. (8) described as:(9)  
(10) 
where Eq. (9) ensures the assignment of all the nodes of the graph job to the GDCN, and Eq. (10) guarantees the practicality of the solution. It is easy to verify that is a convex optimization problem. We use the Lagrangian dual decomposition method [21] to solve this problem. Let , , and denote the Lagrangian multipliers associated with the first, the second, and the third constraint, respectively. The Lagrangian function associated with is then given by:
(11) 
The corresponding dual function of is given by:
(12) 
Finally, the dual problem can be written as ():
(13) 
is a convex optimization problem with differentiable affine constraints; hence, it satisfies the constraint qualifications implying a zero duality gap. As a result, the solution of coincides with the solution of . It can be verified that the minimum of the dual function occurs at the following point:
(14) 
By replacing this in the Lagrangian function, the dual function is given by: , where . The optimal Lagrangian multipliers can be obtained by solving the dual problem given by:
(15) 
Given the solution of Eq. (15), the optimal allocation in is given by . The solutions of Eq. can be derived via the iterative gradient ascent algorithm [21].
Let denote the derived solution in the continuous space, we obtain the solution of by solving the following weighted meansquare problem:
(16) 
where s are the design parameters, which can be tuned to impose a certain tendency toward utilizing specific DCs.
So far, to derive the above solution, it is necessary to have a powerful centralized processor with global knowledge about the state of all the DCs. This is due to the inherent updating mechanism of the gradient ascent method [21], in which iterative update of each Lagrangian multiplier requires global knowledge of the current values of the other Lagrangian multipliers and the DCs’ loads. Obtaining this knowledge may not be feasible for a given GDCN with more than a few DCs. Moreover, multiple powerful backup processors may be needed to avoid the interruption of the allocation process in situations such as overheating of the centralized processor. In the following section, we design a distributed algorithm using the processing power of the DCs in parallel to resolve the above concerns.
4 Graph Job Allocation in MediumScale GDCNs: Decentralized Approach with DCs in Charge of Job Allocation
The described dual problem in Eq. (13), given the result of Eq. (14), can be written as follows:
(17) 
where
(18) 
In Eq. (17), each term can be associated with a DC. For , there are two private (local) variables and a public (global) variable , which is identical for all the DCs. Due to the existence of this public variable, the objective function cannot be directly written as a sum of separable functions. In the following, we propose a distributed algorithm deploying local exchange of information among adjacent DCs to obtain a unified value for the public variable across the network.
4.1 Consensusbased Graph Job Allocation
We propose the consensusbased distributed graph job allocation (CDGA) algorithm consisting of two steps to find the solution of Eq. (17): i) updating the local variables at each DC, ii) updating the global variable via forming a consensus among DCs. We consider each term of Eq. (17) as a (hypothetically) separate term and rewrite the problem as a summation of separable functions, with replaced by in :
(19) 
At each iteration of the CDGA algorithm, each DC first derives the value of the following variables locally using the gradient ascent method:
(20)  
where s are the corresponding stepsizes and is a local variable. Afterward, the local copies of the global variable (s) are derived by employing the consensusbased gradient ascent method [22]:
(21) 
where , with the Laplacian matrix of and , and denotes the number of performed consensus iterations among the adjacent DCs. In this method, the adjacent DCs perform consensus iteration with local exchange of s before updating . The pseudocode of the CDGA algorithm is given in Algorithm 1. Since the solution is found in the continuous space, similar to Section 3, the last stage of the algorithm is obtaining the solution in the feasible set of allocations. This step requires a centralized processor with the knowledge of the feasible solutions. Nevertheless, as compared to the centralized approach (Section 3), the centralized processor is no longer in charge of deriving the optimal allocations for each graph job.
5 Graph Job Allocation in LargeScale GDCNs: Decentralized Approach using Cloud Crawling and PAs’ Computing Resources
Largescale GDCNs consist of an enormous number of PAs and DCs. This fact imposes three challenges for graph job allocation: i) The CDGA algorithm developed above becomes infeasible. In particular, excessive computational burden will be incurred on the DCs due to the large number of arriving jobs. Also, CDGA in largescale GDCNs will incur a long delay (e.g., a GDCN with DCs involves Lagrangian multipliers and requires hundreds of iterations for convergence), which may render the final solution less effective for the current state of the network. Moreover, continuous communication between the DCs imposes a considerable congestion over the communication links. ii) So far, the inherent assumption in our study is a known set of feasible allocations for the graph jobs. This requires solving the NPcomplete problem of subgraph isomorphism between the graph jobs and the largescale aggregated network graph, which may take a long time. iii) Even for a given graph job, the size of the feasible allocation set becomes prohibitively large in a largescale network. For instance, in a fullyconnected network of DCs, each with slots, the number of feasible allocations for a simple triangle graph job is . These concerns motivate us to develop cloud crawlers, based on which we address the mentioned challenges through a decentralized framework. Here, we use the term “crawler” to describe the movement between adjacent DCs. This may bear a resemblance to the term web crawler. Nevertheless, the cloud crawlers introduced here are fundamentally different from conventional web crawlers (e.g., [23, 24, 25]). Our cloud crawlers aim to extract suitable subgraphs from GDCNs for specified graph job structures when traversing the network, while web crawlers are mainly developed to extract information from Internet URLs by looking for keywords and related documents.
5.1 Strategy Suggestion Using Cloud Crawling
We introduce a cloud crawler (CCR) which carries a collection of structured information traveling between adjacent DCs. It probes the connectivity among the DCs and status of them (power usage, load distribution, etc.), based on which it provides a set of suggested allocations for the graph jobs. For a faster network coverage, multiple CCRs for each type of graph job can be assumed. Information gleaned by the CCRs can be shared with the PAs who act as mediators between the GDCN and customers using two mechanisms: i) the CCR shares them with a central database, which PAs have access to, on a regular basis; ii) the CCR shares them with DCs as it passes through them and the DCs update the connected PAs accordingly. The goal of a CCR is to find “potentially good” feasible allocations to fulfill a graph job’s requirements considering the network status. We consider a potentially good feasible allocation as a subgraph in the aggregated network graph which is isomorphic to the considered graph job leading to a low cost of execution. In the following, we first prove a theorem, based on which we provide a corollary aiming to describe a fast decentralized approach to solve the subgraph isomorphism problem in largescale GDCNs.
Definition 3.
Two graphs and with vertex sets and are called isomorphic if there exists an isomorphism (bijection mapping) such that any two nodes, , are adjacent in if and only if are adjacent in .
Theorem 1.
Consider graphs and with vertex sets and , respectively, where . Assume that can be partitioned into multiple complete subgraphs , , with vertex sets , where , and all the nodes in each pair of subgraphs with consecutive indices are connected to each other. Consider node and let denote the length of the longest shortest path between and nodes in . Define , , where denotes the length of the shortest path between the two input nodes. Let be a sequence of integer numbers that satisfy the following conditions:
(22)  
(23)  
(24) 
For such a sequence , there is at least an isomorphic subgraph to , called , in with the corresponding isomorphism mapping , for which at least one of the nodes of , , belongs to , if the following set of conditions is satisfied:
(25) 
Proof.
The key to prove this theorem is considering the following mapping between the nodes of and the subgraphs in :
(26) 
Under this mapping, the mapped nodes form an isomorphic graph to since the connection between all the adjacent nodes in is met in . That is because they are either placed at the same (fullyconnected) , or in (fullyconnected) adjacent s, . With a similar justification, it can be proved that concatenation of the mapped nodes to the adjacent s, , in Eq. (26) preserves the isomorphic property. For instance, all the following mappings form isomorphic graphs to in :
(27)  
(28)  
(29)  
(30) 
It can be seen that conditions stated in Eq. (22)(24) denote the feasible concatenation strategies, where each denotes the number of neighborhoods mapped to , . Also, Eq. (25) ensures the feasibility of the corresponding mappings. ∎
Corollary 1.
For , assume a CCR located at DC allocating at least one node of , , to one slot at , where the length of longest shortest path between and nodes in is . Assume that the CCR’s near future path can be represented as , where . Considering as in Theorem 1, for each realization of the sequence satisfying Eq. (22)(25), the following described allocation is feasible and is isomorphic to .
(31) 
Using our method described in the above corollary, it can be verified that the complexity of obtaining an isomorphic subgraph to a graph job for a CCR becomes , where is the diameter of the graph job. Henceforth, we recall defined in Corollary 1 as the center node, which can be chosen arbitrarily from the graph job’s nodes. The pseudocode of our algorithm implemented in a CCR is given in Algorithm 2. We use the binary search tree (BST) data structure [26] to structurize the carrying suggested strategies. To handle the large number of feasible allocations, we limit the capability of a CCR in carrying potentially good strategies (size of the BST) to a finite number for . Some important parts of Algorithm 2 are further illustrated in the following.
A) Initialization: A CCR is initialized at a DC for a certain graph job, , and a specified number of suggested strategies () to be carried.^{2}^{2}2Note that using a simple extension of this algorithm, a CCR can handle the extraction of suggested strategies for multiple graph jobs at the same time. Each CCR carries a BST, a list [27] of incomplete allocations () and a set of visited neighbors () (can be implemented as a list). In Fig. 3a, topology of a graph job is shown along with three DCs where each square denotes a slot in a DC. The CCR is initialized at traversing the path .
B) Determining the Graph Job Topology Constraints (lines: 22): For a given center node of a graph job, i.e., in Corollary 1, the algorithm calculates the feasible number of nodes allocated to DCs according to Corollary 1. In Fig. 3a, the center node is denoted by and different set of neighbors located in various shortest paths to are demonstrated.