MIPS: Instance Placement for Stream Processing Systems based on Monte Carlo Tree Search

08/01/2020 ∙ by Xi Huang, et al. ∙ 0

Stream processing engines enable modern systems to conduct large-scale analytics over unbounded data streams in real time. They often view an application as a direct acyclic graph with streams flowing through pipelined instances of various processing units. One key challenge that emerges is instance placement, i.e., to decide the placement of instances across servers with minimum traffic across servers and maximum resource utilization. The challenge roots in not only its intrinsic complexity but also the impact between successive application deployments. Most updated engines such as Apache Heron exploits a more modularized scheduler design that decomposes the task into two stages: One decides the instance-to-container mapping while the other focuses on the container-to-server mapping that is delegated to standalone resource managers. The unaligned objectives and scheduler designs in the two stages may lead to long response times or low utilization. However, so far little work has appeared to address the challenge. Inspired by the recent success of applications of Monte Carlo Tree Search (MCTS) methods in various fields, we develop a novel model to characterize such systems, formulate the problem, and cast each stage of mapping into a sequential decision process. By adopting MCTS methods, we propose MIPS, an MCTS-based Instance Placement Scheme to decide the two-staged mapping in a timely yet efficient manner. In addition, we discuss practical issues and refine MIPS to further improve its performance. Results from extensive simulations show, given mild-value of samples, MIPS outperforms existing schemes with a significant traffic reduction and utilization improvement. To our best knowledge, this paper is the first to study the two-staged mapping problem and to apply MCTS to solving the challenge.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed an explosive growth of data streams that are incessantly generated from a wide assortment of applications, e.g., Twitter[1], Facebook[2], and LinkedIn[3]. To conduct large-scale, real-time processing for such data streams, a number of stream processing engines have been proposed and launched, aiming at high scalability, availability, responsiveness, and fault tolerance[4, 5, 3, 1, 6, 2].

To date, stream processing engines have evolved over three generations[7]. Among most recent third-genernation engines, Apache Storm[1] and Heron[6] stand out by their extensive adoption and support from a large community[8][9], as well as their modularized and scalable design. Typically, engines like Storm and Heron view each stream processing application as a direct-acyclic graph, a.k.a. a topology, where data streams (edges) are processed through pipelined components (nodes).

To launch applications, users submit their requests to the system scheduler. Requests arrive in an online manner, each specifying the topology of a given application, along with the parallelism requirement, i.e., the number of instances for each component, and their resource demands. Upon deployment, a key step and challenge to the scheduler is instance placement, i.e., to decide how to distribute instances within a cluster of heterogeneous servers, containers[10], or processes. Besides the intrinsic complexity of the problem being -hard[11], instance placement is often associated with two objectives: 1) to shorten response time, the instances of successive components should be placed in proximity with minimum cross-server or -container traffic; 2) to achieve high resource utilization, the placement should utilize as few servers/containers as possible. However, these two objectives may conflict each other, as Figure 1(a) shows. Any inadvertent placement could lead to either high cross-server traffic with long response time, or low resource utilization with unnecessary overheads.

Constrained by the underlying scheduler design, instance placement generally falls into two categories. The first one is typified by Storm. Terming each instance a task, Storm’s scheduler manages both the computation and enforcement of task placement with a direct control over its underlying cluster’s resources [1]. Consequently, the scheduler can directly map tasks to servers through an one-staged decision, each server running a few processes that host tasks within their threads. However, Storm’s built-in schemes, round-robin (RR) and first-fit-decreasing (FFD), are often blamed for their blindness to traffic patterns between components upon decision making [12]. Motivated by that, previous studies abound, focusing on designing one-staged placement schemes[11, 13, 14, 12]. Despite their effectiveness, such an integrated scheduler design has come to its end due to the highly coupled implementation of resource management and scheduling[15].

(a) A potential trade-off between traffic reduction and high resource utilization
(b) Instance placement in two stages with discrepant goals
Fig. 1: Basic settings: A topology consists of four components, each with one instance. Observations: (a) Placement I incurs units of traffic between the two utilized servers, while Placement II reduces the traffic by units but at the cost of one extra server being used. (b) A sample two-staged instance placement: Instances are distributed with minimum cross-container traffic, then containers are deployed to servers in the order of (X, Y, Z), each assigned to the server with minimum but sufficient resources (best fit).

This calls for a more modularized scheduler design, which typifies the second category. One landmark is Twitter’s replacing Storm with Heron in 2015[6]. Compared to Storm, Heron refines the engine design by treating each instance as an independent process in a container, while delegating resource management to more established cluster schedulers such as YARN[16] and Nomad[17]. Consequently, instance placement is decided in two stages. The first stage focuses on distributing instances onto a set of containers, a.k.a., instance-container mapping, often done by Heron’s scheduler. Next, containers are submitted as jobs and assigned to servers by the cluster scheduler, a.k.a., container-server mapping. Compared to one-staged schemes, the design for two-staged placement is even more challenging: Any unilateral effort in optimizing the placement on either stage would be in vain if their objectives are inconsistent. Figure 1 (b) shows such an example: Stage decides the instance-container mapping with minimum traffic across containers, whereas Stage focuses on maximizing resource utilization by using best fitheuristic, inducing units of traffic between containers. However, the optimum amounts of traffic is by placing container Y, Z together in the same server and container X in the other. This is due to stage 1 aims at reducing cross-container traffic whereas stage 2 doesn’t. By far, very little work has been conducted to work on this two-staged mapping problem. Often times solutions for Storm are tailored to Heron’s first stage mapping, leaving a large space for performance improvement in Heron-like systems.

In this paper, we focus on the two-staged instance placement problem that handles requests arriving in realtime. We target at highly modularized stream processing engines such as Apache Heron, to design an efficient instance placement scheme, with aligned objectives for both stages in minimizing traffic while maximizing resource utilization, so that the system can make timely but effective scheduling decisions. The challenges of the problem come from the combinatorial nature of mapping problems in both stages, the resource contention among different applications and their instances, the conflicts between objectives, and the online nature of request arrivals.

To address the challenges, various heuristics and approximation algorithms are available[18][19]. Notably, in recent years, there have been a progression of success in applying Monte Carlo Search Tree (MCTS) methods to solving problems that involve sequential decision processes [20], such as the gameplay of Go[21]

. In fact, each mapping stage can also be viewed as a sequential decision process that places instances (containers) successively onto containers (servers). Moreover, MCTS takes the advantage of random sampling, achieving a tradeoff between computational efficiency and optimality largely affected by estimate accuracy. To decide instance placements, low computational efficiency is favorable, while the accurate evaluation of consequent mappings is also critical. By leveraging MCTS, we propose MIPS, an efficient two-staged instance placement scheme for stream processing systems. Our main results and contributions are as follows.

Modeling and Formulation: With Apache Heron as the prototype, we develop a novel model that accurately characterizes most updated stream processing systems with a highly modularized design. Based upon the model, we formulate the online instance placement problem as a two-staged constrained mapping problem, aiming at cross-server/container traffic reduction with high resource utilization. To our best knowledge, this is the first model for such stream processing systems and the first formulation for the two-staged mapping problem.

Algorithm Design: Considering the problem is in general -hard, we adopt MCTS techniques with the Upper Confidence Bound for Trees (UCT), by taking the advantage of its anytime property, i.e., more computing power generally leads to better performance. By transforming each mapping stage to a sequential decision process, we propose MIPS, a randomized scheme that decides two-staged instance placement in a timely yet effective manner. Besides, we refine MIPS from various aspects to promote its sampling quality, accelerate the computation, and overcome some practical issues.

Experiment Verification and Analysis: Existing schemes are almost designed for Storm. To make them comparable with MIPS, we propose their variants under Heron. Then we conduct extensive simulations to evaluate MIPS against them. Our results and analysis show that MIPS notably outperforms existing schemes with both low traffic and high utilization.

To the best of our knowledge, this is the first paper that studies the two-staged instance placement problem for Heron-like systems. The rest of the paper is organized as follows. Section II describes the model and problem formulation for the problem, while Section III proposes MIPS and discusses its variants to further accelerate the algorithm. Section IV presents results from simulations with the corresponding analysis, followed by the conclusion in Section V.

Notation Description
The set of servers in the cluster
Capacity of -th type of resource on server
The communication cost per traffic unit from
server to
Graph that corresponds to topology (speci-
fied by request )
The set of components in topology
The set of data streams in topology
The number of instances of component
The set of instances of component
The resource demand of instance
The traffic rate of data stream
The set of containers of topology
The resource capacity of container
The set of containers on server before
deploying topology
The set of all instances in topology
Decision that places instance in container
Decision that places container on server
TABLE I: Main Notations

Ii Model and Problem Formulation

We develop a model for stream processing systems based on Heron and formulate the two-staged instance placement problem. Main notations are summarized in Table I.

Ii-a Overall System Model

We consider a Heron-based data stream processing system running within a cluster of servers, denoted by set . The servers are interconnected according to some network topology[22][23]. User requests arrive at the system in an online fashion, each denoted by . Heron’s scheduler receives and processes requests in a first-in-first-out manner. For each request , the scheduler instantiates its specified application and maps its instances to containers. Next, the mapping is handed over to the underlying resource manager, e.g., YARN[16] or Nomad[17]. Then containers are assigned to servers, whereafter the application are set ready to run.

Ii-B Cluster Model

Regarding each server, we consider types of resources, such as CPU cores, memory, and storage. The resource capacity of server

is denoted by a vector

, where denotes the capacity of -th type of resource. Between any servers there always exists a non-blocking path for traffic transmission, which is commonly achievable in existing data center networks [22]. Transmission from server to incurs a communication cost of per traffic unit, e.g., giga-bytes.

Ii-C Streaming Application Model

Each request specifies the logical topology of a given application, denoted by a directed acyclic graph . denotes the set of components that make up the application, while denotes the set of directed data streams between components. In practice, the diameter of is often not too large, mostly less than four[1].

For component , we use to denote its parallelism, i.e., the number of its instances, and to denote the set of its instances such that . The deployment of any instance has a resource demand of . For any instance , we use to denote its belonging component.

For data stream , its traffic rate is denoted by . In practice, traffic rates can be either pre-defined by users or estimated using historical data[14][24][11]. Given any two components and such that , the traffic is assumed evenly spread from instances of ’s to ’s, which is achievable by adopting shuffling policy in Heron or Storm[1][6]. Therefore, the rate between instance and is obtained as . For non-successive instances , we set . The model can be easily extended to cases with uneven traffic patterns.

Ii-D Deployment Model

Besides logical topology, a request also specifies the number of containers to deploy instances. Per request , the scheduler constructs a set of containers, denoted by . Each container has a resource capacity of . Meanwhile, we denote the set of containers on server before deploying request by .

Ii-E Placement Decisions

For request , the instance placement consists of two stages.

The first stage is to decide a mapping from instances of request , i.e., to containers , denoted by . Each entry indicates whether instance is mapped to container . The decision maps each instance onto exactly one container, while ensuring the resource constraints for each container, i.e.,


The second stage is to decide another mapping from containers to servers , denoted by . Each entry indicates whether container is mapped to server . The decision maps each container to exactly one server without violating the resource constraints on servers, i.e.,


Ii-F Optimization Objectives

Regarding instance-container mapping, it is highly desirable to place successive instances with data streams in between into the same containers in order to minimize cross-container traffic, reducing considerable communication overheads and shortening response time[13]. Formally, given decision for request , the total traffic between container is


Hence, the total cross-container traffic for request is


On the other hand, deploying containers also incurs additional resource overheads[10]. For high resource utilization, the decision should map instances to as few containers as possible. Therefore, given decision , the number of utilized containers is


where the term for container is equal to one only if there is any instance residing in container .

Regarding container-server mapping, containers with intensive traffic in between should be placed closely to minimize the inter-server communication cost. Fixed and given , the cost for request between server and is


The total communication cost incurred after deployment is


Ii-G Problem Formulation

For request , we formulate the instance-container mapping problem (ICMP) as


where is a tunable parameter that weights the importance of cross-container traffic reduction compared to decreasing the number of utilized containers. Meanwhile, we define the container-server mapping problem (CSMP) as


Iii Algorithm Design


are both non-linear combinatorial optimization problems. Such problems are generally

-hard with a huge search space size (e.g., for ICMP), while coupled resource constraints add even more complexity. Inspired by recent progressive success[21] in applying random sampling methods like MCTS to solving complex problems with sequential decision making processes, we shift our perspective by viewing each stage as a sequential decision process that places instances (or containers) successively. By leveraging MCTS, we aim at developing an efficient scheme to solve each stage of mapping, hopefully well balancing the computational complexity and effectiveness.

Iii-a Overview of MCTS

MCTS is derived from tree search methods[25] that handle sequential decision processes. The key idea of such tree search methods is to build a single-rooted tree that corresponds to the process. Each tree node represents a system state, while each outgoing edge represents the action that leads to the next state. In this way, each path from root node to a leaf indicates a complete decision sequence with an eventual reward. The problem then turns to be finding a policy that chooses an action to execute from current node to the next node with maximum reward. In some cases, direct construction of the search tree may require excessive compute resource due to its large search space size and branching factor.

MCTS takes a detour by incrementally constructing part of the tree with random sampling on a round basis. Each node maintains the estimates of the reward of executing different actions to its child nodes. Within each round, MCTS proceeds by following a framework with four basic steps: 1) Traversal: Starting from the root node, recursively finds the next node to traverse by choosing the one with the maximum reward estimate, until reaching a leaf node or some unexpanded node (with unvisited child nodes); 2) Expansion: Upon an unexpanded node, MCTS expands it by adding one of its unvisited child to the partially built tree; 3) Simulation: Starting from a newly added node, MCTS conducts a random simulation to sample a complete decision sequence. 4) Back-propagation: The reward induced by the acquired sequence is then back-propagated along the way to the root node, refining the reward estimates of visited nodes. After numbers of rounds (samples), MCTS returns the next action from the root to the next child node that is most likely towards the optimal decision sequence. The process is then replayed on the subtree that is rooted at the chosen child node.

1:  %% Handle request .
2:  function MIPS_FOR_ICMP(, )
3:   Initialize and action sequence .
4:   Initialize root node with .
5:   while do
7:   Update and
9:   return
11:  function MIPS_FOR_CSMP(, )
12:   Initialize and action sequence .
13:   Initialize root node with .
14:   while do
16:   Update and
18:   return
Algorithm 1 MIPS for the two-staged instance placement
1:  function NEXT_ACTION(, sid)
2:   Set as root node .
3:   Initialize and MAX_SAMPLE_NUM.
4:   while MAX_SAMPLE_NUM do
7:   if then
9:   and
10:   return , best_child
12:  function TRAVERSE()
13:   while is not a leaf do
14:   if then
15:   return EXPAND()
16:   else
17:   return
19:  function EXPAND()
20:   Choose uniformly randomly
21:   Place instance onto container
22:   return the resultant node
24:  function BEST_CHILD(, )
25:   return
27:  function SIMULATE(, sid)
28:   while is not a leaf do
29:   Choose uniformly randomly
30:   Execute and obtain the resultant node
31:   Set
32:   if in stage and satisfies constraints () then
33:   return %% Reward at leaf node .
34:   else return %% Ends in an invalid mapping.
36:  function BACK_PROP(, )
37:   while is not root do
38:   and
39:   parent of
Algorithm 2 Sub-functions for MIPS

MCTS has various favorable properties that contribute to its successful application. One of them is its anytime property. For an anytime algorithm, it can return a valid solution whenever interrupted before it ends; on the other hand, more compute resources generally leads to results of better quality, well balancing the tradeoff between computational complexity and optimality of solution. This property is particularly desirable for instance placement: On one hand, given its overwhelming search space size and progressively request arrivals, the scheduler should determine an effective placement but in a timely fashion, instead of undertaking a time-consuming decision process; on the other hand, provided with more resources, the scheduler should be able to improve the placement rather than resort to tedious parameter tuning tricks. Another is that MCTS only provides a generic framework, whereby system designers can customize the basic steps to further optimize their applications.

Iii-B Modeling Decision Trees and Algorithm Design

To leverage MCTS, we need to cast each mapping stage into a sequential decision process.

Decision Tree for ICMP

: First, we transform the decision process for instance-container mapping into a decision tree. For request

, we construct a tree for its instance-container mapping, with each node denoting the state associated with a given mapping. The root node corresponds to the state where no instances are mapped to containers, i.e., for all and . Each outgoing edge of indicates an action of mapping some unmapped instance to some container, subject to the resource constraint in (1). For example, the action of mapping instance to container changes the mapping state from to with only , denoted by . Similarly, its child nodes then point to their descendants. Recursively defined in this way, the tree eventually reaches leaf nodes with mapping that satisfy (1). The reward for each leaf node is set as the associated objective value defined in (8) given its mapping.

Decision Tree for CSMP: Regarding CSMP, for request , we construct another tree for container-server mapping. The root corresponds to the state where no given containers of request are mapped to servers, while each of its outgoing edge denotes the action of mapping some unmapped container to some server subject to resource constraint in (2). Edges then point to its child nodes with resultant mapping, which in turn point to more descendants. Leaf nodes are either valid mappings from containers to servers, or mappings interrupted due to limited resource. The reward for each leaf node is the corresponding objective value defined in (9) given its mapping.

Every node in the above trees maintains four states. The first is , denoting the times of node being visited. The second is , denoting the total accumulated reward node has received so far. In this way, reflects the expected reward induced by following the decision sequence through node . The third is , denoting the set of all untried mapping actions that satisfy resource constraint in (1). The last is , denoting the set of ’s children being visited.

To find the optimal decision sequence for each decision tree, we propose MIPS, i.e., MTCS-based INstance placemenT scheme to decide two stages of mapping, respectively. Algorithm 1 shows the pseudocode of MIPS that decides the two-staged instance placement for each request .

Notably, to choose the best child node (Alg.2, line 25), MIPS has to decide: 1) to exploit historical information by choosing from the visited ones with the minimum objective value, or 2) to explore an unvisited node with unknown reward, a.k.a. the exploitation-and-exploration tradeoff[20]. To find the best possible placement, MIPS must well balance the tradeoff since 1) if over-dependent on the historical information, MIPS may miss unvisited nodes that lead to better placement, while 2) radical exploration might waste resources on nodes with a far worse reward. To this end, MIPS leverages the widely adopted Upper Confidence bound for Trees (UCT)[26]. By viewing the problem as a multi-armed bandit problem[27], UCT chooses the best child node with the minimum upper confidence bound 1 (UCB1) value (line 25 of Alg.2)[20]. The left-hand-side term of UCB is the reward estimate, obtained by averaging the aggregating rewards from past samples (to ensure the term to be finite, we add one to the denominator); the right-hand-side reflects the visited frequency. If a node has never been visited, the term goes to infinity and its UCB value is , thus the node must be chosen with precedence. If a node is rarely visited but its parent node is visited a great number of times, the node will have a higher chance to be chosen. In this way, MIPS can rebalance the tradeoff by choosing a proper value of weight parameter ; thus greater induces a more explorative search.

Iii-C System Workflow

Upon request ’s arrival, the system first parses its topology, resource demand, and parallelism requirement. In the first stage, the system scheduler applies MIPS to obtain the instance-container mapping. Then it eliminates the containers with no instances assigned and submits each container as a job to the underlying resource manager [17]. In the second stage, cluster scheduler applies MIPS to decide the mapping from containers to servers and enforces the deployment. In practice, MIPS can be implemented as a custom module through APIs provided by Heron and cluster schedulers [6][16][17].

Iii-D Practical Issues and Refinement

By leveraging random sampling, the effectiveness of MIPS heavily depends on the estimate accuracy for the objective values of the tree nodes (states). Accurate estimates often require a large number of samples, inducing long computational time and massive compute resources. Considering that some samples may lead to decision sequences with unfavorably high objective values, uniformly random simulation and node expansion may still have much room for improvement. Hence, to promote sampling quality, we optimize MIPS by refining its way of 1) selecting unvisited children, 2) choosing the best child, and 3) simulating.

1) Expansion policy (Alg.2, line 19-22): For ICMP, given node with unvisited children (untried actions), instead of uniformly random selection, we favor the action that places an instance to such a container that hosts any of its successive instances. We assign such actions with a positive score and the rest with zero score, while selecting an action only from those with high scores. Thus MIPS biases the mappings that place successive instances in proximity, reducing cross-container traffic with least resources. Regarding CSMP, MIPS favors mapping containers with traffic in between on the same server.

2) Best child selection (Alg.2, line 24-25): Although UCT well balances the exploration-and-exploitation tradeoff, MIPS must explore all children nodes before re-passing the visited ones. However, some actions may obviously lead to high objective values, as discussed previously. We bias such node by initializing and with a large positive value, pretending that the node has been visited once a priori with an unfavorably high objective value.

3) Simulating (Alg.2, line 27-34): Starting from a given node, MIPS simulates the rest of mapping decision process by repetitively choosing an unmapped instance uniformly randomly and assigning it to one of the containers. However, such an aimless policy may lead to decision sequences that place successive instances to different containers, incurring undesirably considerable traffic and resource overheads. Instead, we refine the simulating strategy in the following way: Each time MIPS randomly chooses one of the unmapped instances and maps it to the container with minimum incremental cross-container traffic. Likewise, when applied to CSMP, MIPS simulates by progressively placing the rest unmapped containers to servers with minimum incremental cross-server traffic.

Fig. 2: Comparisons between MIPS/MIPS (M/M) and other schemes: R-Heron/Best-fit (means R-Heron in the first stage and Best-fit in the second stage), T-Heron/Best-fit, and FFD/Best-fit, denoted by R/B, T/B, F/B, respectively. We run M/M under different values of , including (M/M traffic: minimizing traffic only), (M/M util.: minimizing container utilization only), and (M/M both: targeting both objectives), according to (8).

Iv Simulation

Iv-a Basic Settings

Cluster Topology: We prototype a stream processing system based on Heron[6] and cluster resource manager Nomad[17]. We implement two custom schedulers based on MIPS in the system and Nomad for instance-container and container-server mapping, respectively. The system is deployed in clusters that are constructed using two widely adopted topologies, Jellyfish[23] and Fat-Tree[22], respectively. Within each cluster are homogeneous switches and heterogeneous servers. Each switch has a port number of , with a bandwidth of Gbps on each port. For any two servers, the unit communication cost of transferring data streams is set as the number of hops of the shortest path between them.

Deployment Resources: Regarding resource allocation, we consider CPU cores and memory on servers[6]. Every server has a number of CPU cores ranging from to and memory from G to G. For each stream processing application, all of its containers have identical resource capacities.

Stream Processing Applications: We progressively submit requests to the system scheduler to deploy applications with common topologies[15][13][11]. Each request specifies a topology with a depth varying from to , and a number of components ranging from to . Besides, the parallelism for each component ranges from to . Instances of the same component have identical functionalities. Instances’ resource demand varies from to CPU cores and to GB memory.

Compared Schemes: Besides Heron’s first-fit-decreasing (FFD) scheme, most existing schemes are designed for Storm [11][13]. To make them comparable with MIPS, we propose their variants for instance-container mapping under Heron.

R-Heron: Given an application, initialize all its containers. Enumerate its components by a breadth first traversal on its topology. If the topology has more than one sink node, then add a virtual root node that precedes all sink nodes and apply the traversal. Next, for each component, enumerate its instances and repeat the following process. For each instance, assign it to the container with minimum resource distance, where the distance is defined as the traffic rate between the container and other containers, adding the euclidean distance between the instance’s resource demand vector and the container’s available resource vector. If no containers have enough resources to host an instance, then an error will be raised.

T-Heron: Given an application, initialize all its containers. Sort all instances by their descending order of (incoming and outgoing) traffic rate. Then assign each instance to one of its application’s containers with minimum incremental traffic and without exceeding the resource capacity of the container.

FFD[6]: Given an application, initialize all its containers, an empty active container list, and a list of unmapped instances. While there still exists unmapped instances, repeat the process: 1) Choose the next unmapped instance from the list; 2) sort the active containers by descending order of their available resources; 3) pick the first active container with sufficient resource; if no active container can host the instance, add a new container to the list and assign the instance to it.

We adopt the best-fit scheme in Nomad[17] as the underlying container-server mapping scheme for the baselines, which assigns each container to the server with free resources that best match its resource demand.

Iv-B Results and Analysis

We show and analyze the results from our extensive simulations of MIPS. Since MIPS is a randomized algorithm, we repeat each simulation for times and take the average of the results to eliminate the impact of randomness.

Performance against Other Schemes: Figure 2 compares two-staged MIPS (M/M) with other three schemes in terms of costs in two stages. The number of samples per round is fixed as for MIPS. Figure 2 (a) makes a comparison of total costs in each of the two stages induced by different schemes, respectively. Note that for any scheme, the total cost of ICMP remains the same while only the cost of CSMP differs under Fat-Tree and Jellyfish, since the decision making for ICMP does not involve communication costs that vary in topologies. We make the following observations.

In the first stage, MIPS (M/M) with different values of effectively reduces the total cost of cross-container traffic and container utilization compared to other schemes. For example, M/M (traffic) with leads to the minimum cost of , with a reduction to F/B, to T/B, and to R/B. M/M (util) and M/M (both) also lead to cost reduction but slightly less than inferior to M/M (traffic).

Zooming into the observation, we further compare the cross-container traffic and container utilization in Figure 2 (b) and (c), respectively. Figure 2 (b) shows that MIPS (traffic) incurs the minimum cross-container traffic. This is reasonable since with , MIPS assigns successive instances into the same containers in the best way possible. Different from THeron also with a relatively low traffic, MIPS decides the placement based on its experiences acquired from random sampling and evaluation rather than greedy heuristics, leading to less traffic. Meanwhile, other heuristics RHeron and FFD bring more traffic due to their less or no focus on traffic reduction.

Figure 2 (c) shows that M/M (util.) achieves the minimum container utilization at . Meanwhile, M/M (util.) also outperforms heuristics FFD and RHeron that focus on utilization by less traffic. On the other hand, along with the traffic reduction, extra container utilization comes as a price to M/M (traffic) for its traffic reduction. Though, M/M (traffic) still outperforms THeron in container utilization. Jointly considering traffic reduction and utilization, M/M (both) make a well balance with about traffic increase and little extra utilization to the optimum at both sides. All such advantages are conduced by MIPS’s well exploitation of random sampling.

In the second stage, Figure 2 (a) shows that M/M significantly outpaces other schemes by an up to reduction in cross-server traffic under both topologies. This ascribes to not only the advantage taken from the mapping in the first stage, but the effectiveness of MIPS in the second stage as well.

Performance under Different Values of : From previous results, there seems to be a potential tradeoff between cross-container traffic and container utilization. Figure 3 verifies the relationship qualitatively by showing the costs incurred by MIPS under Fat-Tree with growing from to : container utilization gently increases while cross-container traffic notably lessens, with cross-server traffic decreasing as well. It seems plausible to place successive instances close to each other to reduce cross-container traffic with only a few containers. However, due to the heterogeneity of instance resource demands, this intuition may fail, as exemplified previously by Figure 1 (a). Moreover, with online requests arrivals, prior deployed applications may take up major server resources, leaving only fragmented resources on servers for later arriving applications. Their containers would either be deployed on new servers or placed distantly on across existing servers. MIPS carefully places instances to minimize the impact of dependence between successive placement decisions.

Fig. 3: MIPS’s Performance sunder various choices of and sample numbers.
Fig. 4: MIPS’s Performance sunder various choices of and sample numbers.

Performance with Different Sampling Numbers: Besides, sampling number is another key factor to MIPS’s performance – more sampling means simulating more possible sequences, conducing to more accurate estimates for MIPS’s eventual decision making. However, that also requires longer computational time and more compute resources being consumed. A natural question is that how many samples are sufficient to decide placements with low traffic and resource consumption. Figure 3 investigates the relationship between sampling number and MIPS’s performance, with . As the sampling number grows from to , there is a significant reduction in the container utilization and cross-container traffic. However, as the sampling number continues to rise, the improvement gradually fades and eventually converges at around samples. This implies that in practice, compared to its enormous sample space size, MIPS requires only mild-value of sampling number to make timely yet efficient decisions with effective placement with both low traffic and few container resources.

Performance under Different Exploration-Exploitation Tradeoffs: To find the best possible placement within a limited number of samples, MIPS has to decide in each round either to exploit decision sequences with known reward estimates or explore those with unknown rewards. Figure 4 investigates the impact of weighting parameter

on the system performance and the variance among repeated simulations, with

and sample number as per round under Fat-Tree topology. Figure 4 (a) shows that as parameter varies from to , MIPS incurs costs roughly at the same level, with container utilization around , cross-container traffic around , and cross-server traffic around . This is reasonable since with fixed settings those costs are supposed to remain constant on the long-term average. However, we can still see a slight fluctuation among the results under different choices of . The reason lies in the sampling quality induced by different exploration-exploitation tradeoffs being made. With a smaller value of , MIPS tends to exploit those decision sequences with known estimates, making the resultant decision largely dependent on a limited set of sequences while missing those with unknown but potentially better rewards. On the other hand, a greater value of leads to a more explorative search. Due to the randomness of sampling, either undue exploitative or explorative search can have a large variance among different simulations. Figure 4 (b) verifies this: The variance of the three costs all rise up after a drop from to , suggesting that is a proper value for MIPS to balance the trade-off.

V Conclusion

In this paper, we studied the two-staged instance placement problem for stream processing engines like Heron. By modeling each stage as a sequential decision-making process and leveraging MCTS to the problem, we proposed MIPS, a randomized scheme that decides the instance placement in two stages in a timely yet efficient manner. To promote the sampling quality, we refined MCTS from various aspects and discussed practical issues. To evaluate MIPS against existing schemes, we propose variants of the schemes in Heron-like systems. Results from extensive simulations show that MIPS outperforms existing schemes with both low traffic and high resource utilization, but requires only mild-value of sampling.


  • [1] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al., “Storm@twitter,” in Proceedings of ACM SIGMOD, 2014.
  • [2] G. J. Chen, J. L. Wiener, S. Iyer, A. Jaiswal, R. Lei, N. Simha, W. Wang, K. Wilfong, T. Williamson, and S. Yilmaz, “Realtime data processing at facebook,” in Proceedings of ACM SIGMOD, 2016.
  • [3] D. Le-Phuoc, M. Dao-Tran, M.-D. Pham, P. Boncz, T. Eiter, and M. Fink, “Linked stream data processing engines: Facts and figures,” in Proceedings of International Semantic Web Conference, 2012.
  • [4] S. A. Noghabi, K. Paramasivam, Y. Pan, N. Ramesh, J. Bringhurst, I. Gupta, and R. H. Campbell, “Samza: stateful scalable stream processing at linkedin,” in Proceedings of the VLDB Endowment, 2017.
  • [5] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache flink: Stream and batch processing in a single engine,” 2015.
  • [6] “Heron documentation,” https://apache.github.io/incubator-heron/docs.
  • [7] T. Heinze, L. Aniello, L. Querzoni, and Z. Jerzak, “Cloud-based data stream processing,” in Proceedings of ACM DEBS, 2014.
  • [8] “Storm community,” https://storm.apache.org/Powered-By.html.
  • [9] “Heron community,” https://wiki.apache.org/incubator/HeronProposal.
  • [10] “Docker document,” https://docs.docker.com/network/.
  • [11] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, “R-storm: Resource-aware scheduling in storm,” in Proceedings of AMC, 2015.
  • [12] L. Eskandari, Z. Huang, and D. Eyers, “P-scheduler: adaptive hierarchical scheduling in apache storm,” in Proceedings of ACSW, 2016.
  • [13] J. Xu, Z. Chen, J. Tang, and S. Su, “T-storm: Traffic-aware online scheduling in storm,” in Proceedings of IEEE ICDCS, 2014.
  • [14] L. Aniello, R. Baldoni, and L. Querzoni, “Adaptive online scheduling in storm,” in Proceedings of ACM SIGMOD, 2013.
  • [15] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja, “Twitter heron: Stream processing at scale,” in Proceedings of ACM SIGMOD, 2015.
  • [16] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al., “Apache hadoop yarn: Yet another resource negotiator,” in Proceedings of ACM SoCC, 2013.
  • [17] “Nomad,” https://www.nomadproject.io/.
  • [18] G. L. Nemhauser and L. A. Wolsey, “Integer programming and combinatorial optimization,” Wiley, Chichester. GL Nemhauser, MWP Savelsbergh, GS Sigismondi (1992). Constraint Classification for Mixed Integer Programming Formulations. COAL Bulletin, vol. 20, pp. 8–12, 1988.
  • [19] M. Chen, S. C. Liew, Z. Shao, and C. Kai, “Markov approximation for combinatorial network optimization,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6301–6327, 2013.
  • [20] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE T-CIAIG, vol. 4, no. 1, pp. 1–43, 2012.
  • [21] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
  • [22] M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proceedings of ACM SIGCOMM, 2008.
  • [23] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: Networking data centers, randomly.” in Proceedings of USENIX NSDI, 2012.
  • [24] A. Floratou, A. Agrawal, B. Graham, S. Rao, and K. Ramasamy, “Dhalion: self-regulating stream processing in heron,” Proceedings of the VLDB Endowment, 2017.
  • [25] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach.   Malaysia; Pearson Education Limited,, 2016.
  • [26] L. Kocsis, C. Szepesvári, and J. Willemson, “Improved monte-carlo search,” Univ. Tartu, Estonia, Tech. Rep, vol. 1, 2006.
  • [27] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo planning,” in

    Proceedings of European conference on machine learning

    , 2006.