A Competitive B-Matching Algorithm for Reconfigurable Datacenter Networks

06/18/2020 ∙ by Marcin Bienkowski, et al. ∙ 0

This paper initiates the study of online algorithms for the maintaining a maximum weight b-matching problem, a generalization of maximum weight matching where each node has at most b > 0 adjacent matching edges. The problem is motivated by emerging optical technologies which allow to enhance datacenter networks with reconfigurable matchings, providing direct connectivity between frequently communicating racks. These additional links may improve network performance, by leveraging spatial and temporal structure in the workload. We show that the underlying algorithmic problem features an intriguing connection to online paging,but introduces a novel challenge. Our main contribution is an online algorithm which is O (b)-competitive; we also prove that this is asymptotically optimal. We complement our theoretical results with extensive trace-driven simulations, based on real-world datacenter workloads as well as synthetic traffic traces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Motivation: Reconfigurable Datacenters

The popularity of distributed data-centric applications related to machine learning and AI has led to an explosive growth of datacenter traffic, and researchers are hence making great efforts to design more efficient datacenter networks, providing a high throughput at low cost. Indeed, over the last years, much progress has been made in the design of innovative datacenter interconnects, based on fat-tree topologies 

(Al-Fares et al., 2008; Liu et al., 2013), hypercubes (Guo et al., 2009; Wu et al., 2009), expanders (Kassing et al., 2017) or random graphs (Singla et al., 2012), among many others (Kachris and Tomkos, 2012; Xia et al., 2017). All these networks have in common that their topology is static and fixed.

An emerging intriguing alternative to these static datacenter networks are reconfigurable networks (Mellette et al., 2019, 2017; Venkatakrishnan et al., 2016; Azimi et al., 2014; Foerster and Schmid, 2019; Kandula et al., 2009; Ghobadi et al., 2016; Liu et al., 2014; Farrington et al., 2010; Zhou et al., 2012; Avin and Schmid, 2018; Wang et al., 2018): networks whose topology can be changed dynamically. In particular, novel optical technologies allow to provide “short cuts”, i.e., direct connectivity between top-of-rack switches, based on dynamic matchings. First empirical studies demonstrate the potential of such reconfigurable networks, which can deliver very high bandwidth efficiency at low cost.

The matchings provided by reconfigurable networks are either periodic (e.g., (Mellette et al., 2017, 2019)) or demand-aware (e.g., (Ghobadi et al., 2016)). The latter is attractive as it allows to leverage structure in the demand: datacenter traffic is known to be highly structured, e.g., traffic matrices are typically sparse and some flows (sometimes called elephant flows) much larger than others (Roy et al., 2015; Benson et al., 2010). This may be exploited: in principle, demand-aware datacenter networks allow to directly match racks which communicate more frequently, leveraging spatial and temporal locality in the workload (Avin et al., 2020). These reconfigurable matchings are usually assumed to enhance a given fixed datacenter topology based on traditional electric switches (Ghobadi et al., 2016): the remaining traffic (e.g., mice flows) can be routed along the fixed network, e.g., using classic shortest path control planes such as ECMP.

The advent of such hybrid static-dynamic datacenter networks introduces an online optimization problem: how to enhance a given fixed topology with a set of additional shortcut “demand-aware” edges, such that the current demand is served optimally (e.g., large flows are routed along short paths, minimizing the “bandwidth tax” (Mellette et al., 2017)), while at the same time reconfiguration costs are kept minimal (reconfigurations take time and can temporarily lead to throughput loss).

1.2. Problem in a Nutshell

The above problem can be modeled as an online dynamic version of the classic -matching problem (Anstee, 1987) (where is the number of optical switches). In this problem, each node can be connected with at most other nodes (using optical links), which results in a -matching.

Interestingly, while the offline version of the -matching problem has been studied intensively in the past (e.g., in the context of matching applicants to posts) (Schrijver, 2003), we are not aware of any work on the dynamic online variant. As the problem is fundamental and finds applications beyond the reconfigurable datacenter design problem, we present it in the following abstract form.

Input. We are given an arbitrary (undirected) static weighted and connected network on the set of nodes connected by a set of non-configurable links : the fixed network. Let be the set of all possible unordered pairs of nodes from . For any pair , let denote the length of a shortest path between nodes and  in graph . Note that and are not necessarily directly connected in .

The fixed network can be enhanced with reconfigurable links, providing a matching of degree : Any node pair from may become a matching edge (such an edge corresponds to a reconfigurable optical link), but the number of matching edges adjacent to any node has to be at most , for a given integer .

The demand is modelled as a sequence of communication requests111A request could either be an individual packet or a certain amount of data transferred. This model of a request sequence is often considered in the literature and is more fine-grained than, e.g., a sequence of traffic matrices. revealed over time, where .

Output and Objective. The goal is to schedule the reconfigurable links over time, that is, to maintain a dynamically changing -matching . Each node pair from is called a matching edge and we require that each node has at most adjacent matching edges. We aim to jointly minimize routing and reconfiguration costs, defined below.

Costs. The routing cost for a request depends on whether and are connected by a matching edge. In this model, a given request can either only take the fixed network or a direct matching edge (i.e., routing is segregated (Ghobadi et al., 2016)). If , the requests is routed exclusively on the fixed network, and the corresponding cost is  (shorter paths imply smaller resource costs, i.e., lower “bandwidth tax” (Mellette et al., 2017)). If , the request is served by the matching edge, and the routing costs 0 (note that this is the most challenging cost function: our result only improves if this cost is larger).

Once the request is served, an algorithm may modify the set of matching edges: reconfiguration costs per each node pair added or removed from the matching . (The reconfiguration cost and time can be assumed to be independent of the specific edge.)

Online algorithms. An algorithm On is online if it has to take decisions without knowing the future requests (in our case, e.g., which edge to include next in the matching and which to evict). Such an algorithm is said to be -competitive (Borodin and El-Yaniv, 1998) if there exists a constant such that for any input instance , it holds that

where is the cost of the optimal (offline) solution for . It is worth noting that can depend on the parameters of the network, such as the number of nodes, but has to be independent of the actual sequence of requests. Hence, in the long run, this additive term becomes negligible in comparison to the actual cost of online algorithm On.

1.3. Our Contributions

This paper initiates the study of a natural problem, online dynamic -matching. For example, this problem finds direct applications in the context of emerging reconfigurable datacenter networks.

We make the following contributions:

  • We show that the online dynamic -matching problem features an interesting connection to online paging problems, however, with a twist, introducing a new challenge.

  • We present an -competitive deterministic algorithm, where . Note that in all relevant practical applications (i.e., the cost of routing between any two nodes is much smaller than the cost of reconfiguration, and hence the term is negligible).

  • We derive a lower bound which shows that no deterministic algorithm can achieve a competitive ratio better than .

  • We verify our approach experimentally, performing extensive trace-driven simulations, based on real datacenter workloads as well as synthetic traffic traces.

1.4. Challenges, Technical Novelty, Scope

At the heart of our approach lies the observation that online -matching is similar to online paging (Fiat et al., 1991; Sleator and Tarjan, 1985; Achlioptas et al., 2000; McGeoch and Sleator, 1991): each node in the network can manage its reconfigurable edges in a cache of size . However, making a direct reduction to caching seems impossible as reconfigurable edges involve both incident nodes, which introduces non-trivial dependencies. Without accounting for these dependencies, the competitive ratio would be in the order of the total number of reconfigurable edges in the network, whereas we in this paper derive results which only depend on the number of per-node edges. Our algorithms hence combine “per-node caches” in a clever way.

Generally, we believe that the notion of link caching has interesting implications for reconfigurable network designs beyond the model considered in this paper. In particular, caching strategies can typically be implemented locally, and hence may allow to overcome centralized control overheads, similar to the stable matching algorithms proposed in the literature (Ghobadi et al., 2016). However, we leave the discussion of of such decentralized schedulers to future work.

1.5. Organization

The remainder of this paper is organized as follows. Our online algorithm is described and analyzed in Section 2, and the lower bound is presented in Section 3. We report on our simulation results in Section 4. After reviewing related literature in Section 5, we conclude our contribution in Section 6.

2. Algorithm BMA

1:Initialization: Matching is empty and counters are zero
2:     
3:     for each edge  do
4:               
5:
6:Request arrives:
7:     if  then
8:         
9:         if  then If becomes saturated,
10:              Execute
11:              Execute
12:              if  then and if no desaturation event occured,
13:                  Execute
14:                  Execute
15:                   add to the matching.                             
16:
17:Routine :
18:     
19:     if   then If the number of saturated node pairs from is at least ,
20:         for  each edge  do reset counters of all node pairs from (desaturation event at ).
21:                             
22:
23:Routine :
24:     if   then If there are already incident matching edges,
25:          Pick any such that remove any unsaturated edge from the matching.
26:               
Algorithm 1 Algorithm Bma

This section introduces our online -matching algorithm, together with a competitive analysis. As described above, in our case study of reconfigurable networks, the matching links may for example describe the reconfigurable links provided by optical circuit switches, offering shortcuts between datacenters racks.

Before we present the algorithm and its analysis in details, let us first provide some intuition of our approach and the underlying challenges. To this end, let us for now assume that the fixed network is a complete unweighted graph (e.g., capturing the distances in a datacenter network), that , and that all requests are node pairs involving one chosen node . Also recall that each node can have at most  incident matching edges.

In this simplified scenario, we can observe that the choice of an appropriate set of matching edges becomes essentially a variant of online caching (more precisely, a variant of online paging with bypassing (Epstein et al., 2015)). That is, an algorithm maintains a set of at most edges incident to that are in the matching. These edges can be thought as “cached”: subsequent requests to matched (cached) edges do not incur further cost. Thus, from the perspective of a single node, the question is roughly equivalent to maintaining a cache of at most items (which leads to the typical algorithmic questions such as which item to cache or evict next).

However, if we simply run independent paging algorithms at all nodes, local perspectives of particular nodes might not be coherent with each other: one endpoint of a node may want to keep an edge in the matching, while the other may want to evict it from the matching. This is practically undesirable, as transmitters and receivers typically need to be aligned and coordinated (Mellette et al., 2017; Ghobadi et al., 2016). To illustrate this issue, assume that node wants to add a new matching edge , but it already has incident matched edges. To accommodate a new matching edge, removes edge from the matching. This however removes the edge not only from the “cache” of node , but also from the “cache” of node . Handling this coherence issue with a low overall cost, constitutes a main technical challenge that we need to tackle in our algorithm.

2.1. Algorithm Definition

Our algorithm Bma is defined as follows. For each node pair , Bma keeps a counter , initially equal to zero. The value of will always be a lower bound for the number of times has been requested since it was removed from the matching the last time (or from the beginning of the input sequence if was never in the matching). For each node pair we define a threshold

Once the counter reaches , and additional certain conditions are fulfilled, edge will be added to the matching. Otherwise, the counter will be reset to zero. A node pair whose counter value is equal to is called saturated; our algorithm always keeps all saturated node pairs in the matching. Note that these requests to node pair induced a total cost of .

For any node , we define , i.e., is the set of all node pairs, with one node equal to . Recall that, at any time, denotes the set of matching edges.

Bma is designed to preserve three invariants:

Counter invariant::

.

Saturation invariant::

If , then .

Matching invariant::

If , then or .

plus an invariant for any node :

Saturation degree invariant::

.

A pseudo-code for our algorithm Bma is given in Algorithm 1. In the next section, we will explain it in more details, and prove that it never violates the invariants.

2.2. Maintaining Invariants

At the beginning, the matching is empty and the counters of all edges are zero, and thus all invariants hold. Below we describe what happens upon serving a communication request by Bma and how Bma ensures that all invariants are preserved.

Bma first verifies whether is a matching edge. If so, then such a request incurs no cost and Bma does nothing. Otherwise, by the saturation invariant, . In this case, Bma pays for the communication request and increments counter . The increment preserves the counter invariant. The matching invariant holds emptily as . If is still below , then the remaining invariants also hold and Bma does not execute any further actions.

However, if the value of reaches ( becomes saturated), the saturation invariant becomes violated ( should be a matching edge) and also the saturation degree invariants may become violated at and . We now explain how Bma handles these issues.

Bma first ensures that the saturation degree invariant is satisfied at both endpoints of . To this end, Bma executes the FixSaturation routine at and . If at any node the number of saturated node pairs from different from is already , all edges of , including , have their counters reset to zero. In this case, we say that a desaturation event occurred at the respective endpoint . Note that all four cases are possible: there can be no desaturation event, a desaturation event can occur at , at , or at both of and . The execution of the FixSaturation routines reestablishes the saturation degree invariants, and preserves counter, saturation and matching invariants at all edges different from .

For edge , the corresponding counter and matching invariants clearly hold. If any desaturation event occurs, then it also fixes the saturation invariant for , and Bma need not do anything more. Otherwise (no desaturation event occurs, that is, is still equal to ), the saturation invariant is violated: has to be added to the matching. However, if any of endpoints already has incident matching edges, one such edge has to be removed from the matching before. This is achieved by the FixMatching routines executed at both and : if necessary, they remove one incident non-saturated matching edge. It remains to show that such a matching edge indeed exists.

Lemma 2.1 ().

A non-saturated matching edge chosen at Line 25 of the algorithm Bma (in routine ) always exists. Moreover, when it is removed, .

Proof.

Assume that is executed (for a node ). Let be the set of saturated node pairs from . Note that is preceded by the execution of : it ensures that ( contains and at most other edges).

Let be the set of matching edges incident to . The condition at Line 24 ensures that .

Now, observe that the set is non-empty as it contains the requested node pair . However, as , the set is non-empty either, and any of its edges is a viable candidate for .

Finally, by the matching invariant, the counter of a matching edge is equal either to or . Hence, the counters of all matching edges in are zero, and thus . ∎

2.3. Desaturation Events

Fix any node . For the analysis of Bma

, a natural approach would be to estimate the number of paid requests to all node pairs of

between two desaturation events at . This number corresponds to the total increase of all counters corresponding to these node pairs in the considered time interval. However, such an approach fails as these counters may be reset multiple times because of desaturation events at other nodes. In particular, it is possible that a node pair from in included multiple times in the matching between two desaturation events at . Therefore, we develop a more complicated accounting scheme.

First, not only we track counters, but for any node pair  we keep track of a set of requests paid by Bma (that caused the increase of the counter , i.e., ). When the counter is reset, the set is emptied.

When requests paid by Bma become removed from sets , we map them to the corresponding desaturation events: for any desaturation event , we create a set of requests , so that all these sets are disjoint. Requests that still belong to the current contents of sets are not (yet) mapped. More precisely, note that when a desaturation event at a node occurs, we empty all sets for node pairs . If a request triggers a single desaturation event at , then we simply set , i.e., we map all requests corresponding to counters that were reset by . If, however, a request triggers desaturation events and both at and , we want requests from to be mapped (partially) to both desaturation events. Thus, we partition requests from arbitrarily into two subsets and , each of cardinality , and set and .

For any request set , let , i.e., is the cost of serving all requests from without using matching edges. For any desaturation event and a node pair , let be the requests of  to node pair . The following observation follows immediately by the definition of Bma and sets .

Observation 1 ().

Fix any desaturation event at any node . Then, the following properties hold:

  1. For any node pair , it holds that .

  2. There exists a set of cardinality , such that for each .

2.4. Competitive Ratio of Bma

We now use sets to estimate the costs of Bma and Opt. We do not aim at optimizing the constants, but rather at the simplicity of the argument.

Lemma 2.2 ().

Let be the set of all desaturation events that occurred during input . Then,

Proof.

Within this proof, we consider contents of sets right after Bma processes the whole input . For any node pair , the set contains at most edges, and therefore

Any request to a node pair  paid by Bma is either in set or it was already assigned to a set for some desaturation event . Thus, the cost of serving all requests by Bma is at most

To bound the cost of matching changes, we observe that by the definition of Bma, only a saturated node pair may become included in the matching. If becomes removed from the matching later, then by Lemma 2.1, the counter of must have dropped to zero in the meantime. Therefore, any addition of to the matching can be mapped to the unique paid requests to . As the cost of such requests is , the total cost of including in the matching is dominated by the half of the cost of serving requests to . Furthermore, as the number of removals from the matching cannot be larger than the number of additions, the total cost of excluding from the matching is also dominated by the same amount. Summing up, the matching reconfiguration cost of Bma is not larger than its request serving cost, i.e.,

which concludes the lemma. ∎

Lemma 2.3 ().

Let be the set of all desaturation events that occurred during input . Then

Proof.

To estimate the cost of Opt, it is more convenient to think that its cost is not associated with node pairs but with nodes. That is, we distribute the cost of Opt pertaining to node pairs (paying for a request, including an edge in the matching or removing an edge from the matching) equally between the endpoints: When Opt pays for a request at node pair , we account cost for node and cost for node . When Opt pays for including node pair into the matching or excluding it from the matching, we associate cost with node and with node .

Now, fix a desaturation event at a node . Let be the previous desaturation event at node . (If is the first desaturation event at , then is the beginning of the input .) Note that all requests of appeared between and .

Let the node cost (in Opt’s solution) of between and be denoted . As each node-cost paid by Opt is covered by at most one term , it holds that . Hence, our goal is to lower bound the value of for any desaturation event (at some node ).

We sort the edges from by the cost of serving them between and . That is, let and

Let be the number of node pairs from that Opt added to the matching between and . The corresponding node cost of due to matching changes is then at least . Then, the total number of all node pairs from  that Opt may have in the matching at some time between and is at most . Therefore, Opt pays for requests from to all node pairs but at most node pairs, i.e.,

To lower-bound this amount, we first observe that for any , it holds that

(1)

Indeed, if , then by Property 2 of Observation 1), , and thus (1) follows. If , then holds for any , which implies (1).

Second, by Property 1 of Observation 1, for each node pair , it holds that

(2)

Therefore, using (1) and (2), we obtain that

Summing this relation over all desaturation events from the input  and using the relation yields the lemma. ∎

Theorem 2.4 ().

Bma is -competitive.

Proof.

Fix any input instance and let be the number of desaturation events that occurred when Bma was executed on . By Lemmas 2.2 and 2.3, we immediately obtain that , i.e., the competitive ratio is at most . ∎

3. Lower Bound

Theorem 3.1 ().

The competitive ratio of any deterministic algorithm Det is at least .

Proof.

Let our graph be a star of nodes and non-reconfigurable edge set

Each edge of has length . We start with any matching that connects to leaves. At any time, the adversary chooses which is not currently matched with , and requests a node pair for times. These requests constitute one chunk.

For each chunk, Det pays at least : either for modifying the matching or for bypassing all requests. An offline algorithm Off (that knows the entire input sequence) could however make a smarter selection of an edge to remove from the matching: Off chooses the one which is not going to be requested in the nearest rounds. Hence, Off pays at most for chunks of the input. For growing , the ratio between the costs of Det and Off becomes arbitrarily close to , and hence the lemma follows. ∎

4. Simulations

In order to complement our theoretical contribution and analytical results on the competitive ratio in the worst case, we conducted extensive simulations, evaluating our algorithms on real-world traffic traces. In the following, we report on our main results.

4.1. Methodology

All our algorithms are implemented in Python (3.7.3), using the graph library NetworkX (2.3.2). All simulations were conducted on a machine with two Intel Xeon E5-2697V3 processors with 2.6 GHz, 128 GB RAM, and 14 cores each.

Our simulations are based on the following workloads:

  • Facebook (Roy et al., 2015): We use the batch processing trace (Hadoop) from one of Facebook datacenters, as well as traces from one of Facebook’s database clusters.

  • Microsoft (Ghobadi et al., 2016):

    This data set is simply a probability distribution, describing the rack-to-rack communication (a traffic matrix). In order to generate a trace, we sample from this distribution

    i.i.d. Hence, this trace does not contain any temporal structure by design (e.g., is not bursty) (Avin et al., 2020)

    . However, it is known that it contains significant spatial structure (i.e., is skewed).

  • pFabric (Alizadeh et al., 2013): This is a synthetic trace and we run the NS2 simulation script obtained from the authors of the paper to generate a trace.

In order to evaluate our algorithm, we are comparing four different scenarios in our simulations:

  • Oblivious: The network topology is fixed and not optimized towards the workload by adding reconfigurable links.

  • Static: The network topology is enhanced with an optimal static -matching, computed with the perfect knowledge of the workload ahead of time.

  • Online BMA: The online algorithm described in this paper.

  • LRU BMA: Like online BMA, however, the cache is now managed according to a least-recently used (LRU) strategy. In other words, when a link needs to be cached and the cache is full, the least recently used link in the cache is evicted.

Figure 1. Left: Hit ratio for Facebook database cluster trace (with lower temporal locality), Middle: pFabric trace (with high temporal locality), Right: Microsoft trace (with high spatial locality).
Figure 2. Left and Right: Facebook Hadoop cluster: routing costs. Middle: Facebook Hadoop cluster: hit ratio for different cache sizes.

For all simulations, we assume a Clos-like datacenter topology (Al-Fares et al., 2008)

, connecting 100 servers (leaf nodes of the Clos topology). In addition, the number of requests for each of our simulations depends on the actual trace, therefore the simulations on the Facebook cluster have a slightly different amount of requests than e.g., the Microsoft trace data. Each test run was performed with six different request counts. The simulations were repeated 5 times, each time with a different subset of the whole data set to account for certain variance in the data; the presented results are averaged over these simulation runs. We evaluated our algorithms with several values for

and . Note that for larger , less traffic will be routed over the static network, given our cost function. Given this, and the fact that reconfigurable links require space, we will be particularly interested in relatively small values of : only a small fraction of all possible links is actually used. Evaluating the effectiveness of small values for is hence not only more interesting, but also more realistic.

4.2. Results

In order to study to which extent the Online BMA algorithm can leverage the temporal locality available in traffic traces, we first consider the effectiveness of the link cache, as a microbenchmark. Figure 1 shows a comparison of the hit ratio of Facebook’s database traces (left), pFabric traces (right) and Microsoft traces. We can observe that in the case of the pFabric and Microsoft traces, a relatively high hit ratio is obtained after a short warm-up period, especially if a least-recently-used (LRU BMA) caching strategy is used. We can also observe that our online algorithm performs better under the pFabric and Microsoft traces, which is expected: empirical studies have already shown that these traces feature more structure than the batch processing traces (Avin et al., 2020). We also find that the results naturally depend on the cache size, see Figure 2 (left). An important remark is, that the degree need to be understood relative to the total number of switch ports, i.e., similar results are obtained for relatively larger values.

It is interesting to compare the results of our online algorithms to demand-oblivious topologies as well as to static topologies. Figure 2 gives a comprehensive overview of our algorithms performance in terms of route lengths (left and right plot) and also regarding the cache hit ratio (middle plot) for different cache sizes for the Facebook Hadoop cluster. Notably, Figure 2 (left and middle plot) gives insights into our algorithm’s performance over all 5 test runs, illustrating the average result, as well as the maximum and minimum result (shaded areas).

As expected, Oblivious always performs worse than Static, Online BMA and LRU BMA. We further observe that the performance of Online BMA comes close to the performance of Static, which knows the demands ahead of time (but is fixed). We expect that under longer request sequences, when larger shifts in the communication patterns are likely to appear, the online approach will outperform the static offline algorithm. To investigate this, however, the publicly available traffic traces are not sufficient.

While the Microsoft trace does not contain temporal structure as it is sampled i.i.d., it can still be exploited toward a more efficient routing and yield a very high cache hit ratio, due to its spatial structure, i.e., the skewed traffic matrix. See Figure  3.

In conclusion, while our main contribution in this paper concerns the theoretical result, we observe that our online algorithm performs fairly well under real-world workloads, even without further optimizations (besides an improved cache eviction strategy).

5. Related Work

Reconfigurable networks based on optical circuit switches, 60 GHz wireless, and free-space optics, have received much attention over the last years (Ghobadi et al., 2016; Liu et al., 2014; Farrington et al., 2010; Azimi et al., 2014; Zhou et al., 2012). It has been shown empirically that reconfigurable networks can achieve a performance similar to a demand-oblivious full-bisection bandwidth network at significantly lower cost (Azimi et al., 2014; Ghobadi et al., 2016). Furthermore, the study of reconfigurable networks is not limited to datacenters and interesting use cases also arise in the context of wide-area networks (Jia et al., 2017; Jin et al., 2016) and overlays (Scheideler and Schmid, 2009; Ratnasamy et al., 2002).

Our paper is primarily concerned with the algorithmic problems introduced by such technologies. In this regard, our paper is related to graph augmentation models, which consider the problem of adding edges to a given graph, so that path lengths are reduced. For example, Meyerson and Tagiku (Meyerson and Tagiku, 2009) study how to add “shortcut edges” to minimize the average shortest path distances, Bilò et al. (Bilò et al., 2012) and Demaine and Zadimoghaddam (Demaine and Zadimoghaddam, 2010) study how to augment a network to reduce its diameter, and there are several interesting results on how to add “ghost edges” to a graph such that it becomes (more) “small world” (Papagelis et al., 2011; Parotsidis et al., 2015; Gozzard et al., 2018). However, these edge additions can be optimized globally and in a biased manner, and hence do not form a matching. In particular, it is impractical (and does not scale) to add many flexible links per node in practice. Another line of related works considers the design of demand-aware networks from scratch (Avin et al., 2017; Huq and Ghosh, 2017; Avin et al., 2018a), ignoring the fixed topology which is available in current architectures (and in the near future). In this regard, the works by Foerster et al. (Foerster et al., 2018) are more closely related to our paper: the authors present algorithms that enhance a given network with a matching to optimize the (weighted average) route lengths. However, all the papers discussed in this paragraph so far focus on the static problem variant and do not consider dynamic reconfiguration over time. Reconfigurable networks over time which explicitly account for reconfiguration costs include Eclipse (Venkatakrishnan et al., 2016), SplayNets (Schmid et al., 2016) and Push-Down Trees (Avin et al., 2018b), which however do not provide a deterministic guarantee on the competitive ratio of the online algorithm and in case of (Schmid et al., 2016; Avin et al., 2018b) are also limited to tree networks.

In this paper, we initiated the study of an online version of the dynamic -matching problem. A polynomial-time algorithm for the static version of this problem has already been presented over 30 years ago (Schrijver, 2003; Anstee, 1987)

, and the problem still receives attention today due to its numerous applications, for example in settings where customers in a market need to be matched to a cardinality-constrained set of items, e.g., matching children to schools, reviewers to papers, or donor organs to patients but also in protein structure alignment, computer vision, estimating text similarity, VLSI design

Figure 3. Left: Microsoft ProjecToR: routing costs. Right: Microsoft ProjecToR: hit ratio.

Note that there is a line of papers studying (bipartite) online matching variants (Karp et al., 1990; Birnbaum and Mathieu, 2008; Buchbinder et al., 2007; Devanur and Jain, 2012; Devanur et al., 2013; Mahdian and Yan, 2011; Mehta et al., 2007; Naor and Wajc, 2015). This problem attracted significant attention in the last decade because of its connection to online auctions and the famous AdWords problem (Mehta, 2013). Despite similarity in names (e.g., the bipartite (static) -matching variant was considered in (Kalyanasundaram and Pruhs, 2000)), this model is fundamentally different from ours. That is, it considers bipartite graphs in which nodes and (weighted) edges appear in time and the algorithm has to choose a subset of edges being a matching. In our scenario, the (non-bipartite) graph is given a priori, and the algorithm has to maintain a dynamic matching. One way of looking at our scenario is to consider the case where edges weights can change over time and the matching maintained by an algorithm needs to catch up with such changes.

6. Conclusion

Motivated by emerging reconfigurable datacenter networks whose topology can be dynamically optimized toward the workload, we initiated the study of a fundamental problem, online -matching. In particular, we presented competitive online algorithms which find an optimal trade-off between the benefits and costs of reconfiguring the matching. While our main contribution concerns the derived theoretical results (i.e., the competitive online algorithm and the lower bound), we believe that our approach has several interesting practical implications: in particular, our algorithm is simple to implement, has a low runtime and, as we have shown, performs fairly well also under different real-world workloads and synthetic traffic traces.

Our work opens several interesting avenues for future research. In particular, we have so far focused on deterministic algorithms, and it would be interesting to explore randomized approaches; in fact, our first investigations in this direction indicate that a similar approach as the one presented in this paper may be challenging to analyze in the randomized setting, due to the introduced dependencies, and we conjecture that the problem is difficult. On the practical side, it would be interesting to investigate specific reconfigurable optical technologies as well as specific datacenter topologies (such as Clos topologies) in more details, and tailor our algorithms and develop distributed implementations for an optimal performance in this case study.

References

  • (1)
  • Achlioptas et al. (2000) Dimitris Achlioptas, Marek Chrobak, and John Noga. 2000. Competitive analysis of randomized paging algorithms. Theoretical Computer Science 234, 1–2 (2000), 203–218.
  • Al-Fares et al. (2008) Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. In Proc. ACM SIGCOMM. 63–74.
  • Alizadeh et al. (2013) Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 435–446.
  • Anstee (1987) Richard P Anstee. 1987. A polynomial algorithm for b-matchings: an alternative approach. Inform. Process. Lett. 24, 3 (1987), 153–157.
  • Avin et al. (2020) Chen Avin, Manya Ghobadi, Chen Griner, and Stefan Schmid. 2020. On the Complexity of Traffic Traces and Implications. In Proc. ACM SIGMETRICS.
  • Avin et al. (2018a) Chen Avin, Alexandr Hercules, Andreas Loukas, and Stefan Schmid. 2018a. rDAN: Toward robust demand-aware network designs. Inform. Process. Lett. 133 (2018), 5–9.
  • Avin et al. (2017) Chen Avin, Kaushik Mondal, and Stefan Schmid. 2017. Demand-Aware Network Designs of Bounded Degree. In Proc. Int. Symp. on Distributed Computing (DISC) (LIPIcs), Vol. 91. 5:1–5:16.
  • Avin et al. (2018b) Chen Avin, Kaushik Mondal, and Stefan Schmid. 2018b. Push-Down Trees: Optimal Self-Adjusting Complete Trees. In arXiv.
  • Avin and Schmid (2018) Chen Avin and Stefan Schmid. 2018. Toward demand-aware networking: a theory for self-adjusting networks. ACM SIGCOMM Computer Communication Review 48, 5 (2018), 31–40.
  • Azimi et al. (2014) Navid Hamed Azimi, Zafar Ayyub Qazi, Himanshu Gupta, Vyas Sekar, Samir R. Das, Jon P. Longtin, Himanshu Shah, and Ashish Tanwer. 2014. FireFly: a reconfigurable wireless data center fabric using free-space optics. In Proc. ACM SIGCOMM. 319–330.
  • Benson et al. (2010) Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2010. Understanding data center traffic characteristics. ACM SIGCOMM Computer Communication Review 40, 1 (2010), 92–99.
  • Bilò et al. (2012) Davide Bilò, Luciano Gualà, and Guido Proietti. 2012. Improved approximability and non-approximability results for graph diameter decreasing problems. Theoretical Computer Science 417 (2012), 12–22.
  • Birnbaum and Mathieu (2008) Benjamin Birnbaum and Claire Mathieu. 2008. On-line bipartite matching made simple. SIGACT News 39, 1 (2008), 80–87.
  • Borodin and El-Yaniv (1998) Allan Borodin and Ran El-Yaniv. 1998. Online Computation and Competitive Analysis. Cambridge University Press.
  • Buchbinder et al. (2007) Niv Buchbinder, Kamal Jain, and Joseph Naor. 2007. Online Primal-Dual Algorithms for Maximizing Ad-Auctions Revenue. In Proc. 15th European Symp. on Algorithms (ESA). 253–264.
  • Demaine and Zadimoghaddam (2010) Erik D Demaine and Morteza Zadimoghaddam. 2010. Minimizing the diameter of a network using shortcut edges. In Proc. Scandinavian Symposium and Workshops on Algorithm Theory (SWAT). 420–431.
  • Devanur and Jain (2012) Nikhil R. Devanur and Kamal Jain. 2012. Online matching with concave returns. In Proc. 44th ACM Symp. on Theory of Computing (STOC). 137–144.
  • Devanur et al. (2013) Nikhil R. Devanur, Kamal Jain, and Robert D. Kleinberg. 2013. Randomized Primal-Dual analysis of RANKING for Online BiPartite Matching. In Proc. 24th ACM-SIAM Symp. on Discrete Algorithms (SODA). 101–107.
  • Epstein et al. (2015) Leah Epstein, Csanád Imreh, Asaf Levin, and Judit Nagy-György. 2015. Online File Caching with Rejection Penalties. Algorithmica 71, 2 (2015), 279–306.
  • Farrington et al. (2010) Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. 2010. Helios: a hybrid electrical/optical switch architecture for modular data centers. In Proc. ACM SIGCOMM. 339–350.
  • Fiat et al. (1991) Amos Fiat, Richard M. Karp, Michael Luby, Lyle A. McGeoch, Daniel D. Sleator, and Neal E. Young. 1991. Competitive paging algorithms. Journal of Algorithms 12, 4 (1991), 685–699.
  • Foerster et al. (2018) Klaus-Tycho Foerster, Manya Ghobadi, and Stefan Schmid. 2018. Characterizing the algorithmic complexity of reconfigurable data center architectures. In IEEE/ACM ANCS.
  • Foerster and Schmid (2019) Klaus-Tycho Foerster and Stefan Schmid. 2019. Survey of Reconfigurable Data Center Networks: Enablers, Algorithms, Complexity. In SIGACT News.
  • Ghobadi et al. (2016) Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil R. Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche, Houman Rastegarfar, Madeleine Glick, and Daniel C. Kilper. 2016. ProjecToR: Agile Reconfigurable Data Center Interconnect. In Proc. ACM SIGCOMM. 216–229.
  • Gozzard et al. (2018) Andrew Gozzard, Max Ward, and Amitava Datta. 2018. Converting a network into a small-world network: Fast algorithms for minimizing average path length through link addition. Information Sciences 422 (2018), 282–289.
  • Guo et al. (2009) Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proc. ACM SIGCOMM. 63–74.
  • Huq and Ghosh (2017) Sikder Huq and Sukumar Ghosh. 2017. Locally Self-Adjusting Skip Graphs. Proc. IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (2017), 805–815.
  • Jia et al. (2017) Su Jia, Xin Jin, Golnaz Ghasemiesfeh, Jiaxin Ding, and Jie Gao. 2017. Competitive analysis for online scheduling in software-defined optical WAN. In INFOCOM. 1–9.
  • Jin et al. (2016) Xin Jin, Yiran Li, Da Wei, Siming Li, Jie Gao, Lei Xu, Guangzhi Li, Wei Xu, and Jennifer Rexford. 2016. Optimizing Bulk Transfers with Software-Defined Optical WAN. In Proc. ACM SIGCOMM. 87–100.
  • Kachris and Tomkos (2012) Christoforos Kachris and Ioannis Tomkos. 2012. A Survey on Optical Interconnects for Data Centers. IEEE Communications Surveys and Tutorials 14, 4 (2012), 1021–1036.
  • Kalyanasundaram and Pruhs (2000) Bala Kalyanasundaram and Kirk R Pruhs. 2000. An optimal deterministic algorithm for online b-matching. Theoretical Computer Science 233, 1-2 (2000), 319–325.
  • Kandula et al. (2009) Srikanth Kandula, Jitendra Padhye, and Paramvir Bahl. 2009. Flyways To De-Congest Data Center Networks. In HotNets.
  • Karp et al. (1990) Richard M. Karp, Umesh V. Vazirani, and Vijay V. Vazirani. 1990. An optimal algorithm for on-line bipartite matching. In Proc. 22nd ACM Symp. on Theory of Computing (STOC). 352–358.
  • Kassing et al. (2017) Simon Kassing, Asaf Valadarsky, Gal Shahaf, Michael Schapira, and Ankit Singla. 2017. Beyond fat-trees without antennae, mirrors, and disco-balls. In Proc. ACM SIGCOMM. 281–294.
  • Liu et al. (2014) He Liu, Feng Lu, Alex Forencich, Rishi Kapoor, Malveeka Tewari, Geoffrey M. Voelker, George Papen, Alex C. Snoeren, and George Porter. 2014. Circuit Switching Under the Radar with REACToR. In NSDI. 1–15.
  • Liu et al. (2013) Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas E. Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In NSDI. 399–412.
  • Mahdian and Yan (2011) Mohammad Mahdian and Qiqi Yan. 2011. Online bipartite matching with random arrivals: an approach based on strongly factor-revealing LPs. In Proc. 43rd ACM Symp. on Theory of Computing (STOC). 597–606.
  • McGeoch and Sleator (1991) Lyle A. McGeoch and Daniel D. Sleator. 1991. A Strongly Competitive Randomized Paging Algorithm. Algorithmica 6, 6 (1991), 816–825.
  • Mehta (2013) Aranyak Mehta. 2013. Online Matching and Ad Allocation. Foundations and Trends in Theoretical Computer Science 8, 4 (2013), 265–368.
  • Mehta et al. (2007) Aranyak Mehta, Amin Saberi, Umesh V. Vazirani, and Vijay V. Vazirani. 2007. AdWords and generalized online matching. J. ACM 54, 5 (2007).
  • Mellette et al. (2019) William M Mellette, Rajdeep Das, Yibo Guo, Rob McGuinness, Alex C Snoeren, and George Porter. 2019. Expanding across time to deliver bandwidth efficiency and low latency. arXiv preprint arXiv:1903.12307 (2019).
  • Mellette et al. (2017) William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. 2017. RotorNet: A Scalable, Low-complexity, Optical Datacenter Network. In Proc. ACM SIGCOMM. 267–280.
  • Meyerson and Tagiku (2009) Adam Meyerson and Brian Tagiku. 2009. Minimizing average shortest path distances via shortcut edge addition. In

    Proc. Int. Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX)

    . 272–285.
  • Naor and Wajc (2015) Joseph Naor and David Wajc. 2015. Near-Optimum Online Ad Allocation for Targeted Advertising. In Proc. 16th ACM Conf. on Economics and Computation (EC). 131–148.
  • Papagelis et al. (2011) Manos Papagelis, Francesco Bonchi, and Aristides Gionis. 2011. Suggesting Ghost Edges for a Smaller World. In Proc. 20th ACM International Conference on Information and Knowledge Management (CIKM). 2305–2308.
  • Parotsidis et al. (2015) Nikos Parotsidis, Evaggelia Pitoura, and Panayiotis Tsaparas. 2015. Selecting shortcuts for a smaller world. In Proc. SIAM International Conference on Data Mining. 28–36.
  • Ratnasamy et al. (2002) Sylvia Ratnasamy, Mark Handley, Richard M. Karp, and Scott Shenker. 2002. Topologically-Aware Overlay Construction and Server Selection. In INFOCOM. 1190–1199.
  • Roy et al. (2015) Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C Snoeren. 2015. Inside the social network’s (datacenter) network. In ACM SIGCOMM Computer Communication Review, Vol. 45. 123–137.
  • Scheideler and Schmid (2009) Christian Scheideler and Stefan Schmid. 2009. A Distributed and Oblivious Heap. In ICALP (2) (Lecture Notes in Computer Science), Vol. 5556. 571–582.
  • Schmid et al. (2016) Stefan Schmid, Chen Avin, Christian Scheideler, Michael Borokhovich, Bernhard Haeupler, and Zvi Lotker. 2016. SplayNet: Towards Locally Self-Adjusting Networks. IEEE/ACM Trans. Netw. 24, 3 (2016), 1421–1433.
  • Schrijver (2003) Alexander Schrijver. 2003. Combinatorial Optimization: Polyhedra and Efficiency. Springer.
  • Singla et al. (2012) Ankit Singla, Chi-Yao Hong, Lucian Popa, and Philip Brighten Godfrey. 2012. Jellyfish: Networking Data Centers Randomly. In NSDI. 225–238.
  • Sleator and Tarjan (1985) Daniel D. Sleator and Robert E. Tarjan. 1985. Amortized efficiency of list update and paging rules. Commun. ACM 28, 2 (1985), 202–208.
  • Venkatakrishnan et al. (2016) Shaileshh Bojja Venkatakrishnan, Mohammad Alizadeh, and Pramod Viswanath. 2016. Costly Circuits, Submodular Schedules and Approximate Carathéodory Theorems. In SIGMETRICS. 75–88.
  • Wang et al. (2018) Mowei Wang, Yong Cui, Shihan Xiao, Xin Wang, Dan Yang, Kai Chen, and Jun Zhu. 2018. Neural network meets DCN: Traffic-driven topology adaptation with deep learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems 2, 2 (2018), 1–25.
  • Wu et al. (2009) Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: a high performance network structure for modular data center interconnection. In CoNEXT. 25–36.
  • Xia et al. (2017) Wenfeng Xia, Peng Zhao, Yonggang Wen, and Haiyong Xie. 2017. A Survey on Data Center Networking (DCN): Infrastructure and Operations. IEEE Communications Surveys and Tutorials 19, 1 (2017), 640–656.
  • Zhou et al. (2012) Xia Zhou, Zengbin Zhang, Yibo Zhu, Yubo Li, Saipriya Kumar, Amin Vahdat, Ben Y Zhao, and Haitao Zheng. 2012. Mirror mirror on the ceiling: Flexible wireless links for data centers. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 443–454.