Nowadays, data is generated geo-distributively at a much higher speed as compared to the existing data transfer speed; for instance, telescopes around the world bring us an unimaginable amount of astronomy data. There are two main reasons for having geo-distributed data: (1) Datacenters (DCs) are built across the globe. (2) Organizations prefer to use multiple clouds to increase reliability, security, and processing. Besides, there exist applications that process and analyze a huge amount of massively geo-distributed data to extract useful information. A typical scenario in processing geo-distributed data is that several analysis tasks are running simultaneously, and each requires a fraction of the collected data [NagaA2018, PuQifan2015, Yue2016Fast, Yin2015, Khuller2016]. In addition, every analysis task moves needed data to a single location before the computation. Fig. 1 shows an example of geo-distributed telescope data.
The network bandwidth is a crucial factor in geo-distributed data movement and becomes the resource bottleneck. For example, the demand for bandwidth increased from 60 to 290 Tbps between the years 2011 and 2015 while the network capacity growth was not proportional. In 2015, the network capacity growth was only 40 percent, which was the lowest during the years 2011 and 2014222https://www.telegeography.com/researchservices/global-bandwidth-research-service/. When applications (such as electromagnetic radiation and infrared ray analysis), each handling data from some datacenters, have to be deployed, there is no meaningful notion of distance. The latency (travel time of a single small packet) under low-congestion conditions tends not to be noticeable to the end-users. The real difficulty here is the underlying capacity of the network. If links become congested, then the latency will increase and throughput will suffer. A key issue is how to allocate enough bandwidth to each application without causing congestion on the network [Zhao2020].
Hence, we need to choose proper locations for the tasks to reduce congestion. Specifically, we propose this target location problem for multi-commodity flow: Given sources of multiple commodities on a capacitated network, the goal is to locate the targets to maximize the flow value.
We fix sources because, as our motivating example of geo-distributed data analysis shows, it is difficult, if not impossible, to change datacenters that collect data since for efficiency as these datacenters should be close to data generators. However, it is much more flexible to choose the target locations where the analysis tasks are performed.
The multi-commodity flow problem (MCF) is one of the most fundamental problems with a wide variety of scientific and engineering applications that have been studied intensively [Shepherd2015, Monis2019]. In the most typical scenario, a finite number of commodities have to be sent from their sources to targets on a capacitated network. Each commodity has its own flow, and the commodities interact when their flows compete for capacity on common edges.
There are two general classes of MCF. One is network analysis which, based on a given network configuration, finds the optimal flow pattern for some objective function. The most studied objective functions include maximizing flow values and minimizing flow costs. The other belongs to network synthesis which seeks an optimal network configuration satisfying certain requirements.
In both classes, the targets of the commodities are taken for granted. To our surprise, researches have long neglected how targets are chosen. This paper is devoted to initializing such a theory.
Our proposed problem extends the MCF framework. It does not belong to either of the two classes. It is a combination of the facility location problem and the network flow problem which are inherently related [An2014]. Facility location is a branch of operations research related to locating or positioning at least a new facility among several existing facilities to optimize (minimize or maximize) at least one objective function. It is among the most fundamental problems in operations research and theoretical computer science [Vazirani2001Approximation]. Facility location met network flow in 1990 [Tamura1990] and has inspired a series of work [Ito2009The, ArataLocating2002, Kortsarz2008, Andreev2009Simultaneous, Hebler2016]. However, all the published works minimize costs of the selected sources or targets (not the flow cost which, together with flow value, is the objective of network flow problems), and never consider the multi-commodity setting. The most crucial difference lies in that our combination is inherent, meaning that the objective is to optimize the flow value, but the literature focuses on the cost of the selected nodes rather than the cost of flow. Another benefit of our framework is that it can naturally extend almost every network flow problem, e.g., flow cost minimization.
Our model has other applications, such as Web server deployment. There are serving various demands from widely-distributed users, for example, requesting for different online video, where should the servers be located so that the users have a good experience? Again we do not care distance, and the key objective is to optimize the available bandwidth. There are more motivating examples, say, network-flow based evacuation planning for an emergency where shelters have to be selected, and congestion decides the efficiency of evacuation. Interested readers are referred to [Hebler2016].
These real-world scenarios justify our problem’s critical features: The commodities are only partially determined since the targets are not given and have themselves to be optimized, and a decisive factor of the optimization is the bandwidth rather than any notion of distance. This well motivates our problem.
1.1 Results and Discussion
We propose a novel model of the target location for multi-commodity flow (LoMuF). On the one hand, we figure out the hardness results of various versions. On the other hand, we design algorithms for several versions. The results are as follows.
We show that the LoMuF problem is NP-hard on general undirected graphs.
We know that if the targets are fixed, the problem degenerates to one normal multi-commodity flow problem (allowing fractional flows) and becomes tractable in polynomial time, which shows that the most challenging part of this problem is indeed how to locate the targets.
We design a polynomial-time algorithm solving LoMuF on trees.
Trees are important network structures in practice. Compared to the NP-hardness result above, the fact that there is only one path connecting a source and the target on a tree simplifies the problem. Our algorithm is elegant and surprisingly shows that the interaction between different commodities becomes not harmful on trees.
We present a -approximation algorithm for LoMuF on general undirected graphs, where is the largest source number among all commodities.
This result actually shows that, when (the so-called bi-source cases) the problem can also be solved efficiently, but (take into account the NP-hardness result above) becomes intractable when .
For LoMuF on directed graphs (Di-LoMuF), we prove that it is also NP-hard and even cannot be efficiently approximated with a ratio less than 2.
In fact, we also show that LoMuF on undirected graphs can be reduced to the directed case Di-LoMuF, and then Di-LoMuF should be even harder.
Di-LoMuF also remains NP-hard on symmetric di-paths and bi-source supply vectors.
These are clear separations between undirected LoMuF and Di-LoMuF, since undirected LoMuF is efficiently solvable on trees while Di-LoMuF is even difficult on paths, a very special case of trees. As we pointed out above, the bi-source instances are easy for undirected LoMuF, but not for Di-LoMuF.
For the special case on symmetric di-trees, Di-LoMuF has a polynomial-time 2-approximation algorithm.
Though we have seen several hardness results of Di-LoMuF, for a special but still meaningful subset, where every link has the same capability of downloading and uploading, we can obtain an efficient approximation algorithm.
We show that our results above can also be extended to other variants of LoMuF such as maximum sum flows, unsplittable flows, restricted candidate targets, maximum feasible flows, and so on. For the unsplittable version, we show that it cannot be approximated within ratio 2. For the version with restrictions on targets, it is NP-hard on uni-source supply vectors and stars and cannot be efficiently approximated within ratio on trees. For the maximum feasible flows version, we prove that for any constant , unless NP=ZPP, it cannot be approximated within on supply vectors.
This shows that the framework of the new location problems has a powerful capability of modeling different scenarios in practice and enriches the theory of location problems and network flow problems.
1.2 Related Work
There is an increasing vast literature on multi-commodity flow and its single-commodity special case [Shepherd2015]. Basically, there are two types of optimization objectives, namely, minimum cost and maximum flow which is the focus of this paper. The main theme of maximum flow in recent years is improving the efficiency of approximation algorithms [Monis2019, Kelner2014, Teng2010, LeeYinTat2013, PengRichard2016, Madry2016, Kelner2012, Sherman2017, Chekuri2013The]. The flow-cut duality is also a challenging issue and has attracted much attention from researchers [krauthgamer2019flow, salmasi2019constant].
Facility location has flourished ever since the 1960s and remains an active topic in operations research and theoretical computer science [Hakimi1964, Labbe1992, Vazirani2001Approximation, Labbe1992]. Though generally, no constant-ratio approximation algorithm exists, it can be constant-approximated on metric spaces. One of the main threads of research is to improve the approximation ratio in various situations [Shmoys1997Approximation, LiA, GuhaGreedy].
Though inherently related to multi-commodity flow[An2014], facility location got to be combined with network flow only in 1990 [Tamura1990]
, when the source location problem was proposed. Roughly speaking, the mission of the source location problem on a network is to find a set of sources from which enough flow can be sent to each prescribed target. In addition to flow requirements, connectivity and vertex coverage are also frequently used constraints. Work in this line can be classified into two categories. One is independent source location, meaning that the flows to different targets do not interact[ArataLocating2002, SakashitaMinimum, Attila2008, MamadaOptimal2002, MamadaAn2006, Ito2009The, ArataLocating2002, Kortsarz2008]. The other is simultaneous source location, where the flows concurrently exist and interact by competing edge capacities [Andreev2009Simultaneous, Hebler2016]. An interesting application is emergency evacuation planning [Hebler2016], where shelters are to be located where residents in a disaster can move to as fast as possible. In such applications, capacities are also usually imposed on network nodes, rather than just on edges in typical network flow models. All the mentioned works have two common features. First, essentially only a single commodity is considered which is multi-source multi-target. Second, the objective is to optimize some measures of the selected sources (say, total cost), rather than the properties of the flow (say, flow value). This is in sharp contrast to our proposed problem.
2 Preliminaries and Problem Statement
In this section, we review key notions and notations used in this paper, and formally define the location problem.
Let (, respectively) represent the set of (non-negative, non-positive, respectively) real numbers. We use for a vector, and for its -th entry. When we denote a set by an upper-case letter, we usually write the corresponding (subscripted) lower-case letter for the members.
A network is a capacitated graph , where is the vertex set, is the edge set, and assigns capacities to the edges. We first mainly focus on undirected graphs in this paper, and will consider directed graphs in Section 4. For any , we use , or interchangeably, to denote the edge between . A commodity is described by a demand vector satisfying , where any such that (, respectively) is called a source (a target, respectively). Intuitively, each source has to sent out units of the commodity, and in total units are delivered to target . The vertex set of a graph is denoted by .
To specify flows over a network, we always arbitrarily orient all the edges and keep the orientation implicit unless necessary. For any , let (, respectively) stands for the set of incoming (outgoing, respectively) edges. A flow is a vector , which for any edge , means units of transportation along in orientation if , and opposite direction otherwise. Given flows , we write if for any . A flow is said to satisfy a demand vector , if for any , . A multi-commodity flow, which means a set of flows, is valid if its congestion along any edge is at most .
The maximum concurrent problem (MCF for short) has been extensively and is still being actively studied. Specifically, given demand vectors on a capacitated graph , the mission of MCF is to find the maximum such that can be satisfied by a valid multi-commodity flow on . The optimum will be denoted by .
Let’s recall some properties of MCF.
MCF lies in P.
As mentioned in [cormen2009introduction, page 863]
, there is no known purely combinatorial algorithm solving MCF exactly and efficiently. The only commonly used algorithm is based on linear programming.
A multi-commodity flow on a capacitated graph is said to be a decomposition of flow , if and for any edge of .
Arbitrarily fix a demand vector on a capacitated graph . Suppose has exactly one target . Then any flow satisfying can be decomposed into a multi-commodity flow which satisfies the demand vectors . Here, each is such that for any vertex of ,
Note that the decomposition in Lemma 2 is not necessarily unique. Any such one will be called a canonical decomposition of the flow.
Given any vertex subset of an graph , the cut induced by , denoted by , is defined to be the set of edges bridging and . Let be the set of edges coming into , and .
Suppose that is a flow satisfying a demand vector on a capacitated graph . Then for any , .
2.2 Target Location problem
Intuitively, our goal is to properly locate targets for multiple commodities. We formulate this problem in this subsection.
Given a capacitated graph , any is called a supply vector on . For any supply vector and , we define a demand vector such that for any ,
It is time to formulate the problem of target location for maximizing concurrent multi-commodity flow, LoMuF for short. Given supply vectors on a capacitated graph , LoMuF aims at finding such that is maximized. By abuse of the notation, the optimum objective value is again denoted by .
3 Hardness and Algorithms of LoMuF
We begin with studying the hardness of LoMuF. Our work refers to a well-known NP-complete problem, 3-dimensional matching (3-DM for short). Though LoMuF is NP-hard in general, we devise an algorithm solving LoMuF problems on trees efficiently, and show that a simple strategy could be a not-bad solution for graphs with bounded sources.
3.1 Hardness Result
A 3-DM instance is a quadruple , where are pairwise disjoint finite sets of equal size, and . The goal is to decide whether contains a perfect matching, namely, a subset such that and ? The trivial cases where will not be considered.
We first show that LoMuF is NP-hard, which is more or less a surprise, compared with Lemma 1.
Given supply vectors on a capacitated graph , it is NP-complete to decide whether .
Choose a target for supply vectors , for any . Due to Lemma 1, we can use as a certificate to check whether . This means that the decision problem lies in NP.
To prove NP-completeness, it suffices to establish a reduction from 3-DM.
Given a 3-DM instance with and , we construct an capacitated graph as illustrated in Figure 2. Specifically, consists of three subgraphs connected via . is a complete bipartite graph of vertex sets and , and any is adjacent to if and only if , likewise for . All the edges are oriented upward in Figure 2.
As to the capacity, let be the set of red edges, namely, those incident to or . For any with , let . Then for any ,
We define supply vectors such that for any ,
The rest of the proof is devoted to showing that has a perfect matching if and only if the LoMuF instance satisfies , which will lead to NP-completeness of our decision problem. The proof consists of two parts.
Part 1: a perfect matching in implies .
Without loss of generality, suppose is a perfect matching. For any , define flow such that for any edge ,
For any , define flow such that for any edge ,
It is straightforward to check that the multi-commodity flow is valid and satisfies the demand vectors . Hence, .
Part 2: implies a perfect matching in .
Suppose the optimum targets are , and the multi-commodity flow is valid and satisfies . Part 2 immediately follows from the two facts:
Fact 1: for any .
Consider the congestion of any on . Let’s proceed case by case.
. Applying Lemma 3 to , we see that the congestion of on is at least 3.
. Without loss of generality, assume . Applying Lemma 3 to , we see that the congestion of on is at least 2, and those on and are both at least 1. Hence, the congestion of on is at least 4.
Since the total capacity of is which upper-bounds the total congestion, we get Fact 1.
Fact 2: the sets are pairwise disjoint.
For contradiction, suppose without loss of generality that . Applying Lemma 3 to multi-commodity flow and command vectors , we have . This implies that for any . Namely, each edge in is full of upward flow. Likewise, each edge in is also full of upward flow. Let . Then we have for any edge in , since flow along such an edge can’t reach or . This, together with the precondition that satisfies , implies . Likewise, . A contradiction is reached since .
3.2 LoMuF on Trees
Theorem 4 indicates that LoMuF is hard to solve on general graphs, but does not exclude the possibility of an efficient algorithm solving LoMuF for some important special case. Indeed, LoMuF on trees allows a fast algorithm, as presented in Algorithm 1. Actually, networks with tree structure is the also the center of related literature [SakashitaMinimum, Tansel1983Location, Chekuri2013The, MamadaOptimal2002, MamadaAn2006, Ito2009The, Andreev2009Simultaneous].
Without loss of generality, trees will be arbitrarily rooted, so the concepts of ancestors, descendants, and subtrees are well defined as usual. Given vertices of a tree, we write if is a descendant of , and if or .
Let’s begin with a polynomial-time algorithm, which turns out to exactly solve LoMuF on trees.
The output of Algorithm 1 is an optimum solution to LoMuF on trees.
Given a capacitated tree and supply vectors , let be the output of Algorithm 1. Orient any edge of upward, i.e., from a vertex to its parent. The theorem is proven in two steps.
Step 1: Arbitrarily fix . We claim that for any , any , and any flow satisfying , there is a flow which satisfies .
The claim is proved by induction on the hop distance (i.e., the number of edges) between and , denoted by .
Basis: The claim trivially holds when .
Hypothesis: The claim holds when .
Induction: . Let be the lowest common ancestor of the sources of . We proceed case by case.
Case 1: .
If , set flow such that for any edge ,
One can easily check that and satisfies . Let .
If , it must happen that at the beginning of some “while loop” of Algorithm 1 when handling . That loop must assign to , where is the child of satisfying the condition in Line 3. Note that lies on the path between and the final . Set flow such that for any edge ,
Case 2: . Let be the child of such that . Let be the edges on the path between and . Define flow such that for any edge ,
Since Algorithm 1 outputs rather than for , it must hold that
For any edge with , we have
Hence, . One can also check that satisfies . Let .
Case 3: neither nor . Let be the lowest common ancestor of and . We have either or .
If , define flow such that for any edge ,
Then and satisfies . Let .
If , lies on the path between and . Hence, it must happen that at the beginning of some “while loop” of Algorithm 1 when handling . Then that loop does not choose the subtree of containing . Follow the argument of Case 2, there is a flow which satisfies the demand vector . Let .
Altogether, we always have a flow which satisfies the demand vector . Because , we apply the induction hypothesis and finish step 1.
Step 2: Let . Choose such that there is a valid multi-commodity flow satisfying . For any , apply the claim in step 1 to and , resulting in a flow which satisfies . Therefore, we get a valid multi-commodity flow satisfying . This means that the output of Algorithm 1 is an optimum solution to LoMuF. ∎
3.3 Approximation Algorithm on General Graphs
Theorem 5 suggests that LoMuF is not extremely intractable, at least in a special case. Fortunately, the tractability can be extended to more general graphs, in the sense of approximation. Let’s begin with a lemma, which shows the important role of master sources (defined below) in approximating LoMuF.
Arbitrarily fix a supply vector on a capacitated graph . Arbitrarily choose , where is the set of sources of . Let be a master source of , namely .
For any and flow satisfying , there is a flow which satisfies .
We proceed case by case.
Case 1: . For any , define demand vector such that for any ,
By Lemma 2, has a decomposition satisfying .
Now for any , define flow such that for any ,
and demand vector such that for any ,
Our task is reduced to establishing three claims.
Claim 1: for any , satisfies .
It suffices to show for any , where , which is the net incoming of flow at vertex . Obviously, is linear in .
Arbitrarily fix . By definition of ,
Claim 2: satisfies . It immediately follows from Claim 1.
Claim 3: .
It holds because for any ,
The proof of Case 1 finishes.
Case 2: . The lemma trivially holds.
Case 3: .
The proof of Case 1 almost works, except that is not well-defined and the decomposition of does not include . As a result, we still apply the proof of Case 1, after defining and to be all-zero vectors. ∎
Lemma 6 remains true if is replaced by .
Algorithm 2 is a simple algorithm for LoMuF with guaranteed approximation ratio.
Algorithm 2 is -approximate, where .
Arbitrarily fix a capacitated graph and supply vectors as input to Algorithm 2. Let be the output. If , each the unique source of , which is trivially optimum. Hence, we assume and show that establish approximation ratio .
Let . Suppose is an optimum solution to LoMuF. This means that there is a multi-commodity flow satisfying .
For any , apply Lemma 6 with , getting a flow which satisfies . As a result, we find a valid multi-commodity flow satisfying , so . The proof ends. ∎
Note that in Remark 2, if the -entry dominates for any , namely . A special such case is when every supply vector has no more than 2 sources. Then by Remark 2, we immediately have the following corollary.
When every supply vector has a dominant entry, Algorithm 2 exactly solves LoMuF.
4 Hardness and Algorithms of Di-LoMuF
In this section, we adapt LoMuF to networks modeled as directed graphs. Such networks have also been studied in the network flow community and frequently appear in nowadays practice. For example, only down-streaming traffics are allowed by many data servers.
We adopt the notation and concepts in Section 2 in case of no ambiguity, with three exceptions:
Every edge has an inherent direction and is called an arc. An arc from vertex to vertex is denoted by . We usually use to represent a capacitated directed with vertex set , arc set , and capacity vector . Accordingly, (, respectively) stands for the set of incoming (outgoing, respectively) arcs at vertex . Likewise, define and for vertex subset .
Any arc only allows a flow in the inherent direction, so we can naturally specify a network flow using a non-negative vector .
We continue to study the problem of target location for maximizing concurrent multi-commodity flow, but in the context of directed graphs. The problem will be called Di-LoMuF to highlight the directed model.
The following theorem indicates the strong relation between LoMuF and Di-LoMuF.
LoMuF is reducible to Di-LoMuF.
Arbitrarily fix an capacitated graph and supply vectors . We will construct a capacitated direct graph and supply vectors , and prove that the construction preserves the quality of solutions.
Step 1: Construct and the supply vectors.
The directed graph is obtained by replacing any edge of with the diamond gadget as illustrated in Figure 3. Specifically, , , and for any arc in the diamond corresponding to edge , . For any , define such that for any ,
Step 2: Prove that for any , .
Consider any and any valid multi-commodity flow satisfying . For any , define flow as follows: for any , if is from to , set , otherwise set ; for any other arc . It is straightforward to check that the multi-commodity flow is valid and satisfies