Fundamental Limits of Distributed Data Shuffling

06/29/2018 ∙ by Kai Wan, et al. ∙ THE UNIVERSITY OF UTAH Berlin Institute of Technology (Technische Universität Berlin) University of Illinois at Chicago 0

Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms. Data shuffling is often considered one of the most significant bottlenecks in such systems due to the heavy communication load. Under a master-worker architecture (where a master has access to the entire dataset and only communications between the master and workers is allowed) coding has been recently proved to considerably reduce the communication load. In this work, we consider a different communication paradigm referred to as distributed data shuffling, where workers, connected by a shared link, are allowed to communicate with one another while no communication between the master and workers is allowed. Under the constraint of uncoded cache placement, we first propose a general coded distributed data shuffling scheme, which achieves the optimal communication load within a factor two. Then, we propose an improved scheme achieving the exact optimality for either large memory size or at most four workers in the system.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed the emergence of big data and machine learning with wide applications in both business and consumer worlds. To cope with such a large size/dimension of data and the complexity of machine learning algorithms, it is increasingly popular to use distributed computing platforms such as Amazon Web Services Cloud, Google Cloud, and Microsoft Azure services, where large scale distributed machine learning algorithms can be implemented. The approach of data shuffling has been identified as one of the core elements to improve the statistical performance of modern large scale machine learning algorithms [ChungUber2017, randomreshuffling2015]. In particular, data shuffling consists of re-shuffling the training data among all computing nodes (workers) once every few iterations, according to some given learning algorithms. However, due to the huge communication cost, data shuffling may become one of the main system bottlenecks.

To tackle this communication bottleneck problem, under a master-worker setup where the master has access to the entire dataset, coded data shuffling has been recently proposed to significantly reduce the communication load between master and workers [speedup2018Lee]. However, when the whole data set is stored across the workers, data shuffling can be implemented in a distributed fashion by allowing direct communication between the workers111In practice, workers communicate with each other as described in [ChungUber2017].. In this way, the communication bottleneck between a master and the workers can be considerably alleviated. This can be advantageous if the transmission capacity among workers is much higher than that between the master and workers, and the communication load between this two setups are similar.

In this work, we consider such a decentralized data shuffling framework, where workers, connected by the same communication bus (common shared link), are allowed to communicate222Notice that putting all nodes on the same bus (typical terminology in Compute Science) is very common and practically relevant since this is what happens for example with Ethernet, or with the Peripheral Component Interconnect Express (PCI Express) bus inside a multi-core computer, where all cores share a common bus for intercommunication. The access of such bus is regulated by some collision avoidance protocol such as Carrier Sense Multiple Access (CSMA) [tobagiCSMA] or Token ring [tokenring], such that once one node talks at a time, and all other listen. Therefore, this architecture is relevant in practice. . Despite a master node may be present for initial data distribution and/or for collecting the results of the training phase in a machine learning application, it is not involved in the data shuffling process which is entirely managed by the worker nodes in a distributed manner. In the following, we will review the literature of coded data shuffling (which we shall refer to as centralized data shuffling) and introduce the distributed data shuffling framework studied in this paper.

I-a Centralized Data Shuffling

The coded data shuffling problem was originally proposed in [speedup2018Lee] in a master-worker centralized model. In this setup, a master, with the access to the whole dataset containing data units, is connected to workers, where

is a positive integer. Each shuffling epoch is divided into

data shuffling and storage update phases. In the data shuffling phase, a subset of the data units is assigned to each worker and each worker must recover these data units from the broadcasted packets of the master and its own stored content from the previous epoch. In the storage update phase, each worker must store the newly assigned data units and, in addition, some information about other data units that can be retrieved from the storage content and master transmission in the current epoch. Such additional information should be strategically designed in order to help the coded delivery of the required data units in the following epochs. Each worker can store up to data units in its local memory. If each worker directly copies some bits of the data units in its storage, the storage update phase is said to be uncoded. On the other hand, if the workers store functions (e.g., linear combinations) of the data objects’ bits, the storage update is said to be coded. The goal is, for a given , to find the best two-phase strategy that minimizes the communication load during the data shuffling phase regardless of the shuffle.

The scheme proposed in [speedup2018Lee] uses a random uncoded storage (to fill users’ extra memories independently when ) and a coded multicast transmission from the master to the workers, and yields a gain of a factor of in terms of communication load with respect to the naive scheme for which the master simply transmits the missing but required data to the workers by directly broadcasting the missing bits over the shared link.

The centralized coded data shuffling scheme with coordinated (i.e., deterministic) uncoded storage update phase was originally proposed in [informationAttia2016, worstAttia2016] to further reduce the communication load for the worst-case shuffles compared to[speedup2018Lee]. The proposed schemes in [informationAttia2016, worstAttia2016] are optimal under the constraint of uncoded storage for the cases where there is no extra memory for each worker (i.e., ) or there are less than or equal to three workers in the systems. Inspired by the achievable and converse bounds for the single-bottleneck-link caching problem in [dvbt2fundamental, ontheoptimality, exactrateuncoded], the authors in [neartoptimalAttia2018] then proposed a general coded data shuffling scheme, which was shown to be order optimality to within a factor of under the constraint of uncoded storage. Also in [neartoptimalAttia2018], the authors improved the performance of the general coded shuffling scheme by introducing an aligned coded delivery, which was shown to be optimal under the constraint of uncoded storage for either or .

Recently, inspired by the improved data shuffling scheme in [neartoptimalAttia2018], the authors in [fundamentalshuffling2018] proposed a linear coding scheme based on interference alignment, which achieves the optimal worst-case communication load under the constraint of uncoded storage for all system parameters. In addition, under the constraint of uncoded storage, the proposed coded data shuffling scheme in [fundamentalshuffling2018] was shown to be optimal for any shuffles (not just for the worst-case) when .

I-B Decentralized Data Shuffling

An important limitation of the centralized framework is the assumption that workers can only receive packets from the master. Since the entire data set is stored in a decentralized fashion across the workers at each epoch of the distributed learning algorithm, the master may not be needed in the data shuffling phase if workers can communicate with each other (e.g., [ChungUber2017]). In addition, the communication among workers can be much more efficient compared to the communication from the master node to the workers [ChungUber2017]. In this paper, we propose the decentralized data shuffling problem, where only communications among workers is allowed during the shuffling phase. This means that in the data shuffling phase, each worker broadcasts well designed coded packets (i.e., representations of the data) based on its stored content in the previous epoch. Worker takes turn in transmitting, and transmissions are received error-free by all other workers though the common communication bus. The objective is to design the data shuffling and storage update phases in order to minimize the total communication load across all the workers in the worst-case shuffling scenario.

I-C Relation to other Problems

The coded decentralized data shuffling problem considered in this paper is related to the coded device-to-device (D2D) caching problem [d2dcaching] and the coded distributed computing problem [distributedcomputing] – see also Remark 1 next.

The coded caching problem was originally proposed in [dvbt2fundamental] for a shared-link broadcast model. The authors in [d2dcaching] extended the coded caching model to the case of D2D networks under the so-called protocol model. By choosing the communication radius of the protocol model such that each node can broadcast messages to all other nodes in the network, the delivery phase of D2D coded caching is resemblant (as far as the topology of communication between the nodes is concerned) to the shuffling phase of our decentralized data shuffling problem.

Recently, the scheme for coded D2D caching in [d2dcaching] has been extended to the coded distributed computing problem [distributedcomputing], which consists of two stages named Map and Reduce. In the Map stage, workers compute a fraction of intermediate computation values using local input data according to the designed Map functions. In the Reduce stage, according to the designed Reduce functions, workers exchange among each other a set of well designed (coded) intermediate computation values, in order to compute the final output results. The coded distributed computing problem can be seen as a coded D2D caching problem under the constraint of uncoded and symmetric cache placement, where the symmetry means that each worker uses the same cache function for each file. A converse bound was proposed in [distributedcomputing] to show that the proposed coded distributed computing scheme is optimal in terms of communication load. This coded distributed computing framework was extended to the cases such as computing only necessary intermediate values [alternative2017, combinatoricsCDC2018], reducing file partitions and number of output functions [combinatoricsCDC2018, leveraging2018], and considering random network topologies [CDCrandomconnect2018], stragglers [straggleswireless2017], storage cost [yan2018distributedcom], and heterogeneous computing power, function assignment and storage space [cascaded2019, CDChetero2019].

Compared to coded D2D caching and coded distributed computing, the decentralized data shuffling problem differs as follows. On the one hand, a novel asymmetric constraint on the stored contents for the workers is present (because each worker must store all bits of each assigned data unit in the previous epoch, which breaks the symmetry of the stored contents across data units of the other settings). On the other hand, each worker also needs to dynamically update its storage based on the received packets and its own stored content in the previous epoch. Therefore the decentralized data shuffling problem over multiple data assignment epochs is indeed a dynamic system where the evolution across the epochs of the node stored content plays a key role, while in the other problems reviewed above the cache content is static and determined at a single initial placement setup phase.

We note that the distributed computing problem in [distributedcomputing] is a special case of the D2D caching problem when one restricts attention to uncoded and symmetric (across files) cache placement.

The decentralized data shuffling phase with uncoded storage is equivalent to a distributed index coding problem [distribuedindexcoding, liu2018distributedIC], where only those servers that have as available messages for encoding those messages that are side information sets of some users are present. The authors in [distribuedindexcoding, liu2018distributedIC] proposed polymatroid converse bounds and achievable schemes based on random coding, which coincide for all non-isomorphic problems with equal link capacities and equal message rate when the there are no more than four message; all other scenarios are widely open. These regions are in general of exponential complexity in the number of messages and servers, thus not really of direct use in our problem. This is so because in the decentralized data shuffling problem, each data unit is divided into sub-blocks depending on which subset of workers stored them before the data shuffling phase; each sub-block desired by a worker is an independent message in the corresponding distributed index coding problem; thus the data shuffling phase is a distributed index coding problem that contains a number of messages that in general is doubly exponential in the number of users in the original decentralized data shuffling problem.

I-D Contributions

In this paper, we study the decentralized data shuffling problem for which we propose converse and achievable bounds as follows.

  1. Novel converse bound under the constraint of uncoded storage. Inspired by the induction method in [distributedcomputing, Thm.1] for the distributed computing problem, we derive a converse bound under the constraint of uncoded storage. Different from the converse bound for the distributed computing problem, in our proof we propose a novel approach to account for the additional constraint on the “asymmetric” stored content.

  2. Scheme A: General scheme for any . By extending the general centralized data shuffling scheme from [neartoptimalAttia2018] to our decentralized model, we propose a general decentralized data shuffling scheme, where the analysis holds for any system parameters.

  3. Scheme B: Improved scheme for . It can be seen later that Scheme A does not fully leverage the workers’ stored content. With the storage update phase inspired by the converse bound and also used in the improved centralized data shuffling scheme in [neartoptimalAttia2018], we propose a two-step scheme for decentralized data shuffling to improve on Scheme A. In the first step we generate multicast messages as in [dvbt2fundamental], and in the second step we encode these multicast messages by a linear code.

    By comparing our proposed converse bound and Scheme B, we prove that Scheme B is exactly optimal under the constraint of uncoded storage for . Based on this result, we can also characterize the exact optimality under the constraint of uncoded storage when the number of workers satisfies .

  4. Scheme C: Improved scheme for . The delivery schemes proposed in [dvbt2fundamental, d2dcaching, neartoptimalAttia2018] for coded caching with a shared-link, D2D caching, and centralized data shuffling, all belong to the class of clique-covering method from a graph theoretic viewpoint. We propose a new distributed clique-covering approach that outperforms the state-of-the-art, and apply it to our decentralized data shuffling problem for the case . The resulting scheme outperforms the previous two schemes for this specific storage size.

    As a result of independent interest, this novel distributed clique-covering method can be used for other distributed broadcast problems with side information, such as the coded D2D caching problem [d2dcaching] and the distributed index coding problem [distribuedindexcoding].

  5. Order optimality under the constraint of uncoded storage. By combing the three proposed schemes and comparing with the proposed converse bound, we prove the order optimality of the combined scheme within a factor of under the constraint of uncoded storage.

I-E Paper Organization

The rest of the paper is organized as follows. The system model and problem formulation for the decentralized data shuffling problem are given in Section II. Results from decentralized data shuffling related to our work are compiled in Section III. Our main results are summarized in Section IV. The proof of the proposed converse bound can be found in Section V, while the analysis of the proposed achievable schemes in Section VI. Section VII concludes the paper. The proofs of some auxiliary results can be also found in Appendix.

I-F Notation Convention

We use the following notation convention. Calligraphic symbols denote sets, bold symbols denote vectors, and sans-serif symbols denote system parameters. We use

to represent the cardinality of a set or the length of a vector; and ; represents bit-wise XOR; denotes the set of all positive integers.

Ii System Model

The decentralized data shuffling problem is defined as follows. There are workers, each of which is charged to process and store data units from a dataset of data units. Data units are denoted as and each data unit consists of i.i.d. bits. Each worker has a local storage of bits, where . The workers are interconnected through a noiseless multicast network.

The computation process occurs over time slots/epochs. At the end of time slot the content of the local storage of worker is denoted by ; the content of all storages is denoted by . At the beginning time slot the data units are partitioned into disjoint batches, each containing data units. The data units indexed by are assigned to worker who must store them in its local storage by the end of time slot . The dataset partition (i.e., data shuffle) in time slot is denoted by and must satisfy

(1a)
(1b)
(1c)

If , we denote that for each .

The following two-phase scheme allows workers to store the requested data units.

Initialization

We first focus on the initial time slot , where a master node broadcasts to all the workers. Given partition , worker must store all the data units where ; if there is excess storage, that is, if , worker can store in its local storage parts of the data units indexed by . The storage function for worker in time slot is denoted by , where

(2a)
(2b)
(2c)

Notice that the storage initialization and the storage update phase (which will be described later) are without knowledge of later shuffles. In subsequent time slots , the master is not needed and the workers communicate with one another.

Data Shuffling Phase

Given global knowledge of the stored content at all workers, and of the data shuffle from to (indicated as ) worker broadcasts a message to all other workers, where is based only on the its local storage content , that is,

(3)

The collection of all sent messages is denoted by . Each worker must recover all data units indexed by from the sent messages and its local storage content , that is,

(4)

The rate -tuple is said to be feasible if there exist delivery functions for all and satisfying the constraints (3) and (4), and such that

(5)

Storage Update Phase

After the data shuffling phase in time slot , we have the storage update phase in time slot . Each worker must update its local storage based on the sent messages and its local stored content , that is,

(6)

by placing in it all the recovered data units, that is,

(7)

Moreover, the local storage has limited size bounded by

(8)

A storage update for worker is said to be feasible if there exists functions for all and satisfying the constraints in (6), (7) and (8).

Note: if for any and we have (i.e., is equivalent to ), the storage phase is called structural invariant.

Objective

The objective is to minimize the worst-case total communication load, or just load for short in the following, among all possible consecutive data shuffles, that is we aim to characterized defined as

(9)

The minimum load under the constraint of uncoded storage is denoted by . In general, .

Remark 1 (Decentralized Data Shuffling vs D2D Caching).

The D2D caching problem studied in [d2dcaching] differs from our setting as follows:

  1. in the decentralized data shuffling problem one has the constraint on the stored content in (7) that imposes that each worker stores the whole requested files, which is not present in the D2D caching problem; and

  2. in the D2D caching problem each worker fills its local cache by accessing the whole library of files, while in the decentralized data shuffling problem each worker updates its local storage based on the received packets in the current time slot and its stored content in the previous time slot as in (6).

Because of these differences, achievable and converse bounds for the decentralized data shuffling problem can not be obtained by trivial renaming of variables in the D2D caching problem.

Iii Relevant Results for Centralized Data Shuffling

Data shuffling was originally proposed in [speedup2018Lee] for the centralized scenario, where communications only exists between the master and the workers, that is, the decentralized encoding conditions in (3) are replaced by where is broadcasted by the master to all the workers. We summarize next some key results from [neartoptimalAttia2018], which will be used in the following sections. We shall use the subscripts “u,cen,conv” and “u,cen,ach” for converse (conv) and achievable (ach) bounds, respectively, for the centralized problem (cen) with uncoded storage (u). We have

  1. Converse for centralized data shuffling: For a centralized data shuffling system, the worst-case communication load under the constraint of uncoded storage is lower bounded by the lower convex envelope of the following memory-load pairs [neartoptimalAttia2018, Thm.2]

    (10)
  2. Achievability for centralized data shuffling: In [neartoptimalAttia2018] it was also showed that the lower convex envelope of the following memory-load pairs is achievable with uncoded storage [neartoptimalAttia2018, Thm.1]

    (11)

    The achievable bound in (11) was shown to be within a factor of the converse bound in (10) under the constraint of uncoded storage [neartoptimalAttia2018, Thm.3].

  3. Optimality for centralized data shuffling: It was shown in [fundamentalshuffling2018, Thm.4] that the converse bound in (10) can be achieved by a scheme that uses linear network coding and interference alignement/elimination. An optimality result similar to [fundamentalshuffling2018, Thm.4] was shown in [neartoptimalAttia2018, Thm.4] but only for ; note that is trivial333We note that a direct extension of the optimal centralized scheme in [fundamentalshuffling2018] to decentralized setting in this paper is however not possible because it heavily builds on centralized interference alignment-type ideas. Moreover, a D2D-caching-inspired idea on how to extend a centralized scheme to a decentralized one is as follows: each linear combination in a centralized setting should be broken into several linear combinations for the decentralized setting (assumed to be ), the sub-blocks in each of which should be stored by some worker; the communication load is thus times larger than the one in the centralized setting, which is in general not optimal in the decentralized setting..

Although the scheme that achieves the load in (11) is not optimal in general, we shall next describe its inner workings as we will generalize it to the case of decentralized data shuffling.

Structurally Invariant Data Partitioning and Storage

Fix and divide each data unit into non-overlapping and equal-length sub-blocks of length bits. Write each data unit as . The storage of worker at the end of time slot is as follows,444 Notice that here each sub-block is stored by workers . In addition, later in our proofs of converse bound and proposed achievable schemes for decentralized data shuffling, the notation denotes the sub-block of which is stored by workers in .

(12)
(13)

Worker stores all the sub-blocks of the required data units indexed by , and also sub-blocks of each data unit indexed by (see (12)), thus the required storage space is

(14)

It can be seen (see (13) and also Table I) that the storage of worker at time is partitioned in two parts: (i) the “fixed part” contains all the sub-blocks of all data points that have the index in the second subscript; this part of the storage will not be changed over time; and (ii) the “variable part” contains all the sub-blocks of all required data points at time that do not have the index in the second subscript; this part of the storage will be updated over time.

Workers Sub-blocks of Sub-blocks of Sub-blocks of
Worker stores , , , ,
Worker stores , , , ,
Worker stores , , , ,
TABLE I: Example of file partitioning and storage in (13) at the end of time slot for the decentralized data shuffling problem with and where .

Initialization (for the achievable bound in (11))

The master directly transmits all data units. The storage is as in (13) given .

Data Shuffling Phase of time slot (for the achievable bound in (11))

After the end of storage update phase at time , the new assignment is revealed. For notation convenience, let

(15)

Note that in (15) we have , with equality (i.e., worst-case scenario) if and only if . To allow the workers to recover their missing sub-blocks, the central server broadcasts defined as

(16)
(17)

where in the MAN-like multicast message in (17) the sub-blocks

involved in the sum are zero-padded to meet the length of the longest one. Since worker

requests and has stored all the remaining sub-blocks in defined in (17), it can recover from , and thus all its missing sub-blocks from .

Storage Update Phase of time slot (for the achievable bound in (11))

Worker evicts from the (variable part of its) storage the sub-blocks and replaces them with the sub-blocks . This procedure maintains the structurally invariant storage structure of the storage in (13).

Performance Analysis (for the achievable bound in (11))

The total worst-case communication load satisfies

(18)

with equality (i.e., worst case scenario) if and only if for all .

Iv Main Results

In this section we summarize our main results for the decentralized data shuffling problem. We shall use the subscripts “u,dec,conv” and “u,dec,ach” for converse (conv) and achievable (ach) bounds, respectively, for the decentralized problem (dec) with uncoded storage (u). We have:

  1. Converse: We start with a converse bound for the decentralized data shuffling problem under the constraint of uncoded storage.

    Theorem 1 (Converse).

    For a decentralized data shuffling system, the worst-case load under the constraint of uncoded storage is lower bounded by the lower convex envelope of the following memory-load pairs

    (19)

    The proof can be found in Section V and is inspired by the induction method proposed in [distributedcomputing, Thm.1] for the distributed computing problem. However, there are two main differences in our proof compared to [distributedcomputing, Thm.1]: (i) we need to account for the additional constraint on the stored content in (7), (ii) our storage update phase is by problem definition in (7) asymmetric across data units, while it is symmetric in the distributed computing problem.

  2. Achievability: We next extend the centralized data shuffling scheme in Section III to our decentralized setting.

    Theorem 2 (Scheme A).

    For a decentralized data shuffling system, the worst-case load under the constraint of uncoded storage is upper bounded by the lower convex envelope of the following memory-load pairs

    (20)
    (21)
    (22)

    The proof is given in Section VI-A.


    A limitation of Scheme A in Theorem 2 is that, in time slot worker does not fully leverage all its stored content. We overcome this limitation by developing Scheme B described in Section VI-B.

    Theorem 3 (Scheme B).

    For a decentralized data shuffling system, the worst-case load under the constraint of uncoded storage for is upper bounded by the lower convex envelope of the following memory-load pairs

    (23)

    We note that Scheme B is neither a direct extension of [neartoptimalAttia2018, Thm.4] nor of [fundamentalshuffling2018, Thm.4] from the centralized to the decentralized setting. As it will become clear from the details in Section VI-B, our scheme works with a rather simple way to generate the multicast messages transmitted by the workers, and it applies to any shuffle, not just to the worst case one. In Remark 3, we also extend this scheme for the general memory size regime.


    Scheme B in Theorem 3 uses a distributed clique-covering method to generate multicast messages similar to what done for D2D caching [dvbt2fundamental], where distributed clique cover is for the side information graph (more details in Section V-A). Each multicast message corresponds to one distributed clique and includes one linear combination of all nodes in this clique. However, due to the asymmetry of the decentralized data shuffling problem (not present in D2D coded caching), the lengths of most distributed cliques are small and thus the multicast messages based on cliques and sent by a worker in general include only a small number of messages (i.e., small multicast gain). To overcome this limitation, in Scheme C we develop a novel distributed clique-covering method for , which is described in Section VI-C. The key idea is to augment some of the cliques and send them in linear combinations.

    Theorem 4 (Scheme C).

    For a decentralized data shuffling system, the worst-case load under the constraint of uncoded storage for is upper bounded by

    (24)
  3. Optimality: By comparing our achievable and converse bounds, we have the following exact optimality results.

    Theorem 5 (Exact Optimality for ).

    For a decentralized data shuffling system, the optimal worst-case load under the constraint of uncoded storage for is given in Theorem 1 and is attained by Scheme B in Theorem 3.

    Note that the converse bound on the load for the case is trivially achieved by Scheme A in Theorem 2.


    From Theorem 5 (because when all possible storage sizes are covered by Scheme B) we can immediately conclude the following.

    Corollary 1 (Exact Optimality for ).

    For a decentralized data shuffling system, the optimal worst-case load under the constraint of uncoded storage is given by Theorem 1 for .


    Finally, by combining the three proposed achievable schemes, we have the following order optimality result proved in Section VI-D.

    Theorem 6 (Order Optimality for all Parameters).

    For a decentralized data shuffling system under the constraint of uncoded storage, for the cases not covered by Theorem 5, the proposed schemes achieves the converse bound in Theorem 1 to within a factor of .

  4. Finally, we can quantify the cost of peer-to-peer operations as follows (as it will be proved in Section VI-D).

    Corollary 2.

    By directly comparing the optimal load for the centralized system in (11) with the loads achieved by our proposed decentralized data shuffling schemes, the cost of peer-to-peer operations is no more than a factor of .

We conclude this section by providing some numerical results. Fig. 1 plots our converse bound and the best convex combination of the proposed achievable bounds on the worst-case load under the constraint of uncoded storage for the distributed data shuffling systems with (Fig. 0(a)) and (Fig. 0(b)) workers. For comparison, we also plot the optimal load for the corresponding centralized system in (10) under the constraint of uncoded storage. For the case of workers, Theorem 1 is tight under the constraint of uncoded storage. For the case of workers, Scheme B meets our converse bound when when , and also trivially when .

a

(a) .

b

(b) .
Fig. 1: The storage-load trade-off for the decentralized data shuffling problem.

V Proof of Theorem 1: Converse Bound under the Constraint of Uncoded Storage

We want to lower bound for a fixed . Recall that the excess storage is said to be uncoded if each worker simply copies bits from the data units in its local storage. When the storage update phase is uncoded, we can divide each data unit into sub-blocks depending on the set of workers who store them, so that the data shuffling phase can be represented by a directed graph. The precise details are given next, before the actual proof of Theorem 1.

V-a Sub-block Division of the Data Shuffling Phase under Uncoded Storage

Because of the data shuffling constraint in (1), all the bits of all data units are stored by at least one worker at the end of any time slot. We denote the worker who stores data unit at the end of time slot by , where

(25)

In the case of excess storage, some bits of some files may be stored by multiple workers. We denote by the sub-block of bits of data unit exclusively stored by workers in where and . By definition, at the end of step , we have that must be in for all sub-blocks of data unit ; we also let for all if . Hence, at the end of step , each data unit can be written as

(26)

and the storage content as

(27)

We note that the sub-blocks have different content at different times (as the partition in (26) is a function of through ); however, in order not to clutter the notation, we will not explicitly denote the dependance of on time. Finally, please note that the definition of sub-block , as defined here for the converse bound, is not the same as defined in Section VI for the achievable scheme (see Footnote 4).

V-B Proof of Theorem 1

We are interested in deriving an information theoretic lower bound on the worst-case communication load. We will first obtain a number of lower bounds on the load for some carefully chosen shuffles. Since the load of any shuffle is at most as large as the worst-case shuffle, the obtained lower bounds are valid lower bounds for the worst-case load as well. We will then average the obtained lower bounds.

In particular, the shuffles are chosen as follows. Consider a permutation of denoted by where for each and consider the shuffle

(28)

Also define

(29)

where represents the subset of workers in whose demanded data units in time slot , indexed by in (28), were stored by some workers in at the end of time slot 555For example, if and , we have because and thus the requested data unit by worker in time slot was stored by worker at the end of time slot ; similarly, we have and ..

In addition, also define as the messages sent by the workers in during time slot , and as the sub-blocks that any worker in either needs to store at the end of time slot or has stored at the end of time slot , that is,

(30)
(31)

From Lemma 1 in Appendix A with , which is the key novel contribution of our proof and that was inspired by the induction argument in [distributedcomputing], we have

(32)

We next consider all the permutations of where for each , and sum together the inequalities in the form of (32). For an integer , by the symmetry of the problem, the sub-blocks where , and appear the same number of times in the final sum. In addition, the total number of these sub-blocks in general is and the total number of such sub-blocks in each inequality in the form of (32) is . So we obtain

(33)
(34)
(35)

where we defined as the total number of bits in the sub-blocks stored by workers at the end of time slot normalized by the total number of bits , i.e.,

(36)

which must satisfy

(37)
(38)

We then use a method based on Fourier-Motzkin elimination as in [ontheoptimality] to bound from (34) under the constraints in (37) and (38). In particular, for each integer , we multiply (37) by to obtain

(39)

and we multiply (38) by to have

(40)

We then add (39), (40), and (34) to obtain,

(41)
(42)

Hence, for each integer , the bound in (42) becomes a linear function in . When , from (42) we have . When , from (42) we have . In conclusion, we prove that is lower bounded by the lower convex envelope (also referred to as “memeory sharing”) of the points , where .

This concludes the proof of Theorem 1.

V-C Discussion

We conclude this session with a couple of remarks:

  1. The corner points from converse bound are of the form , which may suggest the following placement.

    At the end of time slot , each data unit is partitioned into equal-length sub-block of length bits as ; by definition if either or . Each worker stores all the sub-blocks if ; in other words, worker stores all the sub-blocks of desired data units, and sub-blocks of remaining data units.

    In the data shuffling phase of time slot , worker must decode the missing sub-blocks of data unit for all . An interpretation of the converse bound is that, in the worst case, the total number of transmissions is equivalent to at least sub-blocks.

    We will use this interpretation to design the storage update phase our proposed Schemes B and C.

  2. The converse bound is derived for the objective of minimizing the “sum load” , see (9).

    The same derivation would give a converse bound for the “largest individual load” . In the latter case, the corner points from converse bound are of the form . This view point may suggest that, in the worst case, all the individual loads are the same, i.e., the burden of communicating missing data units is equally shared by all the workers.

    Our proof technique for Theorem 1 could also be directly extended to derive a converse bound on the average load (as opposed to the worst-case load) for all the possible shuffles in the distributed data shuffling problem when .

Vi Achievable Schemes for Decentralized Data Shuffling

In this section, we propose three schemes for the decentralized data shuffling problem, and analyze their performances.

Vi-a Scheme A in Theorem 2

Scheme A extends the general centralized data shuffling scheme in Section III to the distributed model. Scheme A achieves the load in Theorem 2 for each memory size , where ; the whole memory-load tradeoff curve is achieved by memory-sharing between these points (given in (20)) and the (trivially achievable) points in (21)-(22).

Structurally Invariant Data Partitioning and Storage

This is the same as the one in Section III for the centralized case.

Initialization

The master directly transmits all data units. The storage is as in (13) given .

Data Shuffling Phase of time slot

The data shuffling phase is inspired by the delivery in D2D caching [d2dcaching]. Recall the definition of sub-block in (15), where each sub-block is known by workers and needed by worker . Partition into non-overlapping and equal-length pieces . Worker broadcasts the MAN-like multicast messgaes

(43)

in other words, one linear combination in (16) for the centralized setting becomes linear combinations in (43) for the decentralized setting, but of size reduced by a factor . Evidently, each sub-block in is stored in the memory of worker at the end of time slot . In addition, each worker knows where such that it can recover its desired block .

Since , the worst-case load is