On the Fundamental Limits of Coded Data Shuffling for Distributed Learning Systems

07/11/2018 ∙ by Adel Elmahdy, et al. ∙ University of Minnesota 0

We consider the data shuffling problem in a distributed learning system, in which a master node is connected to a set of worker nodes, via a shared link, in order to communicate a set of files to the worker nodes. The master node has access to a database of files. In every shuffling iteration, each worker node processes a new subset of files, and has excess storage to partially cache the remaining files, assuming the cached files are uncoded. The caches of the worker nodes are updated every iteration, and it should be designed to satisfy any possible unknown permutation of the files in subsequent iterations. For this problem, we characterize the exact rate-memory trade-off for worst-case shuffling by deriving the minimum communication load for a given storage capacity per worker node. As a byproduct, the exact rate-memory trade-off for any shuffling is characterized when the number of files is equal to the number of worker nodes. We propose a novel deterministic coded shuffling scheme, which improves the state of the art, by exploiting the cache memories to create coded functions that can be decoded by several worker nodes. Then, we prove the optimality of our proposed scheme by deriving a matching lower bound and showing that the placement phase of the proposed coded shuffling scheme is optimal over all shuffles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the emergence of big data analytics, distributed computing systems have attracted enormous attention in recent years. The computational paradigm in the era of big data has shifted towards distributed systems, as an alternative to expensive supercomputers. Distributed Computing systems are networks that consist of a massive number of commodity computational nodes connected through fast communication links. Examples of distributed computing applications span distributed machine learning, massively multilayer online games (MMOGs), wireless sensor networks, real-time process control, etc. Prevalent distributed computing frameworks, such as Apache Spark

[2], and computational primitives, such as MapReduce [3], Dryad [4], and CIEL [5], are key enablers to process substantially large data-sets (in the order of terabytes), and execute production-scale data-intensive tasks.

Data Shuffling is one of the core components in distributed learning algorithms. Broadly speaking, the data shuffling stage is introduced to prepare data partitions with desirable properties for parallel processing in future stages. A prototypical iterative data processing procedure is outlined as follows: (i) randomly shuffle the training data-set, (ii) equally partition the data-set into non-overlapping batches, and assign each batch to a local worker111

One may consider storing the entire training data-set in a massive shared storage system and let the workers directly access the new batches every learning epoch. Although this setting eliminates the communication overhead of the shuffling mechanism, it suffers from network and disk I/O bottlenecks, and hence, this approach is notoriously sluggish and cost-inefficient as well

[16]., (iii) each local worker performs a local computational task to train a learning model, (iv) reshuffle the training data-set to provide each worker with a new batch of data points at each learning model and continue the model training. Data shuffling is known to enhance the learning model quality and lead to significant statistical gains in ubiquitous applications for machine learning and optimization. One prominent example is stochastic gradient descend (SGD) [6, 7, 8, 9, 10, 11, 12]. Recht and Ré [6]

conjectured a non-commutative arithmetic-geometric mean inequality, and showed that the expected convergence rate of the random shuffling version of SGD is faster than that of the usual with-replacement version provided the inequality holds

222It is a long-standing problem in the theory of SGD to prove this statement, and the correctness of the full conjecture is still an open problem.. It was empirically demonstrated that shuffling the data before running SGD results in superior convergence performance [7, 8, 9, 10, 11]. Recently, Meng et al. [12] have proposed an extensive analysis on the desirable convergence properties of distributed SGD with random shuffling, in both convex and non-convex cases. In practice, however, the benefits of data shuffling come at a price. In every shuffling iteration, the entire data-set is communicated over the network of workers. Consequently, this leads to performance bottlenecks due to the communication overhead.

Caching of popular content during off-peak hours is a prominent technique that reduces the network congestion and enhances throughput and latency in content delivery networks. This can be achieved through prefetching popular contents into end user memories distributed across the network. The caching problem comprises two phases; (i) placement phase, (ii) delivery phase. The placement phase takes place when the network is not congested and the system is not aware of the future demands of the users, but the statistics of users’ demands are known. In this phase, the cache of each user prefetches data from the server subject to the size of the cache memories. On the other hand, the delivery phase takes place when the actual demands of the users are revealed, and hence, the network is congested. In this phase, the server transmits the requested files subject to the rate required to serve the users’ demands. In a seminal work by Maddah-Ali and Niesen [13], the first information-theoretic formulation was introduced for the basic caching problem where a central server, with a database of files, is connected to a set of users via a shared bottleneck link. The authors proposed a novel coded caching scheme that exploits not only the local caches at each individual user (i.e., the local cache size), but also the aggregate memory of all users (i.e., the global cache size), even if there is no cooperation among the users. Recently, the exact rate-memory trade-off for the basic caching problem, where the prefetching is uncoded, has been characterized in [14] for both centralized and decentralized settings under uniform file popularity.

The idea of incorporating coding theory into the context of distributed machine learning has been introduced in a recent work by [15]. The authors posed an intriguing question as to how to use coding techniques to ensure robust speedups in distributed computing. To address this question, the work flow of distributed computation is abstracted into three main phases; a storage phase, a communication phase, and a computation phase. Coding theory is utilized to alleviate the bottlenecks in the computation and communication phases of distributed learning algorithms. More specifically, the authors proposed novel algorithms for coded computation to speed up the performance of linear operations, and coded data shuffling to overcome the significant communication bottlenecks between the master node and worker nodes during data shuffling.

I-a Related Prior Works

The data shuffling problem has been extensively studied from various perspectives under different frameworks. In what follows, we survey the literature and present the progress and the current status of the problem.

I-A1 Data Shuffling in Master-Worker Distributed Computing Framework

In the master-worker distributed setup, the master node has access to the entire data-set that is randomly permuted and partitioned into batches at every iteration of the distributed algorithm. The data shuffling phase aims at communicating these batches to the worker nodes in order to locally perform their distributed tasks in parallel. Then, the master node aggregates the local results of the worker nodes to complete the computation and give the final result. Inspired by the coded caching introduced by Maddah-Ali et al. [13], Lee et al. [15] proposed the first coded shuffling algorithm, based on random storage placement, that leverages the excess storage of the local caches of the worker nodes to slash the communication bottlenecks. The coded shuffling algorithm consists of three main strategies: a coded transmission strategy designed by the master node, and decoding and cache updating strategies executed by the worker nodes. It is demonstrated, through extensive numerical experiments, the significant improvement in the achievable rate and the average transmission time of the coded shuffling framework, compared to no shuffling and uncoded shuffling frameworks. The theoretical guarantees of [15] hold only when the number of data points approaches infinity, and the broadcast channel between the master node and worker nodes is perfect. In pursuance of a practical shuffling algorithm, Chung et al. [16] have recently proposed a novel coded shuffling algorithm, coined “UberShuffle”, to enhance the practical efficacy of the shuffling algorithm of [15]. However, it is not evident how far these coded shuffling algorithms are from the fundamental limits of communication rate. Attia and Tandon [17, 18, 19] investigated the data shuffling problem in a distributed computing system, consisting of a master node that communicates data points to worker nodes with limited storage capacity. An information-theoretic formulation of the data shuffling problem was proposed for data delivery and storage update phases. Furthermore, the worst-cast communication rate is defined to be the maximum communication load from the master node to the worker nodes over all possible consecutive data shuffles for any achievable scheme characterized by the encoding, decoding, and cache update functions. Accordingly, the authors characterized the optimal trade-off between the storage capacity per worker node and the worst-case communication rate for certain cases of the number of files , the number of worker nodes , and the available storage per worker node . More specifically, The rate was characterized when the number of worker nodes is limited to in [17]. Furthermore, the special case of no-excess storage (arbitrary and , but ) was addressed in [18]. However, the proposed schemes in these works do not generalize for arbitrary parameters. Recently, the authors have proposed “aligned coded shuffling scheme” [19] that is optimal for , and suboptimal for with maximum multiplicative gap of from the lower bound on the rate for the worst-case communication scenario. On the other hand, following the same master-worker framework, Song et al. [20] considered the data shuffling problem from the perspective of index coding [21], where the new data assigned by the master node at every iteration constitute the messages requested by the worker nodes, and the data cached at the worker nodes form the side information. Motivated by the NP-hardness of the index coding problem [21]

, the authors proposed a pliable version of the index coding problem to enhance the communication efficiency for distributed data shuffling. It is assumed that the worker nodes are pliable in such a way that they are only required to obtain new messages, that are randomly selected from original set of messages, at every iteration. This degree of freedom enables the realization of semi-random data shuffling that yields more efficient coding and transmission schemes, as opposed to fully random data shuffling.

I-A2 Data Shuffling in MapReduce Distributed Computing Framework

MapReduce [3] is a programming paradigm that allows for parallel processing of massive data-sets across large clusters of computational nodes. More concretely, the overall computation is decomposed into computing a set of “Map” and “Reduce” functions in a distributed and parallel fashion. Typically, a MapReduce job splits the input data-set into blocks, each of which is locally processed by a computing node that maps the input block into a set of intermediate key/value pairs. Next, the intermediate pairs are transfered to a set of processors that reduce the set of intermediate values by merging those with the same intermediate key. The process of inter- server communication between the mappers and reducers is referred to as data shuffling. Li et al. [22] introduced a variant implementation of MapReduce, named “Coded MapReduce” that exploits coding to considerably reduce the communication load of the data shuffling phase. The key idea is to create coded multicast opportunities in the shuffling phase through an assignment strategy of repetitive mappings of the same input data block across different servers. The fundamental trade-off between computation load and communication cost in Coded MapReduce is characterized in [23]. A unified coding framework for distributed computing in the presence of straggling servers was proposed in [24], where the trade-off between the computation latency and communication load is formalized for linear computation tasks.

We would like to highlight the subtle distinction between the coded caching problem and the coded shuffling problem. Both problems share the property that the prefetching scheme is designed to minimize the communication load for any possible unknown demand (or permutation) of the data. However, the coded shuffling algorithm is run over a number of iterations to store the data batches and compute some task across all worker nodes. In addition to that, the permutations of the data in subsequent iterations are not revealed in advance. Therefore, the caches of the worker nodes should be adequately updated after every iteration to maintain the structure of the data placement, guarantee the coded transmission opportunity, and achieve the minimum communication load for any undisclosed permutation of the data. Another subtle distinction that we would like to emphasize is the difference between the concept of data shuffling in the master-worker setup and that in the MapReduce setup. In the master-worker setup, a master node randomly shuffles data points among the computational worker nodes for a number of iterations to enhance the statistical efficiency of distributed computing systems. A coded data shuffling algorithm enables coded transmission of batches of the data-set through exploiting the excess storage at the worker nodes. On the other hand, in the MapReduce setup, the whole data-set is divided among the computational nodes, and a data placement strategy is designed in order to create coding opportunities that can be utilized by the shuffling scheme to transfer locally computed results from the mappers to the reducers. In other words, coded MapReduce enables coded transmission of blocks of the data processed by the mappers in the shuffling phase through introducing redundancy in the computation of the Map stage.

I-B Our Contribution

In this paper, we consider a data shuffling problem in a master-worker distributed computing system, in which we have a master node and  worker nodes. The master node has access to the entire data-set of files. Each worker node has a limited cache memory that can store up to files. In each iteration of the distributed algorithm, the master node randomly shuffles the data points among the worker nodes. We summarize the main results of the paper as follows.

  • We first study the data shuffling problem when . We propose a novel coded shuffling algorithm, which comprises the following phases: (i) file partitioning and labeling, (ii) subfile placement, (iii) encoding, (iv) decoding, (v) cache updating and subfile relabeling. We show how cache memories are leveraged in order to create coded functions that can be decoded by several worker nodes that process different files at every iteration of the distributed algorithm. The proposed scheme is generalized for arbitrary and .

  • Next, we derive a matching information-theoretic lower bound on the communication load for the data shuffling problem when , and we prove that among all possible placement and delivery strategies, our proposed coded shuffling scheme is universally optimal over all shuffling scenarios, and achieves the minimum communication load. Therefore, the optimal rate-memory trade-off when is characterized for any shuffling.

  • Finally, we extend the results obtained for the canonical setting of to investigate the general setting of the data shuffling problem when for the worst-case shuffling. Inspired by the concept of perfect matching in bipartite graphs, we develop a coded shuffling scheme by decomposing the file transition graph into subgraphs, each of which reduces to a canonical data shuffling problem with files, worker nodes, and storage capacity per worker node . Hence, we can apply our coded shuffling scheme for to each sub-problem and obtain a delivery scheme for the original shuffling problem, which is generalized for any , and . Furthermore, we derive a matching information-theoretic converse on the communication load for the data shuffling problem when , and demonstrate that there exist shuffling scenarios whose minimum delivery rates are equal to the delivery rate achieved by the proposed coded shuffling scheme. As a result, the optimal rate-memory trade-off is exactly characterized when for the worst-case shuffling.

I-C Paper Organization

The remainder of the paper is organized as follows. We first present the formal definition of data shuffling problem as well as the main results of this work in Section II. The cache placement scheme is proposed in Section III. For the canonical setting of the shuffling problem, i.e. when , two achievable coded shuffling schemes, along with illustrative examples, are delineated in Section IV. Then, the optimality proof for our proposed delivery scheme is presented in Section V. Next, for the general and practical setting of the shuffling problem, i.e. when , is studied in Section VI, where an achievable delivery scheme, an illustrative example, and the optimality of proposed delivery scheme for the worst-case shuffling are presented. Finally, the paper is concluded and directions for future research are discussed in Section VII.

Ii Problem Formulation and Main Results

Ii-a Formulation of Data Shuffling Problem

Fig. 1: Data shuffling in a distributed computing system.

For an integer , let denote the set of integers . Fig. 1 depicts a distributed computing system with a master node, denoted by , and a set of worker nodes, denoted by . The master node is assumed to have access to a data-set, including files, denoted by , where the size of each file is normalized to unit. In practice, the number of files is remarkably larger than the number of worker nodes, and hence we study the data shuffling problem under the practical assumption of . At each iteration, each worker node should perform a local computational task on a subset of files333Unless otherwise stated, we assume that and are integers.. The assignment of files to worker nodes is done by the master node, either randomly or according to some predefined mechanism. Each worker node has a cache  that can store up to files, including those under-processing files. This imposes the constraint on the size of the cache at each worker node. Once the computation at the worker nodes is node, the result is sent back to the master node. A new batch of files will be assigned to each worker node for iteration , and the cache contents of the worker nodes should be accordingly modified. The communication of files from the master node to the worker nodes occurs over a shared link, i.e., any information sent by the master node will be received by all of the worker nodes.

For a given iteration , we denote by the set of indices of the files to be processed by , and by the portion of the cache of dedicated to the under-processing files: . The subsets provide a partitioning for the set of file indices, i.e., for , and . Similarly, denotes the subset of indices of files to be processed by at iteration , where also forms a partitioning for . When , each worker node has an excess storage to cache (parts of) the other files in , in addition to the files in . We denote by the contents of the remaining space of the cache of , which is called the excess storage. Therefore, . Let , and denote the contents of , and at iteration . For the sake of brevity, we may drop the iteration index whenever it is clear from the context.

Filling the excess part of the cache of worker nodes is performed independent of the new assigned subsets . Between iterations and , the master node should compute and broadcast a message (a function of all files in ), such that each worker node can retrieve all files in from its cached data and the broadcast message . The communication load is defined as the size of the broadcast message for the parameters introduced above. We interchangeably refer to as delivery rate and communication load of the underlying data-shuffling system. The goal is to develop a cache placement strategy and design a broadcast message to minimize for any . For , we have since each worker node can store all the files in its cache and no communication is needed between the master node and worker nodes for any shuffling. Thus, we can focus on the regime of . We define to be the cache size normalized by the size of data to be processed by each worker node. Accordingly, we have .

File Transition Graph

A file transition graph is defined as a directed graph , where , with , denotes the set of vertices each corresponding to a worker node, and with is the set of directed edges, each associated to one file (see Fig. LABEL:fig:demand_Ex0). An edge with and indicates that , i.e., file is being processed by worker node at iteration , and assigned to worker node to be processed at iteration . Note that in general is a multigraph, since there might be multiple files in , and we include one edge from to for each of such files.

Fig. 2: The file transition graphs for two instances of a data shuffling system with worker nodes and files. Assume for , and consider two different assignment functions; and . The shown graphs are isomorphic. (a) and . (b) and .

Without loss of generality, let us assume a fixed assignment function at iteration , for example, for , otherwise we can relabel the files. Hence, the problem and its file transition graph are fully determined by the assignment function at iteration . Let be the set of assignment functions whose corresponding file transition graphs are isomorphic to . Fig. 2 captures two instances of a shuffling problem with isomorphic file transition graphs. For a given graph , we define the average delivery rate over all assignment function in as

Our ultimate goal in this paper is to characterize for given parameter and for all feasible file transition graphs .

Ii-B Main Results

First, we present our main results to characterize the exact rate-memory trade-off for the canonical setting of data shuffling problem, when , for any shuffling. Since , then , each worker node processes one file at each iteration. Without loss of generality, we assume that processes file at every iteration, i.e., , for , otherwise we can relabel the files. The following theorems summarize our main results.

Theorem 1.

For a data shuffling problem with a master node, worker nodes, each with a cache of size files with , the communication load required to shuffle files among the worker nodes for any file transition graph is upper bounded by444Note that when .

(1)

For non-integer values of , where , the lower convex envelope of the corner points, characterized by (1), is achievable by memory-sharing.

An achievability argument consists of a cache placement strategy and a delivery scheme. We propose a cache placement in Section III which will be used for all achievable schemes discussed in this paper. The delivery scheme, along with the memory-sharing argument for non-integer values of , is presented in Section IV-A. Illustrative examples are then given in Section IV-B.

The next theorem provides an achievable delivery rate (depending on the file transition graph) by an opportunistic coding scheme. We will show later that the underlying file transition graph of any data shuffling problem, , comprises a number of directed cycles. We denote the number of cycles in the file transition graph by , with denote the cycle lengths by where .

Theorem 2.

For a data shuffling system with a master node and worker nodes, each with a cache of size files, for , the shuffling of files among the worker nodes for a given file transition graph that comprises cycles can be performed by broadcasting a message of size , where

(2)

For non-integer values of , where , the lower convex envelope of the corner points, characterized by (2), is achievable by memory-sharing.

The proposed delivery scheme and achievability proof for Theorem 2 are presented in Section IV-C. The memory-sharing argument for non-integer values of follows a similar reasoning as the one in Theorem 1. We provide an illustrative example in Section IV-D.

Theorem 3.

For the data shuffling system introduced in Theorem 2, the communication load required to shuffle files among the worker nodes for a given assignment with a file transition graph that comprises cycles is lower bounded by

(3)

The proof of optimality (converse) is presented in Section V, where we also provide an illustrative example to describe the proof technique.

Corollary 1.

Theorems 2 and 3 prove the optimality of the proposed coded shuffling scheme for an arbitrary number of worker nodes , storage capacity per worker node , and file transition graph with cycles, when . Therefore, the optimal delivery rate is characterized as

(4)

For non-integer values of , where , the optimal delivery rate is equal to the lower convex envelope of the corner points given in (4). Furthermore, when , the achievable delivery rate of Theorem 2 is equal to that of Theorem 1, and takes its maximum. This characterizes the optimal worst-case delivery rate which is given by

This indicates that the upper bound of Theorem 1 is the best universal (assignment independent) bound that holds for all instances of the data shuffling problem.

Fig. 3 captures the optimum trade-off curve between as a function of for and a file transition graph with cycles.

Fig. 3: The optimum trade-off curve between the delivery rate and the storage capacity per worker node , when and .

Next, based on the results obtained for the canonical setting of data shuffling problem when , we present our main results in Theorem 4 to characterize an upper bound on the rate-memory trade-off for the general setting of data shuffling problem when . This upper bound turns out to be optimum for the worst-case shuffling, as stated in Theorem 5.

Theorem 4.

For a data shuffling system that processes files, and consists of a master node and worker nodes, each with a normalized storage capacity of files, the achievable delivery rate required to shuffle files among the worker nodes for any file transition graph is upper bounded by

(5)

For non-integer values of , where , the lower convex envelope of the corner points, characterized by (5) is achievable by memory-sharing.

The delivery scheme and achievability proof are presented in Section VI-A. The memory-sharing argument for non-integer values of follows a similar reasoning as the one in Theorem 1. We also present an illustrative example in Section VI-C.

Theorem 5.

For the data shuffling system introduced in Theorem 4, the communication load required to shuffle files among the worker nodes according to the worst-case shuffling is given by

(6)

The proof of Theorem 5 is presented in Section VI-D.

Iii Cache Placement

In this section we introduce our proposed cache placement, in which the contents of each worker node’s cache at iteration are known. Note that the cache placement does not depend on the files to be processed by the worker nodes at iteration , i.e., it does not depend on .

Iii-a File Partitioning and Labeling

Throughout this work, we assume that and are integer numbers, unless it is specified otherwise. Let . Let be a file being processed by worker node at iteration , i.e., . We partition into equal-size subfiles, and label the subfiles with a subscript as

(7)

Since the size of each file is normalized to , the size of each subfile will be . For the sake of completeness, we also define dummy subfiles (with size ) for every with or .

Iii-B Subfile Placement

The cache of consists of two parts: (i) the under-processing part , in which all subfiles of files to be processed at iteration are stored; (ii) the excess storage part , which is equally distributed among all other files. We denote by the portion of  dedicated to the file , in which all subfiles with are cached. Hence, we have

(8)

where

(9)
(10)

For any worker node , there are complete files in . Moreover, for each of the remaining files, there are subfiles, out of a total subfiles, that are cached in the excess storage part. Thus, we have

which satisfies the memory constraints.

Recall that the worker node should be able to recover files from its cache and the broadcast message . Communicating files in from the master node to a worker node can be limited to sending only the desired subfiles that do not exist in the cache of . For a worker node , let denote the set of subfiles to be processed by  at iteration , which are not available in its cache at iteration , that is,

(11)

It is evident that each worker node needs to decode at most subfiles for each of the files in , in order to process them at iteration .

The proposed file partitioning and cache placement are described in Algorithm 1 and Algorithm 2, respectively.

1:Input:
2:Output:
3:
4:for  to  do
5:     for all  do
6:         Worker node partitions into subfiles of equal sizes:
7:     end for
8:end for
Algorithm 1 partitionFiles
1:Input:
2:Output:
3:
4:for  to  do
5:      i.e.,
6:     for all  do
7:         
8:     end for
9:     
10:     
11:end for
Algorithm 2 placeSubfiles

Iv Coded Shuffling for the Canonical Setting

We describe two delivery strategies in this section. The first delivery scheme is universal, in the sense that it does not exploit the properties of the underlying file transition graph. By analyzing this scheme in Section IV-A, we show that the delivery rate in Theorem 1 is achievable. Two illustrative examples are presented in Section IV-B to better describe the coding and decoding strategies. We then demonstrate that the size of the broadcast message can be reduced by exploiting the cycles in the file transition graph. A graph-based delivery strategy is proposed in Section IV-C. This new scheme can achieve the reduced delivery rate proposed in Theorem 2. Finally, we conclude this section by presenting an illustrative example for the graph-based delivery scheme in Section IV-D.

Iv-a A Universal Delivery Scheme for Any Shuffling: Proof of Theorem 1

Recall that for we have . In order to prove Theorem 1, we propose a coded shuffling scheme to show that a delivery rate of is achievable for the canonical setting () for any integer . We assume, without loss of generality, that processes file at iteration , i.e., for , otherwise we can relabel the files.

Encoding

Given all cache contents , characterized by (8), and , the broadcast message sent from the master node to the worker nodes is obtained by the concatenation of a number of sub-messages , each specified for a group of worker nodes , that is,

(12)

where

(13)

The encoding design hinges on worker nodes. Without loss of generality, we consider for whom the broadcast sub-messages are designed, and designate as the ignored worker node. We will later show how is served for free using the sub-messages designed for other worker nodes.

According to the proposed encoding scheme, there is a total of encoded sub-messages, each corresponds to one subset , and the size of each sub-message is . Hence, the overall broadcast communication load is upper bounded by

as claimed in Theorem 1.

Decoding

The following lemmas demonstrate how each worker node decodes the missing subfile, that constitute the file to be processed at iteration , from the broadcast sub-messages and its cache contents.

Lemma 1.

For a worker node , where , a missing subfile can be decoded

  • from and the broadcast sub-message , if ; and

  • from , the broadcast sub-message , and other subfiles previously decoded by , if .

We refer to Appendix A for the proof of lemma 1.

Remark 1.

Here, we provide an intuitive justification for Lemma 1. Consider a worker node and a set of worker nodes of size that includes . One can show that every subfile appearing in belongs to either or . Therefore, worker node can recover a linear equation in the subfiles in by removing the subfiles in its cache from . It turns out that all such equations are linearly independent. The number of such equations is (because and ). On the other hand, the number of subfiles in is (at most) , since out of a total of subfiles of , of them are cached in , characterized by (10). Therefore, the obtained set of linearly independent equations suffices to recover all the subfiles in .

Lemma 2.

For the worker node , any missing subfile can be decoded from the cache contents and the summation of the broadcast sub-messages .

We refer to Appendix B for the proof of lemma 2.

Cache Updating and Subfile Relabeling

After worker nodes decode the missing subfiles, characterized by (11), the caches of worker nodes need to be updated and the subfiles need to be relabeled before processing the files at iteration . The goal of cache update is to maintain a similar cache configuration for the worker nodes for shuffling iteration . The cache update phase is described as follows:

  • For , all the subfiles of are placed in at iteration , i.e.,

    (14)
  • For , the excess storage is updated by removing all the subfiles of , and replacing them by the subfiles of that were cached at , i.e.,

    (15)

    where

    (16)
    (17)

Consequently, we have . Note that the cache update procedure is feasible, since the subfiles needed for either exist in or appear in the set of missing subfiles to be decoded after the broadcast message delivery. In particular, all the subfiles of already exist in , and hence those in  will be simply moved from the under-processing part to the excess storage part of the cache.

Finally, the subfiles are relabeled as follows:

  1. For every subfile , where , , and , relabel the subfile’s subscript to , where .

  2. For every subfile , where , and , relabel the subfile’s superscript to .

It is easy to see that after the proposed cache update and subfile relabeling, the cache configuration of each worker node at iteration maintains a similar arrangement to that introduced initially at iteration and characterized by (8). Therefore, the proposed scheme can be applied for the following shuffling iterations. This completes the proof of Theorem 1.

In what follows, the proposed encoding for delivery, decoding at the worker nodes, and cache update and subfile relabeling are formally described in Algorithm 3, Algorithm 4, and Algorithm 5, respectively.

1:Input:
2:Output:
3:for all  and  do
4:     
5:end for
6:
7:The master node broadcasts the message to the worker nodes
Algorithm 3 encodeSubmessages
1:Input:
2:Output:
3:for  to  do
4:     if  then
5:         for all  and  do
6:              Worker node decodes subfile from the sub-message using its cache contents
7:         end for
8:         for all  and and  do
9:              Worker node decodes subfile from the sub-message using its cache contents and other subfiles already decoded by
10:         end for
11:     else
12:         for all  and  do
13:              Worker node decodes subfile from the sub-messages using its cache contents
14:         end for
15:     end if
16:     
17:end for
Algorithm 4 decodeSubfiles
Remark 2.

Let be a non-integer cache size with . We can always write , for some . The data shuffling problem for non-integer cache size can be addressed by a memory-sharing argument, similar to [17]. More precisely, we can show that the pairs and are achievable, and conclude that, for , a communication load of can be achieved.

Recall that the size of each file is normalized to  unit. For the memory-sharing argument, each file will be partitioned into two parts of sizes and . The cache of each worker node is also divided into two parts of sizes and . Then, the files of size will be cached and shuffled within the parts of the caches of size . Similarly, the files of size , together with the parts of the caches of size , form another isolated instance of the problem. Summing the delivery rates of the two instances, we get

(18)
(19)

This shows that the convex hull of the pairs is achievable.

1:Input:
2:Output:
3:for  to  do Updating caches of worker nodes before iteration
4:     
5:     .
6:     
7:      i.e.,
8:     
9:     
10:end for
11:for  to  do Relabeling subscripts of a set of subfiles of each worker node
12:     
13:     for all  do
14:         
15:         Replace in by
16:     end for
17:end for
18:for  to  do Relabeling superscripts of all subfiles of each worker node
19:     
20:     for all  do
21:         Replace in by
22:     end for
23:end for
Algorithm 5 updateCaches

Iv-B Illustrative Examples

Example 1 (Single-Cycle File Transition Graph): Consider a shuffling system with a master node and worker nodes. The size of the cache at each worker node is files. There are files, denoted by . For notational simplicity, we rename the files as . Without loss of generality, we assume that worker nodes , , and are processing files , , , and , respectively, at iteration , that is , , , and . The file transition graph is , , and , as depicted by Fig. LABEL:fig:demand_Ex0.

Fig. 4: Data Shuffling system with , and . (a) The file transition graph for a data shuffling system with , and . Worker nodes , , and are processing files , , , and , respectively. (b) Cache organization of worker nodes at iteration , along with the set of subfiles which are not available in the caches at iteration and need to be processed at iteration . (c) Cache organization of worker nodes at iteration after updating the caches. Subfiles , , and in , , and are moved to , , and , respectively. (d) Cache organization of worker nodes at iteration after updating the caches and relabeling the subfiles of Fig. 4. Subfiles , , and in , , and are moved to , , and and relabeled to , , and , respectively. (e) Received functions by worker nodes after removing the cached subfiles. The complete received functions at worker nodes are expressed in (20).

The proposed placement strategy partitions each file into subfiles of equal sizes. The subfiles are labeled with sets , where . For instance, file being processed by worker node is partitioned into , , and . Accordingly, the cache of is divided into two parts; that is dedicated to the under-processing file , and that is dedicated to store parts of other files. Fig 4 captures the cache organization of worker nodes, along with the missing subfiles (i.e., the ones in , as defined in (11)) that need to be processed at iteration . The broadcast message transmitted from the master node to the worker nodes is formed by the concatenation of sub-messages