I Introduction
With the emergence of big data analytics, distributed computing systems have attracted enormous attention in recent years. The computational paradigm in the era of big data has shifted towards distributed systems, as an alternative to expensive supercomputers. Distributed Computing systems are networks that consist of a massive number of commodity computational nodes connected through fast communication links. Examples of distributed computing applications span distributed machine learning, massively multilayer online games (MMOGs), wireless sensor networks, realtime process control, etc. Prevalent distributed computing frameworks, such as Apache Spark
[2], and computational primitives, such as MapReduce [3], Dryad [4], and CIEL [5], are key enablers to process substantially large datasets (in the order of terabytes), and execute productionscale dataintensive tasks.Data Shuffling is one of the core components in distributed learning algorithms. Broadly speaking, the data shuffling stage is introduced to prepare data partitions with desirable properties for parallel processing in future stages. A prototypical iterative data processing procedure is outlined as follows: (i) randomly shuffle the training dataset, (ii) equally partition the dataset into nonoverlapping batches, and assign each batch to a local worker^{1}^{1}1
One may consider storing the entire training dataset in a massive shared storage system and let the workers directly access the new batches every learning epoch. Although this setting eliminates the communication overhead of the shuffling mechanism, it suffers from network and disk I/O bottlenecks, and hence, this approach is notoriously sluggish and costinefficient as well
[16]., (iii) each local worker performs a local computational task to train a learning model, (iv) reshuffle the training dataset to provide each worker with a new batch of data points at each learning model and continue the model training. Data shuffling is known to enhance the learning model quality and lead to significant statistical gains in ubiquitous applications for machine learning and optimization. One prominent example is stochastic gradient descend (SGD) [6, 7, 8, 9, 10, 11, 12]. Recht and Ré [6]conjectured a noncommutative arithmeticgeometric mean inequality, and showed that the expected convergence rate of the random shuffling version of SGD is faster than that of the usual withreplacement version provided the inequality holds
^{2}^{2}2It is a longstanding problem in the theory of SGD to prove this statement, and the correctness of the full conjecture is still an open problem.. It was empirically demonstrated that shuffling the data before running SGD results in superior convergence performance [7, 8, 9, 10, 11]. Recently, Meng et al. [12] have proposed an extensive analysis on the desirable convergence properties of distributed SGD with random shuffling, in both convex and nonconvex cases. In practice, however, the benefits of data shuffling come at a price. In every shuffling iteration, the entire dataset is communicated over the network of workers. Consequently, this leads to performance bottlenecks due to the communication overhead.Caching of popular content during offpeak hours is a prominent technique that reduces the network congestion and enhances throughput and latency in content delivery networks. This can be achieved through prefetching popular contents into end user memories distributed across the network. The caching problem comprises two phases; (i) placement phase, (ii) delivery phase. The placement phase takes place when the network is not congested and the system is not aware of the future demands of the users, but the statistics of users’ demands are known. In this phase, the cache of each user prefetches data from the server subject to the size of the cache memories. On the other hand, the delivery phase takes place when the actual demands of the users are revealed, and hence, the network is congested. In this phase, the server transmits the requested files subject to the rate required to serve the users’ demands. In a seminal work by MaddahAli and Niesen [13], the first informationtheoretic formulation was introduced for the basic caching problem where a central server, with a database of files, is connected to a set of users via a shared bottleneck link. The authors proposed a novel coded caching scheme that exploits not only the local caches at each individual user (i.e., the local cache size), but also the aggregate memory of all users (i.e., the global cache size), even if there is no cooperation among the users. Recently, the exact ratememory tradeoff for the basic caching problem, where the prefetching is uncoded, has been characterized in [14] for both centralized and decentralized settings under uniform file popularity.
The idea of incorporating coding theory into the context of distributed machine learning has been introduced in a recent work by [15]. The authors posed an intriguing question as to how to use coding techniques to ensure robust speedups in distributed computing. To address this question, the work flow of distributed computation is abstracted into three main phases; a storage phase, a communication phase, and a computation phase. Coding theory is utilized to alleviate the bottlenecks in the computation and communication phases of distributed learning algorithms. More specifically, the authors proposed novel algorithms for coded computation to speed up the performance of linear operations, and coded data shuffling to overcome the significant communication bottlenecks between the master node and worker nodes during data shuffling.
Ia Related Prior Works
The data shuffling problem has been extensively studied from various perspectives under different frameworks. In what follows, we survey the literature and present the progress and the current status of the problem.
IA1 Data Shuffling in MasterWorker Distributed Computing Framework
In the masterworker distributed setup, the master node has access to the entire dataset that is randomly permuted and partitioned into batches at every iteration of the distributed algorithm. The data shuffling phase aims at communicating these batches to the worker nodes in order to locally perform their distributed tasks in parallel. Then, the master node aggregates the local results of the worker nodes to complete the computation and give the final result. Inspired by the coded caching introduced by MaddahAli et al. [13], Lee et al. [15] proposed the first coded shuffling algorithm, based on random storage placement, that leverages the excess storage of the local caches of the worker nodes to slash the communication bottlenecks. The coded shuffling algorithm consists of three main strategies: a coded transmission strategy designed by the master node, and decoding and cache updating strategies executed by the worker nodes. It is demonstrated, through extensive numerical experiments, the significant improvement in the achievable rate and the average transmission time of the coded shuffling framework, compared to no shuffling and uncoded shuffling frameworks. The theoretical guarantees of [15] hold only when the number of data points approaches infinity, and the broadcast channel between the master node and worker nodes is perfect. In pursuance of a practical shuffling algorithm, Chung et al. [16] have recently proposed a novel coded shuffling algorithm, coined “UberShuffle”, to enhance the practical efficacy of the shuffling algorithm of [15]. However, it is not evident how far these coded shuffling algorithms are from the fundamental limits of communication rate. Attia and Tandon [17, 18, 19] investigated the data shuffling problem in a distributed computing system, consisting of a master node that communicates data points to worker nodes with limited storage capacity. An informationtheoretic formulation of the data shuffling problem was proposed for data delivery and storage update phases. Furthermore, the worstcast communication rate is defined to be the maximum communication load from the master node to the worker nodes over all possible consecutive data shuffles for any achievable scheme characterized by the encoding, decoding, and cache update functions. Accordingly, the authors characterized the optimal tradeoff between the storage capacity per worker node and the worstcase communication rate for certain cases of the number of files , the number of worker nodes , and the available storage per worker node . More specifically, The rate was characterized when the number of worker nodes is limited to in [17]. Furthermore, the special case of noexcess storage (arbitrary and , but ) was addressed in [18]. However, the proposed schemes in these works do not generalize for arbitrary parameters. Recently, the authors have proposed “aligned coded shuffling scheme” [19] that is optimal for , and suboptimal for with maximum multiplicative gap of from the lower bound on the rate for the worstcase communication scenario. On the other hand, following the same masterworker framework, Song et al. [20] considered the data shuffling problem from the perspective of index coding [21], where the new data assigned by the master node at every iteration constitute the messages requested by the worker nodes, and the data cached at the worker nodes form the side information. Motivated by the NPhardness of the index coding problem [21]
, the authors proposed a pliable version of the index coding problem to enhance the communication efficiency for distributed data shuffling. It is assumed that the worker nodes are pliable in such a way that they are only required to obtain new messages, that are randomly selected from original set of messages, at every iteration. This degree of freedom enables the realization of semirandom data shuffling that yields more efficient coding and transmission schemes, as opposed to fully random data shuffling.
IA2 Data Shuffling in MapReduce Distributed Computing Framework
MapReduce [3] is a programming paradigm that allows for parallel processing of massive datasets across large clusters of computational nodes. More concretely, the overall computation is decomposed into computing a set of “Map” and “Reduce” functions in a distributed and parallel fashion. Typically, a MapReduce job splits the input dataset into blocks, each of which is locally processed by a computing node that maps the input block into a set of intermediate key/value pairs. Next, the intermediate pairs are transfered to a set of processors that reduce the set of intermediate values by merging those with the same intermediate key. The process of inter server communication between the mappers and reducers is referred to as data shuffling. Li et al. [22] introduced a variant implementation of MapReduce, named “Coded MapReduce” that exploits coding to considerably reduce the communication load of the data shuffling phase. The key idea is to create coded multicast opportunities in the shuffling phase through an assignment strategy of repetitive mappings of the same input data block across different servers. The fundamental tradeoff between computation load and communication cost in Coded MapReduce is characterized in [23]. A unified coding framework for distributed computing in the presence of straggling servers was proposed in [24], where the tradeoff between the computation latency and communication load is formalized for linear computation tasks.
We would like to highlight the subtle distinction between the coded caching problem and the coded shuffling problem. Both problems share the property that the prefetching scheme is designed to minimize the communication load for any possible unknown demand (or permutation) of the data. However, the coded shuffling algorithm is run over a number of iterations to store the data batches and compute some task across all worker nodes. In addition to that, the permutations of the data in subsequent iterations are not revealed in advance. Therefore, the caches of the worker nodes should be adequately updated after every iteration to maintain the structure of the data placement, guarantee the coded transmission opportunity, and achieve the minimum communication load for any undisclosed permutation of the data. Another subtle distinction that we would like to emphasize is the difference between the concept of data shuffling in the masterworker setup and that in the MapReduce setup. In the masterworker setup, a master node randomly shuffles data points among the computational worker nodes for a number of iterations to enhance the statistical efficiency of distributed computing systems. A coded data shuffling algorithm enables coded transmission of batches of the dataset through exploiting the excess storage at the worker nodes. On the other hand, in the MapReduce setup, the whole dataset is divided among the computational nodes, and a data placement strategy is designed in order to create coding opportunities that can be utilized by the shuffling scheme to transfer locally computed results from the mappers to the reducers. In other words, coded MapReduce enables coded transmission of blocks of the data processed by the mappers in the shuffling phase through introducing redundancy in the computation of the Map stage.
IB Our Contribution
In this paper, we consider a data shuffling problem in a masterworker distributed computing system, in which we have a master node and worker nodes. The master node has access to the entire dataset of files. Each worker node has a limited cache memory that can store up to files. In each iteration of the distributed algorithm, the master node randomly shuffles the data points among the worker nodes. We summarize the main results of the paper as follows.

We first study the data shuffling problem when . We propose a novel coded shuffling algorithm, which comprises the following phases: (i) file partitioning and labeling, (ii) subfile placement, (iii) encoding, (iv) decoding, (v) cache updating and subfile relabeling. We show how cache memories are leveraged in order to create coded functions that can be decoded by several worker nodes that process different files at every iteration of the distributed algorithm. The proposed scheme is generalized for arbitrary and .

Next, we derive a matching informationtheoretic lower bound on the communication load for the data shuffling problem when , and we prove that among all possible placement and delivery strategies, our proposed coded shuffling scheme is universally optimal over all shuffling scenarios, and achieves the minimum communication load. Therefore, the optimal ratememory tradeoff when is characterized for any shuffling.

Finally, we extend the results obtained for the canonical setting of to investigate the general setting of the data shuffling problem when for the worstcase shuffling. Inspired by the concept of perfect matching in bipartite graphs, we develop a coded shuffling scheme by decomposing the file transition graph into subgraphs, each of which reduces to a canonical data shuffling problem with files, worker nodes, and storage capacity per worker node . Hence, we can apply our coded shuffling scheme for to each subproblem and obtain a delivery scheme for the original shuffling problem, which is generalized for any , and . Furthermore, we derive a matching informationtheoretic converse on the communication load for the data shuffling problem when , and demonstrate that there exist shuffling scenarios whose minimum delivery rates are equal to the delivery rate achieved by the proposed coded shuffling scheme. As a result, the optimal ratememory tradeoff is exactly characterized when for the worstcase shuffling.
IC Paper Organization
The remainder of the paper is organized as follows. We first present the formal definition of data shuffling problem as well as the main results of this work in Section II. The cache placement scheme is proposed in Section III. For the canonical setting of the shuffling problem, i.e. when , two achievable coded shuffling schemes, along with illustrative examples, are delineated in Section IV. Then, the optimality proof for our proposed delivery scheme is presented in Section V. Next, for the general and practical setting of the shuffling problem, i.e. when , is studied in Section VI, where an achievable delivery scheme, an illustrative example, and the optimality of proposed delivery scheme for the worstcase shuffling are presented. Finally, the paper is concluded and directions for future research are discussed in Section VII.
Ii Problem Formulation and Main Results
Iia Formulation of Data Shuffling Problem
For an integer , let denote the set of integers . Fig. 1 depicts a distributed computing system with a master node, denoted by , and a set of worker nodes, denoted by . The master node is assumed to have access to a dataset, including files, denoted by , where the size of each file is normalized to unit. In practice, the number of files is remarkably larger than the number of worker nodes, and hence we study the data shuffling problem under the practical assumption of . At each iteration, each worker node should perform a local computational task on a subset of files^{3}^{3}3Unless otherwise stated, we assume that and are integers.. The assignment of files to worker nodes is done by the master node, either randomly or according to some predefined mechanism. Each worker node has a cache that can store up to files, including those underprocessing files. This imposes the constraint on the size of the cache at each worker node. Once the computation at the worker nodes is node, the result is sent back to the master node. A new batch of files will be assigned to each worker node for iteration , and the cache contents of the worker nodes should be accordingly modified. The communication of files from the master node to the worker nodes occurs over a shared link, i.e., any information sent by the master node will be received by all of the worker nodes.
For a given iteration , we denote by the set of indices of the files to be processed by , and by the portion of the cache of dedicated to the underprocessing files: . The subsets provide a partitioning for the set of file indices, i.e., for , and . Similarly, denotes the subset of indices of files to be processed by at iteration , where also forms a partitioning for . When , each worker node has an excess storage to cache (parts of) the other files in , in addition to the files in . We denote by the contents of the remaining space of the cache of , which is called the excess storage. Therefore, . Let , and denote the contents of , and at iteration . For the sake of brevity, we may drop the iteration index whenever it is clear from the context.
Filling the excess part of the cache of worker nodes is performed independent of the new assigned subsets . Between iterations and , the master node should compute and broadcast a message (a function of all files in ), such that each worker node can retrieve all files in from its cached data and the broadcast message . The communication load is defined as the size of the broadcast message for the parameters introduced above. We interchangeably refer to as delivery rate and communication load of the underlying datashuffling system. The goal is to develop a cache placement strategy and design a broadcast message to minimize for any . For , we have since each worker node can store all the files in its cache and no communication is needed between the master node and worker nodes for any shuffling. Thus, we can focus on the regime of . We define to be the cache size normalized by the size of data to be processed by each worker node. Accordingly, we have .
File Transition Graph
A file transition graph is defined as a directed graph , where , with , denotes the set of vertices each corresponding to a worker node, and with is the set of directed edges, each associated to one file (see Fig. LABEL:fig:demand_Ex0). An edge with and indicates that , i.e., file is being processed by worker node at iteration , and assigned to worker node to be processed at iteration . Note that in general is a multigraph, since there might be multiple files in , and we include one edge from to for each of such files.
Without loss of generality, let us assume a fixed assignment function at iteration , for example, for , otherwise we can relabel the files. Hence, the problem and its file transition graph are fully determined by the assignment function at iteration . Let be the set of assignment functions whose corresponding file transition graphs are isomorphic to . Fig. 2 captures two instances of a shuffling problem with isomorphic file transition graphs. For a given graph , we define the average delivery rate over all assignment function in as
Our ultimate goal in this paper is to characterize for given parameter and for all feasible file transition graphs .
IiB Main Results
First, we present our main results to characterize the exact ratememory tradeoff for the canonical setting of data shuffling problem, when , for any shuffling. Since , then , each worker node processes one file at each iteration. Without loss of generality, we assume that processes file at every iteration, i.e., , for , otherwise we can relabel the files. The following theorems summarize our main results.
Theorem 1.
For a data shuffling problem with a master node, worker nodes, each with a cache of size files with , the communication load required to shuffle files among the worker nodes for any file transition graph is upper bounded by^{4}^{4}4Note that when .
(1) 
For noninteger values of , where , the lower convex envelope of the corner points, characterized by (1), is achievable by memorysharing.
An achievability argument consists of a cache placement strategy and a delivery scheme. We propose a cache placement in Section III which will be used for all achievable schemes discussed in this paper. The delivery scheme, along with the memorysharing argument for noninteger values of , is presented in Section IVA. Illustrative examples are then given in Section IVB.
The next theorem provides an achievable delivery rate (depending on the file transition graph) by an opportunistic coding scheme. We will show later that the underlying file transition graph of any data shuffling problem, , comprises a number of directed cycles. We denote the number of cycles in the file transition graph by , with denote the cycle lengths by where .
Theorem 2.
For a data shuffling system with a master node and worker nodes, each with a cache of size files, for , the shuffling of files among the worker nodes for a given file transition graph that comprises cycles can be performed by broadcasting a message of size , where
(2) 
For noninteger values of , where , the lower convex envelope of the corner points, characterized by (2), is achievable by memorysharing.
The proposed delivery scheme and achievability proof for Theorem 2 are presented in Section IVC. The memorysharing argument for noninteger values of follows a similar reasoning as the one in Theorem 1. We provide an illustrative example in Section IVD.
Theorem 3.
For the data shuffling system introduced in Theorem 2, the communication load required to shuffle files among the worker nodes for a given assignment with a file transition graph that comprises cycles is lower bounded by
(3) 
The proof of optimality (converse) is presented in Section V, where we also provide an illustrative example to describe the proof technique.
Corollary 1.
Theorems 2 and 3 prove the optimality of the proposed coded shuffling scheme for an arbitrary number of worker nodes , storage capacity per worker node , and file transition graph with cycles, when . Therefore, the optimal delivery rate is characterized as
(4) 
For noninteger values of , where , the optimal delivery rate is equal to the lower convex envelope of the corner points given in (4). Furthermore, when , the achievable delivery rate of Theorem 2 is equal to that of Theorem 1, and takes its maximum. This characterizes the optimal worstcase delivery rate which is given by
This indicates that the upper bound of Theorem 1 is the best universal (assignment independent) bound that holds for all instances of the data shuffling problem.
Fig. 3 captures the optimum tradeoff curve between as a function of for and a file transition graph with cycles.
Next, based on the results obtained for the canonical setting of data shuffling problem when , we present our main results in Theorem 4 to characterize an upper bound on the ratememory tradeoff for the general setting of data shuffling problem when . This upper bound turns out to be optimum for the worstcase shuffling, as stated in Theorem 5.
Theorem 4.
For a data shuffling system that processes files, and consists of a master node and worker nodes, each with a normalized storage capacity of files, the achievable delivery rate required to shuffle files among the worker nodes for any file transition graph is upper bounded by
(5) 
For noninteger values of , where , the lower convex envelope of the corner points, characterized by (5) is achievable by memorysharing.
The delivery scheme and achievability proof are presented in Section VIA. The memorysharing argument for noninteger values of follows a similar reasoning as the one in Theorem 1. We also present an illustrative example in Section VIC.
Theorem 5.
For the data shuffling system introduced in Theorem 4, the communication load required to shuffle files among the worker nodes according to the worstcase shuffling is given by
(6) 
Iii Cache Placement
In this section we introduce our proposed cache placement, in which the contents of each worker node’s cache at iteration are known. Note that the cache placement does not depend on the files to be processed by the worker nodes at iteration , i.e., it does not depend on .
Iiia File Partitioning and Labeling
Throughout this work, we assume that and are integer numbers, unless it is specified otherwise. Let . Let be a file being processed by worker node at iteration , i.e., . We partition into equalsize subfiles, and label the subfiles with a subscript as
(7) 
Since the size of each file is normalized to , the size of each subfile will be . For the sake of completeness, we also define dummy subfiles (with size ) for every with or .
IiiB Subfile Placement
The cache of consists of two parts: (i) the underprocessing part , in which all subfiles of files to be processed at iteration are stored; (ii) the excess storage part , which is equally distributed among all other files. We denote by the portion of dedicated to the file , in which all subfiles with are cached. Hence, we have
(8) 
where
(9)  
(10) 
For any worker node , there are complete files in . Moreover, for each of the remaining files, there are subfiles, out of a total subfiles, that are cached in the excess storage part. Thus, we have
which satisfies the memory constraints.
Recall that the worker node should be able to recover files from its cache and the broadcast message . Communicating files in from the master node to a worker node can be limited to sending only the desired subfiles that do not exist in the cache of . For a worker node , let denote the set of subfiles to be processed by at iteration , which are not available in its cache at iteration , that is,
(11) 
It is evident that each worker node needs to decode at most subfiles for each of the files in , in order to process them at iteration .
Iv Coded Shuffling for the Canonical Setting
We describe two delivery strategies in this section. The first delivery scheme is universal, in the sense that it does not exploit the properties of the underlying file transition graph. By analyzing this scheme in Section IVA, we show that the delivery rate in Theorem 1 is achievable. Two illustrative examples are presented in Section IVB to better describe the coding and decoding strategies. We then demonstrate that the size of the broadcast message can be reduced by exploiting the cycles in the file transition graph. A graphbased delivery strategy is proposed in Section IVC. This new scheme can achieve the reduced delivery rate proposed in Theorem 2. Finally, we conclude this section by presenting an illustrative example for the graphbased delivery scheme in Section IVD.
Iva A Universal Delivery Scheme for Any Shuffling: Proof of Theorem 1
Recall that for we have . In order to prove Theorem 1, we propose a coded shuffling scheme to show that a delivery rate of is achievable for the canonical setting () for any integer . We assume, without loss of generality, that processes file at iteration , i.e., for , otherwise we can relabel the files.
Encoding
Given all cache contents , characterized by (8), and , the broadcast message sent from the master node to the worker nodes is obtained by the concatenation of a number of submessages , each specified for a group of worker nodes , that is,
(12) 
where
(13) 
The encoding design hinges on worker nodes. Without loss of generality, we consider for whom the broadcast submessages are designed, and designate as the ignored worker node. We will later show how is served for free using the submessages designed for other worker nodes.
According to the proposed encoding scheme, there is a total of encoded submessages, each corresponds to one subset , and the size of each submessage is . Hence, the overall broadcast communication load is upper bounded by
as claimed in Theorem 1.
Decoding
The following lemmas demonstrate how each worker node decodes the missing subfile, that constitute the file to be processed at iteration , from the broadcast submessages and its cache contents.
Lemma 1.
For a worker node , where , a missing subfile can be decoded

from and the broadcast submessage , if ; and

from , the broadcast submessage , and other subfiles previously decoded by , if .
Remark 1.
Here, we provide an intuitive justification for Lemma 1. Consider a worker node and a set of worker nodes of size that includes . One can show that every subfile appearing in belongs to either or . Therefore, worker node can recover a linear equation in the subfiles in by removing the subfiles in its cache from . It turns out that all such equations are linearly independent. The number of such equations is (because and ). On the other hand, the number of subfiles in is (at most) , since out of a total of subfiles of , of them are cached in , characterized by (10). Therefore, the obtained set of linearly independent equations suffices to recover all the subfiles in .
Lemma 2.
For the worker node , any missing subfile can be decoded from the cache contents and the summation of the broadcast submessages .
Cache Updating and Subfile Relabeling
After worker nodes decode the missing subfiles, characterized by (11), the caches of worker nodes need to be updated and the subfiles need to be relabeled before processing the files at iteration . The goal of cache update is to maintain a similar cache configuration for the worker nodes for shuffling iteration . The cache update phase is described as follows:

For , all the subfiles of are placed in at iteration , i.e.,
(14) 
For , the excess storage is updated by removing all the subfiles of , and replacing them by the subfiles of that were cached at , i.e.,
(15) where
(16) (17)
Consequently, we have . Note that the cache update procedure is feasible, since the subfiles needed for either exist in or appear in the set of missing subfiles to be decoded after the broadcast message delivery. In particular, all the subfiles of already exist in , and hence those in will be simply moved from the underprocessing part to the excess storage part of the cache.
Finally, the subfiles are relabeled as follows:

For every subfile , where , , and , relabel the subfile’s subscript to , where .

For every subfile , where , and , relabel the subfile’s superscript to .
It is easy to see that after the proposed cache update and subfile relabeling, the cache configuration of each worker node at iteration maintains a similar arrangement to that introduced initially at iteration and characterized by (8). Therefore, the proposed scheme can be applied for the following shuffling iterations. This completes the proof of Theorem 1.
In what follows, the proposed encoding for delivery, decoding at the worker nodes, and cache update and subfile relabeling are formally described in Algorithm 3, Algorithm 4, and Algorithm 5, respectively.
Remark 2.
Let be a noninteger cache size with . We can always write , for some . The data shuffling problem for noninteger cache size can be addressed by a memorysharing argument, similar to [17]. More precisely, we can show that the pairs and are achievable, and conclude that, for , a communication load of can be achieved.
Recall that the size of each file is normalized to unit. For the memorysharing argument, each file will be partitioned into two parts of sizes and . The cache of each worker node is also divided into two parts of sizes and . Then, the files of size will be cached and shuffled within the parts of the caches of size . Similarly, the files of size , together with the parts of the caches of size , form another isolated instance of the problem. Summing the delivery rates of the two instances, we get
(18)  
(19) 
This shows that the convex hull of the pairs is achievable.
IvB Illustrative Examples
Example 1 (SingleCycle File Transition Graph): Consider a shuffling system with a master node and worker nodes. The size of the cache at each worker node is files. There are files, denoted by . For notational simplicity, we rename the files as . Without loss of generality, we assume that worker nodes , , and are processing files , , , and , respectively, at iteration , that is , , , and . The file transition graph is , , and , as depicted by Fig. LABEL:fig:demand_Ex0.
The proposed placement strategy partitions each file into subfiles of equal sizes. The subfiles are labeled with sets , where . For instance, file being processed by worker node is partitioned into , , and . Accordingly, the cache of is divided into two parts; that is dedicated to the underprocessing file , and that is dedicated to store parts of other files. Fig 4 captures the cache organization of worker nodes, along with the missing subfiles (i.e., the ones in , as defined in (11)) that need to be processed at iteration . The broadcast message transmitted from the master node to the worker nodes is formed by the concatenation of submessages , where