 # How to Distribute Computation in Networks

We study the function computation problem in a communications network. The rate region for the function computation problem in general topologies is an open problem, and has been considered under certain restrictive assumptions (e.g. tree networks, linear functions, etc.). We are motivated by the fact that in network computation can be as a means to reduce the required communication flow in terms of number of bits transmitted per source symbol and provide a sparse representation or labeling. To understand the limits of computation, we introduce the notion of entropic surjectivity as a measure to determine how surjective the function is. Exploiting Little's law for stationary systems, we later provide a connection between this notion and the proportion of flow (which we call computation processing factor) that requires communications. This connection gives us an understanding of how much a node (in isolation) should compute (or compress) in order to communicate the desired function within the network. Our analysis does not put any assumptions on the network topology and characterizes the functions only via their entropic surjectivity, and provides insight into how to distribute computation depending on the entropic surjectivity of the computation task.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Challenges in cloud computing include effectively distributing computation to handle the large volume of data with growing computational demand, and the limited resources in the air interface. Furthermore, various tasks such as computation, storage, communications are inseparable. In network computation is required for reasons of dimensioning, scaling and security, where data is geographically dispersed. We need to exploit the sparsity of data within and across sources, as well as the additional sparsity inherent to labeling, to provide approximately minimal representations for labeling.

An equivalent notion to that sparsity is that of data redundancy. Data is redundant in the sense that there exists, a possibly latent and ill understood, sparse representation of it that is parsimonious and minimal, and that allows for data reconstruction, possibly in an approximate manner. Redundancy can occur in a single source of data or across multiple sources.

Providing such sparse representation for the reconstruction of data is the topic of compression, or source coding. The Shannon entropy rate of data provides, for a single source, a measure of the minimal representation, in terms of bits per second, required to represent data. This representation is truly minimal, in the sense that it is achievable with arbitrarily small error or distortion, but arbitrarily good fidelity of reconstruction is provably impossible at lower rates.

### I-a Motivation

As computation becomes increasingly reliant on numerous, possibly geo-dispersed, sources of data, making use of redundancy across multiple sources without the need for onerous coordination across sources becomes increasingly important. The fact that a minimal representation of data can occur across sources without the need for coordination is the topic of distributed compression. The core result is that of Slepian and Wolf [SlepWolf1973], who showed that distributed compression without coordination across source can be as efficient, in terms of asymptotic minimality of representation.

Techniques for achieving compression have traditionally relied on coding techniques. Coding, however, suffers from a considerable cost, as it imputes, beyond sampling and quantization, computation and processing at the source before transmission, then computation and processing at the destination after reception of the transmission. A secondary consideration is that coding techniques, to be efficiently reconstructed at the destination, generally require detailed information about the probabilistic structure of the data being represented. For distributed compression, the difficulty of reconstruction rendered the results in

[SlepWolf1973] impractical until the 2000s, when channel coding techniques were adapted.

In the case of learning on data, however, it is not the data itself but rather a labeling of it that we seek. That labeling can be viewed as being a function of the original data. The reconstruction of data is in effect a degenerate case where the function is identity. Labeling is generally a highly surjective function and thus induces sparsity, or redundancy, in its output values beyond the sparsity that may be present in the data.

The use of the redundancy in both functions and data to provide sparse representations of functions outputs is the topic of the rather nascent field of functional compression. A centralized communication scheme requires all data to be transmitted to some central unit in order to perform certain computations. However, in many cases such computations can be performed in a distributed manner at different nodes in the network avoiding transmission of unnecessary information in the network. Hence, intermediate computations can significantly reduce the resource usage, and this can help improve the trade-off between communications and computation. Fig. 1: (Left) Example rate region for the zero distortion distributed functional compression problem [DosShaMedEff2010]. S denotes the shaded region between the joint entropy H(X1,X2) curve (inner bound I) and the joint graph entropy curve of HGX1,GX2(X1,X2) (outer bound O). Note that any point above I is the Slepian-Wolf achievable rate region, and O is characterized by how surjective the graph entropy is. (Right) Example scenarios with achievable rates: rate region for (i) source compression, (ii) functional compression, (iii) distributed source compression with two transmitters and a receiver, and (iv) distributed functional compression with two transmitters and a receiver. Note that in (iv) the main benefit of joint graph entropy HGX1,GX2(X1,X2) is that it is less than the sum of the marginal graph entropy of source X1, i.e. HGX1(X1), and the conditional graph entropy of source X2 given X1, i.e. HGX2(X2|X1). Joint graph entropy provides a better rate region than the joint entropy since it does not satisfy the chain rule. Hence, we expect to have Δmax>Δmin in the left figure.

### I-B Technical Background

In this section, we introduce some concepts from information theory which characterize the minimum communication (in terms of rate) necessary to reliably evaluate a function. In particular, this problem which is referred to as distributed functional compression, has been studied under various forms since the pioneering work of Slepian and Wolf [SlepWolf1973].

An object of interest in the study of these fundamental limits is the characteristic graph, and in particular its coloring. In the characteristic graph, each vertex represent a possible different sample value, and two vertices are connected if they should be distinguished. More precisely, for a collection of random variables

assumed to take values in the same alphabet , and a function , we draw an edge between vertices and , if for any whose joint instance has non-zero measure. We illustrate the characteristic graph and its relevance in compression through the following example.

#### Slepian-Wolf Coding (or Compression)

We start by reviewing the natural scenario where the function is the identity function, i.e., the case of distributed lossless compression. For sake of presentation, we focus on the case of two random variables and

, which are jointly distributed according to

. Source random variable can be asymptotically compressed up to the rate when is available at the receiver [SlepWolf1973]. Given two statistically dependent i.i.d. finite alphabet sequences and , the Slepian-Wolf theorem gives a theoretical bound for the lossless coding rate for distributed coding of the two sources as shown below [SlepWolf1973]:

 RX1≥H(X1|X2),RX2≥H(X2|X1) RX1+RX2≥H(X1,X2). (1)

We denote the rate region in (I-B) by . The Slepian-Wolf theorem states that in order to recover a joint source at a receiver, it is both necessary and sufficient to encode separately sources and at rates where (I-B) is satisfied [DosShaMedEff2010]

. Note that the encoding is done in a truly distributed way, i.e. no communication or coordination is necessary between the encoders. Distributed coding can achieve arbitrarily small error probability for long sequences.

One of the challenge in function computation is the function on the data itself. Whether or not having correlations among the source random variables (or data) , due to the mapping from the sources to the destinations the codebook design becomes very challenging. Since the rate region of the distributed function computation problem depends on the function, designing achievable schemes for the optimal rate region for function computation (or compression) (for general functions, with/without correlations) remains an open problem. We aim to develop a tractable approach for computing general functions using the tools discussed next.

#### Graph Entropy for Characterizing the Rate Bounds

Given a graph and a distribution on its vertices , the graph entropy is expressed as

 HGX1(X1)=minX1∈W1∈Γ(GX1)I(X1;W1), (2)

where is the set of all maximal independent sets of . The notation means that we are minimizing over all distributions such that implies , where is a maximal independent set of the graph .

In [FeiMed2014, Theorem 41], authors have determined the rate region for a distributed functional compression problem with two transmitters and a receiver. This rate region is characterized by these three conditions:

 R11≥HGX1(X1|X2),R12≥HGX2(X2|X1) R11+R12≥HGX1,GX2(X1,X2), (3)

where is the characteristic graph of on the data , and is the joint graph entropy of the sources.

To summarize, the role of in network function computation is to reduce the amount of rate needed to be able to recover a function on data, and the amount of reduction is observed as

 H(X)→HG(⋅)→HGX(X). (4)

An achievable scheme for the above functional compression problem has been provided in [FeiMed2014]. In the scheme, the sources compute colorings of high probability subgraphs of their and perform source coding on these colorings and send them. Intermediate nodes compute the colorings for their parents’, and by using a look-up table (to compute their functions), they find corresponding source values of received colorings.

In Figure 1, we illustrate the Slepian-Wolf compression rate region in (I-B) versus the outer bound (convex) determined by the joint graph entropy of variables and , as given in (I-B). In the graph, the region between two bounds, denoted by , determines the limits of the functional compression. We denote the depth of this region by that satisfies . This region indicates that there could be potentially a lot of benefit in exploiting the compressibility of the function to reduce communication. The convexity of of can be used to exploit the tradeoff between communications and computation, which is mainly determined by the network, data and correlations, and functions. A notion of compressibility is the deficiency metric introduced in [FuaFenWanCar2018].

###### Definition 1.

Deficiency [PanSakSteWan2011]. Let and be finite Abelian groups of the same cardinality and . Let and . For any and , we denote and . Let for . We call the deficiency of . Hence measures the number of pairs such that has no solutions. This is a measure of the surjectivity of ; the lower the deficiency the closer the are to surjective.

Although Figure 1 gives insights on the limits of compression, it is not clear which point in the outer bound provides the best solution from a joint optimization of communication and computation. In particular, as highlighted in Sect. I-B, constructing optimal compression codes imposes a significant computational burden on the encoders and decoders, since the achievable schemes are based on NP-hard concepts. If the cost of computation were insignificant, it would be optimal to operate at max . However, when the computation cost is not negligible, there will be a strain between the costs of communication and computation. To capture this balance, we propose to follow a different approach, as detailed in Sect. III.

### I-C Contributions

The function computation task in networks is very challenging, and to the best of our knowledge, is unknown except for special cases as outlined in Sect. II. In this paper, we provide a fresh look at this problem from a networking perspective.

Our contributions are as follows. We provide a cost model for a general network topology for performance characterization of distributed function computation by jointly considering the computation and communications aspects. We introduce entropic surjectivity as a measure to determine how surjective a function is. We define a delay cost minimization problem for a flow-based techniques where we use general cost functions for computation. The enabler of our approach is the connection between Little’s law and proportion of flow that requires communications (i.e. computation processing factor) that is determined by the entropic surjectivity of functions.

Our goal is to employ/devise distributed (function) compression techniques in general network topologies (stationary and Jackson type networks where the approach allows for the treatment of individual nodes in isolation, independent of the network topology. Therefore, we do not have to restrict ourselves to cascading operations as in [FeiMed2010allerton] due to the restriction of topology to linear operations.) as a simple means of exploiting function’s entropic surjectivity (a notion of sparsity inherent to labeling), by employing the concepts of graph entropy, in order to provide approximately minimal representations for labeling. Labels can be viewed as colors on characteristic graph of the function on the data, where in our case the labeling is the function, is central to functional compression111

The entropy rate of the coloring of the function’s power conflict graph upon vectors of data characterizes the minimal representation needed to reconstruct with fidelity the desired function of the data

[Korner1973]. The degenerate case of the identity function corresponds to having a complete characteristic graph.. Our main insight is that, the main characteristics required for operating the distributed computation scheme are those associated with the entropic surjectivity of the functions.

The advantages of the proposed approach is as follows. It does not put any assumptions on the network topology and characterizes the functions only via their entropic surjectivity, and provides insight into how to distribute computation/compression depending on the entropic surjectivity of the computation task, how to distribute computation, and how to use the available resources among different computation tasks, and how it compares with the centralized solution. Our results imply that most of the available resources will go to the computation of low complexity functions and fewer resources will be allocated to the processing of high complexity functions.

The organization for the rest of the paper is as follows. In Sect. II, we review the related work. In Sect. III, we detail how to model computation, and derive some lower bounds on the rate of generated flows (i.e. processing factors) of the nodes by linking the computation problem to Little’s law. In Sect. LABEL:performance, we present numerical results and discuss possible directions.

## Ii Related Work

Compressed sensing and information theoretic limits of representation provide a solid basis for function computation in distributed environments. Problem of distributed compression has been considered from different perspectives. For source compression, distributed source coding using syndromes (DISCUS) have been proposed [PradRam2003], and source-splitting techniques have been discussed [ColLeeMedEff2006]. For data compression, there exist some information theoretic limits, such as side information problem [WynZiv1976], Slepian-Wolf coding or compression for depth-one trees [SlepWolf1973], which can be generalized to trees, and general networks via multicast and random linear network coding [HoMedKoeKarEffShiLeo2006].

In functional compression, a function of sources is sought at destination. Korner introduced graph entropy [Korner1973], which was used in characterizing rate bounds in various functional compression setups [AlonOrlit1996]. For a general function and a configuration where one source is local and another collocated with the destination, Orlitsky and Roche provided a single-letter characterization of the rate-region in [OrlRoc2001]. In [DosShaMedEff2010] and [FeiMed2014] authors investigated graph coloring approaches for tree networks. In [FES04] authors computed a rate-distortion region for functional compression with side information. Another class of work considered the in network computation problem for some specific functions. In [KowKum2010]

authors investigated computation of symmetric Boolean functions in tree networks. The asymptotic analysis of the rate in noisy broadcast networks has been investigated in

[Gal88], and in random geometric graph models [KM08]. Function computation has been studied using multi-commodity flow techniques in [ShaDeyMan2013]. There do not exist, however, tractable approaches to perform functional compression in ways that approximate the information theoretic limits. Thus, unlike the case for compression, where coding techniques exist and where compressed sensing acts in effect as an alternative for coding, for purposes of simplicity and robustness, there are currently no family of coding techniques for functional compression.

Computing capacity of a network code is the maximum number of times the target function can be computed for one use of the network [HuanTanYangGua2018]. This capacity for special cases such as trees, identity function [LiYeuCai2003], linear network codes to achieve the multicast capacity have been studied [LiYeuCai2003], [KoeMed2003]. For scalar linear functions, the computing capacity can be fully characterized by min cut [KoeEffHoMed2004]. For vector linear functions over a finite field, necessary and sufficient conditions have been obtained so that linear network codes are sufficient to calculate the function [AppusFran2014]. For general functions and network topologies, upper bounds on the computing capacity based on cut sets have been studied [KowKum2010], [KowKum2012]. In [HuanTanYangGua2018], authors generalize the equivalence relation for the computing capacity. However, in these papers, characterizations based on the equivalence relation associated with the target function is only valid for special network topologies, e.g., the multi-edge tree. For more general networks, this equivalence relation is not sufficient to explore the general function computation problems.

Coding for computation have been widely studied in the context of multi-stage computations [LiAliYuAves2018] which generally focus on linear reduce functions (since many reduce functions of interest are linear); heterogeneous networks and asymmetric computations [KiaWanAves2017], and compressed coded computing [LiAliYuAves2018], which focused on computations of single-stage functions in networks. Coded computing aims to tradeoff the communication (bottlenecks) by injecting computations. While fully distributed algorithms might cause a high communication load, fully centralized systems can suffer from high computation load. With distributed computing at intermediate nodes by exploiting multicast coding opportunities, the communication load can be significantly reduced, can be made inversely proportional to the computation load [LiAliYuAves2018]. The rate-memory tradeoff for function computation has been studied in [YuAliAves2018]. Different coding schemes to improve the recovery threshold include Lagrange coded computing [YuRavSoAve2018], and polynomial codes for distributed matrix multiplication [YuAliAve2017].

In functional compression, functions themselves can also be exploited. There exist functions with special structures, such as sparsity promoting functions [SheSutTri2018], symmetric functions, type sensitive and threshold functions [GK05]. One can also exploit a function’s surjectivity. There are different notions on how to measure surjectivity, such as deficiency [FuaFenWanCar2018], ambiguity [PanSakSteWan2011], and equivalence relationships among function families [gorodilova2019differential].

## Iii Modeling Computation in Networks

In this section, we want to answer the following questions: How to handle large, distributed data? What is the rate region of the distributed functional compression problem for a general function? Where to place computation and memory? When to do computations? How to model computation in networks.

As a first step to ease this problem, we will provide a utility-based approach for general cost functions. As special cases, we continue with simple example of point search (), then MapReduce , then the binary classification model (). Our main contribution is to provide the link between the function computation problem and Little’s law.

We consider a general stationary network topology. While sources can be correlated, and computations are allowed at intermediate nodes, we compute some deterministic functions. Our goal is to effectively distribute computation. Intermediate nodes need to decide whether to compute or relay. At each node, computation is followed by computation (causality) while satisfying stability conditions. We consider a decentralized solution. This yields a threshold of flow (i.e. processing factor) to be able to perform computation. We also consider a centralized solution which can be obtained by solving an optimization problem by using appropriate cost functions.

We use the following notation. The set of source random variables is denoted by . Arrival rate of type flow at node is . Service rate of type flow at node is given by . Average number of packets at node due to the processing of type function is . Function of type is denoted by . In this section and in the remaining of the paper, we drop the subscript in graph entropy , and instead use the boldface notation to show the dependency of the graph entropy on the function on the data . Hence, the (graph) entropy of function is . Time complexity of generating/processing a flow of type at node is . The generation rate of the flow, i.e. the processing factor, of type at node is .

### Iii-a Computing with Little’s Law

In this section, we connect the computation problem to Little’s law. Little’s law states that the long-term average number of packets in a stationary system is equal to the long-term average effective arrival rate multiplied by the average time that a packet spends in the system. More formally, it can be expressed as . The result applies to any system that is stable and non-preemptive, and the relationship does not depend on the distribution of the arrival process, the service distribution, and the service order [Klein1975].

In our setting, the average time a packet spends in the system is given by the addition of the total time required by computation followed by the total time required by communications. We formulate a utility-based optimization problem by decoupling the costs of communications and computation:

 MinCost: minρ C=∑v∈V∑c∈CWcv (5) s.t. ρcv<1,∀c∈C,v∈V,

where captures the total delay, and and are positive delay cost functions that are non-decreasing in flow. The delays of computation and communications for processing functions of type are

 Ccv,comp=1λcvdf(Mcv),Ccv,comm=1μcv−γf(λcv), (6)

where models the time complexity of computation, i.e. the total time needed to process all the incoming packets and generate the desired function outcomes. The term characterizes the amount of computation flow rate generated by node for function of type . Hence, the second term on the right hand side captures the waiting time, i.e. the queueing and service time of a packet. Hence, by Little’s law, we expect that the long-term average number of packets in node for function of type satisfies the following relation

 Lcv=γf(λcv)Wcv, (7)

where we aim to infer the value of using Little’s law.

The connection between and can be given as

 Mcv=Lcv(1−γf(λcv)/λcv). (8)

For simplicity of notation, let , and , , and .

The following gives a characterization of by simple lower and upper bounding techniques.

###### Proposition 1.

Flow bounds. The long-term average number of packets in for type flow satisfies

 H(fc(XN1))2+1−√(H(fc(XN1))2)2+1≤Lcv≤Mcv. (9)

Prop. 1 yields a better inner bound than that of Slepian and Wolf [SlepWolf1973] because the LHS of (9) is always less than or equal to . Its proof is provided in Appendix.

We next provide a result required for stability.

###### Proposition 2.

For stability, we require that .

###### Proof.

Our approach can be considered as a preliminary step for a better understanding of how to distribute computation in networks. Directions include devising coding techniques for in network functional compression, by blending techniques from compressed sensing to the Slepian and Wolf compression, and employing the concepts of graph entropy, and exploiting function surjectivity. They also include the extension to multi-class models with product-form distributions, allowing conversion among classes of packets when routed from/to a node.

The upper bound in (9) follows from the case of no computation. In this case, the long-term average number of packets in satisfies that . However, when we allow function computation we expect to have .

Assume that . If this assumption did not hold, we would have . For stability, the long-term average number of packets in waiting for communications service, i.e. , should be upper bounded by the long-term average number of packets in waiting for computation service, i.e. . Otherwise, will increase over time, which will violate the stationarity assumption.

The lower bound follows from the definition of Little’s law:

 Lcv=γf(λcv)[1λcvdf(Mcv)+1μcv−γf(λcv)](a)≥H(fc(XN1)),

where is for recovering the function at the destination. Manipulating the lower bound relation above, we obtain

 γf(λcv)≥μcv[H(fc(XN1))2+1−√H(fc(XN1))24+1],

using which we get the desired lower bound.