## I Introduction

In recent years, distributed computing has been widely adopted to perform various computation tasks in different computing systems [kartik97task, hong04distributed, lu19toward]. For instance, to perform big data analytics in cloud computing systems, MapReduce [dean08mapreduce] and Apache Spark [zaharia10spark] are the two prevalent modern distributed computing frameworks that facilitate the processing of data in the order of petabytes. On the other hand, edge computing is becoming more important recently. For example, with advanced computing capabilities, *unmanned aerial vehicle* (UAV)-based airborne computing [lu19toward] can facilitate various delay-sensitive civilian and commercial applications, such as precision agriculture [honkavaara13processing], emergency response [choi09developing], and infrastructure monitoring [ham16visual].

Despite the importance of distributed computing, there are still many design challenges. One of the most important challenges is that many computing frameworks are vulnerable to uncertain system noises, such as node failures, communication congestion, and straggler nodes [dean13tail]. Such system noises have been observed in many cloud computing systems, and are a major issue in edge computing scenarios. For UAV-based airborne computing systems with highly mobile nodes, the system noises are even more severe.

To address system noises, a variety of solutions have been proposed in the literature. For example, the authors of [zaharia08improving] proposed to identify and blacklist nodes that are in bad health and to run tasks only on well-performed nodes. However, empirical studies show that stragglers can occur in non-blacklisted nodes [dean12achieving, ananthanarayanan13effective]. As another type of solution, delayed computation tasks can be re-executed in a speculative manner [dean08mapreduce, ananthanarayanan10reining, zaharia08improving, melnik10dremel]. Nevertheless, such speculative execution techniques have to wait to collect the performance statistics of the tasks before generating speculative copies and thus have limitations in dealing with small jobs [ananthanarayanan13effective]. To avoid waiting and predicting stragglers, the authors of [ananthanarayanan12why, ananthanarayanan13effective] suggested to execute multiple clones of each task and use results generated by the fastest clones. Although their results show the promising performance of this approach in reducing the average completion time of small jobs, the extra resources required for launching clones can be considerably large, considering that multiple clones are executed for each task.

Instead of directly replicating the whole task, the coding techniques can be adopted to introduce arbitrary redundancy into the computation in a systematic way. However, until a few years ago, the coding techniques have been mostly known for their capability in improving the resilience of communication, storage and cache systems to uncertain system noises [li15coded]. In 2016, Lee et al. [lee16speeding, lee18speeding] presented the first *coded distributed computing* (CDC) schemes to speed up matrix multiplication and data shuffling. Since then, CDC has attracted significant attention from the distributed computing community. While most CDC schemes consider homogeneous computing nodes, there have been a few recent studies that investigated CDC over heterogeneous computing clusters. In particular, Kim et al. [kim19coded, kim19optimal]

considered the matrix-vector multiplication problem and presented an optimal load allocation method that achieves a lower bound of the expected latency. On the other hand, Reisizadeh

et al. [reisizadeh17coded] introduced a different approach, namely*Heterogeneous Coded Matrix Multiplication*(HCMM), that can maximize the expected computing results aggregated at the master node. In [reisizadeh17coded, reisizadeh19coded], the authors proved that the HCMM is asymptotically optimal under the assumption that the processing time of each computing node follows a shifted exponential or Weibull distribution. Also of interest, Keshtkarjahromi et al. [keshtkarjahromi2018dynamic] considered the scenario when computing nodes have time-varying computing powers and introduced a coded cooperative computation protocol that allocates tasks in a dynamic and adaptive manner.

In this paper, we consider a general distributed computing system with heterogeneous computing nodes. Specifically, we propose a novel *batch-processing based coded computing*

(BPCC) approach. Unlike most existing CDC schemes that require each worker node to first complete the computation task and then send back the whole result to the master node, our BPCC allows each node to return partial computing results to the master node in batches before the whole computation task is completed. Therefore, BPCC is expected to achieve lower latency. Also worthy of note is that the partial results can be used to generate approximated solution, e.g., by applying the singular value decomposition (SVD) approach in

[ferdinand16anytime], which is very useful for applications that require timely but unnecessarily optimized decisions, such as some UAV applications for emergency response.To the best of our knowledge, such a BPCC framework has not been fully investigated in the literature. In this paper, we investigate the design and evaluation of BPCC in a systematic manner. Specifically, we focus on a classical CDC task: matrix-vector multiplication, and formulate an optimization problem for general BPCC with the assumption that the processing time of each computing node follows a shifted exponential distribution and the batch-induced overhead is linear to the number of batches. To solve the optimization problem, we first consider a special case when batching overhead is negligible. For such a system, we formulate alternative optimization problems, based on which we design an optimal load allocation scheme, namely, BPCC-1, to assign tasks to different computing nodes so as to achieve the minimal expected task completion time. We also conduct solid theoretical analysis to prove the asymptotic optimality of BPCC-1 and to prove that it outperforms HCMM, a state-of-the-art CDC scheme for heterogeneous systems. Based on BPCC-1, we then design a greedy algorithm, namely BPCC-2, to solve the initial optimization problem with linear batching overhead, which jointly optimizes the computation load and the number of batches assigned to each computing node.

To further understand and illustrate the performances of the proposed BPCC schemes, we conduct extensive simulation studies and real experiments. Our two BPCC schemes are compared with three benchmark schemes, including the Uniform Uncoded, Load-balanced Uncoded, and HCMM. In the simulation, no batching overhead is simulated. The simulation results show the impacts of important BPCC parameters. They also demonstrate that BPCC-1 can improve computing performance by reducing the latency up to 73%, 56%, and 34% over the aforementioned three benchmark schemes, respectively, and BPCC-2 achieves a similar performance as BPCC-1.

In the real experiments, we test all distributed computing schemes in two systems: 1) an Amazon EC2 computing cluster and 2) a UAV-based airborne computing platform. For the first system, we deploy a heterogeneous computing cluster that consists of different machine instances in Amazon EC2. The results show that BPCC schemes outperform the benchmark schemes and BPCC-2 achieves the minimal task completion time, no matter whether there are unexpected stragglers or not. For the second system, we first introduce a novel approach to model the behavior of virtualized computing nodes in the UAV-based airborne computing platform. Based on the estimated computing time models, we then compare the performances of all schemes in various scenarios without or with unknown stragglers. The results again confirm that both BPCC schemes outperform benchmark schemes in all scenarios. Furthermore, BPCC-2 achieves the best performance, i.e., about 76%, 71%, 35%, and 19% faster than the Uniform Uncoded, Load-balanced Uncoded, HCMM, and BPCC-1, respectively. Finally, since the power and energy consumption issues are critical in the UAV platform, we also evaluate these performances for different schemes. Our results show that, the power levels of the two BPCC schemes are slightly higher than the benchmark schemes. Nevertheless, considering that the energy is the product of power and time, the proposed BPCC schemes require much less energy to complete the computation tasks, which is a desirable feature for many edge computing systems with limited energy supply.

The rest of this paper is organized as follows. In Section II, we first briefly discuss some relevant studies. Next, in Section III, we introduce the system model and the BPCC framework and then formulate an optimization problem for BPCC. To solve the optimization problem, in Section IV, we first consider a system with negligible batching overhead, for which we design the BPCC-1 scheme and conduct solid theoretical analysis to prove its optimality. Based on the understanding of BPCC-1, we further design a greedy algorithm, i.e., BPCC-2, for computing systems with linear batching overhead in Section V. We then present extensive simulation and experimental results in Section VI and Section VII, respectively, before concluding the paper in Section VIII.

## Ii Related Work

### Ii-a Coded Distributed Computing

Following the seminal work in [li15coded, lee16speeding, lee18speeding], many different computation problems have been explored using codes, such as the gradients [tandon17gradient], large matrix-matrix multiplication [lee17high], linear inverse problems [yang17coded], nonlinear operations [lee17coded], etc. Other relevant coded computation solutions include the “Short-Dot” coding scheme [dutta17short-dot] that offers computation speed-up by introducing additional sparsity to the coded matrices and the unified coded framework [li16unified, li18fundamental] that achieves the trade-off between communication load and computation latency. In this paper, we consider a classical CDC problem, i.e., matrix-vector multiplication. However, we believe that our approach can also be applied to other distributed computing problems.

To reduce the output delay in waiting for the exact result, an anytime coding technique was introduced in [ferdinand16anytime], which adopts the SVD to allow early output of approximated result. Also of interest is the study presented in [ferdinand18hierarchical], which introduced a hierarchical approach to address the limitations of above coding techniques in terms of wastefully ignoring the work completed by slow worker nodes. In particular, to better utilize the work completed by each worker node, it partitions the total computation at each worker node into layers of sub-computations, with each layer encoding part of the job. It then processes each layer sequentially. The final result can be obtained after the master node recovers all layers. The simulation results demonstrate the effectiveness of this approach in reducing the computation latency. However, as the worker nodes have to process the layers in the same order, the results obtained by slow worker nodes for layers that have already been recovered are useless. Furthermore, this approach, as well as aforementioned approaches, assumes homogeneous computing nodes, which is not a common case in realistic scenarios. Compared to these studies, in this paper, we consider more general heterogeneous computing systems. The proposed BPCC schemes can also better utilize partial results of all computing nodes.

### Ii-B UAV-based Airborne Computing

The past few years have witnessed a rapid growth of using UAVs to facilitate civilian and commercial applications, such as precision agriculture [honkavaara13processing], emergency response [choi09developing], and infrastructure monitoring [ham16visual]. In these applications, computation-intensive tasks, such as 3-dimensional (3-D) mapping, precise positioning and path planning, are mostly conducted at ground stations [stocker19uav, zhou16seamless] or remote Clouds [luo15uav, cao17cloud] due to the limited computing and storage capacity of a single UAV of small payload. However, such a computation mechanism suffers from significant ground-air transmission delays or even failures and thus is not suitable for delay-sensitive applications that require real-time responses.

To address aforementioned challenges, computation tasks should be executed directly onboard of UAVs. This will not only substantially improve the performance of many existing UAV applications, but also enable advanced new applications. For instance, the recently emerged UAV-based mobile edge computing (MEC) [hu18joint, zhou18computation, asheralieva19hierarchical] considers the use of UAVs with onboard computing capabilities as edge computing nodes to provide computing services to surrounding users. Due to the unique properties of UAVs, including high mobility, flexibility and maneuverability, the UAV-based MEC is capable of providing computing services on-demand, regardless of time, space and existence of communication infrastructure. This unique property makes it particularly attractive. To enable UAV-based MEC, researchers have investigated the problems of how to offload the computation tasks from ground devices to the UAVs and how to design UAVs’ trajectories to achieve the optimal system performance [hu18joint, zhou18uav, jeong17mobile, jeong18mobile]. However, how to enhance the UAV’s onboard computing capability, the very first challenge encountered, has been largely ignored.

Driven by the urgent need for more advanced UAV platforms of high onboard computing capability, we started to develop the UAV-based networked airborne computing platform [wang19computing, lu19toward, wang18enabling], which incorporates the computing resources of multiple UAVs through networking and resource sharing. Currently, we have built a prototype that uses NVIDIA Jetson TX2 of high computing power as the computing unit and implements virtualization for enhanced computing and resource management capabilities [wang18enabling]. To enable efficient resource sharing between two UAVs, the prototype is equipped with a directional antenna, Ubiquiti Nanostation LocoM2, which allows long-distance and broad-band UAV-to-UAV communications [CRIWeb_url, chen17long]. The prototype also implements an advanced learning-based control solution to keep directional antennas aligned for robust UAV-to-UAV communications [li19design]. In this paper, we further explore distributed computing techniques to optimize the resources shared among multiple UAVs.

## Iii System Models

In this section, we first introduce the computing system for distributed matrix-vector multiplication. We then illustrate three computing schemes, including the proposed batch processing-based coded computing (BPCC). Finally, we formulate an optimization problem for BPCC.

### Iii-a Computing System

In this paper, we consider a distributed computing system that consists of one master node and () computing nodes, a.k.a., worker nodes. In this system, we investigate how to quickly solve a matrix-vector multiplication problem, which is one of the most basic building blocks of many computation tasks. Specifically, we consider a matrix-vector multiplication problem , where is the output vector to be calculated, is the input vector to be distributed from a master node to multiple workers, and is an dimensional matrix pre-stored in the system. Both and can be very large, which implies that calculating at a single computing node is not feasible. Finally, we define , where is an arbitrary positive integer, i.e., .

### Iii-B Computing Schemes

#### Iii-B1 Uncoded Distributed Computing

To solve the above problem, a traditional distributed computing scheme divides matrix into a set of sub-matrices , and pre-stores each sub-matrix in computing node , where and . Upon receiving the input vector , the master node sends vector to all worker nodes. Each worker node then computes and returns the result to the master node. After all results are received, the master node aggregates the results and outputs , where stands for transpose.

Due to the existence of system noises, such as malfunctioning nodes and communication bottlenecks, the uncoded computing scheme may defer or even fail the computation, because the delay or loss of any , , will affect the calculation of the final result . To address the issue of system noises, more computing nodes can be used to perform distributed computing. For instance, the master node can let two or more computing nodes to compute . This approach, however, is not efficient because the cost can be unnecessarily large.

#### Iii-B2 Coded Distributed Computing (CDC)

In recent years, a more efficient computing paradigm, CDC, has been introduced to tackle the issue of system noises. In the literature, there are many CDC schemes and we consider a generic CDC scheme as follows.

In this CDC scheme, will first be used to calculate a larger matrix with more rows, i.e., , by using , where is the encoding matrix with the property that any row vectors are linearly independent from each other [lee17coded]. In other words, we can use any rows of to create an full-rank matrix. Similar to the uncoded computing scheme, matrix can then be divided into sub-matrices , where , , and each worker node calculates .

Different from the uncoded computing scheme, the master node does not need to wait for all worker nodes to complete their calculations, because it can recover once the total number of rows of the received results is equal to or larger than . In particular, suppose the master node receives at a certain time , it can first infer that must satisfy

where is a sub-matrix of the encoding matrix corresponding to . The master node can then calculate

(1) |

#### Iii-B3 Bpcc

In the literature, most existing CDC schemes assume that each worker node will send the complete to the master node when it is ready, which may incur large delays. To further speed up the computation, we propose a novel BPCC scheme and the main idea is to allow each worker node to return partial results to the master node.

Specifically, we consider that each worker node equally divides the pre-stored encoded matrix row-wise into sub-matrices, named as batches, where is the number of batches and . Except the last batch, each batch has rows. After receiving the input vector from the master node, the worker node multiplies each batch with and will send back the partial results as soon as possible. Suppose the master node receives batches from the worker node by time , where , it can then recover the final result when , by using Eq. (1).

### Iii-C Problem Formulation

In the previous sub-section, we introduced the BPCC scheme that can improve the performance of CDC. In the following study, we focus on how to optimize the performance of BPCC. Specifically, we jointly consider minimizing the task completion time and the potential overhead of batch processing. Furthermore, we attempt to achieve the optimization goal by allocating proper computation load (i.e., ) to each worker node and specifying the number of batches for each worker node (i.e., ).

We now define as the amount of time to complete a computation task, and we let be the unit cost of using every batch. The optimization can be formulated as follows:

(2) | ||||||

subject to | ||||||

where and .

To facilitate further discussions, we first assume that is an integer. We also assume that the computation task scales with , i.e., . Next, we assume that the computing nodes are fixed with time-invariant computation capabilities, and the network maintains a stable communication delay during the computing process.

We now consider the behavior of waiting time, which is defined as the duration from the epoch that the master node distributes

to the time that it receives a certain result. For BPCC, we let be the waiting time for the master node to receive batches from worker node , . Clearly,can be modeled as a random variable following a certain probability distribution. Following the modeling techniques used in recent studies

[reisizadeh19coded], we consider that follows a shifted exponential distribution defined below:(3) |

where and are straggling and shift parameters, respectively, and and are positive constants for all . Furthermore, we assume that is independent from , , , .

Based on the definitions and assumptions above, we can see that must satisfy . In the following sections, we will first discuss how to solve the optimization problem, in which we will use theoretical analysis to confirm the optimality and advantage of BPCC. We will then conduct extensive simulation and real experiments to validate the assumptions and to evaluate the optimization algorithms.

## Iv Optimal Load Allocation with Negligible Batching Overheads

In this section, we consider a special case when , i.e., introducing batches incurs negligible overhead. To solve the optimization problem, we will first define a simplified formulation, for which we then apply a two-step alternative formulation. Next, we show how to solve the alternative problems and prove the optimality of the solution. Finally, we show that this solution outperforms a recent CDC scheme without batch processing.

### Iv-a Notations for Asymptotic Analysis

For any two given functions and , if and only if there exist positive constants , , and such that for all ; if and only if there exist constants and such that for all ; and , if and only if .

### Iv-B A Simplified Formulation

We relax the constraint from to , to simplify the analysis. Furthermore, we assume that the number of batches for each worker node , , is given. Consequently, the problem in Eq. (2) can be formulated as follows:

subject to |

Once the above problem is solved, we can round each optimal load number up to its nearest integer using the rounding function (denoted as ). Note that the effect of this rounding step is negligible in practical applications with large load numbers, such as those considered in our simulation and experimental studies.

### Iv-C A Two-Step Alternative Formulation

To solve the above problem, which is NP-Hard, we provide a two-step alternative formulation, which is similar to the one introduced in [reisizadeh19coded]. We will show later that this alternative formulation provides an asymptotically optimal solution to problem .

The key idea of the two-step alternative formulation is to first maximize the amount of results accumulated at the master node by a feasible time , i.e., , and then minimize time such that sufficient amount of results are available to recover the final result.

In particular, we let be the amount of results received by the master node by time , where is the batch size. For a feasible time , we first maximize the expected amount of results received by the master node, through solving the following problem:

subject to |

After obtaining the solution to , denoted as , we then minimize the time such that there is a high probability that the results received by the master node by time are sufficient to recover the final result, by solving

minimize | |||||

subject to |

where is the amount of results received by the master node by time for load allocation .

### Iv-D Solution to the Two-Step Alternative Problem

To solve the two-step alternative problem, we first consider . Note that, the expected amount of results received by the master node by time is:

(4) | ||||

where is an integer in range , and is the probability that the master node receives exactly batches from worker node , which can be obtained by:

in Eq. (4) can then be computed by:

(5) | ||||

The solution to can then be obtained by solving the following equation for each :

which yields:

(6) |

is the positive solution to the following equation:

(7) |

which is a constant independent of . To show that Eq. (7) has a single positive solution, we can define an auxiliary function for each :

We can see that decreases monotonically with the increase of when . We can also find that and . Based on these statements we know that a unique exists and can be efficiently solved using a numerical approach. Moreover, numerical calculation also reveals that in all our experiments, which implies that the load for each worker node satisfies the required constraint:

Next, we solve . Since this problem is also NP-hard, we here provide an approximated solution. In particular, we approximate its optimal solution, denoted as , with value , such that the expected amount of results accumulated at the master node by time equals to the amount of results required for recovering the final result, i.e., . To find the value of , we let

(8) |

Then, using the load allocation in Eq. (6), the expected amount of results received by the master node is:

(9) | ||||

Combining the solutions to and , we can then derive the load allocation:

(12) |

For convenience of reference, we name this method as BPCC-1, which is summarized in Algorithm 1.

### Iv-E Optimality Analysis

In this section, we conduct theoretical analysis to investigate the performance of BPCC-1. Specifically, we first prove the following lemma, which demonstrates the optimality of the approximated solution to . We then continue to prove Theorem 2, which shows that the solution provided by BPCC-1 is asymptotically optimal.

###### Lemma 1.

###### Proof.

In this proof, we will apply the McDiarmid’s inequalities [combes15extension] to prove the two inequalities one by one. We also note that our approach is similar to the one used in the proof for Lemma 1 in [reisizadeh19coded], which proves an inequality that is similar to the second inequality above.

According to [combes15extension], for a set of independently distributed random variables, , if a function satisfies the Lipschitz condition:

for all , then, for any ,

where . To apply the McDiarmid’s inequalities at time , we define , , and further define

Clearly, under such definitions, we have .

To facilitate further discussions, we let , . We also summarize the asymptotic scales for the parameters: , , , , .

Applying the second McDiarmid’s inequality, we can then derive

(14) | ||||

Using the asymptotic scales of parameters in the right hand side of Ineq. (14), we have

(15) |

Consequently, we have

(16) |

Ineq. (16) shows that, if , then the probability is not , which does not satisfy the constraint in . Therefore, .

Applying the first McDiarmid’s inequality, we can then derive

(17) | ||||

###### Theorem 2.

Consider problem with the batch processing times following the shifted exponential distribution in Eq. (3) and . Let and be the expected execution time of BPCC-1 and the optimal value of , respectively. The BPCC-1 is asymptotically optimal, i.e.,

(18) |

###### Proof.

We will prove the asymptotic optimality of BPCC-1 by following a similar procedure in the proof of Theorem 1 in [reisizadeh19coded]. In particular, Eq. (18) can be proved by showing that

Since is straightforward because is the optimal value of , we use two steps to prove the other two inequalities.

Step 1: To prove .

Let be the optimal load allocation obtained by solving and let be the amount of results received by the master node by time under load allocation . The inequality above can be proved by showing the following inequalities:

where is the solution to , and and are both . To prove Ineq. (), we first define an auxiliary function for each node as

According to Eq. (5), we have

and

According to our previous discussions, , so , . Therefore, does not change with , i.e., . We then have

By using the McDiarmid’s inequality, we have

which implies that .

Next, we proceed to prove Ineq. (). Since is the optimal value of , we have . Moreover, as according to Eq. (8), , and both and increase monotonically with , we can derive

According to Lemma 1,

Therefore,

We have now proved .

Step 2: To prove .

Let be a random variable that denotes the time required for all worker nodes to complete their tasks assigned using BPCC-1. Let and be two events. can then be computed by

(19) | ||||

The first term in the right hand side of Eq. (19) can be written as

where

is the probability density function (PDF) of

. A stochastic upper bound of can be found by using worker nodes that all take the smallest straggling parameter and the largest shift parameter . Using the PDF of the maximum of i.i.d. exponential random variables, we then have(20) | ||||

where is a constant, i.e., .