1 Introduction
Machine learning serves as the backbone for a wide variety of cognitive tasks such as image classification, object recognition, and natural language processing. Today, applications can leverage stateoftheart machine learning models by using cloud services that offer machine learning as a service
[Azu, Goo, AWS]. To handle large traffic, such service providers typically use a distributed setup with a large number of interconnected servers (compute nodes). It is wellknown that such a distributed compute infrastructure faces a number of unavailability events [Dea, RSG14, SAP13]. First, these clusters are typically built out of commodity components making failures the norm rather than the exception. Second, various factors including load imbalance and resource contention cause transient slowdowns. (Servers facing such temporary unavailability are called stragglers.) Both of these unavailabilities adversely affect service response time (latency).A natural strategy for addressing unavailability in other domains such as communications and data storage has been through a proactive approach of adding redundancy: making use of extra resources upfront to aid in recovery from unavailability. The effectiveness of using redundancy to reduce latency in computer systems has been shown both theoretically [JLS14, GZD15, SLR16, LK13, WJW14] as well as in practical systems [AGSS12, VMGS12, DB13, RCK16]. A naive approach of adding redundancy is to replicate (that is, to have multiple copies), but this approach leads to significant resource overhead. A tool from the domain of coding theory, called erasure codes [RU08], provides a means for adding redundancy with significantly lesser overhead as compared to replication. Erasure codes have been successfully employed in communication [RU08], storage [PGK88, HDF, RSG14, HSX12, SAP13], and distributed caching [RCK16] systems to efficiently alleviate the impact of unavailabilities.
Coded computation is an emerging technique which extends the use of erasure codes to recover from unavailability of computation. Suppose there are data inputs , and suppose the goal is to apply a given function to these data inputs, that is, to compute . The computations for different are performed on separate, unreliable devices, and hence each individual computation can straggle or fail arbitrarily. We let represent a resilience parameter. The framework of coded computation involves two functions, an encoding function and a decoding function . First, the encoding function acts on the data inputs to generate redundant inputs, called “parities,” which we denote as . The given function is then applied on these inputs (data and parity) on separate, unreliable devices that can fail or straggle arbitrarily. If any or fewer outputs (out of the total outputs) are unavailable, the decoding function is applied on all the available outputs to reconstruct the unavailable ones among . Figure 1 illustrates the codedcomputation framework. Given , , and , the goal is to design the encoding function and the decoding function to enable reconstruction of unavailable outputs of .
Many recent works have employed erasure codes for coded computation of linear
functions such as distributed matrixvector multiplication
[LLP18, DCG16, LMAA16, YMAA17, WLS18, DCG17, RPPA17, MCJ18], and specific classes of iterative optimization algorithms [KSDY17, KSDY18, TLDK17]. However, to the best of our knowledge, none of the existing works are applicable for broader classes of nonlinear computations, for example, whenis a neuralnetwork. While the fullyconnected and convolutional layers common to neuralnetworks are linear, they are executed along with nonlinearities such as activation functions and maxpooling, effectively making the overall function nonlinear. These compelling applications serve as our motivation for designing erasure codes that can handle nonlinear computations.
In the history of coding theory, advances in the design of codes have largely come about through human creativity, making use of handcrafted mathematical constructs. For coded computation of nonlinear functions that arise in general tasks including machine learning applications and beyond, complex nonlinear interactions make it challenging to handcraft erasure codes. In this paper, we propose to overcome this challenge via a novel approach for designing erasure codes, that of learning codes.
Learning an erasure code involves learning the encoding and the decoding functions ( and ). Unlike the traditional approach in erasure coding, we allow the outputs of the decoding function to be an approximation of the unavailable outputs. Approximate outputs are sufficient for many applications, such as machine learning algorithms since many of these algorithms themselves are approximate. For any input and a given function , we denote the (approximate) reconstruction of as
. We make use of the ability of neural networks to perform universal function approximation by expressing the encoding and the decoding functions as neural networks. We train the neural networks for the encoding and the decoding functions in tandem via backpropagation directly through the given function
.Our approach is applicable for designing codes for imparting resilience to any differentiable nonlinear function .^{1}^{1}1Although our approach is applicable for linear functions as well, we focus primarily on nonlinear functions. There are several existing works (e.g., [LLP18, DCG16, LMAA16, YMAA17, WLS18, DCG17, RPPA17, MCJ18]) that address only linear functions. These approaches may be more suitable for linear functions as they guarantee exact reconstruction of unavailable outputs. In this paper, we focus our attention on learning codes for machine learning models, specifically for the (often nonlinear) computations during inference. We focus on inference as it is typically a user facing operation, and hence reducing the computation time during inference through failure and straggler mitigation has a significant impact on service quality [Dan17]. In our evaluation, for the sake of concreteness, we particularly focus on functions that are neural network models, and use the term “base model” to refer to . However, we emphasize that our solution extends to any differentiable function, making it applicable to a large class of tasks in machine learning and beyond.
We evaluate our framework using two neuralnetwork based image classifiers as base models (a multilayer perceptron (MLP) and ResNet18) using MNIST [LeC], FashionMNIST [XRV17], and CIFAR10 [Ale] datasets. Our experimental results show that the proposed approach can accurately reconstruct a significant fraction of the unavailable outputs: for example, 98.87%, 92.06%, and 80.84% of ResNet18 classifier outputs are accurately reconstructed on MNIST, FashionMNIST, and CIFAR10 datasets respectively.
Our experimental results are highly promising for the following reasons. Consider the example application of handling failures and stragglers in distributed, machinelearning inference services. Inference services typically provide strong guarantees on response times (Service Level Agreements or SLAs) [Dan17]. Requests that face unavailability have prediction accuracy no better than random guessing in the absence of any corrective measures. Considering a distributed service employing ResNet18 models, if say 10% of requests are unavailable, the overall prediction accuracy for CIFAR10 will drop from 93.47% to 84.12%. Our learned codes can reconstruct the predictions for most of these unavailable cases and get close to the prediction accuracy of the underlying classifier at the cost of performing some redundant computation. Under the same scenario as considered above, learned codes can improve the overall prediction accuracy from 84.12% to 90.59% for CIFAR10, and from 89.28% to 98.75% for MNIST by using only 20% redundant base model computations ().
A note on the scope of this paper: The goal of this work is to explore the feasibility of taking a learningbased approach for designing erasure codes to impart resilience to general nonlinear computations; our focus is not on optimizing the encoding and decoding function architectures for computational efficiency.
The main contributions of this work are as follows:

To the best of our knowledge, we propose the first learningbased approach to designing erasure codes.

To the best of our knowledge, we propose the first codedcomputation approach for providing resilience to nonlinear functions, making it applicable to a large class of tasks in machine learning and beyond.

We carefully design neural network architectures and a training methodology for learning the encoding and decoding functions based on multilayerperceptrons and dilated convolutional neural networks.

Through extensive evaluation on two neuralnetwork based image classifiers (a multilayer perceptron (MLP) and ResNet18) using MNIST, FashionMNIST, and CIFAR10 datasets, we show that our learned codes can accurately reconstruct of the unavailable predictions.
2 Related Work
A host of recent works have explored using coding theoretic approaches to impart resilience to distributed linear computations such as matrix multiplication. Lee et al. [LLP18] use a family of codes called “maximumdistanceseparable” (MDS) codes to mitigate stragglers in distributed matrixvector multiplication. In [DCG16], Dutta et al. propose ShortDot codes to decompose long dot products that arise in certain matrixvector multiplications into smaller products which facilitates parallel computation of such products. Li et al. [LMAA16] present a framework for navigating the tradeoff between computation time and communication time in coded computation schemes for matrix multiplication. Yu et al. [YMAA17] propose Polynomial Codes for distributed matrix multiplication, which reconstruct the full matrix multiplication result using the minimal number of results from workers. Sparse Codes are introduced by Wang et al. [WLS18] to exploit the sparsity of matrix operands in order to reduce decoding complexity in coded matrixmatrix multiplication. In [DCG17], Dutta et al. employ linear codes for resilient distributed convolution between two vectors. Reisizadeh et al. [RPPA17] propose a scheme to balance the load across compute nodes for coded, distributed matrixmultiplication by taking into account heterogeneity of compute resources. In [MCJ18], Mallick et al. propose using rateless codes for distributed matrixvector multiplication in order to make use of partial work completed by straggling nodes. In comparison to the above works which are applicable to only linear computations, we present a learningbased approach that learns codes that can handle any differentiable nonlinear computation.
In another direction in coded computation, several recent works present approaches to using codes for providing resilience to specific iterative optimization algorithms that are employed during training of machine learning algorithms. Tandon et al. [TLDK17] propose a straggler mitigation scheme for dataparallel gradient descent which involves having multiple copies of the data across the worker nodes. Under this scheme, each worker node sends a carefully constructed linear combination of its computed gradients to a master node such that the master node can complete a gradient descent iteration without having to wait for results from all the worker nodes. In [KSDY17, KSDY18]
, Karakus et al. propose a codedcomputation approach wherein both the data and labels of a training set are encoded, and the original optimization algorithm is directly run on the encoded training dataset. For specific optimization algorithms (e.g., gradient descent and LBFGS) and machine learning tasks (e.g., ridge regression, matrix factorization, and logistic regression), the authors present code constructions that achieve stable convergence and reduced runtime as compared to replicationbased approaches. In
[MSM18], Maity et al. encode the second moment of the data matrix using LDPC codes in order to mitigate the effect of stragglers on gradient descent. The authors show that encoding the second moment reduces the number of aggregation steps necessary per training iteration compared to directly encoding the data matrix. In contrast to these lines of work that focus on specific iterative optimization algorithms that arise during the training phase of machine learning, the focus of the our work is to add resilience through redundant computation to any differentiable nonlinear computation that arise during the
inference phase of machine learning.Two recent works have explored taking a learning approach to designing decoding algorithms for existing errorcorrectingcodes employed in the domain of communication. Nachmani et al. [NML18]
propose using feedforward and recurrent neural networks for decoding a family of codes called “block codes”. Kim et al.
[KJR18] show that recurrent neural networks can learn closetooptimal decoding algorithms for several classes of well known codes employed in the domain of communication. In comparison with these works, we propose and establish the feasibility of taking a learningbased approach for the endtoend design of codes, i.e., learning both encoding and decoding algorithms.Another related line of work is on using neural networks for image compression and cryptography [TOH16, TJZ17, AA16]. While these lines of work are similar in spirit to learning an erasure code (transforming input data into alternate representation for later reconstruction), the overall goal, and thus the structure of the architecture and the training methodology differ significantly.
3 Learning a Code
In this section, we describe our proposed approach for learning erasure codes. Recall the coded computation setup (an example of which is illustrated in Figure 1): The encoding function acts on the data inputs to create parity inputs. The function is then applied on these inputs (data and parity) on separate, unreliable devices that can fail or straggle arbitrarily. If any or fewer of these outputs are unavailable, the decoding function is applied on all the available outputs to reconstruct the unavailable outputs corresponding to the data inputs . The goal is to learn the encoding function and the decoding function
with the objective of minimizing a chosen loss function (which we discuss in more detail below in Section
3.1).We use neural networks to learn the encoding and decoding functions. We find neural networks to be a natural choice for learning the encoding and decoding functions due to their ability to perform universal function approximation [Hor91].
In the remainder of this section,we first present our training methodology, and subsequently describe the neural network architectures for learning the encoding and decoding functions.
3.1 Training methodology
Recall that our overall architecture has three functions: the given function whose distributed execution is to be made resilient using the learned codes, the encoding function , and the decoding function . During training the goal is to train the parameters of the neural networks for the encoding and the decoding functions. Note that the given function is not modified during this training.
When the given function is a machine learning algorithm, we train the encoding and the decoding functions using the same training dataset (whenever available) that was used to train . When such a training dataset is not available, which will be the case for generic functions outside the realm of machine learning, one can instead generate a training dataset comprising pairs for various values of in the domain of . Each sample for training the encoding and decoding functions uses a set of (randomly chosen) inputs from the training dataset. For any sample, we perform a forward and a backward pass for each of possible unavailability scenarios, except for the case where all unavailable outputs correspond to parity inputs (since the only role of parities is to aid in the reconstruction of unavailable outputs corresponding to the data inputs). Any iterative optimization algorithm, such as gradient descent and its variants, may be used for training.
A forward and a backward pass under our training method is illustrated in Figure 2. A forward pass involves the following steps. The data inputs are fed through the encoding function to generate parity inputs . Each of the inputs (data and parity) are then fed through the given function . The resulting outputs are fed through the decoding function , out of which no more than are made unavailable (discussed in detail in Section 3.3.1). The decoding function outputs an (approximate) reconstruction for the unavailable function outputs among . The corresponding backward pass involves using any chosen loss function (discussed in detail below) for backpropogation through , , and . We train the encoding and decoding functions in tandem via backpropagation of losses directly through . In other words, the parameters of the encoding and the decoding functions are updated by backpropagating through . Since training backpropagates directly through , this approach is applicable to any given differentiable function .
We consider two types of losses when training the encoding and the decoding functions:

Loss with respect to function outputs: Loss is computed between the function output and its approximate reconstruction produced by the decoding function. This approach can be employed for any given function .

Loss with respect to true labels: When is a machine learning algorithm, there is an additional option of calculating the loss using the true labels (when available in the training dataset). For example, consider to be a neural network for image classification, and let represent the true label for an input image . Under this approach, the loss is computed between the true label and the label predicted using .
The specific loss functions employed in our evaluation under both of the above approaches are discussed in Section 4.
We next move on to describing the neural network architectures for learning the encoding and decoding functions.


Neural network architectures for encoding functions employing fullyconnected (FC) and convolutional (Conv) layers. All convolutional layers have stride of 1. In each network, ReLU activation functions are used after all but the final layer. The activation functions are omitted above for brevity.
3.2 Encoding function architectures
We consider two neural network architectures for learning the encoding function which are applicable to any differentiable function . For concreteness, we describe the proposed architectures below by setting the given function as a neural network for image classification over classes, and use the term “base model” to refer to . For such an , each data input is an pixel image. Each function output is an length vector representing output from the last layer of the neural network classifier.
We now describe two neural network architectures for learning the encoding function. Recall that the encoding function acts on data inputs to create parity inputs. We first describe the architecture considering singlechannel images as inputs, and consider multichannel images in Section 3.2.3.
3.2.1 MLPEncoder
We first consider a simple 2layer multilayerperceptron (MLP) encoding function architecture, because it represents the basis for universal function approximation among neural networks [Hor91]. We call this encoding function architecture MLPEncoder. Under this architecture, the data inputs are flattened into length vectors, as illustrated in Figure 2(a). The flattened vectors from inputs , are concatenated to form a single length input vector to the MLP. The first fullyconnected layer of the MLP produces a length hidden vector. The second fullyconnected layer produces an length output vector, which represents the parity inputs. Each layer used in MLPEncoder is outlined in Table 0(a).
The fullyconnected nature of the MLP allows for computation of arbitrary combinations from the total inputs with a small number of layers. While simple in design and effective for many scenarios (as will be shown in Section 4.2.2), the high parameter count of the fullyconnected layers can lead to overfitting. We next describe an alternate encoding function architecture that avoids overfitting, which we call ConvEncoder.
3.2.2 ConvEncoder
The ConvEncoder architecture makes use of multiple convolutional layers as detailed in Table 0(b). Unlike MLPEncoder, ConvEncoder computes over data inputs in their original representation. As depicted in Figure 2(b), the inputs to the encoding function are treated as input channels to the first convolution layer. This is similar to feeding the RGB representations of an image to a convolutional neural network for image classification. We explain how the encoder handles multichannel inputs in Section 3.2.3.
The traditional use of convolutional layers for image classification involves repeated downsampling of an input image to gradually expand the receptive field of convolutional filters. This approach works well when the output dimension of the network is significantly smaller than the input dimension, which is often the case for image classification. However, the encoding function of a code produces outputs that have the same dimension as the inputs (see Figure 2(b)). Hence, using convolutional layers with downsampling would necessitate subsequent upsampling to bring the outputs back to the input dimension. This has been shown to be inefficient in the context of image segmentation [Fis16]. To overcome this issue, we employ dilated convolutions [Fis16]. As shown in Figure 4, this approach increases the receptive field of a convolutional filter exponentially with linear increase in the number of layers.
Table 0(b) shows each layer of ConvEncoder. The first layer has input channels and the final layer has output channels, one for each parity to be produced. Each of the intermediate layers has input channels and output channels. We increase the receptive field of convolutions by increasing the dilation factor, borrowing this architecture from [Fis16], where it was used for image segmentation.
ConvEncoder uses less parameters than MLPEncoder but requires more layers to enable combinations of all input pixels. The lower parameter count compared to MLPEncoder helps avoid overfitting, as will be shown in Section 4.2.2.
3.2.3 Multichannel input
It is common to represent colored images as having multiple channels. For example, a RGB image would consist of 3 channels, each in size, representing the pixel values of each of the red, green, and blue components. Our encoding function architectures handle multichannel inputs by encoding across each channel independently. For example, an encoding function with RGB images as inputs would encode across the red channels to produce “red” parity channels, and similarly for green and blue channels. The “red”, “green”, and “blue” parity channels are combined together to create parity “RGB” images.
3.3 Decoding function architecture
As in Section 3.2, for concreteness, we describe our decoding function architecture by setting the given function as a neural network for image classification over classes. It is easy to repurpose the proposed architecture for any differentiable function . Recall that we refer to as the base model. The base model output for any input is an length vector representing output from the last layer of the neuralnetwork classifier. Further recall that the base model is applied on the inputs on separate, unreliable compute nodes that can fail or straggle arbitrarily. The decoding function operates on all the available base model outputs and reconstructs approximations of up to unavailable base model outputs among .
Figure 5 presents the overall architecture of our decoding process. The two key design choices for the decoding function architecture are: (a) representation of the unavailable base model outputs at the input layer of the neural network for the decoding function, and (b) the neural network architecture used for learning the decoding function.
3.3.1 Representing unavailability
A key design consideration for the decoding function is in the representation of the unavailable base model outputs at its input layer. We design the decoding function to take the ) vectors of length , , as inputs. Some of these inputs to the decoding function could be unavailable. In place of any unavailable input, we insert a vector of all zeros. Note that an alternative approach is to provide the decoding function with only the (concatenated) available inputs. We chose the former as it allows us to learn a decoding function that depends on the relative position of the unavailable inputs; providing only the available inputs would hide this information. This approach is inspired by traditional (handcrafted) erasure codes whose decoding functions leverage positional information. Correspondingly, the output of the decoding function maintains positional information and consists of vectors , each representing an approximate reconstruction of corresponding potentially unavailable function output.
Layer  Layer Type 

1  FC: 
2  FC: 
3  FC: 
3.3.2 Decoding function architecture
We design the neural network for learning the decoding function as a 3layer MLP as described in Table 2. We use the raw outputs of the base model
as input to the decoding function. Note that we do not convert such outputs to a probability distribution (via a softmax operation) as is typically done during training of classifiers.
4 Evaluation
As discussed in Section 3, we evaluate our approach of learning codes for imparting resilience to nonlinear computations by setting the base model as inference on neuralnetwork based image classifiers. For any input , represents the output from the last layer of the neural network used as the base model. We start by describing our experimental setup and then present results using two neuralnetwork based image classifiers as base models (a multilayer perceptron (MLP) and ResNet18) on the MNIST [LeC], FashionMNIST [XRV17], and CIFAR10 [Ale] datasets. Finally, we present a more detailed analysis of the accuracy attained by the learned codes and the quality of the predictions obtained from the reconstructed outputs.
4.1 Experimental setup
We implement all the encoding and decoding function architectures as well as the training methodology using PyTorch
[Pyt]. Since (to the best of our knowledge) this work presents the first training methodology for learning codes for coded computation, we experiment with several loss functions and architectures, and consider multiple accuracy metrics. We describe our experimental setup below.4.1.1 Loss functions used in training
As discussed in Section 3.1, we use two approaches for calculating the loss when training the neural networks for the encoding and the decoding functions: (a) calculating the loss with respect to the base model output and (b) calculating the loss with respect to the true label (when available in the training dataset). When calculating the loss with respect to the base model output, we experiment with two different loss functions: (a) meansquared error (denoted by MSEBase) and (b) KLdivergence (denoted by KLBase) between and . When calculating loss with respect to the true labels of the underlying task, we use the crossentropy between and the true label of (denoted by XENTLabel).
Base Model  MNIST  FashionMNIST  CIFAR10 

ResNet18  0.9920  0.9285  0.9347 
BaseMLP  0.9793  0.8947   
4.1.2 Base models
We experiment with two neural network architectures as base models: BaseMLP and ResNet18. BaseMLP is a 3layer multilayerperceptron used for the MNIST and FashionMNIST datasets containing three fullyconnected layers with dimensions , , and with ReLU activation functions following all but the final layer. We choose an MLP model due to its simplicity and its reported success on MNIST [Yan98]. ResNet18 [HZRS16] is an 18layer stateoftheart neural network for image classification consisting of convolutional, pooling, and fullyconnected layers.^{2}^{2}2We use the ResNet18 model described at https://github.com/zalandoresearch/fashionmnist. We choose to use ResNet18 for two reasons: (a) it has been shown to provide high classification accuracy on both CIFAR10 and FashionMNIST, and (b) it is a significantly more complex model than BaseMLP and thus provides a good alternative evaluation point for our proposed approach. Table 3 shows the classification accuracies of the base models. We do not use BaseMLP as a base model for CIFAR10 as similar architectures have been shown to achieve low accuracy [LMK15].
4.1.3 Encoding and decoding function architectures
4.1.4 Parameters and training details
We perform experiments for all combinations of the configuration settings discussed above for and with . We focus on because this corresponds to the case of typical unavailability faced in today’s data centers as shown from measurements on Facebook’s data analytics cluster [RSG13, RSG14]. With , the parameter settings with and correspond to 50% and 20% redundant computation, respectively.
Training uses minibatches of 64 samples for and 32 samples for . Each sample in the minibatch consists of
images from the dataset drawn randomly without replacement (i.e., no image is used more than once per epoch). Thus each minibatch for
consists of 128 images and for consists of 160 images from the dataset. The encoding and decoding functions are trained in tandem using the Adam optimizer [Die15] with learning rate of 0.001 and L2regularization of . The weights for the convolutional layers are initialized via uniform Xavier initialization [GB10] and weights for the fullyconnected layer are initialized according to . All bias values are initialized to zero.4.1.5 Accuracy metrics
We measure the accuracy of the reconstructed output with respect to the machine learning task at hand using the following two metrics:

Recoveryaccuracy: This metric measures the accuracy of the reconstructed output based on its ability to recover the label predicted by the base model output. For example, when is a classifier, for any input , a reconstructed output is considered accurate if the classes predicted using and are identical. More formally, let denote the argmax operator (which is typically used to predict the class label from the output layer of a neural network classifier). For an input , a reconstructed output is considered accurate if . This metric helps decouple the accuracy of the learned code in its ability to reconstruct unavailable base model outputs and the classification accuracy of the base model itself.

Overallaccuracy: This metric measures the accuracy of the reconstructed output based on the true label. For example, when is a classifier, for any input with true label , a reconstructed output is considered accurate if the class predicted using and are identical. More formally, using the terminology defined above, a reconstructed output is considered accurate if .
In the results presented, for both the metrics, we calculate the aggregate accuracy by averaging the accuracy over all unavailability scenarios. If unavailability statistics are known, one can instead weigh different unavailability scenarios based on the statistics.
4.2 Experimental results
As discussed above, we have performed experiments for a wide range of configuration settings. To avoid clutter, we focus our discussion below on a subset of the configuration settings. The remaining experiments also have similar results as the ones discussed below and hence are relegated to appendices.
4.2.1 Main results
We begin by discussing experiments in which training is performed using the loss with respect to the base model outputs; we focus on recoveryaccuracy as the accuracy metric. Table 4 presents the results on test datasets for all combinations of datasets, base models, and parameter . For clarity, we show the results only for the encoding function architecture and training loss function which achieved the highest recoveryaccuracy on the training dataset. The results for all encoding function architectures and training loss functions are available in Table 5 in Appendix A.
The results in Table 4 show that our proposed approach can accurately reconstruct a significant fraction of the unavailable predictions. For example, for a ResNet18 classifier on MNIST and FashionMNIST, the learned code can accurately reconstruct 95.71% and 82.77% of the unavailable outputs, respectively, with only 20% redundant base model computations (corresponding to ). Moreover, even on a more complex dataset such as CIFAR10, our learned code can accurately reconstruct 80.74% of the unavailable outputs (corresponding to ).
We relegate the overallaccuracy attained in our experiments to the appendix. We find that the overallaccuracy attained by our learned codes differs only marginally from the recoveryaccuracy. The overallaccuracy metrics for all the experiments are available in Table 5 in Appendix A. In the rest of the paper, we use the term “accuracy” to refer to changes in both recoveryaccuracy and overallaccuracy.
As mentioned earlier, our focus is on designing codes that can impart resilience for nonlinear computations. There are several existing approaches (e.g., [LLP18, DCG16, LMAA16, YMAA17, WLS18, DCG17, RPPA17, MCJ18]) that address linear computations. For the sake of completeness, we include results for learning the encoding and decoding functions for a linear base model in Appendix B.
4.2.2 Effect of configuration settings and parameters
We next discuss how the accuracy attained by the learned code differs under certain parameter settings and configurations.
Value of parameter . Across all datasets, base models, and encoding function architectures, we find that test accuracy is significantly higher when than when . We believe this is because for , a single parity needs to pack information about 5 input images, whereas for , a single parity contains information about only 2 inputs images. Note that with fixed, the value of controls the amount of redundant base model computation. For , having corresponds to 50% redundant base model computation and having corresponds to 20% redundant base model computation. The above observation hints towards a fundamental tradeoff between recoveryaccuracy and the amount of redundant computation. The difference between and is more pronounced for the FashionMNIST and CIFAR10 datasets, which we attribute to the increased complexity of the dataset.
Effect of base model complexity. In our experiments, we find that the complexity of the base model does not have an adverse effect on the accuracy of the learned code. As discussed in Section 4.1.2, ResNet18 is a significantly more complex model than BaseMLP, including many more layers of nonlinearities. Despite this higher complexity, we see that the learned codes achieve similar accuracies for both BaseMLP and ResNet18 (in Table 4 see accuracy achieved for the two base models for the MNIST and FashionMNIST datasets). This is very promising, since it suggests that the proposed approach is effective even for complex base models.
Encoding function architectures. For the MNIST and FashionMNIST datasets, there is little difference in the accuracies attained by the two proposed neural network encoding function architectures, MLPEncoder and ConvEncoder. The difference between the two architectures comes to fore in the more complex dataset CIFAR10, where ConvEncoder greatly outperforms MLPEncoder. MLPEncoder’s high parameter count causes it to overfit and plateau at low accuracy on CIFAR10, while ConvEncoder is able to reach significantly higher accuracy. Table 5 in Appendix A contains a direct comparison between the accuracies attained by both of the encoding function architectures.
Dataset  Base Model  Recoveryaccuracy  Encoding Function Architecture  Training Loss Func.  

MNIST  BaseMLP  2  0.9885  MLPEncoder  MSEBase 
5  0.9485  ConvEncoder  KLBase  
ResNet18  2  0.9904  ConvEncoder  XENTLabel  
5  0.9571  ConvEncoder  KLBase  
FashionMNIST  BaseMLP  2  0.9215  MLPEncoder  KLBase 
5  0.8364  ConvEncoder  XENTLabel  
ResNet18  2  0.9242  ConvEncoder  XENTLabel  
5  0.8277  MLPEncoder  XENTLabel  
CIFAR10  ResNet18  2  0.8074  ConvEncoder  MSEBase 
5  0.6431  ConvEncoder  MSEBase 
4.2.3 Detailed analysis of accuracy and quality of predictions
We next take a deeper look at the recoveryaccuracy attained on the configurations discussed above and analyze cases where the predicted class from reconstructed outputs does not match that from the base model outputs.
Recoveryaccuracy stratified based on accuracy of the base model. In our experiments, interestingly, the learned codes achieve a significantly higher recoveryaccuracy on the set of samples that the base model classifies correctly as compared to the set of samples that the base model classifies incorrectly. Figure 6 shows the recoveryaccuracy on these two sets of samples in all the configurations of experiments listed in Table 4. We see that the learned codes achieve, on average, times higher recoveryaccuracy on the set of samples that the base model classifies correctly (“Base Model Correct” in Figure 6) as compared to the set of samples that the base model classifies incorrectly (“Base Model Incorrect” in Figure 6). Thus, the recoveryaccuracy of the learned codes is higher on samples where it indeed matters more to reconstruct accurately.
Analysis of errors in the learned code. Here we analyze how poor is the class predicted from inaccurate reconstructions. Specifically, we look at the samples for which the reconstructed output is inaccurate with respect to the base model output (as defined under recoveryaccuracy in Section 4.1.5), and analyze how far the resulting predicted class label is from the label predicted by the base model output. We quantify the quality of the label predicted from an inaccurate reconstruction by its rank in the base model output. A rank of 2 means that the class predicted using the reconstruction was ranked second in the base model output.^{3}^{3}3Note that rank 1 is unattainable since we are analyzing only those instances for which the predicted class from the reconstruction does not match that of the base model output. Figure 7 shows the fraction of inaccurate reconstructions which lead to predicted labels that have rank 2 and rank 3 in the base model output for the configurations considered in Table 4. We see that, on average, of the inaccurate reconstructions result in a class prediction that is the second best in the base model output. Furthermore, on average, of the inaccurate reconstructions result in a class prediction among the top 3 predictions of the base model output. Thus, even when the class prediction resulting from a reconstructed output does not match that of the base model output, the predicted class is not far off.
5 Conclusion
Coded computation is an emerging technique which makes use of codingtheoretic tools to impart resilience against failures and stragglers in distributed computation. However, the applicability of current techniques to general computation, including machine learning algorithms, is limited due to the lack of codes that can handle nonlinear functions. In this paper, we propose a novel learningbased approach for designing erasure codes that approximate unavailable outputs for any differentiable nonlinear function. We present carefully designed neural network architectures and a training methodology for learning the encoding and decoding functions. We show that our learned codes can accurately reconstruct up to 98.85%, 92.15%, and 80.74% of the unavailable class predictions from image classifiers for MNIST, FashionMNIST, and CIFAR10 datasets, respectively. These results are highly promising as they show the potential of using learningbased approaches for designing erasure codes and herald a new direction for coded computation by handling general nonlinear computations.
References
 [AA16] Martín Abadi and David G. Andersen. Learning to Protect Communications with Adversarial Neural Cryptography. ArXiv eprints, October 2016.
 [AGSS12] Ganesh. Ananthanarayanan, Ali. Ghodsi, Scott. Shenker, and Ion. Stoica. Why Let Resources Idle? Aggressive Cloning of Jobs with Dolly. In USENIX HotCloud, June 2012.
 [Ale] Alex Krizhevsky and Vinod Nair and Geoffrey Hinton. The CIFAR10 and CIFAR100 Datasets. https://www.cs.toronto.edu/~kriz/cifar.html.
 [AWS] AWS Machine Learning. {https://aws.amazon.com/machinelearning/}. Last accessed 24 May 2018.
 [Azu] Azure Machine Learning. https://azure.microsoft.com/enus/overview/machinelearning/. Last accessed 24 May 2018.
 [Dan17] Daniel Crankshaw and Xin Wang and Guilio Zhou and Michael J. Franklin and Joseph E. Gonzalez and Ion Stoica. Clipper: A LowLatency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017.
 [DB13] Jeffrey Dean and Luiz André Barroso. The Tail at Scale. Communications of the ACM, 56(2):74–80, 2013.

[DCG16]
Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover.
Shortdot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products.
In Advances In Neural Information Processing Systems (NIPS), 2016.  [DCG17] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Coded Convolution for Parallel and Distributed Computing Within a Deadline. In 2017 IEEE International Symposium on Information Theory (ISIT), 2017.
 [Dea] Jeff Dean. Software Engineering Advice from Building LargeScale Distributed Systems. https://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford295talk.pdf. Last accessed 24 May 2018.
 [Die15] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations(ICLR), 2015.
 [Fis16] Fisher Yu and Vladlen Koltun. MultiScale Context Aggregation by Dilated Convolutions. In International Conference on Learning Representations (ICLR), 2016.

[GB10]
Xavier Glorot and Yoshua Bengio.
Understanding the Difficulty of Training Deep Feedforward Neural
Networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS)
, Proceedings of Machine Learning Research. PMLR, 2010.  [Goo] Google Cloud AI. https://cloud.google.com/products/machinelearning/. Last accessed 24 May 2018.
 [GZD15] Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor HarcholBalter, and Esa Hyytia. Reducing Latency via Redundant Requests: Exact Analysis. ACM SIGMETRICS Performance Evaluation Review, 43(1):347–360, 2015.
 [HDF] HDFS RAID. http://www.slideshare.net/ydn/hdfsraidfacebook. Last accessed 24 May 2018.
 [Hor91] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, pages 251–257, 1991.
 [HSX12] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. USENIX Annual Technical Conference (ATC), 2012.

[HZRS16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep Residual Learning for Image Recognition.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  [JLS14] Gauri Joshi, Yanpei Liu, and Emina Soljanin. On the DelayStorage TradeOff in Content Download From Coded Distributed Storage Systems. IEEE JSAC, (5):989–997, 2014.

[KJR18]
Hyeji Kim, Yihan Jiang, Ranvir B. Rana, Sreeram Kannan, Sewoong Oh, and Pramod
Viswanath.
Communication Algorithms via Deep Learning.
In International Conference on Learning Representations (ICLR), 2018.  [KSDY17] Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. Straggler Mitigation in Distributed Optimization Through Data Encoding. In Advances in Neural Information Processing Systems (NIPS), 2017.
 [KSDY18] Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning. arXiv preprint arXiv:1803.05397, March 2018.

[LeC]
Yann LeCun.
The MNIST database of handwritten digits.
http://yann.lecun.com/exdb/mnist/.  [LK13] Guanfeng Liang and Ulas C. Kozat. FAST CLOUD: Pushing the Envelope on Delay Performance of Cloud Storage with Coding. arXiv:1301.1294, January 2013.
 [LLP18] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding Up Distributed Machine Learning Using Codes. IEEE Transactions on Information Theory, July 2018.
 [LMAA16] Songze Li, Mohammad Ali MaddahAli, and A Salman Avestimehr. A Unified Coding Framework for Distributed Computing With Straggling Servers. In 2016 IEEE Globecom Workshops (GC Wkshps), 2016.
 [LMK15] Zhouhan Lin, Roland Memisevic, and Kishore Reddy Konda. How Far Can We Go Without Convolution: Improving FullyConnected Networks. November 2015.
 [MCJ18] Ankur Mallick, Malhar Chaudhari, and Gauri Joshi. Rateless Codes for NearPerfect Load Balancing in Distributed MatrixVector Multiplication. arXiv preprint arXiv:1804.10331, 2018.
 [MSM18] Raj Kumar Maity, Ankit Singh Rawat, and Arya Mazumdar. Robust Gradient Descent via Moment Encoding with LDPC Codes. ArXiv eprints, May 2018.
 [NML18] Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J. Gross, David Burshtein, and Yair Be’ery. Deep Learning Methods for Improved Decoding of Linear Codes. IEEE Journal of Selected Topics in Signal Processing, pages 119–131, February 2018.
 [PGK88] David A. Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proc. ACM SIGMOD International Conference on Management of Data, June 1988.
 [Pyt] Pytorch. https://pytorch.org/. Last accessed 1 June 2018.
 [RCK16] KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. ECCache: LoadBalanced, LowLatency Cluster Caching with Online Erasure Coding. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
 [RPPA17] Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Salman Avestimehr. Coded Computation Over Heterogeneous Clusters. In 2017 IEEE International Symposium on Information Theory (ISIT), 2017.
 [RSG13] K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A solution to the network challenges of data recovery in erasurecoded distributed storage systems: A study on the Facebook warehouse cluster. In Proc. USENIX HotStorage, June 2013.
 [RSG14] KV Rashmi, Nihar B Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A Hitchhiker’s Guide to Fast and Efficient Data Reconstruction in ErasureCoded Data Centers. In ACM SIGCOMM, 2014.
 [RU08] Tom Richardson and Ruediger Urbanke. Modern coding theory. Cambridge university press, 2008.
 [SAP13] Mahesh Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In VLDB Endowment, 2013.
 [SLR16] Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. When do Redundant Requests Reduce Latency? IEEE Transactions on Communications, 64(2):715–722, 2016.
 [TJZ17] Wen Tao, Feng Jiang, Shengping Zhang, Jie Ren, Wuzhen Shi, Wangmeng Zuo, Xun Guo, and Debin Zhao. An EndtoEnd Compression Framework Based on Convolutional Neural Networks. In 2017 Data Compression Conference (DCC), 2017.
 [TLDK17] Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning (ICML), 2017.
 [TOH16] George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable Rate Image Compression with Recurrent Neural Networks. In International Conference on Learning Representations (ICLR), 2016.
 [VMGS12] Ashish Vulimiri, Oliver. Michel, P. Brighten. Godfrey, and Scott Shenker. More is Less: Reducing Latency via Redundancy. In ACM HotNets, 2012.
 [WJW14] Da Wang, Gauri Joshi, and Gregory Wornell. Efficient Task Replication for Fast Response Times in Parallel Computation. In SIGMETRICS, 2014.
 [WLS18] Sinog Wang, Jiashang Liu, and Ness Shroff. Coded Sparse Matrix Multiplication. In International Conference on Machine Learning (ICML), 2018. To Appear.
 [XRV17] Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
 [Yan98] Yann LeCun and Léon Bottou and Yoshua Bengio and Patrick Haffner. GradientBased Learning Applied to Document Recognition. Proceedings of the IEEE, 1998.
 [YMAA17] Qian Yu, Mohammad MaddahAli, and Salman Avestimehr. Polynomial Codes: An Optimal Design for HighDimensional Coded Matrix Multiplication. In Advances in Neural Information Processing Systems (NIPS), 2017.
Appendix A Full Experimental Results
MLPEncoder  ConvEncoder  
Dataset  Base Model  Training Loss  
Function  Recovery  
Accuracy  Overall  
Accuracy  Recovery  
Accuracy  Overall  
Accuracy  
MNIST  BaseMLP  2  KLBase  0.9769  0.9758  0.9831  0.9854 
MSEBase  0.9885  0.9776  0.9767  0.9770  
XENTLabel  0.9737  0.9768  0.9769  0.9893  
5  KLBase  0.9371  0.9340  0.9485  0.9518  
MSEBase  0.9480  0.9424  0.9339  0.9357  
XENTLabel  0.9251  0.9232  0.9474  0.9533  
ResNet18  2  KLBase  0.9742  0.9760  0.9836  0.9854  
MSEBase  0.9788  0.9806  0.9887  0.9888  
XENTLabel  0.9774  0.9796  0.9904  0.9925  
5  KLBase  0.9460  0.9466  0.9571  0.9585  
MSEBase  0.9349  0.9359  0.9415  0.9433  
XENTLabel  0.9401  0.9407  0.9171  0.9178  
FashionMNIST  BaseMLP  2  KLBase  0.9215  0.8800  0.9128  0.9080 
MSEBase  0.8484  0.8196  0.8471  0.8253  
XENTLabel  0.9107  0.8808  0.9036  0.9185  
5  KLBase  0.8275  0.7997  0.8300  0.8153  
MSEBase  0.7133  0.6987  0.7302  0.7193  
XENTLabel  0.8259  0.8037  0.8364  0.8282  
ResNet18  2  KLBase  0.9002  0.8845  0.9206  0.9031  
MSEBase  0.8960  0.8815  0.8982  0.8892  
XENTLabel  0.8947  0.8880  0.9242  0.9164  
5  KLBase  0.8219  0.8133  0.8033  0.7960  
MSEBase  0.7726  0.7672  0.7939  0.7885  
XENTLabel  0.8277  0.8203  0.8303  0.8248  
CIFAR10  ResNet18  2  KLBase  0.4293  0.4283  0.7889  0.8002 
MSEBase  0.4107  0.4116  0.8074  0.8204  
XENTLabel  0.4284  0.4238  0.7980  0.8106  
5  KLBase  0.1889  0.1895  0.5368  0.5382  
MSEBase  0.1913  0.1936  0.6431  0.6466  
XENTLabel  0.1874  0.1890  0.5224  0.5287 
Recall that results presented in Section 4.2.1 did not consider all parameter settings and configurations. We briefly highlight some relevant configuration comparisons made available through the full results presented in Table 5.
Overallaccuracy metric: Looking at the “Recovery Accuracy” and “Overall Accuracy” columns of Table 5, there is little difference between the two metrics, when holding architecture, parameters, and other configuration settings constant. We believe that the similarity of these two metrics can in part be explained by the observation in Section 4.2.3 that the recoveryaccuracy attained on samples which are correctly classified by the base model is often significantly higher than that attained on samples which are incorrectly classified by the base model.
Difference between training loss functions. Results with “XENTLabel” as the training loss function in Table 5 correspond to those configurations for which training calculated loss via crossentropy between the reconstructed output and the true label of . The recoveryaccuracy and overallaccuracy for the XENTLabel configurations are very similar to those of the corresponding configurations with KLBase and MSEBase (which calculate the KLdivergence and MSE, respectively, between and ).
There are two configurations for which we observe significant difference in the accuracies attained using each loss function. First, for FashionMNIST with BaseMLP as the base model and MLPEncoder as the encoding function architecture, we find that using MSEBase leads to a decrease in test recoveryaccuracy and overallaccuracy compared to both KLBase and XENTLabel. Second, when training ResNet18 base models on CIFAR10 with ConvEncoder for , we find that using MSEBase leads to roughly 0.10 increase in test recoveryaccuracy and overallaccuracy compared to using KLBase and XENTLabel.
Appendix B Multinomial Logistic Regression
In this section, we evaluate our learned codes on a multinomial logistic regression problem on the MNIST dataset. The overall calculation in multinomial logistic regression is of the form for parameters and , data , and with being a softmax operator. Recall from Section 3.3.2 that the inputs to our neural network decoding function are the raw outputs of the base model (prior to any softmax operation), which are not converted to a probability distribution. As such, the available inputs to the decoding function are . The softmax operation is applied to reconstructed outputs of the decoder. For the MNIST dataset, the base model thus consists of parameter matrix and a vector . The value 10 corresponds to the number of classes in the MNIST dataset. Each 28 28 input image from the MNIST dataset is flattened to form a 784 length vector .
We train the base model described above on the MNIST dataset. The trained base model achieves an accuracy of 0.9283 on the MNIST test set.
Encoding Function Architecture  k  Recoveryaccuracy  Overallaccuracy 

MLPEncoder  2  0.9831  0.9279 
5  0.9817  0.9270  
ConvEncoder  2  0.9899  0.9295 
5  0.9869  0.9260 
Table 6 shows the recoveryaccuracies achieved on the test set over this base model by each of the encoding function architectures described in Section 3.2 with , being 2 and 5, and using KLdivergence as the loss function. In all cases, the proposed codes are able to achieve a high recoveryaccuracy. We note that can be (trivially) transformed into a linear function by juxtaposing the matrix A and the vector b and appending 1 to vector X. Hence even though our approach provides high recoveryaccuracy, the existing approaches that address only linear functions [LLP18, DCG16, LMAA16, YMAA17, WLS18, DCG17, RPPA17, MCJ18] are perhaps more suitable for this particular as these approaches guarantee exact reconstruction of unavailable outputs. However, note that these existing approaches are applicable only for linear functions while our goal is to handle nonlinear functions.
Comments
There are no comments yet.