Low latency and low variance machine learning inference is critical in many control systems applications, such as robotics. Machine learning as a service (MLaaS) platforms are attractive for scaling inference traffic. The inference is performed by deploying the trained models on the MLaaS platforms. To achieve scalability of the inference service, incoming queries are distributed across multiple replicas of the ML model. As the inference demands grow an enterprise can simply increase the cloud instances to meet the demand. However, virtualized services are prone to straggler problems, which lead to the high variability and long tail in the inference latency. Straggler incidence is more acute in cloud-based deployments because of the widespread sharing of compute, memory and network resources (Dean and Barroso, 2013).
Existing techniques to lower tail latency can be broadly classified into two categories: replication (e.g., (Dean and Barroso, 2013; Ananthanarayanan et al., 2010; Wang et al., 2014)), coded computing (e.g.,(Lee et al., 2016; Li et al., 2016; Dutta et al., 2016; Yu et al., 2017)). In replication based techniques, additional resources are used to add redundancy during execution: either a task is replicated at its launch, or a task is replicated on detection of a straggler node. Replicating every request pro-actively as a straggler mitigation strategy could lead to a significant increase in resource costs. Replicating a request reactively on the detection of a straggler can increase latency. While MLaaS platforms are more prone to stragglers, in this work we argue that they are also more amenable to low cost redundancy schemes. MLaaS platforms deploy a front-end load balancer that receives requests from multiple users and submits them to the back-end cloud instances. In this setting, the load balancer has the unique advantage of treating multiple requests as a single collective and create a more cost effective redundancy strategy.
We propose the Collage Inference technique as a cost effective redundancy strategy to deal with variance in inference latency. Collage Inference uses a unique convolutional neural network (CNN) based coded redundancy model, referred to as a Collage-CNN, that can perform multiple predictions in one shot, albeit at a slight reduction in prediction accuracy. Collage-CNN is like a parity model where the input encoding is the collection of images that are spatially arranged into a collage. Its output is decoded to get the missing predictions of images that are assigned to straggler nodes. This coded redundancy model is run concurrently as a single backup service for a collection of individual image inference models. An individual image inference model is referred to as an S-CNN. Figure 1 shows a service comprising of four S-CNN models and one Collage-CNN model. When prediction from model 4 is missing, the corresponding prediction from Collage-CNN is used in its place.
2 Collage Inference Technique
The Collage-CNN model is a novel multi-object classification model. The critical insight behind collage inference is that the spatial information within an input image is critical, for CNNs to achieve high accuracy, and it should be maintained. So, we use the collage image composed of all the images as the encoding. The Collage-CNN model takes a collage encoded from the images , where each image is input to one of the single image classifiers. The Collage-CNN provides the predictions for all the objects in the collage along with the locations of each object in the collage. The predicted locations are in the form of rectangular bounding boxes. By smartly encoding the individual images into a collage and using location information from the Collage-CNN predictions, the collage inference technique can replace the missing predictions from any straggler nodes.
The encoding of individual images into a single collage image happens as follows. Let a Collage-CNN be providing backup for S-CNN model replicas that are each running on a compute node. To encode the images into a collage we first create a square grid consisting of boxes. Each image that is assigned to an S-CNN model running on compute node is placed in a predefined square box within the collage. Specifically, in the collage, each compute node is assigned the box location . This encoding information is used while decoding outputs of the Collage-CNN. From the outputs of the Collage-CNN, class prediction corresponding to each bounding box is extracted, and this prediction corresponds to the node . Our Collage-CNN model takes collage with 416x416 resolution as input. As the size of grows, more images must be packed into the collage, which reduces the resolution of each image. It can lower the accuracy of predictions.
Figure 2 shows the collage inference technique for ten nodes with one of the nodes providing redundancy for the remaining nodes. Each of the nine nodes running S-CNN model takes an individual image as input. The node takes the collage image as input. Inside the load balancer, each of the nine input images is lowered in resolution and inserted into a specific location to form the collage image. The input image to node goes into location in the collage image. This collage image is provided as input to node 10. The predictions from the Collage-CNN are processed using the collage decoding algorithm. The output predictions from all the ten nodes go to the final decode process in the load balancer. This decode process uses the predictions from the Collage-CNN model to fill in any missing predictions from the nine nodes and return the final prediction responses to the user.
The collage decoding algorithm extracts the best possible class predictions for the images from all the Collage-CNN predictions. The decoding algorithm calculates the Jaccard similarity coefficient of each predicted bounding box with each of the ground truth bounding boxes that are used in creating the collages. Let area of ground truth bounding box be , area of predicted bounding box be and area of intersection between both the boxes be . Then Jaccard similarity coefficient can be computed using the formula: . The ground truth bounding box with the largest similarity coefficient is assigned the class label of the predicted bounding box. As a result, the image present in this ground truth bounding box is predicted as having an object belonging to this class label. This is repeated for all the object predictions. To illustrate the algorithm, consider example scenarios shown in figure 3. The ground truth input collage is a 2x2 collage that is formed from four images. It has four bounding boxes G1, G2, G3, and G4 which contain objects belonging to classes A, B, C, and D respectively. In scenario 1, the collage model predicts four bounding boxes P1, P2, P3 and P4. In this scenario: P1 would have largest similarity value with G1, P2 with G2, P3 with G3 and P4 with G4. So, the decoding algorithm predicts class A in G1, class E in G2, class C in G3, class D in G4. In scenario 2, three bounding boxes are predicted by the model. Predicted box P1 is spread over G1, G2, G3 and G4. Jaccard similarity value of P1 with box G1 is: , G2 is: , G3 is: and G4 is: . So, the algorithm predicts class A in G1, empty prediction in G2, class C in G3, class D in G4.
3 Experimental Results
We trained and measured the top-1 accuracy of Collage-CNN and S-CNN models using images from 100 classes of the Imagenet-1k (ILSVRC 2012) dataset. Resnet-34 is used as the S-CNN model, and its accuracy is 80.72%. Accuracy of Collage-CNN is 76.9% when there are nine images per collage. Hence, Collage-CNN essentially tradesoff a small accuracy degradation to improve the cost of redundancy through collage coding.
We implemented an online image classification system and deployed it on the Digital Ocean cloud. This system is similar to the one described in figure 2 where a load balancer receives requests from multiple clients concurrently. The load balancer is responsible for creating an appropriate collage image with the incoming images. We performed experiments with nine S-CNN compute nodes and one Collage-CNN compute node. Validation images from Imagenet dataset are used to generate inference requests. Two baselines are used for comparison. First is the no replication baseline, where no straggling request is replicated. Second is the replication baseline, where straggling requests are replicated based on a fixed timeout. We measured the end-to-end latency for each request from the time it is sent to the time predictions for it are received. For requests to Collage-CNN model, the end-to-end latency also includes time spent in creating the collage image.
The end to end latency distribution observed when the image classification system consists of nine S-CNN models with no request replication is shown in the top sub-figure of figure 4
. The middle sub-figure corresponds to the system consisting of nine S-CNN models with request replication. The bottom sub-figure corresponds to system consisting of nine S-CNN models and one Collage-CNN model. The x-axis shows the latency in seconds. The histograms along y-axis are the probability density values for the latency distribution. The blue curve line shows the estimated probability density function. Collage inference has a slightly higher mean latency due to the collage creation time. Using Collage-CNN model reduces the standard deviation in latency by 3X and variance by 9X. The 99-th percentile latency of Collage inference is 1.47X lower than both No replication and Replication methods. When the Collage-CNN predictions are used by the final decoder to fill in the missing predictions, the accuracy of these predictions is 87.86%. It is significantly better than the top-1 accuracy (76.9%) because, when using Collage-CNN, only a subset of its predictions corresponding to the straggler nodes need to be accurate.
4 Conclusion and Future Work
In this paper we described a novel coded redundancy model and demonstrated that it reduces inference tail latency. Future work includes improving the Collage-CNN model and reducing the overhead of creating the collage image.
Ananthanarayanan et al. (2010)
Srikanth Kandula, Albert G Greenberg,
Ion Stoica, Yi Lu, Bikas
Saha, and Edward Harris.
Reining in the Outliers in Map-Reduce Clusters using Mantri.. InOSDI, Vol. 10. 24.
- Dean and Barroso (2013) Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56 (2013), 74–80. http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
et al. (2016)
Sanghamitra Dutta, Viveck
Cadambe, and Pulkit Grover.
Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products. InAdvances In Neural Information Processing Systems. 2092–2100.
- Lee et al. (2016) Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2016. Speeding up distributed machine learning using codes. In 2016 IEEE International Symposium on Information Theory (ISIT). 1143–1147. https://doi.org/10.1109/ISIT.2016.7541478
- Li et al. (2016) S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. 2016. A Unified Coding Framework for Distributed Computing with Straggling Servers. In 2016 IEEE Globecom Workshops (GC Wkshps). 1–6. https://doi.org/10.1109/GLOCOMW.2016.7848828
- Wang et al. (2014) Da Wang, Gauri Joshi, and Gregory Wornell. 2014. Efficient Task Replication for Fast Response Times in Parallel Computation. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 599–600. https://doi.org/10.1145/2637364.2592042
- Yu et al. (2017) Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. 2017. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems. 4403–4413.