The field of artificial intelligence (AI) has made rapid strides over the past few years largely due to the progress in Deep Neural Networks (DNNs). DNNs have surpassed human-level accuracy on tasks ranging from speech recognition[toward-human-parity-conversational-speech-recognition], image classification[NIPS2012_4824, DBLP:journals/corr/SimonyanZ14a, DBLP:journals/corr/LinCY13, 43022, DBLP:journals/corr/SzegedyVISW15, Resnet] to machine translation [hassan2018achieving]
. However, this gain in accuracy has come with models becoming deeper leading to increased computational requirements. For example, in object classification, the top-5 classification accuracy has increased from 71% in 2012 to 97% in 2015 on the ImageNet dataset, while the models have become 20more computationally expensive. This increase in computation also leads to longer latencies during prediction or inference where low user response time is paramount. To efficiently serve these models, there is a need to reduce the overall computation needed for inference, without trading off accuracy.
There have been several efforts to reduce the computational complexity of DNNs to improve model serving. A number of previous efforts have proposed compressing the model using techniques such as quantization [DBLP:journals/corr/CourbariauxB16, DBLP:journals/corr/CaiHSV17] or model distillation [hinton2015distilling], but such techniques typically hurt accuracy. On the other hand, ensemble methods [mcdnnPaper] or any-time prediction models [DBLP:journals/corr/HuangCLWMW17] aim to provide a better trade-off between accuracy and latency by building models of varying complexity. However, this either requires re-training using custom model architectures or training a number of models ahead of time. Systems such as Clipper [crankshaw2017clipper] improve serving by batching queries and optimizing their execution within a batch. These techniques typically improve throughput and making inference latency-aware. Finally, PRETZEL[pretzel] improves latencies using multi-model optimizations but does not focus on reducing compute for a single given model.
In this work, we introduce caching as a technique to reduce the prediction latency of DNNs. Caches in general are used to improve the latency of Web requests by storing the output of previous requests. Previous works [crankshaw2017clipper] have used caching at the input layer to improve the prediction latency, but they consider the DNN as a black box. Thus, in the event of a cache miss at the input layer, these systems have to run all the layers of the DNN to obtain a prediction. Instead the question we ask is: Can we design caching such that we can avoid the need to run all the layers of a DNN for every input request?
To this end, we propose augmenting DNNs with a cache at each layer, where the cache holds a succinct representation of intermediate layer outputs and their relation to the final classification. The rationale behind this is that each layer of the DNN tries to normalize the variations in the input that do not correlate with the output, and by doing so, tries to learn an embedding space where similar data points are closer to each other. Thus, even when we do not get a "cache hit" for the input, we could get a "cache hit" at an intermediate layer. For example, with an object classification model, the background, brightness, contrast, etc of the input image do not correlate with the output class. The model will normalize these variations in the image, layer by layer. Thus, we can expect images of the same object with different backgrounds to be embedded closer in the projection space after some DNN layers have been evaluated.
Maintaining a cache for intermediate layers comes with its own challenges. First, as the intermediate layer outputs are float tensors in a high dimensional space, the probability of exact match is low, leading to a low cache hit rate. Second, such a cache would require a large amount of memory owing to the high dimensionality of tensors and the size of training data. Finally, cache look-up time is poor for large caches that cannot fit in fast memories.
We introduce Freeze Inference, a system which augments DNNs with intermediate layer caches to reduce the prediction latency and addresses the above issues. For our initial prototype, Freeze Inference creates an offline cache, which stores the intermediate layer outputs computed over training data. We then train a dimensionality reduction model for each layer, which projects high dimensional intermediate layer tensors to a low dimensional space. Finally, we perform -means clustering on the reduced dimensional space and only store the centroids of the clusters, which further reduces the memory footprint and look-up time.
The rest of the paper is organized as follows. In Sec. 2, we introduce the rationale behind caching in the context of DNNs. Next, we describe the design of our system Freeze Inference in Sec. 3. Finally, we present the initial results of our prototype (Sec. 4) and conclude with discussions on future research directions (Sec. 5).
We begin by describing some key properties of DNNs and how layer-wise caching as we envision it can be applied in this setting. In a DNN, the input is represented as a set of features, where each feature is a value provided to individual nodes at the DNN’s input layer. The DNN has a number of hidden layers each consisting of multiple nodes, where each node applies a non-linear function to a weighted sum of its inputs. We refer to the individual hidden layer outputs as intermediate layer
outputs in our work. Finally, there is an output layer which consists of one or more neurons that can cumulatively be viewed as making a prediction.
Given the architecture of DNNs, we make two important observations which form the basis for our work:
(O1) Given two inputs and which are exactly same, the DNN will predict the same label for both the inputs. This is because the same set of learned weights are used during inference which effectively means that each layer executes a deterministic function on its input. On similar lines, if two inputs and result in the same intermediate layer output at a given hidden layer, we expect the DNN to predict the same label for both the inputs.
(O2) Consider two inputs and
whose output feature vectors reside close to each other in the output feature space. To predict labels, DNNs typically use a function such as softmax at the output layer which draws decision boundaries in the output feature space. Proximity in the output feature space means there is a high probability that the points lie within the same decision boundary and are assigned the same predictionby the DNN.
Figure 1 builds up towards the intuition behind Freeze Inference. Consider a DNN with layers. Let us take two input feature vectors and . Let us say that each input produces intermediate layer outputs , where is the layer number and is an identifier for the input under consideration. Now, let us consider a situation where has already run through the DNN to obtain a prediction . We store each of it’s intermediate layer outputs for along with the final predicted label in a per-layer cache. We now try to predict the label for input as follows: after the computation at each layer, we additionally compare the obtained intermediate output to the contents of the corresponding layer’s cache. During this process, let us say that at some layer we observe that and are the same. From observation O1, we can conclude that the intermediate outputs of successive layers would also be the same ultimately leading to the same prediction. More formally, if is the smallest layer at which we have = , then we can say that = for and both and would have the same predicted label . Hence, it is possible to skip expensive computations for layer when the intermediate layer output for a layer matches an intermediate output that has been cached. In such a scenario, we can freeze the computation at layer and return the cached output.
2.1 Towards Approximate Caching
Since feature vectors have high dimensionality and are represented by a set of floats, it is highly unlikely that that two intermediate layer outputs would be exactly the same. Therefore, a caching mechanism based on exact matches would not generate enough cache hits to provide meaningful computational benefits. To address this, we leverage the insight from observation O2 in that DNNs try to identify decision boundaries in order to classify items. To empirically validate this, we took a set of 50,000 images belonging to the CIFAR-10[cifar_10] dataset and ran a complete forward pass for each image on the ResNet-18[Resnet] model. From this, we constructed a set of intermediate layer outputs for each ResNet block111
ResNet-18 consists of 8 blocks, where each block consists of 2 convolutional layers and 1 residual connectionand tagged each output with the label predicted by the model. For each ResNet block, we then arranged the outputs into 200 clusters using -means and computed the majority label occupying each cluster along with the fractional share of the label within that cluster. From Figure 2, we notice that there is a dominant majority label in each cluster. For instance, in Block 4, we see that the mean fraction of the majority label is 0.95. This provides empirical backing that there exists a semantic relationship between points that lie nearby to each other in the intermediate feature space. Another interesting observation is that the mean fraction of the majority label increases as we move across the layers, indicating that points get better correlated in the feature space as we go deeper in the DNN.
We leverage the above ideas in Freeze Inference by constructing an offline, per-layer cache consisting of intermediate layer outputs and their corresponding labels. We augment the inference control flow to perform an approximate cache look-up for the intermediate output at each layer by using an algorithm like -nearest neighbors. We use information such as the labels of the neighbors and their distances from the input point to make a prediction and offer a notion of confidence about the prediction. We characterize the cache look-up at that layer as a hit if the offered confidence exceeds a defined for that layer.
3 Freeze Inference Design
The first major challenge in Freeze Inference is that the intermediate layer outputs reside in high-dimensional space. Apart from resulting in low cache hit rates, high memory usage, and increasing the computational complexity of cache look-up, prior work[Weber:1998:QAP:645924.671192] has shown that performance of similarity search degrades in high dimension. This is a problem since Freeze Inference relies on the notion of closeness to infer semantic similarity. We overcome this by using dimensionality reduction. Inspired by Metric Learning [xing2003distance], we do this using a one layer neural network whose hidden layer consists of 1024 nodes. We train a per-layer dimensionality reduction model using the intermediate layer outputs and labels predicted by the model for the training data set.
Next, we define the semantics of approximate cache look-ups by describing a cache look-up API. The API takes in an intermediate layer representation as an argument and returns a prediction along with a confidence value. Following from the discussion in Section 2.1, the API computes -nearest neighbors on the input and obtains a set of
tuples, where each tuple consists of the label of the neighbor and its distance from the input. In our initial design, we use a heuristic for computing the predicted label and the associated confidence value. Our heuristic is based on the intuition that predictions can be more confident if (a) more neighbors agree on the same label and (b) the neighbors are close to the input under consideration. Let us say that the dataset haslabels . For a given input point, suppose label has occurrences amongst the neighbors for the input at a specific layer of the DNN. Let the distances associated with the occurrences be . We first compute label’s fractional share amongst the neighbors as . We then compute the confidence for each label. The API returns the label having maximum confidence as the predicted label for the layer along with the associated confidence.
Figure 3 presents the Freeze Inference pipeline. It consists of an offline phase which aggregates information to be used during inference. The offline phase consists of two parts - (i) Cache Construction (ii) Threshold Computation. The above two steps lead to the construction of per-layer caches and per-layer thresholds which are then passed onto the online phase. These structures are used during the online inference phase to perform cache look-up and characterize cache hits. We describe the individual steps in detail below.
3.1 Offline Phase
(i) Cache Construction: Freeze Inference takes in a trained model and constructs per-layer caches by running a forward pass of the DNN for each example and caching the dimensionally reduced intermediate layer outputs for each layer (Line 22–30 in Algorithm 1).
(ii) Threshold Computation: A critical piece of the Freeze Inference design is to develop the notion of a cache hit. For this purpose, we use a validation dataset to compute per-layer thresholds. For each item in the validation data, we perform a forward pass, reduce the dimension of the layer output and then do a cache look-up at each layer. For each layer, we set the threshold as the maximum confidence value that resulted in a wrong prediction on the validation set (Line 40 in Algorithm 1). Thus, Freeze Inference adopts a pessimistic approach by establishing strict thresholds and ensuring zero error on the validation data, which in turn maximizes the accuracy of cache hits during inference.
3.2 Online Phase - Inference
When an inference request comes in, we do forward propagation one layer at a time and a cache look-up on the dimensionally reduced output at each layer. If the confidence returned by cache look-up is greater than the established threshold for that layer, we skip the computation of the remaining layers and return the label predicted by cache look-up as the final predicted label. (Lines 11 to 20 in Algorithm 1).
4 Results and Challenges
We evaluate our Freeze Inference system for the CIFAR-10[cifar_10] and CIFAR-100[cifar_100] datasets on the ResNet-18 model[Resnet]. Figure 4(a) presents the earliest block at which an inference request can be assuming that we have a perfect threshold calculation scheme. In this scenario, we observe that can potentially save half the computation time (run half of the total layers) for around 91% data-points in CIFAR-10 and 55% of the data-points for CIFAR-100. This represents an upper bound on the potential of Freeze Inference. Figure 4(b) shows the distribution of layers at which inference requests are using our naive threshold calculation scheme. From the graph, we observe that our naive Freeze Inference saves half the computation for about 20% of the points on CIFAR-10 and around 15% for CIFAR-100 respectively. Overall, we are able to freeze about 95% and 44% of the data-points before the output layer in CIFAR-10 and CIFAR-100 respectively. The -NN approach with our naive threshold calculation scheme achieves an accuracy of 97.48% and 99.03% with respect to the model’s prediction for CIFAR-10 and CIFAR-100 datasets respectively. This shows that we are able to save computation without trading off too much on the accuracy.
|Cache Construction Scheme||Total Memory|
|k-NN without Dimensionality Reduction||37500.0 MB|
|k-NN with Dimensionality Reduction||2500.0 MB|
|k-Means with Dimensionality Reduction||12.5 MB|
Even though we are able to Freeze a significant percentage of data-points, -NN has following overheads:
Computational Complexity: Computing -nearest neighbors incurs significant overheads due to a large number of cached intermediate points to compare against. In our experiment, we had 40,000 training data points in the cache at each layer. To extract the most out of Freeze Inference, we would require the cache-lookup to be computationally cheap. Memory Overheads: With caching, Freeze Inference incurs an additional overhead with respect to memory. Table 1 captures memory usage that the -NN implementation of Freeze Inference would require. Though dimensionality reduction significantly reduces the memory overhead, we would still like the requirement to be as low as possible to allow Freeze Inference to scale well for larger datasets and models.
We can overcome the computational and memory overheads by leveraging the fact that intermediate layer points are semantically related to each other (Figure 2). In this light, we use -means to cluster neighboring intermediate layer points and represent them by a single cluster center. This reduces the number of points that need to be stored in the cache and consequently reduces both the computational complexity and memory overheads. We construct a per-layer cache such that each item consists of a cluster center, the majority label and the fraction of majority label in that cluster. During inference, the API returns the majority label of the closest cluster as the predicted label and the ratio of the fraction of majority label to distance from the cluster center as the confidence value. The thresholding scheme used by -means is the similar to the one described earlier for -NN.
Figure 5 shows the distribution of layers at which inference requests are using the -means clustering approach for ResNet-18 and ResNet-50222ResNet-50 consists of 15 blocks, where each block consists of 3 convolutional layers and 1 residual connection.. For ResNet 18, we observe that Freeze Inference saves half the computation for about 33% and 19% of the points on CIFAR-10 and CIFAR-100 respectively. Similarly, for ResNet-50, we are able to save half the computation for 46% and 26% of the points on CIFAR-10 and CIFAR-100 respectively. Overall, this approach achieves an accuracy of 92.85% and 88.86% with respect to the model’s prediction for CIFAR-10 and CIFAR-100 datasets respectively. Figure 6 captures the trade-off between the percentage of points that Freeze Inference can and the accuracy of those points for CIFAR-10 on ResNet-50. We observe that as we increase the threshold, the percentage of points frozen decreases which results in points getting frozen more accurately. This tells us that we can model thresholding as an optimization problem where we need to simultaneously maximize the percentage of points frozen and the accuracy with which they are frozen.
With respect to computation, our experiments show that cache look-up is about 20 faster than the compute for a single layer. Additionally, from Table 1 we see that the cache requires a mere 12.5MB of memory for ResNet-18. Thus, -means solves the problems both with respect to computational complexity and memory requirements.
5 Research Directions
Cache Size and Accuracy Trade-off: From the results, we notice that while -means is able to lower the memory requirements this comes at the cost of reduced accuracy. We plan to study techniques that can further improve the heuristic used to compute confidence and thresholds per layer and also study the effect of disabling freeze inference at earlier layers (e.g., Block 4) as most of errors happen in the initial few layers.
Freeze Inference on GPUs: Batching of inference requests is a popular technique used to increase prediction throughput. With Freeze Inference, we would need to reconstruct the batch after each layer to remove the items that have been frozen. Handling dynamic batch sizes in GPUs across layers is an interesting research problem that needs to be investigated.
Cache lookup performance: Optimizing the cache lookup performance is important for realizing the benefits from freezing. While our current prototype uses a single thread on CPU to compute distance from centroids, we plan to investigate techniques to pipeline cache lookups with the forward pass being executed on GPUs.
Incremental Cache Update: In the current design, we use a cache that is constructed offline from the training data. To handle updates, we plan to identify common inference requests that were not frozen over a period of time, collect their intermediate layer representations and labels and once enough examples have accumulated, we can use these examples to recompute the thresholds. Performing online cache updates is a challenging problem especially with -means as clusters and thresholds need to be re-computed for every update.
Acknowledgements. We thank Yingyu Liang, Arjun Singhvi and reviewers for their valuable feedback and suggestions. This work is supported by the National Science Foundation (CNS-1838733). Shivaram Venkataraman is also supported by a Facebook faculty research award and support for this research was also provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin, Madison with funding from the Wisconsin Alumni Research Foundation. Aditya Akella is also supported by a Google Faculty award, a gift from Huawei, and H. I. Romnes Faculty Fellowship.
6 Discussion Topics
In this paper we introduced Freeze Inference, a general technique to improve the latency of serving deep learning models by using cache and have shown that this direction has potential. This paper is likely to generate a discussion regarding the opportunities for not running all the layers of a DNN. Some points that we think will lead to discussion include:
Design approaches for an approximate cache: In this paper, we presented an initial approach at designing an approximate cache using -NN and -means clustering. If other techniques can improve the trade-off between cache size, cache lookup time, and accuracy, it will make for an interesting discussion.
Dynamic batching on GPUs: As discussed in Section 5, the problem of dynamically adjusting the batch size on GPUs is very interesting from a systems perspective. Solutions to this problem could also improve other techniques like SkipNets [DBLP:journals/corr/abs-1711-09485].
Effect of non-uniform request popularity
: Finally our evaluation results consider a uniform distribution of requests from the test dataset. However in real world scenarios we often see a zipfian pattern with a few very popular requests and it will be interesting to discuss on how we can achieve greater benefits for such requests.