Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Large Language Models (LLMs), armed with billions of parameters, exhibit exceptional performance across a wide range of Natural Language Processing (NLP) tasks. However, they present a significant computational challenge during inference, especially when deploying on common hardware such as single GPUs. As such, minimizing the latency of LLM inference by curtailing computational and memory requirements, though achieved through compression, becomes critically important. However, this process inevitably instigates a trade-off between efficiency and accuracy, as compressed LLMs typically experience a reduction in predictive precision. In this research, we introduce an innovative perspective: to optimize this trade-off, compressed LLMs require a unique input format that varies from that of the original models. Our findings indicate that the generation quality in a compressed LLM can be markedly improved for specific queries by selecting prompts with precision. Capitalizing on this insight, we introduce a prompt learning paradigm that cultivates an additive prompt over a compressed LLM to bolster their accuracy. Our empirical results imply that through our strategic prompt utilization, compressed LLMs can match, and occasionally even exceed, the accuracy of the original models. Moreover, we demonstrated that these learned prompts have a certain degree of transferability across various datasets, tasks, and compression levels. These insights shine a light on new possibilities for enhancing the balance between accuracy and efficiency in LLM inference. Specifically, they underscore the importance of judicious input editing to a compressed large model, hinting at potential advancements in scaling LLMs on common hardware.
READ FULL TEXT