Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

08/02/2023
by   Zhiqiang Yuan, et al.
0

In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension and generation tasks. We have the following main findings. First, for the zero-shot setting, instructed LLMs are very competitive on code comprehension and generation tasks and sometimes even better than small SOTA models specifically fine-tuned on each downstream task. We also find that larger instructed LLMs are not always better on code-related tasks. Second, for the few-shot setting, we find that adding demonstration examples substantially helps instructed LLMs perform better on most code comprehension and generation tasks; however, the examples would sometimes induce unstable or even worse performance. Furthermore, we find widely-used BM25-based shot selection strategy significantly outperforms the basic random selection or fixed selection only on generation problems. Third, for the fine-tuning setting, we find that fine-tuning could further improve the model performance on downstream code comprehension and generation tasks compared to the zero-shot/one-shot performance. In addition, after being fine-tuned on the same downstream task dataset, instructed LLMs outperform both the small SOTA models and similar-scaled LLMs without instruction tuning. Based on our findings, we further present practical implications on model and usage recommendation, performance and cost trade-offs, and future direction.

READ FULL TEXT
research
02/28/2023

In-Context Instruction Learning

Instruction learning of Large Language Models (LLMs) has enabled zero-sh...
research
06/20/2023

Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

Instruction fine-tuning has recently emerged as a promising approach for...
research
05/23/2023

Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks

We introduce Goat, a fine-tuned LLaMA model that significantly outperfor...
research
07/11/2023

Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference

We introduce a synthetic dataset called Sentences Involving Complex Comp...
research
04/02/2019

Understanding language-elicited EEG data by predicting it from a fine-tuned language model

Electroencephalography (EEG) recordings of brain activity taken while pa...
research
05/01/2023

Poisoning Language Models During Instruction Tuning

Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetun...
research
09/06/2023

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

This paper explores the instruction fine-tuning technique for speech-to-...

Please sign up or login with your details

Forgot password? Click here to reset