Finding Failure-Inducing Test Cases with ChatGPT

04/23/2023
by   Tsz On Li, et al.
0

Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8 test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8 best baseline.

READ FULL TEXT

page 2

page 4

page 9

research
07/26/2022

On the Interaction between Test-Suite Reduction and Regression-Test Selection Strategies

Unit testing is one of the most established quality-assurance techniques...
research
06/12/2019

SPoC: Search-based Pseudocode to Code

We consider the task of mapping pseudocode to long programs that are fun...
research
06/14/2020

Detection of Coincidentally Correct Test Cases through Random Forests

The performance of coverage-based fault localization greatly depends on ...
research
07/04/2021

The Composability of Intermediate Values in Composable Inductive Programming

It is believed that mechanisms including intermediate values enable comp...
research
03/12/2023

Mitigating the Effect of Class Imbalance in Fault Localization Using Context-aware Generative Adversarial Network

Fault localization (FL) analyzes the execution information of a test sui...
research
07/23/2023

Testing Hateful Speeches against Policies

In the recent years, many software systems have adopted AI techniques, e...
research
07/30/2023

Measuring Software Testability via Automatically Generated Test Cases

Estimating software testability can crucially assist software managers t...

Please sign up or login with your details

Forgot password? Click here to reset