AEON: A Method for Automatic Evaluation of NLP Test Cases

05/13/2022
by   Jen-tse Huang, et al.
2

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44 approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10 precision of finding unnatural test cases, surpassing the baselines by more than 15 models that are more accurate and robust, demonstrating AEON's potential in improving NLP software.

READ FULL TEXT
research
10/14/2021

Identifying Similar Test Cases That Are Specified in Natural Language

Software testing is still a manual process in many industries, despite t...
research
08/22/2023

LEAP: Efficient and Automated Test Method for NLP Software

The widespread adoption of DNNs in NLP software has highlighted the need...
research
10/14/2022

TestAug: A Framework for Augmenting Capability-based NLP Tests

The recently proposed capability-based NLP testing allows model develope...
research
10/04/2022

Putting Them under Microscope: A Fine-Grained Approach for Detecting Redundant Test Cases in Natural Language

Natural language (NL) documentation is the bridge between software manag...
research
04/26/2022

Systematicity, Compositionality and Transitivity of Deep NLP Models: a Metamorphic Testing Perspective

Metamorphic testing has recently been used to check the safety of neural...
research
07/23/2023

Testing Hateful Speeches against Policies

In the recent years, many software systems have adopted AI techniques, e...
research
07/27/2021

Guidelines on Minimum Standards for Developer Verification of Software

Executive Order (EO) 14028, "Improving the Nation's Cybersecurity", 12 M...

Please sign up or login with your details

Forgot password? Click here to reset