Demystifying Regular Expression Bugs: A comprehensive study on regular expression bug causes, fixes, and testing

by   Peipei Wang, et al.

Regular expressions cause string-related bugs and open security vulnerabilities for DOS attacks. However, beyond ReDoS (Regular expression Denial of Service), little is known about the extent to which regular expression issues affect software development and how these issues are addressed in practice. We conduct an empirical study of 356 merged regex-related pull request bugs from Apache, Mozilla, Facebook, and Google GitHub repositories. We identify and classify the nature of the regular expression problems, the fixes, and the related changes in the test code. The most important findings in this paper are as follows: 1) incorrect regular expression behavior is the dominant root cause of regular expression bugs (165/356, 46.3 and other code issues that require regular expression changes in the fix (29.5 and more lines of code to fix them compared to the general pull requests, 3) most (51 Certain regex bug types (e.g., compile error, performance issues, regex representation) are less likely to include test code changes than others, and 4) the dominant type of test code changes in regex-related pull requests is test case addition (75 understanding of the practical problems faced by developers when using, fixing, and testing regular expressions.


page 1

page 2

page 3

page 4


An exploratory study of bug-introducing changes: what happens when bugs are introduced in open source software?

Context: Many studies consider the relation between individual aspects a...

Which bugs are missed in code reviews: An empirical study on SmartSHARK dataset

In pull-based development systems, code reviews and pull request comment...

On the Distribution of "Simple Stupid Bugs" in Unit Test Files: An Exploratory Study

A key aspect of ensuring the quality of a software system is the practic...

Test case quality: an empirical study on belief and evidence

Software testing is a mandatory activity in any serious software develop...

ConE: A Concurrent Edit Detection Tool for Large ScaleSoftware Development

Modern, complex software systems are being continuously extended and adj...

An Empirical Study on the Bugs Found while Reusing Pre-trained Natural Language Processing Models

In NLP, reusing pre-trained models instead of training from scratch has ...

Disentangling Flaws in Linux DCTCP

In the process of testing improvements to the Linux DCTCP code in variou...

Please sign up or login with your details

Forgot password? Click here to reset