DeepAI AI Chat
Log In Sign Up

A ground-truth dataset of real security patches

by   Sofia Reis, et al.

Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software – free of vulnerabilities – is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits – the equivalent to 5942 security patches – from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.


Security Vulnerability Detection Using Deep Learning Natural Language Processing

Detecting security vulnerabilities in software before they are exploited...

Exploring the Security Awareness of the Python and JavaScript Open Source Communities

Software security is undoubtedly a major concern in today's software eng...

SECOMlint: A linter for Security Commit Messages

Transparent and efficient vulnerability and patch disclosure are still a...

Exploring Security Commits in Python

Python has become the most popular programming language as it is friendl...

REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

Software plays a crucial role in our daily lives, and therefore the qual...

Senx: Sound Patch Generation for Security Vulnerabilities

Many techniques have been proposed for automatic patch generation and th...

Code Repositories


A ground-truth dataset of security patches

view repo