A ground-truth dataset of real security patches

10/18/2021
by   Sofia Reis, et al.
0

Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software – free of vulnerabilities – is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits – the equivalent to 5942 security patches – from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.

READ FULL TEXT
research
05/06/2021

Security Vulnerability Detection Using Deep Learning Natural Language Processing

Detecting security vulnerabilities in software before they are exploited...
research
06/24/2020

Exploring the Security Awareness of the Python and JavaScript Open Source Communities

Software security is undoubtedly a major concern in today's software eng...
research
01/17/2023

SECOMlint: A linter for Security Commit Messages

Transparent and efficient vulnerability and patch disclosure are still a...
research
07/21/2023

Exploring Security Commits in Python

Python has become the most popular programming language as it is friendl...
research
09/15/2023

REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes

Software plays a crucial role in our daily lives, and therefore the qual...
research
04/07/2022

Transformer-Based Language Models for Software Vulnerability Detection: Performance, Model's Security and Platforms

The large transformer-based language models demonstrate excellent perfor...
research
11/29/2017

Senx: Sound Patch Generation for Security Vulnerabilities

Many techniques have been proposed for automatic patch generation and th...

Please sign up or login with your details

Forgot password? Click here to reset