Data Quality for Software Vulnerability Datasets

01/13/2023
by   Roland Croft, et al.
0

The use of learning-based techniques to achieve automated software vulnerability detection has been of longstanding interest within the software security domain. These data-driven solutions are enabled by large software vulnerability datasets used for training and benchmarking. However, we observe that the quality of the data powering these solutions is currently ill-considered, hindering the reliability and value of produced outcomes. Whilst awareness of software vulnerability data preparation challenges is growing, there has been little investigation into the potential negative impacts of software vulnerability data quality. For instance, we lack confirmation that vulnerability labels are correct or consistent. Our study seeks to address such shortcomings by inspecting five inherent data quality attributes for four state-of-the-art software vulnerability datasets and the subsequent impacts that issues can have on software vulnerability prediction models. Surprisingly, we found that all the analyzed datasets exhibit some data quality problems. In particular, we found 20-71 inaccurate in real-world datasets, and 17-99 We observed that these issues could cause significant impacts on downstream models, either preventing effective model training or inflating benchmark performance. We advocate for the need to overcome such challenges. Our findings will enable better consideration and assessment of software vulnerability data quality in the future.

READ FULL TEXT
research
03/29/2023

Benchmarking Software Vulnerability Detection Techniques: A Survey

Software vulnerabilities can have serious consequences, which is why man...
research
09/13/2021

Data Preparation for Software Vulnerability Prediction: A Systematic Literature Review

Software Vulnerability Prediction (SVP) is a data-driven technique for s...
research
03/22/2022

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Background: Machine learning techniques have been widely used and demons...
research
12/20/2021

An Investigation into Inconsistency of Software Vulnerability Severity across Data Sources

Software Vulnerability (SV) severity assessment is a vital task for info...
research
03/17/2020

Vulnerability Assessment on Spatial Networks: Models and Solutions

In this paper we present a collection of combinatorial optimization prob...
research
07/18/2021

A Survey on Data-driven Software Vulnerability Assessment and Prioritization

Software Vulnerabilities (SVs) are increasing in complexity and scale, p...
research
11/19/2021

Quantifying Cybersecurity Effectiveness of Software Diversity

The deployment of monoculture software stacks can cause a devastating da...

Please sign up or login with your details

Forgot password? Click here to reset