Automated software vulnerability detection with machine learning

by   Jacob A. Harer, et al.

Thousands of security vulnerabilities are discovered in production software each year, either reported publicly to the Common Vulnerabilities and Exposures database or discovered internally in proprietary code. Vulnerabilities often manifest themselves in subtle ways that are not obvious to code reviewers or the developers themselves. With the wealth of open source code available for analysis, there is an opportunity to learn the patterns of bugs that can lead to security vulnerabilities directly from data. In this paper, we present a data-driven approach to vulnerability detection using machine learning, specifically applied to C and C++ programs. We first compile a large dataset of hundreds of thousands of open-source functions labeled with the outputs of a static analyzer. We then compare methods applied directly to source code with methods applied to artifacts extracted from the build process, finding that source-based models perform better. We also compare the application of deep neural network models with more traditional models such as random forests and find the best performance comes from combining features learned by deep models with tree-based models. Ultimately, our highest performing model achieves an area under the precision-recall curve of 0.49 and an area under the ROC curve of 0.87.


Automated Vulnerability Detection in Source Code Using Deep Representation Learning

Increasing numbers of software vulnerabilities are discovered every year...

Automatically Assessing Vulnerabilities Discovered by Compositional Analysis

Testing is the most widely employed method to find vulnerabilities in re...

Predicting Vulnerability In Large Codebases With Deep Code Representation

Currently, while software engineers write code for various modules, quit...

Feature Engineering-Based Detection of Buffer Overflow Vulnerability in Source Code Using Neural Networks

One of the most significant challenges in the field of software code aud...

Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers

The sources of reliable, code-level information about vulnerabilities th...

MARFCAT: Transitioning to Binary and Larger Data Sets of SATE IV

We present a second iteration of a machine learning approach to static c...

The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification

This paper presents the FormAI dataset, a large collection of 112, 000 A...

Please sign up or login with your details

Forgot password? Click here to reset