Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

11/21/2020
by   Dongkwan Kim, et al.
0

Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.

READ FULL TEXT
research
11/02/2017

BinPro: A Tool for Binary Source Code Provenance

Enforcing open source licenses such as the GNU General Public License (G...
research
04/01/2023

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

We propose and release a new vulnerable source code dataset. We curate t...
research
07/05/2021

An Empirical Study of Rule-Based and Learning-Based Approaches for Static Application Security Testing

Background: Static Application Security Testing (SAST) tools purport to ...
research
01/28/2020

Parallel Binary Code Analysis

Binary code analysis is widely used to assess a program's correctness, p...
research
05/07/2021

Code2Image: Intelligent Code Analysis by Computer Vision Techniques and Application to Vulnerability Prediction

Intelligent code analysis has received increasing attention in parallel ...
research
05/25/2022

jTrans: Jump-Aware Transformer for Binary Code Similarity

Binary code similarity detection (BCSD) has important applications in va...
research
01/20/2017

A Large-scale Dataset and Benchmark for Similar Trademark Retrieval

Trademark retrieval (TR) has become an important yet challenging problem...

Please sign up or login with your details

Forgot password? Click here to reset