Cross-Language Binary-Source Code Matching with Intermediate Representations

01/19/2022
by   Yi Gui, et al.
0

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

READ FULL TEXT
research
04/10/2023

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Matching binary to source code and vice versa has various applications i...
research
05/10/2022

Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

Software clones are beneficial to detect security gaps and software main...
research
03/13/2023

xASTNN: Improved Code Representations for Industrial Practice

The application of deep learning techniques in software engineering beco...
research
09/02/2023

Towards Code Watermarking with Dual-Channel Transformations

The expansion of the open source community and the rise of large languag...
research
01/11/2023

Predicting Tags For Programming Tasks by Combining Textual And Source Code Data

Competitive programming remains a very popular activity that combines bo...
research
12/04/2017

Studying tidal effects in planetary systems with Posidonius. A N-body simulator written in Rust

Planetary systems with several planets in compact orbital configurations...
research
11/09/2022

Representing LLVM-IR in a Code Property Graph

In the past years, a number of static application security testing tools...

Please sign up or login with your details

Forgot password? Click here to reset