BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model

03/09/2022
by   Qige Song, et al.
0

Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73 improvement) and remains robust in multi-author collaboration scenarios. Furthermore, BinMLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.

READ FULL TEXT

page 1

page 10

research
12/16/2022

NetRPC: Enabling In-Network Computation in Remote Procedure Calls

People have shown that in-network computation (INC) significantly boosts...
research
06/05/2023

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Model pre-training on large text corpora has been demonstrated effective...
research
10/28/2022

UniASM: Binary Code Similarity Detection without Fine-tuning

Binary code similarity detection (BCSD) is widely used in various binary...
research
05/25/2022

jTrans: Jump-Aware Transformer for Binary Code Similarity

Binary code similarity detection (BCSD) has important applications in va...
research
06/24/2022

Multi-relational Instruction Association Graph for Cross-architecture Binary Similarity Comparison

Cross-architecture binary similarity comparison is essential in many sec...
research
05/16/2019

Joint Learning of Neural Networks via Iterative Reweighted Least Squares

In this paper, we introduce the problem of jointly learning feed-forward...

Please sign up or login with your details

Forgot password? Click here to reset