Towards Tracing Code Provenance with Code Watermarking

05/21/2023
by   Wei Li, et al.
0

Recent advances in large language models have raised wide concern in generating abundant plausible source code without scrutiny, and thus tracing the provenance of code emerges as a critical issue. To solve the issue, we propose CodeMark, a watermarking system that hides bit strings into variables respecting the natural and operational semantics of the code. For naturalness, we novelly introduce a contextual watermarking scheme to generate watermarked variables more coherent in the context atop graph neural networks. Each variable is treated as a node on the graph and the node feature gathers neighborhood (context) information through learning. Watermarks embedded into the features are thus reflected not only by the variables but also by the local contexts. We further introduce a pre-trained model on source code as a teacher to guide more natural variable generation. Throughout the embedding, the operational semantics are preserved as only variable names are altered. Beyond guaranteeing code-specific properties, CodeMark is superior in watermarking accuracy, capacity, and efficiency due to a more diversified pattern generated. Experimental results show CodeMark outperforms the SOTA watermarking systems with a better balance of the watermarking requirements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection

Identifying vulnerabilities in the source code is essential to protect t...
research
06/08/2019

Recovering Variable Names for Minified Code with Usage Contexts

In modern Web technology, JavaScript (JS) code plays an important role. ...
research
07/05/2021

CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network

Source code summaries are short natural language descriptions of code sn...
research
10/23/2020

Neural Code Completion with Anonymized Variable Names

Source code processing heavily relies on the methods widely used in natu...
research
05/28/2023

RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring

Refactoring is an indispensable practice of improving the quality and ma...
research
03/24/2021

deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search

With the rapid increase in the amount of public code repositories, devel...
research
04/28/2019

A Feature Based Methodology for Variable Requirements Reverse Engineering

In the past years, software reverse engineering dealt with source code u...

Please sign up or login with your details

Forgot password? Click here to reset