Towards Code Watermarking with Dual-Channel Transformations

09/02/2023
by   Borui Yang, et al.
0

The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in wich watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.

READ FULL TEXT
research
01/19/2022

Cross-Language Binary-Source Code Matching with Intermediate Representations

Binary-source code matching plays an important role in many security and...
research
06/05/2020

Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system ...
research
01/12/2023

Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism

Large Language Models (LLMs) have been making big waves in the machine l...
research
10/25/2021

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Github Copilot, trained on billions of lines of public code, has recentl...
research
04/29/2021

Using Paragraph Vectors to improve our existing code review assisting tool-CRUSO

Code reviews are one of the effective methods to estimate defectiveness ...
research
06/03/2019

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years...
research
02/21/2023

On ML-Based Program Translation: Perils and Promises

With the advent of new and advanced programming languages, it becomes im...

Please sign up or login with your details

Forgot password? Click here to reset