CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

10/25/2021
by   Zhensu Sun, et al.
0

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to provide powerful intelligence to help developers implement safe and effective code, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. To this end, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

READ FULL TEXT
research
06/15/2023

Are ChatGPT and Other Similar Systems the Modern Lernaean Hydras of AI?

The rise of Generative Artificial Intelligence systems (“AI systems”) ha...
research
04/14/2022

To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?

Deep Learning (DL) models have been widely used to support code completi...
research
09/08/2023

Open and reusable deep learning for pathology with WSInfer and QuPath

The field of digital pathology has seen a proliferation of deep learning...
research
02/05/2019

PUTWorkbench: Analysing Privacy in AI-intensive Systems

AI intensive systems that operate upon user data face the challenge of b...
research
12/16/2018

The Adverse Effects of Code Duplication in Machine Learning Models of Code

The field of big code relies on mining large corpora of code to perform ...
research
06/19/2022

Productive Reproducible Workflows for DNNs: A Case Study for Industrial Defect Detection

As Deep Neural Networks (DNNs) have become an increasingly ubiquitous wo...
research
09/02/2023

Towards Code Watermarking with Dual-Channel Transformations

The expansion of the open source community and the rise of large languag...

Please sign up or login with your details

Forgot password? Click here to reset