TrojanPuzzle: Covertly Poisoning Code-Suggestion Models

by   Hojjat Aghakhani, et al.

With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model's training or fine-tuning phases by injecting malicious data. Poisoning attacks could be designed to influence the model's suggestions at run time for chosen contexts, such as inducing the model into suggesting insecure code payloads. To achieve this, prior poisoning attacks explicitly inject the insecure code payload into the training data, making the poisoning data detectable by static analysis tools that can remove such malicious data from the training set. In this work, we demonstrate two novel data poisoning attacks, COVERT and TROJANPUZZLE, that can bypass static analysis by planting malicious poisoning data in out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poisoning data by never including certain (suspicious) parts of the payload in the poisoned data, while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust against signature-based dataset-cleansing methods that identify and filter out suspicious sequences from the training data. Our evaluation against two model sizes demonstrates that both COVERT and TROJANPUZZLE have significant implications for how practitioners should select code used to train or tune code-suggestion models.


Backdoor Attacks for In-Context Learning with Language Models

Because state-of-the-art language models are expensive to train, most pr...

Deduplicating Training Data Mitigates Privacy Risks in Language Models

Past work has shown that large language models are susceptible to privac...

Tools for Verifying Neural Models' Training Data

It is important that consumers and regulators can verify the provenance ...

Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge

Previous work has shown that Large Language Models are susceptible to so...

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

Flaw-finding static analysis tools typically generate large volumes of c...

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion

Code autocompletion is an integral feature of modern code editors and ID...

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Data contamination in model evaluation is getting increasingly prevalent...

Please sign up or login with your details

Forgot password? Click here to reset