I still know it's you! On Challenges in Anonymizing Source Code

08/26/2022
by   Micha Horlboge, et al.
0

The source code of a program not only defines its semantics but also contains subtle clues that can identify its author. Several studies have shown that these clues can be automatically extracted using machine learning and allow for determining a program's author among hundreds of programmers. This attribution poses a significant threat to developers of anti-censorship and privacy-enhancing technologies, as they become identifiable and may be prosecuted. An ideal protection from this threat would be the anonymization of source code. However, neither theoretical nor practical principles of such an anonymization have been explored so far. In this paper, we tackle this problem and develop a framework for reasoning about code anonymization. We prove that the task of generating a k-anonymous program – a program that cannot be attributed to one of k authors – is not computable and thus a dead end for research. As a remedy, we introduce a relaxed concept called k-uncertainty, which enables us to measure the protection of developers. Based on this concept, we empirically study candidate techniques for anonymization, such as code normalization, coding style imitation, and code obfuscation. We find that none of the techniques provides sufficient protection when the attacker is aware of the anonymization. While we introduce an approach for removing remaining clues from the code, the main result of our work is negative: Anonymization of source code is a hard and open problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2019

Misleading Authorship Attribution of Source Code using Adversarial Learning

In this paper, we present a novel attack against authorship attribution ...
research
01/06/2022

Source Code Anti-Plagiarism: a C# Implementation using the Routing Approach

Despite the approaches proposed so far, software plagiarism is still a p...
research
08/16/2022

Identifying Source Code File Experts

In software development, the identification of source code file experts ...
research
12/11/2022

Authorship Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method

Source code segment authorship identification is the task of identifying...
research
01/29/2021

The significance of user-defined identifiers in Java source code authorship identification

When writing source code, programmers have varying levels of freedom whe...
research
01/30/2021

ICodeNet – A Hierarchical Neural Network Approach for Source Code Author Identification

With the open-source revolution, source codes are now more easily access...
research
07/25/2018

RuntimeSearch: Ctrl+F for a Running Program

Developers often try to find occurrences of a certain term in a software...

Please sign up or login with your details

Forgot password? Click here to reset