Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

05/24/2023
by   Yiming Sun, et al.
0

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6 of entities); 76 most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89 packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2021

A Large-Scale Security-Oriented Static Analysis of Python Packages in PyPI

Different security issues are a common problem for open source packages ...
research
07/03/2018

FluidDyn: a Python open-source framework for research and teaching in fluid dynamics

FluidDyn is a project to foster open-science and open-source in the flui...
research
07/25/2019

An Empirical Analysis of the Python Package Index (PyPI)

In this research, we provide a comprehensive empirical summary of the Py...
research
09/18/2022

HiPart: Hierarchical Divisive Clustering Toolbox

This paper presents the HiPart package, an open-source native python lib...
research
01/29/2018

Mitigating Spreadsheet Model Risk with Python Open Source Infrastructure

Across an aggregation of EuSpRIG presentation papers, two maxims hold tr...
research
07/22/2022

Efficient Prior Publication Identification for Open Source Code

Free/Open Source Software (FOSS) enables large-scale reuse of preexistin...
research
05/13/2020

Many-Objective Software Remodularization using NSGA-III

Software systems nowadays are complex and difficult to maintain due to c...

Please sign up or login with your details

Forgot password? Click here to reset