A Large-scale Dataset of (Open Source) License Text Variants

04/01/2022
by   Stefano Zacchiroli, et al.
0

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers.The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared.The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2023

The Software Heritage License Dataset (2022 Edition)

Context: When software is released publicly, it is common to include wit...
research
08/29/2021

Making Honey Files Sweeter: SentryFS – A Service-Oriented Smart Ransomware Solution

The spread of ransomware continues to cause devastation and is a major c...
research
08/01/2023

Understanding URDF: A Dataset and Analysis

As the complexity of robot systems increases, it becomes more effective ...
research
11/16/2017

Fast ordered sampling of DNA sequence variants

Explosive growth in the amount of genomic data is matched by increasing ...
research
08/29/2018

Use of Source Code Similarity Metrics in Software Defect Prediction

In recent years, defect prediction has received a great deal of attentio...
research
03/08/2017

Assessing Code Authorship: The Case of the Linux Kernel

Code authorship is a key information in large-scale open source systems....
research
09/01/2022

A large dataset of software mentions in the biomedical literature

We describe the CZ Software Mentions dataset, a new dataset of software ...

Please sign up or login with your details

Forgot password? Click here to reset