The Stack: 3 TB of permissively licensed source code

11/20/2022
by   Denis Kocetkov, et al.
0

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)–not only for natural language processing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called "Am I in The Stack" (https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

READ FULL TEXT
research
09/21/2018

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been cons...
research
06/03/2019

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years...
research
01/09/2023

SantaCoder: don't reach for the stars!

The BigCode project is an open-scientific collaboration working on the r...
research
07/17/2019

Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixes

One problem when studying how to find and fix syntax errors is how to ge...
research
03/15/2023

Practices and Challenges of Using GitHub Copilot: An Empirical Study

With the advances in machine learning, there is a growing interest in AI...
research
06/15/2017

Experimental Study of Compressed Stack Algorithms in Limited Memory Environments

The compressed stack is a data structure designed by Barba et al. (Alg...
research
07/16/2021

A method for decompilation of AMD GCN kernels to OpenCL

Introduction: Decompilers are useful tools for software analysis and sup...

Please sign up or login with your details

Forgot password? Click here to reset