SantaCoder: don't reach for the stars!

01/09/2023
by   Loubna Ben Allal, et al.
1

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.

READ FULL TEXT

page 5

page 10

research
05/09/2023

StarCoder: may the source be with you!

The BigCode community, an open-scientific collaboration working on the r...
research
11/20/2022

The Stack: 3 TB of permissively licensed source code

Large Language Models (LLMs) play an ever-increasing role in the field o...
research
08/13/2023

Py-Tetrad and RPy-Tetrad: A New Python Interface with R Support for Tetrad Causal Search

We give novel Python and R interfaces for the (Java) Tetrad project for ...
research
06/19/2023

RepoFusion: Training Code Models to Understand Your Repository

Despite the huge success of Large Language Models (LLMs) in coding assis...
research
08/31/2023

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

Pre-trained language models like ChatGPT have significantly improved cod...
research
06/11/2023

Attention, Compilation, and Solver-based Symbolic Analysis are All You Need

In this paper we present a Java-to-Python (J2P) and Python-to-Java (P2J)...
research
04/12/2022

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead rep...

Please sign up or login with your details

Forgot password? Click here to reset