Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

05/25/2021
by   Ruchir Puri, et al.
13

Advancements in deep learning and machine learning algorithms have enabled breakthrough progress in computer vision, speech recognition, natural language processing and beyond. In addition, over the last several decades, software has been built into the fabric of every aspect of our society. Together, these two trends have generated new interest in the fast-emerging research area of AI for Code. As software development becomes ubiquitous across all industries and code infrastructure of enterprise legacy applications ages, it is more critical than ever to increase software development productivity and modernize legacy applications. Over the last decade, datasets like ImageNet, with its large scale and diversity, have played a pivotal role in algorithmic advancements from computer vision to language and speech understanding. In this paper, we present Project CodeNet, a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for Code. It consists of 14M code samples and about 500M lines of code in 55 different programming languages. Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety programming languages, to advances in code performance (both runtime, and memory) improvement techniques. CodeNet also provides sample input and output test sets for over 7M code samples, which can be critical for determining code equivalence in different languages. As a usability feature, we provide several preprocessing tools in Project CodeNet to transform source codes into representations that can be readily used as inputs into machine learning models.

READ FULL TEXT

page 3

page 12

research
05/09/2023

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

We present The Vault, an open-source, large-scale code-text dataset desi...
research
04/15/2019

Semantic Source Code Models Using Identifier Embeddings

The emergence of online open source repositories in the recent years has...
research
05/20/2023

CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring

The rise of large language models (LLMs) has unlocked various applicatio...
research
09/06/2022

Automatic Code Documentation Generation Using GPT-3

Source code documentation is an important artifact for efficient softwar...
research
03/15/2023

Practices and Challenges of Using GitHub Copilot: An Empirical Study

With the advances in machine learning, there is a growing interest in AI...
research
03/06/2023

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

The ability to solve problems is a hallmark of intelligence and has been...
research
08/18/2023

Scope is all you need: Transforming LLMs for HPC Code

With easier access to powerful compute resources, there is a growing tre...

Please sign up or login with your details

Forgot password? Click here to reset