Code4ML: a Large-scale Dataset of annotated Machine Learning Code

10/28/2022
by   Anastasia Drozdova, et al.
0

Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However, without annotation number of methods that could be applied is somewhat limited. To address the lack of annotated datasets, we present the Code4ML corpus. It contains code snippets, task summaries, competitions and dataset descriptions publicly available from Kaggle - the leading platform for hosting data science competitions. The corpus consists of  2.5 million snippets of ML code collected from  100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose. Code4ML dataset can potentially help address a number of software engineering or data science challenges through a data-driven approach. For example, it can be helpful for semantic code classification, code auto-completion, and code generation for an ML task specified in natural language.

READ FULL TEXT

page 1

page 4

research
07/17/2020

A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Background: Meeting the growing industry demand for Data Science require...
research
01/30/2022

Training and Evaluating a Jupyter Notebook Data Science Assistant

We study the feasibility of a Data Science assistant powered by a sequen...
research
01/25/2022

Semantic Code Classification for Automated Machine Learning

A range of applications for automatic machine learning need the generati...
research
01/07/2020

Vamsa: Tracking Provenance in Data Science Scripts

Machine learning (ML) which was initially adopted for search ranking and...
research
05/19/2015

Towards Data-Driven Autonomics in Data Centers

Continued reliance on human operators for managing data centers is a maj...
research
10/05/2019

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Interactive programming with interleaved code snippet cells and natural ...
research
07/16/2018

Teaching machines to understand data science code by semantic enrichment of dataflow graphs

Your computer is continuously executing programs, but does it really und...

Please sign up or login with your details

Forgot password? Click here to reset