CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis

08/28/2020
by   Ge Zhang, et al.
29

Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38 heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest ever study of scientific code and find that notebook composition correlates with the citation count of corresponding papers.

READ FULL TEXT
research
12/13/2019

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural ...
research
08/06/2017

CodeSum: Translate Program Language to Natural Language

During software maintenance, programmers spend a lot of time on code com...
research
10/15/2020

Empirical Study of Transformers for Source Code

Initially developed for natural language processing (NLP), Transformers ...
research
09/16/2021

KnowMAN: Weakly Supervised Multinomial Adversarial Networks

The absence of labeled data for training neural models is often addresse...
research
11/17/2021

GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization

As opposed to natural languages, source code understanding is influenced...
research
12/02/2021

LongChecker: Improving scientific claim verification by modeling full-abstract context

We introduce the LongChecker system for scientific claim verification. G...
research
02/03/2023

Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach

Natural language processing (NLP) is a promising approach for analyzing ...

Please sign up or login with your details

Forgot password? Click here to reset