Learning Lenient Parsing Typing via Indirect Supervision

10/14/2019
by   Toufique Ahmed, et al.
0

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse type fragments, even ones with simple errors. Training a machine learner to leniently parse type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

Break-It-Fix-It: Unsupervised Learning for Program Repair

We consider repair tasks: given a critic (e.g., compiler) that assesses ...
research
05/26/2020

Communicating over the Torn-Paper Channel

We consider the problem of communicating over a channel that randomly "t...
research
04/03/2019

Styler: Learning Formatting Conventions to Repair Checkstyle Errors

Formatting coding conventions play an important role on code readability...
research
10/27/2017

The James construction and π_4(S^3) in homotopy type theory

In the first part of this paper we present a formalization in Agda of th...
research
06/14/2021

Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts

This work explores constituency parsing on automatically recognized tran...
research
10/03/2021

Towards Informative Tagging of Code Fragments to Support the Investigation of Code Clones

Investigating the code fragments of code clones detected by code clone d...
research
08/25/2017

Learning to Blame: Localizing Novice Type Errors with Data-Driven Diagnosis

Localizing type errors is challenging in languages with global type infe...

Please sign up or login with your details

Forgot password? Click here to reset