Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixes

07/17/2019
by   Alexander William Wong, et al.
0

One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.

READ FULL TEXT

page 1

page 4

research
04/25/2023

What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow

Machine learning (ML), including deep learning, has recently gained trem...
research
11/20/2022

The Stack: 3 TB of permissively licensed source code

Large Language Models (LLMs) play an ever-increasing role in the field o...
research
04/29/2021

SYNFIX: Automatically Fixing Syntax Errors using Compiler Diagnostics

Beginning programmers struggle with the complex grammar of modern progra...
research
12/06/2018

Yaps: Python Frontend to Stan

Stan is a popular probabilistic programming language with a self-contain...
research
01/19/2022

GAP-Gen: Guided Automatic Python Code Generation

Automatic code generation from natural language descriptions can be high...
research
03/21/2016

Stack-propagation: Improved Representation Learning for Syntax

Traditional syntax models typically leverage part-of-speech (POS) inform...
research
09/21/2020

Recommending Stack Overflow Posts for Fixing Runtime Exceptions using Failure Scenario Matching

Using online Q A forums, such as Stack Overflow (SO), for guidance to ...

Please sign up or login with your details

Forgot password? Click here to reset