Studying the Difference Between Natural and Programming Language Corpora

06/06/2018
by   Casey Casalnuovo, et al.
0

Code corpora, as observed in large software systems, are now known to be far more repetitive and predictable than natural language corpora. But why? Does the difference simply arise from the syntactic limitations of programming languages? Or does it arise from the differences in authoring decisions made by the writers of these natural and programming language texts? We conjecture that the differences are not entirely due to syntax, but also from the fact that reading and writing code is un-natural for humans, and requires substantial mental effort; so, people prefer to write code in ways that are familiar to both reader and writer. To support this argument, we present results from two sets of studies: 1) a first set aimed at attenuating the effects of syntax, and 2) a second, aimed at measuring repetitiveness of text written in other settings (e.g. second language, technical/specialized jargon), which are also effortful to write. We find find that this repetition in source code is not entirely the result of grammar constraints, and thus some repetition must result from human choice. While the evidence we find of similar repetitive behavior in technical and learner corpora does not conclusively show that such language is used by humans to mitigate difficulty, it is consistent with that theory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2019

Do People Prefer "Natural" code?

Natural code is known to be very repetitive (much more so than natural l...
research
12/07/2020

Describing the syntax of programming languages using conjunctive and Boolean grammars

A classical result by Floyd ("On the non-existence of a phrase structure...
research
05/18/2021

CoTexT: Multi-task Learning with Code-Text Transformer

We present CoTexT, a pre-trained, transformer-based encoder-decoder mode...
research
10/15/2020

Empirical Study of Transformers for Source Code

Initially developed for natural language processing (NLP), Transformers ...
research
09/24/2021

ILA: Compilable Markdown for Linear Algebra

Communicating linear algebra in written form is challenging: mathematici...
research
12/02/2017

Will humans even write code in 2040 and what would that mean for extreme heterogeneity in computing?

Programming trends suggest that software development will undergo a radi...
research
09/13/2019

That's C, baby. C!

Hardly a week goes by at BUGSENG without having to explain to someone th...

Please sign up or login with your details

Forgot password? Click here to reset