Mining Idioms in the Wild

by   Aishwarya Sivaraman, et al.

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs. We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper – Jezero – is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code.


page 1

page 2

page 3

page 4


Concrete Syntax with Black Box Parsers

Context: Meta programming consists for a large part of matching, analyzi...

TreeCaps: Tree-Based Capsule Networks for Source Code Processing

Recently program learning techniques have been proposed to process sourc...

PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code

The application of machine learning algorithms to source code has grown ...

Mining Fix Patterns for FindBugs Violations

In this paper, we first collect and track large-scale fixed and unfixed ...

Towards Fully Declarative Program Analysis via Source Code Transformation

Advances in logic programming and increasing industrial uptake of Datalo...

Getafix: Learning to fix bugs automatically

Static analyzers, including linters, can warn developers about programmi...

Active Inductive Logic Programming for Code Search

Modern search techniques either cannot efficiently incorporate human fee...

Please sign up or login with your details

Forgot password? Click here to reset