Benchmarking Multimodal Regex Synthesis with Complex Structures

05/02/2020
by   Xi Ye, et al.
0

Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair

Since regular expressions (abbrev. regexes) are difficult to understand ...
research
08/16/2019

Sketch-Driven Regular Expression Generation from Natural Language and Examples

Recent systems for converting natural language descriptions into regular...
research
09/03/2021

Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis

Multi-modal program synthesis refers to the task of synthesizing program...
research
08/09/2019

Multi-Modal Synthesis of Regular Expressions

Despite their usefulness across a wide range of application domains, reg...
research
09/19/2023

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

We introduce a Large Language Model (LLM) framework that generates rich ...
research
08/30/2019

PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations

Natural language programming is a promising approach to enable end users...
research
04/17/2021

Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

In recent years, vision-language research has shifted to study tasks whi...

Please sign up or login with your details

Forgot password? Click here to reset