FLAME: A small language model for spreadsheet formulas

01/31/2023
by   Harshit Joshi, et al.
0

The widespread use of spreadsheet environments by billions of users presents a unique opportunity for formula-authoring assistance. Although large language models, such as Codex, can assist in general-purpose languages, they are expensive to train and challenging to deploy due to their large model sizes (up to billions of parameters). Moreover, they require hundreds of gigabytes of training data. We present FLAME, a T5-based model trained on Excel formulas that leverages domain insights to achieve competitive performance with a substantially smaller model (60M parameters) and two orders of magnitude less training data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer for our model, and use domain-specific versions of masked span prediction and noisy auto-encoding as pretraining objectives. We evaluate FLAME on formula repair, formula auto-completion, and a novel task called syntax reconstruction. FLAME (60M) can outperform much larger models, such as Codex-Davinci (175B), Codex-Cushman (12B), and CodeT5 (220M), in 6 out of 10 settings.

READ FULL TEXT

page 2

page 3

research
07/24/2022

Neurosymbolic Repair for Low-Code Formula Languages

Most users of low-code platforms, such as Excel and PowerApps, write pro...
research
09/15/2021

Improving Text Auto-Completion with Next Phrase Prediction

Language models such as GPT-2 have performed well on constructing syntac...
research
05/03/2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Deploying large language models (LLMs) is challenging because they are m...
research
06/05/2023

Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

Large pre-trained neural language models have brought immense progress t...
research
06/17/2021

Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention

The recognition of handwritten mathematical expressions in images and vi...
research
07/05/2023

PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records

This paper describes PULSAR, our system submission at the ImageClef 2023...

Please sign up or login with your details

Forgot password? Click here to reset