CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

08/01/2023
by   Nadezhda Chirkova, et al.
0

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17 without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2 increase.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2022

CoditT5: Pretraining for Source Code and Natural Language Editing

Pretrained language models have been shown to be effective in many softw...
research
02/16/2022

Probing Pretrained Models of Source Code

Deep learning models are widely used for solving challenging code proces...
research
09/25/2019

Improve Language Modelling for Code Completion through Statement Level Language Model based on Statement Embedding Generated by BiLSTM

Language models such as RNN, LSTM or other variants have been widely use...
research
02/14/2020

Transformer on a Diet

Transformer has been widely used thanks to its ability to capture sequen...
research
07/10/2023

Calculating Originality of LLM Assisted Source Code

The ease of using a Large Language Model (LLM) to answer a wide variety ...
research
06/01/2023

Analysis of ChatGPT on Source Code

This paper explores the use of Large Language Models (LLMs) and in parti...
research
08/10/2022

Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code

Partial code usually involves non-fully-qualified type names (non-FQNs) ...

Please sign up or login with your details

Forgot password? Click here to reset