MorphPiece : Moving away from Statistical Language Representation

07/14/2023
by   Haris Jabbar, et al.
0

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. We propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows superior convergence compared to the same architecture trained on a standard BPE tokenizer. Specifically we get Language Modeling performance comparable to a 6 times larger model. Additionally, we evaluate MorphGPT on a variety of NLP tasks in supervised and unsupervised settings and find superior performance across the board, compared to GPT-2 model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/18/2022

Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

The emergence of large pretrained models has enabled language models to ...
research
02/13/2020

Comparison of Turkish Word Representations Trained on Different Morphological Forms

Increased popularity of different text representations has also brought ...
research
03/07/2019

Neural Language Modeling with Visual Features

Multimodal language models attempt to incorporate non-linguistic feature...
research
09/30/2021

SlovakBERT: Slovak Masked Language Model

We introduce a new Slovak masked language model called SlovakBERT in thi...
research
03/10/2023

An Overview on Language Models: Recent Developments and Outlook

Language modeling studies the probability distributions over strings of ...
research
10/16/2021

Invariant Language Modeling

Modern pretrained language models are critical components of NLP pipelin...
research
04/06/2016

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

This paper investigates how linguistic knowledge mined from large text c...

Please sign up or login with your details

Forgot password? Click here to reset