UniCase – Rethinking Casing in Language Models

10/22/2020
by   Rafał Powalski, et al.
0

In this paper, we introduce a new approach to dealing with the problem of case-sensitiveness in Language Modelling (LM). We propose simple architecture modification to the RoBERTa language model, accompanied by a new tokenization strategy, which we named Unified Case LM (UniCase). We tested our solution on the GLUE benchmark, which led to increased performance by 0.42 points. Moreover, we prove that the UniCase model works much better when we have to deal with text data, where all tokens are uppercased (+5.88 point).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2023

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Current speech large language models build upon discrete speech represen...
research
05/13/2018

Building Language Models for Text with Named Entities

Text in many domains involves a significant amount of named entities. Pr...
research
07/17/2017

A Simple Language Model based on PMI Matrix Approximations

In this study, we introduce a new approach for learning language models ...
research
11/15/2022

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Large transformer-based language models, e.g. BERT and GPT-3, outperform...
research
08/28/2018

A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units

We address the design of a unified multilingual system for handwriting r...
research
07/28/2023

Robust Distortion-free Watermarks for Language Models

We propose a methodology for planting watermarks in text from an autoreg...
research
08/25/2022

Training a T5 Using Lab-sized Resources

Training large neural language models on large datasets is resource- and...

Please sign up or login with your details

Forgot password? Click here to reset