Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

11/28/2022
by   Eylon Guetta, et al.
0

We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before. We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance. Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks. All in all this new model achieves new SOTA on all available Hebrew benchmarks, including Morphological Segmentation, POS Tagging, Full Morphological Analysis, NER, and Sentiment Analysis. Subsequently we advocate for PLMs that are larger not only in terms of number of layers or training data, but also in terms of their vocabulary. We release the new model publicly for unrestricted use.

READ FULL TEXT

page 2

page 3

page 4

research
04/08/2021

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Large Pre-trained Language Models (PLMs) have become ubiquitous in the d...
research
08/03/2022

Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language

We present a new pre-trained language model (PLM) for Rabbinic Hebrew, t...
research
08/31/2023

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

We present DictaBERT, a new state-of-the-art pre-trained BERT model for ...
research
04/30/2020

A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks

The Arabic language is a morphological rich language, posing many challe...
research
08/10/2020

KR-BERT: A Small-Scale Korean-Specific Language Model

Since the appearance of BERT, recent works including XLNet and RoBERTa u...
research
03/15/2022

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

State-of-the-art NLP systems represent inputs with word embeddings, but ...
research
03/14/2023

MEDBERT.de: A Comprehensive German BERT Model for the Medical Domain

This paper presents medBERTde, a pre-trained German BERT model specifica...

Please sign up or login with your details

Forgot password? Click here to reset