Diacritics Restoration using BERT with Analysis on Czech language

05/24/2021
by   Jakub Náplava, et al.
0

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44 actually not errors, but either plausible variants (19 corrections of erroneous data (25 detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2021

Automatic punctuation restoration with BERT models

We present an approach for automatic punctuation restoration with BERT m...
research
07/03/2020

Playing with Words at the National Library of Sweden – Making a Swedish BERT

This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLa...
research
02/26/2023

Efficient Ensemble Architecture for Multimodal Acoustic and Textual Embeddings in Punctuation Restoration using Time-Delay Neural Networks

Punctuation restoration plays an essential role in the post-processing p...
research
01/31/2022

Correcting diacritics and typos with a ByT5 transformer model

Due to the fast pace of life and online communications and the prevalenc...
research
03/17/2022

Finding Structural Knowledge in Multimodal-BERT

In this work, we investigate the knowledge learned in the embeddings of ...
research
03/09/2009

The Digital Restoration of Da Vinci's Sketches

A sketch, found in one of Leonardo da Vinci's notebooks and covered by t...
research
07/10/2019

Dunhuang Grotto Painting Dataset and Benchmark

This document introduces the background and the usage of the Dunhuang Gr...

Please sign up or login with your details

Forgot password? Click here to reset