FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers

01/09/2023
by   Vincent Vandeghinste, et al.
0

When applying automated speech recognition (ASR) for Belgian Dutch (Van Dyck et al. 2021), the output consists of an unsegmented stream of words, without any punctuation. A next step is to perform segmentation and insert punctuation, making the ASR output more readable and easy to manually correct. As far as we know there is no publicly available punctuation insertion system for Dutch that functions at a usable level. The model we present here is an extension of the models of Guhr et al. (2021) for Dutch and is made publicly available. We trained a sequence classification model, based on the Dutch language model RobBERT (Delobelle et al. 2020). For every word in the input sequence, the models predicts a punctuation marker that follows the word. We have also extended a multilingual model, for cases where the language is unknown or where code switching applies. When performing the task of segmentation, the application of the best models onto out of domain test data, a sliding window of 200 words of the ASR output stream is sent to the classifier, and segmentation is applied when the system predicts a segmenting punctuation sign with a ratio above threshold. Results show to be much better than a machine translation baseline approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2022

Thai Wav2Vec2.0 with CommonVoice V8

Recently, Automatic Speech Recognition (ASR), a system that converts aud...
research
03/24/2022

Language Models that Seek for Knowledge: Modular Search Generation for Dialogue and Prompt Completion

Language models (LMs) have recently been shown to generate more factual ...
research
09/06/2017

A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse

This study addresses the problem of identifying the meaning of unknown w...
research
11/02/2020

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that joi...
research
04/05/2016

Character-Level Neural Translation for Multilingual Media Monitoring in the SUMMA Project

The paper steps outside the comfort-zone of the traditional NLP tasks li...
research
12/23/2018

Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

We consider the problem of training speech recognition systems without u...
research
01/25/2020

Reducing Noise from Competing Neighbours: Word Retrieval with Lateral Inhibition in Multilink

Multilink is a computational model for word retrieval in monolingual and...

Please sign up or login with your details

Forgot password? Click here to reset