DisMo: A Morphosyntactic, Disfluency and Multi-Word Unit Annotator. An Evaluation on a Corpus of French Spontaneous and Read Speech

02/08/2018
by   George Christodoulides, et al.
0

We present DisMo, a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. DisMo is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). In this paper, we present the first public version of DisMo for French. The system is trained and its performance evaluated on a 57k-token corpus, including different varieties of French spoken in three countries (Belgium, France and Switzerland). DisMo supports a multi-level annotation scheme, in which the tokenisation to minimal word units is complemented with multi-word unit groupings (each having associated POS tags), as well as separate levels for annotating disfluencies and discourse phenomena. We present the system's architecture, linguistic resources and its hierarchical tag-set. Results show that DisMo achieves a precision of 95 (finest tag-set) to 96.8 sound-aligned transcriptions of spoken French, while also offering substantial possibilities for automated multi-level annotation.

READ FULL TEXT
research
04/30/2020

Lexical Semantic Recognition

Segmentation and (segment) labeling are generally treated separately in ...
research
05/24/2016

Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT

In this paper, we present an ongoing effort in lexical semantic analysis...
research
08/14/2020

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

This paper presents a novel scheme for the annotation of hate speech in ...
research
07/30/2023

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation

Tone is a crucial component of the prosody of Shanghainese, a Wu Chinese...
research
12/31/2019

A Hybrid Framework for Topic Structure using Laughter Occurrences

Conversational discourse coherence depends on both linguistic and parali...
research
11/10/2020

Multi-Task Sequence Prediction For Tunisian Arabizi Multi-Level Annotation

In this paper we propose a multi-task sequence prediction system, based ...
research
02/03/2023

Lexical Simplification using multi level and modular approach

Text Simplification is an ongoing problem in Natural Language Processing...

Please sign up or login with your details

Forgot password? Click here to reset