External Lexical Information for Multilingual Part-of-Speech Tagging

06/12/2016
by   Benoît Sagot, et al.
0

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods.

READ FULL TEXT
research
02/06/2020

Irony Detection in a Multilingual Context

This paper proposes the first multilingual (French, English and Arabic) ...
research
04/19/2016

Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Bidirectional long short-term memory (bi-LSTM) networks have recently pr...
research
10/12/2018

A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification

Current lexical simplification approaches rely heavily on heuristics and...
research
09/12/2022

Lexical Simplification Benchmarks for English, Portuguese, and Spanish

Even in highly-developed countries, as many as 15-30% of the population ...
research
07/08/2021

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Speech evaluation is an essential component in computer-assisted languag...
research
09/27/2022

Multilingual analysis of intelligibility classification using English, Korean, and Tamil dysarthric speech datasets

This paper analyzes dysarthric speech datasets from three languages with...
research
02/05/2022

A simple language-agnostic yet very strong baseline system for hate speech and offensive content identification

For automatically identifying hate speech and offensive content in tweet...

Please sign up or login with your details

Forgot password? Click here to reset