Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

11/16/2020
by   Kairit Sirts, et al.
0

Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2016

Word Segmentation on Micro-blog Texts with External Lexicon and Heterogeneous Data

This paper describes our system designed for the NLPCC 2016 shared task ...
research
05/13/2020

Sanskrit Segmentation Revisited

Computationally analyzing Sanskrit texts requires proper segmentation in...
research
06/04/2014

A Geometric Method to Obtain the Generation Probability of a Sentence

"How to generate a sentence" is the most critical and difficult problem ...
research
09/15/2017

A Semantic Approach to the Analysis of Rewriting-Based Systems

Properties expressed as the provability of a first-order sentence can be...
research
10/05/2018

Sentence Segmentation for Classical Chinese Based on LSTM with Radical Embedding

In this paper, we develop a low than character feature embedding called ...
research
07/19/2019

Exploring sentence informativeness

This study is a preliminary exploration of the concept of informativenes...
research
07/11/2023

System of Spheres-based Two Level Credibility-limited Revisions

Two level credibility-limited revision is a non-prioritized revision ope...

Please sign up or login with your details

Forgot password? Click here to reset