Word Similarity Datasets for Thai: Construction and Evaluation

04/08/2019
by   Ponrudee Netisopakul, et al.
0

Distributional semantics in the form of word embeddings are an essential ingredient to many modern natural language processing systems. The quantification of semantic similarity between words can be used to evaluate the ability of a system to perform semantic interpretation. To this end, a number of word similarity datasets have been created for the English language over the last decades. For Thai language few such resources are available. In this work, we create three Thai word similarity datasets by translating and re-rating the popular WordSim-353, SimLex-999 and SemEval-2017-Task-2 datasets. The three datasets contain 1852 word pairs in total and have different characteristics in terms of difficulty, domain coverage, and notion of similarity (relatedness vs. similarity). These features help to gain a broader picture of the properties of an evaluated word embedding model. We include baseline evaluations with existing Thai embedding models, and identify the high ratio of out-of-vocabulary words as one of the biggest challenges. All datasets, evaluation results, and a tool for easy evaluation of new Thai embedding models are available to the NLP community online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2017

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks

Word embeddings have been found to provide meaningful representations fo...
research
05/12/2022

SimRelUz: Similarity and Relatedness scores as a Semantic Evaluation dataset for Uzbek language

Semantic relatedness between words is one of the core concepts in natura...
research
06/25/2016

Intrinsic Subspace Evaluation of Word Embedding Representations

We introduce a new methodology for intrinsic evaluation of word represen...
research
08/24/2018

Features of word similarity

In this theoretical note we compare different types of computational mod...
research
02/25/2020

Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction

Language-independent tokenisation (LIT) methods that do not require labe...
research
10/06/2021

A Fast Randomized Algorithm for Massive Text Normalization

Many popular machine learning techniques in natural language processing ...
research
03/04/2019

Russian Language Datasets in the Digitial Humanities Domain and Their Evaluation with Word Embeddings

In this paper, we present Russian language datasets in the digital human...

Please sign up or login with your details

Forgot password? Click here to reset