Segmentation-free compositional n-gram embedding

09/04/2018
by   Geewook Kim, et al.
0

Applying conventional word embedding models to unsegmented languages, where word boundary is not clear, requires word segmentation as preprocessing. However, word segmentation is difficult and expensive to conduct without errors. Segmentation error degrades the quality of word embeddings, leading to performance degradation in downstream applications. In this paper, we propose a simple segmentation-free method to obtain unsupervised vector representations for words, phrases and sentences from an unsegmented raw corpus. Our model is based on subword information skip-gram model, but embedding targets and contexts are character n-grams instead of segmented words. We consider all possible character n-grams in a corpus as targets, and every target is modeled as the sum of its compositional sub-n-grams. Our method completely ignores word boundaries in a corpus and is not word-segmentation dependent. This approach may sound reckless, but it was found to work well through experiments on real-world datasets and benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2020

Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

Recent work on segmentation-free word embedding(sembei) developed a new ...
research
12/11/2019

Character 3-gram Mover's Distance: An Effective Method for Detecting Near-duplicate Japanese-language Recipes

In websites that collect user-generated recipes, recipes are often poste...
research
10/21/2020

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

We look into the task of generalizing word embeddings: given a set of pr...
research
10/17/2017

CASICT Tibetan Word Segmentation System for MLWS2017

We participated in the MLWS 2017 on Tibetan word segmentation task, our ...
research
12/19/2022

Norm of word embedding encodes information gain

Distributed representations of words encode lexical semantic information...
research
05/24/2018

Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms

Many deep learning architectures have been proposed to model the composi...
research
04/28/2017

Neural Word Segmentation with Rich Pretraining

Neural word segmentation research has benefited from large-scale raw tex...

Please sign up or login with your details

Forgot password? Click here to reset