Segmentation-free compositional n-gram embedding
Applying conventional word embedding models to unsegmented languages, where word boundary is not clear, requires word segmentation as preprocessing. However, word segmentation is difficult and expensive to conduct without errors. Segmentation error degrades the quality of word embeddings, leading to performance degradation in downstream applications. In this paper, we propose a simple segmentation-free method to obtain unsupervised vector representations for words, phrases and sentences from an unsegmented raw corpus. Our model is based on subword information skip-gram model, but embedding targets and contexts are character n-grams instead of segmented words. We consider all possible character n-grams in a corpus as targets, and every target is modeled as the sum of its compositional sub-n-grams. Our method completely ignores word boundaries in a corpus and is not word-segmentation dependent. This approach may sound reckless, but it was found to work well through experiments on real-world datasets and benchmarks.
READ FULL TEXT