Improving Chinese Segmentation-free Word Embedding With Unsupervised Association Measure

07/05/2020
by   Yifan Zhang, et al.
0

Recent work on segmentation-free word embedding(sembei) developed a new pipeline of word embedding for unsegmentated language while avoiding segmentation as a preprocessing step. However, too many noisy n-grams existing in the embedding vocabulary that do not have strong association strength between characters would limit the quality of learned word embedding. To deal with this problem, a new version of segmentation-free word embedding model is proposed by collecting n-grams vocabulary via a novel unsupervised association measure called pointwise association with times information(PATI). Comparing with the commonly used n-gram filtering method like frequency used in sembei and pointwise mutual information(PMI), the proposed method leverages more latent information from the corpus and thus is able to collect more valid n-grams that have stronger cohesion as embedding targets in unsegmented language data, such as Chinese texts. Further experiments on Chinese SNS data show that the proposed model improves performance of word embedding in downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2022

An Evaluation Dataset for Legal Word Embedding: A Case Study On Chinese Codex

Word embedding is a modern distributed word representations approach wid...
research
09/04/2018

Segmentation-free compositional n-gram embedding

Applying conventional word embedding models to unsegmented languages, wh...
research
09/15/2021

Fast Extraction of Word Embedding from Q-contexts

The notion of word embedding plays a fundamental role in natural languag...
research
05/25/2019

SuperCaptioning: Image Captioning Using Two-dimensional Word Embedding

Language and vision are processed as two different modal in current work...
research
09/15/2021

SWEAT: Scoring Polarization of Topics across Different Corpora

Understanding differences of viewpoints across corpora is a fundamental ...
research
07/29/2016

A Novel Bilingual Word Embedding Method for Lexical Translation Using Bilingual Sense Clique

Most of the existing methods for bilingual word embedding only consider ...
research
03/31/2020

Enriching Consumer Health Vocabulary Using Enhanced GloVe Word Embedding

Open-Access and Collaborative Consumer Health Vocabulary (OAC CHV, or CH...

Please sign up or login with your details

Forgot password? Click here to reset