Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

03/05/2019
by   Yuxiao Ye, et al.
0

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can significantly improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0 unsupervised cross-domain CWS approaches with a large margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2020

Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word Segmentation

Fully supervised neural approaches have achieved significant progress in...
research
04/20/2017

Cross-domain Semantic Parsing via Paraphrasing

Existing studies on semantic parsing mainly focus on the in-domain setti...
research
11/30/2016

Towards Accurate Word Segmentation for Chinese Patents

A patent is a property right for an invention granted by the government ...
research
02/01/2019

A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings

Learning word embeddings has received a significant amount of attention ...
research
06/07/2019

Learning Word Embeddings with Domain Awareness

Word embeddings are traditionally trained on a large corpus in an unsupe...
research
05/09/2018

Cross Domain Regularization for Neural Ranking Models Using Adversarial Learning

Unlike traditional learning to rank models that depend on hand-crafted f...
research
04/10/2021

FreSaDa: A French Satire Data Set for Cross-Domain Satire Detection

In this paper, we introduce FreSaDa, a French Satire Data Set, which is ...

Please sign up or login with your details

Forgot password? Click here to reset