Towards Accurate Word Segmentation for Chinese Patents

11/30/2016
by   Si Li, et al.
0

A patent is a property right for an invention granted by the government to the inventor. An invention is a solution to a specific technological problem. So patents often have a high concentration of scientific and technical terms that are rare in everyday language. The Chinese word segmentation model trained on currently available everyday language data sets performs poorly because it cannot effectively recognize these scientific and technical terms. In this paper we describe a pragmatic approach to Chinese word segmentation on patents where we train a character-based semi-supervised sequence labeling model by extracting features from a manually segmented corpus of 142 patents, enhanced with information extracted from the Chinese TreeBank. Experiments show that the accuracy of our model reached 95.08 96.59 set if the model is trained on the Chinese TreeBank. We also experimented with some existing domain adaptation techniques, the results show that the amount of target domain data and the selected features impact the performance of the domain adaptation techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2023

Ancient Chinese Word Segmentation and Part-of-Speech Tagging Using Distant Supervision

Ancient Chinese word segmentation (WSG) and part-of-speech tagging (POS)...
research
03/05/2019

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite...
research
08/28/2019

Onto Word Segmentation of the Complete Tang Poems

We aim at segmenting words in the Complete Tang Poems (CTP). Although it...
research
06/27/2019

PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Chinese word segmentation (CWS) is a fundamental step of Chinese natural...
research
09/11/2023

Feature-based Transferable Disruption Prediction for future tokamaks using domain adaptation

The high acquisition cost and the significant demand for disruptive disc...
research
07/08/2015

Learning to Mine Chinese Coordinate Terms Using the Web

Coordinate relation refers to the relation between instances of a concep...
research
11/14/2022

Sentiment recognition of Italian elderly through domain adaptation on cross-corpus speech dataset

The aim of this work is to define a speech emotion recognition (SER) mod...

Please sign up or login with your details

Forgot password? Click here to reset