PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

06/27/2019
by   Ruixuan Luo, et al.
0

Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing. In this paper, we build a new toolkit, named PKUSEG, for multi-domain word segmentation. Unlike existing single-model toolkits, PKUSEG targets at multi-domain word segmentation and provides separate models for different domains, such as web, medicine, and tourism. The new toolkit also supports POS tagging and model training to adapt to various application scenarios. Experiments show that PKUSEG achieves high performance on multiple domains. The toolkit is now freely and publicly available for the usage of research and industry.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2020

N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models

We introduce N-LTP, an open-source Python Chinese natural language proce...
research
02/19/2023

SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes

We present a neural Sanskrit Natural Language Processing (NLP) toolkit n...
research
01/31/2021

BNLP: Natural language processing toolkit for Bengali language

BNLP is an open source language processing toolkit for Bengali language ...
research
05/21/2019

A realistic and robust model for Chinese word segmentation

A realistic Chinese word segmentation tool must adapt to textual variati...
research
12/06/2022

A new eye segmentation method based on improved U2Net in TCM eye diagnosis

For the diagnosis of Chinese medicine, tongue segmentation has reached a...
research
11/30/2016

Towards Accurate Word Segmentation for Chinese Patents

A patent is a property right for an invention granted by the government ...
research
11/16/2019

AttaCut: A Fast and Accurate Neural Thai Word Segmenter

Word segmentation is a fundamental pre-processing step for Thai Natural ...

Please sign up or login with your details

Forgot password? Click here to reset