Green CWS: Extreme Distillation and Efficient Decode Method Towards Industrial Application

11/17/2021
by   Yulan Hu, et al.
0

Benefiting from the strong ability of the pre-trained model, the research on Chinese Word Segmentation (CWS) has made great progress in recent years. However, due to massive computation, large and complex models are incapable of empowering their ability for industrial use. On the other hand, for low-resource scenarios, the prevalent decode method, such as Conditional Random Field (CRF), fails to exploit the full information of the training data. This work proposes a fast and accurate CWS framework that incorporates a light-weighted model and an upgraded decode method (PCRF) towards industrially low-resource CWS scenarios. First, we distill a Transformer-based student model as an encoder, which not only accelerates the inference speed but also combines open knowledge and domain-specific knowledge. Second, the perplexity score to evaluate the language model is fused into the CRF module to better identify the word boundaries. Experiments show that our work obtains relatively high performance on multiple datasets with as low as 14% of time consumption compared with the original BERT-based model. Moreover, under the low-resource setting, we get superior results in comparison with the traditional decoding methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2017

Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network

Recent studies have shown effectiveness in using neural networks for Chi...
research
10/13/2020

Pagsusuri ng RNN-based Transfer Learning Technique sa Low-Resource Language

Low-resource languages such as Filipino suffer from data scarcity which ...
research
11/04/2017

Deep Stacking Networks for Low-Resource Chinese Word Segmentation with Transfer Learning

In recent years, neural networks have proven to be effective in Chinese ...
research
07/14/2022

Multilinguals at SemEval-2022 Task 11: Complex NER in Semantically Ambiguous Settings for Low Resource Languages

We leverage pre-trained language models to solve the task of complex NER...
research
03/30/2023

A BERT-based Unsupervised Grammatical Error Correction Framework

Grammatical error correction (GEC) is a challenging task of natural lang...
research
10/31/2022

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Chinese word segmentation (CWS) models have achieved very high performan...
research
11/04/2018

Handwriting Recognition in Low-resource Scripts using Adversarial Learning

Handwritten Word Recognition and Spotting is a challenging field dealing...

Please sign up or login with your details

Forgot password? Click here to reset