Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

10/31/2022
by   Lei Zhang, et al.
0

Chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain. However, the performance drops drastically when shifting to cross-domain and low-resource scenarios due to data sparseness issues. Considering that constructing large-scale manually annotated data is time-consuming and labor-intensive, in this work, we for the first time propose to mine word boundary information from pauses in speech to efficiently obtain large-scale CWS naturally annotated data. We present a simple yet effective complete-then-train method to utilize these natural annotations from speech for CWS model training. Extensive experiments demonstrate that the CWS performance in cross-domain and low-resource scenarios can be significantly improved by leveraging our naturally annotated data extracted from speech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/15/2017

Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network

Recent studies have shown effectiveness in using neural networks for Chi...
research
04/26/2019

Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization

Existing methods for CWS usually rely on a large number of labeled sente...
research
11/04/2017

Deep Stacking Networks for Low-Resource Chinese Word Segmentation with Transfer Learning

In recent years, neural networks have proven to be effective in Chinese ...
research
07/28/2018

Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Developing a practical speech recognizer for a low resource language is ...
research
11/17/2021

Green CWS: Extreme Distillation and Efficient Decode Method Towards Industrial Application

Benefiting from the strong ability of the pre-trained model, the researc...
research
09/13/2021

Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

Capitalization and punctuation are important cues for comprehending writ...
research
08/28/2019

Onto Word Segmentation of the Complete Tang Poems

We aim at segmenting words in the Complete Tang Poems (CTP). Although it...

Please sign up or login with your details

Forgot password? Click here to reset