Are word boundaries useful for unsupervised language learning?

10/06/2022
by   Tu Anh Nguyen, et al.
7

Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2% and 28% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.

READ FULL TEXT

page 6

page 7

research
06/15/2020

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

We investigate the effect of introducing phone, syllable, or word bounda...
research
09/18/2019

Subword ELMo

Embedding from Language Models (ELMo) has shown to be effective for impr...
research
06/17/2019

Tabula nearly rasa: Probing the Linguistic Knowledge of Character-Level Neural Language Models Trained on Unsegmented Text

Recurrent neural networks (RNNs) have reached striking performance in ma...
research
12/03/2018

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary cha...
research
10/07/2015

Hierarchical Representation of Prosody for Statistical Speech Synthesis

Prominences and boundaries are the essential constituents of prosodic st...
research
11/23/2018

Unsupervised Word Discovery with Segmental Neural Language Models

We propose a segmental neural language model that combines the represent...
research
11/03/2018

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Visual and audiovisual speech recognition are witnessing a renaissance w...

Please sign up or login with your details

Forgot password? Click here to reset