A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

04/16/2021 ∙ by C. M. Downey, et al. ∙ University of Washington 0

Segmentation remains an important preprocessing step both in languages where "words" or other important syntactic/semantic units (like morphemes) are not clearly delineated by white space, as well as when dealing with continuous speech data, where there is often no meaningful pause between words. Near-perfect supervised methods have been developed for use in resource-rich languages such as Chinese, but many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations into meaningful units. To solve this problem, we propose a new type of Segmental Language Model (Sun and Deng, 2018; Kawakami et al., 2019; Wang et al., 2021) for use in both unsupervised and lightly supervised segmentation tasks. We introduce a Masked Segmental Language Model (MSLM) built on a span-masking transformer architecture, harnessing the power of a bi-directional masked modeling context and attention. In a series of experiments, our model consistently outperforms Recurrent SLMs on Chinese (PKU Corpus) in segmentation quality, and performs similarly to the Recurrent model on English (PTB). We conclude by discussing the different challenges posed in segmenting phonemic-type writing systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Other than in the orthography of English and languages with similar writing systems, natural language is not usually overtly segmented into meaningful units. Many languages, like Chinese, are written with no spaces in between characters, and so Chinese Word Segmentation remains an active field of study. In addition, running speech is usually highly fluent with no meaningful pauses existing between “words” like we see in orthography.

While the tokenization schemes for powerful modern language models have now largely been passed off to greedy information-theoretic algorithms like Byte-Pair Encoding

(Sennrich et al., 2016) and the subsequent SentencePiece (Kudo and Richardson, 2018), which create subword-vocabularies of a desired size by iteratively joining commonly co-occuring units, these segmentations are usually not sensical to human readers. For instance, the word twice is sometimes modeled as tw + ice, even though a human would know twice does not have anything to do with the meaning of ice. Given the current performance of models using BPE-type tokenization, the nonsensical nature of these segmentations does not necessarily seem to inhibit the success of neural models.

However BPE does not necessarily help in situations where knowing a sensical segmentation of linguistic-like units is important, such as attempting to model the ways in which children acquire language (Goldwater et al., 2009), segmenting free-flowing speech (Kamper et al., 2016; Rasanen and Blandon, 2020), creating linguistic tools for morphologically complex languages (Moeng et al., 2021), or studying the structure of an endangered language or one with no current speakers (Dunbar et al., 2020).

In this paper, we develop a new unsupervised model for robust segmentation in natural language settings. While near-perfect supervised models have now been developed for resource-rich languages like Chinese, most of the world’s languages do not have large corpora of training data. Especially for morphologically complex languages, large datasets containing “gold” segmentations into units like morphemes are very rare. To solve this problem we propose a type of Segmental Language Model (Sun and Deng, 2018; Kawakami et al., 2019), based on the powerful neural Transformer architecture (Vaswani et al., 2017).

Segmental Language Models grew out of a desire to produce strong language models that can also be used for unsupervised segmentation that correlates with human notions like word and morpheme boundaries (Kawakami et al., 2019)

. SLMs can be considered a form of character or open-vocabulary language model in which the input is a sequence of characters, and the sequence is modeled as a latent series of segments comprising one or more characters. The loss terms for these models is the marginal probability of all segmentation paths through the sequence. Decoding can then be carried out via efficient dynamic programming.

In departure from previous SLMs, we present a Masked Segmental Language Model (MSLM), built on the Transformer’s powerful ability to mask out and predict spans of input characters, and embracing a fully bi-directional modeling context with attention. As far as we are aware, we are the first to introduce a non-recurrent architecture for segmental modeling, and conduct comparisons to recurrent baselines across several standard word-segmentation datasets in both Chinese and English, with the hope of expanding to more domains in future work.

In Section 2, we overview baselines in unsupervised segmentation as well as other models that have influenced the Segmental Language Model. In Section 3, we provide a formal characterization of SLMs in general, as well as a characterization of the architecture and modeling assumptions that make the MSLM distinct from previous work. In Section 4, we present the experiments conducted in the present study using both recurrent and masked SLMs, and in Sections 5-6 we show that the Masked Segmental Language Model consistently outperforms its recurrent counterpart on Chinese segmentation (on the Beijing University Corpus), while performing similarly to the recurrent model on English (the Penn Treebank). We consider these results here to constitute a promising proof-of-concept for working with MSLMs (especially for Chinese), and end in Section 7 by laying out directions for future work with these models.

2 Related Work

2.1 Notable Segmentation Techniques

Unsupervised segmentation of natural language has gone through several modeling paradigms over the last few decades. One of the earliest applications of machine learning to this problem is due to

Elman (1990)

, who described how temporal peaks in surprisal of recurrent language models provided a useful heuristic for inferring word boundaries.

After this, Minimum Description Length (MDL) (Rissanen, 1989) was the dominant paradigm for some years. MDL models language by assuming that empirical data can be encoded in an information-theoretic sense of compressing a corpus to a smaller size using binary representations of linguistic units. The optimal modeling hypothesis is then defined to be the one which minimizes the combined encoded length of the data and the encoding scheme itself, in terms of bits. This model family underlies well-known segmentation tools such as (Morfessor) (Creutz and Lagus, 2002) and other notable works on unsupervised segmentation (de Marcken, 1996; Goldsmith, 2001).

More recently, various Bayesian models have proved some of the most accurate in terms of their ability to model word boundaries in methodologies closely bound to hypotheses about how children learn words and other units from inherently unsegmented speech. Some of the best examples are Hierarchical Dirichlet Processes (Teh et al., 2006), such as those applied to natural language by Goldwater et al. (2009), as well as Nested Pitman-Yor (Mochihashi et al., 2009). However, as Kawakami et al. (2019) note, most of these models do not adequately account for long-range dependencies in the same capacity as modern neural language models.

2.2 Precursors to SLMs

Segmental Language Models, such as the one presented in this paper, have their roots in a variety of recurrent models proposed for finding implicit or explicit hierarchical structure in sequential data. Highly influential among these is the Connectionist Temporal Classification of Graves et al. (2006), which introduced a method of training RNNs to label unsegmented speech data. This in turn heavily influenced Wang et al. (2017) (Sleep WAke-Networks), which use hierarchical RNNs to map an input timeseries to latent segments starting at each position, a key idea in Segmental Language Models.

The Segmental RNN of Kong et al. (2016) is somewhat unique among this list in jointly modeling segmentation and sequence-labeling in which model loss is computed as the clique potential between the Bi-LSTM embedding of a segment, its label, and a latent variable explicitly representing its length. This model is used in both a fully supervised setting in which both the labels and segment boundaries are observed, and a partially supervised one in which only the labels are observed.

Lastly, SLMs draw heavily from various character and open-vocabulary language models, which can model language units as the combinations of observed subword symbols. Chung et al. (2017)

introduce Hierarchical Multiscale Recurrent Neural Networks (HM-RNN), which is a character-based language models with subsequent layers representing “higher” levels of abstraction over the input sequence, and lower layers feeding the higher ones. In this model, segmentations remain latent and are not explicitly decoded.

Kawakami et al. (2017) and Mielke and Eisner (2019) present models in the domain of open-vocabulary language modeling in which words can be represented either as an atomic lexical units, or ones built out of characters in the input sequence. While the hierarchical nature and dual-generation strategy of these types of models did influence Segmental Language Models (Kawakami et al., 2019), both assume that word boundaries are available during training, and use these gold boundaries to form on-line word embeddings from the characters that comprise them111This in turn is based on Sordoni et al. (2015), which instead of modeling words as a sequence of characters, models utterance-actions as a sequence of words. In contrast, Segmental Language Models usually assume no word boundary information is available in the training data.

2.3 Segmental Language Models

While a more technical description of Segmental Language Models can be found in the next section, we give a short overview of other work which falls under that definition here. The term Segmental Language Model seems to be jointly due to Sun and Deng (2018) and Kawakami et al. (2019). Sun and Deng (2018) demonstrate strong results for unsupervised Chinese Word Segmentation using an LSTM-based Segmental Language Model and greedy decoding, competitive with and sometimes exceeding state of the art at that time depending on the dataset and configuration. However it should be pointed out that this study did tune the model for segmentation quality on a validation set, which we will rather call a “lightly supervised” setting (see Section 4.3).

Kawakami et al. (2019) also use LSTM-based SLMs, but maintain a strictly unsupervised setting in which the model can only be trained to optimize implicit language-modeling performance on the validation set, and are not allowed to tune on segmentation quality. Here it is reported that the “vanilla” SLMs give sub-par segmentations unless combined with one or more regularization techniques, including a character

-gram “lexicon” and length regularization.

Finally, Wang et al. (2021) very recently introduced a bidirectional SLM based on a Bi-LSTM. This study shows improved results over the unidirectional SLM of Sun and Deng (2018), as well as testing over more supervision settings, and including novel methods for combining decoding decisions over the forward and backward directions. This final study is most similar to our own work, though Transformer-based SLMs utilize a bidirectional context in a qualitatively different way, and do not require an additional layer to capture the reverse context (see Section 3.3).

3 Model

3.1 Recurrent SLMs

The model introduced here is a type of Segmental Language Model, following the terminology of Sun and Deng (2018); Kawakami et al. (2019). A schematic of the original Recurrent SLM can be found in Figure 1. Within Segmental Language Models, a sequence of symbols or time-steps x can further be modeled as a sequence of segments y, which are themselves sequences of the input time-steps, such that the concatenation of segments .

When implemented using Recurrent (unidirectional) Neural Networks, SLMs are broken into two levels known as a Context Encoder and a Segment Decoder. The Segment Decoder estimates the likelihood of the

character in the segment starting at index , , as in the following equation

On the other hand, the Context Encoder encodes information about the input sequence up to index such that that the distribution over the hidden encoding is

Finally, the Context Encoder “feeds” the Segment Decoder such that the initial character of a segment beginning at is decoded using transformations of the encoded context as initial states ( and are single feed-forward layers):

For inference, the log conditional probability of a segment (starting at index and of length ) is modeled as the log conditional probability of generating with the Segment Decoder given the left conditioning context . Note this means that the probability of a segment is not conditioned on other segments / segmentation choice, but only on the unsegmented input timeseries. Thus, the probability of the segment is

where is the end-of-segment symbol.

The probability of a sentence can thus be modeled as the marginal probability over all possible segmentations of the input, as in equation (1) below (where is the set of all possible segmentations z of an input x). However, since there are possible segmentations, directly marginalizing as in (1) is intractable. Instead, dynamic programming over a forward-pass lattice can be used to recursively compute the marginal as in (2) given the base condition that . The maximum-probability segmentation can then be read off of the backpointer-augmented lattice through Viterbi decoding.

Figure 1: Recurrent Segmental Language Model

3.2 New Model: Masked SLM

In the present study, we present a type of Segmental Language Model designed to leverage the powerful, non-directional contextual encoding of neural transformers with self-attention (Vaswani et al., 2017) as well their potential for parallelization and short derivational chains.

Traditional directional language models estimate the distribution over the next word or symbol , conditional on one or more previous positions (i.e. -gram models), or on a latent representation of the entire leftward context (RNNs). However, recent advancements in language modeling have come from removing left-to-right modeling assumptions and moving to bi-directional architectures (e.g. BiLSTMs, Schuster and Paliwal (1997); Graves and Schmidhuber (2005)) or non-directional/fully-connected ones (transformers).

One of the main advantages of the non-directional transformer encoder is its ability to take into account both the left and right context in conditioning the distribution over a certain encoded position. It is with this advantage in mind that we introduce the Masked Segmental Language Model. We take inspiration from the Masked Language Model of Devlin et al. (2019), in which instead of predicting the “next” word from the leftward context, is masked and re-predicted conditional on the entire context and , i.e. the Cloze Task (Taylor, 1957).

A key difference between standard, transformer-based Masked Language Models like BERT and the one we present here is that while it is standard to predict single tokens based on all the rest, for Segmental Language Models, we are interested in predicting a series of tokens (constituting a segment) based on all other tokens that are not in the segment. For instance, if we are trying to predict the three-character segment starting at , the distribution to be estimated is

While BERT generally masks out single tokens for prediction, there are some recent pre-training techniques for Transformers, such as MASS (Song et al., 2019) and BART (Lewis et al., 2020), that mask out spans to be predicted in the way that we describe. However, one major difference between our model and all of these masking schemes is that while the pre-training data for large Transformer models is usually large enough that only about 15% of training tokens are masked, we will always want to estimate the generation probability for every possible segment of x. Since the usual method for masking requires replacing the masked token(s) with a special symbol, only one span can be predicted with each forward pass (while retaining the full remaining context). However, in each sequence there are possible segments, so replacing each one with a mask token and recovering it would require as many forward passes.

These design considerations have influenced our development of a Segmental Transformer Encoder, and the Segmental Attention Mask around which it is based. The main idea of the segmental encoder is that for each forward pass, an encoding is generated for every possible starting position in x for a segment of up to length . The encoding at timestep corresponds to every possible segment whose first timestep is at index . Thus with maximum segment length of and total sequence length , the encoding at each index will approximate


This task moves the Segmental Language Model framework into a bi-directional/non-directional paradigm, and is enabled by an attention mask designed to condition predictions only on indices that are not part of the material to be predicted. An example diagram of this mask with is shown in Figure 2, and the mask for any max segment length can be formally defined as follows:

Figure 2: Segmental Attention Mask with segment-length () of 3. Blue squares are equal to , orange squares are equal to . This mask blocks the position encoding the segment in the Queries from attending to segment-internal positions in the Keys.

In order to prevent information leaking “from under the mask”, our segmental encoder uses a slightly different configuration in its first layer than in all subsequent layers. In the first layer, Queries, Keys, and Values are all learned from the original input encodings. In subsequent layers, the Queries come from the hidden encodings output by the previous layer, while Keys and Values are learned directly from the original encodings. If Queries and Keys or Queries and Values both come from the previous layer, information can leak from positions that are supposed to be masked for each respective query position.

The encodings learned by the segmental encoder can then be input to an SLM encoder in exactly the same way as previous recurrent models (Figure 3).

Figure 3: Masked Segmental Language Model with .

Finally, to add positional information to the MSLM encoder while preventing the parameter explosion of learned positional embeddings, we use static sinusoidal encodings (Vaswani et al., 2017) and also employ a linear mapping to the concatenation of the original and positional embeddings to learn the ratio at which to add the the two together.

3.3 Directional Modeling Considerations

Switching to a Transformer-based SLM makes the language modeling context naturally bidirectional (or rather adirectional). The encoder is allowed to attend to positions equally over the left and right context, so long as they are not masked. There is reason to believe a bidirectional modeling context is at least as psychologically plausible as one where representations are only conditioned on “previous” material (Luce (1986) shows that in acoustic perception, most words need at least some following context to be recognizable). In addition, bidirectional modeling assumptions in language models like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have led to huge advances in learning quality linguistic representations.

However, a Transformer encoder does not necessarily have to be bidirectional, taking for example, the directional or “causal” mask used for sequence decoding with Transformers (Vaswani et al., 2017). In addition, an RNN-based encoder can be bidirectional (although the latent representations of each direction are separate and need to be combined, see Wang et al. (2021) for a BiLSTM-based SLM).

To tease apart these modeling assumptions, we will additionally posit a Directional Masked Segmental Language Model, which is the same as the normal or “Cloze” MSLM, except that a directional mask is used instead of the span masking type seen in Figure 2. Using the directional mask, as seen in Figure 4, the encoder is still completely attention-based, but the language modeling context is strictly “directional”, in that positions are only allowed to attend over a monotonic “leftward” context. Our experiments use both MSLMs and DMSLMs in order to test the relative importance of the bidirectional modeling assumption in comparison to merely switching to an attention-based computational graph.

4 Experiments

Our experiments aim to assess the usefulness of Segmental Language Models across three dimensions: (1) network architecture and language modeling assumptions, (2) evaluation metrics, specifically segmentation quality and language-modeling performance, and (3) supervision setting (if and where gold segmentation data is assumed during training).

4.1 Architecture and Modeling

One of the main comparisons we make is between SLM encoder architectures. As discussed in Section 3.3, changing from a recurrent (unidirectional) LSTM-based encoder to a Transformer encoder introduces several key modeling assumptions: the encoding for a certain position is conditioned directly on given context positions, and the context is bidirectional.

In order to analyze the relative importance of these modeling decisions, we test on SLMs with three different types of encoders: the standard R(ecurrent)SLM based on an LSTM, the M(asked)SLM introduced in 3.2 with a segmental Cloze-task mask, and a D(irectional)MSLM, which is the same as the MSLM with the exception that it has a “causal” or directional mask sometimes used in Transformer decoding (see Figure 4). The RSLM is thus (+recurrent context, +directional), the DMSLM is (-recurrent context, +directional), and the MSLM is (-recurrent context, -directional).

Figure 4: Directional Masked Segmental Language Model

For all models, we use an LSTM for the segment-decoder portion. This is partly as a control, but also because the decoded sequences are relatively short and might not necessarily benefit as much from an attention model. See also

Chen et al. (2018) for hybrid models with Transformer encoders and recurrent decoders.

4.2 Evaluation Metrics

There are many segmentation models that are not strong language models, e.g. Bayesian models like Hierarchical Dirichlet Processes (Teh et al., 2006; Goldwater et al., 2009) and Nested Pitman-Yor (Mochihashi et al., 2009). Part of the motivation for SLMs was thus to create strong language models that can also be used for segmentation (Kawakami et al., 2019). Because of this, we report both segmentation quality and language modeling strength.

As the measure of segmentation quality, we get the word-F1 score for each corpus using the scoring script from the SIGHAN Bakeoff (Emerson, 2005). Following Kawakami et al. (2019), we report this measure over the entire corpus. For language modeling performance, we report the average Bits Per Character (bpc) loss over the test set.

4.3 Supervision Setting

Because previous studies have used SLMs both in “lightly supervised” settings (Sun and Deng, 2018) and totally unsupervised ones (Kawakami et al., 2019)

, and because we expect SLMs to realistically be deployed in either use case, we test both settings in our experimentation. For all model types, we conduct a hyperparameter sweep and select both the configuration that maximizes the validation segmentation quality (light supervision) and the one that minimizes the validation bpc (unsupervised).

We note here that for reasons of efficiency, we don’t actually use the SIGHAN script for word-wise segmentation F1 for evaluation of training checkpoints, and instead convert each sequence to a binary vector corresponding to whether each character has a boundary after it or not. We then evaluate segmentation quality as a binary classification task over each position, using Matthews Correlation Coefficient to choose hyperparameters and checkpoints in the lightly-supervised setting.

4.4 Datasets

We evaluate our selection of Segmental Language Models on two out of the five datasets used in Kawakami et al. (2019). For each set, we use the same training, validation, and test split. The sets were chosen to represent two relatively different language writing systems, spanning Chinese (PKU) and English (PTB). Basic statistics for each of the datasets can be found in Table 1. One striking difference between the two types of writing systems can be seen in the character vocabulary size: phonemic-type writing systems like English have a much smaller vocabulary of tokens, with words being built out of longer sequences of characters that are not very expressive on their own. We speculate on the effects each of these systems has on modeling performance in Section 6.

Corpus PKU PTB
Tokens/Characters 1.93M 4.60M
Words 1.21M 1.04M
Lines 20.78k 49.20k
Avg. Characters per Word 1.59 4.44
Character Vocabulary Size 4508 46
Table 1: Statistics for the datasets used in experimentation

Beijing University Corpus (PKU)

The Beijing University Corpus has been used as a Chinese Word Segmentation benchmark since it was part of the International Chinese Word Segmentation Bakeoff (Emerson, 2005). One minor change we make to this dataset is to tokenize English, number, and punctuation tokens using the module from Sun and Deng (2018), so as to yield more comparable results to this paper and others that tokenize in this way before segmentation. Unlike Sun and Deng (2018), we do not pre-split sequences on punctuation.

Penn Treebank (PTB)

For English, we use the version of the ubiquitous Penn Treebank corpus prepared in previous work (Kawakami et al., 2019; Mikolov et al., 2010).

4.5 Parameters and Trials

For all models, we conduct hyperparameter tuning over four learning rates on a single random seed (2). For PKU, the learning rates swept are {6e-4, 8e-4, 1e-3, 3e-3}, and for PTB we use {1e-3, 3e-3, 5e-3, 7e-3}. After the parameter sweep, the configuration that maximizes validation segmentation quality and the one that minimizes validation bpc are run over an additional four random seeds: {3, 5, 8, 13}.

All models have one encoder layer and one decoder layer, as well as an embedding and hidden size of 256. A dropout rate of 0.1 is applied leading into both the encoder and the decoder. Transformers use 4 attention heads and a feedforward size of 509 (chosen to come out less than or equal to the number of parameters in the standard LSTM).222Using this feedforward size, the number of trainable parameters in the Transformer-based encoder comes out to 592,381 while the LSTM-based encoder has 592,640. This also includes a 512-parameter linear mapping to learn the combination proportion of the word and sinusoidal positional embeddings. The dropout within transformer layers is 0.0.

Character embeddings are initialized using CBOW (Mikolov et al., 2013)

on the given training set for 32 epochs, with a window size of 5 for Chinese and 10 for English. Special tokens like

<eoseg> that do not appear in the training corpus are randomly initialized. These pre-trained embeddings are not frozen during training.

One important parameter for Segmental Language Models is the maximum segment length . Sun and Deng (2018) actually tune this as a hyperparameter, with different values for fitting different Chinese segmentation standards more or less accurately. In practice, this parameter can be chosen by pre-experimental investigation to be an upper bound on the maximum segment length one expects to find, so as to not rule out long segments. For these experiments, we follow Kawakami et al. (2019) in choosing for Chinese and for English.

All models are trained using the Adam update rule (Kingma and Ba, 2015) for 8192 steps, with a linear warmup for 1024 steps, and a linear decay after. A gradient norm clip threshold of 1.0 is used. Checkpoints are taken every 128 steps. Mini-batches are sized by number of characters rather than number of sequences, with a size of 8192 (though this is not always exact since we do not split up sequences).333The code used to build and train SLMs as well as conduct these experiments can be found at https://github.com/cmdowney88/SegmentalLMs.

5 Results

5.1 Chinese

Models Tuned on Gold Unsupervised
RSLM 59.0 1.8 5.63 0.01 58.7 1.6 5.63 0.01
DMSLM 74.2 1.0 5.72 0.03 73.3 1.1 5.67 0.01
MSLM 73.3 0.6 5.65 0.03 72.2 0.5 5.55 0.01
Table 2: Results on the Beijing University Corpus

On PKU (Table 2

), Masked SLMs yield consistently better segmentation quality in both the lightly-supervised and unsupervised settings. In both settings, the Directional MSLM produces slightly better segmentation quality (+0.9 and +1.1 F1 respectively), but worse language modeling bpc (+0.07 and +0.12 bpc) than the “Cloze” MSLM. In the lightly-supervised setting, the Recurrent SLM has slightly better bpc than the MSLM (-0.02), but this is also the setting in which bpc is not being explicitly tuned, and this is within the standard deviation of the MSLM performance.

5.2 English

Models Tuned on Gold Unsupervised
F1 mean F1 median BPC F1 mean F1 median BPC
RSLM 73.0 3.8 73.2 2.04 0.02 71.3 3.6 70.5 1.94 0.01
DMSLM 64.9 9.8 69.0 2.38 0.08 40.3 24.4 46.5 2.15 0.02
MSLM 64.2 16.7 70.9 2.35 0.07 64.2 7.4 68.4 2.27 0.05
Table 3: Results on the English Penn Treebank

The picture is more complicated for English (PTB, Table 3

). By median, the RSLM gives slightly better segmentation performance in both settings (+2.3 and +2.1 F1). However, both types of MSLM show more variation over random seeds, causing skewed distributions. By mean, RSLM has the better lightly-supervised F1 by +8.8, and unsupervised F1 by +7.1. RSLM also has the better bpc performance (-0.31 and -0.33).

In our trials, one out of the five random seeds contributes most of the variation for the MSLM (see Figure 5). Lower learning rates also lead to greater variation between seeds. This can be seen for the MSLM in the lightly supervised setting (lr3e-3, 16.7 standard deviation) and DMSLM in the unsupervised setting (lr1e-3, 24.4 standard deviation).

6 Analysis and Discussion

6.1 Error Analysis

For PKU, the Transformer-based encoder far exceeds the recurrent baseline for segmentation quality, and also gives stronger bpc language modeling. Given the Directional MSLM yields slightly better segmentations, there may be some merit to directional modeling assumptions when designing SLMs primarily as segmentation algorithms. However, the difference in segmentation quality between DMSLM and MSLM (+1.1 F1) is far smaller than that given by moving to an attention-based architecture over a recurrent one (+13.5). However, the bidirectional context of the MSLM does give the best bpc modeling performance.

We conduct an error analysis for PKU based on the average final Precision and Recall scores for the RSLM and MSLM (for the character-wise binary classification task: word-boundary vs no word-boundary). For simplicity our error analysis is done on the unsupervised setting only. As can be seen in Table

4, both types of model have a Precision that approaches 100%, meaning almost all boundaries that are inserted are true boundaries. The main difference between the two models is in Recall. The MSLM learns to include more of the true boundaries in the gold data. This means the RSLM produces relatively coarser segmentations.444

This table also shows that though character-wise segmentation quality (i.e. classifying whether a certain character has a boundary after it) is a useful heuristic, it does not always scale in a straightforward manner to word-wise F1 like is traditionally used (e.g. by the SIGHAN script).

It is also important to remember that even humans don’t always agree on what constitutes a “word” in Chinese. PKU is known for using relatively shorter words in its segmentation standard, as compared to other corpora like Chinese Penn Treebank where some names are often segmented as five-character “words” (Sun and Deng, 2018; Kawakami et al., 2019). This reinforces the need for studies on more datasets, including those for Chinese Word Segmentation (see Section 7).

Model Avg. Word Length Precision Recall
Gold 1.59 - -
RSLM 1.93 0.02 98.2 0.1 80.7 0.5
DMSLM 1.83 0.02 97.9 0.1 85.0 0.8
MSLM 1.84 0.01 97.9 0.1 84.2 0.4
Table 4: Error analysis statistics for PKU

For English Penn Treebank and other datasets of languages where the character inventory is more or less phonemic, more engineering seems to be needed to guarantee unsupervised segmentation performance across random initializations. Both types of MSLM showed considerable variation in segmentation quality across random seeds, especially when tuned to a lower learning rate.

Error analysis for English can be found in Table 5. In a trend that is opposite to that found for PKU, the (Cloze) MSLM tends to under-segment the English text (i.e. giving longer “words” and missing more of the gold boundaries that are captured in the RSLM baseline). However, it should be noted that the regular MSLM does give the highest Precision of all the models, with its “worst” random seed in terms of overall word F1 having a very high Precision at the cost of a relatively low Recall.

Example model segmentations for PTB can also be found in Table 6, which back up some intuitions from the Precision-Recall analysis in Table 5. As can be seen, the RSLM is actually prone to over-segmenting in a way that tends to split affixes from their roots (great + est, year + s) but also creates some false splits like a + mong. On the other hand, the MSLM tends to under-segment, including suffixes with their root words, and chunking together some common collocations like forthe and cannow.

Model Avg. Word Length Precision Recall
Gold 4.44 - -
RSLM Median 3.83 82.0 95.8
DMSLM Median 3.31 66.8 90.8
DMSLM Worst 2.88 33.6 52.6
MSLM Median 4.57 90.2 87.4
MSLM Worst 5.33 91.4 75.3
Table 5: Error analysis statistics for PTB. The “worst” models are in terms of the final segmentation quality.

Confusingly, the Directional MSLM seems to again over-segment the English input, though its results are slightly harder to interpret. While the “worst” seed for the MSLM showed Precision being optimized at the cost of Recall, no such tradeoff seems to be happening in the worst case for the DMSLM. Instead, both Precision and Recall are fairly poor compared to the RSLM baseline, as well as the MSLM. Indeed, the resulting segmentation from median DMSLM seed in Table 6 is mostly nonsensical.

The DMSLM in the unsupervised case did have the lowest learning rate after tuning, and we have noted before that lower learning rates seem to lead to more variation between seeds. Unfortunately these results remain difficult to interpret, and we devote the rest of this section to discussing why the MSLM variants show so much variation for English.

Gold watching abc ’s monday night football can now vote during unk
for the greatest play in # years from among four or five
RSLM Median watching abc ’s monday night foot ball can now vote during unk
for the great est play in # year s from a mong four or five
DMSLM Median w at ch ing ab c ’s m on day nigh t foot ball can now v ote dur ing unk
for the greatest play in# ye ars froma mong four or five
MSLM Median watching abc’s monday night foot ball can now vote during unk
forthe greatest play in# years from among four or five
MSLM Worst watching abc’s monday night foot ball cannow vote duringunk
forthe greatest play in #years from among four or five
Table 6: Example model segmentations from the Penn Treebank validation set

6.2 Discussion of MSLM Variability on English

In the lightly supervised setting, the variation effect may actually be easily avoidable, since it is usually evident when a model is not converging to a desirable performance, and the practitioner could either train across a few random seeds and select the best, or else tune more carefully for a single fixed seed. For unsupervised settings however, the seed variation is more problematic, given there is no way to know when a trial is training on a “bad seed’.

We have several hypotheses about the source of the random seed variation on English. Firstly, it is possible that this variability could result from the training being stochastic: that is, too small of batch sizes may be imperfectly estimating the global gradient for the problem. A relatively easy fix for this could be increasing the batch size during training, memory permitting. However, if memory is a hard constraint, then gradient accumulation steps could be used instead.

Another potential factor that seems to contribute to the variability is the manner in which positional information is injected into the Transformer architecture, given Transformers without additive or learned positional encodings are essentially adirectional.

In general, the MSLM variants seem to be very sensitive to the ratio of embedding-to-positional-encoding. The learned combination ratio described in Section 3.2 was a vital component of getting Transformer models to work for PTB at all.

Intuitively, it is easy to see why positional information might be more important in roughly phonemic writing systems like English. In Chinese, almost every character is a morpheme itself (i.e. it has some meaning). In English, on the other hand, the letter c has no inherent meaning outside of a context like cat. cat is also a completely different context than act, but this might be difficult to model for an attention model without robust positional information. Thus it is possible that this variation may come from a relative weakness in positional signal in the encoder.

7 Conclusion

In sum, we believe that MSLMs show promise in the domain of unsupervised segmentation and/or character modeling. They show particular promise for writing systems with a large inventory of semantic characters (e.g. Chinese), and we believe that they could be stable competitors of recurrent models in phonemic-type writing systems if one or more of the engineering hurdles described here are solved. To close, we lay out directions for future work in Masked SLMs.

The most obvious next step in the study of MSLMs is using them to model more segmentation datasets spanning different domains. As mentioned in the previous section, the criteria for what defines a “word” in Chinese are not agreed upon, and so more experiments are definitely warranted using other corpora with different standards. Prime candidates include the Chinese Penn Treebank, analyzed in Kawakami et al. (2019), as well as those included in the SIGHAN segmentation bakeoff: Microsoft Research, City University of Hong Kong, and Academia Sinicia.

The Chinese and English sets examined here are also relatively formal orthographic datasets. An eventual use of SLMs may be in speech segmentation, but a smaller step in that direction could be in using a phonemic transcript dataset, like the Brent Corpus, also used in Kawakami et al. (2019), which consists of phonemic transcripts of child-directed English speech (Brent, 1999). SLMs could also be applied to the orthographies of more typologically diverse languages, such as ones with complicated systems of morphology (e.g. Swahili, Turkish, Hungarian, Finnish).

Finally, Kawakami et al. (2019) actually report that their “vanilla” SLMs do not provide very good segmentation quality compared to other baselines, and so propose several regularization techniques to skew the model towards more human-like segments. They report good findings using a character -gram “lexicon” in conjunction with expected segment length regularization based on Eisner (2002) and Liang and Klein (2009). Both regularization techniques are implemented in the codebase used for the present experiments, and we plan to use them in our next series of studies.


  • M. R. Brent (1999) An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery. Machine Learning 34, pp. 71–105. Cited by: §7.
  • M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes (2018)

    The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 76–86. External Links: Link, Document Cited by: §4.1.
  • J. Chung, S. Ahn, and Y. Bengio (2017) Hierarchical Multiscale Recurrent Neural Networks. In 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Toulon, France. Cited by: §2.2.
  • M. Creutz and K. Lagus (2002) Unsupervised Discovery of Morphemes. In Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning, pp. 21–30. External Links: Link Cited by: §2.1.
  • C. de Marcken (1996) Linguistic Structure as Composition and Perturbation. In 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, USA, pp. 335–341. External Links: Link, Document Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §3.2, §3.3.
  • E. Dunbar, J. Karadayi, M. Bernard, X. Cao, R. Algayres, L. Ondel, L. Besacier, S. Sakti, and E. Dupoux (2020) The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units. In Proceedings of INTERSPEECH 2020, Cited by: §1.
  • J. Eisner (2002) Parameter Estimation for Probabilistic Finite-State Transducers. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 1–8. External Links: Link, Document Cited by: §7.
  • J. L. Elman (1990) Finding structure in time. Cognitive Science 14 (2), pp. 179–211 (en). External Links: ISSN 0364-0213, Link, Document Cited by: §2.1.
  • T. Emerson (2005) The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, External Links: Link Cited by: §4.2, §4.4.
  • J. Goldsmith (2001) Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27 (2), pp. 153–198. External Links: Link Cited by: §2.1.
  • S. Goldwater, T. L. Griffiths, and M. Johnson (2009) A Bayesian framework for word segmentation: Exploring the effects of context. Cognition 112 (1), pp. 21–54 (en). External Links: ISSN 0010-0277, Link, Document Cited by: §1, §2.1, §4.2.
  • A. Graves, F. Santiago, F. Gomez, and J. Schmidhuber (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA. Cited by: §2.2.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. In Neural Networks, Vol. 18, pp. 602–610. Cited by: §3.2.
  • H. Kamper, A. Jansen, and S. Goldwater (2016) Unsupervised word segmentation and lexicon discovery using acoustic word embeddings. IEEE/ACM Transactions on Audio, Speech and Language Processing 24 (4), pp. 669–679. External Links: ISSN 2329-9290, Link, Document Cited by: §1.
  • K. Kawakami, C. Dyer, and P. Blunsom (2017) Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1492–1502. External Links: Link, Document Cited by: §2.2.
  • K. Kawakami, C. Dyer, and P. Blunsom (2019) Learning to Discover, Ground and Use Words with Segmental Neural Language Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6429–6441. External Links: Link, Document Cited by: A Masked Segmental Language Model for Unsupervised Natural Language Segmentation, §1, §1, §2.1, §2.2, §2.3, §2.3, §3.1, §4.2, §4.2, §4.3, §4.4, §4.4, §4.5, §6.1, §7, §7, §7.
  • D. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA. Cited by: §4.5.
  • L. Kong, C. Dyer, and N. A. Smith (2016) Segmental Recurrent Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), San Juan, Puerto Rico. External Links: Link Cited by: §2.2.
  • T. Kudo and J. Richardson (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §3.2.
  • P. Liang and D. Klein (2009) Online EM for Unsupervised Models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp. 611–619. External Links: Link Cited by: §7.
  • P. A. Luce (1986) A computational analysis of uniqueness points in auditory word recognition. Perception & Psychophysics 39 (3), pp. 155–158 (en). External Links: ISSN 1532-5962, Link, Document Cited by: §3.3.
  • S. Mielke and J. Eisner (2019) Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33 (01), pp. 6843–6850 (en).
    Note: Number: 01 External Links: ISSN 2374-3468, Link, Document Cited by: §2.2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Workshop Track Proceedings, Scottsdale, AR, USA. Cited by: §4.5.
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. Vol. 2, pp. 1045–1048. Cited by: §4.4.
  • D. Mochihashi, T. Yamada, and N. Ueda (2009) Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 100–108. External Links: Link Cited by: §2.1, §4.2.
  • T. Moeng, S. Reay, A. Daniels, and J. Buys (2021) Canonical and Surface Morphological Segmentation for Nguni Languages. ArXiv. Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §3.3.
  • O. Rasanen and M. A. C. Blandon (2020) Unsupervised Discovery of Recurring Speech Patterns using Probabilistic Adaptive Metrics. In Proceedings of INTERSPEECH 2020, Cited by: §1.
  • J. Rissanen (1989) Stochastic Complexity in Statistical Inquiry. Series in Computer Science, Vol. 15, World Scientific, Singapore. Cited by: §2.1.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. External Links: ISSN 1053587X, Document Cited by: §3.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §1.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: Masked Sequence to Sequence Pre-training for Language Generation. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA. Cited by: §3.2.
  • A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J. Nie (2015) A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, New York, NY, USA, pp. 553–562. External Links: ISBN 978-1-4503-3794-6, Link, Document Cited by: footnote 1.
  • Z. Sun and Z. Deng (2018) Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4915–4920. External Links: Link Cited by: A Masked Segmental Language Model for Unsupervised Natural Language Segmentation, §1, §2.3, §2.3, §3.1, §4.3, §4.4, §4.5, §6.1.
  • W. L. Taylor (1957) Cloze Procedure: A New Tool for Measuring Readability. Journalism Bulletin 30 (4), pp. 415–433. Cited by: §3.2.
  • Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei (2006) Hierarchical Dirichlet Processes. Journal of the American Statistical Association 101 (476), pp. 1566–1581. Cited by: §2.1, §4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA. External Links: Link Cited by: §1, §3.2, §3.2, §3.3.
  • C. Wang, Y. Wang, P. Huang, A. Mohamed, D. Zhou, and L. Deng (2017) Sequence Modeling via Segmentations. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3674–3683. External Links: Link Cited by: §2.2.
  • L. Wang, Z. Li, and X. Zheng (2021) Unsupervised Word Segmentation with Bi-directional Neural Language Model. ArXiv. Cited by: A Masked Segmental Language Model for Unsupervised Natural Language Segmentation, §2.3, §3.3.