Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning

03/06/2020 ∙ by Stig-Arne Grönroos, et al. ∙ Helsingin yliopisto aalto 0

Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subword segmentation has become a standard preprocessing step in many neural approaches to natural language processing (NLP) tasks, e.g Neural Machine Translation (NMT) 

[17] and Automatic Speech Recognition (ASR) [18]. Word level modeling suffers from sparse statistics, issues with Out-of-Vocabulary (OOV) words, and heavy computational cost due to a large vocabulary. Word level modeling is particularly unsuitable for morphologically rich languages, but subwords are commonly used for other languages as well. Subword segmentation is best suited for languages with agglutinative morphology.

While rule-based morphological segmentation systems can achieve high quality, the large amount of human effort needed makes the approach problematic, particularly for low-resource languages. The systems are language dependent, necessitating use of multiple tools in multilingual setups. As a fast, cheap and effective alternative, data-driven segmentation can be learned in a completely unsupervised manner from raw corpora.

Unsupervised morphological segmentation saw much research interest until the early 2010’s; for a survey on the methods, see hammarstrom2011unsupervised. Semi-supervised segmentation with already small amounts of annotated training data was found to improve the accuracy significantly when compared to a linguistic segmentation; see ruokolainen2016comparative for a survey. While this line of research has been continued in supervised and more grammatically oriented tasks [2], the more recent work on unsupervised segmentation is less focused on approximating a linguistically motivated segmentation. Instead, the aim has been to tune subword segmentations for particular applications. For example, the simple substitution dictionary based Byte Pair Encoding segmentation algorithm [6], first proposed for NMT by sennrich2015neural, has become a standard in the field. Especially in the case of multilingual models, training a single language-independent subword segmentation method is preferable to linguistic segmentation [1].

In this study, we compare three existing and one novel subword segmentation method, all sharing the use of a unigram language model in a generative modeling framework. The previously published methods are Morfessor Baseline [3], Greedy Unigram Likelihood [19], and SentencePiece [11]. The new Morfessor variant proposed in this work is called Morfessor EM+Prune.

The contributions of this article111 This work is licensed under a Creative Commons Attribution–NoDerivatives 4.0 International Licence. Licence details: are

  1. a better training algorithm for Morfessor Baseline, with reduction of search error during training, and improved segmentation quality for English, Finnish and Turkish;

  2. comparing four similar segmentation methods, including a close look at the SentencePiece reference implementation, highlighting details omitted from the original article [11];

  3. and showing that the proposed Morfessor EM+Prune with particular hyper-parameters yields SentencePiece.

Morfessor BL Greedy Unigram SentencePiece Morfessor EM+Prune
Model Unigram LM Unigram LM Unigram LM Unigram LM
Cost function MAP ML MAP MAP
Training algorithm Local search EM+Prune EM+Prune EM+Prune
  Initialization Words Seed lexicon Seed lexicon Seed lexicon
  EM variant Lateen-EM once EM EM / Lateen-EM
Stopping criterion
  Cost change threshold
  Target lexicon size Approximate
N-best decoding
Sampling decoding
Count dampening
Requires pretokenization
Reference implementation Python C++ C++ Python
Table 1: Comparison of subword segmentation methods applying a unigram language model.

1.1 Morphological Segmentation with Unigram Language Models

Morphological surface segmentation is the task of splitting words into morphs, the surface forms of meaning-bearing sub-word units, morphemes. The concatenation of the morphs is the word, e.g.

Probabilistic generative methods for morphological segmentation model the probability

of generating a sequence of morphs (a word, sentence or corpus) , as opposed to discriminative methods that model the conditional probability of the segmentation boundaries given the unsegmented data.

This study focuses on segmentation methods applying a unigram language model

. In the unigram language model, an assumption is made that the morphs in a word occur independently of each other. Alternatively stated, it is a zero-order (memoryless) Markov model, generalized so that one observation can cover multiple characters. The probability of a sequence of morphs decomposes into the product of the probabilities of the morphs of which it consists.


The Expectation Maximization (EM) algorithm [5]

is an iterative algorithm for finding Maximum Likelihood (ML) or Maximum a Posteriori (MAP) estimates for parameters in models with latent variables. The EM algorithm consists of two steps. In the E-step (

2), the expected value of the complete data likelihood including the latent variable is taken, and in the M-step (3), the parameters are updated to maximize the expected value of the E-step:


When applied to a (hidden) Markov model, EM is called the forward-backward algorithm. Using instead the related Viterbi algorithm 

[22] is sometimes referred to as hard-EM.222An analogy can be drawn to clustering using

-means, which yields a hard assignment of data points to clusters, and using EM for clustering with a Gaussian Mixture Model (GMM), where the assignment is soft.

spitkovsky2011lateen present lateen-EM, a hybrid variant in which EM and Viterbi optimization are alternated.

[Section]virpioja2012learning discusses the challenges of applying EM to learning of generative morphology. Jointly optimizing both the morph lexicon and the parameters for the morphs is intractable. If, like in Morfessor Baseline, the cost function is discontinuous when morphs are added or removed from the lexicon, there is no closed form solution to the M-step. With ML estimates for morph probabilities, EM can neither add nor remove morphs from the lexicon, because it can neither change a zero probability to nonzero nor vice versa.

One solution to this challenge is to apply local search. Starting from the current best estimate for the parameters, small search steps are tried to explore near-lying parameter configurations. The choice that yields the lowest cost is selected as the new parameters. Greedy local search often gets stuck in local minima. Even if there are parameters yielding a better cost, the search may not find them, causing search error. The error remaining at the parameters with globally optimal cost is the model error.

Another solution is to combine EM with pruning (EM+Prune). The methods based on pruning begin with a seed lexicon, which is then iteratively pruned until a stopping condition is reached. Subwords cannot be added to the lexicon after initialization. As a consequence, proper initialization is important, and the methods should not prune too aggressively without reestimating parameters, as pruning decisions cannot be backtracked. For this reason, EM+Prune methods proceed iteratively, only pruning subwords up to a predefined iteration pruning quota, e.g. removing at most 20% of the remaining lexicon at a time.

2 Related Work

In this section we review three previously published segmentation methods that apply a unigram language model. Table 1 summarizes the differences between these methods.

2.1 Morfessor Baseline

Morfessor is a family of generative models for unsupervised morphology induction [4]. Here, consider the Morfessor 2.0 implementation [20] of Morfessor Baseline method [3].

A point estimate for the model parameters is found using MAP estimation with a Minimum Description Length (MDL) [15] inspired prior that favors lexicons containing fewer, shorter morphs. The MAP estimate yields a two-part cost function, consisting of a prior (the lexicon cost) and likelihood (the corpus cost). The model can be tuned using the hyper-parameter , which is a weight applied to the likelihood [9]:


The parameter controls the overall amount of segmentation, with higher values increasing the weight of each emitted morph in the corpus (leading to less segmentation), and lower values giving a relatively larger weight to a small lexicon (more segmentation).

The prior can be further divided into two parts: the prior for the morph form properties and the usage properties. The form properties encode the string representation of the morphs, while the usage properties encode their frequencies. Morfessor Baseline applies a non-informative prior for the distribution of the morph frequencies. It is derived using combinatorics from the number of ways that the total token count can be divided among the lexicon items:


Morfessor Baseline is initialized with a seed lexicon of whole words. The Morfessor Baseline training algorithm is a greedy local search. During training, in addition to storing the model parameters, the current best segmentation for the corpus is stored in a graph structure. The segmentation is iteratively refined, by looping over all the words in the corpus in a random order and resegmenting them. The resegmentation is applied by recursive binary splitting, leading to changes in other words that share intermediary units with the word currently being resegmented. The search converges to a local optimum, and is known to be sensitive to the initialization [20].

In the Morfessor 2.0 implementation, the likelihood weight hyper-parameter

is set either with a grid search using the best evaluation score on a held-out development set, or by applying an approximate automatic tuning procedure based on a heuristic guess of which direction the

parameter should be adjusted.

2.2 Greedy Unigram Likelihood

varjokallio2013learning presents a subword segmentation method, particularly designed for use in ASR. It applies greedy pruning based on unigram likelihood. The seed lexicon is constructed by enumerating all substrings from a list of common words, up to a specified maximum length. Pruning proceeds in two phases, which the authors call initialization and pruning.

In the first phase, a character-level language model is trained. The initial probabilities of the subwords are computed using the language model. The probabilities are refined by EM, followed by hard-EM. During the hard-EM, frequency based pruning of subwords begins.

In the second phase, hard-EM is used for parameter estimation. At the end of each iteration, the least frequent subwords are selected as candidates for pruning. For each candidate subword, the change in likelihood when removing the subword is estimated by resegmenting all words in which the subword occurs. After each pruned subword, the parameters of the model are updated. Pruning ends when the goal lexicon size is reached or the change in likelihood no longer exceeds a given threshold.

Figure 1: Unweighted Morfessor cost function components (prior and likelihood). Log scale.

2.3 SentencePiece

SentencePiece [10, 11] is a subword segmentation method aimed for use in any NLP system, particularly NMT. One of its design goals is use in multilingual systems.

Although [11] implies a use of maximum likelihood estimation, the reference implementation333 uses the implicit Dirichlet Process prior called Bayesian EM [14]. In the M-step, the count normalization is modified to


where is the digamma function.

The seed lexicon is simply the e.g. one million most frequent substrings. SentencePiece uses an EM+Prune training algorithm. Each iteration consists of two sub-iterations of EM, after which the lexicon is pruned. Pruning is based on Viterbi counts (EM+Viterbi-prune). First, subwords that do not occur in the Viterbi segmentation are pre-pruned. The cost function is the estimated change in likelihood when the subword is removed, estimated using the assumption that all probability mass of the removed subword goes to its Viterbi segmentation. Subwords are sorted according to the cost, and a fixed proportion of remaining subwords are pruned each iteration. Single character subwords are never pruned. A predetermined lexicon size is used as the stopping condition.

3 Morfessor EM+Prune

Morfessor EM+Prune444Software available at . uses the unigram language model and priors similar to Morfessor Baseline, but combines them with EM+Prune training.

3.1 Prior

The prior must be slightly modified for use with the EM+Prune algorithm. The prior for the frequency distribution (5) is derived using combinatorics. When using real-valued expected counts, there are infinite assignments of counts to parameters. Despite not being theoretically motivated, it can still be desirable to compute an approximation of the Baseline frequency distribution prior, in order to use EM+Prune as an improved search to find more optimal parameters for the original cost. To do this, the real valued token count is rounded to the nearest integer555An alternative would be to replace the factorial with the gamma function. This added precision serves no practical purpose, particularly as we already use Stirling’s approximation of the factorial.. Alternatively, the prior for the frequency distribution can be omitted, or a new prior with suitable properties could be formulated. We do not propose a completely new prior in this work, instead opting to remain as close as possible to Morfessor Baseline.

In Morfessor EM+Prune, morphs are explicitly stored in the lexicon, and morphs are removed from the lexicon only during pruning. This differs from Morfessor Baseline, in which a morph is implicitly considered to be stored in the lexicon if it has non-zero count.

The prior for the morph form properties does not need to be modified. During the EM parameter estimation, the prior for the morph form properties is omitted as the morph lexicon remains constant. During pruning, the standard form prior is applicable.

Additionally we apply the Bayesian EM implicit Dirichlet Process prior [14]. We experiment with four variations of the prior:

  1. the full EM+Prune prior,

  2. omitting the Bayesian EM (noexp),

  3. omitting the approximate frequency distribution prior (nofreqdistr),

  4. and omitting the prior entirely (noprior).

3.2 Seed Lexicon

The seed lexicon consists of the one million most frequent substrings, with two restrictions on which substrings to include: pre-pruning of redundant subwords, and forcesplit. Truncating to the chosen size is performed after pre-pruning, which means that pre-pruning can make space for substrings that would otherwise have been below the threshold.

Pre-pruning of redundant subwords is based on occurrence counts. If a string occurs times, then any substring of will occur at least times. Therefore, if the substring has a count of exactly , we know that it is not needed in any other context except as a part of . Such unproductive substrings are likely to be poor candidate subwords, and can be removed to make space in the seed lexicon for more useful substrings. This pre-pruning is not a neutral optimization, but does affect segmentation results. We check all initial and final substrings for redundancy, but do not pre-prune internal substrings.

To achieve forced splitting before or after certain characters, e.g. hyphens, apostrophes and colons, substrings which include a forced split point can be removed from the seed lexicon. As EM+Prune is unable to introduce new subwords, this pre-pruning is sufficient to guarantee the forced splits. While Morfessor 2.0 only implements force splitting certain characters to single-character morphs, i.e. force splitting on both sides, we implement more fine-grained force splitting separately before and after the specified character.

FS Prior Likelihood W-sum
Words 1.7910 1.3210 2.9810
Characters 2.3510 2.9010 2.6110
EM+P MDL 4.6910 2.0910 1.9210
Morfessor Baseline 7.5510 2.0510 1.9210
Morfessor Baseline 8.8410 1.9910 1.8810
EM+P MDL 5.8010 2.0210 1.8810
EM+P MDL lateen 6.3510 2.0110 1.8810
Table 2: Morfessor cost results for English. . FS is short for forcesplit, W-sum for weighted sum of prior and likelihood. means that lower values are better. The bolded method is our primary configuration.
FS Prior Likelihood W-sum
Words 8.6410 4.7710 8.7410
Characters 2.4610 1.2710 2.5410
Morfessor Baseline 8.3110 8.6010 1.8010
Morfessor Baseline 8.3610 8.5910 1.8010
EM+P MDL 1.2910 8.2810 1.7910
EM+P MDL lateen 1.4110 8.2210 1.7910
EM+P MDL 1.3110 8.2610 1.7810
Table 3: Morfessor cost results for Finnish. .
Prior Likelihood W-sum
Words 1.3110 9.0910 1.6810
Characters 1.1910 2.0810 8.3010
Morfessor Baseline 2.5410 1.3910 5.8210
EM+P MDL lateen 2.7910 1.3710 5.7810
EM+P MDL 2.7110 1.3710 5.7710
EM+P MDL keep-redundant 2.9710 1.3610 5.7310
Table 4: Morfessor cost results for Turkish.
FS Prior Likelihood W-sum
Words 2.1210 1.0310 3.1510
Characters 1.3810 2.9810 2.9810
Morfessor Baseline 1.7610 1.6210 1.8010
Morfessor Baseline 1.8710 1.6110 1.8010
EM+P MDL 9.5210 1.7010 1.7910
EM+P MDL lateen 9.8310 1.6910 1.7910
EM+P MDL 9.5610 1.6910 1.7910
Table 5: Morfessor cost results for North Sámi.

3.3 Training Algorithm

We experiment with three variants of the EM+Prune iteration structure:

  1. EM,

  2. Lateen-EM,

  3. EM+Viterbi-prune

EM+Viterbi-prune is an intermediary mode between EM and lateen-EM in the context of pruning. The pruning decisions are made based on counts from a single iteration of Viterbi training, but these Viterbi counts are not otherwise used to update the parameters. In effect, this allows for the more aggressive pruning using the Viterbi counts, while retaining the uncertainty of the soft parameters.

Each iteration begins with 3 sub-iterations of EM. In the pruning phase of each iteration, the subwords in the current lexicon are sorted in ascending order according to the estimated change in the cost function if the subword is removed from the lexicon. Subwords consisting of a single character are always kept, to retain the ability to represent an open vocabulary without OOV issues. The list is then pruned according to one of three available pruning criteria:666MDL with or without automatic tuning is not compatible with omitting the prior.

  1. (-weighted) MDL pruning,

  2. MDL with automatic tuning of for lexicon size,

  3. lexicon size with omitted prior or pretuned .

In (-weighted) Minimum Description Length (MDL) pruning, subwords are pruned until the estimated cost starts rising, or until the pruning quota for the iteration is reached, whichever comes first.

A subword lexicon of a predetermined size can be used as pruning criterion in two different ways. If the desired is known in advance, or if the prior is omitted, subwords are pruned until the desired lexicon size is reached, or until the pruning quota for the iteration is reached, whichever comes first.

To reach a subword lexicon of a predetermined size while using the Morfessor prior, the new automatic tuning procedure can be applied. For each subword, the estimated change in prior and likelihood are computed separately. These allow computing the value of that would cause the removal of each subword to be cost neutral, i.e. the value that would cause MDL pruning to terminate at that subword. For subwords with the same sign for both the change in prior and likelihood, no such threshold can be computed: if the removal decreases both costs the subword will always be removed, and if it increases both costs it will always be kept. Sorting the list of subwords according to the estimated threshold including the always kept subwords allows automatically tuning so that a subword lexicon of exactly the desired size is retained after MDL pruning. The automatic tuning is repeated before the pruning phase of each iteration, as retraining the parameters affects the estimates.

FS Pre Rec F
EM+P MDL noexp 0.8 82.9 71.8 77.0
EM+P MDL nofreqdistr 0.8 83.3 71.4 76.9
EM+P MDL 0.9 81.9 72.1 76.7
Morfessor Baseline 0.8 85.0 68.5 75.9
Morfessor Baseline 0.7 83.8 69.4 75.9 B
EM+P MDL 0.6 79.0 72.8 75.8
SentencePiece 50k 75.9 61.9 68.2
Table 6: Boundary Precision (Pre), Recall (Rec) and F-score (F) results for English. E indicates not significantly different (two-sided Wilcoxon signed-rank test, , zero splitting) from the bolded EM+Prune method, and B from the bolded Baseline.
FS Pre Rec F
EM+P MDL 0.035 72.0 55.8 62.9
EM+P MDL nofreqdistr 0.02 68.7 58.0 62.9 E
EM+P MDL noexp 0.02 68.4 57.9 62.7 E
EM+P MDL 0.015 66.7 58.5 62.3 E
Morfessor Baseline 0.02 62.3 58.2 60.2
SentencePiece 50k 75.7 49.3 59.7 B
Morfessor Baseline 0.02 62.0 57.6 59.7
Table 7: Boundary Precision (Pre), Recall (Rec) and F-score (F) results for Finnish.
Pre Rec F
EM+P MDL keep-redundant 0.3 87.8 58.7 70.4
EM+P MDL noexp 0.4 87.6 58.1 69.9
EM+P MDL nofreqdistr 0.3 86.4 58.2 69.6 E
EM+P MDL 0.2 84.8 58.7 69.4
Morfessor Baseline 0.2 78.2 58.4 66.9
SentencePiece 12k 75.2 60.0 66.8 B
Table 8: Boundary Precision (Pre), Recall (Rec) and F-score (F) results for Turkish.
FS Pre Rec F
Morfessor Baseline 1.4 75.7 60.7 67.4 E
EM+P MDL nofreqdistr 1.0 73.7 61.8 67.2 B
Morfessor Baseline 1.2 75.7 60.4 67.2 E B
EM+P MDL 1.3 73.0 62.1 67.1 B
EM+P MDL 1.2 72.8 62.0 66.9
EM+P MDL noexp 0.4 66.5 65.9 66.2
SentencePiece 64k 65.3 61.3 63.3
Table 9: Boundary Precision (Pre), Recall (Rec) and F-score (F) results for North Sámi.
Figure 2: Boundary Precision–Recall curve at different tuning points, The smallest and largest -values are labeled.

3.4 Sampling of Segmentations

Morfessor EM+Prune can be used in subword regularization [11], a denoising-based regularization method for neural NLP systems. Alternative segmentations can be sampled from the full data distribution using Forward-filtering backward-sampling algorithm [16] or approximatively but more efficiently from an -best list.

3.5 SentencePiece as a Special Case of Morfessor EM+Prune

Table 1 contains a comparison between all four methods discussed in this work. To recover SentencePiece, Morfessor EM+Prune should be run with the following settings: The prior should be omitted entirely, leaving only the likelihood


As the tuning parameter is no longer needed when the prior is omitted, the pruning criterion can be set to a predetermined lexicon size, without automatic tuning of . Morfessor by default uses type-based training; to use frequency information, count dampening should be turned off. The seed lexicon should be constructed without using forced splitting. The EM+Viterbi-prune training scheme should be used, with Bayesian EM turned on.

4 Experimental Setup

English, Finnish and Turkish data are from the Morpho Challenge 2010 data set [12, 13]. The training sets contain ca 878k, 2.9M and 617k word types, respectively. As test sets we use the union of the 10 official test set samples. For North Sámi, we use a list of ca 691k word types extracted from Den samiske tekstbanken corpus (Sametinget, 2004) and the 796 word type test set from version 2 of the data set collected by [8, 7].

In most experiments we use a grid search with a development set to find a suitable value for . The exception is experiments using autotuning or lexicon size criterion, and experiments using SentencePiece. We use type-based training (dampening counts to 1) with all Morfessor methods.

For English, we force splits before and after hyphens, and before apostrophes, e.g. “women’s-rights” is force split into “women ’s - rights”. For Finnish, we force splits before and after hyphens, and after colons. For North Sámi, we force splits before and after colons. For Turkish, the Morpho Challenge data is preprocessed in a way that makes force splitting ineffectual.

4.1 Evaluation

The ability of the training algorithm to find parameters minimizing the Morfessor cost is evaluated by using the trained model to segment the training data, and loading the resulting segmentation as if it was a Morfessor Baseline model. We observe both unweighted prior and likelihood, and their -weighted sum.

The closeness to linguistic segmentation is evaluated by comparison with annotated morph boundaries using boundary precision, boundary recall, and boundary -score [21]. The boundary

-score (F-score for short) equals the harmonic mean of precision (the percentage of correctly assigned boundaries with respect to all assigned boundaries) and recall (the percentage of correctly assigned boundaries with respect to the reference boundaries). Precision and recall are calculated using macro-averages over the word types in the test set. In the case that a word has more than one annotated segmentation, we take the one that gives the highest score.

4.2 Error Analysis

We perform an error analysis, with the purpose of gaining more insight into the ability of the methods to model particular aspects of morphology. We follow the procedure used by ruokolainen2016comparative. It is based on a categorization of morphs into the categories prefix, stem, and suffix. The category labels are derived from the original morphological analysis labels in the English and Finnish gold standards, and directly correspond to the annotation scheme used in the North Sámi test set.

We first divide errors into two kinds, over-segmentation and under-segmentation. Over-segmentation occurs when a boundary is incorrectly assigned within a morph segment. In under-segmentation, the a correct morph boundary is omitted from the generated segmentation. We further divide the errors by the morph category in which the over-segmentation occurs, and the two morph categories surrounding the omitted boundary in under-segmentation.

5 Results

Figure 1 compares the cost components of the Morfessor model across different parameters. The lowest costs for the mid-range settings are obtained for the EM+Prune algorithm, but for larger lexicons, the Baseline algorithm copes better. As expected, using forced splits at certain characters increase the costs, and the increase is larger than between the training algorithms. As Turkish preprocessing causes the results to be unaffected by the forced splits, we only report results without them.

Tables 2 to 5 show the Morfessor cost of the segmented training data for particular values. Again, the proposed Morfessor EM+Prune reaches a lower Morfessor cost than Morfessor Baseline. Using the lateen-EM has only minimal effect to the costs, decreasing the total cost slightly for English and increasing for the other languages. Turkish results include the “keep-redundant” setting discussed below in more detail.

Figure 2 shows the Precision–Recall curves for the primary systems, for all four languages. While increasing the Morfessor cost, forced splitting improves BPR. Tables 6 to 9 show test set Boundary Precision, Recall and F-score (BPR) results at the optimal tuning point (selected using a development set) for each model, for English, Finnish, Turkish and North Sámi, respectively777Note that SentencePiece is not designed for aiming towards a linguistic morpheme segmentation. Neither does it attempt to minimize the Morfessor cost. Therefore, SentencePiece is included in the evaluations for context, not as a baseline method.. The default Morfessor EM+Prune configuration (“soft” EM, full prior, forcesplit) significantly outperforms Morfessor Baseline w.r.t. the F-score for all languages except North Sámi, for which there is no significant difference between the methods.

Morfessor EM+Prune is less responsive to tuning than Morfessor Baseline. This is visible in the shorter lines in Figures 1 and 2, although the tuning parameter takes values from the same range. In particular, EM+Prune can not easily be tuned to produce very large lexicons.

Pre-pruning of redundant substrings gives mixed results. For Turkish, both Morfessor cost and BPR are degraded by the pre-pruning, but for the other three languages the pre-pruning is beneficial or neutral. When tuning to very high values (less segmentation), pre-pruning of redundant substrings improves the sensitivity to tuning. The same effect may also be achievable by using a larger seed lexicon. We perform most of our experiments with pre-pruning turned on.

To see the effect of pre-pruning on the seed lexicon, we count the number of subwords that are used in the gold standard segmentations, but not included in seed lexicons of various sizes. Taking Finnish as an example, we see 203 subword types missing from a 1 million substring seed lexicon without pre-pruning. Turning on pre-pruning decreases the number of missing types to 120. To reach the same number without using pre-pruning, a much larger seed lexicon of 1.7M substrings must be used.

Omitting the frequency distribution appears to have little effect on Morfessor cost and BPR. Turning off Bayesian EM (noexp) results in a less compact lexicon resulting in higher prior cost, but improves BPR for two languages: English and Turkish.

Table 10 contains the error analysis for English, Finnish and North Sámi. For English and North Sámi, EM+Prune results in less under-segmentation but worse over-segmentation. For Finnish these results are reversed. However, the suffixes are often better modeled, as shown by lower under-segmentation on SUF-SUF (all languages) and STM-SUF (English and North Sámi).

Over-segmentation Under-segmentation















eng Characters 71.05 11.82 1.66 0.33 15.13 0.00 0.00 0.00 0.00 0.00 100.00
eng Words 0.00 0.00 0.00 0.00 100.00 55.07 5.90 8.56 0.14 4.38 23.06
eng SentencePiece 38k 17.60 10.25 0.18 0.24 71.74 26.40 2.48 2.74 0.05 2.78 65.26
eng Morfessor Baseline 10.17 2.32 0.03 0.07 87.42 22.46 2.10 4.75 0.04 1.65 67.37
eng EM+Prune MDL 15.46 2.75 0.05 0.13 81.61 19.93 1.82 4.32 0.04 1.46 70.84
fin Characters 65.23 13.80 0.67 0.57 19.73 0.00 0.00 0.00 0.00 0.00 0.00 100.00
fin Words 0.00 0.00 0.00 0.00 100.00 49.19 17.16 21.76 4.84 0.96 0.58 4.09
fin SentencePiece 13k 35.11 3.71 0.08 0.41 60.69 25.96 1.45 16.18 0.35 0.08 0.16 55.81
fin Morfessor Baseline 34.75 2.82 0.03 0.38 62.02 24.57 0.86 16.31 0.15 0.04 0.20 57.63
fin EM+Prune MDL 29.34 2.20 0.03 0.26 68.18 24.68 0.90 15.95 0.29 0.05 0.19 57.60
sme Characters 81.44 6.80 11.76 0.00 0.00 0.00 0.00 100.00
sme Words 0.00 0.00 100.00 52.92 13.15 4.43 0.61 28.64
sme SentencePiece 64k 30.10 4.52 65.38 31.35 3.96 3.09 0.20 61.40
sme Morfessor Baseline 23.27 3.02 73.71 33.16 2.22 3.40 0.10 60.99
sme EM+Prune MDL 23.35 4.41 72.25 30.48 3.10 3.23 0.17 62.84
Table 10: Error analysis for English (eng, ), Finnish (fin, ), and North Sámi (sme, ). All results without forcesplit. Over-segmentation and under-segmentation errors reduce precision and recall, respectively.

6 Conclusion

We propose Morfessor EM+Prune, a new training algorithm for Morfessor Baseline. EM+Prune reduces search error during training, resulting in models with lower Morfessor costs. Lower costs also lead to improved accuracy when segmentation output is compared to linguistic morphological segmentation.

We compare Morfessor EM+Prune to three previously published segmentation methods applying unigram language models. We find that using the Morfessor prior is beneficial when the reference is linguistic morphological segmentation.

In this work we focused on model cost and linguistic segmentation. In future work the performance of Morfessor EM+Prune in applications will be evaluated. Also, a new frequency distribution prior, which is theoretically better motivated or has desirable properties, could be formulated.

7 Acknowledgements

This study has been supported by the MeMAD project, funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement No 780069), and the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113) Computer resources within the Aalto University School of Science “Science-IT” project were used.

8 Bibliographical References


  • [1] N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry, W. Macherey, Z. Chen, and Y. Wu (2019) Massively multilingual neural machine translation in the wild: findings and challenges. CoRR abs/1907.05019. External Links: 1907.05019 Cited by: §1.
  • [2] R. Cotterell, C. Kirov, J. Sylak-Glassman, G. Walther, E. Vylomova, P. Xia, M. Faruqui, S. Kübler, D. Yarowsky, J. Eisner, and M. Hulden (2017-08) CoNLL-SIGMORPHON 2017 shared task: universal morphological reinflection in 52 languages. In Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, pp. 1–30. External Links: Link, Document Cited by: §1.
  • [3] M. Creutz and K. Lagus (2002-07) Unsupervised discovery of morphemes. In ACL-02 Workshop on Morphological and Phonological Learning, MPL ’02, Vol. 6, Philadelphia, Pennsylvania, USA, pp. 21–30 (en). External Links: Link, Document Cited by: §1, §2.1.
  • [4] M. Creutz and K. Lagus (2007-01) Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing 4 (1). Cited by: §2.1.
  • [5] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39 (1), pp. 1–38. Cited by: §1.1.
  • [6] P. Gage (1994-02) A new algorithm for data compression. C Users Journal 12 (2), pp. 23–38. Cited by: §1.
  • [7] S. Grönroos, K. Hiovain, P. Smit, I. E. Rauhala, P. K. Jokinen, M. Kurimo, and S. Virpioja (2016)

    Low-resource active learning of morphological segmentation

    Northern European Journal of Language Technology. Cited by: §4.
  • [8] S. Grönroos, K. Jokinen, K. Hiovain, M. Kurimo, and S. Virpioja (2015) Low-resource active learning of North Sámi morphological segmentation. In Proceedings of 1st International Workshop in Computational Linguistics for Uralic Languages, pp. 20–33. Cited by: §4.
  • [9] O. Kohonen, S. Virpioja, and K. Lagus (2010-07) Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, Uppsala, Sweden, pp. 78–86. External Links: Link Cited by: §2.1.
  • [10] T. Kudo and J. Richardson (2018-11) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §2.3.
  • [11] T. Kudo (2018-04) Subword regularization: improving neural network translation models with multiple subword candidates. arXiv:1804.10959 [cs] (en). Note: arXiv: 1804.10959Comment: Accepted as a long paper at ACL2018 External Links: Link Cited by: item (ii), §1, §2.3, §2.3, §3.4.
  • [12] M. Kurimo, S. Virpioja, V. Turunen, and K. Lagus (2010-07) Morpho challenge 2005-2010: evaluations and results. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, J. Heinz, L. Cahill, and R. Wicentowski (Eds.), Uppsala, Sweden, pp. 87–95. Cited by: §4.
  • [13] M. Kurimo, S. Virpioja, and V. T. Turunen (2010-09) Overview and results of Morpho Challenge 2010. In Proceedings of the Morpho Challenge 2010 Workshop, Espoo, Finland, pp. 7–24. Note: Technical Report TKK-ICS-R37 Cited by: §4.
  • [14] P. Liang and D. Klein (2007) Structured Bayesian nonparametric models with variational inference (tutorial). In Association for Computational Linguistics (ACL), Cited by: §2.3, §3.1.
  • [15] J. Rissanen (1989) Stochastic complexity in statistical inquiry. Vol. 15, World Scientific Series in Computer Science, Singapore. Cited by: §2.1.
  • [16] S. L. Scott (2002) Bayesian methods for hidden markov models: recursive computing in the 21st century. Journal of the American Statistical Association 97 (457), pp. 337–351. Cited by: §3.4.
  • [17] R. Sennrich, B. Haddow, and A. Birch (2015-08) Neural machine translation of rare words with subword units. In ACL16, (en). Note: arXiv: 1508.07909Comment: accepted at ACL 2016; new in this version: figure 3 External Links: Link Cited by: §1.
  • [18] P. Smit, S. Virpioja, M. Kurimo, et al. (2017) Improved subword modeling for WFST-based speech recognition. In In INTERSPEECH 2017–18th Annual Conference of the International Speech Communication Association., Cited by: §1.
  • [19] M. Varjokallio, M. Kurimo, and S. Virpioja (2013-12) Learning a subword vocabulary based on unigram likelihood. In Proc. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, pp. 7–12 (en). External Links: ISBN 978-1-4799-2756-2, Link, Document Cited by: §1.
  • [20] S. Virpioja, P. Smit, S. Grönroos, and M. Kurimo (2013) Morfessor 2.0: python implementation and extensions for morfessor baseline. Report Technical Report 25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY, Department of Signal Processing and Acoustics, Aalto University, Helsinki, Finland (eng). Cited by: §2.1, §2.1.
  • [21] S. Virpioja, V. T. Turunen, S. Spiegler, O. Kohonen, and M. Kurimo (2011) Empirical comparison of evaluation methods for unsupervised learning of morphology. Traitement Automatique des Langues 52 (2), pp. 45–90 (en). External Links: Link Cited by: §4.1.
  • [22] A. J. Viterbi (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13 (2), pp. 260–269. Cited by: §1.1.

9 Language Resource References

Grönroos, Stig-Arne and Hiovain, Katri and Smit, Peter and Rauhala, Ilona Erika and Jokinen, Päivi Kristiina and Kurimo, Mikko and Virpioja, Sami. (2015). North Sámi active learning morphological segmentation annotations. Aalto University, 2.0.

Kudo, Taku and Richardson, John. (2018). SentencePiece. Taku Kudo.

Kurimo, Mikko and Virpioja, Sami and Turunen, Ville T. (2010). Morpho Challenge 2010 dataset. Aalto University.

Sametinget. (2004). Den samiske tekstbanken. UiT Norgga árktalaš universitehta.

Virpioja, Sami and Smit, Peter and Grönroos, Stig-Arne and Kurimo, Mikko. (2013). Morfessor 2.0. Aalto University, 2.0.6.