Log In Sign Up

Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

by   Elizaveta Korotkova, et al.

Both grammatical error correction and text style transfer can be viewed as monolingual sequence-to-sequence transformation tasks, but the scarcity of directly annotated data for either task makes them unfeasible for most languages. We present an approach that does both tasks within the same trained model, and only uses regular language parallel data, without requiring error-corrected or style-adapted texts. We apply our model to three languages and present a thorough evaluation on both tasks, showing that the model is reliable for a number of error types and style transfer aspects.


Monolingual and Cross-lingual Zero-shot Style Transfer

We introduce the task of zero-shot style transfer between different lang...

Russian Texts Detoxification with Levenshtein Editing

Text detoxification is a style transfer task of creating neutral version...

Multilingual Pre-training with Language and Task Adaptation for Multilingual Text Style Transfer

We exploit the pre-trained seq2seq model mBART for multilingual text sty...

Zero-Shot Style Transfer in Text Using Recurrent Neural Networks

Zero-shot translation is the task of translating between a language pair...

How Sequence-to-Sequence Models Perceive Language Styles?

Style is ubiquitous in our daily language uses, while what is language s...

ETNet: Error Transition Network for Arbitrary Style Transfer

Numerous valuable efforts have been devoted to achieving arbitrary style...

Controllable Data Synthesis Method for Grammatical Error Correction

Due to the lack of parallel data in current Grammatical Error Correction...

1 Introduction

Sequence-to-sequence (seq2seq) transformations have recently proven to be a successful framework for several natural language processing tasks, like: machine translation (MT)

Bahdanau et al. (2014); Vaswani et al. (2017), speech recognition Hannun et al. (2014), speech synthesis Shen et al. (2017a), natural language inference Parikh et al. (2016) and others. However, the success of these models depends on the availability of large amounts of directly annotated data for the task at hand (like translation examples, text segments and their speech recordings, etc.). This is a severe limitation for tasks where data is not abundantly available as well as for low-resource languages.

Here we focus on two such tasks: grammatical error correction (GEC) and style transfer. Modern approaches to GEC learn from parallel corpora of erroneous segments and their manual corrections Ng et al. (2014); Yuan and Briscoe (2016); text style transfer also relies on supervised approaches that require texts of the same meaning and different styles Xu et al. (2012); Jhamtani et al. (2017) or imprecise unsupervised methods Fu et al. (2018); Zhang et al. (2018b).

In this paper we introduce an approach to performing both GEC and style transfer with the same trained model, while not using any supervised training data for either task. It is based on zero-shot neural machine translation (NMT)

Johnson et al. (2017), and as such, the only kind of data it uses is regular parallel corpora (with texts and their translations). However, we apply the model to do monolingual transfer, asking to translate the input segment into the same language. We show, that this “monolingual translation” is what enables the model to correct the errors in the input as well as adapt the output into a desired style. Moreover, the same trained model performs both tasks on several languages.

Our main contributions are thus: (i) a single method for both style transfer and grammatical error correction, without using annotated data for either task, (ii) support for both tasks on multiple languages within the same model, (iii) a thorough quantitative and qualitative manual evaluation of the model on both tasks, and (iv) highlighting of the model’s reliability aspects on both tasks. We used publicly available software and data; an online demo of our results is available, but concealed for anonymization purposes.

We describe the details of our approach in Section 2, then evaluate it in terms of performance in grammatical error correction in Section 3 and in style transfer in Section 4. The paper ends with a review of related work in Section 5 and conclusions in Section 6.

2 Method

As mentioned in the introduction, our approach is based on the idea of zero-shot MT Johnson et al. (2017). There the authors show that after training a single model to translate from Portuguese to English as well as from English to Spanish, it can also translate Portuguese into Spanish, without seeing any translation examples for this language pair. We use the zero-shot effect to achieve monolingual translation by training the model on bilingual examples in both directions, and then doing translation into the same language as the input: illustrated on Figure 1.

Figure 1: Schematic illustration of zero-shot monolingual translation. The model is trained on bilingual data in all translation directions (English-to-Estonian, Estonian-to-English, English-to-Latvian, etc.) and then applied in monolingual directions only (English-to-English, etc.), without having seen any sentence pairs for them. The illustration is simplified, as it does not show the style (text domain) parametrization.

With regular sentences monolingual translation does not seem useful, as its behaviour mainly consists of copying. However, when the input sentence has characteristics unseen or rarely seen by the model at training time (like grammatical errors or different stylistic choices) – the decoder still generates the more regular version of the sentence (thus fixing the errors or adapting the style). Furthermore, in case of multilingual multi-domain NMT Tars and Fishel (2018), it is possible to switch between different domains or styles at runtime, thus performing “monolingual domain adaptation” or style transfer.

To create a multilingual multi-domain NMT system we use the self-attention architecture (“Transformer”, Vaswani et al., 2017). Instead of specifying the output language with a token inside the input sequence, as Johnson et al. (2017) did, we follow Tars and Fishel (2018) and use word features (or factors). On one hand, this provides a stronger signal for the model, and on the other – allows for additional parametrization, which in our case is the text domain/style of the corpus.

As a result, a pre-processed English-Latvian training set sentence pair “Hello!”–“Sveiki!” looks like:
En: hello|2lv|2os !|2lv|2os Lv: sveiki !
Here 2lv and 2os specify Latvian and OpenSubtitles as the output language and domain; the output text has no factors to predict. At application time we simply use the same input and output languages, for example the grammatically incorrect input “we is” looks like the following, after pre-processing:
En: we|2en|2os is|2en|2os

The intuition behind our approach is that a multilingual shared encoder produces semantically rich latent sentence representations Artetxe and Schwenk (2018), which provide a solid ground for the effective style transfer on top.

Next we present the technical details, the experiment setup and the data we used for training the model used in the experiments.

2.1 Languages and Data

We use three languages in our experiments: English, Estonian and Latvian. All three have different characteristics, for example Latvian and (especially) Estonian are morphologically complex and have loose word order, while English has a strict word order and the morphology is much simpler. Most importantly, all three languages have error-corrected corpora for testing purposes, though work on their automatic grammatical error correction is extremely limited (see Section 3).

The corpora we use for training the model are OpenSubtitles2018 Lison and Tiedemann (2016), Europarl Koehn (2005), JRC-Acquis and EMEA Tiedemann (2012). We assume that there should be sufficient stylistic difference between these corpora, especially between the more informal OpenSubtitles2018 (comprised of movie and TV subtitles) on one hand and Europarl and JRC-Acquis (proceedings and documents of the European Parliament) on the other.111We acknowledge the fact that most text corpora and OpenSubtitles in particular constitute a heterogeneous mix of genres and text characteristics; however, many stylistic traits are also similar across the whole corpus, which means that these common traits can be learned as a single style.

2.2 Technical Details

For Europarl, JRC-Acquis and EMEA we use all data available for English-Estonian, English-Latvian and Estonian-Latvian language pairs. From OpenSubtitles2018 we take a random subset of 3M sentence pairs for English-Estonian, which is still more than English-Latvian and Estonian-Latvian (below 1M; there we use the whole corpus). This is done to balance the corpora representation and to limit the size of training data.

Details on the model hyper-parameters, data pre-processing and training can be found in Appendix A.

2.3 Evaluation

First, we evaluate our model in the context of MT, as the translation quality can be expected to have influence on the other tasks that the model performs. We use public benchmarks for Estonian-English and Latvian-English translations from the news translation shared tasks of WMT 2017 and 2018 Bojar et al. (2017, 2018). The BLEU scores for each translation direction and all included styles/domains are shown in Table 2.3.

to EP to JRC to OS to EMEA
ENET 20.7 19.9 20.6 18.6
ETEN 24.7 23.6 26.1 23.8
ENLV 15.7 15.3 16.3 15.0
LVEN 18.3 17.8 19.0 17.5
Table 1: BLEU scores of the multilingual MT model on WMT’17 (LatvianEnglish) and WMT’18 (EstonianEnglish) test sets

Some surface notes on these results: the BLEU scores for translation from and into Latvian are below English-Estonian scores, which is likely explained by smaller datasets that include Latvian. Also, translation into English has higher scores than into Estonian/Latvian, which is also expected.

An interesting side-effect we have observed is the model’s resilience to code-switching in the input text. The reason is that the model is trained with only the target language (and domain), and not the source language, as a result of which it learns language normalization of sorts. For example, the sentence “Ma tahan two saldējumus.” (“Ma tahan” / “I want” in Estonian, “two” and “saldējumus” / “ice-creams” in genitive, plural in Latvian) is correctly translated into English as “I want two ice creams.”. See more examples in Appendix B.

3 Grammatical Error Correction

In this section we evaluate our model’s performance in the GEC task: for example, for the English input “huge fan I are”, our model’s output is “I am a huge fan”; this section’s goal is to systematically check, how reliable the corrections are.

Although GEC does not require any distinction in text style, the core idea of this article is to also perform style transfer with the same multilingual multi-domain model. That only means that for GEC we have to select an output domain/style when producing error corrections.

Naturally, the model only copes with some kinds of errors and fails on others – for instance, word order is restored, as long as it does not affect the perception of the meaning. On the other hand, we do not expect orthographic variations like typos to be fixed reliably, since they affect the sub-word segmentation of the input and thus can hinder the translation.

Below we present qualitative and quantitative analysis of our model’s GEC results, showing its overall performance, as well as which kinds of errors are handled reliably and which are not.

3.1 Test Data and Metrics

We use the following error-corrected corpora both for scoring and as basis for manual analysis:

  • for English: CoNLL-2014 Ng et al. (2014) and JFLEG Napoles et al. (2017) corpora

  • for Estonian: the Learner Language Corpus Rummo and Praakli (2017)

  • for Latvian: the Error-annotated Corpus of Latvian Deksne and Skadina (2014)

All of these are based on language learner (L2) essays and their manual corrections.

To evaluate the model quantitatively we used two metrics: the Max-Match (M) metric from the CoNLL-2014 shared task scorer, and the GLEU score Napoles et al. (2015) for the other corpora. The main difference is that M is based on the annotation of error categories, while the GLEU score compares the automatic correction to a reference without any error categorization.

3.2 Results

The M scores are computed based on error-annotated corpora. Since error annotations were only available for English, we calculated the scores on English CoNLL corpus, see Table 3.2).

prec. recall M
Our model 33.4 27.9 32.1
Felice, 2014 39.7 30.1 37.3
Rozovskaya, 2016 60.2 25.6 47.4
Rozovskaja (cl) 38.4 23.1 33.9
Grundkiewicz, 2018 83.2 47.0 72.0
Table 2: M

scores on the CoNLL corpus, including precision and recall.

Our model gets the M score of 32.1. While it does not reach the score of the best CoNLL model Felice et al. (2014) or the state-of-the-art Grundkiewicz and Junczys-Dowmunt (2018)

, these use annotated corpora to train. Our results count as restricted in CoNLL definitions and are more directly comparable to the classifier-based approach trained on unannotated corpora by

Rozovskaya and Roth (2016), while requiring even less effort.

original informal model formal model best known
English (JFLEG) 40.5 44.1 45.9 61.5
Estonian 27.0 38.1 37.8 -
Latvian L2 59.7 44.7 45.1 -
Table 3: GLEU scores for all three languages. No scores have been previously reported elsewhere for Estonian and Latvian.

The GLEU scores can be seen in Table 3.2. We calculated GLEU for both formal and informal style models for all three languages. For English our model’s best score was 45.9 and for Estonian it was 38.1. Latvian corrected output in fact get worse scores than the original uncorrected corpus, which can be explained by smaller training corpora and worse MT quality for Latvian (see Table 2.3).

3.3 Qualitative Analysis

We looked at the automatic corrections for 100 erroneous sentences of English and Estonian each as well as 80 sentences of Latvian. The overall aim was to find the ratio of sentences where (1) all errors have been corrected (2) only some are corrected (3) only some are corrected and part of the meaning is changed and (4) all meaning is lost.

The analysis was done separately for four error types: spelling and grammatical errors, word choice and word order. In case a sentence included more that one error type it was counted once for each error type. For English the first two types were annotated in the corpus, the rest were annotated by us, separating the original third error category into two new ones. The results can be seen in Table 3.3.

Estonian English
1 2 3 4 1 2 3 4
spelling 12 5 2 0 12 7 4 2
lex 35 12 18 12 31 5 5 2
grammar 28 8 8 0 23 13 3 0
order 26 5 2 0 2 1 0 0
overall 29 32 27 12 19 42 7 2
Table 4: GEC results by error types; “grammar” stands for grammatical mistakes, “lex” stands for lexical choice and “order” – for word order errors.

Not all English sentences included errors. 30 sentences remained unchanged, out of which 17 had no mistakes in them. For the changed sentences 87% were fully or partially corrected. In case of Estonian, where all sentences had mistakes, 61 out of the 100 sentences were fully or partially corrected without loss of information. 12 sentences became nonsense, all of which originally had some lexical mistakes. For English the results are similar: the most confusing type of errors that leads to complete loss of meaning is word choice. On the other hand, this was also the most common error type for both languages and errors of that type were fully corrected in 45% of cases for Estonian and 72% for English. Using words in the wrong order is a typical learner’s error for Estonian that has rather free word order. It is also difficult to describe or set rules for this type of error. Our model manages this type rather well, correcting 79% of sentences acceptably, only losing some meaning in 2 sentences including this error type.

A similar experiment using 80 Latvian sentences yielded 17 fully corrected sentences, 15, 22 and 26 respectively for the other categories. As the Latvian model is weaker in general, this also leads to more chances of losing some of the meaning; we exclude it from the more detailed analysis and focus on English and Estonian.

Our model handles punctuation, word order mistakes and grammatical errors well. For example the subject-verb disagreement in English 1 and verb-object disagreement in Estonian 2 have been corrected.

    1. “When price of gas goes up , the consumer do not want buy gas for fuels”

    2. “When the price of gas goes up, the consumer doesn’t want to buy gas for fuels”

    1. “Sellepärast ütleb ta filmi lõpus, et tahab oma unistuse tagasi”

    2. “Sellepärast ütleb ta filmi lõpus, et tahab oma unistust tagasi”

    3. that’s-why says he film at-end, that (he)-wants his-own dream

Sentences that include several error types are generally noticeably more difficult to correct. Depending on the error types that have been combined our model manages quite well and corrects all or several errors present. The sentence 3 includes mistakes with word order and word choice: the argument "vabaainetele" (to elective courses) here should precede the verb and the verb "registreeruma" (register oneself) takes no such argument. Our model corrects both mistakes while also replacing the word "seejärel" (after that) with its synonym.

    1. Seejärel pidi igaüks ennast registreeruma vabaainetele.

    2. then had-to everyone oneself register-oneself to-free-courses

    3. Siis pidi igaüks end vabaainetele registreerima.

    4. then had-to everyone oneself to-free-courses register

The model fixes typos, but it mainly manages cases where two letters are needed but one is written and vice versa, for example "detailled" is corrected to "detailed" and ‘planing’ to "planning". More complicated mistakes are missed, especially if combined with other error types, and in some sentences a misspelled word is changed into an incorrect form that has a common ending, like "misundrestood" to "misundrested". The results get better if the input has been automatically spell-checked.

The system does more changes than are strictly necessary and often replaces correct words and phrases, for example "frequently" was changed to to "often" or in Estonian “öelda” ("say") to “avaldada” ("publish"). Sometimes this also confuses the meaning: "siblings" was changed to "friends".

To conclude this section, our model reliable corrects grammatical, spelling and word order errors on , with more mixed performance on lexical choice errors and some unnecessary paraphrasing of the input. The error types that the model manages well can be traced back to having a strong monolingual language model, a common trait of a good NMT model. As the model operates on the level of word parts and its vocabulary is limited, this leads to combining wrong word parts, sometimes across languages. This could be fixed by either using character-based NMT or doing automatic spelling correction prior to applying our current model.

4 Style Transfer

Next we move on to evaluating the same model in terms of its performance in the context of style transfer.

Figure 2: Proportions of sentences with token-wise edit distance from the original when translated monolingually from and into different styles

At first, we examined how often the sentences change when translated monolingually. The assumption is that passing modified style factors should prevent the model from simply copying the source sequences when translating inside a single language, and incentivize it to match its output to certain style characteristics typical for different corpora. Figure 2 shows the proportions of sentence pairs in the 1000-sentence test sets where there was a significant difference between translations into different styles. We can observe that English texts change less often than Estonian or Latvian, while Europarl sentences are changed more often than those of other corpora.

To assess whether these changes actually correspond to the model’s capability for transferring style, we turned to help of human evaluators.

4.1 Qualitative Analysis

Translation into informal style (OpenSubtitles)
I could not allow him to do that. I couldn’t let him do that.
He will speak with Mr. Johns. He’ll talk to Mr. Johns.
I will put you under arrest. I’ll arrest you.
Translation into formal style (Europarl)
How come you think you’re so dumb? Why do you think you are so stupid?
I’ve been trying to call. I have tried to call.
Yeah, like I said. Yes, as I said.
Table 5: Examples of style transfer

We limit further comparisons to two styles, translating sentences of the OpenSubtitles test set into the style of Europarl and vice versa. Our assumption is that, generally, movie subtitles gravitate towards the more informal style, and parliament proceedings towards the more formal (see examples of translations into those styles in Table 4.1). Preliminary tests showed that JRC-Acquis and EMEA texts resulted in practically the same style as Europarl. We also leave Latvian out of the evaluations, assuming that its performance is weaker, similarly to GEC results.

Human evaluation was performed on a subset of 100 sentences, 50 of them selected randomly from the OpenSubtitles test set and the other 50 from Europarl. Each sentence was translated into the opposite style. The resulting pairs were presented to participants, who were asked the following questions about each of them: (1) Do the sentences differ in any way? (2) How fluent is the translated sentence? (On a scale of 1 to 4, where 1 is unreadable, and 4 is perfectly fluent); (3) How similar are the sentences in meaning? (With options "exactly the same", "the same with minor changes", "more or less the same", "completely different or nonsensical"); (4) Does the translated sentence sound more formal than the original, more informal, or neither? (5) What differences are there between the sentences? (E.g. grammatical, lexical, missing words or phrases, word order, contractions, the use of formal "you").

Two such surveys were conducted, one in English and one in Estonian. 3 people participated in each of them, each of the three evaluators presented with the same set of examples.222All evaluators of Estonian are native or natively bilingual speakers of Estonian, while evaluators of English are proficient, but non-native speakers of English. The evaluators are six different people.

Inter-annotator Agreement

In evaluation of fluency, all three human evaluators gave the translated sentences the same score in 41 out of 55 cases in English (not taking into account sentences which were simply copied from their originals), and in 51 out of 68 cases in Estonian. In evaluation of direction of style transfer, all three evaluators agreed in 16 cases and at least two agreed in 43 cases in English, and in Estonian in 19 cases all three agreed and in 59 at least two.

Survey Results

Of the 100 translated sentences, 45 were marked by all participants as being the same as their original sentences in the English set and 32 in Estonian. The remaining 55 and 68, respectively, were used to quantify style transfer quality.

Being a reasonably strong MT system, our model scores quite high on fluency (3.84 for English, 3.64 for Estonian) and meaning preservation (3.67 for English, 3.35 for Estonian). For meaning preservation, the judgments were converted into a scale of 1-4, where 1 stands for completely different meanings or nonsensical sentences, and 4 for the exact same meaning.

We evaluated the style transfer itself in the following way. For each pair of sentences, the average score given by three evaluators was calculated, in which the answer that the translated sentence is more formal counts as +1, more informal as -1, and neither as 0. We calculated the root mean square error between these scores and desired values (+1 if we aimed to make the sentence more formal, -1 if more informal). RMSE of 0 would stand for always transferring style in the right direction as judged by all evaluators, and 2 for always transferring style in the wrong direction.

On the English set, the RMSE is 0.78, and on Estonian 0.89. These numbers show that style transfer generally happens in the right direction, but is not very strong. Of the 55 sentences in English that were different from their source sentences, in 33 cases the sign of the average human score matched the desired one, in 7 it did not, and in 15 no change in style was observed by humans. In Estonian 36 sentences showed the right direction of style transfer, 10 wrong, and 22 no change.

In English sentences where the direction of style transfer was found to be correct (Figure 2(a)), changes in use of contractions were reported in 19 cases (e.g. I have just been vs. I’ve just been), lexical changes in 15 cases (e.g. ’cause vs. because, or sure vs. certainly), grammatical in 13 (e.g. replacing no one’s gonna with no one will, or method of production with production method), missing or added words or phrases in 8 cases.

In Estonian correctly transferred sentences (Figure 2(b)), the most frequently reported were lexical substitutions (30 cases), followed by missing of added words or phrases (24 cases), changes in grammar (22 cases) and in word order (16 cases).

(a) English
(b) Estonian
Figure 3: Number of sentences in which evaluators reported different types of changes

To conclude this section, unlike many style transfer models which produce text with strong style characteristics (e.g. with strong positive or negative sentiment), often at the cost of preserving meaning and fluency, our model gravitates towards keeping the meaning and fluency of the original sentence intact and mimicking some of the desired stylistic traits.

4.2 Cross-lingual Style Transfer

Being able to translate between languages and also to modify the output to match the desired style allows the model to essentially perform domain adaptation. When translating from a language which has no formal "you" (English) into one that does (Estonian or Latvian), it will quite consistently use the informal variant when the target style is OpenSubtitles and the formal when the target style is Europarl (you rock sa rokid/te rokite). The model is also quite consistent in use of contractions in English (es esmu šeit I am here/I’m here). Some lexical substitutions occur: need on Matti lapsed. those are Matt’s kids./these are Matt’s children. Word order may change: Where is Anna’s bag? is Kus on Anna kott? in the more formal variant, and Kus Anna kott on? in the more informal. This feature is useful, but out of scope of this article, as we focus on monolingual applications.

5 Related Work

Grammatical error correction: there have been four shared tasks for GEC with prepared error-tagged datasets for L2 learners of English in the last decade: HOO Dale and Kilgarriff (2011); Dale et al. (2012) and CoNLL Ng et al. (2013, 2014)

. This has given an opportunity to train new models on the shared datasets and get an objective comparison of results. The general approach for grammatical error correction has been to use either rule-based approach, machine learning on error-tagged corpora, MT models on parallel data of erroneous and corrected sentences, or a combination of these

Ng et al. (2014). The top model of the CONLL shared task in 2014 used a combined model of rule-based approach and MT Felice et al. (2014). All of these require annotated data or considerable effort to create, whereas our model is much more resource-independent.

Another focus of the newer research is on creating GEC models without human-annotated resources. For example Rozovskaya and Roth (2016) combine statistical MT with unsupervised classification using unannotated parallel data for MT and unannotated native data for the classification model. In this case parallel data of erroneous and corrected sentences is still necessary for MT; the classifier uses native data, but still needs definitions of possible error types to classify – this work needs to be done by a human and is difficult for some less clear error types. In our approach there is no need for parallel data nor to specify error types, only for native data.

There has been little work on Estonian and Latvian GEC, all limited with rule-based approaches Liin (2009); Deksne (2016). For both languages, as well as any low-resourced languages, our approach gives a feasible way to do grammatical error correction without needing neither parallel nor error tagged corpora.

Style transfer: Several approaches use directly annotated data: for example, Xu et al. (2012) and Jhamtani et al. (2017) train MT systems on the corpus of modern English Shakespeare to original Shakespeare.Rao and Tetreault (2018) collect a dataset of 110K informal/formal sentence pairs and train rule-based, phrase-based, and neural MT systems using this data.

One line of work aims at learning a style-independent latent representation of content while building decoders that can generate sentences in the style of choice Fu et al. (2018); Hu et al. (2017); Shen et al. (2017b); Zhang et al. (2018a); Xu et al. (2018); John et al. (2018); Shen et al. (2017c); Yang et al. (2018). Unsupervised MT has also been adapted for the task Zhang et al. (2018b); Subramanian et al. (2018). Our system also does not require parallel data between styles, but leverages the stability of the off-the-shelf supervised NMT to avoid the hassle of training unsupervised NMT systems and making GANs converge.

Another problem with many current (both supervised and unsupervised) style transfer methods is that they are bounded to solve a binary task, where only two styles are included (whether because of data or restrictions of the approach). Our method, on the other hand, can be extended to as many styles as needed as long as there are parallel MT corpora in these styles available.

Notably, Sennrich et al. (2016) use side constrains in order to translate in polite/impolite German, while we rely on multilingual encoder representations and use the system monolingually at inference time.

Finally, the most similar to our work conceptually is the approach of Prabhumoye et al. (2018), where they translate a sentence into another language, hoping that it will lose some style indicators, and then translate it back into the original language with a desired style tag attached to the encoder latent space. We also use the MT encoder to obtain rich sentence representations, but learn them directly as a part of a single multilingual translation system.

6 Conclusions

We presented a simple approach where a single multilingual NMT model is adapted to monolingual transfer and performs grammatical error correction and style transfer. We experimented with three languages and presented extensive evaluation of the model on both tasks. We used publicly available software and data and believe that our work can be easily reproduced.

We showed that for GEC our approach reliably corrects spelling, word order and grammatical errors, while being less reliable on lexical choice errors. Applied to style transfer our model is very good at meaning preservation and output fluency, while reliably transferring style for English contractions, lexical choice and grammatical constructions. The main benefit is that no annotated data is used to train the model, thus making it very easy to train it for other (especially under-resourced) languages.

Future work includes exploring adaptations of this approach to both tasks separately, while keeping the low cost of creating such models.


Appendix A Model Training: Technical Details

After rudimentary cleaning (removing pairs where at least one sentence is longer that 100 tokens, at least one sentence is an empty string or does not contain any alphabetic characters, and pairs with length ratio over 9) and duplication to accommodate both translation directions in each language pair, the total size of the training corpus is 22.9M sentence pairs; training set sizes per language and corpus are given in Table A. Validation set consists of 12K sentence pairs, 500 for each combination of translation direction and corpus. We also keep a test set of 24K sentence pairs, 1000 for each translation direction and corpus.

EP 0.64M 0.63M 0.63M
JRC 0.68M 0.69M 1.5M
EMEA 0.91M 0.91M 0.92M
OS 3M 0.52M 0.41M
Table 6: Training set sizes (number of sentence pairs)

The data preprocessing pipeline consists of tokenization with Moses tokenizer Koehn et al. (2007), true-casing, and segmentation with SentencePiece Kudo and Richardson (2018) with a joint vocabulary of size 32 000.

We trained a Transformer NMT model using the Sockeye framework Hieber et al. (2017), mostly following the so-called Transformer base model

: we used 6 layers, 512 positions, 8 attention heads and ReLU activations for both the encoder and decoder; Adam optimizer was used. Source and target token embeddings were both of size 512, and factors determining target language and style had embeddings of size 4. Batch size was set to 2048 words, initial learning rate to 0.0002, reducing by a factor of 0.7 every time the validation perplexity had not improved for 8 checkpoints, which happened every 4000 updates. The model converged during the 17th epoch, when validation perplexity has not improved for 32 consecutive checkpoints. The parameters of a single best checkpoint were used for all translations, with beam size set to 5.

Appendix B Output Examples

We present more examples of translation of code-switched input segments, error correction and style transfer in English, Estonian and Latvian, informal (inf) and formal (fml) output style:

ETET (fml) Mida sa teed? Mida te teete?
ETET (fml) Milleks tulid? Miks te tulite?
ENEN (inf) I will reimburse you. I’ll pay you back.
ENEN (inf) That is correct. That’s right.
ENEN (fml) It’s a pretty important part of the deal. It is a fairly important part of the deal.
ENEN (fml) He big boss, he make much money. He big boss, he makes a lot of money.
ETET (fml) Ta olen suured poisi. Ta on suur poiss.
LVLV (fml) Mums esi grūts. Mums ir grūti.
ENET (inf) You are bad Sa oled paha!
ENET (fml) You are bad Te olete pahad!
ETLV (inf) Sinu plaan on jama! Tavs plāns ir stulbs!
ETLV (fml) Sinu plaan on jama! Jūsu plāns ir nejēdzīgs!
?EN (inf) Kes you esi? Who are you?
?LV (inf) Kes you esi? Kas tu esi?
?ET (inf) Kes you esi? Kes sa oled?