Detecting Machine-Translated Text using Back Translation

10/15/2019 ∙ by Hoang-Quoc Nguyen-Son, et al. ∙ 0

Machine-translated text plays a crucial role in the communication of people using different languages. However, adversaries can use such text for malicious purposes such as plagiarism and fake review. The existing methods detected a machine-translated text only using the text's intrinsic content, but they are unsuitable for classifying the machine-translated and human-written texts with the same meanings. We have proposed a method to extract features used to distinguish machine/human text based on the similarity between the intrinsic text and its back-translation. The evaluation of detecting translated sentences with French shows that our method achieves 75.0 It outperforms the existing methods whose the best accuracy is 62.8 F-score is 62.7 back-translated text with 83.4 best previous accuracy. We also achieve similar results not only with F-score but also with similar experiments related to Japanese. Moreover, we prove that our detector can recognize both machine-translated and machine-back-translated texts without the language information which is used to generate these machine texts. It demonstrates the persistence of our method in various applications in both low- and rich-resource languages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, cross-language communication among people plays an important role in modern life. It opens great opportunities in various fields such as entertainment, e-commerce, career, etc. In this communication, a machine translator is an essential component. Moreover, the translator can also support other mutual interactions among machines and between a human with a machine. For example, a new AI system can be built from the other mature systems, which are operated in another language. In another example, the cutting-edge smart devices such as Apple Siri and Google Home have already supported multiple languages via translators.

However, the main problem of using translation is that it can lead to misunderstanding due to the diversity of language usages such as slang, idiom, dialect, etc. In another problem, adversaries can take advantage of translators to generate paraphrasing texts for malicious purposes, for example, plagiarism Jones and Sheridan (2015) and style transfer Prabhumoye et al. (2018). Spreading such artificial texts can seriously reduce the reputation of the original texts which are created from human society. Therefore, it is crucial to develop a detector for determining whether a text is written by a human or generated by a translator.

Many researchers have interested in detecting machine-translated text. The most common methods are based on the -gram model Aharoni et al. (2014); Arase and Zhou (2013); Nguyen-Son and Echizen (2017) to measure the fluency of text. On the other hand, the structure of the parsing tree is exploited to recognize the machine-generated texts Li et al. (2015). Moreover, the different word usages in human and machine texts lead to the differences in their word distributions Nguyen-Son et al. (2017). Other researchers prove that the coherence of the human-written text is better than the machine-translated one Nguyen-Son et al. (2018, 2019). Beyond detecting machine-translated text, artificial fake reviews and papers are also recognized by readability Juuti et al. (2018) and duplicate patterns Labbé and Labbé (2013), respectively. The limitation in all existing methods above is that they only analyze the intrinsic contents of machine-generated texts but ignore the original processes which are used to produce the texts.

Our idea based on the fact that the processing on original data often produces more variations than that on modified data. For example, in the field of image, equalizing histogram on an original image makes the much larger change than that on a balanced image, which has already equalized before. In the field of text, we also have a similar phenomenon. More particularly, we conduct a random example on an original sentence from European parallel corpus111 as shown in Figure 1. is translated to French and then re-translated to English to create the back-translation denoted as where the subscript represents to the number times of transitions applied from . The other back-translations and are generated in the same manner of . The variation between a back-translation with its origin is highlighted in bold with word usage and in underline with structure. The back-translation reach the saturation in with no change. Among back-translations, has the largest variations with seven positions in the word usage and three positions in the structure. The variations are remarkably reduced in the next back-translation with only one position in the word usage and nothing in . The example demonstrates that the earlier generations have a higher number of variations than the latter ones.

Figure 1: The variants of repeatedly using back-translations.

We check our findings on machine-translated text detection. More particularly, we picked up the English-French pair in the parallel corpus in which is analyzed above. While is considered as the human-written sentence, is translated to English by Google for generating a machine sentence as shown in Figure 2. We then generate the two back-translation versions and using French as the intermediate language. While is translated in two times from the origin , is generated after three from . At the result, has more variations with in word usage than with . Moreover, the structure in is slightly changed whereas in is preserved. It demonstrates that the differences in back-translation can be used to distinguish human-written with machine-translated text.

Figure 2: Human-written vs machine-translated text.

In this paper, we have proposed a method using back-translation to detect machine-translated text. Our contributions are listed as below:

  • We explore the variant of a text when repeatedly back-translated in the same translator. In particular, the text is invariant after certain times of back-translating. Moreover, the earlier back-translations produce the larger variants than the later ones.

  • We measure the variant by calculating the similarity between the text and its back-translation using BLEU scores.

  • We suggest using a classifier with these scores to determine whether the text is translated by a machine or written by a human.

We randomly selected 2000 English-French sentence pairs from the European corpus for evaluation. While the English was considered as the human-written text, the French was translated to English using Google and is represented for the machine-translated text. Our method achieves both accuracy and -score as 75.0%. It outperforms previous methods with the best accuracy as 62.8% and -score as 62.7%. The similar experiment was conducted with back-translation detection. More specifically, we randomly chose 2000 sentiment sentences including 1000 positives and 1000 negatives from a Stanford Treebank corpus222 We then generated the machine back-translated text using French as the intermediate language. Our performance gives 83.4% of both accuracy and -score that is better than the best previous work’s accuracy and -score as 66.7% and 63.7%, respectively. We conducted further experiments with Japanese and reach similar results. It demonstrates the persistence of the proposed method in various tasks in both low- and rich-resource languages.

The rest of the paper is organized as follow. Section 2 describes some main previous methods of detecting machine-translated and other machine-generated texts. The proposed method is presented in Section 3. The experimental results are shown in Section 4. Finally, we summarize some main key points and mention future work in Section 5.

2 Related Work

2.1 Machine Translation Detection

The previous methods for detecting machine-translated text can be split into four groups.

-gram model

This model is commonly used to estimate the fluency of continuous words. Researchers have suggested additional features to support the original model. For example, arase2013machine estimated the fluency of non-continuous words by sequential pattern mining. They can extract fluent human patterns (e.g., “

not only * but also,” and “more * than”) comparing with weird machine patterns (e.g., “after * after the,” “and also * and”). On the other hand, aharoni2014automatic combined the POS -gram model with functional words, which abundantly occur in the machine-translated text. nguyen2017detecting also integrated the word -gram model with noise features for detecting translation in online social networking (OSN) messages. Such specific features often occur in human messages such as misspelling and spoken words or in machine messages, for example, untranslated words. However, these noises frequently appear in the OSN messages more than others.

Parsing tree

li2015machine used the syntactic parsing tree for classifying human and machine sentences. They claim that the structure of a human parsing is more balancing than that of a machine. They thus extracted balancing-based features such as the ratio between left and right nodes in both general and main continents. The limitation of this approach is that it ignores the semantic meaning of the text.

Word distribution

The usage of words in the human text often complies the Zipfian law, which indicates the topmost frequent words double the second, three times the third, etc. nguyen2017identifying use this law for detecting machine translated document. Furthermore, they extracted useful humanity text including idiom, cliché, ancient, and dialect phrases. They also estimated the relationships among certain phrases based on co-reference resolutions. These features only work well on a large text in which the word distribution is more stable and additional features appear more.


Although the machine-translated text can preserve the meaning, the coherence of such text is still low. Some researchers have measured the coherence to distinguish the machine text with the human text. For example, nguyen2018identifying matched similar words between two sentences in a paragraph. The similarity between two matched words is used to estimate the coherence. In another work, nguyen2019detecting broadened the matching on any words in the paragraph in both within and across sentences. However, the coherence is tight in a paragraph but is downgraded in other levels such as sentence and document.

2.2 Other Machine-Generated Text Detection

Many other machine-generated texts support for malicious purposes such as paper generation and fake review. labbe2013duplicate prove that artificial papers are produced by using abundant duplicated words and phrases. Therefore, they suggested an inter-textual distance to estimate the similarity between two word distributions and used the distance to recognize the machine-generated text. In fake review detection, juuti2018stay extracted features from thirteen readability metrics. Moreover, they used

-gram models for various text components including words, simple POS, detailed POS and syntactic dependency. The duplicated usages of word distribution and -gram model indicate high relevant between machine-translated and other machine-generated texts detection.

3 Proposed Method

The schema of the proposed method includes three steps as shown in Figure 3:

Figure 3: The proposed schema for detecting machine-translated text.
  • Step 1 (Generate back-translation): The Google Translation is used to generate the back-translation of the input text.

  • Step 2 (Calculate similarity): The similarity between the input text and its back-translation is measured on the basis of BLEU scores.

  • Step 3 (Classify the input): The similarity features are used to determine whether the input text is written by a human or generated by a machine.

The following subsections describe the step-by-step of the proposed method.

3.1 Generating Back-Translation (Step 1)

The input text in the original language is translated into an intermediate language, which is different from the original one. The translated version is then re-translated back to the original language. The final translation is called as back-translation. In this paper, we use Google as a translator. In Figure 2, the back-translations and are generated from the human text and machine text respectively with the intermediate language, French.

Figure 4 shows an example of back-translation detection. In particular, we use Japanese for generating the machine-translated text from the original text . For distinguishing the two input texts and , we create their back-translated texts and respectively with Chinese. Like Figure 2, we highlight the variants between the input texts and their back-translations with bold for word usages and underline for structures. Although using different languages in the generator and the detector, still makes more variants than . Again, the translation with four times in causes fewer changes than that with two times in .

Figure 4: Human vs machine text in back-translation detection.

3.2 Calculating Similarity (Step 2)

This step aims to estimate the similarity between the input and its back-translation. Due to the high relevance with machine translated-text measurement, we use BLEU scores Papineni et al. (2002) for this step. There are two groups of the BLEU including individual -gram and cumulative -gram scores. While the individuals estimate phrases in the text independently, the later scores cumulate the measurements of the phrases with various lengths. Because the individual uni-gram score equals to the cumulative uni-gram, we only use one of them. The BLEU scores for both translation and back-translation detection are listed in Figure 5. The first four values indicate the individual -gram with within 1 and 4; the remaining values are represented for the cumulative -gram with from 2 to 4.

Figure 5: BLEU scores of the human and machine texts with their back-translations.

The results show that the BLEU scores between machine texts and their back-translations are all higher than those of machine texts in both translation and back-translation detection. It demonstrates that the more times use a translator, the higher similarity is taken. This significant information can be used to distinguish the human with machine text.

3.3 Classifying the Input (Step 3)

The seven BLEU scores extracted from the previous step are run with a classifier to determine whether the input text is translated by a machine or is written by a human. We examine with four best classifiers, chosen from previous work, including linear classification Fan et al. (2008)

, adaptive boosting, support vector machine (SVM) optimized by sequential minimal optimization, and SVM optimized by stochastic gradient descent. All of the classifiers achieve nearly similar results, so we can use any of them for this step.

4 Evaluation

4.1 Translation Detection

4.1.1 Dataset

We randomly selected 2000 English-French sentence pairs from the European parallel corpus333 While the English was used as human-written texts, the French was translated to English by Google for producing machine texts. The 4000 sentences are merged together; the integrated dataset contains 26.2 words per sentence on average. The dataset then is split into two parts: 2800 sentences for a train set and the remaining for the test set. To balance between human and machine texts in each set, we distribute both human and corresponding machine sentences into the same set.

4.1.2 Comparison

We evaluated the dataset on previous methods which detect machine-translated texts and machine-generated reviews. The train set was learned with four machine learning classifiers, which were chosen as the best classifiers in the previous methods. The topmost classifiers, which are mentioned in each method, are marked in underline as shown in Table 

1. They include linear classification (LINEAR) Fan et al. (2008), adaptive boosting (ADABOOST), support vector machine optimized by sequential minimal optimization SVM(SMO), and SVM optimized by stochastic gradient descent SVM(SGD). Two standard metrics are used to evaluate the classifiers including accuracy (ACC) and -score (F1) whose best performances are highlighted in bold. The two last columns calculate the average and the two last rows show the results of our detectors. The first one chooses Spanish for extracting back-translation information that is different from the language of the generator. The second detector uses the same generator language, i.e. French.


Word distribution
Nguyen-Son et al. (2017)
54.8% 54.8% 53.8% 53.0% 52.8% 52.6% 52.0% 51.2% 53.4% 52.9%

Nguyen-Son et al. (2019)
55.0% 53.3% 53.8% 41.6% 56.9% 56.8% 50.2% 50.0% 54.0% 50.4%

Parsing tree
Li et al. (2015)
55.2% 55.0% 53.7% 52.4% 54.8% 54.2% 54.2% 54.2% 54.5% 54.0%

-gram & functional words
Aharoni et al. (2014)
56.9% 56.9% 56.0% 55.2% 58.2% 58.2% 50.3% 50.2% 55.3% 55.1%

-gram & readability
Juuti et al. (2018)
62.8% 62.7% 52.1% 37.8% 61.5% 61.5% 55.1% 55.1% 57.9% 54.3%

Our using Spanish
64.8% 64.7% 65.6% 65.5% 64.1% 64.1% 63.8% 63.8% 64.6% 64.5%

Our using French
73.9% 73.9% 72.8% 72.7% 74.1% 74.1% 75.0% 75.0% 73.9% 73.9%

Table 1: Comparison with other methods on machine-translation detection.

Surprisingly, both accuracy and -score of most previous methods are nearly the randomize approach which is around 50%. It demonstrates the balanced pairs in the dataset with mostly the same meanings and word usages between human and machine texts make the existing methods confused. The best results are identical or comparative with the topmost performances mentioned in corresponding previous methods. The huge difference comes from the juuti2018stay’s method because it is originally targeted on detecting another machine-generated text, namely fake reviews. Among them, the method based on word distribution Nguyen-Son et al. (2017) has the lowest results. It indicates that the limited number of words within a sentence is insufficient to form a stable distribution. The coherence-based method Nguyen-Son et al. (2019) appropriately targets on paragraph level but not in sentence level. On the other hand, the method based on the parsing tree Li et al. (2015) can slightly improve the outcome but the structures of human and machine pairs seem to be similar in this balanced dataset. The most reasonable methods Aharoni et al. (2014); Juuti et al. (2018) are to use -gram model for measuring the text fluency. However, the performances are unstable among classifiers especially in juuti2018stay’s work. It indicates that the current neural translator has already improved, so the only use of internal text information is insufficient to recognize machine-translated texts.

Our method uses additional information from back-translation improving the overall performances. Even using the different language with machine generator, the Spanish-based method achieves higher performance in all classifiers. It demonstrates that our method can efficiently detect machine-translated text without information of the translation language. Moreover, the use of the same language reaches the best performances in both accuracy and -score. Our method also gives a more balancing results not only between accuracy and -score but also among classifiers.

4.2 Back-translation Detection

4.2.1 Dataset

We check the capability of the proposed method on another task, namely back-translation detection. The back-translation can be easily used to generating paraphrasing texts for supporting malicious purposes such as fake reviews or political posts. The generation needs only an original text and using back-translation with various languages for generating many paraphrasing versions. For simulating this scenario, we randomly picked up 2000 sentiment sentences from Stanford Treebank corpus444 Half of them is positive while the remaining one is negative. We then generated the back-translated texts which are considered as machine sentences using French as the intermediate language. The machine sentences were integrated with the original ones into 4000 sentence dataset that averagely has 17.3 words per sentence. It is obviously smaller than the translation dataset above due to short sentences such as “Imperfect?” and “Cool.” The dataset was also split into train and test sets with balancing human and machine sentences in the same manner with the section 4.1.1.

4.2.2 Comparison

We conducted similar experiments with the previous methods on this back-translation dataset. For evaluating our detectors, we also use two languages including Spanish and French. The first language is different from the generator while the last is the same. The results are shown in Table 2.


Word distribution
Nguyen-Son et al. (2017)
53.8% 53.5% 54.3% 53.8% 53.6% 52.8% 51.4% 50.4% 53.3% 52.6%

Nguyen-Son et al. (2019)
51.2% 37.1% 58.3% 57.4% 59.4% 59.4% 49.3% 49.3% 54.5% 50.8%

Parsing tree
Li et al. (2015)
56.2% 56.2% 54.8% 54.8% 56.6% 56.6% 54.8% 54.1% 55.6% 55.4%

-gram & functional words
Aharoni et al. (2014)
57.7% 57.7% 55.9% 52.5% 58.3% 58.3% 45.4% 45.3% 54.3% 53.5%

-gram & readability
Juuti et al. (2018)
61.9% 57.5% 66.7% 62.5% 63.8% 63.7% 56.3% 56.3% 62.2% 60.0%

Our using Spanish
70.7% 70.7% 70.6% 70.6% 71.0% 71.0% 68.3% 68.3% 70.1% 70.1%

Our using French
83.0% 83.0% 83.1% 83.0% 83.1% 83.0% 83.4% 83.4% 83.1% 83.1%

Table 2: Back-translation detection with rich-resource language.

The performances on most previous methods are slightly increased. The main reason is that the back-translation machine texts are generated after using the translator in two times. Therefore, the quality is downgraded and this text is more easily distinguishable. The most changing comes from juuti2018stay’s work in which the ADABOOST reaches the best accuracy contrasting with the result from the previous task in Table 1. Moreover, the best -score places in another classifier, i.e., SVM(SMO). The differences with other classifiers are also remarkable that exploits the inconsistent of this work on detecting various kinds of translated texts. On the other hand, our methods achieve the highest performances in both Spanish and French. The improvements are even larger compared with the previous task. In this task, the back translations of machine texts are created after using the translator in four times while the previous task is only three, so we can exploit more differences comparing with the back-translations created by human texts in the same two times. The proposed method again demonstrates the high consistent results among classifiers. Furthermore, the accuracy and -scores are almost identical. It shows the persistent of our method on various tasks even without language information from the generator.

4.2.3 Low-resource language


Word distribution
Nguyen-Son et al. (2017)
52.4% 52.3% 52.4% 51.4% 50.8% 49.9% 51.3% 50.7% 51.7% 51.1%

Nguyen-Son et al. (2019)
51.5% 39.1% 58.8% 58.8% 58.8% 58.8% 53.1% 52.8% 55.6% 52.4%

Parsing tree
Li et al. (2015)
57.8% 57.7% 57.6% 57.6% 58.4% 58.4% 54.6% 54.5% 57.1% 57.1%

-gram & functional words
Aharoni et al. (2014)
55.6% 55.6% 53.5% 43.0% 56.9% 56.9% 47.4% 47.4% 53.4% 50.7%

-gram & readability
Juuti et al. (2018)
66.1% 65.6% 55.8% 45.0% 65.6% 65.5% 58.0% 58.0% 61.4% 58.5%

Our using Chinese
65.2% 65.2% 64.6% 64.2% 63.9% 63.9% 63.3% 63.3% 64.2% 64.1%

Our using Japanese
80.6% 80.6% 78.9% 78.5% 79.8% 79.7% 78.7% 78.6% 79.5% 79.3%

Table 3: Back-translation detection with low-resource language.

We examined similar experiments with low-resource languages. With the same dataset of 2000 human sentiment sentences, we choose Japanese for generating machine back-translated texts. For detectors, we use two languages as intermediate languages for generating back-translation. They include Chinese, different from the generator, and the same language, i.e Japanese. The results of comparison with other methods are listed in Table 3.

In the previous methods, the results are quite similar to detecting back-translation with the rich-resource language. The most difference again lays in juuti2018stay’s work. The best performances come back to LINEAR while ADABOOST is dropped down to the lowest. It indicates the inconsistent of this method not only on different tasks but also on the same task with different resource languages. Comparing Figure 2 and Figure 4, the translated texts from rich-resource have still higher qualities than those from low-resource. Therefore, it affects the back-translation information which is used in our method. Especially with Chinese, our detector is slightly lower than some classifiers of the previous work, but the stable outcome still demonstrates via average, which is better than the state-of-the-art methods. Moreover, the experiment with Japanese outperforms in all classifiers with significant improvements of both accuracy as 14.5% and -score as 15.0%.

5 Conclusion

In this paper, we have exploited that when using machine translators many times, the translated text is converged. Moreover, the variant between two consecutive usages gets to be smaller. We then propose a method for estimating the variant by using BLEU scores and use them for detecting two types machine-generated text: machine translation and machine back-translation. In machine translation detection, the evaluation of French sentences on best classifiers, which are mentioned on previous methods, shows that our method can detect translated text with 75.0% of both accuracy and -score. It outperforms the previous methods with the best accuracy as 62.8% and -score as 62.7%. In back-translated text detection, the performance is even significantly improved from 66.7% to 83.4% of accuracy and from 63.7% to 83.4% of -score. The experiments on low-resource language, i.e., Japanese, achieve similar results. Moreover, we conduct similar experiments with different languages between generators and detectors. Although the performances are lower than the same language experiments, our detectors still are better than the existing work in all classifiers related to rich-resource languages and are higher on average performances of the classifiers with low-resource. It demonstrates our detectors work well even without language information of the generators.

In future work, we will investigate the effect of our findings on various translators such as neural-network-based and phrase-based translators. Moreover, we will verify the use of the same translator but trained on different corpora. We also analyze other machine generators for detecting other malicious texts such as adversarial texts, artificial fake news, etc. Beyond text, the applications using our hypothesis for detecting other machine-generated data (e.g., image, video, sound, and structured data) will be considered.


  • R. Aharoni, M. Koppel, and Y. Goldberg (2014) Automatic detection of machine translated text and translation quality estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 289–295. Cited by: §1, §4.1.2, Table 1, Table 2, Table 3.
  • Y. Arase and M. Zhou (2013) Machine translation detection from monolingual web-text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1597–1607. Cited by: §1.
  • R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin (2008) LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: §3.3, §4.1.2.
  • M. Jones and L. Sheridan (2015) Back translation: an emerging sophisticated cyber strategy to subvert advances in ‘digital age’plagiarism detection and prevention. Assessment and Evaluation in Higher Education 40 (5), pp. 712–724. Cited by: §1.
  • M. Juuti, B. Sun, T. Mori, and N. Asokan (2018) Stay on-topic: generating context-specific fake restaurant reviews. In Proceedings of the European Symposium on Research in Computer Security (ESORICS), pp. 132–151. Cited by: §1, §4.1.2, Table 1, Table 2, Table 3.
  • C. Labbé and D. Labbé (2013) Duplicate and fake publications in the scientific literature: how many scigen papers in computer science?. Scientometrics 94 (1), pp. 379–396. Cited by: §1.
  • Y. Li, R. Wang, and H. Zhao (2015) A machine learning method to distinguish machine translation from human translation. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC), pp. 354–360. Cited by: §1, §4.1.2, Table 1, Table 2, Table 3.
  • H. Nguyen-Son and I. Echizen (2017) Detecting computer-generated text using fluency and noise features. In Proceedings of the International Conference of the Pacific Association for Computational Linguistics (PACLING), pp. 288–300. Cited by: §1.
  • H. Nguyen-Son, T. P. Thao, S. Hidano, and S. Kiyomoto (2019) Detecting machine-translated paragraphs by matching similar words. In ArXiv Preprint arXiv:1904.10641, Cited by: §1, §4.1.2, Table 1, Table 2, Table 3.
  • H. Nguyen-Son, N. T. Tieu, H. H. Nguyen, J. Yamagishi, and I. Echizen (2017) Identifying computer-generated text using statistical analysis. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1504–1511. Cited by: §1, §4.1.2, Table 1, Table 2, Table 3.
  • H. Nguyen-Son, N. T. Tieu, H. H. Nguyen, J. Yamagishi, and I. Echizen (2018) Identifying computer-translated paragraphs using coherence features. ArXiv Preprint arXiv:1812.10896. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), pp. 311–318. Cited by: §3.2.
  • S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 866–876. Cited by: §1.