Evaluating Amharic Machine Translation

03/31/2020 ∙ by Asmelash Teka Hadgu, et al. ∙ 0

Machine translation (MT) systems are now able to provide very accurate results for high resource language pairs. However, for many low resource languages, MT is still under active research. In this paper, we develop and share a dataset to automatically evaluate the quality of MT systems for Amharic. We compare two commercially available MT systems that support translation of Amharic to and from English to assess the current state of MT for Amharic. The BLEU score results show that the results for Amharic translation are promising but still low. We hope that this dataset will be useful to the research community both in academia and industry as a benchmark to evaluate Amharic MT systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine translation (MT) has a lot of applications in different domains such as consumer reviews for marketplaces (Guha & Heger, 2014)

, insights and sentiment analysis for social media posts 

(Balahur & Turchi, 2012), improving human translation speed (Koehn & Haddow, 2009) and high volume content translation for web browsers. Implementation of MT has been shown to increase international trade (Brynjolfsson et al., 2019). While there are many applications for MT, the quality of commercial systems available on the market varies greatly from system to system and language to language.

One of the challenges of building effective MT systems is quantifying quality improvements. E.g., an MT system trained on religious texts for training and testing may seem to have good results but in reality may not reflect quality improvement. This could be a problem with (i) data leakage (having similar/same sentences between train and test) or (ii) domain mismatch (different domains between train and test). A robust and reliable, diverse benchmark is vital to determine the quality of MT systems. In this paper, we contribute a dataset (Hadgu et al., 2020) toward that effort for Amharic – the official language of Ethiopia.

2 Dataset

Abate et al. (2018) shared parallel corpus to build MT systems for seven Ethiopian language pairs. The design goal of an evaluation set is different in that it is designed to test different use-cases of an MT system. Hence, the evaluation dataset should cover broad categories where MT systems are applicable. We identified news, general knowledge text, short messages and everyday conversational texts to cover these aspects.

To evaluate an MT system for Amharic to and from English, we need to collect two types of sources corresponding to the two languages. Broadly, we want the source Amharic sentences to cover local events and the source English sentences to describe global events.

Collection and Preprocessing of Amharic Sources

Our requirement for Amharic sources was to cover local content.

Collection and Preprocessing of English Sources

For English sources, our requirement was that the content should be of global interest. We found Wikipedia current events portal 444https://en.wikipedia.org/wiki/Portal:Current_events, meets this requirement. The Wikipedia current events portal contains news listed on a daily basis with a link to the corresponding background articles. Of primary importance are events. However, it also contains trends and developments. These news items contain different categories, such as: armed conflicts, arts and culture, business and economy, disasters and accidents, health and environment, international relations, law and crime, politics and elections, science and technology, sports. In our collection the time span ranges from 2013 to 2020 inclusive.

  • News: For news headlines, we took a sample of news headlines across the different categories for each year from the Wikipedia current events portal.

  • Twitter: Wikipedia events portal already contains named entities in each news headline. We took these entities and the corresponding date as query parameters and performed a search on Twitter advanced search using Hadgu & Jäschke (2019) which is available on Github 555https://github.com/asmelashteka/twitteradvancedsearch.

  • Wikipedia: Similar to the Amharic Wikipedia, we used the English Wikipedia dump 666https://dumps.wikimedia.org/other/cirrussearch/20200120/enwiki-20200120-cirrussearch-content.json.gz, extracted sentences, ran language identification for English and dropped sentences that do not have at least two tokens.

  • Conversational: We collected common phrases and expressions from native speakers and websites for spoken English that were used as proxy for every day conversational English.

The overall translation process of these source sentences is described as follows.

Annotation

To generate a diverse dataset of conversational, tweets, Wikipedia and news content a job advertisement for a short text translator was posted to a Telegram group. 198 applicants inquired for the position and after screening the quality of their response, 124 applicants were sent a 10 sentence Amharic sample of either Wikipedia content or news headlines. The work was reviewed and given a score to evaluate and select translations only if they score above a threshold. Scores of bad, ok, good, very good and perfect were assigned. Any content below a good level of quality was not used. After reviewing the initial sample, 28 applicants were given 50 lines of either Wikipedia content or news headlines to translate. Their work was screened and post edited to ensure a high level of translation quality. Successful applicants were asked to provide conversational text and were given English news, tweets and Wikipedia text to translate as well. Tweet translation and post editing was completed using a similar process. Translating tweets is challenging in that they are very informal, have code-mixing and social media specific features such as @mentions, hashtags and embedded URLs. Table 1 shows the domain and number of bitext in the evaluation dataset.

Source News Wikipedia Twitter Conversational total bitext
Amharic 383 257 203 154 997
English 371 499 641 404 1915
Table 1: Domain and number of bitext used for evaluation

3 Evaluation and Results

Two commercial systems were used to evaluate the current state of MT for Amharic. These are Google Translate 777https://translate.google.com/ and Yandex Translate 888https://translate.yandex.com/. Google uses neural MT (Wu et al., 2016) which has made significant advances in the quality of MT systems for many languages. Yandex Translate uses a hybrid system of neural MT and statistical MT 999https://tech.yandex.com/translate/doc/dg/concepts/how-works-machine-translation-docpage/. Both services provide APIs to access their system. A user can send a string and get back the corresponding translation. The systems were accessed on 14th February 2020 for Amharic source (to evaluate Am En) and 30th March 2020 for English source (to evaluate En Am) respectively. We used the SacreBLEU (Post, 2018) implementation to evaluate the results. We tried out different tokenizers. We report the results when tokenization is set to ‘none’ 101010BLEU+case.mixed+numrefs.1+smooth.exp+tok.none+version.1.4.6. The default tokenization on SacreBLEU, 13a 111111BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.6, inflates the result of the tweet translation. This is because it tokenizes URLs to many sub-tokens that overwhelm the sentence tokens. We chose tokenization ‘none’ for consistency but we also report the difference when using the ‘13a’ tokenizer for News, Wikipedia and Conversational domains.

Table 2 shows the BLEU scores for Google Translate and Yandex Translate. In all domains Google Translate is by far better than Yandex translate. Google Translate does better on news headlines and conversational type texts than on Wikipedia and Twitter. Both systems perform worse when translating from English to Amharic. This is a key finding for researchers in low resource machine translation. There is a huge opportunity to improve these systems and allow for more communities to benefit from this technology.

Direction Service News Wikipedia Twitter Conversational All combined
Am En Google Translate 30.8 + 0.8 18.3 + 3.6 19.2 30.5 + 1.6 23.2
Yandex Translate 1.2 + 1.3 1.5 + 2.6 8.6 2.4 + 2.5 4.8
En Am Google Translate 13.7 + 0.4 8.5 + 0.7 7.6 4.8 + 2.8 9.6
Yandex Translate 0.4 + 0.1 0.3 + 0.6 2.3 0.4 + 0.3 1.3

Table 2: BLEU score for Amharic to English and English to Amharic translation using two commercial MT systems.

4 Conclusion

In this work, we provided an evaluation dataset to asses the quality of machine translation systems for Amharic. The dataset is diverse, containing text from news headlines, Wikipedia, Twitter and conversational types. We also evaluated current state-of-the-art MT systems for the English Amharic language pair. We found that while Google Translate does well on Amharic to English translation, the English to Amharic translation from both systems is poor. In future work, we will continue to increase the amount and variety to the dataset. We believe that such a benchmark dataset is valuable for the research community to compare and evaluate their systems.

References