Machine translation (MT) has a lot of applications in different domains such as consumer reviews for marketplaces (Guha & Heger, 2014)
, insights and sentiment analysis for social media posts(Balahur & Turchi, 2012), improving human translation speed (Koehn & Haddow, 2009) and high volume content translation for web browsers. Implementation of MT has been shown to increase international trade (Brynjolfsson et al., 2019). While there are many applications for MT, the quality of commercial systems available on the market varies greatly from system to system and language to language.
One of the challenges of building effective MT systems is quantifying quality improvements. E.g., an MT system trained on religious texts for training and testing may seem to have good results but in reality may not reflect quality improvement. This could be a problem with (i) data leakage (having similar/same sentences between train and test) or (ii) domain mismatch (different domains between train and test). A robust and reliable, diverse benchmark is vital to determine the quality of MT systems. In this paper, we contribute a dataset (Hadgu et al., 2020) toward that effort for Amharic – the official language of Ethiopia.
Abate et al. (2018) shared parallel corpus to build MT systems for seven Ethiopian language pairs. The design goal of an evaluation set is different in that it is designed to test different use-cases of an MT system. Hence, the evaluation dataset should cover broad categories where MT systems are applicable. We identified news, general knowledge text, short messages and everyday conversational texts to cover these aspects.
To evaluate an MT system for Amharic to and from English, we need to collect two types of sources corresponding to the two languages. Broadly, we want the source Amharic sentences to cover local events and the source English sentences to describe global events.
Collection and Preprocessing of Amharic Sources
Our requirement for Amharic sources was to cover local content.
Wikipedia: We took the Wikipedia dump for Amharic 222https://dumps.wikimedia.org/other/cirrussearch/20200120/amwiki-20200120-cirrussearch-content.json.gz, extracted sentences, ran language identification for Amharic and dropped sentences that do not have at least two tokens. We then randomly selected sentences for translation by humans.
Twitter: We collected Amharic tweets by searching for Amharic stop-words using Hadgu & Jäschke (2019) a wrapper around the Twitter advanced search 333https://twitter.com/search-advanced. Similar to Wikipedia, we performed language identification, and dropped tweets that do not have at least two tokens.
Conversational: For conversational type of text, we used Amharic native speakers to get Amharic source sentences and their corresponding translations.
Collection and Preprocessing of English Sources
For English sources, our requirement was that the content should be of global interest. We found Wikipedia current events portal 444https://en.wikipedia.org/wiki/Portal:Current_events, meets this requirement. The Wikipedia current events portal contains news listed on a daily basis with a link to the corresponding background articles. Of primary importance are events. However, it also contains trends and developments. These news items contain different categories, such as: armed conflicts, arts and culture, business and economy, disasters and accidents, health and environment, international relations, law and crime, politics and elections, science and technology, sports. In our collection the time span ranges from 2013 to 2020 inclusive.
News: For news headlines, we took a sample of news headlines across the different categories for each year from the Wikipedia current events portal.
Twitter: Wikipedia events portal already contains named entities in each news headline. We took these entities and the corresponding date as query parameters and performed a search on Twitter advanced search using Hadgu & Jäschke (2019) which is available on Github 555https://github.com/asmelashteka/twitteradvancedsearch.
Wikipedia: Similar to the Amharic Wikipedia, we used the English Wikipedia dump 666https://dumps.wikimedia.org/other/cirrussearch/20200120/enwiki-20200120-cirrussearch-content.json.gz, extracted sentences, ran language identification for English and dropped sentences that do not have at least two tokens.
Conversational: We collected common phrases and expressions from native speakers and websites for spoken English that were used as proxy for every day conversational English.
The overall translation process of these source sentences is described as follows.
To generate a diverse dataset of conversational, tweets, Wikipedia and news content a job advertisement for a short text translator was posted to a Telegram group. 198 applicants inquired for the position and after screening the quality of their response, 124 applicants were sent a 10 sentence Amharic sample of either Wikipedia content or news headlines. The work was reviewed and given a score to evaluate and select translations only if they score above a threshold. Scores of bad, ok, good, very good and perfect were assigned. Any content below a good level of quality was not used. After reviewing the initial sample, 28 applicants were given 50 lines of either Wikipedia content or news headlines to translate. Their work was screened and post edited to ensure a high level of translation quality. Successful applicants were asked to provide conversational text and were given English news, tweets and Wikipedia text to translate as well. Tweet translation and post editing was completed using a similar process. Translating tweets is challenging in that they are very informal, have code-mixing and social media specific features such as @mentions, hashtags and embedded URLs. Table 1 shows the domain and number of bitext in the evaluation dataset.
3 Evaluation and Results
Two commercial systems were used to evaluate the current state of MT for Amharic. These are Google Translate 777https://translate.google.com/ and Yandex Translate 888https://translate.yandex.com/. Google uses neural MT (Wu et al., 2016) which has made significant advances in the quality of MT systems for many languages. Yandex Translate uses a hybrid system of neural MT and statistical MT 999https://tech.yandex.com/translate/doc/dg/concepts/how-works-machine-translation-docpage/. Both services provide APIs to access their system. A user can send a string and get back the corresponding translation. The systems were accessed on 14th February 2020 for Amharic source (to evaluate Am En) and 30th March 2020 for English source (to evaluate En Am) respectively. We used the SacreBLEU (Post, 2018) implementation to evaluate the results. We tried out different tokenizers. We report the results when tokenization is set to ‘none’ 101010BLEU+case.mixed+numrefs.1+smooth.exp+tok.none+version.1.4.6. The default tokenization on SacreBLEU, 13a 111111BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.6, inflates the result of the tweet translation. This is because it tokenizes URLs to many sub-tokens that overwhelm the sentence tokens. We chose tokenization ‘none’ for consistency but we also report the difference when using the ‘13a’ tokenizer for News, Wikipedia and Conversational domains.
Table 2 shows the BLEU scores for Google Translate and Yandex Translate. In all domains Google Translate is by far better than Yandex translate. Google Translate does better on news headlines and conversational type texts than on Wikipedia and Twitter. Both systems perform worse when translating from English to Amharic. This is a key finding for researchers in low resource machine translation. There is a huge opportunity to improve these systems and allow for more communities to benefit from this technology.
|Am En||Google Translate||30.8 + 0.8||18.3 + 3.6||19.2||30.5 + 1.6||23.2|
|Yandex Translate||1.2 + 1.3||1.5 + 2.6||8.6||2.4 + 2.5||4.8|
|En Am||Google Translate||13.7 + 0.4||8.5 + 0.7||7.6||4.8 + 2.8||9.6|
|Yandex Translate||0.4 + 0.1||0.3 + 0.6||2.3||0.4 + 0.3||1.3|
In this work, we provided an evaluation dataset to asses the quality of machine translation systems for Amharic. The dataset is diverse, containing text from news headlines, Wikipedia, Twitter and conversational types. We also evaluated current state-of-the-art MT systems for the English Amharic language pair. We found that while Google Translate does well on Amharic to English translation, the English to Amharic translation from both systems is poor. In future work, we will continue to increase the amount and variety to the dataset. We believe that such a benchmark dataset is valuable for the research community to compare and evaluate their systems.
Abate et al. (2018)
Solomon Teferra Abate, Michael Melese, Martha Yifiru Tachbelie, Million
Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera,
Binyam Ephrem Seyoum, Tewodros Abebe, et al.
Parallel corpora for bi-directional statistical machine translation
for seven ethiopian language pairs.
Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pp. 83–90, 2018.
- Balahur & Turchi (2012) Alexandra Balahur and Marco Turchi. Multilingual sentiment analysis using machine translation? In Proceedings of the 3rd workshop in computational approaches to subjectivity and sentiment analysis, pp. 52–60. Association for Computational Linguistics, 2012.
- Brynjolfsson et al. (2019) Erik Brynjolfsson, Xiang Hui, and Meng Liu. Does machine translation affect international trade? evidence from a large digital platform. Management Science, 65(12):5449–5460, 2019.
- Guha & Heger (2014) Jyoti Guha and Carmen Heger. Machine translation for global e-commerce on ebay. In Proceedings of the AMTA, volume 2, pp. 31–37, 2014.
- Hadgu & Jäschke (2019) Asmelash Teka Hadgu and Robert Jäschke. asmelashteka/twitteradvancedsearch: First release, March 2019. URL https://doi.org/10.5281/zenodo.2605413.
- Hadgu et al. (2020) Asmelash Teka Hadgu, Adam Beaudoin, and Abel Aregawi. Machine translation evaluation dataset for amharic, February 2020. URL https://doi.org/10.5281/zenodo.3734260.
- Koehn & Haddow (2009) Philipp Koehn and Barry Haddow. Interactive assistance to human translators using statistical machine translation methods. MT Summit XII, 2009.
- Post (2018) Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL http://arxiv.org/abs/1609.08144.