Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

06/05/2019 ∙ by Nisansa de Silva, et al. ∙ University of Moratuwa 0

Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this lack and the dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made on the topic.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sinhala language, being the native language of the Sinhalese people [1], who make up the largest ethnic group of the island country of Sri Lanka, enjoys being reported as the mother tongue of Approximately 16 million people [2]. To give a brief linguistic background for the purpose of aligning the Sinhala language with the baseline of English, primarily it should be noted that Sinhala language belongs same the Indo-European language tree [3]. However, unlike English which is part of the Germanic branch, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, unlike English, which borrowed the Latin alphabet, has its own writing system, which is a descendant of the Indian Brahmi script [4, 5, 6, 7, 8, 9]. By extension, this makes Sinhala Script a member of the Aramaic family of scripts [10, 11]. It should be noted that the modern Sinhala language have loanwords from languages such as Tamil, English, Portuguese, and Dutch due to various historical reasons. Regardless of the rich historical array of literature spanning several millennia starting between to century BCE [12], modern natural language processing tools for the Sinhala language are scarce [13].

Natural Language Processing (NLP) is a broad area covering all computational processing and analysis of human languages. To achieve this end, NLP systems operate at different levels [14, 15]. A graphical representation of NLP layers and application domains are shown in Figure 1. On one hand, according to Liddy [15], these systems can be categorized into the following layers; phonological, morphological, lexical, syntactic, semantic, discourse, and pragmatic. The phonological layer deals with the interpretation of language sounds. As such, it consists of mainly speech-to-text and text-to-speech systems. In cases where one is working with written text of the language rather than speech, it is possible to replace this layer with tools which handle Optical Character Recognition (OCR) and language rendering standards (such as Unicode [16]). The morphological layer analyses words at their smallest units of meaning. As such, analysis on word lemmas and prefix-suffix-based inflection are handled in this layer. Lexical layer handles individual words. Therefore tasks such as Part of Speech (PoS) tagging happens here. The next layer, syntactic, takes place at the phrase and sentence level where grammatical structures are utilized to obtain meaning. Semantic

layer attempts to derive the meanings from the word level to the sentence level. Starting with Named Entity Recognition (NER) at the word level and working its way up by identifying the contexts they are set in until arriving at overall meaning. The

discourse layer handles meaning in textual units larger than a sentence. In this, the function of a particular sentence maybe contextualized within the document it is set in. Finally, the pragmatic layer handles contexts read into contents without having to be explicitly mentioned [14, 15]. Some forms of anaphora (co-reference) resolution fall into this application.

Fig. 1: NLP layers and tasks [14]

On the other hand, Wimalasuriya and Dou [17] categorizes NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval (IR), Information Extraction (IE), and Natural Language Understanding (NLU). Information Retrieval covers applications which search and retrieve information which are relevant to a given query. For pure IR, tools and methods up-to and including the syntactic layer in the above analysis are used. Information Extraction, on the other hand, extracts structured information. The difference between IR and IE is the fact that IR does not change the structure of the documents in question. Be them structured, semi-structured, or unstructured, all IR does is fetching them as they are. In comparison, IE, takes semi-structured or unstructured text and puts them in a machine readable structure. For this, IE utilizes all the layers used by IR and the semantic layer. Natural Language Understanding is purely the idea of cognition. Most NLU tasks fall under AI-hard category and remain unsolved [14]. However, with varying accuracy, some NLU tasks such as machine translation111This is, however, not without the criticism of being nothing more than a Chinese room [18] rather than true NLU. are being attempted. The pragmatic layer of the above analysis belongs to the NLU tasks while the discourse layer straddles information extraction and natural language understanding [14].

The objective of this paper is to serve as a comprehensive survey on the state of natural language processing resources for the Sinhala language. The initial structure and content of this survey are heavily influenced by the preliminary surveys carried out by de Silva [13] and Wijeratne et al. [14]. However, our hope is to host this survey at arXiv as a perpetually evolving work which continuously gets updated as new research and tools for Sinhala language are created and made publicly available. Hence, it is our hope that this work will help future researchers who are engaged in Sinhala NLP research to conduct their literature surveys efficiently and comprehensively. For the success of this survey, we shall also consider the Sri Lankan NLP tools repository, lknlp222

The remainder of this survey is organized as follows; Section 2 discusses the various tools and research available for Sinhala NLP. In this section we would discuss both pure Sinhala NLP tool and research as well as hybrid Sinhala-English work. We will also discuss research and tools which contributes to Sinhala NLP either along with or by the help of Tamil, the other official language of Sri Lanka. Finally, Section 3, concludes the survey.

2 Sinhala resources

In this section we generally follow the structure shown in Figure 1 for sectioning. However, in addition to that, we also discuss topics such as available corpora, other data sets, dictionaries, and WordNets. We focus on NLP tools and research rather than the mechanics of language script handling [19, 20].

Tamil, the other official language of Sri Lanka is also a resource poor language. However, due to the existence of larger populations of Tamil speakers worldwide, including but not limited to economic powerhouses such as India, there are more research and tools available for Tamil NLP tasks [14]. Therefore, it is rational to notice that Sinhala and Tamil NLP endeavours can help each other. Especially, given the above fact, that these are official languages of Sri Lanka, results in the generation of parallel data sets in the form of official government documents and local news items. A number of researchers make use of this opportunity. We shall be discussing those applications in this paper as well. Further, there have been some fringe implementations which bridge Sinhala with other languages such as Japanese [21, 22].

2.1 Corpora

For any language, the key for NLP applications and implementations is the existence of adequate corpora. On this matter, a relatively substantial Sinhala text corpus333 was created by Upeksha et al. [23, 24] by web crawling. Later a smaller Sinhala newes corpus444 was created by de Silva [13]. Both of the above corpora are publicly available. However, none of these come close to the massive capacity and range of the existing English corpora. A word corpus of approximately 35,000 entries was developed by Weerasinghe et al. [25]. But it does not seem to be online anymore. A number of Sinhala-English parallel corpora were introduced by Guzmán et al. [26]. This includes a 600k+ Sinhala-English subtitle pairs555 initially collected by [27], 45k+ Sinhala-English sentence pairs from GNOME666, KDE777, and Ubuntu888án et al. [26] further provided two monolingual corpora for Sinhala. Those were a 155k+ sentences of filtered Sinhala Wikipedia999 and 5178k+ sentences of Sinhala common crawl101010

As for Sinhala-Tamil corpora, Mohamed et al. [28] claim to have built a word aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, it was not publicly available. A very small Sinhala-Tamil aligned parallel corpus created by Farhath et al. [29] using order papers of government of Sri Lanka is available to download111111

2.2 Data Sets

Specific data sets for Sinhala, as expected, is scarce. However, a Sinhala PoS tagged data set [30, 31, 32] is available to download from github121212 Further, a Sinhala NER data set created by Manamini et al. [33] is also available to download from github131313

Facebook has released FastText [34, 35, 36] models for the Sinhala language trained using the Wikipedia corpus. They are available as both text models141414 and binary files151515 Using the above models by Facebook, a group at University of Moratuwa has created an extended FastText model trained on Wikipedia, News, and official government documents. The binary file161616 of the trained model is available to be downloaded. Herath et al. [37]

has compiled a report on the Sinhala lexicon for the purpose of establishing a basis for NLP applications.

2.3 Dictionaries

A necessary component for the purpose of bridging Sinhala and English resources are English-Sinhala dictionaries. The earliest and most extensive Sinhala-English dictionary available for consumption was by Malalasekera [38]. However, this dictionary is locked behind copyright laws and is not available for public research and development. The dictionary by Kulatunga [39] is publicly available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary data set is from a discontinued FireFox plug-in EnSiTip [40] which bears a more than passing resemblance to the above dictionary by Kulatunga [39]Hettige and Karunananda [41] claim to to have created a lexicon to help in their attempt to create a system capable of English-to-Sinhala machine translation.

There exists the government sponsored trilingual dictionary [42] which matches Sinhala, English, and Tamil. However, other than a crude web interface on the ministry website, there is no efficient API or any other way for a researcher to access the data of this dictionary. Weerasinghe and Dias [43] have created a multilingual place name database for Sri Lanka which may function both as a dictionary and a resource for certain NER tasks.

2.4 WordNets

WordNets [44] are extremely powerful and act as a versatile component of many NLP applications. They encompass a number of linguistic properties which exist between the words in the lexicon of the language including but not limited to: hyponymy, hypernymy, synonymy, and meronymy. Their uses range from simple gazetteer listing applications [17] to information extraction based on semantic similarity [45, 46] or semantic oppositeness [47]. An attempt has been made to build a Sinhala Wordnet [48]. For a time it was hosted on [49] but it too is now defunct and all the data and applications are lost. However, even at its peak, due to the lack of volunteers for the crowd soured methodology of populating the WordNet, it was at best an incomplete product. Another effort to build a Sinhala Wordnet was initiated by Welgama et al. [50] independently from above; but it too have stopped progression even before achieving the completion level of above.

2.5 Morphological Analyzers

As shown in Fig 1, morphological analysis is a ground level necessary component for natural language processing. Given that Sinhala is a highly highly inflected language [51, 52, 13], a proper morphological analysis process is vital. However, the only prominent work on this avenue of research which could be found was a study which was restricted to morphological analysis of Sinhala verbs [53]. There was no indication on whether this work was continued to cover other types of words. Further, other than this singular publication, no data or tools were made publicly accessible. Completely independent of the above, Welgama et al. [54]

attempted to evaluate machine learning approaches for Sinhala morphological analysis. Yet another independent attempt to create a morphological parser for Sinhala verbs was carried out by 

Fernando and Weerasinghe [55]. As a step on their efforts to create a system with the ability to do English to Sinhala machine translation, Hettige and Karunananda [56] also claim to have created a morphological analyzer, again void of any public data or code.

2.6 Part of Speech Taggers

The next step after morphological analysis is Part of Speech (PoS) tagging. The PoS tags differ in number and functionality from language to language. Therefore, the first step in creating an effective PoS tagger is to identifying the PoS tag set for the language. This work has been accomplished by Fernando et al. [32] and Dilshani et al. [31]. Expanding on that, Fernando et al. [32] has introduced an SVM Based PoS Tagger for Sinhala and finally Fernando and Ranathunga [30]

give an evaluation of different classifiers for the task of Sinhala PoS tagging. While here it is obvious that there has been some follow up work after the initial foundation, it seems all of that has been internal to one research group at one institution as neither the data nor the tools of any of these findings have been made available for the use of external researchers. Several attempts to create a stochastic part of speech tagger for Sinhala has been done with the attempts by 

Herath and Weerasinghe [57] and Jayasuriya and Weerasinghe [58] being most notable. A hybrid PoS tagger for Sinhala language was proposed by Gunasekara et al. [59]. Within a single group yet another set of studies was carried out to create a Sinhala PoS tagger starting with the foundation of Jayaweera and Dias [60]

which then extended to a Hidden Markov Model (HMM) based approach 

[61] and an analysis of unknown words [62]. Further, this group presented a comparison of few Sinhala PoS taggers that are available to them [63].

2.7 Parsers

The PoS tagged data then needs to be handed over to a parser. This is an area which is not completely solved even in English due to various inherent ambiguities in natural languages. However, in the case of English, there are systems which provide adequate results [64] even if not perfect yet. A prosodic phrasing model for sinhala language has been implemented by Bandara et al. [65]. While they do report reasonable results, yet again, do not provide any means for the public to access the data or the tools that they have developed. Work by Liyanage et al. [51] is also concentrated on this layer given that they have worked on formalizing a computational grammar for Sinhala. Kanduboda and Prabath [52] have worked on Sinhala differential object markers. Another parser for the Sinhala language has been proposed by Hettige and Karunananda [66] with a model for grammar [67].

2.8 Named Entity Recognition Systems

As shown in Fig 1, once the text is properly parsed, it has to be processed using a Named-Entity-Recognition (NER) system. An NER system for Sinhla named Ananya has been developed by Manamini et al. [33] and is available to download at GitHub 171717 Another independent attempt on Sinhala NER has been done by Dahanayaka and Weerasinghe [68]; but data and code of that is not accessible to the public.

2.9 Semantic Tools

Applications of the semantic layer is more advanced than the ones below it in Figure 1. But even with the obvious lack of resources and tools, a number of attempts have been made on semantic level applications for the Sinhala Language. A Sinhala semantic similarity measure has been developed for short sentences by Kadupitiya et al. [69]. This work has been then extended by Kadupitiya et al. [70] for the application use case of short answer grading. Data and tools for these projects are not publicly available. Text classification is a popular application on the semantic layer of the NLP stack. Nanayakkara and Ranathunga [71] have implemented a system which uses corpus-based similarity measures for this propose. A smaller implementation of Sinhala news classification has been attempted by de Silva [13]. As mentioned above, their news corpus is publicly available181818 But it is extremely small and thus may not provide much use for extensive research. A word2vec based tool191919

for sentiment analysis of Sinhala news comments is available. There have been multiple attempts to do word sense disambiguation for Sinhala. For this,

Arukgoda et al. [72] have proposed a system based on synonyms while Marasinghe et al. [73] have proposed a system based on probabilistic modeling.

2.10 Phonological Tools

On the case of phonological layer, a Sinhala text-to-speech system was developed by Weerasinghe et al. [74]. However, it is not publicly accessible and there is no further research to be found of the work on a Sinhala speech-to-text system. A separate group has done work on Sinhala text-to-speech systems independent to above [75]. On the converse, Nadungodage et al. [76] has done a series of work on Sinhala speech recognition with special notice given to Sinhala being a resource poor language. This project divides its focus on: continuity [77]

, active learning 

[78], and speaker adaptation [79]. Based on the earlier work by Weerasinghe et al. [80]Wasala et al. [81] have developed methods for Sinhala grapheme-to-phoneme conversion along with a set of rules for schwa epenthesis. This work was then extended by Nadungodage et al. [82].

2.11 Optical Character Recognition Tools

While it is not necessarily a component of the NLP stack shown in Fig 1, which follows the definition by Liddy [15], it is possible to swap out the bottom most phonological layer of the stack in favour of an Optical Character Recognition (OCR) layer. The earliest attempt for Sinhala OCR system has been by Dias et al. [83]. Then it has been extended to be online and made available to use via desktops [84] and hand-held devices [85] with the ability to recognize handwriting. A separate group had also attempted Sinhala OCR [86] mainly involving the nearest-neighbor method [87]. A yet another attempt on this problem has been taken by Rajapakse et al. [88] before the above two groups. A linear symmetry based approch was proposed by Premaratne and Bigun [89]

. A Sinhala handwriting OCR system which utilizes zone-based feature extraction has been proposed by 

Dharmapala et al. [90].

2.12 Translators

A series of work has been done by a group towards English to Sinhala translation as mentioned in some of the above subsections. This work includes; building a morphological analyzer [56], lexicon databases [41], a transliteration system [91], an evaluation model [92], a computational model of grammar [67], and a multi-agent solution [93]. Another group independently attempted English-to-Sinhala machine translation [94] with a statistical approach [95]Hettige and Karunananda [96, 97] have attempted to establish a theoretical basics for English to Sinhala machine translation. A very simplistic web based translator was proposed by Hettige and Karunananda [98].

Most of the cross Sinhala and Tamil work has been done in the domain of machine translation. A neural machine translation for Sinhala and Tamil languages was initiated by 

Tennage et al. [99]. Then they further enhanced it with transliteration and byte pair encoding [100] and used synthetic training data to handle the rare word problem [101]. This project produced Si-Ta [102] a machine translation system of Sinhala and Tamil official documents. In the statistical machine translation front, Farhath et al. [103] worked on integrating bilingual lists. The attempts by Weerasinghe [104] and Sripirakas et al. [105] were also focused on statistical machine translation while Jeyakaran [106] attempted a kernel regression method. A yet another attempt was made by Pushpananda et al. [107] which they later extended with some quality improvements [108]. While not related to Tamil, there have been attempts to link Sinhala NLP with Japanese by Herath et al. [21] and Kanduboda [22].

3 Conclusion

At this point, a reader might think, there seems to be a significant number of implementations of NLP for Sinhala. Therefore, how can one justify listing Sinhala as a resource poor language? The important point which is missing in that assumption is that in the cases of almost all of the above listed implementations and findings, the only thing that is publicly available for a researcher is a set of research papers. The corpora, tools, algorithm, and anything else that were discovered through these research are either locked away as properties of individual research groups or worse lost to the time with crashed ancient servers, lost hard drives, and expired web hosts. This reason and probably academic/research rivalry have caused these separate research groups not to cite or build upon the works of each-other. In many cases where similar work is done, it is a re-hashing on the same ideas adopted from resource rich languages because of, either the unavailability of or the reluctance to, refer and build on work done by another group. This has resulted in multiple groups building multiple foundations behind closed walls but no one ending up with a completed end-to-end NLP work-flow. In conclusion, what can be said is even though there are islands of implementations done for Sinhala NLP, they are of very small scale and/or are usually not readily accessible for further use and research by other researchers. Thus, so far, sadly, Sinhala stays a resource poor language.


  • Bauer [2007] L. Bauer, Linguistics Student’s Handbook.    Edinburgh University Press, 2007.
  • [2] Department of Census and Statistics Sri Lanka. Percentage of population aged 10 years and over in major ethnic groups by district and ability to speak sinhala, tamil and english languages. [Online]. Available:
  • [3] H. Young. A language family tree - in pictures — education — the guardian. [Online]. Available:
  • Bandara et al. [2012] D. Bandara, N. Warnajith, A. Minato, and S. Ozawa, “Creation of precise alphabet fonts of early brahmi script from photographic data of ancient sri lankan inscriptions,”

    Canadian Journal on Artificial Intelligence, Machine Learning and Pattern Recognition

    , vol. 3, no. 3, pp. 33–39, 2012.
  • Daniels and Bright [1996] P. T. Daniels and W. Bright, The world’s writing systems.    Oxford University Press on Demand, 1996.
  • Sirisoma [1990] M. Sirisoma, “Brahmi inscriptions of sri lanka from 3rd century bc to 65 ad,” pp. 3–54, 1990.
  • Dias [1996] M. Dias, “Lakdiwa sellipiwalin heliwana sinhala bhashawe prathyartha namayange vikashanaya,” Department of Archaeology, Colombo Sri Lanka, p. 1, 1996.
  • Hettiarachchi [1990] A. Hettiarachchi, “Investigation of 2nd, 3rd and 4th century inscriptions,” Inscriptions: Volume Two, Archaeological Department Centenary (1890–1990), Commemorative Series. Colombo: Department of Archaeology, pp. 57–104, 1990.
  • Paranavitana and Depārtamēntuva [1970] S. Paranavitana and S. L. P. Depārtamēntuva, Inscriptions of Ceylon.    Dept. of Archaeology, 1970.
  • Salomon [1998] R. Salomon, Indian epigraphy: a guide to the study of inscriptions in Sanskrit, Prakrit, and the other Indo-Aryan languages.    Oxford University Press, 1998.
  • Falk [1993] H. Falk, Schrift im alten Indien: ein Forschungsbericht mit Anmerkungen.    Gunter Narr Verlag, 1993, vol. 56.
  • Ray [2003] H. P. Ray, The archaeology of seafaring in ancient South Asia.    Cambridge University Press, 2003.
  • de Silva [2015] N. de Silva, “Sinhala Text Classification: Observations from the Perspective of a Resource Poor Language,” 2015.
  • Wijeratne et al. [2019] Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural Language Processing for Government: Problems and Potential,” LIRNEasia, 2019.
  • Liddy [2001] E. D. Liddy, “Natural language processing,” 2001.
  • Consortium et al. [1996] U. Consortium et al., “The unicode standard: A technical introduction,” online document, http://www. unicode. org/unicode/standards/principles. html, 1996.
  • Wimalasuriya and Dou [2010] D. C. Wimalasuriya and D. Dou, “Ontology-based information extraction: An introduction and a survey of current approaches,” Journal of Information Science, vol. 36, no. 3, pp. 306–323, 2010.
  • Preston and Bishop [2002] J. Preston and M. J. Bishop, Views into the Chinese room: New essays on Searle and artificial intelligence.    OUP, 2002.
  • Dias and Goonetilleke [2004] G. Dias and A. Goonetilleke, “Development of standards for Sinhala computing,” in 1st Regional Conference on ICT and E-Paradigms, 2004.
  • Dias [2005] G. V. Dias, “Challenges of enabling it in the sinhala language,” in 27th Internationalization and Unicode Conference, 2005.
  • Herath et al. [1994] A. Herath, Y. Hyodo, Y. Kawada, T. Ikeda, and S. Herath, “A practical machine translation system from japanese to modern sinhalese,” Gifu University, pp. 153–162, 1994.
  • Kanduboda [2011] A. B. Kanduboda, “The role of animacy in determining noun phrase cases in the sinhalese and japanese languages,” Science of words, vol. 24, pp. 5–20, 2011.
  • Upeksha et al. [2015a] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. De Silva, and G. Dias, “Implementing a Corpus for Sinhala Language,” in Symposium on Language Technology for South Asia 2015, 2015.
  • Upeksha et al. [2015b] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, “Comparison between performance of various database systems for implementing a language corpus,” in International Conference: Beyond Databases, Architectures and Structures.    Springer, May 2015, pp. 82–91.
  • Weerasinghe et al. [2009] R. Weerasinghe, D. Herath, and V. Welgama, “Corpus-based sinhala lexicon,” in Proceedings of the 7th Workshop on Asian Language Resources.    Association for Computational Linguistics, 2009, pp. 17–23.
  • Guzmán et al. [2019] F. Guzmán, P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M. Ranzato, “Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english,” arXiv preprint arXiv:1902.01382, 2019.
  • Lison and Tiedemann [2016] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles,” 2016.
  • Mohamed et al. [2017] M. Z. Mohamed, A. Ihalapathirana, R. A. Hameed, N. Pathirennehelage, S. Ranathunga, S. Jayasena, and G. Dias, “Automatic creation of a word aligned sinhala-tamil parallel corpus,” in Engineering Research Conference (MERCon), 2017 Moratuwa.    IEEE, 2017, pp. 425–430.
  • Farhath et al. [2018a] F. Farhath, P. Theivendiram, S. Ranathunga, S. Jayasena, and G. Dias, “Improving domain-specific smt for low-resourced languages using data from different domains,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • Fernando and Ranathunga [2018] S. Fernando and S. Ranathunga, “Evaluation of different classifiers for sinhala pos tagging,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 96–101.
  • Dilshani et al. [2017] N. Dilshani, S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “A comprehensive part of speech (pos) tag set for sinhala language.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Fernando et al. [2016] S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “Comprehensive part-of-speech tag set and svm based pos tagger for sinhala,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 173–182.
  • Manamini et al. [2016] S. Manamini, A. Ahamed, R. Rajapakshe, G. Reemal, S. Jayasena, G. Dias, and S. Ranathunga, “Ananya-a named-entity-recognition (ner) system for sinhala language,” in Moratuwa Engineering Research Conference (MERCon), 2016.    IEEE, 2016, pp. 30–35.
  • Bojanowski et al. [2017]

    P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,”

    Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
  • Joulin et al. [2017] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427–431.
  • Joulin et al. [2016] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext. zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
  • [37] D. Herath, K. Gamage, and A. Malalasekara, “Research report on sinhala lexicon,” Langugae Technology Research Laboratory, UCSC.
  • Malalasekera [1967] G. P. Malalasekera, “English-sinhalese dictionary.” 1967.
  • [39] M. Kulatunga. Madura english-sinhala dictionary - online language translator. [Online]. Available:
  • Wasala and Weerasinghe [2008] A. Wasala and R. Weerasinghe, “Ensitip: a tool to unlock the english web,” in 11th international conference on humans and computers, Nagaoka University of Technology, Japan, 2008, pp. 20–23.
  • Hettige and Karunananda [2007a] B. Hettige and A. Karunananda, “Developing lexicon databases for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 215–220.
  • [42] Department of Official Languages, Sri Lanka. Tri-lingual dictionary. [Online]. Available:
  • Weerasinghe and Dias [2013] A. Weerasinghe and G. Dias, “Construction of a multilingual place name database for sri lanka,” 2013.
  • Miller [1995] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
  • Wu and Palmer [1994] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd annual meeting on Association for Computational Linguistics.    Association for Computational Linguistics, 1994, pp. 133–138.
  • Jiang and Conrath [1997] J. J. Jiang and D. W. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy,” in Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97.    Citeseer, 1997.
  • de Silva et al. [2017] N. de Silva, D. Dou, and J. Huang, “Discovering inconsistencies in pubmed abstracts through ontology-based information extraction,” in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics.    ACM, 2017, pp. 362–371.
  • Wijesiri et al. [2014] I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. Wimalasuriya, G. Dias, R. Paranavithana, and N. De Silva, “Building a wordnet for Sinhala,” in Proceedings of the Seventh Global WordNet Conference, 2014, pp. 100–108.
  • [49] Sinhala wordnet. [Online]. Available:
  • Welgama et al. [2011] V. Welgama, D. L. Herath, C. Liyanage, N. Udalamatta, R. Weerasinghe, and T. Jayawardana, “Towards a sinhala wordnet,” in Proceedings of the Conference on Human Language Technology for Development, 2011.
  • Liyanage et al. [2012] C. Liyanage, R. Pushpananda, D. L. Herath, and R. Weerasinghe, “A computational grammar of Sinhala,” in International Conference on Intelligent Text Processing and Computational Linguistics.    Springer, 2012, pp. 188–200.
  • Kanduboda and Prabath [2013] A. Kanduboda and B. Prabath, “On the usage of sinhalese differential object markers object marker /wa/ vs. object marker /ta/,” Theory and Practice in Language Studies, vol. 3, no. 7, p. 1081, 2013.
  • Dilshani and Dias [2017] W. Dilshani and G. Dias, “A corpus-based morphological analysis of sinhala verbs.”    The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017 …, 2017.
  • Welgama et al. [2013] V. Welgama, R. Weerasinghe, and M. Niranjan, “Evaluating a machine learning approach to sinhala morphological analysis,” in Proceedings of the 10th International Conference on Natural Language Processing, Noida, India, 2013.
  • Fernando and Weerasinghe [2013] N. Fernando and R. Weerasinghe, “A morphological parser for sinhala verbs,” in Proceedings of the International Conference on Advances in ICT for Emerging Regions, 2013.
  • Hettige and Karunananda [2006a] B. Hettige and A. S. Karunananda, “A morphological analyzer to enable english to sinhala machine translation,” in Information and Automation, 2006. ICIA 2006. International Conference on.    IEEE, 2006, pp. 21–26.
  • Herath and Weerasinghe [2004] D. L. Herath and A. Weerasinghe, “A stochastic part of speech tagger for sinhala,” in Proceedings of the 06th International Information Technology Conference, 2004, pp. 27–28.
  • Jayasuriya and Weerasinghe [2013] M. Jayasuriya and A. Weerasinghe, “Learning a stochastic part of speech tagger for sinhala,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 137–143.
  • Gunasekara et al. [2016] D. Gunasekara, W. Welgama, and A. Weerasinghe, “Hybrid part of speech tagger for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2016 Sixteenth International Conference on.    IEEE, 2016, pp. 41–48.
  • Jayaweera and Dias [2011] A. Jayaweera and N. Dias, “Part of speech (pos) tagger for sinhala language,” 2011.
  • Jayaweera and Dias [2014a] ——, “Hidden markov model based part of speech tagger for sinhala language,” arXiv preprint arXiv:1407.2989, 2014.
  • Jayaweera and Dias [2014b] ——, “Unknown words analysis in pos tagging of sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 270–270.
  • Jayaweera and Dias [2016] M. Jayaweera and N. Dias, “Comparison of part of speech taggers for sinhala language,” 2016.
  • Manning et al. [2014] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, “The Stanford CoreNLP natural language processing toolkit,” in Association for Computational Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available:
  • Bandara et al. [2013] W. Bandara, V. Lakmal, T. Liyanagama, S. Bulathsinghala, G. Dias, and S. Jayasena, “A new prosodic phrasing model for sinhala language,” 2013.
  • Hettige and Karunananda [2006b] B. Hettige and A. S. Karunananda, “A parser for sinhala language-first step towards english to sinhala machine translation,” in Industrial and Information Systems, First International Conference on.    IEEE, 2006, pp. 583–587.
  • Hettige and Karunananda [2011] B. Hettige and A. Karunananda, “Computational model of grammar for english to sinhala machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2011 International Conference on.    IEEE, 2011, pp. 26–31.
  • Dahanayaka and Weerasinghe [2014] J. Dahanayaka and A. Weerasinghe, “Named entity recognition for sinhala language,” in Advances in ICT for Emerging Regions (ICTer), 2014 International Conference on.    IEEE, 2014, pp. 215–220.
  • Kadupitiya et al. [2016] J. Kadupitiya, S. Ranathunga, and G. Dias, “Sinhala short sentence similarity calculation using corpus-based and knowledge-based similarity measures,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), 2016, pp. 44–53.
  • Kadupitiya et al. [2017] ——, “Sinhala short sentence similarity measures using corpus-based simi-larity for short answer grading,” in 6th Workshop on South and Southeast Asian Natural Language Processing, 2017, pp. 44–53.
  • Nanayakkara and Ranathunga [2018] P. Nanayakkara and S. Ranathunga, “Clustering sinhala news articles using corpus-based similarity measures,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 437–442.
  • Arukgoda et al. [2014] J. Arukgoda, V. Bandara, S. Bashani, V. Gamage, and D. Wimalasuriya, “A word sense disambiguation technique for sinhala,” in 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology.    IEEE, 2014, pp. 207–211.
  • Marasinghe et al. [2002]

    C. Marasinghe, S. Herath, and A. Herath, “Word sense disambiguation of sinhala language with unsupervised learning,” in

    Proc. International Conference on Information Technology and Applications, 2002, pp. 25–29.
  • Weerasinghe et al. [2007] R. Weerasinghe, A. Wasala, V. Welgama, and K. Gamage, “Festival-si: A sinhala text-to-speech system,” in International Conference on Text, Speech and Dialogue.    Springer, 2007, pp. 472–479.
  • [75] L. Nanayakkara, C. Liyanage, P.-T. Viswakula, T. Nagungodage, R. Pushpananda, and R. Weerasinghe, “A human quality text to speech system for sinhala,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 157–161.
  • Nadungodage et al. [a] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Speech recognition for low resourced languages: Efficient use of training data for sinhala speech recognition by active learning.”
  • Nadungodage and Weerasinghe [2011] T. Nadungodage and R. Weerasinghe, “Continuous sinhala speech recognizer,” in Conference on Human Language Technology for Development, Alexandria, Egypt, 2011, pp. 2–5.
  • Nadungodage et al. [2013] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Efficient use of training data for Sinhala speech recognition using active learning,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 149–153.
  • Nadungodage et al. [2015] ——, “Speaker Adaptation Applied to Sinhala Speech Recognition.” Int. J. Comput. Linguistics Appl., vol. 6, no. 1, pp. 117–129, 2015.
  • Weerasinghe et al. [2005] R. Weerasinghe, A. Wasala, and K. Gamage, “A rule based syllabification algorithm for sinhala,” in International Conference on Natural Language Processing.    Springer, 2005, pp. 438–449.
  • Wasala et al. [2006] A. Wasala, R. Weerasinghe, and K. Gamage, “Sinhala grapheme-to-phoneme conversion and rules for schwa epenthesis,” in Proceedings of the COLING/ACL on Main conference poster sessions.    Association for Computational Linguistics, 2006, pp. 890–897.
  • Nadungodage et al. [b] T. Nadungodage, C. Liyanage, A. Prerera, R. Pushpananda, and R. Weerasinghe, “Sinhala g2p conversion for speech processing,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 112–116.
  • Dias et al. [2013a] G. Dias, T. Patikirikorala, C. Arambewela, R. Darshana, and N. Alahendra, “Sinhala optical character recognition for desktops,” 2013.
  • Dias et al. [2013b] G. Dias, T. Patikirikorala, C. Arambewela, R. Darshani, and N. Alahendra, “Online sinhala handwritten character recognition for desktops,” 2013.
  • Ranmuthugala et al. [2006] M. Ranmuthugala, G. Pathiragoda, S. Jayasundara, G. Dias, and A. Karunananda, “Online sinhala handwritten character recognition on handheld devices,” Innovations for a Knowledge Economy, p. 1, 2006.
  • Weerasinghe et al. [2008] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlp applications of sinhala: Tts & ocr,” in Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II, 2008.
  • Weerasinghe et al. [2006] A. Weerasinghe, D. Herath, and N. Medagoda, “A nearest-neighbor based algorithm for printed sinhala character recognition,” Innovations for a Knowledge Economy, p. 11, 2006.
  • Rajapakse et al. [1995]

    R. K. Rajapakse, A. R. Weerasinghe, and E. K. Seneviratne, “A neural network based character recognition system for sinhala script,”

    Department of Statistics and Computer Science, University of Colombo, 1995.
  • Premaratne and Bigun [2002] H. Premaratne and J. Bigun, “Recognition of printed sinhala characters using linear symmetry,” in

    The 5th Asian Conference on Computer Vision

    , 2002, pp. 23–25.
  • Dharmapala et al. [2017] K. Dharmapala, W. Wijesooriya, C. Chandrasekara, U. Rathnapriya, and L. Ranathunga, “Sinhala handwriting recognition mechanism using zone based feature extraction,” 2017.
  • Hettige and Karunananda [2007b] B. Hettige and A. S. Karunananda, “Transliteration system for english to sinhala machine translation,” in Industrial and Information Systems, 2007. ICIIS 2007. International Conference on.    IEEE, 2007, pp. 209–214.
  • Hettige and Asoka [2010] B. Hettige and S. K. Asoka, “An evaluation methodology for english to sinhala machine translation,” in Information and Automation for Sustainability (ICIAFs), 2010 5th International Conference on.    IEEE, 2010, pp. 31–36.
  • Hettige et al. [2016] B. Hettige, A. Karunananda, and G. Rzevski, “A multi-agent solution for managing complexity in english to sinhala machine translation,” Complex Systems: Fundamentals & Applications, vol. 90, p. 251, 2016.
  • Liyanapathirana and Weerasinghe [2011] J. Liyanapathirana and R. Weerasinghe, “English to sinhala machine translation: Towards better information access for sri lankans,” in Conference on Human Language Technology for Development, 2011, pp. 182–186.
  • Liyanapathirana [2013] J. Liyanapathirana, “A statistical approach to english and sinhala translation,” 2013.
  • Hettige and Karunananda [2009] B. Hettige and A. S. Karunananda, “Theoretical based approach to english to sinhala machine translation,” in 2009 International Conference on Industrial and Information Systems (ICIIS).    IEEE, 2009, pp. 380–385.
  • Hettige and Karunananda [2010] B. Hettige and A. Karunananda, “Varanageema: A theoretical basics for english to sinhala machine translation,” in Sri Lanka Association for Artificial Intelligence (SLAAI), 2010.
  • Hettige and Karunananda [2008] B. Hettige and A. S. Karunananda, “Web-based english-sinhala translator in action,” in 2008 4th International Conference on Information and Automation for Sustainability.    IEEE, 2008, pp. 80–85.
  • Tennage et al. [2017] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neural machine translation for sinhala and tamil languages,” in Asian Language Processing (IALP), 2017 International Conference on.    IEEE, 2017, pp. 189–192.
  • Tennage et al. [2018a] P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan, and S. Ranathunga, “Transliteration and byte pair encoding to improve tamil to sinhala neural machine translation,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 390–395.
  • Tennage et al. [2018b] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, and S. Ranathunga, “Handling rare word problem using synthetic training data for sinhala and tamil neural machine translation,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), 2018.
  • Ranathunga et al. [2018] S. Ranathunga, F. Farhath, U. Thayasivam, S. Jayasena, and G. Dias, “Si-ta: Machine translation of sinhala and tamil official documents,” in 2018 National Information Technology Conference (NITC).    IEEE, 2018, pp. 1–6.
  • Farhath et al. [2018b] F. Farhath, S. Ranathunga, S. Jayasena, and G. Dias, “Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil,” in 2018 Moratuwa Engineering Research Conference (MERCon).    IEEE, 2018, pp. 538–543.
  • Weerasinghe [2003] R. Weerasinghe, “A statistical machine translation approach to sinhala-tamil language translation,” Towards an ICT enabled Society, p. 136, 2003.
  • Sripirakas et al. [2010] S. Sripirakas, A. Weerasinghe, and D. L. Herath, “Statistical machine translation of systems for sinhala-tamil,” in Advances in ICT for Emerging Regions (ICTer), 2010 International Conference on.    IEEE, 2010, pp. 62–68.
  • Jeyakaran [2013] M. Jeyakaran, “A novel kernel regression based machine translation system for sinhala-tamil translation,” 2013.
  • Pushpananda et al. [2013] R. Pushpananda, R. Weerasinghe, and M. Niranjan, “Towards sinhala tamil machine translation,” in Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on.    IEEE, 2013, pp. 288–288.
  • Pushpananda et al. [2014] ——, “Sinhala-tamil machine translation: Towards better translation quality,” in Proceedings of the Australasian Language Technology Association Workshop 2014, 2014, pp. 129–133.