Plain text corpus, from the point of view of Computational Linguistics (CL) or Natural Language Processing (NLP), is a systematic collection of text that should have certain properties like representativeness, correctness (if it is for standard written language) and sufficient quantity. For languages on which no NLP work has been done before, or very little such work has been done, creating a good plain text corpus is the foundation for building NLP resources and tools for that language. Plain text corpus is basically a collection of written text or transcribed speeches of a particular natural language stored in electronic or digital form. Modern linguists and NLP researchers use corpora for statistical analysis, hypothesis testing, word-frequency analysis, validating linguistic rules in a language territory. Once this corpus gets annotation with richer linguistic knowledge about its structure, meaning or usage etc. it can act as the training data for supervised learning for building tools that solve various NLP problems such as Part-of-Speech (POS) tagging, chunking, morphological analysis, parsing etc. In this paper, we describe our efforts to prepare some plain text corpus for three languages of the Purvanchal region of central north-eastern parts of India and to annotated this corpus with POS and chunk tags. These languages are Bhojpuri, Maithili, and Magahi.
These languages are low-resource languages and are phylogenetically related since they belong to the same (Indo-Aryan, or Indic) language family. Globally, these are the first-languages of 50, 13, and 30 million speakers, respectively. The availability of an enormous base of speakers is a good reason for researchers to understand and analyze different properties of these languages and to build language technologies for them. (Kumar et al. (2011); Singh and Jha (2015)) began a project by collecting corpora of 1 million words for the Bhojpuri and Magahi languages, out of which 89997 and 63000 words have been annotated at the POS level, respectively. Simultaneously, at our end, we have collected corpora for the Bhojpuri, Magahi, and Maithili, consisting of 704543, 424434, and 267881 words. We have also carried out manual POS annotation on corpus sizes of 245482, 171509 and 208640 words, respectively and the chunk-annotated corpus sizes of 60588, 0 and 10476 words, respectively.
To the best of our knowledge, our effort, which began in 2014, was one of the first to create annotated resources for these languages. At that time, there was no publicly available source of Maithili or Bhojpuri text corpus, and it was the first attempt to build the corpora and annotate them for POS tags and chunks. The project also involved building language processing tools for these languages and, at the end, to build machine translations systems for these languages. This project has now been completed and we will soon be reporting these results.
These linguistically annotated basic resources will serve as the base on which further development of language technology and tools for these languages such as POS tagger, chunker, morphological analyzer, morphological generator, machine translation, cognate identification etc. can take place, even outside of our team. This paper describes these basic resources for the three Purvanchal languages and some basic statistical details about them at the word, the morpheme, the syllable, and the character levels, and their similarities among themselves and with Hindi. These details include some simple measures of complexity and similarity. Apart from this, we also highlight some issues encountered during the Part-of-Speech and Chunking annotations for all these languages. We have also tried to pay some special attention to the similarities among these languages, as their differences are still a matter of debate. The inter-annotator agreement has been calculated through Cohen’s Kappa. The (annotated) corpora have been employed to build the Part-of-Speech taggers, Chunkers and Language identifier for the three languages.
2. Related Work
India, as is well known, is a multilingual country: it has 22 scheduled (constitutionally recognised) languages, written in (at least) 13 different extant scripts and several other variations of them. Apart from the scheduled languages, there are at least 400 plus other languages. Most of the developed countries, especially the ones with a not so diverse language base compared to India, do possess relatively more abundant, easily accessible, and most importantly, good quality corpora. On the other hand, Indian languages have relatively little corpora for the large number of languages, even though the degree of resource-scarceness varies. This scarcity severely constraints the language technology development and application of of state-of-the-art algorithms, particularly deep learning approaches.
Technology Development for Indian Languages or TDIL111http://www.tdil.meity.gov.in (Dash (2004)), a Government of India department under the Ministry of Electronic and Information Technology, has been actively promoting work on language resource and technology development for Indian languages, particularly on the scheduled languages. As an example of work done as part of this initiative, machine-readable corpora of around 10M tokens of Indian languages was prepared. The languages include English, Hindi, Punjabi, Telugu, Kannada, Tamil, Malayalam, Marathi, Gujarati, Oriya, Bangla, Assamese, Urdu, Sindhi, Kashmiri, and Sanskrit. Out of these, Urdu, Kashmiri, and Sindhi use Perso-Arabic scripts for their writing systems, while the remaining languages use scripts derived from the ancient Brahmi script. All these languages belong to either the Dravidian and or the Indo-Aryan families. The preliminary analysis of some of their initial corpora is described in (Bharati et al. (2000)). The resources and tools or technologies developed for Indian languages and shared with the TDIL and distributed through the Indian Languages Linguistic Data Consortium (IL-LDC)222http://www.ildc.in/.
Effort was made by research centres working with TDIL to create as representative corpora for Indian languages and basic analysis of such corpora were published. For example, the Bangla corpus, that was compiled from various disciplines such as literature, mass communication, science, etc. and was initially subjected to statistical analysis by Chaudhuri and Ghosh (Chaudhuri and Ghosh (1998)) at word, character, and sentence levels on the monolingual corpus comprising 2.5 million words by the Anusaraka group333https://ltrc.iiit.ac.in/projects/Anusaaraka/anu_home.html and 150,000 words from their own laboratory. A further collection of the corpus from the Bangla newspaper website (Prothom-Alo) was crawled, considering the balance and representativeness of the corpus. Its comparison with the earlier corpus (Majumder and Arafat (2006)) was also reported. In addition to this, to boost the language technology development for Bangla, Bangla-translated Basic Travel Expression Corpus (BTEC) (Khan et al. (2014)) was compiled to develop an English-Bangla speech translation system in the travel domain.
Another corpora analysis under the TDIL program was for the Telugu language, which is a morphologically rich language, much more so than Bhojpuri, Maithili, and Magahi. It is also one of the most spoken languages of India (ranking third in terms of its speakers), and mainly based geographically in the southern states of Andhra Pradesh and Telangana, India. Its characteristics were analyzed using statistical tools on the compiled monolingual corpus of Telugu comprising 39 million words (Kumar et al. (2007)). Moreover, lexical categories produced from the morphological analyzer were used as a part of speech due to the morphological richness of Telugu language (Badugu (2014)).
After Anusaraka and several other individual machine translation systems, Indian language to Indian language machine translation (IL-ILMT) project was initiated for working on bidirectional machine translation between nine Indian languages translation. In this consortium-based project, 11 Indian institute participated and designed a system called Sampark444http://ilmt.tdil-dc.gov.in/sampark/web/index.php/content. This system was developed by building the required ‘horizontal’ and ‘vertical’ resources, tools and other components. This is transfer-based hybrid machine translation system that includes knowledge-based and statistical modules. So far, it has publicly made available555https://sampark.iiit.ac.in HindiUrdu, HindiPunjabi, HindiTelugu, HindiTamil, HindiBengali, HindiMarathi, HindiKannada, TamilTelugu, and TamilMalayalam machine translation systems. Our continuing work on building machine translation systems for languages of the Purvanchal languages can be considered an extension of this project.
Manipuri (Meitei) was added a few years ago to the list of scheduled languages and belongs to the Sino-Tibetan language family. A recent attempt (Singh et al. (2016)) was made to create corpora from two local Manipuri newspapers, written in the Bangla script, and it was made compatible with Unicode through manual correction from Bangla OCR. Similar to Manipuri speech corpora (Jha (2010)
), the collection of Automatic Speech Recognition (ASR) from native speaker(s) and the telephonic speech data transcription was done.
Over the years, Natural Language Processing and Computational Linguistics community has continued to work on development of resources for low resource Indian languages (although work on a few languages like Hindi has been going on for more than two decades) and use them for several language-based applications (Jha (2010); Srivastava et al. ; Kumar et al. (2005)). With the growing use of the Internet in the age of digitalization, speech technology (Mohan et al. (2014)) also came into the picture for Indian languages, and played a crucial role in the development of applications in various domains such as agriculture, health-care, and other services for ordinary people (Kumar et al. (2005)). Researchers outside India have also contributed to building corpora for Indian languages and one such example is the HindEnCorp- a monolingual corpus for the Hindi language and a parallel corpus of Hindi-English comprising 786.60M and 3.88M tokens, respectively (Bojar et al. (2014)).
Coming to our analyses of the three languages as mentioned above, the first attempt to collect monolingual Bhojpuri corpus was made through a web crawler that includes dialects, different genres, and domains of Bhojpuri (Singh ). Further, this corpus is annotated for POS tagging, and a statistical tagger is developed, which allows us to compare with two other Indo-Aryan languages: Hindi and Oriya (Singh and Jha (2015); Ojha et al. (2015)). A similar attempt (Kumar et al. (2011)) was made to collect a monolingual corpus of Magahi language from blogs, stories, magazines, and transcripts of spontaneous conversations carried out in public, private, and home domains, made it possible to prepare corpus of around 1 million tokens.
POS annotation used BIS standard tagset666http://tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf, which consists of 11 coarse-grained tags and around 30 fine-grained tags for each language that it covers. This standard tagset was slightly modified and tested on around 50000 tokens (with fine-grained tags) with TnT, Memory-based tagger, Maximum-entropy tagger, HMM-based tagger, SVM-based tagger (Kumar et al. (2012)).
3. About Bhojpuri, Maithili and Magahi
Bhojpuri belongs to the Indo-Aryan (Indic) language family. Its origin is often attributed to a place in Bihar called Ara/Bhojpur. According to the census of India (2011) data, approximately 50,579,447 speakers use it as their first language. Ironically, a language with such a massive user base is yet to have official recognition and attain the status of a scheduled language in India, because the government agencies classify Bhojpuri as a dialect /regional variety of Hindi, in particular of ‘Eastern Hindi’. The resultant lack of recognition for this language, coupled with low employment opportunities owing to scarce resources for teaching and learning technology has brought tremendous disadvantage to Bhojpuri and its speech community. Setting aside the classifications based partially on socio-political considerations, it is justifiable to treat Bhojpuri as a living language that has a significant user base, and that is spoken in several countries besides India, including Fiji, Nepal, Surinam, and Mauritius. Moreover, its speakers would outnumber speakers of several officially recognized languages and dialects of India, and also of several European languages. Across India, they are spread with high concentrations in the states of Uttar Pradesh, Bihar, Jharkhand, and Chattisgarh.
Magahi became a distinct language around the tenth century, similar to other New Indo-Aryan (NIA) languages. Grierson classified Magahi under the Eastern group of the outer sub-branch. Moreover, scholars, like Turner, have clubbed the Bihari languages with Eastern and Western Hindi (Masica, 1991). Chatterji (1926) gave an entirely different classification, where Western Hindi is almost an isolated group, while Eastern Hindi, Bihari, and other languages are clubbed together in the Eastern group. Nevertheless, the proposition, given above, by Grierson (1903) is the most widely accepted. Jeffers (1976) and others gave a similar categorization. Although Magahi is frequently mislabelled as a dialect of Hindi, the fact is that it is not even part of the same sub-family as Hindi. It was developed from the Magadhi/Eastern Apbhransha branch of Magadhi Prakrit. Other languages developing from this branch are Maithili, Bhojpuri, Bangla, Assamese, and Oriya. This deduction is validated using the Grierson’s classification and is strengthened even via grammars of Hindi and Magahi. According to the features of the languages, Bangla, Oriya, Assamese, Maithili, Bhojpuri, and Magahi are close cognate sister languages, which means that Bhojpuri and Magahi are sister languages. Consequently, Magahi shares many characteristics with the Bhojpuri language, such as a massive user base of 20.7 million speakers. Maithili is their first cousin, and languages like Bangla, Assamese, and Oriya are their second cousins (Verma, 1985). Contemporarily, Magahi is spoken chiefly in the states of Bihar, Jharkhand, West Bengal, and Orissa.
Maithili serves as the first language to approximately 33 million speakers, who live in the Indian state of Bihar and South-eastern region of Nepal. It is a highly dialect-oriented language having several variations due to social factors such as gender, education level, caste, etc. It has a close association with Sanskrit language and, further, has at least ten varieties in India and at least three regional dialects in Nepal. The social status shows the honorific agreement in this language (Yadava et al. (2019)).
Bhojpuri, Maithili and Magahi languages have Subject-Object-Verb (SOV) word order. These languages are written in Devanagari and Kaithi scripts. Like many other modern Indo-Aryan languages these languages are also non-tonal languages. Word-formation in these languages is somewhere between synthetic and analytical typology. These languages, unlike Hindi, lack ergativity. More details about linguistic features of these languages are discussed in the section on annotation issues.
4. Text Corpus Creation of Purvanchal Language: Bhojpuri, Maithili and Magahi
For the project on which this study is based, we collected data from web pages and made a collection of corpus under the supervision of linguists and speakers of these languages. These were qualified linguists and native speakers of these languages, because of which they were also familiar with the writing systems used for these languages. This last part about writing is only partially true, because none of these languages are medium of education anywhere in India and very few people actually read or write in these languages, even though they are spoken by a large number of people. It is partially true because all these speakers of these three languages are also native speakers of Hindi, which uses the Devanagari script. These speakers will actually identify themselves as Hindi speakers because Hindi is their main first language, not only for communication, but also for education and official purposes, apart from English, which is still the de-facto first official language of India (although constitutionally, English is the second official language, after Hindi). None of these speakers, with one or two exceptions, regularly reads or writes in these languages.
We collected raw text in Bhojpuri, Magahi and Maithili from online newspapers, magazines, and other relevant resources. The collected data was made balanced and representative to some extent by including varied genres such as sports, editorial, politics, science fiction, entertainment, education, and daily news. In magazines, we preferred [pAwI] for Bhojpuri and [axrXaga] novel for Maithili. While building the corpora, we avoided data from blogs so that we might be able to maintain the linguistic authenticity of the sources to the Bhojpuri and Maithili languages. Blog posts in these languages often use a high degree of code-mixing. However, significant deficit of data sources for the Magahi language compelled us to obtain its data from various blogs to at least establish an approximate comparison while making an analysis. It is also one of the reasons why every frequency distribution analysis (Section 4.2) is carried out in a two-fold manner: the entire corpora followed by restricting the sentences of Bhojpuri and Magahi corpora to be same as that of Maithili (which had the smallest corpus size). Most of sources of corpora on the Web that we used are listed in appendix A.
4.1. Raw Corpus Cleaning and Issues
After extracting the corpus from the Web and other sources, we had to clean it in the usual way (such as removing of unwanted characters, symbols), but also, in some cases, to convert the encoding into Unicode (UT-8). This is because for most Indian languages, Unicode is still not universally used. For the three Purvanchal languages, many sources, particularly those from printed material, were written using a proprietary encoding that has been popular for Desktop Publishing in Devanagari. Since the encoding converter is not perfect (as the mapping between this encoding and Unicode is not straightforward), there were some errors in conversion to Unicode. Because of this, some ‘noise’ is present in the text corpus. Part of the cleaning process involved removing such noise as far as possible, given the time and resource constraints. There are still some errors left that were introduced due to encoding conversion.
Since there were not many significant lexical or structural mistakes, so the corpus cleaning process, on the whole, was not very time-consuming, nor could we afford to spend too much time on it. We found several minor orthographic issues in the collected corpora of these three languages, such as for [Sa, Ra]777The items within  brackets indicate WX notation. The WX (https://en.wikipedia.org/wiki/WX_notation) notation is used throughout this paper, as is the usual practice for Indian languages for computational purposes.. Some examples of words where such problems were found are: [akCara akRara], [bAlA bAlA] in Maithili. The problem in these cases is that the orthographic practices of Hindi are carried over to the concerned languages, causing mistakes, e.g. [akCara] is written as [akSaar], which is Hindi spelling for this word, roughly meaning syllable. Another common problem is due to the diacritic called Avagrah. Avagrah is frequently used in Bhojpuri, Magahi and Maithili, and it indicates lengthening of the preceding vowel for our purposes. In many cases, this diacritic was missing, which can change the meaning of a word in these languages. We tried to introduce it in the places where it was missing, but this process may not have been rigorous due to lack of available annotator-time.
Since these languages (previously considered dialects) do not have much of standardization, the words are not spelled consistently, adding to the problem of language processing. A word can have multiple spelling variations. Some examples of problems due to these reasons are given below:
Orthographic variations: As mentioned above, words having the same meaning can often be spelled in more than one ways, i.e., there is a lot of orthographic variation. This is due to the lack of standardization for the three languages. This problem is present, to a much lesser extent, even in Hindi. For example, consider one form of variants formed by replacing Anusvara with Anunasik or Anusvara with half-nasal consonants (such as [samvawa/saMvawa], [samBala/saMBala/saBala]. Examples of other orthographic variants may include [karanA /karie/kariye(a/e/ye/y)], [karAyI/karAI, uTAyI/uTAI (I/yI, I/hI, A/yA/, i)], [bAyIM/bAyIZ], [karyavAI/kAryavAhI/kAravAhI], [BEyA/BaiA/BaiyA], [vAyasarAya/vAisarAya] and [suwala/sowala].
Particle identification: Particles are functional words which do not have very clear meanings, even as function words. Examples of such particles include [vA, bA] and [mA] constructions. Bhojpuri and Magahi have two types of roots: one without [vA] and another attached with [vA], e.g. [laikA] (boy), and [laIkavA] occur in masculine and non-masculine words, but in feminine words, [yA] suffix is there. Also, Maithili has similar type construction; however, it has [mA] and [bA] construction. For example: [CoramA] (boy) and [CorabA]. It is debatable how these suffixes change the meanings. Strictly speaking, these are suffixes and are not words, so should not be called particles in the sense Hindi has particles such as [hI] or [wo]. Still, they pose problems while annotating the corpus.
Incorrectly spelled words or foreign language words: As some words may not be correct or borrowed from any different language than the chosen three languages for analyses, they are often not be spelled correctly (or consistently) as there is no standard spelling. Standard spelling is a general problem for all three Purvanchal languages even for their native words, it is particularly problematic for borrowed words. For the subset of data that was annotated, we tried to filter out such problematic words, but it was not done thoroughly.
Numbers, Abbreviations, Symbols: We also had to consider the use of alternative written forms for words such as [SrI, dA.] symbols (*, & ) numbers (3224.34).
4.2. Frequency Distribution Analysis
The use of Corpus Linguistics is now very common, but as early as 1964, Yule (U (1964)) had written:
“Without adequate knowledge of statistical information about different properties of a language, we can make mistakes in handling linguistic data as well as in observation.”
Statistical analyses of the corpora can help us understand the language characteristics. In quantitative analysis, we classify different linguistic properties or features of a particular language, count them, and even construct more complex statistical models in an attempt to explain what is observed. It allows us to discover and differentiate the genuine from the improbable reflections of phenomena capable of describing the behaviors of a language or a variety of languages.
On the other hand, the qualitative analysis aims at providing a complete and detailed description of the observed phenomena obtained from the corpus. It allows us to draw finer distinctions because it is not necessary to shoehorn the data into a finite set of classifications. Both the quantitative and qualitative analyses prove to be indispensable to the corpora study.
It may be noted that (Kumar et al. (2007)) contains further corpus analyses on other Indian languages, such as Hindi, Telugu etc., which can be used as a point of reference. Further, wherein necessary, the following analyses described here have included allusions to the Hindi language, using (Kumar et al. (2007)) as reference for arguments. Hindi888http://www.cfilt.iitb.ac.in/Downloads.html (Europarl v7) monolingual corpus has been used for comparing with the three languages. There are other, perhaps larger, monolingual corpora for Hindi, but they are usually not freely available and have not been used widely for linguistic studies or language processing experiments.
From the corpus, the initial statistical analysis in terms of the most basic units of language can be performed, e.g. word frequency, vocabulary size etc.. We extracted the total number of sentences, tokens, types, and rare types. Here, the type can be defined as the vocabulary or lexical variety of the language expressed as the number of distinct (unique) tokens available in the corpus. Further, the rare types are those tokens which occur only once throughout the corpus, i.e., with frequency 1. The relationship between the types and tokens is determined by calculating the type-token ratio (TTR) that shows how often new word forms appear in the corpus (Sarkar et al. (2004)). Intuitively, it also roughly indicates the vocabulary size, and therefore, the morphological complexity or richness. This is partly because we are counting over word forms and not lexemes or ‘root words’. We can examine the relation between the lexical or morphological richness and the TTR values: high TTR meaning high lexical/morphological richness and vice-versa. With reference to these terms in the context of the corpus that we created of three Indian (Purvanchal) languages, the Table 1 lists the total the number of sentences, tokens, types, rare types, and also the TTR corresponding to each language.
|Language||Total Sentences||Total Tokens||Total Types||Rare Types||TTR||TTR*|
From the Table 1, we find that the value of TTR comes out to be highest for the Maithili language among the three languages. This indicates that Maithili and Hindi are the most and the least lexically/morphologically rich, respectively, among the four languages. However, a look at the second column tells us that the number of sentences for the Maithili language is the least and quite less in comparison to the remaining three languages. Thus, only from Table 1, we cannot rule out the possibility that the value of TTR could have been higher had we created a larger corpus of the Maithili language, comparable to the other three languages. This observation seems to be supported by the last column of this table, which suggests that, for corpora of equal sizes, all four languages have roughly the same TTR, and hence perhaps the same degree of lexical/morphological richness. While this is plausible, the previously available linguistic knowledge about the four languages does not agree for this observation based on the last column. This knowledge agrees more with the TTR column (without the same corpus size restriction). This could be interpreted in two ways. Either that the corpus size should not be the same for all languages for the TTR to be compared for lexical/morphological richness, or that the quantitative approach has some limitations, because intuitively the same corpus size restriction seems reasonable for quantitative analysis.
In order to further explore this inference, we come to our second step of statistical analysis, provided in Table 2, that is, the percentage of types required to cover a certain percentage () of the corpus, where can be 50, 60, 70, 80, and 90 (100% corpus coverage is already done in Table 1). For instance, the value in the second row and first column shows a value of 74.76, which means that out of total 35253 types available in the entire corpus for the Magahi language, only 74.76% types (or 26355 types approximately) are available in the 60% tokens (or 254660 tokens nearly) chosen randomly out of the entire number of tokens (424434) in the corpus. For all values of , it can be observed that the percentage of corpus coverage is lowest for Bhojpuri. Similarly, except for the case when Magahi exceeds Maithili in percentage coverage of types, Maithili consistently exhibits the highest percentage coverage of the types. These results seem to contradict the conclusion from the second last column of Table 2 (TTR on full corpus size) as well as previous linguistic knowledge. However, these results in Table 2 are on the same percentages, not on the same corpus sizes in terms of number of tokens. Since different languages have different lexical degrees of lexical richness, they will need different amounts of corpora to have the same coverage of types. Therefore, it seems better to compare the TTR in the second last column of Table 1, rather than the last column.
Taking this analysis further, we summarize in Table 3
the Cumulative Coverage, the Frequency and Relative Frequency (or probability) of the occurrence of tokens for a particular language. This table shows the top ten most frequent tokens in the three languages in decreasing order of their frequencies (or RF). A high value of CC indicates that the probability of occurrence regarding both rank and frequency is high.
We first observe the relationship between Bhojpuri and Magahi using various patterns of data in Table 3. The token [ke] exhibits the highest frequency in both these languages. Further, the CC of the word [ke] is higher in Magahi (0.850) in comparison to the Bhojpuri (0.070). Besides, this word does not exist in this list of the Maithili language, indicating that there is a closer association between Magahi and Bhojpuri. Similar to [ke], [meM], [se], and [ki] are on the equally high frequent ranks in the Bhojpuri and Magahi. The CCs are again found to be higher in Magahi (1.135, 1.564, 2.277) than Bhojpuri (0.091, 0.128, 0.182). Thus, since genetically related languages possess certain similarly frequent words by order and rank, the four tokens considered above strengthen an assertion that Bhojpuri and Magahi are genetically related languages.
Additionally, Maithili and Magahi also have interesting patterns in terms of frequency and CC. Consider two of the most frequent words in the list: [I] and [gela]. Firstly, these words do not appear in the Bhojpuri’s top ten most frequent words list but occurs at the same rank of occurrence among the two languages. Further, the [gela] word is nearly equally frequent in Magahi and differs by just one rank among the two languages. The CCs of both of the words in Magahi is higher, being 2.123 and 2.350, while it is 0.111 and 0.118 in Maithili, respectively. Similar to the assertion in the preceding paragraph, our assertion of genetic relations among Maithili and Magahi languages is strengthened. However, it also indicates that perhaps Bhojpuri and Magahi are closer than Magahi and Maithili.
Finally, looking at the total CC for the top ten words, the value is significantly lower for Maithili, than it is for Bhojpuri and Magahi, even though Maithili has the least amount of corpus. This value is lowest for Hindi, but that is most probably due to the large size of Hindi corpus and also the fact that Hindi is used much more for academic purposes, thereby having a larger vocabulary. This too supports the linguistic intuition that Maithili is lexically or morphologically richer.
Since three languages, and also Hindi, use the Devanagari script almost in the same way (i.e., writing systems are also similar), comparing statistics about characters used in the corpora can also perhaps give some insights, or at least validate (or invalidate) some linguistic intuition about these languages, we have performed an exercise similar to words at the character level. This is also supported by the fact that the Devanagari script has almost one-to-one mapping from letters to phonemes. We have considered character -grams, with up to 7, as seven n-grams of characters seem to capture identities of a language quite well (Singh (2006b)) for language identification. Figure 1 lists the TTR values for -grams for 1 to 7 levels for the four languages. For character unigrams, the values are nearly zeros for all the languages, because the number of unigrams is quite small (equal to the size of the Devanagari character set), while the total number of character tokens is much larger. This begins to change from bigram upwards. The trends, in terms of numbers, seem to run opposite to those for word level TTRs. We interpret it as meaning that higher values of character -gram TTR indicates greater lexical richness (relative vocabulary size), but lesser morphological complexity, which is in consonance with linguistic intuition. In phonological terms, it could also mean that higher TTR for character (or phoneme) -grams is indicative of greater phonological variety. We are not aware of any linguistic intuition about this, but perhaps phonologists familiar with these languages can comment on this better.
4.2.3. Syllable or Akshara
Since there is an almost direct mapping from letters to phonemes in Devanagari, it is also possible to analyze the data in terms of orthographic syllables (which are very similar to phonological syllables). The term Akshara is used for such orthographic syllables. This terms has been used since the time of the Paninian grammarians, but in our case we use a slightly modified version of syllable that does not necessarily end with a vowel (like traditional Akshara does). This modified terms came into use perhaps due to the requirements of building computing technology for Indic languages which use Brahmi-derived scripts999http://www.tdil-dc.in/tdildcMain/articles/737061Hindi%20Script%20Grammar%20ver%201.4-2.pdf.
In this section, we perform a similar exercise as in the preceding two sections, but based on the level of orthographic syllables or Aksharas. It has been observed and reported by multiple authors that Akshara is a better linguistic unit that character for many language processing and analysis purposes for Indian languages (Singh (2006a); Kumar et al. (2007)). The grammar for orthographic syllabification is based on the document defining the ISCII standard101010http://varamozhi.sourceforge.net/iscii91.pdf.
In general, the syllable structure is defined by the onset-nucleus-coda (consonant-vowel-consonant). The core component of the syllable is the nucleus that is a vowel, and the remaining components - onset and coda - are optional, and are the preceding and the following consonant, respectively. An orthographic syllable for us is an orthographic unit of the Devanagari script (conjuncts included) and which may or may not end with a vowel, but does correspond to one or more phonological syllables. Since the position of a syllable in a word (beginning, middle or end) is also sometimes important, we consider the numbers of orthographic syllables for each of these positions in Table 4. The TTR values for orthographic syllables are given in Table 5. The top ten most frequent orthographic syllables in Bhojpuri, Maithili, Magahi, and Hindi languages are given in Table 6.
Observations from Table 4 are as follows. The percentages of syllables that occur in the beginning and the end is almost the same in all the four languages. The percentages of syllables which can occur in the middle is much larger for all language, which is natural since the middle part of the word can contain more than one syllables and most words are more than three syllables long. The differences in the percent of syllables that occur in the middle for each of these languages do tell us indirectly about the morphological complexity. Once again, Maithili has the highest percentage of syllables that occur in the middle, meaning it has longer and more complex words, in spite of its corpus being the smallest. Hindi also has a large percentage of syllables that occur in the middle, but that is most probably due to the fact that it is used for academic purposes and has a lot of recently borrowed technical or semi-technical long words originating from Sanskrit. Such words are not normally used in spoken languages, whereas the words in the other three languages (as per our corpus) are used mostly for in spoken language, since these languages are not used much in the written form, and therefore, for academic purposes. The same also applies to phonological complexities for these languages. In fact, it is difficult to separate phonological complexity from morphological complexity solely based on these statistics.
Table 5 further strengthens the above observation about morphological and phonological complexities of the four languages. It also shows that comparing values for corpora of equal size (minimum of the four) is not a good idea. Table 6 is given mainly for reference, although it does not seem to be very informative, except in showing that Hindi is lexically richer than the other languages, which can be explained by the reasons mentioned above.
A morpheme is the smallest meaningful linguistic unit that forms part of a word. Since we are interested in relative morphological complexities of the four languages, performing the same exercise as in the previous three sections for morphemes should be more reliably informative. Since we do not have a completely accurate algorithm or grammar for segmenting words into morphemes, we employ the Morfessor tool (Smit et al. (2014)
) for obtaining morphemes from the corpora. This tool uses a machine learning based algorithm that segments the words into morphemes, based on the corpus provided. Morfessor has two phases: training and decoding. The training data (unannotated word set:
) is used as the input for the training phase, and the log-likelihood cost estimation function learns the theta-model parameter, defined by. The learned parameter is used for morphological segmentation of new ‘test’ data () in the decoding phase.
We used the Morfessor 2.0111111https://morfessor.readthedocs.io/en/latest/
version, with unsupervised learning. Each word in the data is re-written as a sequence of its constituent morphs (hopefully approximating morphemes). This section provides statistics similar to the preceding sections, but at the word level, as given in the Table7.
Apart from its purpose as a reference, this table provides two main takeaways. frequencies and relative frequencies (RFs) do not, by themselves provide much information that is generalizable, although it may be of interest to linguists who want to study these languages. For our purposes, the first takeaway is very similar to the one from preceding tables. In other words, from the cumulative coverage (CC) values, we can observe the same trend that Maithili has the higher morphological complexity among the three Purvanchal languages. Although Hindi again has a high values, it can be explained in the same way as earlier, i.e., larger and more complex technical or academic vocabulary due to its usage for academic purposes and as a standard language at the national level.
The high CC values for Maithili and Hindi also indicate that more morpheme variety in these languages, but for different reasons, as stated above. This again strengthens the conclusion about morphological complexity and richness. Whereas for Maithili, it is inherent in the grammar of the language (evident with a very small corpus), for Hindi it is mainly due to borrowed complex words from Sanskrit.
A third observation is that the top ten morphemes are very less similar (phonologically and orthographically) for Maithili, as compared to the other languages, which also matches the existing linguistic knowledge about the typological nature of Maithili. The popular knowledge about Maithili says that it is, in fact, closer to Bangla than to Hindi. This suggests that it might be a good idea for future work to compare these three languages with Bangla (for more clarity, please see Section 7.1).
In fact, there are several common morphemes ([meM], [ke], [se], [na], [nA]) among the top ten for Bhojpuri, Magahi and Hindi, but none of them occurs in the list for Maithili. This might be the strongest evidence of the common knowledge that Maithili is typologically the most distinct among these four languages.
4.2.5. Zipf’s curve
Since all languages have a long tail distribution in terms of frequencies and ranks of the words in the vocabulary, in this section we describe the result of our routine exercise of plotting frequencies versus ranks. According to the Zipf’s law (Piantadosi (2014)), the frequency of a word is inversely proportional to its rank. A small proportion of words are used almost all the time, while a vast proportion of words are used rarely.
Here, denotes the rank frequency of a word, and is its frequency with regard to the natural corpus at (Powers (1998)).
The Zipf’s curve for Bhojpuri, Maithili, Magahi and Hindi are shown in Figure 2. The first Zipf’s curve is developed analyzing the entire corpus, while the second curve follows the minimum number of sentences (Maithili corpus). In this figure, the frequency is in logarithmic scale. The plots for all the four languages are similar as expected and roughly fit the Zipf’s law. It may be noticed that there are far more rare (frequency one) words in Hindi, which is due to the high usage of academic or technical words borrowed from Sanskrit.
4.2.6. Word length
Yet another parameter that can be used to compare languages based on basic quantitative measures is the word length. Since the postposition marker or case marker (vibhakti) behaves differently in these languages, affecting the word lengths because sometimes it attaches to the previous word as a morphological suffix, while other times it is a separate word, word length analysis can give some more evidence in support of or against some existing linguistic knowledge. Also, word length has been an important factor in readability metrics (Kincaid et al. (1975)), although it our concern is not about document readability, but language complexity. Table 8 shows the word length statistics about the four languages and Figure 3
plots word length versus total words for a certain word length. From the table, the earlier observations about morphological complexity are again strengthened. While the minimum has to be one for all languages, all the other four statistics (maximum, mean, median and standard deviation) have values in the same order as earlier, i.e., highest for Maithili among the three Purvanchal languages and lower for Bhojpuri and Magahi. Hindi has the highest values for the same reasons stated earlier. The figure provides even more evidence for this. It also perhaps shows the typological distance of Maithili from the other two Purvanchal languages, with Hindi being a special case due to relatively high standardization and academic usage.
On a side note, it may be seen that the word lengths go up to more than 60 characters. This is after removing some ‘words’ from this plot which were not really words but strings left in the corpus due to incorrect tokenization. Even after removing such non-words, on arranging the words in descending order of word length, we found (as expected) that the longest words were either hyphenated words used for literary purposes, or were from quotations from Sanskrit, which is morphologically very complex as well as a fusional language. Non-hyphenated native words in all the four languages rarely exceed the word length of 10, as can be seen from the figure.
|Language||Min||Max||Max*||Mean||Mean*||Median||Median*||Std. Dev.||Std. Dev.*|
4.2.7. Entropy and Perplexity
Information Theory provides us with some measures to further allows us to compare the complexities and to consider the similarities and differences of the four languages. Such work has a great deal of precedence, also some specifically for Indian languagesBharati et al. (2000).
In terms of Information Theory, we can see a sentence as being generated from a probabilistic model such as the -gram model, which is still heavily used, in spite of the availability of deep learning models which give better performance on many problems. This insight has been used for seemingly very different problems such as protein sequencing or DNA sequencing. In such a view, words are generated one after the other, in sequence, based on their -gram probabilities. At each step of the generation of a sentence, when a word is picked up from the vocabulary, there are as many choices as there are words in the vocabulary.
Based on this, we get two very basic, but extremely useful measures. One is entropy, which is for the distribution as a whole, while the other is perplexity, which takes into account the sequencing of words in different sentences. Both of them are about the model that we have learnt from the data, but assuming the data is good representative sample, these measures can give as in indirect estimate of the complexity of the language under consideration. It is important to note here that the complexity we are talking about is at a particular level of linguistic analysis, such as phonological, morphological, syntactic, semantic etc. Here we are mainly concerned with only morphological and syntactic levels. Related to these two levels is the lexical complexity in terms of word types or word in the vocabulary.
Table 9 lists the unigram entropy of four languages, which we take as indicative of lexical complexity, i.e, the ‘richness’ of the vocabulary based distribution. Word-level (or unigram) entropy can be defined as:
Here, is a word type, and is the mean information content derived from word types.
(i) ; is determinate (providing no new information)
(ii) , Otherwise
Perplexity (PP) of a language is a weighted average of the reciprocal of its branching factor. And the branching factor of a language is the number of possible words that can succeed any word, based on the context as available in the corpus. Therefore, it is a mean representative of the possible succeeding words, given a word. We can roughly consider it as the mean branching factor, and therefore, of the syntactic complexity. If the model is a good enough representation of the true distribution for the language, then the PP value indicates syntactic complexity.
Here we calculated the corpus perplexity using the following relation:
Where, corpus contains sentences with words.
We have used the SRILM toolkit121212http://www.speech.sri.com/projects/srilm/ for obtaining the -gram based statistical model and for calculating perplexity. We estimated the n-gram language model with backoff smoothing.
From Table 9, we can see (from the values for full corpus size) that the entropy (ignoring the negative sign) is the lowest for Maithili and the highest for Hindi. Among the three Purvanchal languages, Maithili is the lowest and Bhojpuri the highest. This again indicates higher morphological complexity for Maithili and high lexical richness for Hindi.
Table 10 we observe widely varying values for unigrams, bigrams and trigrams, and also widely varying trends as we go from unigrams to trigrams. We also notice that the values for minimum corpus size seem to be more meaningful for perplexity comparison than full corpus size, which intuitively makes sense, since the dataset size has a large effect on the perplexity values. Therefore, we focus on the equal corpus size values in this table. From these, we can observe that the value of perplexity for Bhojpuri becomes less than half when we go from unigram to bigram, but decreases only slightly when we go further to trigrams. This may be interpreted to mean that two word word groups (or chunks) are more syntactically relevant for Bhojpuri then tree word groups. In case of Magahi, the value drops steeply from unigrams to bigrams and then doesn’t decrease so much. This perhaps means that even two word groups are not very relevant for Magahi from syntactic point of view. This statement has to be qualified by the fact that -gram models can model syntax only to a limited extend, one of the reasons being that they cannot model long distance dependencies. Therefore, it indicates the possibility that Magahi has much more of long distance dependencies. The same pattern as Magahi is evident in Maithili, but to an order of magnitude greater. Maithili, thus relies the most on morphology and long distance dependencies than the other three languages. The trend for Hindi very similar to Bhojpuri, which matches the linguistic intuition that these two languages have the highest similarity, at least as far as is visible in written corpus available to us.
5. Experiments on Language Similarity: SSNGLMScore
Preliminary analyses have shown the potential benefits to building tools for resource-poor languages by leveraging resources from resource-rich languages (here, Hindi) based on similarity (Singh et al. (2008); Wu et al. (2019); Zoph et al. (2016); Tiedemann (2017)). The similarity among corpora gives a degree of homogeneity among languages (Porta and Sancho (2014)), motivated by their genetic relationships. Hindi, Bhojpuri, Maithili, and Magahi belong to the Indo-Aryan language family, hence they are genetically related languages, and have typological similarity due to the inheritance of certain features from their common ancestor. We have calculated the character -gram based cross-lingual similarity among these languages. To calculate similarity between two languages, a character-based language model was trained for one language and tested on the other language, and the sentence-wise sum of cross-entropy was obtained. These values are listed in Table 11.
We used the KenLm language model tool131313https://kheafield.com/code/kenlm/ (Heafield et al. (2013)) trained on 50% corpus of language and test of the remaining 50% of test corpus of each language (including training language) with the 5-gram and Kneyser-Ney smoothing technique. The testing scores are scaled in a particular range by the Min-Max scaler algorithm and we then obtained the score by taking the average mean from a language model by the following way:
Here SS stands for Scaled Sum of -gram language model scores.
Here, and is the total number of sentences in test language ().
Since, to the best of our knowledge, this measure has not been used before for estimating language similarity, we call this measure the Scaled Sum of N-Gram LM Scores or SSNGLMScore.
|Language Pair||Cross-entropy score|
It can be seen from Table 11 that the linguistic intuition about similarities for these languages is corroborated by the values of cross entropy scores, with a couple of caveats. Maithili is the most distant from Hindi and Bhojpuri, but is close to Magahi, which matches lingusitc knowledge. However, one discrepancy observable is that Magahi is almost equally close to all the other three languages.
So we look at Table 12 that gives the sequence based similarity scores. These are symmetric scores, as opposed to the cross entropy scores. Bhojpuri is closest to Magahi and almost equally distant from Hindi and Maithili. Maithili is the most distant from Hindi, but almost equally close to the other two Purvanchal languages. Magahi is closest to Maithili, then to Bhojpuri and the least to Hindi. This more or less matches the linguistic knowledge, but we do not see associativity in terms of language similarity, at least using this measure of similarity. Some of this may be due chance, because of taking random samples of 50% corpus size for training and testing. Perhaps -fold cross validation method will give us more reliable estimates.
|Language Model||Test Language|
6. Bilingual Dictionary and Synset
Bilingual lexicon and WordNet (Miller et al. (2010)) are crucial linguistic resources for various natural language processing applications, such as Machine Translation (Madankar et al. (2016); Gracia et al. (2019); Zhang (2019)), Cognate Identification (Wang and Sitbon (2014); Nakov et al. (2009); Ciobanu and Dinu (2013); Nasution et al. (2016)), Part-of-Speech Tagging (Fang and Cohn (2017); Prabhu and Natarajan (2019)), Word-sense Disambiguation (Vaishnav and Sajja (2019); Jain and Lobiyal (2019)), Bilingual Lexicon Induction (Zhang et al. (2017); Hangya et al. (2018); Nasution et al. (2016)), Word Alignment (Probst and Brown (2002)), Cross-language Information Retrieval (Sharma and Mittal (2018); Madankar et al. (2016)), Word Translation (Lample et al. (2018)), and Cross-lingual Word-embedding Alignment (Duong et al. (2016); Ruder et al. (2019)).
Since the goal of the project, of which this work was the initial part, was to develop Machine Translation systems between Hindi and the three Purvanchal languages, we had undertaken to initiate building WordNets for these languages as well. Since the there is already a linked WordNet for many Indian languages, which is called IndoWordNet (Bhattacharyya (2010)), in which WordNets of different languages are linked together by mapping synsets (sets of synonyms) etc. of different languages to those of Hindi, i.e., Hindi is used as the Inter-Lingual Index. However, as part of our project, we only created the synsets and mapped them to Hindi. Bilingual lexicon for the three languages in the usual way, suitable for machine translation.
The information in a synset includes IDs (corresponding to word meanings in Hindi as given in IndoWordNet), synonyms, lexical categories, and concepts for each entry. The sizes of these two resources are given in Table 13.
The created synset has the highest number of total synonyms and synonyms per word for Bhojpuri. Although Magahi has more cumulative synonyms, Maithili closely surpasses Magahi in synonyms per word. The synonym-statistics is given in Table 14. It is not clear if any conclusions can be drawn from these statistics alone. The data could, however, be useful for studying, say, the degree of ambiguity in these languages.
7. Shallow Syntactic Annotation
Annotation enriches the raw corpus with information and allows both linguistic analysis and machine learning. However, annotation by human is not guaranteed to be perfect. It has been shown that annotation possibly reflects the annotator’s errors and predilections (Wynne (2005)). We, therefore, try to discuss some of the annotation issues that were we faced while preparing the annotated corpus.
Syntactic annotation can be very useful for numerous problems in Natural Language Processing, whether it is Machine Translation, Automatic Speech Recognition or Information Retrieval. Since the project this work was part of is a continuation of the ILMT project (Section 2), we used the standard BIS (Bureau of Indian Standards) tagset141414http://tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf for Indian languages that was used in the ILMT project. The BIS tagset consists of 25 POS tags (Table 15) and 11 chunk tags151515The chunk annotation was performed for Bhojpuri and Maithili languages only tags (Table 16) for Bhojpuri, Maithili, and Magahi language. Since we did not go beyond Chunking, out syntactic annotation is basically shallow syntactic annotation, which cab be useful for building, for example, shallow parsers for these languages (Sha and Pereira (2003); Sangal et al. (2007)).
|2||NNP||PROPER NOUN||15||SYM||SPECIAL SYMBOL|
|2||VGF||Finite Verb Chunk|
|3||VGNF||Non-finite Verb Chunk|
|4||VGINF||Infinitival Verb Chunk|
Although it is possible to perform annotation in a simple text editor, since the annotated file is stored in a text file, such an approach can lead to easily avoidable errors. It is, therefore, a common practice to use some annotation tool with Graphical User Interface and some kind of error checking. For this also, we used the same tool for such annotation as was used for the ILMT project, namely Sanchay (Singh (2008)). As far as actual file format or representation is concerned, we used the Shakti Standard Format (SSF), which is a robust recursive representation. It allows storing a wide variety of linguistic and non-linguistic information in one place. The detailed description of the SSF representation can be found is given in (Bharati et al. (2014)). We are try to get the same kind of statistics as above for POS tags, pairs of words and POS tags and of chunks. Table 17 shows the entropy values for these units.
Entropy values do not vary much for three units over the four languages. Only two things are noticeable here from the second column of the table. One is that Word_POS_Tag TTR is quite low for Hindi and, for the same language, the Chunk Tag TTR is very high for Hindi, while POS Tag TTR is almost the same for all languages. This seems to indicate that syntactic complexity of the three languages in terms of POS tags is almost the same. The Word_POS_Tag TTR is twice for Hindi that the other three languages, meaning lexicalized POS Tag complexity (say, as modelled by HMM) is higher for Hindi. Chunk TTR being very high for Hindi indicates much greater syntactic complexity at the chunk tag level, which could partially be explained by the same reason of academic usage, but also of larger data size. We do not really as much varied data for the three Purvanchal languages as we have for Hindi. This factor might be affecting other measures too.
|Languages||POS Tag||Word_POS_Tag||Chunk Tag||POS Tag||Word_POS||Chunk Tag|
7.1. Differences between Bhojpuri, Magahi and Maithili as Compared to Hindi
Bhojpuri, Magahi and Maithili are traditionally, and sometimes officially, labeled as dialects of Hindi. Maithili, recently in 2002, attained the status of a language. Although these languages are more often considered sub-languages under the umbrella language of Hindi, the fact is that these are not even part of the same sub-family as Hindi. Hindi has its two types: 1) Western Hindi and 2) Eastern Hindi. Western Hindi (the most dominant variety being Khari Boli), which developed from Saursaeni Apbhransha, is presently considered as standard Hindi. The three Purvanchal, on the other hand, belong to the Eastern Hindi category. These three (and several other) languages developed from the “Magadhi Apabhransha”, a branch of “Magadhan Prakrit”.
valign=t [Saurseni Prakrit [Saurseni Apbhransha [Braj Bhasha] [Kannauji] [Khari Boli [Western Hindi ] ] ] ] valign=t [Magadhan Prakrit [Ardha-Magadhi Apbhransha [Eastern Hindi]] [Magadhi Apbhransha [Bihari [Bhojpuri] [Magahi] [Maithili]] [Bengali] [Oriya] [Assamese] ] ]
This deduction is validated using Grierson’s classification and is strengthened even by the differences between Standard Hindi and these languages. Based on the linguistic features, Bengali, Oriya, Assamese, Maithili, Bhojpuri and Magahi are close cognate languages, which means that Bhojpuri and Magahi are sister languages and Maithili is their first cousin, languages like Bengali, Assamese and Oriya are their second cousins. Hindi is a distant relative, even though prolonged close contact (with Hindi being the official language of the Indian state) has made things a bit cloudy. Some examples of difference are as follows:
Ergative Construction: Bhojpuri, Magahi and Maithili are Ergative-less languages, because in them no overt ergative case marker is available with noun but the inflection of the verb represents the perfective aspect of the sentence. Standard Hindi is an Ergative language. This marks the constructional difference of these languages from that of Hindi. Example:
Bhojpuri: [majistreta oha jAzca ka mukaxamA kAyama kare AxeSa xe xihalana .]
Hindi: [majistreta usa jAzca mukaxamA kAyama karane AxeSa xe .]
English Translation: Magistrate, after investigation, has ordered a case to be filed against him/her.
Magahi: [PuPiyA sAsa bolalana hala .]
Hindi: [PuPiya sAsa kahA .]
English Translation: Father-in-law’s sister has said.
Nominal Case Inflection: In these languages, case markers are generally separated, but in some cases they get merged with nominals and pronominals. Especially locative, instrumental and genitive case markers get fused with nominals and pronominals. This construction feature is also quite different from that of Hindi. Example:
Bhojpuri: [jeke rauvAz majaXAre Xasori Aila rahalIM .]
Hindi: [jeke rauvAz majaXAre Xasori Aila rahalIM .]
English Translation: Whom you had drowned in midstream.
Maithili: [gAmaka hajAma A xusAXa jAwika Gara hinaka Garasaz kiCu xUrIpara Cala .]
Hindi: [gAmaka hajAma A xusAXa jAwika Gara hinaka Garasaz kiCu xUrIpara Cala .]
English Translation: The houses of the Barber and Dusadh were at some distance from his house.
Emphatic Expressions: In these languages emphatic particles are generally merged with nominals and pronominals, whereas in Hindi we have separate emphatic particles such as [BI], [hI]. Example:
Bhojpuri: [laikavo Ayala rahe .]
Hindi: [ladZakA BI AyA .]
English Translation: The boy too had come.
Bhojpuri: [laikave Ayala rahe.]
Hindi: [ladZakA hI AyA .]
English Translation: That [same] boy had come too.
Maithili: [eweka BelAka bAxo aXikArIgaNa kez mEWilIka prawi vixveRaka BavanA Cani .]
Hindi: [eweka BelAka bAxo aXikArIgaNa kez mEWilIka prawi vixveRaka BavanA Cani .]
English Translation: In spite of all this, officials have a sense of hostility towards Maithili.
Classifiers: Like Bengali and Oriya, Bhojpuri, Magahi and Maithili languages also very often use classifiers with numerals. Classifier markers in these languages are: “Tho”, “The”, “go”, “khe”, “goT” etc., whereas Hindi and its other dialects are classifier-less languages. This feature also marks a paradigm shift from the grammar of Hindi. Example:
Maithili: [eka gota sajjana .]
Hindi: [eka [human-classifier] sajjana .]
English Translation: One [human-classifier] gentlemen.
Bhojpuri: [eka Te meharArU .]
Hindi: [eka [human-classifier] swrI .]
English Translation: One [human-classifier] woman.
Determiners: Wide use of determiners is again a unique feature of Bhojpuri, Maithili and Magahi languages. In these languages determiners can occur with almost all the proper and common nouns and similar to the emphatic markers, determiners also get fused with nominal forms. This constructional feature marks another point of contrast from Hindi.
In Bhojpuri, Maithili and Magahi languages a word ending with [-a] or [-A] sound will take [-vA] suffix, [-i/-I] will take [–jA] suffix and [-u] will take an [-A] suffix. Examples:
Bhojpuri: [sasurA paisave xuaravA]
Hindi: [sasura pEsA xvAra]
English Translation: Father-in-law [showed] money [the] door
In the last example, the term father-in-law is used in its colloquial form and is meant to be derogatory, and does not literally mean father-in-law. Therefore, a better translation would be:
English Translation: The bastard [showed even] money [the] door.
Or, more accurately: English Translation: The bastard refused to even take money.
Bhojpuri, Maithili and Magahi are partially synthetic languages. Morphology of these languages differs significantly from the morphology of Hindi in some aspects. Languages do have their idiosyncrasies which are quite relevant in linguistic analysis, and therefore also for statistical analysis or machine learning. Their presence in language can not be neglected. We now discuss some annotation issues based on these grammatical idiosyncrasies.
7.2. Annotation Issues
There are many challenges while annotating the corpora of these less resourced languages. Bhojpuri, Maithili and Magahi are partially synthetic languages. Thus, the use of embedded case markers, emphatic markers, classifiers, determiners etc. is frequent in these languages. This is responsible for many problems in annotation.
7.2.1. Embedded Case Markers
It is already mentioned that in many morphological constructions of these languages case markers get merged with nominals and pronominals. Especially Locative, Instrumental and Genitive case markers get fused with nominals and pronominals. Example:
Bhojpuri: [kehU] (PRP) [ke] (CC) [Gare] (NN) [aPIma] (NN) [nA] (NEG) [hoI] (VM) [.] (SYM)
Hindi: [kisI ke Gara meM aPIma nahIM hoga .]
English Translation: Opium will not be [allowed] in anybody’s house.
Bhojpuri: [hama] (PRP) [BUKe] (NN) [mara] (VM) [gailIM] (VAUX) [.] (SYM)
Hindi: [mEM BUKa se mara gayA .]
English Translation: I starved.
Bhojpuri: [mAI] (NN) [Gare] (NN) [suwala] (VM) [biyA] (VAUX) [.] (SYM)
Hindi: [mAz Gara meM so rahI hE .]
English Translation: Mother is sleeping in the house.
Maithili: [I] (DEM) [SoXaka] (NN) [viRaya] (NN) [aCi] (VAUX) [.] (SYM)]
Hindi: [yaha SoXa kA viRaya hE .]
English Translation: This is a matter for research.
Maithili: [rosadZA] (NN) [bajAra] (NN) [prasaMgasaz] (NN) [ehi] (PRP) [bAwaka] (NN) [ABAsa] (NN) [BetEwa] (VM) [aCi] (VAUX) [.] (SYM)
Hindi: [rosadZA bAjAra prasaMga se isI bAwa kA ABAsa ho rahA hE .]
English Translation: The same thing is being felt from the Rosada market episode.
Maithili: [gAmaka] (NN) [hajAma] (NNP) [A] (CC) [xusAXa] (NNP) [jAwika] (NN) [Gara] (NN) [hinaka] (PRP) [Garasaz] (NN) [kiCu] (QF) [xUrIpara] (NST) [Cala] (VM) [.] (SYM)
Hindi: [gAzva ke hajAma Ora xusAXa jAwi ke Gara inake Gara se kuCa xUrI para We .]
English Translation: The houses of the Barber and Dusadh castes were at some distance from his house.
Bhojpuri: [kekare] (PRP) [Gare] (NN) [biyAha] (NN) [bA] (VM) [?] (SYM)
Hindi: [kisa ke Gara meM vivAha hE ?]
English Translation: In whose house there is going to be a wedding?
Magahi: [jinhakaralela] (PRP) [luggA - PattA] (NN) [saba] (QF) [mAMgala] (VM) [hala] (VAUX) [.] (SYM)
Hindi: [jina logoM se saba kapadZe - lawwe mAzge hEM .]
English Translation: People who have been asked to give all the clothes.
The annotated examples are the problematic cases because the presently assigned tags are unable to capture the full meaning of such nominal and pronominal fused forms. This can be a matter for further work on these languages. Perhaps the tagset also may have to be revised a little to accommodate certain idiosyncrasies of these languages.
7.2.2. Diverse Realizations of Single Token
In Bhojpuri, Maithili and Magahi, several cases of diverse realization of a single token are found. The token [ke] has the highest different realizations in Magahi and Bhojpuri: 12 tags and 15 tags respectively. Whereas in Maithili the token [je] has been assigned 14 tags for different realizations. List of such tokens is long in all these three languages. Mostly these tokens have multiple functional and connotative meanings, making the process of annotation hard and error prone.
7.2.3. Homophones Forms
Like many other Indian languages, Bhojpuri, Maithili and Magahi also have homophones words in abundance. These words look similar but their POS tags are varied which may pose a sense of perplexity or confusion to the annotator. Example:
[kula] (JJ) [paisave] (NN) [orA] (VM) [gaila] (VAUX) [.] (SYM)
Hindi: [sArA pEsA hI samApwa ho gayA .]
English Translation: All the money is gone.
[wohAra] (PRP) [kOna] (WQ) [kula] (NN) [bA] (VM) [?] (SYM)
Hindi: [wumhArA kOna sA kula hE ?]
English Translation: Which one is your clan?
[laikA] (NN) [geMxa] (NN) [ke] (PSP) [loka] (VM) [lihalasa] (VAUX) [.] (SYM)
Hindi: [ladZake ne geMxa lapaka lI .]
English Translation: Boy caught the ball.
[sahara] (NN) [A] (CC) [gazvaI] (JJ) [loka] (NN) [meM] (PSP) [aMwara] (NN) [bA] (VM) [.] (SYM)
Hindi: [Sahara Ora grAmINa samAja meM aMwara hE .]
English Translation: There is difference between the urban and the rural society.
7.2.4. Amalgamated Emphatic Expressions
In these languages, emphatic particles are generally merged with nominals and pronominals and hence they make annotation task difficult. Example:
Magahi: [rAjapUwa] (NN) [kuramI] (NN) [loga] (NN) [wo] (RP) [apanahuz] (PRP) [hara] (NN) [jowa] (VM) [lewa] (VAUX) [hala] (VAUX) [.] (SYM)
Hindi: [rAjapUwa kurmI loga wo apane se hI Kewa jowa lewe hEM .]
English Translation: Rajput Kurmis cultivate fields by themselves.
Magahi: [bABano] (NN) [nahiMe] (NEG) [Ala] (VM) [.] (SYM)
Hindi: [brAhmaNa BI nahIM hI AyA .]
English Translation: The Brahmin too did not come after all.
Maithili: [apana] (PRP) [kaWyame] (NN) [Aro] (NN) [nUna] (NN) [mircAI] (NN) [lagA] (VM) [rahala] (VAUX) [CaWi] (VAUX) [.] (SYM)
Hindi: [apane kaWana meM Ora BI namaka mirca lagA rahA WA .]
English Translation: Was adding more spices to his statement.
Maithili: [saBa] (QF) [bAwa] (NN) [kahalAka] (VM) [bAxo] (NST) [mohana] (NN) [cala] (VM) [gela] (VAUX) [.] (SYM)
Hindi: [sArI bAwa kahane ke bAxa BI mohana calA gayA .]
English Translation: Mohan left even after saying everything.
Bhojpuri: [rAXA] (NNP) [I] (PRP) [kabbo] (NST) [nA] (NEG) [socale] (VM) [rahalI] (VAUX) [.] (SYM)
Hindi: [rAXA ne yaha kaBI BI nahIM socA WA .]
English Translation: Radha had never thought that.
Bhojpuri: [kula] (JJ) [paisave] (NN) [orA] (VM) [gaila] (VAUX) [.] (SYM)
Hindi: [sArA pEsA hI samApwa ho gayA .]
English Translation: All the money is gone.
Bhojpuri: [wabbo] (NST) [‘pAwI’] (NN) [ke] (PSP) [swara] (NN) [abahiyoM] (NST) [bahuwa] (INTF) [acCA] (JJ) [bA] (VM) [.] (SYM)
Hindi: [waba BI “pAwI” kA swara aBI BI bahuwa acCA hE .]
English Translation: Even then the [literary] level of ”Paati” [magazine] is still very good.
As these examples show, there are problems in annotating the available text with the given tagset. Perhaps there is a need for creating new subcategories of tags under the major categories of present BIS tagset used for Indian languages.
7.3. Annotated Corpus Statistics
Earlier reported work on Bhojpuri and Magahi had 120k, 63k tokens, respectively, annotated with modified tagset of Bureau of Indian Standards or BIS (Ojha et al. (2015); Kumar et al. (2012)). Bhojpuri, Maithili, and Magahi being under-resourced languages, researchers have not paid that much attention to these languages, even though a large number of native speakers are available. There was no previous on the Maithili language. In our work on annotation, the cleaned raw corpus was used for POS tag and Chunk annotation. POS annotated corpus comprises of 0.2 million words for Bhojpuri, Maithili and 0.1 million words for Magahi (to be exact, 245489, 208648 and 171538, respectively). Also, Chunk annotation has 9695, 1954 sentences (60591 and 10476 tokens, respectively) for Bhojpuri and Maithili. For Hindi, we used the Hindi-Urdu dependency tree-bank dataset, which comprises 20866 sentences, which have 436940 POS and 235426 Chunks. The problems encountered during annotation has highlighted in the previous section. The annotation statistics of these language given in Table 18.
|Language||POS annotation||Chunk annotation|
|# Sentence||# Token||# Sentence||# Token|
The POS tag distribution in an annotated corpus is mentioned in Table 19. The NN, VM, PSP tags covers approximately 50% of Bhojpuri annotated corpus, whereas the same is true for NN, VM, PSP, SYM tags in Maithili and NN, VM, SYM tags in Magahi annotated corpus include. Moreover, some tags having a coverage of less than 0.1% are considered as rare tags, such as, ECH, UNK, UT for Bhojpuri, ECH, UT for Maithili and CL, UNK, UT for Magahi, These have nearly 0% coverage of POS tag distribution in annotated corpus.
In the chunk tag annotated corpus, the Noun Phrase (NP) is the most frequent as compared to other phrases in Bhojpuri and Maithili. Moreover, Chunk Fragment Phrase (FRAGP) has the minimum occurrence, as mentioned in Table 20.
|POS Tag||Tag Frequency||Tag Percentage||Tag Frequency||Tag Percentage||Tag Frequency||Tag Percentage||Tag Frequency||Tag Percentage|
|Chunk Tag||Tag Frequency||Tag Percentage||Tag Frequency||Tag Percentage||Tag Frequency||Tag Percentage|
7.4. Baseline for POS Taggging and Chunking for Purvanchal Languages
Since the objective of this work as to function as the basis for further work on these languages, we finish the section on annotation by describing a reasonably good baseline, which can be treated as the lower baseline for later research on building POS taggers and Chunkers of these languages.
As part of our experiment with this baseline, we compare our results with those for Hindi part of the Hindi-Urdu dependency tree-bank for automatic annotation. Since we suspect the quality of the data and annotation is not as good as, say, for Hindi, we also consider these experiments as a kind of additional validation that our data may be useful for the purposes for which it was made. The idea is that if a reasonably good machine learning algorithm is able to give tolerable predictions (which could be improved in future) using this data, then the data is not useless, in spite of shortcomings.
For the experiments, we have used the traditional state-of-the-art machine learning or probabilistic technique, known as the Conditional random field (CRF), for sequence labeling. The CRF model provides surprisingly good results, even in the absence of an ample amount of annotated dataset. CRF++ toolkit161616https://taku910.github.io/crfpp/ with the default feature set and parameters have been used for POS tagging and Chunking. Furthermore, the obtained results are compared with the Hindi dependency treebank dataset in terms of Accuracy, Weighted Precision, Weighted Recall, and Weighted F-score, which are shown in the Tables 21, 22.
For POS tagging, we get 88% F-score Bhojpuri, which is not bad considering that we used the default template. For Maithili and Magahi, the values a bit below the 80% mark, but that can be explained by the low quantity of data. Compared to Hindi, the results are not very unexpected.
For Chunking, the results for Bhojpuri and Maithili are even better, being above 90%, although much lower than the 99% for Hindi. These values are higher for Chunking than for tagging because the number of Chunk tags is less than half of the number of POS tags. Moreover, only two Chunk tags (noun phrase and finite verb phrase) cover most of the corpus for all three languages.
8. An Experiment on Language Identification
Language Identification (LI) on text is one of the most basic tasks in natural language processing, which determines the language of the provided text; where the input text is either a word, a sentence, paragraph or even a document or set of documents. LI aims to mimic human ability to includes recognise or identify a language, without actually knowing that language. Multiple approaches have been proposed over the years, that infer the languages without human intervention. According to Simons et al. (Simons and Fennig (2017)), LI approaches should able to discern thousands of human language.
LI is one of the oldest problems in Natural Language Processing. The text displayed on computing devices is made up of a stream of characters (or glyphs), which are digitised by a particular encoding scheme. Most troubling for LI is the use of either isomorphic or proprietary encodings to encode text in some languages where there is a lack of standardization or standardization has not yet been fully adapted.
Perhaps the most important advance was the Cavnar et al. (Cavnar et al. (1994)) method, which has popularised the to use of character n-gram models for automatic LI. It is useful in various NLP applications which includes authorship profiling, machine translation, information retrieval, lexicography and many more. Moreover, the LI system’s output is utilized in the adaptation of boilerplate NLP tools that require annotated data, such as POS tagger, Chunker for resource-poor languages.
Earlier work considers encoding (Singh (2006b); Singh and Gorla (2007)), character (Singh (2006b); Singh and Gorla (2007); Baldwin and Lui (2010); Damashek (1995)), character combination (Windisch and Csink (2005); Sterneberg (2012); Banerjee et al. (2014); Elfardy and Diab (2013)), morpheme (Marcadet et al. (2005); Romsdorfer and Pfister (2007); Anand (2014)), words (Rehurek and Kolkus (2009); Adouane and Dobnik (2017)), word combination (Çöltekin and Rama (2016); Singh (2006b)), syllables (Yeong and Tan (2011); Murthy and Kumar (2006)), syntax (Alex (2005); Martinc et al. (2017)) and chunks (You et al. (2008); Elfardy et al. (2014)) as features. Even though LI task is considered a solved problem for distant languages (McNamee (2005)), it is still challenging to identify the closely related languages. Recently, close attention has been paid to similar or closely related languages, such as the Purvanchal languages we deal with in this work.
More similar the languages are, the more challenging it is discriminate them or to identify them correctly for an LI system. Instead of n-gram based models, which did not consider sequences of characters or other units, the approaches to LI for closely related languages are mostly based on sequential algorithms. One of the earlier work on closely related Indian languages identification applied by pair-wise binary classification on Indo-Aryan (Hindi, Marathi, Punjabi, Oriya, Assamese and Bengali) and Dravidian (Tamil, Telugu, Kannada and Malayalam) language families. For the LI system, characters and aksharas are considered as the smallest appropriate unit to apply multiple learning methods (Murthy and Kumar (2006); Singh (2006b)). Recently VarDial 2018171717http://alt.qcri.org/vardial2018/index.php?id=campaign organized shared task on language identification which included Indo-Aryan languages, i.e., Hindi, Bhojpuri, Magahi, Braj and Awadhi, where the dataset included 15000 instances corresponding to each language. HeLI (Jauhiainen et al. (2019)
) is a character level, n-gram language model with adaption technique, where the system provides 0.958% macro F-score (Zampieri et al. (2018)).
We used convolutional neural network (ConvNet) for the LI task at the character level. Earlier, ConvNet has been applied on text classification (Zhang et al. (2015)
). The ConvNet takes the input of all possible characters used in the script of the language of the corpus (Devanagari for Bhojpuri, Maithili, Magahi and Hindi) to create embeddings of characters. Here, a total of 115 characters extracted from the corpus of all languages, written in Devanagari script include consonants, vowels, special symbols, digits etc. The input sentences are quantized lengthwise, which reduces the dimension of character’s one-hot vector. These vectors or embeddings are passed to the convolutional layer to extract language-specific features and select the optimal features by applying max-pooling over them. The pair of convolution and max-pooling layer is sequentially used six times. Finally, three fully connected layers with a non-linear function followed by pooling layer are applied, out of which the final fully connected layer defines the number of languages in our training corpus. Rest of the parameters and hyper-parameters viz. batch size, learning rate, weight decay, optimizer, etc. for training the model are the same as applied on the previously reported text classification task. The result is given in Table23. It has been obtained by using 20% of the corpus as validation data, while the rest is used as training data for the neural network.
While we get better language identification results for our experiments than the workshop on Indo-Aryan languages, this may be partly because we tried only with four languages.
Apart from providing a baseline for comparison for future work, this experiment of LI also acts as the last check on the quality of the data. Since there languages are quite close and almost none of the annotators use them for writing. Also, there is a complete lack of standardization and the boundaries between these languages are sometimes blurred as they are spoken mostly in a contiguous area. Each of them has numerous varieties, which change gradually at least according to the geographical location. It means that on the border areas, it is sometime not clear which language is being spoken. Also, the annotators are from all over this area and they are familiar with different varieites of these languages, so there was often disagreement among them. The LI experiments at least shows that the data can be discriminated based on the language it is supposed to be from, using a standard LI algorithm for closely related languages.
9. Conclusion and Future work
We collected raw corpora from various sources (as mentioned in the Appendix) that cover different genres and domains for Bhojpuri, Magahi, and Maithili languages. After cleaning their raw corpora, we obtained 51374, 31251, and 19948 sentences, respectively; out of this, 16067, 14669, and 12310 sentences are annotated for Part-of-speech tagging. We have performed the statistical analysis at the character, word, morpheme, and syllable levels to exhibit the agglutinative and word-sharing properties. We have computed the unigram (or word), POS, and POS-word entropy of our corpus for the three languages and the n-gram character-based cross-entropy score among the languages, including Hindi, which show a degree of similarity among languages. Additionally, the manually annotated sentence reliability is estimated through Cohen’s kappa Inter-Annotator Agreement and found to be 0.92, 0.64, and 0.74 for Bhojpuri, Magahi, and Maithili, respectively. The highest inter-annotator agreement score for Bhojpuri indicates that its annotation done by different linguists is found to be in the fairest agreement. We discussed some of the annotation issues for all the three languages. The POS tagged data was then annotated for Chunk tags: Bhojpuri (9695 sentences) and Maithili (1954 sentences). We built an initial POS tagger, Chunker and language identifier tool by using these resources. Also, the bilingual dictionary from these three Purvanchal languages to Hindi and synsets for IndoWordNet are prepared.
To the best of our knowledge, such corpora and corresponding syntactic annotation using POS tagging and Chunking, the synset (WordNet) and bilingual dictionary preparation is not available so far for these low-resource Purvanchal languages. The collected and annotated corpora can be used as the basis for the creation of further linguistic resources for these languages. These linguistic resources will then be equipped to create machine translation systems for these languages.
The results for initial POS tagger and Chunker were reasonably good, i.e., good enough to be used as baselines for future work. The language identification results were better than reported earlier and they also confirm that the quality of the data, though problematic, is not completely useless.
This section contains online sources from where different parts of the data for Purvanchal languages were collected and the example of their annotated sentences.
http://magahi-sahitya.blogspot.com/default, the above blog taken was a sort of a collection of Magahi (TeMTa) words.
- Identification of languages in algerian arabic multilingual documents. In Proceedings of the Third Arabic Natural Language Processing Workshop, pp. 1–8. Cited by: §8.
- An unsupervised system for identifying english inclusions in german text. In Proceedings of the ACL Student Research Workshop, pp. 133–138. Cited by: §8.
- Language identification for transliterated forms of indian language queries. In Working Notes of Forum for Information Retrieval Evaluation (FIRE), Cited by: §8.
- Morphology based pos tagging on telugu. International Journal of Computer Science Issues (IJCSI) 11 (1), pp. 181. Cited by: §2.
- Language identification: the long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, pp. 229–237. Cited by: §8.
A hybrid approach for transliterated word-level language identification: crf with post-processing heuristics. In Proceedings of the Forum for Information Retrieval Evaluation, pp. 54–59. Cited by: §8.
- Basic statistical analysis of corpus and cross comparison among corpora. Technical Report of Indian Institute of Information Technology. Cited by: §2, §4.2.7.
- Ssf: a common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT, pp. 66–76. Cited by: §7.
- Indowordnet. In In Proc. of LREC-10, Cited by: §6.
- HindEnCorp-hindi-english and hindi-only corpus for machine translation. In LREC, pp. 3550–3555. Cited by: §2.
- N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Vol. 161175. Cited by: §8.
- A statistical study of bangla corpus. In Int. Conf. on Computational Linguistics, Speech and Document Processing, Cited by: §2.
- A dictionary-based approach for evaluating orthographic methods in cognates identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, Hissar, Bulgaria, pp. 141–147. External Links: Cited by: §6.
- Discriminating similar languages with linear svms and neural networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 15–24. Cited by: §8.
- Gauging similarity with n-grams: language-independent categorization of text. Science 267 (5199), pp. 843–848. Cited by: §8.
- Language corpora: present indian need. In Proceedings of the SCALLA 2004 Working Conference, pp. 5–7. Cited by: §2.
- Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1285–1295. External Links: Cited by: §6.
- AIDA: identifying code switching in informal arabic text. In Proceedings of The First Workshop on Computational Approaches to Code Switching, pp. 94–101. Cited by: §8.
- Sentence level dialect identification in arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 456–461. Cited by: §8.
- Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 587–593. External Links: Cited by: §6.
- Results of the translation inference across dictionaries 2019 shared task. Cited by: §6.
- Two methods for domain adaptation of bilingual tasks: delightfully simple and broadly applicable. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 810–820. Cited by: §6.
- Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 690–696. External Links: Cited by: §5.1.
- Word sense disambiguation using implicit information. Natural Language Engineering, pp. 1–20. Cited by: §6.
- Language model adaptation for language and dialect identification of text. Natural Language Engineering 25 (5), pp. 561–583. Cited by: §8.
- The tdil program and the indian language corpora intitiative (ilci). In LREC, Cited by: §2, §2.
- Bengali basic travel expression corpus: a statistical analysis. In 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), pp. 1–6. Cited by: §2.
- Derivation of new readability formulas for navy enlisted personnel. Research Branch Report. Cited by: §4.2.6.
- Statistical analysis of telugu text corpora. Cited by: §2, §4.2.3, §4.2.
- Challenges in developing lrs for non-scheduled languages: a case of magahi. In Proceedings of the 5th Language and Technology Conference Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC’11), pp. 60–64. Cited by: §1, §2.
- Developing a pos tagger for magahi: a comparative study. In Proceedings of the 10th Workshop on Asian Language Resources, pp. 105–114. Cited by: §2, §7.3.
- Development of indian language speech databases for large vocabulary speech recognition systems. In Proceedings of SPECOM, Cited by: §2.
- Word translation without parallel data. In International Conference on Learning Representations, External Links: Cited by: §6.
- Information retrieval system and machine translation: a review. Procedia Computer Science 78, pp. 845–850. Cited by: §6.
- Analysis of and observations from a bangla news corpus. Cited by: §2.
- A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis. In Ninth European Conference on Speech Communication and Technology, Cited by: §8.
- PAN 2017: author profiling-gender and language variety prediction.. In CLEF (Working Notes), Cited by: §8.
- Language identification: a solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20 (3), pp. 94–101. External Links: Cited by: §8.
- WordNet: an online lexical database. International Journal Lexicograph 3 (4), pp. 235–244. Cited by: §6.
- Acoustic modelling for speech recognition in indian languages in an agricultural commodities task domain. Speech Communication 56, pp. 167–180. Cited by: §2.
- Language identification from small text samples. Journal of Quantitative Linguistics 13 (01), pp. 57–80. Cited by: §8, §8.
- Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In Proceedings of the International Conference RANLP-2009, pp. 292–298. Cited by: §6.
- Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 3291–3298. Cited by: §6.
- Training & evaluation of pos taggers in indo-aryan languages: a case of hindi, odia and bhojpuri. In the proceedings of 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 524–529. Cited by: §2, §7.3.
- Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic bulletin & review 21 (5), pp. 1112–1130. Cited by: §4.2.5.
- Using maximum entropy models to discriminate between similar languages and varieties. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 120–128. Cited by: §5.
- Applications and explanations of zipf’s law. In Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, NeMLaP3/CoNLL ’98, USA, pp. 151–160. External Links: Cited by: §4.2.5.
- Extraction of character personas from novels using dependency trees and pos tags. In Emerging Research in Computing, Information, Communication and Applications, pp. 65–74. Cited by: §6.
- Using similarity scoring to improve the bilingual dictionary for word alignment. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 409–416. Cited by: §6.
- Language identification on the web: extending the dictionary method. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 357–368. Cited by: §8.
- Text analysis and language identification for polyglot text-to-speech synthesis. Speech communication 49 (9), pp. 697–724. Cited by: §8.
A survey of cross-lingual word embedding models.
Journal of Artificial Intelligence Research65, pp. 569–631. Cited by: §6.
- Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI – 2007 Workshop On Shallow Parsing for South Asian Languages ( SPSAL-2007, pp. 1–8. Cited by: §7.
- Easy measures for evaluating non-english corpora for language engineering: some lessons from arabic and bengali. Dep. of Comp., Faculty of Math. and Comp., The Open University, Walton Hall, UK. Tech. Rep. № 2004 5, pp. 1–5. Cited by: §4.2.1.
- Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, USA, pp. 134–141. External Links: Cited by: §7.
- Cross-lingual information retrieval: a dictionary-based query translation approach. In Advances in Computer and Computational Sciences, pp. 611–618. Cited by: §6.
- Ethnologue: languages of asia. sil International. Cited by: §8.
- Identification of languages and encodings in a multilingual document. In Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval, Vol. 4, pp. 95. Cited by: §8.
- Estimating the resource adaption cost from a resource rich language to a similar resource poor language.. In LREC, Cited by: §5.
- A computational phonetic model for indian language scripts. In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, pp. 1–19. Cited by: §4.2.3.
- Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances, pp. 63–72. Cited by: §4.2.2, §8, §8.
- A mechanism to provide language-encoding support and an nlp friendly editor. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II, Cited by: §7.
- Automatic syllabification for manipuri language. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 349–357. Cited by: §2.
Statistical tagger for bhojpuri (employing support vector machine). In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1524–1529. Cited by: §1, §2.
-  Web drawn corpus for bhojpuri. Cited by: §2.
- Morfessor 2.0: toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 21–24. Cited by: §4.2.4.
-  Interspeech 2018 low resource automatic speech recognition challenge for indian languages. In Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 11–14. Cited by: §2.
- Language identification of person names using cascaded svms. Bachelor’s Thesis, Uppsala University, Uppsala. Cited by: §8.
- Cross-lingual dependency parsing for closely related languages - Helsinki’s submission to VarDial 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, pp. 131–136. External Links: Cited by: §5.
- The statistical study of literary vocabulary. Cambridge University Press. Cited by: §4.2.
Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In Information and Communication Technology for Intelligent Systems, pp. 485–494. Cited by: §6.
- Multilingual lexical resources to detect cognates in non-aligned texts. In Proceedings of the Australasian Language Technology Association Workshop 2014, Vol. 12, pp. 14–22. Cited by: §6.
- Language identification using global statistics of natural languages. In Proceedings of the 2nd Romanian-Hungarian Joint Symposium on Applied Computational Intelligence (SACI), pp. 243–255. Cited by: §8.
Language discrimination and transfer learning for similar languages: experiments with feature combinations and adaptation. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, TOBEFILLED-Ann Arbor, Michigan, pp. 54–63. External Links: Cited by: §5.
- Developing linguistic corpora: a guide to good practice. Vol. 92, Oxbow Books Oxford. Cited by: §7.
- The syntax of possessor prominence in maithili. Prominent Internal Possessors, pp. 39. Cited by: §3.
- Applying grapheme, word, and syllable information for language identification in code switching sentences. In 2011 International Conference on Asian Language Processing, pp. 111–114. Cited by: §8.
- Identifying language origin of named entity with multiple information sources. IEEE transactions on audio, speech, and language processing 16 (6), pp. 1077–1086. Cited by: §8.
- Language identification and morphosyntactic tagging. the second vardial evaluation campaign.. Cited by: §8.
- Bilingual lexicon induction from non-parallel data with minimal supervision. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §6.
- Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §8.
- Improving performance of nmt using semantic concept of wordnet synset. In Machine Translation: 14th China Workshop, CWMT 2018, Wuyishan, China, October 25-26, 2018, Proceedings, Vol. 954, pp. 39. Cited by: §6.
Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1568–1575. External Links: Cited by: §5.