For the Purpose of Curry: A UD Treebank for Ashokan Prakrit

11/24/2021
by   Adam Farris, et al.
Georgetown University
0

We present the first linguistically annotated treebank of Ashokan Prakrit, an early Middle Indo-Aryan dialect continuum attested through Emperor Ashoka Maurya's 3rd century BCE rock and pillar edicts. For annotation, we used the multilingual Universal Dependencies (UD) formalism, following recent UD work on Sanskrit and other Indo-Aryan languages. We touch on some interesting linguistic features that posed issues in annotation: regnal names and other nominal compounds, "proto-ergative" participial constructions, and possible grammaticalizations evidenced by sandhi (phonological assimilation across morpheme boundaries). Eventually, we plan for a complete annotation of all attested Ashokan texts, towards the larger goals of improving UD coverage of different diachronic stages of Indo-Aryan and studying language change in Indo-Aryan using computational methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/08/2022

MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

We present a completed, publicly available corpus of annotated semantic ...
04/22/2020

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Universal Dependencies is an open community effort to create cross-lingu...
10/01/2022

CGELBank: CGEL as a Framework for English Syntax Annotation

We introduce the syntactic formalism of the Cambridge Grammar of the Eng...
03/16/2020

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

In this paper, we discuss the development of a multilingual annotated co...
08/12/2020

The Annotation Guideline of LST20 Corpus

This report presents the annotation guideline for LST20, a large-scale c...
11/22/2020

Standardizing linguistic data: method and tools for annotating (pre-orthographic) French

With the development of big corpora of various periods, it becomes cruci...
07/24/2022

Enhancements to the BOUN Treebank Reflecting the Agglutinative Nature of Turkish

In this study, we aim to offer linguistically motivated solutions to res...

1 Introduction

Ashokan Prakrit is the earliest attested stage and among the most conservative known forms of Middle Indo-Aryan (MIA), represented by inscriptions in the form of rock and pillar edicts commissioned by the Mauryan emperor Ashoka (aśōka111Throughout this work, we use a newly devised transliteration scheme, based on the International Alphabet of Sanskrit Transliteration (IAST) which is standard in Indological work, as well as influences from the IPA and Americanist systems. Divergences from IAST are: 1. indication of aspiration and breathy voice with superscript ʰ, 2. explicit marking of ē and ō as long vowels, 3. overdot for visarga and anusvara , instead of the underdot, to avoid confusion with retroflexion.) in the 3rd century BCE. The Indo-Aryan languages are the predominant language family in the northern (and insular southern) parts of the Indian subcontinent, and are a branch of the widespread Indo-European family. They are generally divided into three historical stages: Old Indo-Aryan (OIA; Sanskrit, both the language of Vedic and of later Classical texts, as well as unattested varieties suggested by dialectal variation in later stages), Middle Indo-Aryan (MIA; Ashokan Prakrit, Pali, the Dramatic Prakrits, and early koinés of the Hindi Belt), and New Indo-Aryan (NIA; modern Indo-Aryan languages such as Hindi–Urdu, Assamese, Marathi, Dhivehi, Kashmiri, Khowar, etc.).

Diachronically, Ashokan Prakrit is a descendant of unwritten dialects of Sanskrit (some of which are attested through oral transmission of the Vedas) and a precursor to regional fragmentation of Middle Indo-Aryan into Pali, the Dramatic Prakrits, and eventually the NIA languages. Ashokan Prakrit is a dialect continuum rather than a standardized language, but the three dialect zones are not divergent enough to prove mutually unintelligible [oberlies].

Universal Dependencies [nivre-etal-2016-universal, ud-recent] is a multilingual formalism for treebanking, including annotation guidelines for dependency relations, morphological analysis, part-of-speech tagging, and other linguistic features. Several New Indo-Aryan languages [bhatt-etal-2009-multi, tandon-etal-2016-conversion, ravishankar-2017-universal] and Sanskrit [kulkarni-etal-2020-dependency, hellwig-etal-2020-treebank, dwivedi] have treebanks annotated using UD or other syntactic formalisms, but thus far there is no treebank for a MIA language, leaving a gap in Indo-Aryan historical corpora. Within MIA, Ashokan Prakrit has an unusual corpus of parallel texts representing multiple geographical dialects, and internal comparison using computational tools will also help us study the history of Indo-Aryan linguistic fragmentation.

To that end, we began UD annotation of a digitized Ashokan corpus. We will present some interesting linguistic features that we encountered, both in the context of IA and for the Universal Dependencies annotation scheme, and suggest future directions for historical and dialectological corpus linguistic work in the Indo-Aryan family.

2 Related work

Figure 1: Locations of the various Ashokan inscriptions and edicts in the Indian Subcontinent, coloured by their usual geographic grouping (not by linguistic isoglosses). Points in grey in the northwest are inscriptions that are not in Ashokan Prakrit (instead, Aramaic and Greek).

The first Ashokan edicts were deciphered by James Prinsep in the 1830s [kopf]. Since then, they have played a major role in the historical study of Ashoka and the Mauryan Empire, sociological and religious study on early Buddhism and other heterodox Dharmic sects, and, of course, linguistic work from a historical and social perspective.

There are several works which attempt a broad comparative study of the inscriptions with reference to Sanskrit [woolner, hultzsch, mehendale, bloch, oberlies]

. Like most historical linguistic work on IA, these works focus mostly on phonology and, to a lesser extent, morphology to the exclusion of syntax, semantics, and the lexicon

[varma].

On the computational side, the only digitized and machine-readable version of the Ashokan edicts is the Ashoka Library [ashoka-library], which is sourced from hultzsch and thus missing more recently discovered inscriptions.

Other UD corpora and their annotation guidelines were also helpful to our own annotation process [vedicannotation].

3 Corpus

Doc. Sent. Tok.
Girnar 4 28 394
Shahbazgarhi 3 14 158
Mansehra 1 8 87
Kalsi 1 8 85
Jaugada 1 8 89
Dhauli 1 3 20
Total 11 69 833
Table 1: DIPI corpus composition.

The Ashokan texts available to us constitute a very limited corpus. They are royal inscriptions concerning the promotion of Buddhist morality, administration of the Mauryan Empire, and records of magnanimous deeds of Ashoka including his conversion to Buddhism. They address the public, and all the evidence points to Ashokan Prakrit being a semi-standardized but not artificial reflection of vernacular language, given the geographical dialect variation and purpose of the texts.

We began with transcribed edicts from the Ashoka Library [ashoka-library]. Annotation began in June 2021 and was done in Google Sheets simultaneously by two linguistically-informed annotators with discussions to resolve disagreements. A guidelines document was added to as the analysis of tricky constructions was decided upon. We also organized inflectional tables based on data harvested from mehendale, and used a variety of sources for reference. Sanskrit dictionaries and morphological analysers were also useful [mw, huet2005functional].222The Sanskrit Grammarian (the second cite) has a web interface at https://sanskrit.inria.fr/DICO/grammar.html.

Given the parallel nature of the corpus, annotations for a particular edict at one location could be transferred with little modification to that of another location. An example of this is given in fig. 2, which only shows POS-tag and dependency parse UD annotations of a parallel sentence. Thus, we used the well-preserved edicts at Girnar as the main annotation document, and annotated other editions only after finalising the corresponding Girnar version. Table 1 gives statistics about the annotated corpus.

[column sep=0.6cm] bahukaṁ & hi & dōsaṁ & samājamhi & pasati & dēvānaṁ & priyō & priya & dasi & rājā
DET & PART & NOUN & NOUN & VERB & NOUN & ADJ & ADJ & ADJ & NOUN
5root [edge unit distance=2.2ex]510nsubj 109nmod:desc 98amod [edge unit distance=2.3ex]107nmod:desc 76nmod 54obl 53obj 31det 52discourse [column sep=0.6cm] bahukaṁ & hi & dōsaṁ & samājasa & drakhati & dēvānaṁ & piyē & piya & dasī & lājā
DET & PART & NOUN & NOUN & VERB & NOUN & ADJ & ADJ & ADJ & NOUN
5root [edge unit distance=2.2ex]510nsubj 109nmod:desc 98amod [edge unit distance=2.3ex]107nmod:desc 76nmod 54obl 53obj 31det 52discourse

Figure 2: Dependency parse of the fourth sentence of Major Rock Edict 1 as found in two locations. The top is from Girnar, representing the Western dialect, and the bottom is from Jaugada, representing the Eastern dialect. Both roughly translate to: for King Beloved-of-the-Gods Looking-Kindly sees much evil in festival meetings.

4 Linguistic features

Some of the tricky annotation issues faced include: the POS-tagging and dependency parsing of regnal names in Ashokan and cross-lingually (with further discussion on compounds in general), the in-progress transition to split ergativity in Ashokan and its morphological and syntactic analysis within the framework of UD, as well as the relationship between irregular sandhi in Ashokan and grammaticalization.

A recurring point in these issues is that Ashokan is transitional between Sanskrit and New Indo-Aryan, still in the process of undergoing many drastic syntactic (from non-configurational to configurational) and morphological (from synthetic to analytic) changes. Given the small size of the corpus and inability to elicit information from native speakers, we faced difficulty annotating features based on a synchronic analysis without looking towards better, and often conflicting, data from Sanskrit or NIA languages.

4.1 Regnal names

A puzzling issue in annotation was Ashoka’s regnal names, such as: Dēvānaṁ- priyēna Priya- dasinā rāña
god:.. beloved:.. kindly looking:.. king:..
‘King Beloved-of-the-Gods Looking-Kindly’ (Girnar 1:1) Ashokan Prakrit, like Sanskrit, often constructs chains of nominals and adjectives headed by the last member and with all the members agreeing in case and number with it—here, the instrumental singular. Tokenization, POS-tagging of morphemes in compounds, and dependency relations in regnal names all came up as issues in UD annotation. The decisions in this section were arrived at after much discussion with the UD community.

4.1.1 POS annotation of morphemes in compounds

The first issue was how to POS tag the morphemes in such compounds. In Ashokan Prakrit, like in Sanskrit, “the division-line between substantive and adjective … [is] wavering” [whitney] so any of these titles could be thought of as nominals (‘one who is beloved of the Gods’) or adjectives (‘beloved by the Gods’). Furthermore, syntactic context can blur the distinction; an adjective like dasin ‘looking’ can be nominalized into ‘looker’, and a noun in a compound may behave attributively.

Initially, we thought to label all the morphemes in the regnal names as PROPN given that they refer to a person like a regular name does. However, these morphemes have internal dependency structure, most obviously the genitive-case modifier in dēvānaṁ-priyēna. The PROPN label would obscure what is clearly a genitive-case NOUN, dēvānaṁ ‘of the Gods’, that does not refer to a specific individual or entity like a name does.

In regards to differentiating between NOUN and ADJ in Ashokan, we settled on the criteria that something with a fixed inherent gender must be NOUN, and anything with fluid gender assignment is ADJ. This makes the POS tag a lexical feature rather than one that is contextually assigned by syntactic properties, which would render it redundant. UD precedent in other languages, e.g. Latin, favours the annotation of dependency structure in proper nouns and the regular POS tagging of nominalized components in such names.

4.1.2 Dependency structure of nominalized titles

[column sep=0.6cm] dēvānaṁ & priyēna & priya & dasinā & rāña
NOUN & ADJ & ADJ & ADJ & NOUN
2root 24flat 43amod 25flat 21nmod

(a) flat

[column sep=0.6cm] dēvānaṁ & priyēna & priya & dasinā & rāña
NOUN & ADJ & ADJ & ADJ & NOUN
5root 54? 43amod 52? 21nmod

(b) any headed dependency
Figure 3: Potential dependency parses (headless and headed) of Ashoka’s regnal names.

There is substantial disagreement among UD corpora on the dependency annotation of regnal names, epithets, and other appellative titles. The current UD guidelines prefer the flat relation for “exocentric (headless) semi-fixed MWEs like names and dates”. The head is arbitrarily assigned to be the first nominal in the multi-word expression. This is unacceptable for titles in Ashokan, since want to treat this the same way as adjective–noun NPs, with the head always being the last word.

schneider2021mischievous recently investigated this issue for a wide range of nominal constructions in English (including Mr. and Secretary of State, which are similar to Ashokan titles). Once we have established that such constructions are not headless, we have to decide which headed dependency relation should be used instead. We considered appos, compound, and nmod:desc, and amod if we chose to analyse the appellatives as adjectival rather than nominal. The difference between a headed and headless dependency analysis of the regnal titles is shown in fig. 3.

The issues, resolved once we came to nmod after settling our POS tagging, in the other relations are:

  • appos: Generally, an appositive is a full NP that can be paraphrased with an equational copula, e.g. Bob, my friend implies Bob, who is my friend. But in Ashokan, given the blurring between nouns and adjectives, it is clear that each title NP is directly modifying the NP rāña ‘king’ rather than paraphrasing an appositional relationship.

  • compound: Like flat, this indicates a multiword expression forming a single NP rather than relationships between full NPs. Each regnal name is, however, a whole NP that could stand alone.

  • amod: Our reasoning against the other two relies on analysing each title as an NP. The fact that titles can be dropped, and that rāña ‘king’ can be dropped while retaining grammaticality, supports that each title is indeed an NP since any one could be the head if phrase-final. Thus, an adjectival relation like amod is not preferred.

Realising that the head of each NP title is lexically an ADJ that gets nominalized, we settled on nmod:desc as the best dependency relation. Further evidence comes from variation in the components of the titles in different editions of the edicts. Given that section 4.1.2 drops ‘king’ entirely and can have the titles stand alone without another NP head, we are certain that each title is an NP. Dēvana- priasa rañō
god:.. beloved:.. king:..
‘King Beloved-of-the-Gods’ (Shahbazgarhi 1:1) Dēvānaṁ- piyēna Piya- das[i]nā
god:.. beloved:.. kindly looking:..
‘Beloved-of-the-Gods Looking-Kindly’ (Kalsi 1:1) Now backed with our cross-lingual evidence, we agree with schneider2021mischievous that nmod or a subtyped label of it is the best descriptor for nominal epithets. We specifically picked the subtyped label so that we can query instances of the construction for future analysis.

4.2 Predicated -ta construction

The -ta construction333Philologically known as the past passive participle. in Sanskrit forms participles from verbal roots. These participles morphologically function as nominals, including taking gender, case, and number marking and not marking person unlike finite verb forms. rājñā hataḣ cauraḣ
king:.. kill:ppp... thief:..
‘a thief killed by a king’ (lit. ‘a king-killed thief’) (Sanskrit) In Sanskrit, especially in post-Vedic texts, it can also be interpreted with (past) perfective meaning. -ta forms agree in case/gender/number with the object, unlike the finite verbs of this stage of Indo-Aryan. mayā lipī likʰitā
. text:.. write:ppp...
‘the text was written by me’ (passive)
‘I wrote the text’ (ergative) (Sanskrit) This use is extremely common in Ashokan Prakrit and is the point of contention discussed here. According to one view, -ta formed deverbal adjectives that behaved as resultatives in early OIA, gradually shifting towards main predicate function in first intransitive and later transitive verbs (the agent receiving case marking) by late OIA [reinohl2018].

This construction is ancestral to the tense/aspect-based split ergativity observed in many later NIA languages. In such languages, the Sanskrit participle has developed into a perfective aspect verb that agrees with the object, while other inflected forms in the verb paradigm agree with the subject. Since Ashokan Prakrit was still undergoing this transition to split ergativity, the analysis of this construction in it is of interest.

In Ashokan, since the inherited active aorist is no longer productive, -ta forms have become the unmarked strategy to express the past perfect [bubenik1998]. We believe, with some certainty, that this construction is not passive at least as late as Ashokan Prakrit (if it ever was). Evidence [casaretto2020] provide against a passive analysis in Sanskrit also applies to Ashokan. A key argument is that -ta occurs with both transitive and intransitive verbs, and in the case of the latter, does not form an “impersonal passive” as would be expected of a passivized intransitive.

As such, we adopt an ergative-like analysis of the -ta construction in Ashokan, agreeing with [peterson1998]’s view of the corresponding construction in Pali (another early MIA lect) as being a periphrastic perfect. Indeed, in the example in fig. 3(a), the -ta form agrees in number and gender with the object, while the agent receives instrumental marking. As the object dʰrama-dipi is still in the nominative case, and the nominative is morphologically marked, an independent absolutive case is yet to arise.

With respect to UD annotation, our ergative-like analysis translates to the agent rajina receiving the deprel nsubj and the object dipi obj (instead of obl:agent and nsubj:pass of the passive analysis in fig. 3(b)).

[column sep=0.6cm] ayi & dʰrama & dipi & Dēvanaṁ- & priyēna & Priya- & draśina & rajina & likʰapita
PRON & NOUN & NOUN & NOUN & ADJ & ADJ & ADJ & NOUN & VERB
9root 98nsubj 31det 32compound [edge unit distance=2.2ex]93obj 54nmod [edge unit distance=2.4ex]85nmod:desc 76amod 87nmod:desc

(a) The ergative-like analysis, with nsubj and obj

[column sep=0.6cm] ayi & dʰrama & dipi & Dēvanaṁ- & priyēna & Priya- & draśina & rajina & likʰapita
PRON & NOUN & NOUN & NOUN & ADJ & ADJ & ADJ & NOUN & VERB
9root 98obl:agent 31det 32compound [edge unit distance=2.2ex]93nsubj:pass 54nmod [edge unit distance=2.4ex]85nmod:desc 76amod 87nmod:desc

(b) The passive analysis, with obl:agent and nsubj:pass
Figure 4: Two possible analyses of the predicated -ta construction in the sentence ’king Beloved-of-the-Gods Looking-Kindly has caused this rescript on morality to be written’ (Mansehra 1:1). The above was ultimately chosen.

4.2.1 Differential agent marking

Cross-dialectally as well as dialect-internally, Ashokan Prakrit varies with respect to how the agent phrase is marked in -ta constructions. Agents may receive either instrumental or (with lesser frequency) genitive case marking, though the basis for this alternation is not wholly clear. sē mamayā bahu kayānē kaṭē
now 1. many good_deed:.. do:ppp...
‘Now, I did many good deeds.’ (Kalsi 5:4) Dēvānaṁ- piyaśa Piya- daśinē lājinē Kaligyā vijitā
god:.. beloved:.. kindly looking:.. king:.. Kalinga:.. conquer:ppp...
‘… king Beloved-of-the-Gods Looking-Kindly conquered the Kalingas.’ (Kalsi 13:1) andersen1986’s analysis suggests discourse-pragmatic factors may be at play; the genitive agent conveys old (i.e. contextually given and/or definite) information while the instrumental agent conveys new information. On this basis, he also claims these represent two separate constructions, a passive and an ergative respectively, though this proposal has some flaws (see [bubenik1998] for criticisms).

We tentatively follow dahl-and-stronski2016 in analyzing the situation as one of differential agent marking (DAM) [arkadiev2017], whereby two agent-marking cases are distributed along (potentially irrecoverable) semantic/pragmatic lines. Thus we stuck with standard morphological analysis of the case features in these agents, i.e. Case=Gen/Ins rather than explicitly proposing Case=Erg as an Ashokan feature.

DAM seems to affect both the agents of the ergative-like -ta construction as well as the oblique agents of finite passives in Ashokan. Of the source constructions in Vedic, bubenik1998 explains there is a broad tendency for ‘active’ verbs to favor instrumental agents, and ‘ingestive’ verbs (perception, consumption, etc.) to favor the genitive, but the instrumental becomes default in later stages of Old Indo-Aryan. Further annotation of the Ashokan corpus will allow us to probe into these hypotheses with statistical tools.

Additionally, the influence of Ashoka’s administrative language, an eastern dialect from which other dialectal edicts were likely translated [oberlies], should not be neglected. If the choice between instrumental and genitive marking is at least partially a function of dialect, direct translation from Ashoka’s variety could leave relic forms444One such example of dialectal interference is .. -ē, a non-western isogloss, attested in place of the expected -ō in Girnar (a western dialect) (otherwise inconsistent with the internal distribution of cases) in other edicts.

4.3 Sandhi

Sanskrit texts (which in written form all post-date the Ashokan edicts) generally orthographically indicate sandhi, a kind of phonological assimilation at morpheme boundaries [sandhi]. Some examples from Sanskrit are given in section 4.3.

gaccʰati arjunaḣ gaccʰatyarjunaḣ (Sanskrit) saḣ aham sō’ham brahma asmi brahmāsmi Middle Indo-Aryan has more haphazard orthographic indication of sandhi rules [dockalova2009], even though these assimilations likely persisted in speech. Pali, for example, shows sandhi in compounds (especially those inherited directly from Old Indo-Aryan and then subject to normal sound changes), some function words (emphatic ēva, preverbs, etc.), pronouns, and sometimes in nominal arguments to verbs, noun–noun relations, and vocatives [childers]. That is, Pali optionally indicates sandhi only between syntactically related words [oberlies2, p. 116].

We observed similar occurrences in our Ashokan corpus. We think certain rare cases of sandhi in Ashokan may be examples of grammaticalization (the development of a postposition with case-like properties) and lexicalization (compounds that are no longer as transparent). These pose issues for UD annotation.

4.3.1 Grammaticalization of atʰāya ~ aṭʰāya

One case where sandhi may gives us clues about morphological change is occurrences of atʰāya ‘for the purpose [of]’, the dative of atʰa ‘purpose’ (< Sanskrit ártʰa). In the prototypical example below, sandhi with the preceding nominal stem causes vowel lengthening. tī ēva prāṇā ārabʰarē sūp- ātʰāya
three emph animal:.. kill:... curry purpose:..
‘Only three animals are being killed for the purpose of curry.’ (Girnar 1:7) reinohl describes a related phenomenon based on Classical Sanskrit and Pali corpora: the post-Vedic genitive shift, wherein many adverbs and adjectives were analysed as taking the genitive and periphrastically replacing case relations, e.g. -asya artʰāya ‘for the purpose of …’. Sanskrit generally uses the dative case by itself to indicate Purpose, so this compounded construction in Ashokan may be an intermediate phase in the genitive shift.

For UD, this is a tricky situation. We were stuck between describing atʰāya as a case complement to sūpa, or instead the head of an NP, both shown in fig. 5. Given similar constructions with stem-less nouns in compounds, atʰāya would usually be analysed as the head here, but if it has been grammaticalized then case would be a better deprel as is used for case markers and clitics in New Indo-Aryan UD. Girnar 4:10 also has ētāya atʰāya ‘for this purpose’ which lacks sandhi or stem-compounding, but this may be exceptional since ētad can take the det deprel as a modifier to nouns and so does not behave like a true nominal. Pending better evidence supporting either analysis, we settled on the latter.

An interesting data point is that a similar construction is the etymological source of the dative case in the Insular Indo-Aryan languages, Dhivehi and Sinhala.

mamma e=ge-aṣ̊ diya
mother 3=house- go..alter
‘Mother went to that house.’ (adapted from Lum, lum2020) (Dhivehi)

ammā ē gedr- giyā
mother 4 house.- go.
‘Mother went to that house.’ (Sinhala)

Sinhala -()ṭ and Dhivehi -aṣ̊ are both reflexes of Sanskrit ártʰāya (or, possibly, the accusative case ártʰaṁ) [fritz] and have expanded as datives to encode other roles such as goal. Ashokan Prakrit’s compounding of atʰāya may represent an early stage towards a similar grammaticalization, though its precise synchronic status is unclear. Future UD annotation of MIA corpora will allow us to better track such phenomena from a comparative perspective.555It is worth noting that a Middle Indo-Aryan ancestor of Sinhala, roughly contemporaneous with the Ashokan edicts, formed a periphrastic dative of purpose with aṭaya (cf. śagaśa aṭaya ‘for the benefit of the sangha’) [premaratne, paranavithana]. Here, as is also observed with Pali’s attʰāya construction [reinohl, fahs], the nominal śaga ‘sangha’ takes genitive case marking. In contrast, Ashokan employs either a dative dependent (e.g. etāya) or the stem-compounding strategy described above. It cannot be ruled out, however, fthat the modern Sinhala and Dhivehi datives originate in a similar compound-like use of ártʰāya [fritz].

[column sep=0.6cm] ārabʰarē & sūpa & atʰāya
VERB & NOUN & NOUN
1root 32nmod 13obl

(a) normal NP

[column sep=0.6cm] ārabʰarē & sūpa & atʰāya
VERB & NOUN & ADP
1root 12obl 23case

(b) grammaticalized case
Figure 5: Two potential analyses of the atʰāya construction in Girnar 1:7.

4.3.2 Other cases

Another strange sandhi was in Girnar 2:2, manusōpagāni ca pasōpagāni ca ‘beneficial to man and beneficial to animal’. The form pasōpagāni is underlying pasu ‘(domestic) animal’ + upagāni ‘benefits’, wherein the sandhi of u + u gives ō rather than expected ū (as in Sanskrit) or u (as in Pali). This sandhi is found in every other ediction of the edict; Jaugada even has pasu-ōpagāni. We feel it is too speculative to claim grammaticalization in this instance and instead think it is phonological analogy with manusōpagāni, so we analysed it as a noun compound with deprel nmod.

5 Future work

The main task ahead of us involves completing annotation, which will require gathering and critical editing of Ashokan texts discovered in the past century that have not been digitally compiled thus far. What has been annotated already will be included within the next annual UD corpus release.

After a good selection of annotation texts cross-dialectaly is available, we would like to explore computational methods for studying the corpus. Automatic word-level alignment between dialectal variants of the same edict will enable us to compare dependency structure, case marking, sound change outcomes, alongside other dialectological features. On the technical side, we would also like to see if training data from Sanskrit with finetuning on the smaller Ashokan corpus could be used to automatically perform UD annotation of texts in other Middle Indo-Aryan languages.

More broadly, we would like to continue UD annotation of texts in earlier Indo-Aryan languages in order to have data to better address historical linguistic questions. Given the value demonstrated by corpus data for Indo-Aryan historical linguistics already [stronski2020shaping], open-access corpora annotated using Universal Dependencies, with fine-grained analysis of morphology and syntax beyond just glossed examples, will surely help put some of the controversial issues in the field to rest. Comparisons of Ashokan with other stages of Indo-Aryan will help us study language change, e.g. the development of configurationality in Middle Indo-Aryan [reinohl]. Dialectal variation (and possible substrate influence) in Ashokan should also be studied with comparison to regional NIA treebanks. Other recent work in computational approaches to this area [cathcart-rama-2020-disentangling, cathcart2020probabilistic, arora-etal-2021-bhasacitra, jambu] encouraged us to pursue the study of its historical linguistics from a similar angle.

Some texts we hope to treebank in the future include: the Pāli Canon, plays in the various later Dramatic Prakrits (e.g. Gāhā Sattasaī), the Old Dhivehi Lōmāfānu documents, the Old Kashmiri Bāṇāsurakatʰā, the Punjabi Gurū Grantʰ Sāhib, the Old Bengali Caryāpada, the medieval Sindhi Šāh jō Risālō, and epics and poetry from the Hindi Belt and Maharashtra.

References