Malayalam and Telugu are two widely spoken languages in southern India: Malayalam is an official state language of Kerala, Lakshadweep, and Mahe while Telugu is the official state language of Telangana and Andhra Pradesh. Malayalam is spoken by 37 million native speakers, whereas Telugu has 70 million native speakers333http://www.vistawide.com/languages/top_30_languages.htm
. Both languages are agglutinative and come under the Dravidian language family. Agglutinative languages are characterized by the flexibility they offer to form complex words by chaining simpler morphemes together. The growing web presence of these languages necessitates automatic techniques to process text in them. It is estimated that Indian language internet users will exceed the English user base by 2021444http://bestmediainfo.com/2018/01/regional-language-users-to-account-for-75-of-total-internet-users-by-2021-times-internet-study/
, underlining the importance of developing effective NLP for Indian languages. A hurdle in exploiting the presence of Malayalam and Telugu text from social media to train models for NLP tasks such as machine translation, named entity recognition and POS tagging, is that of the presence of a large number of loanwords within text from these languages. The loanwords are predominantly from English, and many loanwords, such aspolice, train and taxi virtually always appear in transliterated form in contemporary Malayalam and Telugu texts. On a manual analysis of a Malayalam news dataset, we found that up to 25% of the vocabulary were formed by loanwords. While processing mixed language text for tasks such as translation or tagging, automatically identifying loanwords upfront and flagging them would help avoid treating them as separate token (wrt their source language versions) directly leading to enhanced effectiveness of the model under the same learning method. The separation of intrinsic language words from loanwords is especially useful in the realm of cross language information retrieval.
In this paper, we consider the task of separating loanwords from the native language words within an unlabeled dataset of words gathered from a document corpus in the language of interest (i.e., either Malayalam or Telugu). We propose an unsupervised method, Unsupervised Nativeness Scoring, that takes in a dictionary of Malayalam or Telugu words, and scores each word in the dictionary based on their nativeness
. UNS uses an optimization framework, which starts with scoring each word based on the versatility of its word stem as observed in the corpus, and refines the scoring iteratively leveraging a generative model built over character n-gram probability distributions. Our empirical analysis illustrates the effectiveness of UNS over existing baseline methods that are suited for the task.
2 Related Work
Identification of loanwords and loanword sequences, being a critical task for cross-lingual text analysis, has attracted attention since the 1990s. While most methods addressing the problem have used supervised learning, there have been some methods that can work without labeled data. We briefly survey both classes of methods.
2.1 Supervised and ‘pseudo-supervised’ Methods
consider leveraging decision trees to address the related problem of learning transliteration and back-transliteration rules for English/Korean word pairs. Both these and other methods from the same family rely on and require large amounts of training data. Obtaining such amounts of data has high costs associated with it. To alleviate this,
propose a rule-based method to generate large amounts of training data for English-Korean loanword identification. Baker and Brew make use of phonological conversion rules to generate training data. They show that a classifier trained on the generated data performs comparably with one trained on actual examples. Although their method makes use of comparatively less manually-labeled training data, it still relies on rules that specify how words change when borrowed. These are not very much applicable for our context of Dravidian languages where words seldom undergo significant structural changes other than in cases involving usage of external sandhis555https://en.wikipedia.org/wiki/Sandhi to join them with adjacent words.
2.2 Unsupervised Methods
A recent work proposes that multi-word phrases in Malayalam text where their component words exhibit strong co-occurrence be categorized as transliterable/loanword phrases . Their intuition stems from observing contiguous words such as test dose which often occur in transliterated form while occurring together, but get replaced by native words in other contexts. Their method is however unable to identify single loanwords, or phrases involving words such as train and police whose transliterations are heavily used in the company of native Malayalam words. There hasn’t been any work, to our best knowledge, on automatically identifying loanwords in Telugu text. However, a recent linguistic study on characteristics of loanwords in Telugu newspapers  is indicative of the abundance and variety of loanwords in Telugu.
There has been some work that relax supervised data requirements for the task within the context of languages of non-Indic origin.  present a loosely-supervised approach that sources native words from 100-year-old Hebrew texts. The assumption is that there would be fewer foreign words in these older texts. Indian languages, particularly regional south indian languages, are yet to see large-scale digitization efforts of old text for such temporal assumptions to be leveraged in nativeness scoring.  presents unsupervised loanword identification in Korean where they construct a binary character-based n-gram classifier that is trained on a corpus. Koo makes use of native and foreign seed words that are determined using document statistics. Words with higher corpus frequency are part of the native seed. This is based on the assumption that native words occur more frequently than foreign words in a corpus. The foreign seed consists of words that have apparent vowel insertion. According to Koo, in Korean—as well as phonotactically similar languages—words neither begin nor end with consonant clusters. Therefore, foreign words usually have vowels arbitrarily inserted to break the consonant clusters. Contrary to the phonotactics of Korean, words in Malayalam and Telugu can begin and end with consonant clusters. Koo’s method is therefore inapplicable to the languages in our focus.
2.3 Positioning the Nativeness Scoring Task
Nativeness scoring of words may be seen as a vocabulary stratification step (upon usage of thresholds) for usage by downstream applications. A multi-lingual text mining application that uses Malayalam/Telugu text in the company of English text would benefit by transliterating non-native Malayalam/Telugu words to English, so the loanword token and its transliteration is treated as the same token. For machine translation, loanwords may be channeled to specialized translation methods (e.g., ) or for manual screening and translation.
3 Problem Definition
We now define the problem more formally. Consider distinct words obtained from Malayalam/Telugu text, . It may be noted that should either contain all Malayalam words, or all Telugu words (not a mixture of some Telugu and some Malayalam words). This may be obvious for readers familiar with the fact that the two languages use different scripts leading to non-overlapping vocabularies; so, mixing them within a dataset doesn’t make much intuitive sense. Our task is to devise a technique that can use to arrive at a nativeness score for each word, , within it, as .
We would like to be an accurate quantification of native-ness of word . Since , may be treated analogously as a quantification of loanword-ness of . For example, when words in are ordered in the decreasing order of scores, we expect to get the native words at the beginning of the ordering and vice versa. We do not presume availability of any data other than ; this makes our method applicable across scenarios where corpus statistics are unavailable due to privacy or other reasons.
Given that it is easier for humans to crisply classify each word as either a native word or a loanword in lieu of attaching a score to each word, the nativeness scoring (as generated by a scoring method such as ours) often needs to be evaluated against a crisp nativeness assessment, i.e., a scoring with scores in . Such evaluation involving the comparison of a scoring with a crisp labelling appears in other contexts such as the task of record linkage scoring ; within record linkage scenarios, however, the evaluation is further confounded due to the high imbalance between the cardinalities of the two classes in question.  uses the an aggregate of the rankings of the minority class in an ordering of the objects according to the scores, in order to evaluate the effectiveness of the (record linkage) scoring task. We use a similar framework, but use precision instead of average ranking since the imbalance of sizes between the native and loanword vocabulary is not too extreme in Indian language settings. Consider the ordering of words in the labeled set in the decreasing (or more precisely, non-increasing) order of nativeness scores (each method produces an ordering for the dataset). We use two sets of metrics for evaluation:
Precision at the ends of the ordering: Top-k precision denotes the fraction of native words within the words at the top of the ordering; analogously, Bottom-k precision is the fraction of loanwords among the bottom k. Since a good scoring would likely put native words at the top of the ordering and the loanwords at the bottom, a good scoring method would intuitively score high on both these metrics. We call the average of the top-k and bottom-k precision for a given k, as Avg-k precision. These measures, evaluated at varying values of , indicate the quality of the nativeness scoring at either ends.
Clustering Quality: Consider the cardinalities of the native and loanword sets from the labeled set as being and respectively. We now take the top-N words and bottom-L words from the ordering generated by each method, and compare against the respective labeled sets as in the case of standard clustering quality evaluation666https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html. Since the cardinalities of the generated native (loanword) cluster and the native (loanword) labeled set is both (), the Recall of the cluster is identical to its Purity/Precision, and thus, the F-measure too; we simply call it Clustering Quality. A cardinality-weighted average of the clustering quality across the native and loanword clusters yields a single value for the clustering quality across the dataset. It may be noted that, as expected, we are not making the labeled dataset available to the method generating the ordering, instead merely using it’s cardinalities for evaluation purposes.
4 UNS: Unsupervised Nativeness Scoring
We now introduce our method, Unsupervised Nativeness Scoring. We use probability distributions over character n-grams to separately model loanwords and native words, and develop an optimization framework that alternatively refines the character n-gram distributions and nativeness scoring within each iteration. UNS involves an initialization that induces a coarse separation between native and loanwords followed by iterative refinement. The initialization is critical in optimization methods that are vulnerable to local optima; the native word distribution needs to be initialized to roughly prefer native words over loanwords. This will enable further iterations to exploit the initial preference direction to further refine the model to attract the native words more strongly and weaken any initial preference to loanwords. The vice versa holds for the models that stand for loanwords. We will first outline the initialization step followed by the description of the iterative framework and the overall approach. UNS, it may be noted, is designed keeping Malayalam and Telugu in mind; thus, UNS can be applied on an input dictionary comprising either Malayalam or Telugu words, but not a mix of words from Malayalam and Telugu.
4.1 Diversity-based Initialization
Our initialization is inspired by an observation of the versatility of word stems. We define a word stem as the sub-word formed by the first few pseudo-syllables of a (Malayalam or Telugu) word; a pseudo-syllable is one which contains a consonant along with one or more modifiers that appear with it. Consider a word stem |pu|ra|777We will represent Malayalam/Telugu words in transliterated form for reading by those who might not be able to read Malayalam/Telugu. A pipe would separate Malayalam/Telugu pseudo-syllable; a pseudo-syllable is one which contains a consonant along with one or more modifiers that appear with it, modifiers being primarily those listed in https://www.win.tue.nl/ aeb/natlang/malayalam/malayalam-alphabet.html. for example |pu| corresponds to a character /pa/ along with a modifier /u/. , a stem commonly leading to native Malayalam words; its suffixes (i.e., subwords that could immediately follow them to form a full word) are observed to start with a variety of characters such as |ttha| (e.g., |pu|ra|ttha|kki|), |me| (e.g., |pu|ra|me|), |mbo| (e.g., |pu|ra|mbo|kku|) and |ppa| (e.g., |pu|ra|ppa|du|). On the other hand, stems that mostly lead to loanwords often do not exhibit so much of diversity. For example, |re|so| is followed only by |rt| (i.e., resort being a commonly used loanword from English) and |po|li| is usually only followed by |s| (i.e., police). Some stems such as |o|ppa| lead to transliterations of two English-origin loanwords such as opener and operation. To sum up, our observation, upon which we model the initialization part of UNS, is that the variety of suffixes is generally correlated with native-ness (i.e., propensity to lead to a native word) of word stems. This is intuitive since loanword stems, being of non-native origin, would be expected to provide limited versatility to being modified by sandhis or derivational/inflectional suffixes as compared to native ones.
For simplicity, we use the first two pseudo-syllables (characters grouped with their modifiers) of each word as the word stem; we will evaluate the robustness of UNS to varying stem lengths in our empirical evaluation, while consistently using the stem length of two pseudo-syllables in our description. We start by associating each distinct word stem in with the number of unique third pseudo-syllables that follow it (among words in ); in our examples, |pu|ra| and |o|pa| would be associated with and respectively. We initialize the nativeness weights as proportional to the diversity of 3 pseudo-syllables beyond the stem:
where denotes the set of third pseudo-syllables that follow the stem of word among words in . We flatten off scores beyond a diversity of (note that a diversity of or higher will lead to the second term in the expression above becoming or higher, kicking in the min function to choose for ) as shown in the above equation. By disallowing to assume the maximum possible value of
, we will allow even words formed of highly versatile stems to contribute, albeit very slightly, to building loanword pseudo-syllable n-gram models (details of which are in the next section); this limits over-reliance on the versatility initialization heuristic. We setbased on our observation from the dataset that most word stems having more than distinct pseudo-syllables were seen to be native. As in the case of word stem length, we study UNS trends across varying in our empirical analysis.
4.2 UNS Iterative Optimizations
Having arrived at an initialization of nativeness scores , UNS refines the scores iteratively in order to arrive at an accurate quantification of nativeness. We use characters in the below narrative to refer to pseudo-syllables (the former being more familiar terminology); pseudo-syllables may either be single characters, or characters grouped with their modifiers. The UNS structure of iterative refinement makes use of two models, which we first introduce:
Native and Loanword Character n-gram Distributions: Probability distributions over character n-grams form the main tool used within UNS to refine the word nativeness scores. UNS uses separate probability distributions over character n-grams to model native and loanwords. While the size of the n-grams (i.e., whether or ) over which the probability distributions are built is a system-level parameter, we will use for simplicity in our description of UNS. We denote the native and loanword character probability distributions as and respectively, with and denoting the weight associated with the character (1-gram) according to the respective distributions.
Highly Diverse Words: UNS works by refining the initialization through iterations in order to arrive at an accurate nativeness quantification for words in . Over the course of iterations, the n-gram distribution-based model could drag the scores afar from their initialized values. UNS uses a mechanism in order to ensure that there is an inertia to such movements in the case of words with highly versatile stems. This brings us to the second model in UNS, a subset of words from with highly versatile stems (beyond a threshold ), which we will denote as :
Thus, all words formed using stems that are versatile enough to to be followed by more than different characters in would form part of .
These models are leveraged in an iterative optimization framework. Towards deriving the optimization steps, we first outline two objective functions in the next section.
4.2.1 Maximizing and Minimizing Objective Functions
Consider an estimate of the nativeness scores for all words in , and a state of the character unigram models and . Consider a particular word ; if it has a high nativeness (non-nativeness) score, we would reasonably expect to be formed by characters that score highly within the () character probability distribution. This can be folded into an intuitive objective function that would be maximized under a good estimate of word scores and models:
This measures the aggregate supports for words in
, the support for each word measured as an interpolated support from across the distributionsand with weighting factors being squares of the nativeness scores (i.e., s) and loanword-ness scores (i.e., s) respectively. In a way, the objective function can be regarded as the likelihood of words being generated by a generative process where words are formed by sampling characters from and in rates directly and inversely related to the word nativeness score
respectively. Similar mixing models have been used earlier in emotion lexicon learning and solution post discovery . The squares of the nativeness/loanwordness scores are used in our model (instead of the raw scores) for optimization convenience; it may be noted that the usage of squares has a side-effect of the optimizing pushing the nativeness scores towards either ends of the spectrum. A highly native word should intuively have a high (nativeness) and a high support from and correspondingly low loanword-ness (i.e., ) and support from ; a highly non-native word would be expected to have exactly the opposite. Due to the design of Eq. 3 in having the higher terms multiplied with each other (and so for the lower terms), this function would be maximized for a desirable estimate of the variables .
As indicated earlier, in addition to measuring and optimizing for conformance of and models with nativeness scores, we would like to penalize successive iterations from dragging words in , those being words having highly versatile stems, into a low territory. A simple objective to maximize s of words in would be as follows:
The parameter enables controlling the relative weighting for the model conformance and diverse words’ inertia terms respectively. At , the function simply becomes the model conformance term, whereas very high values of would lead to being largely influenced by the second term. The suffix in indicates that the objective function is one that needs to be maximized.
Minimizing Objective: We now define an analogous construction of an objective function, whose minimization would lead to improving model conformance (with current estimates of s) and diverse words’ inertia; this is in contrast with for whom higher values indicate better model conformance. The minimizing objective is as follows:
Let us first consider the model conformance term; in this form, given a good estimate of the models, the highly native (non-native) words have their nativeness (loanword-ness) weights multiplied with the support from the loanword (native) character n-gram probability distribution. In other words, maximizing the model conformance term in Eq. 3 is semantically equivalent to minimizing the first term in Eq. 6 above. Similar is the case with the diverse words’ inertia term; minimizing the product of s is semantically equivalent to maximizing the product of s (recollecting that ). The parameter, as before, allows to tune the relative weighting between the two terms in the composite objective.
Role of two Objectives: We have outlined two objective functions above to measure the goodness of the estimates of s, and . In UNS, we will optimize for the estimates (i.e., s) and models (i.e., and ) in alternative steps. The construction of the formulation, given the interpolation of supports from the models, is such that it is analytically difficult to arrive at a formulation to use the same objective function (either or ) for both optimization steps. Accordingly, we will outline a optimization method that uses the maximizing objective, to optimize for the estimates of the models (i.e., and ), while the minimizing objective, , is used to optimize for the word nativeness scores, i.e., the s.
4.2.2 Estimating Word Nativeness Scores
Our task, in this phase, is to use the current estimates of and in order to identify a set of nativeness scores that best conform to the models and the inertia term. As indicated earlier, we will use the minimizing objective, in order to estimate the s. First, writing out in log form gives the following:
Noting that the s of each word is nicely segregated into a separate term within the summation, the slope of this objective with respect to the nativeness score of a particualr word is as follows:
where is an indicator function that returns if the internal expression evaluates to true, and otherwise. An optimal value of may be achieved by identifying a value of that brings the slope above to . While we omit details here, the second derivative of , when worked out, leads to a positive value, indicating that the equating the slope to would lead to a minima (as against a maxima). Equating Eq. 8 to and solving for gives:
It may be seen that Eq. 9 above is not in a closed form, since the estimate of depends on itself, given that it appears in the right-hand-side of the Eq. 9. Nevertheless, it offers a constructive way of estimating new values of s by using previous estimates in the right-hand-side of the equation. The numerator of Eq. 9 comprises two terms; ignoring the first term, it is easy to observe that words comprising characters that score highly in (notice that appears in the numerator whereas the analogous term in the denominator is ) would achieve high scores. This formulation, leading to estimating as being roughly proportional to the support from is intuitively desirable. Now, coming to the first term in the numerator, it may be observed that it evaluates to for words not belonging to ; for those words in the highly diverse list, it translates into a slight ‘nudge’, one that would push the score slightly upward, once again a desirable effect given that we want to retain a high nativeness score for words in .
4.2.3 Learning and Distributions
As outlined earlier, we use separate character n-gram probability distributions to model native and loanwords. We would like these probability distributions to support the latest estimates of nativeness scoring and loanwordness scoring respectively. While refining and , we would like to ensure that they remain true probability distributions that sum to unity. This brings in the following constraints:
We will use the maximizing objective in order to optimize for the and models. As earlier, taking the log form of the objective and adding Lagrangian terms from the constraints yields the following objective to maximize:
The last two terms come from the constraints and are associated with their own Lagrangian multipliers, and .
Learning : Fixing the values of s and , let us now consider learning a new estimate of . The slope of Eq. 11 with respect to , i.e., the weight associated with a particular character , would be the following:
where is the frequency of character in the word and denotes the Lagrangian multiplier corresponding to the sum-to-unity constraints for . Equating this to zero, as earlier, does not yield a closed form solution for , but a simple re-arrangement yields an iterative update formula:
As in Eq. 9, the previous estimate of would need to be used in the right-hand-side of the update equation. The second derivative of yields a negative value in the general case, this affirming that equating slope to yields a maxima (as against a minima); we omit details here. It is this contrasting behavior of the second derivatives of and that requires us to use these two separate objectives for estimating the nativeness scores and probability distributions respectively. Eq. 13 may be seen to be intuitively reasonable, with it establishing a somewhat direct relationship between and , allowing words with high nativeness to influence more. The sum-to-unity constraint can be factored in by simply using the above relation as an update equation, followed by updating the revised estimate using a normalization as follows (this is the same process as in estimating simple maximum likelihood language models):
Learning : In a sequence of steps analogous to that as for above, we arrive at the following update equation:
Treating the above proportionality as an equation helps arriving at an update formula, which would then be followed by a normalization on the lines of Eq. 14. Analogous to the case of Eq. 13, Eq. 15 establishes a direct relationship between and the loanwordness (i.e., scores) of words within which occurs with high frequency.
4.3 The UNS Iterative Refinement Algorithm
Having described the initiatalization and the details of the iterative refinement, we now outline the overall UNS algorithm as Algorithm 1. The iterative method starts with the diversity-based initialization, and followed on by a number of iterations, each iteration involving estimation of followed by the s. Since we do not have closed form solutions for these updates, we use the iterative update steps as outlined earlier. The iterations are stopped when the nativeness weights do not change significantly (as estimated by a threshold) or when a reasonable number of iterations have been completed (we use ). It may be noted that the character n-gram distributions within may not necessarily be unigrams since is a parameter to the method; unigrams corresponds to a choice of . For , the update steps would need to have the inner summations iterating over 2-length character sequences instead of characters; this involves replacing with in each of the update equations where denotes a contiguous sequence of two characters. Each of the update steps in UNS are linear in the size of , making UNS a fast technique for even large dictionaries.
We now describe our empirical study of UNS, starting with the dataset and experimental setup leading on to the results and analyses.
Given the design of our task, we create two separate datasets for our target languages viz. Malyalam and Telugu. Both datasets were created in a similar fashion by sourcing words from articles found in popular newspapers in the respective languages: 65068 unique Malayalam words were obtained from Mathrubhumi888https://www.mathrubhumi.com/, while 43150 distinct Telugu words were sourced from Andhrabhoomi999http://www.andhrabhoomi.net/. For each language, we chose a subset of 1035 random words to be manually labelled as either native, loanword or unknown; this forms our evaluation set. The complete word lists along with the annotated subsets have been made publicly available101010Malayalam dataset: https://goo.gl/DOsFES 111111Telugu dataset: https://goo.gl/xsvakx 121212Malayalam labeled subset: https://goo.gl/XEVLWv 131313Telugu labeled subset: https://goo.gl/S2eoB2. For evaluation purposes, we merged the set of unknown labelled words with loanwords; this seemed appropriate since most unknown labellings were seen to correlate with non-native words, whose source language wasn’t as obvious as others. The frequencies of native and loanwords in our datasets are shown in Table 1. In general, our datasets contain approximately 3 times as many native words as loanwords. This is in tandem with the contemporary distribution of words in the target languages within the news domain, as observed from other sources as well.
As outlined in Section 2, the unsupervised version of the problem of telling apart native and loanwords for Malayalam and/or similar languages has not been addressed in literature, to the best of our knowledge. The unsupervised Malayalam-focused method (Ref: Sec 2.2) is able to identify only contiguous sequences of two or more loanwords, making it inapplicable for general contexts where individual english words are often transliterated for want of a suitable malayalam alternative. As an example,  would be able to identify police as a loanword only in cases where it appears together with other words; while such scenarios, such as police station and traffic police do appear many a time, police abundantly appears in itself. The Korean method is too specific to Korean language and cannot be used for other languages due to the absence of a generic high-precision rule to identify a seed set of loanwords. With both the unsupervised state-of-the-art approaches being inapplicable for our task, we compare against an intuitive generalization-based baseline, called GEN, that orders words based on their support from the combination of a unigram and bi-gram character language model learnt over ; this leads to a scoring as follows:
where and are bigram and unigram character-level language models built over all words in . We set  which was observed to be an empirically strong setting for GEN. We experimented with higher-order models in GEN, but observed drops in evaluation measures leading to us sticking to the usage of the unigram and bi-gram models. The form of Eq. 16 is inspired by an assumption similar to that used in both  and  that loanwords are rare. Thus, we expect they would not be adequately supported by models that generalize over whole of ; intuitively,
may be thought of as a score of outlierness, being an quantification of deviation from a model learnt over the corpus. We also compare against our diversity-based initialization score from Section4.1, which we will call as INIT. For ease of reference, we outline the INIT scoring:
The comparison against INIT enables us to isolate and study the value of the iterative update formulation vis-a-vis the initialization.
5.3 Evaluation Outline
We outlined in Section 3, we use top-k, bottom-k and avg-k precision (evaluated at varying values of ) as well as clustering quality in our evaluation. Of these, the clustering quality metric provides an overview of the performance of UNS vis-a-vis the baselines. The precisions at the ends of the ordering, allow for delving deeper into the orderings produced by the nativeness scores. Accordingly, we start by analyzing clustering quality of UNS across varying settings of (n-gram length), (weighting between two terms in optimization) and (used in construction of highly diverse words set) against the baselines. This is followed by a similar analysis over the metrics of top-k, bottom-k and avg-k precisions against the baseline methods. We then perform a deeper analysis of UNS to understand its sensitivity over other parameters such as word stem length and (used in initialization), to conclude our empirical evaluation. Each of the above analyses are performed separately on the Malayalam and Telugu datasets described in Section 5.1. Unless mentioned otherwise, we set the word stem length to two, and the diversity threshold to . UNS iterations were continued until there were fewer than 1% of labels changing across successive iterations, or until iterations are reached, whichever is earlier.
5.4 Evaluation on Clustering Quality: UNS vs. Baselines
Table 2 and Table 3 record the results of UNS against the baselines over the Malayalam and Telugu datasets respectively. As outlined in Section 3.1, the tables list the clustering quality for the native and loanword clusters followed by the weighted average that provides a single evaluation measure over the entire dataset. The results over a wide variety of settings of , and suggest that UNS outperforms the baselines across a variety of parameter settings. Each of the UNS performance numbers on each of the measures can be seen to be better than the respective numbers for both the baselines. Of the varying parameter settings, , and consistently record the best numbers across both the Malayalam and Telugu datasets; the best numbers are boldfaced in the table. The technique peaking at the same parameter settings for both languages indicates that UNS modelling is able to exploit the commonalities in lexical structure between the two Dravidian languages. entails the usage of single character level probability distributions, whereas ensures an even weighting across the model conformance and inertia terms. marks the strength of the inertia in that smaller values of cause a larger set of words to be weighted into the inertia term; thus the higher performance of as against further illustrates the importance of the inertia term in driving UNS towards desirable scoring. It is further notable that , the setting that discards the inertia term, records a significant drop in performance as against settings where the inertia term is factored in. Overall, the clustering quality evaluation establishes the consistent performance of UNS and illustrates why UNS should be the preferred method for the task.
|UNS Results across Parameter Settings|
|UNS Results across Parameter Settings|
|Evaluation on Malayalam Dataset|
|k = 50||k = 100||k = 200|
|Evaluation on Telugu Dataset|
|k = 50||k = 100||k = 200|
5.5 Evaluation on End-Precisions: UNS vs. Baselines
Table 4 lists down the end-precision metrics, laid out in Section 3.1, across varying values of . While the top-k precision measures the fraction of native words at the native end of the ordering formed by the scores, bottom-k precision measures the fraction of loanwords at the other end. As may be obvious, what may be regarded as the key indicator is average-k precision, which forms the mean of the precision at either ends. It is however, to be noted that this evaluation only focuses on a subset of the dataset, and data points are excluded from influencing the evaluation; thus, these evaluations only serve only the limited purpose of ensuring that the either ends of the ordering are pure in the expected sense. In many cases where automated scoring is applied to do a two-class classification, it may be desirable to subject the ambiguous band of objects in the middle to manual verification, whereas the labels in the end may be regarded as more trustworthy to bypass manual verification. Such manual verification may be inevitable in critical processes such as those in healthcare and security, making the end-precision measures being the more useful measure to analyze for such scenarios, as compared to clustering quality. Table 4, in the interest of brevity, lists the performance of UNS at the parameter setting , and , the setting that was found to be most desirable for both langauges from the analysis in the previous section. It is easy to see from the results that UNS convincingly outperforms the baselines on the average-k measure across varying values of over both the languages; this confirms the observations from the previous section. It is interesting to note the trend of Top-k where INIT, the initiatialization used in UNS, scores better. Top-k measure the purity at the high-end of scores; this means that the initialization heuristic is very effective in putting the native words in the high range. However, it fares very badly in the low range, as indicated by the Bottom-k precision dropping to . This offers a perspective on UNS as well; starting from an ordering that is accurate only within the high territory, the UNS model is able to learn the probability distributions that are meaningful enough to spread the gains more evenly across the whole spectrum of scores over the course of the iterative refinements.
5.6 UNS Evaluation: Parameter Sensitivity and Objective Function Trends
and Word Stem Length: Up until above, we retained the word stem length to be consistently two, and the diversity threshold parameter was set to across all analyses. In this section, we study the sensitivity of UNS to these two parameters. All other parameters are set to values as earlier. The results across varying values of and stem length are shown in Tables 5 and 6 respectively. The table suggests that UNS is extremely robust to variations in diversity threshold, despite a slight preference towards values around and . This suggests that a system designer looking to use UNS need not carefully tune this parameter due to the inherent robustness. Given the nature of Malayalam and Telugu where the variations in word lengths are not as high as in English, it seemed very natural to use a word stem length of . Moreover, very large words are uncommon in Malayalam and Telugu. In our corpus, of words were found to contain five characters or less. Our analysis of UNS across variations in word-stem length, illustrated in Table 6 strongly supports this intuition with clustering quality peaking at stem-length close to (for Malayalam, the actual peak is at , but the clustering quality improvement for over is not much). It is notable, however, that UNS degrades gracefully beyond that range. Trends across different settings of word-stem length are interesting since they may provide clues about applicability for other languages with varying character granularities (e.g., each Chinese character corresponds to multiple characters in Latin-script).
Objective Functions Across Iterations: Across UNS iterations, we expect the maximizing and minimizing objectives to be progressively refined in appropriate directions. Figure 1 plots the objective function trends across iterations for the Malayalam dataset when UNS is run with . The trends, as expected, show rapid objective function changes (max objective function increasing and min objective function decreasing, as expected) in the initial iterations, with the values stabilizing beyond to
iterations. Similar trends were observed for the Telugu dataset as well as for varying settings of hyperparameters; the corresponding chart appears in Figure2.. That the objective functions show a steady movement towards convergence as iterations progress, we believe, is indicative of the effectiveness of the UNS formulation.
5.7 UNS Qualitative Analysis
Towards analyzing the results qualitatively, we now present the top-10 words on either ends of the spectrum for Malayalam and Telugu in Figures 3 and 4 respectively. The labelling is driven by the motivation to illustrate correctness which depends on the choice of ends; this is detailed in the caption of the respective figures. Based on an analysis, we found that the highest values of are generally achieved for words that are commonly used, with the native words that appear at the low end being those that are quite rarely used.
6.1 Applicability to Other Languages
In contrast to earlier work focused on specific languages (e.g., ) that use heuristics that are very specific to the language (such as expected patterns of consonants), the UNS framework is general-purpose in design. The main heuristic setting that is likely to require some tuning for applicability in other languages, such as other Indic languages, is the word-stem length. We expect the approach would generalize well to other Sanskrit-influenced Dravidian languages such as Kannada, Tulu and Kodava, but may require some adaptation for others such as Tamil due to a lack of diversity in the alphabet. Unfortunately, we did not have any Kannada/Tulu/Kodava knowledge (Dravidian languages have largely disjoint speaker-populations) in our team, or access to labelled datasets in those languages (they are low-resource languages too); testing this on Kannada/Tulu/Tamil would be interesting future work.
6.2 The Nuanced Nature of Word Nativity
As an empirically oriented work, we have considered native and non-native as two distinct and different concepts. This is reflected in our formulation of as a nativeness score and as a loanwordness score. This is also reflected in our evaluation dataset that makes use of binary native/non-native labels. However, as may be well-understood, nativeness is a much more nuanced concept. A loanword that has been in usage for a long time in a language may be regarded as native for every practical purpose, making the mutual exclusivity embedded in the construction obsolete. For example, , a widely used malayalam-language word to denote chair, has its origins in the portugese word cadeira; with the embedding of the word within Malayalam being so pervasive to the extent that most native speakers are unaware of the portugese connection, it may be argued to have both high nativeness and high loanwordness. Additionally, langauges such as Sanskrit that have influenced some dravidian languages for many centuries, have contributed words that take part in productive and complex morphological processes within the latter. For this and other reasons, it may be meaningful to consider a more structured scoring and labelling process for words in extending UNS to scenarios that would need to be sensitive to such distinctions.
6.3 UNS in an Application Context
Within any target application context, and especially so in domains such as healthcare and security, machine-labelled non-native words (and their automatically generated transliterations) may need to manual screening for accountability reasons. The high accuracy at either ends of the ordering lends itself to be exploited in the following fashion; in lieu of employing experts to verify all labellings/transliterations, low-expertise volunteers (e.g., students/Mechanical Turkers) can be called in to verify labellings at the ends (top/bottom) of the lists with experts focusing on the middle (more ambiguous) part of the list; this frees up experts’ time as against a cross-spectrum expert-verification process, leading to direct cost savings.
7 Conclusions and Future Work
We considered the problem of unsupervised separation of loanwords and native words in Malayalam and Telugu; this is a critical task in easing automated processing of Malayalam/Telugu text in the company of other language text. We outlined a key observation on the differential diversity beyond word stems, and formulated an initialization heuristic that would coarsely separate native and loanwords. We proposed the usage of probability distributions over character n-grams as a way of separately modelling native and loanwords. We then formulated an iterative optimization method that alternatively refines the nativeness scorings and probability distributions. Our technique for the problem, UNS, that encompasses the initialization and iterative refinement, was seen to significantly outperform other unsupervised baseline methods in our empirical study. This establishes UNS as the preferred method for the task. We have also released our datasets and labeled subset to help aid future research on this and related tasks.
7.1 Future Work
The applicability of UNS to other Indic languages is interesting to study. Due to our lack of familiarity with any other language in the family, we look forward to contacting other groups to further the generalizability study. While nativeness scoring improvements directly translate to reduction of effort for manual downstream processing, quantifying gains these bring about in translation and retrieval is interesting future work. Exploring the relationship/synergy of this task and Sandhi splitting  would form another interesting direction for future work.
Loanwords are often within Malayalam used to refer to very topical content, for which suitable words are harder to find. Thus, loanwords could be preferentially treated towards building rules in interpretable clustering  and for modelling context in regex-oriented rule-based information extraction . Loanwords might also hold cues for detecting segment boundaries in conversational transcripts [10, 13].
The authors would like to thank Giridhara Gurram for annotating the Telugu dataset.
-  (2008) Statistical identification of english loanwords in korean using automatically generated training data.. In LREC, Cited by: §2.1.
-  (2012) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowledge and information systems 32 (3), pp. 475–503. Cited by: §7.1.
-  (2014) Generating a word-emotion lexicon from# emotional tweets.. In * SEM@ COLING, pp. 12–21. Cited by: §4.2.1.
-  (1996) Identification and classification of proper nouns in chinese texts. In COLING, Cited by: §2.1.
-  (2014) Unsupervised solution post identification from discussion forums.. In ACL (1), pp. 155–164. Cited by: §4.2.1.
-  (2008) Identification of transliterated foreign words in hebrew script. In Computational linguistics and intelligent text processing, pp. 466–477. Cited by: §2.2.
-  (1999) Automatic identification and back-transliteration of foreign words for information retrieval. Inf. Proc. & Mgmt. 35. Cited by: §2.1.
-  (2018) It pays to be certain: unsupervised record linkage via ambiguity minimization. In Advances in Knowledge Discovery and Data Mining - 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III, pp. 177–190. External Links: Cited by: §3.1.
-  (2015) An unsupervised method for identifying loanwords in korean. Language Resources and Evaluation 49 (2), pp. 355–373. Cited by: §2.2, §5.2, §5.2, §6.1.
Unsupervised segmentation of conversational transcripts.
Statistical Analysis and Data Mining: The ASA Data Science Journal2 (4), pp. 231–245. Cited by: §7.1.
-  (2012) Improving recall of regular expressions for information extraction. In International Conference on Web Information Systems Engineering, pp. 455–467. Cited by: §7.1.
-  (2011) $s^3$ - statistical sandhi splitting. In IJCNLP, pp. 301–308. External Links: Cited by: §7.1.
-  (2007) Mining conversational text for procedures with applications in contact centers. International Journal on Document Analysis and Recognition 10 (3), pp. 227–238. Cited by: §7.1.
-  (2014) A technique to extract transliterating phrases in statistical machine translation from english to malayalam. In National Conference on Indian Language Computing, Cited by: §2.2, §5.2, §5.2.
-  (2006) An investigation of dirichlet prior smoothing’s performance advantage. IR Cited by: §5.2.
-  (2015) Lexicon stratification for translating out-of-vocabulary words. In ACL Short Papers, pp. 125–131. External Links: Cited by: §2.3.
-  (2014) Loan words in telugu newspaper language - present trends. International Journal of Interdisciplinary Advanced, pp. 123. Cited by: §2.2.