In recent years, we have witnessed an increasing interest in spoken or written dialect identification, proven by a high number of evaluation campaigns Ali, Vogel, and Renals (2017); Malmasi et al. (2016); Rangel et al. (2017); Zampieri et al. (2017, 2018, 2019) with more and more participants. In this paper, we explore the Moldavian versus Romanian Cross-Dialect Topic Identification (MRC) shared task, which was introduced as a task in the VarDial 2019 evaluation campaign Zampieri et al. (2019), following the release of the MOROCO data set Butnaru and Ionescu (2019a). The shared task included two sub-task types: one that consisted in discriminating between the Moldavian and the Romanian sub-dialects and one that consisted in classifying documents by topic across the two sub-dialects of Romanian. However, our primary focus is on the Moldavian versus Romanian dialect identification task. Since MOROCO is the first data set of its kind, there are only a handful of works that studied Romanian dialect identification from a computational perspective Butnaru and Ionescu (2019a); Chifu (2019); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019).
Romanian, the language spoken in Romania, belongs to a Balkan-Romance group that emerged in the fifth century Coteanu, Bolocan, and Marioţeanu (1969), after it separated from the Western Romance branch of languages. The Balkan-Romance group is formed of four dialects: Aromanian, Daco-Romanian, Istro-Romanian, and Megleno-Romanian. We note that, within its group, Romanian is referred to as Daco-Romanian. Noting that Moldavian is a sub-dialect of Daco-Romanian, which is spoken in the Republic of Moldova and in northeastern Romania, the Moldavian versus Romanian dialect identification task is actually a sub-dialect identification task. The Moldavian sub-dialect can be delimited from Romanian in large part by its phonetic features, and only marginally, by morphological and lexical features Pavel (2008). Hence, it is much easier to distinguish between the spoken Moldavian and Romanian dialects than the written dialects. This is a first hint that discriminating between Moldavian and Romanian is not an easy task, at least from a human point of view. It is important to add the fact that Romania and the Republic of Moldova have the same literary standard Minahan (2013). In this context, some linguists Pavel (2008) believe that a dialectal division between the two countries is not justified. In our case, we study the challenging Moldavian versus Romanian written sub-dialect identification, since the data set available for the experiments is composed of written news articles Butnaru and Ionescu (2019a). We naturally assume that the news articles follow the literary standards. Furthermore, named entities are masked in the entire corpus. Considering all these facts, the dialect identification task should be very difficult. We analyze the difficulty of the task from a human perspective by asking human annotators from Romania and the Republic of Moldova to label news articles with the corresponding dialect. Given that the average accuracy of the human annotators is around , the human evaluation confirms the difficulty of the task. Interestingly, the machine learning (ML) methods proposed so far Butnaru and Ionescu (2019a); Chifu (2019); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019) attain much higher accuracy rates. For example, the top scoring system in the VarDial 2019 evaluation campaign Tudoreanu (2019) obtained a macro score of 0.895 for Moldavian versus Romanian dialect identification. Furthermore, the string kernels baseline proposed by Butnaru and Ionescu (2019a) seems to perform even better, with a macro score of 0.941. We therefore consider the machine learning systems for Moldavian versus Romanian dialect identification to be unreasonably effective.
We note that the high accuracy rates of the ML systems can be influenced by different factors. The first factor to consider is that the ML systems have access to a large training set from which many discriminative features can be learned, including features unrelated to the dialect identification task, such as features specific to the author style. The second factor is that the samples are full-length news articles formed of several sentences. This increases the chance of finding discriminative features in just about every sample. The third factor is that the news articles are collected from different publication sources from Romania and the Republic of Moldova, and an ML system could just learn to discriminate among the publication sources. In order to explain the unreasonable effectiveness of machine learning systems, we conduct a series of comprehensive experiments on MOROCO, considering all these factors. First of all, we perform experiments considering only the first sentence in each news article, significantly reducing the length of the text samples. Second of all, we test the systems on a new set of tweets from Romania and the Republic of Moldova collected from a different time period, making sure that the publication sources in the training and the test set are different. This generates a cross-domain (or cross-genre) dialect identification task, with the training (source) domain being represented by news articles and the test (target) domain being represented by tweets. Our findings indicate that, even in this difficult cross-domain setting, the ML systems still outperform humans by a significant margin. We therefore delve into analyzing and visualizing the discriminative features of one of the best-performing ML systems. Our analysis indicates that the machine learning models take their decisions mostly based on morphological and lexical features, many of which were previously unknown to us.
Upon reimplementing and evaluating most of the previously proposed methods from the related literature Butnaru and Ionescu (2019a); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019)
, we considered an ensemble learning method in order to find out if further accuracy improvements are possible. Our empirical results show that ensemble learning is useful, indicating that the features captured by the various machine learning models, ranging from string kernels to convolutional and recurrent neural networks, are somewhat complementary.
The remainder of this paper is organized as follows. We present related work on dialect identification in Section 2. We describe the machine learning systems and the ensemble learning method in Section 3. We present the experiments in Section 4, followed by a discussion of the most discriminative features in Section 5. Finally, we draw our conclusions in Section 6.
2 Related Work
2.1 Dialect Identification
Dialect identification has been acknowledged in the computational linguistics community as an important task, with multiple events and shared tasks materializing this acknowledgement Malmasi et al. (2016); Zampieri et al. (2014, 2015, 2017, 2018, 2019). Naturally, some of the most wide-spread languages also tend to be the most well-studied in terms of dialect identification from a computational linguistics perspective.
To our knowledge, it seems that Arabic is one of the most studied languages, considering modern setups, such as social media AlYami and AlZaidy (2020), large and diverse corpora, such as QADI Abdelali et al. (2020), dialect recognition from speech Hanani and Naser (2018); Shon et al. (2020) or dialect identification from travel text and tweets Mishra and Mujadia (2019). Preliminary works dealing with Arabic dialect identification used various handcrafted and linguistic features. For instance, Biadsy, Hirschberg, and Habash (2009) employed a phonotactic approach to differentiate among four Arabic dialects with good accuracy. In the same direction of study, we can also mention the efforts involving experiments on the Arabic Online Commentary Dataset Zaidan and Callison-Burch (2011, 2014). More recently, Guellil and Azouaou (2016) proposed an unsupervised approach for Algerian dialect identification. Another interesting study is conducted by Salameh, Bouamor, and Habash (2018), where the city of each speaker is identified based on the spoken dialect. The evaluation campaigns Ali, Vogel, and Renals (2017); Bouamor, Hassan, and Habash (2019); Malmasi et al. (2016); Zampieri et al. (2017, 2018) represent one more proof that dialect identification is of much interest from the Arabic language perspective, as these campaigns included a shared task for Arabic dialect identification. We note that one of the most successful approaches in the Arabic dialect identification shared tasks is based on string kernels Butnaru and Ionescu (2018); Ionescu and Popescu (2016); Ionescu and Butnaru (2017).
Among the well-studied languages from a dialectal perspective, there is also Chinese. Tsai and Chang (2002) proposed a Gaussian Mixture Bigram Model in the differentiation of three major Chinese dialects spoken in Taiwan. Later, Ma, Zhu, and Tong (2006)
had an attempt at distinguishing among three different Chinese dialects from speech. A semi-supervised approach, outperforming the initial Gaussian Mixture Models (GMM) for dialect identification, is introduced byMingliang, Yuguo, and Yiming (2008). In Xia et al. (2011), gender is employed as a factor in deciding the dialect of different Chinese utterances. A more recent work Jun (2017) employed deep bottleneck features, which are related to the phoneme level. Through deep bottleneck features, a attempt at suppressing the influence of redundant dialect information from features is made by the author.
A number of works targeting dialect identification were also published for Spanish. The first such work Zissman et al. (1996) aims at differentiating Cuban and Peruvian dialects from Spanish. The same task is addressed later by Torres-Carrasquillo, Gleason, and Reynolds (2004), with an approach based on GMMs, however less accurate than that of Zissman et al. (1996). In Huang and Hansen (2006), GMMs with mixture and frame selection is used for Latin-American Spanish dialect identification. More recently, Francom, Hulden, and Ussishkin (2014) introduced the ACTIV-ES corpus, with informal language records of Spanish speakers from Argentina, Mexico and Spain.
MOROCO Butnaru and Ionescu (2019a), the data set on which the current study is based on, comes as a response to the increasing interest in dialect identification with many research efforts for languages such as Arabic Alsarsour et al. (2018); Bouamor et al. (2018); Zaidan and Callison-Burch (2011), Spanish Francom, Hulden, and Ussishkin (2014), Indian Kumar et al. (2018) and Swiss Samardzic, Scherrer, and Glaser (2016), trying to attract interest towards under-studied languages such as Romanian.
2.2 Romanian Sub-Dialect Identification
The classification of Romanian in four dialects, i.e. Daco-Romanian, Macedo-Romanian, Aromanian and Megleno-Romanian, has been studied from a purely linguistic perspective for a few decades Caragiu-Marioțeanu (1975); Petrovici (1970); Puşcariu (1976). In a modern linguistic work Lozovanu (2012) that studied Romanian and its dialects, the authors addressed the subject from a geographical, historical and etymological angle. In another modern study, Nisioi (2014) proposed a quantitative approach in the investigation of the syllabic structure of the Aromanian dialect, proposing a rule-based algorithm for automatic syllabification. The the aforementioned works are valuable studies performed from a social sciences perspective. However, we are interested in the computational nature of differentiating among Romanian and its dialects or sub-dialects. In this regard, to our knowledge, there is one single work Ciobanu and Dinu (2016) to study Romanian dialects from a computational linguistics perspective, before the VarDial 2019 evaluation campaign Zampieri et al. (2019). Ciobanu and Dinu (2016) offer a comparative analysis of the phonetic and orthographic discrepancies between various Romanian dialects. However, the data set used in their endeavour to automatically differentiate among the aforementioned dialects, is rather small, containing only 108 words.
Butnaru and Ionescu (2019a)
introduced MOROCO, a data set of 33,564 online news reports collected from Romania and the Republic of Moldova. For each news article, the data set provides dialect labels as well as category labels. The authors applied two effective approaches in tackling the problems of dialect identification and categorization by topic: a character-level convolutional neural network, inspired byZhang, Zhao, and LeCun (2015)
, and a simple Kernel Ridge Regression with custom string kernels, followingPopescu and Ionescu (2013). We note that the data set proposed by Butnaru and Ionescu (2019a) was also used as benchmark in the first shared task on Moldavian versus Romanian Cross-Dialect Topic Identification (MRC), generating an additional set of publications Chifu (2019); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019)
. In the MRC shared task, the following sub-tasks were proposed: binary classification by dialect and cross-dialect categorization by topic. The participants proposed various approaches for the MRC shared task, ranging from various deep learning models based on word embeddingsOnose, Cercel, and Trăuşan-Matu (2019) or character embeddings Tudoreanu (2019)2019) and voting schemes based on a set of handcrafted statistical features Chifu (2019). In our study, we consider the best performing models in the MRC shared task Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019) along with the baselines proposed by Butnaru and Ionescu (2019a), combining these approaches into ensemble models based on voting or stacking. Different from all prior works, we make the following important contributions:
We introduce a new set of over 5,000 Moldavian and Romanian tweets, enabling us and future works to study Romanian dialect identification in a cross-genre scenario.
We study the Romanian dialect identification task in new scenarios, considering models trained on sentences (instead of full news articles) and applied on sentences or tweets, showing how performance degrades as the scenario gets more difficult.
We study how native Romanian or Moldavian speakers compare to the ML models for dialect identification and categorization by topic, showing that there is a significant performance gap in favor of the ML models for dialect identification.
We present Grad-CAM visualizations Selvaraju et al. (2017) revealing dialectal patterns that explain the unreasonable effectiveness of the ML models. The newly discovered patterns were not known to us or to the human annotators.
Throughout this section, we present in detail the most relevant models from the related literature Butnaru and Ionescu (2019a); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019), which we have selected to build an ensemble. From Butnaru and Ionescu (2019a), we select the Kernel Ridge Regression based on string kernels, since this is their best baseline. From Tudoreanu (2019), the winner of the Moldavian versus Romanian dialect identification sub-task, we select the character-level convolutional neural network (CNN), which is similar in design to the character-level CNN presented by Butnaru and Ionescu (2019a)
. Onose-VarDial-2019 applied three different deep learning models: a Long Short-Term Memory (LSTM) network, a Bidirectional Gated Recurrent Units (BiGRU) network and a Hierarchical Attention Network (HAN). Since these deep models are quite diverse, we included all their models in our study. Finally, fromWu et al. (2019), we considered the Support Vector Machines based on character n-grams. For efficiency reasons, we employed the dual form of their SVM, which is given by string kernels. We note that the considered methods form a broad variety that includes both shallow models based on handcrafted features and deep models based on automatically-learned character or word embeddings. Nevertheless, all methods are essentially based on two steps, data representation and learning, although in some models, e.g. the character-level CNN, the steps are performed in an end-to-end fashion. We next provide details about the data representations and the learning models considered in our experiments.
3.1 Data Representations
Word Embeddings. Some of the first statistical learning models for building vectorial word representations were introduced in Bengio et al. (2003); Schütze (1993). The goal of vectorial word representations (word embeddings) is to associate similar vectors to semantically related words, allowing us to express semantic relations mathematically in the generated embedding space. After the preliminary work of Bengio-JMLR-2003 and Schutze-NIPS-1993, various improvements have been made in the quality of the embedding and the training time Collobert and Weston (2008); Mikolov et al. (2013a, b); Pennington, Socher, and Manning (2014), while some efforts have been directed towards learning multiple representations for polysemous words Huang et al. (2012); Reisinger and Mooney (2010); Tian et al. (2014). These improvements, and many others not mentioned here, have been extensively used in various NLP tasks Butnaru and Ionescu (2019b); Garg et al. (2018); Glorot, Bordes, and Bengio (2011); Ionescu and Butnaru (2019); Musto et al. (2016); Weston, Bengio, and Usunier (2011); Yang, Macdonald, and Ounis (2018).
In the experiments, we use pre-trained word embeddings as features for the LSTM, BiGRU and HAN models. The same set of distributed word representations as Onose-VarDial-2019 is employed in the feature extraction step. We note that these representations are learned from Romanian corpora, such as the corpus for contemporary Romanian language (CoRoLa)Mititelu, Tufiş, and Irimia (2018); Paiş and Tufiş (2018), Common Crawl (CC) and Wikipedia Grave et al. (2018), as well as from data coming from the Universal Dependencies project Nivre et al. (2016), that is added to the Nordic Language Processing Laboratory (NLPL) shared repository. In the remainder of this paper, we refer to these representations, shortly as: CoRoLa, CC, and NLPL.
Character Embeddings. Some of the pioneering works in language modelling at the character level are Gasthaus, Wood, and Teh (2010); Wood et al. (2009). To date, characters proved useful in a variety of neural models, such as Recurrent Neural Networks (RNNs) Sutskever, Martens, and Hinton (2011), LSTM networks Ballesteros, Dyer, and Smith (2015); Ling et al. (2015), CNNs Kim et al. (2016); Zhang, Zhao, and LeCun (2015) and transformer models Al-Rfou et al. (2019). Characters are the smallest units necessary in building words that exist in the vocabulary, regardless of language, as the alphabet changes only slightly across many languages. Thus, knowledge of words, semantic structure or syntax is not required when working with characters. Robustness to spelling errors and words that are outside the vocabulary Ballesteros, Dyer, and Smith (2015) constitute other advantages explaining the growing interest for using characters as features.
In our paper, we employ three models working at the character level, an SVM and a KRR based on character n-grams Wu et al. (2019), as well as a character-level CNN Butnaru and Ionescu (2019a); Tudoreanu (2019). The CNN is equipped with a character embedding layer, generating a 2D representation of text that is further processed by the convolutional layers. We provide additional details about the CNN in Section 3.2.
String Kernels. Lodhi et al. (2001, 2002) introduced string kernels as a means of comparing two documents, based on the inner product generated by all substrings of length , typically known as n-grams. Of interest in determining the similarity are the n-grams that the two documents have in common. The authors applied string kernels in a text classification task with promising results. Since then, string kernels have found many applications, from protein classification Zaki, Deris, and Illias (2005) and learning semantic parsers Kate and Mooney (2006) to tasks as complex as recognising famous pianists by their playing style Saunders et al. (2004)
or dynamic scene understandingBrun, Saggese, and Vento (2014)
. Other applications of the method include various NLP tasks across different languages, e.g. sentiment analysisGiménez-Pérez, Franco-Salvador, and Rosso (2017); Ionescu and Butnaru (2018); Popescu, Grozea, and Ionescu (2017), authorship identification Sanderson and Guenter (2006), automated essay scoring Cozma, Butnaru, and Ionescu (2018), sentence selection Masala, Ruseti, and Rebedea (2017), native language identification Ionescu, Popescu, and Cahill (2014, 2016); Ionescu and Popescu (2017); Popescu and Ionescu (2013) and dialect identification Butnaru and Ionescu (2018, 2019a); Ionescu and Butnaru (2017). Many improvements have also been added, incrementally, to the original method. These target the space usage Belazzougui and Cunial (2017), versatility Elzinga and Wang (2013) and time complexity Popescu, Grozea, and Ionescu (2017); Singh et al. (2017).
In this work, we employ string kernels as described in Butnaru and Ionescu (2019a), specifically using the efficient algorithm for building string kernels of Popescu, Grozea, and Ionescu (2017). We note that the number of character n-grams is usually much higher than the number of samples, so representing the text samples as feature vectors may require a lot of space. String kernels provide an efficient way to avoid storing and using the feature vectors (primal form), by representing the data though a kernel matrix (dual form). Each cell in the kernel matrix represents the similarity between some text samples and . In our experiments, we use the presence bits string kernel Popescu and Ionescu (2013) as the similarity function. For two strings and over a set of characters , the presence bits string kernel is defined as follows:
where is the length of n-grams, is a function that returns 1 when the number of occurrences of n-gram in is greater than 1, and 0 otherwise.
3.2 Learning Models
Support Vector Machines.
The objective of Support Vector Machines (SVM) is to find a hyperplane that best classifies the data points provided in the training phase in two classesCortes and Vapnik (1995). To ensure a good generalization capability, the SVM aims at maximizing the margin that separates the points of both classes. The margin is chosen based on the points that are closest to the decision boundary. These points are called support vectors, and, not only do they give the name of the method, but they also influence the orientation and position of the hyperplane that is eventually used for classification during inference. Through the kernel trick, the SVM gains the power classify data that is not linearly separable, since the data is mapped into a higher-dimensional space, where it becomes separable using a hyperplane Shawe-Taylor and Cristianini (2004). For multi-class classification, multiple SVM classifiers need to be trained in a one-versus-one or one-versus-rest scheme. In our text categorization by topic experiments, we employ the one-versus-one scheme. Instead of using a standard kernel, we employ the SVM with the custom string kernel based on character n-grams defined in Equation 1. We note that our dual SVM based on string kernels is mathematically equivalent to the primal SVM based on character n-grams employed by Wu et al. (2019). We prefer the dual SVM because it is more computationally efficient, as explained in detail by Ionescu, Popescu, and Cahill (2016).
Kernel Ridge Regression. Ridge Regression Hoerl and Kennard (1970)
, or linear regression withregularization for overfitting prevention, has been combined with the kernel trick Saunders, Gammerman, and Vovk (1998), enabling the method to capture non-linear relations between features and responses. The kernel version, known as Kernel Ridge Regression (KRR), is a state-of-the-art technique Shawe-Taylor and Cristianini (2004) used in several recent works Butnaru and Ionescu (2019a); Ionescu, Popescu, and Cahill (2016); Ionescu and Butnaru (2018) with very good results. KRR can be seen as a generalization of simple Ridge Regression, learning a function in the Hilbert space described by the kernel. The function learned is either linear or non-linear, with respect to the original space, depending on the considered kernel Shawe-Taylor and Cristianini (2004). Although KRR can be used with any kernel function, we employ the KRR based on the kernel defined in Equation 1, as previously proposed by Butnaru and Ionescu (2019a). In order to repurpose the trained regressor as a (binary) classifier, we round the predicted continuous values to the values in the set . For the multi-class text categorization by topic tasks, we employ KRR in a one-versus-rest scheme.
Convolutional Neural Networks. A type of artificial neural network based on convolving multiple sets of filters in a sequential manner is represented by the convolutional neural network. The rectified outputs yielded by the convolution operation are called activation maps and they are subject to pooling operations, which provide a downscaled version of the activation maps, implicitly reducing the amount of parameters and computations further used in the network. After repeating a number of convolutional blocks consisting in convolutions and pooling operations, a sequence of fully-connected layers typically follows, with the last layer having a number of units equal to the number of classes in the data set. Because CNNs are inspired by the mammalian visual cortex Bengio (2009); Fukushima (1980), they have been found suitable, initially, for image classification Krizhevsky, Sutskever, and Hinton (2012); Lawrence et al. (1997); LeCun et al. (1989); LeCun, Huang, and Bottou (2004)
. This approach has been, afterwards, adapted for natural language processing (NLP) problemsKim (2014); Zhang, Zhao, and LeCun (2015). In NLP, the meaning of the inputs changes: instead of image pixels, we have documents represented as a matrix, using either word dos Santos and Gatti (2014) or character embeddings Zhang, Zhao, and LeCun (2015).
One of the models that we employ in the experiments is a character-level CNN Zhang, Zhao, and LeCun (2015) with squeeze-and-excitation (SE) blocks, introduced by Butnaru and Ionescu (2019a). Our motivation for this choice of algorithm lies in the good results obtained on MOROCO by Butnaru and Ionescu (2019a) and by Tudoreanu (2019), and also, in the interpretability of the model through visualization techniques. We used the latter feature to get a better understanding of the CNN model’s effectiveness in Section 5, based on Grad-CAM visualizations Selvaraju et al. (2017).
Long Short-Term Memory Networks. Recurrent Neural Networks (RNNs) Werbos (1988) represent a type of neural model that operates at the sequence level, achieving state-of-the-art performance on language modeling tasks Chung et al. (2014); Weiss, Goldberg, and Yahav (2018)
, among other problems involving time series. Their effectiveness is constrained by the length of the input sequence. RNNs must use context in order to make predictions, while they also need to learn the context itself, which can lead to vanishing gradients problemsHochreiter et al. (2001), a major drawback of simple RNNs. This is solved in Long Short-Term Memory networks (LSTMs) Hochreiter and Schmidhuber (1997), which rely on an RNN architecture that uses a more complex structure for its base units. An LSTM unit has a cell that acts as a memory element, remembering dependencies in the input. The amount of information stored in this cell and its overall impact is controlled through three gates acting as regulators. The input and output gates control and select the information to be added into and outside of the cell. Later versions of LSTMs also use forget gates, enabling the cell to reset its state for optimization reasons Gers, Schmidhuber, and Cummins (2000); Greff et al. (2016). With these modifications in terms of structure and computation, LSTMs are able to selectively capture long-term dependencies without the technical challenges faced when working with simple RNNs, i.e. exploding and vanishing gradients. Onose, Cercel, and Trăuşan-Matu (2019) showed that LSTMs are also useful in the dialect identification and categorization sub-tasks on the MOROCO data set. Hence, using this type of network in our experiments has been inspired by Onose, Cercel, and Trăuşan-Matu (2019).
Bidirectional Gated Recurrent Units. Gated Recurrent Units (GRUs) Cho et al. (2014) implement a simplified version of LSTMs having only input and forget gates, i.e. the output gate is excluded. With fewer parameters than LSTMs, the performance achieved by GRUs on various tasks, e.g. speech recognition, is similar to the one achieved by LSTMs Ravanelli et al. (2018). Moreover, GRUs tend to outperform LSTMs on small data sets Chung et al. (2014). The roles seem reversed for problems such as language recognition Weiss, Goldberg, and Yahav (2018)2017). We note that GRUs, as well as other types of RNNs, can use a bidirectional architecture, an adjustment made with the aim of addressing the need of knowing both the previous and the next context to understand the current word. Thus, a bidirectional Gated Recurrent Unit (BiGRU) model is composed of two vanilla GRUs, one with forward activations (i.e. getting information from the past) and one with backward activations (i.e. getting information from the future) Nussbaum-Thom et al. (2016). BiGRUs are among the models that proved their efficiency in the experiments conducted by Onose, Cercel, and Trăuşan-Matu (2019) on MOROCO, which is why we decided to include the BiGRU architecture in our set of models.
Hierarchical Attention Networks. Proposed by Yang et al. (2016), Hierarchical Attention Networks (HANs) have been initially applied in document classification. The success obtained on this task is explained by the natural approach taken in HANs, reflecting the structure of documents through attention mechanisms applied at two levels: for words that form sentences and for sentences as components of documents. In the case of HANs, the attention mechanism uses context to spot relevant sequences of tokens in a given sentence or document. Essentially, the same algorithms, namely encoding and selection by relevance, are applied twice, at the word level and also at the sentence level Yang et al. (2016). As for the previously described methods, i.e. LSTM and BiGRU, the inclusion of HAN in our set of models to be used in the experiments has its motivation in the results obtained by Onose, Cercel, and Trăuşan-Matu (2019).
Ensemble Models. The main idea behind ensemble models is to combine multiple learning techniques in order to obtain a model that achieves better results than any of its individual components Opitz and Maclin (1999); Rokach (2010). The model obtained via ensemble learning is typically more stable and robust Gashler, Giraud-Carrier, and Martinez (2008). There is proof that a significant diversity among the component models of an ensemble leads to better results than in the case where similar techniques are brought together into an ensemble Kuncheva and Whitaker (2003); Sollich and Krogh (1996). We use this hypothesis in the experiments conducted in this work. More precisely, our models cover different features as input, from the basic character-level properties of string kernels to the hierarchical selection of words and sentences of HAN. Furthermore, not only that we employ a diversity in the types of features, but we also use different, complementary learning techniques, ranging from shallow models, such as SVM and KRR, to deep models, such as CNNs and RNNs. Majority voting is one of the ensemble approaches that we have chosen for our experiments. In this approach, the models in the ensemble simply vote with equal weights. The second ensemble learning approach that we consider for the experiments is stacking. In the second approach, the predictions from all the considered models (SVM, KRR, CNN, LSTM, BiGRU and HAN) are used to train a meta-model that learns how to best combine the predictions of its components Wolpert (1992)
. We employ Multinomial Logistic Regression as our meta-classifier. We note that ensemble learning has not been previously studied on MOROCO. Hence, this is the first study to test out the effectiveness of ensemble learning in Romanian dialect identification.
4.1 Data Sets
The Moldavian and Romanian Dialectal Corpus (MOROCO)111https://github.com/butnaruandrei/MOROCO Butnaru and Ionescu (2019a) is the main data set employed in the experiments conducted in this work. The corpus was collected from the top five news websites from Romania and the Republic of Moldova as data sources, using each country’s web domain (.ro or .md) to automatically label the news articles by dialect. Butnaru and Ionescu (2019a) also provide topic labels, assigning each news article in the corpus to one of six categories: culture, finance, politics, science, sports, tech. A minimum of approximately 2000 samples per dialect is obtained for each topic. The corpus was automatically pre-processed to remove named entities. MOROCO is comprised of 33,564 news articles, with an official split of 21,719 training samples, 5921 validation samples and 5924 test samples.
Although we are primarily interested in the dialect identification task, we present results for the full range of tasks proposed by Butnaru and Ionescu (2019a), namely:
binary discrimination between Romanian (RO) and Moldavian (MD);
Romanian intra-dialect categorization by topic;
Moldavian intra-dialect categorization by topic;
cross-dialect categorization by topic using Moldavian as source and Romanian as target;
cross-dialect categorization by topic using Romanian as source and Moldavian as target.
In this paper, we introduce an additional data set composed of tweets collected from Romania and the Republic of Moldova, which allows us to evaluate the machine learning models in a cross-genre dialect identification setting. The tweets were collected from a different time period, helping us to reveal any overfitting behavior of the models. The MOROCO-Tweets222We will release the labeled MOROCO-Tweets data set for public use after the end of the VarDial 2020 evaluation campaign. data set is divided into a validation set of 215 tweets and a test set of 5,022 tweets, both having a balanced distribution of Moldavian and Romanian tweets. Indeed, the validation set is composed of 113 Moldavian tweets and 102 Romanian tweets, while the test set is composed of 2,499 Moldavian tweets and 2,523 Romanian tweets. All tweets are pre-processed for named entity removal. We did not collect any topic labels from Twitter, since we are mostly interested in cross-genre dialect identification.
4.2 Experimental Setup
We first evaluate the considered machine learning models on MOROCO, using the complete news articles, as in all previous works Butnaru and Ionescu (2019a); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019)
. Since our aim is to determine the extent to which machine learning models attain good performance levels, we consider an additional scenario in which we keep only the first sentence from each news article. This essentially transforms all tasks into sentence-level classification tasks. As we keep the same number of data samples, the sentence-level classification accuracy rates are expected drop, essentially because there are less patterns in the data. We include an even more difficult evaluation setting, testing the models trained at the sentence-level on tweets, while considering only the dialect identification task. As evaluation metrics, we employ the classification accuracy and the macroscore. We note that the macro score is the official metric chosen for the VarDial 2019 evaluation campaign Zampieri et al. (2019).
4.3 Parameter Tuning
We have borrowed as many of the hyperparameters as possible from the worksButnaru and Ionescu (2019a); Onose, Cercel, and Trăuşan-Matu (2019); Tudoreanu (2019); Wu et al. (2019) proposing the models considered in our experiments, trying to replicate the previously reported results as closely as possible. When sufficient details to replicate the results were missing, we tuned the corresponding hyperparameters on the validation data. We next present the hyperparameter choices for each machine learning model.
SVM. In the experiments, we used SVM with a pre-computed string kernel with , which has been selected via grid search from a range of values starting from to , considering a multiplication step of . The string kernel is based on character 6-grams.
KRR. For KRR, the only parameter that requires tuning is the regularization . From a set of potential values ranging from to , with a multiplication step of , the best for our setup was . As for the SVM, the string kernel used in KRR is based on character 6-grams.
Character-level CNN. We employed the same architecture described by Butnaru and Ionescu (2019a). For the experiments performed with full news articles, we use an input size of characters. For the experiments performed on the first sentence from each news article, we adjust the input size to characters. The input layer is followed by an embedding layer, which embeds each character into a vector of components. The neural architecture consists of three convolutional blocks, each having a convolutional layer with
filters, strideand filter sizes , and
, respectively. Max pooling with a filter size ofis applied in each convolutional block. After each convolutional block, we insert a Squeeze-and-Excitation block with the reduction ratio set to . Two fully-connected layers follow the convolutional blocks, each having
neural units. Each of these two fully-connected layers is subject to dropout, with the probability of dropping individual units of. The final layer in the network is the one used for prediction, having or neurons depending on the problem solved, i.e. neurons for dialect identification and for categorization by topic. The classification layer is based on Softmax activation. We use a learning rate of equal with and train the network for epochs on mini-batches of samples.
LSTM, BiGRU and HAN. The architectures and setup used for these models are identical or very similar in most aspects with those implemented by Onose, Cercel, and Trăuşan-Matu (2019). Each model is trained for epochs with a mini-batch size of samples. These three models are all bases on word embeddings, and we combine each one of them with the previously presented word embeddings, namely CoRoLa, NLPL and CC, resulting in a set of independent models.
For the LSTM network, we employed an architecture with two LSTM layers of and neurons in this order, both having tanh activations. The second LSTM layer is followed by dropout regularization, with a probability of dropping out individual neurons of . Two dense layers follow next, the first one having
neural units and ReLU activations. The second fully-connected layer in the architecture is a classification layer. It has Softmax activations andneurons, where depends on the sub-task for which the network is used, i.e. for dialect identification and for categorization by topic.
BiGRU consists in a GRU layer of neurons and a bidirectional GRU layer with
neurons. The activation function used in both layers istanh. Two fully-connected layers of and
units are also added to the architecture. For regularization and training acceleration reasons, batch normalization is applied after each GRU layer. Dropout with a rate ofis used after each dense layer. The last layer relies on Softmax activations, performing the classification task. As for the LSTM, the number of neurons in the last layer is .
In HAN, we used a maximum sequence length of words, which is also valid for the other word-based models previously described, namely LSTM and BiGRU. A sentence encoder using a bidirectional GRU layer with neural units is employed in the first half of the network. The maximum document size considered for the second half is of sequences. For the second encoder, i.e. the document encoder in HAN, we have a similar bidirectional GRU layer, with a size of units. The prediction layer, which comes in last, has or neurons with Softmax activations.
Ensemble Models. While the majority voting strategy requires no hyperparameter tuning, the meta-learner used in model staking, namely Logistic Regression, requires tuning of the regularization parameter and of the penalty. As penalty, we generally obtained better validation results with over , except for Moldavian intra-dialect categorization by topic. The parameter is validated within and , considering a step of . Depending on the task, we typically obtain the best validation results with or . An exceptional case is the sentence-level dialect identification task, where the optimal is .
4.4 Dialect Identification Results
In Table 1, we present the dialect identification results of various ML methods in three different scenarios. In the first scenario, in which the models are trained and tested on full news articles, there are three individual models that surpass the threshold for both evaluation metrics, namely the SVM, the KRR and the character-level CNN. The ensemble models are also going beyond this threshold. In general, it seems that the dialect identification task on entire news articles is fairly easy. However, the high accuracy rates could also be explained by many other factors, namely by the fact that the models actually discriminate the news articles based on author style, publication source or the discussed subjects, which might be different in the two countries. In order to diminish the effects of such additional factors, we considered two additional scenarios, one that involves training and testing at the sentence level, and one that involves a cross-genre evaluation. In the second scenario, in which the models are trained and tested on sentences, we observe significant performance drops with respect to the first scenario. Indeed, the accuracy rates and the macro scores drop by roughly for almost all models. The only model that does not register such a high performance decrease is HAN, but its scores in the first scenario are quite low. Although it is much harder to recover the author style, the publication source or the subject from the first sentence of each news article, these patterns are not completely eliminated. We therefore consider the third evaluation scenario, in which the models are trained on sentences from MOROCO and tested on tweets collected from different sources and from a different time period. We observe further performance drops in the third scenario. While some models are close to a random chance prediction, e.g. HAN, other models are close to in terms of both accuracy and macro . As shown in Table 4, the human-level performance in the Moldavian versus Romanian dialect identification task is much under the best performing ML models evaluated on tweets. In order to understand and explain this difference, we analyze the Grad-CAM visualizations Selvaraju et al. (2017) for one of the best performing models, namely the character-level CNN, in Section 5. Considering all three evaluation scenarios, the individual models attaining the best results are the SVM and the KRR, both being based on string kernels. These two models are closely followed by the character-level CNN. The majority voting strategy attains mixed results, surpassing the best individual model only in the second evaluation scenario. The ensemble based on staking achieves the best results in each and every case.
4.5 Intra-Dialect Categorization Results
|Intra-Dialect Categorization by Topic|
|Model||Embedding||Full Articles||Sentences||Full Articles||Sentences|
We report the intra-dialect categorization by topic accuracy rates and macro scores of various models in Table 2. First of all, we note that the models generally attain better results within the Moldavian dialect as opposed to the Romanian dialect. In the first evaluation scenario, which is based on full news articles, all models, except HAN, surpass the threshold in terms of accuracy rate for the Moldavian news articles. On both dialects, the best accuracy rates in the first evaluation scenario are obtained by the LSTM based on CoRoLa embeddings, surpassing even the ensemble models. The LSTM based on CC embeddings attains the top accuracy rates on both dialects in the second evaluation scenario, which is conducted at the sentence level. In general, we observe that deep learning models attain better accuracy rates than the shallow SVM and KRR, while the latter models attain the top macro scores among all individual models. We note that the high differences between the classification accuracy, which is equivalent to the micro score, and the macro score of each classifier can be explained by the fact that the topic distribution in MOROCO is unbalanced Butnaru and Ionescu (2019a). The macro score is considered more relevant by the VarDial shared tasks organizers Zampieri et al. (2019), as it assigns equal weights to each class. Although SVM and KRR surpass other individual models, the best macro scores in all intra-dialect categorization experiments are attained by the ensemble based on classifier staking. Comparing the categorization results of the ML models in the second evaluation scenario with those reported for the human annotators in Table 4, we observe that the performance gap in favor of the machine learning models is smaller with respect to the gap observed in the dialect identification experiments. We believe that this observation indicates that the dialectal features are likely more subtle than the topical features.
4.6 Cross-Dialect Categorization Results
|Cross-Dialect Categorization by Topic|
|Full Articles||Sentences||Full Articles||Sentences|
In Table 3, we present the accuracy rates and the macro scores of the considered ML models for cross-dialect categorization by topic in two scenarios, one based on full articles and one based on sentences. In general, we notice that most of the patterns observed in the intra-dialect categorization experiments shown in Table 2 also apply to the cross-dialect experiments. Indeed, we observe that the deep learning methods typically yield superior accuracy rates with respect to the shallow methods based on string kernels, the best approach in most cases being the LSTM network. Nevertheless, the SVM and the KRR compensate by attaining better macro scores than the deep learning models. Once again, the ensemble based on stacking yields the top macro scores in both scenarios and for both cross-dialect tasks. In summary, we consider that the idea of combining the models into an ensemble via classifier staking is very useful. Comparing the cross-dialect categorization results of the ML classifiers at the sentence level with those reported for the human annotators in Table 4, we notice that, at least in terms of the macro metric, humans are generally better.
4.7 Human Annotation Results
|Human Annotated Data|
|Identification||by Topic||Identification||by Topic|
We have asked six human subjects to manually annotate a subset of 120 randomly selected samples from the MOROCO data set. Among the subjects involved in the annotation task, there were five native speakers of Romanian and one native speaker of Moldavian. All annotators understood the task and the presented examples in both dialects. The samples considered in the manual annotation process have been randomly selected, while aiming for a balanced distribution, for both the dialect identification and the categorization by topic sub-tasks. Thus, a total of 120 samples have been selected, from which 60 were written in Romanian and the other 60 originated in news reports from the Republic of Moldova. For each dialect, we considered 10 samples from each of the 6 categories available in MOROCO: culture, finance, politics, science, sports, tech.
Another fact about the data set is that the samples considered for annotation contain only the first sentence of the original news articles. This made the task more challenging from a human perspective, as we took away most of the context from the examples, with useful linguistic and semantic clues that could have provided a great help in inferring the correct classes. However, in the same time, our aim was to reduce the annotation time by as much as possible, since all human subjects were volunteers providing the annotations for free. As we seek to fairly compare the human skills with the performance of the ML models in differentiating among dialects, we consider the results reported in the third evaluation scenario, in which the models are trained on sentences and tested on tweets. In order for the evaluation to happen in similar circumstances, the named entities in the samples presented to the human annotators have been replaced with the special token $ne$, just as in the data samples used to train and evaluate the ML models.
The summary of the human annotation is presented in Table 4. For dialect identification, the worst results are just below random chance, the accuracy of annotators #A1 and #A4 being . Moreover, the accuracy averaged over all annotators () merely exceeds the probability of a coin toss. With an accuracy of , annotator #A3 is the only one getting closer to the results reported for the ML models in Table 1. We believe it is fair to compare the human performance at the sentence level with the performance of ML models applied on tweets. We hereby note that the accuracy of the best human annotator exceeds the accuracy of LSTM and HAN. However, SVM and KRR provide accuracy rates and macro scores that are about higher than those of annotator #A3. The ensemble based on staking is even better. This high difference between the ML models and the Romanian and Moldavian speaking annotators indicates that there are some subtle patterns undetected by humans. In order to discover these patterns, in Section 5, we analyze Grad-CAM visualizations pointing out what models, particularly the character-level CNN, focus on.
Evaluating the human annotations for the categorization by topic task, we note that the annotators are much better at discriminating between the six topics than at identifying the dialect, the accuracy rates being between and and the macro scores being between and . These results are comparable to the ones obtained by the ML models in the intra-dialect and cross-dialect categorization experiments presented in Tables 2 and 3, respectively. The previous statement is specifically valid for the results reported in the second scenario, in which models are trained and tested at the sentence level.
|Sample ID||Ground-Truth||Labels by Annotators|
|Sample ID||Sample||English Translation|
|#S1||"oamenii de ştiinţă de la $ne$ din $ne$ elaborează pantaloni inteligenţi cu muşchi artificiali, care vor oferi un sprijin suplimentar persoanelor cu mobilitatea piciorului afectată, notează $ne$ $ne$ citat de $ne$."||"scientists from $ne$ in $ne$ are fabricating smart pants with artificial muscle, which are going to help people with reduced leg mobility, says $ne$ $ne$ cited by $ne$."|
|#S2||"leul moldovenesc se depreciază faţă de moneda unică europeană."||"the Moldavian Leu depreciates compared to the unique European currency."|
|#S3||"autorităţile vamale şi de frontieră ale $ne$ $ne$ şi $ne$ au devenit beneficiarele unui nou proiect de asistenţă tehnică, finanţat de $ne$ $ne$."||"customes and border authorities of $ne$ $ne$ şi $ne$ became the beneficiaries of a new technical assistance project, funded by $ne$ $ne$."|
|#S4||"acestea sunt fondurile europene pe care $ne$ le pierde definitiv doar cu programul $ne$."||"these are the European funds lost forever by $ne$ only with the $ne$ program."|
|#S5||"o sală a teatrului $ne$ $ne$ din $ne$ s-a făcut scrum."||"a hall of the $ne$ $ne$ theater from $ne$ turned to ashes."|
|#S6||"un bilet amoros rătăcit, ameninţări cu sinuciderea, gesturi extreme, dictate de pasiuni la fel de extreme, o sticluţă cu <<vitrion englezesc>>, toate acestea învălmăşindu-se pe scenă."||"a lost love note, suicide threats, extreme gestures, dictated by equally extreme passions, a bottle of <<English vitrion>>, all these mixing up on the stage."|
Figure 0(a) shows the sum of confusion matrices computed on all the annotators included in our human evaluation study, for the dialect identification task. We note that the annotators were predisposed at labeling the received samples as belonging to the Romanian dialect. On average, of the 120 samples received by each annotator have been labeled as being written in Romanian. From these, almost half are mislabeled, actually belonging to the Moldavian dialect. This predisposition can be explained by the fact that five out of six annotators were native Romanian speakers, hence the bias towards labeling more samples as Romanian, unless they found clues indicating otherwise. Additionally, the poor results confirm the difficulty of this binary classification task, from a human perspective.
Figure 0(b) displays the sum of confusion matrices computed on the six human annotators, for the categorization by topic task. For the sports category, annotators were able to correctly classify almost all sentences, with an average of one false negative per annotator. Since sports is less related to the other categories and sports news likely contain semantic clues right from the first sentence regarding the category of the content, it seems natural for people to find it more distinctive. Not the same stands for categories such as finance, politics, science or tech. Indeed, the highest confusions are between finance and politics and between science and tech, respectively.
In Table 5, we display six samples selected from the data set provided to the human annotators. Among the presented samples, the first three belong to the Moldavian dialect, while the last three belong to the Romanian dialect. For a better comprehension, the English translation of each sample is also included in Table 5. We selected the samples considering three different cases: most annotators agree on the label, but the majority vote label does not match the ground-truth label; most annotators agree on a label that matches the ground-truth label; there are strong disagreements among annotators, such that a majority cannot be determined. The first and the sixth rows in Table 5 are representative for case . In sample #S1, there is no linguistic or semantic clue to indicate that the sentence belongs to the Romanian dialect, yet all annotators made this choice, against the ground-truth label (Moldavian). One explanation for this choice is perhaps motivated by the fact that Romania is a more developed country from a scientific point of view. Hence, the annotators might have been biased in their belief that a news article talking about scientists is more likely to come from Romania than from the Republic of Moldova. Sample #S6 contains several words that are not commonly used in the Romanian language, hence, all annotators decide to label it as belonging to the Moldavian dialect. Samples #S2 and #S4 are representative for case , most of the votes matching the correct label. Sample #S2 contains an explicit clue suggesting that it belongs to the Moldavian dialect, namely the adjective "moldovenesc" when referring to the currency used in the Republic of Moldova. Similarly, in sentence #S4, the clue is the noun phrase "fondurile europene". The Republic of Moldova is not a member of the European Union. Thus, it becomes clear for anyone who knows this information, that the corresponding sentence is more likely to originate in Romania, as Romania is involved in receiving funds from the European Union. Finally, samples #S3 and #S5 are representative for case . We notice that, in sample #S5, there is simply not enough context to infer the dialect, while sample #S3 does not bare any clues to indicate the dialect, although the sentence is longer. Interestingly, in the presented samples neither we nor the annotators were able to spot any dialectal clues. Although some samples were labeled correctly, the clues indicating correct dialect are more related to the subject rather than the dialect. Until this point, we conclude that either the dialectal patterns are missing or they are very hard to spot by humans. The analysis provided in Section 5 reveals that the character-level CNN does learn some interesting dialectal clues, which we were not aware of.
|Sample ID||Ground-Truth||Labels by Annotators|
|Sample ID||Sample||English Translation|
|#S7||"$ne$ $ne$ recunoscut pentru felul lui unic de a fi, revine în $ne$ cu tolba plină de muzică şi poezie."||"$ne$ $ne$ famous for his unique way of being, comes back in $ne$ with lots of music and poesy."|
|#S8||"multe voci sustin ca patruderea puternica a $ne$ si a $ne$ pe podiumul $ne$ intareste si mai mult ideea ca aceasta competitie nu este decat una (geo) politica."||"several voices argue that $ne$ and $ne$’s strong infiltration on the $ne$ podium further confirms the idea that this is nothing more than a (geo) political competition."|
|#S9||"locurile de muncă bine plătite pot fi găsite şi în alte sectoare ale economiei, mai puţin populare, în care bătălia este mai mică."||"well paid work places can also be found in other less popular sectors of the economy, with less competition."|
|#S10||"coaliţia la guvernare spune că atunci când accizele au fost reduse, la 1 ianuarie, carburanţii nu s-au ieftinit."||"governing coalition states that fuel has not become cheaper with the reduction of excise duties that happened on January 1st."|
|#S11||"dacă duminica viitoare ar avea loc alegeri parlamentare, în $ne$ ar accede trei partide: $ne$ şi $ne$."||"if there would be parliamentary elections next Sunday, in $ne$ three parties would accede: $ne$ and $ne$."|
|#S12||"$ne$ $ne$ şi $ne$ ( $ne$ ) şi $ne$ $ne$ şi $ne$ ($ne$) au devenit astăzi membri observatori ai $ne$ $ne$ $ne$ ( $ne$ )."||"$ne$ $ne$ and $ne$ ( $ne$ ) and $ne$ $ne$ and $ne$ ($ne$) are, as of today, observer members of $ne$ $ne$ $ne$ ( $ne$ )."|
|#S13||"$ne$ a anunţat descoperirea unui sistem solar asemănător cu al nostru, care are opt planete."||$ne$ has announced the discovery of a new solar system similar to ours, which has eight planets."|
|#S14||"totul se intampla la bordul $ne$ $ne$ $ne$."||"everything happens on board of $ne$ $ne$ $ne$."|
|#S15||"ronaldinho s-a retras oficial, anunţul venind din partea fratelui acestuia, cel care-i este şi agent."||"Ronaldinho has officially retired, announces his brother, who is also his agent."|
|#S16||"dupa ce s-a dat cu motorul in favelas din $ne$ pe un roller coaster din $ne$ $ne$ dar si la $ne$ $ne$ motociclistul francez care calatoreste in toata lumea cautand cele mai spectaculoase locuri pentru trial freestyle, $ne$ $ne$ a revenit in $ne$ pentru un proiect inedit."||"after he has been biking in favelas from $ne$ on a roller coaster from $ne$ $ne$ and also in $ne$ $ne$, the French biker who travels the world seeking the most spectacular places for freestyle trial, $ne$ $ne$ came back in $ne$ for a novel project."|
|#S17||"unele companii şi-au adus $ne$-işti din $ne$ $ne$ sau $ne$."||"several companies have brought their $ne$-ists from $ne$ $ne$ or $ne$."|
|#S18||"pe piaţa online a $ne$."||"on the online market of $ne$."|
In Table 6, we present sentences with category labels for two different cases: the correct category is chosen in unanimity; there are disagreements among annotators, regardless of the final result of majority voting. Each of these two cases is exemplified through one sentence for each of the six categories. Samples #S7, #S9, #S11, #S13, #S15, #S17 are representative for case . The nouns "muzică" and "poezie" in example #S7 are strong clues for the culture category, hence the unanimity of votes in this direction. In sample #S9, the keyword "economie" gives the strongest clue for the finance category, while in sample #S11, the noun phrase "alegeri parlamentare" suggests that the sentence belongs to the politics category. However, sample #S13 does not seem to contain any specific phrase that can be considered a strong indicator for the science topic. Here, it is the entire context that reveals the nature of the sentence. The name of a famous football player has escaped our named entity removal process, representing the reason why sentence #S15 was unanimously classified as belonging to the sports topic. In example #S17, from the noun "$ne$-işti", a native Romanian speaker can infer that the removed name entity is "IT", as "IT-işti" is a very common yet distinctive way to refer to people working in the IT industry, in Romania. Therefore, the annotators unanimously labeled #S17 as part of the tech sector. Examples #S8, #S10, #S12, #S14, #S16, #S18 are representative for case , having at least one wrong label among the manual annotations provided by the six subjects. Only one out of six annotators has correctly labeled sample #S8 as belonging to the culture topic. The other annotators were deceived by the fact that sample #S8 contains the word "politica" suggesting the politics label and the word "podium" suggesting the sports label. The annotations of sample #S10 confirm the confusion between finance and politics
observed in the confusion matrix depicted in Figure0(b). In sample #S10, we observe a reason suggesting that the label is politics, namely the presence of the noun phrase "coaliţia la guvernare". If the annotators would have considered the noun "accizele" (taxes) as more relevant, they would have been able to find the correct category, i.e. finance. Sample #S12 contains very few words along with many placeholders for named entities. However, most of the annotators know that "membri observatori" is a political function inside the European Union, hence the label politics. Sample #S14 presents strong disagreements among the annotators. This is expected due to the very short sentence lacking sufficient context to label the example. Misclassified sports samples were very few in the data set, as we can also see in Figure 0(b). In sample #S16, only one annotator has not marked this text as belonging to the sports category. Leaving aside the lack of context in sample #S18, we note that the noun phrase "piaţa online" (online market) might suggest the finance and the tech topics. The labels provided by the annotators are divided between these two topics, confirming our hypothesis about "piaţa online".
So far, it remains unclear if there any dialect clues in the news articles from MOROCO. One hypothesis (H1) is that there are no dialectal clues, since Romanian speakers had a hard time distinguishing between the two dialects, as shown in Table 4. In this case, the good performance of the machine learning models can be explained through other factors, e.g. subjects specific to each of the two countries. The alternative hypothesis (H2) is that the samples contain dialectal clues, since the machine learning models trained on news articles are able to classify tweets collected from a different time period. In this case, the low performance of human annotators can be explained if we consider that the dialectal clues are harder to spot than expected. In order to find out which hypothesis is valid, we analyze the discriminative features learned by the character-level CNN, which is among the top three individual dialect identification systems. We opted for the character-level CNN in favor of the better SVM and KRR, as it allows us to look at discriminative features using Grad-CAM, a technique that was initially used to explain decisions of convolutional neural networks applied on images Selvaraju et al. (2017). We adapted this technique for the character-level CNN, presenting the corresponding visualizations in Tables 7 and 8, respectively.
|#R1||ford a demarat producţia noului său model $ne$ la uzina din $ne$ .||"Ford has started the production of the new model $ne$ at their factory in $ne$."|
|#R2||mai pe româneşte, taie frunze la câini .||"as the romanian saying goes, he’s cutting leaves to the dogs."|
|#R3||aşa au apărut, în gări, tonomatele cu tot felul de publicaţii, contra cost, sau bibliotecile pentru corporatişti, cu livrare direct la birou .||"this is how there have appeared, in train stations, jukeboxes with all kinds of books, requiring payment, or libraries for corporate people, with delivery directly at the office."|
|#R4||şeful $ne$ $ne$ $ne$ şi adjunctul acestuia, $ne$ $ne$ s - au prezentat, joi, la $ne$ $ne$ în dosarul violenţelor de la mitingul din 10 august din $ne$ $ne$ .||"the head of $ne$ $ne$ $ne$ and his assistant were present, on Thursday, at $ne$ $ne$ in the criminal case of the violence acts that happened at the protests on August 10th from $ne$ $ne$."|
|#R5||compania în cauză nu mai vindea modele diesel pe piaţa americană din 2015 .||"the said company hasn’t sold diesel models on the American market since 2015."|
|#R6||partidele au o nouă temă de campanie - definirea familiei .||"the parties have a new campaign theme - defining family."|
|#R7||coaliţia la guvernare spune că atunci când accizele au fost reduse, la 1 ianuarie, carburanţii nu s - au ieftinit .||"governing coalition states that fuel has not become cheaper with the reduction of excise duties that happened on January 1st."|
|#R8||bancherii nu stau cu mainile in san .||"bankers do not sit with the hands on their chests."|
|#R9||hidrologii au emis, sâmbătă, mai multe avertizări cod galben de inundaţii, scurgeri de pe versanţi, torenţi şi pâraie, valabile pentru râuri din şase judeţe .||"hydrologists have emitted, on Saturday, more yellow code flood warnings, runoff from slopes, torrents and streams, valid for rivers in six counties."|
|#R10||$ne$ a precizat că este vorba despre semnarea unui acord între o firmă privată românească şi una dintre cele mai mari companii din lume - o firmă americană de armament, care produce, printre altele, şi celebrele rachete $ne$ .||"$ne$ has stated that this is about signing an agreement between a private Romanian company and one of the biggest companies world wide - an American weapons business, which manufactures, among others, the famous rockets $ne$."|
We quantized the importance of each character using shades of blue (for Romanian) or shades of red (for Moldavian), the darker shades representing more relevant features and the lighter shades representing less relevant features, respectively. In order to extract the importance of each character, we used the weights learned by the last convolutional layer in the network as well as the spatial localization kept in the activation maps resulted upon convolving filters of predefined size over the input fed to the model. In the remainder of this discussion, we try to explain why the features considered important by the character-level CNN also make sense from a human perspective.
We provide a set of visualizations for Romanian sentences in Table 7. In sample #R1, the model focuses on the first four words, but the one indicating the dialect is "demarat". This word, which translates to "started", is used in Romanian to indicate that the start of a construction process. In Moldavian, the word "început" would have probably been used instead to express the same thing. We note that the word "început" is also commonly used in Romania, but in typically different contexts. Perhaps this is why the model also highlights the neighboring word "producţia" (production). Sample #R2 contains an entire Romanian proverb which is, as a whole, predictive for this dialect. It refers to people doing useless jobs, e.g. "cutting leaves to the dogs". In sample #R3, the CNN focuses on two separate groups of words, but we believe the dialectal clue is the "contra cost" expression, which is typically used in Romanian to express the fact that some product or service is not for free, but it requires some payment from the customer. Example #R4 contains a type of news that has dominated the Romanian media for months, namely the protest against the Romanian government on August 10th, 2018. Therefore, the features highlighted by the CNN have no dialectal clues, except perhaps for the word "miting", which is preferred instead of the synonym "protest", the latter one being more common in the Republic of Moldova. In samples #R5 and #R10, the CNN focuses on the nouns "compania" (singular of "company") or "companii" (plural of "company"), respectively. From our observations, in Moldavian news reports, writers use "întreprindere", while in Romanian news reports, the synonym "companie" is rather used. We note that "companie" and "întreprindere" exist in both Romanian and Moldavian, but the preference for one or the other depends on the dialect. Sample #R6 refers to what was a really hot and controversial topic in Romania, namely that of changing the definition of family ("definirea familiei") in the constitution of Romania. We can safely say this is not a dialectal topic. In sample #R7, the model focuses on the noun phrase "coaliţia la gurvernare" (governing coalition). In the Republic of Moldova, the same concept is expressed through the noun phrase "coaliţia de guvernămînt". Sample #R8 contains a Romanian saying, namely "nu stau cu mainile in san", which is used to express that the bankers took some action instead of waiting for something to happen. Sample #R9 contains the noun "torenţi", which is never used in the Republic of Moldova with the meaning of weather torrent, only with the meaning of web torrent. The CNN also considers as relevant the word "valabile", which is rarely used in the Republic of Moldova. Hence, sample #R9 contains more than one dialectal pattern. In summary, we find that the CNN does find some interesting dialectal patterns, which we were unaware of before seeing the Grad-CAM visualizations. However, there is a small percentage of sentences, namely #R4 and #R6, that have no dialectal patterns, but are correctly labeled by the CNN because of the subjects that are related to events in Romania.
|#M1||cabinetul de miniştri a aprobat, în cadrul şedinţei de astăzi, modificări şi completări la $ne$ privind procedura de repatriere a copiilor şi adulţilor victime ale traficului de fiinţe umane, traficului ilegal de migranţi, precum şi a copiilor neînsoţiţi .||"the cabinet has approved, in today’s meeting, changes and completions to $ne$ regarding the procedure for repatriation of children and adults who are victims of human trafficking, illegal immigrants trafficking, as well as the one regarding unaccompanied children."|
|#M2||$ne$ are tot ce îi trebuie pentru a reuşi, iar $ne$ $ne$ a condus ţara cu mîna fermă, cu minte limpede şi cu sufletul la oameni, făcîndu - şi datoria faţă de oameni .||"$ne$ has everything that is needed in order to win, but $ne$ $ne$ has ruled the country with a firm hand, a clear mind and with the soul close to people, while doing his/her duty to people."|
|#M3||facebook ştie multe lucruri despre tine, dintre care majoritatea sînt împărtăşite cu prietenii tăi pentru a vă ajuta pe toţi .||"facebook knows a lot of things about you, most of which are shared with your friends in order to help all of you."|
|#M4||cei mai mulţi bani, locuitorii capitalei îi cheltuie $ne$ pe mîncare .||"the inhabitants of the capital spend most of their money on food."|
|#M5||la ediţia curentă, a 43 - a, a $ne$ $ne$ de $ne$ "$ne$ participă muzicieni din 15 ţări, iar concertele se desfăşoară în diferite localităţi ale republicii .||"in the current, 43rd edition, of $ne$ $ne$ $ne$ $ne$ there are musicians from 15 countries, and the concerts are going to happen in different locations of the republic."|
|#M6||maşina zburătoare $ne$ a companiei $ne$ a fost în dezvoltări şi teste timp de mulţi ani, dar este în sfîrşit aproape gata .||"the flying car $ne$ of the $ne$ company has been under development and tests for many years, but it is, finally, almost ready."|
|#M7||una dintre cele mai mari bănci din $ne$ $ne$ a ajuns în faliment .||"one of the greatest banks in $ne$ $ne$ went bankrupt."|
|#M8||guvernul a avizat pozitiv, în şedinţa de astăzi, pachetul legislativ pentru reforma fiscală .||"the government approved, in todays’ meeting, the legislative package for the tax reform."|
|#M9||$ne$ din $ne$ nu va comunica public opţiunea partidului privind turul $ne$ din $ne$ dar speră că la 3 iunie cetăţenii se vor mobiliza şi vor participa activ la vot, a declarat la un briefing vicepreşedintele formaţiunii $ne$ $ne$ .||"$ne$ from $ne$ isn’t going to publicly announce the party’s option regarding $ne$ tour in $ne$, but they hope that, on June 3rd, the citizens are going to actively participate in the vote, the vice-president of the $ne$ $ne$ has declared."|
|#M10||$ne$ va beneficia de suportul experţilor europeni în procesul de implementare a $ne$ $ne$ şi a $ne$ $ne$ $ne$ de $ne$ $ne$ , condiţionalităţi prevăzute în capitolul privind $ne$ de $ne$ $ne$ $ne$ şi $ne$ ( $ne$ ) a $ne$ de $ne$ .||"$ne$ will benefit from the support of the European experts in the process of implementing the $ne$ $ne$ and the $ne$ $ne$ $ne$ of $ne$ $ne$, conditions specified in the chapter regarding $ne$ of $ne$ $ne$ $ne$ and $ne$ ($ne$) of $ne$ of $ne$."|
We provide a set of visualizations for Moldavian sentences in Table 7. Sample #M1 contains a highlighted noun phrase that is a clear indicator of the Moldavian dialect. Indeed, the noun phrase "cabinetul de miniştri" (the cabinet of ministers) is almost never used in Romanian, where the alternative "gurvernul" (the government) is preferred. The noun "migranţi" (migrants) is also unusual in Romanian, the forms "emigranţi" or "imigranţi" being used instead, depending on the context. In samples #M2, #M3, #M4 and #M6, we can observe a few highlighted words, such as "mîna" (hand), "făcîndu" (doing), "sînt" (are), "mîncare" (food) and "sfîrşit" (end), that reveal the same pattern used only in the Moldavian dialect, namely the use of the vowel "î" inside words. We note that the vowel "î" is used in Romanian only at the beginning of the words. The same sound is spelled by the vowel "â" anywhere else in the word, and the aforementioned Moldavian words would be written as "mâna", "făcându", "mâncare" and "sfârşit", respectively. For the verb "sînt" (are), even the sound is different, the correct Romanian spelling being "sunt". In addition, sample #M3 contains the adverb "împărtăşite" (distributed), which would likely be replaced by "partajate" in Romanian. In sample #M4, the CNN model focuses on the phrase "cei mai mulţi bani", the distinctive pattern being the placement of this phrase at the beginning of the sentence. In Romanian, the same sentence would be written as follows: "locuitorii capitalei cheltuie cei mai mulţi bani pe mâncare". In sample #M5, we can understand why the network has highlighted the phrase "ale republicii" (of the republic) as being a strong indicator for Moldavian, namely because Moldova is considered a republic. Romania was considered a republic only during the communist regime. Hence, example #M5 does not contain any dialectal patterns. In sample #M7, the verb phrase "a ajuns în faliment" (went bankrupt) is distinctive for the Moldavian dialect. In the Romanian dialect, the verb "a intrat" would be used instead of "a ajuns". Another distinctive verb phase for the Moldavian dialect is present in sample #M8, namely "a avizat pozitiv" (approved). In Romanian, this verb phrase would be replaced by the verb "a aprobat", the adverb "pozitiv" being implied by the verb. In #M9, we can observe that "briefing" is used to define a short press conference. To express the same concept, a Romanian speaker would use "declaraţie de presă" or "conferinţă de presă". Sample #M9 contains another dialectal pattern. In Moldavian, a political party is typically referred to as "formaţiune", whereas in Romanian, it is referred to as "partid". In sample #M10, the only highlighted dialect pattern that we found interpretable from our perspective is the use of the noun "condiţionalităţi" (conditions), since we would rather use "condiţii" in the Romanian dialect. As for the Romanian sentences, we notice that the character-level CNN finds some relevant patterns of the Moldavian dialect.
We confess that we were not aware of many of the distinctive patterns among the two dialects discovered through the Grad-CAM visualizations. The same applies to our annotators. While both dialects contain about the same words, it seems that differences regarding the preferred synonym to express a certain concept play a very important role in distinguishing among the two dialects. This also explains why people living in Romania or the Republic of Moldova have such a hard time in distinguishing between the dialects. Many of the presented sentences are grammatically and syntactically correct in both dialects, but some word choices in one dialect seem rather unusual in the other dialect. We believe that untrained people can easily mistake such dialectal patterns with the style of the author. We believe that the presented examples elucidated the mystery behind the unreasonable effectiveness of machine learning in Moldavian versus Romanian dialect identification, revealing some interesting dialectal patterns, previously unknown to ourselves. In summary, we consider hypothesis H2 to be true.
In this article, we studied dialect identification and related sub-tasks, e.g. cross-dialect categorization by topic, for an under-studied language, namely Romanian. We experimented with several machine learning models, including novel ensemble combinations, attaining very good performance levels, especially with the ensemble based on model stacking. Comparing the ML models with native Romanian or Moldavian speakers, we found a significant performance gap, the average performance of the human annotators being barely above the random chance baseline. In order to find out why ML models attain significantly better results compared to humans, we analyzed Grad-CAM visualizations of the character-level CNN model. The visualizations revealed some interesting dialectal clues, which were too subtle to be observed by the human annotators or by us. We therefore reached the conclusion that the effectiveness of the ML models is explainable in large part through dialectal patterns, although the models can occasionally distinguish the samples based on their subject. In this regard, we believe that the newly-introduced cross-genre setting, in which the models are trained on sentences from MOROCO and test on tweets collected from a different time span, is more representative for a fair and realistic evaluation.
While our current study is focused on written dialect identification, we aim to address spoken dialect identification in future work. Since the spoken dialect bares more distinctive clues, it will allow us to include other Romanian sub-dialects in our study, e.g. those spoken in Ardeal or Oltenia regions.
- Abdelali et al. (2020) Abdelali, Ahmed, Hamdy Mubarak, Younes Samih, Sabit Hassan, and Kareem Darwish. 2020. Arabic Dialect Identification in the Wild. arXiv preprint arXiv:2005.06557.
- Al-Rfou et al. (2019) Al-Rfou, Rami, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2019. Character-Level Language Modeling with Deeper Self-Attention. In Proceedings of AAAI, pages 3159–3166.
- Ali, Vogel, and Renals (2017) Ali, Ahmed, Stephan Vogel, and Steve Renals. 2017. Speech Recognition Challenge in the Wild: Arabic MGB-3. In Proceedings of ASRU, pages 316–322.
- Alsarsour et al. (2018) Alsarsour, Israa, Esraa Mohamed, Reem Suwaileh, and Tamer Elsayed. 2018. DART: A Large Dataset of Dialectal Arabic Tweets. In Proceedings of LREC, pages 3666–3670.
- AlYami and AlZaidy (2020) AlYami, Reem and Rabeah AlZaidy. 2020. Arabic Dialect Identification in Social Media. In Proceedings of ICCAIS, pages 1–2.
- Ballesteros, Dyer, and Smith (2015) Ballesteros, Miguel, Chris Dyer, and Noah A. Smith. 2015. Improved Transition-Based Parsing by Modeling Characters instead of Words with LSTMs. In Proceedings of EMNLP 2015, pages 349–59.
- Belazzougui and Cunial (2017) Belazzougui, Djamal and Fabio Cunial. 2017. A Framework for Space-Efficient String Kernels. Algorithmica, 79(3):857–883.
- Bengio (2009) Bengio, Yoshua. 2009. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1):1–127.
- Bengio et al. (2003) Bengio, Yoshua, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:1137–1155.
- Biadsy, Hirschberg, and Habash (2009) Biadsy, Fadi, Julia Hirschberg, and Nizar Habash. 2009. Spoken Arabic Dialect Identification Using Phonotactic Modeling. In Proceedings of CASL, pages 53–61.
- Bojanowski et al. (2017) Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
Bouamor et al. (2018)
Bouamor, Houda, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow,
Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander
Erdmann, et al. 2018.
The MADAR Arabic Dialect Corpus and Lexicon.In Proceedings of LREC, pages 3387–3396.
- Bouamor, Hassan, and Habash (2019) Bouamor, Houda, Sabit Hassan, and Nizar Habash. 2019. The MADAR Shared Task on Arabic Fine-Grained Dialect Identification. In Proceedings of WANLP, pages 199–207.
- Britz et al. (2017) Britz, Denny, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive Exploration of Neural Machine Translation Architectures. In Proceedings of EMNLP, pages 1442–1451.
- Brun, Saggese, and Vento (2014) Brun, L., A. Saggese, and M. Vento. 2014. Dynamic Scene Understanding for Behavior Analysis Based on String Kernels. IEEE Transactions on Circuits and Systems for Video Technology, 24(10):1669–1681.
- Butnaru and Ionescu (2018) Butnaru, Andrei M. and Radu Tudor Ionescu. 2018. UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row. In Proceedings of VarDial, pages 77–87.
- Butnaru and Ionescu (2019a) Butnaru, Andrei M. and Radu Tudor Ionescu. 2019a. MOROCO: The Moldavian and Romanian Dialectal Corpus. In Proceedings of ACL, pages 688–698.
- Butnaru and Ionescu (2019b) Butnaru, Andrei M. and Radu Tudor Ionescu. 2019b. ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation. IEEE Access, 7:120961–120975.
- Caragiu-Marioțeanu (1975) Caragiu-Marioțeanu, Matilda. 1975. Compendiu de dialectologie română:(nord şi sud-dunăreană). Editura ştiinţifică şi enciclopedică.
- Chifu (2019) Chifu, Adrian-Gabriel. 2019. The R2I_LIS Team Proposes Majority Vote for VarDial’s MRC Task. In Proceedings of VarDial, pages 138–143.
- Cho et al. (2014) Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of EMNLP, pages 1724–1734.
- Chung et al. (2014) Chung, Junyoung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In Proceedings of NIPS Deep Learning and Representation Learning Workshop.
- Ciobanu and Dinu (2016) Ciobanu, Alina Maria and Liviu P. Dinu. 2016. A Computational Perspective on the Romanian Dialects. In Proceedings of LREC, pages 3281–3285.
- Collobert and Weston (2008) Collobert, Ronan and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of ICML, pages 160–167.
- Cortes and Vapnik (1995) Cortes, Corinna and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning, 20(3):273–297.
- Coteanu, Bolocan, and Marioţeanu (1969) Coteanu, Ion, Gheorghe Bolocan, and Matilda Caragiu Marioţeanu. 1969. Istoria Limbii Române (History of the Romanian Language), volume II. Romanian Academy, Bucharest, Romania.
- Cozma, Butnaru, and Ionescu (2018) Cozma, Mădălina, Andrei Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of ACL, pages 503–509.
- Elzinga and Wang (2013) Elzinga, Cees H. and Hui Wang. 2013. Versatile string kernels. Theoretical Computer Science, 495:50––65.
- Francom, Hulden, and Ussishkin (2014) Francom, Jerid, Mans Hulden, and Adam Ussishkin. 2014. ACTIV-ES: a comparable, cross-dialect corpus of ‘everyday’ Spanish from Argentina, Mexico, and Spain. In Proceedings of LREC, pages 1733–1737.
Fukushima, Kunihiko. 1980.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202.
- Garg et al. (2018) Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115:E3635–E3644.
- Gashler, Giraud-Carrier, and Martinez (2008) Gashler, Mike, Christophe Giraud-Carrier, and Tony Martinez. 2008. Decision tree ensemble: Small heterogeneous is better than large homogeneous. In Proceedings of ICMLA, pages 900–905.
- Gasthaus, Wood, and Teh (2010) Gasthaus, Jan, Frank Wood, and Yee Whye Teh. 2010. Lossless Compression Based on the Sequence Memoizer. In Proceedings of DCC, page 337–345.
- Gers, Schmidhuber, and Cummins (2000) Gers, Felix A., Jürgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471.
- Giménez-Pérez, Franco-Salvador, and Rosso (2017) Giménez-Pérez, Rosa M., Marc Franco-Salvador, and Paolo Rosso. 2017. Single and Cross-domain Polarity Classification using String Kernels. In Proceedings of EACL, pages 558–563.
- Glorot, Bordes, and Bengio (2011) Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. 2011. Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach. In Proceedings of ICML, pages 513–520.
- Grave et al. (2018) Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of LREC, pages 3483–3487.
- Greff et al. (2016) Greff, Klaus, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232.
Guellil and Azouaou (2016)
Guellil, Imène and Faiçal Azouaou. 2016.
Arabic dialect identification with an unsupervised learning (based on a lexicon). Application case: Algerian dialect.In Proceedings of CSE, EUC and DCABES, pages 724–731.
- Hanani and Naser (2018) Hanani, Abualsoud and Rabee Naser. 2018. Spoken Arabic dialect recognition using X-vectors. Natural Language Engineering, pages 1–10.
- Hochreiter et al. (2001) Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. 2001. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks, pages 237–244.
- Hochreiter and Schmidhuber (1997) Hochreiter, Sepp and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
Hoerl and Kennard (1970)
Hoerl, Arthur E. and Robert W. Kennard. 1970.
Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67.
- Huang et al. (2012) Huang, Eric, Richard Socher, Christopher Manning, and Andrew Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of ACL, pages 873–882.
- Huang and Hansen (2006) Huang, Rongqing and John H.L. Hansen. 2006. Gaussian Mixture Selection and Data Selection for Unsupervised Spanish Dialect Classification. In Proceedings of INTERSPEECH.
- Ionescu and Butnaru (2019) Ionescu, Radu Tudor and Andrei Butnaru. 2019. Vector of Locally-Aggregated Word Embeddings (VLAWE): A Novel Document-level Representation. In Proceedings of NAACL, pages 363–369.
- Ionescu and Butnaru (2017) Ionescu, Radu Tudor and Andrei M. Butnaru. 2017. Learning to Identify Arabic and German Dialects using Multiple Kernels. In Proceedings of VarDial, pages 200–209.
- Ionescu and Butnaru (2018) Ionescu, Radu Tudor and Andrei M. Butnaru. 2018. Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set. In Proceedings of EMNLP, pages 1084–1090.
- Ionescu and Popescu (2016) Ionescu, Radu Tudor and Marius Popescu. 2016. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. In Proceedings of VarDial, pages 135–144.
- Ionescu and Popescu (2017) Ionescu, Radu Tudor and Marius Popescu. 2017. Can string kernels pass the test of time in native language identification? In Proceedings of BEA-12, pages 224–234.
- Ionescu, Popescu, and Cahill (2014) Ionescu, Radu Tudor, Marius Popescu, and Aoife Cahill. 2014. Can characters reveal your native language? A language-independent approach to native language identification. In Proceedings of EMNLP, pages 1363–1373.
- Ionescu, Popescu, and Cahill (2016) Ionescu, Radu Tudor, Marius Popescu, and Aoife Cahill. 2016. String kernels for native language identification: Insights from behind the curtains. Computational Linguistics, 42(3):491–525.
- Joulin et al. (2017) Joulin, Armand, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of EACL, pages 427–431.
- Jun (2017) Jun, Han. 2017. Chinese dialect identification based on DBF. Audio Engineering, page Z1.
- Kate and Mooney (2006) Kate, Rohit J. and Raymond J. Mooney. 2006. Using String-Kernels for Learning Semantic Parsers. In Proceedings of COLING/ACL, pages 913–920.
- Kim (2014) Kim, Yoon. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of EMNLP, pages 1746–1751.
- Kim et al. (2016) Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of AAAI, pages 2741–2749.
- Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceddings of NIPS, pages 1097–1105.
- Kumar et al. (2018) Kumar, Ritesh, Bornini Lahiri, Deepak Alok, Atul Kr Ojha, Mayank Jain, Abdul Basit, and Yogesh Dawer. 2018. Automatic Identification of Closely-related Indian Languages: Resources and Experiments. In Proceedings of WILDRE4.
- Kuncheva and Whitaker (2003) Kuncheva, Ludmila I. and Christopher J. Whitaker. 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207.
- Lawrence et al. (1997) Lawrence, Steve, C. Lee Giles, Ah Chung Tsoi, and Andrew D. Back. 1997. Face recognition: A convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1):98–113.
- LeCun et al. (1989) LeCun, Yann, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551.
- LeCun, Huang, and Bottou (2004) LeCun, Yann, Fu Jie Huang, and Leon Bottou. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of CVPR, volume 2, pages II–104.
- Ling et al. (2015) Ling, Wang, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of EMNLP, pages 1520–1530.
- Lodhi et al. (2002) Lodhi, Huma, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text Classification Using String Kernels. Journal of Machine Learning Research, 2:419–444.
- Lodhi et al. (2001) Lodhi, Huma, John Shawe-Taylor, Nello Cristianini, and Christopher J.C.H. Watkins. 2001. Text Classification Using String Kernels. In Proceedings of NIPS, pages 563–569.
- Lozovanu (2012) Lozovanu, Dorin. 2012. Romanian-Speaking Communities Outside Romania: Linguistic Identities. International Journal of Social Science and Humanity, 2(6):569.
- Ma, Zhu, and Tong (2006) Ma, Bin, Donglai Zhu, and Rong Tong. 2006. Chinese Dialect Identification Using Tone Features Based on Pitch Flux. In Proceedings of ICASSP, volume 1, pages I–I.
- Malmasi et al. (2016) Malmasi, Shervin, Marcos Zampieri, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, and Jörg Tiedemann. 2016. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. In Proceedings of VarDial, pages 1–14.
- Masala, Ruseti, and Rebedea (2017) Masala, Mihai, Stefan Ruseti, and Traian Rebedea. 2017. Sentence selection with neural networks using string kernels. In Proceedings of KES, pages 1774–1782.
- Mikolov et al. (2013a) Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In Proceedings of ICLR Workshops.
- Mikolov et al. (2013b) Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, pages 3111–3119.
- Minahan (2013) Minahan, James. 2013. Miniature Empires: A Historical Dictionary of the Newly Independent States. Taylor & Francis.
- Mingliang, Yuguo, and Yiming (2008) Mingliang, Gu, Xia Yuguo, and Yang Yiming. 2008. Semi-supervised learning based Chinese dialect identification. In Proceedings of ICSP, pages 1608–1611.
- Mishra and Mujadia (2019) Mishra, Pruthwik and Vandan Mujadia. 2019. Arabic Dialect Identification for Travel and Twitter Text. In Proceedings of WANLP, pages 234–238.
- Mititelu, Tufiş, and Irimia (2018) Mititelu, Verginica Barbu, Dan Tufiş, and Elena Irimia. 2018. The Reference Corpus of the Contemporary Romanian Language (CoRoLa). In Proceedings of LREC, pages 1235–1239.
- Musto et al. (2016) Musto, Cataldo, Giovanni Semeraro, Marco Degemmis, and Pasquale Lops. 2016. Learning Word Embeddings from Wikipedia for Content-Based Recommender Systems. In Proceedings of ECIR, pages 729–734.
- Nisioi (2014) Nisioi, Sergiu. 2014. On the syllabic structures of Aromanian. In Proceedings of the LaTeCH, pages 110–118.
- Nivre et al. (2016) Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC 2016, pages 1659–1666.
- Nussbaum-Thom et al. (2016) Nussbaum-Thom, Markus, Jia Cui, Bhuvana Ramabhadran, and Vaibhava Goel. 2016. Acoustic Modeling Using Bidirectional Gated Recurrent Convolutional Units. In Proceedings of INTERSPEECH, pages 390–394.
- Onose, Cercel, and Trăuşan-Matu (2019) Onose, Cristian, Dumitru-Clementin Cercel, and Ştefan Trăuşan-Matu. 2019. SC-UPB at the VarDial 2019 Evaluation Campaign: Moldavian vs. Romanian Cross-Dialect Topic Identification. In Proceedings of VarDial, pages 172–177.
Opitz and Maclin (1999)
Opitz, David and Richard Maclin. 1999.
Popular ensemble methods: An empirical study.
Journal of Artificial Intelligence Research, 11:169–198.
- Paiş and Tufiş (2018) Paiş, Vasile and Dan Tufiş. 2018. Computing distributed representations of words using the CoRoLa corpus. Proceedings of the Romanian Academy, 19(2):403–409.
- Pavel (2008) Pavel, Vasile. 2008. Limba română – unitate în diversitate (Romanian language – there is unity in diversity). Romanian Language Journal, XVIII(9–10).
- Pennington, Socher, and Manning (2014) Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, pages 1532–1543.
- Petrovici (1970) Petrovici, Emil. 1970. Studii de dialectologie şi toponimie. Editura Academiei.
- Popescu, Grozea, and Ionescu (2017) Popescu, Marius, Cristian Grozea, and Radu Tudor Ionescu. 2017. HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages. In Proceedings of KES, pages 1755–1763.
- Popescu and Ionescu (2013) Popescu, Marius and Radu Tudor Ionescu. 2013. The Story of the Characters, the DNA and the Native Language. In Proceedings of BEA-8, pages 270–278.
- Puşcariu (1976) Puşcariu, Sextil. 1976. Limba română. Privire generală. I. Minerva.
- Rangel et al. (2017) Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter. In Working Notes Papers of the CLEF.
- Ravanelli et al. (2018) Ravanelli, Mirco, Philemon Brakel, Maurizio Omologo, and Yoshua Bengio. 2018. Light Gated Recurrent Units for Speech Recognition. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):92–102.
- Reisinger and Mooney (2010) Reisinger, Joseph and Raymond J. Mooney. 2010. Multi-Prototype Vector-Space Models of Word Meaning. In Proceeding of NAACL, pages 109–117.
- Rokach (2010) Rokach, Lior. 2010. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39.
- Salameh, Bouamor, and Habash (2018) Salameh, Mohammad, Houda Bouamor, and Nizar Habash. 2018. Fine-Grained Arabic Dialect Identification. In Proceedings of COLING, pages 1332–1344.
- Samardzic, Scherrer, and Glaser (2016) Samardzic, Tanja, Yves Scherrer, and Elvira Glaser. 2016. ArchiMob - A Corpus of Spoken Swiss German. In Proceedings of LREC, pages 4061–4066.
Sanderson and Guenter (2006)
Sanderson, Conrad and Simon Guenter. 2006.
Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation.pages 482–491.
- dos Santos and Gatti (2014) dos Santos, Cícero and Maíra Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING, pages 69–78.
- Saunders, Gammerman, and Vovk (1998) Saunders, Craig, Alexander Gammerman, and Volodya Vovk. 1998. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings of ICML, pages 512–521.
- Saunders et al. (2004) Saunders, Craig, David R. Hardoon, John Shawe-Taylor, and Gerhard Widmer. 2004. Using String Kernels to Identify Famous Performers from Their Playing Style. In Proceedings of ECML, pages 384–395.
- Schütze (1993) Schütze, Hinrich. 1993. Word Space. In Proceedings of NIPS, pages 895–902.
- Selvaraju et al. (2017) Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. In Proceedings of ICCV, pages 618–626.
- Shawe-Taylor and Cristianini (2004) Shawe-Taylor, John and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
- Shon et al. (2020) Shon, Suwon, Ahmed Ali, Younes Samih, Hamdy Mubarak, and James Glass. 2020. ADI17: A Fine-Grained Arabic Dialect Identification Dataset. In Proceedings of ICASSP, pages 8244–8248.
- Singh et al. (2017) Singh, Ritambhara, Arshdeep Sekhon, Kamran Kowsari, Jack Lanchantin, Beilun Wang, and Yanjun Qi. 2017. GaKCo: A Fast Gapped k-mer String Kernel Using Counting. In Proceedings of ECML-PKDD, pages 356–373.
- Sollich and Krogh (1996) Sollich, Peter and Anders Krogh. 1996. Learning with ensembles: How overfitting can be useful. In Proceedings of NIPS, pages 190–196.
- Sutskever, Martens, and Hinton (2011) Sutskever, Ilya, James Martens, and Geoffrey Hinton. 2011. Generating Text with Recurrent Neural Networks. In Proceedings of ICML, pages 1017–1024.
- Tian et al. (2014) Tian, Fei, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A Probabilistic Model for Learning Multi-Prototype Word Embeddings. In Proceedings of COLING, pages 151–160.
- Torres-Carrasquillo, Gleason, and Reynolds (2004) Torres-Carrasquillo, Pedro A., Terry P. Gleason, and Douglas A. Reynolds. 2004. Dialect identification using Gaussian Mixture Models. In Proceedings of ODYSSEY04.
- Tsai and Chang (2002) Tsai, Wuei-He and Wen-Whei Chang. 2002. Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification. Speech Communication, 36(3-4):317–326.
- Tudoreanu (2019) Tudoreanu, Diana. 2019. DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification. In Proceedings of VarDial, pages 202–208.
- Weiss, Goldberg, and Yahav (2018) Weiss, Gail, Yoav Goldberg, and Eran Yahav. 2018. On the Practical Computational Power of Finite Precision RNNs for Language Recognition. In Proceedings of ACL, pages 740–745.
- Werbos (1988) Werbos, Paul J. 1988. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356.
- Weston, Bengio, and Usunier (2011) Weston, Jason, Samy Bengio, and Nicolas Usunier. 2011. WSABIE: Scaling up to Large Vocabulary Image Annotation. In Proceedings of IJCAI, pages 2764–2770.
- Wolpert (1992) Wolpert, David H. 1992. Stacked generalization. Neural networks, 5(2):241–259.
- Wood et al. (2009) Wood, Frank, Cédric Archambeau, Jan Gasthaus, Lancelot James, and Yee Whye Teh. 2009. A Stochastic Memoizer for Sequence Data. In Proceedings of ICML, pages 1129–1136.
Wu et al. (2019)
Wu, Nianheng, Eric DeMattos, Kwok Him So, Pin-zhen Chen, and
Çağrı Çöltekin. 2019.
Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation.In Proceedings of VarDial, pages 54–63.
- Xia et al. (2011) Xia, Wang, Gu Mingliang, Gao Yuan, and Ma Yong. 2011. Chinese dialect identification based on gender classification. In Proceedings of WCSP, pages 1–5.
- Yang, Macdonald, and Ounis (2018) Yang, Xiao, Craig Macdonald, and Iadh Ounis. 2018. Using Word Embeddings in Twitter Election Classification. Information Retrieval Journal, 21:183––207.
- Yang et al. (2016) Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of NAACL, pages 1480–1489.
- Zaidan and Callison-Burch (2011) Zaidan, Omar F. and Chris Callison-Burch. 2011. The Arabic Online Commentary Dataset: An Annotated Dataset of Informal Arabic with High Dialectal Content. In Proceedings of ACL: HLT, volume 2, pages 37–41.
- Zaidan and Callison-Burch (2014) Zaidan, Omar F. and Chris Callison-Burch. 2014. Arabic Dialect Identification. Computational Linguistics, 40(1):171–202.
- Zaki, Deris, and Illias (2005) Zaki, Nazar, Safaai Deris, and Rosli Illias. 2005. Application of String Kernels in Protein Sequence Classification. Applied Bioinformatics, 4:45–52.
- Zampieri et al. (2017) Zampieri, Marcos, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, and Noëmi Aepli. 2017. Findings of the VarDial Evaluation Campaign 2017. In Proceedings of VarDial, pages 1–15.
- Zampieri et al. (2018) Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, Chris van der Lee, Stefan Grondelaers, Nelleke Oostdijk, Antal van den Bosch, Ritesh Kumar, Bornini Lahiri, and Mayank Jain. 2018. Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign. In Proceedings of VarDial, pages 1–17.
- Zampieri et al. (2019) Zampieri, Marcos, Shervin Malmasi, Yves Scherrer, Tanja Samardžić, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan, Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru, and Tommi Jauhiainen. 2019. A Report on the Third VarDial Evaluation Campaign. In Proceedings of VarDial, pages 1–16.
- Zampieri et al. (2014) Zampieri, Marcos, Liling Tan, Nikola Ljubešić, and Jörg Tiedemann. 2014. A Report on the DSL Shared Task 2014. In Proceedings of VarDial, pages 58–67.
- Zampieri et al. (2015) Zampieri, Marcos, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, and Preslav Nakov. 2015. Overview of the DSL Shared Task 2015. In Proceedings of LT4VarDial, pages 1–9.
- Zeman et al. (2018) Zeman, Daniel, Jan Hajic, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21.
- Zhang, Zhao, and LeCun (2015) Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In Proceedings of NIPS, pages 649–657.
- Zissman et al. (1996) Zissman, Marc A., Terry P. Gleason, D.M. Rekart, and Beth L. Losiewicz. 1996. Automatic dialect identification of extemporaneous conversational, Latin American Spanish speech. In Proceedings of ICASSP, volume 2, pages 777–780.