In recent years, methods based on string kernels have demonstrated remarkable performance in various text classification tasks ranging from authorship identification Popescu and Grozea (2012) and sentiment analysis Giménez-Pérez et al. (2017); Popescu et al. (2017) to native language identification Popescu and Ionescu (2013); Ionescu et al. (2014, 2016); Ionescu and Popescu (2017), dialect identification Ionescu and Popescu (2016b); Ionescu and Butnaru (2017); Butnaru and Ionescu (2018) and automatic essay scoring Cozma et al. (2018). As long as a labeled training set is available, string kernels can reach state-of-the-art results in various languages including English Ionescu et al. (2014); Giménez-Pérez et al. (2017); Cozma et al. (2018), Arabic Ionescu (2015); Ionescu et al. (2016); Ionescu and Butnaru (2017); Butnaru and Ionescu (2018), Chinese Popescu et al. (2017) and Norwegian Ionescu et al. (2016). Different from all these recent approaches, we use unlabeled data from the test set to significantly increase the performance of string kernels. More precisely, we propose two transductive learning approaches combined into a unified framework. We show that the proposed framework improves the results of string kernels in two different tasks (cross-domain sentiment classification and Arabic dialect identification) and two different languages (English and Arabic). To the best of our knowledge, transductive learning frameworks based on string kernels have not been studied in previous works.
2 Transductive String Kernels
String kernels. Kernel functions Shawe-Taylor and Cristianini (2004)
capture the intuitive notion of similarity between objects in a specific domain. For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams. Various string kernel functions have been proposed to dateLodhi et al. (2002); Shawe-Taylor and Cristianini (2004); Ionescu et al. (2014). Perhaps one of the most recently introduced string kernels is the histogram intersection string kernel Ionescu et al. (2014). For two strings over an alphabet , , the intersection string kernel is formally defined as follows:
where is the number of occurrences of n-gram as a substring in , and is the length of . The spectrum string kernel or the presence bits string kernel can be defined in a similar fashion Ionescu et al. (2014). The standard kernel learning pipeline is presented in Figure 1. String kernels help to efficiently Popescu et al. (2017) compute the dual representation directly, thus skipping the first step in the pipeline illustrated in Figure 1.
Transductive string kernels. We propose a simple and straightforward approach to produce a transductive similarity measure suitable for strings, as illustrated in Figure 2. We take the following steps to derive transductive string kernels. For a given kernel (similarity) function , we first build the full kernel matrix , by including the pairwise similarities of samples from both the train and the test sets (step in Figure 2) . For a training set of samples and a test set of samples, such that , each component in the full kernel matrix is defined as follows (step in Figure 2):
where and are samples from the set , for all . We then normalize the kernel matrix by dividing each component by the square root of the product of the two corresponding diagonal components:
We transform the normalized kernel matrix into a radial basis function (RBF) kernel matrix as follows:
As the kernel matrix is already normalized, we can choose for simplicity. Therefore, Equation (4) becomes:
Each row in the RBF kernel matrix
is now interpreted as a feature vector, going from stepto step in Figure 2. In other words, each sample is represented by a feature vector that contains the similarity between the respective sample and all the samples in (step in Figure 2). Since includes the test samples as well, the feature vector is inherently adapted to the test set. Indeed, it is easy to see that the features will be different if we choose to apply the string kernel approach on a set of test samples , such that . It is important to note that through the features, the subsequent classifier will have some information about the test samples at training time. More specifically, the feature vector conveys information about how similar is every test sample to every training sample. We next consider the linear kernel, which is given by the scalar product between the new feature vectors. To obtain the final linear kernel matrix, we simply need to compute the product between the RBF kernel matrix and its transpose (step in Figure 2):
In this way, the samples from the test set, which are included in , are used to obtain new (transductive) string kernels that are adapted to the test set at hand.
Transductive kernel classifier. After obtaining the transductive string kernels, we use a simple transductive learning approach that falls in the category of self-training methods McClosky et al. (2006); Chen et al. (2011)
. The transductive approach is divided into two learning iterations. In the first iteration, a kernel classifier is trained on the training data and applied on the test data, just as usual. Next, the test samples are sorted by the classifier’s confidence score to maximize the probability of correctly predicted labels in the top of the sorted list. In the second iteration, a fixed number of samples (in the experiments) from the top of the list are added to the training set for another round of training. Even though a small percent (less than in all experiments) of the predicted labels corresponding to the newly included samples are wrong, the classifier has the chance to learn some useful patterns (from the correctly predicted labels) only visible in the test data. The transductive kernel classifier (TKC) is based on the intuition that the added test samples bring more useful information than noise, since the majority of added test samples have correct labels. Finally, we would like to stress out that the ground-truth test labels are never used in our transductive algorithm.
The proposed transductive learning approaches are used together in a unified framework. As any other transductive learning method, the main disadvantage of the proposed framework is that the unlabeled test samples from the target domain need to be used in the training stage. Nevertheless, we present empirical results indicating that our approach can obtain significantly better accuracy rates in cross-domain polarity classification and Arabic dialect identification compared to state-of-the-art methods based on string kernels Giménez-Pérez et al. (2017); Ionescu and Butnaru (2017). We also report better results than other domain adaptation methods Pan et al. (2010); Bollegala et al. (2013); Franco-Salvador et al. (2015); Sun et al. (2016); Huang et al. (2017).
3 Polarity Classification
Data set. For the cross-domain polarity classification experiments, we use the second version of Multi-Domain Sentiment Dataset Blitzer et al. (2007). The data set contains Amazon product reviews of four different domains: Books (B), DVDs (D), Electronics (E) and Kitchen appliances (K). Reviews contain star ratings (from 1 to 5) which are converted into binary labels as follows: reviews rated with more than 3 stars are labeled as positive, and those with less than 3 stars as negative. In each domain, there are 1000 positive and 1000 negative reviews.
Baselines. We compare our approach with several methods Pan et al. (2010); Bollegala et al. (2013); Franco-Salvador et al. (2015); Sun et al. (2016); Giménez-Pérez et al. (2017); Huang et al. (2017) in two cross-domain settings. Using string kernels, franco-EACL-2017 reported better performance than SST Bollegala et al. (2013) and KE-Meta Franco-Salvador et al. (2015) in the multi-source domain setting. In addition, we compare our approach with SFA Pan et al. (2010), KMM Huang et al. (2007), CORAL Sun et al. (2016) and TR-TrAdaBoost Huang et al. (2017) in the single-source setting.
Evaluation procedure and parameters. We follow the same evaluation methodology of franco-EACL-2017, to ensure a fair comparison. Furthermore, we use the same kernels, namely the presence bits string kernel () and the intersection string kernel (), and the same range of character n-grams (5-8). To compute the string kernels, we used the open-source code provided by radu-marius-book-chap6-2016. For the transductive kernel classifier, we select
unlabeled test samples to be included in the training set for the second round of training. We choose Kernel Ridge RegressionShawe-Taylor and Cristianini (2004) as classifier and set its regularization parameter to in all our experiments. Although franco-EACL-2017 used a different classifier, namely Kernel Discriminant Analysis, we observed that Kernel Ridge Regression produces similar results () when we employ the same string kernels. As franco-EACL-2017, we evaluate our approach in two cross-domain settings. In the multi-source setting, we train the models on all domains, except the one used for testing. In the single-source setting, we train the models on one of the four domains and we independently test the models on the remaining three domains.
Results in multi-source setting. The results for the multi-source cross-domain polarity classification setting are presented in Table 1. Both the transductive presence bits string kernel () and the transductive intersection kernel () obtain better results than their original counterparts. Moreover, according to the McNemar’s test Dietterich (1998), the results on the DVDs, the Electronics and the Kitchen target domains are significantly better than the best baseline string kernel, with a confidence level of . When we employ the transductive kernel classifier (TKC), we obtain even better results. On all domains, the accuracy rates yielded by the transductive classifier are more than better than the best baseline. For example, on the Books domain the accuracy of the transductive classifier based on the presence bits kernel () is above the best baseline () represented by the intersection string kernel. Remarkably, the improvements brought by our transductive string kernel approach are statistically significant in all domains.
Results in single-source setting. The results for the single-source cross-domain polarity classification setting are presented in Table 2. We considered all possible combinations of source and target domains in this experiment, and we improve the results in each and every case. Without exception, the accuracy rates reached by the transductive string kernels are significantly better than the best baseline string kernel Giménez-Pérez et al. (2017), according to the McNemar’s test performed at a confidence level of . The highest improvements (above ) are obtained when the source domain contains Books reviews and the target domain contains Kitchen reviews. As in the multi-source setting, we obtain much better results when the transductive classifier is employed for the learning task. In all cases, the accuracy rates of the transductive classifier are more than better than the best baseline string kernel. Remarkably, in four cases (EB, ED, BK and DK) our improvements are greater than . The improvements brought by our transductive classifier based on string kernels are statistically significant in each and every case. In comparison with SFA Pan et al. (2010), we obtain better results in all but one case (KD). With respect to KMM Huang et al. (2007), we also obtain better results in all but one case (BE). Remarkably, we surpass the other state-of-the-art approaches Sun et al. (2016); Huang et al. (2017) in all cases.
4 Arabic Dialect Identification
Data set. The Arabic Dialect Identification (ADI) data set Ali et al. (2016) contains audio recordings and Automatic Speech Recognition (ASR) transcripts of Arabic speech collected from the Broadcast News domain. The classification task is to discriminate between Modern Standard Arabic and four Arabic dialects, namely Egyptian, Gulf, Levantine, and Maghrebi. The training set contains 14000 samples, the development set contains 1524 samples, and the test contains another 1492 samples. The data set was used in the ADI Shared Task of the 2017 VarDial Evaluation Campaign Zampieri et al. (2017).
Baseline. We choose as baseline the approach of Radu-Andrei-ADI-2017, which is based on string kernels and multiple kernel learning. The approach that we consider as baseline is the winner of the 2017 ADI Shared Task Zampieri et al. (2017). In addition, we also compare with the second-best approach (Meta-classifier) Malmasi and Zampieri (2017).
Evaluation procedure and parameters. Radu-Andrei-ADI-2017 combined four kernels into a sum, and used Kernel Ridge Regression for training. Three of the kernels are based on character n-grams extracted from ASR transcripts. These are the presence bits string kernel (), the intersection string kernel (), and a kernel based on Local Rank Distance () Ionescu (2013). The fourth kernel is an RBF kernel () based on the i-vectors provided with the ADI data set Ali et al. (2016). In our experiments, we employ the exact same kernels as Radu-Andrei-ADI-2017 to ensure an unbiased comparison with their approach. As in the polarity classification experiments, we select unlabeled test samples to be included in the training set for the second round of training the transductive classifier, and we use Kernel Ridge Regression with a regularization of in all our ADI experiments.
|+++ + TKC||*||*|
Results. The results for the cross-domain Arabic dialect identification experiments on both the development and the test sets are presented in Table 3. The domain-adapted sum of kernels obtains improvements above over the state-of-the-art sum of kernels Ionescu and Butnaru (2017). The improvement on the development set (from to ) is statistically significant. Nevertheless, we obtain higher and significant improvements when we employ the transductive classifier. Our best accuracy is ( above the baseline) on the development set and ( above the baseline) on the test set. The results show that our domain adaptation framework based on string kernels attains the best performance on the ADI Shared Task data set, and the improvements over the state-of-the-art are statistically significant, according to the McNemar’s test.
- Ali et al. (2016) Ahmed Ali, Najim Dehak, Patrick Cardinal, Sameer Khurana, Sree Harsha Yella, James Glass, Peter Bell, and Steve Renals. 2016. Automatic dialect detection in arabic broadcast speech. In Proceedings of INTERSPEECH, pages 2934–2938.
- Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL, pages 187–205.
- Bollegala et al. (2013) D. Bollegala, D. Weir, and J. Carroll. 2013. Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus. IEEE Transactions on Knowledge and Data Engineering, 25(8):1719–1731.
- Butnaru and Ionescu (2018) Andrei M. Butnaru and Radu Tudor Ionescu. 2018. UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row. In Proceedings of VarDial Workshop of COLING, pages 77–87.
- Chen et al. (2011) Minmin Chen, Kilian Weinberger, and John Blitzer. 2011. Co-Training for Domain Adaptation. In Proceedings of NIPS, pages 2456–2464.
- Cozma et al. (2018) Mădălina Cozma, Andrei Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of ACL, pages 503–509.
- Dietterich (1998) Thomas G. Dietterich. 1998. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923.
- Franco-Salvador et al. (2015) Marc Franco-Salvador, Fermin L. Cruz, Jose A. Troyano, and Paolo Rosso. 2015. Cross-domain polarity classification using a knowledge-enhanced meta-classifier. Knowledge-Based Systems, 86:46–56.
- Giménez-Pérez et al. (2017) Rosa M. Giménez-Pérez, Marc Franco-Salvador, and Paolo Rosso. 2017. Single and Cross-domain Polarity Classification using String Kernels. In Proceedings of EACL, pages 558–563.
- Huang et al. (2007) Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex Smola. 2007. Correcting sample selection bias by unlabeled data. In Proceedings of NIPS, pages 601–608.
- Huang et al. (2017) Xingchang Huang, Yanghui Rao, Haoran Xie, Tak-Lam Wong, and Fu Lee Wang. 2017. Cross-Domain Sentiment Classification via Topic-Related TrAdaBoost. In Proceedings of AAAI, pages 4939–4940.
- Ionescu (2013) Radu Tudor Ionescu. 2013. Local Rank Distance. In Proceedings of SYNASC, pages 221–228.
- Ionescu (2015) Radu Tudor Ionescu. 2015. A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification. In Proceedings of ICONIP, volume 9490, pages 390–400.
- Ionescu and Butnaru (2017) Radu Tudor Ionescu and Andrei Butnaru. 2017. Learning to Identify Arabic and German Dialects using Multiple Kernels. In Proceedings of VarDial Workshop of EACL, pages 200–209.
- Ionescu and Popescu (2016a) Radu Tudor Ionescu and Marius Popescu. 2016a. Native Language Identification with String Kernels. In Knowledge Transfer between Computer Vision and Text Mining
- Ionescu and Popescu (2016b) Radu Tudor Ionescu and Marius Popescu. 2016b. UnibucKernel: An Approach for Arabic Dialect Identification based on Multiple String Kernels. In Proceedings of VarDial Workshop of COLING, pages 135–144.
- Ionescu and Popescu (2017) Radu Tudor Ionescu and Marius Popescu. 2017. Can string kernels pass the test of time in native language identification? In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 224–234.
- Ionescu et al. (2014) Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. 2014. Can characters reveal your native language? a language-independent approach to native language identification. In Proceedings of EMNLP, pages 1363–1373.
- Ionescu et al. (2016) Radu Tudor Ionescu, Marius Popescu, and Aoife Cahill. 2016. String kernels for native language identification: Insights from behind the curtains. Computational Linguistics, 42(3):491–525.
Lodhi et al. (2002)
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and
Christopher J. C. H. Watkins. 2002.
Text classification using string kernels.
Journal of Machine Learning Research, 2:419–444.
- Malmasi and Zampieri (2017) Shervin Malmasi and Marcos Zampieri. 2017. Arabic Dialect Identification Using iVectors and ASR Transcripts. In Proceedings of the VarDial Workshop of EACL, pages 178–183.
- McClosky et al. (2006) David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective Self-training for Parsing. In Proceedings of NAACL, pages 152–159.
- Pan et al. (2010) Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010. Cross-domain Sentiment Classification via Spectral Feature Alignment. In Proceedings of WWW, pages 751–760.
- Popescu and Grozea (2012) Marius Popescu and Cristian Grozea. 2012. Kernel methods and string kernels for authorship analysis. In Proceedings of CLEF (Online Working Notes/Labs/Workshop).
- Popescu et al. (2017) Marius Popescu, Cristian Grozea, and Radu Tudor Ionescu. 2017. HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages. In Proceedings of KES, pages 1755–1763.
- Popescu and Ionescu (2013) Marius Popescu and Radu Tudor Ionescu. 2013. The Story of the Characters, the DNA and the Native Language. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 270–278.
- Shawe-Taylor and Cristianini (2004) John Shawe-Taylor and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.
- Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of Frustratingly Easy Domain Adaptation. In Proceedings of AAAI, pages 2058–2065.
- Zampieri et al. (2017) Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, and Noëmi Aepli. 2017. Findings of the VarDial Evaluation Campaign 2017. In Proceedings of VarDial Workshop of EACL, pages 1–15.