Combining Deep Learning and String Kernels for the Localization of Swiss German Tweets

10/07/2020
by   Mihaela Gaman, et al.
0

In this work, we introduce the methods proposed by the UnibucKernel team in solving the Social Media Variety Geolocation task featured in the 2020 VarDial Evaluation Campaign. We address only the second subtask, which targets a data set composed of nearly 30 thousand Swiss German Jodels. The dialect identification task is about accurately predicting the latitude and longitude of test samples. We frame the task as a double regression problem, employing a variety of machine learning approaches to predict both latitude and longitude. From simple models for regression, such as Support Vector Regression, to deep neural networks, such as Long Short-Term Memory networks and character-level convolutional neural networks, and, finally, to ensemble models based on meta-learners, such as XGBoost, our interest is focused on approaching the problem from a few different perspectives, in an attempt to minimize the prediction error. With the same goal in mind, we also considered many types of features, from high-level features, such as BERT embeddings, to low-level features, such as characters n-grams, which are known to provide good results in dialect identification. Our empirical results indicate that the handcrafted model based on string kernels outperforms the deep learning approaches. Nevertheless, our best performance is given by the ensemble model that combines both handcrafted and deep learning models.

READ FULL TEXT
research
02/18/2021

UnibucKernel: Geolocating Swiss German Jodels Using Ensemble Learning

In this work, we describe our approach addressing the Social Media Varie...
research
04/21/2018

Automated essay scoring with string kernels and word embeddings

In this work, we present an approach based on combining string kernels a...
research
11/05/2021

Sexism Identification in Tweets and Gabs using Deep Neural Networks

Through anonymisation and accessibility, social media platforms have fac...
research
02/28/2021

NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

In recent years, several systems have been developed to regulate the spr...
research
05/20/2021

TF-IDF vs Word Embeddings for Morbidity Identification in Clinical Notes: An Initial Study

Today, we are seeing an ever-increasing number of clinical notes that co...
research
03/28/2022

UTSA NLP at SemEval-2022 Task 4: An Exploration of Simple Ensembles of Transformers, Convolutional, and Recurrent Neural Networks

The act of appearing kind or helpful via the use of but having a feeling...
research
07/05/2023

How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model

Learning generic high-dimensional tasks is notably hard, as it requires ...

Please sign up or login with your details

Forgot password? Click here to reset