HashSet – A Dataset For Hashtag Segmentation

01/18/2022
by   Prashant Kodali, et al.
10

Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways – transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task – STAN, BOUN – are small in size and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and also account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We show that the performance of SOTA models for Hashtag Segmentation drops substantially on proposed dataset, indicating that the proposed dataset provides an alternate set of hashtags to train and assess models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2021

L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset

Sentiment analysis is one of the most fundamental tasks in Natural Langu...
research
02/24/2016

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

What are the limits of automated Twitter sentiment classification? We an...
research
06/24/2023

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

The exploration of sentiment analysis in low-resource languages, such as...
research
08/15/2023

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Medical Image Segmentation is crucial in various clinical applications w...
research
01/01/2023

MTNeuro: A Benchmark for Evaluating Representations of Brain Structure Across Multiple Levels of Abstraction

There are multiple scales of abstraction from which we can describe the ...
research
12/02/2021

PartImageNet: A Large, High-Quality Dataset of Parts

A part-based object understanding facilitates efficient compositional le...

Please sign up or login with your details

Forgot password? Click here to reset