Tweet2Vec: Character-Based Distributed Representations for Social Media

05/11/2016
by   Bhuwan Dhingra, et al.
0

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/11/2017

Social Media Writing Style Fingerprint

We present our approach for computer-aided social media text authorship ...
research
04/12/2019

Adapting Sequence to Sequence models for Text Normalization in Social Media

Social media offer an abundant source of valuable raw data, however info...
research
08/10/2016

Hierarchical Character-Word Models for Language Identification

Social media messages' brevity and unconventional spelling pose a challe...
research
10/24/2020

Char2Subword: Extending the Subword Embedding Space from Pre-trained Models Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokeni...
research
01/24/2019

Semantic Classification of Tabular Datasets via Character-Level Convolutional Neural Networks

A character-level convolutional neural network (CNN) motivated by applic...
research
05/15/2021

A Deep Metric Learning Approach to Account Linking

We consider the task of linking social media accounts that belong to the...
research
12/20/2021

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-wo...

Please sign up or login with your details

Forgot password? Click here to reset