Data Sets: Word Embeddings Learned from Tweets and General Data

08/14/2017
by   Quanzhi Li, et al.
0

A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/06/2020

Quality of Word Embeddings on Sentiment Analysis Tasks

Word embeddings or distributed representations of words are being used i...
11/29/2016

Identity-sensitive Word Embedding through Heterogeneous Networks

Most existing word embedding approaches do not distinguish the same word...
07/02/2016

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their us...
07/25/2020

Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding

Processing of raw text is the crucial first step in text classification ...
08/25/2020

A simple method for domain adaptation of sentence embeddings

Pre-trained sentence embeddings have been shown to be very useful for a ...
02/01/2019

A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings

Learning word embeddings has received a significant amount of attention ...
02/02/2019

Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey

This work investigates the role of factors like training method, trainin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.