A Longitudinal Assessment of the Persistence of Twitter Datasets

09/26/2017
by   Arkaitz Zubiaga, et al.
0

Sharing of social media datasets presents the caveat that they are not always completely replicable. Having to adhere to requirements of platforms like Twitter, researchers cannot release the raw data and instead have to release a list of unique identifiers, which others can then use to recollect the data from the platform themselves. This leads to the problem that subsets of the data may no longer be available, as content can be deleted or user accounts deactivated. To quantify the impact of content deletion in the validity of datasets in a long term, we perform a longitudinal analysis of the persistence of 30 Twitter datasets, which include over 147 million tweets. Having the original datasets collected between 2012 and 2016, and recollecting them later by using the tweet IDs, we look at four different factors that quantify the extent to which recollected datasets resemble original ones: completeness, representativity, similarity and changingness. Even though the ratio of available tweets keeps decreasing as the dataset gets older, we find that the textual content of the recollected subset is still largely representative of the whole dataset that was originally collected. The representativity of the metadata, however, keeps decreasing over time, both because the dataset shrinks and because certain metadata, such as the users' number of followers, keeps changing. Our study has important implications for researchers sharing and using publicly shared Twitter datasets in their research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/15/2021

A Dataset of State-Censored Tweets

Many governments impose traditional censorship methods on social media p...
research
03/25/2022

Manipulating Twitter Through Deletions

Research into influence campaigns on Twitter has mostly relied on identi...
research
03/02/2023

The Impact of Data Persistence Bias on Social Media Studies

Social media studies often collect data retrospectively to analyze publi...
research
03/27/2018

You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information

Metadata are associated to most of the information we produce in our dai...
research
06/06/2023

Russo-Ukrainian War: Prediction and explanation of Twitter suspension

On 24 February 2022, Russia invaded Ukraine, starting what is now known ...
research
08/30/2017

Reliable Granular References to Changing Linked Data

Nanopublications are a concept to represent Linked Data in a granular an...
research
04/16/2015

Actively Learning to Attract Followers on Twitter

Twitter, a popular social network, presents great opportunities for on-l...

Please sign up or login with your details

Forgot password? Click here to reset