Wikipedia Reader Navigation: When Synthetic Data Is Enough

by   Akhil Arora, et al.

Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10 utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.


page 1

page 2

page 3

page 4


A Large-Scale Characterization of How Readers Browse Wikipedia

Despite the importance and pervasiveness of Wikipedia as one of the larg...

Inspiration, Captivation, and Misdirection: Emergent Properties in Networks of Online Navigation

The World Wide Web (WWW) has fundamentally changed the ways billions of ...

Why We Read Wikipedia

Wikipedia is one of the most popular sites on the Web, with millions of ...

Why the World Reads Wikipedia: Beyond English Speakers

As one of the Web's primary multilingual knowledge sources, Wikipedia is...

Individual differences in knowledge network navigation

As online information accumulates at an unprecedented rate, it is becomi...

Wikipedia's Network Bias on Controversial Topics

The most important feature of Wikipedia is the presence of hyperlinks in...

Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions

"Wiki rabbit holes" are informally defined as navigation paths followed ...

Please sign up or login with your details

Forgot password? Click here to reset