Wikipedia has become one of the largest platforms for open knowledge in the world with more than 56M articles in 300+ languages and approximately 250K new articles added each month. However, the knowledge is not just the sum of the individual articles. In fact, articles in Wikipedia are connected via an extensive network of hyperlinks; e.g., English Wikipedia’s roughly 6M articles are connected by more than 400M hyperlinks (42). These hyperlinks allow readers to navigate from one page to another, find related information, or discover new topics.
But how do readers use this network to navigate the information contained in Wikipedia? Insights into patterns of navigation are extremely important for identifying content that is missing or hard-to-find in order to address knowledge gaps (77). For example, the gender gap in Wikipedia, which is commonly associated with an under-representation of biographies on women (50), is also reflected in structural properties such as lower centrality and visibility in the network (27; 37). This in turn has been shown to lead to biases in the navigation of readers (41), adding to observations on gender disparities in Wikipedia readership (34). A better understanding of navigation in this context could also help empower readers to find content more easily, or to organize lists of articles when learning about a specific topic in the form of a curriculum (54).
The phenomenon of navigating through different topics when browsing Wikipedia has been colloquially described as a “rabbit hole” (74; 68). However, in quantitative terms, our understanding of the learning pathways that readers are actually following is still extremely limited. There have been extensive studies of navigation on Wikipedia in game-like settings (e.g. Wikigame or Wikispeedia), where the aim is to find paths between two given articles (71). However, results from targeted navigation cannot be expected to apply in the same way to “navigation in the wild” (48).
One of the main limitations to address these questions is the lack of publicly available data. The raw webrequest logs (23), which capture all requests to Wikimedia’s servers (including IP addresses), are considered extremely sensitive. The Wikimedia Foundation’s commitment to protect readers’ privacy (19) requires this data to remain private and, in fact, to be completely purged after 90 days. In an effort to accommodate research interests, anonymized and/or aggregated derived datasets are kept for longer time spans and shared publicly. For example, the pageview dumps (22) provide timeseries of the number of pageviews per article per hour. These provide some proxy for article popularity (see, e.g., the pageviews tool (20)) as well as collective attentions more generally (51).
In the same vein, the clickstream dataset (15) has been released as a public resource capturing some aspects of navigation on Wikipedia. It contains the counts of source–target pairs of articles (i.e., how often a link from one page to another page was clicked by readers) in a given month. It has not only been used to study the popularity of links (14), but also for generating synthetic reading pathways to infer properties of navigation patterns (52).
Here, for the first time, we address the question of how much of the complexity of reader navigation on Wikipedia can be captured by the publicly available clickstream data. More precisely, we systematically compare ensembles of real and synthetically generated navigation sequences via two studies empirically characterizing the sequences and four downstream tasks across eight different languages. We find that differences are statistically significant but absolute effect sizes are small, establishing quantitative confidence in the utility of clickstream data to study navigation more generally.
2. Related Work
In this section, we review works studying readers on Wikipedia and their navigation, as well as navigation on the web more generally.
Wikipedia readers’ pageviews. Wikipedia’s publicly available pageviews-data (16) (i.e. how often individual pages are viewed) serves as a common proxy to measure collective attention to, e.g., news or popular topics (24; 51). Other use cases involve the detection of misalignment between supply and demand to content by comparing pageviews to quality ratings (70). Furthermore, individual studies have measured the dwell-time, that is how much time readers actually spend on a given page, finding substantial differences across geographic regions (67). More generally, Shaw and Hargittai (59) propose a model for online participation in the form of a pipeline, where being a reader is one of many steps for a user to become an active contributor. Thus, readers have been considered a core-dimension in recent efforts to systematically approach knowledge gaps in Wikipedia (50).
Navigation on Wikipedia. Only few studies have characterized the navigation of readers on Wikipedia since data from webrequest-logs is not public. Halfaker et al. (31) analyzed the distribution of inter-activity times (i.e. the time between two pageviews by the same user) to determine a threshold when constructing reading sessions. Lehmann et al. (38)
identify four different types of reading patterns by performing a k-means clustering of reading sessions obtained (with consent) from anonymized activity log data via the Yahoo-toolbar. Lydon-Staley et al.(40) found different types of curiosity-seeking behavior when looking at the knowledge networks created from the browsing sessions of volunteers. Two recent studies investigated why readers visit Wikipedia (62; 39) and correlated responses to surveys on motivation, information need, and prior knowledge with features from reading sessions such as session length. Paranjape et al. (44) showed that navigation traces provide a strong signal to predict new useful links among Wikipedia articles. Furthermore, reading sessions can be used to construct article embeddings—where similarity captures articles that are read in close succession (76)—that have been used for building alternative visual representations of the semantic space of Wikipedia (58).
Applications of clickstream data. Due to lack of availability of data on Wikipedia’s webrequest-logs, several studies have used the clickstream dataset as a proxy to characterize navigation. While the latter contains only aggregate counts of the number of times a given link from a source- to a target-page was clicked, these studies already provide important insights into readers’ behavior. While generally only a small fraction of existing links is used, it was found that popularity of links depends strongly on different properties of the article (e.g. in-/out-degree, centrality) or the placement of links within the article (e.g. whether they occur in the lead section) (14; 36). Comparing the incoming and outgoing clicks of a page allows for a classification of pages based on usage, e.g. into sources/sink or bottlenecks (26; 12). Other approaches used the clickstream data to assess whether some articles should be read before others when learning about a specific topic (54), to detect structural biases in content (41), extract semantic relationships between articles (11), as a ground-truth for other tasks (56; 9), or to study how the structure of the page influences the links clicked by the readers (13). Finally, Rodi et al. (52) characterized search strategies in synthetic data generated from clickstream, however, not without making strong assumptions about the underlying process of navigation.
Targeted navigation. The most detailed studies on navigation on Wikipedia come from experiments such as Wikigame (7) or Wikispeedia (73; 72), where participants are asked to find a path between two given articles using the hyperlinks. Several studies developed models to assess the order of the Markov process (61; 47), or assess other hypothesis about the navigation (such as decentralized search) (32; 60). Empirical approaches aimed at characterizing different properties of the navigation traces, such as the evolution of step sizes (71) or why users stop (55). However, conclusions from studies on targeted navigation do not apply in the same way to “navigation in the wild” (48).
Navigation in general. Understanding navigation on the web and online platforms has been a long-standing problem (5). Some of the common approaches include, e.g., identification of specific patterns (66; 2), development of generative models (6; 25), or the characterization of the overall predictability (35).
3. Problem and Notation
In this section, we formalize the problem of assessing the utility of the public Wikipedia clickstream data (15) to characterize the navigation of readers in Wikipedia.
We start from the set of real navigation sequences containing the sequences of pageviews by the same user (indexed by ),
with being the length (or the number of pageviews) of the sequence . Next, we generate a set of synthetic navigation sequences based on the clickstream data with similar properties,
where with are pageviews generated via a random walk on the network of Wikipedia articles.
Given the sets and of navigation sequences, we conduct multiple analyses and experiments: i) summary statistics derived from empirical characterization of the sequences (cf. Sec. 5), and ii) performance-metrics in downstream tasks (cf. Sec. 6), and leverage the observables to quantify the difference between and .
4. Data and Resources
In this section, we describe in detail the datasets used for the analysis of reader navigation in Wikipedia. We consider 8 different language versions of Wikipedia for which clickstream data is available, namely—English (EN), Japanese (JA), German (DE), Russian (RU), French (FR), Italian (IT), Polish (PL), and Persian (FA). Additionally, this choice covers languages from different families. The dataset statistics are presented in Table S1. All the publicly accessible resources required to reproduce the experiments in this paper are available at https://github.com/epfl-dlab/wikinav-approx.
Next, we describe the procedure for constructing the five different types (cf. Table 1) of navigation sequences analyzed in this work, as well as all auxiliary resources.
4.1. Real navigation sequences
We construct navigation sequences for readers of Wikipedia articles using Wikipedia’s Webrequest server logs from March, 2021. We only consider internal navigation, i.e., navigation to/from websites external to Wikipedia are ignored.
Webrequest logs (23). The webrequest logs contain an entry for each HTTP request to Wikimedia servers, specifying information, including, but not limited to, timestamp, requested URL, referrer URL, client IP address, and user agent information.
Constructing navigation sequences. Real navigation sequences are constructed following the approach proposed by Paranjape et al. (44). As a first step, we construct navigation sessions for each Wikipedia reader. Note that a reader may visit multiple pages from a given page by opening different browser tabs, and thus, navigation sessions are innately trees and not chains. We construct navigation trees by stitching together pageviews using the referrer information. From this, we obtain navigation sequences by sampling, uniformly at random, one root-to-leaf path from each navigation tree with at least 2 nodes. Note that we do not consider all root-to-leaf paths in a tree, as that would lead to navigation sessions with more complex trees to be over-represented when compared to those with a simple tree. We observed that most sequences are short, however, long sequences do exist. For instance, () of the sequences in the English Wikipedia are of length , but only () are of length or more.
4.2. Synthetic navigation sequences
We construct synthetic navigation sequences using the Wikipedia clickstream and the hyperlink graph from March, 2021.
Wikipedia hyperlink graph. We construct the ‘directed’ hyperlink graph from the links contained in the main text of Wikipedia articles available via the XML dumps (18) released by Wikipedia on a monthly basis. We perform link extraction based on the approach described in (8). To align the article versions with the webrequest logs and the clickstream data from March 2021, we use the dumps dated ‘2021-04-01’.
Clickstream dataset. The Wikipedia clickstream (75) contains the counts of links (i.e. pairs of source-target pages) clicked by Wikipedia readers. The data is generated from the webrequest logs, and is published on a monthly basis as dump files (15) for different language versions of Wikipedia. There are several pre-processing steps and filters; most importantly, the removal of links with or fewer observations for protecting the privacy of readers.
|Logs||Real||Human navigation on Wikipedia.|
|Clickstream-Priv||Synthetic||Markov-1, biased random walks using private Clickstream.|
|Clickstream-Pub||Synthetic||Markov-1, biased random walks using public Clickstream.|
|Clickstream-Pub (I)||Synthetic||Markov-1, biased random walks using public Clickstream,|
|with a different intrinsic stopping criterion (52).|
|Graph||Synthetic||Markov-1, unbiased random walks on Wikipedia hyperlink graph.|
Generating Markovian navigation sequences. We generate four different types of synthetic navigation sequences closely mimicking the overall statistics of the real sequences, by ensuring: i) for each generated sequence the same starting article and length111If the random walker reaches an isolated node prior to reaching the desired sequence length, we back-track and restart the walk from the parent. This step is performed until the desired length is reached.
as its corresponding real sequence, and ii) the number of generated sequences to be the same as the total number of real sequences. With these constraints in place, we generate sequences by performing random walks with the following three different transition probabilities between articles in the Wikipedia hyperlink graph:
Clickstream-Priv: weights proportional to the number of clicks between two articles obtained from the webrequest logs, which is similar to Clickstream-Pub, except that it also includes pairs of articles with or fewer observations.
Clickstream-Pub: weights proportional to the number of clicks between two articles obtained from the publicly available clickstream data.
Graph: weights are uniform over all outgoing links from the Wikipedia hyperlink graph.
Note that Clickstream-Pub (I) uses the same transition probabilities as Clickstream-Pub. The only difference is that the former not only decides the next step but also takes the decision to stop at a given node based on the pairwise transition probabilities (52) (“intrinsic stopping”). Thus, unlike other strategies that use extrinsic stopping, the length constraint is not enforced for navigation sequences obtained using Clickstream-Pub (I). Additionally, note that the state space for the random walker is all the articles in the Wikipedia hyperlink graph and not just the ones observed in the webrequest logs or clickstream. While the unobserved articles would never be visited via the weighted random walker, the unbiased random walker would visit them.
4.3. Auxiliary resources
For the empirical analysis and downstream tasks we use multiple additional resources, which are described as follows.
Semantic embedding. The semantic embeddings are learned from textual content of Wikipedia articles, thereby possessing the ability to capture the inter-article semantic similarity. The representation for each article in a given language is obtained by averaging the representations of the words—using pre-trained FastText (28) word embeddings with dimensions in the respective language (3)—appearing in the first paragraph of the article. Similar to Sec. 4.2, we use the XML dump (18) dated ‘2021-04-01’. We use this resource for performing the semantic diffusion analysis in Sec. 5.2.
Navigation embedding. As the name implies, the navigation embeddings are learned from the navigation sequences described in Secs. 4.1 and 4.2, thereby possessing the ability to capture semantic relatedness between the concepts that different articles represent. Analogous to text-based embeddings, the navigation sequences (sentences) are ordered collections of pageviews of articles (words). That said, following convention in the network representation learning literature (46), we train
-dimensional article embeddings from each of the real and synthetic navigation sequence datasets—with sequences of at least 2 pageviews—using FastText with the default hyperparameter settings prescribed in(4). We refrain from hyperparameter tuning, since i) it captures the typical use-case, and ii) we are not interested in the absolute performance but only in the relative difference of the embeddings generated from the respective datasets. We use this resource for the semantic similarity/relatedness (Sec. 6.3) and topic classification (Sec. 6.4) tasks.
Added-links data. This data consists of new links added to Wikipedia in April 2021. Following the procedure described in Paranjape et al., we obtain the added links by computing the set difference of links existent in Wikipedia in April and March 2021, respectively. Next, we restrict ourselves to added links possessing at least indirect paths between them observed in the real navigation sequences. We denote this set of links as positive examples . We also identify a set of negative examples , i.e., a set of links that were not added in the corresponding period. Please see Appendix A.3 for details around the aforementioned construction. This dataset is used for the link prediction task in Sec. 6.2.
Similarity/Relatedness data. We use the WikiSRS data (65) as a ground-truth for capturing the relationships between articles. It contains the similarity and relatedness judgments (on a scale of ) for 688 pairs of Wikipedia entities (people, places, and organizations), as assigned by 5 different Amazon Mechanical Turk workers. We map the entities to the corresponding language version of Wikipedia using the sitelinks from its Wikidata-item. As a result, a pair is not available for evaluation in a given language if at least one of the articles does not exist in that language. We use this dataset for evaluating the navigation sequence-based article representations through the semantic relatedness task in Sec. 6.3.
Topic-labels data. As a ground-truth for topic assignments to Wikipedia articles, we use the dataset described by Johnson et al. (33). A Wikipedia article can belong to one or more of 64 topics (such as ‘Mathematics’, ‘Entertainment’, ‘Politics and Government’, etc.). The annotations come from editors interested in specific topics who are organized in the so-called WikiProjects and manually label relevant articles. The more than different WikiProjects labels are aggregated to a set of 64 topics using a taxonomy (30) derived from a hierarchy developed by the editor community (21). We extract WikiProjects labels for articles in English Wikipedia (almost all articles have at least one label). For articles in other languages, we apply the labels from the corresponding article in English. This dataset is used in the topic classification task described in Sec. 6.4.
5. Empirical Characterization
In this section, we empirically characterize two key aspects of real navigation sequences that help us to distinguish them from synthetic navigation sequences. By construction, the synthetic sequences follow a Markov process of order 1, i.e. when generating a synthetic sequence, the probability for the next article only depends on the current article and not on any of the previously visited articles. Therefore, our aim is to quantify the degree to which previously visited articles affect the probability for the next article in real navigation sequences, and thus, differs from model selection approaches that fit Markov models of fixed order(6; 61; 47). First, we quantify the mixing of incoming and outgoing navigation flows when passing through a given article using information-theoretic measures (Sec. 5.1). Second, we quantify how much faster synthetic navigation sequences diffuse in an abstracted semantic space using an article-embedding (Sec. 5.2).
5.1. Mixing of flows
We assess the degree to which the flow of real navigation sequences is mixing when passing through a specific article (53). For navigation sequences passing through a given article , we quantify the relation between incoming traffic (source ) and outgoing traffic (target ). By construction, the synthetic navigation sequences exhibit full mixing due to the fact that incoming and outgoing traffic are completely uncorrelated.
We calculate the adjusted mutual information (AMI) (69). For navigation sequences passing through , the mutual information (MI) measures the average amount of information (in bits) about the target-page when knowing the source-page . The advantage of AMI is that it is normalized between 0 (full mixing, no information about ) and 1 (no mixing, full information about ) and takes into account non-zero values of the MI from finite-size effects. Please see Appendix A.4 for the mathematical formulation of AMI.
Fig. 1 demonstrates that most pages show a very high degree of mixing: the distribution of AMI values is peaked around with an exponential decay and no further mode. In fact, less than of pages have an AMI and fewer than of pages exhibit AMI . In order to interpret these numbers, Fig. 1 also shows two visual examples of pages with low and high AMI, respectively. Qualitatively, we can see that for AMI the flows overlap substantially such that starting from a given source (left), one can end up in virtually any of the targets (right) with similar probability when passing through (middle). In contrast, for AMI flows virtually do not overlap such that choosing the source strongly determines the target . We find a similar pattern for all 8 languages. We find no statistically significant correlation between the number of observations (i.e. number of sequences passing through an article) and the value of AMI (Spearman’s , ).
5.2. Diffusion in semantic space
Next, we assess whether a similar effect holds for sequences more generally, not just focusing on groups of sequences passing through a specific article. For this, we quantify the diffusion of navigation sequences in a low-dimensional vector space obtained from the text of the articles. For the latter, we use theSemantic embedding described in detail in Sec. 4.3 capturing how semantically similar articles are. We then map the navigation sequence into this space and measure the cosine-distance between the starting article and the article reached after steps in Fig. 2. Increasing the step size , we see that the average distance increases for the synthetic clickstream sequences, however, staying close to the value observed from the real navigation sequences. As stated in Sec. 1
, while all the differences are statistically significant (measured using bootstrapped 95% confidence intervals), the effect sizes are small. Specifically, the difference betweenLogs and Clickstream-Pub is until , and only for , we observe larger effect sizes, however, we note that those sequences are very rare. Additionally, these findings hold across different languages. Moving beyond averages, we also analyze the entire distribution of semantic distances in Fig. S1. Please see Appendix B for details.
6. Downstream Tasks
The empirical observations in the previous section suggest that synthetic navigation sequences from clickstream data are very similar to real navigation sequences. In this section, we investigate the implications of the aforementioned finding for practical downstream applications. For this, we consider four tasks relevant in the context of reader navigation or for which navigation sequences have been shown to provide useful signals: predicting the next article in the navigation sequence (Sec. 6.1), predicting new links to be added (Sec. 6.2), inferring semantic similarity/relatedness between entities (Sec. 6.3
), and classifying articles into topics (Sec.6.4). In each case, the goal is not to propose new methods to achieve state-of-the-art performance in these tasks. Rather, we are interested in assessing the relative difference in performance when using synthetic instead of real navigation sequences.
6.1. Next-article prediction
We predict the next article in the navigation sequence following typical setup in sequence-based recommendation tasks (49). Training a model on each of the different datasets (real and synthetic navigation sequences), we evaluate their ability to predict the next article in real navigation sequences. The main idea is that differences between training and test data should be reflected in a decrease in the prediction score.
Specifically, we use triples of consecutively read articles, , sampled uniformly from the navigation sequences. For the training, we select
of the triples as training data and train a separate Markov chain model of order 2 for each of the 4 datasets, respectively. For the evaluation, we selectof the triples each for the validation- and test-set from the real navigation sequences for all four cases. For each triple in the test-set, we use the trained model to compute a ranked list of candidate targets linked from the sources using their predicted likelihoods , and then determine the rank of the true target . We evaluate the predictions using mean reciprocal rank (MRR) (we also measured Recall@k yielding similar results, not shown).
In Table (a)a we observe that, as expected, training the model on synthetic sequences leads to a decrease in the prediction score. While the performance using synthetic sequences from the unbiased random walk is extremely poor, synthetic sequences from clickstream yield an MRR within a relative difference of when compared to the real navigation sequences. Interestingly, for languages with less data (in terms of the number of navigation sequences), the performance of the public clickstream becomes substantially worse than the private clickstream. This suggests that -anonymity (only links with more than clicks are included in Clickstream-Pub) plays a larger role than the restriction to first-order transitions. This is further corroborated when looking at a set of filtered queries, where we prune all queries that lack observations in the training set for any one of the four types of navigation sequences. In addition to yielding overall higher (almost double) values of MRR, the difference between private and public clickstream virtually disappears.
6.2. Link prediction
We predict new links in the Wikipedia graph from a source page to a target page following the approach by Paranjape et al. (44), who showed that navigation sequences contain a useful signal for the prediction of new links in Wikipedia. From the Added links data we obtain a labeled dataset of links consisting of positive examples (links that were added) and negative examples (links that were not added). Using real and synthetic navigation sequences, respectively, we calculate the path proportion for each link according to , where is the number of sequences from to and are all sequences starting in . We rank all links according to in descending order and calculate the precision@k, i.e. what fraction of the top-k ranked links correspond to positive examples.
In Fig. 3 we see that precision@k decreases with in most cases from () to (). The only exception are the predictions from the synthetic sequences based on the unbiased random walk, which yields a poor performance. More importantly, the performance from the synthetic sequences based on the clickstream is comparable with the real navigation sequences for all values of . In most cases the relative difference is less than . Interestingly, for small Clickstream-Pub performs slightly better than Clickstream-Priv, possibly due to filtering of negative low-probability pairs from the k-anonymity threshold for the former.
6.3. Semantic similarity/relatedness
), we first generate an embedding of articles from navigation sequences for the real and the synthetic datasets, respectively. We then compare the cosine-similarity between articles in this representation with a ground-truth dataset (Similarity/Relatedness data) containing human-annotated ratings for similarity and relatedness for a selected set of pairs. Specifically, we calculate Spearman’s between the two ranked lists of pairs of articles.
In Table (b)b we see that the synthetic sequences from Clickstream-Priv yield values for that are compatible with the ones obtained from real navigation sequences. In contrast, values for from the synthetic sequences from Clickstream-Pub are substantially lower, though the relative difference is still well within . Similar to the task of next-article prediction, this suggests that the -anonymity (only links with more than clicks are included in Clickstream-Pub) play a larger role than the restriction to first-order transitions. This is further corroborated by the results of the synthetic navigation sequences from the unbiased random walk which yield values for that are similar or sometimes better than the real navigation sequences. We note that this finding is consistent with previous findings reported in (63; 11)) and that the reported values of are on-par with the state-of-the-art results from the study introducing the ground-truth dataset (43). One interpretation of these observations is that many of the existing links, even if they are rarely or never used, are crucial for capturing the complexity in the relationships between entities when learning representations.
6.4. Topic classification
We predict topics of articles in a supervised classification task using the embeddings obtained from the different navigation sequences. We follow the approach described in (1; 33), who used embeddings generated from the content of the articles (text or links) as features to predict labels from topic annotations of editors.
We use (Navigation embeddings) as features, i.e. the embeddings generated from the real and synthetic navigation sequences, respectively. We perform supervised classification of the topic labels from the Topic-labels data, where an article can have one or more of the
different topic labels. For each topic label we train an independent binary classifier using logistic regression, for which we use the publicly available implementation inScikit-learn (45) with the prescribed default parameters. We use as the training set, as validation, and
as the test set. We report micro and macro-F1 statistics. We explored other statistics (precision and recall) as well as the effect of hyperparameters on the resultant performance, specifically, the regularization parameterand the usage of reciprocal class frequencies as weights in the loss computation for managing the class imbalance. The results portrayed similar trends, and are therefore omitted.
In Table (c)c we observe that the performance for topic classification does not vary strongly between real and synthetic datasets as features. Using navigation embeddings generated from public clickstream data yields the lowest scores. However, the relative difference to using real navigation sequences remains well below in most cases. Surprisingly, embeddings generated from the unweighted hyperlink graph yield a performance on par with real navigation sequences. Overall, these results largely mirror the findings from the downstream task on semantic similarity/relatedness in the previous section (Sec. 6.3). The latter also used embeddings generated from navigation sequences. Therefore, we can conclude that article embeddings generated from synthetic navigation sequences are of comparable quality to those generated from real navigation sequences.
7. Discussion and Concluding Insights
7.1. Summary of results
Characterizing the flow of navigation sequences. Across all 8 considered languages, our results show that navigation sequences passing through a given article are strongly mixing with most pages exhibiting a mutual information close to zero, i.e. knowing the source-article provides little information about the target-article of the navigation sequence. However, for a small set of articles (around ) we observed much larger values of mutual information, which provides clear evidence for cases where the real navigation sequences differ substantially from the synthetic navigation sequences. We showed that, as a result, overall the diffusion in the semantic space of articles is virtually indistinguishable when comparing real navigation sequences with synthetic sequences from public clickstream. While differences in the cosine-similarity between articles separated by steps are statistically significant, the effect sizes are small (Table S2).
Performance of synthetic navigation sequences in downstream tasks. We showed that synthetic navigation sequences from the public clickstream dataset are effective for many practical applications for all 8 considered languages. We compared the performance of real and synthetic navigation sequences in four different downstream tasks involving the use of navigation sequences ( link prediction (44), next-article prediction (10), generating representations (58), and topic classification (33)), revealing that using clickstream data often yields performance that are within (or less) in comparison to using real navigation sequences (Table S2). Specifically, article embeddings generated from synthetic navigation sequences are of comparable quality to those generated from real navigation sequences. Furthermore, in the case of next-article prediction and semantic similarity/relatedness we found evidence that the main limitations do not originate from the restrictions to first-order Markov processes when generating the synthetic data but from additional filtering of the public data using -anonymity necessary for ensuring privacy.
7.2. Implications and Future Work
When synthetic data is enough. Our results indicate that, for many practical cases, the synthetic navigation sequences from public clickstream data can be used as a good approximation for real navigation sequences in Wikipedia. In order to generate the synthetic sequences in practice, one cannot exactly match length and starting articles of each real sequence; it is, however, possible to match these properties on average by sampling from the distributions of lengths (64) and overall pageview statistics (16), respectively. Alternatively, it is possible to circumvent the requirement of matching on length by using intrinsic stopping similar to Clickstream-Pub (I) style sequences. For cases in which necessary privacy filters (e.g., -anonymity) are believed to remove too many rarely-used links, one potential solution would be to augment the public clickstream data with the unweighted hyperlink graph.
There are several implications of this finding. First, previous research investigating navigation on Wikipedia using the public clickstream data (e.g. on search strategies (52)) could be generalized to describe navigation on Wikipedia more generally; naturally, it is unlikely to capture all the nuances of how readers browse Wikipedia (48). Second, research on navigation in Wikipedia becomes more accessible to a wider range of researchers as the data underlying the synthetic sequences is publicly available for 11 languages and updated on a monthly basis. Other resources such as article representations learned from navigation embeddings, such as the so-called navigation vectors (76), can now be generated from publicly available datasets and thus become much more widely available and customizable. Third, our results question the un-scrutinized approaches to improving our understanding of user behavior through more and more data which often comes at significant costs to user privacy. The example of navigation in Wikipedia provides clear evidence that for many practical downstream tasks (of which we gave four examples) the available synthetic data can be considered “good enough”. Fourth, the Wikipedia clickstream data provide an example case for how to approach research on navigation of users in online platforms more generally in a privacy-preserving way. It remains an open question whether our findings will generalize beyond Wikipedia. However, our results highlight that publicly sharing clickstream-like data constitutes an immensely valuable data source and empowers others to better understand navigation on these platforms.
When synthetic data is not enough. The clickstream represents an excellent resource to approximate the global behavior of the readers on Wikipedia, but it cannot replace the server logs entirely. The fine-grained private activities stored on the servers, despite the careful anonymization guidelines of the Wikimedia Foundation, remain a valuable resource for studies that focus on the properties of readers. Real navigation sequences could help answer questions that rely on keeping track of the activities of the same user, such as revisitation patterns, multi-tab behavior, and how readers interact with additional content on a Wikipedia page (e.g., images or infoboxes (17)). At the same time, features removed during the aggregation of the public clickstream, such as time and geolocation, can support a more in-depth understanding of the information consumption patterns of Wikipedia readers (48). To this end, an interesting future direction would be to build generative models capable of generating realistic navigation sequences in a privacy-preserving manner.
Memory in navigation sequences.
The comparison between real and synthetic data provides insights on the degree of memory in navigation sequences; i.e., we characterize memory not at the level of an individual sequence but as a collective property emerging from averaging over hundreds of millions of sequences. By construction, the synthetic sequences are generated from a process without memory where the next step does not depend on any of the previously visited pages, such that deviations from the real sequences indicate importance of memory.
This problem is typically approached by estimating the order of a Markov-process fitted to the data for, e.g., Web
The comparison between real and synthetic data provides insights on the degree of memory in navigation sequences; i.e., we characterize memory not at the level of an individual sequence but as a collective property emerging from averaging over hundreds of millions of sequences. By construction, the synthetic sequences are generated from a process without memory where the next step does not depend on any of the previously visited pages, such that deviations from the real sequences indicate importance of memory. This problem is typically approached by estimating the order of a Markov-process fitted to the data for, e.g., Web(6) or targeted navigation (61). While previous works have yielded inconclusive results, our results suggest that fitting a single order constitutes an ill-posed problem. While we find that most paths do not possess memory (cf. Fig. 1 where most pages are characterized by mutual information close to ), we cannot claim that readers follow a Markovian process of order . In fact, we note that there could be a plethora of small but consequential subsets of navigation sequences with extremely strong memory (e.g., a reporter researching a topic for their front-page article).
Context of readers during navigation. Our findings serve as a starting point for more in-depth studies taking into account the context of the reader by providing ground-truth labels collected from, e.g., surveys. In fact, previous research on why readers visit Wikipedia (62; 39) showed that motivations are diverse (e.g. ‘intrinsic learning’, ‘bored’, etc.), or that some topics are more viewed in some regions (e.g. STEM in countries with lower human development index). This suggests that future work should aim at better understanding the heterogeneity of readers and context of their visit. Furthermore, one promising direction would be to study reading behavior on longer timescales identifying learning pathways in order to, e.g., generate curricula around broader topics (54).
7.3. Methodological limitations
Pageload as a proxy for reading. The navigation sequences only capture pageloads from the http-request to Wikimedia’s server. Our data does not capture actual reading behavior, such as how much time a reader was spending on a specific page (67). Thus, the pageloads only serve as a proxy.
In this study, we addressed the question of how much of the complexity of reader navigation in Wikipedia can be captured by the publicly available clickstream data. We systematically compared ensembles of real and synthetically generated navigation sequences across 8 different languages. We found that differences are statistically significant but absolute effect sizes are small establishing quantitative confidence in the utility of clickstream data to study navigation more generally. Our results empower more researchers to study navigation in Wikipedia and further strengthen the privacy of readers by limiting the need to share further potentially sensitive data in most practical use-cases.
Acknowledgements.We thank Leila Zia for insightful discussions. This project was partly funded by the Swiss National Science Foundation (grant 200021_185043), the European Union (TAILOR, grant 952215), and the Microsoft Swiss Joint Research Center. We also gratefully acknowledge generous gifts from Facebook and Google supporting West’s lab.
-  (2018) With few eyes, all hoaxes are deep. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 21. Cited by: §6.4.
-  (2016) Modeling user consumption sequences. In Proc. WWW, pp. 519–529. Cited by: §2.
-  Word vectors for 157 languages. Note: https://fasttext.cc/docs/en/crawl-vectors.htmlaccessed: 13 August 2021 Cited by: §4.3.
-  (2017) Enriching word vectors with subword information. TACL 5, pp. 135–146. Cited by: §4.3.
-  (1995-04) Characterizing browsing strategies in the World-Wide web. Computer Networks and ISDN Systems 27 (6), pp. 1065–1073. External Links: Cited by: §2.
-  (2012) Are web users really markovian?. In Proc. WWW, pp. 609–618. Cited by: §2, §5, §7.2.
-  The wiki game. Note: https://www.thewikigame.com/accessed: 13 August 2021 Cited by: §2.
-  (2019) WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks. pp. 598–607. Cited by: §4.2.
-  (2020) CycleRank, or there and back again: personalized relevance scores from cyclic paths on directed graphs. Proceedings. Mathematical, physical, and engineering sciences / the Royal Society 476 (2241). Cited by: §2.
Extrapolating paths with graph neural networks. arXiv preprint arXiv:1903.07518. Cited by: §7.1.
-  (2016) Extracting semantics from random walks on wikipedia: comparing learning and counting methods. In Proc. ICWSM, Cited by: §2, §6.3, §6.3.
-  (2019) Different topic, different trafic: how search and navigation interplay on wikipedia. The Journal of Web Science 1. Cited by: §2.
-  (2016) Visual positions of links and clicks on wikipedia. In Proc. WWW (Companion), pp. 27–28. Cited by: §2.
-  (2017) What makes a link successful on wikipedia?. In Proc. WWW, pp. 917–926. Cited by: §1, §2.
-  Analytics datasets: clickstream. Note: https://dumps.wikimedia.org/other/clickstream/readme.htmlaccessed: 13 August 2021 Cited by: §1, §3, §4.2.
-  Analytics datasets: pageviews. Note: https://dumps.wikimedia.org/other/pageviews/readme.htmlaccessed: 13 August 2021 Cited by: §2, §7.2.
-  Infobox. Note: https://en.wikipedia.org/wiki/Infoboxaccessed: 13 August 2021 Cited by: §7.2.
-  Wikimedia downloads. Note: https://dumps.wikimedia.org/backup-index.htmlaccessed: 13 August 2021 Cited by: §4.2, §4.3.
-  Wikipedia pageviews analysis tool. Note: https://pageviews.toolforge.orgaccessed: 13 August 2021 Cited by: §1.
-  WikiProject council/directory. Note: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directoryaccessed: 13 August 2021 Cited by: §4.3.
-  Wikistats: pageview complete dumps. Note: https://dumps.wikimedia.org/other/pageview_complete/readme.htmlaccessed: 13 August 2021 Cited by: §1.
-  (2015) Wikimedia Webrequest Server Logs. Note: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequestaccessed: 13 August 2021 Cited by: §1, §4.1.
-  (2017) The memory remains: understanding collective memory in the digital age. Science advances 3 (4). Cited by: §2.
-  (2015) Random surfers on a web encyclopedia. In Proc. i-KNOW, Cited by: §2.
-  (2018) Inspiration, captivation, and misdirection: emergent properties in networks of online navigation. In Complex Networks IX, pp. 271–282. Cited by: §2.
-  (2015) First women, second sex: gender bias in wikipedia. Proceedings of the 26th ACM. Cited by: §1.
-  (2018) Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893. Cited by: §4.3.
-  Text processing utilities for mediawiki. Note: https://github.com/mediawiki-utilities/python-mwtextaccessed: 13 August 2021 Cited by: §A.2.
-  The wikitax taxonomy. Note: https://github.com/wikimedia/wikitaxaccessed: 13 August 2021 Cited by: §4.3.
-  (2015) User session identification based on strong regularities in inter-activity time. In Proc. WWW, pp. 410–418. Cited by: §2.
-  (2013) Models of human navigation in information networks based on decentralized search. In Proc. HT, Cited by: §2.
-  (2021) Language-agnostic topic classification for wikipedia. arXiv preprint arXiv:2103.00068. Cited by: §4.3, §6.4, §7.1.
-  (2021) Global gender differences in wikipedia readership. In Proceedings of the Thirteenth International AAAI Conference on Web and Social Media (ICWSM ’21), Cited by: §1.
-  (2020) Web routineness and limits of predictability: investigating demographic and behavioral differences using web tracking data. arXiv preprint arXiv:2012.15112. Cited by: §2.
-  (2017) How the structure of wikipedia articles influences user navigation. New Review of Hypermedia and Multimedia 23 (1), pp. 29–50. Cited by: §2.
-  (2020-11) The gender divide in wikipedia: a computational approach to assessing the impact of two feminist interventions. External Links: Cited by: §1.
-  (2014) Reader preferences and behavior on wikipedia. In Proc. HT, pp. 88–97. Cited by: §2.
-  (2019) Why the world reads Wikipedia: Beyond English speakers. In Proc. WSDM, Cited by: §2, §7.2.
-  (2021) Hunters, busybodies and the knowledge network building associated with deprivation curiosity. Nature Human Behaviour 5 (3), pp. 327–336. Cited by: §2.
-  (2020) Wikipedia’s network bias on controversial topics. arXiv preprint arXiv:2007.08197. Cited by: §1, §2.
-  (2020) WikiHist.html: english wikipedia’s full revision history in HTML format. Proc. ICWSM 14, pp. 878–884. Cited by: §1.
-  (2018) Jointly embedding entities and text with distant supervision. In Proc. of ACL RepL4NLP Workshop, pp. 195–206. Cited by: §6.3.
-  (2016) Improving website hyperlink structure using server logs. In Proc. WSDM, pp. 615–624. Cited by: §2, §4.1, §4.3, §6.2, §7.1.
Scikit-learn: machine learning in Python. JMLR 12, pp. 2825–2830. Cited by: §A.4, §6.4.
-  (2014) DeepWalk: online learning of social representations. In KDD, pp. 701–710. Cited by: §4.3.
-  (2020) Learning the markov order of paths in a network. arXiv preprint arXiv:2007.02861. Cited by: §2, §5.
-  (2021) A large-scale characterization of how readers browse wikipedia. arXiv preprint arXiv:2112.11848. Cited by: §1, §2, §7.2, §7.2.
-  (2018) Sequence-Aware recommender systems. ACM Comput. Surv. 51 (4), pp. 1–36. Cited by: §6.1.
-  (2020) A taxonomy of knowledge gaps for wikimedia projects (second draft). arXiv preprint arXiv:2008.12314. Cited by: §1, §2.
-  (2020) Sudden attention shifts on wikipedia following COVID-19 mobility restrictions. arXiv preprint arXiv:2005.08505. Cited by: §1, §2.
-  (2017) Search strategies of wikipedia readers. PLOS ONE 12 (2), pp. 1–15. External Links: Cited by: §1, §2, §4.2, Table 1, §7.2.
-  (2014) Memory in network flows and its effects on spreading dynamics and community detection. Nature communications 5. Cited by: §5.1.
-  (2019) Finding prerequisite relations using the wikipedia clickstream. In Proc. WWW (Companion), pp. 1240–1247. Cited by: §1, §2, §7.2.
-  (2014) The last click: why users give up information network navigation. In Proc. WSDM, pp. 213–222. Cited by: §2.
-  (2016) Evaluating link-based recommendations for wikipedia. In Proc. JCDL, Cited by: §2.
-  Scikit-learn: adjusted mutual information score. Note: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.htmlaccessed: 13 August 2021 Cited by: §A.4.
-  (2019) Toward universal spatialization through Wikipedia-Based semantic enhancement. ACM Trans. Interact. Intell. Syst 9 (2-3). Cited by: §2, §7.1.
-  (2018) The pipeline of online participation inequalities: the case of wikipedia editing. The Journal of communication 68 (1), pp. 143–168. Cited by: §2.
-  (2015) HypTrails: a bayesian approach for comparing hypotheses about human trails on the web. In Proc. WWW, pp. 1003–1013. Cited by: §2.
-  (2014) Detecting memory and structure in human navigation patterns using markov chain models of varying order. Vol. 9. Cited by: §2, §5, §7.2.
-  (2017) Why we read Wikipedia. In Proc. WWW, Cited by: §2, §7.2, §7.3.
-  (2013) Computing semantic relatedness from human navigational paths: a case study on wikipedia. Int. J. Semant. Web Inf. Syst. 9 (4), pp. 41–70. Cited by: §6.3, §6.3.
-  (2014) Understanding, leveraging and improving human navigation on the web. In Proc. WWW, pp. 27–32. Cited by: §7.2.
-  The wikisrs dataset. Note: https://github.com/OSU-slatelab/WikiSRSaccessed: 13 August 2021 Cited by: §4.3.
-  (1997) Revisitation patterns in world wide web navigation. In Proc. SIGCHI, pp. 399–406. Cited by: §2.
-  (2019) Dwelling on wikipedia: investigating time spent by global encyclopedia readers. In Proc. OpenSym, Cited by: §2, §7.3.
-  The problem with wikipedia. Note: https://xkcd.com/214/accessed: 13 August 2021 Cited by: §1.
-  (2009) Information theoretic measures for clusterings comparison. Proc. ICML. Cited by: §5.1.
-  (2015) Misalignment between supply and demand of quality content in peer production communities. In Proc. ICWSM, Cited by: §2.
-  (2012) Human wayfinding in information networks. In Proc. WWW, pp. 619–628. Cited by: §1, §2.
-  (2009) Wikispeedia: an online game for inferring semantic distances between concepts. In Proc. IJCAI, pp. 1598–1603. Cited by: §2.
-  Wikispeedia. Note: https://dlab.epfl.ch/wikispeediaaccessed: 13 August 2021 Cited by: §2.
-  Wiki rabbit hole. Note: https://en.wikipedia.org/wiki/Wiki_rabbit_holeaccessed: 13 August 2021 Cited by: §1.
-  (2015) Wikipedia clickstream. Note: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream Cited by: §4.2.
-  Wikipedia navigation vectors. Note: https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectorsaccessed: 13 August 2021 Cited by: §2, §7.2.
-  (2019-02) Knowledge gaps – wikimedia research 2030. External Links: Cited by: §1.
Appendix A Implementation details
a.1. Constructing navigation sequences
We mimic a reader via an MD5 digest of the concatenation of the client IP address and the user agent string, which serves as an approximate reader ID. The user agent information is also used to discard common bots and crawlers.
a.2. Semantic embedding
a.3. Added-links data
Given all and appearing in a link we consider as negative examples all and for which i) or is not already a link, ii) , and iii) there are at least navigation sequences with an indirect path between and or and , respectively.
a.4. Mixing of flows: AMI
Specifically, we get all triples of consecutively visited pages . For every distinct page we consider the set of all triples passing through
and calculate the joint distribution
counting the fraction of triples starting in and ending in while passing through . We calculate the AMI for each page to assess the degree of mixing of the trajectories passing through. We first calculate the MI between all sources and targets as
from which we get the AMI as
where are the entropies over the marginal distributions and is the expected MI from finite-size effects when randomly relating source and target. In practice, we use the implementation (57) from Scikit-learn (45).
Appendix B Diffusion in semantic space
We observe in Fig. S1 that the distribution of distances for real and synthetic sequences from clickstream are almost completely overlapping (even for larger ) showing that differences are present but overall small. In contrast, the distribution of distances from the synthetic sequences of the unweighted random walk on the hyperlink graph is substantially shifted to larger values indicating that these sequences are much less confined and, as a result diffuse much faster. As a sanity-check, we show that the semantic distance between two completely randomly drawn articles is much larger corroborating that the embedding space captures the semantic similarity between articles.
|Semantic distance ()||-1.49||-0.98||-1.28||-1.25||-2.4||-2.33||-1.18||-0.79|
|Semantic distance ()||11.1||2.32||6.43||6.21||8.17||12.96||4.09||5.44|
|Semantic distance ()||28.93||5.12||19.11||14.77||19.3||36.43||9.24||14.91|
|Semantic distance ()||68.59||10.11||47.99||23.72||29.4||58.57||24.49||43.6|
|Next-article prediction (All queries)||14.38||18.01||13.33||18.51||15.71||19.92||18.31||30.56|
|Next-article prediction (Filtered queries)||9.20||8.85||8.32||8.60||8.86||9.93||7.58||3.64|
|Topic classification (Micro)||6.67||7.47||7.43||7.35||10.08||9.78||7.18||6.78|
|Topic classification (Macro)||6.68||9.61||10.24||10.08||11.32||12.13||11.56||10.12|
|Link prediction (P@10)||-25.00||10.00||0.00||0.00||0.00||-11.11||-11.11||0.00|
|Link prediction (P@50)||-13.64||17.78||18.75||13.04||0.00||-4.55||11.11||2.50|
|Link prediction (P@100)||-2.38||22.47||20.45||7.41||8.43||4.88||12.50||10.26|