Wikipedia graph mining: dynamic structure of collective memory

10/01/2017 ∙ by Volodymyr Miz, et al. ∙ 0

Wikipedia is the biggest encyclopedia ever created and the fifth most visited website in the world. Tens of millions of people surf it every day, seeking answers to various questions. Collective user activity on its pages leaves publicly available footprints of human behavior, making Wikipedia an excellent source for analysis of collective behavior. In this work, we propose a distributed graph-based event extraction model, inspired by the Hebbian learning theory. The model exploits collective effect of the dynamics to discover events. We focus on data-streams with underlying graph structure and perform several large-scale experiments on the Wikipedia visitor activity data. We show that the presented model is scalable regarding time-series length and graph density, providing a distributed implementation of the proposed algorithm. We extract dynamical patterns of collective activity and demonstrate that they correspond to meaningful clusters of associated events, reflected in the Wikipedia articles. We also illustrate evolutionary dynamics of the graphs over time to highlight changing nature of visitors' interests. Finally, we discuss clusters of events that model collective recall process and represent collective memories - common memories shared by a group of people.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Over recent years, the Web has significantly affected the way people learn, interact in social groups, store and share information. Apart from being an essential part of modern life, social networks, online services, and knowledge bases generate a massive amount of logs, containing traces of global online activity on the Web. A large-scale example of such publicly available information is the Wikipedia knowledge base and its history of visitors activity. This data is a great source for collective human behavior analysis at scale. Due to this reason, the analysis of the Wikipedia in this area has become popular over the recent years (Tinati et al., 2016), (García-Gavilanes et al., 2017), (Kanhabua et al., 2014).

Collective memory (Halbwachs, 2013) is an interesting social phenomenon of human behavior. Studying this concept is a way to enhance our understanding of a common view of events in social groups and identify the events that influence remembering of the past. Early research on collective memory relied on interviews and self-reports that led to a limited number of subjects and biased results (Stone et al., 1999). The availability of the Web activity data opened new opportunities toward systematic studies at a much larger scale (Ferron, 2012), (Graus et al., 2017), (García-Gavilanes et al., 2017). Nonetheless, the general nature of collective memory formation and its modeling remain open questions. Can we model collective and individual memory formation similarly? Is it possible to find collective memories and behavior patterns inside a collaborative knowledge base? In this work, we adopt a data-driven approach to shed some light on these questions.

An essential part of the Web visitors activity data is the underlying human-made graph structure that was initially introduced to facilitate navigation. The combination of the activity dynamics and structure of Web graphs inspires an idea of their similarity to biological neural networks. A good example of such network is the human brain. Numerous neurons in the brain constitute a biological dynamic neural network, where dynamics are expressed in terms of neural spikes. This network is in charge of perception, decision making, storing memories, and learning.

During learning, neurons in our brain self-organize and form strongly connected groups called neural assemblies (Allport, 1985). These groups express similar activation patterns in response to a specific stimuli. When learning is completed, and the stimuli applied once again, reactions of the assemblies correspond to consistent dynamic activity patterns, i.e. memories. Synaptic plasticity mechanisms govern this self-organization process.

Hebbian learning theory (Hebb, 2005) proposes an explanation of this self-organization and describes the basic rules that guide the network design. The theory implies that simultaneous activation of a pair of neurons leads to an increase in the strength of their connection. In general,the Hebbian theory implies that the neural activity transforms brain networks. This assumption leads to an interesting question. Can temporal dynamics cause a self-organization process in the Wikipedia Web network, similar to the one, driven by neural spikes in the brain?

To answer this question, we introduce a collective memory modeling approach inspired by the theory of learning in the brain, in particular, a content-addressable memory system, the Hopfield network (Hopfield, 1982). In our experiments, we use a dynamic network of Wikipedia articles, where the dynamics comes from the visitors activity on each page, i.e. the number of visits per page per hour. We assume that the Wikipedia network can self-organize similar to a Hopfield network with a modified Hebbian learning rule and learn collective memory patterns under the influence of visitors activity similar to neurons in the brain. The results of our experiments demonstrate that memorized patterns correspond to groups of collective memories, containing clusters of linked pages that have a closely related meaning. A topic of a cluster corresponds to a real world event that triggered the interest of Wikipedia visitors during a finite period of time. In addition, the collective memory gives access to the time lapse when the event occured. We also show that our collective memory model is able to recall an event recorded in the collective memory, given only a part of the event cluster. Here the term recall means that we recover a missing fraction of the visitor activity signal in a memory cluster.

Contributions. Our contributions are as follows:

  • We propose a novel collective memory learning framework, inspired by artificial models of individual memory – the Hebbian learning theory and the Hopfield network model.

  • We formalize our findings into an content-addressable model of collective memories. So far, collective memory (Halbwachs, 2013) has been considered just as a concept that lacks a general model of memory formation. In the experiments we demonstrate that given a modified Hebbian learning rule, Wikipedia hyperlinks network can self-organize and gain properties reminiscent of an associative memory.

  • We develop a graph algorithm for collective memory extraction. Computations are local on the graph, allowing to build an efficient implementation. We provide a distributed implementation of the algorithm to show that it can handle dense graphs and massive amounts of time-series data.

  • We present graph visualizations as an interactive tool for collective memory studies.

The rest of this paper organized as follows. In Section 2, we give an overview of related works on large-scale collective memory research and analysis. Then, we present a graph algorithm and a community detection approach in Section 4. Section 3 describes the dataset and data preprocessing. We then discuss the results of our experiments and evaluate discovered memory properties in Section 6. Finally, information about the data, the code and tools used for the study as well as online visualizations are provided in Section 8.

2. Related work

The term Collective memory first appeared in the book of Maurice Halbwachs in 1925 (Halbwachs, 2013). He proposed a concept of a social group that shares a set of collective memories that exist beyond a memory of each member and affects understanding of the past by this social group. Halbwachs’s hypothesis influenced a range of studies in sociology (Assmann and Czaplicka, 1995), (Barash, 2016), psychology (Coman et al., 2009), (Ferron and Massa, 2012), cognitive sciences (Ferron, 2012), and, only recently, in computer science (Au Yeung and Jatowt, 2011), where authors extract collective memories using LDA (Blei et al., 2003), applied to a collection of news articles.

Despite the fact that Wikipedia is the largest ever created encyclopedia of public knowledge and the fifth most visited website in the world, the studies on collective memory considered the Wikipedia visitor activity data only recently. The idea of regarding Wikipedia as a global memory space was first introduced by Pentzold in 2009 (Pentzold, 2009). Then, it was followed by a range of collective memory studies based on various Wikipedia data archives.

Analyzing Wikipedia page views, Kanhabua et al. (Kanhabua et al., 2014) proposed a collective memory model, investigating 5500 events from 11 categories. The authors proposed a remembering score based on a combination of time-series analysis and location information. The focus of the work is on the four types of events: aviation accidents, earthquakes, hurricanes, and terrorist attacks. The work presents extensive experimental results, however, it limited to particular types of collective memories.

Traumatic events such as attacks and bombings have also been investigated in (Ferron and Massa, 2011), (Ferron, 2012)

based on the Wikipedia edit activity data. The authors investigate the difference between traumatic and nontraumatic events using natural language processing techniques. The study is limited to a certain type of events.

Another case study (García-Gavilanes et al., 2017) focuses on memory-triggering patterns to understand collective memories. The collective memory model is inferred from the Wikipedia visitors activity. The work considers only a case of aircraft incidents reported in English Wikipedia. The authors try to build a general mathematical model and explain the phenomenon of collective memory, extracted from Wikipedia, based on that single topic.

Popularity and celebrities represent another focus point of public interest. The Wikipedia hourly visits on the pages of celebrities was used to investigate fame levels of tennis players (Yucesoy and Barabási, 2016). The authors, though, did not tackle collective memories and aimed to quantify the relationship between performance and popularity of the athletes.

To the best of our knowledge, we are the first to apply a content-addressable memory model to the Wikipedia viewership data for collective memory analysis. We build our collective memory model based on the assumption that it is similar to a memory of an individual. For the first time, we demonstrate that Wikipedia Web network and its dynamics can gain properties reminiscent of associative memory. Unlike the results of previous works, our model is not limited to particular classes of events and does not require labeled data. We do not analyze the content of the pages and only rely on the collective viewership dynamics in the network that makes the presented model tractable.

Figure 1. a) Data structure built from Wikipedia data, with time-series associated to nodes of the Wikipedia hyperlink graph. b) Focus on the time-series signals (number of visits per hour) residing on the vertices of the graph.

3. Dataset

We use the dataset described in (Benzi, 2017). This dataset is based on two Wikipedia SQL dumps: English language articles and user visit counts per page per hour. The original datasets are publicly available on the Wikimedia website (Foundation, 2016).

The Wikipedia network of pages is first constructed using the data from article dumps that contain information about the references (edges) between the pages (nodes)111

Note that Wikipedia is continuously updating. Some links that existed at the moment we made the dump may have been removed from current versions of the pages. To check consistency with past versions, one can use the dedicated search tool at Time-series are then associated to each node (Fig. 1), corresponding to the visits history from 02:00, 23 September 2014 to 23:00, 30 April 2015.

The graph contains 116 016 Wikipedia pages (out of a total 4 856 639 pages) with 6 573 475 hyperlinks. Most of the Wikipedia pages remain unvisited. Therefore, only the pages that have a number of visits higher than 500 at least once in the recording are kept. The time-series associated to the nodes have a length of hours.

Preprocessing. To identify potential core events of collective memories and reduce further the processing time, we select nodes that have bursts of visitor activity, i.e. spikes in the signal. This is done on a monthly basis for two reasons. Firstly, it reduces the processing and the needs for memory as each month can be processed independently. Secondly, it allows for a study of the evolution of the collective memories on a monthly basis, by investigating the trained Hopfield networks (see next section).

For each month, we define the burstiness of a page as the number and length of the peaks of visits over time. We denote by the time-series giving the number of visits per hour for a node labeled , during month . To define a burst of visits we compute the mean

and the standard deviation

of . We select values that are above , where is a tunable activity rate parameter ( here). The burstiness of the page associated to node is defined by where is the length of the time-series for month and


For each month, we discard the pages that have a burstiness smaller or equal to .

4. Collective memory learning

Our goal is to create a self-organized memory from the Wikipedia data. This memory is not a standard neural network but more a concept network or a network made of pieces of information. Indeed, the Wikipedia pages replace neurons in the network. Hence it can be seen as a memory connecting high-level concepts. The self-organized structure is shaped as a Hopfield network with additional constraints, relating it to the concept of associative memory. The learning process follows the Hebbian rules, adapted to our particular dataset.

Hopfield network. Our approach is based on a Hopfield model of artificial memory (Hopfield, 1982)

. A Hopfield network is a recurrent neural network serving as an associative memory system. It allows for storing and recalling patterns. Let

be the number of nodes in the Hopfield network. Starting from an initial partial memory pattern , the action of recalling a learned pattern is done by the following iterative computation:


where is the weight matrix of the Hopfield network. The function is a nonlinear thresholding function (step function giving values

) that binarize the vector entries. The value

is the threshold (same for all entries). In our case, we build a network per month so is associated to a particular month. For each , is a matrix of (binarized) time-series where each row is associated to a node of the network and each column corresponds to an hour of the month considered. We stop the iteration when the iterative process has converged to a stable solution (). Note that is a binary matrix, obtained from the time-series using the function defined in Eq. (1).

Dimensionality reduction. The learning process has been modified in order to cope with a large amount of data. We recall that the number of pages has already been reduced by keeping only the ones with bursts of activity. However, it is still a large number of neurons and weights for a Hopfield network. Therefore, instead of training a fully connected network, we reduce the number of links in the following manner. We consider only hyperlink connections between pages and we learn their associated weight. In this way, nodes in the network are linked by human-made connections and it can be seen as a sort of pre-trained, constrained, network. This is a strong assumption as no link in the Hopfield network can be created between pages that are not related by a hyperlink.

Hebbian learning. We use a synaptic-plasticity-inspired computational model (Hebb, 2005)

in the proposed graph learning algorithm to compute weights. The main idea is that a co-activation of two neurons results in the enforcement of a connection (synapse) between them. We do not take causality of activations into account. For two neurons

and , their respective activity (number of visits) over time and are compared to determine if they are co-active or not at each time step. We introduce the following similarity measure between node and at time , where if and otherwise:


This function compares the ratio of the number of visits, putting more emphasis on the pages receiving a similar amount of visits.

For each time step , the edge weight between and is updated by the following amount:


where is a threshold parameter.

The computations are tractable because 1) they are local on the graph, i.e. weight updates depend on a node and its first order neighborhood, 2) weight updates are iterative, and 3) a weight update occurs only between the connected nodes and not among all possible combinations of nodes. These three facts allow us to build a distributed model to speed up computations. For this purpose, we use a graph-parallel Pregel-like abstraction, implemented in the GraphX framework (Gonzalez et al., 2014), (Xin et al., 2013).

Figure 2. Weighted degree distribution in log-log scale. Linearity in log-log scale corresponds to power-law behavior . Power-law exponent for the initial graph (blue), and for the learned graph (red). The decrease of the exponent in the learned graph indicates that the weight distribution becomes smoother and the number of greedy hubs drops with the learning.

5. Graph visualization and community detection

The structure of each learned network shows patterns corresponding to communities of correlated nodes. We analyze all the monthly Hopfield networks, extract communities and visualize them in the following sections. We use a heuristic method based on modularity optimization (Louvain method)

(Blondel et al., 2008) for community detection. Colors in the graph depict and highlight the detected communities. A resolution parameter controls the size of communities. To represent the graph in 2D space for visualization, we use a force-directed layout (Jacomy et al., 2014). The resulting spatial layout reflects the organization of the network. An interactive version of all visualizations presented in this paper is available online (Miz, 2017b).

Figure 3.

Edge weight distribution in the learned graph in log scale. The learned graph has skewed heavy-tailed weight distribution.

6. Experiments and results

Using the approach presented in Section 4, we intend to learn collective memories and describe their properties. After the learning, we inspect different resulting networks on different time scales. We extract strongly connected clusters of nodes from these networks and discuss their features. Also, we investigate the temporal behavior of the memories.

A 7 months network. We first learn the graph using the 7-month Wikipedia activity dataset. We start from the initial graph of Wikipedia pages connected with hyperlinks, described in Section 3. The Hopfield network, the result of the learning process, is a network , where is the set of Wikipedia pages, is the set of references between the pages, and is the set of weights, reflecting the similarity of activity patterns between articles.

After the learning, the majority of weights are 0 (6 297 977, 95.8%). Only 275 498 edges (4.2%) have strictly positive weights. For a better visualization, we prune out the edges that have 0-weight. We also delete disconnected nodes that remain after edge pruning and remove small disconnected components of the graph. The number of remaining nodes is 35 839 (31% of the initial number). Figure 4 shows snapshots of the initial Wikipedia graph before learning (a) and the learned graph after weight update and pruning (b).

The initial and learned Wikipedia graphs have statistically heterogeneous connectivity (Fig. 2) and weight patterns (Fig. 3) that correspond to skewed heavy-tailed distributions.

The degree distribution of the initial graph has a larger power-law exponent (Fig. 2, blue) than the learned one (). This shows that the initial Wikipedia graph is dominated by large hubs that attract most of the connections to numerous low-degree neighbors. These hubs correspond to general topics in the Wikipedia network. They often link broad topics such as, for instance, the “United States” page, with a large number of hyperlinks pointing to it from highly diverse subjects.

We have applied a community detection algorithm, described in Section 5, to both graphs. The initial Wikipedia graph (Fig. 4) is dense and cluttered with a significant number of unused references, while the learned graph reveals smaller and more separated communities. This is confirmed by the community size distribution of the graphs (Fig. 5). The number of communities and their size change after learning. Initially, the small number of large communities dominate the graph (blue), while after the learning (red) we see a five times increase in the number of communities. Moreover, as a result of the learning, the size of the communities decreases by one order of magnitude. The modularity of the learned graph is 25% higher, giving the first evidence of the creation of associative structures.

The analysis of each community of nodes in the learned graph gives a glimpse of the events that occurred during the 7-month period. Each cluster is a group of pages related to a common topic such as a championship, a tournament, an awards ceremony, a world-level contest, an attack, an incident, or a popular festive events such as Halloween or Christmas. The graph contains the summary of events that occurred during the period of interest, hence its name of collective memory.

Before going deeper in the clusters analysis and the information they contain, we investigate the evolution of the graph structure over time.

(a) Initial (b) Learned
Figure 4. Wikipedia graph of hyperlinks (left) and learned Hopfield network (right). Colors correspond to the detected communities. The learned graph is much more modular than the initial one, with a larger number of smaller communities. The layout is force-directed.

Monthly networks. To obtain a better, fine-grained view of the collective memories, we focus on a smaller time-scale. Indeed, events attracting the attention of Wikipedia users for a period longer than a week or two are rare. Therefore, we split our dataset into months. Monthly graphs are smaller, compared to the 7-months graph, and contain 10 000 nodes on average. However, the properties and distributions of monthly graphs are similar to the 7-months one, described above.

Figure 5. Community size distribution of the initial Wikipedia graph of hyperlinks (blue) and the learned Hopfield network (red). The total number of communities: 32 for the initial graph, 172 for the learned one.

Short-term (monthly) graphs allow to understand and visualize the dynamics of the memory formation in the long-term (7 months) graph. To give an example of the memory evolution, we discuss the cluster of the USA National Football League championship and the collective memories that are mostly related to the previous NFL seasons (Fig. 6). NFL is one of the most popular sports leagues in the USA and its triggers a lot of interest on Wikipedia. Thanks to the high number of visits on this topic we were able to spot a cluster related to the NFL on each of the monthly graphs. Figure 6 shows information about the NFL clusters. The top part of the figure contains the learned graphs, for each month, where the NFL cluster is highlighted in red. The final game of the 2014 season, Super Bowl XLIX, had been played on February 1, 2015. This explains the increase in size until February where it reaches its maximum. The activity collapses after this event and the cluster disappears. For the sake of interpretability, we extracted 30 NFL team pages from the original cluster (485 pages) to show the details of the evolution in time as a table on Fig. 6. This fraction of the nodes reflects overall dynamics in the entire cluster. Each row describes the hourly activity of a page, while the columns split the plot into months. The sum of visits for the selected pages is plotted as a red line on the bottom.

Figure 6. Evolution of the National Football League 2014-2015 championship cluster. We show 30 NFL teams from the main cluster. Top: the monthly learned graph with the NFL cluster highlighted in red. Middle table: visitors activity per hour on the NFL teams’ Wikipedia pages in greyscale (the more visits, the darker). Bottom: timeline, in red, of the overall visitor activity in the cluster.

The dynamics of the detected cluster reflects the real timeline of the NFL championship. The spiking nature of the overall activity corresponds to weekends when most of the games were played. Closer to the end of the championship, the peaks become stronger, following the increasing interest of fans. We see the highest splash of the activity on 1 February, when the final game was played.

We want to emphasize that this cluster, as well as all the others, was obtained in a completely unsupervised manner. Football team pages were automatically connected together in a cluster having “NFL” as a common topic. Moreover, the cluster is not formed by one Wikipedia page and its direct neighbors, it involves many pages with distances of several hops on the graph.

The NFL championship case is an example of a periodic (yearly) collective memory. The interest increases over the months until the expected final event. Accidents and incidents are different types of events as they appear suddenly, without prior activity. Our proposed learning method allows for the detection of these kinds of collective memories as well. We provide examples of three accidents to demonstrate the behavior of the collective memory in case of an unexpected event.

We pick three core events among 172 detected and discuss them to show the details of memory detection by our approach. Figure 7 shows the extracted clusters from the learned graph (top) and the overall timeline of the clusters’ activity (bottom).

Charlie Hebdo shooting. 7 January 2015. This terrorist attack is an example of an unexpected event. The cluster emerged over a period of 72 hours, following the attack. All pages in the cluster are related to the core event. Strikingly, a look at the title of the pages is sufficient to get a precise summary of what the event is about. There is a sharp peak of activity on the first day of the attack, slowly decaying over the following week.

Germanwings flight 9525 crash. 24 March 2015. This cluster not only involves pages describing the crash or providing more information about it, but also the pages of similar events that happened in the past. It includes, for example, a page enumerating airlane crashes and the page of a previous crash that happened in December 2014, the Indonesia AirAsia Flight 8501 crash. As a result, the memory of the event is connected to the memory of the Flight 8501 crash, that is why we can see an increase of visits in December. This is an example where our associative memory approach not only connects pages together but also events.

Ferguson unrest. Second wave. November 24, 2014 – December 2, 2014. This is an example of an event that has official beginning and end dates. A sharp increase in the activity at the beginning of protests highlights the main event. This moment triggers the core cluster emergence. We also see that the cluster becomes active once again at the end of the unrest.

(a) Germanwings 9525 crash (b) Ferguson unrest (c) Charlie Hebdo attack
Figure 7. Collective memory clusters (top) and overall activity of visitors in the clusters (bottom).

Eventually, we conclude our exploration of the clusters of the learned graphs by providing on Table 1 a list of handpicked page titles inside each cluster that refer to previous events and related subjects. The connected events occurred outside of the 7-months period we analyze. This illustrates the associative features of our method. Firstly, pages are grouped by events with the help of visitors activity. Secondly, events themselves are connected together by this activity. Memory is made of groups of concepts tightly connected and these groups are in turn connected together through concepts they share.

Charlie Hebdo attack Germanwings 9525 crash Ferguson unrest Porte de Vincennes hostage crisis Inex-Adria Aviopromet Flight 1308 Shooting of Tamir Rice Al-Qaeda Pacific Southwest Airlines Flight 1771 Shooting of Amadou Diallo Islamic terrorism SilkAir Flight 185 Sean Bell Shooting Incident Hezbollah Suicide by pilot Shooting of Oscar Grant 2005 London bombings Aviation safety 1992 Los Angeles riots Anders Behring Breivik Air France Flight 296 O.J. Simpson murder case Jihadism Air France Flight 447 Shooting of Trayvon Martin 2015 Baga massacre Airbus Attack on Reginald Denny
Table 1. Collective memories triggered by core events.

Recalling memories. In this section, our goal is to test the hypothesis that the proposed method, as a memory, allows recalling events from partial information. We emulate recall processes using the Hopfield network approach, described in Section 4. We show that the learned graph structure can recover a memory (cluster of pages and its activations) from an incomplete input.

We create incomplete patterns for the Hopfield network by selecting randomly a few pages contained in a chosen cluster. We built the input matrix setting to (inactive state) all the time-series except for the few pages selected. We then apply iteratively Eq. (2).

The results of the experiment are illustrated with an example on Figure 8. We selected the cluster associated to the Charlie Hebdo Attack. From the list of pages, we select a subset of it, here . We applied the learned graph for the month of January, when the memory was detected. After the recall (Fig. 8, right), most of the cluster is recovered at the correct position in time. Note that the model forgets a part of the activity, plotted in light red. This missing part is made of pages that are active outside of the time of the event, giving the evidence that they are not directly related (or weakly related) to the event.

To evaluate the performance of the remembering, we mute different fractions of nodes in memory clusters and compute recall errors. We define the recall accuracy as the ratio of correctly recalled activations over the number of initial activations. The error is given by . Figure 10 shows the results of the evaluation. We consider three types of errors. First, we measure the accuracy of a recalled pattern over 7 months (green). Second, the accuracy of a recalled pattern over a period of 72 hours after the start of the event (blue). And third, a relaxed accuracy of the recall during the same period of time, with a correct recovery if an initially active node is active at least once in the recalled pattern (red). As expected, the quality of the recalled memory increases with the quality of the input pattern. The better performance of the recall in the 72 hours zone is due to the memory effect that focuses on the most active part of the cluster and forgets small isolated activity spikes of the individual pages scattered over the 7 months.

Figure 8. Recall of an event from a partial pattern (Charlie Hebdo attack). The red vertical lines define the start of the event and its most active part, ending 72 hours from the start. Left: full activity over time of the pages in the cluster. Middle: pattern with 20% of it set inactive. Right: result of the recall using the Hopfield network model of associative memory. In light red are shown the difference with the original pattern (the forgotten activity).
October November December January February March April
Figure 9. Recalled activity over time for each monthly learned graph.

In Figure 9, we show the results of the recall process for each monthly learned graph when the entire time-series are given at the input. The output of the graph is summed up to create a global curve of activity over the 7 months. The initial global activity is subtracted from the output curve to obtain the final curve plotted on the figure. Red areas on the plot correspond to a low activity in the output, hence an absence of memory in the zone. Green areas represent positive recalls. The memory of each of the monthly graph corresponds to its learning period. This demonstrates the global effectiveness of the learning and the recall process.

Figure 10. Plot of the recall error rate with three different error measures.

7. Conclusions and future work

In this paper, we propose a new method that allows learning and remembering collective memories in an unsupervised manner. To extract memories, the method analyses the Wikipedia Web network and hourly viewership history of its articles. The collective memories are summaries of the events that raised the interest of Wikipedia visitors. The approach is able to perform an efficient knowledge retrieval from the Wikipedia encyclopedia allowing to highlight the changing interests of visitors over time and to shed some light on their collective behavior.

This approach is able to handle large-scale datasets. We have noted experimentally a high robustness of the method to the tuning of the parameters. This will be investigated further in future work.

This work opens new avenues for dynamic graph-structured data analysis. For example, the proposed approach could be used in a framework for automated event detection, monitoring, and filtering in network structured visitor activity streams. The resulting framework is also of interest for the understanding of the human memory and the simulation of an artificial memory.

8. Tools, implementation, code and online visualizations

All learned graphs (overall September-April, monthly activity, and localized events) are available online (Miz, 2017b) to foster further exploration and analysis. For graph visualization we used the open source software package Gephi (Bastian et al., 2009) and layout ForceAtlas2 (Jacomy et al., 2014). We used Apache Spark GraphX (Xin et al., 2013), (Gonzalez et al., 2014) for graph learning implementation and graph analysis. The presented results can be reproduced using the code for Wikipedia dynamic graph learning, written in Scala (Miz, 2017a). The dataset is available on Zenodo (Kirell et al., 2017).

We would like to thank Michaël Defferrard and Andreas Loukas for fruitful discussions and useful suggestions. The research leading to these results has received funding from the European Union’s H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant agreement no 642685 MacSeNet.


  • (1)
  • Allport (1985) DA Allport. 1985. Distributed memory, modular subsystems and dysphasia. Current perspectives in dysphasia (1985), 32–60.
  • Assmann and Czaplicka (1995) Jan Assmann and John Czaplicka. 1995. Collective memory and cultural identity. New German Critique 65 (1995), 125–133.
  • Au Yeung and Jatowt (2011) Ching-man Au Yeung and Adam Jatowt. 2011. Studying how the past is remembered: towards computational history through large scale text mining. In Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 1231–1240.
  • Barash (2016) Jeffrey Andrew Barash. 2016. Collective Memory and the Historical Past. University of Chicago Press.
  • Bastian et al. (2009) Mathieu Bastian, Sebastien Heymann, Mathieu Jacomy, et al. 2009. Gephi: an open source software for exploring and manipulating networks. Icwsm 8 (2009), 361–362.
  • Benzi (2017) Kirell Maël Benzi. 2017. From recommender systems to spatio-temporal dynamics with network science. EPFL, Chapter 5, 97–98.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    3, Jan (2003), 993–1022.
  • Blondel et al. (2008) Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
  • Coman et al. (2009) Alin Coman, Adam D Brown, Jonathan Koppel, and William Hirst. 2009. Collective memory from a psychological perspective. International Journal of Politics, Culture, and Society IJPS 22, 2 (2009), 125–141.
  • Ferron (2012) Michela Ferron. 2012. Collective memories in Wikipedia. Ph.D. Dissertation. University of Trento.
  • Ferron and Massa (2011) Michela Ferron and Paolo Massa. 2011. Studying collective memories in Wikipedia. Journal of Social Theory 3, 4 (2011), 449–466.
  • Ferron and Massa (2012) Michela Ferron and Paolo Massa. 2012. Psychological processes underlying Wikipedia representations of natural and manmade disasters. In Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration. ACM, 2.
  • Foundation (2016) Wikimedia Foundation. 2016. Wikimedia Downloads: Database tables as sql.gz and content as XML files. (May 2016). Accessed: 2016-May-16.
  • García-Gavilanes et al. (2017) Ruth García-Gavilanes, Anders Mollgaard, Milena Tsvetkova, and Taha Yasseri. 2017. The memory remains: Understanding collective memory in the digital age. Science Advances 3, 4 (2017), e1602368.
  • Gonzalez et al. (2014) Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed Dataflow Framework.. In OSDI, Vol. 14. 599–613.
  • Graus et al. (2017) David Graus, Daan Odijk, and Maarten de Rijke. 2017. The birth of collective memories: Analyzing emerging entities in text streams. arXiv preprint arXiv:1701.04039 (2017).
  • Halbwachs (2013) Maurice Halbwachs. 2013. Les cadres sociaux de la mémoire. Albin Michel.
  • Hebb (2005) Donald Olding Hebb. 2005. The organization of behavior: A neuropsychological theory. Psychology Press.
  • Hopfield (1982) John J Hopfield. 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79, 8 (1982), 2554–2558.
  • Jacomy et al. (2014) Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu Bastian. 2014. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PloS one 9, 6 (2014), e98679.
  • Kanhabua et al. (2014) Nattiya Kanhabua, Tu Ngoc Nguyen, and Claudia Niederée. 2014. What triggers human remembering of events? A large-scale analysis of catalysts for collective memory in Wikipedia. In Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on. IEEE, 341–350.
  • Kirell et al. (2017) Benzi Kirell, Miz Volodymyr, Ricaud Benjamin, and Vandergheynst Pierre. 2017. Wikipedia time-series graph. (Sept. 2017).
  • Miz (2017a) V. Miz. 2017a. Wikipedia graph mining. GraphX implementation. (2017).
  • Miz (2017b) V. Miz. 2017b. Wikipedia graph mining. Visualization. (2017).
  • Pentzold (2009) Christian Pentzold. 2009. Fixing the floating gap: The online encyclopaedia Wikipedia as a global memory place. Memory Studies 2, 2 (2009), 255–272.
  • Stone et al. (1999) Arthur A Stone, Christine A Bachrach, Jared B Jobe, Howard S Kurtzman, and Virginia S Cain. 1999. The science of self-report: Implications for research and practice. Psychology Press.
  • Tinati et al. (2016) Ramine Tinati, Markus Luczak-Roesch, and Wendy Hall. 2016. Finding Structure in Wikipedia Edit Activity: An Information Cascade Approach. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 1007–1012.
  • Xin et al. (2013) Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, and Ion Stoica. 2013. Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems. ACM, 2.
  • Yucesoy and Barabási (2016) Burcu Yucesoy and Albert-László Barabási. 2016. Untangling performance from success.

    EPJ Data Science

    5, 1 (2016), 17.