Log In Sign Up

Searching Heterogeneous Personal Digital Traces

Digital traces of our lives are now constantly produced by various connected devices, internet services and interactions. Our actions result in a multitude of heterogeneous data objects, or traces, kept in various locations in the cloud or on local devices. Users have very few tools to organize, understand, and search the digital traces they produce. We propose a simple but flexible data model to aggregate, organize, and find personal information within a collection of a user's personal digital traces. Our model uses as basic dimensions the six questions: what, when, where, who, why, and how. These natural questions model universal aspects of a personal data collection and serve as unifying features of each personal data object, regardless of its source. We propose indexing and search techniques to aid users in searching for their past information in their unified personal digital data sets using our model. Experiments performed over real user data from a variety of data sources such as Facebook, Dropbox, and Gmail show that our approach significantly improves search accuracy when compared with traditional search tools.


page 1

page 2

page 3

page 4


A Frequency-Based Learning-To-Rank Approach for Personal Digital Traces

Personal digital traces are constantly produced by connected devices, in...

Supporting Human Memory by Reconstructing Personal Episodic Narratives from Digital Traces

Numerous applications capture in digital form aspects of people's lives....

Entities of Interest

In the era of big data, we continuously - and at times unknowingly - lea...

Top-k queries over digital traces

Recent advances in social and mobile technology have enabled an abundanc...

A Natural Language Query Interface for Searching Personal Information on Smartwatches

Currently, personal assistant systems, run on smartphones and use natura...

MASS: Mobile Autonomous Station Simulation

We propose a set of tools to replay wireless network traffic traces, whi...

A Total Error Framework for Digital Traces of Humans

The interactions and activities of hundreds of millions of people worldw...

1. Introduction

Digital traces of our lives are constantly being produced and saved by users, either actively in files, emails, social media interactions, multimedia objects, calendar items, contacts, etc., or passively via various applications such as GPS tracking of mobile devices, records of usage, records of financial transactions, web search records or quantified self-sensor usage. These “personal digital traces” are different from traditional personal files; they are typically (but not always) smaller, heterogeneous, and accessible through a wide variety of different portals and interfaces, such as web forms, APIs or email notifications; or directly stored in files used by apps on our devices. These traces reflect a chronicle of the user’s life, keeping record of where the user went, who the user interacted with (online or in real-life), what the user did, and when. However, the large quantity of personal data available, and the fact that data is stored in multiple decentralized systems, in heterogeneous formats, makes it challenging for users to interact with their data and perform even simple searches.

Our goal is to give back to individual users easy and flexible access to their own data. In (DataExtraction, ) we proposed an extraction tool that implements access to a variety of data sources, retrieving the decentralized data and storing it in a single database. Personal data is highly sensitive; consequently, privacy and ethical issues have to be considered while dealing with this type of information. Due to privacy concerns, the data downloaded is stored on the user’s own hard drive, and aggregate query answers that we wish to see for experimental purposes must be approved by the user. More elaborate scheme for preserving privacy in personal information management is discussed in (Abiteboul:2015, ). The work discussed in this paper is developed as part of a series of tools to let user retrieve, store and organize their digital traces on their own devices (ExploreDBValia, ; odbase, ; DataExtraction, ), guaranteeing some clear privacy and security benefits.

Work in Cognitive Psychology (wagenaar86, ; brewer88, ; sevensinsmemory, ; PIMbook, ) has shown that contextual cues are strong triggers for autobiographical memories. Abowd et al. (Abowd:1999, ) and Dey (Dey:2001, ) define context as any information that can be used to characterize the situation of an entity (person, place, object,…). This suggests that a natural way to remember and learn from past events is to include any pertinent contextual information when organizing and searching personal data. Personal information can be modeled, and indexed following six dimensions that mirror the basic interrogative words: what, who, when, where, why, and how. Each personal digital trace is a source of knowledge. For instance, a simple Facebook post may contain enough information to identify where a user went, what they did, who they interacted with, and when. Multiple traces, from the same or different data sources, are often related to each other. The correlation between data traces can be identified through common information such as time and location. Even though multiple data traces may share common information, they may have significantly different structures. This heterogeneity presents a major challenge. Thus, in this work, we are proposing a data model that can effectively represent this heterogeneous data in a way that can aid users to find pieces of information again.

Search of personal data is usually focused on retrieving information that users know exists in their own data set, even though most of the time they do not know in which source or device they have seen the desired information. Current search tools such as Spotlight and Gmail search are not adequate to deal with this scenario where the user has to perform the same search multiple times on different services or/and devices rather than search over just a single service. Besides, traditional searches are often inefficient as they typically identify too many matching documents. In addition to the unified data model, we are proposing scoring and searching techniques that allow personal information search over distributed data from multiple services and devices integrated in a unified data set.

In this paper, we make the following contributions:

  • A unified and intuitive multidimensional data model to link and represent heterogeneous personal digital traces. The model, called w5h, uses those six dimensions to unify features of each personal data object, regardless of its source. (Section 2).

  • A frequency-based scoring methodology for searching personal digital traces. Our scoring, named w5h-f is based on our multidimensional data model and leverages entities interactions within and across dimensions in the data sets. (Section 3)

  • An implementation of our techniques, from data extraction, to entity recognition, classification and index structures, that will be used as the basis of our experimental evaluation. (Section 4)

  • A thorough qualitative evaluation of our proposed w5h scoring and search techniques, as well as comparison with two popular existing search tools, Solr (ApacheSolr, ) and Spotlight (Spotlight, ), and techniques, TFIDF (Salton:88, ) and BM25 (Robertson96somesimple, ), on real data using both manually designed and synthetically generated search queries. Our results show that our scoring model results in improved search accuracy. (Section 5)

We discuss related work in Section 6 and conclude with future work directions in Section 7.

2. w5h Data Model

We propose a data model that relies on the context in which personal data traces are created, produced and gathered to integrate heterogeneous traces into a unified data model that will support accurate searches. The proposed model, called w5h, was derived from the following observations:

  1. Personal digital traces are rich in contextual information, in the form of metadata, application data, or environment knowledge.

  2. Personal digital traces can be represented following a combination of dimensions that naturally summarize various aspects of the data collection: who, when, where, what, why and how.

Our w5h model uses these six dimensions as the unifying features of each personal digital trace object, regardless of its source. Using these natural questions as the main facets of data representation will also allow the combination of our data representation with a natural and intuitive query model for searching information in digital traces. Listed below are some examples of dimensional data that can be extracted from a user’s personal digital traces:

  • what: messages, messages subjects, publications, description of events, description of users, list of interests of a user.

  • who: user names, senders, recipients, event owners, lists of friends, authors.

  • where: hometown, location, event venue, file/folder path, URL.

  • when: birthday, file/message/event created-/modified-time, event start/end time.

  • why: sequences of data/events that are causally connected.

  • how: application, device, environment.

Figure 1 presents a digital trace from a Facebook post with each piece of information identified as belonging to one of the six dimensions proposed (what, who, where, when, why and how). Even though multiple digital traces come from different sources and have their own data schema, they can be unified using the six dimensions proposed in our w5h model. For instance, two separate traces that have John Smith or/and Anna Smith under the same dimension who (for example a Facebook image tagging Anna Smith, or a tweet mentioning John Smith), can be linked by our unified model. Details on the implementation of the dimension classification and entity resolution are given in Section 4.2 and 4.3. The w5h model is used both to unify heterogeneous digital trace data from different sources, and to link digital traces using the six proposed dimensions.

Figure 1.

Simplified example of a user Facebook post classified according to the

w5h model.

The why dimension is not explored in this paper. This dimension can be derived by inference and could be used to connect different fragments of data. For instance, if a value could be inferred to the why dimension for the Facebook post in Figure 1 it could be used to connect this data to a possible message thread. In (ExploreDBValia, ; odbase, ) we explored connections, in the form of plans, between events involving personal data traces; the plans, or tasks, connecting these events giving a contextual link as to why the corresponding digital traces were created/produced.

In the next section we will explore indexing and searching techniques over sets of personal digital traces using the proposed w5h model .

3. w5h Scoring Model

We leverage the w5h model presented in the previous section to provide rich and accurate search capabilities over personal digital traces. Unlike Web search, where the focus is often on discovering new relevant information, search in personal data sets is typically focused on retrieving relevant information that the user knows exists in their data set. In this scenario, standard search techniques are not ideal as they do not leverage the additional knowledge the user is likely to have about the target object, or the connections between objects pertaining to a given user.

As pointed in (wagenaar86, ), users tend to remember their actions using the six natural questions; thus, using them to guide search is a logical approach. We now evaluate the potential benefits of the w5h model for integrating and searching personal data. Specifically, we propose a search mechanism that supports queries containing conditions along each of the six interrogative dimensions. Our proposed search relies on a novel frequency-based scoring methodology over the w5h data model, called w5h-f, that will be detailed in this section.

3.1. Scoring Methodology

To illustrate our query and scoring methodology let us consider the following search scenario: the user is interested in message(s) from John Smith or/and Anna Smith about the 2017 March for Science. We consider each digital trace to be a distinct object that can be returned as the result to a query.

Definition 3.1 (Object in w5h Integrated Dataset).

An object in the data set is a structure that has fields corresponding to the 6 dimensions mentioned earlier. Each of these dimensions contains 0 or more items (corresponding to text, entities identified by entity resolution, times, locations, etc). The fields of an object are accessed using functions , etc.

Formal queries have the same structure as objects in the unified data set. In the example above, the query has three filled dimensions: March for Science (what); John Smith, Anna Smith (who); 2017 (when).

Given objects Q and O, O is considered as an answer to object Q treated as a query if it contains at least one of the dimensions specified in Q. In looking for (partially) matching objects to a given query, each dimension will be searched separately, and the results will be combined according to a scoring function, generating a rank-ordered list of candidates. The choice of scoring function can be application dependent. We propose our frequency-based scoring function, w5h-f, below.

3.2. w5h-f Scoring

Because personal digital traces are byproducts of users’ actions and events, they are not independent objects. Our intuition is that the correlation between traces (objects) can be leveraged to improve the accuracy of search results. For example, if the March for Science query from Section 3.1 returns several potential matches, one from Alice Jones, and one from Bob White, we may want to score the one from Alice higher if she communicates more frequently as a group with the user, Anna Smith, and John Smith, than Bob White.

Our w5h-f scoring scheme uses the correlation between users (or entities) and how they interact over time to rank an object. Because we are focusing on personal digital traces, all the data articulates around a user. By analyzing the data collected by our Extraction Tool (DataExtraction, ) (Section 4.1), we observed a strong correlation between the user (owner of the data) and multiple users (who groups), through times (who, when), location (who, where) and data sources (who, how). For instance, in one of the datasets, 94.9% of the objects have more than 2 users (who), 95.7% of objects have at least one date (when), 99.9% of objects have content (what) and only 1.5% of the objects have location (where). Our scoring exploits those interactions and correlations by way of a frequency score. 111Our model is focused around personal digital traces and as such we included this specific group of correlations in our scoring. Other application scenarios could also benefit from our w5h, with other group and pairwise correlations highlighted in a dedicated frequency-based scoring. For instance, traces from weather sensors could have strong pairwise (where,when), or (where, how) correlations.Frequencies can be computed for individual users or group of users. They can be associated with multiple times, multiple data sources, and also with a set of locations. For example, from a set of emails exchanged between a group of users, we can extract the frequency (number of interactions) with which those users communicated, and in which time period those interactions occurred. In short, frequency expresses the strength of relationships, based on users, time, location and data sources (who, when, where, how).

1:procedure Compute–Frequency(source)
2:     /* object(source) retrieves all objects from a given source.
3:     for each O object(source) do
4:         group O.get(‘who’)
5:         times O.get(‘when’)
6:         locations O.get(‘where’)
7:         for each time times do
9:              for each user group do
11:              end for
12:         end for
13:         for each user group do
15:         end for
17:         for each location locations do
19:         end for
20:     end for
21:end procedure
Algorithm 1 Frequency algorithm

Algorithm 1 shows how frequencies are computed across multiple dimensions. Initially, a list of objects is retrieved for each data source. For each object, the algorithm extracts groups of users, times and locations. Then, the following frequencies are computed:

  • Frequency of each individual user: number of objects that mention a user in the who dimension.

  • Frequency of a group of users: number of objects mentioning a group of users. If {a,b,c} is the group mentioned, frequencies of subgroups of {a,b,c}, e.g. {a,b} and {b,c}, are not counted.

  • Frequency of each individual user at specific times: number of objects that mention a user at matching times. Time is normalized, so variations are also considered. For instance, a query searching for June, will match objects with time June 2016 and June 2017.

  • Frequency of a group of users at specific times: number of objects mentioning the group at a specific time.

  • Frequency of a location: number of objects that mention a location.

Besides computing the frequencies per source, we also compute the total frequency of a user, group of users, times and locations by combining the individual results obtained for each data source. For simplicity, in Algorithm 1, every time a user or group of users has an interaction, the frequency is increased by one; however, in practice, the algorithm allows us to weigh differently distinct types of interactions. For example, likes or comments on a Facebook post could be weighed differently, giving more relevance to interactions coming from comments than likes. Different roles, e.g. From and To in an email, can also be weighed differently.

Definition 3.2 (Similarity Score).

Given a query Q, an object O, and the frequencies above, we define:

where is the group of users in the who dimension of O, is each user in , is each time in the when dimension, is a data source, is each location in the where dimension, is the frequency of a group of users in the same object, is the total frequency of each user across all data services, is the frequency of each user in the data source of the object, when the date from query matches object ; otherwise, , is the total frequency of the user in the time across all data sources, is the frequency of the user in the time and data source of the object, is the total frequency of the group of user in the time , is the frequency of each location , and is the score of an object for a given source : when the service from query matches object ; otherwise, . Lastly, is a text-based score for object , using any chosen scoring function (e.g., TFIDF, BM25,…).

The equation in Definition 3.2 assumes that a query has all 4 dimensions who, when, where and how; if a dimension does not exist in a query, the equation term corresponding to that dimension will be .

Let us consider the query (what: March for Science; who: John Smith, Anna Smith; when: 2017), and the object illustrated in Figure 1 (Section 2). According to the w5h-f methodology, the object will have the following score:

where Facebook

4. Search Implementation

We have presented a model to integrate personal digital traces into a unifying multi-dimensional data model in Section 2. In Section  3, we proposed a scoring methodology that leverages this data model to search heterogeneous data across all six dimensions while taking advantage of the inherent correlation between data objects in the scoring. We now discuss our search implementation in details.

4.1. Data Retrieval

To create a data set of personal digital traces, we use the extraction tool proposed in (DataExtraction, ) to identify and retrieve data from current popular services and sources of digital traces. The data retrieved is stored in its original format to avoid mistakes that could lead to missing relevant data. All the data collected by the tool is stored in MongoDB, a NoSQL database that is already optimized for semi-structured data, with the data from each service stored in its own collection. We are constantly adding and revising sources of personal digital traces; the current implementation includes emails services (Gmail), social networks interactions (Facebook, LinkedIn, Twitter), location services (GPS, Foursquare), file management (Dropbox, Local Filesystem), browsing data (Firefox, Chrome), financial data (Mint, bank accounts), calendars (Google Calendar).

In the next section we will present how the raw data retrieved can be parsed and mapped into the w5h model proposed in Section 2.

4.2. Classification

Having defined the w5h model (Section 2), it is still necessary to find an effective mechanism to translate the heterogeneous set of personal data into the six dimensions. The dynamic nature of data sources, especially the rapid rate of change in the service APIs, and the fact that new sources can be added into the extraction tool, also pose a challenge.

Digital traces have their own structures but most are retrieved in a semi-structured data format (typically JSON through APIs), or are extracted along with some metadata. We implemented parsers to represent the raw data from each source in the w5h model, thus unifying the data downloaded into a single data collection. The identification of data according to the six dimensions is done by analyzing the data available to be retrieved for each data source implemented and then building a dictionary of words/labels for each w5h dimension. Much of the classification is intuitive, for instance, the words From and To should be classified under the who dimension, while words Subject and Body should be classified as what. Text messages are classified as what, even though some specific information derived from content could be classified differently (e.g., “I went to the market today” gives both when (“today”), where (“market”) and who (“I”)). Note that the how and why dimensions are more ambiguous. For now, we consider how as the type of information recorded, e.g., a Facebook comment. The why dimension is not explored in this paper; it is derived from inference and can be used to connect events (ExploreDBValia, ; odbase, ).

We designed a machine learning multi-class classifier that automatically maps the raw data from each source into the

w5h dimensions. The input data to the w5h classifier is a set of sentences and w5h labels. For instance, in Figure 1 (Section 2

) each line corresponds to a sentence/label pair. Each sentence is then transformed in embedding vectors by a Word2vec algorithm, and labels are reshaped into one-hot encoded binary matrices. Architectures were built combining LSTM (Long Short-Term Memory

(LSTM, ) and Dense layers. Dropout (dropout, )

was used in some architectures to reduce the complexity of the model with the goal to prevent overfitting. Parameters were evaluated using a 5-fold cross validation process to estimate the performance of models. We use categorical cross-entropy as the training criterion (loss function); Adam optimization algorithm as the optimization algorithm for our models. The evaluation was conducted using the dataset

User 2 described in Table 1

. We achieve accuracy over 99.9%. The confusion matrix in Figure 

2 shows the accuracy of the model for dataset User 1 (Table 1), using the training data from User 2, with the true labels represented in the y-axis and predicted labels in the x-axis. All correct predictions are located in the diagonal of the table. The results indicate that a machine learning classifier can accurately translate dynamic and heterogeneous set of personal data into the w5h model.

Our implementation uses the classifier to translate raw data into the w5h model and does not require user intervention.

Figure 2. Confusion matrix with predictions for dataset User 1. The model was trained using dataset User 2.

4.3. Entity Resolution

Our scoring technique (Section 3) relies on frequency scoring of the same entity across objects. To make this possible, we need to identify separate instances of the same entity in data traces coming from the same sources, and across sources. For instance, the same person may appear in different services using variations of their names and email addresses.

The impact of entity resolution on search performance will be discussed in Section 5.

Entity Resolution for the who dimension. Almost 100% of the personal data retrieved has information associated with the who dimension. Our goal is to identify unique entities (person) that may be referred to differently (e.g. different email addresses). The first step to solve the ER problem for the who dimension is to process the entire user data set, and extract all information classified under who; for example, names and email addresses. We use the Stanford Entity Resolution Framework (SERF), a generic open-source, infrastructure for Entity Resolution (ER) (SERF, ), to identify entities. SERF uses the algorithm (Benjelloun2009, ), proved to be optimal in the number of record comparisons in worst-case scenarios. Using SERF person entities are identified and grouped in final entities that are stored in MongoDB in a separate collection.

Entity Resolution for the where dimension. The same where location can be represented in multiple, ambiguous and error-prone ways. To disambiguate and match location data, we used Google Geocoding, Google Places API and SERF. We start by using Google Maps to disambiguate places that appear under different names and to augment the existing data. Besides dealing with multilingual places, Google Geocoding and Google Places API have the advantage of generating location-based data under the same format. For instance, Google Maps recognizes that Greece, Hellas, Ελλαδα and Grecia are the same location. However, there are a number of challenges to be faced. In most scenarios, given an ambiguous location (e.g. Student Center), the Google Maps API outputs a set of results instead of a unique address, making it difficult to identify which one of the listed addresses is the target place. To overcome this issue, we rank all addresses returned by a Google Maps search using a tf (term frequency) function computed based on the user’s data set. For example, consider a set of results returned by the API search; the set of addresses includes an address in France; if the user’s data set does not have any data related to France, the address in France will be associated with a low tf. Similarly, when Google Maps API does not return any result for a given search, we augment the location search by using information from other related digital traces. We then use SERF for deduplication and record linkage for all the locations that have the same geocoded address information or geographical coordinates (longitude, latitude).

4.4. Retrieval

When a query is submitted, each dimension is individually matched against the user’s data set using the above pre-computed indexes. Each separate search returns a list of objects that partially match the query for a given dimension, which are then scored using the w5h-f scoring function (Section 3.2). The current (unoptimized) implementation scores all matching objects and generates a ranked list of results. We are focusing our current efforts on validating the qualitative performance of our w5h-f scoring model. We plan to investigate dedicated optimized w5h index structures in the future.

5. Experimental Evaluation

We now evaluate the efficacy of the w5h-f search approach by comparing its performance with two popular existing search tools, Solr (ApacheSolr, ) (using different scoring methodologies: TFIDF, BM25, and field-based BM25), and Spotlight (Spotlight, ). In this section, we first describe our evaluation methodology. Then, we explore the accuracy of the search approach for a set of search scenarios manually designed to be representative of possible user queries. Finally, we explore the accuracy of the search approach using a much larger set of synthetically generated searches.

5.1. Methodology

5.1.1. Data Set.

User 1 User 2
Data Source #Objs Size #Objs Size
Facebook 1493 9Mb 2384 19Mb
Gmail 1136 107Mb 10926 1Gb
Dropbox - - 573 32Mb
Foursquare - - 55 59Kb
Twitter - - 2062 10Mb
Google Calendar 2 9Kb 209 389Kb
Google+ 1 1Kb 102 343Kb
Google Contacts 157 158Kb 427 430Kb
Total 2789 116Mb 16738 1.4Gb
Table 1. Personal data sets for two users

There is a dearth of synthetic data sets and benchmarks to evaluate search over personal data. This challenge has only been exacerbated by the recent explosion in the amount of personal digital traces, as well as the varied services that create, collect, and store them. Thus, we perform our evaluation using a real data set collected by our extraction tool (DataExtraction, ) for two users.

Table 1 shows two real user data sets along with the number and size of objects retrieved from different sources over different periods of time. These two data sets will be used to evaluate the w5h scoring approach proposed in Section 3.

5.1.2. Evaluation Techniques.

Solr. Solr (ApacheSolr, ) is a popular open source full-text search platform from the Apache Lucene project. For the experiments in this section, we integrate all data retrieved by the extraction tool, from each different data source, in a unified collection. This approach allows user to search for information across the entire set of retrieved digital traces, which is already a significant step forward from the current state. We consider three different scoring methods in conjunction with Solr: TFIDF, BM25, and field-based BM25 where the fields correspond to the parsing into the w5h model.

Spotlight. We also compare our search approach to Spotlight, the desktop search platform in Apple’s OS X. Spotlight allows users to search for files based on metadata (Spotlight, ). This approach also works using the integrated raw (original) data. Each object in the evaluation data set is stored as an individual file in a machine running OS X Yosemite version 10.10.5. When possible, the following metadata is added to the files: MDAuthors (authors), MDCreationDate (creation date), MDChangeDate (content change date), MDCreator (content creator), MDFroms (path of a file). It is important to mention that Spotlight only ranks one item that it views as most relevant to a query. All other matching items are returned without ranking, typically organized by type of documents (e.g., email, pdf, etc.).

w5h-f Our proposed approach relies on the six memory cues (what, who, when, where, why and how) to guide search. The w5h-f approach uses the data parsed according to the w5h model. The correlation between users/entities and how they interact over time through different services, including the frequency users communicate, is used to rank objects, as described in Section 3. w5h-f uses entity resolution, as described in Section 4.3, to disambiguate/link entities from different sources (e.g. Facebook, Gmail, Twitter…) in the data set.

Search Approach Query Description Rank
Scenario 1 - search target: a Google+ post about SIGIR 2013 posted by Ashley in 2013
Spotlight MDContent: SIGIR, MDAuthors: Ashley, MDCreationDate: 2013 2 - 14
Solr (TFIDF) SIGIR, Ashley, 2013 11
Solr (BM25) SIGIR, Ashley, 2013 12
Solr (Field-based BM25) who:Ashley, what:SIGIR, when:2013 8
w5h-f who:Ashley, what:SIGIR, when:2013 5
Scenario 2 - search target: a photo of a cat posted on Facebook by Katie in March 2012
Spotlight MDContent:photo, MDContent:cat, MDAuthors:Katie, MDCreationDate:2012-03 2-2964
Solr (TFIDF) photo, cat, Katie, 2012-03 5468
Solr (BM25) photo, cat, Katie, 2012-03 9106
Solr (Field-based BM25) what:photo, what:cat, who:Katie, when:2012-03 65
w5h-f what:photo, what:cat, who:Katie, when:2012-03 13
Scenario 3 - search target: a Facebook photo of Anna taken in Campos
Spotlight MDContent:Photo, MDContent:Anna, MDContent:Campos 2-3169
Solr (TFIDF) Photo, Anna, Campos 17
Solr (BM25) Photo, Anna, Campos 43
Solr (Field-based BM25) what: Photo, who: Anna, where: Campos 1
w5h-f what: Photo, who: Anna, where: Campos 1
Table 2. Representative search scenarios targeting information stored in a user’s personal data set.

5.2. Case Studies

We begin our evaluation by studying three manually created search scenarios designed to be representative of realistic user searches targeting different personal digital traces from the data set User 2 described in Table 1. For each scenario, we compose one query for each of Spotlight, Solr (TFIDF), Solr (BM25), Solr (Field-based BM25) and w5h-f using the same information. Query conditions are derived from information in the target objects, and all conditions are classified accurately along the dimensions within Spotlight, field-based Solr and w5h-f.

Table 2 describes the search scenarios, the corresponding queries, and the rank of the target object as returned by each search method. Note that the target objects are always found, since the queries are accurate, and all three search tools currently return all matching objects. When Spotlight does not return the target item as the 1st ranked result, we report the ranking as the range from 2 to the total number of returned items.

The results show that w5h-f achieves the best accuracy by always ranking the target object higher than or equal to Spotlight and Solr. The differences can be significant (e.g., scenarios 1, and 2), demonstrating that using memory cues to guide search can lead to improved search accuracy. We next discuss each of the search scenarios in more detail to show how differentiating between the dimensions, and using frequency information, helps to improve search accuracy.

In scenario 1, the user is searching for a data item containing information about the 2013 SIGIR Conference. The information was sent or posted by Ashley. In this scenario, identifying Ashley as who and 2013 as when allows w5h-f to rank the target object higher than all instances of Solr. When compared with Solr field-based BM25, using the same parsed data as w5h-f, the fact that w5h-f scoring function takes into consideration the frequency that Ashley communicated with the user during the year of 2013 using Google+, allows w5h-f to rank the target object higher than Solr. Spotlight was unable to leverage the same distinctions as w5h-f since the target object was not ranked number 1. Thus, Spotlight returned the target object as an unranked item among 13 other items.

Scenario 2 targets a photo of a cat sent or taken by Katie in March 2012. In this case, the classification of photo and cat as what and Katie as who allows w5h-f and Solr field-based BM25 to rank the target object much higher than Solr BM25, Solr TFIDF and Spotlight. Entity resolution in the who dimension and the scoring function based on frequency help w5h-f to rank the target object in the top 20.

Scenario 3 looks for a picture of Anna taken at a place called Campos. The good performance achieved by the w5h and Solr field-based BM25 approach is explained by the fact that those approaches were able to classify Anna under the dimension who and Campos under dimension where. Since Campos is a very common family name in the user database, the keyword search approaches ended up returning lots of documents matching Campos as location and also as a name.

Parameter Group 1 Group 2 Group 3 Group 4 Group 5
number of scenarios 250 250 250 250 250
dimensions () what what, who what, who, when what, who, when, how what, who, when, how
number of values () 1 1 1 1 2(who,what), 1(when,how)
Table 3. Parameters used to generate five groups of queries.
Methods MRR NDCG@10 NDCG@20
Solr TF.IDF 0.2920 0.3384 0.3673
Solr BM25 0.4742 0.5192 0.5352
Solr Field-based BM25 0.4979 0.5428 0.5619
w5h-f (no entity) 0.5632 0.5993 0.6136
w5h-f 0.6119 0.6414 0.6546
Table 4. MRR, NDCG@10, NDCG@20 for Group of queries.
Methods MRR NDCG@10 NDCG@20
Solr TF.IDF 0.1959 0.2304 0.2513
Solr BM25 0.2127 0.2481 0.2702
Solr Field-based BM25 0.2383 0.2712 0.2996
w5h-f 0.2383 0.2712 0.2996
(a) Group 1
Methods MRR NDCG@10 NDCG@20
Solr TF.IDF 0.3580 0.4036 0.4234
Solr BM25 0.5267 0.5619 0.5777
Solr Field-based BM25 0.6117 0.6582 0.6772
w5h-f 0.7072 0.7488 0.7628
(b) Group 3
Methods MRR NDCG@10 NDCG@20
Solr TF.IDF 0.3328 0.3925 0.4179
Solr BM25 0.5357 0.5888 0.6036
Solr Field-based BM25 0.6327 0.6765 0.6951
w5h-f 0.7539 0.7931 0.8013
(c) Group 4
Methods MRR NDCG@10 NDCG@20
Solr TF.IDF 0.3772 0.4270 0.4569
Solr BM25 0.5345 0.5924 0.6152
Solr Field-based BM25 0.5769 0.6363 0.6510
w5h-f 0.6514 0.7014 0.7124
(d) Group 5
Table 9. MRR, NDCG@10, NDCG@20 for groups ,,,and (Group is in Table 4). Compared against w5h-f all the results are statistically significant (Wilcoxon signed-rank test).

5.3. Simulated Known-Item Queries

We now study a larger set of automatically generated known-item queries: search of personal data is usually focused on retrieving information that users know exists in their own data set. Considering the fact that personal data trace search is a known-item type of search, simulated queries can be automatically generated, using known-item query (Elsweiler07towardstask-based, ) generation techniques such as the ones presented in  (Azzopardi:2007, ) and (Kim:2009, ), as detailed below.

For this set of experiments, we built two query sets, one using data set User 1, and one using data set User 2 (Table 1). Both sets comprise 5 different groups of queries, each containing 1500 queries for 250 different scenarios. Each scenario is automatically created by randomly choosing a target object from one of the evaluation data set. We then choose dimensions, from which we randomly select random values. We adapted the queries to each of our evaluation methods. Table 3 shows the parameters () for the 5 query groups. We performed our experiments on both User 1, and User 2 data sets and observed similar behaviors. For space reasons, we only report here on the results over the User 2 data set.

Our evaluation resulted in the following observations on the impact of the multidimensional w5h data model, choice of text search function, entity resolution, and frequency scoring on the accuracy of the search results.

Including pertinent contextual information when searching personal data can significantly improve accuracy. Tables 9 and 4 show the MRR (Mean Reciprocal Rank), NDCG@10 (Normalized Discounted Cumulative Gain through position 10) and NDCG@20 (through position 20) of each approach, Solr TFIDF, Solr BM25, Solr field-based BM25, and w5h-f, for Group of queries. If the target object has the same ranking as other matching objects, we report the median value of the range. Observe that all search implementations that use the data parsed according to the w5h model, Solr field-based BM25, and w5h-f, outperform the keyword-based approaches, Solr TFIDF and Solr BM25. These results show how valuable it is to use context (w5h-f and Solr field-based BM25) to find matching documents.

The use of a more elaborated approach to search text data can positively impact the final results obtained by the w5h approaches. As previously mentioned, the what dimension in the w5h model is composed basically by content information comprising most of the text. w5h-f uses Solr field-based BM25 to score the what dimension. The impact of the text search using Solr field-based BM25 versus Solr TFIDF and Solr BM25, can be seen in Table 9 (a), which presents MRR, NDCG@10 and NDCG@20 for Group of queries (queries have only the what dimension). We can observe that Solr field-based BM25 and w5h-f use a more efficient approach to search and score text data than Solr TFIDF and Solr BM25. Note that since Group has only one textual dimension in the query, the w5h-f is equivalent to the underlying text-based scoring approach for the what dimension; field-based BM25 in our implementation. The results show that the adoption of a field-based text search for the what dimension leads to better results.

Being able to disambiguate/link people from different sources of data can significantly improve the accuracy of search. To analyze the importance of the entity resolution phase presented in Section 4.3, we created a group of queries (Group ) composed by values from the who and what dimensions. The results, for the data set User 2, are illustrated in Table 4, with w5h-f approach being superior when using entity resolution, compared with an implementation of w5h-f that does not use entity resolution.

Including frequency information as part of the scoring results in significant improvements. Tables 9 and 4 show that w5h-f, which uses our proposed frequency scoring (Section 3), consistently outperforms Solr field-based BM25, which also relies on the w5h model (Section 2) but does not consider frequency. This shows that taking into consideration the correlation between dimensions while scoring an object improves the search accuracy.

Our evaluation shows that using a tailored frequency-based multidimensional scoring approaches yields significant improvements in search accuracy over personal digital traces where the desired search outcome is a specific known object.

6. Related Work

The case for a unified data model for personal information was made in (haystack, ; xu03towards, ). deskWeb (deskweb, ) looks at the social network graph to expand the searched data set to include information available in the social network. Stuff I’ve Seen (stuff, ) indexes all of the information the user has seen, regardless of its location or provenance, and uses the corresponding metadata to improve search results. Seetrieve (Soules2008, ) extends on this idea by only considering the parts of documents that were visible to the user to infer task-based (“why”) context to the file for later retrieval. Most notably, Personal Dataspaces (dataspace, ; iDM, ; imemex, ) propose semantic integration of data sources to provide meaningful semantic associations that can be used to navigate and query user data (implicit context). Connections (soules2005, ) uses system activity to make similar connections between files;  (Soules2007, ) extends this approach to consider causality, using data flow, as contextual information. Our work is related to the wider field of Personal Information Management (PIMbook, ), in particular, search behavior over personal digital traces is likely to mimic that of searching data over personal devices. Unlike traditional information seeking, which focuses on discovering new information, the goal of search in Personal Information systems is to find information that has been created, received, or seen by the user.

Bell has pioneered the field of life-logging with the project MyLifeBits (mylifebits, ; bell-total-recall, ) for which he has digitally captured all aspects of his life. While MyLifeBits started as an experiment, there is no denying that we are moving towards a world where all of our steps, actions, words and interactions will be recorded by personal devices (e.g., Google Glasses, cell phones GPS systems, FitBit and other Quantified Self sensors,…), or by public systems (e.g., traffic cameras, surveillance systems,…), and will generate a myriad of digital traces. (digime, ) is a commercial tool that aims at extending Bell’s vision to everyday users. The motivations behind are very close to ours; however currently only offers a keyword- or navigation-based access to the data; search results can be filtered by service, data type or/and date.

Other file system related projects have tried to enhance the quality of search within the file system by leveraging the context in which information is accessed to find related information (chen09search, ; gyllstrom07confluence, ) or by altering the model of the file system to a more object-oriented database system (mic94file, ). YouPivot (YouPivot, ) indexes all user activities based on time and uses the time-based context to guide searches. Social context (users’ friends and communities) is leveraged in (social-context, ) for information discovery; similarly (Derczynski2013, ) uses temporal and location context to aid discovery in social media data. Our work integrates all these sources of contextual information and provides a unified complete model of context-aware personal data.

Contextual information has been considered in various computer science applications. Context-aware applications dynamically adapt to changes in the environment in which they are running: location, time, user profile, history. Bolchini et al. provide a thorough survey of context-aware models in (ContextModelsSurvey, ). Truong and Dustdar survey context-aware Web-Service systems in (Truong09asurvey, ). Context-awareness has become increasingly popular with the wide adoption of mobile devices. While the types of context these systems consider overlap with ours, the overall approach is different from ours, for instance a contextually-aware Information Retrieval system will use the current context (e.g., user location and time of day) to adjust search results (Shen2005, ). In contrast, we consider context as information that can be queried and used to guide the search.

7. Conclusions and Future Work

We proposed and implemented a multidimensional data model based on the six natural questions: what, when, where, who, why and how to represent and unify heterogeneous personal digital traces. Based on this proposed model we designed a frequency-based scoring strategy for search queries that takes into account interactions between entities across objects to assist in the ranking of query results. Experiments over personal data sets composed by data from a variety of data sources showed that our approach significantly improved search accuracy when compared with traditional search methods. In the future, we plan on investigating several extensions to our work on searching personal data traces:

  • Include topic modeling approaches over the what dimension to be able to correlate objects based on their contents.

  • Optimize indexes and search algorithms to improve search efficiency.

  • Add query relaxation rules to allow for inaccuracies in the queries and approximate query matching.

  • Design an aggregate query model where groups of objects (traces) can be returned together as a query answer (e.g., all the social media messages and pictures relating to a party). For this we plan to integrate our work on the why dimension, which connects digital traces together (ExploreDBValia, ; odbase, ) into our scoring framework.


  • [1] Spotlight.
  • [2] Stanford entity resolution framework.
  • [3] S. Abiteboul, B. André, and D. Kaplan. Managing your digital life. Commun. ACM, pages 32–35.
  • [4] G. D. Abowd, A. K. Dey, P. J. Brown, N. Davies, M. Smith, and P. Steggles. Towards a better understanding of context and context-awareness. In Proceedings of the 1st International Symposium on Handheld and Ubiquitous Computing, HUC ’99, pages 304–307, London, UK, UK, 1999. Springer-Verlag.
  • [5] Apache solr.
  • [6] L. Azzopardi, M. de Rijke, and K. Balog. Building simulated queries for known-item topics: An analysis using six european languages. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pages 455–462, New York, NY, USA, 2007. ACM.
  • [7] G. Bell and J. Gemmell. Total Recall: How the E-Memory Revolution Will Change Everything. Penguin, 2009.
  • [8] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1):255–276, 2009.
  • [9] L. Blunschi, J. peter Dittrich, O. R. Girard, S. Kirakos, K. Marcos, and A. V. Salles. A dataspace odyssey: The imemex personal dataspace management system. In In CIDR, 2007.
  • [10] C. Bolchini, C. A. Curino, E. Quintarelli, F. A. Schreiber, and L. Tanca. A data-oriented survey of context models. SIGMOD Rec., 36(4):19–26, Dec. 2007.
  • [11] C. M. Bowman, C. Dharap, M. Baruah, B. Camargo, and S. Potti. A File System for Information Management. In Proceedings of the Intl. Conference on Intelligent Information Management Systems (ISMM), 1994.
  • [12] W. Brewer. Memory for randomly sampled autobiographical events, pages 21 – 90. Cambridge University Press, 1988.
  • [13] J. Chen, H. Guo, W. Wu, and C. Xie. Search Your Memory! – An Associative Memory Based Desktop Search System. In Proceedings of the 2009 ACM International Conference on Management of Data (SIGMOD’09)", 2009.
  • [14] L. R. A. Derczynski, B. Yang, and C. S. Jensen. Towards context-aware search and analysis on social media data. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13, pages 137–142, New York, NY, USA, 2013. ACM.
  • [15] A. K. Dey. Understanding and using context. Personal Ubiquitous Comput., 5(1):4–7.
  • [16]
  • [17] J.-P. Dittrich and M. A. V. Salles. iDM: A unified and versatile data model for personal dataspace management. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06), 2006.
  • [18] S. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins. Stuff ive seen: A system for personal information retrieval and re-use. In Proceedings of the 26th International ACM SIGIR Conference (SIGIR’03), 2003.
  • [19] D. Elsweiler and I. Ruthven. Towards task-based personal information management evaluations. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, pages 23–30, New York, NY, USA, 2007. ACM.
  • [20] J. Gemmell, G. Bell, and R. Lueder. Mylifebits: a personal database for everything. Communications of the ACM, 49(1):88–95, 2006.
  • [21] K. Gyllstrom and C. A. N. Soules. Seeing is retrieving: building information context from what the user sees. In IUI, pages 189–198, 2008.
  • [22] K. A. Gyllstrom, C. Soules, and A. Veitch. Confluence: Enhancing Contextual Desktop Search. In Proceedings of the 30th International ACM SIGIR Conference (SIGIR’07), 2007.
  • [23] J. Hailpern, N. Jitkoff, A. Warr, K. Karahalios, R. Sesek, and N. Shkrob. Youpivot: improving recall with contextual search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pages 1521–1530, New York, NY, USA, 2011. ACM.
  • [24] A. Halevy, M. Franklin, and D. Maier. Principles of dataspace systems. Communications of the ACM, 2006.
  • [25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780.
  • [26] W. Jones and J. Teevan, editors. Personal Information Management. University of Washington Press, 2007.
  • [27] V. Kalokyri, A. Borgida, A. Marian, and D. Vianna. Integration and exploration of connected personal digital traces. In Proceedings of the ExploreDB’17, Chicago, IL, USA, May 19, 2017, pages 3:1–3:6, 2017.
  • [28] V. Kalokyri, A. Borgida, A. Marian, and D. Vianna. Semantic modeling and inference with episodic organization for managing personal digital traces. In Proceedings of the 16th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE’17), pages 273–280. Springer, 2017.
  • [29] D. R. Karger, K. Bakshi, D. Huynh, D. Quan, and V. Sinha. Haystack: A general-purpose information management tool for end users based on semistructured data. In CIDR, pages 13–26, 2005.
  • [30] J. Kim and W. B. Croft. Retrieval experiments using pseudo-desktop collections. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 1297–1306, New York, NY, USA, 2009. ACM.
  • [31] H. linh Truong and S. Dustdar. A survey on context-aware web service systems, 2009.
  • [32] S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In In Proceedings of the 17th annual international ACM SIGIR conference, pages 232–241, 1996.
  • [33] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513–523, Aug. 1988.
  • [34] D. Schacter. The seven sins of memory: How the mind forgets and remembers. Houghton Mifflin, 2001.
  • [35] S. Shah, C. A. N. Soules, G. R. Ganger, and B. D. Noble. Using provenance to aid in personal file search. In 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC’07, pages 13:1–13:14, Berkeley, CA, USA, 2007. USENIX Association.
  • [36] X. Shen, B. Tan, and C. Zhai. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, 2005.
  • [37] M. Smith, V. Barash, L. Getoor, and H. W. Lauw. Leveraging social context for searching social media. In Proceedings of the 2008 ACM workshop on Search in social media, SSM ’08, 2008.
  • [38] C. A. N. Soules and G. R. Ganger. Connections: using context to enhance file search. In Proceedings of the twentieth ACM symposium on Operating systems principles, SOSP ’05, pages 119–132, New York, NY, USA, 2005. ACM.
  • [39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.

    Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [40] D. Vianna, A.-M. Yong, C. Xia, A. Marian, and T. Nguyen. A tool for personal data extraction. In Proceedings of the 10th International Workshop on Information Integration on the Web (IIWeb), pages 80–83, 2014.
  • [41] W. A. Wagenaar. My memory: A study of autobiographical memory over six years. Cognitive Psychology, 18(2):225 – 252, 1986.
  • [42] Z. Xu, M. Karlsson, C. Tang, and C. Karamanolis. Towards a Semantic-Aware File Store. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS’03), 2003.
  • [43] S. Zerr, E. Demidova, and S. Chernov. deskweb2.0: Combining desktop and social search. In Proc. of Desktop Search Workshop, In conjunction with the 33rd Annual International ACM SIGIR 2010, 23 July 2010, Geneva, Switzerland, 2010.