Semantic and Influence aware k-Representative Queries over Social Streams

by   Yanhao Wang, et al.

Massive volumes of data continuously generated on social platforms have become an important information source for users. A primary method to obtain fresh and valuable information from social streams is social search. Although there have been extensive studies on social search, existing methods only focus on the relevance of query results but ignore the representativeness. In this paper, we propose a novel Semantic and Influence aware k-Representative (k-SIR) query for social streams based on topic modeling. Specifically, we consider that both user queries and elements are represented as vectors in the topic space. A k-SIR query retrieves a set of k elements with the maximum representativeness over the sliding window at query time w.r.t. the query vector. The representativeness of an element set comprises both semantic and influence scores computed by the topic model. Subsequently, we design two approximation algorithms, namely Multi-Topic ThresholdStream (MTTS) and Multi-Topic ThresholdDescend (MTTD), to process k-SIR queries in real-time. Both algorithms leverage the ranked lists maintained on each topic for k-SIR processing with theoretical guarantees. Extensive experiments on real-world datasets demonstrate the effectiveness of k-SIR query compared with existing methods as well as the efficiency and scalability of our proposed algorithms for k-SIR processing.



There are no comments yet.



Topic-based Community Search over Spatial-Social Networks (Technical Report)

Recently, the community search problem has attracted significant attenti...

Influential User Subscription on Time-Decaying Social Streams

Influence maximization which asks for k-size seed set from a social netw...

NETR-Tree: An Eifficient Framework for Social-Based Time-Aware Spatial Keyword Query

The prevalence of social media and the development of geo-positioning te...

ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Single-source and top-k SimRank queries are two important types of simil...

Online Topic-Aware Entity Resolution Over Incomplete Data Streams (Technical Report)

In many real applications such as the data integration, social network a...

Finding Effective Geo-Social Group for Impromptu Activity with Multiple Demands

Geo-social group search aims to find a group of people proximate to a lo...

Boosting Search Performance Using Query Variations

Rank fusion is a powerful technique that allows multiple sources of info...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Enormous amount of data is being continuously generated by web users on social platforms at an unprecedented rate. For example, around 650 million tweets are posted by 330 million users on Twitter per day. Such user generated data can be modeled as continuous social streams, which are key sources of fresh and valuable information. Nevertheless, social streams are extremely overwhelming for their huge volumes and high velocities. It is impractical for users to consume social data in its raw form. Therefore, social search [8, 7, 28, 37, 17, 33, 9, 18, 39] has become the primary approach to facilitating users on finding their interested content from massive social streams.

ID Tweet Retweets @asroma win but it’s @LFC joining @realmadrid in the #UCL final 3154 #OnThisDay in 1993, @ManUtd were crowned the first #PL champion 1476 @Cavs defeats @Raptors 128-110 and leads the series 2-0 in #NBAPlayoffs 2706 LeBron is great! #NBAPlayoffs 2 Congratulations to @LFC reaching #UCL Final!! #YNWA 2167 LeBron is the 1st player with 40+ points 14+ assists in an #NBAPlayoffs game 3489 Hope this post inspires us to win #PL champions again in 2018-19 4 Schedule for #PL and #NBAPlayoffs tonight 25
Figure 1: A list of exemplar tweets

Figure 2: Topic distribution
Figure 3: References

Existing search methods for social data can be categorized into keyword-based approaches and topic-based approaches based on how they measure the relevance between queries and elements. Keyword-based approaches [8, 7, 28, 37, 17, 33, 9] adopt the textual relevance (e.g., TF-IDF and BM25) for evaluation. However, they merely capture the syntactic correlation but ignore the semantic correlation. Considering the tweets in Figure 3, if a query “soccer” is issued, no results will be found because none of the tweets contains the term “soccer”. It is noted that the words like “asroma” and “LFC” are semantically relevant to “soccer”. Therefore, elements such as are relevant to the query but missing from the result. Thus, overlooking the semantic meanings of user queries may degrade the result quality, especially against social data where lexical variation is prevalent [14].

To overcome this issue, topic-based approaches [18, 39] project user queries and elements into the same latent space defined by a probabilistic topic model [5]. Consequently, queries and elements are both represented as vectors and their relevance is computed by similarity measures for vectors (e.g., cosine distance) in the topic space. Although topic-based approaches can better capture the semantic correlation between queries and elements, they focus on the relevance of results but neglect the representativeness. Typically, they retrieve top- elements that are the most coherent with the query as the result. Such results may not be representative in the sense of information coverage and social influence. First, users are more satisfied with the results that achieve an extensive coverage of information on query topics than the ones that provide limited information. For example, a top- query on topic in Figure 3 returns as the result. Nevertheless, compared with , can provide richer information to complement the news reported by . Therefore, in addition to relevance, it is essential to consider information coverage to improve the result quality. Second, influence is another key characteristic to measure the representativeness of social data. Existing methods for social search [37, 8, 18, 7] have taken into account the influences of elements for scoring and ranking. These methods simply use the influences of authors (e.g., PageRank [24] scores) or the retweet/share count to compute the influence scores. Such a naïve integration of influence is topic-unaware and may lead to undesired query results. For example, in Figure 3, which is mostly related to , may appear in the result for a query on because of its high retweet count. In addition, they do not consider that the influences of elements evolve over time, when previously trending contents may become outdated and new posts continuously emerge. Hence, incorporating a topic-aware and time-critical influence metric is imperative to capture recently trending elements.

To tackle the problems of existing search methods, we define a novel Semantic and Influence aware -Representative (-SIR) query for social streams based on topic modeling [5]. Specifically, a -SIR query retrieves a set of elements from the active elements corresponding to the sliding window at the query time . The result set collectively achieves the maximum representativeness score w.r.t. the query vector , each dimension of which indicates the degree of interest on a topic. We advocate the representativeness score of an element set to be a weighted sum of its semantic and influence scores on each topic. We adopt a weighted word coverage model to compute the semantic score so as to achieve the best information preservation, where the weight of a word is evaluated based on its information entropy [42, 31]

. The influence score is computed by a probabilistic coverage model where the influence probabilities are topic-aware. In addition, we restrict the influences within the sliding window

so that the recently trending elements can be selected.

The challenges of real-time -SIR processing are two-fold. First, the -SIR query is NP-hard. Second, it is highly dynamic, i.e., the results vary with query vectors and evolve quickly over time. Due to the submodularity of the scoring function, existing submodular maximization algorithms, e.g., CELF [16] and SieveStreaming [3], can provide approximation results for -SIR queries with theoretical guarantees. However, existing algorithms need to evaluate all active elements at least once for a single query and often take several seconds to process one -SIR query as shown in our experiments. To support real-time -SIR processing over social streams, we maintain the ranked lists to sort the active elements on each topic by topic-wise representativeness score. We first devise the Multi-Topic ThresholdStream (MTTS) algorithm for -SIR processing. Specifically, to prune unnecessary evaluations, MTTS sequentially retrieves elements from the ranked lists in decreasing order of their scores w.r.t. the query vector and can be terminated early whenever possible. Theoretically, it provides -approximation results for -SIR queries and evaluates each active element at most once. Furthermore, we propose the Multi-Topic ThresholdDescend (MTTD) algorithm to improve upon MTTS. MTTD maintains the elements retrieved from ranked lists in a buffer and permits to evaluate an element more than once to improve the result quality. Consequently, it achieves a better -approximation but has a higher worst-case time complexity than MTTS. Despite this, MTTD shows better empirical efficiency and result quality than those of MTTS.

Finally, we conduct extensive experiments on three real-world datasets to evaluate the effectiveness of -SIR as well as the efficiency and scalability of MTTS and MTTD. The results of a user study and quantitative analysis demonstrate that -SIR achieves significant improvements over existing methods in terms of information coverage and social influence. In addition, MTTS and MTTD achieve up to 124x and 390x speedups over the baselines for -SIR processing with at most and losses in quality.

Our contributions in this work are summarized as follows.

  • We define the -SIR query to retrieve representative elements over social streams where both semantic and influence scores are considered. (Section 3)

  • We propose MTTS and MTTD to process -SIR queries in real-time with theoretical guarantees. (Section 4)

  • We conduct extensive experiments to demonstrate the effectiveness of -SIR as well as the efficiency and scalability of our proposed algorithms for -SIR processing. (Section 5)

2 Related Work

Search Methods for Social Streams. Many methods have been proposed for searching on social streams. Here we categorize existing methods into keyword-based approaches and topic-based approaches.

Keyword-based approaches [8, 7, 28, 37, 17, 33, 9, 40] typically define top- queries to retrieve elements with the highest scores as the results where the scoring functions combine the relevance to query keywords (measured by TF-IDF or BM25) with other contexts such as freshness [28, 17, 33, 37], influence [8, 37], and diversity [9]. They also design different indices to support instant updates and efficient top- query processing. However, keyword queries are substantially different from the -SIR query and thus keyword-based methods cannot be trivially adapted to process -SIR queries based on topic modeling.

As the metrics for textual relevance cannot fully represent the semantic relevance between user interest and text, recent work [18, 39] introduces topic models [5]

into social search, where user queries and elements are modeled as vectors in the topic space. The relevance between a query and an element is measured by cosine similarity. They define

top- relevance query to retrieve most relevant elements to a query vector. However, existing methods typically consider the relevance of results but ignore the representativeness. Therefore, the algorithms in [18, 39] cannot be used to process -SIR queries that emphasize the representativeness of results.

Social Stream Summarization. There have been extensive studies on social stream summarization [1, 27, 36, 23, 4, 26, 29, 25] : the problem of extracting a set of representative elements from social streams. Shou et al. [27, 36] propose a framework for social stream summarization based on dynamic clustering. Ren et al. [25] focus on the personalized summarization problem that takes users’ interests into account. Olariu [23] devise a graph-based approach to abstractive social summarization. Bian et al. [4] study the multimedia summarization problem on social streams. Ren et al. [26] investigate the multi-view opinion summarization of social streams. Agarwal and Ramamritham [1] propose a graph-based method for contextual summarization of social event streams. Nguyen et al. [31] consider maintaining a sketch for a social stream to best preserve the latent topic distribution.

However, the above approaches cannot be applied to ad-hoc query processing because they (1) do not provide the query interface and (2) are not efficient enough. For each query, they need to filter out irrelevant elements and invoke a new instance of the summarization algorithm to acquire the result, which often takes dozens of seconds or even minutes. Therefore, it is unrealistic to deploy a summarization method on a social platform for ad-hoc queries since thousands of users could submit different queries at the same time and each query should be processed in real-time.

Submodular Maximization.

Submodular maximization has attracted a lot of research interest recently for its theoretical significance and wide applications. The standard approaches to submodular maximization with a cardinality constraint are the greedy heuristic 

[22] and its improved version CELF [16], both of which are -approximate. Badanidiyuru and Vondrak [2] propose several approximation algorithms for submodular maximization with general constraints. Kumar et al. [15] and Badanidiyuru et al. [3] study the submodular maximization problem in the distributed and streaming settings. Epasto et al. [12] and Wang et al. [35] further investigate submodular maximization in the sliding window model. However, the above algorithms do not utilize any indices for acceleration and thus they are much less efficient for -SIR processing than MTTS and MTTD proposed in this paper.

3 Problem Formulation

Elem ID Time Words References
1 0.2 0.8
2 0.26 0.74
3 0.89 0.11
4 1 0
5 0.29 0.71
6 0.7 0.3
7 0.33 0.67
8 0.51 0.49
(a) Elements extracted from tweets in Figure 3
Word ID Word
asroma 0 0.03
assist 0.06 0.04
cavs 0.09 0
champion 0.1 0.09
defeat 0.05 0.04
final 0.11 0.12
lebron 0.12 0
lfc 0 0.06
(b) Topic-Word distribution – I
Word ID Word
manutd 0 0.07
nbaplayoffs 0.11 0
pl 0 0.11
point 0.15 0.14
raptors 0.08 0
realmadrid 0 0.07
schedule 0.13 0.12
ucl 0 0.11
(c) Topic-Word distribution – II
Table 1: Example for social stream and topic model

3.1 Data Model

Social Element. A social element is represented as a triple , where is the timestamp when is posted, is the textual content of denoted by a bag of words drawn from a vocabulary indexed by (), and is the set of elements referred to by . Given two elements and (), if refers to , i.e., , we say influences , which is denoted as . In this way, the attribute captures the influence relationships between social elements [30, 34]. If is totally original, we set . For example, tweets on Twitter shown in Table 1 are typical social elements and the propagation of hashtags can be modeled as references [30, 19]. Note that the influence relationships vary for different types of elements, e.g., “cite” between academic papers and “comment” on Reddit can also be modeled as references.

Social Stream. We consider social elements arrive continuously as a data stream. A social stream comprises a sequence of elements indexed by . Elements are ordered by timestamps and multiple elements with the same timestamp may arrive in an arbitrary manner. Furthermore, social streams are time-sensitive: elements posted or referred to recently are more important and interesting to users than older ones. To capture the freshness of social streams, we adopt the well-recognized time-based sliding window [11] model. Given the window length , a sliding window at time comprises the elements from time to , i.e., . The set of active elements at time includes not only the elements in but also the elements referred to by any element in , i.e., . We use to denote the number of active elements at time .

Topic Model. We use probabilistic topic models [5] such as LDA [6] and BTM [38] to measure the (semantic and influential) representativeness of elements and the preferences of users. A topic model consisting of topics is trained from the corpus and the vocabulary . Each topic is a multinomial distribution over the words in , where is the probability of a word distributed on and . The topic distribution of an element is a multinomial distribution over the topics in , where is the probability that is generated from and .

The selection of appropriate topic models is orthogonal to our problem. In this work, we consider any probabilistic topic model can be used as a black-box oracle to provide and . Note that the evolution of topic distribution is typically much slower than the speed of social stream [41, 38]. In practice, we assume that the topic distribution remains stable for a period of time. We need to retrain the topic model from recent elements when it is outdated due to concept drift.

3.2 Query Definition

Query Vector. Given a topic model of topics, we use a -dimensional vector to denote a user’s preference on topics. Formally, and, indicates the user’s degree of interest on . W.l.o.g., is normalized to . Since it is impractical for users to provide the query vectors directly for their lack of knowledge about the topic model , we design a scheme to transform the standard query-by-keyword [17] paradigm in our case: the keywords provided by a user is treated as a pseudo-document and the query vector is inferred from its distribution over the topics in . Note that other query paradigms can also be supported, e.g., the query-by-document [39] paradigm where a document is provided as a query and the personalized search [18] where the query vector is inferred from a user’s recent posts.

Definition of Representativeness. Given a set of elements and a query vector , the representativeness of w.r.t.  at time is defined by a function that maps any subset of to a nonnegative score w.r.t. a query vector. Formally, we have


where is the score of on topic . Intuitively, the overall score of w.r.t.  is the weighted sum of its scores on each topic. The score on is defined as a linear combination of its semantic and influence scores. Formally,


where is the semantic score of on , is the influence score of on at time , specifies the trade-off between semantic and influence scores, and adjusts the ranges of and to the same scale. Next, we will introduce how to compute the semantic and influence scores based on the topic model respectively.

Topic-specific Semantic Score. Given a topic , we define the semantic score of a set of elements by the weighted word coverage model. We first define the weight of a word in on . According to the generative process of topic models [5], the probability that is generated from is denoted as . Following [31, 42], the weight of in on can be defined by its frequency and information entropy, i.e., , where is the frequency of in . Then, the semantic score of on is the sum of the weights of distinct words in , i.e., where is the set of distinct words in . We extend the definition of semantic score to an element set by handling the word overlaps. Given a set and a word , if appears in more than one element of , its weight is computed only once for the element with the maximum . Formally, the semantic score of on is defined by


where . Equation 3 aims to select a set of elements to maximally cover the important words on so as to best preserve the information of . Additionally, it implicitly captures the diversity issue because adding highly similar elements to brings little increase in .

Example 1.

Table 1 gives a social stream extracted from the tweets in Figure 3 and a topic model on the vocabulary of elements in the stream. We demonstrate how to compute the semantic score where on . The frequency of each word in any element is . The set of words in is . The word only appears in . Its weight is . The words appear in both elements. As and , and are the weights of and for . Finally, we sum up the weights of each word in and get . In this example, has no contribution to the semantic score because all words in are covered by .

Topic-specific Time-critical Influence Score. Given a topic and two elements (), the probability of influence propagation from to on is defined by . Furthermore, the probability of influence propagation from a set of elements to on is defined by . We assume the influences from different precedents to are independent of each other and adopt the probabilistic coverage model to compute the influence probability from a set of elements to an element. To select recently trending elements, we define the influence score in the sliding window model where only the references observed within are considered. Let be the set of elements influenced by at time and be the set of elements influenced by at time . The influence score of on at time is defined by


Equation 4 tends to select a set of influential elements on at time . The value of will increase greatly only if an element is added to such that is relevant to itself and is referred to by many elements on within .

Example 2.

We compute the influence score of in Table 1 on at time . We consider the window length and . at time is and expires at time . First, . Similarly, . For , we have . Finally, we acquire . We can see, although is referred to by several elements, its influence score on is low because and the elements referring to it are mostly on .

Query Definition. We formally define the Semantic and Influence aware -Representative (-SIR) query to select a set of elements with the maximum representativeness score w.r.t. a query vector from a social stream. We have two constraints on the result of -SIR query : (1) its size is restricted to , i.e., contains at most elements, to avoid overwhelming users with too much information; (2) the elements in must be active at time , i.e., , to satisfy the freshness requirement. Finally, we define a -SIR query as follows.

Definition 1 (-Sir).

Given the set of active elements and a vector , a -SIR query returns a set of elements with a bounded size such that the scoring function is maximized, i.e., , where is the optimal result for and is the optimal representativeness score.

Example 3.

We consider two -SIR queries on the social stream in Table 1. We set , in Equation 2 and the window length . At time , the set of active elements contains all except . Given a -SIR query where (a user has the same interest on two topics), is the query result and . We can see obtain the highest scores on respectively and they collectively achieve the maximum score w.r.t. . Given an -SIR query where (the user prefers to ), the query result is and . is excluded because it is mostly distributed on .

3.3 Properties and Challenges

Properties of -SIR Queries. We first show the monotonicity and submodularity of the scoring function for -SIR query by proving that both the semantic function and the influence function are monotone and submodular.

Definition 2 (Monotonicity & Submodularity).

A function on the power set of is monotone iff for any and . The function is submodular iff for any and .

Lemma 1.

is monotone and submodular for .

Lemma 2.

is monotone and submodular for at any time .

The proofs are given in Appendices A.1 and A.2.

Given a query vector , the scoring function is a nonnegative linear combination of and . Therefore, is monotone and submodular.

Challenges of -SIR Queries. In this paper, we consider that the elements arrive continuously over time. We always maintain the set of active elements at any time . It is required to provide the result for any ad-hoc -SIR query in real-time.

The challenges of processing -SIR queries in such a scenario are two-fold: (1) NP-hardness and (2) dynamism. First, the following theorem shows the -SIR query is NP-hard.

Theorem 1.

It is NP-hard to obtain the optimal result for any -SIR query .

The weighted maximum coverage problem can be reduced to -SIR query when in Equation 2. Meanwhile, the probabilistic coverage problem is a special case of -SIR query when in Equation 2. Because both problems are NP-hard [13], the -SIR query is NP-hard as well.

In spite of this, existing algorithms for submodular maximization [22] can provide results with constant approximations to the optimal ones for -SIR queries due to the monotonicity and submodularity of the scoring function. For example, CELF [16] is -approximate for -SIR queries while SieveStreaming [3] is -approximate (for any ). However, both algorithms cannot fulfill the requirements for real-time -SIR processing owing to the dynamism of -SIR queries. The results of -SIR queries not only vary with query vectors but also evolve over time for the same query vector due to the changes in active elements and the fluctuations in influence scores over the sliding window. To process one -SIR query , CELF and SieveStreaming should evaluate for and times respectively. Empirically, they often take several seconds for one -SIR query when the window length is 24 hours. To the best of our knowledge, none of the existing algorithms can efficiently process -SIR queries. Thus, we are motivated to devise novel real-time solutions for -SIR processing over social streams.

Notation Description
is a social stream; is an arbitrary element in ; is the -th element in .
is the window length; is the sliding window at time ; is the set of active elements at time .
is a topic model; is the -th topic in .
is a -dimensional vector; is the -th entry of .
is the semantic function on ; is the influence function on at time .
is the representativeness scoring function on ; is the scoring function w.r.t. a query vector.
is a -SIR query at time with a bounded result size and a query vector .
is the optimal result for ; is the optimal representativeness score.
is the score of on ; is the score of w.r.t. .
is the marginal score gain of adding to .
is the ranked list maintained for the elements on topic .
Table 2: Frequently Used Notations

Before moving on to the section for -SIR processing, we summarize the frequently used notations in Table 2.

4 Query Processing

In this section, we introduce the methods to process -SIR queries over social streams. The architecture is illustrated in Figure 4. At any time , we maintain (1) Active Window to buffer the set of active elements , (2) Ranked Lists to sort the lists of elements on each topic of in descending order of topic-wise representativeness score, and (3) Query Processor to leverage the ranked lists to process -SIR queries. In addition, when the topic model is given, the query and topic inferences become rather standard (e.g., Gibbs sampling [21]), and thus we do not discuss these procedures here for space limitations. We consider the query vectors and the topic vectors of elements have been given in advance.

As shown in Figure 4, we process a social stream in a batch manner. is partitioned into buckets with equal time length and updated at discrete time until the end time of the stream . When the window slides at time , a bucket containing the elements between time to is received. After inferring the topic vector of each with the topic model, we first update the active window. The elements in are inserted into the active window and the elements referred to by them are updated. Then, the elements that are never referred to by any element after time are discarded from the active window. Subsequently, the ranked list on each topic is maintained for . The detailed procedure for ranked lists maintenance will be presented in Section 4.1.

Figure 4: The architecture for -SIR query processing

Next, let us discuss the mechanism of -SIR processing. One major drawback of existing submodular maximization methods, e.g., CELF [16] and SieveStreaming [3], on processing

-SIR queries is that they need to evaluate every active element at least once. However, real-world datasets often have two characteristics: (1) The scores of elements are skewed, i.e., only a few elements have high scores. For example, we compute the scores of a sample of tweets w.r.t. a

-SIR query and scale the scores linearly to the range of 0 to 1. The statistics demonstrate that only 0.4% elements have scores of greater than 0.9 while 91% elements have scores of less than 0.1. (2) One element can only be high-ranked in very few topics, i.e., one element is about only one or two topics. In practice, we observe that the average number of topics per element is less than 2. Therefore, most of the elements are not relevant to a specific -SIR query. We can greatly improve the efficiency by avoiding the evaluations for the elements with very low chances to be included into the query result. To prune these unnecessary evaluations, we leverage the ranked lists to sequentially evaluate the active elements in decreasing order of their scores w.r.t. the query vector. In this way, we can track whether unevaluated elements can still be added to the query result and terminate the evaluations as soon as possible.

Although such a method to traverse the ranked lists is similar to the one for top- query [39], the procedures for maintaining the query results are totally different. A top- query simply returns elements with the maximum scores as the result for a -SIR query. Although the top- result can be retrieved efficiently from the ranked lists using existing methods [39], its quality for -SIR queries is suboptimal because the word and influence overlaps are ignored. Thus, we will propose the Multi-Topic ThresholdStream (MTTS) and Multi-Topic ThresholdDescend (MTTD) algorithms for -SIR processing in Sections 4.2 and 4.3. They can return high-quality results with constant approximation guarantees for -SIR queries while meeting the real-time requirements.

4.1 Ranked List Maintenance

In this subsection, we introduce the procedure for ranked list maintenance. Generally, a ranked list keeps a tuple for each active element on topic . A tuple for element is denoted as where is the topic-wise representativeness score of on and is the timestamp when is last referred to. All tuples in are sorted in descending order of topic-wise score.

Input: A social stream , the window length , and the bucket length
1 , initialize an empty ranked list for ;
2 while  do
3       ;
4       foreach  do
5             foreach  do
6                   ;
7                   create a tuple and insert it into ;
9            foreach  do
10                   foreach  do
11                         ;
12                         adjust the position of in ;
14      foreach  : is never referred to after  do
15             delete the tuples of from with ;
Algorithm 1 Ranked List Maintenance

The algorithmic description of ranked list maintenance over a social stream is presented in Algorithm 1. Initially, an empty ranked list is initialized for each topic in the topic model (Line 1). At discrete timestamps until , the ranked lists are updated according to a bucket of elements . For each element in , a tuple is created and inserted into for every topic with (Lines 11). The score is because the elements influenced by have not been observed yet. The time when is last referred to is obviously . Subsequently, it recomputes the influence score for each parent of . After that, it updates the tuple by setting to and to . The position of in is adjusted according to the updated (Lines 11). Finally, we delete the tuples for expired elements from (Lines 11).

Complexity Analysis. The cost of evaluating for any element is where . Then, the complexity of inserting a tuple into is . For each , the complexity of re-evaluating is also . Overall, the complexity of maintaining for element is where . As the tuples for may appear in ranked lists, the time complexity of ranked list maintenance for element is .

Operations for Ranked List Traversal. We need to access the tuples in each ranked list in decreasing order of topic-wise score for -SIR processing. Two basic operations are defined to traverse the ranked list : (1) to retrieve the element w.r.t. the first tuple with the maximum topic-wise score from ; (2) to acquire the element w.r.t. the next unvisited tuple in from the current one. Note that once a tuple for element has been accessed in one ranked list, the remaining tuples for in the other lists will be marked as “visited” so as to eliminate duplicate evaluations for .

4.2 Multi-Topic ThresholdStream Algorithm

In this subsection, we present the MTTS algorithm for -SIR processing. MTTS is built on two key ideas: (1) a thresholding approach [15] to submodular maximization and (2) a ranked list based mechanism for early termination. First, given a -SIR query, the thresholding approach always tracks its optimal representativeness score . It establishes a sequence of candidates with different thresholds within the range of . For any element , each candidate determines whether to include independently based on ’s marginal gain and its threshold. Second, to prune unnecessary evaluations, MTTS utilizes ranked lists to sequentially feed elements to the candidates in decreasing order of score. It continuously checks the minimum threshold for an element to be added to any candidate and the upper-bound score of unevaluated elements. MTTS is terminated when the upper-bound score is lower than the minimum threshold. After termination, the candidate with the maximum score is returned as the result for the -SIR query.

Input: The ranked list for each and a -SIR query
Result: for
1 , foreach do ;
2 foreach do ;
3 and ;
4 while  do
5       ;
6       ;
7       if then ;
8       ;
9       delete if ;
10       foreach  do
11             if  then
12                   if then ;
15      ;
16       , ;
18return ;
Algorithm 2 Multi-Topic ThresholdStream

The algorithmic description of MTTS is presented in Algorithm 2. The initialization phase is shown in Lines 22. Given a parameter , MTTS establishes a geometric progression with common ratio

to estimate the optimal score

for . Then, it maintains a candidate initializing to for each . The threshold for is . The traversal of ranked lists starts from the first tuple of each list. We use to denote the element corresponding to the current tuple from . MTTS keeps variables: (1) to store the maximum score w.r.t.  among the evaluated elements, (2) to maintain the minimum threshold for an element to be added to any candidate, and 3) to track the upper-bound score for any unevaluated element w.r.t. . Specifically, is the threshold of the unfilled candidate (i.e., ) with the minimum . We set before the evaluation. If , can be safely excluded from evaluation. In addition, for any unevaluated element , it holds that because the tuples in are sorted by topic-wise score. Thus, can be used as the upper-bound score of unevaluated elements w.r.t. .

After the initialization phase, the elements are sequentially retrieved from the ranked lists and evaluated by the candidates according to Lines 22. At each iteration, MTTS selects an element with the maximum as the next element for evaluation (Line 2). Subsequently, the candidate maintenance procedure is performed following Lines 22. It first computes the score of w.r.t. . Second, it updates the maximum score . Third, the range of is adjusted to . Fourth, it deletes the candidates out of the range for . Next, each candidate determines whether to add independently according to Lines 22. If or has contained elements, will be ignored by . Otherwise, the marginal gain of adding to is evaluated. If reaches , will be added to . Finally, it obtains the next element in as and updates accordingly (Lines 2 and 2). The evaluation procedure will be terminated when because is satisfied for any unevaluated element , which can be safely pruned. Finally, MTTS returns the candidate with the maximum score as the result for (Line 2).

Figure 5: Example for -SIR processing using MTTS.
Example 4.

Following the example in Table 1, we show how MTTS processes a -SIR query where in Figure 5. We set in this example.

First of all, the traversals of and start from and respectively. Initially, we have and . Then, the first element to evaluate is because . As , the range of is . We have and candidates with are maintained. can be added to each of the candidates. After that, is the next element from . and are updated to and respectively. The second element to evaluate is from . As , the candidate with directly skips for . Other candidates include as . Then, is the next element from . decreases to while increases to . Subsequently, are retrieved but skipped by all candidates. After evaluating , decreases to and is lower than . Thus, no more evaluation is needed and is returned as the result for .

The approximation ratio of MTTS is given in Theorem 2.

Theorem 2.

returned by MTTS is a -approximation result for any -SIR query.

The proof is given in Appendix A.3.

Complexity Analysis. The number of candidates in MTTS is as the ratio between the lower and upper bounds for is . The complexity of retrieving an element from ranked lists is . The complexity of evaluating one element for a candidate is where and is the number of non-zero entries in the query vector . Thus, the complexity of MTTS to evaluate one element is . Overall, the time complexity of MTTS is where is the number of elements evaluated by MTTS.

4.3 Multi-Topic ThresholdDescend Algorithm

Although MTTS is efficient for -SIR processing, its approximation ratio is lower than the the best achievable approximation guarantees, i.e.,  [13] for submodular maximization with cardinality constraints. In addition, its result quality is also slightly inferior to that of CELF. In this subsection, we propose the Multi-Topic ThresholdDescend (MTTD) algorithm to improve upon MTTS. Different from MTTS, MTTD maintains only one candidate from to reduce the cost for evaluation. In addition, it buffers the elements that are retrieved from ranked lists but not included into so that these elements can be evaluated more than once. This can lead to better quality as the chances of missing significant elements are smaller. Specifically, MTTD has multiple rounds of evaluation with decreasing thresholds. In the round with threshold , each element with is considered and will be included to once the marginal gain reaches . When contains elements or is descended to the lower bound, MTTD is terminated and is returned as the result. Theoretically, the approximation ratio of MTTD is improved to but its worst-case complexity is higher than MTTS. Despite this, the efficiency and result quality of MTTD are both better than MTTS empirically.

Input: The ranked list for each and a -SIR query
Result: for
1 ;
2 foreach do ;
3 ;
4 while  do
5       retrieve(), ;
6       while  do
7             ;
8             if  then