1 Introduction
Enormous amount of data is being continuously generated by web users on social platforms at an unprecedented rate. For example, around 650 million tweets are posted by 330 million users on Twitter per day. Such user generated data can be modeled as continuous social streams, which are key sources of fresh and valuable information. Nevertheless, social streams are extremely overwhelming for their huge volumes and high velocities. It is impractical for users to consume social data in its raw form. Therefore, social search [8, 7, 28, 37, 17, 33, 9, 18, 39] has become the primary approach to facilitating users on finding their interested content from massive social streams.
Existing search methods for social data can be categorized into keywordbased approaches and topicbased approaches based on how they measure the relevance between queries and elements. Keywordbased approaches [8, 7, 28, 37, 17, 33, 9] adopt the textual relevance (e.g., TFIDF and BM25) for evaluation. However, they merely capture the syntactic correlation but ignore the semantic correlation. Considering the tweets in Figure 3, if a query “soccer” is issued, no results will be found because none of the tweets contains the term “soccer”. It is noted that the words like “asroma” and “LFC” are semantically relevant to “soccer”. Therefore, elements such as are relevant to the query but missing from the result. Thus, overlooking the semantic meanings of user queries may degrade the result quality, especially against social data where lexical variation is prevalent [14].
To overcome this issue, topicbased approaches [18, 39] project user queries and elements into the same latent space defined by a probabilistic topic model [5]. Consequently, queries and elements are both represented as vectors and their relevance is computed by similarity measures for vectors (e.g., cosine distance) in the topic space. Although topicbased approaches can better capture the semantic correlation between queries and elements, they focus on the relevance of results but neglect the representativeness. Typically, they retrieve top elements that are the most coherent with the query as the result. Such results may not be representative in the sense of information coverage and social influence. First, users are more satisfied with the results that achieve an extensive coverage of information on query topics than the ones that provide limited information. For example, a top query on topic in Figure 3 returns as the result. Nevertheless, compared with , can provide richer information to complement the news reported by . Therefore, in addition to relevance, it is essential to consider information coverage to improve the result quality. Second, influence is another key characteristic to measure the representativeness of social data. Existing methods for social search [37, 8, 18, 7] have taken into account the influences of elements for scoring and ranking. These methods simply use the influences of authors (e.g., PageRank [24] scores) or the retweet/share count to compute the influence scores. Such a naïve integration of influence is topicunaware and may lead to undesired query results. For example, in Figure 3, which is mostly related to , may appear in the result for a query on because of its high retweet count. In addition, they do not consider that the influences of elements evolve over time, when previously trending contents may become outdated and new posts continuously emerge. Hence, incorporating a topicaware and timecritical influence metric is imperative to capture recently trending elements.
To tackle the problems of existing search methods, we define a novel Semantic and Influence aware Representative (SIR) query for social streams based on topic modeling [5]. Specifically, a SIR query retrieves a set of elements from the active elements corresponding to the sliding window at the query time . The result set collectively achieves the maximum representativeness score w.r.t. the query vector , each dimension of which indicates the degree of interest on a topic. We advocate the representativeness score of an element set to be a weighted sum of its semantic and influence scores on each topic. We adopt a weighted word coverage model to compute the semantic score so as to achieve the best information preservation, where the weight of a word is evaluated based on its information entropy [42, 31]
. The influence score is computed by a probabilistic coverage model where the influence probabilities are topicaware. In addition, we restrict the influences within the sliding window
so that the recently trending elements can be selected.The challenges of realtime SIR processing are twofold. First, the SIR query is NPhard. Second, it is highly dynamic, i.e., the results vary with query vectors and evolve quickly over time. Due to the submodularity of the scoring function, existing submodular maximization algorithms, e.g., CELF [16] and SieveStreaming [3], can provide approximation results for SIR queries with theoretical guarantees. However, existing algorithms need to evaluate all active elements at least once for a single query and often take several seconds to process one SIR query as shown in our experiments. To support realtime SIR processing over social streams, we maintain the ranked lists to sort the active elements on each topic by topicwise representativeness score. We first devise the MultiTopic ThresholdStream (MTTS) algorithm for SIR processing. Specifically, to prune unnecessary evaluations, MTTS sequentially retrieves elements from the ranked lists in decreasing order of their scores w.r.t. the query vector and can be terminated early whenever possible. Theoretically, it provides approximation results for SIR queries and evaluates each active element at most once. Furthermore, we propose the MultiTopic ThresholdDescend (MTTD) algorithm to improve upon MTTS. MTTD maintains the elements retrieved from ranked lists in a buffer and permits to evaluate an element more than once to improve the result quality. Consequently, it achieves a better approximation but has a higher worstcase time complexity than MTTS. Despite this, MTTD shows better empirical efficiency and result quality than those of MTTS.
Finally, we conduct extensive experiments on three realworld datasets to evaluate the effectiveness of SIR as well as the efficiency and scalability of MTTS and MTTD. The results of a user study and quantitative analysis demonstrate that SIR achieves significant improvements over existing methods in terms of information coverage and social influence. In addition, MTTS and MTTD achieve up to 124x and 390x speedups over the baselines for SIR processing with at most and losses in quality.
Our contributions in this work are summarized as follows.

We define the SIR query to retrieve representative elements over social streams where both semantic and influence scores are considered. (Section 3)

We propose MTTS and MTTD to process SIR queries in realtime with theoretical guarantees. (Section 4)

We conduct extensive experiments to demonstrate the effectiveness of SIR as well as the efficiency and scalability of our proposed algorithms for SIR processing. (Section 5)
2 Related Work
Search Methods for Social Streams. Many methods have been proposed for searching on social streams. Here we categorize existing methods into keywordbased approaches and topicbased approaches.
Keywordbased approaches [8, 7, 28, 37, 17, 33, 9, 40] typically define top queries to retrieve elements with the highest scores as the results where the scoring functions combine the relevance to query keywords (measured by TFIDF or BM25) with other contexts such as freshness [28, 17, 33, 37], influence [8, 37], and diversity [9]. They also design different indices to support instant updates and efficient top query processing. However, keyword queries are substantially different from the SIR query and thus keywordbased methods cannot be trivially adapted to process SIR queries based on topic modeling.
As the metrics for textual relevance cannot fully represent the semantic relevance between user interest and text, recent work [18, 39] introduces topic models [5]
into social search, where user queries and elements are modeled as vectors in the topic space. The relevance between a query and an element is measured by cosine similarity. They define
top relevance query to retrieve most relevant elements to a query vector. However, existing methods typically consider the relevance of results but ignore the representativeness. Therefore, the algorithms in [18, 39] cannot be used to process SIR queries that emphasize the representativeness of results.Social Stream Summarization. There have been extensive studies on social stream summarization [1, 27, 36, 23, 4, 26, 29, 25] : the problem of extracting a set of representative elements from social streams. Shou et al. [27, 36] propose a framework for social stream summarization based on dynamic clustering. Ren et al. [25] focus on the personalized summarization problem that takes users’ interests into account. Olariu [23] devise a graphbased approach to abstractive social summarization. Bian et al. [4] study the multimedia summarization problem on social streams. Ren et al. [26] investigate the multiview opinion summarization of social streams. Agarwal and Ramamritham [1] propose a graphbased method for contextual summarization of social event streams. Nguyen et al. [31] consider maintaining a sketch for a social stream to best preserve the latent topic distribution.
However, the above approaches cannot be applied to adhoc query processing because they (1) do not provide the query interface and (2) are not efficient enough. For each query, they need to filter out irrelevant elements and invoke a new instance of the summarization algorithm to acquire the result, which often takes dozens of seconds or even minutes. Therefore, it is unrealistic to deploy a summarization method on a social platform for adhoc queries since thousands of users could submit different queries at the same time and each query should be processed in realtime.
Submodular Maximization.
Submodular maximization has attracted a lot of research interest recently for its theoretical significance and wide applications. The standard approaches to submodular maximization with a cardinality constraint are the greedy heuristic
[22] and its improved version CELF [16], both of which are approximate. Badanidiyuru and Vondrak [2] propose several approximation algorithms for submodular maximization with general constraints. Kumar et al. [15] and Badanidiyuru et al. [3] study the submodular maximization problem in the distributed and streaming settings. Epasto et al. [12] and Wang et al. [35] further investigate submodular maximization in the sliding window model. However, the above algorithms do not utilize any indices for acceleration and thus they are much less efficient for SIR processing than MTTS and MTTD proposed in this paper.3 Problem Formulation



3.1 Data Model
Social Element. A social element is represented as a triple , where is the timestamp when is posted, is the textual content of denoted by a bag of words drawn from a vocabulary indexed by (), and is the set of elements referred to by . Given two elements and (), if refers to , i.e., , we say influences , which is denoted as . In this way, the attribute captures the influence relationships between social elements [30, 34]. If is totally original, we set . For example, tweets on Twitter shown in Table 1 are typical social elements and the propagation of hashtags can be modeled as references [30, 19]. Note that the influence relationships vary for different types of elements, e.g., “cite” between academic papers and “comment” on Reddit can also be modeled as references.
Social Stream. We consider social elements arrive continuously as a data stream. A social stream comprises a sequence of elements indexed by . Elements are ordered by timestamps and multiple elements with the same timestamp may arrive in an arbitrary manner. Furthermore, social streams are timesensitive: elements posted or referred to recently are more important and interesting to users than older ones. To capture the freshness of social streams, we adopt the wellrecognized timebased sliding window [11] model. Given the window length , a sliding window at time comprises the elements from time to , i.e., . The set of active elements at time includes not only the elements in but also the elements referred to by any element in , i.e., . We use to denote the number of active elements at time .
Topic Model. We use probabilistic topic models [5] such as LDA [6] and BTM [38] to measure the (semantic and influential) representativeness of elements and the preferences of users. A topic model consisting of topics is trained from the corpus and the vocabulary . Each topic is a multinomial distribution over the words in , where is the probability of a word distributed on and . The topic distribution of an element is a multinomial distribution over the topics in , where is the probability that is generated from and .
The selection of appropriate topic models is orthogonal to our problem. In this work, we consider any probabilistic topic model can be used as a blackbox oracle to provide and . Note that the evolution of topic distribution is typically much slower than the speed of social stream [41, 38]. In practice, we assume that the topic distribution remains stable for a period of time. We need to retrain the topic model from recent elements when it is outdated due to concept drift.
3.2 Query Definition
Query Vector. Given a topic model of topics, we use a dimensional vector to denote a user’s preference on topics. Formally, and, indicates the user’s degree of interest on . W.l.o.g., is normalized to . Since it is impractical for users to provide the query vectors directly for their lack of knowledge about the topic model , we design a scheme to transform the standard querybykeyword [17] paradigm in our case: the keywords provided by a user is treated as a pseudodocument and the query vector is inferred from its distribution over the topics in . Note that other query paradigms can also be supported, e.g., the querybydocument [39] paradigm where a document is provided as a query and the personalized search [18] where the query vector is inferred from a user’s recent posts.
Definition of Representativeness. Given a set of elements and a query vector , the representativeness of w.r.t. at time is defined by a function that maps any subset of to a nonnegative score w.r.t. a query vector. Formally, we have
(1) 
where is the score of on topic . Intuitively, the overall score of w.r.t. is the weighted sum of its scores on each topic. The score on is defined as a linear combination of its semantic and influence scores. Formally,
(2) 
where is the semantic score of on , is the influence score of on at time , specifies the tradeoff between semantic and influence scores, and adjusts the ranges of and to the same scale. Next, we will introduce how to compute the semantic and influence scores based on the topic model respectively.
Topicspecific Semantic Score. Given a topic , we define the semantic score of a set of elements by the weighted word coverage model. We first define the weight of a word in on . According to the generative process of topic models [5], the probability that is generated from is denoted as . Following [31, 42], the weight of in on can be defined by its frequency and information entropy, i.e., , where is the frequency of in . Then, the semantic score of on is the sum of the weights of distinct words in , i.e., where is the set of distinct words in . We extend the definition of semantic score to an element set by handling the word overlaps. Given a set and a word , if appears in more than one element of , its weight is computed only once for the element with the maximum . Formally, the semantic score of on is defined by
(3) 
where . Equation 3 aims to select a set of elements to maximally cover the important words on so as to best preserve the information of . Additionally, it implicitly captures the diversity issue because adding highly similar elements to brings little increase in .
Example 1.
Table 1 gives a social stream extracted from the tweets in Figure 3 and a topic model on the vocabulary of elements in the stream. We demonstrate how to compute the semantic score where on . The frequency of each word in any element is . The set of words in is . The word only appears in . Its weight is . The words appear in both elements. As and , and are the weights of and for . Finally, we sum up the weights of each word in and get . In this example, has no contribution to the semantic score because all words in are covered by .
Topicspecific Timecritical Influence Score. Given a topic and two elements (), the probability of influence propagation from to on is defined by . Furthermore, the probability of influence propagation from a set of elements to on is defined by . We assume the influences from different precedents to are independent of each other and adopt the probabilistic coverage model to compute the influence probability from a set of elements to an element. To select recently trending elements, we define the influence score in the sliding window model where only the references observed within are considered. Let be the set of elements influenced by at time and be the set of elements influenced by at time . The influence score of on at time is defined by
(4) 
Equation 4 tends to select a set of influential elements on at time . The value of will increase greatly only if an element is added to such that is relevant to itself and is referred to by many elements on within .
Example 2.
We compute the influence score of in Table 1 on at time . We consider the window length and . at time is and expires at time . First, . Similarly, . For , we have . Finally, we acquire . We can see, although is referred to by several elements, its influence score on is low because and the elements referring to it are mostly on .
Query Definition. We formally define the Semantic and Influence aware Representative (SIR) query to select a set of elements with the maximum representativeness score w.r.t. a query vector from a social stream. We have two constraints on the result of SIR query : (1) its size is restricted to , i.e., contains at most elements, to avoid overwhelming users with too much information; (2) the elements in must be active at time , i.e., , to satisfy the freshness requirement. Finally, we define a SIR query as follows.
Definition 1 (Sir).
Given the set of active elements and a vector , a SIR query returns a set of elements with a bounded size such that the scoring function is maximized, i.e., , where is the optimal result for and is the optimal representativeness score.
Example 3.
We consider two SIR queries on the social stream in Table 1. We set , in Equation 2 and the window length . At time , the set of active elements contains all except . Given a SIR query where (a user has the same interest on two topics), is the query result and . We can see obtain the highest scores on respectively and they collectively achieve the maximum score w.r.t. . Given an SIR query where (the user prefers to ), the query result is and . is excluded because it is mostly distributed on .
3.3 Properties and Challenges
Properties of SIR Queries. We first show the monotonicity and submodularity of the scoring function for SIR query by proving that both the semantic function and the influence function are monotone and submodular.
Definition 2 (Monotonicity & Submodularity).
A function on the power set of is monotone iff for any and . The function is submodular iff for any and .
Lemma 1.
is monotone and submodular for .
Lemma 2.
is monotone and submodular for at any time .
Given a query vector , the scoring function is a nonnegative linear combination of and . Therefore, is monotone and submodular.
Challenges of SIR Queries. In this paper, we consider that the elements arrive continuously over time. We always maintain the set of active elements at any time . It is required to provide the result for any adhoc SIR query in realtime.
The challenges of processing SIR queries in such a scenario are twofold: (1) NPhardness and (2) dynamism. First, the following theorem shows the SIR query is NPhard.
Theorem 1.
It is NPhard to obtain the optimal result for any SIR query .
The weighted maximum coverage problem can be reduced to SIR query when in Equation 2. Meanwhile, the probabilistic coverage problem is a special case of SIR query when in Equation 2. Because both problems are NPhard [13], the SIR query is NPhard as well.
In spite of this, existing algorithms for submodular maximization [22] can provide results with constant approximations to the optimal ones for SIR queries due to the monotonicity and submodularity of the scoring function. For example, CELF [16] is approximate for SIR queries while SieveStreaming [3] is approximate (for any ). However, both algorithms cannot fulfill the requirements for realtime SIR processing owing to the dynamism of SIR queries. The results of SIR queries not only vary with query vectors but also evolve over time for the same query vector due to the changes in active elements and the fluctuations in influence scores over the sliding window. To process one SIR query , CELF and SieveStreaming should evaluate for and times respectively. Empirically, they often take several seconds for one SIR query when the window length is 24 hours. To the best of our knowledge, none of the existing algorithms can efficiently process SIR queries. Thus, we are motivated to devise novel realtime solutions for SIR processing over social streams.
Notation  Description 

is a social stream; is an arbitrary element in ; is the th element in .  
is the window length; is the sliding window at time ; is the set of active elements at time .  
is a topic model; is the th topic in .  
is a dimensional vector; is the th entry of .  
is the semantic function on ; is the influence function on at time .  
is the representativeness scoring function on ; is the scoring function w.r.t. a query vector.  
is a SIR query at time with a bounded result size and a query vector .  
is the optimal result for ; is the optimal representativeness score.  
is the score of on ; is the score of w.r.t. .  
is the marginal score gain of adding to .  
is the ranked list maintained for the elements on topic . 
Before moving on to the section for SIR processing, we summarize the frequently used notations in Table 2.
4 Query Processing
In this section, we introduce the methods to process SIR queries over social streams. The architecture is illustrated in Figure 4. At any time , we maintain (1) Active Window to buffer the set of active elements , (2) Ranked Lists to sort the lists of elements on each topic of in descending order of topicwise representativeness score, and (3) Query Processor to leverage the ranked lists to process SIR queries. In addition, when the topic model is given, the query and topic inferences become rather standard (e.g., Gibbs sampling [21]), and thus we do not discuss these procedures here for space limitations. We consider the query vectors and the topic vectors of elements have been given in advance.
As shown in Figure 4, we process a social stream in a batch manner. is partitioned into buckets with equal time length and updated at discrete time until the end time of the stream . When the window slides at time , a bucket containing the elements between time to is received. After inferring the topic vector of each with the topic model, we first update the active window. The elements in are inserted into the active window and the elements referred to by them are updated. Then, the elements that are never referred to by any element after time are discarded from the active window. Subsequently, the ranked list on each topic is maintained for . The detailed procedure for ranked lists maintenance will be presented in Section 4.1.
Next, let us discuss the mechanism of SIR processing. One major drawback of existing submodular maximization methods, e.g., CELF [16] and SieveStreaming [3], on processing
SIR queries is that they need to evaluate every active element at least once. However, realworld datasets often have two characteristics: (1) The scores of elements are skewed, i.e., only a few elements have high scores. For example, we compute the scores of a sample of tweets w.r.t. a
SIR query and scale the scores linearly to the range of 0 to 1. The statistics demonstrate that only 0.4% elements have scores of greater than 0.9 while 91% elements have scores of less than 0.1. (2) One element can only be highranked in very few topics, i.e., one element is about only one or two topics. In practice, we observe that the average number of topics per element is less than 2. Therefore, most of the elements are not relevant to a specific SIR query. We can greatly improve the efficiency by avoiding the evaluations for the elements with very low chances to be included into the query result. To prune these unnecessary evaluations, we leverage the ranked lists to sequentially evaluate the active elements in decreasing order of their scores w.r.t. the query vector. In this way, we can track whether unevaluated elements can still be added to the query result and terminate the evaluations as soon as possible.Although such a method to traverse the ranked lists is similar to the one for top query [39], the procedures for maintaining the query results are totally different. A top query simply returns elements with the maximum scores as the result for a SIR query. Although the top result can be retrieved efficiently from the ranked lists using existing methods [39], its quality for SIR queries is suboptimal because the word and influence overlaps are ignored. Thus, we will propose the MultiTopic ThresholdStream (MTTS) and MultiTopic ThresholdDescend (MTTD) algorithms for SIR processing in Sections 4.2 and 4.3. They can return highquality results with constant approximation guarantees for SIR queries while meeting the realtime requirements.
4.1 Ranked List Maintenance
In this subsection, we introduce the procedure for ranked list maintenance. Generally, a ranked list keeps a tuple for each active element on topic . A tuple for element is denoted as where is the topicwise representativeness score of on and is the timestamp when is last referred to. All tuples in are sorted in descending order of topicwise score.
The algorithmic description of ranked list maintenance over a social stream is presented in Algorithm 1. Initially, an empty ranked list is initialized for each topic in the topic model (Line 1). At discrete timestamps until , the ranked lists are updated according to a bucket of elements . For each element in , a tuple is created and inserted into for every topic with (Lines 1–1). The score is because the elements influenced by have not been observed yet. The time when is last referred to is obviously . Subsequently, it recomputes the influence score for each parent of . After that, it updates the tuple by setting to and to . The position of in is adjusted according to the updated (Lines 1–1). Finally, we delete the tuples for expired elements from (Lines 1–1).
Complexity Analysis. The cost of evaluating for any element is where . Then, the complexity of inserting a tuple into is . For each , the complexity of reevaluating is also . Overall, the complexity of maintaining for element is where . As the tuples for may appear in ranked lists, the time complexity of ranked list maintenance for element is .
Operations for Ranked List Traversal. We need to access the tuples in each ranked list in decreasing order of topicwise score for SIR processing. Two basic operations are defined to traverse the ranked list : (1) to retrieve the element w.r.t. the first tuple with the maximum topicwise score from ; (2) to acquire the element w.r.t. the next unvisited tuple in from the current one. Note that once a tuple for element has been accessed in one ranked list, the remaining tuples for in the other lists will be marked as “visited” so as to eliminate duplicate evaluations for .
4.2 MultiTopic ThresholdStream Algorithm
In this subsection, we present the MTTS algorithm for SIR processing. MTTS is built on two key ideas: (1) a thresholding approach [15] to submodular maximization and (2) a ranked list based mechanism for early termination. First, given a SIR query, the thresholding approach always tracks its optimal representativeness score . It establishes a sequence of candidates with different thresholds within the range of . For any element , each candidate determines whether to include independently based on ’s marginal gain and its threshold. Second, to prune unnecessary evaluations, MTTS utilizes ranked lists to sequentially feed elements to the candidates in decreasing order of score. It continuously checks the minimum threshold for an element to be added to any candidate and the upperbound score of unevaluated elements. MTTS is terminated when the upperbound score is lower than the minimum threshold. After termination, the candidate with the maximum score is returned as the result for the SIR query.
The algorithmic description of MTTS is presented in Algorithm 2. The initialization phase is shown in Lines 2–2. Given a parameter , MTTS establishes a geometric progression with common ratio
to estimate the optimal score
for . Then, it maintains a candidate initializing to for each . The threshold for is . The traversal of ranked lists starts from the first tuple of each list. We use to denote the element corresponding to the current tuple from . MTTS keeps variables: (1) to store the maximum score w.r.t. among the evaluated elements, (2) to maintain the minimum threshold for an element to be added to any candidate, and 3) to track the upperbound score for any unevaluated element w.r.t. . Specifically, is the threshold of the unfilled candidate (i.e., ) with the minimum . We set before the evaluation. If , can be safely excluded from evaluation. In addition, for any unevaluated element , it holds that because the tuples in are sorted by topicwise score. Thus, can be used as the upperbound score of unevaluated elements w.r.t. .After the initialization phase, the elements are sequentially retrieved from the ranked lists and evaluated by the candidates according to Lines 2–2. At each iteration, MTTS selects an element with the maximum as the next element for evaluation (Line 2). Subsequently, the candidate maintenance procedure is performed following Lines 2–2. It first computes the score of w.r.t. . Second, it updates the maximum score . Third, the range of is adjusted to . Fourth, it deletes the candidates out of the range for . Next, each candidate determines whether to add independently according to Lines 2–2. If or has contained elements, will be ignored by . Otherwise, the marginal gain of adding to is evaluated. If reaches , will be added to . Finally, it obtains the next element in as and updates accordingly (Lines 2 and 2). The evaluation procedure will be terminated when because is satisfied for any unevaluated element , which can be safely pruned. Finally, MTTS returns the candidate with the maximum score as the result for (Line 2).
Example 4.
Following the example in Table 1, we show how MTTS processes a SIR query where in Figure 5. We set in this example.
First of all, the traversals of and start from and respectively. Initially, we have and . Then, the first element to evaluate is because . As , the range of is . We have and candidates with are maintained. can be added to each of the candidates. After that, is the next element from . and are updated to and respectively. The second element to evaluate is from . As , the candidate with directly skips for . Other candidates include as . Then, is the next element from . decreases to while increases to . Subsequently, are retrieved but skipped by all candidates. After evaluating , decreases to and is lower than . Thus, no more evaluation is needed and is returned as the result for .
The approximation ratio of MTTS is given in Theorem 2.
Theorem 2.
returned by MTTS is a approximation result for any SIR query.
The proof is given in Appendix A.3.
Complexity Analysis. The number of candidates in MTTS is as the ratio between the lower and upper bounds for is . The complexity of retrieving an element from ranked lists is . The complexity of evaluating one element for a candidate is where and is the number of nonzero entries in the query vector . Thus, the complexity of MTTS to evaluate one element is . Overall, the time complexity of MTTS is where is the number of elements evaluated by MTTS.
4.3 MultiTopic ThresholdDescend Algorithm
Although MTTS is efficient for SIR processing, its approximation ratio is lower than the the best achievable approximation guarantees, i.e., [13] for submodular maximization with cardinality constraints. In addition, its result quality is also slightly inferior to that of CELF. In this subsection, we propose the MultiTopic ThresholdDescend (MTTD) algorithm to improve upon MTTS. Different from MTTS, MTTD maintains only one candidate from to reduce the cost for evaluation. In addition, it buffers the elements that are retrieved from ranked lists but not included into so that these elements can be evaluated more than once. This can lead to better quality as the chances of missing significant elements are smaller. Specifically, MTTD has multiple rounds of evaluation with decreasing thresholds. In the round with threshold , each element with is considered and will be included to once the marginal gain reaches . When contains elements or is descended to the lower bound, MTTD is terminated and is returned as the result. Theoretically, the approximation ratio of MTTD is improved to but its worstcase complexity is higher than MTTS. Despite this, the efficiency and result quality of MTTD are both better than MTTS empirically.
Comments
There are no comments yet.