Given a query, for knowledge summarization, it is often crucial to understand its relevant entities and their inherent connections. More interestingly, the entities and connections may evolve over time or other ordinal dimensions, indicating dynamic behaviors and changing trends. For example, in biological scientific literature, the study of a disease might focus on particular genes for a period of time, and then shift to some others due to a technology breakthrough. Capturing such entity connections and their changing trends can enable various tasks including the analysis of concept evolution, forecast of future events and detection of outliers.
Related work. Traditional information retrieval (IR) aims at returning a ranked list of documents according to their relevance to the query (Baeza-Yates et al., 2011). To understand the results and distill knowledge, users then need to pick out and read some of the documents, which requires tedious information processing and often leads to inaccurate conclusions. To deal with this, recent works on entity search (Jameel et al., 2017; Shen et al., 2018b) aim to search for entities instead of documents, but they only return lists of isolated entities, thus incapable of providing insights about entity connections. Existing works on graph keyword search (Kacholia et al., 2005)2017; Tan et al., 2016) have been using graph structures for query result or knowledge summarization, but they do not consider the evolution of entity connections.
Present work. In this work, we advocate for the novel task of entity evolutionary network construction
for query-specific knowledge summarization, which aims to return a set of query-relevant entities together with their evolutionary connections modeled by a series of networks automatically constructed from large-scale text corpora. Mathematically, we model entities as variables in a complex dynamic system, and estimate the connections among them based on their discrete occurrence within the documents.
Regarding techniques, recent existing works on network structure estimation have studied the inference of time-varying networks (Hallac et al., 2017; Tomasi et al., 2018). However, in our novel problem setting, there are two unique challenges: 1) Identifying the query-relevant set of entities from text corpora and 2) Constructing the evolutionary entity connections based on discrete entity observations. To orderly deal with them, we develop SetEvolve, a unified framework based on the principled nonparanormal graphical models (Liu et al., 2009; Gan et al., 2018).
For the first challenge, we assume a satisfactory method has been given for the retrieval of a list of documents based on query relevance (e.g., BM25 (Baeza-Yates et al., 2011)) and develop a general post-hoc method based on document rank list cutoff for query-relevant entity set identification. The cutoff is efficiently computed towards entity-document support recovery for accurate network construction, with theoretical and empirical analysis on the existence of optimality.
For the second challenge, we formulate the problem as an evolutionary graphical lasso, by modeling entities as variables in an evolving Markov random field, and detecting their inherent connections by estimating the underlying inverse covariance matrix. Moreover, we leverage the robust nonparanormal transformation to deal with the ordinal discrete entity observations in documents. Through theoretical analysis, we show that our model can capture the true conditional connections among entities, while its computational efficiency remains the same to standard graphical lasso.
We evaluate SetEvolve on both synthetic networks and real-world datasets. On synthetic datasets of different sizes, evolution patterns and noisy observations., SetEvolve leads to significant improvements of 9%-21% on the standard F1 measure compared with the strongest baseline from the state-of-the-art. Furthermore, on three real-world corpora, example evolutionary networks constructed by SetEvolve provide plausible summarizations of query-relevant knowledge that are rich, clear and readily interpretable.
Given a query on a text corpus , we aim to provide a clear and interpretable summarization over the retrieved knowledge , which is extracted from based on the relevance to . In principle, should help users easily understand the key concepts within , as well as their interactions and evolutions.
To achieve this goal, we get motivated by recent success on text summarization with concept maps(Falke and Gurevych, 2017), and propose to represent as a series of concept networks , by stressing the evolutionary nature of concept interactions. In each network , denotes a window in an arbitrary ordinal dimension of interest, such as time, product price, user age, etc. Without loss of generality, we will focus on time as the evolving dimension in the following. We further decompose into , where is the set of relevant concepts, and is the set of concept interactions, both in the time window denoted by .
In this work, we propose SetEvolve, a unified framework based on nonparanomal graphical models (Liu et al., 2009; Falke and Gurevych, 2017; Hallac et al., 2017; Yang et al., 2017; Gan et al., 2018), for the novel problem of constructing query-specific knowledge summarization from free text corpora. In particular, empowered by recent available structured data of knowledge bases and well-developed techniques of entity discovery (Hasibi et al., 2017; Shen et al., 2017), we leverage entities to represent concepts of interest, and derive a theoretically sound method to identify the stable set of query-relevant entities. Moreover, to model concept interactions in a principled way, we consider the corresponding entities as interconnected variables in a complex system and leverage graphical models to infer their connection patterns based on their occurrence in the text corpus as variable observations. Finally, to capture knowledge evolution, we consider observations associated with time (or other evolving dimension of interest), and jointly estimate a series of networks for characterizing changing trends.
2.2. Entity Set Identification
To construct the knowledge summarization , we first need to identify a set of entities to represent the key concepts that users care about. Entity set search without the consideration of network evolution is well studied (Shen et al., 2018a). Differently from them, in this work, we develop a general post-hoc entity set identification method based on the theoretical guarantees of nonparanomal graphical models for document-entity support recovery, as shown in Theorem 2.2. Particularly, we assume that based on a list of documents ranked by relevance, the optimal query-relevant entity set will appear consistently within the top ranked documents. In other words, as we bring in more documents, we get more complete entity sets and supportive information for constructing the entity network, but too many documents will bring in less relevant entities and redundant information. Therefore, we aim to find an optimal cutoff on to extract and further facilitate accurate evolutionary network construction.
Without loss of generality, we use the classic BM25 (Baeza-Yates et al., 2011) to retrieve documents , while various other advanced methods like (Wang et al., 2018) can be trivially plugged in to better meet users’ particular information need regarding query relevance. To extract entities from , we assume the availability of non-evolving knowledge bases and utilize an entity linking tool called CMNS (Hasibi et al., 2017) to convert into an entity set list , where is the set of entities in . To quantify the optimality, we propose to track the following metric sequentially as documents are taken in one-by-one and use it to determine the optimal entity set :
where is the size of an entity set , and is the number of documents taken in from the rank list .
The optimal set extracted from is determine based on two criteria: the convergence of , which is used to measure the completeness of the entity set; and is larger than a prefixed threshold (e.g., 10), which provides the theoretical guarantees in Theorem 2.2 for document-entity support recovery.
Figure 1 shows computed on the PubMed and Yelp datasets with different queries. As we can see, the derivation of often converges to 0 rapidly, which corroborates our assumption of entity consistency. Particularly, as we take in more documents, the total number of entities converges, so we can cutoff the document list at once the aforementioned two criteria are met, and use to identify and construct the evolutionary networks over .
2.3. Evolutionary Network Construction
After fixing , we consider the retrieval of connections among . A straightforward way to construct links in an entity network is to compute and threshold the document-level entity co-occurrence as suggested in (Peng et al., 2018). However, it is not clear how to weigh the links and set the thresholds. More importantly, because the thresholding method (Peng et al., 2018) only models the pairwise marginal dependence between entities and , and does not consider the interactions between and all the other entities , the links generated may be messy and less insightful, as shown in (Shen et al., 2018a).
To find the essential entity connections in a principled way, we propose to uncover the underlying connection patterns among entities as a graphical model selection problem (Friedman et al., 2008; Gan et al., 2018). Assume we have observations of a set of entities . In a standard GGM, we assume the observations identically and independent distributed (iid
) follow a multivariate Gaussian distribution, and our target is to detect the support of the precision matrix . It is well-known that in standard GGM is equivalent to the conditional independence of and , given all other variables, i.e., . If the Gaussianity and iid assumptions of GGM hold, insightful conditional dependence network links can be generated systematically.
However, the GGM assumptions do not exactly fit the context we consider, i.e., entity evolutionary network construction, because 1) GGM assumes observations are Gaussian, but entity occurrences in documents are discrete; 2) The entity network is evolving over time, so the iid assumption does not hold.
Motivated by previous works in graphical models that address the non-Gaussian data (Liu et al., 2009) and time dependent data (Zhou et al., 2008; Kolar and Xing, 2011), we propose an evolutionary nonparanormal graphical model, which detects the conditional dependence structure even when data is both discrete and evolving. Without loss of generality, in the following, we use time as the evolving dimension to describe our model.
Denote the -dimensional discrete observation at time as . In the proposed model, we assume satisfies the following relationship:
where is an -dimensional Gaussian copula transformation function defined in Eq. (3) and is assumed to evolve smoothly from to . A definition of the smoothness assumption of the evolution pattern we use is as Assumption S in (Kolar and Xing, 2011), where the smoothness is quantified through the boundedness of the first and second derivative of the changes in over time . No further assumption on the parameter (distribution) form of the evolution pattern is required for the model.
As shown in the following proposition, the model (2) can capture the conditional dependence links even when data is evolving and discrete. The proposition can be shown through standard matrix calculation and we refer to Lemma 2 in (Liu et al., 2009) for the proof.
Proposition 2.1 ().
If , and are conditionally independent, i.e., .
Motivated by (Liu et al., 2009), we utilize a Gaussian copula transformation function to handle the non-Gaussianity in the data, and the function is defined as follows:
In (3), and
are the empirical mean and empirical standard deviation ofrespectively,
is the quantile function of the standard norm distribution, andis the Winsorized estimator of the empirical distribution suggested in (Liu et al., 2009) and defined as:
is the empirical cumulative distribution function.
This transformation allows us to estimate the evolving pattern of through a kernel based method (Zhou et al., 2008; Kolar and Xing, 2011), where our target is to minimize the following kernel-based objective function:
is a kernel estimator of the sample covariance matrix at time , with weights defined as:
where is a kernel function such as the box kernel .
2.4. Theoretical Analysis
In the following theorem, we show the accuracy of SetEvolve in detecting the true entity links.
Theorem 2.2 ().
With the same assumptions of (Kolar and Xing, 2011) on the true evolving graphs, define the maximum graph node degree as . Suppose is estimated using a kernel with bandwidth . If the number of documents satisfies for a sufficiently large constant and we choose tuning parameter , then our estimation procedure can detect the links correctly with probability converging to
, then our estimation procedure can detect the links correctly with probability converging to.
In terms of the computation efficiency, in SetEvolve, we first compute using equation (6), and then compute for each using the state-of-the-art Graphical Lasso algorithm (Friedman et al., 2008). The computation complexity is thus exactly same as the state-of-the-art evolutionary network inference algorithms (Hallac et al., 2017; Tomasi et al., 2018).
3. Synthetic Experiments
Since there is no ground-truth for knowledge summarization with evolutionary networks, we follow the common practice in recent works on network inference (Hallac et al., 2017; Tomasi et al., 2018) to demonstrate the effectiveness of SetEvolve through comprehensive synthetic experiments.
Data Generation. Following Hallac et al. (2017), we generate two evolutionary networks of sizes and and then randomly generate the ground-truth covariance and sample discrete observations over 100 timestamps, where the global and local evolutions occur at time . At each , we generate 10 independent samples from the true distribution, with maximum variable value set to 10.
Compared Algorithms. We compare SetEvolve to two baselines: 1) Static (Liu et al., 2009), which applies the nonparanormal transformation for discrete observations, but aggregate observations from all timestamps to infer a single static network without evolutions, and 2) TVGL (Hallac et al., 2017), which leverages graphical lasso for evolutionary network construction but does not consider discrete observations.
Performance Evaluations. Table 1 shows the performance of compared methods. Following (Hallac et al., 2017; Tomasi et al., 2018), we compute macro-F1 scores for link reconstruction over evolutionary networks. As we can see, all algorithms perform better on smaller networks, partly because the observations are denser and the essential correlations are easier to capture. Moreover, all algorithms perform better with local evolutions, because the overall changes are smaller, while TVGL and SetEvolve perform better on global evolutions compared with Static. Finally, SetEvolve consistently outperforms both Static and TVGL on networks of different sizes and evolution types, which indicates its effectiveness in detecting correlations among evolutionary discrete variables. The scores of TVGL are lower than those in the original work due to the discreteness of variables. Besides accuracy, SetEvolve can construct networks with fewest links, generating clear views of networks for deriving the most essential entity connections.
We evaluate the robustness of SetEvolve by randomly adding Poisson noises to synthetic observations and show results in Figure 2. We also find that runtimes of TVGL and SetEvolve are roughly the same, which indicates the good scalability of SetEvolve.
4. Case Studies
We show the practical utility of SetEvolve on three datasets: PubMed with GB texts of articles†††https://www.ncbi.nlm.nih.gov/pubmed/, Yelp with GB texts of reviews‡‡‡https://www.yelp.com/dataset/challenge, and Semantic with GB texts of papers§§§https://labs.semanticscholar.org/corpus/corpus/archive.
Figure 3, 4 and 5 show the constructed networks on each corpus. The results are insightful. Figure 3, for example, shows that the long-used OGTT test for Diabetes was replaced by a new standard HbA1 test in the 2000’s, and Type 2 Diabetes (DM2) draws more research attention with newly found proteins like Osteocalcin in recent years. In Figure 4, we notice that Chinese restaurants get poorly rated often due to the quality of classic dishes (e.g., orange chicken, sour soup), while the higher ratings focus on authentication and customer service. In Figure 5, we observe that papers about MOOC and transfer learning have accumulated fewer citations, partly because these fields are quite new, whereas papers about question answering, knowledge base and language model are usually cited more. We will release the full implementation of our framework and interested readers can pose queries to find more insightful knowledge.
We propose SetEvolve, a unified framework for query-specific knowledge summarization from large text corpora. Built on the principles of nonparanormal graphical models, SetEvolve identifies query-relevant entity sets and constructs evolutionary entity networks with theoretical guarantees. Its effectiveness is corroborated with rich synthetic experiments and insightful case studies.
- Baeza-Yates et al. (2011) Ricardo Baeza-Yates, Berthier de Araújo Neto Ribeiro, and others. 2011. Modern information retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,.
- Falke and Gurevych (2017) Tobias Falke and Iryna Gurevych. 2017. Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps. In ACL.
- Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2008. Sparse inverse covariance estimation with the graphical lasso. In Biostatistics.
- Gan et al. (2018) Lingrui Gan, Naveen N Narisetty, and Feng Liang. 2018. Bayesian Regularization for Graphical Models with Unequal Shrinkage. In JASA.
- Hallac et al. (2017) David Hallac, Youngsuk Park, Stephen Boyd, and Jure Leskovec. 2017. Network Inference via the Time-Varying Graphical Lasso. In KDD.
- Hasibi et al. (2017) Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2017. Entity linking in queries: Efficiency vs. effectiveness. In ECIR.
- Jameel et al. (2017) Shoaib Jameel, Zied Bouraoui, and Steven Schockaert. 2017. MEmbER: Max-Margin Based Embeddings for Entity Retrieval. In SIGIR.
- Kacholia et al. (2005) Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai, and Hrishikesh Karambelkar. 2005. Bidirectional expansion for keyword search on graph databases. In VLDB.
- Kolar and Xing (2011) Mladen Kolar and Eric Xing. 2011. On time varying undirected graphs. In ICAIS.
- Liu et al. (2009) Han Liu, John Lafferty, and Larry Wasserman. 2009. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. In JMLR.
- Peng et al. (2018) Hao Peng, Jianxin Li, Yu He, Yaopeng Liu, Mengjiao Bao, Lihong Wang, Yangqiu Song, and Qiang Yang. 2018. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW.
et al. (2017)
Jiaming Shen, Zeqiu Wu,
Dongming Lei, Jingbo Shang,
Xiang Ren, and Jiawei Han.
SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble. InECML/PKDD.
- Shen et al. (2018a) Jiaming Shen, Jinfeng Xiao, Xinwei He, Jingbo Shang, Saurabh Sinha, and Jiawei Han. 2018a. Entity set search of scientific literature. In SIGIR.
- Shen et al. (2018b) Jiaming Shen, Jinfeng Xiao, Yu Zhang, Carl Yang, Jingbo Shang, Saurabh Sinha, Peipei Ping, Zhiyong Lu, and Jiawei Han. 2018b. SetSearch+: Entity-Set-Aware Search and Mining for Scientific Literature. In KDD.
- Tan et al. (2016) Zhaowei Tan, Changfeng Liu, Yuning Mao, Yunqi Guo, Jiaming Shen, and Xinbing Wang. 2016. AceMap: A Novel Approach towards Displaying Relationship among Academic Literatures. In WWW.
- Tomasi et al. (2018) Federico Tomasi, Veronica Tozzo, Saverio Salzo, and Alessandro Verri. 2018. Latent variable time-varying network inference. In KDD.
- Wang et al. (2018) Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The lambdaloss framework for ranking metric optimization. In CIMK.
- Yang et al. (2017) Carl Yang, Lin Zhong, Li-Jia Li, and Luo Jie. 2017. Bi-directional joint inference for user links and attributes on large social graphs. In WWW. 564–573.
- Zhou et al. (2008) Shuheng Zhou, John Lafferty, and Larry Wasserman. 2008. Time varying undirected graphs. In COLT.