DeepAI
Log In Sign Up

A Formal Definition of Importance for Summarization

01/26/2018
by   Maxime Peyrard, et al.
0

Research on summarization has mainly been driven by empirical approaches, crafting systems to perform well on standard datasets with the notion of information Importance remaining latent. We argue that establishing formal theories of Importance will advance our understanding of the task and further improve summarization systems. Therefore, we attempt a definition of several concepts: Redundancy, Relevance, and Informativeness within an abstract theoretical framework. Importance arises as a single quantity naturally unifying these concepts. Finally, we provide intuitions to interpret the proposed quantities especially Importance.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/31/2022

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

The topic of summarization evaluation has recently attracted a surge of ...
05/25/2018

Toward Extractive Summarization of Online Forum Discussions via Hierarchical Attention Networks

Forum threads are lengthy and rich in content. Concise thread summaries ...
11/30/2020

Systematically Exploring Redundancy Reduction in Summarizing Long Documents

Our analysis of large summarization datasets indicates that redundancy i...
10/06/2020

SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy

Most work on multi-document summarization has focused on generic summari...
02/29/2020

Clinical Text Summarization with Syntax-Based Negation and Semantic Concept Identification

In the era of clinical information explosion, a good strategy for clinic...
03/31/2021

Defx at SemEval-2020 Task 6: Joint Extraction of Concepts and Relations for Definition Extraction

Definition Extraction systems are a valuable knowledge source for both h...

1 Introduction

Text summarization is the process of identifying the most important information from a source (or sources) and produces a comprehensive output for a particular user(s) and/or task(s) Mani (1999)

. While producing readable outputs is a problem shared with the field of natural language generation, the core challenge of summarization is the identification and selection of

important information. The task definition is rather intuitive but involves vague and undefined terms such as Importance and information.

Since the seminal work of Luhn, automatic text summarization research has focused on empirical developments, crafting summarization systems to perform well on standard datasets leaving the formal definitions of Importance and information latent Das and Martins (2010); Nenkova and McKeown (2012)

. This view entails collecting datasets, defining evaluation metrics and iteratively selecting best-performing systems either via supervised learning or via repeated comparison of unsupervised systems

Yao et al. (2017).

These approaches lack guidance as they do not test hypothesis emerging from theoretical frameworks. Instead, they rely on hypothesis stemming from common sense and intuition about what are the relevant aspects of Importance. While such empirical approaches have facilitated the development of practical solutions, they only identify signal correlating with the vague human intuition of Importance. For example, even nowadays, structural features like centrality and repetitions are still the most used proxies for Importance Yao et al. (2017). However, such features simply correlate with Importance in standard datasets. Unsurprisingly, simple adversarial attacks reveal their weaknesses Zopf et al. (2016).

We postulate that establishing formal theories of Importance will advance our understanding of the task and further improve summarization systems. One can draw inspiration from physics, arguably one of the most successful scientific development, which fosters both empirical and theoretical works with strong interactions between the two. Empirical studies test hypothesis designed to falsify working theories, while theories are refined to account for new empirical results Kuhn (1970)

. In summarization, and more generally in Natural Language Processing, the lack of efforts to produce abstract theoretical frameworks might impede the progress.

A theory provides a frame of reference for interpreting observations, defining new concepts, generalizing knowledge and understanding complex logical relationships between variables. It forms an interrelated, coherent set of ideas and models which is refined upon new empirical observations Kuhn (1970). Hence, it is, by design, more internally consistent than common sense and intuition.

In symbiosis with empirical works, theories are particularly useful because they provide a common language to ground research. They describe how different approaches relate to each other, pinpoint dark zones, opportunities for improvement and promising areas. Therefore, they provide motivation and direction for future research. Theoretically motivated experiments are always beneficial; even if the outcome of an experiment is unexpected, it is an opportunity to revise and improve the theory in a fundamental way Kuhn (1970).

In this work, we attempt a definition of information Importance within an abstract theoretical framework. This requires the notion of information, which has received a lot of attention since the work from Shannon:1963 in the context of communication theory. The subsequent theory produced powerful tools applied successfully in various domains like physics Jaynes (1957), economics Maasoumi (1993), evolutionary biology Adami (2012), or the study of consciousness Tononi et al. (2016). Information theory provides the means to rigorously discuss the abstract concept of information, which seems particularly suited as an entry point for a theory of summarization.

However, information theory focused on uncertainty (entropy) about which message was chosen among a set of possible messages, ignoring the semantic of messages Shannon and Weaver (1963). The ideal communication channel model offers the tools for lossless compression of messages at the symbol-level while ignoring common knowledge, but summarization is a lossy compression at the semantic-level depending on background knowledge.

In order to apply information theory to summarization, we postulate the existence of a semantic representation of a text as a set of semantic units. When applied on semantic symbols, the tools of information theory indirectly operate at the semantic level. The semantic unit decomposition is supported by recent work attempting a semantic theory of information Zhong (2017).

Within this framework, we define several concepts intuitively connected to summarization: Redundancy, Relevance and Informativeness. Importance arises as a single quantity naturally unifying these concepts. In this view, Importance is not an intrinsic property of a semantic unit, it depends on which units are present within some contextual boundaries: Redundancy is in the context of the summary only, Relevance is in the context of the source document(s) and Informativeness is in the context of background knowledge and preconceptions of the user. Finally, Importance encompasses these three levels.

The rest of the paper is organized as follows: we start by briefly presenting the relevant background from information theory in section 2. We then develop the framework in section 3. We illustrate the workings of the formulas on real datasets in section 4. To do so, we made simplistic assumptions about semantic units. In section 6, we provide a discussion about semantic units and other practical considerations. The related work is discussed in section 5.

Contribution

We propose a simple theoretical framework for content selection in summarization governed by the abstract notion of Importance. Within the framework, we provide definition for quantities like Redundancy, Relevance and Informativeness. We also provide intuitions to interpret the proposed quantities. This theoretical development is a humble starting point subject to be empirically tested and refined upon new observations.

2 Background

In this section, we briefly describe standard notions from information theory required for the rest of the paper.

2.1 Entropy

Entropy was introduced by Shannon:1963 as the central quantity of information theory. Let

be a probability distribution defined over

symbols such that, to each is assigned the probability . Then the entropy of is defined by:

(1)

Note that is the surprise of observing and the entropy of

is the expected surprise. Entropy is a measure of uncertainty and therefore it is maximal for a uniform distribution (

).

Entropy is also a measure of the average amount of information. Intuitively, the more uncertainty (or average surprisal) there is in , the more information we need in order to communicate the outcome of an experiment governed by . Information theory is not concerned with what forms information takes, it only aims at quantifying information via entropy.

2.2 KL divergence

Let and be two probability distributions over the same set of symbols. How to quantify the difference between these two distributions?

The Kullback-Leibler (KL) divergence or relative entropy is one such measure and is defined by:

(2)

It is not a proper distance because it is not symmetric () and do not satisfy the triangular inequality.

Rather than being interpreted as a distance measure between distributions, should be understood as a measure of entropy increase due to the use of as an approximation of . It measures the loss of information incured by this approximation.

3 Framework

3.1 Terminology and Assumptions

We call semantic unit an atomic piece of information which is independent of every other semantic unit. Atomic and Indepedent mean that knowing or observing one semantic unit gives no information about the existence or the content of a different unit . We denote the set of all possible semantic units.

Consequently, we note the set of information conveyed by any given text . As language is redundant, a text might contain several expressions of the same semantic unit. A text is then represented by a frequency distribution over its support . For example, the text has the support and frequency distribution: .

In particular, we represent the source text(s) and the generated summary as frequency distributions and .

In the following paragraphs, we propose some important quantities to model the task of summarization. We rely on the well-established field of information theory to provide sound theoretical motivations behind these quantities. Since we introduced semantic symbols, the information theoretic tools indirectly operate at the semantic level.

3.2 Redundancy

Intuitively, a summary should contain the maximum amount of information. In information theoretic terms, the amount of information is measured by Shannon’s entropy. For a summary represented by , the entropy is given by:

(3)

is maximized for a uniform probability distribution when every semantic unit is present only once in : . Therefore, we define Redundancy, our first quantity relevant to summarization, via entropy:

(4)

Because is a constant independent of the summary , meaning we can simplify the expression: .

Intuitively, a summary maximizes the information content if it displays many semantic units but once. Indeed, should not be redundant but also contain as many semantic units as possible. The two following summaries: and are both non-redundant but is intuitively better because it contains more information. This is captured by entropy because .

Furthermore, we observe that entropy encompasses the notion of maximum coverage. Indeed, maximum coverage and minimum redundancy are equivalent in the context of summarization because one should fit as much information as possible in a constrained space McDonald (2007).

Now, suppose displays the following semantic units: . Then the summaries: , , , and are all maximizing , minimizing Redundancy and thus indistinguishable without further insights.

3.3 Relevance

A summary also depends on the source texts it originates from. It aims to approximate the original source(s) and this approximation should incur a minimum loss of information. We call this property Relevance.

With information theoretic tools, estimating

Relevance boils down to comparing the probability distributions and with KL divergence :

(5)

This is interpreted as the information loss due to using as an approximation of . A summary approximating the source(s) exhibits a frequency distribution of semantic units similar to the distribution of semantic units of the source documents: . There is a negative sign because the KL divergence measures dissimilarity and Relevance should measure similarity.

We observe that divergence and entropy are connected by:

(6)

is the cross-entropy, which is interpreted as the average surprise of observing while expecting . Based on these insights, we observe that Relevance has the following decomposition in entropy and cross-entropy:

(7)

is the average surprise of observing while expecting , which should be low. Therefore maximizing Relevance is equivalent to minimizing both the Redundancy of and the surprise of observing given that we expect . We discuss the connections between these different notions in more details in the section 3.5.

Relevance considers only the document source(s) and ignore any other potential source of information such as previous knowledge about the world or preconceptions from the user or about the task. If displays the following semantic units: , Relevance can still not distinguish between the two summaries: and .

3.4 Informativeness

In Shannon’s theory, information is the surprise of observing an outcome. The surprise depends on our expectations and preconceptions about the outcome, which is encoded in a probability distribution over symbols.

When extending information to the notion of semantic information, the reasoning is analogous. Semantic information is the surprise of observing a semantic signal Carnap and Bar-Hillel (1953); Zhong (2017). This surprise depends on our expectation and preconceptions about which semantic signal will happen. Therefore we introduce , which represents background knowledge and preconceptions about the summarization task. In the classical view is a probability distribution over symbols, and analogously has a probability distribution over semantic units .

Intuitively, a summary is informative if it gives a lot of new information with respect to what is already known. An informative summary is a summary that induces, for a user, a great change in his/her knowledge about the world.

Formally, the semantic information contained in a summary is the average amount of surprise. The surprise of seeing is given by and therefore the average surprise of observing while knowing is given by the cross-entropy :

(8)

As we saw previously, the cross-entropy and KL divergence are connected via the entropy. Thus, Informativeness can be rewritten as:

(9)

Note that can be interpreted as the Bayesian surprise Louis (2014), or the amount of information gained by observing if we already know . Maximizing Informativeness is equivalent to minimizing Redundancy while maximizing bayesian surprise.

Remarks

It is important to notice that the choice of controls many variants of summarization tasks.

For example, can be used to encode the background knowledge of a user, by assigning a high probability to known information. These probabilities correspond to the strengths of in the user’s memory. A simple model could be the uniform distribution over known information where is if the user knows , and otherwise.

Furthermore, a user may indicate his/her preferences and interests by setting low probabilities in for the interesting semantic units. An Informative summary is one that fills these gaps in .

Similarly, a query can be encoded by setting low probability to semantic units related to .

It is also a natural formulation for update summarization. Let and be two sets of documents. Update summarization consists in summarization given that the user has already seen . This is modeled simply by setting , considering as previous knowledge.

3.5 Importance

Importance is hard to define because of its inherent vagueness and subjectivity. Summarization is a lossy semantic compression and whenever one compresses with loss of information one must make choices about what to discard. Informally, Importance is the measure that guides these choices.

As a definition for Importance, we propose the following formula:

(10)

This introduces a new distribution defined by:

(11)

is the normalizing constant and is a smoothing factor preventing undefined divisions. We say that is the importance-encoding distribution because measures the importance of with respect to background knowledge and the source document(s) . It is the distribution of semantic units that a summary should approximate.

Previous quantities gave us different measures of semantic information from different perspectives: Redundancy at the summary level, Relevance at the source(s) level and Informativeness at the level of background knowledge and pre-conceptions. To better understand how arises from other quantities and how to interpret it, we remark that it can be rewritten as:

(12)

is a constant accounting for the development of . From now on, we omit this term as it does not depend on . This gives three equivalent interpretations of :

(13)

The first says that maximizing is equivalent to maximizing both Relevance and Informativeness. The second says that it is equivalent to minimizing the average surprise of observing expecting while maximizing the Bayesian surprise of observing knowing . The third one is the minimization of Redundancy while maximizing both Relevance and Bayesian surprise. Finally, we can say that , and are the three independent components of Importance.

It is worth noting that each previously defined quantity: , and are measured in bits (or nats depending on the base of the logarithm). Shannon initially axiomatized that information quantities should be additive Shannon and Weaver (1963) and therefore defining Importance as the sum of other information quantities is natural.

Interpreting

The distribution is central because it encodes the relative importance of semantic units and gives an overall target for the summary. states that a summary should approximate this distribution.

For example, suppose that a semantic unit is prominent in ( is high) and not known in ( is low), then is very high, which means very desired in the summary. This makes sense because choosing this unit will fill the gap in the knowledge while matching the source(s).

The figure 1 illustrates how this distribution behaves with respect to and . Depending on their prominence in the sources and background knowledge, gives the relative importance of semantic units. not only encourages the selection of high scoring semantic units first, it also indicates which choices and trade-offs are more beneficial. covers the overall selection of several semantic units together, it is not restricted to independently choosing individual ones.

Summarization uncertainty

This target distribution may exhibit different properties. For example, it might be very clear which semantic units should be extracted (spiky probability distribution) or it might be unclear (many units have more or less the same importance score). This can be quantified by the entropy of the distribution:

(14)

Intuitively, this measures the amount of possibly good summaries. If is low then is spiky and there is little uncertainty about which semantic units to extracts (few possible good summaries). Conversely, if the entropy is high, many equivalently good summaries are possible.

3.6 Potential Information

We defined a notion relating and , another relating and , but we can question what connects and . Intuitively, if and are very similar, there is little information to report. However, if is surprising compared to our expectations , there is a lot of information to report.

With the same argument we laid out for Informativeness, we can define the amount of potential information as the average surprise of observing while already knowing . Again, this is given by the cross-entropy:

(15)

In semantic information theory Carnap and Bar-Hillel (1953), this quantity can be interpreted as the entropy of a semantic source: the direct equivalent of entropy from the symbolic source coding. It measures the semantic information content of the source and sets the limit of lossless semantic compression. Any summary representing all the information must have at least bits of information.

Alternatively, if can only rely on information extracted from , then can not extract more than bits from . can be understood as Potential informativeness, the maximum amount of Informativeness a summary can extract from .

If we compare several different ’s while keeping a fixed expectation , then is equivalent to (simply offset by the constant ). Again is the Bayesian surprise or the information we gain by reading the source(s).

(a) ditribution
(b) distribution
(c) distribution
Figure 1: 0(a) represent an example distribution of source(s), 0(b) an example distribution of background knowledge and 0(c) is the resulting target distribution that summaries should approximate.

4 Experiments

In this section, we provide an illustration of the workings of the formulas on real data.

Data

We experiment with standard datasets for two different summarization tasks: generic and update multi-document summarization.

We use two datasets from the Text Analysis Conference (TAC) shared task: TAC-2008 and TAC-2009.111http://tac.nist.gov/2009/Summarization/, http://tac.nist.gov/2008/ TAC-2008 and TAC-2009 contain 48 and 44 topics, respectively. Each topic consists of 10 news articles to be summarized in a maximum of 100 words. We use only the so-called initial documents (A documents) for the generic part. In the update part, 10 new documents (B documents) are to be summarized assuming that the first 10 documents (A documents) have already been seen.

For each topic, there are 4 human reference summaries along with a manually created Pyramid set Nenkova et al. (2007). In both editions, all system summaries and the 4 reference summaries were manually evaluated by NIST assessors for readability, content selection (with Pyramid) and overall responsiveness. At the time of the shared tasks, 57 systems were submitted to TAC-2008 and 55 to TAC-2009.

Setup and Assumptions

To keep the experiments simple and focused on the workings of the formulas, we make several simplistic and limiting assumptions. Mainly, we assume that semantic units are n-grams and therefore texts are represented as frequency distributions over n-grams (we use

). In section 6, we will discuss why this is limiting and why n-grams do not qualify as semantic units. However, this remains a simple and convenient approximation letting us observe the quantities in action.

is the parameter from the theory and its choice is subject to investigation. Here, we fix it to simple choices: for update summarization, is the frequency distribution over n-grams in the background documents (A). For generic summarization, is the uniform probability distribution over all n-grams from the source documents.

4.1 Correlation with humans

First, we measure how well the different quantities correlate with human judgments. We compute the score of each system summary according to each quantity defined in the previous section: , , , . We then compute the correlations between these scores and the two kinds of human judgments available: Pyramid and responsiveness.

We measure the correlation with NDCG, a metric that compares ranked lists and puts more emphasis on the top elements with a logarithmic decay weighting. A summarizer aims at extracting high-scoring summaries: . Therefore, useful metrics should exhibit high NDCG scores.

Analysis

We report results for both generic and update summarization over the two datasets TAC-2008 and TAC-2009 in Table 1. It is worth mentioning that with n-gram distributions as semantic units, our definition of Relevance matches the objective function used in previous summarization systems Haghighi and Vanderwende (2009); Peyrard and Eckle-Kohler (2016).

Each quantity has individually a decent correlation, indicating that they are indeed relevant for summarization. Interestingly, has much better correlation than individual quantities, supporting the hypothesis that is a more general and robust notion of Importance.

We need to be careful when interpreting these results because we made several strong assumptions: by choosing n-grams as semantic units and by choosing rather arbitrarily. Moreover, the data might be biased because summarizers have difficulties to address redundancy during the selection procedure. This may explain why performs so well on these datasets. Redundancy may be the main dimension of variation distinguishing the summaries submitted to the shared tasks.

Nevertheless, these are encouraging and promising results. Should we spend more efforts to craft better text representations and come-up with more suitable , we should expect even better correlations. We already observe better correlation for in the update summarization scenario, which comes from a more natural choice of . Indeed, being the previous document set follows naturally from the task description.

In the generic case, the uninformative uniform distribution is a weaker approximation of background knowledge. In fact, becomes similar to because the cross-entropy between the summary and the uniform distribution is tightly connected to the entropy of the summary . With the uninformative prior (generic case), and capture similar aspects, while they are complementary in the update case.

Generic Update
resp. Pyramid resp. Pyramid
Red .617 .554 .581 .502
Rel .510 .414 .504 .377
Inf .686 .615 .554 .472
I .772 .718 .889 .873
Table 1: Correlation of various information theoretic quantities with human judgments measured by NDCG on generic and update summarization.

4.2 Comparison with Reference Summaries

In this experiment, we verify that reference summaries have high scores. Intuitively, the distribution should be similar to the probability distribution of the reference summaries.

Analysis

In table 2, we report the average of reference summaries in comparison with the average of system summaries. Remember that is defined with a minus sign since a summary should minimize its dissimilarity with the distribution .

For example, in the generic case, system summaries have on average a difference of bits with the target distribution while reference summaries only have an average difference of . This confirms that reference summaries are indeed quite close to the target distribution .

Generic Update
-2.11 -2.42
-0.79 -0.93
Table 2: scores of reference summaries compared to scores of systems summaries on generic and update summarization.

4.3 Potential Information

In this section, we are interested in the Potential Information () of documents in our datasets.

We compute for each topic to measure the amount of interesting, useful and new information. To put the numbers in perspective, we also report the where the background is generated randomly. Additionally, note that if because there is nothing to learn from in this case.

Analysis

We report results for both generic and update summarization over the two datasets TAC-2008 and TAC-2009 in Table 3.

When a random background is used, the Potential Information is similar in both tasks. This hints that, on average, document sets from the generic and update task contain the same amount of information. This is expected because all documents come from news articles of roughly the same length and same style.

However, we observe a major difference with more reasonable choices of : uniform in the generic case, the previous document set (A) in the update case.

In the generic case, no semantic unit is more expected than another, producing a rather low Potential information. In the update case, the original document sets (A) and the new ones to be summarized (B) are quite different since there is a large amount of Potential information when summarizing (B) after observing (A). The difference is not as large as if they were two randomly selected document sets because they are topically related. This provides another indication that the simplistic choice of we made for update summarization is, in fact, rather appropriate.

Generic Update
1.15 1.14
.150 .886
Table 3: Potential information measured with different variations of on Generic and Update summarization.

5 Related Work

Information theoretic tools have been previously employed in summarization. It usually entails representing texts as a probability distribution over n-grams before using information theoretic measures like KL or Jensen-Shannon (JS) divergence.

Divergences between n-gram distributions have served as proxies for similarity between summaries and sources Haghighi and Vanderwende (2009); Peyrard and Eckle-Kohler (2016). louis:2014:P14-2 investigated background knowledge for update summarization with Bayesian surprise. Several techniques have replaced distributions over n-grams by distributions over topics Celikyilmaz and Hakkani-Tür (2011); Delort and Alfonseca (2012); Hennig (2009).

Divergences have also been used to devise evaluation metrics: lin-EtAl:2006 proposed to measure the similarity between system summaries and reference summaries based on JS divergence. LouisN13 argued that JS divergence between a candidate summary and source documents is a solid indicator of quality.

Finally, it is worth noting that, even though they did not provide a theoretical treatment, Daume:2002 have modeled document compression by a noisy-channel where summarization is viewed as a communication problem. This is a modelisation of the Relevance aspect of Importance.

These works envisioned and exemplified that information theoretic tools can be applied for summarization but they did not develop a formal theory of Importance with them.

6 Discussion

6.1 Semantic Units

In this section, we will discuss the implications of our initial assumption that text can be represented by semantic units.

What is not a semantic unit

Characters, character n-grams, morphemes, words, n-grams, phrases and sentences do not qualify as semantic units. Even though previous works who relied on information theoretic motivation Lin et al. (2006); Haghighi and Vanderwende (2009); Louis and Nenkova (2013); Peyrard and Eckle-Kohler (2016) used some of them as support for probability distributions, they are neither atomic nor independent. It is mainly because they are surface forms whereas semantic units are abstract and operate at the semantic level.

Vector Space dimensions

Modern techniques represent the semantic of texts by vector spaces

Mikolov et al. (2013b); Turian et al. (2010). They intend to identify the latent independent semantic components of texts Gábor et al. (2017); Mikolov et al. (2013a). The resulting dimensions of such vector spaces seems to be potential candidates for being semantic units (after translation and normalization to transform vectors into proper probability distributions).

These are indeed promising schemes strengthening the hypothesis that semantic units is an applicable concept. In fact, the quest to uncover the independent latent component of meaning aligns with the requirements of semantic units.

However, there are still issues with employing vector spaces as semantic units. Different techniques or initializations give different dimensions. The number of dimensions remains an undecided hyper-parameter. Few hundreds of dimensions seems to be a too few to be paired with semantic units. Intuitively, a text should have a sparse representation over semantic units but embedding techniques project texts into dense low dimensionality spaces.

Also, these techniques rely on the distributional hypothesis, but is it valid for uncovering semantic units? It may only reveal the few semantics aspects that correlate with some syntactic patterns. Sahlgren2008 even said: The distributional hypothesis […] is a strong methodological claim with a weak semantic foundation.

6.2 Future directions

As asserted previously, theoretical frameworks provide a common language to ground research. In particular, they can motivate future work.

Two obvious lines of research emerge from the framework which may extend beyond summarization: (1) Appropriate text representation (semantic units) and (2) Automatic discovery or data-driven approach for finding .

Appropriate text representation

Discovering appropriate, meaningful and useful representation for texts is one of the main challenges of NLP.

Integrating the requirements of semantic units (atomic independent semantic symbols) may result in better text representations. Furthermore, it would allow the use of information theoretic tools at the semantic level for a wide range of NLP problems.

Automatic discovery of

One can use supervise techniques to automatically search for . For example, with proper regularization, by aggregating over many users and many topics one can find a generic . It means finding what, on average, people know and believe. This might be useful in other applications like dialog systems.

By aggregating over different persons but in one domain, one can uncover a domain-specific . Similarly, by aggregating over many topics for one person, one can find a personalized .

6.3 Pactical Considerations

Conceptually, it is straightforward to build a system out of after a semantic units representation and a have been chosen.

A summarizer intends to extract or generates a summary maximizing . In extractive summarization, this can naturally be cast as a discrete optimization problem where the text source is considered as a set of sentences and the summary is created by selecting an optimal subset of the sentences under a length constraint McDonald (2007). In abstractive summarization, a language-aware decoder needs to be employed to guarantee linguistic qualities besides the information selection process Nenkova and McKeown (2012).

Another practical point comes from the fact that KL divergence can diverge to infinity when we are sure that some semantic unit will never be observed () but it actually happens (it is intuitive since the surprise of seeing something impossible should be infinite). In order to avoid this issue, one has to perform smoothing by assigning non-zero probabilities to possible semantic units.

7 Conclusion

In this work, we argued for the development of formal theories of Importance in order to advance our understanding of summarization. We introduced formal definitions for several concepts : Redundancy, Relevance and Informativeness within an abstract theoretical framework rooted in information theory. Importance arises as a single quantity naturally unifying these concepts. Finally, we provided intuitions to interpret the proposed quantities and discussed the relevance of the framework. This development is a humble starting point subject to be empirically tested and refined upon new empirical observations.

Acknowledgments

This work has been supported by the German Research Foundation (DFG) as part of the Research Training Group “Adaptive Preparation of Information from Heterogeneous Sources” (AIPHES) under grant No. GRK 1994/1.

References