Log In Sign Up

IndoSum: A New Benchmark Dataset for Indonesian Text Summarization

Automatic text summarization is generally considered as a challenging task in the NLP community. One of the challenges is the publicly available and large dataset that is relatively rare and difficult to construct. The problem is even worse for low-resource languages such as Indonesian. In this paper, we present IndoSum, a new benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries. Notably, the dataset is almost 200x larger than the previous Indonesian summarization dataset of the same domain. We evaluated various extractive summarization approaches and obtained encouraging results which demonstrate the usefulness of the dataset and provide baselines for future research. The code and the dataset are available online under permissive licenses.


page 1

page 2

page 3

page 4


Dataset for Automatic Summarization of Russian News

Automatic text summarization has been studied in a variety of domains an...

VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization

Video transcript summarization is a fundamental task for video understan...

StreamHover: Livestream Transcript Summarization and Annotation

With the explosive growth of livestream broadcasting, there is an urgent...

SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary

Sports game summarization aims to generate news articles from live text ...

An Overview of Indian Language Datasets used for Text Summarization

In this paper, we survey Text Summarization (TS) datasets in Indian Lang...

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

While online conversations can cover a vast amount of information in man...

Pointer over Attention: An Improved Bangla Text Summarization Approach Using Hybrid Pointer Generator Network

Despite the success of the neural sequence-to-sequence model for abstrac...

I Introduction

The goal of text summarization task is to produce a summary from a set of documents. The summary should retain important information and be reasonably shorter than the original documents [1]. When the set of documents contains only a single document, the task is usually referred to as single-document summarization. There are two kinds of summarization characterized by how the summary is produced: extractive and abstractive. Extractive summarization attempts to extract few important sentences verbatim from the original document. In contrast, abstractive summarization tries to produce an abstract which may contain sentences that do not exist in or are paraphrased from the original document.

Despite quite a few number of research on Indonesian text summarization, none of them were trained nor evaluated on a large, publicly available dataset. Also, although ROUGE [2]

is the standard intrinsic evaluation metric for English text summarization, for Indonesian it does not seem so. Previous works rarely state explicitly that their evaluation was performed with ROUGE. The lack of a benchmark dataset and the different evaluation metrics make comparing among Indonesian text summarization research difficult.

In this work, we introduce IndoSum, a new benchmark dataset for Indonesian text summarization, and evaluated several well-known extractive single-document summarization methods on the dataset. The dataset consists of online news articles and has almost 200 times more documents than the next largest one of the same domain [3]. To encourage further research in this area, we make our dataset publicly available. In short, the contribution of this work is two-fold:

  1. IndoSum, a large dataset for text summarization in Indonesian that is compiled from online news articles and publicly available.

  2. Evaluation of state-of-the-art extractive summarization methods on the dataset using ROUGE as the standard metric for text summarization.

The state-of-the-art result on the dataset, although impressive, is still significantly lower than the maximum possible ROUGE score. This result suggests that the dataset is sufficiently challenging to be used as evaluation benchmark for future research on Indonesian text summarization.

Ii Related work

Fachrurrozi et al. [4] proposed some scoring methods and used them with TF-IDF to rank and summarize news articles. Another work [5]

used latent Dirichlet allocation coupled with genetic algorithm to produce summaries for online news articles. Simple methods like naive Bayes has also been used for Indonesian news summarization 

[3], although for English, naive Bayes has been used almost two decades earlier [6]. A more recent work [7] employed a summarization algorithm called TextTeaser with some predefined features for news articles as well. Slamet et al. [8]

used TF-IDF to convert sentences into vectors, and their similarities are then computed against another vector obtained from some keywords. They used these similarity scores to extract important sentences as the summary. Unfortunately, all these work do not seem to be evaluated using ROUGE, despite being the standard metric for text summarization research.

An example of Indonesian text summarization research which used ROUGE is [9]. They employed the best method on TAC 2011 competition for news dataset and achieved ROUGE-2 scores that are close to that of humans. However, their dataset consists of only 56 articles which is very small, and the dataset is not available publicly.

An attempt to make a public summarization dataset has been done in [10]. They compiled a chat dataset along with its summary, which has both the extractive and abstractive versions. This work is a good step toward standardizing summarization research for Indonesian. However, to the best of our knowledge, for news dataset, there has not been a publicly available dataset, let alone a standard. - Cerita sekuel terbaru James Bond bocor
 Menurut sumber yang terlibat dalam produksi film ini, agen rahasia 007 berhenti menjadi mata-mata Inggris demi menikah dengan perempuan yang dicintainya.
 ”Bond berhenti menjadi agen rahasia karena jatuh cinta dan menikah dengan perempuan yang dicintai,” tutur seorang sumber yang dekat dengan produksi seperti dikutip laman
 Dalam film tersebut, Bond diduga menikahi Madeleine Swann yang diperankan oleh Lea Seydoux.
 Lea diketahui bermain sebagai gadis Bond di sekuel Spectre pada 2015 silam.
 Jika benar, ini merupakan satu-satunya sekuel yang bercerita pernikahan James Bond sejak 1969.
 Sebelumnya, di sekuel On Her Majesty, James Bond menikahi Tracy Draco yang diperankan Diana Rigg.
 Namun, di film itu Draco terbunuh.
 Plot sekuel film James Bond ke-25 bocor tak lama setelah Daniel Craig mengumumkan bakal kembali memerankan tokoh agen 007.
 Cerita sekuel terbaru James Bond bocor.
 Menurut sumber yang terlibat dalam produksi film ini, agen rahasia 007 berhenti menjadi mata-mata Inggris demi menikah dengan perempuan yang dicintainya.
 Jika benar, ini merupakan satu-satunya sekuel yang bercerita pernikahan James Bond sejak 1969.
 Sebelumnya, di sekuel On Her Majesty, James Bond menikahi Tracy Draco.
 Namun, di film itu Draco terbunuh. - Newest James Bond sequel’s story was leaked
 According to a source involved in the movie production, the secret agent 007 stopped being an English spy to marry a woman whom he loved.
 ”Bond stopped being a spy because he fell in love and married a woman that he loved,” said a source who is close to the production as reported by
 In the movie, Bond was suspected to marry Madeleine Swann who is played by Lea Seydoux.
 Lea is known to play as a Bond girl in the sequel Spectre in 2015.
 If true, this would be the only sequel that tells about James Bond’s marriage since 1969.
 Previously, in the sequel On Her Majesty, James Bond married Tracy Draco who was played by Diana Rigg.
 However, in the movie Draco was killed.
 The plot of the 25th James Bond sequel movie was leaked not long after Daniel Craig announced that he would play agent 007 character again.
 Newest James Bond sequel’s story was leaked.
 According to a source involved in the movie production, the secret agent 007 stopped being an English spy to marry a woman whom he loved.
 If true, this would be the only sequel that tells about James Bond’s marriage since 1969.
 Previously, in the sequel On Her Majesty, James Bond marries Tracy Draco.
 However, in the movie Draco was killed.
Fig. 1: A sample article, its abstractive summary, and their English translations. Underlined sentences are the extractive summary obtained by following the greedy algorithm in [11].

Iii Methodology

Iii-a IndoSum: a new benchmark dataset

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
train dev test train dev test train dev test train dev test train dev test
# of articles 14262 750 3762 14263 749 3762 14290 747 3737 14272 750 3752 14266 747 3761
avg # of paras / article 10.54 10.42 10.39 10.49 10.83 10.47 10.47 10.57 10.61 10.52 10.37 10.49 10.49 10.23 10.54
avg # of sents / para 1.75 1.74 1.75 1.75 1.75 1.75 1.75 1.74 1.73 1.74 1.73 1.77 1.75 1.79 1.74
avg # of words / sent 18.86 19.26 18.91 18.87 18.71 19.00 18.89 18.95 18.90 18.88 19.27 18.82 18.92 18.81 18.82
avg # of sents / summ 3.48 3.42 3.47 3.47 3.50 3.47 3.48 3.44 3.46 3.48 3.40 3.48 3.47 3.54 3.48
avg # of words / summ sent 19.58 19.91 19.59 19.60 19.54 19.58 19.57 19.77 19.65 19.58 19.92 19.60 19.63 19.05 19.57
TABLE I: Corpus statistics.

We used a dataset provided by Shortir,111 an Indonesian news aggregator and summarizer company. The dataset contains roughly 20K news articles. Each article has the title, category, source (e.g., CNN Indonesia, Kumparan), URL to the original article, and an abstractive summary which was created manually by a total of 2 native speakers of Indonesian. There are 6 categories in total: Entertainment, Inspiration, Sport, Showbiz, Headline, and Tech. A sample article-summary pair is shown in Fig. 1.

Note that 20K articles are actually quite small if we compare to English CNN/DailyMail dataset used in [12] which has 200K articles. Therefore, we used 5-fold cross-validation to split the dataset into 5 folds of training, development, and testing set. We preprocessed the dataset by tokenizing, lowercasing, removing punctuations, and replacing digits with zeros. We used NLTK [13] and spaCy222 for sentence and word tokenization respectively.

In our exploratory analysis, we discovered that some articles have a very long text and some summaries have too many sentences. Articles with a long text are mostly articles containing a list, e.g., list of songs played in a concert, list of award nominations, and so on. Since such a list is never included in the summary, we truncated such articles so that the number of paragraphs are at most two standard deviations away from the mean.


We assume the number of paragraphs exhibits a Gaussian distribution.

For each fold, the mean and standard deviation were estimated from the training set. We discarded articles whose summary is too long since we do not want lengthy summaries anyway. The cutoff length is defined by the upper limit of the Tukey’s boxplot, where for each fold, the quartiles were estimated from the training set. After removing such articles, we ended up with roughly 19K articles in total. The complete statistics of the corpus is shown in Table 


Since the gold summaries provided by Shortir are abstractive, we needed to label the sentences in the article for training the supervised extractive summarizers. We followed Nallapati et al. [11] to make these labeled sentences (called oracles hereinafter) using their greedy algorithm. The idea is to maximize the ROUGE score between the labeled sentences and the abstractive gold summary. Although the provided gold summaries are abstractive, in this work we focused on extractive summarization because we think research on this area are more mature, especially for Indonesian, and thus starting with extractive summarization is a logical first step toward standardizing Indonesian text summarization research.

Since there can be many valid summaries for a given article, having only a single abstractive summary for an article is a limitation of our dataset which we acknowledge. Nevertheless, we feel that the existence of such dataset is a crucial step toward a fair benchmark for Indonesian text summarization research. Therefore, we make the dataset publicly available for others to use.444

Iii-B Evaluation

For evaluation, we used ROUGE [2], a standard metric for text summarization. We used the implementation provided by pythonrouge.555 Following [12], we report the score of R-1, R-2, and R-L. Intuitively, R-1 and R-2 measure informativeness and R-L measures fluency [12]. We report the score instead of just the recall score because although we extract a fixed number of sentences as the summary, the number of words are not limited. So, reporting only recall benefits models which extract long sentences.

Iii-C Compared methods

We compared several summarization methods which can be categorized into three groups: unsupervised, non-neural supervised, and neural supervised methods. For the unsupervised methods, we tested:

  1. SumBasic, which uses word frequency to rank sentences and selects top sentences as the summary [14, 15].

  2. Lsa, which uses latent semantic analysis (LSA) to decompose the term-by-sentence matrix of a document and extracts sentences based on the result. We experimented with the two approaches proposed in [16] and [17] respectively.

  3. LexRank, which constructs a graph representation of a document, where nodes are sentences and edges represent similarity between two sentences, and runs PageRank algorithm on that graph and extracts sentences based on the resulting PageRank values [18]

    . In the original implementation, sentences shorter than a certain threshold are removed. Our implementation does not do this removal to reduce the number of tunable hyperparameters. Also, it originally uses

    cross-sentence informational subsumption (CSIS) during sentence selection stage but the paper does not explain it well. Instead, we used an approximation to CSIS called cross-sentence word overlap described in [19] by the same authors.

  4. TextRank, which is very similar to LexRank but computes sentence similarity based on the number of common tokens [20].

For the non-neural supervised methods, we compared:

  1. Bayes

    , which represents each sentence as a feature vector and uses naive Bayes to classify them 

    [6]. Four features are used: whether the sentence has less than 5 words, whether the sentence contains signature words, its position in the document, and its position in the paragraph. To obtain the signature words, TF-IDF are used. The original paper computes TF-IDF score on multi-word tokens that are identified automatically using mutual information. We did not do this identification, so our TF-IDF computation operates on word tokens.

  2. Hmm

    , which uses hidden Markov model where states correspond to whether the sentence should be extracted 


    . Gaussian distribution is used as the emission probability distribution, where each sentence is represented as a feature vector. Four features are used: its position in the paragraph, number of terms, sum of probability of terms in the document, and sum of probability of terms in a baseline document. We used a precomputed TF table for the last feature. The original work uses QR decomposition for sentence selection but our implementation does not. We simply ranked the sentences by their scores and picked the top 3 as the summary.

  3. MaxEnt, which represents each sentence as a feature vector and leverages maximum entropy model to compute the probability of a sentence should be extracted [22]. Several features are used: word pairs, sentence length, previous sentence length, sentence position, and whether the sentence is at the start of a paragraph. The original approach puts a prior distribution over the labels but we put the prior on the weights instead. Our implementation still agrees with the original because we employed a bias feature which should be able to learn the prior label distribution.

As for the neural supervised method, we evaluated NeuralSum [12] using the original implementation by the authors.666 We modified their implementation slightly to allow for evaluating the model with ROUGE. Note that all the methods are extractive. Our implementation code for all the methods above is available online.777

As a baseline, we used Lead-N which selects leading sentences as the summary. For all methods, we extracted 3 sentences as the summary since it is the median number of sentences in the gold summaries that we found in our exploratory analysis.

Iii-D Experiment setup

Some of these approaches optionally require precomputed term frequency (TF) or inverse document frequency (IDF) table and a stopword list. We precomputed the TF and IDF tables from Indonesian Wikipedia dump data and used the stopword list provided in [23]. Hyperparameters were tuned to the development set of each fold, optimizing for R-1 as it correlates best with human judgment [24]. For NeuralSum, we tried several scenarios:

  1. tuning the dropout rate while keeping other hyperparameters fixed,

  2. increasing the word embedding size from the default 50 to 300,

  3. initializing the word embedding with FastText pre-trained embedding [25].

Scenario 2 is necessary to determine whether any improvement in scenario 3 is due to the larger embedding size or the pre-trained embedding. In scenario 2 and 3, we used the default hyperparameter setting from the authors’ implementation. In addition, for every scenario, we picked the model saved at an epoch that yields the best R-1 score on the development set.

Iv Results and discussion

R-1 R-2 R-L
Oracle 79.27 (0.25) 72.52 (0.35) 78.82 (0.28)
Lead-3 62.86 (0.34) 54.50 (0.41) 62.10 (0.37)
Unsupervised SumBasic [14, 15] 35.96 (0.18) 20.19 (0.31) 33.77 (0.18)
Lsa [16, 17] 41.37 (0.19) 28.43 (0.25) 39.64 (0.19)
LexRank [18] 62.86 (0.35) 54.44 (0.44) 62.10 (0.37)
TextRank [20] 42.87 (0.29) 29.02 (0.35) 41.01 (0.31)
Non-neural supervised Bayes [6] 62.70 (0.39) 54.32 (0.46) 61.93 (0.41)
Hmm [21] 17.62 (0.11) 4.70 (0.11) 15.89 (0.11)
MaxEnt [22] 50.94 (0.42) 44.33 (0.50) 50.26 (0.44)
Neural supervised NeuralSum [12] 67.60 (1.25) 61.16 (1.53) 66.86 (1.30)
NeuralSum 300 emb. size 67.96 (0.46) 61.65 (0.48) 67.24 (0.47)
NeuralSum + FastText 67.78 (0.69) 61.37 (0.93) 67.05 (0.72)
TABLE II: Test score of ROUGE-1, ROUGE-2, and ROUGE-L, averaged over 5 folds.
Source dom. Target dom.
Entertainment Inspiration Sport Showbiz Headline Tech
Oracle 75.59 81.19 77.65 78.33 80.52 80.09
Lead-3 51.27 52.12 67.56 65.05 65.21 50.01
LexRank 51.41 50.78 67.52 65.01 65.19 50.01
NeuralSum Entertainment 52.51 53.15 72.51 67.01 67.63 51.81
Inspiration 52.51 52.71 72.51 67.01 68.02 51.67
Sport 52.41 53.85 72.51 66.62 68.48 50.89
Showbiz 53.65 49.86 72.51 67.81 70.88 51.22
Headline 52.80 55.07 72.53 67.17 71.59 50.92
Tech 50.39 47.93 62.43 56.93 63.44 48.00
TABLE III: Test score of ROUGE-1 for the out-of-domain experiment.

Iv-a Overall results

Table II shows the test score of ROUGE-1, ROUGE-2, and ROUGE-L of all the tested models described previously. The mean and standard deviation (bracketed) of the scores are computed over the 5 folds. We put the score obtained by an oracle summarizer as Oracle. Its summaries are obtained by using the true labels. This oracle summarizer acts as the upper bound of an extractive summarizer on our dataset. As we can see, in general, every scenario of NeuralSum consistently outperforms the other models significantly. The best scenario is NeuralSum with word embedding size of 300, although its ROUGE scores are still within one standard deviation of NeuralSum with the default word embedding size. Lead-3 baseline performs really well and outperforms almost all the other models, which is not surprising and even consistent with other work that for news summarization, Lead-N baseline is surprisingly hard to beat. Slightly lower than Lead-3 are LexRank and Bayes, but their scores are still within one standard deviation of each other so their performance are on par. This result suggests that a non-neural supervised summarizer is not better than an unsupervised one, and thus if labeled data are available, it might be best to opt for a neural summarizer right away. We also want to note that despite its high ROUGE, every NeuralSum scenario scores are still considerably lower than Oracle, hinting that it can be improved further. Moreover, initializing with FastText pre-trained embedding slightly lowers the scores, although they are still within one standard deviation. This finding suggests that the effect of FastText pre-trained embedding is unclear for our case.

Iv-B Out-of-domain results

Since Indonesian is a low-resource language, collecting in-domain dataset for any task (including summarization) can be difficult. Therefore, we experimented with out-of-domain scenario to see if NeuralSum can be used easily for a new use case for which the dataset is scarce or non-existent. Concretely, we trained the best NeuralSum (with word embedding size of 300) on articles belonging to category and evaluated its performance on articles belonging to category for all categories and . As we have a total of 6 categories, we have 36 domain pairs to experiment on. To reduce computational cost, we used only the articles from the first fold and did not tune any hyperparameters. We note that this decision might undermine the generalizability of conclusions drawn from these out-of-domain experiments. Nonetheless, we feel that the results can still be a useful guidance for future work. As comparisons, we also evaluated Lead-3, Oracle, and the best unsupervised method, LexRank. For LexRank, we used the best hyperparameter that we found in the previous experiment for the first fold. We only report the ROUGE-1 scores. Table III shows the result of this experiment.

We see that almost all the results outperform the Lead-3 baseline, which means that for out-of-domain cases, NeuralSum can summarize not just by selecting some leading sentences from the original text. Almost all NeuralSum results also outperform LexRank, suggesting that when there is no in-domain training data, training NeuralSum on out-of-domain data may yield better performance than using an unsupervised model like LexRank. Looking at the best results, we observe that they all are the out-of-domain cases. In other words, training on out-of-domain data is surprisingly better than on in-domain data. For example, for Sport as the target domain, the best model is trained on Headline as the source domain. In fact, using Headline as the source domain yields the best result in 3 out of 6 target domains. We suspect that this phenomenon is because of the similarity between the corpus of the two domain. Specifically, training on Headline yields the best result most of the time because news from any domain can be headlines. Further investigation on this issue might leverage domain similarity metrics proposed in [26]. Next, comparing the best NeuralSum performance on each target domain to Oracle, we still see quite a large gap. This gap hints that NeuralSum can still be improved further, probably by lifting the limitations of our experiment setup (e.g., tuning the hyperparameters for each domain pair).

V Conclusion and future work

We present IndoSum, a new benchmark dataset for Indonesian text summarization, and evaluated state-of-the-art extractive summarization methods on the dataset. We tested unsupervised, non-neural supervised, and neural supervised summarization methods. We used ROUGE as the evaluation metric because it is the standard intrinsic evaluation metric for text summarization evaluation. Our results show that neural models outperform non-neural ones and in absence of in-domain corpus, training on out-of-domain one seems to yield better performance instead of using an unsupervised summarizer. Also, we found that the best performing model achieves ROUGE scores that are still significantly lower than the maximum possible scores, which suggests that the dataset is sufficiently challenging for future work. The dataset, which consists of 19K article-summary pairs, is publicly available. We hope that the dataset and the evaluation results can serve as a benchmark for future research on Indonesian text summarization.

Future work in this area may focus on improving the summarizer performance by employing newer neural models such as SummaRuNNer [11] or incorporating side information [27]. Since the gold summaries are abstractive, abstractive summarization techniques such as attention-based neural models [28], seq2seq models [29], pointer networks [30]

, or reinforcement learning-based approach 

[31] can also be interesting directions for future avenue. Other tasks such as further investigation on the out-of-domain issue, human evaluation, or even extending the corpus to include more than one summary per article are worth exploring as well.


We thank anonymous reviewers for their helpful feedback. We acknowledge the support from Shortir and Tempo. Lastly, we also thank Muhammad Pratikto and Ahmad Rizqi Meydiarso for their relentless support.


  • [1] D. Das and A. F. Martins, “A survey on automatic text summarization,” Literature Survey for the Language and Statistics II course at CMU, vol. 4, pp. 192–195, 2007.
  • [2] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches out: Proceedings of the ACL-04 Workshop, vol. 8.   Barcelona, Spain, 2004.
  • [3] A. Najibullah, “Indonesian Text Summarization based on Naïve Bayes Method,” in Proceeding of the International Seminar and Conference 2015: The Golden Triangle (Indonesia-India-Tiongkok) Interrelations in Religion, Science, Culture, and Economic, Semarang, Indonesia, 2015, p. 12.
  • [4] M. Fachrurrozi, N. Yusliani, and R. U. Yoanita, “Frequent Term based Text Summarization for Bahasa Indonesia,” in Proceedings of the International Conference on Innovations in Engineering and Technology, Bangkok, Thailand, 2013, p. 3.
  • [5] Silvia, P. Rukmana, V. Aprilia, D. Suhartono, R. Wongso, and Meiliana, “Summarizing Text for Indonesian Language by Using Latent Dirichlet Allocation and Genetic Algorithm,” in Proceeding of International Conference on Electrical Engineering, Computer Science and Informatics (EECSI 2014), Yogyakarta, Indonesia, 2014, p. 6.
  • [6]

    C. Aone, M. E. Okurowski, and J. Gorlinsky, “Trainable, scalable summarization using robust NLP and machine learning,” in

    Proceedings of the 17th International Conference on Computational Linguistics-Volume 1.   Association for Computational Linguistics, 1998, pp. 62–66.
  • [7] D. Gunawan, A. Pasaribu, R. F. Rahmat, and R. Budiarto, “Automatic Text Summarization for Indonesian Language Using TextTeaser,” IOP Conference Series: Materials Science and Engineering, vol. 190, no. 1, p. 012048, 2017.
  • [8] C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W. Darmalaksana, and M. A. Ramdhani, “Automated Text Summarization for Indonesian Article Using Vector Space Model,” IOP Conference Series: Materials Science and Engineering, vol. 288, p. 012037, Jan. 2018.
  • [9] D. T. Massandy and M. L. Khodra, “Guided summarization for Indonesian news articles,” in 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA), Aug. 2014, pp. 140–145.
  • [10] F. Koto, “A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).   Portorož, Slovenia: European Language Resources Association (ELRA), 2016, p. 5.
  • [11]

    R. Nallapati, F. Zhai, and B. Zhou, “SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents,” in

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.

    , 2017, pp. 3075–3081.
  • [12] J. Cheng and M. Lapata, “Neural summarization by extracting sentences and words,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 484–494.
  • [13] S. Bird, E. Loper, and E. Klein, Natural Language Processing with Python.   O’Reilly Media Inc., 2009.
  • [14] A. Nenkova and L. Vanderwende, “The impact of frequency on summarization,” Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, vol. 101, 2005.
  • [15] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova, “Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion,” Information Processing & Management, vol. 43, no. 6, pp. 1606–1618, 2007.
  • [16] Y. Gong and X. Liu, “Generic text summarization using relevance measure and latent semantic analysis,” in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.   ACM, 2001, pp. 19–25.
  • [17] J. Steinberger and K. Jezek, “Using latent semantic analysis in text summarization and summary evaluation,” in Proc. ISIM’04, 2004, pp. 93–100.
  • [18] G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of Artificial Intelligence Research, vol. 22, pp. 457–479, 2004.
  • [19] D. R. Radev, H. Jing, and M. Budzikowska, “Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies,” in Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization.   Association for Computational Linguistics, 2000, pp. 21–30.
  • [20] R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.
  • [21] J. Conroy and D. O’Leary, “Text summarization via hidden Markov model and pivoted QR matrix decomposition,” 2001.
  • [22] M. Osborne, “Using maximum entropy for sentence extraction,” in Proceedings of the Workshop on Automatic Summarization (Including DUC 2002).   Philadelphia: Association for Computational Linguistics, Jul. 2002.
  • [23] F. Tala, J. Kamps, K. E. Müller, and R. de M, “The impact of stemming on information retrieval in Bahasa Indonesia,” Studia Logica - An International Journal for Symbolic Logic - SLOGICA, Jan. 2003.
  • [24]

    C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics,” in

    Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1.   Association for Computational Linguistics, 2003, pp. 71–78.
  • [25] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” arXiv preprint arXiv:1607.04606, 2016.
  • [26]

    S. Ruder and B. Plank, “Learning to select data for transfer learning with Bayesian Optimization,” in

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.   Copenhagen, Denmark: Association for Computational Linguistics, Jul. 2017, pp. 372–382.
  • [27] S. Narayan, N. Papasarantopoulos, M. Lapata, and S. B. Cohen, “Neural Extractive Summarization with Side Information,” CoRR, vol. abs/1704.04530, 2017.
  • [28]

    A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.   Lisbon, Portugal: Association for Computational Linguistics, 2015, pp. 379–389.
  • [29] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, “Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.   Berlin, Germany: SIGNLL, 2016.
  • [30] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1073–1083.
  • [31] R. Paulus, C. Xiong, and R. Socher, “A Deep Reinforced Model for Abstractive Summarization,” arXiv:1705.04304 [cs], May 2017.