Log In Sign Up

SportsSum2.0: Generating High-Quality Sports News from Live Text Commentary

by   Jiaan Wang, et al.
Soochow University
FUDAN University

Sports game summarization aims to generate news articles from live text commentaries. A recent state-of-the-art work, SportsSum, not only constructs a large benchmark dataset, but also proposes a two-step framework. Despite its great contributions, the work has three main drawbacks: 1) the noise existed in SportsSum dataset degrades the summarization performance; 2) the neglect of lexical overlap between news and commentaries results in low-quality pseudo-labeling algorithm; 3) the usage of directly concatenating rewritten sentences to form news limits its practicability. In this paper, we publish a new benchmark dataset SportsSum2.0, together with a modified summarization framework. In particular, to obtain a clean dataset, we employ crowd workers to manually clean the original dataset. Moreover, the degree of lexical overlap is incorporated into the generation of pseudo labels. Further, we introduce a reranker-enhanced summarizer to take into account the fluency and expressiveness of the summarized news. Extensive experiments show that our model outperforms the state-of-the-art baseline.


page 1

page 2

page 3

page 4


Knowledge Enhanced Sports Game Summarization

Sports game summarization aims at generating sports news from live comme...

IndoSum: A New Benchmark Dataset for Indonesian Text Summarization

Automatic text summarization is generally considered as a challenging ta...

VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization

Video transcript summarization is a fundamental task for video understan...

Semantic Extractor-Paraphraser based Abstractive Summarization

The anthology of spoken languages today is inundated with textual inform...

A Novel ILP Framework for Summarizing Content with High Lexical Variety

Summarizing content contributed by individuals can be challenging, becau...

Live Blog Corpus for Summarization

Live blogs are an increasingly popular news format to cover breaking new...

What did you Mention? A Large Scale Mention Detection Benchmark for Spoken and Written Text

We describe a large, high-quality benchmark for the evaluation of Mentio...

1. Introduction

Text Summarization aims at compressing the original document into a shorter text while preserving the main ideas (Rush et al., 2015; Chopra et al., 2016; Nallapati et al., 2016; See et al., 2017; Chen and Bansal, 2018). A special text summarization task in sports domain is Sports Game Summarization, as the example shown in Fig. 1, which focuses on generating news articles from live commentaries (Zhang et al., 2016). Obviously, this task is more challenging than conventional text summarization for two reasons: First, the length of live commentary document often reaches thousands of tokens, which is far beyond the capacity of mainstream PLMs (e.g., BERT, RoBERTa, etc.); Second, there are different text styles between commentaries and news. Specifically, commentaries are more colloquial than news.

Figure 1. An example of Sports Game Summarization.

Sports game summarization has gradually attracted attention from researchers due to its practical significance. Zhang et al. (Zhang et al., 2016) pioneer this task and construct the first dataset with only 150 samples. Another dataset created for the shared task at NLPCC 2016 contains 900 samples (Wan et al., 2016). Although these datasets promote the research to some extent, they cannot further fit more sophisticated models due to their limited scale. Additionally, early literature (Zhang et al., 2016; Zhu et al., 2016; Yao et al., 2017; Liu et al., 2016) on these datasets mainly explores different strategies to select important commentary sentences to form news directly, but ignores the different text styles between commentaries and news. In recent, Huang et al. (Huang et al., 2020) present SportsSum, the first large-scale sports game summarization dataset with 5,428 samples. This work also proposes a two-step model which first selects important commentary sentences, and then rewrites each selected sentence to a news sentence so as to form news. In order to provide training data for both selector and rewriter of the two-step model, a pseudo-labeling algorithm is introduced to find for each news sentence a corresponding commentary sentence according to their timeline information as well as semantic similarities.

Given all the existing efforts, this task is still not fully exploited in the following aspects: (1) The existing datasets are limited in either scale or quality. According to our observations on SportsSum, more than 15% of samples have noisy sentences in the news articles due to its simple rule-based data cleaning process; (2) The pseudo-labeling algorithm used in existing two-step models only considers the semantic similarities between news sentences and commentary sentences but neglects the lexical overlap between them, which actually is an useful clue for generating the pseudo label; (3) The existing approaches rely on direct stitching to constitute news with the (rewritten) selected sentences, resulting in low fluency and high redundancy problems due to each (rewritten) selected sentence is transparent to other sentences.

Therefore, in this paper, we first denoise the SportSum dataset to obtain a higher-quality SportsSum2.0 dataset. Secondly, lexical overlaps between news sentences and commentary sentences are taken into account by our advanced pseudo-labeling algorithm. Lastly, we extend the two-step framework (Huang et al., 2020) to a novel reranker-enhanced summarizer, where the last step reranks the rewritten sentences w.r.t importance, fluency and redundancy. We evaluate our model on the SportsSum2.0 and SportsSum  (Huang et al., 2020) datasets. The experimental results show that our model achieves state-of-the-art performance in terms of ROUGE Scores.

Figure 2. Noise existed in SportsSum Dataset. The first example contains the descriptions of other games while the second one has the history-related descriptions. The last case includes an advertisement and irrelevant hyperlink text.

2. Data Construction

In this section, we first analyze the existing noises in SportsSum, and then introduce the details of manual cleaning process. Lastly, we show the statistics of SportsSum2.0.

Noise Analysis. SportsSum (Huang et al., 2020) is the only public large-scale sports game summarization dataset with 5,428 samples crawled from Chinese sports website. During the observation of SportsSum, we find more than 15% of news articles have noisy sentences. Specifically, we divide these noises into three classes and show an example for each class in Fig. 2:

  • [leftmargin=*,topsep=0pt]

  • Descriptions of other games: a news website may contain news articles of multiple games, which is neglected by SportsSum and resulted in 2.2% (119/5428) of news articles include descriptions of other games.

  • Descriptions of history: There are many sports news describe the matching history at the beginning, which cannot be inferred from the commentaries. To alleviate this issue, SportsSum adopts a rule-based method to identify the starting keywords (e.g., “at the beginning of the game”) and remove the descriptions before the keywords. However, about 4.6% (252/5428) of news articles cannot be correctly disposed by this method.

  • Advertisements and irrelevant hyperlink text: We find that about 9.8% (531/5428) of news articles in SportsSum have such noise.

Manual Cleaning Process. In order to reduce the noise of news articles, we design a manual cleaning process. For a given news article, we first remove the descriptions related to other games, and then delete advertisements and irrelevant hyperlink text. Finally, we discard the descriptions of history.

Annotation Process. We recruit 7 master students to perform the manual cleaning process. In order to help the annotators fully understand our manual cleaning process, every annotator is trained in a pre-annotation process. All the cleaning results are checked by another two data experts. The manual cleaning process costs about 200 human hours in total. After manual cleaning, we discard 26 bad cases which do not contain the descriptions of current games and finally obtain 5402 human-cleaned samples.

Statistics. Table  1 shows statistics of SportsSum2.0 and SportsSum. The average length of commentary documents in SportsSum2.0 is slightly different from the counterpart in SportsSum due to the 26 removed bad cases.

Source SportsSum2.0 SportsSum
Commentary News Commentary News
Avg. #chars 3464.31 771.93 3459.97 801.11
Avg. #words 1828.56 406.81 1825.63 427.98
Avg. #sent. 194.10 22.05 193.77 23.80
Table 1. The statistics of SportsSum2.0 and SportsSum.

3. Methodology

As shown in Fig. 1, the goal of sports game summarization is to generate sports news from a given live commentary document . represents -th news sentence and is -th commentary, where is the timeline information, denotes the current scores and is the commentary sentence.

Fig. 3 shows the overview of our reranker-enhanced summarizer which first learns a selector to extract important commentary sentences, then uses a rewriter to convert each selected sentence to a news sentence. Finally, a reranker is introduced to generate a news article based on the rewritten sentences.

3.1. Pseudo-Labeling Algorithm

To train the selector and rewriter, we need labels to indicate the importance of commentary sentences and their corresponding news sentences. Following SportsSum(Huang et al., 2020), to obtain the labels, we map each news sentence to its commentary sentence through a pseudo-labeling algorithm that considers both timeline information and similarity metrics. Though there is no explicit timeline information of news sentences, but most of sentences start with “in the n-th minutes” which indicates its timeline information.

For each news sentence , we first extract time information if possible, and then construct a candidate commentaries set , where the timeline information of each commentary belongs to . Lastly, we select one commentary sentence with the highest similarity with from , and form a pair of a mapped commentary sentence and a news sentence.

Figure 3. The overview of the reranker-enhanced summarizer.

The similarity function used by Huang et al. (Huang et al., 2020) is BERTScore (Zhang et al., 2020)

, which only considers the semantic similarity between news and commentaries but neglects the lexical overlap between them. However, the lexical overlap actually is a useful clue, e.g., the matching probability of a news sentence and a commentary sentence will be greatly improved if the same entity mentions appear both in these two sentences. Therefore, we decide to consider both semantics and lexical overlap in similarity function:


The similarity function, i.e., is calculated by linearly combining BERTScore function, i.e., and ROUGE score function, i.e., . The coefficient is a hyper parameter.

With the above process, we finally obtain a large number of pairs of a mapped commentary sentence and a news sentence, which can be used for training selector and rewriter.

Method # Model SportsSum2.0 SportsSum
Extractive Models 1 TextRank 20.53 6.14 19.64 18.37 5.69 17.23
2 PacSum 23.13 7.18 22.04 21.84 6.56 20.19
Abstractive Models 3 Abs-LSTM 31.14 11.22 30.36 29.22 10.94 28.09
4 Abs-PGNet 35.98 13.07 35.09 33.21 11.76 32.37
Two Step Framework
(Selector + Rewriter)
5 SportsSUM 44.73 18.90 44.03 43.17 18.66 42.27
6 PGNet 45.13 19.13 44.12 44.12 18.89 43.23
7 mBART 47.23 19.27 46.54 46.43 19.54 46.21
8 Bert2bert 46.85 19.24 46.12 46.54 19.32 45.93
9 Bert2bert 47.54 19.87 46.99 47.08 19.63 46.87
Reranker-Enhanced Summarizer
(Selector + Rewriter + Reranker)
10 PGNet 46.23 19.41 45.37 45.54 19.02 45.31
11 mBART 47.62 19.73 47.19 46.89 19.72 46.53
12 Bert2bert 47.32 19.33 47.01 46.14 19.32 45.53
13 Bert2bert 48.13 20.09 47.78 47.61 19.65 47.49
Table 2. Experimental results on SportsSum2.0 and SportsSum. The models with denotes they use our advanced pseudo-labeling algorithm while indicates the models utilize the original pseudo-labeling algorithm (Huang et al., 2020).

3.2. Reranker-Enhanced Summarizer

Our reranker-enhanced summarizer extends the two-step model (Huang et al., 2020) to consider the importance, fluency and redundancy of the rewritten news sentences, as shown in Fig. 3

. Particularly, we first select important commentary sentences, and then rewrite each selected sentence to a news sentence by a seq2seq model. However, the readability and fluency of these news sentences vary a lot. Some new sentences can be directly used in the final news while others are inappropriate due to the low fluency, which is one common problem in natural language generation, representing by repetition 

(See et al., 2017) and incoherence (Bellec et al., 2018). So we use the reranker to filter out low fluent sentences while retaining high informative sentences and controlling the redundancy between the rewritten sentences.

Selector. Different from existing two-step model (Huang et al., 2020), which purely use TextCNN (Kim, 2014)

as selector and ignores the contexts of a commentary sentence, we design a context-aware selector which can capture the semantics well with a sliding window. In detail, we train a binary classifier to choose important commentary sentences. When training, for each commentary

in , we assign a positive label if can be mapped with a news sentence by the pseudo-labeling algorithm. Otherwise, we give a negative label. Here, a RoBERTa (Liu et al., 2019) is employed to extract the contextual representation of commentary sentences. we first tokenize the commentary sentences. Then we concatenate the target commentary sentence and its partial context with special tokens as [CLS] commentary1 [SEP] commentary2 [SEP] ... [SEP] commentaryN [SEP]. the target commentary sentence in the middle of the whole sequence, and the sequence is limited to 512 tokens. For prediction, the sentence embedding is obtained by averaging the token embedding of the target commentary sentence, and then is followed by a sigmoid classifier. The cross-entropy loss is used as the training objective for selector.

Rewriter. Each selected commentary sentence first concatenates with its timeline information, and then rewrites to a news sentence through the seq2seq model. To be more specific, we choose the following three seq2seq models:

  • [leftmargin=*,topsep=0pt]

  • Pointer-Generator Network (See et al., 2017) is a popular abstractive text summarization model with copy mechanism and coverage loss.

  • Bert2bert111 is a seq2seq model, in which both encoder and decoder are initialized by BERT (Devlin et al., 2019).

  • BART (Lewis et al., 2020)

    is a denoising autoencoder seq2seq model, which achieves state-of-the-art results on many language generation tasks. Since there is no Chinese version of BART for public use, we choose its multilingual version, i.e., mBART 

    (Liu et al., 2020).

Reranker. Although the rewritten sentences can convey the semantic to some extent, there still exist low fluent sentences, which leads to poor readability. In view of the ability of Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) on extracting the high informative and low redundant sentences, we decide to adopt this approach as reranker. Unfortunately, vanilla MMR cannot take fluency into account. So, we propose a variant MMR algorithm by incorporating the fluency of each rewritten sentence.


where represents the whole news sentence set, denotes the selected news sentence set. The is employed to calculate the importance of the news sentence. Note that, each commentary sentence has corresponding importance predicted by selector. So we directly use the importance of corresponding commentary sentence as . The function in our MMR algorithm is utilized to indicate the fluency of a news sentence. We consider computing perplexity of the news sentence by GPT-2 (Radford et al., 2019). Equal 3 shows the details of function. The is BERTScore function. We greedily select news sentences with the highest MMR score until the total length exceeds a pre-defined budget222We set the average length of the news articles as the budget..

4. Experiments

4.1. Implementation Details

We split SportsSum2.0333The dataset is available at into three sets: training (4803 games), validation (300 games), and testing (299 games) sets444We keep the original splitting except for removed 26 “bad cases”.. We trained all models on one 32GB Tesla V100 GPU. Our reranker-enhanced summarizer is implemented based on RoBERTa-large (24 layers with 1024 hidden size), mBART (12 layers with 1024 hidden size) and GPT-2 of huggingface Transformers (Wolf et al., 2019) with default settings. The learning rates of selector, PGNet, Bert2bert rewriter and mBART rewriter are 3e-5, 0.15, 2e-5 and 2e-5, respectively. For all models, we set the batch size to 32 and use Adamw optimizer. The coefficient in our pseudo-labeling algorithm is 0.70. We set the , and to 0.6, 0.2 and 0.2 in our variant MMR.

(a) Info
(b) Redundancy
(c) Fluency
(d) Overall
Figure 4. Results on human evaluation.

4.2. Experimental Results

We compare our three-step models with several conventional models, including extractive summarization models, abstractive summarization models and two-step models in terms of ROUGE scores.

As shown in Table 2, our reranker-enhanced models achieve significant improvement on both datasets. TextRank (Mihalcea and Tarau, 2004) and PacSum (Zheng and Lapata, 2019) are extractive summarization models, which are limited by the different text styles between commentaries and news. Abs-PGNet (See et al., 2017) and Abs-LSTM (abstractive summarization models) achieve better performance since they can alleviate different styles issue, but still have the drawback of dealing with long texts. SportsSUM (Huang et al., 2020) is a two-step state-of-the-art model that performs better than the above models where the different text styles and long text issues are solved. Other two-step models enhance SportsSUM through improved pseudo-labeling algorithm, selector and rewriter. However, they still suffer from low fluency problem. Our best reranker-enhanced model outperforms SportsSUM by more than 2.8 and 3.5 points in the average of ROUGE scores on SportsSum2.0 and SportsSum, respectively. Additionally, the effectiveness of our advanced pseudo-labeling algorithm is proved by comparing row 8 to 9 or 12 to 13.

4.3. Necessity of Manual Cleaning

To further demonstrate the necessity of manual cleaning, we conduct a human evaluation on sports news generated by two reranker-enhanced summarizers trained on SportsSum2.0 and SportsSum, respectively. We denote these two models as SUM-Clean and SUM-Noisy. Five postgraduate students are recruited and each one evaluates 100 samples for each summarizer. The evaluator scores generated sports news in terms of informativeness, redundancy, fluency and overall quality with a 3-point scale.

Fig. 4 shows the human evaluation results. SUM-Clean performs better than the SUM-Noisy on all four aspects, especially in fluency. This indicates that the summarizer trained on noisy data degrades the performance of game summarization, and it is necessary to remove the noise existed in the dataset.

5. Conclusion

In this paper, we study the sports game summarization on the basis of SportsSum. A high-quality dataset SportsSum2.0 is constructed by removing or rectifying noises. We also propose a novel pseudo-labeling algorithm based on both semantics and lexicons. Furthermore, an improved framework is designed to improve the fluency of rewritten news. Experimental results show the effectiveness of our model in the SportsSum2.0.

This research is supported by National Key R&D Program of China (No. 2018-AAA0101900), the Priority Academic Program Development of Jiangsu Higher Education Institutions, National Natural Science Foundation of China (Grant No. 62072323, 61632016), Natural Science Foundation of Jiangsu Province (No. BK20191420), Suda-Toycloud Data Intelligence Joint Laboratory, and the Collaborative Innovation Center of Novel Software Technology and Industrialization.


  • G. Bellec, D. Kappel, W. Maass, and R. A. Legenstein (2018) Deep rewiring: training very sparse deep networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.2.
  • J. G. Carbonell and J. Goldstein (1998) The use of mmr and diversity-based reranking for reodering documents and producing summaries.(1998). Cited by: §3.2.
  • Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 675–686. External Links: Link, Document Cited by: §1.
  • S. Chopra, M. Auli, and A. M. Rush (2016)

    Abstractive sentence summarization with attentive recurrent neural networks

    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 93–98. External Links: Link, Document Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 2nd item.
  • K. Huang, C. Li, and K. Chang (2020) Generating sports news from live commentary: a Chinese dataset for sports game summarization. In

    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

    Suzhou, China, pp. 609–615. External Links: Link Cited by: §1, §1, §2, §3.1, §3.1, §3.2, §3.2, Table 2, §4.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Link, Document Cited by: §3.2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: 3rd item.
  • M. Liu, Q. Qi, H. Hu, and H. Ren (2016) Sports news generation from live webcast scripts based on rules and templates. In Natural Language Understanding and Intelligent Applications, pp. 876–884. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §3.2.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020)

    Multilingual denoising pre-training for neural machine translation

    Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: 3rd item.
  • R. Mihalcea and P. Tarau (2004) TextRank: bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. External Links: Link Cited by: §4.2.
  • R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 280–290. External Links: Link, Document Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §3.2.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Link, Document Cited by: §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Link, Document Cited by: §1, 1st item, §3.2, §4.2.
  • X. Wan, J. Zhang, J. Yao, and T. Wang (2016) Overview of the nlpcc-iccpol 2016 shared task: sports news generation from live webcast scripts. In Natural Language Understanding and Intelligent Applications, pp. 870–875. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.1.
  • J. Yao, J. Zhang, X. Wan, and J. Xiao (2017) Content selection for real-time sports news construction from commentary texts. In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, pp. 31–40. External Links: Link, Document Cited by: §1.
  • J. Zhang, J. Yao, and X. Wan (2016) Towards constructing sports news from live text commentary. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1361–1371. External Links: Link, Document Cited by: §1, §1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §3.1.
  • H. Zheng and M. Lapata (2019) Sentence centrality revisited for unsupervised summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6236–6247. External Links: Link, Document Cited by: §4.2.
  • L. Zhu, W. Wang, Y. Chen, X. Lv, and J. Zhou (2016) Research on summary sentences extraction oriented to live sports text. In Natural Language Understanding and Intelligent Applications, pp. 798–807. Cited by: §1.