As Internet technology continues to thrive, a large number of documents are continuously generated and published online. Online newspapers, for instance, publish articles describing important facts or life events of well-known personalities. However, these documents being highly unstructured and noisy, contain both meaningful biographical facts, as well as unrelated information to describe the person, for example, opinions, discussions, etc. Thus, extracting meaningful biographical sentences from a large pool of unstructured text is a challenging problem. Although humans manage to filter the desired information, manual inspection does not scale to very large document collections.
1.1. Textual Biographies
A textual biography can be represented as a series of facts and events that make up a person’s life. Different types of biographical facts include aspects related to personal characteristics like date and place of birth and death, education, career, occupation, affiliation, and relationships. Overall, a general biography generation process can be described in three major steps: (i) identifying biographical sentences, (ii) classifying biographical sentences into different life-events, and (iii) relevancy-based ranking of sentences in each life-event class. Along with the biographical information, a Wikipedia profile of a person also consists of a consistently-formatted table on the top right-hand side of the page. This table or box is known as aninfobox, and it contains some important facts related to the person.
1.2. Machine Learning in Biography Generation
Majority of past literature focuses on Machine Learning techniques for information extraction. Zhou et al.(Zhou et al., 2005)
trained biographical and non-biographical sentence classifiers to categorize sentences. They also employ a Naive Bayes model with n-gram features to classify sentences into ten classes such as bio, fame factor, personality, etc. This work looks similar to ours, but this requires a good amount of human effort. Biadsy et al.(Biadsy et al., 2008) proposed summarization techniques to extract important information from multiple sentences. Liu et al. (Liu et al., 2018)
also use multi-document summarization. For identifying salient information, the paragraphs are ranked and ordered using various extractive summarization techniques. However, both these systems ((Biadsy et al., 2008) and (Liu et al., 2018)) do not focus on sectionizing the biography. The works by Filatova et al. (Filatova and Prager, 2005) and Barzilay et al. (Barzilay et al., 2001) focus on specific tasks such as identifying occupation related important events and sentence ordering techniques respectively. One of the recent works in generating sentences is by Lebret et al. (Lebret et al., 2016). They use
concept to text generationapproach to generate only single/first sentence using the fact tables present in Wikipedia.
1.3. Our Contribution
In this paper, we address the task of automatically extracting biographical facts from textual documents published on the Web. We pose this problem in the extractive summarization framework and propose a two-stage extractive strategy. In the first stage, sentences are classified into biographical facts or not. In the following stage, we classify biographical sentences into several life-event categories. Along with the biography generation task, we also propose a method to generate Infobox
which is a consistently-formatted table mentioning some important facts and events related to a person. We experimented with several ML models and achieve significantly high F-scores.
Outline: Section 2 describes the datasets that are used to train the models. Section 3 describes the components of our system in more detail. Section 4 describes our experiments and results. Section 5 draws the final conclusion of our work.
The current work requires large textual biography datasets. Also, in order to categorize between biographical and non-biographical sentences, we leverage a non-biographical news dataset. Following is the descriptions of the dataset used.
TREC-RCV1 (Lebret et al., 2016): This Reuters news corpus consists of 8.5 million news titles, links and timestamps collected between Jan 2007 and Aug 2016. Dataset was used for training a 2-class classifier in the first step of the biography generation process (see Section 3.1). All the sentences in this dataset are labeled as non-biographical.
WikiBio (Amini et al., 2009): This dataset consists of 730K biographical pages from English Wikipedia. For each article, the dataset consists of only the first paragraph. This dataset was used for training the 2-class classifier as mentioned in Section 3.1. All the sentences are labeled as biographical sentences.
BigWikiBio: We curated this dataset by crawling English Wikipedia articles. It consists of 6M Wikipedia biographies. This dataset was used to train the 6-class classifier (see Section 3.2).
The biography generation process involves multi-stage extractive subtasks. In this section, we describe these stages in detail. Along with the biographical information, a Wikipedia page also consists of an infobox. For the sake of completeness, we also describe an automatic approach to generate infoboxes similar to the one present on the Wikipedia page.
3.1. Identifying Biographical Sentences
A textual document that describes an event or some news related to a person contains a large number of non-biographical sentences as compared to biographical sentences. In the first stage, sentences were categorized into the above two categories.
3.1.1. Data Pre-processing
Given a text document, we partition it into a set of sentences. Next, we enrich extracted sentences by performing standard NLP tasks like special character removal, spell check, etc.
3.1.2. Sentence representation and classification
Each sentence is converted into a fixed-length TF-IDFvector representation. We consider sentences available in the TREC dataset as non-biographical sentences. Whereas sentences in WikiBio dataset are considered as biographical
. We experiment with several machine learning models like Logistic Regression, Decision Trees, Naive Bayes, etc., to perform binary classification. Since the Logistic Regression model performed best (evaluation scores described in Section4), we leverage its classified results for next stages.
3.1.3. Filtering False Positives
Our classifier resulted in some false positives — sentences that are non-biographical but were classified as biographical. We, further, filter out these cases by employing standard Named Entity Recognition technique. We consider only those sentences that contain at least one of the three named entities: (i) PERSON, (ii) PLACE or (iii) ORGANIZATION. In the next stage, we classify biographical sentences into several life-event categories.
3.2. Classifying Biographical Sentences
The categorized biographical sentences are further classified into six life-event classes namely Education, Career, Life, Awards, Special Notes, and Death. We leverage section information in BigWikiBio dataset (described in Section 2) to construct a mapping between sentences and life-events. We label each sentence in the Wikipedia page with its corresponding section heading and further map it to a broad life-event class. For example, sentences with sections headings as College, High School, Early life and education, Education, etc. are labeled as Education class, Politics, Music career, Career, Works, Publications, Research, etc. are labeled as Career class, Honors, Awards, Recognition, Championships, Achievements, Accomplishments, etc. are labeled as Honours/Awards class, Honors, Awards, Recognition, Championships, Achievements, Accomplishments, etc. are labeled as Honours/Awards class, Notes, Legacy, Personal, Gallery, Influences, Other, Controversies, etc. are labeled as Special Notes class, and Death, Death and Legacy, Later life, and Death, etc. are labeled as Death class.
Next, we leverage the Logistic Regression model to perform this multi-class classification task. We construct similar fixed-length TF-IDF vector representation as described in Section 3.1. The classification results into clusters of similar sentences representing a single life-event of person.
A single life-event cluster might contain hundreds of biographical sentences. We, therefore, rank the most important sentences by leveraging graph ranking algorithm111 We use Gensim implementation (Řehůřek and Sojka, 2010). Text Rank (Mihalcea and Tarau, 2004). For a given person, we apply Text Rank on each of the obtained six clusters. The ranking imparts flexibility in experimenting with multiple length values of the generated biography.
3.4. Generating Infobox
The infobox is a well-formatted table which gives a short and concise description of important facts related to the person. We use the following set facts in our proposed Infobox of a queried person.
Name: Name of the queried person.
Date of Birth & Date of Death: We use regular expressions to extract the date, depending on the context phrases such as ‘born on’, ‘birth’, etc.
Place of birth: We use a similar methodology as above. We leverage part-of-speech (POS) tagging and Named Entity Recognition222We use the Stanford NER library(Bird et al., 2009). to identify the place of the birth.
Awards: We extract award information by leveraging a list of all the awards available at: Wikipedia List of Awards page 333https://en.wikipedia.org/wiki/List_of_awards. We, next, use standard string matching to identify an award name in the biographical sentences.
Education & Career: Here also, we leverage education information (degree, courses, etc.) present at official government sites like data.gov.in and usa.gov. Similarly, career-related information was obtained using the Wikipedia page list of Occupations.444https://en.wikipedia.org/wiki/Lists_of_occupations
As an additional feature, we also present a profile image of the person by performing an image-based Google search query. This additional feature enriches textual biography with visual aspects similar to a Wikipedia profile of a person.
4. Experiments and Results
In this section, we present evaluation results of two sub-tasks: (i) Biography Generation and (ii) Infobox Generation.
4.1. Tasks and Evaluation Measures:
The evaluation metrics for the above tasks are as follows:
4.1.1. Biography Generation
We compare biographies generated by BioGen against corresponding Wikipedia page. Note that, biography generation task is similar to document summarization. We, therefore, use ROUGE score to evaluate our generated biographies. In the current paper, we three variations of ROUGE, ROUGE-1, ROUGE-2, and ROUGE-L scores.
4.1.2. Infobox Generation
To evaluate the quality of the infobox generated by BioGen we define a score for each field in the infobox. Let and be the infobox generated by BioGen and the infobox present on Wikipedia page respectively. Let be the set of characteristics present in the field recovered by BioGen and be the corresponding field set in the infobox present on the Wikipedia page. Then, the score for each field is defined as: And the total score for the generated infobox is given by the average over the present fields given as: As the Infobox is generated containing the only specific set of fields which are Name, Date-Of-Birth, Place-Of-Birth and Death, Awards, Education, and Career; the score is calculated only corresponding to those fields.
4.2.1. Biography Generation Accuracy
Extracting information from an arbitrary webpage is a challenging task. We, therefore, leverage three web resources to construct a source document set for a given person query. The resources are Ducksters555https://www.ducksters.com/, IMDB666https://www.imdb.com/, and Zoomboola777https://zoomboola.com/. Ducksters is an educational site covering subjects such as history, science, geography, math, and biographies. IMDB is the world’s most popular and authoritative source for movie, TV and celebrity content. Zoomboola is a news website. We experiment with randomly selected 150 biographies belonging to various domains such as Academics, Politics, Literature, Sports, Film Industry, etc.
Table 1 describes the ROUGE scores with the increasing number of sources that were used to generate the biography. We observe that ROUGE score increases with an increase in the number of sources. I.e., the similarity of biography generated by BioGen with that of the Wikipedia page increases as the number of sources increase. This also demonstrates the fact that a Wikipedia page is composed of multiple references.
|One Source||Two Sources||Three Sources|
Table 2 shows the ROUGE score when we do not use the Summarization step of the biography generation process in Biogen. We can see that the Recall is better in this case, and this is because Summarization step filters out some text from the biography, and thus the overlap with the Wikipedia page decreases.
Table 3 shows Amitabh Bachchan’s (Bollywood Film Industry Actor) biography generated using BioGen as an example. If we compare the generated biography to Amitabh Bachan’s Wikipedia page888https://en.wikipedia.org/wiki/Amitabh_Bachchan, it turns out that the sentences which are highlighted in italics are very much related to the class in which they have been classified, which says that BioGen does a fairly good job in not only generating relevant sentences but also placing them in the proper sections. Also, we can see that the row corresponding to the field Death is empty. I.e. BioGen did not extract any information related to Amitabh Bachan’s death, which is also true as per Wikipedia, as there is no information about his death. It is also important to note that BioGen did not add any arbitrary information in that field.
|Career||Bachchan’s career moved into fifth gear after Ramesh Sippy’s Sholay (1975).The movies he made with Manmohan Desai (Amar Akbar Anthony, Naseeb, Mard) were immensely successful but towards the latter half of the 1980s his career went into a downspin.However, the importance of being Amitabh Bachchan is not limited to his career, although he reinvented himself and experimented with his roles and acted in many successful films.|
|Life||Amitabh Bachchan was born on October 11, 1942 in Allahabad. He is the son of late poet Harivansh Rai Bachchan and Teji Bachchan.Son of well known poet Harivansh Rai Bachchan and Teji Bachchan.He has a brother named Ajitabh.He got his break in Bollywood after a letter of introduction from the then Prime Minister Mrs. Indira Gandhi, as he was a friend of her son, Rajiv Gandhi.He married Jaya Bhaduri, an accomplished actress in her own right, and they had two children, Shweta and Abhishek.His son, Abhishek, is also an actor by his own rights. On November 16, 2011, he became a Dada (Paternal Grandfather) when Aishwarya gave birth to a daughter in Mumbai Hospital.He is already a Nana (maternal grandfather) to Navya Naveli and Agastye - Shweta’s children.After completing his education from Sherwood College, Nainital, and Kirori Mal College, Delhi University, he moved to Calcutta to work for shipping firm Shaw and Wallace.|
|Awards||In 1984, he was honored by the Indian government with the Padma Shri Award for his outstanding contribution to the Hindi film industry.France’s highest civilian honour, the Knight of the Legion of Honour, was conferred upon him by the French Government in 2007, for his ”exceptional career in the world of cinema and beyond”|
|Rejected||Amitabh was in Goa during the last weekend to be one of the speaker at the THINK festival where he was honoured like he is being honoured in any and every place he makes his presence. At the very outset Bachchan was humble enough to let all those in the audience know that he held De Niro as one of his major sources of inspiration, was once forced to clear immigration in his hotel in Cairo, because his Egyptian fans became overly enthusiastic at the airport. Image caption|
We experimented BioGen with one more parameter, which is the length of the generated biography. Figure 1 shows the change in the ROUGE score with the change in the length of the biographies generated. Again, we can see that as the length increases the recall increases, but the precision decreases. Which is evident, because as we add in more and more content in the biography, we cover more and more information present on Wikipedia resulting in the increase of recall. However, as the length increases, the amount of new information that we get goes on reducing, which is seen by the decrease in the precision value.
Table 3 shows a sample biography generated for a famous Bollywood actor Amitabh Bachchan. It can be seen that our model does fairly well in extracting important sentences and classifies those into the six life-event classes. Also, as the actor is still alive, there should not be any ‘death’ related event, which our model outputs correctly.
4.2.2. Infobox Generation Accuracy
Table 4 shows an example of an Infobox (for Amitabh Bachchan) generated using BioGen. This Infobox recovers almost all the information present on the original Wikipedia page999https://en.wikipedia.org/wiki/Amitabh_Bachchan, and achieves a score of .
|Career||Actor, Artist, Assistant, Producer|
|Awards||Padma Vibhushan, Padma Bhushan, Padma Shri, Government of India|
5. Conclusion and Future Work
In this work, we proposed a system that generates a biography of a person, given the name and reference documents as input. However, we aim to build a system that just takes the name as an input and generates a biography by extracting information from the web. We would also like to enhance our system by incorporating coreference resolution so that it takes care of identifying the sentences related to the entity we are interested in. Right now, the system works well for extractive summarization, but we feel that adding in the abstractive form would help in creating better biographies. One of the other enhancements to check would be neural network based models for the classification and sentence generation task.
- Amini et al. (2009) Massih R. Amini, Nicolas Usunier, and Cyril Goutte. 2009. Learning from Multiple Partially Observed Views -an Application to Multilingual Text Categorization. In Proceedings of the 22Nd International Conference on Neural Information Processing Systems (NIPS’09). Curran Associates Inc., USA, 28–36. http://dl.acm.org/citation.cfm?id=2984093.2984097
- Barzilay et al. (2001) Regina Barzilay, Noemie Elhadad, and Kathleen R. McKeown. 2001. Sentence Ordering in Multidocument Summarization. In Proceedings of the First International Conference on Human Language Technology Research (HLT ’01). Association for Computational Linguistics, Stroudsburg, PA, USA, 1–7. DOI:http://dx.doi.org/10.3115/1072133.1072217
- Biadsy et al. (2008) Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. An unsupervised approach to biography production using wikipedia. Proceedings of ACL-08: HLT (2008), 807–815.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python (1st ed.). O’Reilly Media, Inc.
- Filatova and Prager (2005) Elena Filatova and John Prager. 2005. Tell Me What You Do and I’Ll Tell You What You Are: Learning Occupation-related Activities for Biographies. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 113–120. DOI:http://dx.doi.org/10.3115/1220575.1220590
- Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771 (2016).
- Liu et al. (2018) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198 (2018).
- Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
- Zhou et al. (2005) Liang Zhou, Miruna Ticrea, and Eduard Hovy. 2005. Multi-document biography summarization. arXiv preprint cs/0501078 (2005).