With the ever-growing scientific volume, scholarly search and recommendation engines are gradually adopting artificial intelligence frameworks for better document retrieval. A well-known annotation task is to identify topical keyphrases for facilitating topical search. Further, the identified keyphrases can be classified into several semantic categories for facilitating knowledge graph construction. However, due to the large volume of research, current manual annotation schemes are financially infeasible owing to the requirement of continuous human resources and domain expertise.
In this paper, we introduce SEAL that aims to automate keyphrase extraction and further classification into three semantic categories: (i) tasks, (ii) processes, or (iii) materials. Tasks represent research problems like extraction, processing, parsing, etc. Processes represent solutions to problems, including physical equipment, algorithms, methods/techniques, and tools. Materials include physical material such as chemical compounds and datasets. We showcase that SEAL outperforms several state-of-the-art tools on a recently published dataset of 500 scientific publications in the field of Computer Science, Material Sciences, and Physics (Augenstein et al., 2017).
2. Seal Architecture
SEAL comprises two distinct neural modules, one for keyphrase extraction and other for keyphrase classification. We use standard ‘Beginning, Inside and Last tokens of multi-token chunks, Unit-length chunks and Outside’ (BILOU) labeling scheme (Ammar et al., 2017) in both of the modules. Figure 1(a) presents the SEAL architecture.
2.1. Keyphrase Extraction Module
This module leverages pre-trained token-level 9216-dimensional SciBERT embeddings (Beltagy et al., 2019)111SciBERT results in significantly better scores than several other embeddings such as Levi and Goldberg dependency-based embeddings (Levy and Goldberg, 2014) and GLOVE (Pennington et al., 2014).
to train three layers (96,48, and 24 hidden units in respective layers) of Bidirectional Long Short-Term Memory (BiLSTM) cells stacked on top of each other. The output, a 24-dimensional vector, is then downsized to a five-dimensional vector using a linear layer and then fed to a Conditional Random Field (CRF) layer, which then predicts the label of the token. The results are, further, refined through a post-processing step to handle single-token keyphrases.
2.2. Keyphrase Classification Module
This module uses pre-trained token-level Levy Embeddings (Levy and Goldberg, 2014)222Levy embeddings results in significantly better scores than several other embeddings.
. For each token, we also consider the immediate neighboring tokens as context. We pass the concatenated vector of the embedding of the current token, the previous token, and the next token to the standard Random Forest (RF) classifier. Candidate tokens without next or previous tokens are appropriately padded with the embedding corresponding to the¡UNKNOWN¿ tag.
We post-process the abbreviations and chemical formulae separately due to inefficiencies associated with their classification. In the case of abbreviations, we match abbreviations with the full-form using their first occurrence and assigned them the respective class of the full-form. In the case of chemical formulae (such as NaCl, Mg), we match the corresponding formulae name token using a regular expression.
3. Experimental Results
We experiment on ScienceIE dataset333https://scienceie.github.io/ (Augenstein et al., 2017) containing 500 scientific abstracts curated from Science Direct open access publications. Each abstract is manually labeled with keyphrase boundaries and their respective classes. For experimentation, we verbatim follow the guidelines specified in the ScienceIE competition. The dataset is partitioned into train, development, and test sets containing 350, 50, and 100 abstracts, respectively. We next, showcase that SEAL outperformed the top-ranked implementations on the ScienceIE leaderboard against the standard F1-score metric.
Keyphrase Extraction: As described in previous section, we experiment with several embedding schemes. Table 1 shows extraction accuracy of SciBERT outperformed other embedding schemes. Table 1 also compares F1-scores of SEAL against the ScienceIE official leaderboard top-rankers (Augenstein et al., 2017). Furthermore, unlike TIAL_UW (rank 1 in the leaderboard ) and s2_end2end (Ammar et al., 2017) (rank 2 in the leaderboard), SEAL does not use external data sources.
|Glove word embeddings (Pennington et al., 2014)||0.440|
|Levi and Goldberg embeddings (Levy and Goldberg, 2014)||0.470|
|SciBERT embeddings (Beltagy et al., 2019)||0.564|
|s2_end2end (Ammar et al., 2017)||0.550|
4. System Description
web-application is developed using the Flask framework. The current implementation uses Pytorch Framework for the extraction and classification modules. The trained models and the demo are hosted at our research group server444http://lingo.iitgn.ac.in:5000/. Figure 1(b) presents snapshot of the demo. On encountering a POST request, the framework first executes the extraction module, followed by the classification module, and displays the result. The code, processed dataset, and the system implementation details are available at https://github.com/Sammed98/Keyphrase-Extraction-Demo.
5. Conclusion and future proposals
In this paper, we propose a toolkit SEAL for scientific keyphrase extraction as well as classification. We showcase that SEAL performed similar to state-of-the-art extraction systems that leverage the large volume of external knowledge. In the future, we plan to experiment with domain-specific embeddings and semi-supervised bootstrapping techniques.
- Ammar et al. (2017) Waleed Ammar, Matthew Peters, Chandra Bhagavatula, and Russell Power. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 592–596.
- Augenstein et al. (2017) Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. 2017. Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. arXiv preprint arXiv:1704.02853 (2017).
- Beltagy et al. (2019) Iz Beltagy, Arman Cohan, and Kyle Lo. 2019. SciBERT: Pretrained Contextualized Embeddings for Scientific Text. CoRR abs/1903.10676 (2019). arXiv:1903.10676 http://arxiv.org/abs/1903.10676
- Eger et al. (2017) Steffen Eger, Erik-Lân Do Dinh, Ilia Kuznetsov, Masoud Kiaeeha, and Iryna Gurevych. 2017. Eelection at semeval-2017 task 10: Ensemble of neural learners for keyphrase classification. arXiv preprint arXiv:1704.02215 (2017).
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 302–308.
- Liu et al. (2017) Sijia Liu, Feichen Shen, Vipin Chaudhary, and Hongfang Liu. 2017. Mayonlp at semeval 2017 task 10: Word embedding distance pattern for keyphrase classification in scientific publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 956–960.
et al. (2014)
Richard Socher, and Christopher
Glove: Global Vectors for Word Representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162