Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach

by   Tanwi Mallick, et al.

Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.


page 5

page 13

page 15

page 16


Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Highly specific datasets of scientific literature are important for both...

Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

Topic models are among the most widely used methods in natural language ...

Efficient Clustering from Distributions over Topics

There are many scenarios where we may want to find pairs of textually si...

A Weakly Supervised Approach for Classifying Stance in Twitter Replies

Conversations on social media (SM) are increasingly being used to invest...

Visual Exploration and Knowledge Discovery from Biomedical Dark Data

Data visualization techniques proffer efficient means to organize and pr...

Out-of-Category Document Identification Using Target-Category Names as Weak Supervision

Identifying outlier documents, whose content is different from the major...

CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis

Large scale analysis of source code, and in particular scientific source...

Please sign up or login with your details

Forgot password? Click here to reset