Pretrained Domain-Specific Language Model for General Information Retrieval Tasks in the AEC Domain

03/09/2022
by   Zhe Zheng, et al.
0

As an essential task for the architecture, engineering, and construction (AEC) industry, information retrieval (IR) from unstructured textual data based on natural language processing (NLP) is gaining increasing attention. Although various deep learning (DL) models for IR tasks have been investigated in the AEC domain, it is still unclear how domain corpora and domain-specific pretrained DL models can improve performance in various IR tasks. To this end, this work systematically explores the impacts of domain corpora and various transfer learning techniques on the performance of DL models for IR tasks and proposes a pretrained domain-specific language model for the AEC domain. First, both in-domain and close-domain corpora are developed. Then, two types of pretrained models, including traditional wording embedding models and BERT-based models, are pretrained based on various domain corpora and transfer learning strategies. Finally, several widely used DL models for IR tasks are further trained and tested based on various configurations and pretrained models. The result shows that domain corpora have opposite effects on traditional word embedding models for text classification and named entity recognition tasks but can further improve the performance of BERT-based models in all tasks. Meanwhile, BERT-based models dramatically outperform traditional methods in all IR tasks, with maximum improvements of 5.4 score, respectively. This research contributes to the body of knowledge in two ways: 1) demonstrating the advantages of domain corpora and pretrained DL models and 2) opening the first domain-specific dataset and pretrained language model for the AEC domain, to the best of our knowledge. Thus, this work sheds light on the adoption and application of pretrained models in the AEC domain.

READ FULL TEXT

page 11

page 15

research
01/19/2022

TourBERT: A pretrained language model for the tourism industry

The Bidirectional Encoder Representations from Transformers (BERT) is cu...
research
04/09/2022

Efficient Extraction of Pathologies from C-Spine Radiology Reports using Multi-Task Learning

Pretrained Transformer based models finetuned on domain specific corpora...
research
07/17/2020

Multi-Perspective Semantic Information Retrieval in the Biomedical Domain

Information Retrieval (IR) is the task of obtaining pieces of data (such...
research
04/06/2019

Publicly Available Clinical BERT Embeddings

Contextual word embedding models such as ELMo (Peters et al., 2018) and ...
research
08/16/2019

CFO: A Framework for Building Production NLP Systems

This paper introduces a novel orchestration framework, called CFO (COMPU...
research
05/10/2023

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks

The most recent large language models such as ChatGPT and GPT-4 have gar...
research
07/11/2022

Embedding Recycling for Language Models

Training and inference with large neural models is expensive. However, f...

Please sign up or login with your details

Forgot password? Click here to reset