CSDR-BERT: a pre-trained scientific dataset match model for Chinese Scientific Dataset Retrieval

01/30/2023
by   Xintao Chu, et al.
0

As the number of open and shared scientific datasets on the Internet increases under the open science movement, efficiently retrieving these datasets is a crucial task in information retrieval (IR) research. In recent years, the development of large models, particularly the pre-training and fine-tuning paradigm, which involves pre-training on large models and fine-tuning on downstream tasks, has provided new solutions for IR match tasks. In this study, we use the original BERT token in the embedding layer, improve the Sentence-BERT model structure in the model layer by introducing the SimCSE and K-Nearest Neighbors method, and use the cosent loss function in the optimization phase to optimize the target output. Our experimental results show that our model outperforms other competing models on both public and self-built datasets through comparative experiments and ablation implementations. This study explores and validates the feasibility and efficiency of pre-training techniques for semantic retrieval of Chinese scientific datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/15/2019

Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable resul...
research
01/11/2021

AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21

Acronym identification focuses on finding the acronyms and the phrases t...
research
04/20/2021

B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Pre-training and fine-tuning have achieved remarkable success in many do...
research
02/10/2020

Pre-training Tasks for Embedding-based Large-scale Retrieval

We consider the large-scale query-document retrieval problem: given a qu...
research
11/05/2019

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training pro...
research
08/02/2022

Debiasing Gender Bias in Information Retrieval Models

Biases in culture, gender, ethnicity, etc. have existed for decades and ...
research
10/22/2022

Spectrum-BERT: Pre-training of Deep Bidirectional Transformers for Spectral Classification of Chinese Liquors

Spectral detection technology, as a non-invasive method for rapid detect...

Please sign up or login with your details

Forgot password? Click here to reset