GLS-CSC: A Simple but Effective Strategy to Mitigate Chinese STM Models' Over-Reliance on Superficial Clue

09/08/2023
by   Yanrui Du, et al.
0

Pre-trained models have achieved success in Chinese Short Text Matching (STM) tasks, but they often rely on superficial clues, leading to a lack of robust predictions. To address this issue, it is crucial to analyze and mitigate the influence of superficial clues on STM models. Our study aims to investigate their over-reliance on the edit distance feature, commonly used to measure the semantic similarity of Chinese text pairs, which can be considered a superficial clue. To mitigate STM models' over-reliance on superficial clues, we propose a novel resampling training strategy called Gradually Learn Samples Containing Superficial Clue (GLS-CSC). Through comprehensive evaluations of In-Domain (I.D.), Robustness (Rob.), and Out-Of-Domain (O.O.D.) test sets, we demonstrate that GLS-CSC outperforms existing methods in terms of enhancing the robustness and generalization of Chinese STM models. Moreover, we conduct a detailed analysis of existing methods and reveal their commonality.

READ FULL TEXT
research
07/25/2023

A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check

With the development of pre-trained models and the incorporation of phon...
research
05/19/2023

XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters

In recent years, pre-trained language models have undergone rapid develo...
research
03/02/2022

The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking

Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling...
research
04/15/2022

Improving Pre-trained Language Models with Syntactic Dependency Prediction Task for Chinese Semantic Error Recognition

Existing Chinese text error detection mainly focuses on spelling and sim...
research
11/07/2019

Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention

Most Chinese pre-trained encoders take a character as a basic unit and l...
research
12/09/2022

Moto: Enhancing Embedding with Multiple Joint Factors for Chinese Text Classification

Recently, language representation techniques have achieved great perform...
research
06/11/2022

An Evaluation of OCR on Egocentric Data

In this paper, we evaluate state-of-the-art OCR methods on Egocentric da...

Please sign up or login with your details

Forgot password? Click here to reset