Information retrieval for label noise document ranking by bag sampling and group-wise loss

03/12/2022
by   Chunyu Li, et al.
0

Long Document retrieval (DR) has always been a tremendous challenge for reading comprehension and information retrieval. The pre-training model has achieved good results in the retrieval stage and Ranking for long documents in recent years. However, there is still some crucial problem in long document ranking, such as data label noises, long document representations, negative data Unbalanced sampling, etc. To eliminate the noise of labeled data and to be able to sample the long documents in the search reasonably negatively, we propose the bag sampling method and the group-wise Localized Contrastive Estimation(LCE) method. We use the head middle tail passage for the long document to encode the long document, and in the retrieval, stage Use dense retrieval to generate the candidate's data. The retrieval data is divided into multiple bags at the ranking stage, and negative samples are selected in each bag. After sampling, two losses are combined. The first loss is LCE. To fit bag sampling well, after query and document are encoded, the global features of each group are extracted by convolutional layer and max-pooling to improve the model's resistance to the impact of labeling noise, finally, calculate the LCE group-wise loss. Notably, our model shows excellent performance on the MS MARCO Long document ranking leaderboard.

READ FULL TEXT
research
03/01/2022

DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index

Web search provides a promising way for people to obtain information and...
research
11/20/2022

SeDR: Segment Representation Learning for Long Documents Dense Retrieval

Recently, Dense Retrieval (DR) has become a promising solution to docume...
research
06/28/2020

RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

Although exact term match between queries and documents is the dominant ...
research
12/16/2021

CODER: An efficient framework for improving retrieval through COntextualized Document Embedding Reranking

We present a framework for improving the performance of a wide class of ...
research
06/16/2022

Towards Robust Ranker for Text Retrieval

A ranker plays an indispensable role in the de facto 'retrieval rera...
research
08/08/2021

PoolRank: Max/Min Pooling-based Ranking Loss for Listwise Learning Ranking Balance

Numerous neural retrieval models have been proposed in recent years. The...
research
05/18/2022

PASH at TREC 2021 Deep Learning Track: Generative Enhanced Model for Multi-stage Ranking

This paper describes the PASH participation in TREC 2021 Deep Learning T...

Please sign up or login with your details

Forgot password? Click here to reset