Towards General Text Embeddings with Multi-stage Contrastive Learning

08/07/2023
by   Zehan Li, et al.
0

We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE_base outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Rethinking embedding coupling in pre-trained language models

We re-evaluate the standard practice of sharing weights between input an...
research
01/24/2022

Text and Code Embeddings by Contrastive Pre-Training

Text embeddings are useful features in many applications such as semanti...
research
01/26/2022

CodeRetriever: Unimodal and Bimodal Contrastive Learning

In this paper, we propose the CodeRetriever model, which combines the un...
research
01/03/2023

Graph Contrastive Learning for Multi-omics Data

Advancements in technologies related to working with omics data require ...
research
04/10/2020

SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification

Fine-tuning pre-trained sentence embedding models like BERT has become t...
research
04/19/2021

A Framework using Contrastive Learning for Classification with Noisy Labels

We propose a framework using contrastive learning as a pre-training task...
research
12/11/2018

Contrastive Training for Models of Information Cascades

This paper proposes a model of information cascades as directed spanning...

Please sign up or login with your details

Forgot password? Click here to reset