Text and Code Embeddings by Contrastive Pre-Training

01/24/2022
by   Arvind Neelakantan, et al.
5

Text embeddings are useful features in many applications such as semantic search and computing text similarity. Previous work typically trains models customized for different use cases, varying in dataset choice, training objective and model architecture. In this work, we show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code. The same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our best unsupervised model achieves a relative improvement of 4 embedding models respectively. The same text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4 10.6 TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8 over prior best work on code search.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2022

CodeRetriever: Unimodal and Bimodal Contrastive Learning

In this paper, we propose the CodeRetriever model, which combines the un...
research
08/07/2023

Towards General Text Embeddings with Multi-stage Contrastive Learning

We present GTE, a general-purpose text embedding model trained with mult...
research
05/30/2021

CLEVE: Contrastive Pre-training for Event Extraction

Event extraction (EE) has considerably benefited from pre-trained langua...
research
09/19/2020

Prior Art Search and Reranking for Generated Patent Text

Generative models, such as GPT-2, have demonstrated impressive results r...
research
10/18/2022

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Code contrastive pre-training has recently achieved significant progress...
research
10/13/2022

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a...
research
09/10/2022

Code Compliance Assessment as a Learning Problem

Manual code reviews and static code analyzers are the traditional mechan...

Please sign up or login with your details

Forgot password? Click here to reset