C-Pack: Packaged Resources To Advance General Chinese Embedding

09/14/2023
by   Shitao Xiao, et al.
0

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10 time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

READ FULL TEXT
research
10/13/2022

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a...
research
04/20/2023

Phoenix: Democratizing ChatGPT across Languages

This paper presents our efforts to democratize ChatGPT across language. ...
research
11/02/2022

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

The tremendous success of CLIP (Radford et al., 2021) has promoted the r...
research
06/05/2023

MCTS: A Multi-Reference Chinese Text Simplification Dataset

Text simplification aims to make the text easier to understand by applyi...
research
11/04/2020

A BERT-based Dual Embedding Model for Chinese Idiom Prediction

Chinese idioms are special fixed phrases usually derived from ancient st...
research
12/18/2017

A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction

Abbreviation is a common phenomenon across languages, especially in Chin...
research
04/07/2020

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for MandarinChinese Based on a New Open Benchmark Dataset

Conversion of Chinese graphemes to phonemes (G2P) is an essential compon...

Please sign up or login with your details

Forgot password? Click here to reset