Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity

06/22/2023
by   Andrew Kean Gao, et al.
0

Vector embeddings have become ubiquitous tools for many language-related tasks. A leading embedding model is OpenAI's text-ada-002 which can embed approximately 6,000 words into a 1,536-dimensional vector. While powerful, text-ada-002 is not open source and is only available via API. We trained a simple neural network to convert open-source 768-dimensional MPNet embeddings into text-ada-002 embeddings. We compiled a subset of 50,000 online food reviews. We calculated MPNet and text-ada-002 embeddings for each review and trained a simple neural network to for 75 epochs. The neural network was designed to predict the corresponding text-ada-002 embedding for a given MPNET embedding. Our model achieved an average cosine similarity of 0.932 on 10,000 unseen reviews in our held-out test dataset. We manually assessed the quality of our predicted embeddings for vector search over text-ada-002-embedded reviews. While not as good as real text-ada-002 embeddings, predicted embeddings were able to retrieve highly relevant reviews. Our final model, Vec2Vec, is lightweight (<80 MB) and fast. Future steps include training a neural network with a more sophisticated architecture and a larger dataset of paired embeddings to achieve greater performance. The ability to convert between and align embedding spaces may be helpful for interoperability, limiting dependence on proprietary models, protecting data privacy, reducing costs, and offline operations.

READ FULL TEXT
research
08/17/2022

Neural Embeddings for Text

We propose a new kind of embedding for natural language text that deeply...
research
06/24/2021

Evaluation of Representation Models for Text Classification with AutoML Tools

Automated Machine Learning (AutoML) has gained increasing success on tab...
research
10/13/2022

MTEB: Massive Text Embedding Benchmark

Text embeddings are commonly evaluated on a small set of datasets from a...
research
06/10/2020

Training with Multi-Layer Embeddings for Model Reduction

Modern recommendation systems rely on real-valued embeddings of categori...
research
09/04/2020

Going Beyond T-SNE: Exposing whatlies in Text Embeddings

We introduce whatlies, an open source toolkit for visually inspecting wo...
research
11/07/2018

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

In this article we propose a novel approach for adapting speaker embeddi...

Please sign up or login with your details

Forgot password? Click here to reset