A Bilingual Generative Transformer for Semantic Sentence Embedding

11/10/2019
by   John Wieting, et al.
0

Semantic sentence embedding models encode natural language sentences into vectors, such that closeness in embedding space indicates closeness in the semantics between the sentences. Bilingual data offers a useful signal for learning such embeddings: properties shared by both sentences in a translation pair are likely semantic, while divergent properties are likely stylistic or language-specific. We propose a deep latent variable model that attempts to perform source separation on parallel sentences, isolating what they have in common in a latent semantic vector, and explaining what is left over with language-specific latent vectors. Our proposed approach differs from past work on semantic sentence encoding in two ways. First, by using a variational probabilistic framework, we introduce priors that encourage source separation, and can use our model's posterior to predict sentence embeddings for monolingual data at test time. Second, we use high-capacity transformers as both data generating distributions and inference networks – contrasting with most past work on sentence embeddings. In experiments, our approach substantially outperforms the state-of-the-art on a standard suite of unsupervised semantic similarity evaluations. Further, we demonstrate that our approach yields the largest gains on more difficult subsets of these evaluations where simple word overlap is not a good indicator of similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Bridging Continuous and Discrete Spaces: Interpretable Sentence Representation Learning via Compositional Operations

Traditional sentence embedding models encode sentences into vector repre...
research
02/07/2022

Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals

We have recently seen many successful applications of sentence embedding...
research
09/18/2020

Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models

We report a GPT-based multi-sentence language model for dialogue generat...
research
12/21/2022

Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval

Contrastive learning has been successfully used for retrieval of semanti...
research
09/26/2018

Semantic Sentence Embeddings for Paraphrasing and Text Summarization

This paper introduces a sentence to vector encoding framework suitable f...
research
12/03/2019

COSTRA 1.0: A Dataset of Complex Sentence Transformations

We present COSTRA 1.0, a dataset of complex sentence transformations. Th...
research
05/10/2021

DefSent: Sentence Embeddings using Definition Sentences

Sentence embedding methods using natural language inference (NLI) datase...

Please sign up or login with your details

Forgot password? Click here to reset