InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works

09/21/2021
by   Johannes Höhmann, et al.
0

Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Such automated approaches enable quantitative studies on large corpora which would not be feasible by manual inspection alone. However, due to copyright restrictions, the availability of relevant digitized literary works is limited. Derived Text Formats (DTFs) have been proposed as a solution. Here, textual materials are transformed in such a way that copyright-critical features are removed, but that the use of certain analytical methods remains possible. Contextualized word embeddings produced by transformer-encoders (like BERT) are promising candidates for DTFs because they allow for state-of-the-art performance on various analytical tasks and, at first sight, do not disclose the original text. However, in this paper we demonstrate that under certain conditions the reconstruction of the original copyrighted text becomes feasible and its publication in the form of contextualized word representations is not safe. Our attempts to invert BERT suggest, that publishing parts of the encoder together with the contextualized embeddings is critical, since it allows to generate data to train a decoder with a reconstruction accuracy sufficient to violate copyright laws.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2021

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

We provide the first exploration of text-to-text transformers (T5) sente...
research
04/07/2021

Combining Pre-trained Word Embeddings and Linguistic Features for Sequential Metaphor Identification

We tackle the problem of identifying metaphors in text, treated as a seq...
research
06/05/2019

Training Temporal Word Embeddings with a Compass

Temporal word embeddings have been proposed to support the analysis of w...
research
10/04/2019

Investigating the Effectiveness of Representations Based on Word-Embeddings in Active Learning for Labelling Text Datasets

Manually labelling large collections of text data is a time-consuming, e...
research
03/18/2015

Text Segmentation based on Semantic Word Embeddings

We explore the use of semantic word embeddings in text segmentation algo...
research
12/22/2020

Improved Biomedical Word Embeddings in the Transformer Era

Biomedical word embeddings are usually pre-trained on free text corpora ...

Please sign up or login with your details

Forgot password? Click here to reset