Putting Self-Supervised Token Embedding on the Tables

by   Marc Szafraniec, et al.
Ecole Polytechnique

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.


page 1

page 2

page 3

page 4


Semi-supervised learning made simple with self-supervised clustering

Self-supervised learning models have been shown to learn rich visual rep...

Evaluating context-invariance in unsupervised speech representations

Unsupervised speech representations have taken off, with benchmarks (SUP...

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

As the adoption of deep learning techniques in industrial applications g...

DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

In this paper, we propose a simple yet effective transformer framework f...

Learning Better Representation for Tables by Self-Supervised Tasks

Table-to-text generation aims at automatically generating natural text t...

INFOTABS: Inference on Tables as Semi-structured Data

In this paper, we observe that semi-structured tabulated text is ubiquit...

Please sign up or login with your details

Forgot password? Click here to reset