R-grams: Unsupervised Learning of Semantic Units in Natural Language

08/14/2018
by   Ariel Ekgren, et al.
0

This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2018

Unsupervised Word Segmentation from Speech with Attention

We present a first attempt to perform attentional word segmentation dire...
research
03/07/2022

IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation

The T5 model and its unified text-to-text paradigm contributed in advanc...
research
05/07/2021

Generalising Multilingual Concept-to-Text NLG with Language Agnostic Delexicalisation

Concept-to-text Natural Language Generation is the task of expressing an...
research
04/24/2019

Semantic Drift in Multilingual Representations

Multilingual representations have mostly been evaluated based on their p...
research
12/28/2021

LINDA: Unsupervised Learning to Interpolate in Natural Language Processing

Despite the success of mixup in data augmentation, its applicability to ...
research
05/31/2019

Representing and Using Knowledge with the Contextual Evaluation Model

This paper introduces the Contextual Evaluation Model (CEM), a novel met...
research
05/24/2021

RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

We present RobeCzech, a monolingual RoBERTa language representation mode...

Please sign up or login with your details

Forgot password? Click here to reset