R-grams: Unsupervised Learning of Semantic Units in Natural Language

08/14/2018
by   Ariel Ekgren, et al.
0

This paper introduces a novel type of data-driven segmented unit that we call r-grams. We illustrate one algorithm for calculating r-grams, and discuss its properties and impact on the frequency distribution of text representations. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset