Text Ranking and Classification using Data Compression

09/23/2021
by   Nitya Kasturi, et al.
0

A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/13/2022

Optimal alphabet for single text compression

A text can be viewed via different representations, i.e. as a sequence o...
research
04/27/2017

A Survey of Neural Network Techniques for Feature Extraction from Text

This paper aims to catalyze the discussions about text feature extractio...
research
06/06/2023

LLMZip: Lossless Text Compression using Large Language Models

We provide new estimates of an asymptotic upper bound on the entropy of ...
research
08/25/2023

EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression

We propose an unsupervised method to extract keywords and keyphrases fro...
research
11/17/2021

Using Convolutional Neural Networks to Detect Compression Algorithms

Machine learning is penetrating various domains virtually, thereby proli...
research
02/09/2023

Bag of Tricks for Training Data Extraction from Language Models

With the advance of language models, privacy protection is receiving mor...
research
02/11/2022

Spam four ways: Making sense of text data

The world is full of text data, yet text analytics has not traditionally...

Please sign up or login with your details

Forgot password? Click here to reset