byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings

06/24/2021
by   Xiang Zhang, et al.
3

This article introduces byteSteady – a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data – DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2020

An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

This paper presents an improved classification model for Igbo text using...
research
05/29/2017

Character-Based Text Classification using Top Down Semantic Model for Sentence Representation

Despite the success of deep learning on many fronts especially image and...
research
03/10/2022

TextConvoNet:A Convolutional Neural Network based Architecture for Text Classification

In recent years, deep learning-based models have significantly improved ...
research
04/27/2019

Sentiment Classification using N-gram IDF and Automated Machine Learning

We propose a sentiment classification method with a general machine lear...
research
06/21/2016

An empirical study on large scale text classification with skip-gram embeddings

We investigate the integration of word embeddings as classification feat...
research
07/22/2018

On Tree-structured Multi-stage Principal Component Analysis (TMPCA) for Text Classification

A novel sequence-to-vector (seq2vec) embedding method, called the tree-s...

Please sign up or login with your details

Forgot password? Click here to reset