byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings

06/24/2021
by   Xiang Zhang, et al.
3

This article introduces byteSteady – a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data – DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/01/2020

An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

This paper presents an improved classification model for Igbo text using...
05/29/2017

Character-Based Text Classification using Top Down Semantic Model for Sentence Representation

Despite the success of deep learning on many fronts especially image and...
03/10/2022

TextConvoNet:A Convolutional Neural Network based Architecture for Text Classification

In recent years, deep learning-based models have significantly improved ...
04/27/2019

Sentiment Classification using N-gram IDF and Automated Machine Learning

We propose a sentiment classification method with a general machine lear...
06/21/2016

An empirical study on large scale text classification with skip-gram embeddings

We investigate the integration of word embeddings as classification feat...
02/28/2022

Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches

Feature embedding methods have been proposed in literature to represent ...
09/25/2019

Beyond image classification: zooplankton identification with deep vector space embeddings

Zooplankton images, like many other real world data types, have intrinsi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.