An Analysis of Hierarchical Text Classification Using Word Embeddings

09/06/2018
by   Roger A. Stein, et al.
0

Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations---fastText, XGBoost, SVM, and Keras' CNN---and noticeable word embeddings generation methods---GloVe, word2vec, and fastText---with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an _LCAF_1 of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.

READ FULL TEXT
research
11/26/2019

Word-Class Embeddings for Multiclass Text Classification

Pre-trained word embeddings encode general word semantics and lexical re...
research
05/04/2022

Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem

Word embeddings are one of the most fundamental technologies used in nat...
research
06/14/2016

Active Discriminative Text Representation Learning

We propose a new active learning (AL) method for text classification wit...
research
04/05/2018

Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop

Most of the literature around text classification treats it as a supervi...
research
04/04/2019

Text Classification Components for Detecting Descriptions and Names of CAD models

We apply text analysis approaches for a specialized search engine for 3D...
research
03/03/2022

Automated Single-Label Patent Classification using Ensemble Classifiers

Many thousands of patent applications arrive at patent offices around th...
research
09/03/2015

Encoding Prior Knowledge with Eigenword Embeddings

Canonical correlation analysis (CCA) is a method for reducing the dimens...

Please sign up or login with your details

Forgot password? Click here to reset