MatchXML: An Efficient Text-label Matching Framework for Extreme Multi-label Text Classification

08/25/2023
by   Hui Ye, et al.
0

The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.

READ FULL TEXT
research
01/10/2022

GUDN A novel guide network for extreme multi-label text classification

The problem of extreme multi-label text classification (XMTC) is to reca...
research
10/01/2021

Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMC) seeks to find relevant lab...
research
08/15/2020

Label-Wise Document Pre-Training for Multi-Label Text Classification

A major challenge of multi-label text classification (MLTC) is to stimul...
research
06/12/2023

Imbalanced Multi-label Classification for Business-related Text with Moderately Large Label Spaces

In this study, we compared the performance of four different methods for...
research
10/29/2022

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification

Extreme Multi-label Text Classification (XMC) involves learning a classi...
research
07/05/2020

Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) is a task for tagging a g...
research
06/21/2016

An empirical study on large scale text classification with skip-gram embeddings

We investigate the integration of word embeddings as classification feat...

Please sign up or login with your details

Forgot password? Click here to reset