HinFlair: pre-trained contextual string embeddings for pos tagging and text classification in the Hindi language

01/18/2021
by   Harsh Patel, et al.
0

Recent advancements in language models based on recurrent neural networks and transformers architecture have achieved state-of-the-art results on a wide range of natural language processing tasks such as pos tagging, named entity recognition, and text classification. However, most of these language models are pre-trained in high resource languages like English, German, Spanish. Multi-lingual language models include Indian languages like Hindi, Telugu, Bengali in their training corpus, but they often fail to represent the linguistic features of these languages as they are not the primary language of the study. We introduce HinFlair, which is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus. Experiments were conducted on 6 text classification datasets and a Hindi dependency treebank to analyze the performance of these contextualized string embeddings for the Hindi language. Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging. Also, HinFlair when combined with FastText embeddings outperforms many transformers-based language models trained particularly for the Hindi language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2022

Myers-Briggs personality classification from social media text using pre-trained language models

In Natural Language Processing, the use of pre-trained language models h...
research
11/04/2020

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Language models based on the Transformer architecture have achieved stat...
research
05/19/2020

On the Choice of Auxiliary Languages for Improved Sequence Tagging

Recent work showed that embeddings from related languages can improve th...
research
11/21/2019

Automatic Text-based Personality Recognition on Monologues and Multiparty Dialogues Using Attentive Networks and Contextual Embeddings

Previous works related to automatic personality recognition focus on usi...
research
05/22/2020

Improving Segmentation for Technical Support Problems

Technical support problems are often long and complex. They typically co...
research
10/14/2022

HashFormers: Towards Vocabulary-independent Pre-trained Transformers

Transformer-based pre-trained language models are vocabulary-dependent, ...
research
07/29/2020

Text-based classification of interviews for mental health – juxtaposing the state of the art

Currently, the state of the art for classification of psychiatric illnes...

Please sign up or login with your details

Forgot password? Click here to reset