Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

01/06/2020
by   Amir Jalilifard, et al.
0

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this paper we propose STF-IDF, a novel semantic method based on TF-IDF, for scoring word importance of informal documents in a corpus. A set of nearly four million documents from health-care social media was collected and was trained in order to draw semantic model and to find the word embeddings. Then, the features of semantic space were utilized to rearrange the original TF-IDF scores through an iterative solution so as to improve the moderate performance of this algorithm on informal texts. After testing the proposed method with 200 randomly chosen documents, our method managed to decrease the TF-IDF mean error rate by a factor of 50 27.2

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

07/25/2017

From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

In this paper, we propose a novel approach for text classification based...
01/10/2020

Inductive Document Network Embedding with Topic-Word Attention

Document network embedding aims at learning representations for a struct...
11/24/2018

Novelty and Coverage in context-based information filtering

We present a collection of algorithms to filter a stream of documents in...
07/27/2018

Clustering Prominent People and Organizations in Topic-Specific Text Corpora

Named entities in text documents are the names of people, organization, ...
09/09/2019

Follow the Leader: Documents on the Leading Edge of Semantic Change Get More Citations

Diachronic word embeddings offer remarkable insights into the evolution ...
10/16/2016

Term-Class-Max-Support (TCMS): A Simple Text Document Categorization Approach Using Term-Class Relevance Measure

In this paper, a simple text categorization method using term-class rele...
12/28/2017

Corpus specificity in LSA and Word2vec: the role of out-of-domain documents

Latent Semantic Analysis (LSA) and Word2vec are some of the most widely ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.