DeepAI
Log In Sign Up

A Framework for Neural Topic Modeling of Text Corpora

08/19/2021
by   Shayan Fazeli, et al.
0

Topic Modeling refers to the problem of discovering the main topics that have occurred in corpora of textual data, with solutions finding crucial applications in numerous fields. In this work, inspired by the recent advancements in the Natural Language Processing domain, we introduce FAME, an open-source framework enabling an efficient mechanism of extracting and incorporating textual features and utilizing them in discovering topics and clustering text documents that are semantically similar in a corpus. These features range from traditional approaches (e.g., frequency-based) to the most recent auto-encoding embeddings from transformer-based language models such as BERT model family. To demonstrate the effectiveness of this library, we conducted experiments on the well-known News-Group dataset. The library is available online.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

09/14/2019

Multi-view and Multi-source Transfers in Neural Topic Modeling

Though word embeddings and topics are complementary representations, sev...
06/09/2022

Analyzing Folktales of Different Regions Using Topic Modeling and Clustering

This paper employs two major natural language processing techniques, top...
06/04/2018

History Playground: A Tool for Discovering Temporal Trends in Massive Textual Corpora

Recent studies have shown that macroscopic patterns of continuity and ch...
09/05/2020

Visually Analyzing Contextualized Embeddings

In this paper we introduce a method for visually analyzing contextualize...
05/02/2017

Fuzzy Approach Topic Discovery in Health and Medical Corpora

The majority of medical documents and electronic health records (EHRs) a...
01/28/2019

A new evaluation framework for topic modeling algorithms based on synthetic corpora

Topic models are in widespread use in natural language processing and be...

I Introduction

When it comes to analyzing a large number of documents, having to make sense of them manually is most often a cumbersome task to carry out, if not utterly impossible [boyd2017applications, jelodar2019latent]. For many applications, users have to perform an in-depth analysis of a large number of text data and try to discover what principal points of focus can best describe the general underlying structure of the data. This motivates the focus on automation for performing such tasks, with an example being the similarity-based clustering of document representations, and in this way, bringing documents that are close in terms of their semantic content together. An application-domain instance would be going over the customer review documents and discovering what topics have often contributed to the semantic content of each document [titov2008modeling, calheiros2017sentiment, sutherland2020determinants, kirilenko2021automated]. Moreover, topic modeling and document clustering have applications in various fields, ranging from medical applications to geography and political sciences [zhang2017idoctor, greene2015unveiling, jiang2012using, paul2011you, wu2012ranking, linstead2007mining, gethers2010using, asuncion2010software, thomas2011mining, cristani2008geo, eisenstein2010latent, tang2012multiscale, yin2011geographical, sizov2010geofolk, chen2010opinion, cohen2013classifying].

Topic modeling based on Latent Dirichlet Allocation (LDA) provides the foundation for the bulk of traditional approaches to topic modeling and is still an effective approach for proposing a solution to this problem [blei2003latent, jelodar2019latent]. Generative modeling of documents and discovering topics is possible via LDA-based pipelines. In addition to LDA-based designs, matrix factorization has also been utilized in topic modeling, and document clustering [kuang2015nonnegative].

With the advancements in employing neural network-based architectures to improve the performance in computer vision and natural language processing, the use of text embedding saw an increase in attention. From the more traditional approaches such as GloVe

[pennington2014glove]

, the effort was focused on capturing the semantic context of each word in a real-valued vector representation such that there be a close relationship between the mathematical arithmetics and the semantic similarity of the data points. The more recent approaches focusing on auto-regressive (e.g., ElMo) and auto-encoding (e.g., Bert) models improved the state of the art performance on almost every natural language processing task by allowing the embedding of the word as well as the context it has appeared in

[peters2018deep, devlin2018bert]. These models, having been trained on large-scale corpora of text, are able to encode the text semantics in concise mathematical vectors effectively. It has been shown that utilizing these representations in obtaining document representations can be helpful [shao_2020, green_mitchell_yu_2020].

In this work, we propose FAME, an open-source library for designing and training neural network-based topic modeling pipelines. Following the aforementioned literarture, this library allows experimenting with traditional approaches to feature extraction such as LDA and term frequency–inverse document frequency (tf-idf). It also provides the ability to utilize transformer-based document representations. It allows mixing the two domains by training auto-encoders and minimizing reconstruction loss, leading to encoded embeddings being a more concise representation associated with the different features extracted from the documents.

Ii Methods

In what follows, the key components of FAME framework are discussed. These components help enable an efficient and easy process for designing, implementing, and evaluating the topic modeling schemes on large-scale text corpora.

Ii-a Preprocessing

When it comes to working with large text corpora, especially for traditional approaches of topic modeling, various forms of preprocessing are needed. In addition to normal preprocessing routines (e.g., modifying or removing punctuations), it is often needed to stem and lemmatize words or perform spell-checking to fix typos in the text. All in all, preprocessing is a critical component when it comes to deal with textual data.

Ii-B Document Representation

Ii-B1 Term-based

To represent a document as a real-valued vector representation, utilizing term-frequencies has been a well-known traditional approach. These values can either correspond directly to the term frequencies or be found by performing factorization techniques such as non-negative matrix factorization. In summary, tf-idf features can play an important role, especially if the dataset does not include a large number of documents.

Ii-B2 LDA-based

LDA can also be used to represent documents. To do so, the approach is first to fit an LDA model including a range of topics to the document corpora and then utilize it to associate each document with a probability density layout, measuring its links to the found principal topics.

Ii-B3 Transformers-based

Transformer-based approaches focusing on Bert-family have been utilized successfully in representing documents. Therefore, another effective way of mapping documents to a latent semantic space is employing these models. FAME employs the sentence-transformers library in tackling the problem of obtaining semantic word sequence representations by applying pre-trained Transformers [reimers-2020-multilingual-sentence-bert, reimers-2019-sentence-bert].

Ii-C Representation Fusion

To be able to utilize more than one of the aforementioned feature domains, we followed the literature and provided the option to design and use an additional auto-encoder for obtaining concise representation reflecting on the information from the inputs. To train the auto-encoder, Mean Squared Error (MSE) reconstruction loss can be used.

Ii-D Clustering

After mapping the text documents to latent space, the similarity of the resulting representations can be effectively utilized to cluster these documents into semantically similar groups. This operation makes it easier to determine the main topics semantically similar documents have focused on.

Iii Experiments

We conducted experiments on the 20 Newsgroups dataset to better demonstrate the results of applying the aforementioned methodologies on text corpora [20newsgroup]. This dataset includes news documents (with of them being used as the held-out validation set), each of which is labeled with a group name from the following:

  1. alt.atheism

  2. comp.graphics

  3. comp.os.ms-windows.misc

  4. comp.sys.ibm.pc.hardware

  5. comp.sys.mac.hardware

  6. comp.windows.x

  7. misc.forsale

  8. rec.autos

  9. rec.motorcycles

  10. rec.sport.baseball

  11. rec.sport.hockey

  12. sci.crypt

  13. sci.electronics

  14. sci.med

  15. sci.space

  16. soc.religion.christian

  17. talk.politics.guns

  18. talk.politics.mideast

  19. talk.politics.misc

  20. talk.religion.misc

Fig. 1:

The 2-D t-SNE representation of document representations, with colors indicating their cluster in a 20-cluster K-Means

from mike avon demon us subject re distribution world organization none mike avon demon us simple news lines in article access die net access die com writes the only theory that makes any sense is that and are either the same for all chips or vary among very few possibilities so that anyone trying to break the encryption by brute force need only low through the possible serial numbers multiplied by the number of different combinations if the phones transmit their serial nos as part of the message then what is to say that each phone can take that serial number and use it to generate the required key - target: sci.crypt
from tunisia signal eye carson eds see subject re organized lobbying for cryptography tunisia signal eye carson eds see organization sun ecosystems lines signal eye carson eds in article has transfer stratus com me ellison stratus com writes to paraphrase i may not agree with what you encrypting but i defend your right to encrypt it great slogan i ready to sign up with a effort though i would want to do it through an era offshoot shall we also push for the car cryptographic rights amendment dwight tunisia best tunisia sandman eye carson eds tolerable twisted craft camp carson eds homo sapiens planetary cancer news at six - target: sci.crypt
from spectra reed eds subject re why the drive speeds differ reed organization reed college portland oregon lines in article content alana org a a content alana org a writes the most likely explanation may have something to do with the fact that a greater density of information exists on the larger capacity disk drive than the smaller one if your running the drive on a mac i would recommend a shareware utility called timeline which tests seek sci throughput and rotational speed this utility should let you know what the differences are between the drives the views expressed in this posting those of the individual author only bus number malcontent is victorious first iconic bus larger drives tend to have multiple platters which can allow adjacent bits to be read in parallel resulting in higher throughput they also have higher spindle speeds which leads to both increased throughput and reduced seek times specimen - target: comp.sys.mac.hardware
from straight sitcom com subject back doors in clipper organization lines i think it very unlikely there are back doors in clipper for two reasons the government does need them if it can get the key and yes i assume that the official government obeys court orders that the design of the chip and its approval were official it would defeat the whole purpose of providing secure crypt for american business that could be read by our economic adversaries if this were not a legitimate and genuine purpose and as many think the asa can read de why bother otherwise rational responses preferred to conspiracy theories thanks david david starlight great care has been taken to ensure the accuracy of our information errors and omissions excepted - target: sci.crypt
from access die com subject re dorothy denying opposes clipper capstone wiretap chips organization express access online communications greenbelt my us lines access die net i believe there is no technical means of ensuring key escrow without the government maintaining a secret of some kind not necessarily for instance in the system outlined in the may issue of byte the process of getting one public key listed for general use involves giving pieces of your private key to escrow agencies which do calculations on those pieces and forward the result to the publishers of the public key directory which combines these results into your listed public key if you try to give the escrow agencies pieces which yield your private key when they are all put together the result is that the public key listed for you is wrong and you a read messages encrypted to you - target: sci.crypt
Fig. 2: An example set of documents that have been clustered together shown with their actual group label

Iv Conclusion

Topic modeling and reinforcing it by integration with document representation is a critical problem with numerous applications. Traditional and recent approaches to extract semantic features from text documents have shown to be helpful in obtaining representative clusters with semantically similar documents. In this work, we proposed FAME, an open-source software library that enables conducting experiments related to topic modeling and document clustering with ease.

References