I Introduction
When it comes to analyzing a large number of documents, having to make sense of them manually is most often a cumbersome task to carry out, if not utterly impossible [boyd2017applications, jelodar2019latent]. For many applications, users have to perform an in-depth analysis of a large number of text data and try to discover what principal points of focus can best describe the general underlying structure of the data. This motivates the focus on automation for performing such tasks, with an example being the similarity-based clustering of document representations, and in this way, bringing documents that are close in terms of their semantic content together. An application-domain instance would be going over the customer review documents and discovering what topics have often contributed to the semantic content of each document [titov2008modeling, calheiros2017sentiment, sutherland2020determinants, kirilenko2021automated]. Moreover, topic modeling and document clustering have applications in various fields, ranging from medical applications to geography and political sciences [zhang2017idoctor, greene2015unveiling, jiang2012using, paul2011you, wu2012ranking, linstead2007mining, gethers2010using, asuncion2010software, thomas2011mining, cristani2008geo, eisenstein2010latent, tang2012multiscale, yin2011geographical, sizov2010geofolk, chen2010opinion, cohen2013classifying].
Topic modeling based on Latent Dirichlet Allocation (LDA) provides the foundation for the bulk of traditional approaches to topic modeling and is still an effective approach for proposing a solution to this problem [blei2003latent, jelodar2019latent]. Generative modeling of documents and discovering topics is possible via LDA-based pipelines. In addition to LDA-based designs, matrix factorization has also been utilized in topic modeling, and document clustering [kuang2015nonnegative].
With the advancements in employing neural network-based architectures to improve the performance in computer vision and natural language processing, the use of text embedding saw an increase in attention. From the more traditional approaches such as GloVe
[pennington2014glove], the effort was focused on capturing the semantic context of each word in a real-valued vector representation such that there be a close relationship between the mathematical arithmetics and the semantic similarity of the data points. The more recent approaches focusing on auto-regressive (e.g., ElMo) and auto-encoding (e.g., Bert) models improved the state of the art performance on almost every natural language processing task by allowing the embedding of the word as well as the context it has appeared in
[peters2018deep, devlin2018bert]. These models, having been trained on large-scale corpora of text, are able to encode the text semantics in concise mathematical vectors effectively. It has been shown that utilizing these representations in obtaining document representations can be helpful [shao_2020, green_mitchell_yu_2020].In this work, we propose FAME, an open-source library for designing and training neural network-based topic modeling pipelines. Following the aforementioned literarture, this library allows experimenting with traditional approaches to feature extraction such as LDA and term frequency–inverse document frequency (tf-idf). It also provides the ability to utilize transformer-based document representations. It allows mixing the two domains by training auto-encoders and minimizing reconstruction loss, leading to encoded embeddings being a more concise representation associated with the different features extracted from the documents.
Ii Methods
In what follows, the key components of FAME framework are discussed. These components help enable an efficient and easy process for designing, implementing, and evaluating the topic modeling schemes on large-scale text corpora.
Ii-a Preprocessing
When it comes to working with large text corpora, especially for traditional approaches of topic modeling, various forms of preprocessing are needed. In addition to normal preprocessing routines (e.g., modifying or removing punctuations), it is often needed to stem and lemmatize words or perform spell-checking to fix typos in the text. All in all, preprocessing is a critical component when it comes to deal with textual data.
Ii-B Document Representation
Ii-B1 Term-based
To represent a document as a real-valued vector representation, utilizing term-frequencies has been a well-known traditional approach. These values can either correspond directly to the term frequencies or be found by performing factorization techniques such as non-negative matrix factorization. In summary, tf-idf features can play an important role, especially if the dataset does not include a large number of documents.
Ii-B2 LDA-based
LDA can also be used to represent documents. To do so, the approach is first to fit an LDA model including a range of topics to the document corpora and then utilize it to associate each document with a probability density layout, measuring its links to the found principal topics.
Ii-B3 Transformers-based
Transformer-based approaches focusing on Bert-family have been utilized successfully in representing documents. Therefore, another effective way of mapping documents to a latent semantic space is employing these models. FAME employs the sentence-transformers library in tackling the problem of obtaining semantic word sequence representations by applying pre-trained Transformers [reimers-2020-multilingual-sentence-bert, reimers-2019-sentence-bert].
Ii-C Representation Fusion
To be able to utilize more than one of the aforementioned feature domains, we followed the literature and provided the option to design and use an additional auto-encoder for obtaining concise representation reflecting on the information from the inputs. To train the auto-encoder, Mean Squared Error (MSE) reconstruction loss can be used.
Ii-D Clustering
After mapping the text documents to latent space, the similarity of the resulting representations can be effectively utilized to cluster these documents into semantically similar groups. This operation makes it easier to determine the main topics semantically similar documents have focused on.
Iii Experiments
We conducted experiments on the 20 Newsgroups dataset to better demonstrate the results of applying the aforementioned methodologies on text corpora [20newsgroup]. This dataset includes news documents (with of them being used as the held-out validation set), each of which is labeled with a group name from the following:
-
alt.atheism
-
comp.graphics
-
comp.os.ms-windows.misc
-
comp.sys.ibm.pc.hardware
-
comp.sys.mac.hardware
-
comp.windows.x
-
misc.forsale
-
rec.autos
-
rec.motorcycles
-
rec.sport.baseball
-
rec.sport.hockey
-
sci.crypt
-
sci.electronics
-
sci.med
-
sci.space
-
soc.religion.christian
-
talk.politics.guns
-
talk.politics.mideast
-
talk.politics.misc
-
talk.religion.misc

The 2-D t-SNE representation of document representations, with colors indicating their cluster in a 20-cluster K-Means
Iv Conclusion
Topic modeling and reinforcing it by integration with document representation is a critical problem with numerous applications. Traditional and recent approaches to extract semantic features from text documents have shown to be helpful in obtaining representative clusters with semantically similar documents. In this work, we proposed FAME, an open-source software library that enables conducting experiments related to topic modeling and document clustering with ease.