Sifaka: Text Mining Above a Search API

10/05/2018
by   Cameron VandenBerg, et al.
0

Text mining and analytics software has become popular, but little attention has been paid to the software architectures of such systems. Often they are built from scratch using special-purpose software and data structures, which increases their cost and complexity. This demo paper describes Sifaka, a new open-source text mining application constructed above a standard search engine index using existing application programmer interface (API) calls. Indexing integrates popular annotation software libraries to augment the full-text index with noun phrase and named-entities; n-grams are also provided. Sifaka enables a person to quickly explore and analyze large text collections using search, frequency analysis, and co-occurrence analysis; and import existing document labels or interactively construct document sets that are positive or negative examples of new concepts, perform feature selection, and export feature vectors compatible with popular machine learning software. Sifaka demonstrates that search engines are good platforms for text mining applications while also making common IR text mining capabilities accessible to researchers in disciplines where programming skills are less common.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2023

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

We present Spacerini, a modular framework for seamless building and depl...
research
03/23/2016

The Anatomy of a Search and Mining System for Digital Archives

Samtla (Search And Mining Tools with Linguistic Analysis) is a digital h...
research
05/06/2020

Enhancing Software Development Process (ESDP) using Data Mining Integrated Environment

Nowadays, it has become a basic need to reuse existing Application Progr...
research
06/02/2023

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Noticing the urgent need to provide tools for fast and user-friendly qua...
research
01/10/2012

Pbm: A new dataset for blog mining

Text mining is becoming vital as Web 2.0 offers collaborative content cr...
research
02/07/2016

Scalable Text Mining with Sparse Generative Models

The information age has brought a deluge of data. Much of this is in tex...

Please sign up or login with your details

Forgot password? Click here to reset