Real-time Text Analytics Pipeline Using Open-source Big Data Tools

12/12/2017
by   Hassan Nazeer, et al.
0

Real-time text processing systems are required in many domains to quickly identify patterns, trends, sentiments, and insights. Nowadays, social networks, e-commerce stores, blogs, scientific experiments, and server logs are main sources generating huge text data. However, to process huge text data in real time requires building a data processing pipeline. The main challenge in building such pipeline is to minimize latency to process high-throughput data. In this paper, we explain and evaluate our proposed real-time text processing pipeline using open-source big data tools which minimize the latency to process data streams. Our proposed data processing pipeline is based on Apache Kafka for data ingestion, Apache Spark for in-memory data processing, Apache Cassandra for storing processed results, and D3 JavaScript library for visualization. We evaluate the effectiveness of the proposed pipeline under varying deployment scenarios to perform sentiment analysis using Twitter dataset. Our experimental evaluations show less than a minute latency to process 466,700 Tweets in 10.7 minutes when three virtual machines allocated to the proposed pipeline.

READ FULL TEXT
research
11/23/2021

Real-time intelligent big data processing: technology, platform, and applications

Human beings keep exploring the physical space using information means. ...
research
05/09/2023

High-throughput Cotton Phenotyping Big Data Pipeline Lambda Architecture Computer Vision Deep Neural Networks

In this study, we propose a big data pipeline for cotton bloom detection...
research
09/14/2022

PAPyA: Performance Analysis of Large RDF Graphs Processing Made Easy

Prescriptive Performance Analysis (PPA) has shown to be more useful than...
research
09/09/2016

Nanosurveyor: a framework for real-time data processing

Scientists are drawn to synchrotrons and accelerator based light sources...
research
03/06/2022

An Adapter Architecture for Heterogeneous Data Processing in Bioinformatics Pipelines

Bioinformatics is a growing field focused on both the domains of compute...
research
01/11/2018

Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis

In this paper we propose a new parallel architecture based on Big Data t...
research
11/27/2018

Cloud based Real-Time and Low Latency Scientific Event Analysis

Astronomy is well recognized as big data driven science. As the novel ob...

Please sign up or login with your details

Forgot password? Click here to reset