MapReduce for Counting Word Frequencies with MPI and GPUs

05/21/2022
by   Nithin Kavi, et al.
0

In this project, the goal was to use the Julia programming language and parallelization to write a fast map reduce algorithm to count word frequencies across large numbers of documents. We first implement the word frequency counter algorithm on a CPU using two processes with MPI. Then, we create another implementation, but on a GPU using the Julia CUDA library, though not using the in built map reduce algorithm within FoldsCUDA.jl. After doing this, we apply our CPU and GPU algorithms to count the frequencies of words in speeches given by Presidents George W Bush, Barack H Obama, Donald J Trump, and Joseph R Biden with the aim of finding patterns in word choice that could be used to uniquely identify each President. We find that each President does have certain words that they use distinctly more often than their fellow Presidents, and these words are not surprising given the political climate at the time. The goal of this project was to create faster MapReduce algorithms in Julia on the CPU and GPU than the ones that have already been written previously. We present some simple cases of mapping functions where our GPU algorithm outperforms Julia's FoldsCUDA implementation. We also discuss ideas for further optimizations in the case of counting word frequencies in documents and for these specific mapping functions.

READ FULL TEXT
research
11/12/2018

Comparing Spark vs MPI/OpenMP On Word Count MapReduce

Spark provides an in-memory implementation of MapReduce that is widely u...
research
09/15/2020

Term Rewriting on GPUs

We present a way to implement term rewriting on a GPU. We do this by let...
research
10/04/2019

GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory

We present an implementation of the overlap-and-save method, a method fo...
research
10/11/2017

Subdomain Deflation Combined with Local AMG: a Case Study Using AMGCL Library

The paper proposes a combination of the subdomain deflation method and l...
research
08/29/2022

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming

The hybrid MPI+X programming paradigm, where X refers to threads or GPUs...
research
10/11/2017

Subdomain Deflation and Algebraic Multigrid: Combining Multiscale with Multilevel

The paper proposes a combination of the subdomain deflation method and l...
research
05/25/2021

Providing Meaningful Data Summarizations Using Exemplar-based Clustering in Industry 4.0

Data summarizations are a valuable tool to derive knowledge from large d...

Please sign up or login with your details

Forgot password? Click here to reset