Analyzing Large-Scale, Distributed and Uncertain Data
The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively handling such large data sets. MapReduce is a novel programming paradigm for processing distributable problems over large-scale data using a computer cluster. In this work we explore the MapReduce paradigm from three different angles. We begin by examining a well-known problem in the field of data mining: mining closed frequent itemsets over a large dataset. By harnessing the power of MapReduce, we present a novel algorithm for mining closed frequent itemsets that outperforms existing algorithms. Next, we explore one of the fundamental implications of "Big Data": The data is not known with complete certainty. A probabilistic database is a relational database with the addendum that each tuple is associated with a probability of its existence. A natural development of MapReduce is of a distributed relational database management system, where relational calculus has been reduced to a combination of map and reduce function. We take this development a step further by proposing a query optimizer over distributed, probabilistic database. Finally, we analyze the best known implementation of MapReduce called Hadoop, aiming to overcome one of its major drawbacks: it does not directly support the explicit specification of the data repeatedly processed throughout different jobs.Many data-mining algorithms, such as clustering and association-rules require iterative computation: the same data are processed again and again until the computation converges or a stopping condition is satisfied. We propose a modification to Hadoop such that it will support efficient access to the same data in different jobs.
READ FULL TEXT