Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching

08/17/2023
by   Maximilian Knespel, et al.
0

Gzip is a file compression format, which is ubiquitously used. Although a multitude of gzip implementations exist, only pugz can fully utilize current multi-core processor architectures for decompression. Yet, pugz cannot decompress arbitrary gzip files. It requires the decompressed stream to only contain byte values 9-126. In this work, we present a generalization of the parallelization scheme used by pugz that can be reliably applied to arbitrary gzip-compressed data without compromising performance. We show that the requirements on the file contents posed by pugz can be dropped by implementing an architecture based on a cache and a parallelized prefetcher. This architecture can safely handle faulty decompression results, which can appear when threads start decompressing in the middle of a gzip file by using trial and error. Using 128 cores, our implementation reaches 8.7 GB/s decompression bandwidth for gzip-compressed base64-encoded data, a speedup of 55 over the single-threaded GNU gzip, and 5.6 GB/s for the Silesia corpus, a speedup of 33 over GNU gzip.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2019

Parallel decompression of gzip-compressed files and random access to DNA sequences

Decompressing a file made by the gzip program at an arbitrary location i...
research
06/28/2018

Generalization of LRU Cache Replacement Policy with Applications to Video Streaming

Caching plays a crucial role in networking systems to reduce the load on...
research
04/20/2020

Information Freshness in Cache Updating Systems

We consider a cache updating system with a source, a cache and a user. T...
research
06/19/2018

Rate-Memory Trade-Off for Caching and Delivery of Correlated Sources

This paper studies the fundamental limits of content delivery in a cache...
research
07/13/2023

scda: A Minimal, Serial-Equivalent Format for Parallel I/O

We specify a file-oriented data format suitable for parallel, partition-...
research
09/03/2019

Large Scale Parallelization Using File-Based Communications

In this paper, we present a novel and new file-based communication archi...

Please sign up or login with your details

Forgot password? Click here to reset