On Efficiently Partitioning a Topic in Apache Kafka

05/19/2022
by   Theofanis P. Raptis, et al.
0

Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its performance via large-scale simulations, considering as benchmarks some Apache Kafka cluster configuration recommendations provided by Microsoft and Confluent. We demonstrate that, unlike the recommendations, the proposed heuristics respect the hard constraints on replication latency and perform better w.r.t. unavailability time and OS load, using the system resources in a more prudent way.

READ FULL TEXT
research
12/20/2022

Tuning the Tail Latency of Distributed Queries Using Replication

Querying graph data with low latency is an important requirement in appl...
research
12/23/2021

Deterministic Parallel Hypergraph Partitioning

Balanced hypergraph partitioning is a classical NP-hard optimization pro...
research
05/24/2023

Towards Optimizing Storage Costs on the Cloud

We study the problem of optimizing data storage and access costs on the ...
research
10/29/2021

SDP: Scalable Real-time Dynamic Graph Partitioner

Time-evolving large graph has received attention due to their participat...
research
02/25/2019

PaRiS: Causally Consistent Transactions with Non-blocking Reads and Partial Replication

Geo-replicated data platforms are at the backbone of several large-scale...
research
03/19/2021

Topic Allocation Method on Edge Servers for Latency-sensitive Notification Service

The importance of real-time notification has been growing for social ser...

Please sign up or login with your details

Forgot password? Click here to reset