Scalable Data Pipeline Design using Apache Kafka and Hadoop Ecosystem

Tam Sakirin

Authors

Tam Sakirin Dean, Faculty of Science and Technology, University of Puthisastra, Phnom Penh, Cambodia, Author

Keywords:

Scalable data pipelines, Apache Kafka,, Hadoop ecosystem, Spark Structured Streaming, HDFS, data ingestion,, stream processing, exactly-once semantics, system scalability, fault tolerance.

Abstract

Scalable data pipelines are foundational for modern real-time analytics, ingestion, and processing architectures. This study explores the design, implementation, and evaluation of a robust, scalable pipeline built upon Apache Kafka for stream ingestion and processing, and the Hadoop ecosystem (HDFS, Hive, Spark) for storage and batch/interactive analytics. We propose a modular architecture comprising Kafka producers, distributed Kafka clusters, Kafka connectors, Spark Streaming consumers, HDFS storage layers, and Hive metastore integration. Our contributions include optimized partition/replication strategies for high-throughput publishers; automated failover and monitoring via Kafka’s MirrorMaker and Prometheus; and dynamic Spark Structured Streaming jobs that scale with workload while maintaining end-to-end exactly-once semantics. Benchmarking over terabytes of synthetic and real event data shows end-to-end latency below 5 seconds at input rates of 1M events/sec, with horizontal scale-out maintaining throughput and less than 15% CPU/memory overhead. Storage optimization via HDFS compression and Hive partitioning reduces data footprint by ~60% and improves query performance by 4×. A cost analysis demonstrates a 35% cost advantage relative to traditional message queue and ETL–DB solutions, achieved under cloud execution scenarios. The platform’s generic design facilitates its use in diverse contexts, including fraud detection, IoT stream analytics, and log/event warehousing. We provide insights on deployment best practices, operational monitoring, and tuning strategies, guiding practitioners toward scalable, fault-tolerant, and cost-effective real-time pipelines.

Author Biography

Tam Sakirin, Dean, Faculty of Science and Technology, University of Puthisastra, Phnom Penh, Cambodia,

INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY RESEARCH IN SCIENCE, ENGINEERING, TECHNOLOGY & MANAGEMENT

Scalable Data Pipeline Design using Apache Kafka and Hadoop Ecosystem

Authors

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

Latest publications

Information