How Apache Pulsar Solves Kafka’s Scalability Issues
Apache Kafka and Pulsar compete for popularity in enterprise software development as the foundation for the data center architecture that powers the world’s most popular web and mobile applications on the basis of scalability. Websites with hundreds of millions or over one billion registered users operate at hyperscale with a unique set of dynamics for cloud hardware.
Apache web servers traditionally became overloaded at around 10,000 simultaneous users, leading to the development of the NGINX platform as an alternative, as well as to the invention of elastic cluster web servers as a cloud data center solution (AWS EC2 / Kubernetes).
With cluster servers, web and mobile software developers have the option to publish their code to multiple data centers around the world using geo-location with load balancing for better stream processing speeds, high-availability support, continuity of service, and legal compliance.
This article will discuss the scalability differences between Apache Pulsar and Kafka with reference to the web/mobile traffic requirements of enterprise software devops support.
Kafka vs. Pulsar: Enterprise Hyperscale Software Support
With elastic cluster servers running at hyperscale using Apache Kafka and Pulsar, software development teams build on streaming event and message architecture to operate the world’s most popular social networking and ecommerce websites. DevOps has become the heart of cloud data center support with the unique challenges of keeping websites like Facebook, Twitter, Amazon, Pinterest, LinkedIn, etc. online with personalized content delivered to hundreds of millions of simultaneous active users.
- Websites operating at hyperscale generate billions of events per hour through activity such as posting content to profiles, making purchases online, uploading images/videos, user comments, or registering new accounts.
- Automated elastic cloud clusters scale containers of web servers pre-configured with code up and down on demand to support increased rates of web traffic, utilizing various methods for sharded database synchronization.
In DevOps, these services require advanced load-balancing on I/O requests that route messages to the hardware in the data center that can process the information the quickest to the users, which may be based on geo-location, web server availability, congestion, etc..
Hyperscale social networking and ecommerce software development requires customized namespace allocation for software-defined networking (SDN) across distributed architecture to function. This can be further optimized with integration of serverless and AI / ML / DL platforms.
Kafka vs. Pulsar: Scaling Cloud Software for Enterprise
Some of the best information on the differences between Apache Kafka and Pulsar can be found in the reviews of data center engineers and software developers working on the world’s most popular websites and mobile applications. Many engineers prefer Apache Pulsar over Kafka because of the platform’s native support for Apache Bookkeeper.
“BookKeeper is a service that provides persistent storage of streams of log entries—aka records—in sequences called ledgers. BookKeeper replicates stored entries across multiple servers. In BookKeeper each unit of a log is an entry (aka record); streams of log entries are called ledgers;and individual servers storing ledgers of entries are called bookies. BookKeeper is designed to be reliable and resilient to a wide variety of failures.”
Consider the difficulties of indexing topics and keyword information in a database that can easily be searched to support hundreds of millions of queries applied to hundreds of millions of users every hour!
- Kafka stores information in indexed files and directories that need to be regularly flushed and rebuilt, causing performance issues at scale.
- Pulsar utilizes Apache Bookkeeper for better indexing and faster search processing.
Because Kafka uses a centralized index system to manage queries, a large index system needs to be continually replicated in new clusters, causing potential performance lags at the scale of enterprise traffic over time in production.
- Apache Pulsar adopts segment-centric storage and layered architecture with Apache BookKeeper for “instant scaling without data rebalancing”.
- Apache Kafka is a partition-centric pub/sub system, while Apache Pulsar is a segment-centric pub/sub system.
With Apache Pulsar and Bookkeeper integration, there is also better performance in recovery from cluster failure in operations due to superior management of partitions in ledgers and bookies through segments. Pulsar’s decentralized indexing system and superior partitioning in clusters lead to regularly faster processing of data streams in event and message queues for popular cloud applications. Apache Zookeeper is used to assist with cluster management.
Kafka vs. Pulsar: Brokers, Topics, Logs, and Namespaces
Most companies are unable to invest the amount of money in talent and resources for research and development in cloud computing as Google, Microsoft, Apple, IBM, and Amazon, so they seek to emulate their best practices. Apache Kafka evolved out of the requirements for LinkedIn, which included “over 800 billion messages per day which amounts to over 175 terabytes of data. Over 650 terabytes of messages (were) then consumed daily.” Apache Pulsar grew out of Yahoo! Labs, with their support for the search portal, email, advertising, gaming, and news.
The problems that enterprise social networking sites have with scaling using Kafka can be best illustrated by the example of Pinterest. The company uses Apache Kafka to support over 250 million active users, but frequently reported: “With thousands of brokers running in the cloud, we have broker failures almost every day.” Other issues reported by enterprise users were related to the use of topics without sufficient “globally unique namespacing.” Integration with Apache Bookkeeper is the main difference between Kafka and Pulsar aimed at solving these problems.
Many enterprise data center operations in the midst of cloud transition are encountering sprawl across resources which result in too many messaging systems being used to support software and databases in production. This leads to increased administrative burdens and requires larger teams of trained devops engineers to manage in addition to the programming teams. Reducing the amount of data center support required for backend operations of code is critical to managing expenses for software operations at hyperscale, as is increasing hardware efficiency.
Apache Pulsar: Optimize Cloud Software for Hyperscale
Just as there has been significant debate over the preferred use of relational vs.non-relational databases (MySQL vs. NoSQL) in hyperscale enterprise applications, or the adoption of different programming languages, web server platforms, and operating systems for application development, IT pros now discuss the various benefits for adoption of Apache Kafka vs. Pulsar.
Because this is a fundamental component of cloud data center architecture and influences the programming of enterprise software applications, this debate needs to take place at an early stage of transformation. Many of the largest Kafka users are now converting to running Apache Pulsar to solve the issues related to scale in enterprise software and data center management.