loader

Pandio’s Top 10 Rules for Managing Apache Pulsar

Apache Pulsar is the next generation of messaging systems and streaming platforms built to handle the demands of AI and ML. Because of its architectural design, it has a lot to offer compared with older technologies like Apache Kafka. Apache Pulsar is a rich and complex system to configure and optimize; it requires experience and expertise to realize its benefits fully.

Below, we’ll outline Pandio’s top ten rules for managing Apache Pulsar that will help you understand what Apache Pulsar is and how you can make the most of it.

1. Remember: Apache Pulsar is Open-Source 

One thing that makes Apache Pulsar stand out among the crowd of other streaming platforms is its unique system. Unlike Apache Kafka, which operates on a more traditional framework built for data centers of the past, Apache Pulsar has a unique framework that enables it to work faster, better, and with less data loss.

The Apache Pulsar system is wholly open source, meaning it can be configured to fit your company’s specific needs. 

Think of the basic version of Apache Pulsar as a white sheet on which you build your system – you must configure every part of it to fit your specific needs, or you might run into some errors. The first rule of Apache Pulsar is to make it your own, test it as much as possible, and rework it if necessary before you implement it into your existing infrastructure. Configuring Apache Pulsar is technically demanding, which explains why many enterprises choose to leverage a Managed Service to support the platform.

Apache Pulsar’s came from Yahoo Japan and was designed to be cloud native, making it an obvious choice for next generation workloads that have been designed to run in the cloud. 

2. Understand the Functionality of Multi-Tenancy and its Policies

Multi-tenancy is one of the most important innovations that Apache Pulsar enables, allowing companies to streamline their operations, all while cutting down on costs. This unique architectural model allows Apache Pulsar instances to occupy more than one tenant, significantly reducing administrative overhead and cost. 

Unlike Apache Kafka, which has single-tenancy features, Apache Pulsar’s multi-tenancy system allows for far smoother data delivery in real-time and ensures latency is never an issue (in most cases below 5 ms).

Tenants are the basic units of Apache Pulsar, even more so than namespaces. You can space tenants across clusters, but you’ll have to apply unique authentication and authorization schemes to them. You can manage the tenants by setting specific policies. Ensure that your administrator uses tiered policies to create order across your multi-tenant system.

The multi-tenancy characteristic of Pulsar ensures that, there is a form of isolation among the tenants through automatic quota distribution, which is a common SLA mandate.

3. Geo-replication is Native – Synchronous and Asynchronous 

The way that Apache Pulsar handles storage is unique. Due to its many features that streamline data storage, geo-replication allows the system to store information consisting of message data across multiple clusters of the same instance.

There are two different types of geo-replication in Apache Pulsar – synchronous and asynchronous. The default system is synchronous, which is used for most geo-replication applications. This is a two-way system that sends and receives messages between the central database and the databases connected to it. This is a fantastic option if you’re looking to maximize data availability.

If you’re looking to send data from a centralized location without receiving data back, you’ll have to enable the asynchronous method of geo-replication. This is the top option for minimal latency as well as ensuring maximum data stability between databases. 

While this may not sound overly impressive, the costs of retrieving and accessing data stack up, especially when we’re talking about petabytes of data. Therefore, this simple geo-replication solution can save companies an excessive amount of money in the long run.

4. Tiered-Storage means Lower Cost and Better Cluster Performance

Tiered storage is a highly hypothesized feature of many messaging systems, but only Pandio has implemented it functionally. All the data stored in the system is first segmented into the correct cluster before being sent out for storage.

The data is stored based on its importance. If it’s a piece of data that is rarely accessed and mainly used for analysis later down the line, it will likely be sent off to cold storage. However, if the data is accessed more frequently, it will be moved far higher up in the hierarchy, allowing for near-instantaneous access. 

This segmentation and categorization allow for far more efficient storage management, saving money and dramatically improving performance.

An essential thing to note is that, by using tiered storage, the process directly rips data, log-by-log from the existing storage, and segments it according to its importance for easier storage. This means that if you want to minimize the latency and waiting time, you’ll have to segment your data quite frequently, and that means configuring the offload driver in the broker.conf. The supported offload drivers are AWS S3 and Google Cloud Storage. 

Features such as advanced data encryption protect the data constantly flowing through Apache Pulsar, while other solutions such as storage durability and tiered storage remove the possibility of data loss.

5. Deploying Functions in Modes

When deploying functions in Apache Pulsar, you’ll have to choose between one of two modes – local run and cluster.

Local run mode is a self-explanatory deployment feature that states that you’re deploying something locally on the computer you are currently working on. That way, the deployment is both localized and inaccessible by others. Local run mode is also known as private mode and is mostly used for testing.

If you deploy a function in local mode, it will automatically do so on the machine you’re running it from. If you want to deploy it on another machine in local mode, you’ll have to manually specify the broker URL, which you can do so with the following command: 

$ bin/pulsar-admin functions localrun \

  –broker-service-url pulsar://my-cluster-host:(PORT) \

The second mode, cluster mode, is the primary deployment model for Apache Pulsar. It entails letting out a deployment inside of your specific Apache Pulsar cluster, which is located on all the machines with your Apache Pulsar brokers. That makes the deployment visible and accessible by every one of your brokers. Cluster mode is also known as public mode and is used for deployment. In cluster mode, you deploy an Apache Pulsar function on anything from Kubernetes to AWS and so on.

Unlike local mode, you can update functions in cluster mode with relative ease with the following function: 

$ bin/pulsar-admin functions update \

  –py myfunc.py \

  –classname myfunc.SomeFunction \

  –inputs persistent://public/default/new-input-topic \

  –output persistent://public/default/new-output-topic

Regardless of which deployment you go with, you’ll have to manage your deployments. You can do so through your Apache Pulsar admin interface, which allows for the creation, deletion, and management of all of your deployments.

If you enable cluster mode, you are actively letting your deployment run on multiple clusters in the same system, allowing more people to access it and efficiently running multiple instances of the same function or trigger.

Furthermore, if you want a mirror of what you’re doing among multiple clusters, you can enable parallelism. You can do this by entering the following command line: 

$ bin/pulsar-admin functions create \

  –parallelism X \

6. Apache Pulsar Offers Three Subscription Models

Apache Pulsar includes a system called subscriptions. Subscriptions are named rules that substantiate the message delivery – they govern how and which messages are delivered. There are three named configuration rule subscription models: Exclusive, Failover, and Shared – all of which are very different, all of which serve different purposes.

Exclusive

The exclusive subscription model is one-on-one – when this model is in effect, only a single consumer can attach to the data and get in on the message. If someone else tries to attach to this message, they’ll receive an error. While the least popular model out of the lot, exclusive is the default Apache Pulsar subscription model.

In the exclusive subscription model, you’ll have to link the producers to the pulsar topic in question, then link the topic to a subscription, and lastly, link the subscription to Listener 1. 

Failover

The failover subscription model is a one-to-multiple type of model – multiple consumers can attach themselves to the same message without receiving an error. The consumers who have access to the messages are organized by name, and the first one on the list is the master consumer. The master consumer can access all the messages and control how they’re distributed among the remaining consumers.

In this method, you’ll have to do the same process as in the exclusive method, with the only difference being installing a failover method. This means that if Listener 1 that is subscribed to the topic fails, the system automatically reroutes the topic to Listener 2. 

Shared

The shared subscription model is a little more complicated. It also allows multiple people to attach themselves to the same data type, but not all will receive it. Instead, it works on a first-come, first-serve basis, where the master consumer can tick all the data he likes, and the remaining data will be redistributed to the rest of the people accessing it.

The shared subscription method is the most complicated out of the lot. It will require you to link the topic to all the listeners in the pool, but you’ll have to deploy a vital algorithm that determines who the first listener that came in is. Once this is done, the messages will go directly to the first user until they disconnect, which is when the messages will automatically be routed to the second listener and so forth. 

7. Consider Pulsar Clients & Configurations

One of the many features that Apache Pulsar has under its belt is that it operates seamlessly across all kinds of languages. Apache Pulsar has APIs for Java, Go, Python, and C++ – working wonders to streamline the client-broker communication protocol.

The beauty of setting up clients with Apache Pulsar lies in its simple API, which allows you to integrate it with almost anything you’d like and serve clients from there.

Furthermore, this system is vastly improved by a transparent reconnection feature and a failover broker connection, minimizing the potential downtime. Moreover, the data loss from the system to clients is minimized through a message queuing system that requires a secure and stable connection. If there are any interruptions, however, there are ample connection retries with backoff.

Before you get into creating a producer and consumer relationship, you’ll have to configure broker resources. You can do so by getting all the available brokers, which is as easy as inputting the following command line: 

$ pulsar-admin brokers list use

After that, you’ll want to find the leader brokers information, which can be done by inputting this function into the admin API: 

$ pulsar-admin brokers leader-broker

Lastly, you’ll want to seek out all of the namespaces that belong to the broker, and you can do so by inputting: 

$ pulsar-admin brokers namespaces use \

  –url broker1.use.org.com:8080

These three simple steps will give you all of the broker resources you need for further static or dynamic configuration or creating a producer and consumer relationship. 

Creating a producer and consumer relationship is simple in Apache Pulsar. All you need to do is access the Apache Pulsar client library and complete a straightforward two-stage process.

The first thing you’ll have to go through is setting the owner of a topic. You can do so by sending an HTTP lookup request to your broker, which usually automatically reaches a presently active broker. It does so by analyzing the Zookeeper metadata. That is done so the system can connect to a broker address for further setup.

After this connection has been established, a TCP connection will go through an authentication process where binary commands and data packets are exchanged from client to protocol.

All that’s left after that is to assign a role to the broker, compiling it, and validating the authorization for security reasons. Most of these two processes are done automatically in the background – all you need to do is put things into motion.

8. Use Partitions & Routing According to your Use Case

The first partition is known as the round-robin partition. In this partition, the routing is used to publish as much data in as many places as possible, as long as there is no master key provided – achieving the biggest possible spread of messages, maximizing batching, and minimizing possible delays.

However, the round-robin partition is a bit disorganized, so many people opt for a single partition instead. Moreover, single partitions are pretty self-explanatory, whereas they deliver the messages to one random broker in the system.

Custom partitions are best suited for entries and large-scale operations, as they allow you to get in on the partitioning system from a development perspective, and configure what message you won’t deliver to a particular partition, as well as determine who can access it.

If you don’t set up a routing mode upon creating a new producer, it automatically goes by the round-robin mode. To set up a routing mode, you’ll have to go to the ProducerConfiguration section of the config and select one of the three partitions.

Single-platform and round-robin partition types are straightforward, but setting up a custom partition can be challenging. It’s important to remember that you’ll need to use the MessageRouter API. This interface has only one choose Partition method, which is:

public interface MessageRouter extends Serializable {

    int choosePartition(Message msg);}
Once you’ve chosen this partition method, you can set up a custom router that sends messages to specific partitions in predetermined intervals.

9. How to Configure Brokers & Listeners

Understanding Apache Pulsar requires understanding the concept of brokers and topics. Brokers are technically stateless components that handle HTTP servers and data dispatchers. That, in turn, provides the system with an API call that consumers can latch upon to get their required messages. 

To configure a broker, you’ll have to go into the administrator API and set it from the ground up. A lot of the configuration options are going to be set at their default state, which might not be ideal for your specific needs. One of the most common things people change in the Brokers section of the administrator API is the listeners. 

When configuring Apache Pulsar brokers, it’s imperative to note that every broker must have a specified listener. The command line for adding a broker listener is:

<listener_name>:pulsar://<host>:<port>

If, however, the broker needs multiple listeners, you can use the same command line; just separate the listeners with commas in the configuration. 

10. Understanding Encryption and Authentication inside Pulsar

All the messages and data that go and flow through Apache Pulsar must undergo authentication and authorization before reaching the targeted customer, cluster, or broker. The way the data is stored makes this process far less tedious and demanding, as data can be accessed quickly and sent out.

If you’re looking to ensure the highest level of encryption possible, which is E2E AES encryption, you’ll have to create a public and private key pair, which you can do so by adding these commands: 

openssl ecparam -name secp521r1 -genkey -param_enc explicit -out test_ecdsa_privkey.pem

openssl ec -in test_ecdsa_privkey.pem -pubout -outform pem -out test_ecdsa_pubkey.pem

After you’ve made the keys, you’ll need to add them to your key management and configure the retrieval for both public and private keys. 

You can configure the public key with this function: 

CryptoKeyReader.getPublicKey() 

Configuring the private key is equally as easy, and you’ll have to input this function. CryptoKeyReader.getPrivateKey() 

Lastly, you’ll want to add the encryption key name to the producer, which can be done by inputting: 

PulsarClient.newProducer().addEncryptionKey(“myapp.key”)

In Conclusion

Apache Pulsar is the streaming platform and messaging system of the future, which is already widely used and available. Furthermore, due to its open-source nature, it can be modified to fit the specific requirements of any company that uses its features.

Apache Pulsar is an intricate, complex, and complicated piece of software, but using, managing, and interacting with it is quite easy from the clients’ and consumers’ perspectives. That’s why many companies hire managed services to help them with their Apache Pulsar needs.

About Pandio

Pandio is one of the premier managed Apache Pulsar services providers currently on the market. With extensive data science experience, the experts at Pandio are well equipped to help you enterprises that are looking to scale their distributed messaging environments and accelerate their path to AI.

Leave a Reply