Brief Intro
One of the most exciting features of Apache Pulsar is Pulsar Functions.
The general premise is, if you have a series of messages/events, you can apply arbitrary logic against each message in a serverless and stateful way.
This feels similar to how AWS Lambda feels, where you just write your code knowing that a payload is going to be passed to your code and anything can be done to it at that point.
Pulsar Functions is also serverless like AWS Lambda. It gets deployed inside of Pulsar with the ability to configure the concurrency easily. This removes the operational burden. Additionally, distributed state storage is made available that opens the door to huge possibilities. Sliding windows for analytics? Check. Metrics? Check. You’re only limited by your imagination. And since the storage is backed by Apache Bookkeeper, additional storage mojo is available.
The Apache Pulsar website has a lot of great information about Pulsar Functions.
Alternatives
There are many amazing alternatives that this feature of Apache Pulsar takes after, namely Apache Storm, Apache Heron, Apache Flink, and Apache Spark. Pulsar functions was created to be lightweight versions of these. Sometimes you need to do lightweight processing and do not want to increase operational overhead with an additional technology added to your stack.
In general, the vast majority of features found in Apache Pulsar are always available in Java. They aren’t always available in Python at the same time. With the goal of helping people utilize Python in Pulsar Functions, I created a repository that attempts to demonstrate the many things you can do with Python in Pulsar Functions.
Examples
Pulsar Python Functions
As of writing this article, there are currently 4 examples, with at least one new example being added each week. Contributions are welcome!
The idea with these examples is to highlight broad use cases that make a lot of sense for something like Pulsar Functions.
Format Phone Number
This function demonstrates utilizing a 3rd party library to modify the payload similar to Extract Transform Load methodology. The extract and load are handled for you by Pulsar, and your function code is doing the transformation.
Dynamic Routing
The ability to route messages based on the content of the message and/or the context is incredibly powerful.
All sorts of workflows can be created for organizational purposes or efficient use of resources. As an example, if you had a topic that included all click events, you could run a niche fraud algorithm against all events, or you could route those specific clicks for the niche model to a new topic, and place the model there. This way the model is only running against events relevant to how the model was trained.
Python Schema
A powerful function of Apache Pulsar is the ability to enforce a schema on a topic. When the message is passed into the function, it is in a format more efficient for transport.
This example converts the message into the intended object from Avro.
Naive Bayes AGRAWAL
One of the most exciting aspects of Pulsar Functions is the ability to do machine learning and artificial intelligence.
This example demonstrates the usage of scikit-multiflow leveraging the AGRAWAL data generator and a Naive Bayes algorithm to train and predict at the same time.
This concept of online machine learning is an exciting new space with great promise.
Conclusion
You’ve seen a few ways to utilize Pulsar Functions to perform logic against streams. There are infinite more things you could do.
Part of the examples that was open sourced above is instructions on how to write and build pulsar functions locally with ease. How to format them, package them, and deploy them for iterating quickly.
If you’ve got a great example of your own, please feel free to contribute it to this repository!