Distributed Messaging, Apache Pulsar, and Accelerating ML Initiatives
A discussion with Pandio’s CTO Josh Odmark
The following transcript was recorded from a conversation between Matt Rall and Josh Odmark on the Pandio Podcast, if you are interested in listening to the full episode, it can be found here.
Matt: On our podcast today I would like to welcome Josh Odmark. Josh is the CTO and co-founder of Pandio and has been instrumental in the growth and acceleration of the Pandio business. Josh is going to be able to talk a little bit about distributed messaging, and also relate that to why it’s important in the construct of AI and ML. Thank you again for joining the call. Before we dive in, I think it’d be helpful if we could just give the listeners a little overview of your background.
Josh: Thanks, Matt. I appreciate you putting this together. I come from the kind of background of your traditional engineer. I was self-taught and sort of just got the bug of engineering in general when I started my first company in high school, and ever since then, you know, my role has really been to come in and help companies scale huge.
I’ve been doing that for just over a decade now, helping companies hit that hockey stick growth. Most recently, I was working with some startups that worked in the insurance space and I saw a lot of very interesting things related to machine learning and artificial intelligence.
So, a lot of those things were issues with accomplishing those initiatives, so I co-founded Pandio with my good friend Gideon to help solve some of those issues and help companies achieve AI in their
Matt: Okay, appreciate that context. I think that’s helpful for framing the rest of this discussion. So, in today’s podcast, we’re really going to talk about three different areas.
Number one is: “What is a distributed messaging system?”
The second point is around Apache Pulsar, and why its emergence is so important for large enterprises today.
And the third piece that Josh is going to really talk about is the future of distributed messaging and why it’s so critical for the adoption of AI and ML.
So, let’s dive in Josh.
It’s interesting. I think when most people hear distributed messaging, they think about you know, chatting or a slack channel is what distributed messaging is about. Is that what we’re talking about here?
Josh: So, of course, my take on it is very technical in nature, but from my perspective, you know, something like a distributed messaging system is pretty straightforward really – it’s sort of the decoupling of the job that needs to be done and distributing that to workers. So, kind of classically, one way to approach it is the thing that needs work done can do the work itself, but that becomes very difficult to scale.
So, the concept of distributed messaging was created to help address things like that by decoupling the producer from the consumer and that spawned a whole industry of very fascinating things – from traditional queuing systems to publish and subscribe, and all the way to events streaming, which is the most recent incarnation of a distributed message system.
Matt: Okay, so that’s really interesting. You talk about publish and subscribe streaming and cues. Can you tell us a bit more about what those three different things are?
Josh: Sure. So, queuing is a traditional sort of approach, where you’ve got units of work and you put them into a queue, like a line that forms at a grocery store for a cashier. You put those things in a spot and then things that actually do the work take them one by one off of the queue.
It’s a really great, durable system where you put those things into a queue for other things to work on them. In some of the older ways of doing cues, the thing producing the work would talk directly to the worker, and so they would tell the worker to do something. But, if for some reason there was no worker available, either they crashed or they were all busy. That created a problem with what to do at that point. Do you retry later?
So the queue is kind of put as a middleman in there. It’s the thing that takes an unlimited amount of jobs and then it prepares them and allows a worker to pull off one by one as the workers are ready. So, it lets the worker work at their own pace. Publish and subscribe is really a very similar concept. You can think of those as similar. There are different sorts of methodologies that are messaging patterns, but I guess for this purpose, they’re pretty similar.
Streaming is where it gets vastly different. Streaming is where you have a sort of real-time feed of data and you want to do something with that. You can have something like an IoT device sending off temperature sensor readouts, or something like that, and then you can have something tying into that real-time feed of data to do things with it. You can have multiple things, and you might have one thing that is aggregating that up.
For example, you might have a sensor readout every 60 seconds, or every five seconds, and you only want to create a one-hour average, so that you can have something listening to the stream that’s doing. Then, you can have another thing listening to this stream that is looking for, maybe when the temperature is too high. If it’s over a hundred degrees Fahrenheit, you may want to send an alert or turn on a fan or something. It allows you to attach a lot of different logic to that or attribute that stream to many different consumers for various different purposes.
That’s inherently a lot more difficult due to its real-time nature, but there are some pretty cool technologies out there that bridge that gap of traditional things you expect from a queue and bring them to events streaming, such as durability and things like that.
Matt: So, this whole idea of distributed messaging, would you consider it a recent phenomenon or maybe give a little bit of context in terms of how technology infrastructure has evolved over the past 20 or 30 years?
Josh: Yeah, I would say queuing is a pretty old concept, messaging in general, but the interesting thing is – with the sort of transition from more monolithic applications in nature to something that’s more microservice or chunked up into smaller bits instead of one large application – it sort of brought the need to have a robust messaging system to the forefront. We’re seeing a lot of that in traditional applications, but much more so in artificial intelligence and machine learning that naturally have those borders between each of those services.
So in other words, it looks more like microservices in the wild versus monolithic in nature, but that’s really what is driving this. A lot of these concepts aren’t really new – they’re just really being applied in new ways with a slight exception of events streaming. That’s more, I guess, in my opinion again, the real-time nature of events and messages, these things always existed in some form or another, but thinking about it as a stream and attaching processing to it or the consumption of it, in the sense of an application or machine learning or AI, is why this is a hot topic these days.
Matt: Okay, okay. So, is it fair to say that – the majority of large enterprises that are likely moving towards more of a microservices architecture API framework – is it safe to say that they are in need of more robust distributed messaging systems as you look towards the future?
Josh: Yeah. It’s certainly a hotly contested topic, whether that’s a positive thing or a negative thing. Well, it becomes difficult when you have a microservice architecture, instead of having one sort of application that you can much more easily reason about what’s happening, where data flows through, where your business logic is being applied, and things of that nature. When you move into microservice that becomes a bit more difficult because you’re moving things from a central location to a more distributed location.
So, if at one time you had one monolithic application, and you change that to be maybe 50 microservices, and each unit or worker or thing that the application did had to move through 50 different things. If you were to step back and ask yourself what happened at, you know, step 37, if you were to think of each microservice as a step in an application, that becomes much more difficult.
So there’s, you know, tooling out there that’s sort of created to make that a little easier. Sidecar proxies and Kubernetes have some things to help us with this. But, so, certainly hotly-contested enterprises. I would say it is kind of a hot thing to do now, and there’s a lot of really cool technology out there. We see a lot of companies doing that with success, but it’s not guaranteed, and it’s not guaranteed to be the best way to do things.
It’s one of those case-by-case things, but I would say when it comes to machine learning and AI, just due to the nature of the way people are building things, there’s more community support and understanding around a microservice approach to some of these things – just due to the sheer amount of data and things of that nature. And that is a shoo-in for microservices, but it also can be applied to monolithic applications. Distributed messaging isn’t one of those things that are only for microservices.
It’s just kind of all those things that are absolutely needed for microservices, and that’s not the case with a monolithic application. But it also is very helpful. If you’ve got a monolithic application in one department at a large enterprise, and then another one in another department, and you want to communicate between the two, it’s a really good shoo-in for that scenario too because, really, at the end of the day, it’s just a methodology for communicating between two programmatic systems, so can be applied in, you know, a lot of different ways.
Matt: Given that and that kind of background, I’d like to talk a little bit now about Apache Pulsar. Maybe you can first give a little bit of a description of Pulsar, and then maybe a little bit of context or comparison to other, you know, competitors in the market today.
Josh: Yeah, So I’ll start with the context because it kind of talks about my path to Pulsar. I originally started with AWS SQS quite a long time ago. The thing that was great about SQS – which is just a traditional queuing service from Amazon – the great thing about it was that it was simple, and at the time, I didn’t believe I could sort of tip it over. I couldn’t send it too many jobs and I couldn’t request too many jobs of it, so it felt very reliable and it worked great.
I’ve used it relatively recently, as a matter of fact, some of our internal systems still use it in a certain fashion. But I slowly progressed to other things like Kinesis and RabbitMQ, and for various different reasons, and then eventually to Kafka, which I used in the past as well. What I found quickly with a lot of those things is that each one in and of itself is a kind of difficult component of your technical stack to run. So, while they work really great out of the gate, when you really get to major volume – which is inherent with machine learning AI – they could become very difficult.
I had issues with RabbitMQ and Kafka. When you want to scale them really large, they can get, well, gremlins is what a lot of people refer to it as – the gremlins in the system come out, where you spend a whole lot of your time tracking down why something is not performing as it was or it crashed or you’re losing data, things of that nature.
When it came to ML and AI, I didn’t really have great success with any existing technologies until I came across Pulsar. When we used Pulsar, the great thing about it is that it unified messaging models. You see, when we were using Kafka, we had to also use RabbitMQ to get some of the queuing functionality that we needed for our full distributed messaging needs, but Pulsar has both of them, so it can do events streaming and queuing. So, I’ve got the event stream, the Pub/Sub, and the queue components to it, and that’s great.
The other piece of it is because they separate compute and storage. That allows the compute and storage to scale horizontally independent of each other, which makes scaling it out a lot easier. You find that in the ML and AI space your usage graphs look very much like a roller-coaster – there’s a lot of twists and turns in your usage.
It’s really high one minute, really low the next, and in many cases, if it’s predictive in nature, you can have spine patterns in it, but it’s hard to do that. Having something that scales a lot easier than anything out there was highly beneficial, and Pulsar just has a laundry list of other things, you know, it’s been vetted for a long time. It came out of Yahoo – it’s a top-tier project at Apache, so it’s got a strong community that grows by the day and is used by a lot of large enterprises.
So, there’s the confidence that it’s bulletproof and that’s certainly been the case for us. It’s one of those things where once you get it going and working for you, it’s just been absolutely flawless. I don’t have the operational pain that I’ve had with all other solutions with Pulsar, so it’s fantastic in so many ways and, given that last comment, from an enterprise perspective.
Matt: There are a lot of Fortune 500 companies that are running Apache Kafka today, right? I think it’s like 50 or 60 percent of the Fortune 500. For them to make a transition to Apache Pulsar to address some of these concerns that you’ve raised, is that a huge lift? Is that a rip and replace? Can you talk a little about what that looks like?
Josh: Yeah, so you know, this typically would have to be addressed case-by-case because it can really depend on how you’re using certain things, if you’ve modified it yourself, how you sort of built things on top of it, but for the most part if it’s used in the traditional ways, it’s relatively straightforward to replace something. Pulsar is very interesting in that it supports the binary protocol of Kafka.
So, when it comes to receiving and doing things with the message that was maybe originally meant for Kafka, Pulsar can drop and replace those scenarios and it also comes with a lot of the of ancillary benefits of something like Kafka – they have the function capability where it looks like a stream processing engine. Some of the abilities are sort of “run SQL against your messages” and things of that nature. It’s got a lot of that built-in and it runs under the same conceptual things, so it’s a relatively low lift to actually switch to something like that. And again, you don’t have to do it 100% – you could keep some components of your old system running while you move new things to it. It doesn’t have to be a light switch where you’re on one one day, and on the other the next – you can slowly move towards it. And a big reason why you’d want to do that is the operational cost of running some of these things.
You know, I had in the past, recently, run something like Kafka, RabbitMQ, and this particular implementation for a customer. There were about 15 people who were experts in Kafka and RabbitMQ, and we replaced it with Pulsar and could reallocate most of that operational labor. That’s one of the things that it just does very well – it just runs flawlessly and it’s a lot cheaper to operate, and then you’ve got the pure performance gains of it.
It’s also significantly faster – you need fewer resources to do the same amount of work and the latency is lower. When you add all that up, you can see sometimes as high as 40% cost savings on it, and depending on use cases, it can be even higher. You can see even higher throughput gains, say if you have larger messages flowing through your system. You can see even better latency gains depending on variables such as how your geo-replication is set up and things of that nature.
Pretty much across the board there’s huge efficiency, and at the end of the day, you also see productivity gains, so it’s a very popular thing that’s happening now. I know of a couple of very large enterprises that are in the middle of switching from Kafka to pulsar as we speak. They’re seeing the same things that we see – Splunk even, most recently, showed some pretty insane gains when they switched from Kafka to Pulsar. In some cases, I saw 50x improvement, which is just absolutely insane.
So, it’s an exciting space to be in now. Okay?
Matt: Okay. So, I think that’s great. I appreciate, first of all, the definition of what distributed messaging is. I think you’ve given the listeners a pretty good overview of Apache Pulsar. I’d like to now, kind of tie this up by looking a little bit into the future. One of the things that I’ve heard is that something like Apache Pulsar is going to be an enabler for large enterprises to really integrate and operationalize AI and ML, so is that a true statement? And if so, can you expand on that a little bit?
Josh: Yeah, so one of the reasons why I’m knee-deep in Pulsar now is because there’s just a huge need to support messaging patterns at scales that have never really been seen before. So, it’s no wonder that a lot of applications aren’t really fitted for things like that because they were never designed for it, due to the nature of the architecture of Pulsar.
It works really great – in my experience, over the last decade I saw a lot of ML and AI initiatives succeed and fail, and the common denominator in a lot of those was always having a very strong and robust distributed messaging system. It’s the thing that’s the fabric or the foundation of all these systems. It’s the thing that the application communicates. The machine learning model does its inference and training on and things of that nature. It was always the piece that was critical to the success of those adjectives and, likewise, when things failed, a lot of times they failed for these types of reasons.
There are typically always multiple reasons. That’s why I say it’s a common denominator because distributed messaging is very difficult to do at the scale that’s necessary for machine learning and AI. So, that’s why I think this is a very hot space and a space that needs to succeed for the adoption of artificial intelligence to increase because now everybody wants it, but very few people have it, and that’s a problem. Nailing distributed messaging can help solve that.
Matt: Can you give the listeners maybe a concrete example of an industry and a type of machine learning model or AI model that you’ve seen successfully implemented or where you have seen this actually in, you know, in real life.
Josh: Sure. Media is one of the best examples just because they have a huge amount of data. IoT is also another great example, but that’s relatively in its infancy, even though it’s going to explode. Media has been around for a long time. So, within the last two years, you know that we worked with a media company and they had a really hard time ingesting and moving data.
For them, they’re out there placing advertisements all over the internet on Google and Facebook, all the social media, all across the web, paid search, the content network – pretty much, you name it, they’re there. And because of that, they had a huge amount of click and impression data, and it was one of those things where the sort of spurts of web traffic are very difficult to predict, so it became very costly for them just to ingest that data, let alone do something with it.
But just ingesting it was very difficult, likewise – because they use so many places for ads, all that data was coming in and in vastly different formats. Unifying that into a singular data model was also very difficult. Something like Pulsar makes that a lot easier because it’s built to handle ingestion on that scale. It also has a lightweight compute framework on top of it, so you can do ETL-like things to it, and then I can put it in a place that makes sense for you, which can be like a data warehouse, a data mart, data lake – really anything you want, it can do it – and put it in two places.
That’s something that it really helps with – just the literal logistics of things like that. It also can help you make sure the costs don’t get out of hand. Duplicate data, when you’re talking about some scale like that, can be really painful, and it’s completely unnecessary.
If you’re storing a petabyte of data twice, for no real reason other than it was easier operationally or from a productivity standpoint, you’ll eventually get to a point where you have to get rid of those inefficiencies because your data costs will outpace your revenue. Something like Pulsar just helps you build systems like that and execute much more efficiently and for that same use case.
They also had the need to take all that data, and they wanted to understand the best price or the best bid for advertisements, and as close to real-time as possible, and that also is very difficult to do.
However, because Pulsar has a lightweight compute framework, they were able to take that model that learns from the past 24 hours what worked and what didn’t, and the best price for them to pay based on what a customer is paying. They had a whole lot of data that was fed into that model in the form of futures, so they took that, put it inside of Pulsar, and then they took in that streaming data as “at the end of the day” values and features, and their model, and then they pulled in other information from their internal systems.
Pulsar can both apply functionality against the message in an event stream, but it can also pull in outside information that might make up additional features in your model to make that accurate prediction. By doing that, you know, that was a system that was, you know, multiple different technologies just to get to that point, and you can do all of it inside of Pulsar – a single technology to do all of that. That is both way more efficient operationally, but much easier to reason about how things work and chase down data provenance and where issues are happening, and leverage a lot of the real interesting functionality of Pulsar, where it can infinitely retain every event you send to it.
So, if you wanted to test a new model, maybe you wanted to see if you could increase your prediction capabilities by adding additional features or removing different features or trying to calculate different weights, you can replay messages that happened in the past. That just opens up a huge, huge amount of capability with that one functionality itself.
So, you know, the future is bright with something like Pulsar – it’s exciting.
Matt: Got it, got it. So, Josh, this has been really interesting. I wanted to give you just a chance to give a little elevator pitch on Pandio.
Josh: Oh, so you know, Pandio is a hosted solution.
Matt: Can you just talk a little bit about why you think that it’s such a powerful potential piece of an enterprise tech stack?
Josh: Yeah, so Pandio really is the best distributed messaging platform out there. It’s faster, has more throughput, it’s cheaper than anything that exists today, and it’s really geared towards allowing companies to achieve machine learning and artificial intelligence. It can be used in almost any fashion for event streaming, Pub/Sub, queuing, but we really focused on solving the use cases for machine learning and artificial intelligence. One of the things that makes Pandio quite amazing is that we sort of dog food at our own system. We built a neural network into Apache Pulsar, so we can leverage all the greatness of Apache Pulsar, and then we added in our own sugar on top of it.
That neural network literally controls all of Pulsar, so it will scale it up or scale it down. It will reconfigure flush cash intervals a whole bunch of things so that, you know, we can run distributed messaging at an absolute maximum performance, and also maximum efficiency and things of that nature. It’s quite amazing to see it actually work, but that’s how we’re able to run huge messaging clusters for very large companies at almost unbelievably low operating costs because we’ve got that neural network.
That’s our operator – it handles everything for us and we really only have to check in to make sure that it’s operating efficiently, and we also use a very interesting way of doing that. We took a card out of Google’s playbook, and they’ve been doing some very interesting things along federated learning. So are Android phones and iOS for that matter, but on an Android phone, when you type on the keyboard it starts to learn your own personal lingo.
Out of the gate, when you get a phone, it’s really good at generic stuff, but you have your own nicknames and acronyms and things like that. It starts to learn those, the exact same concept that Pandio has. When we launch a cluster for a customer, it gets a baseline or network, and then, as they use their individual cluster, it starts to learn their use cases. If they traditionally send small messages or large messages, how they use topics, the patterns of messaging that they use, and things of that nature, and it really starts to optimize for that.
That’s the reason we have the sort of unmatched performance and efficiency from anything in the market.
Matt: And Josh, if you’re a developer, a software architect, or an engineer listening to this call, is there a way for those folks to actually play around a little bit and see what this is all about?
Josh: Yeah, of course, you can go sign up with Pandio. Oh, that would be the best way. Then you don’t have to think about where you’re going to run it. But, you know, Apache Pulsar itself, on their website, has the ability to download Pulsar – ours is very similar to that. They have both a standalone jar file, as well as a Kubernetes and Docker, so you can pick your flavor and get started locally and just see what it’s about.
They have very good documentation about how you actually use Pulsar and things of that nature, and then when you’re ready to deploy it in a production atmosphere, that’s where Pandio can really help you out with some of our sugar added on top.
Matt: Okay, and Josh, for folks who would want to get a hold of you or your engineering team, what’s the best way for them to do that?
Josh: So yeah, I mean they could email me directly at josh@pandio.com, or they could submit a question through our website as well. Those are probably the two best ways if it’s for support specifically around Pulsar runs.
There’s also a very active Slack channel that I would definitely invite people to check out, but it’s, you know, it’s a growing community, very strong. Those are the best ways to get in contact with us now.
Matt: Okay, fantastic. Listen, Josh, I appreciate the time. I think we’ve learned a lot in the last half an hour and look forward to seeing this community grow and the momentum that we’ve picked up on here.
Josh: Awesome. Yeah, thanks man. I appreciate you putting this together.