Data in Motion: Reality vs Fantasy
We are here to distinguish fact from fiction in recent marketing hype about querying live data streams, popularly called “Data in Motion.” Examples include live event streaming from many types of network connected devices such as IoT sensors, webcams, financial monitoring instruments and many more. Providing intelligence from these sensors and actuators is a hot topic today because the devices feed machine learning and AI algorithms, which immediately enhance business intelligence (BI) and user experience. While there are many exciting advances in this field, there is also a lot of misunderstanding and blog bunk, the result of exaggerating accurate technical information with marketing hyperbole.
For example, In its recent IPO filing, one streaming data company implies that its “Data in Motion” service can query telemetry data directly from IoT devices. This is not correct, and the IPO reads like data science fiction. Although technically possible, the claims are not realistic, because they would require IoT device-specific coding on the Pub/Sub app – in this case Kafka – used to stream the data from the devices. Another problem with the IPO’s notion of querying a live data stream directly lies in telemetry days; it is the accumulated time series data which makes a query meaningful. Querying the most recent single datum from telemetry is actually nonsense because there’s only the data from one sensor at one time; to query more means first storing the data, which this IPO claims is now obsolete! We will clarify the method of ingesting and querying telemetry data, log, transaction, and other types of streams a bit further on in order to develop a clear knowledge base for true “Data in Motion.” We will examine the technical errors in the IPO filing, because doing so will put the fantasy to rest, and reveal that the reality is actually quite awesome.
Marketing vs. Engineering
This IPO describes “Data in Motion” as a “missing feature in databases” in general and claims to have rethought computer science on the subject. But the reality is that even “live streaming” data is stored in databases, and so it must be, as we will see shortly. And there is no disadvantage in this. Kafka is the streaming platform used by the company who filed this IPO, and Kafka most certainly stores data from streaming devices. Let’s have a look at a section of the actual IPO:
Data engineers may be correctly baffled by the phrase “data at rest” in the above, because all streamed data must be stored first in order to be queryable. In fact, the PubSub app Kafka, which the company above uses for streaming data, first stores all data with the same key in partitions. As we will see later on, every streaming/messaging platform has a strategy for storing streamed data first. Indeed, the zero data loss feature of such platforms can only be guaranteed after message data has been stored to disk. The above IPO language is most likely the dramatic fiction of marketing staff.
While APIs provide access to log, transaction, and site visitor data at the source, the data must be stored in a form that can be queried. And while on-device processing in IoTs could conceivably receive queries, that is a wave of the imaginary future. In practice today, a typical IoT device cannot be queried as implied in the above IPO. True Data in Motion actually handles queries in live data by setting up rules to trigger SQL queries automatically. Systems built as such are fast and more than sufficient to feed live model updates in ML.
Although the data is stored in a database there is no reason to re-imagine this as “Data at Rest.” Nor is it accurate to say about databases in general that “…the whole paradigm crumbles” in the streaming context. Nothing of the sort is true. Today’s methods guarantee a high level of data integrity and confidentiality, qualities which would be diminished if devices could be queried directly. Moreover, it is the accumulation of data which makes time series analysis possible; this requires storing data in databases. At least one rational data scientist concludes that the IPO statement about computing an “…answer that is immediately out of date,” was the dramatic product of a marketing imagination to woo investors and customers!
On the other hand, an engineer will likely realize what the statement “…stream data in motion through the query” actually refers to: not the implied ability to bypass databases altogether, but instead the Pub/Sub’s topic rules which trigger SQL queries automatically to run against the most recently ingested data from IoT and other streaming source devices. To conclude the scathing criticism of this IPO statement, the obvious solution is to have engineering read marketing language before publishing!
A Dose of Streaming Reality
As mentioned earlier, there are many exciting new things under the event streaming Sun. For example, suppose we’re using Apache Pulsar to stream live data (which we would do naturally because it is far better for this use case than Kafka). Pulsar implements functions and rules to automatically trigger SQL queries when new data is ingested, queries which then update training models for machine learning and AI apps, which furthermore immediately and decisively improve customer experience in the UI and update app data for BI charts and monitoring. “The whole paradigm” as it is actually has a live feeling.
Now, to understand how that is possible, we need to seriously look at the mechanics of:
- Commonly streamed data types such as logs, transactions, networked data streaming devices, Edge, Cloud, and on-premise IoT telemetry and metadata
- Messaging Pub/Sub and event streaming platforms like Pulsar
- Database query engines with universal connectors like Presto
Usually a product engineer tasked with research and innovation will start with a familiar component – an existing legacy system for instance – so that the new architecture is not wholly new and unfamiliar. Let’s say the enterprise is already invested in Azure but the engineer needs to reduce costs. This can be done by lopping off one of the expensive chunks of the MS monster and replacing it with Pulsar or PandioML, plus other open stack components like Presto.
Suppose now that our engineer Googles the question, “Can PrestoDB query Azure device twins?” Now, that’s a real question, and we’re going to answer it here, along with many others. Contrary to Confluent’s IPO filing, all query engines require mapping IoT streaming data to structured storage; Presto is no exception. Azure components may use JSON docs like Device Twins, but the overhead may be high. In this scope, the IPO’s view of data as “static” is misleading, because the most recent data is also arriving in the same database, triggering queries, and updating ML models. Here, we are revealing the true nature of Data in Motion.
Let’s take a birdseye view of the top five Data in Motion platforms. to realize how the same reality gets relabeled by marketing to give the impression that a brilliant new widget was invented. We want to define the components which are common to all five of the big providers mentioned here. Careful study reveals that some of these can be surgically interchanged with Pulsar for streaming/messaging, PrestoDB as well as PandioML for AI and ML applications, and even Presto for querying. We are talking first about the expensive platforms:
… and finally we will be talking about a lean and aggressive new performer in this field: Pandio.
End-to-End Solutions are Expensive!
IBM, Amazon, Google, and Microsoft all provide complete solutions to capturing data streams, storing targeted data as structured, and querying the data to update models. These solutions are designed for the largest enterprise organizations and carry the expected huge costs. There is, however, a more robust and far more affordable approach that incorporates cutting edge open-source technology. In the case of Pandio, we’re only paying for the expert partners, because most of the apps are open source stack. The benefit is that Pandio engineers are experienced experts at deploying solutions.
Overview of Data in Motion
As mentioned, streaming data arises from fluctuating financial instrument values, scientific instruments, factory sensors, security cameras, and much more. For the moment we will cast our net over a source of vast data. IoT devices are among the most prolific new generators of Big Data. They connect to a network to provide information from their environment via sensors, but also to enable a variety of systems to interact with the world through actuators. A common example is a factory sensor that reports the temperature of a production machine and whose data will be used to predict maintenance and failure probabilities.
These devices stream app data, called telemetry data, as well as metadata, which is information about the telemetry and the device state.
In addition to sensor data the device. Metadata usually includes:
- Identifier (ID)
- Class of device
- Manufacture date
- Hardware serial number
- Device State information
Logs, Telemetry, Transactions
Where does streaming data originate? The answer further confounds the IPO above, because most of them cannot be queried directly, with the exception of Edge apps, and are usually streamed to Cloud processing. Log files generated by customers using mobile web apps, ecommerce purchases, account transactions, video game player activity, social network user data, financial trades, GPS, all generate data to be streamed. In other words, data in motion. Typical streamed data processing workflows include:
- Warehousing data
Pulsar Pub/Sub Ingestion
As the front line of “Data in Motion” Apache Pulsar Pub/Sub provides a global and durable message ingestion platform. Data is streamed as packets within messages which are managed by Pulsar. Within Pulsar, “topics” are created for streams (or channels) flowing from the sources we mentioned above. Various app components then subscribe to the topics to make use of the “data in motion.” An advantage of pub/sub is that modules can subscribe to specific streams of data without building subscriber-specific channels for each device. And, as an integration tool of choice in the ESB paradigm, Pulsar connects to most Cloud services to maintain data pipelines and storage. It’s not surprising that this description sounds very much like Google’s streaming platform; as with all the platforms we’re discussing, they evolved as competitors in an arena of similar challenges, and much of the code used by the big four is also borrowed from open source. Pulsar automatically scales to handle big data spikes from source surges. Pulsar features compute/store separation, multi-tenancy, and geo-replication by default. As we survey the bulky expensive providers, we will see that open source components have significant advantages
Data in Motion with IBM Watson
IBM Watson streaming Platform collects and retains data for access and processing and historical storage for use in time series, analytics and other services. Like Google’s platform, Watson contains its own message broker and real-time handler of streaming data. Device data, for example, is immediately written to the Databases for PostgreSQL table. Here is where Watson gets costly. To store data from connected sources, your environment must meet the requirements of the Watson Platform, messaging, Database PostgreSQL compliance. Interpreting the fine print, this means paying IBM engineers and consultants. As for the Presto vs. PostgreSQL debate we will treat that in another chapter.
Data in motion reality is a technical storm with IBM. Device payloads must be well-formed JSON and HTML format. For IoT sources, device-type logical interfaces must exist for all connected devices. A logical interface is needed to create the Databases for PostgreSQL tables. As you can see, partnering with a giant means paying a ton of money.
AWS IoT SQL
To once again alay the “data at rest” myth, let’s look at AWS. In AWS rules are defined with SQL-like syntax. SQL statements are composed of the familiar clauses:
To extract information from the payload of messages and do transformations on the information. Topic filters are applied by the FROM clause.
The SELECT clause uses Data types, Operators, Cases, Functions, Literals, JSON, Substitutions, Nested queries, and Binaries.
The platform’s MQTT message topic filter catches the target messages to extract data from. Rules triggered actions for each message sent to the matching topic.
Logic which extends rules.
As we can see, all streaming or data in motion platforms map incoming data to data structures which can be queried. An example SQL statement looks like:
SELECT color1 AS rgb FROM ‘colortopic/subtopic’ WHERE temperature1 > 50
An example message (payload) looks like:
If this message is published on the ‘colortopic/subtopic’ topic, the associated rule will be triggered and the SQL statement will be evaluated. The same design is implemented between Pulsar and Presto.
Azure & IoT Hub Query Language: Data in Motion Reality
MS Azure deals with data in motion in a slightly different way, which includes a virtual hub to ingest streaming data sources. You first deploy SQL Server to store data on devices running Azure IoT Edge with Linux containers. As apparent in the name, this design is intended for the Edge compute paradigm, in which processing to update ML models is done at or near the data source.
Azure IoT Edge and SQL store and query data near the virtual edge of the network for fast processing. As with Apache Pulsar, Azure IoT Edge has storage to cache and preserve messages when a device goes offline to guarantee zero data loss.
The Azure IoT Hub query language supports querying device and module twins, which is misleading in a sense similar to the IPO document’s suggestion that it queries devices directly; it does not! In fact, Azure queries “device twins,” which are documents created when Azure stores new data streamed from devices and other sources. Device twins can then be queried by the hub query language. Here again we see products labeled in a misleading way. To boot, in the SQL statements “SELECT * FROM devices…” misleads us to believe that Azure is directly qeurying devices. They do no such thing.
From MS: “Device twins are JSON documents that store device state information including metadata, configurations, and conditions. Azure IoT Hub maintains a device twin for each device that you connect to IoT Hub.
var query = registryManager.CreateQuery(“SELECT * FROM devices”, 100);
var page = await query.GetNextAsTwinAsync();
foreach (var twin in page)
Device twin JSON docs are labeled “devices” in the above SQL query statement. Data streams such as IoT device output are saved as JSON files called device twins. And now this data can be queried by IoT Hub query language. This is as close as you can get to querying “live data” as of the time of this writing.
From Fantasy to Fact
Abundant fantasies cloud the air when we’re looking for clarity. One such fantasy is that paying the exorbitant costs of the big four providers will somehow result in a better outcome. In fact, leaner providers like Pandio will demonstrate much the same actual outcome at a small fraction of the cost. Pandio features comparative advantages:
- Ease of use
- Managed service
- Massive scalability
- No lock-in
- Built on open source
With large enterprises increasingly confronting vast data volume and velocity, intelligence from data is the asset to be extracted. For a global organization, data warehouses built from streams must be secure. And data must be manageable and easily administered. The data warehouse must be rapidly updated to enable ML workloads such as live model updating. Only Pandio has a complete solution to these problems with a brave new open source combination:
- Pulsar event streaming
- IoT ingestion
- Sophisticated In-house AI / ML libraries
- Separation of storage and compute
- No data loss
- Tiered Storage
The giants are slower and less flexible to update their systems. Pandio is always ready to upgrade to the best new proven technology and keep its partners at the competitive front.