Trino! The Universal Analytics Query Engine
With Trino’s ability to query object storage and block storage simultaneously, data scientists are enjoying unprecedented freedom to engineer AI-based analytics to harvest rich insights and intelligence from data lakes. Additionally, now that streaming apps like Apache Pulsar support querying live event streamed data, access to all data everywhere is limitless. Universal query engines like Trino likewise broaden the scope of data access. The differences among block, object, and file storage systems once dictated the software required to search and explore them; but now Trino is built with the spirit of universal integration. Now, analytics can proceed afresh and harvest intelligence from data unfettered by the type of datasource and format. Furthermore, Trino can query data sources by accessing variegated file types on multiple machines – across multiple server clusters – all in the same query!
Our purpose here is to illustrate how this spirit of integration is abstracting away the old obstacles of connectivity among apps and data sources, and thereby empowering even non technical users to run analytics against all data everywhere! Among the outcomes is the ability to to remove all technical obstacles and make possible the deployment of real-time machine learning based analytics. In other words, live event streamed data querying makes possible AI systems including deep learning networks which update their models during transactions in order to provide real-time learning in applications.
The journey begins with a broad use case: analytics on a data lake of mixed data sources including Cloud and on-premise data combined with live event streaming data captured by IoT devices which capture bank transactions and portfolio trades for example. What technologies are best for such a massive scale data science application? What is the best strategy for implementation? We will begin with the best database query engine. Let’s get started.
Universal DB Connector API
Trino’s native Connector API interfaces automatically to provide the fastest high performance queries to essentially all data sources, including:
- Hadoop HDFS
- All RDBMSs
- SQL and NoSQL
- Structured and unstructured
- Stream processing systems like Pulsar, Kafka.
Although data lakes and data warehouses are widely used in Big Data strategies, they are distinctly different concepts, so let’s clarify them now. A data lake is a vast collection of raw data which will eventually serve numerous purposes, which have not yet been imagined or defined. The basic idea is to make all germane data available to apps which may use it and then extract insights from it as the project evolves. A data warehouse, on the other hand, is a well defined and engineered repository for structured data whose intended purpose is already defined and already in iterative use within enterprise operations and analytics. To summarize, the data lake is raw but unmined potential; the warehouse is already in production. The noteworthy feature of Trino in this context is that it has native APIs to connect to data sources across data lakes and warehouses.
Trino has the unique capability to query multiple databases with a single query statement. Queried data can be stored in segments and scattered across files and servers. Trino simplifies the work of engineers and developers because there is no need to aggregate or consolidate data sources.
Importantly today, Trino can query live streaming data from messaging systems like Pulsar and Kafka topic streams, while joining data from PostgreSQL, MongoDb, Redis, MongoDB and ORC, and all in one query.
It is also important to note that while Trino is an accelerated query engine, benchmarking outcomes will be partly dependent on the performance of the other DB engines you interface Trino with. Benchmarks of Trino running a query on ORC and HDFS outperforms MySQL. In other words, the best performing systems will integrate all data sources with Trino rather than combining several.
With regard to emerging concepts of “data in motion,” we want to understand how Trino handles event streams from Pulsar and Kafka, for example. Generally, a Schema Registry is implemented to persist streaming data in Pulsar or Kafka for the purpose of Trino queries. Nowadays we see Kafka providers claiming that their data sources are not static; they imply instead that they are querying data streams directly. It is important to understand that both Pulsar and Kafka persist data from streams in configured formats to make the data queryable. Along this line for example, Trino has a native Pulsar connector for workers within a Trino cluster to make querying Pulsar topic data possible.
Trino Use Cases
A typical emerging use case today reveals a data team project manager tasked with combining and integrating multiple data sources toward one monolithic analytics objective. Combining data from a Customer Relations Management system with that of an Enterprise Resource Planning application to find correlations between campaign demands and production outcomes will suffice as a project drawing upon data sources from hundreds of devices. While in previous years the temptation was to think along the lines of hauling all the data into a central warehouse, Trino now offers a much lighter, leaner, and faster solution. In other words, traditional engineers were conditioned to think in terms of consolidating and integrating data sources as a first step. Trino bypasses this step by integrating all data sources including live streams (data from streams persisted according to schema registry).
A Lot of the Work is Already Done!
Looking out across an expanse of multiple data sources, our project manager sees some unstructured data already in Cloud object storage like S3, perhaps a rogue MongoDB, throw in a few hundred static MySQL tables from customer servers, and now add a Pulsar stream from a factory IoT. He sees a bewildering variety of data formats, scattered across data warehouses, an open source database or two, with proprietary databases thrown into the lot, even a terabyte of SMS sentiment data from a customer survey swimming in a data lake!
Being human, our traditional project manager finds this task daunting. But now we fastforward to the modern integrated query engine: How can she unify all these within the scope of her analytics project? Fortunately today there is a solution which is so right for this task that we don’t even need to talk about integration and consolidation. The Trino distributed query engine can already do the tasks described above, and without coding. In some cases, it may benefit a unique enterprise to partner with a Trino hosting expert, in order to reach the intended market early as possible.
Trino’s Gifted Pedigree
Developed at Facebook and soon after released as open source, Trino is today used by many noteworthy enterprises whose daily analytics models require querying complex big data sources in diverse locations. Here again we emphasize that, rather than first integrating the Big Data into a warehouse, Trino is going to solve the enterpise analytics puzzle by querying all those sources as they are, wherever they are!
Twitter and Uber are two progressive companies already optimizing their insights with Trino. Airbnb, Netflix, and LinkedIn, likewise develop with open source analytics stacks based on Trino. With Trino, the data stays where it is. The advantages of using open standard formats instead of costly proprietary formats are abundant. Here are some features to appeal to developers:
- Easily pluggable connectors provide metadata for queries.
- Simple but extensible architecture.
- Pipeline configurable for iteration.
- User customized functions.
- Vectorized column data processing.
In other words, Trino can query any data source at any location and combine multiple sources from various hardware and infrastructure all in a single query.
Trino’s architecture is ideal for containerized cloud deployments which demand scalability and elasticity. New enterprises seeking to avoid on-premise infrastructure costs as well as existing enterprises phasing out existing on-premise hardware will benefit from Trino’s intelligent scope and reach. Data scientists can run Interactive SQL and noSQL across multiple data warehouse files even faster than with Spark, while evolving microservices and re-usable code.
Query The Data Where it Persists!
Challenges which analytics teams face today are easily met by Trino’s interface capability to navigate various data repositories and quickly extract insights. Yet, many enterprises will still find improved benefit in partnering with a hosted Trino solution. Why so?
The Open Analytics stack is a brilliant combination of tools which provides cost and efficiency benefits which any engineer can certainly make full use of. However, some enterprises will want to reach the market with their product as quickly as possible. This is where the engineers at a premium hosted service like Pandio accelerate time to market.
Another reason to consider a hosted Trino solution is the number one concern of all enterprises using Cloud tech today: data security. The data security related intricacies of deploying a Trino across multiple data sources are best managed in partnership with a team of proven experts. In a competitive marketplace, the value of having seasoned Trino knowledge experts on hand is considerable. Consulting partners share the burden of liability, answer urgent questions and resolve midnight challenges with alacrity. All of the above ensure the time to resolution and getting to insight happen much faster!