Why Data Engineers Prefer Pandio's Managed Presto Service

Posted on July 7, 2021

Why Data Engineers Prefer Pandio’s Managed Presto Service

Presto is a SQL query engine that is driving productivity and efficiency for data science and engineering teams. It allows data engineers to access and exploit data without needing to physically move the data. This article will review why Pandio’s managed Presto service is so compelling and is gaining momentum in the marketplace.

As the number of data pipelines multiplies and volume increases exponentially, big data is increasingly challenging to manage and leverage. More specifically, the issues occur with data pipelining, performance optimization, and data analysis. Data engineers are tasked with administering and understanding this increasing complex ecosystem.

Managed Presto is scalable, simple, and cost-efficient

Presto has been developed from the ground up to enable the interactive querying of petabytes of data. It works with numerous data sources ranging from traditional relational databases, to flat files, to data marts and data lakes. Pandio offers a managed cloud or on-prem service that enables enterprises (and their engineering teams) to take advantage of Presto without worrying about administration or support.

With extensive data operations, it is inevitable that important data will reside scattered across the enterprise. Historically this has been a tricky problem for data engineers. First, they had to secure access to the data. Then they had to understand what was actually in the repository – is it redundant? Is it accurate? Is it in a format that can be queried? Assuming the data was relevant, it then needed to be extracted and moved to a consolidated area where the data could then be joined, cleaned, and queried. This process was expensive, time-intensive, and often ineffective. Presto offers an alternative approach that drives efficiency, dramatically shortens the time to insight, and does so without the actual movement of any data.

Dramatically reduce time to insight

The beauty of using Presto is that there is no costly up-front infrastructure investment. You don’t have to move the data at all to query it. Instead, Presto brings queries to the data. Thus, it solves one of the major challenges when working with big data – the inability to separate storage and computing power.

Presto comes with data source connectors enabling data architects to manage even the most complex systems easily. In addition, Presto workers are lightweight, which allows engineers to bring the query processing as close to the data as possible.

Pandio’s managed Presto service is built upon the open-source Presto code base. It enables data engineers to separate computing and storage easily. They have access to numerous connectors to do so. Whether it is a traditional, non-relational, or columnar database, Presto connectors enable data engineers to bring queries to the data efficiently.

With Presto, there is no need for coding at all. They can simply attach a data source from any cluster in a matter of seconds. The same goes for connecting various databases and database services. It works with Hadoop distributed file systems, Apache Cassandra, MongoDB Atlas Database, Elastichsearch, Amazon Redshift, RDS/PostgreSQL, and RDS/MySQL. This is just a part of their full list.

You can install and run it wherever you want

At the start, it may seem convenient to choose a cloud provider and use it for big data projects. This is because cloud providers offer tools and one uniform ecosystem data engineers can learn how to use and get comfortable in. However, locking into a cloud provider can have certain drawbacks including a limited maximum concurrency that varies from region to region, maximum database limit per account, and partitions limit per table.

There is also a performance issue. When you choose a cloud provider, you have limited guarantees on how many servers it allocates to your project. These are often shared services, and the system will use some sort of algorithm to give resources to its users. Inconsistent performance is not acceptable for an enterprise that is relaying on the SQL Query Engine to provide timely and relevant data to its business analysts and data scientists.

Data engineers pick Pandio’s managed Presto service because it enables them to install and run Presto wherever they want. This is highly convenient for data engineers as they stand to benefit from a lot of things. But, first and foremost, they have the flexibility to run Pandio’s managed Presto service on an infrastructure of their choice.

Data engineers that opt to use Pandio’s managed Presto service also have complete control of the number of Presto nodes in their deployment. They can also choose node instance-type, which we will revisit in the affordability section because it delivers the best price/performance ratio. With complete control of the deployment, enterprises receive a consistent level of performance and service from their SQL Query infrastructure.

It’s cloud-native and installed in Kubernetes.

Solutions with serverless nature may seem attractive to data engineers at first. It’s because they are pretty easy to use. However, they offer very little and very often nothing in terms of control. Serverless solutions make it hard, if not impossible, to add more resources, sessions, or debug.

Cloud-native solutions, especially those installed in Kubernetes, are the exact opposite of serverless solutions. They provide data engineers with more options and control.

Data engineers pick Pandio’s solution over others because it’s cloud-native and installed in Kubernetes. But, more importantly, Pandio’s team consists of experienced professionals who are well-versed in the installation, deployment, and optimization of applications within Kubernetes deployments.

Affordability

Most alternative solutions to managed Presto are expensive and inflexible, especially at scale. Price is a significant factor because data engineers have to pitch their ideas for solutions to CTOs too. Unfortunately, many solutions on the market have pricing models that result in ridiculously high prices.

For instance, they bill per query and base prices on the volume of data scanned. In the big data industry, the volume of data scanned and the number of queries scale up with project demands and can result in quite expensive service prices. WIthout the ability to predict the number of transactions and total cost, the CTO will likely choose not to further pursue.

This is where Pandio’s managed Presto service excels. As we previously stated, Pandio’s cloud-native managed Presto services run on Kubernetes. This architecture enables data engineers to automate cluster management and significantly reduce operational costs. In addition, Pandio’s pricing model is pay-as-you-go. What does it imply?

It simply means that you will pay only for what you use. You will pay only for the node clusters deployed (compute used) and run as many queries as you want. It’s precisely why Pandio’s managed Presto service delivers the best price/performance ratio, as we’ve mentioned earlier. Data engineers get the same performance at half the cost of any other vendor on the market today.

For data engineers to take a leading role in a company’s digital transformation, they need access to a solution that enables them to work with all the data they can access. Feel free to check out Pandio’s managed Presto service to learn how it can helps enterprises access scattered data and make it drive insights for the business.

You must be logged in to post a comment.