Challenges Building Big Data Pipelines for Next Generational Workloads
In the past decade, the volume, velocity and variety of raw data available has grown to increasingly massive numbers. With more data, companies stand to quickly turn raw data in gold. Quickly feeding data scientists with clean and manageable datasets can help companies remain agile within their relative industries. Data pipelines can reduce operation costs and boost productivity. Unsurprisingly enough, issues arise at scale and data pipelines are by no means immune.
For context, data pipelines merge together software to automate the unification, management and visualization of the raw data – handling the ETL in a predefined iterative process. At the beginning of the pipeline, the raw unstructured data sits in applications, databases, files or in the cloud. As the data moves through the pipeline, it is cleaned and transformed until it can be effectively presented in a given business intelligence tool. While data pipelines serve the specific needs of a company, they all implement the following steps:
- Extraction: Here, the relevant fields of the dataset can be feature engineered. For example, given a feature of full addresses, this step can pick out zip codes, states, etc.
- Joining: Since the pipeline will take data from multiple sources, this step consolidates the data. Here, the data can be paired with other relevant features to create a single dataset.
- Standardization: With a single dataset, this step converts the data into a common format that adheres to the analytical needs defined by the engineers. Features like UNIX timestamps can be engineered to readable dates rather than the number of seconds since 1970. Abbreviated state names and different phone number formats are all corrected to follow a standardized form.
- Data Loading: At this step the data is ready for analysis, but it still needs to be stored. The pipeline appends the dataset into a target system: whether that be the cloud, a data warehouse, or a relational database system.
Data pipelines position companies to generate insights from raw data in a much more agile sense. By leveraging a group of software technologies, a pipeline can integrate and manage unstructured data to simplify reporting and business intelligence.
Orchestrating numerous software services to get data in motion requires constant monitoring. A team of data engineers must validate that the pipeline is outputting consistent, reliable data. Typically data pipelines integrate a number of tools to handle every iteration of the ETL process. That process in itself can grow to be increasingly complex, but it is not out of the scope of most data engineering teams. However, most BI applications require more than a single dataset to make well informed decisions and with that comes many data pipelines. And even with a team of diligent engineers and readily capable software services, data pipelines at scale can quickly become a chaotic mess. With more data pipelines in play, engineers lose optics into what exactly is happening in every pipeline. Monitoring and maintenance becomes increasingly complex and developers require a more robust platform to handle the compute, storage, and maintenance needs of a multitude of many data pipelines.
Computation power and storage resources are the most abundant they have ever been. However, these resources are not infinite and with each addition of another data pipeline, an organization can quickly find their own compute and storage ceilings. Each data pipeline requires compute power to perform the necessary ETL and a final destination to store the cleaned data. Fortunately, with so many companies implementing data pipelines, the software backing it is mature and well documented. Specifically, Apache Kafka is more or less the standard use-case in many real-time data pipelines. Kafka can remedy the complexity of implementing a pipeline, but underperforms when maintaining a large cluster of pipelines. This is due in part by Kafka treating both computation and storage as a single resource. As a company adds more pipelines, Kafka’s resources can quickly reach a deficit and the system becomes increasingly plagued with bugs. With the scale of data growing exponentially, a Kafka pipeline ruptures and the maintenance required undermines autonomous functionally. On the other hand, Apache Pulsar treats the compute and storage resources as two different entities, and the demands of each can dynamically grow as needed. Separating the two allows for a more robust pipeline that can efficiently meet the ETL demands of a company and leave ample room for both compute and storage. For any large storage needs or very complicated ETL processes, Pulsar can ease the developmental pains at the same level of Apache Kafka, while at the same time raising the resource ceiling. With the growing velocity of data, most companies are going to hit this hard resource ceiling and they will need to add robustness at scale.
Stay tuned for the next upcoming articles, where we will dive deeper into the complexities of implementing a data pipeline and survey how Kafka and Pulsar stack up against each other from a developmental perspective and how their resource allocation performs in any given data pipeline application.