Data Engineering: What are Data Pipelines and why your business needs them?

Throughout history, a reliable way to add value to two things is to connect them. Once you start connecting more than one into a sequence, you get a pipeline. A pipeline is a process: raw materials go into one end, value emerges at the other, often with many transformations adding to it along the way.

Pipelines – of people, resources, and knowledge – are central to the development of civilisation. They’re how cities along the Silk Roads prospered, connected by desert tracks and donkeys. They’re how islands in the ancient Greek archipelago formed a nation, connected by wooden boats and mariners’ wisdom. They’re how diverse kingdoms south of the Himalayas became today’s India, connected by rail and road, and how oil and gas carpeted the Arab world in petrodollars. Pipelines – whether passenger transport, trade routes, or knowledge transfer – became the driver of countless empires: the Romans, Renaissance Venice, the Dutch and British East India Companies.

Today, the world’s most important pipelines involve the data flows that make up our computing infrastructure. Connected by thousands of kilometers of optic fiber. Millions of mobile masts. And billions of networked devices. And the area of expertise that plans how they happen – how data travels and is transformed between these connections – is called data engineering. But the principle is the same as it’s been for thousands of years. When pipelines work, they lead to fresh opportunities, larger markets, greater diversity of knowledge.

Bob Metcalfe, the inventor of Ethernet, even gave it a name. Metcalfe’s Law states that the value of a network increases as the square of its number of users. In other words, the value rise isn’t linear: it’s exponential.

That’s why pipelines are at the core of every business today. Scratch the surface of any network and you’ll find it’s far from random – it’s made up of established processes, each a planned sequence of interactions between applications, data, devices, and people with the goal of adding value. A data pipeline. They may evolve over time – but that’s the point. A smoothly functioning network depends on the effectiveness of the data pipelines that take in, transform, and pass on business data to those who need it. Without these pipelines, a network is nothing.

So: to optimize your business, optimize your data pipelines. Let’s look at how Strypes approaches this aspect of data engineering.

How computing changed: the dominance of data

Computing used to be all about the applications: big pieces of software that sat on your servers, supported by a team of technicians 24/7. Data was almost a footnote, a trickle of numbers exchanged between different departments – often with a lot of manual rekeying to get it into a useful form.

No more. A data pipeline today is a firehose of information, gigabytes daily even in SME-sized companies and smaller, the volumes of data dwarfing the size of the applications that interact with it. That’s why today’s data needs corralling and controlling in the most efficient way possible: because efficient data pipelines save time, trouble, and most importantly money.

How? There’s a variety of methods. Streamlining data in bulk to process it more efficiently; separating data that needs to be updated in real time from data that doesn’t; automating people’s work and providing support to management decisions. And these goals need proper planning, optimization, and execution. They need engineering.

Why are data pipelines useful in data engineering?

Data engineers use a dozen or so models for building and optimizing data pipelines, but most answer a basic question: whether data is worked on now (in real time) or later (for economies of scale). Let’s list them.

Extract, Transform, Load and Batch processing pipelines

The batch processing method’s been around since computing’s earliest days – because of a simple principle: gathering a bunch of data, then performing the same operation on it in one go, is more efficient than doing it repeatedly as each data chunk comes in. Even in today’s real-time, up-to-the-minute, 24/7 world, batch processing is still useful for many data types.

The traditional approach for processing large volumes at once is called ETL (Extract, Transform, Load). It does what it says: extract raw data from whatever dataset it has to work with, perform a useful operation on that data, then deliver the value-adding result to a data warehouse where it’s available for use. Many data engineers use tools like the Open Source framework Hadoop for it.

Note, though, that batch processing isn’t real time. That’s the point: you’re waiting until you have a decent-sized dataset to work on! So it’s great for regular but non-instant needs like overnight updates and collating month-end results – but this architecture won’t perform when the outputs are needed now.

There’s also an architecture called micro-batching. It’s still a batch model, but the batches are much smaller, so they can be processed sooner and the results warehoused faster. Some micro-batching architectures deliver results in near real time; Apache Spark is a popular technology for building this type of data pipeline.  

Stream processing pipeline architectures

Batch processing’s energetic brother is the stream: data is worked on as it arrives, in real time so the results are available instantly. (The results can be as simple as the output of a web form, or the update to your bank account after you tap to pay.)

In the tech stack, it’s common to see Apache Kafka handling stream processing, thanks to its ability to process huge volumes of data across its highly scalable “clusters”. A stream of data may vary from a trickle to a torrent, but the right pipeline architecture can handle it.

Mixing and matching: lambda and kappa

As you’d expect, there are hybrid approaches between the extremes of batch-only and stream-only – with many using a mix.

The lambda architecture divides data into two “layers”, called Batch and Speed. The batch layer focuses on accuracy and completeness, warehousing data safely and securely for later use, whereas the Speed layer handles real-time data streaming in, for when users need answers immediately. User interfaces, such as financial and operational dashboards, often use the outputs from both: it gives people the Big Picture of their Big Data.

By contrast, the kappa approach has only a single layer (Speed) – but works on both real-time data streaming in and historical data warehoused elsewhere, treating normally batchable data with the same urgency as it treats streaming. Lambda aims for maximum efficiency in its data handling; kappa aims for faster availability. Of course, both approaches have their applications.

Data lake architectures: all your data in one

In contrast to a data warehouse – designed to store data that’s already been transformed in some way, ready for use – a data lake brings together the whole zoo: structured,  unstructured, and semi-structured data all share the same space, in their native formats. The point is to maintain a “single point of reference” for different types of data, making sure the complete dataset – however ordered or disordered its data – is available to those who need it.

A data lake architecture is highly flexible, and works at enterprise scale: it’s not so much a data pipeline itself, more like a standard part that “plugs into” many other pipelines of different types. Imagine a company that needs to stream video, populate customer dashboards, and provide real-time updates on stock prices: it probably floats on a data lake architecture.

Advantages of data-engineering your data pipelines

That’s quite a choice. But if you get the mix right, the savings – in money, time, and resources – can be terrific. Here’s a quick list of what you want to see in an effective data pipeline.

A smooth flow of data being processed. If the data’s being worked on at the speed you want, without long latencies or bottlenecks at busy times, it means your computing resources are being used efficiently – and with compute prices rising, you don’t want to pay more for your cloud than you need.

Higher scalability and flexibility. You don’t know what your needs will be next year – or in some cases even next month. The right data pipeline architecture lets you grow as your business does … or shrink when the load’s less frantic.

Automation for greater efficiency. Wherever a human needs to interact with your data pipeline in the same way regularly, there’s an opportunity for automation. Automating boring, repetitive, yet critical tasks that can be done more accurately by machine is good for your business – and equally good for your people, who can turn their minds to more creative and innovative work.

Smarter monitoring and maintenance. With data engineers responsible for optimizing your data pipeline, every waypoint can be measured and managed for efficiency – because that’s how you optimize: seeing results and thinking about how to improve them. A professionally engineered data pipeline also contains data about the data.

A goal of integration. Data lakes and warehouses bring together information from different sources in one place, giving you a more complete picture of all the data in your organization. Integrating different data sources in a single data pipeline often reduces tech debt and load, too.

Any data network can benefit from data engineering

That’s why data pipelines are such a big part of data engineering: one of our core skills here at Strypes. They connect resources, end to end. They optimize efficiency, at every interaction. And they add to the value of all the outputs that emerge, so you can make better decisions and stay competitive.

And they’re growing in importance to today’s enterprise. Why not reach out to us and discuss what a focus on data pipelines could do for you?

Get in touch
Contents
    Add a header to begin generating the table of contents
    Scroll to Top