Data Lineage Primer

An introduction to Data Lineage

J
Written by Jim Down
Updated over a week ago

Introduction

In today’s world data is everywhere. We are surrounded by it in the form of spreadsheets, sales figures, energy consumption, personnel data, budgets, etc. Market data drives company strategy, production data informs about manufacturing processes and costs, and sales figures contribute to profitability. Data is essential and the lifeblood of business.

Understanding data, categorizing it, using it, analyzing it, tracking it, storing it, and deriving conclusions from it are therefore critical for the success of any organization.

This is where data lineage becomes important.

What is Data Lineage

Data lineage is, essentially, a map of the journey data takes through a company. It starts with its origin, references each process and transformation along the way, including an explanation of how and why the data has moved over time, through to its delivery to an endpoint or end-user.

Typically Data Lineage can be split into two, distinct, types:

  1. Solution / Datawarehouse data lineage, comprising:

    1. specific data origin

    2. transformational steps/processing

    3. the logic that decides data transformation steps

  2. Enterprise data lineage, comprising:

    1. where data resides

    2. which apps read or write data

    3. where in the world the data is physically located

    4. how the data is used within the business

Solution data lineage is, therefore, more transactional/process related whereas Enterprise data lineage focuses on the high-level and bigger picture. This primer will focus on Enterprise data lineage.

An important part of data lineage is understanding how the data exists within your technical- and business landscape. There are three ways or “master data management patterns” in which data is used and moved around within an organization:

  1. Single source

  2. Master/Slave

  3. Multi-master

These are explained in more detail in our Data Lineage Metamodel document.

In some instances, data lineage can be documented visually to show the source of the data, each process or change it encounters on its journey within the organization, and its destination.

Why Is Data Lineage Important

As the volume of data streams grows, especially via the cloud, it is becoming ever more critical for organizations to fully understand the data lifecycle: mapping how and where data is sourced, its transformation, and where it is stored. This knowledge is becoming increasingly important from a compliance and governance perspective.

Data governance requires that data be stored and processed in line with organizational policies and, importantly, regulatory standards.

Data lineage can be used to provide compliance auditing, improve risk management, and ensure data is stored and processed as required. An example of this is GDPR compliance, especially in the processing and storage of data.

Benefits of Data Lineage

Understanding everything about your data including its provenance, what information it actually provides, how your organization processes it, and where it finally resides is critical for a number of reasons.

Specifically, data lineage will help your organization with the following:

  • Providing the necessary core data, including its source, its flow, its storage, and its use within the business to ensure compliance with existing and future regulatory and legal directives.

  • Enabling improved risk and compliance management

  • Making transparent where in the world data is physically stored. For example, if Personal Identifiable Information is stored within the European Union (EU).

  • Gaining visibility about how data is processed on its journey through the organization compared to how the business actually needs the data to be processed

  • Ensuring that critical data is from a reliable source

Summary

As the volume of data, together with societal attention on ethical data usage, increases it is essential that businesses know how and where their critical data comes from, is processed, and stored.

Data lineage provides this information through two streams: Enterprise data lineage, which provides a high-level view of the data, how it flows between applications (the patterns), and where it is stored, and Solution data lineage, which provides a lower-level, transactional, view of the data, where it comes from and how it is processed.

As compliance and governance directives become stricter having knowledge about data flows is essential.

For more detailed information about how Ardoq deals with data lineage please read our article on data lineage metamodel.

Did this answer your question?