Introduction
In today’s world, we are surrounded by data. From energy consumption, personnel records, budgets, invoices, and so on. Data is everywhere.
Understanding data, categorizing it, using it, analyzing it, tracking it, and deriving conclusions from it are therefore critical for the success of any organization. Using data lineage, you can make the logical and physical location of your business data transparent, enabling you to make effective business decisions based on trustworthy data.
Using the preconfigured bundle and the necessary prerequisites, we will show you how to get started with data lineage in Ardoq.
Table of Contents
What is Data Lineage
Data lineage is a map of the journey data takes through your company. Typically it starts with the data’s origin, how it is processed and transformed within the business, including an explanation of how and why the data has moved over time, through to its delivery and storage at an endpoint or end-user.
Typically data lineage can be split into two distinct types:
Solution / Data Warehouse data lineage, comprising:
specific data origin
transformational steps/processing
the logic that decides data transformation steps
Enterprise data lineage, comprising:
where data resides
which apps read or write data
where in the world the data is physically located
how the information is used within the business
Solution data lineage is, therefore, more transactional/process related whereas Enterprise data lineage focuses on the high-level bigger picture. This Use Case Guide will focus on Enterprise data lineage.
Why is Data Lineage Important for Your Organization
Knowing your critical business data’s process point and location is paramount for two main areas:
Compliance management
Ensuring compliance with existing and future regulatory and legal directives
Enabling improved risk and compliance management
Providing the required data storage and processing
Showing the physical storage location of data, worldwide
Risk management
Ensuring that critical data comes from a reliable source
Data lineage aids both short and long-term management, showing businesses where data is logically and physically processed, where the business data is used, and who is responsible for stewarding the data entity.
How to Get Started With a Data Lineage Initiative
1. Defining Your Problem and Success Criteria
Start your data lineage initiative by defining the business problems that you want to solve.
Common problems include:
Which applications write information to the entity, and which read and use the information?
Who is responsible for the data entity?
What is the confidentiality of the data entity?
Which infrastructure hosts the data entity?
What is the location where your data entity is physically stored?
Ardoq comes with a built-in data lineage metamodel based on short time-to-value results from various customers and industries. However, this metamodel can be adapted to your organization’s unique goals and objectives, using Ardoq’s metamodel as a template. Simply modify it to fit your company, adding any other specific problems you’re going to solve and removing those that don’t fit your industry or situation.
Once you know what your problems are you then need to determine how and when you can solve them. We suggest breaking your data landscape into pieces and tackling the most important parts first rather than trying to solve all problems at once.
Equally as important as defining your problems you need to define what your success criteria are for these problems.
In Ardoq we consider three key areas when identifying success metrics:
Data Quality & Completeness - have more accurate and complete data which is necessary for business decision making
Business Value - measure realization of the market and operational performance objectives
Productivity - see improvements from the Enterprise Architecture operating more efficiently
Initially, metrics do not have to be comprehensive, or exact, but they must be monitored and refined over time.
2. Data Governance
As outlined above one of the key issues addressed by data lineage is that of supporting corporate and legal compliance.
To support any compliance requirements it is essential that data is always up to date. Ardoq can automate the enforcement of many governance rules to:
All Data Entity associations from applications should be updated annually to ensure correct application risk assessment
Maintaining up-to-date metadata on all data entities will minimize the risk of documenting faulty confidentiality
3. Get Your Data into Ardoq
Having defined your problems, success criteria, and metrics, you can begin entering data into Ardoq.
Using our predefined metamodel the data needed for data lineage includes component types (Data Entity, Confidentiality, Application, Business Capability, and People) and reference types (Accesses and Owns).
This data can be imported in several ways including:
Excel importer - use to import/update large data sets easily
XML - Data entities and models are usually documented with modeling tools. Typically the tools’ export file format is unstructured data, but the XML importer will assist with importation.
Azure AD - use Ardoq’s Azure AD integration to sync Active Directory with Ardoq
Custom Integrations - REST API and wrappers can be used for selected programming languages. REST API can automate documentation or create custom tools that use data from Ardoq.
Take advantage of our Excel import templates to streamline data entry and speed up the process:
4. Visualizing Data
Having input your data, you need to determine the best way to visualize the data lineage information. Several different formats are available, so use those which are most appropriate for your business. The formats include:
Pages View
See all of the Components or References information textually as you would in document editing software or wiki. In addition, it shows the fields and values you’ve defined for those components or references.
Table View
The data catalog is presented in a tabular format that includes both metadata and references.
Block Diagram View
The block view format allows you to quickly see and understand the adjacent and extended context for any component.
Dependency Map View
This visual allows you to easily see relationships or hierarchy in a compact nested visual. This is ideal for showing reference to data entities in a structured and clear way.
5. Assessing Data Completeness
Once you have collected and visualized your data it is essential to determine how complete the data set is as having high data quality is critical for delivering insights.
Visualizing your data can highlight missing, inconsistent, or incorrect information. However, a more robust way to check for data completeness is to create queries that aggregate and summarize the quality of data sets. The results can be plugged into a dashboard or used to leverage the prebuilt Ardoq Data Lineage Data Quality Dashboard.
To improve the quality of the collected data you should collaborate with colleagues.
Collaboration can be undertaken manually, in an ad-hoc manner, or through the use of Surveys to collect missing information from data architects, data entity owners, or application owners. Once identified, updates and changes should be made to the data either directly or via the Excel Importer.
6. Analyzing Results
Once you have data of sufficient quality you can analyze it using custom or pre-built views and presentations. These will surface insights that can be shared with your stakeholders.
For data lineage, use the preconfigured Discover viewpoints for stakeholders to visualize a data entity, the applications using the data, on which infrastructure it exists, and its physical storage location. Additionally, use the preconfigured presentation to gain a complete overview of where business data is used.
7. Continuous Update
Maintaining the status and quality of data fields is essential for legal and corporate compliance, as well as for minimizing business risk during decision-making. Stakeholders receive continued value when high data quality is maintained.
To ensure data integrity broadcasts can be used to trigger workflows to automatically send out alerts and surveys. This helps establish confidence and credibility with the architecture team and with the insights that are delivered.
Ardoq includes pre-configured broadcasts that are set to collect this data at a 12-month interval. However, some organizations are fast-paced and have a lot of change. If that’s the case, you might want to set the interval to 6 months. Ultimately the frequency of alerts and surveys is something to feel out and identify the right rhythm for your organization.
Summary
Organizations today are faced with a wealth of data, which they must be able to understand, categorize, utilize, analyze, track, and draw critical conclusions.
Regulatory, legal, and corporate requirements ensure that all organizations must comply with procedures and processes for the control, management, and validation of data.
Data lineage and associated techniques are used to track the data journey through your organization, from its origin and entry into the business, how and where it is processed, transformed, and moved over time, through to its delivery and storage at an endpoint or end-user.
This Getting Started guide provides an overview of the necessary steps that are required to support this process including defining the companies problems, resolution success criteria and appropriate metrics, data collection and input, its visualization, a completeness assessment, improving its quality, and finishing with the analysis of the results and generated insights.
We finally emphasized the importance of maintaining the quality and up-to-dateness of the data through continuous updating and the use of broadcasts.
We hope that this guide has given an insight into data lineage, its importance to the organization for compliance and risk management, and how to go about starting your data lineage journey.
For a more detailed understanding of data lineage within the Ardoq metamodel see our Data Lineage Metamodel guide.