All Collections
Ardoq Use Case Guides
Data Lineage
Data Lineage Metamodel: Purpose, Scope, & Rationale
Data Lineage Metamodel: Purpose, Scope, & Rationale

Learn about when you should do data lineage and how to get started.

J
Written by Jim Down
Updated over a week ago

Introduction

The scope of Data or Information in Enterprise Architecture (EA) is large. This is explained in the Data Lineage Primer. Including data in your EA scope is beneficial and foundational for risk and compliance work. You can use Ardoq’s Data Lineage Use Case Guide to improve transparency around:

  • Your data entities

  • Which applications they exist in and in what form

  • Which part of the business they are used in

Read this article to learn about the scope of the use case guide and for detailed information on implementation and common usage patterns.

Scope

The Data Lineage Use Case Guide explains how to document a high-level view of data entities and their use in the business. To achieve this the guide answers the following questions:

  • Which application(s) writes information to the entity, and which only reads and uses the information?

  • Who is responsible for the data entity?

  • What is the confidentiality of my data entities?

  • On which infrastructure is my data entity stored?

  • Where in the world is the data entity physically stored?

A data entity's path through the application landscape is often part of data lineage. This is only partially covered in the current scope of the use case guide. The Application Integration Management Use Case also partly covers this feature. It will be fully supported in a subsequent release of the Data Lineage Use Case Guide.

Rationale

The subject of data is huge, as the Data Lineage Primer clarifies. We have made the following decisions to narrow the scope of our Data Lineage Use Case while delivering clear value for Ardoq users.

Data Lineage in an Enterprise Context

This use case will help you create and maintain an enterprise view of your organization’s data on a conceptual abstraction level, highlighting data usage and ownership, and how to keep the business informed on the essential aspects of data lineage.

Data modeling can be done on several abstraction levels. These are the 3 most common ones:

Conceptual

The most abstract modeling level, and typically contains the name of the data entity (as detailed in the section Data Entity below) and relationships to connect related data entities.

Logical

Describes how a specific data entity is implemented without taking technology into account. In other words, attribute types and cardinality are added to the model.

Physical

Technology specific and describes how the types and relationships are practically implemented in the specific database technology.

For more information on data modeling abstraction levels, refer to What is Data Modelling?

Data entities for data lineage should be modeled conceptually so that you don't get bogged down with unnecessary detail. However, for the purposes of data lineage, we don't need to model the relationships. This will come in a future Use Case Guide that focuses specifically on Information Architecture.

Data, as we usually experience it in applications, is modeled in great detail to help specialists do their job effectively. However, the data entities documented for enterprise data lineage are intended for people in different professional areas, so rather than documenting data entities in detail, we recommend approaching it on a high level. Customer, Product, Identity, and Employee are examples of data entities documented at the recommended level.

The best place to start is to focus on the outcome your organization wants to achieve and start by documenting data related to this outcome. For example, if your primary outcome with data lineage is to address privacy compliance policies, start detailing the Customer entity. Let the priority of the following outcomes drive the iterative approach. Continue detailing 1-3 data entities simultaneously and ensure high-quality data before moving on to the next set of data entities.

Read our Data Lineage Use Case Guide for more insight into our approach if you have an existing data set and how to measure the quality of your data.

Metamodel

If you have completed the Application Lifecycle Management, Application Integration Management, and Business Capability Management/Realization (BCM/BCR) Use Cases, you’ll already have all the relevant component types in your Ardoq instance. In this case, the Data Lineage Use Case will mainly help you specify the data entities, augment these with a confidentiality classification, and create references from applications, business capabilities, and people. These references are necessary for answering the questions identified and listed earlier.

Reference 1: The metamodel for data lineage

Component Types

Data Entity

Data Entity is the primary component type of data lineage used to represent a real-world object or concept such as a product, customer, order, or contact details. Data entities are commonly used to identify master business data sources across applications and data stores. The data entity should not represent the detailed data model from the integration message. Instead, it should represent only the major logical elements that benefit you, as it is tracked across the application landscape.

Moreover, the data entity is the focal point when showing which data business areas are using and where that data is physically stored in the world.

The Data Entity component type consists of the following fields:

Field

Description

Name

A concise description of the data entity. Covers the organization-wide definition of a business object.

Description

A concise description of the nature of the data entity and which real-world object it represents.

Confidentiality

Public: This information is public and can be shared openly.

Internal: Internal information that should be protected with limited controls. Internal information may include the employee handbook, policies, and memos. If shared, internal information has minimal impact on the business.

Confidential: Confidential information that should be held within the business. This information may include customers, pricing, or intellectual property. If disclosed, confidential information could negatively affect your business.

Restricted: Restricted information is highly sensitive and should be limited to a need-to-know basis.

Number of Reports / Data Assets (calculated field)

Counts how many Accesses references there are to one specific data entity.

Application

As covered in greater detail in the Application Lifecycle Management Metamodel article, an application is the configuration of lower-level software or technology to provide a specific business capability or technology capability, perform a defined task, or to analyze specific information.

Applications are central to understanding data lineage, as data is usually created and processed in applications. Also, applications are the components connecting data entities to infrastructure and physical locations, thus enabling both logical and physical data lineage.

Interface

An interface is a dedicated point of interaction between two or more applications or other technology components. It implements functionality to enable interoperability and exchange of information, including agreed behavior, common semantics, and defined security and availability.

The Interface component supplies the data entity to other applications and thereby allows for data transfer between applications. The Interface is a child component of an application.

For more information about the interface component type, please see the article Application Integration Management Metamodel.

Business Capabilities

A business capability describes what a business does without explaining how. More than one department can be involved in delivering a business capability. As internal organizational structure changes more often than a business does, the best practice is to not replicate a business’ organizational structure.

In data lineage, business capabilities describe how the business uses data, and heightens insight into the usage of restricted and confidential data in various business areas - knowledge that can be used in business risk assessments.

For more information on the Business Capability component type, please see Business Capability Metamodel.

Reference Types

This use case introduces two references to relate data entities to applications, namely business capabilities and people.

Reference

Description

Accesses

Represents that a component reads or modifies the state of another component in situ. As an example Application Accesses Data Entity.

Accesses connects applications and business capabilities to data entities, and through those, connects data to the rest of the architecture.

Owns

Represents that a component has overall responsibility for another component. As an example, Person Owns Data Entity.

Owns connects people to data entities to assign responsibility and ease data quality maintenance.

Fields on the Accesses Reference

Field

Description

Request Type

Some applications use the information in a data entity and write back to it - for example, a CRM system that uses the Customer data entity in a business process and updates the Customer accordingly.

- If your application both reads and writes, choose "Write."

- If your application merely reads the data entity, choose "Read."

Common Patterns of Data Lineage

An important part of data lineage is understanding how the data flows into and within your business. There are three ways or “patterns” in which data is used and moved around within an organization. These are sometimes referred to as master data management patterns.

Pattern 1 - Single Source

Single source, also referred to as centralized data management, is a pattern where a data entity is maintained in a single authoritative source. Other applications and services can read and update information at that source. Suppose you have a Customer data entity maintained in your CRM solution. In that case, all other applications that need this data entity must integrate with your CRM solution to retrieve the data entity.

Reference 2: Single source data pattern

single source pattern

Pattern 2 - Master/Slave

Master/Slave is a pattern where information is updated at a single source. Still, the data can be replicated to other applications to provide read-only versions of that information. These applications become replicating sources for tertiary applications, creating a chain of causality for this specific data entity.

In the CRM example, an application retrieves the data entity from the CRM solution, combines it with data from an HR system, and then presents the output to a second application. The two pictures below both illustrate the same, but with different information on the visual. The Customer data entity is written by Salesforce CRM, transferred to Salesforce CPQ, and transferred from Salesforce CPQ to the Customer Payment Portal application.

Reference 3: Master/slave data pattern - the left example shows Data Entity on Connects To reference, and the right example shows Request Type on Connects To, indicating the direction of the data flow.

master slave pattern data flow
master slave pattern data flow direction

Pattern 3 - Multi-master

Multi-master, also referred to as decentralized data management is a pattern where a data entity is maintained in multiple authoritative sources. In the ERP example, the Customer data entity could be created and updated in both the ERP and CRM solutions. When other applications need that data entity, the source can be either the ERP or the CRM, depending on the context.

Domain Driven Design is an architecture technique that recommends this pattern. Data entities like Customer can have different attributes in specific Bounded Contexts. It is best to represent that information separately in each context.

Reference 4: Multi-master data pattern

multi-master data management pattern

Understanding and building these data patterns within the organization is essential for complying with required governance and legislation. It also provides a baseline for the business that can be used during mergers and acquisitions, takeovers, etc. Knowing the provenance and confidentiality of the source data is essential.

The practice of this use case is agnostic to these patterns - you can construct the data architecture in whichever pattern benefits your business.

Understanding Physical Data Location and Impact on Data Risk Assessment

We know that applications create and process data, but where is the data physically stored?

Reference 5 illustrates an example with the data entity Customer, showing that the data for it is logically stored in applications, and physically stored in EMEA and North America due to infrastructure and by direct association (for SaaS applications).

Reference 5: Overview of data entity Customer

block diagram of data entity physical whereabouts

Risk Associated With Data

Where in the world data is physically stored is itself a risk category. Perhaps the data entity contains Personal Identifiable Information (PII) and must not exist outside the European Union. Maybe it is vital for your business that the data exists in more than one physical location. Knowing where your important data is stored can be pertinent to assessing risk.

Another factor to consider when assessing risk is the technical fitness of the application that processes the data. Reference 6 shows all the applications where the Customer data entity is processed. If you’ve followed our Application Rationalization Use Case Guide, your applications should have a 1-5 technical fit score. It is important to consider the potential risk of having essential data processed on technically inferior applications.

Reference 6: Logical flow of data entity Customer

data entity's presence on applications

Risk management is a major topic that we will explore in greater detail in future use cases.

Practical Implementation of the Metamodel

For the foundational implementation of Data Lineage, to understand your data entities, their connections to applications, and the business, you should complete these 3 prerequisite use cases:

To connect Data Lineage with infrastructure and location, as well as technical fitness of the applications hosting data entities, you should also complete these use cases:

We at Ardoq always recommend the simplest model to get the job done. Reference 7 illustrates how we practically implement workspaces and references in Ardoq, including the foundational implementation of Data Lineage.

Reference 7: An overview of workspaces and references on the Ardoq app

data lineage metamodel

Extending Your Metamodel

Metamodeling is a highly specialized task, so before you start extending the metamodel do read our article Seven Principles for Creating a Great Enterprise Architecture Metamodel.

Data lineage is foundational for several architectural topics. One such topic is GDPR, and even though a complete description of how to implement GDPR in Ardoq is out of scope for this article, there are simple extensions to the metamodel that can get you started. Locating where PII or sensitive data about individuals is processed will give you a good starting point for GDPR compliance. Data lineage also creates a great supplemental data set for enriching use cases like Application Rationalization, and Application Integration Management where for example the confidentiality or PII content of data entities can provide additional insight into application and integration risk.

One way of approaching a solution is to simply create a multi-select field called “PII type” on the data entity component type containing typical PII types such as personal ID, health information, sexual orientation, affiliation with unions, etc. You can locate specific PII types by utilizing the references between applications and data entities.

Gremlin Code

This section contains the Gremlin code used to implement the calculated field, Number of Reports/Data Assets, the graph filter for focusing on one data entity, and some of the reports used in this use case.

Number of Reports / Data Assets (Calculated field):

g.V().hasLabel('Data Entity').

project('id', 'name', 'value').

by(id).

by('name').

by(inE('Accesses').count())

Graph filter for focusing on one specific data entity

g.V(selectedId).union(__.inE().otherV().hasLabel('Business Capability', 'Person', 'Interface', 'Application'),

__.inE().otherV().

hasLabel('Application').

outE().inV().

hasLabel('Server').

outE().inV().

hasLabel('Location')).path()

Applications that have the most restricted or confidential data entities (Report)

counter = out('Accesses').hasLabel('Data Entity').

or(

has('confidentiality', 'Confidential'),

has('confidentiality', 'Restricted')

).count();

g.V().hasLabel('Application').

filter(

__.out('Accesses').hasLabel('Data Entity').

or(

has('confidentiality', 'Confidential'),

has('confidentiality', 'Restricted')

)

).order().by(counter,decr).

project('name', 'Number of Data Entities', 'Data Entities').

by('name').

by(

__.out('Accesses').hasLabel('Data Entity').

or(

has('confidentiality', 'Confidential'),

has('confidentiality', 'Restricted')

).values('name').count()

).

by(

__.out('Accesses').hasLabel('Data Entity').

or(

has('confidentiality', 'Confidential'),

has('confidentiality', 'Restricted')

).values('name').fold().map({it.get().join(', ')})

)

Data entities that are physically located in multiple locations (Report)

g.V().hasLabel('Data Entity').

filter(__.in().hasLabel('Application').out('Is Supported By').out('Is Located At').repeat(out('ardoq_parent')).times(3)).project('name', 'Locations').

by('name').

by(__.in().hasLabel('Application').out('Is Supported By').out('Is Located At').

repeat(out('ardoq_parent')).times(3).values('name').

dedup().fold().map{it.get().join(', ')})

Data entities that reside on technically inferior applications (Report)

g.V().hasLabel('Data Entity').

where(has('confidentiality', 'Restricted').or().has('confidentiality', 'Confidential')).filter(__.in().hasLabel('Application').has('technical_fit', lt(2))).project('name', 'Application').

by('name').

by(__.in().hasLabel('Application').has('technical_fit', lt(2)).values('name').fold().map({it.get().join(', ') }))

Conclusion

Has this article piqued your interest in data lineage? Wait no longer - pop into our Best Practice Module in Ardoq and get your copy of the Data Lineage bundle.

Document versions

Date

Responsible

Rationale

Sept 6, 2022

Rasmus Valther Jensen

Published

Did this answer your question?