Azure Data Factory - Automatic Data Lineage

Samuel Chmiel - Dataedo Team Samuel Chmiel 16th December, 2024

What to Expect

Dataedo imports automatically:

  1. Pipelines
  2. Activities (as data lineage processes in Dataedo)
  3. Datasets
  4. Sources of copy activity
  5. Sinks of copy activity as Destinations in Dataedo
  6. Column-level lineage for copy activity

Due to the specifics of ADF operation, lineage does not pass through datasets; instead, we generate Sources and Destinations based on datasets and their parameters, as a single dataset can be used multiple times when parameterized.

Parametrized linked services, datasets and pipelines

Lineage for parameterized linked sources, datasets, and pipelines can generated based on Pipeline Runs and Activity Runs logs. To enable it, you need to check Analyze Pipeline Runs checkbox and specify how many recent days and how many last pipeline executions should be analyzed.

Image title

If in parameters there are Azure Expressions interpreted values are not stored in logs. We are recreating entire flow and parameter values assignements.

While interpreting Azure Expressions Dataedo supports functions:

  • concat
  • toLower
  • utcnow

Examples of supported expressions:

  • @concat('SELECT column_names FROM dbo.salesforce_column_names WHERE name = ''', pipeline().parameters.ObjectName, '''')
  • @concat('Copy_', item().TargetTable)
  • @concat('sales/', toLower(item().name), '/', utcNow('yyyy-MM-dd'), '/')
  • SELECT * FROM [@{item().TABLE_SCHEMA}].[@{item().TABLE_NAME}]

Supported linked services

Lineage of data source types below are linked automatically:

  • Microsoft SQL Server
  • Azure SQL Server
  • Azure Synapse Analytics
  • Azure Blob Storage
  • Azure Data Lake Storage
  • Amazon S3,
  • Postgres
  • MariaDB
  • MySQL
  • MongoDB
  • Postgres
  • Snowflake (linked source i Dataedo needs to be linked to source database in repository manually and next import changes will detect column level lineage to tables)
  • Salesforce (linked source i Dataedo needs to be linked to source database in repository manually and next import changes will detect column level lineage to objects)
  • REST (object level lineage)

Copy Activity

Dataedo supports lineage for Table, Query and Stored Procedure source type.

Image title

Image title

Image title

Pipeline triggers an Execute Pipeline activity, which runs a Copy Activity within the ForEach activity loop scenario

In example below we have two pipelines. On the first screen pipeline uses Lookup activity to get table names and schemas from database table and runs Execute Pipeline and passes Table Schema and Table Name parameters array to invoked pipeline. On the second screen we see invoked pipeline which gets Table Schema and Table Name parameters array and runs Copy Activity with dynamic parameters inside ForEach loop activity.

Image title

Image title

When import is ran with Analyze Pipeline Runs parameter checked and there are successfull runs of those pipelines in Pipeline and Activities runs log inside the second pipeline which actually copies data in Dataedo there will be end-to-end column-level lineage:

Image title

We can also view the lineage for a specific iteration:

Image title

Known Limitations

  • When a ForEach loop is used in a pipeline, we only support lineage for loops that run sequentially. Due to inconsistencies in Azure Data Factory Pipeline and Activity Runs logs, lineage for parallel runs is not supported. Image title
  • In Azure Expressions only expressions with return type String are supported.
  • In Linked Services, Datasets, Pipelines parameters only the String type is supported.
  • In Azure Expressions only concat, utcNow and toLower functions are supported.

Troubleshooting

Lineage is created based on a matched Linked Source, check the Dataedo desktop to make sure it is matched properly.

Image title