Databricks Unity Catalog - Automatic data lineage

Michal Adamczyk - Dataedo Team Michal Adamczyk 25th June, 2024

What to expect

Within Databricks

Dataedo uses built-in Databricks data lineage module that stores historic information. Column level lineage will be created for various kind of tables and views that exists in Databricks Unity Catalog. If many catalogs are documented in Dataedo, lineage between them also will be imported. Lineage appearance will change depending on whether object is created using Delta Live Table (DLT) pipeline or not:

Delta Live Tables lineage

On the lineage graph user can see DLT pipeline object reponsible for creating objects and loading data. Image title It is also possible to hide pipelines and view lineage directly between objects: Image title

Other objects lineage

If object is created without DLT pipeline, for example by running notebook with python code, executing CREATE.. AS SELECT.. or CREATE VIEW.. SQL statements lineage will be always shown directly between objects:

Image title

Between Databricks and external sources

External Locations

Dataedo creates lineage based on location URI when data source of the table is located in one of the following cloud storages:

    • Google Cloud
    • Amazon S3
    • Azure
    • Cloudflare

Image title

Fivetran Data Sources

If you connect to some data source (for example Salesforce) using Fivetran connection you can utilize Dataedo Fivetran connector to create column level lineage between this source and Databricks. On the diagram you can see Fivetran ETL program that processes data which can be hidden as in the case of DLT pipelines to show lineage between objects directly:

Image title

Known Limitations

  • Column level lineage for external tables will be created only if data source (for example JSON file) schema is automatically discovered by Databricks and column names are not changed
  • For Delta Live Tables (DLT), Dataedo will use data lineage only for latest update, and ignore any older updates.

Because Dataedo uses Databricks API, unless otherwise specfied in the documentation, following limitations regarding data lineage must be taken into account: Databricks lineage limitations

Troubleshooting

I don't see data lineage

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Rerun import of the source - maybe schema was imported in older version or configuration was incorrect.

I don't see data lineage from External location storage

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Make sure the object has Linked Source set, with correctly assigned storage Image title
  3. If you don't see Linked Source, check if you have permissions to appriopriate Databricks External Location.
  4. Rerun import of the source - maybe schema was imported in older version or configuration was incorrect.

I don't see data lineage between objects from different catalogs

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Make sure that you have both catalogs imported
  3. Rerun import of catalog with outflow objects (those to which data should flow)

Data lineage is inaccurate

  1. Make sure that there are recent updates available for pipelines.