Skip to main content

Databricks Unity Catalog - Automatic data lineage

What to expect

Within Databricks

Dataedo uses a built-in Databricks data lineage module that stores historic information. Column-level lineage will be created for various kinds of tables and views that exist in Databricks Unity Catalog. If many catalogs are documented in Dataedo, lineage between them will also be imported. Lineage appearance will change depending on whether the object is created using Delta Live Table (DLT) pipeline or not:

Delta Live Tables lineage

On the lineage graph, the user can see the DLT pipeline object responsible for creating objects and loading data.

Image title

It is also possible to hide pipelines and view lineage directly between objects:

Image title

Other objects lineage

If an object is created without a DLT pipeline, for example by running a notebook with Python code, executing CREATE.. AS SELECT.. or CREATE VIEW.. SQL statements, lineage will always be shown directly between objects:

Image title

Between Databricks and external sources

External Locations

Dataedo creates lineage based on location URI when the data source of the table is located in one of the following cloud storages:

  • Google Cloud
  • Amazon S3
  • Azure
  • Cloudflare
Image title

Fivetran Data Sources

If you connect to some data source (for example Salesforce) using a Fivetran connection, you can utilize Dataedo Fivetran connector to create column-level lineage between this source and Databricks. On the diagram, you can see the Fivetran ETL program that processes data, which can be hidden as in the case of DLT pipelines to show lineage between objects directly:

Image title

Known Limitations

  • Column-level lineage for external tables will be created only if the data source (for example JSON file) schema is automatically discovered by Databricks and column names are not changed.
  • For Delta Live Tables (DLT), Dataedo will use data lineage only for the latest update and ignore any older updates.

Because Dataedo uses the Databricks API, unless otherwise specified in the documentation, the following limitations regarding data lineage must be taken into account: Databricks lineage limitations

Troubleshooting

I don't see data lineage

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Rerun the import of the source - maybe the schema was imported in an older version or the configuration was incorrect.

I don't see data lineage from External location storage

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Make sure the object has Linked Source set, with correctly assigned storage.
Image title
  1. If you don't see Linked Source, check if you have permissions to the appropriate Databricks External Location.
  2. Rerun the import of the source - maybe the schema was imported in an older version or the configuration was incorrect.

I don't see data lineage between objects from different catalogs

  1. Check if the object that does not have lineage has it in Databricks. If not, lineage won't be created in Dataedo.
  2. Make sure that you have both catalogs imported.
  3. Rerun the import of the catalog with outflow objects (those to which data should flow).

Data lineage is inaccurate

  1. Make sure that there are recent updates available for pipelines.