Snowflake - Automatic Data Lineage

Hubert Książek - Dataedo Team Hubert Książek 3rd June, 2024

What to Expect

Entire column-level data lineage within the Snowflake instance

Users with Snowflake Enterprise Edition (or higher) can automatically generate column-level data lineage based on all executed scripts that moved data between Snowflake objects up to 1 year back. Read more about this feature in the Snowflake Enterprise Lineage connector documentation.

You can view column-level data lineage along with the scripts that created this lineage within Snowflake instance.

How it looks in Dataedo Porta

Views/Materialized views

Data lineage for views/materialized views is created using our SQL Analyzer (Parser). Our analyzer creates column-level data lineage based on the view script. The data flow is created between Snowflake objects (tables, views, etc.) and view.

Data lineage for views

Dynamic tables

Data lineage for dynamic tables is created using our SQL Analyzer (Parser) and using Snowflake metadata (INFORMATION_SCHEMA.DYNAMIC_TABLE_GRAPH_HISTORY()).

  1. Our analyzer creates column-level data lineage based on the dynamic table script.
  2. Snowflake metadata provides only object-level data lineage.

The data flow is created between Snowflake objects (tables, views, etc.) and dynamic table.

Data lineage for dynamic tables

Because dynamic tables behave more like views than tables, their main type is View (with subtype Dynamic table), not Table. That's why they are displayed on screenshots as views.

Stages

Data lineage for stages is created using Snowflake metadata (INFORMATION_SCHEMA.STAGES). The data flow is created on object-level between external storage (like S3 or Azure Blob Storage) objects (CSV, JSON, Parquet files, folders, etc.) and Snowflake stage.

Data lineage for stages

External tables

Data lineage for external tables is created using Snowflake metadata (INFORMATION_SCHEMA.EXTERNAL_TABLES). The data flow is created on object-level between Snowflake stage and external table.

Data lineage for external tables

Snowpipes (Snowflake Data Pipelines)

Data lineage for Snowpipes is created using simple regex pattern matching on the Snowpipe script. The data flow is created on object-level between Snowflake stage and Snowpipe and then between Snowpipe and table: stage → snowpipe → table.

Data lineage for snowpipes

Direct loading

Dataedo supports data lineage created by directly loading data into tables from stages using the COPY INTO command. The data flow is created based on Snowflake metadata (ACCOUNT_USAGE.COPY_HISTORY - requires requires elevated permissions). The data flow is created on object-level between Snowflake stage and table with SQL Script as the processor.

Data lineage for manual COPY INTO

Streams

Dataedo supports column-level data lineage for streams, where the data flow is created between table/view and stream. The lineage is established based on column names. Note that Dataedo does not support lineage for streams created on directory tables (stages).

Stream lineage

Known limitations

  1. Check the limitations for views lineage from SQL parser.
  2. With more complex snowpipes copy statements, the data lineage may not be created correctly.
  3. Snowpipes and stages regex patterns are ignored when lineage is created.
  4. Lineage from snowpipes with transformations is not supported.
  5. Snowflake Enterprise Lineage connector only creates data lineage for data moved between Snowflake objects. It does not support lineage for data moved between Snowflake and external storage.

Troubleshooting

Snowflake Enterprise Lineage connector

  1. Make sure that your Snowflake instance is running on Enterprise Edition or higher.
  2. Make sure that the role used to connect to Snowflake has the necessary permissions to access the ACCOUNT_USAGE schema.
  3. Check if objects used in the scripts are still available in Snowflake and are imported to Dataedo.
  4. Check if the scripts are supported by the connector - connector documentation.
  5. Check the date of script execution - the connector can only retrieve scripts executed up to 1 year back but also have 3 hours of latency.

Views/Materialized views

  1. Make sure the view script was imported correctly and is valid (e.g. referenced objects exist).
  2. Check if the view script contains any unsupported SQL syntax - parser documentation.
  3. Make sure that dialect is set correctly and "Automatic lineage" and "Parse SQL" checkboxes are checked in the Snowflake import window.
  4. Try to reparse the view script. You can do it by clicking the "Parse Script to Lineage" button or by rerunning the import process.

Dynamic tables

  1. Make sure the dynamic table script was imported correctly and is valid (e.g. referenced objects exist).
  2. Check if the dynamic table script contains any unsupported SQL syntax - parser documentation.
  3. Make sure that dialect is set correctly and "Automatic lineage" and "Parse SQL" checkboxes are checked in the Snowflake import window.
  4. Verify in Snowflake if SELECT * FROM TABLE(INFORMATION_SCHEMA.DYNAMIC_TABLE_GRAPH_HISTORY()) returns any results using the same user, role, and database as provided in the Dataedo.
  5. Try to reparse the dynamic table script. You can do it by clicking the "Parse Script to Lineage" button or by rerunning the import process.

Stages

  1. Make sure that referenced storage objects are imported to Dataedo and contain valid paths.
  2. Make sure the "Automatic lineage" checkbox is checked in the Snowflake import window.
  3. Verify in Snowflake if SELECT * FROM INFORMATION_SCHEMA.STAGES returns any results using the same user, role, and database as provided in the Dataedo.
  4. Try rerunning the import process (especially if referenced storage objects were added after the last import).

External tables

  1. Make sure the "Automatic lineage" checkbox is checked in the Snowflake import window.
  2. Verify in Snowflake if SELECT * FROM INFORMATION_SCHEMA.EXTERNAL_TABLES returns any results using the same user, role, and database as provided in the Dataedo.
  3. Try rerunning the import process.

Snowpipes (Snowflake Data Pipelines)

  1. Make sure the Snowpipe script was imported correctly and is valid (e.g. referenced objects exist).
  2. Make sure the "Automatic lineage" checkbox is checked in the Snowflake import window.
  3. Verify if the Snowpipe script contains transformations (should not contain any SELECT).
  4. Try rerunning the import process.

Direct loading

  1. Make sure the "Automatic lineage" and "Copy history" checkboxes are checked in the Snowflake import window.
  2. Verify in Snowflake if SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY returns any results using the same user, role, and database as provided in the Dataedo. SNOWFLAKE.ACCOUNT_USAGE requires elevated permissions.
  3. Try rerunning the import process.

Streams

  1. Make sure the "Automatic lineage" checkbox is checked in the Snowflake import window.
  2. Verify if the object used in the stream is imported to Dataedo.
  3. Verify in Snowflake if SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.COLUMNS returns any results using the same user, role, and database as provided in the Dataedo. SNOWFLAKE.ACCOUNT_USAGE requires elevated permissions.
  4. Check if the stream is created on a table or view. If it's created on a directory table (stage), the lineage will not be created.
  5. Try rerunning the import process.