OpenLineage

Samuel Chmiel - Dataedo Team Samuel Chmiel 13th March, 2025

OpenLineage is an open standard for tracking data lineage across processing systems. It standardizes the collection of metadata about data pipelines, enabling better visibility, debugging, and governance of data workflows.

Dataedo provides a public API with a dedicated lineage endpoint. When tools like Apache Airflow and Apache Spark are configured to emit OpenLineage events, those events are captured by Dataedo and stored in the open_lineage_events table.

The collected events can then be imported and analyzed using Dataedo's OpenLineage connector, offering powerful lineage visualization and insights into your data pipelines.

Catalog and documentation

Dataedo imports jobs, input datasets, and output datasets extracted from OpenLineage events that have a status of COMPLETE and are successfully sent to the Dataedo Public API. These events are then saved in the open_lineage_events table in the Dataedo repository.

Jobs

Overview

On the overview page, you will see basic information about the job such as Job Namespace (in Schema field) and Job Name.

Image title

Script

If Job has script it will be visible in Script tab.

Image title

Data Lineage

If Run Event contains information about lineage it will be visible in Data Lineage tab.

Image title

Input Datasets

Overview

On the overview page, you will see basic information about the dataset such as Dataset Namespace (in Schema field) and Dataset Name.

Image title

Fields

If Dataset has fields they will be visible in Columns tab.

Image title

Data Lineage

If Run Event contains information about lineage it will be visible in Data Lineage tab.

Image title

Output Datasets

Overview

On the overview page, you will see basic information about the dataset such as Dataset Namespace (in Schema field) and Dataset Name.

Image title

Fields

If Dataset has fields they will be visible in Columns tab.

Image title

Data Lineage

If Run Event contains information about lineage it will be visible in Data Lineage tab.

Image title

Specification

Imported metadata

Dataedo reads following metadata from OpenLineage events:

Imported Editable
RunEvent
  Inputs
   Fields
  Outputs
   Fields
    Input Fields

Configuration and import

To enable OpenLineage events gathering it is required to enable Dataedo Public API and get a token. Next you need to configure OpenLineage events emitter. The OpenLineage events will be stored in open_lineage_events table in Dataedo repository. To process events you need to run OpenLineage connector import.

Configuration of Dataedo Public API

To enable Dataedo Public API follow steps from article: Dataedo Public API Authorization

Apache Airflow configuration

To enable emitting OpenLineage events follow official documentation: Apache Airflow OpenLineage provider configuration

Example configuration file for Airflow OpenLineage configuration:

transport:
  type: http
  url: {YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}
  endpoint: public/v1/lineage
  auth:
    type: api_key
    apiKey: {API_KEY_GENERATED_IN_DATAEDO_PORTAL}

Apache Spark configuration

To enable emitting OpenLineage events follow official documentation: Quickstart with Jupyter

Example configuration of Spark Session:

from pyspark.sql import SparkSession

spark = (SparkSession.builder.master('local')
         .appName('sample_spark')
         .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
         .config('spark.jars.packages', 'io.openlineage:openlineage-spark:1.28.0')
         .config('spark.openlineage.transport.type', 'http')
         .config("spark.openlineage.transport.url", "{YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}")
         .config("spark.openlineage.transport.endpoint", "/public/v1/lineage")
         .config("spark.openlineage.transport.auth.type", "api_key")
.config("spark.openlineage.transport.auth.apiKey","{API_KEY_GENERATED_IN_DATAEDO_PORTAL}")
         .config('spark.openlineage.columnLineage.datasetLineageEnabled', 'true')
         .getOrCreate())

Apache Spark on Databricks

To enable emitting OpenLineage events follow official documentation: Quickstart with Databricks

Example configuration of Spark Session:

spark.conf.set("spark.openlineage.columnLineage.datasetLineageEnabled", "true")
spark.conf.set("spark.openlineage.transport.url", "{YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}")
spark.conf.set("spark.openlineage.transport.endpoint", "/public/v1/lineage")
spark.conf.set("spark.openlineage.transport.auth.type", "api_key")
spark.conf.set("spark.openlineage.transport.auth.apiKey","{API_KEY_GENERATED_IN_DATAEDO_PORTAL}")
spark.conf.set("spark.openlineage.transport.type", "http")

Apache Spark on AWS Glue

To enable emitting OpenLineage events follow official documentation: Quickstart with AWS Glue

Example configuration of Job:

--conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
--conf spark.openlineage.transport.type=http
--conf spark.openlineage.transport.url={YOUR_DATAEDO_PORTAL_PUBLIC_API_URL}
--conf spark.openlineage.transport.endpoint=/api/v1/lineage
--conf spark.openlineage.columnLineage.datasetLineageEnabled=true
--conf spark.openlineage.transport.auth.apiKey={API_KEY_GENERATED_IN_DATAEDO_PORTAL}
--conf spark.openlineage.transport.endpoint=/public/v1/lineage
--conf spark.openlineage.transport.auth.type=api_key

Processing OpenLineage events with Dataedo OpenLineage connector

To process OpenLineage events stored in Dataedo repository select Add source -> New connection. On the connectors list select OpenLineage.

Image title

Select number of last days to analyze. Click Connect and go through the import process.

Image title

If you have several OpenLineage producers with different namespaces you can import them in separate datasources by filtering namespace in Dataedo Schema field.

Image title