Databricks is a data processing cloud-based platform. It simplifies collaboration of data analysts, data engineers, and data scientists. Databricks is available in Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
Dataedo will connect to single catalog Unity Catalog via API, and document objects and data lineage within the connected catalog.
Instructions on how to connect to Databricks using Dataedo can be found at: Connecting to Databricks Unity Catalog
Connector features
Data Source | Support | Schema | Lineage | Profiling | Classification | Export comments | FK tester | DDL import |
---|---|---|---|---|---|---|---|---|
Databricks Unity Catalog | Native | ✅ | Column Level | ❌ | ✅ | ✅ | NA | NA |
Data Catalog
Dataedo will document following objects and their respective properties from Databricks:
Object Name | Metadata | Lineage |
---|---|---|
Delta Live Tables | ✅ | ✅ |
Pipelines | Limited | ✅ |
Tables | ✅ | ✅ |
Views | ✅ | ✅ |
Columns | ✅ | ✅ |
External locations | ✅ | ✅ |
External Tables | ✅ | ✅ |
Primary keys | ✅ | |
Foreign keys | ✅ |
Objects Properties Configuration & Support
Documentation is created for one selected catalog from Databricks Unity Catalog.
Known Limitations
Documentation Functionality
- Data Profiling is not available for Databricks, however we're working on this feature for feature releases.
- Connection to multiple catalogs at once or regional metastore is not yet supported [it is on the roadmap]
- For pipelines, Dataedo will discover only name, not script
Lineage Functionality
- Column level lineage for external tables will be created only if data source (for example JSON file) schema is automatically discovered by Databricks and column names are not changed