Data Profiling in Dataedo

Mac Lewandowski - Dataedo Team Mac Lewandowski 4th November, 2021
Applies to: Dataedo 10.x versions, Article available also for: 24.x (current), 23.x
You are looking at documentation for an older release.
Switch to the documentation for Dataedo 24.x (current).

Dataedo 10 allows you to discover data stored in the database and review its contents and quality. Data Profiling module is a combination of useful metrics with friendly User Interface. On top of it, Profiling in Dataedo allows you to peak into most common, or random data from your tables.

What is Data Profiling?

In general Data Profiling means the process of inspecting the data and presenting statistics and metrics about it.

This is usually done in order to:

  • find out what's the quality of data and can it be reused,
  • better understand data structure,
  • discover potential data challenges and improvements,
  • review data before building software based on it,

Data Profiling in Dataedo

Table row count

Dataedo scans table and counts number of rows in that table.

Each time you save profiling the values are updated, and later presented in both Dataedo Desktop and Dataedo Web Catalog.

Column distribution

Column distribution scans different types of values in the column in terms of nullability and uniqueness:

  • Distinct values - Rows with values that are unique in the column (think of ID or Order number)
  • Non distinct values - Rows that are non-unique and non-empty (think of First name)
  • Empty - Rows non-null values but with empty strings (for instance '' or ' ')
  • NULL - Rows with null values

Image title

Column values profile

Dataedo performs basic profiling of numeric values in the column. Results depends on a data type. Learn more about Supported data types.

Numerical String Date
Min minimum value first alphabetically sorted string earliest found date
Max maximum value last alphabetically sorted string latest found date
Avg average value - -
Variance variance counted for values - -
Standard deviation standard deviation for values - -
Span difference between Max and Min values - difference between Min and Max dates (formatted, ie. 2 months, 2.5 years)
Distinct number of distinct values number of distinct strings number of distinct dates

Image title

String length profile

Dataedo performs basic profiling of column string length:

  • Min – Minimum length of non null string in the column,
  • Max – Maximum length of string in the column,
  • Avg – average stringh length,
  • Variance – counted for string length,
  • Standard deviation – counted for string length,

Image title

Column top/all values/random values

Dataedo can scan columns for top or random values. For each value it calculates how many rows have that value:

  • Top 10/100/1000 values – By default Dataedo scans top 10 or 100 most popular values from the column.
  • All values – if the number of distinct values is less than 1000 you can ask Dataedo to fetch all the values from the column.
  • Random values – you can ask Dataedo to sample random 10 values from entire table. This can be useful for unique values such as order_number.

Sample data

Dataedo fetches 10 random rows from the table and presents it in the tabluar form. This data cannot be saved to the repository.

Image title

How it works

Running Data Profiling

On user request, Dataedo scans tables and columns and gathers statistics and top data. Worth mentioning is that preparing statistics are calculated on a database level, so we are not downloading more data from the database than necessary.

Prepared statistics are presented to a user in Dataedo Desktop.

Image title

Saving

Saving is optional. Moreover saving can be disabled by configuration. Read more about configuration in Data Profiling configuration article. By default saving data is disabled.

Where data is saved?

Profiling data can be saved in the repository right next to the data model metadata (tables, columns).