Dataedo 10 allows you to discover data stored in the database and review its contents and quality. Data Profiling module is a combination of useful metrics with friendly User Interface. On top of it, Profiling in Dataedo allows you to peak into most common, or random data from your tables or views.
What is Data Profiling?
In general Data Profiling means the process of inspecting the data and presenting statistics and metrics about it.
This is usually done in order to:
- find out what's the quality of data and can it be reused,
- better understand data structure,
- discover potential data challenges and improvements,
- review data before building software based on it,
Data Profiling in Dataedo
Table row count
Dataedo scans table and counts number of rows in that table.
Each time you save profiling the values are updated, and later presented in both Dataedo Desktop and Dataedo Web Catalog.
Column distribution
Column distribution scans different types of values in the column in terms of nullability and uniqueness:
- Distinct values - Rows with values that are unique in the column (think of ID or Order number)
- Non distinct values - Rows that are non-unique and non-empty (think of First name)
- Empty - Rows non-null values but with empty strings (for instance '' or ' ')
- NULL - Rows with null values
Column values profile
Dataedo performs basic profiling of numeric values in the column. Results depends on a data type. Learn more about Supported data types.
Numerical | String | Date | |
---|---|---|---|
Min | minimum value | first alphabetically sorted string | earliest found date |
Max | maximum value | last alphabetically sorted string | latest found date |
Avg | average value | average string length | - |
Variance | variance counted for values | variance counted for string length | - |
Standard deviation | standard deviation for values | standard deviation counted for string length | - |
Span | difference between Max and Min values | - | difference between Min and Max dates (formatted, ie. 2 months, 2.5 years) |
Distinct values | number of distinct values | number of distinct strings | number of distinct dates |
String length profile
Dataedo performs basic profiling of column string length:
- Min – Minimum length of non null string in the column,
- Max – Maximum length of string in the column,
- Avg – average string length,
- Variance – counted for string length,
- Standard deviation – counted for string length,
Column top/random values
Dataedo can scan columns for top or random values. For each value it calculates how many rows have that value:
- Top 10/100/1000 values – By default Dataedo scans top 10 or 100 most popular values from the column.
- Random values – you can ask Dataedo to sample random 10 values from entire table. This can be useful for unique values such as
order_number
.
Sample data
Dataedo fetches 10 random rows from the table and presents it in the tabluar form. This data cannot be saved to the repository.
How it works
On user request, Dataedo scans tables and columns and gathers statistics and top data. Worth mentioning is that preparing statistics are calculated on a database level, so we are not downloading more data from the database than necessary.
Prepared statistics are presented to a user in Dataedo Desktop.
Saving
Saving is optional. Moreover saving can be disabled by configuration. Read more about configuration in Data Profiling configuration article. By default saving data is disabled.
Where data is saved?
Profiling data can be saved in the repository right next to the data model metadata (tables, columns).