Table of Contents:
I would like to present you with the role metadata plays in databases and data in general.
Level 1: Mysterious Data Set
Let's consider the following example - we were asked to list all current employees with all the available details.
Let's assume we found this raw table of data:
It has no context and we have no information about what data we are actually looking at. This looks like a table of peoples' names, but are those employees or maybe customers or just random people? If we know some of the employees names we might figure out what the data set represents. We figured out that the table actually holds employees, or at least includes employees. But what do the columns A and D-I hold? Is D a birth date? But that could also be a start of employment (and they are all long term employees). What about column H, is it the date of the last record update, last position change, last raise or last medical examination? I seems to be some kind of flag. There's virtually no way of telling what F and G columns represent.
It would take a while for you to figure out the meaning of each data and that is assuming you have any reference information (a way to test this data against other sources of information - your knowledge of the company, access to data UI or reports, paper records or asking other people). If you don't, then some of this data is really useless to you.
What we are missing is some information about our dataset - a metadata.
Level 2: Basic Metadata
Let's add some basic metadata - table and column names:
Now, that is much better. Now we have much better understanding of what the data set is and what specific columns represent. Or at least most of them. We confirmed that the table holds employees. We figured out that column A holds a unique employee number, D holds a date of birth, E is a card number, F seems to be an education level and G is an employee department.
We don't know what eval column represents - is it employee performance or medical evaluation and what does the date actually represent - the first evaluation, the last evaluation or planned next evaluation?
It is also not clear what specific values in education and department columns mean - dept_id suggests there's a lookup table. Is it the same with the edu column?
This metadata - tables and column names - are a standard element of all modern database systems. But those names might not tell us what the table holds for various reasons:
- Names are vague or confusing
- Column might have changed its purpose without changing its name
- A short name might not be enough to explain complex logic
We need much more metadata - we need a Data Dictionary.
Level 3: Data Dictionary
Let's say we are lucky and we found buried down in folders of our corporate storage a document that had "Data Dictionary" in the title and a following table inside:
That explains a lot: we now know how to get a specific department - we need to use the departments lookup table, we know how to decipher education - there seems to be no lookup table and the logic is a bit confusing and we know that the eval column holds the date of the last employee performance review date.
This is what we needed in order to prepare our report. Now our work is straightforward. Before that it was tedious analysis and guesswork.
So how do we get a Data Dictionary? Well, we need to make one. It takes someone to describe each table, column and relation. This is a task to be taken seriously. It should be created at the data modeling and database design phase and maintained during development.
Actually, it's never too late to start gathering metadata to start building a data dictionary. As the task above shows, as we learn more about the data we should gather all the information in a data dictionary, share it across the organization and make sure everyone can add to it if they learn something new.
Support for Metadata
Today's relational databases support this kind of metadata, but a lot depends on people. Designers are obligated to provide a name and data type for each column but providing meaningful descriptions is up to them their practices and good will. Practice shows it is often omitted.
This was a simple example of a small single table. Real databases hold many, many more tables that have many, many more columns. Level of difficulty is much higher. If you want to be able to use your data effectively you need to take care of meaningful metadata.
Without Metadata you can't use your data!