Whenever you have data, you should have some kind of description of what it represents. Some may think that table and column names are enough. This is far from truth and in my opinion in most cases you should have accompanying descriptions (a documentation). You will need it for reporting, maintenance, further development.
Understanding and deciding you want to describe your data might not be enough. You need to know how to do it right, so that it serves a purpose (other than satisfying management or auditors to have any documentation).
Let's have a look at an example.
Captains' Meaningless Descriptions
We asked certain captain to describe our invoices table and here is what we got:
Invoices table (before)
Column | Description |
---|---|
number | Invoice number |
date | Invoice date |
status | Invoice status |
amount | Invoice amount |
customer_no | Customer number |
Well done, Captain Obvious! We didn't learn much from those descriptions, did we? They even look like auto-generated and hold zero information. And yet, there is so much we don't know about the data but should. You probably have seen something similar at least once.
Adding the Non-obvious
Captain's work is done (he finished early today). Now it's our turn - let's see whether we could add something actually useful.
Meaning
Table or column name often isn't enough to understand the purpose of the data. It can often be even misleading. Let's take the amount column. Is it net amount or including tax? If this is order, is it amount including discount? Is this amount including refunds and cancellations or is it the amount as on order when it was placed?
As you can see, the exact meaning of the column isn't always obvious and this is where a description is necessary and can save your organization a lot of issues with incorrect reports and analytics.
Source, Calculations, and Defaults
Where does the data come from? Is it entered by user or calculated? What is the logic of those calculations? Is there a default value?
In our case amount is always provided by user, status is working copy and date is null by default. If we had tax column it might have been automatically calculated by some trigger or stored procedure.
Formats and Validation
Are there restrictions on values or can they accept all values of their data type? Those restrictions can be defined anywhere - as database constraints, in application validation or even business rules. In most real life cases there are.
Date should be no older than a month back from creation and not more in the future than a week, amount should be positive and status column accepts only predefined values.
List of Values, Lookups
Some columns are enums and are expected to hold a limited set of values, even though they are mostly one of primitive types without any constraints - varchars or integers. In such cases application layer defines somewhere list of valid values (labels) and their meaning.
To make a convenient reference list those values and their meaning. In our case for status column it would be the following: 'W' - working copy, 'A' - approved invoice, 'C' - cancelled.
Foreign Keys
Some tools can indicate whether column is a foreign key and to which table (Dataedo will from 6.0) but to play it safe let's always provide foreign key in the description. In our case customer_no is a FK to customers table.
Life Cycle
When does a certain column is expected to get populated? At insert or at a later stage of row life cycle? Does the data get ever updated? Who/what updates it? What is the logic? Obviously all data in databases can be updated at any time, but I mean here a business as usual workflow.
In our case invoice number and date are generated when invoice gets approved.
Use
Tables often have many columns, and not all of them have as obvious purpose as those presented in our example (basically fields visible on document everyone more or less knows). There are often data fields used by some external programs, interfaces or reports or are part of standard packaged application and aren't used at all.
Let's say customer table has a score field but no-one of developers knows what that is and how it is calculated. It turns out it is maintained by external program and is used by marketing campaign planner. Mentioning that in the documentation would let admins know who needs this program and even might prevent some crazy developer from dropping that field!
Those were just a few of the aspects you could consider when documenting data.
Meaningful Descriptions
Let's apply what we just learned and overhaul our descriptions. We could get something like this. What do you say, Captain?
Invoices table (after)
Column | Description |
---|---|
number | Invoice autogenerated number, starting from 1 each year. Number is generated when invoice gets approved. |
date | Invoice issued date. Null for working copy invoices. Automatically set to today’s date on invoice approval. |
status | Invoice status. 'W' - working copy, 'A' - approved invoice, 'C' - cancelled |
amount | Invoice net amount in USD |
customer_no | Number of customer invoice was issued to. Ref: customers. |
Start Documenting Your Tables
If you want to have a meaningful description of tables and columns you first need the right tool. Download Dataedo - a lightweight Windows tool that enables you to describe each data element of your SQL Server, Oracle and MySQL databases in a separate repository. You can use rich text, images, long descriptions, add ER diagrams and share it in a nicely formatted HTML or PDF.