Captain Obvious' Guide to Column Descriptions - Data Dictionary Best Practices

Whenever you have data, you should have some kind of description of what it represents. Some may think that table and column names are enough. This is far from truth and in my opinion in most cases you should have accompanying descriptions (a documentation). You will need it for reporting, maintenance, further development.

Understanding and deciding you want to describe your data might not be enough. You need to know how to do it right, so that it serves a purpose (other than satisfying management or auditors to have any documentation).

Let's have a look at an example.

Captains' Meaningless Descriptions

We asked certain captain to describe our invoices table and here is what we got:

Invoices table (before)

Column Description
number Invoice number
date Invoice date
status Invoice status
amount Invoice amount
customer_no Customer number

Well done, Captain Obvious! We didn't learn much from those descriptions, did we? They even look like auto-generated and hold zero information. And yet, there is so much we don't know about the data but should. You probably have seen something similar at least once.

Adding the Non-obvious

Captain's work is done (he finished early today). Now it's our turn - let's see whether we could add something actually useful.

Meaning

Table or column name often isn't enough to understand the purpose of the data. It can often be even misleading. Let's take the amount column. Is it net amount or including tax? If this is order, is it amount including discount? Is this amount including refunds and cancellations or is it the amount as on order when it was placed?

As you can see, the exact meaning of the column isn't always obvious and this is where a description is necessary and can save your organization a lot of issues with incorrect reports and analytics.

Source, Calculations, and Defaults

Where does the data come from? Is it entered by user or calculated? What is the logic of those calculations? Is there a default value?

In our case amount is always provided by user, status is working copy and date is null by default. If we had tax column it might have been automatically calculated by some trigger or stored procedure.

Formats and Validation

Are there restrictions on values or can they accept all values of their data type? Those restrictions can be defined anywhere - as database constraints, in application validation or even business rules. In most real life cases there are.

Date should be no older than a month back from creation and not more in the future than a week, amount should be positive and status column accepts only predefined values.

List of Values, Lookups

Some columns are enums and are expected to hold a limited set of values, even though they are mostly one of primitive types without any constraints - varchars or integers. In such cases application layer defines somewhere list of valid values (labels) and their meaning.

To make a convenient reference list those values and their meaning. In our case for status column it would be the following: 'W' - working copy, 'A' - approved invoice, 'C' - cancelled.

Foreign Keys

Some tools can indicate whether column is a foreign key and to which table (Dataedo will from 6.0) but to play it safe let's always provide foreign key in the description. In our case customer_no is a FK to customers table.

Life Cycle

When does a certain column is expected to get populated? At insert or at a later stage of row life cycle? Does the data get ever updated? Who/what updates it? What is the logic? Obviously all data in databases can be updated at any time, but I mean here a business as usual workflow.

In our case invoice number and date are generated when invoice gets approved.

Use

Tables often have many columns, and not all of them have as obvious purpose as those presented in our example (basically fields visible on document everyone more or less knows). There are often data fields used by some external programs, interfaces or reports or are part of standard packaged application and aren't used at all.

Let's say customer table has a score field but no-one of developers knows what that is and how it is calculated. It turns out it is maintained by external program and is used by marketing campaign planner. Mentioning that in the documentation would let admins know who needs this program and even might prevent some crazy developer from dropping that field!

Those were just a few of the aspects you could consider when documenting data.

Meaningful Descriptions

Let's apply what we just learned and overhaul our descriptions. We could get something like this. What do you say, Captain?

Invoices table (after)

Column Description
number Invoice autogenerated number, starting from 1 each year. Number is generated when invoice gets approved.
date Invoice issued date. Null for working copy invoices. Automatically set to today’s date on invoice approval.
status Invoice status. 'W' - working copy, 'A' - approved invoice, 'C' - cancelled
amount Invoice net amount in USD
customer_no Number of customer invoice was issued to. Ref: customers.

Start Documenting Your Tables

If you want to have a meaningful description of tables and columns you first need the right tool. Download Dataedo - a lightweight Windows tool that enables you to describe each data element of your SQL Server, Oracle and MySQL databases in a separate repository. You can use rich text, images, long descriptions, add ER diagrams and share it in a nicely formatted HTML or PDF.

Comments (0)