Originally published 7 January 2009
In many entities, but particularly in master data entities, it is not uncommon to see attributes that are generally known as “indicators” or “flags.” These become columns in physical databases, where they are also known as “indicators” or “flags.” They seem to be often passed over as rather simple, but they are worth thinking about. Unfortunately, not a lot is written about indicator columns, so what follows is my personal view of them based on my particular experience, and I freely admit there may be other perspectives that are worth considering.
In terms of database theory, indicator columns are supposed to be a Boolean datatype. This datatype has two possible values: true and false. Boolean datatypes actually exist for a number of database platforms, but it is rare to see the words “true” or “false” presented to a user on a screen or report. It is much more common to see “yes” and “no.” For this reason, and perhaps others, it is not that common to see Boolean datatypes used to implement indicator columns. Very often, they are single character fields in which “Y” represents true and “N” represents false. Business users can easily understand that “Y” stands for “yes” and “N” for “no.” Perhaps the next most common representation is “0” for false and “1” for true, in either a single character or integer datatype. Of course, when one uses something other than a Boolean datatype, it is possible to have more than two values, which has consequences that will be discussed later in this article.
Although a Boolean datatype can be true or false, traditional logic also includes a third possibility called indeterminate, meaning we do not know if the value should be true or false. This seems to be what a null value is used for.
As noted previously, there is not much written on the theory of indicator columns. However, “true” and “false” relate to propositional logic. A proposition is a statement of which it can be said that it is true or false. What this seems to mean for database design is that if we have an indicator column, then we should be able to state the meaning of the column as a proposition. For instance someone buying a book online may be asked if they want to receive the monthly email newsletter from the company selling the book. The definition of this column could be put in the form of a proposition such as:
The customer wishes to receive the company’s monthly email newsletter
And the data will tell us for each record whether the proposition is true or not. Now, other columns can also be reformatted as propositions, but they have to take into account the value of the data. For instance, a bank account may have a balance of $1,000, which is held in the Account Balance column of an Account table. The only way we are going to be able to state this as a proposition (so we can say it is true or false) is something like:
This bank account has a balance of $1,000
But this is really not a definition of the column in the same way that “The customer wishes to receive the company’s monthly email newsletter” is. Thus, we can see that for indicator columns, the data value is simply an affirmation or denial of the definition of the column, whereas for other types of columns, it is more like a measurement of the attribute that is being represented. This is an important distinction.
It is common to think of indicators being used to represent a binary state. However, not all binary states are equal.
The example of the customer wishing to receive the company’s monthly email can only be put as two propositions which are mutually exclusive in a way that the logicians term contradictory opposition. This means that one of the following propositions must always be true.
The customer wishes to receive the company’s monthly email newsletter
The customer does not wish to receive the company’s monthly email newsletter
Now let us consider a slightly different case. Suppose a bank account can be “Active” or “Suspended.” A data modeler may choose to represent this by an indicator attribute, perhaps called Account Suspended Indicator. In the corresponding physical database this might be implemented as a Char(1) column where “Y” represents “yes” (i.e., that the account is suspended), and “N” represents “no” (i.e., that the account is active).
Just what is going on here? First, the data modeler has understood that there is a binary state for the attribute that can be represented as the following two propositions:
This account is suspended
This account is active
The two states are mutually exclusive, but they are mutually exclusive in a way that the logicians term contrary opposition. With contrary opposition, the possibility exists that both might be false. In other words, the possibility exists that we are really dealing with something that should be called Account Status, and that while we currently know it has two states, one day it might grow to three or more.
Let us act upon this worry. Suppose that the data modeler responsible for modeling bank account suddenly finds out that a bank account can be “Active,” “Suspended,” or “Frozen.” “Suspended” is where the bank has its own reasons for stopping debits or credits to the account, and “Frozen” is where a government regulator instructs the bank to stop debits and credits against a particular account.
We now have a situation where the modeler has to change the production database. Suppose the Account table originally had a column called AcctSuspended with a value of “Y” for a suspended account and a value of “N” for an active account. There are three main options confronting the modeler.
Option 1: The modeler can instruct the programmers to put a value of “F” in AcctSuspended for a frozen account. Program logic, screen layouts, and report formats can all be changed to support this change. The change is only possible because the modeler had the “foresight” not to use a Boolean datatype which can only represent true or false, but instead chose a Char(1) datatype which can accommodate a many distinct values. Of course, these values are now codes representing states, rather that affirmation or denial of a single proposition.
Option 2: The modeler can create two indicator columns AcctActive, and AcctFrozen in addition to AcctSuspended. These will be set to “Y” depending on the status of the account (i.e., whether it is active, frozen, or suspended respectively). Probably very few modelers would choose to do this. The main reason is that the states represented by these three indicators are mutually exclusive. Yet by having three indicators in the table, the risk arises that on a given record, two (or even three) of them could end up with “Y’s”, creating a data quality problem.
Option 3: The modeler could create a code table called Account Status with three records, one for each of the states. The Account table could be modified to have a new column, Account Status Code, which would be the migrated primary key of the Account Status table. This is the best way to solve the problem.
From these three options, we can see that there are reasons to be suspicious of indicators when we look at database tables. Does a column that we are told is an indicator have more than two distinct values, but no parent code table as in option 1? Are there related indicator columns that really are mutually exclusive as in option 2?
There is quite a bit more to say about indicators, especially how they can represent metadata, or derived information – but that is for a future column.
SOURCE: A Brief Review of Indicator and Flag Columns
Recent articles by Malcolm Chisholm
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!