DataFlux - The Leader in Data Quality and Data Integration

Thoughts About Completeness – Part 2

David Loshin

October 27, 2009

In my last post about completeness, I mentioned the question of using values that represent nulls. A good example is the use of the character string “N/A” when the data requested is not available. Of course, if you have “N/A,” you probably will also have a bunch of variants as well, such as “n/a,” “NA,” na,” “not available,” among others. There are two questions you might ask:

  • If the value is not available, why not just leave it null?
  • Why are there so many variations on the “not available” value?

Good questions both, and as opposed to suggesting some deviant data quality mole inserting bad data into the works, they more likely point to two different kinds of data quality problems.

The first question is really about the nullness of the data. Older database systems may rely on files for storage, with no system nulls allowed. In this case, the absent value is represented by some actual value, even if it is just a string of blanks. And when you see many blank strings in the data element, that is probably what has happened. But when you don’t see the blanks and do see the variants for the missing values, the culprit is probably not the data, but the application’s enforcement of some value constraint at the point of entry.

And we have actually seen this – some application refuses to let you get to the next screen until all the fields are filled in. Yet if I don’t have the value, I won’t be able to get the process finished. Therefore, the data entrant will stick anything in there to get to the next form, which leads to the creation of “false nulls” in the data. And the variations? When there is no standard for the missing value, the data entry person ends up making the decision, and that leads to all sorts of creativity.

tags:  , ,

  1. #1 by Julian Schwarzenbach at November 17th, 2009

    I deal with utilities and transport sector clients, in particular their asset information. One aspect relating to completeness, which is perhaps unique to this sector, means that it can be valid not to have a completeness target of 100%.

    I’ll explain – many assets have been in existence for many years (over 100 for some bridges and sewers), when these assets were constructed asset data quality had not been thought about. Although records will almost certainly have been kept initially (possibly hand drawn on parchment drawings) over the years some of this information has been lost. In other cases information, such as date of construction may not have been recorded originally.

    In such cases it can be valid to have a completeness target of less than 100% – why beat yourself up over lost data that cannot be recreated?

    Are there other similar examples that you are aware of?

  1. No trackbacks yet.

Comments are closed.