We are continually working with clients to refine the list of attributes that can be used to uniquely identify (and therefore distinguish as well as match) entities within a data set, such as customers or products. This task is critical to data quality management, master data management, and other value-added business services such as customer relationship management. One recent challenge was observing that one client was using birth date as an identifying attribute, even though it was null a significant portion of the time, defaulted to 01/01/YYYY some high percentage of the time, and generally was not trustworthy. It started me thinking about the characteristics of attributes that make them good candidates as identifying attributes. Here are some that I am incorporating into a larger paper on identity resolution:
- Inherence – the degree to which the attribute is intrinsic to the entity. Examples include engineering specifications of a product, such as the “head diameter,” “shank diameter,” or “threading type” of a screw.
- Structure stability – the degree to which the attribute’s structure is subject to variance. Attributes relying on a well-defined value domain have a high degree of structural stability; attributes like dates, telephone numbers, and individual names can appear in a variety of patterns or formats, and have a medium degree of structural stability; free-formed text values have a low degree of structural stability.
- Value stability – the degree to which the attribute’s value ever changes, and if so, how frequently. An example of an attribute with a stable value is an individual’s eye color.
- Domain cardinality – this looks at the size of the domain that the value can take. Attributes that use a domain with many values are more likely to be used for differentiation than those using a domain with a small number of values. For example, a birth date domain may have a limited set of 366 values.
- Completeness – attributes that are missing data are less likely to contribute significantly to differentiation.
- Accuracy – attributes with a high degree of trust may be more reliable for similarity comparisons.
I think that if you can come up with a good set of evaluation criteria for candidate identifying attributes, you can simplify the process of selecting data elements for your similarity functions. Thoughts?



#1 by Henrik Liliendahl Sørensen at February 16th, 2010
I will say, that unless you are able to maintain a natural key that you are perfectly sure always is available, then it will be a better choice to select a surrogate key with no binding to the real world and beside that maintain one or probably several other identification attributes used when matching. Mature matching tools are able to combine several attributes in a single match.
Recently I wrote a blog post on key selection and real world alignment called Create Table Homo_Sapiens.
#2 by Ramon de Noronha at February 16th, 2010
I agree with Hendrik, using several attributes will improve the matching results. This sounds obvious, but when you dive a little deeper into the identifying elements it is possible to create combinations of attributes which have a high probability to identify persons or products uniquely. The first name and last name won’t suffice (many John Johnson’s). But last name, birth year and house number might be enough. See also my blog post about Categorizing Identifying Attributes.
#3 by Walter Howard at February 17th, 2010
Interesting. I disagree with Henrik’s statement. A surrogate key provides no value in a MDM/CDI solution. I need names, addresses, DOB’s, telephone numbers, business account identifiers, EIN’s, etc, that exhibit the properties illustrated by Mr. Loshin.
And when did it become an acceptable practice to link to your blog in the comments section of another blogger?
#4 by Henrik Liliendahl Sørensen at February 18th, 2010
David, Walter. It may be that we actually agree, but are putting it in different ways. Of course you need name, address, data of birth, national ID Number, telephone number in order to make identity resolution. On the question whether you should not include them based on dimensions as inherence, completeness and accuracy I will say, that it’s rather a matter on how you weight those in your decision matrix when matching. My point on the key is that I have seen all to often that a poor natural key is used to identify party master objects, which is discussed in the blog with comments that I have linked.
On the commenting etiquette I will refer to Jim Harris (another author on this blog) and his posts on Social Karma</a.
#5 by Trudy Curtis at February 18th, 2010
This is a pretty good set of criteria as a starting point. I have another.
We tend, as professionals, to believe that we know what the object we are identifying is. In some cases, this may actually be the case. However, sometimes an industry will use a term loosely and inconsistently. It’s difficult to uniquely identify an object when we don’t have concensus about what that object is! We also need to know “when” it is. When should a person be assigned a unique identifier? At birth? Conception? When they pay taxes?
Definitions and semantics are very difficult. We tend to avoid them because they pose thorny issues that can delay projects, raise more questions than answers, frustrate developers and increase project costs. I sympathise, but speaking from many years of experience, in the long run this can cost a lot of money and create tremendous problems.
Typically, these issues leap to the forefront when a company decides to integrate their data stores into something like a Master Data Repository. Imagine what happens when you try to integrate systems that identify objects with the same common name, but with different definitions.
We need clear terms and even clearer definitions so that we can clearly and unambiguously identify what the object we are identifying is! Without that clarity, it’s impossible to avoid problems (although, tragically, you may have created a very shiny illusion that suggests you have!).
For example, in the Oil and Gas industry, we talk a lot about “wells”. They represent a key business object to us, and their is a HUGE amount of technical and business data associated with them.
We have a lot of “Wells” in the world today – millions of them, in fact. It’s incredibly important to be able to uniquely identify them, so we can manage the properly. Unfortunately, the term “well” means different things to different people. Around the globe, the identifiers that are assigned to wells are actually assigning those identifiers to different objects – even though they are all called “wells”.
This is a complex problem (if it was easy, we would not have a problem!). To understand, you need to visit http://www.WhatIsAWell.org.
As an industry, we are working towards developing a common set of terminology, definitions, illustrations and teaching material that we hope will help industry to converge globally.