If you are a DataFlux customer, please use the MyDataFlux Portal login to access our resources.
The drive for data unity creates many opportunities for data consolidation. Customer data integration projects, master product catalogs, security master projects and enterprise master patient indexes are all examples of technology-driven projects intended to resolve multiple data sets that contain similar information into a single view – all with the hope that a unified data asset will lead to improved business processes. The success of this data integration process hinges on the ability to determine when different data instances in the same (or across multiple) data sets refer to the same real-world entity. Searching through data sets for matching records that represent the same party or product is the key to the data consolidation process.
The two most interesting challenges for customer data integration are basically two sides of the same coin; the challenge is not just about determining when two records refer to the same real-world object, but it’s about knowing when they do not refer to the same real-world object. Yet without being able to make that clear connection or distinction, it would be difficult, if not impossible, to identify potential duplicate records within and across data sets.
The method used to find these connections is typically referred to as identity resolution. From a technology perspective, identity resolution is a collection of algorithms used to parse, standardize, normalize and compare data values. This can establish that two records refer to the same entity or to determine that they don’t. By feeding the set of records into the identity resolution process, we can determine that all of these records contain a reference to a unique entity. Beyond that, we can use data culled from all of the records to materialize a high-quality representation of each entity type. This process is used to resolve different entity representations and determine that they all refer to the same real-world entity.
The techniques used in this process are critical for any business applications that rely on customer or product data integration as part of a master data management (MDM) or data quality initiative. In this paper we explore the root cause of the “dual challenge” of identity resolution, examine how parsing and standardization contribute to the process, then review different ways that similarity scoring and approximate matching algorithms can help determine and resolve identical entities despite variant representations.
Registration is required to download DataFlux resources. If you have already registered, please log in. If you are a new user, please fill out the form below.