Data Virtualization and Data Quality – An Overview

Originally published 8 April 2010

Data virtualization delivers many benefits including real-time data access and pipelining to the enterprise information environment. As data virtualization is increasingly used in enterprise data integration architectures, particular attention must be paid to data quality, which historically has been the Achilles heel. Opinions vary on whether data quality processes can be fully integrated into the data virtualization layer, primarily due to the complexity of the data quality components themselves.

Typical Data Quality Processes

Before we continue our discussion on how data quality controls can be integrated into the data virtualization layer, let’s first explore the variety of data quality processes typically used in the enterprise to correct various data quality problems. The typical data quality workflow consists of data validation, standardization and normalization, cleansing (including matching, de-duping and merging), and enrichment (see Figure 1).


 
Figure 1: Data Quality Processes 

Data validation – Data validation is typically implemented within the application layer to allow for immediate correction of mistyped entries. The simplest example is an online ordering system where a buyer enters his or her credit card information. As part of the order processing, the application seeks bank and/or credit card processor validation of the entered information. In the case of invalid information, the application alerts the user to correct the error – no order can be placed until the user enters valid information. Data validation is the most effective form of data quality because it guarantees that validated data is accurate. However, this process requires user interaction with the application, and is therefore not suitable for integration into the data virtualization layer.

Data standardization and normalization – These operations predominantly deal with changing the format of the inputted data based on a predetermined set of rules. Some examples include a simple conversion of an information string into a floating number or parsing an address field into subcomponents like street name, city, state and ZIP code. Data standardization can be applied during data input (e.g., at the step immediately following successful data validation), or during post-processing. Since data standardization and normalization rely on heuristic rules that can be pre-programmed, they can be implemented as part of the data virtualization pipelining.

Data cleansing –The most complex data quality process, cleansing consists of a number of sub-processes such as records merging, de-duplication, correction and more. Most, if not all, data cleansing operations require data scans or profiling, which help identify duplicate records. These scans may be performed continuously or periodically, typically when new data is entered or when changes are made to existing data. Because data cleansing often corrects certain entry errors like misspellings, it often relies on third-party data sets used as master reference data for comparison. For example, the U.S. Postal Service address database may serve as an effective master data set for validating customer contact information entries.

Data enrichment – this process is often used in combination with data cleansing to enhance the overall completeness of the data. For example, a process can expand a ZIP code from a customer-submitted 5 digits to a 5+4 digit format by looking up the address in the U.S. Postal Service database. This step, while not as complex as data cleansing, still requires record comparison and matching against some third-party trusted master data set.

Real-time Versus Batch

With the exception of the data validation step, as described above, all other data quality operations are performed on data stored somewhere, for instance in a relational data source. As such, these processes can theoretically be integrated into the data virtualization layer. Considering that data virtualization operations such as federated multi-source queries are often time sensitive – client applications access the data through the virtualization layer in real time – the nature of the data quality operation will determine its suitability for real-time access. In simple terms, any additional operation, such as data quality processing, will add some latency to these queries, and therefore requires careful case-by-case evaluation to ensure that the complete process still meets the SLA requirements put forth by the business stakeholders.

Some of the more traditional implementations will store the output of data quality processes in temporary or even permanent stores, and then point the data virtualization layer to the store instead of the originating data sources (see Figure 2). The benefit is obvious – this type of configuration will ensure that no repetitive reruns of data quality operations are performed on data sets that have not changed since the last access. Unfortunately, such an approach also prevents the use of data quality from within the data virtualization layer because all data quality processes have to be executed first to produce the data, which is then consumed by the data virtualization layer.


Figure 2: Traditional Data Integration Processes

Using Caching

One possible alternative is to configure the data virtualization layer to cache the results of virtualized operations. In this scenario, the caching system becomes the temporary store for the data output by the data quality controls. In essence, the data virtualization layer ensures that the client accesses only the cached, quality data automatically. The IT operator can also configure data virtualization caching to monitor for changes in the underlying data and refresh its cache as needed, also re-executing appropriate data quality control operations when necessary.

One drawback to this approach is that first-time data access requires running the data quality processes, thus adding latency when compared to subsequent runs. Another area of potential concern is the need to continuously monitor the primary data source for changes and the need to re-apply data quality processes on the changed or newly added data before caching it in the data virtualization layer. Here, a solution based on database change capture and incremental caching may help alleviate bottlenecks and unnecessary loads on the systems.

Data Quality in the Data Virtualization Layer

It’s important to point out that a number of data quality processes such as data validation and data profiling cannot be implemented as part of the data virtualization layer due to how they operate on the data. But does this mean they should always be performed off-line or outside the data virtualization layer? Not necessarily. Certain data quality processes, particularly standardization and normalization operations that are broadly called data transformation operations may be inserted into the real-time data virtualization pipeline with little impact on the overall data virtualization performance. Why? These operations are typically pre-programmed to follow well defined rules. Each transformation operation is fairly self-contained and does not rely on external information or require lookups. Hence such operations are not CPU/disk taxing, making them ideal for complete integration into the data virtualization layer.

On the other hand, typical data cleansing or enrichment operations are significantly more complex than data transformation operations. They are also more time-consuming, requiring extensive computations, multiple data passes, external references and lookups. A lookup of an external billing address to expand the ZIP code during the enrichment stage is an example that requires accessing external data sources, record matching and merging.

Consider Service Levels

To determine the suitability of a particular data quality operation for real-time federated data access, the logical starting point is to evaluate the SLA requirements for federated queries such as transactions per second. Start by ensuring that data access meets those requirements. After that, measure the impact of various data quality operations on the individual federated query transactions or average them across particular data sets. With the addition of new data quality operations, the resulting latency increases are rarely linear. Therefore, it makes sense to re-measure the data federation layer’s performance after each new data quality process addition. Time-consuming operations like cleansing, enrichment and profiling should be performed outside the data virtualization layer, unless there is tolerance for a significant latency increase or the potential of implementing some form of caching as described above. Naturally, one should be careful even with relatively limited operations such as transformations, particularly if multiple seemingly simple operations are stacked to form a fairly complex chain – a dozen of such “simple operations” eats up a significant chunk of available CPU time, thus degrading the performance of data virtualization access (see Figure 3).


Figure 3: Re-Measure Performance Impact

Conclusion

As IT teams increasingly implement data virtualization as part of their data integration initiatives, they can take several practical approaches to achieving seamless data quality throughout the data virtualization layer. Through examination of specific SLA requirements, the potential impact on performance latency, and other critical factors, enterprise data architects can determine the processes necessary for achieving their data quality objectives and implement these processes to realize the benefits of real-time data access afforded by the data virtualization layer.

SOURCE: Data Virtualization and Data Quality – An Overview

  • Avtandil Garakanidze
    Avtandil, Senior Vice President of Product Development at Composite Software, Inc. has 15+ years of experience garnered from executive/senior product and engineering management roles at several Silicon Valley technology companies. At VERITAS, he managed the VERITAS Volume Replicator, an enterprise disaster recovery product. At Starfish Software, he was responsible for the TrueSync OTA – now the backbone service of top wireless carriers in the U.S., Europe and Asia. Garakanidze holds an MBA from the Massachusetts Institute of Technology (MIT) and an MSCE from the Georgian Technical University. He may be reached at avtandil@compositesw.com.
 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!