Test Data Management Risks

Providing meaningful data to perform software testing is the main challenge of test data management. This issue is even more important in domains where sensitive data is used like healthcare or financial systems. In this article, Thirunavukarasu Papanasam discusses some best practices to remove the risks out of your test data management activity.

Author: Thirunavukarasu Papanasam, Maveric Systems, http://maveric-systems.com/

Providing quality test data is a challenge across the software testing life cycle. A significant amount of resources is spent in creating, maintaining, and archiving test data using manual and semi-automated processes. Using direct copies of production data without de-risking it may result in the exposure of sensitive customer data and financial data, thereby violating regulatory and compliance directives.

Acquiring de-risked high quality test data faces the following challenges:
1. Distributed Environment – In the current outsourcing model of development, different stages of the software development life cycle are executed across multiple locations and data, which are to be seen and worked on their entirety. Banks have to deal with their data environment going outside its premises, and resolve confidentiality and security problems. Also, if you operate with an Agile approach where time to market is very critical, there is very little additional time available for data generation.
2. Data Complexity – Often, testing teams have to work with different types of data stored in multiple environments in the legacy back-end. For assurance purposes, this data needs to be unified in a central repository. This is a big challenge which is often complicated by the fact that there is little documentation on the relationship between databases, and how to connect them.
3. Differences in Types of Testing – As different types of testing, such as user acceptance testing (UAT), system integration testing (SIT), performance testing, etc., require different types of data, it is imperative that the effort spent by testers to prepare test data is minimized while at the same time the results ensure maximum coverage and volume in the correct format wherever needed.
4. Data Security and Confidentiality – Moving confidential data to the test environment is a risky proposition. With regulators imposing strict compliance norms on banks, the need for effective data masking becomes all the more important.

The solution that addresses these challenges need to ensure the following: unification of data from multiple sources, provide copies of production data, generate data for code coverage, de-risk production data based on regulations and compliance requirements, de-duplicate and reuse data across multiple test environments, effectively categorize and establish relationships between databases. More importantly, all this has to be done without any loss in data quality or integrity. The solution must also allow data to be provisioned to multiple locations across the globe, if needed.

Data Profiling & Categorization

This involves identifying relationships between different types of data at the source level. As a best practice, it is not recommended for the production data to be used for the data discovery process. Disaster recovery database or data backup can be used as source data for data profiling and categorization. This process can be done manually as well as through tools such as IBM Discovery, Oracle EM and CA Test Data Manager. In addition to data profiling, it is ideal to categorize the data based on its business nature and usage. This will help to categorize the data properly as transaction data, financial data, master data, etc., and prepare the data for the next steps in test data management.

Data Masking

It is best to first consider which type of data masking would suit our current goals. There are different types of techniques, such as substitution, shuffling, user defined function, etc. for effective data masking. For example, name columns can be masked using substitution technique to represent data with another meaningful name, whereas for functional data credit card numbers – where simple substitution or scrambling will fail in functional tests, it is better to use algorithms such as Luhn’s to generate functionally valid credit card numbers.

Data Generation

There are scenarios – such as performance testing, testing an enhancement that is absent in production, testing negative cases, etc. – where data pulled from production databases may not be sufficient for testing. In such cases, data generation tools such as CA Test Data Manager can be used to synthetically create data that satisfies the requirements (in terms of volumes/business rules) and can substitute production data. Also, since it is synthetically generated, the data can be used in different environments without violating regulatory or compliance guidelines.

Copy Data Virtualization

Extremely useful when multiple test environments need the same set of data, data virtualization enables us to provision data using virtual data environments to multiple test environments. With tools like Actifio and Delphix, virtual data can be provisioned and any change made to the data by one user is reflected only in his/her local copy (of changes alone) and doesn’t affect the main copy.

In addition to methods and tools which address the challenges of test data management, there are processes that enable implementing a functional and effective test data management solution. Defining test data request process workflow, documents to be used during requests, SLA and metrics, etc. will facilitate a seamless test data service.

About the author

Thirunavukarasu Papanasam is a data management professional with over 15 years of experience in manufacturing and financial services domain with extensive experience in developing ‘data quality’ solutions. He has worked in areas of data modeling, data analysis and database programming, and has also explored test data management in depth.