Big Data: How to Test the Elephant?

Big Data is a big topic in software development today. When it comes to practice, software testers may not yet fully understand what is exactly Big Data. A tester knows that you need a plan for testing it. Since most Big Data lacks a traditional structure, how does Big Data quality look like? And what the are most appropriate software testing tools? This article tries to answer these questions.

Author: Alexander Panchenko, A1QA,

As more and more companies are adopting Big Data as a solution for data analysis the question arises: how can you determine a proper testing strategy for controlling this heavyweight “elephant”? The problem for software testing is magnified by a lack of clear understanding about what to test and how deep inside a tester should go.

Big Data Software Testing

As a software tester, you need a clear definition of Big Data. Many of us improperly believe that Big Data is just a large amount of information. This is a completely incorrect approach. Actually, you don’t face Big Data when you work with an Oracle 2 petabytes database, but just a high load database. To be very precise, Big Data is a series of approaches, tools and methods for processing of high volumes of structured and (what is the most important) of unstructured data. The key difference of Big Data from “ordinary” high load-systems is the ability to create flexible queries.

Big Data can be described by three “V”: Volume, Variety, and Velocity. In other words, you have to process an enormous amount of data of various formats at high speed.
The processing of Big Data, and, therefore its software testing process, might be split into 3 basic components. The process is illustrated below by an example based on the open source Apache Hadoop software framework:
1. Loading the initial data into the HDFS (Hadoop Distributed File System)
2. Execution of Map-Reduce operations
3. Rolling out the output results from the HDFS

Loading the initial data into HDFS

In this first step, the data is retrieved from various sources (social media, web logs, social networks etc.) and uploaded into the HDFS, being split into multiple files.
* Verifying that the required data was extracted from the original system and there was no data corruption;
* Validating that the data files were loaded into the HDFS correctly;
* Checking the files partition and copying them to different data units;
* Determination of the most complete set of data that needs to be checked. For a step-by-step validation, you can use tools such as Datameer, Talend or Informatica.

Execution of Map-Reduce operations

In this step you process the initial data using a Map-Reduce operation to obtain the desired result. Map-reduce is a data processing concept for condensing large volumes of data into useful aggregated results
* Checking of required business logic on standalone unit and then on the set of units;
* Validating the Map-Reduce process to ensure that the “key-value” pair is generated correctly;
* Checking the aggregation and consolidation of data after performing “reduce” operation;
* Comparing the output data with initial files to make sure that output file was generated and its format meets all the requirements.

The most appropriate language for the verification of data is Hive. Testers prepare requests with the Hive (SQL-style) Query Language (HQL) that they send to Hbase to verify that the output complies with the requirements. Hbase is a NoSQL database that can serve as the input and output for MapReduce jobs.

You can also use other Big Data processing programs as an alternative to Map-Reduce. Frameworks like Spark or Storm are good examples of substitutes for this programming model as they provide similar functionality and are compatible with the Hadoop community.

Rolling out the output results from HDFS

This final step includes unloading the data that was generated by the second step and loading it into the downstream system, which may be a repository for data to generate reports or a transactional analysis system for further processing.
* Conducting inspection of data aggregation to make sure that the data has been loaded into the required system and thus was not distorted;
* Validating that the reports include all the required data, all indicators are referred to concrete measures and displayed correctly while report operating the latest data.

Testing data in a Big Data project can be obtained in two ways: copying actual production data or creating data exclusively for testing purposes. There are no doubts that software testers should prefer the first choice. In this case, the conditions are as realistic as possible and thus it becomes easier to come up with a larger number of test scenarios. However, not all companies are willing to provide real data, when they prefer to keep some information confidential. In this case, you have to create testing data yourself or make request for artificial info. The main drawback of this scenario is that artificial business scenarios created by using limited data inevitably restrict testing. Only real users themselves can detect defects in that case.

As speed is one of Big Data main characteristics, it is mandatory to do performance testing. A huge volume of data and an infrastructure similar to the production infrastructure is usually created for performance testing. Furthermore, if this is acceptable, data is copied directly from production.

To determine the performance metrics and to detect errors, you can use for instance the Hadoop performance monitoring tool. There are fixed indicators like operating time, capacity and system-level metrics like memory usage within performance testing.

If you apply the right test strategies and follow best practices, you will improve Big Data testing quality, which will help to identify defects on early stages and reduce overall cost.

* Big Data on Wikipedia
* Big Data: Testing Approach to Overcome Quality Challenges
* Big data testing challenges
* Testing in a Big Data World
* Big Data Testing

About the author

Alexander Panchenko works as Head of Complex Web QA Department for A1QA, largest independent Software testing company in Eastern and Central Europe. During his long career in A1QA Alexander got a great experience in QA and quality control of various projects: from backup and recovery standalone application to medical social networking. He also participated in huge projects with complex business logic, e.g. corporate portals, based on Share Point, Banking systems, Government portals. Now he is leading several teams of 7+ people and managing a division of 30+ engineers on board.