Open Source Test Data Generators

In most of your software testing activities, you need data. Sometimes you can rely on a small sample, but if you want to perform some load testing or if you want to test a feature that needs to produce a multipage invoice, then you start to need more than just two or three occurrences. Test data generators are tools that can help you in this task with the automatic generation of hundreds or thousands of customers, products or accounts items with different attributes for their id, email, name, etc.

Test data generators can work in different mode: from the random approach to a more focused or intelligent way. Their goal is to use a predefined data structure to produce the data need for test in a specific format that could range from a spreadsheet file to SQL insert instructions. This article presents some open source test data generators. Do not hesitate to contact us to include any tool that that is not yet listed in this article. The products currently included in this article are: Benerator, DataFactory, Data Factory, DataGenerator, dbldatagen, Faker, generatedata, Faker, Gofakeit, jFairy, Mimesis, MockNeat, MySQL Random Data Generator, pydbgen, Spawner, SQLfuzz, Synth, test-data-generator

Updates

August 28 2023:
added Databricks Labs Data Generator (dbldatagen), Faker, Gofakeit, jFairy, Mimesis
July 21 2022 :
renamed Databene Benerator to Benerator
added MockNeat, MySQL Random Data Generator, pydbgen
October 19 2021:
added generatedata, SQLfuzz, Synth

Benerator

Benerator is a framework released under both open source and commercial licenses that can be used to generate high-volume test data. This test data generation tool works on Windows and Unix systems. It supports many database systems (Oracle, IBM DB2, MS SQL Server, MySQL, PostgreSQL, …), XML, XML Schema, CSV, Flat Files and Excel. Benerator has also a plugin system that allows for instance to use it with Eclipe or Maven.

How Databene Benerator works

Figure: How Databene Benerator works.

Websites: https://github.com/rapiddweller/rapiddweller-benerator-ce and https://sourceforge.net/projects/benerator/

DataFactory

DataFactory is an open source test data generator tool that allows you to easily generate test data. It was primarily written for populating database for development or test environments by providing values for names, addresses, email addresses, phone numbers, text, and dates. DataFactory can be integrated with Maven.

Website: https://github.com/andygibson/datafactory

Data Factory

Data Factory is an open source Java API that can be used to generate random data. It is useful when developing applications that require a lot of sample data.

Website: https://sourceforge.net/projects/data-factory/

DataGenerator

DataGenerator is an open source Java library that can produce large volumes of data to meet the challenges of the Big Data domain. DataGenerator frames data production as a modeling problem, with a user providing a model of dependencies among variables and the library traversing the model to produce relevant data sets. DataGenerator can be used with IDE like Eclipse, IntelliJ IDEA or NetBeans.

Website: http://finraos.github.io/DataGenerator/

Databricks Labs Data Generator (dbldatagen)

Databricks Labs Data Generator (dbldatagen) is an open source Python library for generating synthetic data within the Databricks environment using Spark. The generated data may be used for testing, benchmarking, demos, and many other uses. It operates by defining a data generation specification in code that controls how the synthetic data is generated. The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion. Dbldatagen has no dependencies on any libraries that are not already installed in the Databricks runtime, and you can use it from Scala, R or other languages by defining a view over the generated data.

Website: https://github.com/databrickslabs/dbldatagen

Faker

Faker is a pure open source Elixir library for generating fake data.

Website: https://github.com/elixirs/faker

generatedata

generatedata is an open source script that is essentially an engine to generate any sort of random data in any format. It currently comes with 30 or so Data Types (types of data it generates), 8 Export Types (formats for the data), plus around 30 data sets for specific countries (city names, regions,etc). But more importantly, it can be extended in any way you want. If you need to generate random data programmatically rather than manually via the UI, you can use the REST API.

Website: https://github.com/benkeen/generatedata

Gofakeit

Gofakeit is an open source random fake data generator written in Go. Gofakeit offers extensive features, including random data generation across various types. It also provides customizable options for adherence to specific formats, support for localization, and realistic time and date generation. Gofakeit can generate random data for struct fields. For the most part it covers all the basic type as well as some non-basic like time.Time. Struct fields can also use tags to more specifically generate data for that field type.

Website: https://github.com/brianvoe/gofakeit

jFairy

jFairy is an open source Java fake data generator. jFairy allows you to build data sets containing diverse types of data including names, addresses, telephone numbers, dates, large integers, usernames, email addresses, and more. You can try it online on https://devskiller.com/datafairy/
Website: https://github.com/Devskiller/jfairy

Mimesis

Mimesis is a powerful open source data generator for Python that can produce a wide range of fake data in multiple languages. This tool is useful for populating testing databases, creating fake API endpoints, generating custom structures in JSON and XML files, and anonymizing production data, among other things. With Mimesis, developers can obtain realistic, randomized data easily to facilitate development and testing.

Websites: https://github.com/lk-geimfari/mimesis, https://mimesis.name/

MockNeat

Mockneat is an arbitrary data-generator open-source library written in Java. It provides a simple but powerful (fluent) API that enables developers to create json, xml, csv and sql data programatically. It can also act as a powerful Random substitute or a mocking library.

Random test data generation

Mockneat random data generation example

Websites: https://github.com/nomemory/mockneat, https://www.mockneat.com/

MySQL Random Data Generator

MySQL Random Data Generator is the easiest MySQL random test data generator tool. Load the procedure and execute to auto detect column types and load data.

Website: https://github.com/kedarvj/mysql-random-data-generator

pydbgen

pydbgen is an open source python package that allows random dataframe and database table generation. This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user’s choice of data types (database fields). User can specify the number of samples needed. One can also designate a “PRIMARY KEY” for the database table. Finally, the TABLE is inserted into a new or existing database file of user’s choice.

Website: https://github.com/tirthajyoti/pydbgen

Spawner

Spawner is a generator of sample and test data for databases. It can be configured to output delimited text or SQL insert statements. It can also insert directly into a MySQL database. Includes many field types, most of which are configurable. Spawner works on Linux and Windows systems.

Spawner Open Source Test Data Generator

Figure. Spawner generation screen. Source: http://spawner.sourceforge.net/

Website: http://spawner.sourceforge.net/

SQLfuzz

SQLfuzz is an open source tool for software testing that loads random data into SQL tables for testing purposes. The tool can get the layout of the SQL table and fill it up with random data.

Website: https://github.com/PumpkinSeed/sqlfuzz

Synth

Synth is an open source tool for generating realistic data using a declarative data model. Synth is database agnostic and can scale to millions of rows of data. Synth provides a robust, declarative framework for specifying constraint based data generation. Synth provides a flexible declarative data model which you can version control in git, peer review, and automate.

Website: https://github.com/getsynth/synth

test-data-generator

test-data-generator is a simple open source Java tool to generate data that can be used with Maven. It supports many data values like emails, country or name. You can produce output in different formats like csv, tsv or sql. You can also directly inject the generated test data in a database using a jdbc connection.

Website: https://github.com/presidentio/test-data-generator

5 Comments on Open Source Test Data Generators

  1. Nice list of tools. There are also some fee online test data generation services that offer similar features without having to install a software.

  2. Your article gave me a lot of inspiration to try using these tools. Thanks for this list of open source test data generators.

  3. Grateful for your contribution to open source through software testing! Your efforts help ensure the reliability and security of these valuable projects. Thank you!

2 Trackbacks & Pingbacks

  1. Testing Bits – 10/2/16 – 10/8/16 | Testing Curator Blog
  2. Test Data: Food for Test Automation Framework

Comments are closed.