Fake Data – The Secret of Great Testing

by Robert Hall

Issue: Vol 2, Issue 1 - All Stuff, No Fluff

published in June 2010

I live in Charlotte North Carolina with my lovely wife, 2 kids, 2 dogs, 3 cats and 2 fish. My programming career started in 1989 when I was 19. Through the years I have used a wide variety of languages and stacks. I've done application architecture, system and infrastructure design. Now I'm working on datawarehouse and business intelligence projects.

I believe that software development, configuration and deployments don't have to be nearly as hard as IT makes it. This has led me to create my own SDLC based on a movie production schedule rather than an engineering practice.

I discovered Ruby on Rails in 2006 and immediately saw its potential to change the entire web application landscape. Most of my RoR projects in some way refine the art of conventions over configuration and move RAD to be better, faster, less expensive with fewer defects.

You can reach me at golsombe /at/ gmail /dot/ com or follow me on Twitter as golsombe.

Introduction

Whether your application begins with TDD (Test Driven Design), BDD (Behavior Driven Design) or you choose to go old-school with tried and true unit testing. Regarless of your testing framework of choice like shoulda or rspec, the secret of great testing depends on great testing data. Also consider that QA (Quality Assurance) and users will benefit greatly from both quality and quantity of test records. Yet for all of the testing frameworks, most application teams only create a small number of complete test records. The reason for this is obvious, creating fake test data is labor-intensive, error prone, generally sucky and unappreciated work.

Existing issues

There are a number of good field level data faking GEMs available to the Rails community like Benjamin Curtis's FAKER, Mike Subelsky's Random_Data and Sevenwire's Forgery. While these tools solve their original domain problem there are two main issues when trying to create large sets of complete test cases. The first issue is that each application, model and associations require a hand-rolled solution. The second issue is that, on their own, faked fields are unaware of other dependent fields like date ranges or composite email addresses. Of course none of these solutions address associated models without custom methods.

Enter Imposter

Imposter is a new concept in data faking. Imposter addresses the entire schema as a 3^rd normal form entity. Imposter uses a generator to randomly approximate field values based on data type into YAML DSL files. By default every field in a model is covered. Developers can modify this to only generate fake data for fields requiring in test cases. Imposter is similar in concept to Rails migrations and fixtures, as the custom rake task executes each imposter in sequential order to build .csv (comma seperated value) files. CSV files are more efficient and useful for not only Rails implementations but for other DBMSs (database management system) requiring loadable datasets such as ETL (Extract Transform and Load) tasks or alternate data stores.

Enough theory, lets look at some real-world examples.

Nuts and bolts

Imposter was tested on Ubuntu 9.10 with Rails 2.3.5. It is hosted at Gemcutter, so if you have rubygems > 1.3.4 your gem sources will automatically find the Imposter gem. Otherwise you'll need to get the gemcutter gem and tumble the data source

> gem install gemcutter

> gem tumble

First we need to install imposter.

Imposter will automatically install Faker, FasterCSV and SQLite3 gems. SQLite3 & libsqlite3-dev packages are required.

user@xbuntu-laptop:~$ sudo gem install imposter

Building native extensions. This could take a while...

Successfully installed sqlite3-ruby-1.2.5

Successfully installed faker-0.3.1

Successfully installed fastercsv-1.5.0

Successfully installed imposter-0.1.4

4 gems installed

Next we'll create a new Rails application and add some scaffolds.

rails -d mysql order-tracking

cd order-tracking

Modify db connection as necessary in config/database.yml

rake db:create #creates the development database

ruby script/generate scaffold customer name:string

address1:string address2:string city:string

state:string postal:string primary_phone:string

secondary_phone:string email_address:string

website:string

rake db:migrate
# creates the customer table in

# the development database

ruby script/generate imposter

# creates a test/imposter/000_customer.yml file

Let's take a look at the default structure of the Customer Imposter file.

---
customer:
quantity: 10
fields:
    id: i.to_s
    name: Imposter::Animal.one
    address1: Imposter::Noun.multiple
    address2: Imposter::Animal.one
    city: Imposter::Noun.multiple
    state: Imposter::Vegtable.multiple
    postal: Imposter::Noun.multiple
    contact: Imposter::Animal.one
    website: Imposter::Animal.one
    email_address: Imposter::Noun.multiple
    primary_phone: Imposter::Noun.multiple
    secondary_phone: Imposter::Animal.one

Each model is defined by it's real name. You can specify the quantity at each imposter. The default type assignments will work but they are not very exciting. Each time you generate it, the values will be different. So let's modify the default to build some real fake data.

customer:
quantity: 76
fields:
    id: i.to_s
    name: (Imposter::Noun.one + ['_'] +

      Imposter::Verb.one).to_s.titleize
    address1: Imposter::Street.full
    address2: Imposter::Street.full
    city: Imposter::CSZ.get_rand['city']
    state: Imposter::CSZ.state
    postal: Imposter::CSZ.zip5
    contact: (Imposter::Animal.one + ['_'] +

      Imposter::Noun.one).to_s.titleize
    website: ('http://www.'.to_a +

      Faker::Internet.domain_name.to_a +

      '.com'.to_a).to_s.downcase
    email_address: Faker::Internet.email.to_a
    primary_phone:

      Imposter::Phone.number("(###)\s###-####")
    secondary_phone: Imposter::Phone.number

rake imposter:load

# will generate the .csv files

# based on the parameters in each imposter yaml file

rake db:fixtures:load

# will load data into the individual tables

Tools of the trade

Imposter has several specialized data faking classes. One of the most useful is Imposter::CSZ. Other data fakers can make random cities, states and zip codes but they are not associated or real. Imposter's data model was taken from USPS sources and are associated, one for every zip code in the US. In the above example Imposter::CSZ.get_rand['city'] selects a random zip code from somewhere in the US and returns the city for that zip. Now the complete record is sticky. Selecting Imposter::CSZ.state returns the associated state for the previous selected random record. This address data is suitable for application that use GEO or mapping APIs.

Some common fake data constructs are:

- Random Inplace list:

%w[est cst mst pst].shuffle[0,1].to_a

- Date in the future:

(Date.today+3).to_s

- Arbitrary Dimension W by H:

Imposter.numerify("##").to_a +

"x".to_a + Imposter.numerify("##cm").to_a

- Random number plus string:

((1+rand(6)).to_s + " PM EST").to_s

See Imposter's documentation for a complete class and method list and be sure to look at each imposter.yml for more examples.

Conclusion

With better testing methods, entire development architectures being devoted to testing as an integrated step in the software development process and users becoming more and more involved in the success of custom application development, there is an ever growing need to produce both quality test records and sufficient quantities to ensure that all projects are well tested and integrated. I encourage you to download the Imposter GEM and try creating fake data for your schema.

Resources

Homepage
GEM
Source
Sample project code