![]() |
I live in Charlotte North Carolina with my lovely wife, 2 kids, 2 dogs, 3 cats and 2 fish. My programming career started in 1989 when I was 19. Through the years I have used a wide variety of languages and stacks. I've done application architecture, system and infrastructure design. Now I'm working on datawarehouse and business intelligence projects. I believe that software development, configuration and deployments don't have to be nearly as hard as IT makes it. This has led me to create my own SDLC based on a movie production schedule rather than an engineering practice. I discovered Ruby on Rails in 2006 and immediately saw its potential to change the entire web application landscape. Most of my RoR projects in some way refine the art of conventions over configuration and move RAD to be better, faster, less expensive with fewer defects. You can reach me at golsombe /at/ gmail /dot/ com or follow me on Twitter as golsombe. |
Introduction
Whether your application begins with TDD (Test Driven Design), BDD (Behavior Driven Design) or you choose to go old-school with tried and true unit testing. Regarless of your testing framework of choice like shoulda or rspec, the secret of great testing depends on great testing data. Also consider that QA (Quality Assurance) and users will benefit greatly from both quality and quantity of test records. Yet for all of the testing frameworks, most application teams only create a small number of complete test records. The reason for this is obvious, creating fake test data is labor-intensive, error prone, generally sucky and unappreciated work.
Existing issues
There are a number of good field level data faking GEMs available to the Rails community like Benjamin Curtis's FAKER, Mike Subelsky's Random_Data and Sevenwire's Forgery. While these tools solve their original domain problem there are two main issues when trying to create large sets of complete test cases. The first issue is that each application, model and associations require a hand-rolled solution. The second issue is that, on their own, faked fields are unaware of other dependent fields like date ranges or composite email addresses. Of course none of these solutions address associated models without custom methods.
Enter Imposter
Imposter is a new concept in data faking. Imposter addresses the entire schema as a 3rd normal form entity. Imposter uses a generator to randomly approximate field values based on data type into YAML DSL files. By default every field in a model is covered. Developers can modify this to only generate fake data for fields requiring in test cases. Imposter is similar in concept to Rails migrations and fixtures, as the custom rake task executes each imposter in sequential order to build .csv (comma seperated value) files. CSV files are more efficient and useful for not only Rails implementations but for other DBMSs (database management system) requiring loadable datasets such as ETL (Extract Transform and Load) tasks or alternate data stores.
Enough theory, lets look at some real-world examples.
Nuts and bolts
Imposter was tested on Ubuntu 9.10 with Rails 2.3.5. It is hosted at Gemcutter, so if you have rubygems > 1.3.4 your gem sources will automatically find the Imposter gem. Otherwise you'll need to get the gemcutter gem and tumble the data source
> gem install gemcutter
> gem tumble
First we need to install imposter.
Imposter will automatically install Faker, FasterCSV and SQLite3 gems. SQLite3 & libsqlite3-dev packages are required.
user@xbuntu-laptop:~$ sudo gem install imposter
Building native extensions. This could take a while...
Successfully installed sqlite3-ruby-1.2.5
Successfully installed faker-0.3.1
Successfully installed fastercsv-1.5.0
Successfully installed imposter-0.1.4
4 gems installed
Next we'll create a new Rails application and add some scaffolds.
rails -d mysql order-tracking
cd order-tracking
Modify db connection as necessary in config/database.yml
rake db:create #creates the development database
ruby script/generate scaffold customer name:stringaddress1:string address2:string city:string
state:string postal:string primary_phone:string
secondary_phone:string email_address:string
website:string
rake db:migrate
# creates the customer table in
# the development database
ruby script/generate imposter
# creates a test/imposter/000_customer.yml file
Let's take a look at the default structure of the Customer Imposter file.
---
customer:
quantity: 10
fields:
id: i.to_s
name: Imposter::Animal.one
address1: Imposter::Noun.multiple
address2: Imposter::Animal.one
city: Imposter::Noun.multiple
state: Imposter::Vegtable.multiple
postal: Imposter::Noun.multiple
contact: Imposter::Animal.one
website: Imposter::Animal.one
email_address: Imposter::Noun.multiple
primary_phone: Imposter::Noun.multiple
secondary_phone: Imposter::Animal.one
Each model is defined by it's real name. You can specify the quantity at each imposter. The default type assignments will work but they are not very exciting. Each time you generate it, the values will be different. So let's modify the default to build some real fake data.
customer:
quantity: 76
fields:
id: i.to_s
name: (Imposter::Noun.one + ['_'] +Imposter::Verb.one).to_s.titleize
address1: Imposter::Street.full
address2: Imposter::Street.full
city: Imposter::CSZ.get_rand['city']
state: Imposter::CSZ.state
postal: Imposter::CSZ.zip5
contact: (Imposter::Animal.one + ['_'] +Imposter::Noun.one).to_s.titleize
website: ('http://www.'.to_a +Faker::Internet.domain_name.to_a +
'.com'.to_a).to_s.downcase
email_address: Faker::Internet.email.to_a
primary_phone:Imposter::Phone.number("(###)\s###-####")
secondary_phone: Imposter::Phone.number
rake imposter:load
# will generate the .csv files
# based on the parameters in each imposter yaml file
rake db:fixtures:load
# will load data into the individual tables
Tools of the trade
Imposter has several specialized data faking classes. One of the most useful is Imposter::CSZ. Other data fakers can make random cities, states and zip codes but they are not associated or real. Imposter's data model was taken from USPS sources and are associated, one for every zip code in the US. In the above example Imposter::CSZ.get_rand['city'] selects a random zip code from somewhere in the US and returns the city for that zip. Now the complete record is sticky. Selecting Imposter::CSZ.state returns the associated state for the previous selected random record. This address data is suitable for application that use GEO or mapping APIs.
Some common fake data constructs are:
- Random Inplace list:
%w[est cst mst pst].shuffle[0,1].to_a
- Date in the future:
(Date.today+3).to_s
- Arbitrary Dimension W by H:
Imposter.numerify("##").to_a +
"x".to_a + Imposter.numerify("##cm").to_a
- Random number plus string:
((1+rand(6)).to_s + " PM EST").to_s
See Imposter's documentation for a complete class and method list and be sure to look at each imposter.yml for more examples.
Conclusion
With better testing methods, entire development architectures being devoted to testing as an integrated step in the software development process and users becoming more and more involved in the success of custom application development, there is an ever growing need to produce both quality test records and sufficient quantities to ensure that all projects are well tested and integrated. I encourage you to download the Imposter GEM and try creating fake data for your schema.