Saturday, April 25, 2009

Creating a MySQL reference database in 3 easy steps

What is a reference database?

A reference database is nothing more than a set of data that developers and test systems use to mirror what production data looks like without actually exposing sensitive data from the production system. This is critical for systems that handle HIPPA and PCI/DSS data because developers need to know what production looks and performs like, but they aren't allowed to see protected data like credit card numbers or medical data about specific individuals.

There are a number of different goals a reference database typically covers:
  • Provide a realistic snapshot of what developers need in order to be effective when maintaining most of the production code (i.e. the DDL and realistic data).
  • Provide a way to test edge cases without affecting production.
  • Provide a data set that realistically models production for performance testing of queries.
How do I set up a basic reference database?

So, the process of creating the reference database is pretty simple using replication. I'll assume you're already using replication to provide back-up of your master data set (at minimum). I'm also going to assume you are on MySQL 5 or newer.
  1. Create a replication slave that receives all data from your production master.
  2. Create triggers and corresponding stored procedures on this slave that obfuscate sensitive data.
  3. Utilize a cron or similar automated job to do a mysqldump of the resulting data set from time to time so that developers and testers can download it to their environments.
This accomplishes the goal of providing a production database snapshot in isolation without exposing sensitive data to others that aren't authorized to see it. It also accomplishes the goal of providing a basis for testing performance of the database as well as a method for verifying edge cases.

How do I cover edge cases?

The problem is, this method probably won't cover edge cases by itself, but as I said above, it does provide the basis for it. To cover edge cases that the production data doesn't cover, you'll need to write a script that will update a deployed copy of the reference database. What happens if certain data is missing? What if it's duplicated inappropriately? How does the application react when a transaction table is full - does it automatically add another transaction table to the merge? ... You know your application and the edge cases it needs. Creating scripts to simulate these conditions should be relatively straightforward.

Once you have a reference database and scripts to test performance and edge cases, it becomes terribly simple to add your database to your continuous integration system. Deciding on how often to run those tests largely depends on how often you can afford to run the tests with the resources you have available to you. Typically, I want to run a minimum set of tests for every code commit related to it, then run a full set of tests on a timed schedule so I don't overwhelm my build system.

Has this helped you? Do you have suggestions or comments that might help someone else? Please feel free to add your comments here. :-)

No comments:

Post a Comment