Test Data

At XDC 2019 my session was titled Xojo Design Mistakes (the alternate but way longer title was ‘Thankfully time travel doesn’t exist or my future self might travel back and murder my younger self for the stupid coding mistakes I’ve made’).  These are things that I’ve discovered over the years, both in my own projects and in other peoples projects that are just plain wrong or less than ideal.  This will be an on-going series since I had well over 50 slides and left 150 out.  So when I get bored I’ll bring these topics up.

Nearly all of our consulting projects are database driven applications.  It’s why we’ve created the tools to help with these projects like ARGen, which simplifies our interactions with the database, and BKS Shorts, which is our own reporting tool.  These tools are invaluable in getting our work done in a timely matter.

In a database application it’s typical to have a List of something.  A common example of this is a Customers list.  In that list the client typically wants the ability to Create, Read, Update, and Delete (or CRUD) a customer with varying degrees of rules behind it (like do they have permissions to add or delete a customer?).

During development we get the List form going, add the controls to be able to add a new record.  Then we create the Add/Edit form that allows us to test those capabilities.  We create a few, update a few, delete a few customers and then move on.  Maybe the client wants search capabilities so we add that to the List window and when we’ve tested it against our half dozen or so records we move on to the next task.

There is nothing wrong with this process.  It works and it’s fairly efficient as far as it does.  However, there’s one thing we’ve skipped that’s really important but also difficult to achieve.

So far we’ve test with *maybe* a dozen records.  What happens when the client adds 10,000, or 100,000 Customer records?  Does the list form take a long time to load?  Does the search function take a long time?  What about the Customer popup menu’s that you’ve scattered throughout the project – are those now slow, unwieldy, and unusable?

Unfortunately, with the way we implemented the project we don’t know how any of this works since we only have a dozen records.  So it’s really important to have adequate amounts of test data.  Creating 10,000 new customers using your new interface would take a long time.  So what can you do?

There are tools out there that will help generate data sets.  These tools allow you to create thousands, even millions of rows of realistic data.  Randomized male and female first names along with a last names is a great way to generate customer names.  Many tools allow you to add random dates in a range, random IP addresses, random values from a  list you provide and so on.  The sky is the limit when it comes to what sort of data developers need.

Now, when you do your testing you see how your application reacts with a lot of data.  I almost guarantee that it will act different.  Do you need to switch to a data-on-demand listbox?  Do you need to put an index on a common searchable field to speed up indexing?  Do you need to implement Full Text Search in your database?  Having a huge amount of data will answer these questions for you.

I once worked on an accounting application in VB6 where the original database designer using an Access database and did an account balance on the fly iterating through bills, checks, journal entries, etc. With a few thousand rows of data in each table this process took a second or two for all balances on a local machine. When this database was accessed over the network it took 5 to 7 seconds. When we converted our first client database it took 30 to 40 seconds for EACH account! Obviously this was not acceptable performance from an accounting application meant to be used daily by general contractors with hundreds of employees and tens of thousands of customers. The solution was to have a current balance value that was stored and then updated when a transaction occurred. We could have saved ourselves hundreds of hours of rushed development time (and much stress and heartache) if we had tested with large amounts of data much earlier in the process.

I mentioned adding an Index to a field earlier. One word of caution on this: it’s tempting to add an index to every field you’re searching on. Don’t do this! Only added indexes to the most important fields in a table. For a customer maybe the two most important fields are phone number and name even though you search on City and things like that. Indexing is extra work for the database so performance can take a signifiant hit with indexing a field.

Since the toolI’ve been using to create test data is no longer being sold I’m curious what you’d recommend.  Do you have a favorite tool?  Or is this a tool that would be of use to the community?

Happy Coding!

7 thoughts on “Test Data

  1. the TL;DR answer is: it depends….

    basically I have found several (web) sites that have bulk data generators and I have taken their output to use. sometimes I have had to merge some data from set A and set B to get what I need. using some one-off perl scripts to do the merging of the data sets to be used. Some of the data sets I have kept around so they can be used again. Sometimes it is all new data sets.

    the big question I have for you Bob & Company, is how do you do the ETL loads? ActiveRecord doesnt have an ETL function. do you use a third party dbase management tool? or some home grown tool? or use a method for the application you are writing?

    • I’m not sure what ETL means. Definition?

      I always use a 3rd party db management tool. Navicat is pretty good for doing large data sets. Not free but it’s consistent across multiple databases.

      The tool I have generates the SQL statements and then I just run it in the tool. Been thinking about creating my own for years. You know how it goes, if I do it myself it does exactly what I want it to do. The few that I’ve used have been okay but not great.

      The one thing that I’ve not found a good tool for is parent-child records. Example: invoices. I’d like to generate 10,000 invoices but also have 4 or 5 children InvoiceLineItems per invoice. The tool needs to be smart enough to get the Invoice id and then use that as part of the Line Items insert. And then have all that in a transaction so it’s not slow as dirt.

      • I know that there are some Oracle & DB2 ETL software that can get the “ID” of an inserted record and use that in other inserts (like you called them children). They are commercial and expensive. targeting big enterprise shops.

  2. A SQL Query walks into a bar. In one corner of the bar are two tables. The Query walks up to the tables and asks:
    – Mind if I join you?

  3. ETL might be extract transfer load ?
    If so data migration is an entirely different subject and really depends on where its coming from like a well normalized db or something else like flats files or even older fixed record database and where its going (non-sql db, properly normalized sql db, etc)
    Only time Ive had to do this in a big way was moving an old accounting app to a new version and it moved from a “cobol like” record oriented flat file system to a relational database
    At the time we wrote custom software to move the data

    • ETL = Etract/Translate/Load.

      ETL software is used to do bulk load of data in databases. Data Warehouse DBs have a daily(?) ETL load of the “new” data. OLTP DBs have an ETL load of baseline data. Test/Dev/QA DBs generally have an ETL load of data (of some sorts).

      Things like NaviCat cat do ETL of SQL commands but it isnt optimized for it. I have used NaviCat for loading data sets.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.