Whether you’re a front-end developer displaying information on a webpage or a back-end developer doing heavy lifting with a database, good software testing is impossible without good testing data.
Testing data has become even more important in recent years as companies adopt test automation to keep up with the speed of software development, said Sanjay Sharma, CTO of financial services company SEI.
“Development cycles are no longer six months or nine months,” Sharma said. “Today, we are releasing code almost every day. In that kind of scenario, test data plays a significant role … Unless you’re 100 percent sure that your software is perfect, you’ll be impacting a very wide user community.”
What Is Testing Data?
Testing data is not just used to check code for bugs, but for other important aspects of the application as well, like performance, scaling and user experience.
But what exactly characterizes “good” testing data? Developers say that, ideally, it should behave like production data without sharing the problems of production data, like containing sensitive or personal identifiable information. Most companies rely on two main ways of getting testing data: either by creating fake data from scratch, or by taking production data and stripping out all sensitive information.
While there are advantages and disadvantages to each approach, which method developers should use to get testing data also depends on their circumstances.
Personally Identifiable Information Can Be Sensitive
In many ways, deriving testing data from production data can be an intuitive and effective approach. After all, companies already have access to it, and there’s nothing like existing production data to mimic the types of data applications seen in production.
But the approach comes with significant challenges. One of the most serious is the risk of exposing private user information to internal employees and hackers.
“[Using production data] is a last resort when you’re talking about data that identifies individuals and is therefore sensitive,” said Michael Rochlin, software developer at healthtech startup Chapter. “That’s particularly important in healthcare situations.”
“[Using production data] is a last resort when you’re talking about data that identifies individuals and is therefore sensitive.”
Protecting personal identifiable information is important to the finance industry as well, Sharma said. Europe’s General Data Protection Regulation affects how companies, including those in the finance industry, interact with users’ data, similar to how the Health Insurance Portability and Accountability Act, or HIPAA, regulations set guidelines on how healthcare companies need to protect users’ data and privacy. These considerations are legally and ethically important for companies, but also important for maintaining good customer relationships because handing over private information to a company is an important form of trust.
But when production data doesn’t contain sensitive information it’s often a good idea to use it directly for testing. At Intelligent Medical Objects, a company that streamlines data entry workflows for physicians, some teams don’t interact with personal identifiable information at all, said staff software engineer Rami Alshareef.
That’s because some Intelligent Medical Objects database tables only contain lists of medical terminology. Since those tables contain no private or patient information, that production data can be used directly to test the code and get accurate results on how the application will perform in production.
Cleaning Production Data Is a Multi-Step Process
Even when there are no security concerns, turning production data into usable testing data isn’t easy. Adam Kamor is head of engineering at Tonic, a company that helps developers with the process of turning production data into safe and effective testing data.
The first step is locating all the sensitive data in the production databases — a step that can be quite tricky and often gets overlooked.
“Knowing where your sensitive information is actually a problem all by itself,” Kamor said.
Typically, companies may have hundreds or even thousands of database tables — the most Kamor has ever encountered at Tonic was a company that had 40,000 tables. Going through all that data and identifying the sensitive information is not necessarily a trivial task.
It’s also difficult to keep track of data relationships within a database. Data often has links to other data across tables. When developers start anonymizing sensitive information, they have to make decisions about whether those relationships should be kept or broken. If relationships need to be preserved, it could make the resulting testing data less anonymous.
“Knowing where your sensitive information is actually a problem all by itself.”
There’s always that tension between the anonymity of the testing data and its usefulness for testing, according to Kamor. For example, replacing all sensitive information with just the letter “X” is one surefire way to make production data completely anonymous, but that risks losing valuable testing scenarios.
“That new data set is going to be incredibly private — no identifiable information,” Kamor said. “But is it useful for development testing? No, it’s not useful at all.”
A common use case is anonymizing social security numbers. Tonic provides companies with two choices, which are catered to different needs: one is to just scramble the values and replace them with random numbers, while another is to maintain those existing data relationships by replacing the real social security number with a fake one.
“Do you want that [original social security number] mapped to a different random number every time? Or do you want that SSN to always be mapped to the same fake SSN so that there’s some level of consistency maintained?” Kamor said.
There is always this trade-off between consistency and anonymity. If developers want to preserve those data relationships, keeping the data relationships risks making testing data less anonymous. Even when all the testing database’s social security numbers are fake, the pattern of behavior for each user across the database is preserved — if those patterns are specific enough, it may be possible to learn a lot about an individual.
There Are Advantages to Generating Fake Data
Although some developers at Intelligent Medical Objects work with non-sensitive data, they don’t always choose to use production data for testing.
“Production data can be very, very specific to certain things and at times very confusing compared to fake data,” said Eric Grandt, senior software engineer at Intelligent Medical Objects.
While production data can be especially handy for preventing repeat errors when tweaked from bug reports, most of the time, Grandt prefers fake data. Testing is most efficient when it’s done on small slices of functionality at a time, something that production data struggles with because it can get so large and contain so many different properties.
And sometimes developers simply don’t have access to production data yet because the application is still being built. For any of these situations, developers can create fake data instead. For smaller-scale applications — or those that don’t use a lot of data — it’s possible to write scripts to generate fake data or even to do it by hand.
“The problem is it’s time intensive and things change all the time,” Rochlin said. “And it’s also not fun.”
“Production data can be very, very specific to certain things and at times very confusing compared to fake data.”
Luckily, developers can use tools and libraries to help with generating fake data. Python provides libraries that can help with creating thousands of fake customer names, for example. Third-party APIs also usually provide testing data so developers can test the connections before moving to those APIs in the production environment.
Developers at Chapter benefit from the fake testing data provided by Medicare to build tools for comparing and recommending Medicare plans to users.
“Medicare publishes data sets that are sanitized of people’s data, which is pretty cool,” Rochlin said. “They have testing data that’s built for any integrations that we’re going to do.”
It’s great when data is provided by third-party companies to test integrations because it can be a lot of work for developers to do on their own. At SEI, developers regularly test whether stock orders are executed correctly, which involves tracing orders as they are processed by brokers and markets. Many of those steps involve simulating interactions with third party services that don’t have time to participate in the testing process.
“You can’t simulate all those conditions just through your system unless everybody’s participating, which is not practical,” Sharma said. “The DTCC [or Depository Trust and Clearing Corporation] is not going to participate in our day-to-day test cycle — or [National Securities Clearing Corporation] or stock exchanges. We have to simulate that situation on our own, which means creating that kind of test data.”
Writing More Tests Can Guide Your Testing Data Strategy
Regardless of how testing data is initially generated, developers will have to put in work to keep that data up to date with the changing codebase and databases. Every time developers make changes in the code to alter the structure of database tables, like adding or removing columns, the testing data will need to be updated as well.
As a result, decisions about how to incorporate testing data into development processes is an ongoing process. It can be a lot of work, but a good way to figure out whether applications need additional data to cover more edge cases is to write more tests. The process of writing tests will naturally expose any gaps in the data because additional test cases will require different kinds of data that touch different parts of the application.
“For the most part, depending on the type of tests, they run very fast so it’s not a huge bottleneck at all,” Grandt said.
Developers can combine the insights they get from writing test cases with considerations around the difficulty of anonymizing production data to decide how to generate testing data. If production data contains a lot of sensitive information — especially information that has intricate relationships across the database — deciding how to anonymize the data can get quite complex. Synthesizing fake data is a great alternative, but that method can be a lot of work as well, especially if those intricate data relationships are important to testing and need to be replicated.
Ultimately, decisions around testing data are made based on the application and developers’ use cases. Both synthetic data and anonymized production data can work, but each method should be evaluated in relation to a project’s specific circumstances.