Understanding and Embracing Synthetic Data in AI

Data is at the core of harnessing the potential of artificial intelligence (AI). However, as AI models become more sophisticated, it is increasingly tough and expensive to capture, maintain and secure the data needed for training these models and testing systems across various use cases. This is especially true in highly regulated industries and for companies lacking a modern data infrastructure. That’s where synthetic data comes into play.

Synthetic data is artificially generated data that has the same properties as real data but is not derived from real-world sources such as system events, user actions or observations. It can also have different statistical qualities than real data in order to facilitate specific goals, such as reducing bias or enabling unique simulation scenarios. Synthetic data can take various forms, such as images rather than just rows of data.

3 Benefits of Synthetic Data in Building AI Models

Synthetic data can mitigate AI bias.
It adheres to legal and regulatory requirements, making it easier to build AI models in highly regulated industries.
It expands data access for AI teams, closing the gap on real data.

By creating and using synthetic data, data scientists and engineers can train even more powerful models that support business analysis while addressing some of the traditional challenges associated with real-world data. Synthetic data reduces privacy risks by mocking or redacting highly sensitive information, such as personal health data, in non-production scenarios. Additionally, it can also accelerate development cycles and help organizations manage the costs of data acquisition and maintenance more effectively.

Synthetic data has applications in every industry, for example:

Biotech modeling for research and development.
Financial services modeling to combat credit card fraud.
Utility modeling for grid modernization and optimization.
Healthcare modeling to enable better patient care, population health, or payment integrity.
Manufacturing modeling of operational technology to improve warehouse operations or automated inventory management, like sensor readings for predictive maintenance or modeling operating conditions.

Gartner anticipates the use of synthetic data will overshadow the use of real data in AI models by 2030. This trend makes it increasingly likely you will encounter synthetic data, if you haven’t already. Understanding how and where to use synthetic data, as well as how to create it effectively, is crucial to building better AI models for your business or clients.

3 Scenarios Driving the Use of Synthetic Data

1. Mitigating AI bias

AI models often face issues with bias due to the data and knowledge used for their development and training. Over one-third of the facts used by AI models may be biased, which can lead to misleading AI hallucinations, according to a USC study. A notable example is home mortgage underwriting, where the algorithms used have exhibited racial bias. In one study, real data from the Home Mortgage Disclosure Act was used to create experimental loan applications, revealing consistent discrimination against Black applicants, even when their financial profiles were identical to those of white applicants.

As you develop your AI models and conduct analytics, being aware of these issues and taking steps to eliminate potential biases is crucial. Utilizing synthetic data can help address bias. Continuing with the mortgage example, you might create synthetic data points to fill in gaps or underrepresented groups in existing data sets. By generating new data sets that equally represent different groups, scenarios, or conditions, your models can learn from a more inclusive dataset and reduce bias.

It is crucial to approach the testing and validation of synthetic datasets thoughtfully. Human insights and expertise are essential to quality and proper data governance. To overcome bias effectively, seek to understand where it exists in real data, investigating its origins and how the data has been transformed. This proactive approach will not only enhance the performance of your models but also contribute to fairer outcomes in your applications.

More on Machine Learning7 Common Loss Functions in Machine Learning

2. Adhering to Legal and Regulatory Requirements

If you work in or with a highly regulated industry with strict data and privacy protection requirements, synthetic data can be a game changer. In fields where the stakes are high, managing sensitive data comes with significant costs and risks.

Healthcare is a great example of an industry that faces major challenges in building and developing AI models. Sharing personal healthcare data can lead to serious liability issues, yet you need this data to create effective models for patient care scenarios. Complex modeling often requires collaboration and data-sharing, whether that’s moving data between your onshore and offshore teams or sharing information with external parties. A recent example is the public health response to COVID-19, where data had to be shared among healthcare providers, insurers, public health agencies and pharmaceutical companies.

Synthetic data can mitigate these risks. It allows you to test model viability by using general patient attributes while protecting identifying information like personal health information (PHI). However, you may find that in some cases, accessing data that doesn’t include PHI is essential for effective modeling. For instance, if you are training an AI system to detect brain cancer, it needs actual, high-quality images of the brain that depict various forms of cancer. In such cases, it’s crucial to ensure that any identifying details are stripped away. Otherwise, you risk introducing inaccurate data that doesn’t truly reflect the patterns of cancer.

Additionally, there are considerations beyond data privacy, such as the potential for bias. As you address privacy concerns, it’s important to be careful about how you fill in any data gaps, as this can lead to the introduction of new biases. For example, you could inadvertently overfit models with synthetic patterns that are more inclusive in certain ways but don’t accurately reflect reality in others. This calls for thoughtful consideration of the trade-offs involved.

3. Closing Gaps in Real-World Data

Have you ever found yourself in a situation where you simply don’t have the necessary data at your fingertips? You’re not alone. As modeling becomes increasingly complex, the challenge of accessing real data grows, leaving many of us in a bind. In fact, in one recent survey of the AI/ML community, 28 percent of data scientists attributed their failed AI/ML deployments to a lack of data access. This isn’t just an AI issue. Good data is also essential for testing software systems, to avoid risks associated with exposing production data in non-production environments. Synthetic data can help mitigate that risk, as well.

Imagine you’re working with a utility company that is automating grid maintenance and needs to analyze the features and aging of transformers. While current computer vision models can identify basic items such as cars, buses, stop signs and buildings, they struggle to recognize specific attributes of transformers in an energy grid.

You might need hundreds of photos of transformers to train your AI model effectively, but that level of data isn’t readily available. A synthetic data creator can generate and label photos for training. You could start with a base set of transformer images that include essential attributes, like the number of high-voltage and low-voltage bushings or signs of damage, such as dents. From this initial set, you could then produce additional images with a broader range of attributes, thereby increasing the data available for training or simulating attributes that don’t exist in the current image set.

More on Data ScienceWhat Is Data Poisoning?

Key Considerations for Creating and Using Synthetic Data

It’s easier than ever to create synthetic data from existing data sets, particularly with the help of generative AI. You can generate synthetic data using traditional programs such as Excel, Python or specialized platforms like Tonic.ai, Mostly AI or Gretel. These purpose-built platforms are often the easiest to use, but they may require more investment and training to get the most out of them.

In situations where building your own synthetic data is too arduous or time-consuming, you may also consider purchasing it from vendors. Through experimentation, you will begin to understand which approach or vendor solution best meets your team’s data requirements and workflow preferences.

Finally, it’s beneficial to integrate synthetic data generation and use into the entire development workflow. This empowers your team to experiment and iterate on your models freely without the usual constraints of data privacy concerns or data availability.

Keep in mind that humans must remain in the loop for the foreseeable future. AI will hallucinate, so it’s important that we continue to observe and validate its output. But with continued effort and experimentation, it can add efficiency to day-to-day work.