Synthetic data use cases for a safer pathway to business AI

As its name sounds, synthetic data is artificial data. So why would that be interesting? We already have so much more data than most companies could ever work with, right? 

Because real data is a minefield, waiting for your business to explode all over headlines with privacy leaks.

Synthetic data is, at its core, completely private. Unlike the lesser alternatives of data redaction and data anonymisation, synthetic data doesn’t run the risk of re-identification. And it retains the utility and value in the information it is mimicking.

Synthetic data or smart synthetic data is generated by machine learning algorithms. A synthetic data generator trains and learns from patterns it realizes in existing large-scale datasets. This can be generalizations of demographics, geographical locations and behavioral patterns without any PII or personally identifiable information.

Then it creates new sets of data based on the same patterns. When created by generative adversarial networks or GANs, it continues to retrain itself, comparing the new data with the original data. Once they reach the desired level of similarity, the synthetic data generator has trained itself and becomes a useful tool that can create 100 percent safe data that doesn’t risk anyone’s identity. 

The business potential of synthetic data use cases

As still cutting edge, emerging technology, use cases of synthetic data are emerging daily. The most known of those use cases is of course deepfakes, which is the most advanced form of synthetic data — because it’s relatively easier for deep learning algorithms to learn from the very predictable patterns of human faces. Deepfakes are artificial videos and images — both of real people and completely synthetic ones — that look and sound real. This is highly controversial because it could be used for nefarious reasons like falsely implicating people. On the other hand if a software organization like perhaps a beauty company wants to virtually test products on a more diverse group of individuals, deepfakes could be used to generate a more balanced, inclusive dataset. 

It’s a highly visual way to show off the capacity of this artificial intelligence, but deepfakes is certainly not the most interesting synthetic data use case at all. In fact, deepfakes are one of the less challenging kinds.

Synthetic data has an interesting use case in app development and testing. It can be used as a placeholder for real data before it even exists. You can train a synthetic data generator on existing demographics but then use that data in non-existing situations. You can see how stable your app is with a plethora of users and behaviors.

Synthetic data is a favourite solution for organisations that are struggling to innovate against heavy governance and regulations. Synthetic data usually fulfills the needs of internal and external partners without having to go through the six-month to a year rigmarole that is data procurement.

With this in mind, an interesting use case for synthetic data is when you want to have a realistic outcome but you don’t want to risk data leaks. This can be when you are determining if you want to work with an integration partner. You want to give them recent data but, since you are only vetting them, you don’t want to risk sharing PII. 

Similarly, machine learning engineers want to play around with highly relevant, current, and realistic data, but by the time they go through procurement processes, they end up with stale information. They are happy to use synthetic data as they both understand it — after all data scientists are the ones creating it still — and they can quickly get access to pertinent, although artificial, information.

Finally for large corporations, the most interesting use case for synthetic data is data portability. Imagine if you are a multinational bank that spans many borders, divisions and sectors. Now imagine the potential of being able to share all the data across those many regulations and borders. Safely. You could dramatically improve fraud detention and spotting of money laundering. Or if you’re a non-governmental healthcare provider and you could share scans from a million or even a billion patients  — you’d have the potential to cure cancer.

These will never be allowed with real data, but with high quality synthetic data, the potentiality of data portability can quickly become a collaborative, innovative, cross-organisational, trans-global reality. 

The use cases of synthetic data are seemingly limitless and as the technology finally catches up with that potential, we have a promising, more private future.