Let’s say you had some time on your hands and for some reason decided to create a machine learning algorithm that could count the oranges on an orange tree. Far-fetched? Not according to data scientists who published “Fruit Counting Based on Deep Simulated Learning.”
Welcome to the world of synthetic data, which is now in the news because of how it relates to AI models, which are also trending thanks in part to ChatGPT and other generative artificial intelligence (AI) models.
In the orange tree example, synthetic or simulated data are the thousands of computer-generated images of orange trees fed into the machine learning algorithm to train the model. One alternative – feeding the model thousands of images of actual orange trees – would be far more costly and time-consuming, requiring a lot of legwork and the permission from the grove owners. Another alternative would be even less appealing; manually having to count the number of oranges on each tree.
ChatGPT is somewhat similar in that it produces content that could be classified as synthetic; the text that’s created is generated using existing text scraped off the internet according to the programmed inputs. The output is synthetic in that it has been artificially generated, and the text it mines is also synthetic – it is trained not to copy text, but to collect bits and pieces mimicking style and tone. Like the made-up orange tree images, the text is a distorted amalgamation.
Synthetic Data in Marketing
So while synthetic data is familiarizing a wider audience of laypersons to machine learning, marketers have long been aware of its more practical use cases in creating datasets for training or testing purposes. In an era of heightened data privacy regulations, synthetic data is also gaining traction as a privacy-compliant way to train a machine learning model without exposing any personally identifiable information (PII) or otherwise run afoul of data security regulations.
A main reason for using synthetic data to train machine learning models is that, like the orange tree example, it saves time and resources. For a midsize company – really any company without the data of an Amazon, Google, Walmart, etc. – that wants to learn more about its customers or determine which prospects to go after, it is faster and more inexpensive to train a model with synthetic data rather than wait until you have enough customers with which to train a model.
The other option of course is to forget about the modeling, and make marketing decisions based on intuition, or to test a campaign on a small percentage of actual customers and hope it accurately represents the target audience. But because customers now expect hyper-personalized, omnichannel experiences, more and more brands are data-driven and base decisions only on the most accurate, up-to-date information about an actual customer. Enter synthetic data.
Accelerate Training of Machine Learning Models
Using simulated data to discern something about an actual customer might seem counter-intuitive, but an example might make it more clear. First, it’s important to know that machine learning models are only as good as the data they’re trained on. If the goal of a marketer is to use the best predictive model, it would be highly counter-productive to feed a model with biased data, i.e., the self-prophecy example of creating a model that will predict pet owners and feeding it only with pet owners, or creating a model that assumes everyone in the world is already a pet owner. Likewise, for a fraud detection model to be effective it is necessary to feed it with thousands of examples of (synthetic) fraud, lest the model see only a few cases of legitimate fraud and thus be biased against detecting fraud.
With that in mind, consider a fictitious travel company that wants to market a certain vacation package – say a budget weekend getaway to Atlantic City. To market to the right audience, it needs to know which customers would be most likely to purchase the package. If it’s a newly created offer, the company can’t go to the well with existing customers. Again, it might test the offer on a small percentage of existing customers – taking the chance of creating friction for those customers who find the offer irrelevant.
Instead, it purchases an anonymized dataset that includes thousands of customers who have visited the same or a similar destination – a dataset of synthetic customers created from real customers (anonymized), or algorithms that can generate variants of real customers. It trains its model on this anonymized dataset to determine which customers will buy the vacation package. Providing real-life synthetic data, particularly in industries with a need for security and privacy, effectively accelerates the training of models to apply them more quickly to real-life scenarios.
The caveat for the travel company is that it must obviously trust that the anonymized data is representative of the audience it is trying to influence. If your campaign is for the budget weekend in Atlantic City, you wouldn’t feed the model with anonymized data of people who summer on a yacht in the Mediterranean.
There is a ChatGPT liar’s paradox making the rounds that shows the potential distortion when a model has been improperly trained. Trained that “my wife is always right,” the model eventually agrees that “2+5=8” because someone’s wife says it is, even though it also has been trained that “2+5=7.” For the travel company, using a large enough dataset mitigates such a possibility because the data are founded in reality, used to replace or mirror the type of behavior you’re trying to predict.
With enough of a dataset that approximates the type of audience it wants to reach, the travel company can then train its own models on that database, apply it to its own customers and enhance the model over time.
“Synthetic” as a Synonym for Metadata
Another way marketers tend to think of synthetic data is first-party data that has been augmented in some way using machine learning, such as by creating a model score. A propensity model score, a clustering index – it is a framework that has been generated by something else to append to the customer record.
In this type of instance, the use of the word “synthetic” is more analogous to metadata than it is to simulated data, which is the marketing use cases we’ve explored above. The similarity is that in both the creation of a model score and training a model with an anonymized dataset, the result is data that have been generated by a machine.
One key difference is that in something like a model score, the synthetic data is telling you something about your own customers. If you ask a model to classify a group of customers into five buckets according to the completeness of a unified profile, for example, that 1-5 grouping is synthetic and whichever group a customer belongs to might determine how they’re marketed to. But it’s synthetic only in that it didn’t exist prior to you creating it; unlike synthetic data that trains a model, it does not simulate actual customers – it is data about actual customers.