Leveraging ChatGPT and SDV for Synthetic Data Generation: A Game Changer for Small Businesses

How Synthetic Data Generated by ChatGPT and SDV can be used by small businesses and in the blockchain world

Leveraging ChatGPT and SDV for Synthetic Data Generation: A Game Changer for Small Businesses
Photo by Pietro Jeng / Unsplash

In the world of artificial intelligence and machine learning, data is king. But what if the data you need doesn't exist or is hard to come by? Enter synthetic data, a game-changing solution that's opening up new possibilities for businesses of all sizes and in various industries.

Understanding Synthetic Data

Synthetic data is artificially generated data that mimics real data in terms of essential characteristics but does not directly correspond to real-world events. It's like a data doppelgänger, providing a valuable stand-in for scenarios where real data is scarce, sensitive, or expensive to collect.

In our first case study, we are looking at how synthetic data can be a game-changer for small businesses. It allows them to train robust machine learning models without needing access to massive, real-world datasets. This levels the playing field, enabling small businesses to compete with larger corporations that have access to vast amounts of data.

Our second case study looks at how synthetic data can be used in the blockchain world for a transaction analysis tool.

Case Study: Enhancing Customer Journey with Synthetic Data

Consider a small e-commerce business specializing in outdoor gear, looking to improve its customer journey from search to purchase. With limited customer data, creating a robust recommendation system and personalized user experience can be challenging. However, with synthetic data, the business can generate a large dataset that mimics customer behavior and use this to train their systems.

Here's how they could use ChatGPT and SDV:

Use ChatGPT to generate text data based on specific prompts or themes.

  • The business could prompt ChatGPT with a series of actions that customers might take. For example, searching for "best lightweight camping tent", reading reviews, comparing different tents, and eventually making a purchase. The prompt could be something like: "Generate a sequence of 100 different actions a customer might take when looking for and purchasing outdoor gear."
  • To do this, they could use the OpenAI API to send a series of prompts to ChatGPT. Each prompt would be a potential customer action, and ChatGPT would generate a variety of responses, effectively creating a diverse set of synthetic customer journeys.

Feed this data into SDV to model the dataset.

  • Once they have a collection of generated customer actions, they can use SDV to model this data. This involves defining the structure of their data (e.g., fields like 'Action', 'Product Category', 'Action Sequence') and fitting the model to the generated actions.
  • The reason for using SDV here is that while ChatGPT is excellent at generating text, SDV is designed specifically for modeling and generating structured data. It can capture the relationships between different fields in the data and generate new data that maintains these relationships, which is crucial for creating a realistic synthetic dataset.
  • For instance, they might define 'Action' as a categorical field with options like 'Search', 'Read Review', 'Compare', 'Purchase', etc., 'Product Category' as another categorical field with options like 'Tents', 'Boots', 'Jackets', etc., and 'Action Sequence' as a numerical field representing the order of actions.
  • They would then use SDV's CopulaGAN or CTGAN model to fit this data, effectively learning the joint distribution of these fields.

Use SDV to generate new synthetic data based on the modeled dataset.

  • After modeling the dataset, the business can use SDV to generate new synthetic data. This data will have the same structure as their original data but will contain new, synthetic entries. For example, they might get new customer action sequences like "Search for waterproof jackets for spring", "Read reviews", "Compare different jackets", "Make a purchase".
  • This synthetic data mimics the real world because it's based on the patterns and relationships in the original data generated by ChatGPT, which was prompted with real-world scenarios. It gives good insight because it maintains the complexity and diversity of real-world data, allowing the business to test and train their systems in a realistic environment without needing access to large amounts of real customer data.

The result? A more personalized and effective customer journey that can rival those of larger competitors, providing customers with highly relevant product suggestions, reviews, and comparisons, and improving the overall shopping experience.

Case Study: Synthetic Data for Blockchain Transaction Analysis

Consider a startup developing a new tool for analyzing blockchain transactions. Their goal is to detect fraudulent transactions and understand transaction patterns. However, they face a challenge: real-world blockchain transactions are complex and diverse, and the startup doesn't have access to a large enough dataset to effectively test and train their tool.

This is where synthetic data comes in. Using ChatGPT and SDV, the startup can generate a large dataset of synthetic blockchain transactions that mimic the complexity and diversity of real-world transactions. The process involves using ChatGPT to generate transaction descriptions, modeling this data with SDV, and then using SDV to generate new synthetic data.

Even though blockchain transactions are publicly accessible, synthetic data offers several advantages:

  1. Privacy and Anonymity: Synthetic data doesn't correspond to real-world individuals or entities, ensuring privacy.
  2. Complexity and Diversity: Synthetic data can cover a wide range of scenarios, including edge cases not present in the real-world data.
  3. Controlled Environment: The startup has full control over the synthetic data's characteristics, allowing them to create specific data scenarios.
  4. Cost and Efficiency: Synthetic data can be generated more quickly and efficiently than accessing and processing large amounts of real-world blockchain data.

With this synthetic dataset, the startup can effectively test and train their blockchain transaction analysis tool, improving its accuracy and reliability while respecting privacy and enabling scalability.


Synthetic data, generated with tools like ChatGPT and SDV, is a powerful resource that's reshaping the landscape of data availability and usage. It's a game-changer for businesses of all sizes, enabling them to overcome data limitations, improve their systems, and compete effectively in their respective markets. Whether it's enhancing the customer journey in e-commerce or improving transaction analysis in the blockchain world, synthetic data opens up a world of possibilities. As we continue to explore and innovate in this space, we can expect to see even more exciting applications and advancements. The future of data is here, and it's synthetic.

Subscribe to Endeavours Way

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.