Seonkyu Kim

Data Scientist | Purdue MS BAIM'24

Enhancing Patient Privacy with Synthetic Data Generation (Future Edelman Impact Competition Finalist)

Facing the dual challenges of maintaining patient privacy and managing high data protection costs, the healthcare industry requires innovative solutions. Our synthetic data approach ensures enhanced privacy, reduces operational expenses, and provides valuable data for research, driving cost-effective advancements in healthcare.

We generated synthetic datasets with Conditional Tabular Generative Adversarial Network (CTGAN), Data Synthesizer, Gaussian Copula, CopulaGAN, and Variational Auto-Encoder (VAE). According to our evaluation criteria and experiments, CTGAN emerged as the best synthetic data generation method. It could capture the distribution of the real data well and had the best prediction power of 0.528 at a reasonable privacy level of 0.7. Moreover, CTGAN could capture correlations between variables crucial to retaining the predictive power of the real dataset.

The poster, presentation slides, and codes are available here.

Video Presentation

Poster