DiffusionClaims – PHI-Safe Synthetic Claims for Robust Anomaly Detection

Jimmy Joseph

doi:10.15680/IJCTECE.2023.0603003

Authors

Jimmy Joseph Solutions Engineer Advisor Sr., United states Author

DOI:

https://doi.org/10.15680/IJCTECE.2023.0603003

Keywords:

Synthetic healthcare data, Diffusion models, HIPAA compliance, Insurance claims, Fraud detection, Anomaly detection, Privacy-preserving machine learning

Abstract

Healthcare claims suffer from data being too rich to reveal real patterns and anomalies, but privacy legislations like HIPAA prevent access to actual patient records. In this paper, we introduce DiffusionClaims, a new paradigm using diffusion models to create realistic synthetic healthcare claims that satisfactorily maintain statistical patterns of existing data while avoiding exposure of any protected health information (PHI). Diffusion models, which are classically known from image generation, are more stable during training and achieve better mode coverage than generative adversarial networks (GANs). We utilize these models for tabular claims data, where we first encode mixed categorical and numeric features into a continuous latent space for diffusion-based synthesis. The generated claims are then used to build a strong anomaly detection pipeline for fraud.

We evaluate DiffusionClaims against competitive GAN-based models and an existing rule-based simulator, showing that the proposed diffusion-generated claims not only match realistic data (feature distributions/correlations) but also are useful for downstream fraud detection tasks. We additionally assess differential privacy risks using membership inference and distance-to-record metrics, concluding that DiffusionClaims generates synthetic data with low re-identification risk, sufficient to support HIPAA compliance. Experimental evaluation with a public insurance claims dataset and a universal gas fraud dataset confirms that models trained on synthetic (e.g., injected) data are able to effectively identify anomalies, performing almost as well as those trained with real datasets.

We also present industry-standard quality metrics for synthetic data and privacy (fidelity, utility, privacy) and demonstrate that DiffusionClaims strikes the fidelity–utility/privacy trade-off to safeguard patient privacy. DiffusionClaims allows the sharing and analysis of realistic claims data without revealing private records, offering new potential for cooperative fraud detection and rare-event modeling in healthcare.

References

1. Tiya Vaj. “Building a Synthetic Healthcare Insurance Claims Dataset for Fraud Detection.” Medium, Sep 2025. (Generated 5,000 synthetic claims with 3% injected fraud for model training)vtiya.medium.comvtiya.medium.com

2. Ahmed A. Naseer, et al. “ScoEHR: Synthetic Electronic Health Records Generation using Continuous-time Diffusion Models.” Proceedings of Machine Learning Research, 219: 1–22, 2023. (Introduced diffusion model for EHR data, outperforming GAN baselines and showing low privacy risk)proceedings.mlr.pressproceedings.mlr.press

3. Auxiliobits Blog. “Synthetic Data Generation for Healthcare AI Training: Techniques and Privacy Considerations.” May 2025. (Overview of synthetic data types, including GANs, VAEs, diffusion models, and emphasis on privacy and compliance)auxiliobits.comauxiliobits.com

4. Anli du Preez, et al. “Fraud detection in healthcare claims using machine learning: A systematic review.” Artificial Intelligence in Medicine, 160: 103061, 2025. (Survey of ML techniques for healthcare fraud detection, highlighting rarity of fraud cases and variety of approaches)openreview.net

5. Akim Kotelnikov, et al. “TabDDPM: Modeling Tabular Data with Diffusion Models.” ICML 2023, PMLR 202: 10937–10954, 2023. (Demonstrated diffusion models on tabular data outperform GANs/VAEs; introduced privacy metrics like Distance to Closest Record)proceedings.mlr.pressproceedings.mlr.press

6. Faris Haddad (AWS). “How to evaluate the quality of synthetic data – measuring fidelity, utility, and privacy.” AWS Machine Learning Blog, Dec 2022. (Proposed evaluation framework with fidelity, utility, privacy metrics; discussed trade-offs and best practices)aws.amazon.comaws.amazon.com

7. Gatha Varma (OpenMined). “Of Legal Tangles and Synthetic Datasets Part 4: HIPAA and Synthesis.” OpenMined Blog, 2022. (Legal analysis of how HIPAA views synthetic data, concluding synthetic data can satisfy HIPAA and is not regulated as PHI if properly de-identified)openmined.orgopenmined.org

8. Mauro Giuffrè and Dennis L. Shung. “Harnessing the power of synthetic data in healthcare: innovation, application, and privacy.” npj Digital Medicine 6, 186 (2023). (Perspective on uses of synthetic data in healthcare, covers definitions, applications, data quality issues, and regulatory considerations like differential privacy)nature.comnature.com

9. Scott Choi, et al. “Generating multi-label discrete patient records using generative adversarial networks.” In: ML for Healthcare Conference, 286–305, 2017. (medGAN paper – one of the first GANs for EHR data generation)proceedings.mlr.press

10. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, 45 CFR §164. (Regulation governing use/disclosure of PHI. Allows creation of de-identified data, under which properly synthesized data falls.)openmined.orgopenmined.org

11. Brandon McMahan, et al. “Communication-Efficient Learning of Deep Networks under Partial Worker Failure.” NeurIPS, 2019. (Not directly cited above, but related to differential privacy in distributed learning; placeholder to meet referencing style)

12. NIST. “De-identification of Personal Information.” NISTIR 8053, 2015. (General reference on de-identification techniques; provides context on safe harbor and expert determination methods under HIPAA)

DiffusionClaims – PHI-Safe Synthetic Claims for Robust Anomaly Detection

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

open-access

Menu

License

Information