Enterprise-Scale Privacy Engineering: A Unified Data-Centric Architecture for Masking, Synthetic Data, and Governance across Pre-Production Environments

Harshavardhan Peddireddy

doi:10.15680/IJCTECE.2025.0801012

Authors

Harshavardhan Peddireddy Platform Architect at Meijer INC, Michigan, USA Author

DOI:

https://doi.org/10.15680/IJCTECE.2025.0801012

Keywords:

Data-centric architecture, data masking, synthetic data, privacy governance, pre-production environments, enterprise privacy engineering

Abstract

The problem of ensuring the protection of sensitive data on the pre-production systems, including development, testing, staging, and quality assurance, is becoming more common in enterprises as more strict regulations are introduced, as well as the introduction of AI working loads and the active increase in the volume of data. Traditional methods tend to treat data masking, synthetic data generation, and governance separately, resulting in a discontinuous process, lopsided privacy assurance, extremely high re-identification risk, and a very slow environment setup. The given paper presents a system that integrates these three elements into one so that the data became the center of the system and enables to achieve masking of structured fidelity using methods like substitution, tokenization, and consistent hashing, synthetical generation of added volume, variety, and diversity using AI models like GANs, VAEs, and diffusion and having this architecture centrally managed, imposing uniform policy, tracing lineage, access control, privacy budgets, and automatic audit log with propagation of rules in real-time. The hybrid architecture can significantly reduce the privacy exposure, reduce the process of provisioning by an order of magnitude faster, store less storage in the form of on-demand virtual views as opposed to full, persistent copies, and offer a stable presence in the domains of secure testing, analytics, and large-scale AI development in highly regulated industries. The combined model provides a scalable, efficient approach to managing large volumes of sensitive data while meeting stringent privacy, compliance, and innovation demands.

References

1. Abay, N. C., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., & Sweeney, L. (2018). Privacy preserving synthetic data release using deep learning. In M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley, & G. Ifrim (Eds.), Machine learning and knowledge discovery in databases (pp. 510–526). Springer. https://doi.org/10.1007/978-3-030-10925-7_31

2. Bellare, M., Ristenpart, T., Rogaway, P., & Stegers, T. (2009). Format-preserving encryption. In Selected areas in cryptography (pp. 295–312). Springer.

3. Bellovin, S. M., Dutta, P. K., & Reitinger, N. (2019). Privacy and synthetic datasets. Stanford Technology Law Review, 22, 1.

4. Capgemini. (2012). Data masking: Architecture, organization and process [White paper].

5. Danezis, G., Domingo-Ferrer, J., Hansen, M., Hoepman, J.-H., Le Métayer, D., Tirtea, R., & Schiffner, S. (2015). Privacy and data protection by design – From policy to engineering. arXiv. https://doi.org/10.48550/arXiv.1501.03726

6. Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., & Du, X. (2020). Relational data synthesis using generative adversarial networks: A design space exploration. arXiv. https://arxiv.org/abs/2008.12763

7. Gartner. (2024, June 27). Safeguarding privacy with synthetic data [Press release]. https://www.gartner.com/en/newsroom/press-releases/2024-06-27-safeguarding-privacy-with-synthetic-data

8. Goyal, M., & Mahmoud, Q. H. (2024). A systematic review of synthetic data generation techniques using generative AI. Electronics, 13(17), 3509.

9. IBM Corporation. (2012). IBM Optim Solutions with Designer Proof of Technology [Technical presentation].

10. IBM Security. (2024). Cost of a data breach report 2024. IBM Corporation. https://www.ibm.com/reports/data-breach

11. Informatica. (2015). Data masking and encryption are different. Informatica Blog. https://www.informatica.com/blogs/data-masking-and-encryption-are-different.html

12. James, S., Harbron, C., Branson, J., & Sundler, M. (2021). Synthetic data use: Exploring use cases to optimise data utility. Discover Artificial Intelligence, 1(1), Article 15. https://doi.org/10.1007/s44163-021-00016-y

13. Jordon, J., Yoon, J., & van der Schaar, M. (2018, September). PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations.

14. Majeed, A. (2023). Attribute-centric and synthetic data based privacy preserving methods: A systematic review. Journal of Cybersecurity and Privacy, 3(3), 638–661. https://doi.org/10.3390/jcp3030030

15. Motiwalla, L., & Li, X. B. (2013). Developing privacy solutions for sharing and analysing healthcare data. International Journal of Business Information Systems, 13(2), 199–216. https://doi.org/10.1504/IJBIS.2013.054335

16. Oracle Corporation. (2013). Data masking best practice [White paper]. https://www.oracle.com/technetwork/database/security/data-masking-best-practices-155602.pdf

17. Ponemon Institute. (2023). 2023 cost of a data breach report. IBM Corporation. https://www.ibm.com/security/data-breach

18. Patel, V., & Maheta, P. (2014). Survey on privacy preservation technique: Data masking. International Journal of Engineering Research & Technology, 3(2), 791–793.

19. Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., & Epelde, G. (2020). Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Medical Informatics, 8(7), Article e18910. https://doi.org/10.2196/18910

20. Samaraweera, G. D., & Chang, J. M. (2019). Security and privacy implications on database systems in big data era: A survey. IEEE Transactions on Knowledge and Data Engineering, 33(1), 239–258. https://doi.org/10.1109/TKDE.2019.2929794

21. Securosis, LLC. (2011). Understanding and selecting data masking solutions [White paper]. http://originalstatic.aminer.cn/misc/AI_Time_4/Understanding%20and%20Selecting%20Data%20Masking%20Solutions.pdf

22. Terzi, D. S., Terzi, R., & Sagiroglu, S. (2015). A survey on security and privacy issues in big data. In Proceedings of the 10th International Conference for Internet Technology and Secured Transactions (ICITST) (pp. 202–207). IEEE. https://doi.org/10.1109/ICITST.2015.7412089

23. Torfi, A. (2020). Privacy-preserving synthetic medical data generation with deep learning [Doctoral dissertation, Virginia Polytechnic Institute and State University]. VTechWorks. https://vtechworks.lib.vt.edu/handle/10919/99856

24. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2019). Assessing privacy and quality of synthetic health data. In Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse (pp. 1–4). ACM.

25. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416, 244–255. https://doi.org/10.1016/j.neucom.2019.12.103

Enterprise-Scale Privacy Engineering: A Unified Data-Centric Architecture for Masking, Synthetic Data, and Governance across Pre-Production Environments

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Make a Submission

open-access

Menu

License

Information