Enterprise-Scale Privacy Engineering: A Unified Data-Centric Architecture for Masking, Synthetic Data, and Governance across Pre-Production Environments
DOI:
https://doi.org/10.15680/IJCTECE.2025.0801012Keywords:
Data-centric architecture, data masking, synthetic data, privacy governance, pre-production environments, enterprise privacy engineeringAbstract
The problem of ensuring the protection of sensitive data on the pre-production systems, including development, testing, staging, and quality assurance, is becoming more common in enterprises as more strict regulations are introduced, as well as the introduction of AI working loads and the active increase in the volume of data. Traditional methods tend to treat data masking, synthetic data generation, and governance separately, resulting in a discontinuous process, lopsided privacy assurance, extremely high re-identification risk, and a very slow environment setup. The given paper presents a system that integrates these three elements into one so that the data became the center of the system and enables to achieve masking of structured fidelity using methods like substitution, tokenization, and consistent hashing, synthetical generation of added volume, variety, and diversity using AI models like GANs, VAEs, and diffusion and having this architecture centrally managed, imposing uniform policy, tracing lineage, access control, privacy budgets, and automatic audit log with propagation of rules in real-time. The hybrid architecture can significantly reduce the privacy exposure, reduce the process of provisioning by an order of magnitude faster, store less storage in the form of on-demand virtual views as opposed to full, persistent copies, and offer a stable presence in the domains of secure testing, analytics, and large-scale AI development in highly regulated industries. The combined model provides a scalable, efficient approach to managing large volumes of sensitive data while meeting stringent privacy, compliance, and innovation demands.
References
1. Abay, N. C., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., & Sweeney, L. (2018). Privacy preserving synthetic data release using deep learning. In M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley, & G. Ifrim (Eds.), Machine learning and knowledge discovery in databases (pp. 510–526). Springer. https://doi.org/10.1007/978-3-030-10925-7_31
2. Bellare, M., Ristenpart, T., Rogaway, P., & Stegers, T. (2009). Format-preserving encryption. In Selected areas in cryptography (pp. 295–312). Springer.
3. Bellovin, S. M., Dutta, P. K., & Reitinger, N. (2019). Privacy and synthetic datasets. Stanford Technology Law Review, 22, 1.
4. Capgemini. (2012). Data masking: Architecture, organization and process [White paper].
5. Danezis, G., Domingo-Ferrer, J., Hansen, M., Hoepman, J.-H., Le Métayer, D., Tirtea, R., & Schiffner, S. (2015). Privacy and data protection by design – From policy to engineering. arXiv. https://doi.org/10.48550/arXiv.1501.03726
6. Fan, J., Liu, T., Li, G., Chen, J., Shen, Y., & Du, X. (2020). Relational data synthesis using generative adversarial networks: A design space exploration. arXiv. https://arxiv.org/abs/2008.12763
7. Gartner. (2024, June 27). Safeguarding privacy with synthetic data [Press release]. https://www.gartner.com/en/newsroom/press-releases/2024-06-27-safeguarding-privacy-with-synthetic-data
8. Goyal, M., & Mahmoud, Q. H. (2024). A systematic review of synthetic data generation techniques using generative AI. Electronics, 13(17), 3509.
9. IBM Corporation. (2012). IBM Optim Solutions with Designer Proof of Technology [Technical presentation].
10. IBM Security. (2024). Cost of a data breach report 2024. IBM Corporation. https://www.ibm.com/reports/data-breach
11. Informatica. (2015). Data masking and encryption are different. Informatica Blog. https://www.informatica.com/blogs/data-masking-and-encryption-are-different.html
12. James, S., Harbron, C., Branson, J., & Sundler, M. (2021). Synthetic data use: Exploring use cases to optimise data utility. Discover Artificial Intelligence, 1(1), Article 15. https://doi.org/10.1007/s44163-021-00016-y
13. Jordon, J., Yoon, J., & van der Schaar, M. (2018, September). PATE-GAN: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations.
14. Majeed, A. (2023). Attribute-centric and synthetic data based privacy preserving methods: A systematic review. Journal of Cybersecurity and Privacy, 3(3), 638–661. https://doi.org/10.3390/jcp3030030
15. Motiwalla, L., & Li, X. B. (2013). Developing privacy solutions for sharing and analysing healthcare data. International Journal of Business Information Systems, 13(2), 199–216. https://doi.org/10.1504/IJBIS.2013.054335
16. Oracle Corporation. (2013). Data masking best practice [White paper]. https://www.oracle.com/technetwork/database/security/data-masking-best-practices-155602.pdf
17. Ponemon Institute. (2023). 2023 cost of a data breach report. IBM Corporation. https://www.ibm.com/security/data-breach
18. Patel, V., & Maheta, P. (2014). Survey on privacy preservation technique: Data masking. International Journal of Engineering Research & Technology, 3(2), 791–793.
19. Rankin, D., Black, M., Bond, R., Wallace, J., Mulvenna, M., & Epelde, G. (2020). Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing. JMIR Medical Informatics, 8(7), Article e18910. https://doi.org/10.2196/18910
20. Samaraweera, G. D., & Chang, J. M. (2019). Security and privacy implications on database systems in big data era: A survey. IEEE Transactions on Knowledge and Data Engineering, 33(1), 239–258. https://doi.org/10.1109/TKDE.2019.2929794
21. Securosis, LLC. (2011). Understanding and selecting data masking solutions [White paper]. http://originalstatic.aminer.cn/misc/AI_Time_4/Understanding%20and%20Selecting%20Data%20Masking%20Solutions.pdf
22. Terzi, D. S., Terzi, R., & Sagiroglu, S. (2015). A survey on security and privacy issues in big data. In Proceedings of the 10th International Conference for Internet Technology and Secured Transactions (ICITST) (pp. 202–207). IEEE. https://doi.org/10.1109/ICITST.2015.7412089
23. Torfi, A. (2020). Privacy-preserving synthetic medical data generation with deep learning [Doctoral dissertation, Virginia Polytechnic Institute and State University]. VTechWorks. https://vtechworks.lib.vt.edu/handle/10919/99856
24. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2019). Assessing privacy and quality of synthetic health data. In Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse (pp. 1–4). ACM.
25. Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416, 244–255. https://doi.org/10.1016/j.neucom.2019.12.103

