End-to-End Architecture and Implementation of a Unified Lakehouse Platform for Multi-ERP Data Integration using Azure Data Lake and the Databricks Lakehouse Governance Framework
DOI:
https://doi.org/10.15680/IJCTECE.2024.0704005Keywords:
Data Lakehouse, Multi-ERP Integration, Enterprise Data Architecture, Metadata-Driven Architecture, Databricks Unity Catalog, Azure Data Lake, Cloud Cost Optimization, Data GovernanceAbstract
The evolution of enterprise data architecture has historically oscillated between the rigid schema enforcement of the traditional data warehouse and the unmanaged scalability of the data lake, culminating recently in the theoretical convergence of the "Lakehouse." However, existing literature frequently glosses over the engineering realities of implementing this paradigm within a fragmented "Best-of-Breed" landscape, where heterogeneous ERP systems create tenacious data silos and escalating integration costs. This study details the end-to-end implementation of a metadata-driven Lakehouse architecture on Azure and Databricks, utilizing genericized Change Data Capture (CDC) pipelines and the Unity CatLog to enforce governance across a multi-ERP environment. Unlike standard implementations that prioritize theoretical purity, this architecture treats cloud cost optimization as a structural design constraint, employing aggressive auto-scaling policies to decouple expense from data volume. Empirical results demonstrate a reduction in data latency and a significant decrease in compute spend, addressing the common challenges of conventional data platforms, specifically complexity and maintenance overhead. Ultimately, this research argues that the viability of the Unified Lakehouse relies not merely on novel storage formats but on the rigorous application of metadata abstraction to tame the entropy of modern enterprise data.References
1. Armbrust, M., Das, T., Paranjpye, S., Xin, R., Zhu, S., Ghodsi, A., Yavuz, B., Murthy, M., Torres, J., Sun, L., Boncz, P. A., Mokhtar, M., Van Hovell, H., Ionescu, A., Luszczak, A., Switakowski, M., Ueshin, T., Li, X., Szafranski, M., Senster, P., & Zaharia, M. (2020). Delta lake. Proceedings of the VLDB Endowment, 13(12), 3411-3422. https://doi.org/10.14778/3415478.3415560
2. Begoli, E., Goethert, I., & Knight, K. (2021). A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 2816-2825). IEEE. https://doi.org/10.1109/BigData52589.2021.9671534
3. Bhosale, P. (2023). Data Governance Frameworks on Databricks: A Role for Unity CatLog. Journal of Advanced Artificial Intelligence and Machine Learning Diagnostics, 1(4), 433. https://doi.org/10.51219/jaimld/pradeep-bhosale/433
4. Bollineni, S. (2022). Enhancing Data Lakehouse Architecture with DevOps and MLops Practices. Journal of Modern Communication and Applied Computing, 1(1), e133. https://doi.org/10.47363/jmca/2022(1)e133
5. Deshpande, M. (2023). Rise of DataOps: Streamlining Data Pipelines and Workflows for Agile Data Management. Journal of Advanced Artificial Intelligence and Machine Learning Diagnostics, 94. https://doi.org/10.51219/jaimld/mahesh-deshpande/94
6. Firdausy, D., de Alencar Silva, P., Sinderen, M. V., & Iacob, M. (2022). Towards a Reference Enterprise Architecture to enforce Digital Sovereignty in International Data Spaces. In 2022 IEEE 24th International Conference on Business Informatics (CBI) (pp. 127-136). IEEE. https://doi.org/10.1109/CBI54897.2022.00020
7. Garcia, R. D., Ramachandran, G., Jurdak, R., & Ueyama, J. (2022). Blockchain-Aided and Privacy-Preserving Data Governance in Multi-Stakeholder Applications. IEEE Transactions on Network and Service Management, 20(1), 312-327. https://doi.org/10.1109/TNSM.2022.3225254
8. Garriga, M., Aarns, K., Tsigkanos, C., Tamburri, D., & Van Den Heuvel, W. (2021). DataOps for Cyber-Physical Systems Governance: The Airport Passenger Flow Case. ACM Transactions on Internet of Things, 5(1), 1-28. https://doi.org/10.1145/3432247
9. Georgiadis, G. P., & Poels, G. (2021). Enterprise architecture management as a solution for addressing general data protection regulation requirements in a big data context: a systematic mapping study. Information Systems and e-Business Management, 19(2), 433-470. https://doi.org/10.1007/s10257-020-00500-5
10. Goedegebuure, A., Kumara, I., Driessen, S. W., van den Heuvel, W.-J., Monsieur, G., Tamburri, D., & Di Nucci, D. (2023). Data Mesh: A Systematic Gray Literature Review. ACM Transactions on Software Engineering and Methodology, 33(3), 1-52. https://doi.org/10.1145/3687301
11. Harby, A. A., & Zulkernine, F. (2022). From Data Warehouse to Lakehouse: A Comparative Review. In 2022 IEEE International Conference on Big Data (Big Data) (pp. 117-126). IEEE. https://doi.org/10.1109/BigData55660.2022.10020719
12. Janssen, M., Brous, P., Estevez, E., Barbosa, L., & Janowski, T. (2020). Data governance: Organizing data for trustworthy Artificial Intelligence. Government Information Quarterly, 37(3), 101493. https://doi.org/10.1016/j.giq.2020.101493
13. Karkosková, S. (2022). Data Governance Model To Enhance Data Quality In Financial Institutions. Information Systems Management, 39(3), 200-213. https://doi.org/10.1080/10580530.2022.2042628
14. Mazumdar, D., Hughes, J., & Onofre, J. B. (2023). The Data Lakehouse: Data Warehousing and More. https://doi.org/10.48550/arXiv.2310.08697
15. Mostafa, F., Tao, L., & Yu, W. (2020). An effective architecture of digital twin system to support human decision making and AI-driven autonomy. Concurrency and Computation: Practice and Experience, 33(7). https://doi.org/10.1002/cpe.6111
16. Munappy, A., Mattos, D. I., Bosch, J., Olsson, H. H., & Dakkak, A. (2020). From Ad-Hoc Data Analytics to DataOps. In Proceedings of the 21st International Conference on Software Process (pp. 45-56). ACM. https://doi.org/10.1145/3379177.3388909
17. Muvva, S. (2022). Implementing Low-Latency Data Streaming from SQL Server to BigQuery: A Kafka-Based Approach in Google Cloud Platform. International Journal of Financial Management Research, 4(4), 163-172. https://doi.org/10.36948/ijfmr.2022.v04i04.25653
18. Oreščanin, D., & Hlupić, T. (2021). Data Lakehouse - a Novel Step in Analytics Architecture. In 2021 International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 1083-1088). IEEE. https://doi.org/10.23919/mipro52101.2021.9597091
19. Potla, R. B. (2022). Hybrid Integration for Manufacturing Finance: RTR Controls, Intercompany Eliminations, and Auditability Across Multi-ERP Estates. International Journal of Electronic Commerce, Security and IT Research, 3(1), 1-13. https://doi.org/10.63397/iscsitr-ijec_03_01_002
20. Qu, L., Yuan, W., Zheng, R., Cui, L., Shi, Y., & Yin, H. (2024). Towards Personalized Privacy: User-Governed Data Contribution for Federated Recommendation. Proceedings of the ACM Web Conference 2024 (pp. 3724-3735). ACM. https://doi.org/10.1145/3589334.3645690
21. Ramakrishnan, R., Sridharan, B., Douceur, J., Kasturi, P., Krishnamachari-Sampath, B., Krishnamoorthy, K., Li, P., Manu, M., Michaylov, S., Ramos, R., Sharman, N., Xu, Z., Barakat, Y., Douglas, C., Draves, R., Naidu, S. S., Shastry, S., Sikaria, A., Sun, S., & Venkatesan, R. (2017). Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the 2017 ACM International Conference on Management of Data (pp. 1635-1647). ACM. https://doi.org/10.1145/3035918.3056100
22. Rapolu, U. K. (2023). Automating Data Pipelines in Azure Data Factory to Improve Data Management in Large Enterprises. International Journal of Financial Management Research, 5(3), 20-27. https://doi.org/10.36948/ijfmr.2023.v05i03.36367
23. Underwood, M. (2023). Continuous Metadata in Continuous Integration, Stream Processing and Enterprise DataOps. Data and Information Management, 7(1), 1-13. https://doi.org/10.1162/dint_a_00193
24. Wang, Q., Liu, N., Zhang, Z., Jiang, J., Jiang, M., Pei, Z., & Qiu, S. (2017). Architecture methodology researchment of metadata driven design. In 2017 International Conference on Computer Communication and Social Network (ICCSN) (pp. 200-203). IEEE. https://doi.org/10.1109/ICCSN.2017.8230337

