Leveraging Machine Learning for Accurate Data Lineage Tracking
DOI:
https://doi.org/10.15680/IJCTECE.2020.0405002Keywords:
Data Lineage, Machine Learning, Automation, Data Governance, Transparency, Reproducibility, Compliance, Metadata Management, Data Provenance, AI EthicsAbstract
Accurate data lineage tracking is essential for ensuring transparency, reproducibility, and accountability in data-driven systems. Traditional manual methods of documenting data transformations are often error-prone and unsustainable at scale. This paper explores the application of machine learning (ML) techniques to automate and enhance data lineage tracking. By leveraging ML models, organizations can achieve more precise and scalable lineage documentation, facilitating better governance and compliance. We discuss various ML approaches, their integration into data pipelines, and the benefits and challenges associated with their implementation.
References
1. Namaki, M. H., et al. Vamsa: Automated Provenance Tracking in Data Science Scripts. arXiv preprint arXiv:2001.01861.
2. Xie, Z. Tracer: A Machine Learning Based Data Lineage Solver with Visualized Metadata Management. Massachusetts Institute of Technology.
3. Sculley, D., et al. (Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems (NeurIPS).
4. Moreau, L., et al. The Open Provenance Model Core Specification. Future Generation Computer Systems, 27(6), 743–756.
5. Cheney, J., Chiticariu, L., & Tan, W. CProvenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4), 379–474.
6. Bhatt, C., et al. (Data Governance and Machine Learning: Enabling Responsible AI. IEEE Access, 9, 1343–1357.
7. GeeksforGeeks. (n.d.). What is Data Lineage? Retrieved from https://www.geeksforgeeks.org/what-is-datalineage/
8. IBM. (). Establishing Lineage Transparency for ML. Retrieved from https://www.ibm.com
9. NISTAI Risk Management Framework 1.0. National Institute of Standards and Technology.
10. Amershi, S., et al. Software Engineering for Machine Learning. IEEE/ACM International Conference on Software Engineering (ICSE).
11. Lin, T., et al. (2021). Provenance-Driven Monitoring in Machine Learning Pipelines. VLDB Endowment, 14(6), 991–1003.
12. Groth, P., & Moreau, L. Provenance: An Introduction to PROV. Morgan & Claypool Publishers.
13. Khoussainov, B., & Kushilevitz, E. Complexity Measures for Provenance in Databases. Journal of the ACM, 42(3), 726–740.
14. Wang, D., et al. Designing Transparency for ML Systems. CHI Conference on Human Factors in Computing Systems.
15. OpenLineage. Open Metadata and Lineage Platform. https://openlineage.io
16. Apache Atlas. (https://atlas.apache.org
17. MLflow. Open Source Platform for the ML Lifecycle. https://mlflow.org
18. Microsoft. Responsible AI Principles and Toolkits. https://www.microsoft.com/en-us/ai/responsible-ai
19. DVC (Data Version Controlhttps://dvc.org
20. Gil, Y., et al. Examining the Challenges of Scientific Workflows. IEEE Computer, 40(12), 24–32.