Leveraging Machine Learning for Accurate Data Lineage Tracking

Authors

  • Neil Karan Saraf Department of IT, GGDSD College, Chandigarh, Punjab, India Author

DOI:

https://doi.org/10.15680/IJCTECE.2020.0405002

Keywords:

Data Lineage, Machine Learning, Automation, Data Governance, Transparency, Reproducibility, Compliance, Metadata Management, Data Provenance, AI Ethics

Abstract

Accurate data lineage tracking is essential for ensuring transparency, reproducibility, and accountability in data-driven systems. Traditional manual methods of documenting data transformations are often error-prone and unsustainable at scale. This paper explores the application of machine learning (ML) techniques to automate and enhance data lineage tracking. By leveraging ML models, organizations can achieve more precise and scalable lineage documentation, facilitating better governance and compliance. We discuss various ML approaches, their integration into data pipelines, and the benefits and challenges associated with their implementation.

References

1. Namaki, M. H., et al. Vamsa: Automated Provenance Tracking in Data Science Scripts. arXiv preprint arXiv:2001.01861.

2. Xie, Z. Tracer: A Machine Learning Based Data Lineage Solver with Visualized Metadata Management. Massachusetts Institute of Technology.

3. Sculley, D., et al. (Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems (NeurIPS).

4. Moreau, L., et al. The Open Provenance Model Core Specification. Future Generation Computer Systems, 27(6), 743–756.

5. Cheney, J., Chiticariu, L., & Tan, W. CProvenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4), 379–474.

6. Bhatt, C., et al. (Data Governance and Machine Learning: Enabling Responsible AI. IEEE Access, 9, 1343–1357.

7. GeeksforGeeks. (n.d.). What is Data Lineage? Retrieved from https://www.geeksforgeeks.org/what-is-datalineage/

8. IBM. (). Establishing Lineage Transparency for ML. Retrieved from https://www.ibm.com

9. NISTAI Risk Management Framework 1.0. National Institute of Standards and Technology.

10. Amershi, S., et al. Software Engineering for Machine Learning. IEEE/ACM International Conference on Software Engineering (ICSE).

11. Lin, T., et al. (2021). Provenance-Driven Monitoring in Machine Learning Pipelines. VLDB Endowment, 14(6), 991–1003.

12. Groth, P., & Moreau, L. Provenance: An Introduction to PROV. Morgan & Claypool Publishers.

13. Khoussainov, B., & Kushilevitz, E. Complexity Measures for Provenance in Databases. Journal of the ACM, 42(3), 726–740.

14. Wang, D., et al. Designing Transparency for ML Systems. CHI Conference on Human Factors in Computing Systems.

15. OpenLineage. Open Metadata and Lineage Platform. https://openlineage.io

16. Apache Atlas. (https://atlas.apache.org

17. MLflow. Open Source Platform for the ML Lifecycle. https://mlflow.org

18. Microsoft. Responsible AI Principles and Toolkits. https://www.microsoft.com/en-us/ai/responsible-ai

19. DVC (Data Version Controlhttps://dvc.org

20. Gil, Y., et al. Examining the Challenges of Scientific Workflows. IEEE Computer, 40(12), 24–32.

Downloads

Published

2021-09-01

How to Cite

Leveraging Machine Learning for Accurate Data Lineage Tracking. (2021). International Journal of Computer Technology and Electronics Communication, 4(5), 4004-4008. https://doi.org/10.15680/IJCTECE.2020.0405002