Leveraging Machine Learning for Accurate Data Lineage Tracking

Neil Karan Saraf

doi:10.15680/IJCTECE.2020.0405002

Authors

Neil Karan Saraf Department of IT, GGDSD College, Chandigarh, Punjab, India Author

DOI:

https://doi.org/10.15680/IJCTECE.2020.0405002

Keywords:

Data Lineage, Machine Learning, Automation, Data Governance, Transparency, Reproducibility, Compliance, Metadata Management, Data Provenance, AI Ethics

Abstract

Accurate data lineage tracking is essential for ensuring transparency, reproducibility, and accountability in data-driven systems. Traditional manual methods of documenting data transformations are often error-prone and unsustainable at scale. This paper explores the application of machine learning (ML) techniques to automate and enhance data lineage tracking. By leveraging ML models, organizations can achieve more precise and scalable lineage documentation, facilitating better governance and compliance. We discuss various ML approaches, their integration into data pipelines, and the benefits and challenges associated with their implementation.

References

1. Namaki, M. H., et al. Vamsa: Automated Provenance Tracking in Data Science Scripts. arXiv preprint arXiv:2001.01861.

2. Xie, Z. Tracer: A Machine Learning Based Data Lineage Solver with Visualized Metadata Management. Massachusetts Institute of Technology.

3. Sculley, D., et al. (Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems (NeurIPS).

4. Moreau, L., et al. The Open Provenance Model Core Specification. Future Generation Computer Systems, 27(6), 743–756.

5. Cheney, J., Chiticariu, L., & Tan, W. CProvenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4), 379–474.

6. Bhatt, C., et al. (Data Governance and Machine Learning: Enabling Responsible AI. IEEE Access, 9, 1343–1357.

7. GeeksforGeeks. (n.d.). What is Data Lineage? Retrieved from https://www.geeksforgeeks.org/what-is-datalineage/

8. IBM. (). Establishing Lineage Transparency for ML. Retrieved from https://www.ibm.com

9. NISTAI Risk Management Framework 1.0. National Institute of Standards and Technology.

10. Amershi, S., et al. Software Engineering for Machine Learning. IEEE/ACM International Conference on Software Engineering (ICSE).

11. Lin, T., et al. (2021). Provenance-Driven Monitoring in Machine Learning Pipelines. VLDB Endowment, 14(6), 991–1003.

12. Groth, P., & Moreau, L. Provenance: An Introduction to PROV. Morgan & Claypool Publishers.

13. Khoussainov, B., & Kushilevitz, E. Complexity Measures for Provenance in Databases. Journal of the ACM, 42(3), 726–740.

14. Wang, D., et al. Designing Transparency for ML Systems. CHI Conference on Human Factors in Computing Systems.

15. OpenLineage. Open Metadata and Lineage Platform. https://openlineage.io

16. Apache Atlas. (https://atlas.apache.org

17. MLflow. Open Source Platform for the ML Lifecycle. https://mlflow.org

18. Microsoft. Responsible AI Principles and Toolkits. https://www.microsoft.com/en-us/ai/responsible-ai

19. DVC (Data Version Controlhttps://dvc.org

20. Gil, Y., et al. Examining the Challenges of Scientific Workflows. IEEE Computer, 40(12), 24–32.

Leveraging Machine Learning for Accurate Data Lineage Tracking

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Call For Paper

Make a Submission

Contact

open-access

Menu

License

Information