Multi-Dimensional Data Quality Scoring for Reliable Machine Learning Training in Enterprise Environments
DOI:
https://doi.org/10.15680/IJCTECE.2024.0705009Keywords:
Multi-Dimensional Data Quality, Enterprise Data Management, Training Data Selection, Data Quality Scoring, Machine Learning ReliabilityAbstract
Enterprise-based machine learning systems frequently fail not due to model design but the training data is of low quality. The conventional way of data validation is based on fundamental checking, and the most problematic areas like semantic accuracy and completeness and traceability are not addressed. The purpose of this paper is to present a Multi-Dimensional Data Quality Scoring Framework, an approach that would consider data in 5 dimensions, namely, correctness, completeness, consistency, semantics, and novelty, and then integrate them into one quality score. The model does not conflict with ML pipelines and allows selecting high-quality records automatically and makes the training process more efficient. The experimental outcomes of the real enterprise datasets have demonstrated that the framework application has boosted the average add quality scores by 19-44 per cent in both dimensions and had improved the model accuracy improvement by 78.4% to 89.6 with a reduction of training epochs by 42 to 27. These results indicate that systematic data quality scoring improves the reliability, cost of processing, and trustful AI in business institutions.
References
[1] Zhang, H Patel, H., Ishikawa, F., Berti-Equille, L., Gupta, N., Mehta, S., Masuda, S., Mujumdar, S., Afzal, S., Bedathur, S., & Nishi, Y. (2021, August 14). Data Quality Assessment for Machine Learning. IBM Research. https://research.ibm.com/publications/data-quality-assessment-for-machine-learning
[2] Klie, J., Eckart, D. C. R., & Gurevych, I. (2023). Analyzing dataset annotation quality management in the wild. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2307.08153
[3] Mazurek, S., & Wielgosz, M. (2023). Assessing dataset quality through decision tree characteristics in AutoEncoder-Processed spaces. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2306.15392
[4] Gupta, N., Patel, H., Afzal, S., Panwar, N., Mittal, R. S., Guttula, S., Jain, A., Nagalapatti, L., Mehta, S., Hans, S., Lohia, P., Aggarwal, A., & Saha, D. (2021). Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2108.05935
[5] Rangineni, S. (2023). An analysis of data quality requirements for machine learning Development Pipelines Frameworks. International Journal of Computer Trends and Technology, 71(8), 16–27. https://doi.org/10.14445/22312803/ijctt-v71i8p103
[6] Schwabe, D., Becker, K., Seyferth, M., Klaß, A., & Schaeffter, T. (2024). The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. Npj Digital Medicine, 7(1), 203. https://doi.org/10.1038/s41746-024-01196-4
[7] Jayakumar, S., Sounderajah, V., Normahani, P., Harling, L., Markar, S. R., Ashrafian, H., & Darzi, A. (2022). Quality assessment standards in artificial intelligence diagnostic accuracy systematic reviews: a meta-research study. Npj Digital Medicine, 5(1), 11. https://doi.org/10.1038/s41746-021-00544-y
[8] Ehrlinger, L., & Wöß, W. (2022). A survey of data quality measurement and monitoring tools. Frontiers in Big Data, 5, 850611. https://doi.org/10.3389/fdata.2022.850611
[9] Gong, Y., Liu, G., Xue, Y., Li, R., & Meng, L. (2023). A survey on dataset quality in machine learning. Information and Software Technology, 162, 107268. https://doi.org/10.1016/j.infsof.2023.107268
[10] Bayram, F., Ahmed, B. S., Hallin, E., & Engman, A. (2023). DQSOPS: Data Quality Scoring Operations Framework for Data-Driven Applications. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2303.15068
[11] Lwakatare, L. E., Rånge, E., Crnkovic, I., & Bosch, J. (2021, March 6). On the experiences of adopting automated data validation in an industrial machine learning project. arXiv.org. https://arxiv.org/abs/2103.04095
[12] Zhang, H., Chen, C., Ran, P., Yang, K., Liu, Q., Sun, Z., Chen, J., & Chen, J. (2024). A multi-dimensional hierarchical evaluation system for data quality in trustworthy AI. Journal of Big Data, 11(1). https://doi.org/10.1186/s40537-024-00999-2
[13] Chen, H. (2022). Data quality evaluation and improvement for machine learning. Data Quality Evaluation and Improvement for Machine Learning. https://doi.org/10.13140/rg.2.2.15870.87361
[14] Tute, E., Ganapathy, N., & Wulff, A. (2021). A data driven learning approach for the assessment of data quality. BMC Medical Informatics and Decision Making, 21(1), 302. https://doi.org/10.1186/s12911-021-01656-x
[15] Cho, H., & Lee, S. (2021). Data quality measures and efficient evaluation algorithms for Large-Scale High-Dimensional data. Applied Sciences, 11(2), 472. https://doi.org/10.3390/app11020472
[16] Bayram, F., Ahmed, B. S., & Hallin, E. (2024). Adaptive Data Quality Scoring Operations Framework using Drift-Aware Mechanism for Industrial Applications. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2408.06724
[17] Tercan, H., & Meisen, T. (2022). Machine learning and deep learning based predictive quality in manufacturing: a systematic review. Journal of Intelligent Manufacturing, 33(7), 1879–1905. https://doi.org/10.1007/s10845-022-01963-8
[18] Peng, G., Lacagnina, C., Downs, R. R., Ganske, A., Ramapriyan, H. K., Ivánová, I., Wyborn, L., Jones, D., Bastin, L., Shie, C., & Moroni, D. F. (2022). Global Community Guidelines for documenting, sharing, and reusing quality information of individual digital datasets. Data Science Journal, 21. https://doi.org/10.5334/dsj-2022-008
[19] Miller, R., Whelan, H., Chrubasik, M., Whittaker, D., Duncan, P., & Gregório, J. (2024). An Overview of Current and New Data Quality Dimensions under a Common Framework. Preprints.org. https://doi.org/10.20944/preprints202409.1076.v1
[20] Khan, A. (2024, January 30). Data Quality and Governance in Healthcare: Leveraging AI and ML for Master Data management. https://meridianjournal.in/index.php/IMJ/article/view/33
[21] Wang, T., Zeng, Y., Jin, M., & Jia, R. (2021). A unified framework for Task-Driven data quality management. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2106.05484
[22] Liu, Y., Yang, Z., Zou, X., Ma, S., Liu, D., Avdeev, M., & Shi, S. (2023). Data quantity governance for machine learning in materials science. National Science Review, 10(7), nwad125. https://doi.org/10.1093/nsr/nwad125

