AI-Driven Data Cleaning: Intelligent Detection and Correction of Data Errors

Anjali Kapoor

doi:10.15680/IJCTECE.2025.0801004

Authors

Anjali Kapoor Researcher, OSU, Oregon, USA Author

DOI:

https://doi.org/10.15680/IJCTECE.2025.0801004

Keywords:

AI-driven data cleaning, Data error detection, Automated correction, Machine learning, Data quality Anomaly detection, Natural language processing, Data preprocessing, Data consistency, Intelligent systems

Abstract

Data cleaning is a critical step in the data lifecycle that ensures accuracy, consistency, and reliability of datasets used for analytics and decision-making. Traditional data cleaning approaches often rely on static rules and manual intervention, which are time-consuming and insufficient for handling the increasing volume and complexity of modern datasets. This paper presents an AI-driven framework for intelligent detection and correction of data errors, leveraging machine learning and natural language processing to automate and improve data quality processes. The proposed system integrates anomaly detection models, pattern recognition algorithms, and context-aware correction mechanisms to identify and resolve diverse data issues such as missing values, duplicates, inconsistencies, and erroneous entries. Using a combination of supervised and unsupervised learning techniques, the framework adapts dynamically to different data domains and error types, reducing dependence on domain-specific rules. We validate the framework on heterogeneous datasets including financial records, healthcare data, and customer information systems, demonstrating significant improvements in data quality metrics. The AI-driven cleaning approach achieved up to a 30% reduction in error rates compared to baseline rule-based systems while also decreasing manual correction efforts by 50%. Additionally, the system effectively prioritized errors for human review, optimizing resource allocation. This research highlights the advantages of integrating AI into data cleaning workflows, emphasizing scalability, adaptability, and improved accuracy. By automating error detection and suggesting corrections, the framework accelerates data preparation, enabling faster and more reliable analytics. The findings underscore the potential of AI-powered data cleaning as an essential component of modern data management, paving the way for future developments in autonomous data quality assurance.

References

1. Rahm, E., & Do, H.-H. (2000). Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin.

2. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. ICDM.

3. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection: A Survey. ACM Computing Surveys.

4. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked Denoising Autoencoders. JMLR.

5. Paulheim, H. (2017). Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semantic Web.

6. Mann, G., & Yarowsky, D. (2005). Multi-Modal Data Cleaning for Textual Data. EMNLP.

7. Kandel, S., Paepcke, A., Hellerstein, J. M., & Heer, J. (2011). Wrangler: Interactive Visual Specification of Data Transformation Scripts. CHI.

8. Apache Spark Documentation (2023). https://spark.apache.org/docs/latest/

9. MLflow Documentation (2023). https://mlflow.org/docs/latest/

AI-Driven Data Cleaning: Intelligent Detection and Correction of Data Errors

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

open-access

Menu

License

Information