Improving Data Quality and Deduplication Using Similarity Scoring and Confidence Models

Authors

  • Sravan Kumar Kunadi Independent Researcher, USA Author

DOI:

https://doi.org/10.15680/IJCTECE.2024.0704014

Keywords:

similarity scoring, deduplication, Detection of duplication, Confidence models, Similarity scoring, Entity resolution, Record linkage, Data cleansing

Abstract

Current information intensive business, decision making, analytics, customer management and efficiency of its activities are now factors of concern attributed to data quality. Nonetheless, mass datasets are normally characterised by huge amounts of redundancy, inconsistency, lapsing of data and incorrect connecting of records that affects credibility and generates mammoth issues down the line. The current studies paper recommends a handy template to enhance quality of data based on the smart identification of the duplicates, and record verification basing on their assurance. The paper deals with the fusion of similarity scoring approaches i.e. string matching, attribute comparison and weighted field-level analysis with confidence models which make an approximation of the likelihood of the records representing the same real-world object. To reduce false positives and false negatives, the framework can be used to reduce the number of false positives and false negatives since it is possible to do this both deterministically and probabilistically to improve rules of deduplication. It is created to handle the heterogeneous data sets in which the spelling variations, formatting, abbreviations and missing values are typical. The solution that is presented, in addition, has a confidence threshold mechanism which enables automated, semi automated and manual inspection processes, which provides additional scalability and certainty in the cleansing process. The findings indicate that similarity based confidence modelling will improve the entity resolve of enterprise data assets immensely, generate uniformity and the overall reliability of the enterprise data assets is also enhanced. The study also adds data management and data governance in the sense that it provides a specialized and generalizable approach to business entities that are interested in quality, holistic, and practical data in the multifaceted digital landscape.

References

[1] A. Jain, S. Sarawagi, and P. Sen, “Deep indexed active learning for matching heterogeneous entity representations,” Proc. VLDB Endowment, vol. 15, no. 1, pp. 31–45, 2021.

[2] D. Jin, B. Sisman, H. Wei, X.-L. Dong, and D. Koutra, “Deep transfer learning for multi-source entity linkage via domain adaptation,” Proc. VLDB Endowment, vol. 15, no. 3, pp. 465–477, 2021.

[3] B. Li, Y. Miao, Y. Wang, Y. Sun, and W. Wang, “Improving the efficiency and effectiveness for BERT-based entity resolution,” in Proc. AAAI Conf. Artificial Intelligence, vol. 35, 2021, pp. 13226–13233.

[4] P. Li, X. Cheng, X. Chu, Y. He, and S. Chaudhuri, “Auto-FuzzyJoin: Auto-program fuzzy similarity joins without labeled examples,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 2021, pp. 1064–1076.

[5] Y. Li, J. Li, Y. Suhara, A. Doan, and W.-C. Tan, “Deep entity matching: Challenges and opportunities,” J. Data and Information Quality, vol. 13, no. 1, pp. 1–17, 2021.

[6] R. Peeters and C. Bizer, “Dual-objective fine-tuning of BERT for entity matching,” Proc. VLDB Endowment, vol. 14, no. 10, pp. 1913–1921, 2021.

[7] A. Baraldi, F. D. Buono, M. Paganelli, and F. Guerra, “Using landmarks for explaining entity matching models,” in Proc. Int. Conf. Extending Database Technology (EDBT), 2021, pp. 451–456.

[8] C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao, “CollaborER: A self-supervised entity resolution framework using multi-features collaboration,” arXiv preprint arXiv:2108.08090, 2021.

[9] U. Brunner and K. Stockinger, “Entity matching with transformer architectures—a step forward in data integration,” in Proc. Int. Conf. Extending Database Technology (EDBT), 2020.

[10] V. V. Meduri, L. Popa, P. Sen, and M. Sarwat, “A comprehensive benchmark framework for active learning methods in entity matching,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 2020, pp. 1133–1147.

[11] S. Suri, I. F. Ilyas, C. Ré, and T. Rekatsinas, “Ember: No-code context enrichment via similarity-based keyless joins,” Proc. VLDB Endowment, vol. 15, no. 3, pp. 699–712, 2021.

[12] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: A design space exploration,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 2018, pp. 19–34

Downloads

Published

2024-08-14

How to Cite

Improving Data Quality and Deduplication Using Similarity Scoring and Confidence Models. (2024). International Journal of Computer Technology and Electronics Communication, 7(4), 9200-9211. https://doi.org/10.15680/IJCTECE.2024.0704014

Most read articles by the same author(s)