摘 要:目前大型企业存储了大量的数据,但是数据质量令人担忧,集中表现在相似重复冗余的数据特别多,以及多个数据源的合并加重数据的冗余。大数据相似记录检测环节是数据清洗研究的重要方向。针对大数据中存在的相似重复数据的检测问题,文章提出了一种基于 k-means 分组聚类的检测算法,实验分析表明,该方法在确保精度不变的情况下提高了检测效率。
关键词:相似重复记录;K-means;SNM
DOI:10.19850/j.cnki.2096-4706.2022.08.025
基金项目:2021 年校级质量工程项目(2021xjtz107)
中图分类号:TP18 文献标识码:A 文章编号:2096-4706(2022)08-0089-03
Detection of Similar Duplicate Records of Big Data Based on K-means
ZHANG Ping1, CHENG Xinlian2
(1.School of Information Engineering, Anhui Vocational and Technical College, Hefei 230011, China; 2.Jiashan D-max Electronics Co., Ltd., Jiaxing 314100, China)
Abstract: At present, large enterprises store a large amount of data, but the data quality is worrying. It expresses especially that there are much similar, repeated and redundant data, and the combination of multiple data sources aggravates the redundancy of data. Big data similarity record detection part is an important direction of data cleaning research. Aiming at the detection problem of similar and repeated data existing in big data, this paper proposes a detection algorithm based on K-means grouping clustering. Experimental analysis shows that this method improves the detection efficiency while ensuring the accuracy is constant.
Keywords: similar duplicate record; K-means; SNM
参考文献:
[1] 袁满,穆永豪,王贵友,等 . 改进的 SNM 中文语义重复记录检测算法 [J]. 吉林大学学报(信息科学版),2021,39(3):348-356.
[2] 梁雪,任剑锋,景丽 . 基于 QPSO-LSSVM 的数据库相似重复记录检测算法 [J]. 计算机科学,2012,39(11):157-159+190.
[3] 吕国俊,曹建军,郑奇斌,等 . 基于多目标蚁群优化的单类支持向量机相似重复记录检测 [J]. 兵工学报,2020,41(2):324-331.
[4] 张平,党选举,陈皓,等 . 基于熵特征优选分组聚类的相似重复记录检测 [J]. 传感器与微系统,2011,30(11):135-137+141.
[5] 邱越峰,田增平,季文贇,等 . 一种高效的检测相似重复记录的方法 [J]. 计算机学报,2001(1):69-77.
[6] 郑剑,冷碧玉 .K-means 隐私保护聚类算法 [J]. 计算机工程与设计,2022,43(1):26-33.
[7] DRAISBACH U,NAUMANN F,SZOTT S,et al. Adaptive Windows for Duplicate Detection [C]//IEEE International Conference on Data Engineering.Arlington:IEEE,2012:1073-1083.
作者简介:张平(1981—),男,汉族,安徽铜陵人,讲师,硕士研究生,研究方向:数据清洗、数据挖掘。