摘 要:针对 K-means 算法易受初始值和异常点影响,以及聚类数选取依靠人工经验和初始聚类中心选取随机等缺点,提出一种基于改进 Canopy 算法的 K-means 聚类算法。首先将初始数据集进行预处理和分类,然后选取特殊的阈值利用改进的Canopy 算法得到聚类数和初始聚类中心,再运行 K-means 算法实现最终聚类。经检验得知,改进后的算法减少了对人工选择的依赖,并且聚类准确度有了明显的提高。最后将改进后的算法应用于顾客细分实例,取得了良好的分类效果,证明了优化算法的实用性。
关键词:Canopy 算法;主成分分析法;局部密度;顾客细分
DOI:10.19850/j.cnki.2096-4706.2023.06.028
中图分类号:TP301.6 文献标识码:A 文章编号:2096-4706(2023)06-0111-05
Optimization and Application of K-means Algorithm
FANG Shiqiao, HU Peiling, HUANG Yingying, ZHANG Xin
(College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China)
Abstract: In view of the shortcomings of K-means algorithm that is easily affected by initial values and outliers, and that the selection of clustering number depends on artificial experience and the selection of initial clustering center is random, a K-means clustering algorithm based on improved Canopy algorithm is proposed. First, the initial data set is preprocessed and classified, and then a special threshold is selected to obtain the number of clusters and the initial cluster center using the improved Canopy algorithm, and then the K-means algorithm is run to achieve the final clustering. The test shows that the improved algorithm reduces the dependence on manual selection, and the clustering accuracy has significantly improved. Finally, the improved algorithm is applied to a customer segmentation example, and good classification results are obtained, which proves the practicability of the optimized algorithm.
Keywords: Canopy algorithm; principal component analysis; local density; customer segmentation
参考文献:
[1] 杨爽爽,石鸿雁 . 基于改进果蝇优化的密度峰值聚类算法[J]. 微电子学与计算机,2022,39(9):26-34.
[2] 邱荣太 . 基于 Canopy 的高效 K-means 算法 [J]. 现代营销:学苑版,2012(3):244-246.
[3] 陈胜发,贾瑞玉.基于密度权重Canopy的改进K-medoids算法 [J]. 计算机工程与科学,2019,41(10):1823-1828.
[4] 王海燕,崔文超,许佩迪,等 .Canopy 在划分聚类算法中对 K 选取的优化 [J]. 吉林大学学报:理学版,2020,58(3):634-638.
[5] 鲁茜,蒙祖强 .Canopy 算法中 T 值选取的优化及聚类效果的改进 [J]. 信息与电脑:理论版,2021,33(6):61-65.
[6] 袁逸铭,刘宏志,李海生 . 基于密度峰值的改进 K-Means文本聚类算法及其并行化 [J].武汉大学学报:理学版,2019,65(5):457-464.
[7] 薛京花,刘震宇,崔适时 . 对 K-means 算法初始聚类中心选取的优化 [J]. 电子世界,2012(5):11-14+18.
[8] 沈郭鑫,蒋中云 . 基于密度和中心指标的 Canopy 二分 K-均值算法优化 [J]. 计算机工程与科学,2022,44(2):372-380.
作者简介:方诗乔(2000—),女,汉族,广东深圳人,本科在读,研究方向:数学与应用数学;胡佩玲(2001—),女,汉族,广东广州人,本科在读,研究方向:数学与应用数学;黄莹莹(2001—),女,汉族,广东河源人,本科在读,研究方向:信息与计算科学。