摘 要:不平衡数据越来越多地出现在各个领域,而传统机器学习分类算法往往会忽略少数类样本的分类精度,针对此问题,提出一种基于密度峰值聚类改进的欠采样算法。该算法利用信息熵对密度峰值聚类算法进行优化,获取最优截断距离;选取密度距离较大的点作为聚类中心并选取所有聚类中心代表整个多数类数据集。将该文算法与几种欠采样算法进行对比实验,结果表明,该方法有效提高了不平衡数据集中少数类的预测精度。
关键词:数据挖掘;不平衡数据;欠采样;密度峰值聚类
DOI:10.19850/j.cnki.2096-4706.2022.18.019
中图分类号:TP18 文献标识码:A 文章编号:2096-4706(2022)18-0081-04
An Improved Undersampling Algorithm for Density Peak Clustering
LI Xin
(Capital University of Economics and Business, Beijing 100026, China)
Abstract: The unbalanced data is increasing appearing in various fields, and traditional machine learning classification algorithms often ignores the classification accuracy of the samples for a few classes. For this problem, an improvement undersampling algorithm based on density peak clustering is proposed. The algorithm uses information entropy to optimize the density peak clustering algorithm to obtain the optimal truncation distance. It selects points with larger density distances as the cluster centers and selects all cluster centers to represent the entire datasets for majority classes. Comparing the proposed algorithm with several undersampling algorithms, the results of experiments show that the method effectively improves the prediction accuracy of unbalanced datasets for a few classes.
Keywords: data mining; imbalanced data; undersampling; density peak clustering
参考文献:
[1] SAFAVIAN S R,LANDGREBE D. A survey of decision tree classifier methodology [J].IEEE Trans Syst Man Cybern,1991,21(3):660-674.
[2] CHRISTOPHER J C. BURGES. A tutorial on support vector machines for pattern recognition [J].Data Min Knowl Discov,1998,2(2):121-167.
[3] KEERTHI S S,DUAN K B,SHEVADE S K,et al. A fast dual algorithm for kernel logistic regression [J].Mach Learn,2005,61(1-3):151-165.
[4] MOHD F,JALIL M A,NOORA N M M,et al. Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique [C]//ICC 2019: Advances in Data Science, Cyber Security and IT Applications. Riyadh:Springer, 2019:99-110.
[5] DEVI D,BISWAS S K,PURKAYASTHA B. A Costsensitive weighted Random Forest Technique for Credit Card Fraud Detection [C]//2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT).Kanpur:IEEE,2019:1-6.
[6] ROY D D,SHIN D. Network Intrusion Detection in Smart Grids for Imbalanced Attack Types Using Machine Learning Models [C]//2019 International Conference on Information and Communication Technology Convergence (ICTC),Jeju:IEEE,2019:576-581.
[7] 林智勇,郝志峰,杨晓伟 . 不平衡数据分类的研究现状 [J].计算机应用研究,2008(2):332-336.
[8] 刘定祥,乔少杰,张永清,等 . 不平衡分类的数据采样方法综述 [J]. 重庆理工大学学报(自然科学),2019,33(7):102-112.
[9] 徐玲玲,迟冬祥 . 面向不平衡数据集的机器学习分类策略[J]. 计算机工程与应用,2020,56(24):12-27.
[10] SOWAH R A,AGEBURE M A,MILLS G A,etal.New cluster undersampling technique for class imbalance learning [J].International Journal of Machine Learning and Computing,2016,6(3):205-214.
[11] WILSON D R,MARTINEZ T R. Reduction techniques for instance-based learning algorithms [J].Machine learning,2000,38(3):257-286.
[12] HART P E. The condensed nearest neighbor rule(corresp.)[J].IEEE Trans.Information Theory,1968,14(3):515-516.
[13] TOMEK I. Two modifications of CNN [J].IEEE Trans. Systems,Man and Cybernetics,1976,6:769-772.
[14] YEN S J,LEE Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions [J].Expert Systems with Applications,2009,36(3):5718-5727.
[15] 崔彩霞,曹付元,梁吉业 . 基于密度峰值聚类的自适应欠采样方法 [J]. 模式识别与人工智能,2020,33(9):811-819.
[16] RODRIGUEZ A,LAIO A. Clustering by fast search and find of density peaks[J].Science,2014,344(6191):1492-1496.
作者简介:李鑫(1998.09—),女,汉族,湖北襄阳人,硕士研究生在读,研究方向:数据挖掘。