摘 要:针对不完整乳腺癌数据问题,该研究提出 kmeans-KNN 方法处理缺失值。首先对训练集进行聚类并采用 KNN 进行缺失值填充,基于完整训练集训练线性回归模型填充测试集的缺失值,然后使用机器学习算法 XGBoost、RF、KNN、SVM对完整训练集进行训练建模,利用建立好的模型对完整测试集进行测试。结果证明 kmeans-KNN 在缺失值预处理上优于 EM、MICE 等常用的缺失值填补方法,在准确度和 AUC 上,kmeans-KNN+SVM 取得最优。
关键词:不完整数据;乳腺癌;诊断预测
DOI:10.19850/j.cnki.2096-4706.2021.07.013
中图分类号:R737.9 文献标识码:A 文章编号:2096-4706(2021)07-0050-04
Model Prediction Research Based on Incomplete Breast Cancer Data
DENG Yufang
(School of Computer,Electronics and Information,Guangxi University,Nanning 530004,China)
Abstract:Aiming at the problem of incomplete breast cancer data,the study proposed the kmeans-KNN method to deal with missing values. First,cluster the training set and use KNN to fill in missing values,and train a linear regression model based on the complete training set to fill in missing values in the test set. Then,machine learning algorithms XGBoost,RF,KNN,and SVM are used to train and model the complete training set and complete test is used to test. The results show that kmeans-KNN is better than EM,MICE and other common missing value filling methods in missing value preprocessing,and kmeans-KNN+SVM is the best in accuracy and AUC.
Keywords:incomplete data;breast cancer;diagnosis prediction
参考文献:
[1] 世界卫生组织国际癌症研究机构(IARC).Estimated agestandardized incidence rates(World)in 2020 [EB/OL].(2021- 03-02).https://gco.iarc.fr/today/online-analysis-multi-bars.
[2] DHAHRI H,MAGHAYREH E A,MAHMOOD A,et al. Automated Breast Cancer Diagnosis Based on Machine Learning Algorithms [J/OL].Journal of healthcare engineering,2019:4253641[2021-03-29]. https://www.hindawi.com/journals/jhe/2019/4253641/.
[3] 刘星毅,农国才 . 几种不同缺失值填充方法的比较 [J]. 南 宁师范高等专科学校学报,2007,24(3):148-150
[4] 李琳,杨红梅,杨日东,等 . 基于临床数据集的缺失值处 理方法比较 [J]. 中国数字医学,2018,13(4):8-10+80.
[5] 闫世艳,郭中宁,何丽云,等 . 临床研究缺失数据多重填 补敏感性分析方法 [J]. 世界科学技术 - 中医药现代化,2020,22 (3):823-828.
[6] 彭佳丽,刘春容,李旭,等 . 采用 XGBoost 和随机森林探 索中国西部女性乳腺癌危险因素 [J]. 现代预防医学,2020,47(1): 1-4.
[7] 吴兴惠,周玉萍,邢海花,等 . 机器学习分类算法在糖尿 病诊断中的应用研究 [J]. 电脑知识与技术,2018,14(35):177- 178+195.
[8] CHEN T Q,GUESTRIN C. XGBoost:A Scalable Tree Boosting System [C]//KDD’16:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York City:Association for Computing Machinery, 2016:785-794.
作者简介:邓钰芳(1996.10—),女,汉族,广西南宁人, 硕士研究生在读,研究方向:机器学习数据挖掘。