摘 要:K-means 是最常用的批量聚类方法,然而该算法需要多次迭代并不能直接用于数据流聚类。文章基于自适应谐振理论(ART),提出一种针对数据流聚类的自适应两阶段聚类算法(ATPC)。该算法分为在线自适应微聚类和离线全局批量聚类两个阶段,自适应生成微簇,具有线性计算复杂度。在 MOA 平台上真实与模拟数据流的实验结果验证了 ATPC 方法的高效性。
关键词:数据流;聚类;两阶段;自适应谐振理论;微簇
DOI:10.19850/j.cnki.2096-4706.2021.14.032
基金项目:湖南省自然科学基金面上项目 (2019JJ40111)
中图分类号:TP311.13 文献标识码:A 文章编号:2096-4706(2021)14-0124-03
An Adaptive Two-Stage Clustering Algorithm for Data Stream
LI Zhijie, LIAO Xuhong, LIU Jiwang, JIANG Hua
(School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang 414006, China)
Abstract: K-means is the most commonly used batch clustering method, however, this algorithm requires several iterations and cannot be directly used for data stream clustering. In this paper, we propose an adaptive algorithm with two phases for data stream clustering (ATPC) based on adaptive resonance theory (ART). ATPC is divided into two phases, online adaptive microclustering and offline global batch clustering, and adaptively generates microclusters with linear computational complexity. Experimental results on the real dataset and simulated data streams validate the efficiency of ATPC method.
Keywords: data stream; clustering; two phases; adaptive resonance theory; micro cluster
参考文献:
[1] BEYAZIT E,ALAGURAJAH J,WU X D. Online Learning from Data Streams with Varying Feature Spaces [J].Proceedings of the 33rd AAAI Conference on Artificial Intelligence,2019,33(1): 3232-3239.
[2] SCHNEIDER J,VLACHOS M. On Randomly Projected Hierarchical Clustering with Guarantees [C]//Proceedings of the 2014 SIAM International Conference on Data Mining(SDM). Philadelphia:SIAM,2014:407-415.
[3] BIFET A,GAVALDA R,HOLMES G,et al. Machine learning for data streams:with practical examples in MOA [M].USA: Massachusetts Institute of Technology press,2017.
[4] SCHNEIDER J,VLACHOS M. Fast parameterless densitybased clustering via random projections [C]//CIKM’13:Proceedings of the 22nd ACM international conference on Information & Knowledge Management.New York:ACM,2013:861-866.
[5] ANDRÉS-MERINO J,BELANCHE L. StreamLeader:A New Stream Clustering Algorithm not Based in Conventional Clustering [C]//Artificial Neural Networks and Machine Learning–ICANN 2016. Barcelona:September,2016:208–215.
[6] YE M,LIU W F,WEI J H,et al. Fuzzy c-Means and Cluster Ensemble with Random Projection for Big Data Clustering [J/ OL].Mathematical Problems in Engineering,2016(1):1-13.
[7] HE Y,WU B,WU D,et al. Online learning from capricious data streams [C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,Macao:IJCAI,2019:2491- 2497.
[8] LOSING V,HAMMER B,WERSING H,et al. Randomizing the Self-Adjusting Memory for Enhanced Handling of Concept Drift [C]//2020 International Joint Conference on Neural Networks(IJCNN).Glasgow:IEEE,2020:1-8.
[9] KRANEN P,ASSENT I,BALDAUF C,et al. The ClusTree:indexing micro-clusters for anytime stream mining [J]. Knowledge and Information Systems,2011,29(2):249–272.
[10] 朱颖雯,陈松灿 . 基于随机投影的高维数据流聚类 [J]. 计算机研究与发展,2020,57(8):1683-1696.
作者简介:李志杰(1964—),男,汉族,湖南永兴人,博士,副教授,研究方向:大数据在线学习;廖旭红(1997—),女,汉 族,湖南醴陵人,硕士研究生在读,研究方向:数据流聚类;刘基 旺(1997—),男,土家族,湖南永顺人,硕士研究生在读,研究 方向:数据流分类;江华(1997—),男,汉族,安徽合肥人,硕 士研究生在读,研究方向:数据流分类。