摘 要:文档聚类是将文档集自动归成若干类别的过程,是对文本信息进行分类的有效方式。为了解决半结构化的文本数据转化为结构化数据时出现的数据高维性问题,本文提出了一种卷积自编码器的文档聚类模型CASC,利用卷积神经网络和自编码器的特征提取能力,在尽可能保留原始数据内部结构的同时,将其嵌入到低维潜在空间,然后使用谱聚类算法进行聚类。实验表明,CASC 模型在保证聚类准确率不降低的前提下减少了算法运行时间,同时也降低了算法时间复杂度。
关键词:聚类;卷积神经网络;自编码器;无监督模型
中图分类号:TP391;TN911.2 文献标识码:A 文章编号:2096-4706(2018)02-0012-04
A Document Clustering Model Based on Convolutional Autoencoder
FENG Yongqiang1,LI Yajun2
(1.Tianjin Haihe Dairy Company,Tianjin 300410,China;2. Tianjin University of Science and Technology College of Computer Science and Information Engineering,Tianjin 300457,China)
Abstract:Document clustering is a process of automatically categorizing document sets into several categories and is an effective means of organizing textual information. Aiming at the problem of high dimensionality of data when converting semi-structured text data into structured data,this paper proposes a document clustering model called Convolutional Self-Encoder (CASC),which uses convolutional neural network and self-encoder feature extraction capabilities,the best possible to retain the internal structure of the original data while embedded in low-dimensional potential space,and then use the spectral clustering algorithm for clustering. Experiments show that the CASC algorithm can reduce the algorithm running time and reduce the time complexity of the algorithm without reducing the accuracy of clustering.
Keywords:clustering;convolution neural network;autoencoder;unsupervised model
基金项目:天津市科技计划项目(17KPXMSF00140,17ZLZXZF00470);天津市科技项目(KJCX-KFQ-CXY-2016-003)。
参考文献:
[1] Xu Jiaming et al. "Short text clustering via convolutional neural networks."2015.
[2] 谭晋秀,何跃. 基于k-means 文本聚类的新浪微博个性化博文推荐研究 [J]. 情报科学,2016,34(4):74-79.
[3] John Langford,Joelle Pineau.Proceedings of the 29th international conference on machine learning (icml-12) [J].CoRR,2012.
[4] Gerard Salton,Christopher Buckley.Term-weighting approaches in automatic text retrieval [J].Information Processing & Management,1988,24(5):513-523.
[5] Mikolov T,Sutskever I,Chen K,et al. Distributed R e p r e s e n t a t i o n s o f W o r d s a n d P h r a s e s a n d t h e i r Comp o s i t i o n a l i t y [ J ].Ad v a n c e s i n N e ur a l I n f o r m a t i o n Processing Systems,2013,26:3111-3119.
[6] Mikolov T,Chen K,Corrado G,et al. Efficient
Estimation of Word Representations in Vector Space [J].Computer Science,2013.
[7] Stephen Johnson.Hierarchical clustering schemes [J].Psychometrika,1967.
[8] HARTIGAN JA,WONG MA.Algorithm as 136:a k-means clustering algorithm [J].Appl Stat,1979,28(1):100.
[9] Arthur,David,and Sergei Vassilvitskii. "k-means++:The advantages of careful seeding." Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics,2007.
[10] NAYAK J,NAIK B,BEHERA HS.Fuzzy c-means (fcm) clustering algorithm:a decade review from 2000 to 2014[M].New Delhi:Springer India,2015:133-149.
[11] Ester,Martin,et al. "A density-based algorithm for discovering clusters in large spatial databases with noise."Kdd. Vol.96.No.34.1996.
[12] Krzysztof Cios,Mark Shields.Advances in neural information processing systems 7 [J].Neurocomputing,1997,16(3):263.
[13] Yunlan Tan,Pengjie Tang,Yimin Zhou,et al.Photograph aesthetical evaluation and classification with deep convolutional neural networks [J].Neurocomputing,2016.
[14] I o f f e S,S z e g e dy C . Ba t c h Norma l i z a t i o n:Accelerating Deep Network Training by Reducing Internal Covariate Shift [J].2015:448-456.
作者简介:
冯永强(1963-),男,汉族,天津人,天津海河乳业公司研发部经理,高级工程师;
李亚军(1993-),汉族,河南新乡人,计算机应用技术专业硕士研究生。