摘 要:近年来自然语言处理领域发展迅猛,文本分类任务作为其中的基本任务出现了重大突破,但并未辐射到公安工作实务之中。目前文本分类领域以采用基于统计和概率的模型为主,但是相比于使用大量语料训练的预训练模型,其分类效果并不理想。文章采取预训练 ERNIE 作为特征提取模型,并以 SA-Net 结合 ERNIE 模型中的注意力机制,最后以 DPCNN 作为深度学习网络形成 ERNIE-SA-DPCNN 算法。实验证明, ERNIE-SA-DPCNN 在涉网新型犯罪案件案情文本分类任务上的表现优于其他模型。
关键词:涉网新型犯罪;文本分类;ERNIE;SA-Net;DPCNN
DOI:10.19850/j.cnki.2096-4706.2022.06.017
基金项目:国家级高等学校大学生创新创业训练计划项目(202011483011);浙江省公益技术研究计划项目(LGF19G010001);公安部科技强警基础项目(2020GABJC35)
中图分类号:TP391 文献标识码:A 文章编号:2096-4706(2022)06-0069-06
Research on Text Classification Based on ERNIE-SA-DPCNN
—Take the Text of New Network Related Crime Cases as an Example
QIU Kaikai 1,3, DING Weijie 2,3, ZHONG Nanjiang1,3
(1.Department of Computer and Information Security, Zhejiang Police College, Hangzhou 310053, China; 2.Research Institute of Dig Data and Network Security, Zhejiang Police College, Hangzhou 310053, China; 3.Key Laboratory of the Ministry of Public Security for Public Security Informatization Application Based on Big Data Architecture, Hangzhou 310053, China)
Abstract: In recent years, the field of natural language processing has developed rapidly. As one of the basic tasks, text classification task has made a major breakthrough, but it has not radiated into the practice of public security work. At present, the field of text classification mainly adopts the model based on statistics and probability, but compared with the pre training model trained with a large number of corpus, its classification effect is not ideal. Pre training ERNIE is used as the feature extraction model, and SA-Net is combined with the attention mechanism in ERNIE model. Finally, DPCNN is used as the deep learning network to form ERNIE-SA-DPCNN algorithm. Experiments show that ERNIESA-DPCNN performs better than other models in the task of case text classification of new online crime cases.
Keywords: new network related crime; text classification; ERNIE; SA-Net; DPCNN
参考文献:
[1] 李维和 . 浙江公安通报“净网 2021”行动成果 [N]. 杭州日报,2021-07-29(A10).
[2] 孟令慈 . 基于 Bert-LSTM 模型的裁判文书分类的研究[D]. 南昌:华东交通大学,2021.
[3] 程盼,徐弼军 . 基于 word2vec 和 logistic 回归的中文专利文本分类研究 [J]. 浙江科技学院学报,2021,33(6):454-460.
[4] LUHN H P. Pioneer of Information Science [J].Selected Works,1968,320.
[5] KONONENKO I. Successive Naive Bayesian Classifier [J].Informatica (Slovenia),1993,17(2):167-174.
[6] SALZBERG S L. C4.5:Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers,Inc.,1993 [J].Machine Learning,1994,16(3):235-240.
[7] CORTES C,VAPNIK V N. Support-vector networks [J].Machine learning,1995,20(3):273-297.
[8] Performance Relationship Between the kNN Classifier and Neural Networks in Feature Extraction [J]. ,1996.
[9] YANG Y B. SA-Net:Shuffle Attention for Deep Convolutional Neural Networks [J/OL].arXiv:2102.00240 [cs.CV].[2022-01-03].https://arxiv.org/abs/2102.00240v1.
[10] 刘凯洋 . 结合 Bert 字向量和卷积神经网络的新闻文本分类方法 [J]. 电脑知识与技术,2020,16(1):187-188.
[11] 张海丰,曾诚,潘列,等 . 结合 BERT 和特征投影网络的新闻主题文本分类方法 [J]. 计算机应,2022,42(4):1116-1124.
[12] 邓维斌,朱坤,李云波,等 .FMNN:融合多神经网络的文本分类模型 [J]. 计算机科学,2022,49(3):281-287.
[13] 齐凯凡 . 基于卷积神经网络的新闻文本分类问题研究[D]. 西安:西安理工大学,2018.
[14] 张航 . 基于朴素贝叶斯的中文文本分类及 Python 实现[D]. 济南:山东师范大学,2018.
[15] 李荣陆,王建会,陈晓云,等 . 使用最大熵模型进行中文文本分类 [J]. 计算机研究与发展,2005(1):94-101.
作者简介:裘凯凯(1999—),男,汉族,浙江宁波人,本科在读,主要研究方向:涉网犯罪文本挖掘;通讯作者:丁伟杰(1980—),男,汉族,河南西平人,副教授,硕士生导师,博士研究生在读,主要研究方向:警务大数据分析、涉网犯罪治理;钟南江(1991—),男,汉族,湖南祁阳人,助教,硕士研究生,主要研究方向:谣言识别、欺诈检测、网络空间安全。