摘 要:作者识别任务旨在找到匿名文本的作者,在互联网蓬勃发展的时代,准确识别出匿名文本的作者对维护网络环境的安全有着积极作用。在该任务中,文本内容的量化非常关键,能直接影响作者识别的准确率。基于词频 - 逆文档频率(TF-IDF)算法,文章提出了一种文本量化方法将文本转变为向量。为评估闵可夫斯基距离和余弦相似度识别作者的共同作用,提出了一种混合距离用于计算两个文本之间的距离。实验结果显示在中英文两种数据集上,运用提出的量化方法量化文本能有效提高支持向量机、K 近邻和闵可夫斯基距离(p=1 和 p=2)识别文本作者的准确率。
关键词:作者识别;文本量化; TF-IDF 算法;文本距离
DOI:10.19850/j.cnki.2096-4706.2022.19.001
中图分类号:TP391.1 文献标识码:A 文章编号:2096-4706(2022)19-0001-07
A Text Quantification Method Based on TF-IDF Algorithm and Its Application in Author Identification
LI Chu
(Northeastern University at Qinhuangdao, Qinhuangdao 066099, China)
Abstract: The author identification task aims to find the real author of anonymous texts. In today’s booming Internet era, accurately identifying the author of an anonymous text plays a positive role in improving the security of the online environment. In this task, text content’s quantification is a critical step that can directly affect the accuracy of author identification. Based on the word frequency-inverse document frequency (TF-IDF) algorithm, this paper proposes a text quantization method to transform text into vector. In addition, to evaluate the joint role of Minkowski Distance and Cosine Similarity Measure in the author identification task, a hybrid distance is also presented for computing the distance between two texts. The experimental results show that on two English and Chinese datasets, using the proposed quantification method to quantify texts can effectively improve the accuracy of Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Minkowski Distance (p=1 and p=2) in author identification.
Keywords: author identification; text quantification; TF-IDF algorithm; text distance
参考文献:
[1] SIDOROV G. Example of application of n-grams:Authorship attribution using syllables [J/OL].SpringerBriefs in Computer Science, 2019:27-39[2022-06-20].https://link.springer.com/chapter/10.1007/978- 3-030-14771-6_6#citeas.
[2] GRÖNDAHL T,ASOKAN N. Text analysis in adversarial settings: Does deception leave a stylistic trace? [J/OL].IEICE Transactions on Fundamentals of Electronics,Communications and Computer Sciences,2019[2022-06-20].https://schlr.cnki.net/zn/ Detail/index/GARJ2019/DBLP8DF26260CCE558EA323634A44334 B5EE.
[3] SAPKOTA U,BETHARD S,MONTES M M,et al. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution [C]//Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies.Denver:Association for Computational Linguistics, 2015:93-102.
[4] REDDY P B,REDDY T R,CHAND M G,et al. A new approach for authorship attribution [M].Information and Decision Sciences.Springer:Singapore,2018:1-9.
[5] SAEDI C,DRAS M. Siamese networks for large-scale author identification [J/OL].Computer Speech & Language,2021, 70:101241[2022-06-20]. https://linkinghub.elsevier.com/retrieve/pii/ S0885230821000486.
[6] RAMEZANI R. A language-independent authorship attribution approach for author identification of text documents [J/OL].Expert Systems with Applications,2021,180:115139[2022-06-20].https:// linkinghub.elsevier.com/retrieve/pii/S0957417421005807.
[7] BINSAEEDAN W,ALRAMLAWI S. CS-BPSO:Hybrid feature selection based on chi-square and binary PSO algorithm for Arabic email authorship analysis [J/OL].Knowledge-Based Systems, 2021,227:107224[2022-06-20].https://www.sciencedirect.com/ science/article/abs/pii/S095070512100486X?via%3Dihub.
[8] BHATTI M S,ULLAH A,LATIP R,et al. Benchmarking Performance of text Level Classification and Topic Modeling [J]. Computers,Materials & Continua,2022,71(1):125-141.
[9] WANG Y C,ZHU L G. Research on improved text classification method based on combined weighted model [J/OL]. Concurrency and Computation:Practice and Experience,2020,32(6): e5140[2022-06-20].https://scholar.cnki.net/zn/Detail/doi/GARJ2020/SJ WDEE8C5514DB3E548ACCAD83F7FB39D7B6.
[10] SALTON G,WONG A,YANG C S. A vector space model for automatic indexing [J].Communications of the ACM,1975,18(11): 613-620.
[11] WANG T,CAI Y,LEUNG H,et al. On entropy-based term weighting algorithms for text categorization [J].Knowledge and Information Systems,2021,63(9):2313-2346.
[12] SEBASTIANI F. Machine learning in automated text categorization [J].ACM computing surveys(CSUR),2002,34(1):1-47.
[13] SALTON G,BUCKLEY C. Term-weighting approaches in automatic text retrieval [J].Information processing & management, 1988,24(5):513-523.
[14] BOYACK K W,NEWMAN D,DUHON R J,et al. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches [J/OL].PloS one,2011,6(3):e18029[2022-06-20]. https://onlinelibrary.wiley. com/doi/10.1002/cpe.5140.
[15] KOU G,LIN C S. A cosine maximization method for the priority vector derivation in AHP [J]. European Journal of Operational Research,2014,235(1):225-232.
[16] HANILÇI C,ERTA F. Comparison of the impact of some Minkowski metrics on VQ/GMM based speaker recognition [J].Computers & Electrical Engineering,2010,37 (1):41-56.
[17] CORTES C,VAPNIK V. Support-vector networks [J]. Machine learning,1995,20(3):273-297.
[18] BREIMAN L. Random forests [J].Machine learning,2001, 45(1):5-32.
[19] FRIEDMAN N,GEIGER D,GOLDSZMIDT M. Bayesian network classifiers [J].Machine learning,1997,29(2-3):131-163.
[20] COVER T M,HART P E. Nearest neighbor pattern classification [J].IEEE transactions on information theory,1967,13(1):21-27.
[21] MCCULLOCH W S,PITTS W. A logical calculus of the ideas immanent in nervous activity [J].Bulletin of Mathematical Biology,1943,5(4):115-133.
作者简介:李楚(1998—),女,汉族,湖北咸宁人,硕士研究生在读,研究方向:文本分类、机器学习、自然语言处理。