摘 要:冒犯性语言在社会化媒体上频繁出现,为了建立友好的网络社区,研究高效而准确的冒犯性语言检测方法具有重要意义。文章首先阐述冒犯性语言的定义,然后分析各种检测方式的特点与基于预训练的深度学习检测方法的潜力和优势。随后对现阶段常见的预处理方法及几种典型的深度学习模型的利弊、现状进行介绍。最后对冒犯性语言检测领域面临的挑战和期望进行归纳总结。
关键词:深度学习;冒犯性语言;文本分类;数据预处理
DOI:10.19850/j.cnki.2096-4706.2022.05.002
基金项目:国家自然科学基金(62172144)
中图分类号:TP391.1 文献标识码:A 文章编号:2096-4706(2022)05-0005-06
A Review of Offensive Language Detection Methods Based on Deep Learning
GUO Bolu, XIONG Xuhui
(College of Computer and Information Engineering, Hubei Normal University, Huangshi 435002, China)
Abstract: Offensive language appears frequently in social media. In order to establish a friendly online community, it is of great significance to study efficient and accurate offensive language detection methods. This paper explains the definition of offensive language firstly, and analyzes the characteristic of each detection method and the advantages and potentiality of deep learning detection method based on pre-training. Then the paper introduces the advantages and disadvantages and current situation of common pre-processing methods at the present stage and several typical deep learning models. Finally, it concludes and summarizes the challenges and expectations of the field of offensive language detection.
Keywords: deep learning; offensive language; text classification; data preprocessing
参考文献:
[1] 臧敏,徐圆圆,程春慧 . 社交媒体对网络新闻传播的影响分析——以微博为例 [J]. 赤峰学院学报(汉文哲学社会科学版),2024,35(4):121–122.
[2] WANG S H,LIU J X,YANG X O,et al. Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification Using Pre-trained Language Models [J/OL].arXiv:2010.03542 [cs.CL].[2021-12-25].https://doi.org/10.48550/arXiv.2010.03542.
[3] 冉永平,杨巍.人际冲突中有意冒犯性话语的语用分析 [J].外国语(上海外国语大学学报),2011,34(3):49-55.
[4] DAVIDSON T,WARMSLEY D, MacyM,et al.Automated hate speech detection and the problem of offensive language [J/OL].arXiv:1703.04009 [cs.CL].[2021-12-24].https://doi.org/10.48550/arXiv.1703.04009.
[5] DADVAR M, TRIESCHNIGG D,ORDELMAN R,et al. Improving Cyberbullying Detection withUserContext [EB/OL].[2012-12-25].https://link.springer.com/chapter/10.1007/978-3-642-36973-5_62.
[6] MALMASI S,ZAMPIERI M. Challenges in Discriminating Profanity from Hate Speech [J/OL].arXiv:1803.05495[cs.CL].[2021-12-25].https://doi.org/10.48550/arXiv.1803.05495.
[7] SINGH P,CHAND S. Identifying and Categorizing Offensive Language in Social Media.using Deep Learning [C]//Proceedings of the 13th International Workshop on Semantic Evaluation.Minneapolis:Association for Computational Linguistics,2019:727–734.
[8] 高玉君,梁刚,蒋方婷,等 . 社会网络谣言检测综述 [J].郭博露,等 : 基于深度学习的冒犯性语言检测方法综述第5期 现代信息科技10 2022.03 电子学报,2020,48(7):1421-1435.
[9] BURNAP P,WILLIAMS M L. Cyber hate speech on twitter:An application of machine classification and statistical modeling for policy and decision making [J].Policy & Internet,2015,7(2):121-262.
[10] MODHA S, MAJUMDER P,MANDL T,et al. Filtering Aggression from the Multilingual Social Media Feed [C]//Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018),Santa Fe:Association for Computational Linguistics,2018:199–207.
[11] 李康,李亚敏,胡学敏,等 . 基于卷积神经网络的鲁棒高精度目标跟踪算法 [J]. 电子学报,2018,46(9):2087-2093.
[12] BANSAL H,NAGEL D,SOLOVEVA A. Deep Learning Analysis of Offensive Language on Twitter:Identification and Categorization [C]//Proceedings of the 13th International Workshop on Semantic Evaluation.Minneapolis:Association for Computational Linguistics,2019:622-627.
[13] GAMBACK B,SIKDAR U K. Using convolutional neural networks to classify hatespeech[EB/OL].[2021-12-25].https://aclanthology.org/W17-3013.pdf.
[14] GOODFELLOW I,BENGIO Y,COURVILLE A.Deep Learning [EB/OL].[2021-12-25].https://www.deeplearningbook.org/.
[15] ZHANG Y J,XU B,ZHAO T J.CN-HIT-MI.T at SemEval-2019 Task 6:Offensive Language Identification Based on BiLSTM with Double Attention [C]//Proceedings of the 13th International Workshop on Semantic Evaluation,Minneapolis:Association for Computational Linguistics,2019:564–570.
[16] ALTIN L S M,SERRANO À B,SAGGION H. LaSTUS/ TALN at SemEval-2019 Task 6:Identification and Categorization of Offensive Language in Social Media with Attention-based Bi-LSTM model [C]//Proceedings of the 13th International Workshop on Semantic Evaluation.Minneapolis:Association for Computational Linguistics,2019:672–677.
[17] DEVLIN J,CHANG M W,LEE K,et al. Bert: Pretraining of deep bidirectional transformers for language understanding[J/OL].arXiv:1810.04805 [cs.CL].[2021-12-25].https://arxiv.org/abs/1810.04805.
[18] YANG Z L,DAI Z H,YANG Y M,et al. XLNet:Generalized Autoregressive Pretraining for Language Understanding [EB/OL].[2021-12-25].https://zhuanlan.zhihu.com/p/403559991.
[19] VASWANI A,SHAZEER N,PARMAR N,et al. Attention Is All You Need [J/OL].arXiv:1706.03762 [cs.CL].[2021-12-25]. https://arxiv.org/abs/1706.03762v1.
[20] ZAMPIERI M,MALMASI S,NAKOV P,et al. NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional [C]//Transformers2019.Proceedings of the 13th International Workshop on Semantic Evaluation,Minneapolis:Association for Computational Linguistics,2019:75–86.
[21] ZAMPIERI M,MALMASI S,NAKOV P,et al. Predicting the Type and Target of Offensive Posts in Social Media [J/OL].arXiv:1902.09666[cs.CL].[2021-12-25].https://arxiv.org/abs/1902.09666.
[22] WAIBEL A,HANAZAWA T,HINTON G,et al. Phoneme recognition using time-delay neural networks [J].IEEE Transactions on Acoustics,Speech,and Signal Processing,1989,37(3):328-339.
[23] VASWANI A,SHAZEER N,PARMA N,et al. Attention is All you Need [J/OL].arXiv:1706.03762 [cs.CL].[2012-12-25].https://arxiv.org/abs/1706.03762v1.
[24] DEVLIN J,CHANG M W,LEE K,et al. BERT:Pretraining of deep bidirectional transformers for language understanding [J/OL].arXiv:1810.04805 [cs.CL].[2012-12-26].https://arxiv.org/abs/1810.04805.
[25] YANG Z L,DAI Z H,YANG Y M,et al. XLNet:Generalized Autoregressive Pretraining for Language Understanding[J/OL].arXiv:1906.08237 [cs.CL].[2021-12-26].https://doi.org/10.48550/arXiv.1906.08237.
作者简介:郭博露(1999—),女,汉族,湖北荆州人,硕士研究生在读,主要研究方向:自然语言处理;通讯作者:熊旭辉(1971—),男,汉族,湖北黄石人,副教授,硕士生导师,工学博士,主要研究方向:计算机系统结构、自然语言处理。