摘 要:针对网络论坛文本数据的特点与网络论坛的结构,提出了一种网络论坛文本数据获取与存储方法。先采用Browser/Server 架构云构建网络论坛数据系统框架,再依托网络爬虫技术实现对网络论坛数据的收集,然后基于Bi-LSTM 网络搭建主题相关性文本数据过滤系统,最后采用MySQL 和MongoDB 数据库,构建数据存储方案。系统设计表明所提出的方法可行,为网络论坛舆情的研究与引导提供了依据。
关键词:网络论坛;文本数据;数据获取;数据存储
中图分类号:TP391.1 文献标识码:A 文章编号:2096-4706(2021)01-0007-06
Research on Text Data Acquisition and Storage Method for Network Forum
CAO Huiru1,CHENG Haixiu2,LIAN Songyao3,WANG Yi1
(1.Guangzhou Institute of Technology,Guangzhou 510075,China;2.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510640,China;3.College of Nanfang,Sun Yat-Sen University,Guangzhou 510970,China)
Abstract:According to the characteristics of the text data of network forum and the structure of network forum,a method of acquiring and storing the text data of network forum is proposed. Firstly,the data system framework of web forum is constructed by using Browser/Server architecture cloud,then the data collection of web forum is realized by relying on web crawler technology,and then the topic related text data filtering system is built based on Bi-LSTM network. Finally,the data storage scheme is constructed by using MySQL and MongoDB database. The system design shows that the proposed method is feasible,which provides a basis for the research and guidance of public opinion in network forum.
Keywords:web forum;text data;data access;data storage
基金项目:教育部人文社会科学研究一般项目(20YJCZH004);广东省普通高校特色创新类项目(2019GKTSCX075);2020 年广东省科技创新战略专项资金(“攀登计划”专项资金)项目(pdjh2020b1137)
参考文献:
[1] 林云,曾振华,曾林浩. 微博社区网络结构特征对舆情信息传播的影响研究 [J]. 情报科学,2019,37(3):55-59.
[2] 丁晟春,王鹏鹏,龚思兰. 基于社区发现和关键词共现的网络舆情潜在主题发现研究——以新浪微博魏则西事件为例 [J]. 情报科学,2018,36(7):78-84.
[3] ZHONG Z F. Internet public opinion evolution in theCOVID-19 event and coping strategies [J].Disaster medicine and publichealth preparedness,2020:1-7.
[4]Z A M A N I M,R A B B A N I F,H O R I C S Á N Y I A,e tal.Differences in structure and dynamics of networks retrieved from darkand public web forums [J].Physica A:Statistical Mechanics and its Applications,2019,525:326-336.
[5] PARK S,WOO J. Gender Classification Using SentimentAnalysis and Deep Learning in a Health Web Forum [J].AppliedSciences,2019,9(6):1249.
[6] BRADLEY A,JAMES R J E. Defining the key issues discussedby problematic gamblers on web-based forums:a data-driven approach [J/OL].International Gambling Studies,2020:[2020-07-30].https://www.tandfonline.com/doi/full/10.1080/14459795.2020.1801793.
[7] 沈明珠,刘辉. 面向技术论坛的问题解答状态预测 [J]. 计算机研究与发展,2020,57(3):474-486.
[8] 贺敬杰. 网络表达与公共讨论:基于“林松龄事件”中论坛回帖文本的情感分析(英文) [J]. 国际新闻界,2015,37(9):109-132.
[9] 滕云,陈玲. 网络舆情特点的实证研究——基于高校BBS论坛的文本分析 [J]. 山东社会科学,2014(3):181-186.
[10] 赵璐. 网络舆情监控系统关键技术研究 [D]. 西安:西安电子科技大学,2014.
[11] 丁晟春,龚思兰,周文杰,等. 基于知识库和主题爬虫的南海舆情实时监测研究 [J]. 情报杂志,2016,35(5):32-37.
[12] 谭啸. 基于本体的网络爬虫设计及应用 [D]. 成都:电子科技大学,2016.
[13] BOUKADI K,REKIK M,REKIK M,et al. FC4CD:a new SOA-based Focused Crawler for Cloud service Discovery [J].Computing,2018,100:1081-1107.
[14] SUEBCHUA T,MANASKASEMSAK B,RUNGSAWANGA,et al. Efficient topical focused crawling through neighborhoodfeature [J].New Generation Computing,2018,36(2):95-118.
[15] KIM Y Y,KIM Y K,KIM D S,et al. Implementation of hybridP2P networking distributed web crawler using AWS for smart work news bigdata [J].Peer-to-Peer Networking and Applications,2020,13:659-670.
[16] PRAMUDITA Y D,ANAMISA D R,PUTRO S S,et al.Extraction System Web Content Sports New Based On Web CrawlerMulti Thread [C]//International Conference on Science and Technology2019.Surabaya:IOP Publishing,2020.
作者简介:曹惠茹(1981—),女,汉族,陕西渭南人,副教授,硕士研究生,主要研究方向:大数据,无线网络。