摘 要:文章针对中医临床症状实体及属性抽取存在医疗短文本语义信息欠缺,常用的流水线方法易导致多任务之间产生错误累积的问题,提出一种基于深度学习的症状实体及属性抽取方法。首先通过基于 BLSTM-CRF 的序列标注模型完成“实体/ 修饰属性”识别;其次根据扩展步长的就近匹配原则生成高覆盖率、低冗余度的“实体—属性值”候选对;最后基于 ERNIEBGRU-MP 完成关系分类,利用 ERNIE 丰富文本上下文信息,联合 BGRU 提取文本全局特征信息,采用最大池化法过滤冗余和噪声信息,提高模型的泛化性和鲁棒性。
关键词:实体及属性抽取;ERNIE;BGRU;最大池化;中医药信息学
DOI:10.19850/j.cnki.2096-4706.2022.03.019
基金项目: 国家重点研发计划(2019YFC12301);国家自然科学基金项目(6214120,82160955);江西省自然科学基金(20202BAB202019);江西省教育厅科技项目(GJJ190863);江西省一流学科建设科研启动基金专项项目(SYLXK-ZHYI060)
中图分类号:TP391 文献标识码:A 文章编号:2096-4706(2022)03-0070-06
Symptom Entity and Attribute Extraction for TCM Electronic Medical Record
HU Dingxing1, DU Jianqiang1, SHI Qiang2, LUO Jigen1, LIU Yong1
(1.School of Computer, Jiangxi University of Chinese Medicine, Nanchang 330004, China; 2.Qihung Medical College, Jiangxi University of Chinese Medicine, Nanchang 330004, China)
Abstract: Aiming at the problem of the lack of semantic information of medical short texts in entity and attribute extracting of TCM clinical symptoms and the accumulation of errors among multiple tasks coursed by common pipeline methods. A symptom entity and attribute extracting method based on deep learning is proposed. Firstly, the recognition of “entity/modification attribute” is completed by the sequence annotation model based on BLSTM-CRF; Secondly, “entity-attribute value” candidate pairs with high coverage and low redundancy are generated according to the nearest matching principle of extended step size; finally, the relationship classification is completed based on ERNIE-BGRU-MP, Ernie is used to enrich the text context information, and the max-pooling method is used to filter the redundant and noise information, so as to improve the generalization and robustness of the model.
Keywords: entity and attribute extracting; ERNIE; BGRU; max-pooling; Chinese medicine informatics
参考文献:
[1] 杨锦锋,关毅,何彬,等 . 中文电子病历命名实体和实体关系语料库构建 [J]. 软件学报,2016,27(11):2725-2746.
[2] MIWA M,BANSAL M. End-to-end Relation Extraction Using LSTMs on Sequences and Tree Structures [EB/OL].[2021-12-16].https://arxiv.org/pdf/1601.00770.pdf.
[3] ZHENG S,WANG F,BAO H,et al. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme [J/OL].arXiv:1706.05075 [cs.CL].[2021-12-16].https://arxiv.org/abs/1706.05075v1.
[4] 石雪 . 临床医疗实体及其属性的联合抽取方法 [D]. 哈尔滨:哈尔滨工业大学,2019.
[5] 吴赛赛,梁晓贺,谢能付,等 . 面向领域实体关系联合抽取的标注方法 [J]. 计算机应用,2021,41(10):2858-2863.
[6] 马进,杨一帆,陈文亮 . 基于远程监督的人物属性抽取研究 [J]. 中文信息学报,2020,34(6):64-72.
[7] 罗计根,杜建强,聂斌,等 . 基于双向 LSTM 和 GBDT 的中医文本关系抽取模型 [J]. 计算机应用研究,2019,36(12):3744-3747.
[8] 张昱 . 基于深度学习的中文电子病历实体及其修饰识别技术研究 [D]. 兰州:西北师范大学,2019.
[9] CHO K,MERRIENBOER B V,GULCEHRE C.et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation [EB/OL].[2021-12-16].https://arxiv.org/pdf/1406.1078.pdf.
[10] 毕云杉,钱亚冠,张超华,等 . 基于 ERNIE 模型的中文文本分类研究 [J]. 浙江科技学院学报,2021,33(6):461-468+476.
[11] ZHENG Y Z,XU H,ZHI Y L,et al. ERNIE:Enhanced Language Representation with Informative Entities [C]//the 57th Annual Meeting of the Association for Computational Linguistics.2019:1441-1451.
作者简介:胡定兴(1996—),男,汉族,湖北黄冈人,硕士研究生在读,研究方向:自然语言处理、知识图谱构建;杜建强(1968—),男,汉族,江西南昌人,教授,博士,研究方向:中医药信息学、数据挖掘;石强(1976—),男,汉族,江西南昌人,副教授,医学博士,研究方向:中医辨证规律;罗计根(1991—),男,汉族,江西萍乡人,讲师,硕士,研究方向:自然语言处理;刘勇(1997—),男,汉族,江西抚州人,硕士研究生在读,研究方向:自然语言处理。