当前位置>主页 > 期刊在线 > 信息技术 >

信息技术2019年14期

融合主题模型的在线可比度计算研究
赵永标,张其林,谷琼
(湖北文理学院计算机工程学院,湖北 襄阳 441053)

摘  要:在线挖掘可比语料是构建大规模可比语料库的可行途径之一,在线可比度计算是语料挖掘过程中的关键环节。本文提出一种融合词汇重合度和主题模型的在线可比度计算方式,主题模型选择能够进行在线学习的Online LDA,利用词对齐工具GIZA++ 进行主题映射,融合方式为加权求和。在下载的中英新闻语料上的测试结果表明,两种计算方式融合后的准确性比两种都要高。


关键词:可比语料库;可比度;主题模型;主题映射



中图分类号:TP391.1         文献标识码:A         文章编号:2096-4706(2019)14-0001-04


Online Comparability Measurement Integrating Topic Model

ZHAO Yongbiao,ZHANG Qilin,GU Qiong

(Computer School of Hubei University of Arts and Science,Xiangyang 441053,China)

Abstract:Online mining bilingual comparable text pairs is among practical approaches for building large scale comparable corpora,Online comparability calculation is a key part of the mining process. We propose an online comparability measurement integrating the online comparability measurements based on word overlap and topic model. For topic model,we choose Online LDA which can be trained online. For topic mapping,we use the word aligning package GIZA++. For integration,we adopt the weighted summation. The test results based on downloaded Chinese-English news collection shows that the accuracy of the combination of the two measurements is better than either of them.

Keywords:comparable corpora;comparability;topic model;topic mapping


基金项目:国家语委十三五科研规划项目:基于主题模型的Web 可比语料在线挖掘研究(项目编号:YB135-22);国家语委十三五科研规划项目:北宋书法家米芾书法字库创建及其推广应用(项目编号:YB135-33)。


参考文献:

[1] Talvensaari T,Laurikkala J,Järvelin K,et al. Creating and exploiting a comparable corpus in cross-language information retrieval [J].ACM Transactions on Information Systems,2007,25(1):4.

[2] Saad M,Langlois D,Smaïli K. Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities [J].Procedia-Social and Behavioral Sciences,2013(95):40-47.

[3] Malek H,Maroua T,Chiraz L. Building comparable corpora from social networks [C].Workshop on Building & Using Comparable Corpora.International Conference on Language Resources and Evaluation,2014.

[4] Preiss J. Identifying Comparable Corpora Using LDA [C].Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Association for Computational Linguistics,2012.

[5] Zhu Z,Li M,Chen L,et al. Building Comparable Corpora Based on Bilingual LDA Model [C].Meeting of the Association for Computational Linguistics,2013.

[6] Firas Sabbah,Ahmet Aker. Creating Comparable Corpora through Topic Mappings [C]//Workshop on Building & Using Comparable Corpora. International Conference on Language Resources and Evaluation,2018.

[7] 房璐. 英汉可比较语料库的构建与应用研究 [D]. 苏州:苏州大学,2011.

[8] Hoffman M D,Blei D M,Bach F R. Online Learning for Latent Dirichlet Allocation [C]//Advances in Neural Information Processing Systems 23:24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010,Vancouver,British Columbia,Canada. Curran Associates Inc,2010.


作者简介:赵永标(1980-),男,汉族,湖北洪湖人,讲师,硕士,研究方向:自然语言处理方面的教学与研究。