基于众包标注的三代测序数据的纠错方法研究-现代信息科技

点击排行

当前位置>主页 > 期刊在线 > 信息技术 >

信息技术22年11期

基于众包标注的三代测序数据的纠错方法研究

戴道成

（西安欧亚学院金融学院，陕西西安 710065）

摘要：三代测序技术凭借其高平均读长和短测序周期等优势而成为当前基因测序的主要发展方向，然而，错误随机性和15% 的测序错误率不可避免地对基因序列的进一步分析产生影响。文章针对已有纠错方法的不足，提出一种基于众包标注的三代测序数据纠错方法 CTDC。CTDC 将可信度较高的二代测序数据作为标注工作者，结合其一致度、能力和准确度对待标注的三代测序数据位点进行校正。实验结果表明，相较于已有的三代测序数据纠错方法，CTDC 具有更好的精度和性能。

关键词：三代测序技术；测序错误率；混合纠错；众包标注

DOI:10.19850/j.cnki.2096-4706.2022.011.002

中图分类号：TP391 文献标识码：A 文章编号：2096-4706（2022）11-0006-05

Research on Error Correction Methods of Third-generation Sequencing Data Based on Crowdsourcing Annotation

DAI Daocheng

(School of Finance, Xi'an Eurasia University, Xi'an 710065, China)

Abstract: Third-generation sequencing technology has become the main development direction of gene sequencing due to its advantages of high average reading length and short sequencing cycle. However, the randomness of errors and the 15% sequencing error rate inevitably affect the further analysis of gene sequences. Aiming at the shortcomings of existing error correction methods, this paper proposes a third-generation sequencing data error correction method CTDC based on crowdsourcing annotation. CTDC takes the secondgeneration sequencing data with high reliability as tagging workers, and corrects the third-generation sequencing data sites to be tagged in combination with their consistency, ability and accuracy. The experimental results show that CTDC has better accuracy and performance than the existing third-generation sequencing data error correction methods.

Keywords: third-generation sequencing technology; sequencing error rate; hybrid error correction; crowdsourcing annotation

参考文献：

[1] 康菊清，王溢.DNA的桑格测序法简介 [J].中学生物教学，2016（10）：48-51.

[2] 徐疏梅 .新一代 DNA 测序技术的应用与研究进展 [J]. 徐州工程学院学报（自然科学版），2018，33（4）：60-64.

[3] 唐勇，刘旭 .SMRT 测序技术及其在微生物研究中的应用[J]. 生物技术通报，2018，34（6）：48-53.

[4] 李梦臻，张芃芃，郝京生，等 .基于纳米孔的 DNA 测序技术 [J]. 国外医药（抗生素分册），2017，38（3）：125-128.

[5] 刘玉洁，胡海洋 .第三代测序技术及其在生物学领域的革新 [J]. 科技与创新，2021（5）：34-39.

[6] SHAN C C，ALEXANDER D H，PATRICK M H，et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data [J].Nature Methods，2013，10（6）：563-569.

[7] LEENA S，RIKU W，ERIC R，et al. Accurate self-correction of errors in long reads using de Bruijn graphs [J].Bioinformatics， 2017，33（6）：799-806.

[8] HACKL T，HEDRICH R，SCHULTZ J，et al. Proovread: large-scale highaccuracy PacBio correction through iterative short read consensus [J].Bioinformatics，2014，30（21）：3004-3011.

[9] LEENA S，ERIC R. LoRDEC: accurate and efficient long read error correction [J].Bioinformatics，2014，30（24）：3506-3514.

[10] GILES M，MAHDI H，PIET D，DEMEESTER，et al. Jabba: hybrid error correction for long sequencing reads [J].Algorithms for Molecular Biology，2016，11（1）：10.

[11] 孙欢 . 众包标注的学习算法研究 [D]. 杭州：浙江大学，2015.

[12] CHAISSON M J，TESLER G. Mapping single molecule

sequencing reads using basic local alignment with successiverefinement (blasr): application and theory [J].BMC Bioinformatics，2012，13（1）：238.

[13] YUKITERU O，KIYOSHI A，MICHIAKI H. PBSIM: acBio reads simulator--toward accurate genome assembly [J]. Bioinformatics，2013，29（1）：119-121.

作者简介：戴道成（1995—），男，汉族，陕西西安人，教师，硕士研究生，研究方向：生物信息。

上一篇：基于 Tensorflow 框架的人流量计算系统设计

下一篇：社区居民心理健康服务平台设计与实现