摘 要:三代测序技术凭借其高平均读长和短测序周期等优势而成为当前基因测序的主要发展方向,然而,错误随机性和15% 的测序错误率不可避免地对基因序列的进一步分析产生影响。文章针对已有纠错方法的不足,提出一种基于众包标注的三代测序数据纠错方法 CTDC。CTDC 将可信度较高的二代测序数据作为标注工作者,结合其一致度、能力和准确度对待标注的三代测序数据位点进行校正。实验结果表明,相较于已有的三代测序数据纠错方法,CTDC 具有更好的精度和性能。
关键词:三代测序技术;测序错误率;混合纠错;众包标注
DOI:10.19850/j.cnki.2096-4706.2022.011.002
中图分类号:TP391 文献标识码:A 文章编号:2096-4706(2022)11-0006-05
Research on Error Correction Methods of Third-generation Sequencing Data Based on Crowdsourcing Annotation
DAI Daocheng
(School of Finance, Xi'an Eurasia University, Xi'an 710065, China)
Abstract: Third-generation sequencing technology has become the main development direction of gene sequencing due to its advantages of high average reading length and short sequencing cycle. However, the randomness of errors and the 15% sequencing error rate inevitably affect the further analysis of gene sequences. Aiming at the shortcomings of existing error correction methods, this paper proposes a third-generation sequencing data error correction method CTDC based on crowdsourcing annotation. CTDC takes the secondgeneration sequencing data with high reliability as tagging workers, and corrects the third-generation sequencing data sites to be tagged in combination with their consistency, ability and accuracy. The experimental results show that CTDC has better accuracy and performance than the existing third-generation sequencing data error correction methods.
Keywords: third-generation sequencing technology; sequencing error rate; hybrid error correction; crowdsourcing annotation
参考文献:
[1] 康菊清,王溢.DNA的桑格测序法简介 [J].中学生物教学,2016(10):48-51.
[2] 徐疏梅 .新一代 DNA 测序技术的应用与研究进展 [J]. 徐州工程学院学报(自然科学版),2018,33(4):60-64.
[3] 唐勇,刘旭 .SMRT 测序技术及其在微生物研究中的应用[J]. 生物技术通报,2018,34(6):48-53.
[4] 李梦臻,张芃芃,郝京生,等 .基于纳米孔的 DNA 测序技术 [J]. 国外医药(抗生素分册),2017,38(3):125-128.
[5] 刘玉洁,胡海洋 .第三代测序技术及其在生物学领域的革新 [J]. 科技与创新,2021(5):34-39.
[6] SHAN C C,ALEXANDER D H,PATRICK M H,et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data [J].Nature Methods,2013,10(6):563-569.
[7] LEENA S,RIKU W,ERIC R,et al. Accurate self-correction of errors in long reads using de Bruijn graphs [J].Bioinformatics, 2017,33(6):799-806.
[8] HACKL T,HEDRICH R,SCHULTZ J,et al. Proovread: large-scale highaccuracy PacBio correction through iterative short read consensus [J].Bioinformatics,2014,30(21):3004-3011.
[9] LEENA S,ERIC R. LoRDEC: accurate and efficient long read error correction [J].Bioinformatics,2014,30(24):3506-3514.
[10] GILES M,MAHDI H,PIET D,DEMEESTER,et al. Jabba: hybrid error correction for long sequencing reads [J].Algorithms for Molecular Biology,2016,11(1):10.
[11] 孙欢 . 众包标注的学习算法研究 [D]. 杭州:浙江大学,2015.
[12] CHAISSON M J,TESLER G. Mapping single molecule
sequencing reads using basic local alignment with successiverefinement (blasr): application and theory [J].BMC Bioinformatics,2012,13(1):238.
[13] YUKITERU O,KIYOSHI A,MICHIAKI H. PBSIM: acBio reads simulator--toward accurate genome assembly [J]. Bioinformatics,2013,29(1):119-121.
作者简介:戴道成(1995—),男,汉族,陕西西安人,教师,硕士研究生,研究方向:生物信息。