摘 要:启动子的分类已成为一个有趣的问题,并引起了生物信息学领域许多研究人员的关注。为解决这一问题,进行了多种研究,但其性能结果仍需进一步改进。为此,基于机器学习和深度学习算法,引入了一种智能计算模型,即 iPSI(2L)-XGBoost,用于区分启动子及其强弱。所提出的计算模型 iPSI(2L)-XGBoost 能够在两层中分别达到 86.79% 和 78.64% 的交叉验证精度,就所有评估指标而言,拟议的 iPSI(2L)-XGBoost 模型比其他模型获得了有效的成功率。因此,iPSI(2L)-XGBoost 模型将为启动子鉴定的学术研究提供一个有用的工具。
关键词:启动子;启动子识别;卷积神经网络;多特征融合;XGBoost
DOI:10.19850/j.cnki.2096-4706.2023.07.020
基金项目:国家自然科学基金(31860312)
中图分类号:TP39;TP18 文献标识码:A 文章编号:2096-4706(2023)07-0078-04
A Two-level Predictor for Promoter and Their Type Recognition Based on XGBoost
HU Zihao
(Jingdezhen Ceramic University, Jingdezhen 333403, China)
Abstract: The classification of promoters has become an interesting issue and has attracted the attention of many researchers in the field of bioinformatics. To solve this problem, various studies have been conducted, but their performance results still need to be further improved. Therefore, based on machine learning and deep learning algorithms, an intelligent computing model, iPSI(2L)-XGBoost, is introduced to distinguish promoters and their strengths. The proposed computing model iPSI(2L)-XGBoost can achieve cross validation accuracy of 86.79% and 78.64% in two layers, respectively. For all evaluation indicators, the proposed iPSI(2L)-XGBoost model achieves an effective success rate compared to other models. Therefore, the iPSI(2L)-XGBoost model will provide a useful tool for academic research on promoter identification.
Keywords: promoter; promoter recognition; Convolutional Neural Networks; Multi-feature fusion; XGBoost
参考文献:
[1] SHAHMURADOV I A,RAZALI R M,BOUGOUFFA S,et al. bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli [J].Bioinformatics,2017,33:334-340.
[2] LIU B,YANG F,HUANG D S,et al. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multiwindow-based PseKNC [J].Bioinformatics,2017,34(1):33-40.
[3] ABEEL T,SAEYS Y,ROUZÉ P,et al. ProSOM:core promoter prediction based on unsupervised clustering of DNA physical profiles [J].Bioinformatics,2008,24(13):24-31.
[4] MEYSMAN P,COLLADO-VIDES J,MORETT E,et al. Structural Properties of Prokaryotic Promoter Regions Correlate with Functional Features [J/OL].Plos One,2014[2022-10-03].https://doi. org/10.1371/journal.pone.0088717.
[5] WOSTEN M M S M. Eubacterial sigma-factors [J].FEMS Microbiology Reviews,1998,22(3):127-150.
[6] XIAO X,XU Z C,QIU W R,et al. iPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition [J]. Genomics,2019,111(6):1785-1793.
[7] LE N Q K,YAPP E K Y,NAGASUNDARAM N,et al.
Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams [J/OL].Frontiers in Bioengineering and Biotechnology, 2019,7:[2022-10-03].https://doi.org/10.3389/fbioe.2019.00305.
[8] TAYARA H,TAHIR M,CHONG K T,et al. Identification of prokaryotic promoters and their strength by integrating heterogeneous features [J].Genomics,2020,112(2):1396-1403.
[9] SANTOS-ZAVALETA A,SALGADO H,GAMA-CASTRO S,et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 [J].Nucleic Acids Research,2019,47(D1):D212-D220.
[10] FU L M,NIU B F,ZHU Z W,et al. CD-HIT: accelerated for clustering the next-generation sequencing data [J].Bioinformatics, 2012,28(23):3150-3152.
[11] CHOU K C. Prediction of signal peptides using scaled window [J].Peptides,2001,22(12):1973-1979.
[12] LE N Q K,YAPP E K Y,HO Q T,et al. iEnhancer-5Step: identifying enhancers using hidden informationof DNA sequences via Chou's 5-step rule and word embedding [J].Anal Biochem,2019,571:53-61.
[13] CHEN W,FENG P M,LIN H,et al. Irspot-psednc: identify recombinationspots with pseudo dinucleotide composition [J/ OL].Nucleic Acids Research,2013,41(6):[2022-09-26].https:// doi.org/10.1093/nar/gks1450.
[14] XU Y,SHAO X J,WU L Y,et al. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins [J/OL].Peer J,2013[2022-09-26]. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3792191/pdf/peerj-01-171.
[15] BRADLEY A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms [J].Pattern Recognit, 1997,30(7):1145-1159.
[16] LIN H,DENG E Z,DING H,et al. iPro54-PseKNC:a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition [J].Nucleic Acids Research,2014,42(21):12961-12972.
[17] SILVA S D A E,FORTE F,SARTOR I T S,et al. DNA duplex stability as discriminative characteristic for Escherichia coli σ54- and σ28- dependent promoter sequences [J].Biologicals,2014,42(1):22-28.
[18] SONG K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method [J].Nucleic Acids Research, 2012,40(3):963-971.
[19] LI Q Z,LIN H. The recognition and prediction of σ70 promoters in escherichia colik-12 [J].Theoretical Biology,2006,242(1):135-141.
作者简介:胡仔豪 (1999—),男,汉族,江西南昌人,硕士研究生在读,研究方向:生物信息学、智能控制等。