当前位置>主页 > 期刊在线 > 计算机技术 >

计算机技术2020年5期

​一种基于 CNN/CTC 的端到端普通话语音识别方法
潘粤成 ¹ ,刘卓¹ ,潘文豪 ¹ ,蔡典仑² ,韦政松 ¹
(1. 华南理工大学 自动化科学与工程学院,广东 广州 510641;2. 华南理工大学 机械与汽车工程学院,广东 广州 510641)

摘  要:为了实现离线状态较高正确率的中文普通话语音识别,提出一种基于深度全卷积神经网络 CNN 表征的语音识别系统的声学模型,将频谱图作为输入,在模型结构上参考了 VGG 模型。在输出端,该模型可以与连接时序分类完美结合,从而实现整个模型的端到端训练,将声波信号转换成普通话拼音序列。语言模型则采用最大熵马尔可夫模型,将拼音序列转换为中文文本。实验表明,此算法在测试集上已经获得了 80.82% 的正确率。


关键词:卷积神经网络;中文语音识别;连接时序分类;端到端系统



中图分类号:TN912.34;TP399         文献标识码:A         文章编号:2096-4706(2020)05-0065-04


An End-to-End Mandarin Speech Recognition Method Based on CNN/CTC

PAN Yuecheng1,LIU Zhuo1,PAN Wenhao1,CAI Dianlun2,WEI Zhengsong1

(1.School of Automation Science and Engineering,South China University of Technology,Guangzhou 510641,China; 2.School of Mechanical and Automotive Engineering,South China University of Technology,Guangzhou 510641,China)

Abstract:In order to achieve Mandarin speech recognition with higher accuracy in offline state,we come up with an acoustic model of a speech recognition system based on deep full convolutional neural network(CNN). We choose the spectrogram of acoustic signals as input. As for the structure of the model,we refer the VGG model. At the output end,the model can be perfectly combined with the connectionist temporal classification (CTC). We realize the end-to-end training of the entire model using this method,and the acoustic signal is directly converted into a Mandarin Pinyin sequence. Our language model uses the Maximum Entropy Markov Model to convert Pinyin sequences into Chinese text. Our experiments show that this algorithm has achieved 80.82% accuracy on our test set.

Keywords:convolutional neural network;Chinese speech recognition;connectionist temporal classification;end-to-end system


基金项目:国家级大学生创新创业训练计划项目(201910561167)


参考文献:

[1] 张德良 . 深度神经网络在中文语音识别系统中的实现 [D].北京:北京交通大学,2015.

[2] 林俊潜 . 基于神经网络和小波变换的语音识别系统研究 [D]. 广州:广东工业大学,2013.

[3] 郑文秀,赵峻毅,文心怡,等 . 一种基于瓶颈复合特征的声学模型建立方法 [J/OL]. 计算机工程:1-6(2019-12-16).https://doi.org/10.19678/j.issn.1000-3428.0056278.

[4] 唐美丽,胡琼,马廷淮 . 基于循环神经网络的语音识别研究 [J]. 现代电子技术,2019,42(14):152-156.

[5] 王嘉伟 . 基于卷积神经网络的语音识别研究 [J]. 科学技术创新,2019(31):71-73.

[6] DAHL G E,YU D,DENG L,et al. Context-dependent pretrained deep neural networks for large-vocabulary speech recognition [J].IEEE Transactions on Audio,Speech and Language Processing,2012,20(1):30-42.

[7] SEIDE F,LI G,YU D. Conversational speech transcription using context-dependent deep neural networks [C]//12th Annual Conference of the International Speech Communication Association,2011.

[8] GRAVES A,FERNÁNDEZ S,GOMEZ F. Connectionist temporal classification:Labelling unsegmented sequence data with recurrent neural networks [C]// Machine Learning,Proceedings of the Twenty-Third International Conference (ICML 2006),2006:369-376.

[9] GRAVES A. Supervised Sequence Labelling with Recurrent Neural Networks [M]. Berlin,Heidelberg:Springer Berlin Heidelberg,2012:52-81. [10] KLINGER R,TOMANEK K. ClassicalProbabilistic Models and Conditional Random Fields [J].Algorithm Engineering Report,2007,2(13):5-6

[11] 杨洋,汪毓铎 . 基于改进卷积神经网络算法的语音识别 [J]. 应用声学,2018,37(6):940-946.


作者简介:潘粤成(1998-),男,汉族,广西融水人,就读于自动化专业,本科在读,研究方向:自动语音识别。