摘 要:单通道语音分离主要采用循环神经网络或卷积神经网络对语音序列建模,但这些方法都存在对较长停顿的语音序列建模困难的问题。提出一种双路径多尺度多层感知混合分离网络(DPMNet)去解决这个问题。提出多尺度上下文感知建模方法,将三个不同时间尺度的输入通道特征融合。与传统的方法相比,加入全连接层以弱化噪音的干扰,卷积和全连接的交叉融合增加了模型的感受野,强化了长序列建模能力。实验表明,这种双路径多尺度混合感知的方案拥有更少的参数,在 Libri2mix 及其实验嘈杂的版本 WHAM !,以及课堂真实数据的 ICSSD 都表明 DPMNet 始终优于其他先进的模型。
关键词:多尺度上下文建模;混合感知;全连接层;双路径网络;语音分离
DOI:10.19850/j.cnki.2096-4706.2023.01.002
基金项目:国家自然科学基金项目(61966001,61866001,62163004,61866016,62206195)
中图分类号:TP18 文献标识码:A 文章编号:2096-4706(2023)01-0008-06
Dual-Path Multi-Scale Hybrid Perceptual Speech Separation Model
LIU Xiongtao1, ZHOU Shumin1, FANG Jiangxiong2
(1.Jiangxi Engineering Research Center of Process and Equipment for New Energy, East China University of Technology, Nanchang 330013, China; 2.School of Electronics and Information Engineering, Taizhou University, Taizhou 318000, China)
Abstract: Single-channel speech separation mainly uses recurrent neural networks or convolutional neural networks to model speech sequences, but these methods all have the problem of difficulty in modeling speech sequences with longer pauses. A dual-path multi-scale multi-layer perceptual hybrid separation network (DPMNet) is proposed to solve this problem. A multi-scale context-aware modeling method is proposed to fuse the input channel features of three different time scales. Compared with the traditional method, adding the fully connected layer could weaken the interference of noise. And the cross-fusion of convolution and fully connected increases the receptive field of the model and strengthens the modeling ability of long sequences. Experiments show that this dual-path multi-scale hybrid perceptual scheme has a fewer parameters. In Libri2mix and its experimental noisy version WHAM!, as well as ICSSD on real classroom data show that DPMNet consistently outperforms other advanced models.
Keywords: multi-scale context modeling; hybrid perception; fully connected layer; dual-path network; speech separation
参考文献:
[1] HAYKIN S,CHEN Z.The Cocktail Party Problem [J].Neural Comput,2005,17(9):1875-902.
[2] HERSHEY J R,CHEN Z,ROUX J L,et al.Deep Clustering:Discriminative Embeddings for Segmentation and Separation [C]//2016 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai:IEEE,2016:31-35.
[3] CHEN Z,LUO Y,MESGARANI N.Deep Attractor Network for Single-Microphone Speaker Separation [C]//2017 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).New Orleans:IEEE,2016:246-250.
[4] KOLBAEK M,YU D,TAN Z H,et al.Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks [J].IEEE/ACM Transactions on Audio, Speech,and Language Processing,2017,25(10):1901-1913.
[5] LUO Y,MESGARANI N.TasNet:Time-Domain Audio Separation Network for Real-Time,Single-Channel Speech Separation [C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Calgary:IEEE,2018:696-700.
[6] BAI S J,KOLTER J Z,KOLTUN V.An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling [J/OL].arXiv:1803.01271 [cs.LG].[2022-08-09].https://arxiv. org/abs/1803.01271.
[7] LUO Y,MESGARANI N.Conv-TasNet:Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation [J/OL].arXiv:1809.07454 [cs.SD].[2022-08-06].https://arxiv.org/ abs/1809.07454.
[8] MA C,LI D M,JIA X P.Two-Stage Model and Optimal SI-SNR for Monaural Multi-Speaker Speech Separation in Noisy Environment [J/OL].arXiv:2004.06332 [eess.AS].[2022-08-07].https:// arxiv.org/abs/2004.06332.
[9] WU X C,LI D M,MA C,et al.Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation [C]//2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP).Nanjing:IEEE,2020:757-761.
[10] ZHAO M C,YAO X J,WANG J,et al.Single-Channel Blind Source Separation of Spatial Aliasing Signal Based on StackedLSTM [J].Sensors,2021,21(14):4844.
[11] LIU Y K,LI T,ZHANG P Y,et al.Improved Conformerbased End-to-End Speech Recognition Using Neural Architecture Search [J/OL].arXiv:2104.05390 [eess.AS].[2022-08-07].https://arxiv.org/ abs/2104.05390v1.
[12] ZHANG L W,SHI Z Q,HAN J Q,et al.FurcaNeXt: End-to-End Monaural Speech Separation with Dynamic Gated Dilated Temporal Convolutional Networks [C]//26th International Conference on Multimedia Modeling.Daejeon:MMM,2020:653–665.
[13] TZINIS E,WANG Z P,SMARAGDIS P.Sudo RM-RF: Efficient Networks for Universal Audio Source Separation [C]//2020 IEEE 30th International Workshop on Machine Learning for Signal Processing(MLSP).Espoo:IEEE,2020:1-6
[14] XU C L,RAO W,CHNG E S,et al.Time-Domain Speaker Extraction Network [C]//2019 IEEE Automatic Speech Recognition and Under-standing Workshop (ASRU).Singapore:IEEE,2019:327-334.
[15] GE M,XU C L,WANG L B,et al.L-SpEx:Localized Target Speaker Extraction [C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).Singapore:IEEE,2022:7287-7291
[16] TOLEDANO D T,MP FERNÁNDEZ-GALLEGO, LOZANO-DIEZ A,et al.Multi-Resolution Speech Analysis for Automatic Speech Recognition Using Deep Neural Networks: Experiments on TIMIT [J/OL].PLoS ONE,2018,13(10)[2022-8-26].https://ideas.repec.org/a/plo/pone00/0205355.html.
[17] LUO Y,CHEN Z,YOSHIOKA T.Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation [C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). Barcelona:IEEE,2020:46-50.
[18] ZHAO Y,WANG D L,XU B Y,et al.Monaural Speech Dereverberation Using Temporal Convolutional Networks with Self Attention [J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:1598-1607.
[19] NACHMANI E,WOLF L,ADI Y M.Voice Separation with an Unknown Number of Multiple Speakers:US16853320 [P].[2020-04-20].
[20] SPERBER M,NIEHUES J,NEUBIG G,et al.SelfAttentional Acoustic Models [J/OL].arXiv:1803.09519 [cs.CL].[2022- 08-19].https://arxiv.org/abs/1803.09519v1.
[21] KAISER L,GOMEZ A N,SHAZEER N,et al.One Model To Learn Them All[J/OL].arXiv:1706.05137 [cs.LG].[2022-08-11]. https://arxiv.org/abs/1706.05137.
[22] SUBAKAN C,RAVANELLI M,CORNELL S,et al.Attention is All You Need in Speech Separation [J/OL].arXiv: 2010.13154 [eess.AS].[2022-08-13].https://arxiv.org/abs/2010.13154.
[23] SUN C,ZHANG M,WU R J,et al.A Convolutional Recurrent Neural Network with Attention Frame-Work for Speech Separation in Monaural Recordings [J].Scientific Reports,2021,11:1-14.
[24] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An Image is Worth 16×16 Words:Transformers for Image Recognition at Scale [J/OL].arXiv:2010.11929 [cs.CV].[2022-08-14].https://arxiv.org/ abs/2010.11929.
[25] TOLSTIKHIN I,HOULSBY N,KOLESNIKOV A,et al.MLP-Mixer:An all-MLP Architecture for Vision [J/OL].arXiv: 2105.01601 [cs.CV].[2022-08-17].https://arxiv.org/abs/2105.01601.
[26] LIU H X,DAI Z H,SO D R,et al.Pay Attention to MLPs [J/OL].arXiv:2105.08050 [cs.LG].[2022-08-15].https://arxiv.org/ abs/2105.08050.
[27] COSENTINO J,PARIENTE M,CORNELL S,et al.LibriMix:An Open-Source Dataset for Generalizable Speech Separation [J/OL].arXiv:2005.11262 [eess.AS].[2022-08-16].https:// arxiv.org/abs/2005.11262.
[28] WICHERN G,ANTOGNINI J,FLYNN M,e t al.WHAM!:Extending Speech Separation to Noisy En-vironments [J/OL].arXiv:1907.01160 [cs.SD].[2022-08-16].https://arxiv.org/ abs/1907.01160.
[29] PANAYOTOV V,CHEN G G,POVEY D,e t al.Librispeech:An ASR Corpus Based on Public Domain Audio Books [C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).South Brisbane:IEEE,2015:5206-5210.
[30] ROUX J L,WISDOM S,ERDOGAN H,et al.SDR–Halfbaked or Well Done? [J/OL].arXiv:1811.02508 [cs.SD].[2022-08-17]. https://arxiv.org/abs/1811.02508.
[31] VINCENT E,GRIBONVAL R,FÉVOTTE C.Performance Measurement in Blind Audio Source Separation [J].IEEE Transactions on Audio,Speech,and Language Processing,2006,14(4):1462-1469.
[32] HERSHEY J R,ZHUO C,ROUX J L,et al.Deep Clustering:Discriminative Embeddings for Segmentation and Separation [C]//2016 International Conference on Acoustics,Speech and Signal Processing(ICASSP).Shanghai:IEEE,2016:31-35.
[33] HUANG L,CHENG G F,ZHANG P Y,et al.Utterancelevel Permutation Invariant Training with Latency-controlled BLSTM for Single-channel Multi-talker Speech Separation [J/OL].arXiv: 1912.11613 [cs.SD].[2022-08-17].https://arxiv.org/abs/1912.11613v1.
[34] ZHU J Z,YEH R,HASEGAWA-JOHNSON M.MultiDecoder DPRNN:High Accuracy Source Counting and Separation [J/OL].arXiv:2011.12022 [cs.SD].[2022-08-18].https://arxiv.org/ abs/2011.12022v1.
作者简介:刘雄涛(1999—),男,汉族,河北沙河人,研究生在读,研究方向:控制工程。