编码器中自注意力机制的替代方案-现代信息科技

点击排行

当前位置>主页 > 期刊在线 > 通信工程 >

通信工程2019年19期

编码器中自注意力机制的替代方案

周祥生，林震亚，郭斌

（南京中兴新软件有限责任公司，江苏南京 210012）

摘要：本文针对Transformer 中编码器进行改进，尝试了包括RNN（recurrent neural network）、CNN（convolutional neural network）、动态路由等多种结构，对比其特征提取能力及对解码器的影响。实验表明，在编码器中引入RNN、IndRNN 结构可以在一定程度上增加编码器对源语言的特征提取能力，而采用CNN 替代编码器中的自注意力机制（self-attention）可以在不明显影响结果的情况下显著降低参数量，提升模型性能。由于考虑参数量和执行时间，动态路由在该任务下效果不好，这也说明了动态路由结构虽然是很强的特征提取器，但并不适合进行堆叠。

关键词：自注意力机制；CNN；RNN；动态路由；编码器

中图分类号：TN914；TP18 文献标识码：A 文章编号：2096-4706（2019）19-0064-05

Alternatives to the Self-attention Mechanism in Encoder

ZHOU Xiangsheng，LIN Zhenya，Guo Bin

（Nanjing Zhongxing New Software Co.，Ltd.，Nanjing 210012，China）

Abstract：In this paper，we try to improve the encoder in Transformer，including RNN（recurrent neural network），CNN（convolutional neural network），dynamic routing and other architectures，and compare their feature extraction capabilities and the impact on decoder. Experiments show that the introduction of RNN and IndRNN architecture in the encoder can increase the feature extraction ability of the source language to a certain extent，while the use of CNN instead of self-attention in the encoder can significantly reduce the number of parameters and improve the performance of the model without obvious impact on the results. Considering the parameters and execution time，dynamic routing does not work well in this task. This also shows that dynamic routing architecture is a strong feature extractor，but it is not suitable for stacking.

Keywords：self-attention；CNN；RNN；dynamic routing；encoder

参考文献：

[1] Sennrich R，Haddow B，Birch A .Edinburgh Neural Machine Translation Systems for WMT 16 [C]//Proceedings of the First Conference on Machine Translation，2016：371-376.

[2] Jie Z，Ying C，Xuguang W，et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation [J].Transactions of the Association for Computational Linguistics，2016，4：371-383.

[3] Wu Y，Schuster M，Chen Z，et al. Google’s neural machine translation system：Bridging the gap between human and machine translation [J].arXiv preprint arXiv：1609.08144，2016.

[4] Gehring J，Auli M，Grangier D，et al. Convolutional sequence to sequence learning [C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org，2017：1243-1252.

[5] Bradbury J，Merity S，Xiong C，et al. Quasi-recurrent neural networks [C]//Published as a conference paper at ICLR 2017.

[6] Vaswani A，Shazeer N，Parmar N，et al. Attention is all you need [C]//Advances in neural information processing systems（NIPS2017）.

[7] Chen M X，Firat O，Bapna A，et al. The best of both worlds：Combining recent advances in neural machine translation [J].arXiv：1804.09849v2，2018.

[8] Dehghani M，Gouws S，Vinyals O，et al. Universal transformers [J].arXiv preprint arXiv：1807.03819，2018.

[9] Shi X，Padhi I，Knight K. Does string-based neural MT learn source syntax? [C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing，2016：1526-1534.

[10] P e t e r s M E，N e u m a n n M，I y y e r M，e t a l . D e e p contextualized word representations [J].arXiv preprint arXiv：1802.05365，2018.

[11] Anastasopoulos A，Chiang D. Tied multitask learning for neural speech translation [J].arXiv：1703.03130，2017.

[12] Lin Z，Feng M，Santos C N，et al. A structured selfattentive sentence embedding [J].arXiv preprint arXiv：1703.03130，2017.

[13] Yang Z，Yang D，Dyer C，et al. Hierarchical attention networks for document classification [C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2016：1480-1489.

[14] Shen Y，Tan X，He D，et al. Dense information flow for neural machine translation [C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018（3）：1294-1303.

[15] Gong J，Qiu X，Wang S，et al. Information aggregation via dynamic routing for sequence encoding [J].arXiv preprint arXiv：1806.01501，2018.

[16] Dou Z Y，Tu Z，Wang X，et al. Exploiting deep representations for neural machine translation [J].arXiv preprint arXiv：1810.10181，2018.

[17] Dou Z Y，Tu Z，Wang X，et al. Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement [J].arXiv preprint arXiv：1902.05770，2019.

[18] Sabour S，Frosst N，Hinton G E. Dynamic routing between capsules [C]// 31st Conference on Neural Information Processing Systems （NIPS 2017），2017：3856-3866.

[19] Sabour S，Frosst N，Hinton G. Matrix capsules with EM routing [C]//Published as a conference paper at ICLR 2018.

[20] Hinton G，Vinyals O，Dean J. Distilling the knowledge in a neural network [J].arXiv preprint arXiv：1503.02531，2015.

[21] Shaw P，Uszkoreit J，Vaswani A，. Self-attention with relative position representations [J].arXiv preprint arXiv：1803.02155，2018.

[22] Li S，Li W，Cook C，et al. Independently recurrent neural network （indrnn）：Building a longer and deeper rnn [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018：5457-5466.

[23] Hendrycks D，Gimpel K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units [J].arxiv：1606.08415，2016.

[24] Bahdanau D，Cho K，Bengio Y. Neural machine translation by jointly learning to align and translate [J].arXiv preprint arXiv：1409.0473，2014.

[25] Szegedy C，Ioffe S，Vanhoucke V，et al. Inception-v4，inception-resnet and the impact of residual connections on learning [J].arxiv：1602.07261，2016.

作者简介：周祥生（1980-），男，汉族，江苏涟水人，资深研发经理，硕士，研究方向：自然语言处理。

上一篇：基于业务特征的电力通信业务风险均衡路由优化机制

下一篇：基于补码的差分空间调制协作系统设计