摘 要:为了解决所生成的图像描述语句不准确的问题,提出一种基于注意力机制和强化学习的三层 LSTM 网络图像描述模型。首先使用 ResNet-101 网络提取图像的特征信息,再使用改进的三层 LSTM 网络模型生成描述语句。同时针对采用交叉熵损失函数对模型进行训练时存在的曝光偏差问题,使用强化学习方法直接优化 CIDEr 评价指标。MS-COCO 数据集上的评价指标对比结果表明,文章提出的模型能够生成更符合图像内容的描述语句。
关键词:图像描述;注意力机制;长短期记忆网络;强化学习
DOI:10.19850/j.cnki.2096-4706.2022.09.026
中图分类号:TP391.4 文献标识码:A 文章编号:2096-4706(2022)09-0103-04
Image Description Method Based on Improved Visual Attention Mechanism
WANG Yaoge, WEN Ruisen, PANG Guijie
(School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China)
Abstract: In order to solve the inaccurate problem of generated image description sentences, a three-layer LSTM network image description model based on attention mechanism and reinforcement learning is proposed. First, the ResNet-101 network is used to extract the feature information of the image, and then use the improved three-layer LSTM network model to generate the description sentences. At the same time, in view of the exposure bias problem that exists when use the cross-entropy loss function to train the model, we directly optimize the CIDEr evaluation metric using reinforcement learning methods. The comparison results of the evaluation metric on the MSCOCO dataset show that the model proposed in this paper can generate description sentences that are more in line with the image content.
Keywords: image description; attention mechanism; long short-term memory network; reinforcement learning
参考文献:
[1] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell: A neural image caption generator [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Boston:IEEE,2015:3156-3164.
[2] HOCHREITER S,SCHMIDHUBER J. Long short-term memory [J]. Neural computation,1997,9(8):1735-1780.
[3] XU K,BA J L,KIROS R,et al. Show, attend and tell: Neural image caption generation with visual attention [J/OL].arXiv:1502.03044 [cs.LG].[2022-04-05].https://doi. org/10.48550/arXiv.1502.03044.
[4] ANDERSON P,HE X D,BUEHLER C,et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: 2018:6077-6086.
[5] WU C L,YUAN S Z,CAO H W,et al. Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards [J]. IEEE Access,2020,8:57943-57951. [6] WEI H Y,LI Z X,ZHANG C L,et al. The synergy of double attention: Combine sentence-level and word-level attention for image captioning [J]. Computer Vision and Image Understanding,2020,201:103068.
[7] PAPINENI K,ROUKOS S,WARD T,et al. Bleu: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Eight Street,Stroudsburg: Association for Computational Linguistics,2002:311-318.
[8] DENKOWSKI M,LAVIE A. Meteor universal: Language specific translation evaluation for any target language [C]//Proceedings of the ninth workshop on statistical machine translation. Baltimore:Association for Computational Linguistics,2014:376-380.
[9] LIN C Y. Rouge: A package for automatic evaluation of summaries [C]//Text summarization branches out. Barcelona:[s.n.],2004:74-81.
[10] VEDANTAM R, LAWRENCE Z C,Parikh D. Cider: Consensus-based image description evaluation [C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Boston:IEEE,2015:4566-4575.
作者简介: 王耀葛(1995.12—), 男, 汉 族, 广东茂名人,硕士研究生在读,研究方向:图像描述、深度学习;文瑞森(1998.11—),男,汉族,广东惠州人,硕士研究生在读,研究方向:深度学习、图像处理;庞贵杰(1997.09—),男,汉族,四川南充人,硕士研究生在读,研究方向:深度学习、视频编码。