当前位置>主页 > 期刊在线 > 信息技术 >

信息技术2019年11期

基于深度学习的视觉问答系统
葛梦颖,孙宝山
(天津工业大学 计算机科学与技术学院,天津 300387)

摘  要:随着互联网的发展,人类可以获得的信息量呈指数型增长,我们能够从数据中获得的知识也大大增多,之前被搁置的人工智能再一次焕发活力。随着人工智能的不断发展,近年来,产生了视觉问答(VQA)这一课题,并发展成为人工智能的一大热门问题。视觉问答(VQA)系统需要将图片和问题作为输入,结合图片及问题中的信息,产生一条人类语言作为输出。视觉问答(VQA)的关键解决方案在于如何融合从输入图像和问题中提取的视觉和语言特征。本文围绕视觉问答问题,从概念、模型等方面对近年来的研究进展进行综述,同时探讨现有工作存在的不足;最后对视觉问答未来的研究方向进行了展望。


关键词:深度学习;人工智能;视觉问答;自然语言处理



中图分类号:TP391.41;TP18         文献标识码:A         文章编号:2096-4706(2019)11-0011-04


Visual Question Answering System Based on Deep Learning

GE Mengying,SUN Baoshan

(School of Computer Science and Technology,Tianjin Polytechnic University,Tianjin 300387,China)

Abstract:With the development of the internet,the amount of information available to human beings increases exponentially,and the amount of knowledge we can get from the data also increases greatly. Artificial intelligence,which had been put on hold,is radiate vitality. With the continuous development of artificial intelligence, in recent years,visual question answer (VQA) hasemerged as a hot topic in the field of artificial intelligence. Visual question answer (VQA) system needs to take pictures and questions asinput and combine these two parts of information to produce a human language as output. The key solution for VQA is how to fuse visualand linguistic features extracted from input images and questions. This paper focuses on the visual question and answer,summarizesthe research progress in recent years from the aspects of concept and model,and discusses the existing deficiencies. Finally,the futureresearch direction of VQA are prospected.

Keywords:deep learning;artificial intelligence;visual question answer;natural language processing


参考文献:

[1] Malinowski M,Fritz M . A Multi-World Approach to QuestionAnswering about Real-World Scenes based on Uncertain Input [J].OALib Journal,2014.

[2] Lu J,Yang J,Batra D,et al. Hierarchical Question-ImageCo-Attention for Visual Question Answering [C].30th Conference onNeural Information Processing Systems(NIPS) in 2016,Barcelona,Spain,2016.

[3] Yu D,Fu J,Mei T,et al. Multi-level Attention Networks forVisual Question Answering [C]// 2017 IEEE Conference on ComputerVision and Pattern Recognition (CVPR). IEEE,2017.

[4] Yu Z,Yu J,Fan J,et al. Multi-modal Factorized BilinearPooling with Co-Attention Learning for Visual Question Answering [J].2017 IEEE International Conference on Computer Vision,2017(1):1839-1848.

[5] Fukui A,Park D H,Yang D,et al. Multimodal CompactBilinear Pooling for Visual Question Answering and Visual Grounding [J].ScienceOpen,2016:457-468.

[6] He K,Zhang X,Ren S,et al. Deep ResidualLearning for Image Recognition [J].2016 IEEE Conferenceon Computer Vision and Pattern Recognition,2016(1):770-778.

[7] Deng J,Dong W,Socher R,et al. ImageNet:a Large-Scale Hierarchical Image Database [C]// 2009 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR2009),20-25 June 2009,Miami,Florida,USA. IEEE,2009.

[8] Nguyen D K,Okatani T. Improved Fusion of Visual andLanguage Representations by Dense Symmetric Co-Attention for VisualQuestion Answering [J/OL].https://arxiv.org/pdf/1804.00775.pdf,2018.

[9] Antol S,Agrawal A,Lu J,et al. VQA:Visual QuestionAnswering [J].International Journal of Computer Vision,2017,123(1):4-31.

[10] Zhou B,Tian Y,Sukhbaatar S,et al. Simple Baseline forVisual Question Answering [J].Computer Science,2015.


作者简介:

葛梦颖(1996.12-),女,汉族,安徽宿州人,硕士研究生,研究方向:自然语言处理、深度学习等。

孙宝山(1978.10-),男,汉族,天津人,副教授,硕士生导师,工学博士,研究方向:机器学习、自然语言处理等。