Transformer with Multi-level Grid Features and Depth Pooling for Image Captioning

Doanh C. Bui, Tam V. Nguyen, and Khang Nguyen

Abstract: Image captioning is an exciting yet challenging problem in both computer vision and natural language processing research. In recent years, this problem has been addressed by Transformer-based models optimized with Cross-Entropy loss and boosted performance via Self-Critical Sequence Training. Two types of representations are embedded into captioning models: grid features and region features, and there have been attempts to include 2D geometry information in the self-attention computation. However, the 3D order of object appearances is not considered, leading to confusion for the model in cases of complex scenes with overlapped objects. In addition, recent studies using only feature maps from the last layer or block of a pretrained CNN-based model may lack spatial information. In this paper, we present the Transformer-based captioning model dubbed TMDNet. Our model includes one module to aggregate multi-level grid features (MGFA) to enrich the representation ability using prior knowledge, and another module to effectively embed the image’s depth-grid aggregation (DGA) into the model space for better performance. The proposed model demonstrates its effectiveness via evaluation on the MS-COCO “Karpathy” test split across five standard metrics.

This work was conducted under the supervision of Prof. Tam V. Nguyen (University of Dayton) and Prof. Khang Nguyen (University of Information Technology, VNU-HCM).