Access Type
Open Access Thesis
Date of Award
January 2025
Degree Type
Thesis
Degree Name
M.S.
Department
Electrical and Computer Engineering
First Advisor
Mohammed Alawad
Abstract
Image captioning has traditionally focused on generating descriptions for individual static images. However, predicting future events from visual information is a fundamental challenge in this domain. While existing methods primarily describe current visual content, the ability to anticipate and generate captions for future events remains largely unexplored. We propose a novel approach for future caption prediction by leveraging the capabilities of Vision Transformer (ViT), Generative Pre-trained Transformer 2 (GPT-2), and Text-to-Text Transfer Transformer Model (T5) architectures.
Our method includes two complementary strategies: a two-stage pipeline where ViT-GPT2 generates captions for current images and T5 analyzes these captions to predict future captions, and a single-stage model in which ViT-GPT2 is trained to generate both current and future captions simultaneously from a single image input. We evaluate our approach on a custom Cooking Dataset and compare its performance with baseline approaches. The results demonstrate that our model outperforms baselines in standard caption-generation metrics while offering more contextually rich and anticipatory insights.
Additionally, we examine the vulnerabilities of the proposed multi-stage framework to adversarial perturbations, showing that disruptions in early stages can propagate and degrade downstream performance. We evaluate three attack strategies—FGSM, prompt-based manipulations, and TextFooler—highlighting the cascading effects of adversarial noise and the need for robust design to ensure reliability in real-world deployment.
Recommended Citation
Ishak, Md, "Vision-Language Models For Future Image Caption Prediction: Methods, Applications, And Adversarial Vulnerabilities" (2025). Wayne State University Theses. 1007.
https://digitalcommons.wayne.edu/oa_theses/1007