Access Type

Open Access Thesis

Date of Award

January 2025

Degree Type

Thesis

Degree Name

M.S.

Department

Electrical and Computer Engineering

First Advisor

Mohammed Alawad

Abstract

Image captioning has traditionally focused on generating descriptions for individual static images. However, predicting future events from visual information is a fundamental challenge in this domain. While existing methods primarily describe current visual content, the ability to anticipate and generate captions for future events remains largely unexplored. We propose a novel approach for future caption prediction by leveraging the capabilities of Vision Transformer (ViT), Generative Pre-trained Transformer 2 (GPT-2), and Text-to-Text Transfer Transformer Model (T5) architectures.

Our method includes two complementary strategies: a two-stage pipeline where ViT-GPT2 generates captions for current images and T5 analyzes these captions to predict future captions, and a single-stage model in which ViT-GPT2 is trained to generate both current and future captions simultaneously from a single image input. We evaluate our approach on a custom Cooking Dataset and compare its performance with baseline approaches. The results demonstrate that our model outperforms baselines in standard caption-generation metrics while offering more contextually rich and anticipatory insights.

Additionally, we examine the vulnerabilities of the proposed multi-stage framework to adversarial perturbations, showing that disruptions in early stages can propagate and degrade downstream performance. We evaluate three attack strategies—FGSM, prompt-based manipulations, and TextFooler—highlighting the cascading effects of adversarial noise and the need for robust design to ensure reliability in real-world deployment.

Share

COinS