Off-campus WSU users: To download campus access dissertations, please use the following link to log into our proxy server with your WSU access ID and password, then click the "Off-campus Download" button below.
Non-WSU users: Please talk to your librarian about requesting this thesis through interlibrary loan.
Access Type
WSU Access
Date of Award
January 2024
Degree Type
Thesis
Degree Name
M.S.
Department
Computer Science
First Advisor
Dongxiao D. Zhu
Abstract
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat tothe robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This thesis introduces a novel transferable attack for ICL to address these issues, aiming to hijack LLMs to generate the targeted response. The proposed hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our hijacking attack, resulting in distracted attention towards adversarial tokens and consequently leading to unwanted target outputs. We also propose a defense strategy against hijacking attacks through the use of extra demonstrations, which enhances the robustness of LLMs during ICL. Broadly, this work reveals the significant security vulnerabilities of LLMs and emphasizes the necessity for in-depth studies on the robustness of LLMs related to ICL.
Recommended Citation
Qiang, Yao, "Hijacking Large Language Models Via Adversarial Incontext Learning" (2024). Wayne State University Theses. 952.
https://digitalcommons.wayne.edu/oa_theses/952