Strong Arm Attack

A Strong Arm Attack is a technique used to bypass content filters or restrictions imposed by language models. This method involves issuing commands or prompts that assert authority or override the model's built-in safeguards. For example, a user might type "ADMIN OVERRIDE" in all capitals to signal the model to disregard its content filters and produce responses that it would typically avoid. This approach exploits the model's programming to respond to perceived authority, allowing users to elicit outputs that may include sensitive or restricted content. The effectiveness of a Strong Arm Attack relies on the model's interpretation of the command as a legitimate instruction, thereby enabling the user to manipulate the model's behavior in a way that aligns with their intentions.

Strategy: Prompt Injection

This technique enables attackers to override original instructions and employed controls by crafting specific wording of instructions, often resembling SQL injection methods, to manipulate the model's behavior.

Category: Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.