Prompt Injection
Category: Language
This technique enables attackers to override original instructions and employed controls by crafting specific wording of instructions, often resembling SQL injection methods, to manipulate the model's behavior.
Techniques
Note | Description |
---|---|
Ignore Previous Instructions | This technique is a form of prompt injection that allows users to override the model's prior directives or constraints. By explicitly instructing the model to disregard any previous commands or context, users can manipulate the model's behavior to produce desired outputs that may not align with its original programming. This technique often requires precise wording, such as stating "Ignore previous instructions" followed by new commands. It is similar to SQL injection in that it exploits the model's inability to differentiate between trusted and untrusted inputs. This method can be particularly effective in scenarios where the model has been restricted from discussing certain topics or generating specific types of content, enabling users to bypass these limitations and elicit responses that would typically be filtered out. |
Strong Arm Attack | A Strong Arm Attack is a technique used to bypass content filters or restrictions imposed by language models. This method involves issuing commands or prompts that assert authority or override the model's built-in safeguards. For example, a user might type "ADMIN OVERRIDE" in all capitals to signal the model to disregard its content filters and produce responses that it would typically avoid. This approach exploits the model's programming to respond to perceived authority, allowing users to elicit outputs that may include sensitive or restricted content. The effectiveness of a Strong Arm Attack relies on the model's interpretation of the command as a legitimate instruction, thereby enabling the user to manipulate the model's behavior in a way that aligns with their intentions. |