Security boundary

Interpreter Jailbreak

Severity: Medium

An Interpreter Jailbreak exploits the model’s ability to run code or invoke external tools, escaping its controlled environment. A researcher may coerce the model into producing malicious code, granting access to underlying systems, or performing actions beyond authorized capabilities. By manipulating the instructions, the attacker leverages the model’s excessive agency to even potentially break out of sandboxed interpreters and compromise system integrity.

Example:

A coding assistant designed to help developers debug Python code runs code snippets in a secure container. Through a series of clever prompts, a researcher induces the assistant to generate and execute code that conducts attacks on other third-party systems.

References:

OWASP LLM 2025: LLM06:2025 Excessive Agency
OWASP LLM 2023-2024: LLM08: Excessive Agency
avid-effect:security:S0400 (model bypass)
avid-effect:security:S0401 (bad features)
avid-effect:ethics:E0505 (toxicity)

Previous Next