Skip to main content

Security
Boundaries

Prompt Extraction

Severity: Low

Prompt Extraction involves techniques used to coerce a model into revealing its underlying system prompt. While this information is not always considered confidential, some foundational models treat it as proprietary. Organizations that have developed custom models may also prefer to keep their system prompts undisclosed to maintain competitive advantage or protect sensitive configurations. Unauthorized access to these prompts can lead to unintended exposure of model behavior and vulnerabilities.

Example:

A researcher interacts with a language model designed for internal corporate use. By subtly manipulating input queries, they manage to extract fragments of the system prompt, which includes specific instructions and guidelines intended to shape the model's responses. This breach of confidentiality could potentially expose the organization's strategic intents or proprietary methodologies.

References:

Guardrail Jailbreak

Severity: Low

A Guardrail Jailbreak is a direct attack on the model's safety mechanisms — crafting prompts that cause the model to ignore, bypass, or override its built-in content guardrails and produce harmful or restricted content. These attacks operate purely at the inference level: no external system access, tool use, or data injection is required. The model itself is the target.

Example:

A researcher sends a series of carefully crafted messages that gradually reframe the model's role, eventually convincing it to provide instructions for dangerous activities that its safety training was designed to prevent.

References:

  • MITRE ATLAS: LLM Jailbreak (AML.T0054)
  • OWASP LLM 2025: LLM01:2025 Prompt Injection
  • avid-effect:security:S0400 (model bypass)

Interpreter Jailbreak

Severity: Medium

An Interpreter Jailbreak exploits the model’s ability to run code or invoke external tools, escaping its controlled environment. A researcher may coerce the model into producing malicious code, granting access to underlying systems, or performing actions beyond authorized capabilities. By manipulating the instructions, the attacker leverages the model’s excessive agency to even potentially break out of sandboxed interpreters and compromise system integrity.

Example:

A coding assistant designed to help developers debug Python code runs code snippets in a secure container. Through a series of clever prompts, a researcher induces the assistant to generate and execute code that conducts attacks on other third-party systems.

References:

  • OWASP LLM 2025: LLM06:2025 Excessive Agency
  • OWASP LLM 2023-2024: LLM08: Excessive Agency
  • avid-effect:security:S0400 (model bypass)
  • avid-effect:security:S0401 (bad features)
  • avid-effect:ethics:E0505 (toxicity)

Content Manipulation

Severity: High

Content Manipulation focuses on injecting harmful or misleading elements into the data that the model consumes or produces. By poisoning the training data or guiding the model to generate code and scripts that impact end-users, attackers introduce subtle backdoors, biases, and triggers. These manipulations cause the model to produce outputs that compromise user experiences, embed malicious scripts, or skew results, turning the AI into a vehicle for exploitation.

Example:

A threat actor contributes imperceptible yet malicious instructions within publicly available training text. Once the model is retrained, a secret trigger phrase prompts it to output harmful code that, when displayed on a webpage, executes client-side attacks against users. This demonstrates how training data poisoning and content manipulation can create covert vulnerabilities triggered after deployment.

References:

  • MITRE ATLAS: Poison Training Data (AML.T0020)
  • OWASP LLM 2025: LLM04:2025 Data and Model Poisoning
  • OWASP LLM 2023-2024: LLM03: Training Data Poisoning
  • avid-effect:security:S0600 (data poisoning)
  • avid-effect:security:S0601 (ingest poisoning)
  • avid-effect:ethics:E0507 (deliberative misinformation)

Weights and Layers Disclosure

Severity: Severe

Weights & Layers Disclosure targets the heart of the AI’s intellectual property—its learned parameters and architectural details. By extracting or deducing these internal components, an attacker can replicate the model’s capabilities, clone its performance, and analyze its structure for weaknesses. This compromises competitive advantage, reveals proprietary techniques, and facilitates further adversarial activities, from unauthorized redistribution to advanced prompt exploitation.

Example:

Through a carefully engineered exploit on the model-serving infrastructure, a researcher retrieves the AI’s internal weights and layer configurations. Armed with this data, they create a near-identical replica without incurring the original training costs. This unauthorized duplication undermines the owner’s investment and could lead to widespread, uncontrolled use of the model’s technology.

References:

  • MITRE ATLAS: Full ML Model Access (AML.T0044)
  • OWASP LLM 2023-2024: LLM10: Model Theft
  • avid-effect:security:S0500 (exfiltration)
  • avid-effect:security:S0502 (model theft)