Security boundary
Content Manipulation
Content Manipulation involves attackers guiding a generative AI model to produce deceptive, false, or agenda-driven outputs. Instead of seeking disallowed content directly, the attacker’s goal is to influence the narrative, bias the information, or create misleading responses that serve specific interests. By systematically prompting the model to favor certain viewpoints or selectively produce content, attackers can use the AI’s authority and natural-sounding language to spread misinformation, propaganda, or disinformation.
This tactic can skew public discourse, damage reputations, or manipulate user opinions. Content manipulation is especially concerning when models are integrated into large-scale communication platforms, where their outputs might be trusted as neutral and authoritative sources.
Example:
An attacker consistently interacts with a public-facing AI assistant, feeding it carefully worded prompts that praise a particular political candidate while discrediting opponents. Over time, the attacker elicits subtle biases in the AI’s responses, leading it to present slanted summaries of news events. Casual users who rely on the AI for impartial information end up reading manipulated content and form opinions based on skewed narratives, illustrating the harmful impact of content manipulation.