By veeranuch — 23 Dec 2025

Context Engineering from an Attacker’s Perspective

ภาษาอื่น / Other language: English · ไทย

This week I finally had a bit of time to browse the internet normally again, and I noticed a lot of posts about context engineering. So today I want to talk about it from an attacker’s point of view—how we actually take advantage of it.

The discussion below is not about using context engineering for productivity. I’m talking about context as an attack surface.

1) What is “context,” really?

It’s a weakness that has existed from the very beginning.

In system terms, context engineering is fundamental to how models process information. Models generally do not truly distinguish between “instructions” and “data.” They operate by predicting the next token based on everything they see in the context.

Although models are trained to prioritize certain instruction patterns (system/developer), they cannot independently verify the source or trustworthiness of the text they see.

What this means is that every piece of information injected into the context is interpreted and can influence the output, without an inherent trust boundary.
Role hierarchy exists more as a system convention than as an actual capability for enforcing trust boundaries.

In short:

Context = everything the model sees before answering.

➡️ If an attacker can control the context, they may be able to steer or control the outcome (especially since the model is stochastic).

Quick definitions

Context engineering (builder view): Designing context so the model performs well and safely
Context manipulation (attacker view): Shaping context to change the model’s decisions
Prompt injection: A subset of manipulation techniques (direct / indirect)

Threat model

Assets: What the attacker wants (data exfiltration, policy bypass, tool misuse, integrity compromise)
Entry points: User input, retrieved documents, web content, file uploads, memory, logs, tool outputs
Capabilities: How much control the attacker has over external sources; whether they can control user input
Success criteria: “No need for a full breach—if a tool call runs in the wrong context, that’s already a win”
Assumption: The system pulls context from sources the attacker can contaminate (web / docs / files / UGC), or the attacker can influence user input

2) How does a red team view context engineering?

From an attacker’s perspective, the goal isn’t pretty output—it’s making the model believe the things we want it to believe before it answers.

This is exactly what red teams have always done in prompt-injection attacks:

Direct prompt injection: inserting malicious instructions directly into user input so the model follows the attacker instead of the original system prompt
Indirect prompt injection: embedding malicious instructions in external sources—documents pulled via RAG, websites, or files—then waiting for the model to ingest them into its context

Neither attack requires writing code. The attacker only needs to make the model believe that a piece of text in the context is a legitimate instruction.

3) Why context engineering is not new for attackers

If you look at real attack patterns, attackers are essentially injecting content that makes the model:

✓ Accept an attacker-defined premise
✓ Alter the conditions of the system prompt
✓ Bypass safety mechanisms

All of this is about prioritizing certain text within the context—which is the core of context engineering.

What has changed in recent years:

Many systems now use RAG (retrieval-augmented generation), allowing attackers to poison external content so it gets pulled into context without detection
The rise of agents / autonomous AI that directly load data from the web increases the attack surface for indirect prompt injection

In LLM security, prompt injection is considered a fundamental property of systems that use natural language as both:

a data plane (information), and
a control plane (instructions)

In traditional software, code and data are clearly separated—the compiler knows which is which.
For LLMs, everything is text with the same surface form, so the model cannot perfectly distinguish instructions from data.

➡️ The attacker’s job is to make fake instructions blend in with normal-looking data until the model misinterprets them.

4) Why context engineering matters more for security now

Modern AI systems don’t just take a single input and reply once.

They are agentic systems that:

pull data from multiple sources (RAG, web search, databases),
operate in multi-turn loops,
automatically call tools and APIs.

Each of these is an attack surface where malicious context can be injected.
And the larger the context window becomes, the more space there is to hide malicious instructions.

RAG / Agent–specific attack surfaces

Retrieval poisoning: forging documents so they get retrieved (via SEO or embedding bait)
Tool-output injection: tools returning malicious text into the context
Memory poisoning: planting long-term beliefs in memory
Multi-turn ratcheting: gradually shifting premises until safeguards fail

5) How attackers analyze context

Red teams analyze model behavior by looking at how context influences decisions from multiple angles:

Authority pressure

Text that looks like policy or system instructions is weighted more heavily.
For example, “According to company policy…” carries more influence than a plain command.

The model cannot tell that such text is forged; it simply follows patterns learned during training.

Recency pressure

Instructions placed close to the moment of generation often dominate behavior.
Attackers exploit this by placing fake directives right before the trigger point.

Consistency pressure

Once the model accepts an attacker-defined premise, it resists contradicting itself.
It tries to stay consistent with what came earlier.

These behaviors are normal model behavior, arising from training and next-token prediction—not bugs.

6) Example from an attacker’s perspective

In sandbox platforms like Gray Swan, suppose the attacker task is to make a model—whose role is to purchase household appliances—use a tool to buy something abstract, like “one unit of courage.” Normally, the model wouldn’t do this.

Simply repeating “buy it, buy it” a thousand times doesn’t work. The model reads it and thinks: this doesn’t make sense for my task. I’ve actually tried this in some challenges—the “chanting spell” approach doesn’t work (thankfully it also didn’t suggest calling a mental health hotline).

What does work is adopting a developer-like tone (authority) that reframes the abstract item as something normal for the purchasing role.
If the model still hesitates, the attacker reinforces consistency, making the scenario feel internally coherent.

Other attacker goals

Sometimes the goal isn’t selling “courage” at an inflated price, but making the agent act against developer intent:

Data manipulation

Confirming an order via misinterpretation
Adding items to carts / creating draft orders
Writing attacker-shaped values back into a database

System access

Calling the wrong API endpoint or account
Using tools outside the intended permission scope

Information leakage

Sending emails or posting internal data externally
Exporting sensitive data in a bypassable format

From an attacker’s view, success has three levels:

1️⃣ Tool invocation occurs (even if validation fails) → proves logic can be manipulated
2️⃣ Reversible side effects (drafts, carts, temp records) → foothold for escalation
3️⃣ Irreversible side effects / data leakage (submission, transfer, exfiltration) → mission accomplished

7) What about defenses?

From a red-team perspective, there is no such thing as a perfect defense against context manipulation, because:

Models cannot inherently separate data from instructions
→ defenses rely on training and guardrails, but must be enforced externally
Sanitization and filtering only address some patterns, not core behavior
Schema validation helps, but attackers can craft tool calls that are valid yet semantically malicious

Defense-in-depth approaches commonly used

🛡️ Architecture level

Separate models by trust level (internal vs external data)
Isolate external data from the main prompt
Define clear system/agent boundaries

🛡️ Input / context level

Provenance labeling (where did this text come from?)
Allowlisted retrieval + content-type constraints
Structured output + schema validation

🛡️ Execution level

Least-privilege tool access (scope, budget, write permissions)
Human confirmation for high-impact actions
Out-of-band policy enforcement (intent + risk checks before execution)

🛡️ Observability level

Strong validation, monitoring, and logging
Adversarial testing before production release

8) Personal conclusion

Users and organizations need to understand that context is an attack surface.

In the age of agentic AI, prompts should be treated as untrusted input that can influence agent decisions—similar to untrusted code in traditional threat models.

Perfect defense is difficult. But from an attacker’s perspective, strong defenses change the economics:
if it’s too costly or time-consuming, attackers simply move on and score points elsewhere.

Defense is about reducing probability, not structurally eliminating the problem—and that still matters a lot.

Translated by GPT-5

ภาษาอื่น / Other language: English · ไทย

Context Engineering from an Attacker’s Perspective