The industry spent 2025 splitting its inference architecture in two: a dumb, hyper-fast multi-LoRA engine on the backend that loads whichever adapter the API names, and a semantic gateway on the frontend that reads the user's prompt and decides which adapter that should be. The base model got the alignment budget. The router got a classifier. The trust boundary moved to the seam between them, and the threat model stayed where the money was.
Kai Aizen
creator of AATMF · author of Adversarial Minds · NVD contributor
SHELL.003 · 2026.05

This piece extends a thread from the wire oracle: there the confused deputy sat on the response transport; here it sits on the decision that selects which weights answer the next token. Same spine: a control was specified assuming the principal was deterministic and the decision was administrative, and an LLM-shaped system made both false. Different layer. The layer nobody is reviewing.

We walk it in five steps. Step 1 and step 2 are verified against the gateway's own shipped documentation; the line between what is protocol-true and what is product-checkable is drawn explicitly, the same discipline as the prior two pieces, because the strong claim is weaker if it is bundled with a soft one.

The architecture the threat model assumes

Defenders model the threat as the base model. Millions of dollars of red-teaming, RLHF, and eval go into making the aligned model refuse what it should refuse. The implicit assumption — never written down, because in a single-model deployment it was free — is that the model that answers is the model that was aligned, and that the choice of which model answers is an administrative concern, not a user-reachable one.

Multi-LoRA routing breaks that assumption the way protocol transition broke the Kerberos one: not by defeating the protected thing, but by changing which thing is in the seat. The aligned base model is real. The adapter swapped onto it, and the safety posture attached to the routing decision, are selected at runtime by a classifier reading the user's prompt. The diagram is the assumption and its failure; the rest is one walk across the seam.

the deployment the alignment budget assumes user prompt → aligned model → answer one model, reviewed selection is administrative the deployment that actually shipped user prompt attacker-controlled semantic gateway classifies → adapter id metadata engine loads what it is told selection is no longer administrative — it is a parser reading the attacker's text
fig 1 · the split — the budget went to the model on the right; the attacker enters at the box on the left of the bottom row

Step 1 — The engine is metadata-only — and that is not the bug

State the defended half first, because the piece is stronger if the reader cannot reach for "actually, vLLM routes by API parameter." It does, and that is correct behavior. The engine performs zero semantic analysis. Adapters are registered out-of-band — vLLM via the --lora-modules flag — and selected by an explicit identifier on the request. When a lora_name is specified, the serving layer uses it as the final model name in place of the base model. Name an adapter, get that adapter. No inference about intent, no attacker-reachable decision, nothing to exploit at this layer.

This half is unimpeachable, and the piece concedes it loudly on purpose. The engine is not the vulnerability. The vulnerability is everything that decides what to write into that lora_name field — and users never talk to the engine.

Step 2 — The gateway turns prompt content into adapter identity

This is protocol-true against the shipped product, not speculation. Production agentic deployments do not expose the bare engine; they sit behind a semantic routing gateway. The vLLM Semantic Router's own production guidance is explicit: routing detects intent using domain, embedding, keyword, or similarity classification, then automatically selects the appropriate LoRA adapter on the backend, transparent to the user, who sends to one endpoint while the router chooses the adapter.

The embedding-routing mode is the sharpest surface, and the gateway advertises the exact property that makes it one: it matches semantic intent and handles paraphrases. "Handles paraphrases" is a feature to the operator and a primitive to the attacker — it is an explicit promise that semantically-equivalent rephrasings cross the same boundary, which is the definition of a classifier an adversary can steer with controlled paraphrase.

One timeline correction stated rather than buried: intent-aware LoRA routing was a feature request in late 2025 and shipped in the gateway's v0.1 release in January 2026. It is shipped, not roadmap — but the honest sentence is "shipped in early 2026," not "this is how it always worked." And the architecture went further than the obvious model: the gateway's own classifier now runs as LoRA adapters over a shared base. The thing selecting the brain is the same class of artifact as the brain it selects. The reviewer of neither is the reviewer of the base model.

prompt user-controlled classifier domain · embedding · keyword lora_name set engine obeys advertised property — "handles paraphrases" — is the steerability the attacker needs semantically-equivalent rephrasings cross the same boundary — by design, not by accident
fig 2 · the semantic→metadata transition — verified against the gateway's shipped routing documentation

Step 3 — The smuggle: route past the hardened path, not into a bad one

The weak version of this attack is "force-load a poisoned adapter someone left lying around." That makes the finding contingent on a deployment mistake and a critic correctly downgrades it to hygiene. The strong version needs no leftover artifact and no malicious adapter. It uses only the legitimate, intended adapter set.

The gateway's own documented design is signal-decision driven with priority ordering. Their published example: a request matching urgent + security routes at top priority to a hardened expert adapter with jailbreak protection and PII filtering attached; a request matching only general support falls through to a low-priority path that is the base model with none of those protections. The protections are bolted onto the high-priority decision, not onto the model universally.

So the attack is not "match the dangerous adapter." It is signal suppression: craft the prompt so it does not trip the signals that would route it to the hardened, jailbreak-protected path, and it falls through to the lower-priority general path where the same payload meets none of the protections the defender thinks are in place. The defender layered safety onto the routing decision. The attacker's entire job is to not match it. Every adapter is individually fine; the composition under attacker-influenced selection was never the thing that got safety-reviewed.

payload trips the security signal prompt → urgent+security signal matches priority 100 hardened adapter jailbreak + PII guard payload blocked same payload, phrased to suppress the signal prompt → general only signal does not match priority 40 fall-through path no guard attached payload executes the protections were attached to the decision — so don't match the decision
fig 3 · signal suppression — the load-bearing technique; no poisoned adapter required, only the legitimate set and a non-matching paraphrase

Step 4 — Why this is a class, not a misconfiguration

If it depended on a specific bad config it would be a hardening note. It does not, because of a split that mirrors the prior two pieces: one property is structural and universal to the pattern, one is product-checkable.

Structural — the certifier of the routing decision is not the certifier of the model. The alignment review validated the base model and, at most, each adapter in isolation. No process applies that scrutiny to the classifier that selects among them at runtime from attacker-influenced text, nor to the composition of (selected adapter × attached safety policy) the classifier produces. This holds for any semantic-routing gateway by construction, not by misconfiguration. It is the same unreviewed-boundary shape as render-layer redaction, one layer further in.

Checkable — how a given deployment binds protections to paths. Whether jailbreak/PII guards are attached per-decision (suppressible) or applied universally (not) is a deployment choice. Where they are per-decision — the gateway's own documented example — signal suppression is a full bypass. Where a deployment has independently applied protection to every path, the suppression degrades to adapter-capability mismatch rather than total bypass. Smaller blast radius, same root cause; stated as a corollary, not the spine.

The leftover-poisoned-adapter scenario survives only here, as the weakest corollary: where a deployment left a debug or legacy adapter registered, signal-steering into it is a sharper bypass. It is real, it is one bullet, and it is explicitly not the finding. The finding is that the routing decision is an unreviewed trust boundary in every deployment of this pattern, hygiene or not.

routing is unreviewed structural · universal certifier ≠ model certifier → the class checkable · per-deploy guard binding → blast radius the class rides the structural property; only the radius depends on the deployment
fig 4 · structural versus checkable — the finding rides the universal property, the corollaries ride the deployment

The primitive, and the boundary nobody owns

Name it: semantic-to-metadata smuggling. The defect is the unreviewed gate — the transition where attacker-influenced prompt content becomes a metadata identifier that selects which weights and which safety policy serve the request, with no equivalent of the base model's alignment review applied to the transition itself. The capability it grants is promotion of an unsafe path: getting a payload served by a model-and-policy composition the safety review never evaluated, by steering the classifier rather than defeating the model.

The honest scope, in one line because the piece earns the right to be terse here: this targets deployments using semantic or embedding adapter routing — the vLLM-Semantic-Router-class gateway, shipped and documented — not bare engine APIs. The engine is correctly metadata-only and is not the bug. The bug is that the semantic→metadata transition is an unmonitored, attacker-influenced classification sitting upstream of every protection the defender spent the budget on.

What closes it: treat the routing decision as a security boundary and review it as one — adversarial-paraphrase test the classifier against every signal that gates a protection, not just against accuracy; attach safety policy to the model and the path, never only to the matched decision, so suppression cannot strip it; and log the chosen adapter and the signals that chose it as a first-class security event, because right now the decision that picks the brain is the least-audited step in the pipeline.

The base model was never the only thing that decided what the system would do. We built a classifier to pick its brain at runtime, fed that classifier the attacker's text, spent the alignment budget one layer downstream of it, and called the downstream model the attack surface.

same attack. different substrate.

sources — vLLM Semantic Router: Intelligent LoRA Routing guide (domain/embedding/keyword classification → backend adapter selection; embedding routing "handles paraphrases"); Signal-Decision Driven Architecture (priority-ordered signal→decision, hardened-path example); v0.1 "Iris" release (Jan 2026; classifier kernel itself LoRA-based); Issue #545 (intent-aware LoRA routing feature history). vLLM serving: --lora-modules registration, lora_name request selection (engine is metadata-only). OWASP Top 10 for LLM Applications (2025): LLM01 prompt injection. Prior SnailSploit work: the wire oracle — same spine, prior layer.