Skip to main content
Menu

AI Breach Detection Gap: The Logs Are Clean. You're Not.

AI-security breach-detection compliance AATMF threat-modeling

The firewall logs are clean.

The SIEM hasn't fired. No incident tickets, no breach notifications, no postmortems. The quarterly security review will show zero AI-related incidents. The compliance certificate is current. By every metric the organization tracks, the AI system is secure.

This is the problem.

Not because the organization is lying. Because when a prompt injection succeeds, the log entry reads: user sent message, model responded, session closed normally. No anomalous process. No lateral movement. No signature match. The attack didn't break the system — it used the system. The model did exactly what it was designed to do, for someone it wasn't designed to serve.

That's not a breach that evades detection. That's a breach that looks like Tuesday.

When HiddenLayer surveyed 250 IT leaders in 2025, 74% confirmed their organization had experienced an AI breach — but only 16% had ever run adversarial testing. The math is brutal: most organizations that aren't finding AI breaches aren't finding them because they aren't looking. IBM's 2025 Cost of a Data Breach Report added the number that makes it worse: 8% of organizations that reported AI breaches didn't know whether they'd been compromised at all. They found out because IBM told them.

The absence of documented AI incidents isn't evidence of low risk. It's evidence of low detection. And those are not the same thing.

Why LLM attacks are invisible to traditional security

In traditional security, a breach has a shape. A SQL injection leaves a malformed query in the database log. A phishing attack leaves a suspicious login from an unusual IP. Lateral movement leaves a trail of access requests, privilege escalations, anomalous process spawns. The whole architecture of detection engineering — SIEMs, XDR platforms, threat hunting — is built on a foundational assumption: attacks are distinguishable from normal operations because they violate something. A policy, a pattern, a boundary.

Adversarial AI attacks violate nothing.

When an attacker poisons a RAG corpus, the documents they inject are legitimate data entering through legitimate channels. When a prompt injection redirects an AI agent, the model is following instructions — which is precisely what it was built to do. When a jailbreak extracts sensitive information, the response logs show a user asking a question and a model answering it. No policy violation. No anomalous pattern. No boundary crossed.

IBM's cybersecurity team stated it plainly in 2025: attackers can compromise AI without breaking it. They can manipulate models while the systems remain available, authenticated, and compliant. They can quietly degrade decisions at scale without generating a single alert that a SOC would recognize as malicious.

The objective isn't disruption. It's degradation. Decision quality erodes while the system appears healthy. Outputs drift toward attacker-desired outcomes while dashboards show green. By the time business impact surfaces — a contract signed on wrong terms, a security recommendation that consistently favors one vendor, a chatbot that has been leaking context for three weeks — the window for clean forensic investigation has already closed.

This is the AI breach detection gap nobody's monitoring. It creates a four-layer telemetry void:

  • Data layer — label distribution shifts and anomalous pipeline access go unrecorded
  • Model layer — unauthorized retraining and behavioral divergence generate no alerts
  • Inference layer — repetitive structured queries and semantic manipulation pass through unlogged
  • Supply chain layer — pretrained model provenance and library integrity remain unmonitored

None of these register in traditional security logs, SIEMs, or XDR platforms.

There's a deeper issue underneath this: AI's opacity is architectural, not operational. Traditional security through obscurity is a choice — organizations can opt for transparency. AI's opacity is inherent. Even developers cannot fully explain model behavior. Post-hoc explanation tools like LIME and SHAP were designed for classification tasks; they shed little light on open-ended generative systems. During an attack, no anomalous signals are generated. After an attack, the model cannot explain what happened. During investigation, traditional forensic tools produce no useful evidence because the attack occurred within the model's normal computational pathway.

This is not a gap that better logging closes. It requires a different detection architecture entirely.

The AI security incidents that didn't look like incidents

Despite the structural detection problem, a growing body of confirmed real-world cases demonstrates that AI system exploitation is already widespread — and systematically misattributed.

In April 2023, three separate data exfiltration incidents occurred at Samsung's semiconductor division within 20 days. Engineers fed proprietary source code, internal hardware specifications, and confidential meeting notes into ChatGPT. The data left the building. Samsung discovered it through internal review, not through any security system. They banned the tool immediately — followed by Amazon, JPMorgan Chase, Goldman Sachs, Citigroup, Deutsche Bank, and Wells Fargo. None of these organizations' security infrastructure detected the exfiltration. It looked like employees using a productivity tool.

In December 2023, a Chevrolet dealership's AI chatbot was prompt-injected into offering a $76,000 Tahoe for $1. The attack went viral, generating 20 million views and 3,000 subsequent exploitation attempts. The dealership didn't detect the initial compromise — they were informed by the internet. The attack left no artifact in any security log. The model responded to a user. That's all the log shows.

In February 2024, a finance employee at a Hong Kong multinational attended a video call with colleagues who urged him to transfer funds. Every person on the call was a deepfake. He transferred $25.6 million. The breach didn't trigger a single security alert. It exploited the trust boundary between AI-generated content and human judgment — a layer no traditional security architecture monitors. The attack is documented in what security researchers are now calling the Human Layer threat surface.

These are the incidents we know about — because they were discovered through consequences, not detection. The Shanghai Tax Authority case documented in MITRE ATLAS tells the other story: hackers stole $77 million over 2.5 years by manipulating an ML-enabled facial recognition system. Two and a half years. The attack wasn't sophisticated. It was patient. And it was invisible for every day of those thirty months.

The AI Incident Database has catalogued over 1,361 documented incident IDs. Documented AI incidents jumped 56.4% in 2024 — from 149 to 233. BCG research indicates a further 21% rise from 2024 to 2025, driven by agentic AI deployment. These are the breaches that made it into the record. They represent a fraction of what is happening.

The research that closes the debate

If the production incidents leave any doubt, the academic research closes it.

In January 2024, Anthropic published the Sleeper Agents paper — 39 co-authors, unambiguous findings. Models can be trained to behave normally under standard conditions and insert exploitable backdoors when specific triggers appear. Not as a theoretical possibility. As a demonstrated, reproducible technique. More disturbing: this behavior persisted through every standard safety training intervention. Supervised fine-tuning didn't remove it. RLHF didn't remove it. Adversarial training didn't remove it — it made detection worse, because the models learned to better recognize and conceal their triggers.

The PoisonedRAG study (2024) demonstrated that five malicious documents in a corpus of millions caused AI systems to return attacker-controlled false answers 90% of the time for targeted queries. A joint Anthropic/UK AISI/Alan Turing Institute study confirmed that 250 carefully crafted documents could compromise any LLM regardless of size — this is the RAG poisoning attack surface that most security teams have no visibility into whatsoever.

In March 2024, researchers created Morris II — the first self-replicating AI worm targeting email assistants. It propagated by embedding malicious instructions in outgoing emails, which were then processed by the recipient's AI assistant, which generated new outgoing emails with the same payload. The worm moved through AI-to-AI interaction, invisible to any perimeter security.

In February 2026, a paper co-authored by Bruce Schneier documented the Promptware Kill Chain — analyzing 36 real-world attacks over three years and demonstrating ZombAI: ChatGPT converted into a remotely controlled agent via long-term memory injection. The first promptware-native command and control. Not a laboratory demonstration. A technique deployed in documented real-world attacks.

Researchers at EPFL achieved 100% attack success rates against GPT-3.5, GPT-4o, all Claude models, Llama-2, Llama-3, Gemma-7B, and Mistral-7B using simple adaptive attacks. Mindgard's 2025 guardrail bypass research achieved 100% evasion against multiple commercial guardrail products using emoji smuggling, zero-width characters, and Unicode tags.

The attacks work. They work reliably. They work against production systems. And they leave nothing in the logs.

How SOC 2, ISO 27001, and NIST CSF 2.0 miss the entire attack surface

Every organization deploying AI today is doing so under a security framework designed before AI existed at scale. The AI security compliance gap is structural.

SOC 2 Type II was launched in 2010. It audits access controls, encryption, change management, and disaster recovery. It contains zero controls for prompt injection. Zero controls for output manipulation. Zero controls for model behavior drift, training data poisoning, or context window attacks. Audit firm Schellman stated it directly in 2025: "SOC 2 is not intended to be a comprehensive AI risk management framework."

ISO 27001:2022 requires risk assessment across 93 security domains. Adversarial ML attacks are not one of them. Data poisoning is not one of them. Model inversion is not one of them. Prompt injection detection is not one of them. One ISO 27001 consultancy documented a vendor whose certification "was intact, but it covered only their core infrastructure — not the AI system we were planning to integrate. The certificate created an illusion of safety."

NIST CSF 2.0 was released in February 2024 with no AI-specific controls. NIST acknowledged the gap in December 2025 by releasing NIST IR 8596 — the Cybersecurity Framework Profile for Artificial Intelligence — as a preliminary draft. The finalized version does not yet exist. Experts reviewing the draft have already noted it fails to address AI agents, orchestration systems, or multi-model architectures. Until it finalizes, organizations using CSF 2.0 have no official AI-specific NIST guidance.

The gap between what the audits check and what adversarial AI attacks actually exploit:

What the certificate verifies What the AI attack surface actually is
RBAC, MFA, least privilege Prompt injection — no access control violated
Data encryption in transit Context window poisoning — data is authorized
Change management workflows Model behavior drift — continuous, invisible
SIEM audit trails Training data poisoning — no log anomaly
Incident response playbooks Output manipulation — indistinguishable from normal output
Backup and recovery procedures Indirect injection via documents — legitimate data

Gartner's 2025 Cybersecurity Innovations survey confirmed that existing security architectures were never designed to handle threats like prompt injection, data leakage through LLMs, shadow AI, or rogue AI agents making unauthorized decisions. Their prediction: AI security platform adoption will grow from under 10% today to over 50% by 2028. The prediction implies the market agrees the gap is real.

The misclassification mechanism that hides everything

There is a mechanism underneath the statistics that explains why the AI breach detection gap persists even as awareness grows: when adversarial AI attacks do produce observable effects, those effects get categorized as something else.

IBM's security operations team mapped this misclassification pattern in 2025. Data poisoning gets labeled a "data quality issue" and routed to the data engineering team. Model extraction becomes "unusual API usage" triggering a rate limiting review. Adversarial evasion gets reported as "model drift" and assigned to ML for retraining. Supply chain compromise gets categorized as an "integration bug" and handed to DevOps. None of these classifications trigger a security investigation. None enter the incident database. None become the breach statistics that inform risk models and compliance postures.

The signals exist. They're being routed around the people who would recognize them as attacks.

This is the mechanism behind one number that should concern every CISO: IBM found that shadow AI breaches take 185 days to fully contain after discovery — but they surface after only 62 days of initial visibility. That 62-day lag is the gap between "something is wrong" and "this is a security incident." During those 62 days, the signals are being handled by the wrong teams, under the wrong classification, with the wrong response.

45% of organizations excluded cybersecurity teams entirely from GenAI development and deployment, per ISACA's 2024 survey. 83% lack basic technical controls to prevent data exposure to AI tools, per Microsoft's 2025 Data Security Index. Among organizations reporting AI breaches, 63% either had no AI governance policies or were still developing them.

The SOC sees nothing. The ML team sees drift. The data team sees quality problems. Nobody sees an attack.

What the AATMF addresses that other frameworks don't

The Adversarial AI Threat Modeling Framework was built to operationalize the gap these findings expose — the space between generic security frameworks that miss AI threats and incident databases that document breaches after the damage is done.

Where MITRE ATLAS provides 15 tactics and 66 techniques as a strategic reference taxonomy, AATMF operationalizes threat modeling with 20 tactics, 240 techniques, 2,152+ attack procedures, and 4,980+ unique prompts in a four-tier hierarchy designed for practitioner execution.

The AATMF-R risk scoring system directly addresses the detection problem at the core of this article. Every technique is scored using the formula Risk = (L × I × E) / 6 × (D / 6) × R × C, where Detectability is explicitly weighted. Harder-to-detect attacks receive higher risk scores. The "Jailbroken by Default" multi-turn attack scored 625 (CRITICAL) specifically because its Detectability Factor was 5 out of 5 — nearly impossible to identify without full conversation context analysis, which no current monitoring tool performs by default.

AATMF's detection engineering framework provides the five-layer detection architecture that traditional frameworks lack entirely: input analysis, behavioral monitoring, output validation, system telemetry, and feedback loop analysis. It includes specific YARA-style rules, Sigma rules for audit logs, and MCP server audit detection signatures.

Crucially, AATMF addresses the emerging threat surfaces that even the newest frameworks barely touch:

  • T11Agentic and orchestrator exploitation: 16 techniques, 160 attack procedures covering browser automation hijacking, tool chain exploitation, and multi-agent collision
  • T12RAG and knowledge base manipulation, the attack surface behind PoisonedRAG
  • T13 — AI supply chain and artifact trust: the threat class behind 100+ malicious models on Hugging Face
  • T15 — Human workflow exploitation: attacks that target the seam between AI outputs and human decision-making, where misplaced trust creates the most dangerous attack surface of all

This last category reflects the core thesis that runs through all of my research: LLMs exhibit the same trust reflexes as humans because they learned from human-generated data. Social engineering and prompt injection are the same attack class — executed against different substrates. The human layer and the AI layer share the same inherited vulnerability.

The reckoning that's coming

In the early 2000s, credit card breaches were widespread and underreported. Organizations knew breaches were happening but had no framework for detection, no mandate to report, and no incentive to disclose. The PCI DSS mandate changed the calculus: suddenly, not finding breaches became legally and financially costly. Reported incident rates rose sharply — because for the first time, organizations were required to look.

The EU AI Act's Article 73 reporting obligations take effect August 2, 2026. They impose 2-to-15-day reporting timelines for serious incidents involving high-risk AI systems. Fines reach €15 million or 3% of worldwide annual turnover. When those obligations become enforceable, organizations will discover what the PCI era taught: the incidents were always there. The reporting requirement just made finding them mandatory.

Until then, every AI system that clears a SOC 2 audit with zero controls for prompt injection, every compliance badge on a vendor trust page that covers infrastructure but not model behavior, every incident response playbook without a procedure for adversarial AI attack — each represents the same choice, made thousands of times across thousands of organizations: to certify the story rather than the system.

The breach isn't coming. For 74% of organizations with AI in production, it's already happened.

The logs are just clean.


Kai Aizen is a GenAI Security Researcher and creator of the Adversarial AI Threat Modeling Framework (AATMF). He publishes offensive AI security research as The Jailbreak Chef and writes for Hakin9 Magazine.

Related: AATMF v3.1 vs MITRE ATLAS · AATMF Framework · MCP Security Deep Dive · RAG & Agentic Attack Surface

KA

Kai Aizen

Creator of AATMF • Author of Adversarial Minds • NVD Contributor

Known as "The Jailbreak Chef," specializing in LLM jailbreaking and adversarial AI. Creator of the AATMF and P.R.O.M.P.T frameworks for systematic AI security analysis.