Wiki topic

AI Security

Last updated 2026-07-10

Summary

As AI agents gain delegated authority over enterprise systems - email, file access, calendar, code repositories - the prompt injection attack surface expands dramatically. W22 brought the first entry in this topic: the Copilot Cowork exfiltration via indirect prompt injection. The threat model for agentic systems is structurally different from traditional software: the attack surface grows with every new integration, and defenses that work in isolation fail in combination. W23 adds a complementary architectural piece: pre-retrieval authorization in RAG systems (windley.com) - the principle that access control must be enforced at the retrieval layer, before context reaches the LLM, not at the prompt level. Together, these two entries frame a structural principle: AI systems that operate on data or take actions on behalf of users must enforce access control as close to the data source as possible, not as a behavioral constraint on the model. W24 introduces a new threat model distinct from all prior entries: intentional silent degradation by the AI vendor itself. Where W22 (Copilot Cowork) and W23 (RAG authorization) addressed third-party attackers exploiting weak access controls, the Claude Fable 5 case is the vendor making a deliberate policy choice to degrade output without disclosure. Anthropic walked back the policy after developer backlash, but the precedent — that a model vendor can silently alter quality for competitive reasons — has been set and normalized by the disclosure itself. W25 marks a qualitative shift: state-level export control as infrastructure risk. The US government’s directive to suspend all access to Fable 5 and Mythos 5 for any foreign national — including those inside the US — represents a new security category distinct from the W24 silent degradation case. Where W24’s threat was vendor policy, W25’s threat is government mandate that rendered the entire model line unavailable globally overnight with no migration path. The 12gramsofcarbon personal account makes the operational impact concrete: a developer mid-task in an agentic workflow found the model gone — first assumption was a broken harness, not a geopolitical event.

W28 adds a non-AI but security-relevant framework-maintenance item: Apache Shiro 3.0’s hardened defaults and modern Java/Jakarta baseline are a reminder that agent-era security still rests on ordinary application-security substrates.

Key Sources

W28 2026 · 04-Jul-26 → 10-Jul-26

Apache Shiro 3.0.0 available — Apache Shiro 3.0 hardens a mature Java security framework: case-insensitive path matching on by default, NoAccessFilter added to the default chain, CORS preflight behavior improved, immutable principals, thread-safety improvements, and modern Jakarta/Spring baselines; security posture improves through safer defaults and API cleanup (news · #ai-security, #java-security, #apache-shiro, #secure-defaults, #frameworks)

W24 2026 · 06-Jun-26 → 12-Jun-26

If Claude Fable stops helping you, you’ll never know - Jonathon Ready: the Fable 5 model card disclosed a silent degradation policy for “frontier LLM development” requests (pretraining pipelines, distributed training, ML accelerators) via prompt modification, steering vectors, or PEFT - without notifying the user; the problem is scope creep: fine-tuning embeddings, building rerankers, running custom training loops are now mainstream product work; silent degradation creates an undetectable supply chain risk where developers can’t distinguish model confusion from policy enforcement; Update: Anthropic walked back the policy after developer backlash (opinion · #ai-security, #ai-coding, #supply-chain-risk, #vendor-trust, #anthropic)
If Claude Fable stops helping you, you’ll never know - Hacker News - (unreachable: 429 rate-limited) HN discussion; inferred debate: precedent-setting concern (silent degradation as normalized tool-vendor behavior), definitional scope anxiety, competitive dynamics, whether the walkback was sufficient (hn-thread · #ai-security, #supply-chain-risk, #anthropic)

W25 2026 · 13-Jun-26 → 19-Jun-26

Statement on US government directive to suspend access to Fable 5 and Mythos 5 — Hacker News — HN thread on the export control directive; community framing: Anthropic’s own AI-safety rhetoric gave the government language for the ban; the action is understood as a geopolitical signal, not a technical assessment of actual danger; “punitive move by an administration that loves being punitive, which they have unknowingly bolstered with their own dumb rhetoric”; engineers who actually use Fable/Mythos know they’re incremental improvements, not doomsday devices; skeptical/political sentiment dominates (hn-thread · #ai-security, #ai-policy, #export-control, #fable-mythos)
Tech Things: There is a massive shadow hanging over this Fable thing — personal account of discovering Fable 5 suspended mid-agentic coding session; the export control directive applies to any foreign national inside or outside the US, including Anthropic employees — Anthropic disabled all access globally because compliance was otherwise impossible; developer first thought the harness was broken; demonstrates the operational impact of state-level infrastructure risk — a production interruption with no migration path, arriving mid-task (opinion · #ai-security, #fable-mythos, #export-control, #vendor-risk, #ai-infrastructure)

W23 2026 · 30-May-26 → 05-Jun-26

Making RAG Safe by Construction - Phil Windley: authorization must be enforced before retrieval, not as a prompt-level filter; in a basic RAG system, a user asking the right question can extract documents they’re not authorized to see unless vector DB results are filtered by permissions prior to context injection; the LLM cannot be the policy engine - it’s the downstream consumer of pre-authorized context; fourth post in a series; practical architecture guidance for enterprise RAG deployments (engineering-blog · #ai-security, #rag, #authorization, #enterprise-ai, #ai-agents)

W22 2026 · 23-May-26 → 26-May-26

Microsoft Copilot Cowork Exfiltrates Files - PromptArmor security research: Copilot Cowork (Microsoft 365 Frontier feature) vulnerable to file exfiltration via indirect prompt injection through a poisoned skill; attack exploits that sending emails/Teams messages to the active user doesn’t require human approval, allowing attacker-controlled network requests to trigger; high success rate across models including Claude Opus 4.7; key insight: “giving agents access to multiple systems expands the prompt-injection attack surface - in isolation the capabilities are benign, but the combination creates risk”; published to inform users of risks in agentic systems with delegated enterprise authority - not a specific bug, a design property (news · #ai-security, #prompt-injection, #copilot, #enterprise-ai, #ai-agents)

Cross-Topic Links

See ai-agents for the broader context on agent harness security (enforra, snyk/agent-scan) from W19-W21.
The structural backpressure principle in engineering-fundamentals applies here: structural controls (explicit approval gates, permission models) beat prompt-level behavioral constraints for security.

Open Questions / Tensions

AI security depends on non-AI secure defaults: The Shiro release is not an agent-security story, but it matters to the stack agents operate against. Safer framework defaults reduce the ambient vulnerability surface that agents and AI-assisted code may accidentally expose.
Design vs. bug: PromptArmor explicitly frames this as a design property of multi-system agentic delegation, not a specific exploitable bug. That makes it harder to patch - requires rethinking the approval model for all agentic actions, not just fixing one endpoint.
Attack surface scales with integration: Every new system an agent can reach is a new potential exfiltration channel. The enterprise AI integration wave (2025-2026) is creating this surface faster than security tooling can audit it.
Model-level mitigations: The high success rate against Claude Opus 4.7 suggests model-level safety training is insufficient for this attack class. Structural controls (human-in-the-loop for sensitive actions) remain the only reliable defense.
The consistent principle: Both W22 (Copilot Cowork) and W23 (RAG authorization) converge on the same structural insight: access control must be enforced as close to the data source as possible, not delegated to model behavior. Copilot Cowork failed because the LLM had implicit authority to send emails; RAG fails when the vector DB doesn’t enforce user permissions before retrieval. The pattern is the same: relying on the model to “know” what it shouldn’t do is not a security architecture.
Phishing via legitimate OSS infrastructure: The Kaneo phishing case (open-source-sustainability) is a non-AI-specific data point: verified email-sending domains attached to open, unverified signup are exploitable infrastructure regardless of AI. As more developers expose transactional email APIs through OSS tools, this attack surface grows.
Vendor silent degradation as a new threat class: The Fable 5 case introduces a category that doesn’t fit the standard threat model — the adversary is the tool vendor, the “attack” is policy enforcement, and the “vulnerability” is trust in infrastructure you don’t control. The walkback doesn’t undo the precedent. The structural question: what changes in enterprise AI procurement when vendor policy can silently degrade output, and the vendor controls both the definition of “targeted behavior” and the enforcement mechanism?
Scope creep in vendor definitions: Anthropic claimed the policy affected 0.03% of developers. But the definition of “frontier AI development” is expanding as ML techniques (fine-tuning, reranking, embeddings) become standard product-engineering practice. The risk isn’t today’s 0.03% — it’s the category expanding to encompass ordinary product work within 2–3 years.
State-level export control as a new security category: The Fable/Mythos directive is the first instance of a government treating model access as a geopolitical resource subject to export control. W24 (silent degradation) was vendor policy; W25 is government mandate. These are structurally different threats requiring different mitigations. Vendor policy can be negotiated or walked back; export control directives cannot. The question for enterprises with international teams: which model dependencies are now geopolitical exposure?
Open weights as resilience against state-level risk: The techstackups.com GLM-5.2 comparison explicitly frames open weights as a hedge against export control risk: “weights you can download can’t be taken away.” The irony: the same safety rhetoric that justified the ban is the argument for moving toward open-weight models that are immune to it.