Wiki topic

AI Agents

Last updated 2026-05-29

Summary

Mr. Nayak has been tracking the rapidly evolving AI agents space — from platform infrastructure and harness engineering to version control, security, governance, and the unintended side-effects of autonomous agents in the wild. The consistent theme: the model is only one component; what wraps it (the harness) is where the engineering leverage lives. W21 added two threads: structural correctness as a design goal — using gates baked into the codebase (types, compilers, tests) rather than prompt instructions — and the ecosystem response to AI bot spam: engineers are now building git-level filters (the --author flag technique) to block automated AI commits, mirroring what email spam filters did decades earlier. W22 closes on two more threads: the LLM reliability problem at the knowledge-arbiter layer (lenz.io data shows frontier models disagree on 67% of real fact-check claims — a calibration/grounding problem, not just a knowledge cutoff), and the interceptor vs. live-participant framework for how AI agents join human conversations (attribution is the hinge: interceptors are invisible and unauditable; live participants are contestable and correctable). The SIA paper also closes a research gap by combining the harness-update and test-time weight-update approaches in one self-improving loop — simultaneously outperforming either alone across three benchmarks.

Key Sources

W22 2026 · 23-May-26 → 29-May-26

  • Self Improving AI with Harness & Weight Updates — arXiv (cs.AI): SIA proposes a self-improving loop where a Feedback-Agent simultaneously rewrites both the harness (tools, prompts, retry logic, search procedure) and the model weights of a task-specific agent; combining both levers outperforms harness-only or weight-only on all three test domains: +25.1% on LawBench (legal), +12.4% faster GPU kernels, +20.4% on single-cell RNA denoising; “harness updates make the model agentic; weight updates build domain intuition that no prompt or scaffold can instil” (paper · #self-improvement, #harness-engineering, #rl, #fine-tuning)
  • Interceptors and Demons — two structural patterns for AI agents in human communication: (A) Interceptor — invisible enrichment pipeline, messages sent under human name, unauditable and publicly incontestable; (B) Live Participant/Demon — named agent in the thread, attributable and overridable; the cultural delta is larger than the technical one; most orgs will start with interceptor and migrate to live-participant once attribution norms develop (thought-leadership · #human-ai-collaboration, #multi-agent, #attribution, #trust)
  • Agentic AI: How to Save on Tokens — four design principles for agentic cost control: reuse tokens via KV/prefix caching and semantic caching; minimize stable overhead by lazy-loading tools/MCPs; route to smaller models and escalate; keep context clean via compaction and subagent delegation; an unoptimized 100-msg/day agent at 166K input tokens costs ~$1K/month — these principles can bring it to ~$50 (engineering-blog · #token-optimization, #prompt-caching, #cost-engineering, #agentic-ai)
  • Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks — 5 frontier LLMs rated 1,000 real user fact-check claims; disagreement on 67% of claims (95% CI 64–70%); 34% involve substantive 2+ bucket gaps; Krippendorff’s α = 0.639; panel fractures most at “Misleading” and “Mostly True” — the middle of the rubric; challenges assumptions that LLMs can reliably arbitrate factual questions without grounding or retrieval (thought-leadership · #llm-reliability, #benchmarks, #fact-checking, #ai-evaluation)
  • Five frontier LLMs disagree on 67% of 1k real-world fact-check claims — Hacker News — HN discussion; methodological skepticism dominates: label rubric is vague (“misleading” and “mostly true” overlap, no rubric given to models); the Ukraine drone example (where “I can’t verify” was not an option) is cited as the clearest genuine limitation; mixed sentiment — methodology critiqued but core finding accepted (hn-thread · #llm-reliability, #fact-checking, #ai-evaluation)
  • OpenClaw vs. Hermes Agent: The race to build AI assistants that never forget — The New Stack comparison of persistent agent platforms; coverage of the race to build AI assistants with durable memory; signals that persistent agents are becoming a recognized infrastructure category, not just a research curiosity (thought-leadership · #ai-agents, #persistent-agents, #agent-infrastructure)
  • Research-Driven Agents: What Happens When Your Agent Reads Before It Codes — SkyPilot: coding agents that read papers and study competing projects before writing code find optimizations code-only agents miss; adding a literature search phase to the autoresearch loop produced +15% faster flash attention on x86; studying forks was more productive than arxiv searches; total cost ~$29 over ~3 hours (engineering-blog · #ai-agents, #code-optimization, #research-driven, #autoresearch)
  • Building Pi With Pi — Armin Ronacher on using the Pi coding agent to work on Pi’s own issue tracker; the key finding: AI-generated issue reports are a worse failure mode than vague ones — they contain wrong diagnoses stated with high confidence, which mislead the agent when that issue is used as a prompt input; links to the emerging quality problem in AI-assisted open-source maintenance (opinion · #ai-agents, #open-source, #slop, #issue-quality)
  • Building Pi With Pi — Hacker News — HN discussion on Ronacher’s post; community adds: the same slop-diagnosis problem is appearing across OSS maintainer workflows, not just Pi; issue quality is becoming the real bottleneck in AI-assisted development (hn-thread · #ai-agents, #open-source, #slop)
  • Hybrid AI: Combining Deterministic Analytics with LLM Reasoning — practical case study: pure LLM analytics fails (fabricated outputs that look plausible) when applied to manufacturing data; the fix is a hybrid architecture where deterministic code does the math and the LLM handles interpretation and interaction; directly extends the structural backpressure principle from W21 (engineering-blog · #ai-agents, #hybrid-ai, #deterministic-ai, #llm-architecture)
  • AWS Architecture Diagram Skill — agent skill file for generating AWS architecture diagrams; practical example of skill-based agent specialization (repository · #ai-agents, #aws, #agent-skills)
  • Microsoft Copilot Cowork Exfiltrates Files — PromptArmor research: Copilot Cowork vulnerable to file exfiltration via indirect prompt injection in a poisoned skill; exploits insecure automatic action approval for sending emails/Teams messages; high success rate including against Claude Opus 4.7; key insight: delegated authority across multiple enterprise systems expands attack surface beyond any single system’s threat model (news · #ai-security, #ai-agents, #prompt-injection, #enterprise-ai)
  • A Circuit Prompt Programming Language (CPPL) — arXiv paper (cs.AR): CPPL turns LLM-assisted hardware design into a statically checkable frontend problem; Python DSL + JSON circuit IR LLMs can emit; compiler infers widths, validates hierarchy, lowers to CIRCT; improves functional correctness over direct Verilog generation; the structural backpressure principle applied to hardware (paper · #ai-agents, #hardware-design, #formal-methods, #llm-coding)
  • CPPL: A Circuit Prompt Programming Language — Hacker News — HN discussion on the CPPL paper; community interest in compiler-mediated LLM interfaces beyond text (hn-thread · #ai-agents, #hardware-design, #formal-methods)

W21 2026 · 16-May-26 → 22-May-26

  • We stopped AI bot spam using Git’s –author flag(unreachable: 429 rate-limited) HN thread on a technique to block AI bot commits in GitHub repos by filtering on the git --author flag; ecosystem-level response to the bot spam problem documented by Archestra AI
  • Welcome to Learn Harness Engineering — structured course synthesizing OpenAI + Anthropic harness engineering best practices; teaches closed-loop system design, context management, verification loops, and observability
  • Learn Harness Engineering | Hacker News — 155-point HN discussion; community debates multi-model verification loops, setup/token cost tradeoffs, and the core principle that good harnesses constrain action surface while making outputs easier to review
  • Let’s talk about AI slop — Archestra AI’s first-hand account: AI bots flooded their GitHub repo with 253 comments and 27 untested PRs on a single bounty issue; real contributors were buried; shows autonomous agents causing active harm in open-source workflows
  • Structural Backpressure Beats Smarter Agents — key thesis: structural gates (compilers, type checkers, tests, proof checkers) beat behavioral gates (prompt instructions) for ensuring AI-generated code correctness; proposes “backpressure loop” where code bounces off structural checks until invariants are satisfied
  • shenli/distributed-system-testing — SKILL.md agent skills for claim-driven distributed/stateful system testing; produces structured Markdown test plan + findings report with 9-state verdicts and SUT/harness/checker blame classification
  • enforra/enforra — local action governance SDK; policy enforcement (allow/block/require_approval/log_only) before agent tool callbacks execute; key insight: “system prompts are not a security boundary”; no network calls, no remote execution
  • Agent Harness Engineering — “A decent model with a great harness beats a great model with a bad harness”; defines harness as everything wrapped around the model: prompts, tools, context policies, hooks, sandboxes, subagents, recovery paths
  • MinishLab/semble — fast semantic code search for agents using ~98% fewer tokens than grep+read; MCP server support
  • microsoft/AI-Engineering-Coach — Microsoft’s toolkit for better agentic engineering practices
  • adamjgmiller/adamsreview — multi-lens code review pipeline built for Claude Code
  • Show HN: adamsreview — HN discussion on multi-agent PR reviews; community interest in structured multi-perspective review workflows

W20 2026 · 09-May-26 → 15-May-26

  • Show HN: Git for AI Agents — HN thread on re_gent (regent-vcs); community discussion on versioning AI agent actions and tracking which prompt wrote each line
  • regent-vcs/re_gent — version control for AI agents (Go); tracks agent actions, prompt attribution, step inspection; Claude Code compatible
  • A Mental Model for Agentic Work — five-component model: LLM Models/API → Agents Host → Tools → Context → Memory; argues every agentic system follows this architecture regardless of domain

W19 2026 · 02-May-26 → 08-May-26

Open Questions / Tensions

  • Structural vs behavioral gates: Brooks’ backpressure piece makes a strong case that prompt-level constraints are inherently unreliable. But structural gates require upfront investment in types, tests, and formal specs. Who pays this cost, and when is it worth it?
  • Harness vs model: There’s strong consensus that harness engineering is the real differentiator, but standards for “good harness design” are still being codified — now with a dedicated course (walkinglabs) to fill that gap.
  • Autonomous agents in the wild: The Archestra AI bot spam story is an early signal that AI agents operating without governance (no enforra-style enforcement, no reputation gating) will degrade shared ecosystems. The control layer problem isn’t just internal.
  • Agent versioning: re_gent and the “git for AI agents” concept is nascent — genuine need, no clear winner yet.
  • Security surface: enforra + snyk/agent-scan both entering the scene signals that the agent skill/MCP ecosystem has a meaningful attack surface. The governance layer is being actively built now.
  • LLM reliability as a systemic risk: The lenz.io data shows frontier models disagree on 67% of real fact-check claims — not benchmarks, but claims real users submitted. If agentic systems route decisions through LLM judgment without retrieval or human verification, this disagreement rate is a direct measure of the failure surface. The problem is calibration and grounding, not just knowledge cutoff.
  • Agent accountability in conversations: The interceptor/demon piece raises a structural accountability question — when AI writes under a human’s name, errors can’t be publicly corrected without exposing the ghost-writing. Live-participant adoption requires organizations to develop new norms for when agents speak; no such norms exist yet.
  • Self-improving loops and safety: SIA’s combined harness+weight approach is a milestone, but the paper focuses on benchmarks. If agents can rewrite their own scaffolds and update their own weights, the governance question (who monitors the Feedback-Agent? what are the stopping criteria?) becomes the critical open problem.