Wiki on Citta

AI Agents

Mon, 01 Jan 0001 00:00:00 +0000

Summary

Mr. Nayak has been tracking the rapidly evolving AI agents space — from platform infrastructure and harness engineering to version control, security, governance, and the unintended side-effects of autonomous agents in the wild. The consistent theme: the model is only one component; what wraps it (the harness) is where the engineering leverage lives. W21 added two threads: structural correctness as a design goal — using gates baked into the codebase (types, compilers, tests) rather than prompt instructions — and the ecosystem response to AI bot spam: engineers are now building git-level filters (the --author flag technique) to block automated AI commits, mirroring what email spam filters did decades earlier. W22 closes on two more threads: the LLM reliability problem at the knowledge-arbiter layer (lenz.io data shows frontier models disagree on 67% of real fact-check claims — a calibration/grounding problem, not just a knowledge cutoff), and the interceptor vs. live-participant framework for how AI agents join human conversations (attribution is the hinge: interceptors are invisible and unauditable; live participants are contestable and correctable). The SIA paper also closes a research gap by combining the harness-update and test-time weight-update approaches in one self-improving loop — simultaneously outperforming either alone across three benchmarks. W23 adds three new threads: skill distillation as a new technique (frontier models author SKILL.md procedure files that smaller local models execute — institutional knowledge encoded in markdown, model-agnostic and hot-swappable); multi-specialist orchestration at production scale (Cloudflare’s CI-native system launches up to 7 specialized reviewer agents per merge request); and a growing agent skill ecosystem (skills-for-humanity packages historical reasoning methodologies as Claude Code skills; cartographer-skill adds spatial reasoning). The RAG authorization piece (W23) also closes a design gap: pre-retrieval authorization is the correct architecture for safe RAG, not prompt-level filtering. W24 adds a production landmark and a measurement: OpenAI’s harness engineering piece is the first credible production-scale case study of a team using Codex with zero manually-written code — making the harness philosophy concrete (3.5 PRs/engineer/day, 1M LOC in 5 months; the job became environment design and AGENTS.md authoring). flightdeck fills the fleet observability gap: a self-hosted control plane streaming every LLM call and tool call with token budgets and MCP policy enforcement. And the botsitters data (6.4 hrs/week supporting AI, only 13% org-level improvement) frames a new measurement challenge: if human labor is reorganized around supporting AI rather than replaced by it, the net productivity case requires scrutiny. W25 adds the meta-harness layer as the next frontier above individual agent harnesses: Databricks’ Omnigent wraps multiple agents (Claude Code, Codex, Pi) in a unified API with policy controls and live team collaboration, making agents composable at the organization level. Bastion Computer formalizes the environment-definition layer with JSON-declared agent workspaces, auth, and lifecycle scripts. W26 brings a rich cluster: agent-native architecture principles (Dan Shipper’s guide for designing apps where agents are first-class citizens, not API bolt-ons), a production agentic RAG case study (Bayer PRINCE — multi-step context and harness engineering before those terms existed), a hands-on eval harness primer (full agent eval loop in 150 lines of Bun — no framework needed), Self-Harness (arXiv 2606.09498: agents iteratively identify their own failure patterns and propose minimal harness modifications, validated by regression testing — +21pp improvement on Terminal-Bench-2.0 without human engineers), and Haystack (deepset’s production-ready Python framework for RAG and agentic pipelines with branching/looping control).

AI Coding & Developer Workflows

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A major thread in Mr. Nayak’s catalog: how to integrate AI into software development without eroding skill or judgment. The sources span tooling (Claude Code ecosystem, Codex), workflow advice (context management, durable threads, voice input), and a growing counterpoint — don’t let the speed of AI output replace your own comprehension. This week added notable new voices: the “vibe coding” debate brought a useful skeptic perspective, while Willison’s PyCon summary provided much-needed historical context for where coding agents actually stand. The November 2025 inflection point (coding agents got genuinely good via RLVR) helps explain why the discourse has intensified. W23 adds two new angles: domain expertise as the oracle (Horsting argues the scarce resource is no longer “can you build it” but “can you tell whether it’s right” — and only someone with domain knowledge can answer that); and AI-enabled long-tail software (Slop Forks argues the more interesting AI frontier isn’t taking over existing software — it’s enabling niche tools that were never worth building at prior costs). Cloudflare’s production multi-specialist code review system (W23) also demonstrates that the organizational adoption of AI code review at scale requires orchestration infrastructure, not just better models. W24 brings several sharp new threads: zero-manual-code at production scale (OpenAI’s harness engineering piece documents 1M LOC shipped by 3 engineers via Codex in 5 months — the harness is the product now); a first-person domain erosion account (10-year fintech engineer documents how employer-mandated AI adoption rendered their accumulated expertise replicable on demand); and the rsync controversy (Andrew Tridgell, rsync’s original author, defends legitimate expert AI use while simultaneously lamenting the flood of AI-generated security reports — two distinct problems that community discourse collapsed into one). W26 adds a tighter cluster: Charity Majors frames the post-threshold imperative precisely — AI crossed the quality bar (tipping point: Opus 4.5, but harnesses enabled it from mid-2025), and the correct engineering response is more discipline, not less, because review volume is now the bottleneck. Vini Brasil provides the practitioner-level rejection heuristic: five criteria for rejecting working AI code (can’t explain the approach; diff bigger than the problem; premature abstraction; harder to reason about; trusting output over understanding). Reflex’s ast.walk 220x speedup is a second-order consequence of AI code generation — the tooling that processes AI output must scale to match output throughput. TesterArmy brings AI browser agents to QA, making plain-English test authoring viable.

AI Security

Mon, 01 Jan 0001 00:00:00 +0000

Summary

As AI agents gain delegated authority over enterprise systems - email, file access, calendar, code repositories - the prompt injection attack surface expands dramatically. W22 brought the first entry in this topic: the Copilot Cowork exfiltration via indirect prompt injection. The threat model for agentic systems is structurally different from traditional software: the attack surface grows with every new integration, and defenses that work in isolation fail in combination. W23 adds a complementary architectural piece: pre-retrieval authorization in RAG systems (windley.com) - the principle that access control must be enforced at the retrieval layer, before context reaches the LLM, not at the prompt level. Together, these two entries frame a structural principle: AI systems that operate on data or take actions on behalf of users must enforce access control as close to the data source as possible, not as a behavioral constraint on the model. W24 introduces a new threat model distinct from all prior entries: intentional silent degradation by the AI vendor itself. Where W22 (Copilot Cowork) and W23 (RAG authorization) addressed third-party attackers exploiting weak access controls, the Claude Fable 5 case is the vendor making a deliberate policy choice to degrade output without disclosure. Anthropic walked back the policy after developer backlash, but the precedent — that a model vendor can silently alter quality for competitive reasons — has been set and normalized by the disclosure itself. W25 marks a qualitative shift: state-level export control as infrastructure risk. The US government’s directive to suspend all access to Fable 5 and Mythos 5 for any foreign national — including those inside the US — represents a new security category distinct from the W24 silent degradation case. Where W24’s threat was vendor policy, W25’s threat is government mandate that rendered the entire model line unavailable globally overnight with no migration path. The 12gramsofcarbon personal account makes the operational impact concrete: a developer mid-task in an agentic workflow found the model gone — first assumption was a broken harness, not a geopolitical event.

Engineering Fundamentals

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A cluster of links on the bedrock of software engineering - the problems that don’t go away regardless of AI tooling. Idempotency, build caching, code reading, formal modeling, and now: structural correctness as a design philosophy for AI-generated code. W21 added a significant new thread via the backpressure piece - the argument that invariants belong in the substrate (types, compilers, tests, proof checkers), not in prompts. This directly connects to the formal methods work (TLA+) and distributed systems testing methodology. The theme emerging: the engineering fundamentals that make human-written systems reliable are the same ones that keep AI-generated systems correct. W22 adds two more pieces: platform capture of communication channels (push notifications are no longer developer-controlled delivery - Apple and Google have inserted AI intermediaries that summarize, reorder, and rewrite before delivery) and database-native durable workflows (DBOS argues that external orchestrators like Temporal are a needless abstraction - Postgres itself can coordinate workflows via standard locking and integrity constraints, making scalability and availability DB engineering problems with decades of prior art). W23 extends the durable workflow thread further: SQLite as the lightest-weight tier for bursty, agent-centric workloads (Litestream replication to S3, no separate DB service needed), a hands-on build-it-from-scratch workflow engine guide (Go + Postgres, Hatchet style), and a historical counterpoint in CASTOR - CERN’s decades-old hierarchical storage system for physics data, proof that purpose-built distributed storage at exabyte scale is a solved engineering problem when the domain is well-understood. W24 adds two new threads: Linear’s local-first architecture as a masterclass in inverting the traditional CRUD model (browser-as-database, async server sync, no spinners); and North Star values (Loris Cro) as a reminder that engineering fundamentals are always in service of a goal — user utility — not ends in themselves. W26 brings a cluster of fundamentals spanning multiple registers: ast.walk optimization (Reflex’s 220x speedup through generator-to-list replacement and C extension — a case study in hot-path profiling where the answer was simpler than the symptom suggested); the Bayer PRINCE architecture retrospective (harness and context engineering patterns for production agentic RAG, framed now that the vocabulary exists); a visual SSH tunnels reference (local and remote port forwarding — foundational networking primitives that outdate Cloud Native tools by decades); the Aalto sync taxonomy (systematic trade-off analysis of distributed sync architectures — theoretical complement to the practical durable workflow entries from W22/W23); a design document writing guide (the doc’s primary value is the thinking it forces, not the artifact it produces); and Data Compression Explained (Mahoney’s canonical reference from entropy coding through neural methods).

Human-AI Collaboration

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A growing thread in Mr. Nayak’s catalog: the conceptual and practical frameworks for how humans and AI systems work together - and what goes wrong when the relationship is inverted. The centaur metaphor (human+machine exceeding either alone) is the aspirational frame; the reverse centaur (machine directing human labor) is the cautionary one. W22 brought this thread into focus with the Doctorow piece on reverse centaurs and the Centaur Chess origin story, plus a practical engineering case study on hybrid AI architecture. The “Interceptors and Demons” piece (also W22) adds a more granular structural analysis: even within a single conversation, AI agency exists on a spectrum from invisible ghost-writer (interceptor) to named collaborator (live participant/demon) - and the choice has deep implications for accountability, contestability, and how organizations learn from AI errors. W23 adds two new threads: the Dead Economy Theory - the AI industry’s financial model requires labor replacement at scale (the global labor market is the only market large enough to justify $800B+ AI valuations; “gentler language is marketing”); and domain expertise as non-fungible human value (Horsting: the scarce resource is the ability to evaluate AI output for correctness in a specific domain, not to produce code). W24 adds a neuroscience data point and two first-person accounts of the collaboration cost. Gloria Mark’s decades of attention span research (47-second average as of 2020, still falling) frames AI chatbots as the latest stage in a longer cognitive offloading arc. The LLMs-eroding-career piece adds first-person texture to what Horsting and Doctorow argued structurally: the human experience of AI collaboration can feel like having your accumulated knowledge claims invalidated, not augmented. And the botsitting data makes the reverse centaur concrete: 6.4 hours/week of human labor supporting AI, with productivity gains visible at the individual level but not yet at the organizational level. W26 adds two more angles: AI’s Brokenomics (Ed Zitron) extends the economic critique with the Fable/Mythos export ban as a concrete case study in infrastructure dependency — the entire model line was gone overnight with no recourse, making vendor-fragility a new dimension of the AI collaboration calculus. When I reject AI code (Vini Brasil) is a practitioner-level account of the human judgment bottleneck in AI-assisted work: the problem is no longer producing code but understanding and owning what was produced. W27 adds a philosophical reset: AI: The Falsity of Comparison (syntheticauth.ai) argues that the entire frame of “what can humans do that AI can’t” is wrong because it reduces being human to a capability checklist — a rare piece that questions the premise rather than taking a side within it.

Local Models

Mon, 01 Jan 0001 00:00:00 +0000

Summary

Mr. Nayak is following the gap between “local models are possible” and “local models feel finished.” The frustration isn’t with model quality — it’s with the fragmented tooling, missing polish, and configuration overhead that makes local inference still worse than it should be for practical coding workflows. Two pieces directly address this UX gap; a third offers the hardware angle. This week, Qwen 3.7 landed as a release event — the model release cadence continues to accelerate, adding capable options to the local model ecosystem. W23 adds a conceptually important new angle: skill distillation (Tom Tunguz) uses frontier models to write SKILL.md procedure files that smaller local models execute — a technique that makes local models viable for agentic workflows without requiring them to match frontier capability, because the procedure is externalized into the skill file rather than encoded in weights. W25 adds a manifesto-level argument: Ahmad Osman’s Opensource AI Must Win frames locally-deployable AI as civilizational infrastructure — operational freedom requires the ability to run, inspect, and modify intelligence systems without permission; the Fable/Mythos export ban (cataloged the same day) provides the precise case study. W26 adds a concrete benchmark: GLM-5.2 vs Claude Opus (3D WebGL game build) — Opus was faster and produced cleaner code; GLM-5.2 cost 4x less; the decisive argument for open weights in the arsenal isn’t capability parity, it’s that weights you download can’t be revoked. W27 provides the most practical “it works” signal yet: Migdal (Quesma) declares Qwen 3.6 27B the first local model that “makes sense as a general intelligence” — achieves 32 tok/s on an M5 Max with llama.cpp + MTP, integrates cleanly with OpenCode via a localhost OpenAI-compatible API, and outperforms the faster 35B A3B MoE variant on a coding task by following an instruction the MoE variant ignored; the llama.cpp + OpenCode stack is now a documented, complete local development workflow.

Open Source Sustainability

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A thread emerging in W22: what happens to open-source projects as AI-generated contributions flood issue trackers, pull requests, and forums? And what happens to the humans who maintain them? The Ronacher “Building Pi With Pi” piece is the most detailed account so far of AI slop degrading OSS maintainer workflows - not through bad code but through confidently-wrong AI-generated issue reports that mislead both humans and agents. The AWS departure piece adds the perspective of someone who spent four years trying to build a human face for a corporation that views contributors as fungible. W23 adds two new angles: infrastructure abuse through open OSS tooling (the Kaneo phishing case: a verified email-sending domain attached to open, unverified signup is exploitable regardless of intent) and AI-enabled long-tail forks (Slop Forks: AI makes viable niche re-implementations that no one would have bothered building - a shift in what counts as worth doing in OSS). W24 adds the rsync story: Andrew Tridgell simultaneously documents legitimate AI-augmented maintenance (test-suite rewrite under expert design direction) and the flood of AI-generated security reports requiring new defensive infrastructure. The community discourse conflated both phenomena — the backlash was against the former, when the real complaint belongs to the latter. W25 adds Ahmad Osman’s Opensource AI Must Win — a manifesto-level argument that open-source AI is civilizational infrastructure: “a subscription economy for cognition” is what closed frontier labs produce; the alternative requires the ability to run, inspect, modify, and preserve intelligence systems without permission. Written the same day the Fable/Mythos export control directive was issued, it arrives with an immediate case study: the government can revoke access to closed models overnight; open weights cannot be revoked.

Software Engineering Career

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A recurring undercurrent in Mr. Nayak’s catalog: what happens to software engineering as a profession when AI can generate most of the code? Sources range from philosophical (is it still a lifetime career?) to personal (laid off by Atlassian) to practical (reading code, interview prep tools, the vibe coding debate). The picture is nuanced - AI doesn’t eliminate the engineer, but it reshapes what “being a good engineer” means. W23 adds three new angles: the Dead Economy Theory frames AI investment as a structural labor displacement play; domain expertise as the non-fungible moat (Horsting: the scarce resource is whether you can verify AI output, and that requires owning a domain model); and the “do the hardest thing” principle as a career strategy response - choosing ambitious problems with few competitors as a hedge against commoditization. W24 adds two contrasting responses to the AI pressure. North Star (Loris Cro) is a values clarification: when AI commoditizes code production, the durable north star is maximizing user utility — not technical mastery, not DX, not abstraction beauty. The LLMs-eroding-career piece is the most emotionally direct account in the catalog: not structural analysis but a first-person record of watching accumulated expertise become replicable on demand, in the exact domain you chose to build depth in. W25 adds Paul Graham’s How to Earn a Billion Dollars (Oxford Union, June 2026): an articulation of the startup-path wealth creation mechanism (user value → exponential growth → compounding valuation) as a direct rebuttal to claims that billionaire wealth is structurally exploitative. W26 adds Charity Majors’ career signal: engineers who raise their standards in response to AI capability gains — treating it as a quality accelerator, not just a velocity accelerator — will be structurally more valuable than those who lower them.

Software Tools & Open Source Ecosystem

Mon, 01 Jan 0001 00:00:00 +0000

Summary

A cluster of links tracking the evolution of developer tools and the open-source ecosystem that sustains them. The recurring theme: tools built with community values can shift when commercial pressure enters. Bitwarden’s quiet leadership change is the clearest signal so far - a formerly community-first security tool is showing signs of PE-aligned monetization. Fish 4.0’s Rust rewrite offers a contrasting case: a community-led architectural overhaul that preserves user trust and backwards compatibility throughout. W22 adds two new developments: the push notification piece documents how Apple and Google have become active AI intermediaries in developer-to-user communication (a form of platform capture applicable beyond push), and Multiplayer represents the next generation of AI-connected debugging tools - feeding correlated production observability directly to coding agents to close the reproduce-in-dev loop. W23 adds two lighter entries: Tesseract Analytics’ Open Terminal (financial data democratization for individual investors) and Slop Forks (an argument that AI-generated niche tools represent a new software category - tools for audiences of one that previously couldn’t justify the build cost). W24 adds two engineering-adjacent entries: Linear’s performance architecture as a case study in local-first design done right, and flightdeck as the missing observability/governance layer for AI agent fleets. W25 adds three distinct tools: Relax-and-Recover (ReaR), a GPL-licensed Linux bare metal disaster recovery tool whose setup-and-forget model deserves more adoption; Bastion Computer, a JSON-configuration harness runner that brings IaC paradigm to AI agent workspace definitions; and Gemini-SQL2, Google’s text-to-SQL capability achieving 80.04% on BIRD (the hardest SQL benchmark, 37 domains, dirty values, external knowledge grounding). W26 adds TesterArmy (AI browser agents executing plain-English QA tests with automatic PR checks) and Haystack (deepset’s flexible Python framework for production RAG and agentic pipelines).