<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>AI Coding on Code is cheap, let&#39;s talk</title>
    <link>https://blog.ferstar.org/en/series/ai-coding/</link>
    <description>Code is cheap, let&#39;s talk</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>Copyright 2026 ferstar</copyright>
    <lastBuildDate>Mon, 22 Jun 2026 12:00:00 +0800</lastBuildDate>
    <ttl>60</ttl><atom:link href="https://blog.ferstar.org/en/series/ai-coding/index.xml" rel="self" type="application/rss+xml" /><image>
      <url>https://blog.ferstar.org/site-logo.png</url>
      <title>Code is cheap, let&#39;s talk</title>
      <link>https://blog.ferstar.org/</link>
    </image>
    
    <item>
      <title>Growing a Codex Workflow as a Living System: From Session Logs to Skills</title>
      <link>https://blog.ferstar.org/en/posts/codex-workflow-skills-feedback-loop/</link>
      <pubDate>Mon, 22 Jun 2026 12:00:00 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/en/posts/codex-workflow-skills-feedback-loop/</guid>
      <description>Long-term Codex usage accumulates repeated debugging and delivery chores; scan local sessions, identify recurring friction, then turn it into skills, scripts, and cross-machine sync; make the workflow lighter over time.</description><content:encoded><![CDATA[<blockquote><p>I am not a native English speaker; this article was translated by AI.</p>
</blockquote><p>A few days ago I did something boring but useful: scanned the Codex sessions in <code>~/.codex</code>.</p>
<p>Not for nostalgia, not for a dashboard. I just wanted to see where I was wasting time on repeat.</p>
<p>No surprise. The expensive part was rarely one hard code change. It was the small glue work I had to do every single day:</p>
<ul>
<li><code>git status</code>, <code>git diff</code>, <code>glab api</code>, <code>glab mr</code></li>
<li>finding the first failed CI job</li>
<li>checking remote SSH, PATH, Tailscale, permissions</li>
<li>deciding which tests to run for a given change</li>
<li>confirming SHA, workflow, artifact before release or deploy</li>
<li>resuming a session and re-sorting the issue, branch, MR</li>
</ul>
<p>These things are too small. Small enough that you never bother to fix them. But because they are small, they keep getting ignored. In the end, you spend your day swimming in manual glue.</p>
<p>The original prompt was short:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Based on my recent Codex projects and threads, suggest ways to simplify project workflows and improve efficiency. Use subagents to analyze in parallel.</span></span></code></pre></div></div>
<p>The first result was still too biased toward a few recent projects, so I added one more instruction:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Not just these projects. Scan all possible sessions under ~/.codex, dispatch multiple subagents to analyze them separately, then summarize.</span></span></code></pre></div></div>
<p>The point was not “ask the model for optimization ideas.” It was changing the source of truth from my memory of recent work to the repeated actions inside real session logs.</p>

<h3 class="relative group">Do not write tools too early
    <div id="do-not-write-tools-too-early" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#do-not-write-tools-too-early" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>I used to do this too — see repetition, reach for a script. Later I realized that is often too early.</p>
<p>Most repetition is not about repeated commands. It is about a repeated decision process. When CI fails, the reusable part is not some random <code>gh run view</code>. It is:</p>
<ol>
<li>confirm the run and head SHA</li>
<li>find the first failed job</li>
<li>extract the useful error</li>
<li>then figure out: workflow, dependency, test, or code</li>
</ol>
<p>Turning this straight into a big tool welds your wrong assumptions in place. The lighter move is to write a skill first: when to use it, minimum steps, what not to do, what to output.</p>
<p>A skill is not an encyclopedia. It is a sticky note — so the agent skips one dead end.</p>
<pre class="not-prose mermaid">
flowchart LR
  A[Session history] --> B[Repeated friction]
  B --> C[Small skill]
  C --> D[Run on real tasks]
  D --> E[Script only when repeated]
  E --> C
</pre>

<p>I ended up keeping only these:</p>
<ul>
<li><code>agent-preflight</code>: read the real repo state before starting, no assumptions</li>
<li><code>gitlab-mr-context</code>: use <code>glab api</code> for issues, MRs, pipelines, notes — much more reliable</li>
<li><code>ci-first-failure</code>: find the first real failure before touching code</li>
<li><code>path-verify</code>: pick the smallest check from the changed files</li>
<li><code>release-deploy-preflight</code>: confirm full SHA, workflow, artifact, health check before deploy</li>
<li><code>remote-health</code>: check SSH, PATH, services, locks, and Tailscale on remote hosts first</li>
</ul>
<p>The names are not cool. That is exactly why you know when to use them.</p>

<h3 class="relative group">Skills first, scripts later
    <div id="skills-first-scripts-later" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#skills-first-scripts-later" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Another lesson: do not give every skill a scripts directory on day one.</p>
<p>Most workflows only need a <code>SKILL.md</code>. <code>path-verify</code> is not there to run your tests. It reminds the agent to pick the smallest check based on what files changed. Let it run with the agent on a few real tasks first. Automate later, once the pattern is confirmed.</p>
<p>Scripts are for one kind of thing: stuff that is definitely repeated, mechanical, and low risk.</p>
<p>This time I only added one — linking repo skills into the user skill directory:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">scripts/link-user-skills.sh</span></span></code></pre></div></div>
<p>And a PowerShell version for Windows:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-powershell" data-lang="powershell"><span class="line"><span class="cl"><span class="p">.\</span><span class="n">scripts</span><span class="p">\</span><span class="nb">link-user</span><span class="n">-skills</span><span class="p">.</span><span class="n">ps1</span></span></span></code></pre></div></div>
<p>I tripped on one thing: symlink direction.</p>
<p>The correct direction is: real files in the repo, links in the user directory.</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">~/.agents/skills/glab -> /path/to/repo/skills/glab</span></span></code></pre></div></div>
<p>That way the repo has real content, and local Codex can use it. Get it backwards and it is a mess — the repo only has a link to <code>~/.agents</code>, GitHub gets nothing, and Git thinks the original files were deleted.</p>

<h3 class="relative group">Make it work across machines
    <div id="make-it-work-across-machines" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#make-it-work-across-machines" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>I switch between macOS, Windows, and remote hosts constantly. If a skill only works on one machine, its value gets cut in half.</p>
<p>So after the local setup, I synced the repo to <code>my-win</code> and ran the same maintenance flow on Windows. The PowerShell script uses directory junctions, not symlinks — creating symlinks on Windows often fights with permissions, and junctions are enough for directories.</p>
<p>Tedious step. But without it, workflow refinement turns back into a one-machine trick.</p>

<h3 class="relative group">How I think about it now
    <div id="how-i-think-about-it-now" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-i-think-about-it-now" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>After this round, a few thoughts hardened.</p>
<p>Find repetition in sessions, not in your imagination. If <code>git status</code>, <code>glab api</code>, <code>ssh</code>, and <code>pnpm test</code> are actually frequent, start there. Do not invent a workflow governance framework nobody asked for.</p>
<p>Keep skills short. One blocks one gap. The only job is to make the agent ask less, search less, guess less. Do not turn it into an encyclopedia.</p>
<p>Scripts do mechanical work — linking skills, collecting CI logs, checking remote health. Product judgment, risk boundaries, deployment decisions still need human confirmation, or at least explicit preflight.</p>
<p>Mistakes need to feed back. I got the symlink direction wrong at first. After fixing it, the lesson cannot just stay in the chat window. It goes into the script and the README. Otherwise I will step on the same rake again.</p>

<h3 class="relative group">What remained
    <div id="what-remained" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-remained" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Not much:</p>
<ul>
<li>a few short skills</li>
<li>one Bash linking script</li>
<li>one PowerShell linking script</li>
<li>one Windows sync check</li>
<li>one rule: real skills in the repo, links in the user directory</li>
</ul>
<p>Good enough.</p>
<p>The more I use AI coding, the more I think the workflow is not about building a big platform. It is about removing the most annoying five minutes, over and over. Each pass makes the system a little lighter. When enough of these small rules pile up, the agent starts working in a real engineering environment — not opening a new trail from scratch every time.</p>
]]></content:encoded>
      
    </item>
    
    <item>
      <title>Stopping Codex SQLite Log Growth with a Trigger</title>
      <link>https://blog.ferstar.org/en/posts/codex-sqlite-log-trigger/</link>
      <pubDate>Sat, 20 Jun 2026 16:55:00 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/en/posts/codex-sqlite-log-trigger/</guid>
      <description>Codex&#39;s local SQLite log database can grow too fast; a small BEFORE INSERT trigger blocks new log rows, keeps a restore path, and quickly reduces disk IO and WAL growth.</description><content:encoded><![CDATA[<blockquote><p>I am not a native English speaker; this article was translated by AI.</p>
</blockquote><p>Codex recently started storing local logs in <code>~/.codex/logs_2.sqlite</code>. On my machine that database had already grown past 1GB. The real problem was not the WAL file itself, but the log table: <code>TRACE</code>, <code>DEBUG</code>, and <code>INFO</code> rows kept going into SQLite, creating unnecessary disk usage and IO.</p>
<p>The public configuration surface is limited here. <code>RUST_LOG</code> can reduce verbosity, <code>log_dir</code> only controls the plaintext TUI log, and <code>history.max_bytes</code> only applies to <code>history.jsonl</code>. I could not find a public retention, max-size, or journal-mode option for <code>logs_2.sqlite</code>.</p>
<p>So I used SQLite itself as the stopgap.</p>

<h3 class="relative group">Block new log rows with one trigger
    <div id="block-new-log-rows-with-one-trigger" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#block-new-log-rows-with-one-trigger" aria-label="Anchor">#</a>
    </span>
    
</h3>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">sqlite3 ~/.codex/logs_2.sqlite <span class="s2">"CREATE TRIGGER IF NOT EXISTS block_log_inserts BEFORE INSERT ON logs BEGIN SELECT RAISE(IGNORE); END;"</span></span></span></code></pre></div></div>
<p>The trigger is intentionally blunt: whenever something tries to insert into the <code>logs</code> table, SQLite ignores that insert.</p>
<p>Verification is also simple:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">sqlite3 ~/.codex/logs_2.sqlite <span class="s2">"
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT count(*) FROM logs;
</span></span></span><span class="line"><span class="cl"><span class="s2">INSERT INTO logs(ts, ts_nanos, level, target, feedback_log_body, estimated_bytes)
</span></span></span><span class="line"><span class="cl"><span class="s2">VALUES(strftime('%s','now'), 0, 'INFO', 'trigger_test', 'should_not_exist', 1);
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT count(*) FROM logs;
</span></span></span><span class="line"><span class="cl"><span class="s2">SELECT count(*) FROM logs WHERE target='trigger_test';
</span></span></span><span class="line"><span class="cl"><span class="s2">"</span></span></span></code></pre></div></div>
<p>If the row count stays the same and <code>trigger_test</code> is <code>0</code>, the trigger is working.</p>

<h3 class="relative group">Windows PowerShell version
    <div id="windows-powershell-version" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#windows-powershell-version" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>On Windows, the path is usually:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-powershell" data-lang="powershell"><span class="line"><span class="cl"><span class="nv">$db</span> <span class="p">=</span> <span class="nb">Join-Path</span> <span class="nv">$env:USERPROFILE</span> <span class="s2">".codex\logs_2.sqlite"</span>
</span></span><span class="line"><span class="cl"><span class="n">sqlite3</span> <span class="nv">$db</span> <span class="s2">"CREATE TRIGGER IF NOT EXISTS block_log_inserts BEFORE INSERT ON logs BEGIN SELECT RAISE(IGNORE); END;"</span></span></span></code></pre></div></div>
<p>I verified this on a remote Windows machine. The test insert did not change the row count:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">trigger: block_log_inserts
</span></span><span class="line"><span class="cl">before: 76387
</span></span><span class="line"><span class="cl">after: 76387
</span></span><span class="line"><span class="cl">trigger_test_rows: 0</span></span></code></pre></div></div>

<h3 class="relative group">Restore log writes
    <div id="restore-log-writes" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#restore-log-writes" aria-label="Anchor">#</a>
    </span>
    
</h3>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">sqlite3 ~/.codex/logs_2.sqlite <span class="s2">"DROP TRIGGER IF EXISTS block_log_inserts;"</span></span></span></code></pre></div></div>
<p>PowerShell:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-powershell" data-lang="powershell"><span class="line"><span class="cl"><span class="nv">$db</span> <span class="p">=</span> <span class="nb">Join-Path</span> <span class="nv">$env:USERPROFILE</span> <span class="s2">".codex\logs_2.sqlite"</span>
</span></span><span class="line"><span class="cl"><span class="n">sqlite3</span> <span class="nv">$db</span> <span class="s2">"DROP TRIGGER IF EXISTS block_log_inserts;"</span></span></span></code></pre></div></div>

<h3 class="relative group">Compact the old logs too
    <div id="compact-the-old-logs-too" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#compact-the-old-logs-too" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>The trigger only blocks new rows. It does not shrink a database that has already grown. After quitting Codex, run a checkpoint and <code>VACUUM</code>:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">sqlite3 ~/.codex/logs_2.sqlite <span class="s2">"
</span></span></span><span class="line"><span class="cl"><span class="s2">PRAGMA wal_checkpoint(TRUNCATE);
</span></span></span><span class="line"><span class="cl"><span class="s2">DELETE FROM logs WHERE level IN ('TRACE','DEBUG');
</span></span></span><span class="line"><span class="cl"><span class="s2">DELETE FROM logs WHERE level = 'INFO' AND ts < strftime('%s','now','-3 days');
</span></span></span><span class="line"><span class="cl"><span class="s2">VACUUM;
</span></span></span><span class="line"><span class="cl"><span class="s2">"</span></span></span></code></pre></div></div>
<p>If Codex is still running, SQLite may report <code>database is locked</code>. That is expected. Quit Codex and run it again.</p>

<h3 class="relative group">The limit of this trick
    <div id="the-limit-of-this-trick" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-limit-of-this-trick" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>This does not fix Codex’s logging system. It is just local damage control.</p>
<p>The upside is that it needs no Codex patch, no release wait, and no background cleanup daemon. The downside is that <code>logs_2.sqlite</code> will no longer contain new local logs, which makes local debugging weaker. When you need logs, drop the trigger, reproduce the issue, then add the trigger again.</p>
<p>Long term, Codex should expose log database retention or max-size settings. Until then, one SQLite trigger is enough.</p>
]]></content:encoded>
      
    </item>
    
    <item>
      <title>ACE Broke, So I Rewrote It: From ace-wrapper to fast-context</title>
      <link>https://blog.ferstar.org/en/posts/ace-wrapper-to-fast-context/</link>
      <pubDate>Mon, 01 Jun 2026 10:35:09 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/en/posts/ace-wrapper-to-fast-context/</guid>
      <description>After ace-wrapper was written, ACE free tier became unreliable and relay services died one by one; so I reverse-engineered Windsurf&#39;s SWE-grep protocol, added local Semble cache as fallback, and built a hybrid retrieval tool called fast-context.</description><content:encoded><![CDATA[<blockquote><p>I am not a native English speaker; this article was translated by AI.</p>
</blockquote><p>In the <a href="/en/posts/ace-wrapper-semantic-search-ai-coding-harness/" >last post</a> I wrote about <code>ace-wrapper</code>: a shell command around ACE (Augment Context Engine) filesystem context search, so the agent could start with semantic retrieval when keywords were fuzzy, then decide which files to read.</p>
<p>Then ACE got flaky.</p>
<p>API keys expired in new ways. Free-tier quota became harder to rely on. And one by one, the relay services died.</p>
<p>I do not really blame anyone. It was a preview feature to begin with. The problem was that semantic search had already become part of my daily workflow: dozens of <code>ace</code> calls in one session. Take it away and the agent is back to guessing keywords.</p>
<p>So I switched approaches:</p>
<p><a href="https://github.com/ferstar/fast-context"  target="_blank" rel="noreferrer">ferstar/fast-context</a></p>
<p>This time I skipped third-party APIs and talked directly to Windsurf’s SWE-grep protocol — the same semantic search backend used by Codex CLI and Windsurf IDE — while layering a local Semble cache on top as fallback.</p>

<h3 class="relative group">The structural difference from ace-wrapper
    <div id="the-structural-difference-from-ace-wrapper" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-structural-difference-from-ace-wrapper" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>ace-wrapper was pure remote: local only sent parameters, everything depended on the ACE service.</p>
<p>fast-context is local and remote working together.</p>
<pre class="not-prose mermaid">
flowchart TB
  subgraph Input
    Q[User query]
  end

  subgraph Local
    S[Semble local prefetch<br/>cached index + chunk search]
    A[Lexical anchors<br/>filename / path / literal hits]
    R["Repo map<br/>(auto-shrink when too large)"]
  end

  subgraph Remote
    WS[Windsurf SWE-grep<br/>agentic verify + expand]
  end

  subgraph Output
    O["Candidate files<br/>line ranges<br/>follow-up terms<br/>(or local chunks when remote fails)"]
  end

  Q --> S
  Q --> A
  Q --> R
  S --> WS
  A --> WS
  R --> WS
  WS -- success --> O
  WS -- auth / rate-limit / timeout --> O
  S -- fallback path --> O
</pre>

<p>The flow:</p>
<ol>
<li><strong>Run Semble locally first</strong> — cached index + chunk search, sub-second</li>
<li><strong>Collect local lexical anchors</strong> — exact filename, path segment, and literal content matches from the query</li>
<li><strong>Generate a repo map</strong> — directory tree, auto-shrunk when it gets too large</li>
<li><strong>Feed all three to Windsurf</strong> — Semble chunks as hints, lexical anchors as pinpoints, repo map as path context</li>
<li><strong>Windsurf verifies and expands</strong> using rg/readfile/tree/ls/glob — agentic tool-call loop</li>
<li><strong>When the remote path fails, return local Semble results</strong> — no empty hands, no blocked workflow</li>
</ol>
<p>That “no empty hands” property matters more than it sounds. With ace-wrapper and ACE, once the service went down, that search was simply gone. Now, when the remote path fails, local cache still returns chunk-level candidates. Lower quality, yes, but the workflow does not just die there.</p>

<h3 class="relative group">Reverse-engineering SWE-grep
    <div id="reverse-engineering-swe-grep" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#reverse-engineering-swe-grep" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Windsurf’s SWE-grep uses Connect-RPC + Protobuf, which is nothing like a normal REST API.</p>
<p>The trickiest part was the Connect framing. Every RPC frame has a 5-byte header (1 flag byte + 4 big-endian length bytes). On top of that, the protocol requires a Connect-Connect frame before the actual payload.</p>
<p>The Protobuf side was worse. Windsurf uses a custom proto schema with no public definition. The field numbers in core data structures had to be inferred from packet captures and known Wireshark decryption configs — call chains look like <code>{1: name, 2: args, 3: id}</code>, variable definitions like <code>{1: name, 2: type, 3: value}</code>. Guess wrong and the whole request fails, with no useful error message.</p>
<p>The encoder looks like this (<a href="https://github.com/ferstar/fast-context/blob/main/src/core.py#L64"  target="_blank" rel="noreferrer">ProtobufEncoder</a>):</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-python" data-lang="python"><span class="line"><span class="cl"><span class="k">class</span> <span class="nc">ProtobufEncoder</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="s2">"""Manual protobuf encoder, matching the Windsurf wire format exactly."""</span>
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-></span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">bytearray</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_varint</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">value</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="n">parts</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
</span></span><span class="line"><span class="cl">        <span class="k">while</span> <span class="n">value</span> <span class="o">></span> <span class="mh">0x7F</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">            <span class="n">parts</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">value</span> <span class="o">&</span> <span class="mh">0x7F</span><span class="p">)</span> <span class="o">|</span> <span class="mh">0x80</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">            <span class="n">value</span> <span class="o">>>=</span> <span class="mi">7</span>
</span></span><span class="line"><span class="cl">        <span class="n">parts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">value</span> <span class="o">&</span> <span class="mh">0x7F</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nb">bytes</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">def</span> <span class="nf">_tag</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">field</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">wire</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">bytes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">_varint</span><span class="p">((</span><span class="n">field</span> <span class="o"><<</span> <span class="mi">3</span><span class="p">)</span> <span class="o">|</span> <span class="n">wire</span><span class="p">)</span></span></span></code></pre></div></div>
<p>Decoding Windsurf’s streaming response is the same story — split frames, read payloads, find the stream-end marker, reconstruct the result. It is much more work than calling a REST API, but the upside is clear: no intermediary dependency, just a direct path to Windsurf’s backend.</p>

<h3 class="relative group">Why local Semble cache works
    <div id="why-local-semble-cache-works" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#why-local-semble-cache-works" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Before adding Semble I did wonder: does local indexing really help?</p>
<p>Once I ran the benchmark, there was no suspense left.</p>
<p>I ran 40 labeled queries across two repos (fastapi and axios):</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Backend</th>
          <th style="text-align: right">NDCG@10</th>
          <th style="text-align: right">Recall@10</th>
          <th style="text-align: right">Top-1</th>
          <th style="text-align: right">Batch p50</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">local (Semble only)</td>
          <td style="text-align: right">0.854</td>
          <td style="text-align: right">0.946</td>
          <td style="text-align: right">0.775</td>
          <td style="text-align: right">30 ms</td>
      </tr>
      <tr>
          <td style="text-align: left">remote (Windsurf only)</td>
          <td style="text-align: right">0.453</td>
          <td style="text-align: right">0.467</td>
          <td style="text-align: right">0.450</td>
          <td style="text-align: right">24.4 s</td>
      </tr>
      <tr>
          <td style="text-align: left">hybrid (Semble + Windsurf)</td>
          <td style="text-align: right">0.890</td>
          <td style="text-align: right">0.979</td>
          <td style="text-align: right">0.825</td>
          <td style="text-align: right">28.3 s</td>
      </tr>
  </tbody>
</table>
<p>Local Semble alone hit 94.6% recall with 30 ms p50 latency. Windsurf alone underperformed — only a 52.5% success rate, with the rest lost to throttling or <code>resource_exhausted</code>.</p>
<p>Hybrid mode puts Windsurf after Semble for verification and expansion. NDCG@10 jumped to 0.890, recall to 97.9%.</p>
<p>Two things became clear:</p>
<ul>
<li><strong>Local cache is not a backup; it is the first line of defense.</strong> It handles most common searches in 30 ms. When the remote is down, it is a degradation path, not a dead end.</li>
<li><strong>Windsurf’s value is in verification, not first-pass search.</strong> Asking it to search from scratch risks timeout and throttling. Give it Semble chunk candidates and exact keyword anchors, and it only has to confirm — which succeeds far more often.</li>
</ul>

<h3 class="relative group">Credential handling got more involved
    <div id="credential-handling-got-more-involved" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#credential-handling-got-more-involved" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>ace-wrapper just needed an API key. fast-context uses Windsurf’s session token, stored in <code>state.vscdb</code> (a SQLite database).</p>
<p>The extraction logic lives in <a href="https://github.com/ferstar/fast-context/blob/main/src/extract_key.py"  target="_blank" rel="noreferrer">extract_key.py</a>:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Query ItemTable for key='windsurf.api_key'
</span></span><span class="line"><span class="cl">→ if found, return it
</span></span><span class="line"><span class="cl">→ if not, search for rows key containing 'devin-session-token'
</span></span><span class="line"><span class="cl">→ either format works
</span></span><span class="line"><span class="cl">→ can also override with WINDSURF_API_KEY env var</span></span></code></pre></div></div>
<p>Why support two formats? Because Windsurf keeps changing. Earlier versions used standard API keys; newer ones moved to session-style credentials like <code>devin-session-token$...</code>. If the tool does not adapt, it breaks as soon as the user upgrades their IDE.</p>

<h3 class="relative group">The current workflow
    <div id="the-current-workflow" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-current-workflow" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>During the ace-wrapper era, my AGENTS.md looked like this:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Use ace for semantic retrieval → read files → confirm evidence with rg</span></span></code></pre></div></div>
<p>Now it reads:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Use fast-context search (default hybrid) for candidates + line ranges
</span></span><span class="line"><span class="cl">If hybrid times out or returns nothing, try fast-context local-search
</span></span><span class="line"><span class="cl">If a chunk candidate looks promising, use fast-context find-related
</span></span><span class="line"><span class="cl">After reading files, confirm exact evidence with rg/ast-grep</span></span></code></pre></div></div>
<p>There are more branches now, but each one has a clear fallback.</p>
<p>On the remote side, there is a model fallback chain too:</p>
<ol>
<li>Default: <code>MODEL_SWE_1_6_FAST</code></li>
<li>On <code>resource_exhausted</code> or rate-limit: auto-degrade to <code>MODEL_SWE_1_5</code></li>
<li>Custom fallback order via <code>WS_FALLBACK_MODELS</code></li>
</ol>

<h3 class="relative group">Benchmark results
    <div id="benchmark-results" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#benchmark-results" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>With the fair runner (completion-based cooldown, 40 queries):</p>
<ul>
<li><strong>Hybrid non-empty output rate: 100%</strong> — all 40 queries returned useful results</li>
<li><strong>Remote non-empty output rate: only 50%</strong> — the other half timed out or got throttled</li>
<li><strong>Local zero failures</strong> — 100% non-empty, p50 latency 30 ms</li>
</ul>
<p>If the workflow depended purely on remote semantic search, half the queries could get no answer during peak hours. With local Semble backing, the worst case is degraded local chunks, not an empty result.</p>

<h3 class="relative group">What I do not want to tear down and rebuild again
    <div id="what-i-do-not-want-to-tear-down-and-rebuild-again" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-i-do-not-want-to-tear-down-and-rebuild-again" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>A few design choices that held up well:</p>
<ol>
<li><strong>Always keep a degradation path.</strong> Every remote dependency needs a local fallback. I already paid for ignoring that once.</li>
<li><strong>Pure Python is easier to maintain.</strong> ace-wrapper was Python too, but this project grew from a few hundred lines to more than two thousand — protobuf encoder, Connect framing, Semble adapter, benchmark runner. Clear structure matters more than language choice. Python just happens to be what I work fastest in.</li>
<li><strong>Benchmarks should live with the code.</strong> The <a href="https://github.com/ferstar/fast-context/tree/main/benchmarks/"  target="_blank" rel="noreferrer">benchmarks/</a> directory with 40 labeled queries and a runner shows the real gap between backends on every run. Optimization without data is mostly guessing.</li>
<li><strong>Credential extraction should auto-adapt.</strong> The <code>devin-session-token</code> format was unexpected, but the code structure made it easy — try another pattern when the first key is not found, without touching the main flow.</li>
</ol>

<h3 class="relative group">Wrapping up
    <div id="wrapping-up" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#wrapping-up" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>I still use ace-wrapper sometimes — ACE does come back to life once in a while. But I no longer want my workflow tied to it.</p>
<p>The core idea behind fast-context is simple: let local cache carry the baseline, and use the remote path for verification and expansion. Once the upstream gets shaky, a purely remote solution turns into a rope with no backup.</p>
<p>If you have hit the same wall, the code is here: <a href="https://github.com/ferstar/fast-context"  target="_blank" rel="noreferrer">ferstar/fast-context</a></p>
]]></content:encoded>
      
    </item>
    
    <item>
      <title>Putting Semantic Search into an AI Coding Harness: Notes on Open-Sourcing ace-wrapper</title>
      <link>https://blog.ferstar.org/en/posts/ace-wrapper-semantic-search-ai-coding-harness/</link>
      <pubDate>Sat, 09 May 2026 14:38:00 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/en/posts/ace-wrapper-semantic-search-ai-coding-harness/</guid>
      <description>Long AI coding tasks often fail because the agent reads the wrong files; use ace-wrapper to put semantic retrieval into Read -&gt; Search -&gt; Change -&gt; Verify; let agents find candidate files first, then verify evidence to reduce blind edits and wasted context.</description><content:encoded><![CDATA[<blockquote><p>I am not a native English speaker; this article was translated by AI.</p>
</blockquote><p>In the <a href="/en/posts/ai-coding-harness-engineering-workflow/" >previous post</a> about Harness Engineering, I compressed my default AI coding workflow into a few steps:</p>
<ol>
<li>Read</li>
<li>Search</li>
<li>Change</li>
<li>Verify</li>
<li>Record</li>
</ol>
<p>The easiest one to underestimate is <code>Search</code>.</p>
<p>Many agents fail not because they cannot edit code, but because they read the wrong place first. The user describes a behavior, a bug, or a cross-layer workflow, while the code may not contain a function with the same name. Running <code>rg login</code>, <code>rg upload</code>, or <code>rg session</code> is fast, but it only works when the keyword is already known. If the keyword is unknown, speed just helps the agent drift faster.</p>
<p>So I open-sourced a small layer I have been using recently:</p>
<p><a href="https://github.com/ferstar/ace-wrapper"  target="_blank" rel="noreferrer">ferstar/ace-wrapper</a></p>
<p>It does one narrow thing: wrap Augment Context Engine’s filesystem context search as an <code>ace</code> command, so coding agents can run semantic retrieval from the shell before editing.</p>

<h3 class="relative group">Why this layer exists
    <div id="why-this-layer-exists" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#why-this-layer-exists" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>The target is concrete: make the search action part of the harness.</p>
<p>I used to see this path often:</p>
<pre class="not-prose mermaid">
flowchart LR
  A[User describes behavior] --> B[Agent guesses keywords]
  B --> C[Reads nearby files]
  C --> D[Edits plausible code]
  D --> E[Verification fails]
  E --> B
</pre>

<p>The problem with this loop is that, after failure, the agent often keeps circling around the same wrong files. It can edit code; what it needs is a better entry point into candidate files. Put less politely, it is working hard after entering the wrong door.</p>
<p><code>ace-wrapper</code> is meant to patch this part:</p>
<pre class="not-prose mermaid">
flowchart LR
  A[User describes behavior] --> B[ace semantic retrieval]
  B --> C[Candidate files]
  C --> D[Read returned files]
  D --> E[rg / tests confirm evidence]
  E --> F[Small patch]
  F --> G[Verify]
</pre>

<p>The important part is the order: <code>ace</code> only finds candidate files. Conclusions still require reading files, exact search, and tests. It is not an answer generator; it just helps the agent waste fewer steps.</p>

<h3 class="relative group">Usage is short
    <div id="usage-is-short" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#usage-is-short" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Install it:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">uv tool install ace-wrapper</span></span></code></pre></div></div>
<p>Install a local development checkout:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">uv tool install /path/to/ace-wrapper</span></span></code></pre></div></div>
<p>Search for a workflow when the exact keyword is unknown:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">timeout 60s ace <span class="s2">"user uploads an unsupported file and should see skipped-file feedback"</span> -w /repo
</span></span><span class="line"><span class="cl">rg -n <span class="s2">"unsupported|skipped|upload|file"</span> /repo</span></span></code></pre></div></div>
<p>The first command answers “which files may be relevant.” The second command confirms “which identifiers, events, copy, or tests actually exist in the code.”</p>
<p>I usually put this rule into a project’s <code>AGENTS.md</code>:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Use `timeout 60s ace "<query>" -w <repo-root>` for semantic codebase discovery.
</span></span><span class="line"><span class="cl">Treat `ace` results as candidate files.
</span></span><span class="line"><span class="cl">After it returns results, read the relevant files and use exact search before using them as evidence.</span></span></code></pre></div></div>
<p>These lines work better than “read more context,” because they give the agent a concrete action and a boundary against false conclusions.</p>

<h3 class="relative group">How it works with rg
    <div id="how-it-works-with-rg" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#how-it-works-with-rg" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p><code>ace</code> and <code>rg</code> work better as consecutive steps.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Scenario</th>
          <th style="text-align: left">Use first</th>
          <th style="text-align: left">Why</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">You know the behavior but not the implementation location</td>
          <td style="text-align: left"><code>ace</code></td>
          <td style="text-align: left">Behavior descriptions can find candidate entry points across files and naming styles</td>
      </tr>
      <tr>
          <td style="text-align: left">You know the function name, event name, or error text</td>
          <td style="text-align: left"><code>rg</code></td>
          <td style="text-align: left">It is exact, complete, and enumerable</td>
      </tr>
      <tr>
          <td style="text-align: left">You need a structural refactor</td>
          <td style="text-align: left"><code>ast-grep</code></td>
          <td style="text-align: left">AST-level matching is needed; textual proximity falls short</td>
      </tr>
      <tr>
          <td style="text-align: left">You need to confirm whether a feature exists</td>
          <td style="text-align: left"><code>ace</code> + read files + <code>rg</code></td>
          <td style="text-align: left">A semantic hit cannot prove the feature exists</td>
      </tr>
  </tbody>
</table>
<p>I intentionally wrote this boundary into the README: ACE returns candidate files, while evidence still has to come from code and tests. That boundary matters.</p>
<p>Semantic retrieval returns “nearby” things. If you ask about a feature that does not exist, it may still find files that look related. If an agent treats “there are results” as “the feature exists,” it starts inventing a story. A conclusion is only defensible after reading an implementation, test, route, config, or call site.</p>

<h3 class="relative group">Where it fits in Harness Engineering
    <div id="where-it-fits-in-harness-engineering" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#where-it-fits-in-harness-engineering" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p><code>ace-wrapper</code> is small, and I want it to stay that way. It is closer to a small gear in the harness: it turns open-ended code discovery into a repeatable, constrained command.</p>
<p>I now prefer this project rule:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">Read -> Search -> Change -> Verify</span></span></code></pre></div></div>
<p>Here, <code>Search</code> means choosing the tool by problem type:</p>
<ul>
<li>Open-ended behavior and cross-layer workflows: use <code>ace</code> first</li>
<li>Exact identifiers, errors, routes, and config keys: use <code>rg</code></li>
<li>Structural replacements: use <code>ast-grep</code></li>
<li>External strategy and industry practice: use web research</li>
<li>Old decisions and repeated lessons: use memory</li>
</ul>
<p>The useful part of this split is reduced agent randomness. The agent first uses semantic retrieval to narrow the reading surface, then uses deterministic tools to confirm facts, and only then changes code. The order is a little more verbose, but it is much cheaper than confidently editing the wrong file.</p>

<h3 class="relative group">The prompt matters most
    <div id="the-prompt-matters-most" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-prompt-matters-most" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>A good <code>ace</code> query describes behavior and avoids keyword piles:</p>
<div class="highlight-wrapper"><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">timeout 60s ace <span class="s2">"frontend sends requestId to backend and starts a processing job"</span> -w /repo
</span></span><span class="line"><span class="cl">timeout 60s ace <span class="s2">"用户拖入不支持的文件后应该显示跳过文件提示"</span> -w /repo
</span></span><span class="line"><span class="cl">timeout 60s ace <span class="s2">"how provider config is persisted and restored after app restart"</span> -w /repo</span></span></code></pre></div></div>
<p>I try to include four kinds of information:</p>
<ul>
<li>User action: click, drag, upload, stop generation</li>
<li>Runtime boundary: frontend to backend, CLI handler to core service</li>
<li>Expected effect: persist config, abort loop, show skipped-file feedback</li>
<li>Known fields: <code>sessionId</code>, <code>requestId</code>, <code>files</code>, <code>workspace</code></li>
</ul>
<p>This is much more stable than only searching <code>upload</code> or <code>provider</code>. It lets the retrieval system look for behavior and data flow, and it reminds the agent that this step is still semantic retrieval, not evidence by itself.</p>

<h3 class="relative group">Why I open-sourced it
    <div id="why-i-open-sourced-it" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#why-i-open-sourced-it" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p><code>ace-wrapper</code> has very little code. The core is just <code>FileSystemContext.create(str(workspace))</code> plus <code>context.search(args.query)</code>. I wanted to preserve the workflow constraints around those few lines:</p>
<ol>
<li>If the keyword is unknown, start with semantic retrieval</li>
<li>Ask one workflow per query</li>
<li>Treat results as candidate files</li>
<li>Read the files, then use <code>rg</code> to confirm exact evidence</li>
<li>Do not conclude without evidence</li>
</ol>
<p>Once these rules live in the tool README, skill, and agent prompt, they become much more likely to stick. Otherwise every session depends on a human reminding the agent again, which gets old fast.</p>
<p>The previous post said Harness Engineering means putting an engineering track around AI. <code>ace-wrapper</code> is one small piece of that track: it does not make the agent better at writing code; it just helps the agent read the right place first.</p>
]]></content:encoded>
      
    </item>
    
    <item>
      <title>From Vibe Coding to Harness Engineering: How My AI Coding Workflow Changed</title>
      <link>https://blog.ferstar.org/en/posts/ai-coding-harness-engineering-workflow/</link>
      <pubDate>Sat, 09 May 2026 14:19:00 +0800</pubDate>
      
      <guid isPermaLink="true">https://blog.ferstar.org/en/posts/ai-coding-harness-engineering-workflow/</guid>
      <description>AI coding can generate code but long-running delivery drifts easily; use Harness Engineering to control tasks, context, verification, and recovery; turn AI output into an executable, verifiable, reviewable engineering workflow.</description><content:encoded><![CDATA[<blockquote><p>I am not a native English speaker; this article was translated by AI.</p>
</blockquote><p>This is the written version of an internal team sharing session. The slides are here:</p>
<p><a href="/slides/harness-engineering-ai-coding/" >From Vibe Coding to Harness Engineering</a></p>
<div style="position:relative;width:100%;aspect-ratio:16/9;margin:1.5rem 0 2rem;border:1px solid rgba(127,127,127,.25);overflow:hidden;">
  <iframe src="/slides/harness-engineering-ai-coding/" title="From Vibe Coding to Harness Engineering" style="position:absolute;inset:0;width:100%;height:100%;border:0;" loading="lazy" allowfullscreen></iframe>
</div>
<p>For a while I kept looking at one question: can AI really take over most of the coding work?</p>
<p>The answer is mostly settled now. When the project context, quality gates, and verification flow are in place, AI-generated code can enter the engineering workflow reliably. Human time moves from “typing the code” to “holding the line”: breaking down requirements, judging architecture, arranging context, checking boundaries, and handling failures.</p>
<p>Recent practice pushed this one step further. The question is no longer how to make the prompt prettier. It is whether the whole workflow can survive long-running tasks. I have stepped on this rake a few times, especially when I open the laptop in the morning, see that the agent ran all night, and still cannot tell which diff should be kept.</p>

<h3 class="relative group">What changed
    <div id="what-changed" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#what-changed" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Early Vibe Coding solved the entry problem: describe the requirement clearly, put project rules into <code>AGENTS.md</code> / <code>CLAUDE.md</code>, and let tests, lint, and review catch the model output.</p>
<p>That setup is still useful, but it is closer to single-task engineering. Once a task gets longer, a few problems start showing up:</p>
<ul>
<li>Context keeps growing until the model loses the important part</li>
<li>Repeated retries can push the fix further away from the real issue</li>
<li>Without external references, strategy becomes guesswork</li>
<li>After many rounds, it is hard to tell which changes should be kept</li>
<li>User rejection, permission blocks, and empty output need explicit stop semantics</li>
</ul>
<p>So I now prefer calling this layer Harness Engineering: put an engineering track around AI so tasks are executable, results are verifiable, and failures are recoverable. The name sounds a bit grand. In practice, it just means trusting “it will figure it out” a little less and adding a few guardrails.</p>
<pre class="not-prose mermaid">
flowchart LR
  A[Task scope] --> B[Context route]
  B --> C[Agent loop]
  C --> D[Verification gate]
  D --> E[Recovery / memory]
  D -->|failed| F[Patch harness]
  F --> C
</pre>


<h3 class="relative group">The four things I manage first
    <div id="the-four-things-i-manage-first" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#the-four-things-i-manage-first" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>The first thing is task boundaries.</p>
<p>Before a medium-sized task starts, I want at least <code>done when</code>, <code>out of scope</code>, the change surface, and the verification command. This does not need to be a long document. Five lines are often enough. The point is to let the executor know when to stop, instead of drifting into “while I am here” changes.</p>
<p>The second thing is context routing.</p>
<p><code>AGENTS.md</code> should not become an encyclopedia. It works better as an index: project rules, entry points, verification commands, things that must not be touched, and where to read the next layer of docs. Long context should be opened on demand, not dumped into the session. When the context gets too full, the model behaves a bit like me with too many browser tabs open: it looks busy, but the focus is gone.</p>
<p>The third thing is the verification loop.</p>
<p>My default order is now:</p>
<ol>
<li>Read: read README, AGENTS, older notes, and key implementation files</li>
<li>Search: use <code>ace</code>, <code>rg</code>, <code>ast-grep</code>, <code>nmem</code>, and Exa to find evidence</li>
<li>Change: apply a small patch and avoid drive-by refactors</li>
<li>Verify: run narrow checks first, then expand by risk</li>
<li>Record: write repeated lessons back into rules, tests, or memory</li>
</ol>
<p>This order is boring in a good way. Reading and searching first reduce model guesswork. Narrow verification avoids one giant change where nobody knows which step broke.</p>
<p>The fourth thing is failure handling.</p>
<p>After a failure, I classify it first: stop, retry, patch the harness, or record it.</p>
<table>
  <thead>
      <tr>
          <th style="text-align: left">Type</th>
          <th style="text-align: left">When to use it</th>
          <th style="text-align: left">Handling</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: left">Stop</td>
          <td style="text-align: left">User rejection, permission block, side effect risk, repeated spinning</td>
          <td style="text-align: left">Break the loop and return control</td>
      </tr>
      <tr>
          <td style="text-align: left">Retry</td>
          <td style="text-align: left">Network jitter, fixable parameter, read failure without side effects</td>
          <td style="text-align: left">Retry in small steps and keep logs</td>
      </tr>
      <tr>
          <td style="text-align: left">Patch</td>
          <td style="text-align: left">Same class of error appears twice</td>
          <td style="text-align: left">Add tests, rules, scripts, or logs</td>
      </tr>
      <tr>
          <td style="text-align: left">Record</td>
          <td style="text-align: left">The case will likely happen again</td>
          <td style="text-align: left">Save trigger conditions, verification commands, and evidence entry points</td>
      </tr>
  </tbody>
</table>
<p>I used to treat many failures as “try again.” Now I am more careful: only retry failures that are actually retryable, and stop when the situation says stop. Letting an agent push forward from a wrong premise usually just creates more diff for a human to clean up.</p>

<h3 class="relative group">Where external research fits
    <div id="where-external-research-fits" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#where-external-research-fits" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>In this workflow, Exa or similar web search tools also have a clearer place.</p>
<p>I usually do not search for broad trends. I search for concrete engineering questions:</p>
<ul>
<li>What timeout should be used?</li>
<li>Should this failure be retried?</li>
<li>How should the default strategy be split?</li>
<li>What boundaries do mainstream tools provide?</li>
<li>What failure samples show up in real issues?</li>
</ul>
<p>I still do not copy external answers directly. External material gives me a reference frame, and the final decision has to fit the current repo. Useful conclusions should land in specs, project rules, tests, or scripts. Otherwise I will search for the same thing again next time, which is a very small but reliable way to waste time.</p>

<h3 class="relative group">Autoresearch and Ralph Loop
    <div id="autoresearch-and-ralph-loop" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#autoresearch-and-ralph-loop" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>Autoresearch works best for long loops with a clear metric. Give the agent a goal, a guard, and a verification command first. Each round should allow only one rollback-friendly change. If it drifts, the damage is still contained.</p>
<p>I currently treat Ralph Loop as persistent single-owner execution. The same owner keeps driving the work. PRD and test spec come first, then the agent runs the long task. It cares more about preserving context, judgment, and verification clues than about adding more agents early. Fewer people in the loop can sometimes make ownership much clearer.</p>
<p>Both patterns share the same idea: define the track before letting the agent run. The track needs metrics, boundaries, verification, and rules for what to keep or discard.</p>

<h3 class="relative group">Three steps worth copying first
    <div id="three-steps-worth-copying-first" class="anchor"></div>
    
    <span
        class="absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none">
        <a class="text-primary-300 dark:text-neutral-700 !no-underline" href="#three-steps-worth-copying-first" aria-label="Anchor">#</a>
    </span>
    
</h3>
<p>If this needs to move into a team workflow, I would not start with platform work. Three steps are enough to copy tomorrow:</p>
<ol>
<li>Write <code>done when</code> and <code>out of scope</code> for every medium-sized task</li>
<li>Ask the agent to list files, evidence, and the change surface before allowing edits</li>
<li>After one failure, patch tests, rules, or scripts before letting the agent continue</li>
</ol>
<p>Once these three steps are in place, AI coding moves a bit from “it can produce output” toward “it can be shipped.” Autoresearch, Ralph Loop, team workers, and memory become easier to reason about after that.</p>
]]></content:encoded>
      
    </item>
    
  </channel>
</rss>
