AI Coding on Code is cheap, let's talk

Growing a Codex Workflow as a Living System: From Session Logs to Skills

Mon, 22 Jun 2026 12:00:00 +0800

I am not a native English speaker; this article was translated by AI.

A few days ago I did something boring but useful: scanned the Codex sessions in ~/.codex.

Not for nostalgia, not for a dashboard. I just wanted to see where I was wasting time on repeat.

No surprise. The expensive part was rarely one hard code change. It was the small glue work I had to do every single day:

git status, git diff, glab api, glab mr
finding the first failed CI job
checking remote SSH, PATH, Tailscale, permissions
deciding which tests to run for a given change
confirming SHA, workflow, artifact before release or deploy
resuming a session and re-sorting the issue, branch, MR

These things are too small. Small enough that you never bother to fix them. But because they are small, they keep getting ignored. In the end, you spend your day swimming in manual glue.

The original prompt was short:

Based on my recent Codex projects and threads, suggest ways to simplify project workflows and improve efficiency. Use subagents to analyze in parallel.

The first result was still too biased toward a few recent projects, so I added one more instruction:

Not just these projects. Scan all possible sessions under ~/.codex, dispatch multiple subagents to analyze them separately, then summarize.

The point was not “ask the model for optimization ideas.” It was changing the source of truth from my memory of recent work to the repeated actions inside real session logs.

Do not write tools too early
#

I used to do this too — see repetition, reach for a script. Later I realized that is often too early.

Most repetition is not about repeated commands. It is about a repeated decision process. When CI fails, the reusable part is not some random gh run view. It is:

confirm the run and head SHA
find the first failed job
extract the useful error
then figure out: workflow, dependency, test, or code

Turning this straight into a big tool welds your wrong assumptions in place. The lighter move is to write a skill first: when to use it, minimum steps, what not to do, what to output.

A skill is not an encyclopedia. It is a sticky note — so the agent skips one dead end.

flowchart LR
  A[Session history] --> B[Repeated friction]
  B --> C[Small skill]
  C --> D[Run on real tasks]
  D --> E[Script only when repeated]
  E --> C

I ended up keeping only these:

agent-preflight: read the real repo state before starting, no assumptions
gitlab-mr-context: use glab api for issues, MRs, pipelines, notes — much more reliable
ci-first-failure: find the first real failure before touching code
path-verify: pick the smallest check from the changed files
release-deploy-preflight: confirm full SHA, workflow, artifact, health check before deploy
remote-health: check SSH, PATH, services, locks, and Tailscale on remote hosts first

The names are not cool. That is exactly why you know when to use them.

Skills first, scripts later
#

Another lesson: do not give every skill a scripts directory on day one.

Most workflows only need a SKILL.md. path-verify is not there to run your tests. It reminds the agent to pick the smallest check based on what files changed. Let it run with the agent on a few real tasks first. Automate later, once the pattern is confirmed.

Scripts are for one kind of thing: stuff that is definitely repeated, mechanical, and low risk.

This time I only added one — linking repo skills into the user skill directory:

scripts/link-user-skills.sh

And a PowerShell version for Windows:

.\scripts\link-user-skills.ps1

I tripped on one thing: symlink direction.

The correct direction is: real files in the repo, links in the user directory.

~/.agents/skills/glab -> /path/to/repo/skills/glab

That way the repo has real content, and local Codex can use it. Get it backwards and it is a mess — the repo only has a link to ~/.agents, GitHub gets nothing, and Git thinks the original files were deleted.

Make it work across machines
#

I switch between macOS, Windows, and remote hosts constantly. If a skill only works on one machine, its value gets cut in half.

So after the local setup, I synced the repo to my-win and ran the same maintenance flow on Windows. The PowerShell script uses directory junctions, not symlinks — creating symlinks on Windows often fights with permissions, and junctions are enough for directories.

Tedious step. But without it, workflow refinement turns back into a one-machine trick.

How I think about it now
#

After this round, a few thoughts hardened.

Find repetition in sessions, not in your imagination. If git status, glab api, ssh, and pnpm test are actually frequent, start there. Do not invent a workflow governance framework nobody asked for.

Keep skills short. One blocks one gap. The only job is to make the agent ask less, search less, guess less. Do not turn it into an encyclopedia.

Scripts do mechanical work — linking skills, collecting CI logs, checking remote health. Product judgment, risk boundaries, deployment decisions still need human confirmation, or at least explicit preflight.

Mistakes need to feed back. I got the symlink direction wrong at first. After fixing it, the lesson cannot just stay in the chat window. It goes into the script and the README. Otherwise I will step on the same rake again.

What remained
#

Not much:

a few short skills
one Bash linking script
one PowerShell linking script
one Windows sync check
one rule: real skills in the repo, links in the user directory

Good enough.

The more I use AI coding, the more I think the workflow is not about building a big platform. It is about removing the most annoying five minutes, over and over. Each pass makes the system a little lighter. When enough of these small rules pile up, the agent starts working in a real engineering environment — not opening a new trail from scratch every time.

Stopping Codex SQLite Log Growth with a Trigger

Sat, 20 Jun 2026 16:55:00 +0800

I am not a native English speaker; this article was translated by AI.

Codex recently started storing local logs in ~/.codex/logs_2.sqlite. On my machine that database had already grown past 1GB. The real problem was not the WAL file itself, but the log table: TRACE, DEBUG, and INFO rows kept going into SQLite, creating unnecessary disk usage and IO.

The public configuration surface is limited here. RUST_LOG can reduce verbosity, log_dir only controls the plaintext TUI log, and history.max_bytes only applies to history.jsonl. I could not find a public retention, max-size, or journal-mode option for logs_2.sqlite.

So I used SQLite itself as the stopgap.

Block new log rows with one trigger
#

sqlite3 ~/.codex/logs_2.sqlite "CREATE TRIGGER IF NOT EXISTS block_log_inserts BEFORE INSERT ON logs BEGIN SELECT RAISE(IGNORE); END;"

The trigger is intentionally blunt: whenever something tries to insert into the logs table, SQLite ignores that insert.

Verification is also simple:

sqlite3 ~/.codex/logs_2.sqlite "
SELECT count(*) FROM logs;
INSERT INTO logs(ts, ts_nanos, level, target, feedback_log_body, estimated_bytes)
VALUES(strftime('%s','now'), 0, 'INFO', 'trigger_test', 'should_not_exist', 1);
SELECT count(*) FROM logs;
SELECT count(*) FROM logs WHERE target='trigger_test';
"

If the row count stays the same and trigger_test is 0, the trigger is working.

Windows PowerShell version
#

On Windows, the path is usually:

$db = Join-Path $env:USERPROFILE ".codex\logs_2.sqlite"
sqlite3 $db "CREATE TRIGGER IF NOT EXISTS block_log_inserts BEFORE INSERT ON logs BEGIN SELECT RAISE(IGNORE); END;"

I verified this on a remote Windows machine. The test insert did not change the row count:

trigger: block_log_inserts
before: 76387
after: 76387
trigger_test_rows: 0

Restore log writes
#

sqlite3 ~/.codex/logs_2.sqlite "DROP TRIGGER IF EXISTS block_log_inserts;"

PowerShell:

$db = Join-Path $env:USERPROFILE ".codex\logs_2.sqlite"
sqlite3 $db "DROP TRIGGER IF EXISTS block_log_inserts;"

Compact the old logs too
#

The trigger only blocks new rows. It does not shrink a database that has already grown. After quitting Codex, run a checkpoint and VACUUM:

sqlite3 ~/.codex/logs_2.sqlite "
PRAGMA wal_checkpoint(TRUNCATE);
DELETE FROM logs WHERE level IN ('TRACE','DEBUG');
DELETE FROM logs WHERE level = 'INFO' AND ts < strftime('%s','now','-3 days');
VACUUM;
"

If Codex is still running, SQLite may report database is locked. That is expected. Quit Codex and run it again.

The limit of this trick
#

This does not fix Codex’s logging system. It is just local damage control.

The upside is that it needs no Codex patch, no release wait, and no background cleanup daemon. The downside is that logs_2.sqlite will no longer contain new local logs, which makes local debugging weaker. When you need logs, drop the trigger, reproduce the issue, then add the trigger again.

Long term, Codex should expose log database retention or max-size settings. Until then, one SQLite trigger is enough.

ACE Broke, So I Rewrote It: From ace-wrapper to fast-context

Mon, 01 Jun 2026 10:35:09 +0800

I am not a native English speaker; this article was translated by AI.

In the last post I wrote about ace-wrapper: a shell command around ACE (Augment Context Engine) filesystem context search, so the agent could start with semantic retrieval when keywords were fuzzy, then decide which files to read.

Then ACE got flaky.

API keys expired in new ways. Free-tier quota became harder to rely on. And one by one, the relay services died.

I do not really blame anyone. It was a preview feature to begin with. The problem was that semantic search had already become part of my daily workflow: dozens of ace calls in one session. Take it away and the agent is back to guessing keywords.

So I switched approaches:

ferstar/fast-context

This time I skipped third-party APIs and talked directly to Windsurf’s SWE-grep protocol — the same semantic search backend used by Codex CLI and Windsurf IDE — while layering a local Semble cache on top as fallback.

The structural difference from ace-wrapper
#

ace-wrapper was pure remote: local only sent parameters, everything depended on the ACE service.

fast-context is local and remote working together.

flowchart TB
  subgraph Input
    Q[User query]
  end

  subgraph Local
    S[Semble local prefetch
cached index + chunk search]
    A[Lexical anchors
filename / path / literal hits]
    R["Repo map
(auto-shrink when too large)"]
  end

  subgraph Remote
    WS[Windsurf SWE-grep
agentic verify + expand]
  end

  subgraph Output
    O["Candidate files
line ranges
follow-up terms
(or local chunks when remote fails)"]
  end

  Q --> S
  Q --> A
  Q --> R
  S --> WS
  A --> WS
  R --> WS
  WS -- success --> O
  WS -- auth / rate-limit / timeout --> O
  S -- fallback path --> O

The flow:

Run Semble locally first — cached index + chunk search, sub-second
Collect local lexical anchors — exact filename, path segment, and literal content matches from the query
Generate a repo map — directory tree, auto-shrunk when it gets too large
Feed all three to Windsurf — Semble chunks as hints, lexical anchors as pinpoints, repo map as path context
Windsurf verifies and expands using rg/readfile/tree/ls/glob — agentic tool-call loop
When the remote path fails, return local Semble results — no empty hands, no blocked workflow

That “no empty hands” property matters more than it sounds. With ace-wrapper and ACE, once the service went down, that search was simply gone. Now, when the remote path fails, local cache still returns chunk-level candidates. Lower quality, yes, but the workflow does not just die there.

Reverse-engineering SWE-grep
#

Windsurf’s SWE-grep uses Connect-RPC + Protobuf, which is nothing like a normal REST API.

The trickiest part was the Connect framing. Every RPC frame has a 5-byte header (1 flag byte + 4 big-endian length bytes). On top of that, the protocol requires a Connect-Connect frame before the actual payload.

The Protobuf side was worse. Windsurf uses a custom proto schema with no public definition. The field numbers in core data structures had to be inferred from packet captures and known Wireshark decryption configs — call chains look like {1: name, 2: args, 3: id}, variable definitions like {1: name, 2: type, 3: value}. Guess wrong and the whole request fails, with no useful error message.

The encoder looks like this (ProtobufEncoder):

class ProtobufEncoder:
    """Manual protobuf encoder, matching the Windsurf wire format exactly."""
    def __init__(self) -> None:
        self.buf = bytearray()

    def _varint(self, value: int) -> bytes:
        parts: list[int] = []
        while value > 0x7F:
            parts.append((value & 0x7F) | 0x80)
            value >>= 7
        parts.append(value & 0x7F)
        return bytes(parts)

    def _tag(self, field: int, wire: int) -> bytes:
        return self._varint((field << 3) | wire)

Decoding Windsurf’s streaming response is the same story — split frames, read payloads, find the stream-end marker, reconstruct the result. It is much more work than calling a REST API, but the upside is clear: no intermediary dependency, just a direct path to Windsurf’s backend.

Why local Semble cache works
#

Before adding Semble I did wonder: does local indexing really help?

Once I ran the benchmark, there was no suspense left.

I ran 40 labeled queries across two repos (fastapi and axios):

Backend	NDCG@10	Recall@10	Top-1	Batch p50
local (Semble only)	0.854	0.946	0.775	30 ms
remote (Windsurf only)	0.453	0.467	0.450	24.4 s
hybrid (Semble + Windsurf)	0.890	0.979	0.825	28.3 s

Local Semble alone hit 94.6% recall with 30 ms p50 latency. Windsurf alone underperformed — only a 52.5% success rate, with the rest lost to throttling or resource_exhausted.

Hybrid mode puts Windsurf after Semble for verification and expansion. NDCG@10 jumped to 0.890, recall to 97.9%.

Two things became clear:

Local cache is not a backup; it is the first line of defense. It handles most common searches in 30 ms. When the remote is down, it is a degradation path, not a dead end.
Windsurf’s value is in verification, not first-pass search. Asking it to search from scratch risks timeout and throttling. Give it Semble chunk candidates and exact keyword anchors, and it only has to confirm — which succeeds far more often.

Credential handling got more involved
#

ace-wrapper just needed an API key. fast-context uses Windsurf’s session token, stored in state.vscdb (a SQLite database).

The extraction logic lives in extract_key.py:

Query ItemTable for key='windsurf.api_key'
→ if found, return it
→ if not, search for rows key containing 'devin-session-token'
→ either format works
→ can also override with WINDSURF_API_KEY env var

Why support two formats? Because Windsurf keeps changing. Earlier versions used standard API keys; newer ones moved to session-style credentials like devin-session-token$.... If the tool does not adapt, it breaks as soon as the user upgrades their IDE.

The current workflow
#

During the ace-wrapper era, my AGENTS.md looked like this:

Use ace for semantic retrieval → read files → confirm evidence with rg

Now it reads:

Use fast-context search (default hybrid) for candidates + line ranges
If hybrid times out or returns nothing, try fast-context local-search
If a chunk candidate looks promising, use fast-context find-related
After reading files, confirm exact evidence with rg/ast-grep

There are more branches now, but each one has a clear fallback.

On the remote side, there is a model fallback chain too:

Default: MODEL_SWE_1_6_FAST
On resource_exhausted or rate-limit: auto-degrade to MODEL_SWE_1_5
Custom fallback order via WS_FALLBACK_MODELS

Benchmark results
#

With the fair runner (completion-based cooldown, 40 queries):

Hybrid non-empty output rate: 100% — all 40 queries returned useful results
Remote non-empty output rate: only 50% — the other half timed out or got throttled
Local zero failures — 100% non-empty, p50 latency 30 ms

If the workflow depended purely on remote semantic search, half the queries could get no answer during peak hours. With local Semble backing, the worst case is degraded local chunks, not an empty result.

What I do not want to tear down and rebuild again
#

A few design choices that held up well:

Always keep a degradation path. Every remote dependency needs a local fallback. I already paid for ignoring that once.
Pure Python is easier to maintain. ace-wrapper was Python too, but this project grew from a few hundred lines to more than two thousand — protobuf encoder, Connect framing, Semble adapter, benchmark runner. Clear structure matters more than language choice. Python just happens to be what I work fastest in.
Benchmarks should live with the code. The benchmarks/ directory with 40 labeled queries and a runner shows the real gap between backends on every run. Optimization without data is mostly guessing.
Credential extraction should auto-adapt. The devin-session-token format was unexpected, but the code structure made it easy — try another pattern when the first key is not found, without touching the main flow.

Wrapping up
#

I still use ace-wrapper sometimes — ACE does come back to life once in a while. But I no longer want my workflow tied to it.

The core idea behind fast-context is simple: let local cache carry the baseline, and use the remote path for verification and expansion. Once the upstream gets shaky, a purely remote solution turns into a rope with no backup.

If you have hit the same wall, the code is here: ferstar/fast-context

Putting Semantic Search into an AI Coding Harness: Notes on Open-Sourcing ace-wrapper

Sat, 09 May 2026 14:38:00 +0800

I am not a native English speaker; this article was translated by AI.

In the previous post about Harness Engineering, I compressed my default AI coding workflow into a few steps:

Read
Search
Change
Verify
Record

The easiest one to underestimate is Search.

Many agents fail not because they cannot edit code, but because they read the wrong place first. The user describes a behavior, a bug, or a cross-layer workflow, while the code may not contain a function with the same name. Running rg login, rg upload, or rg session is fast, but it only works when the keyword is already known. If the keyword is unknown, speed just helps the agent drift faster.

So I open-sourced a small layer I have been using recently:

ferstar/ace-wrapper

It does one narrow thing: wrap Augment Context Engine’s filesystem context search as an ace command, so coding agents can run semantic retrieval from the shell before editing.

Why this layer exists
#

The target is concrete: make the search action part of the harness.

I used to see this path often:

flowchart LR
  A[User describes behavior] --> B[Agent guesses keywords]
  B --> C[Reads nearby files]
  C --> D[Edits plausible code]
  D --> E[Verification fails]
  E --> B

The problem with this loop is that, after failure, the agent often keeps circling around the same wrong files. It can edit code; what it needs is a better entry point into candidate files. Put less politely, it is working hard after entering the wrong door.

ace-wrapper is meant to patch this part:

flowchart LR
  A[User describes behavior] --> B[ace semantic retrieval]
  B --> C[Candidate files]
  C --> D[Read returned files]
  D --> E[rg / tests confirm evidence]
  E --> F[Small patch]
  F --> G[Verify]

The important part is the order: ace only finds candidate files. Conclusions still require reading files, exact search, and tests. It is not an answer generator; it just helps the agent waste fewer steps.

Usage is short
#

Install it:

uv tool install ace-wrapper

Install a local development checkout:

uv tool install /path/to/ace-wrapper

Search for a workflow when the exact keyword is unknown:

timeout 60s ace "user uploads an unsupported file and should see skipped-file feedback" -w /repo
rg -n "unsupported|skipped|upload|file" /repo

The first command answers “which files may be relevant.” The second command confirms “which identifiers, events, copy, or tests actually exist in the code.”

I usually put this rule into a project’s AGENTS.md:

Use `timeout 60s ace "" -w ` for semantic codebase discovery.
Treat `ace` results as candidate files.
After it returns results, read the relevant files and use exact search before using them as evidence.

These lines work better than “read more context,” because they give the agent a concrete action and a boundary against false conclusions.

How it works with rg
#

ace and rg work better as consecutive steps.

Scenario	Use first	Why
You know the behavior but not the implementation location	`ace`	Behavior descriptions can find candidate entry points across files and naming styles
You know the function name, event name, or error text	`rg`	It is exact, complete, and enumerable
You need a structural refactor	`ast-grep`	AST-level matching is needed; textual proximity falls short
You need to confirm whether a feature exists	`ace` + read files + `rg`	A semantic hit cannot prove the feature exists

I intentionally wrote this boundary into the README: ACE returns candidate files, while evidence still has to come from code and tests. That boundary matters.

Semantic retrieval returns “nearby” things. If you ask about a feature that does not exist, it may still find files that look related. If an agent treats “there are results” as “the feature exists,” it starts inventing a story. A conclusion is only defensible after reading an implementation, test, route, config, or call site.

Where it fits in Harness Engineering
#

ace-wrapper is small, and I want it to stay that way. It is closer to a small gear in the harness: it turns open-ended code discovery into a repeatable, constrained command.

I now prefer this project rule:

Read -> Search -> Change -> Verify

Here, Search means choosing the tool by problem type:

Open-ended behavior and cross-layer workflows: use ace first
Exact identifiers, errors, routes, and config keys: use rg
Structural replacements: use ast-grep
External strategy and industry practice: use web research
Old decisions and repeated lessons: use memory

The useful part of this split is reduced agent randomness. The agent first uses semantic retrieval to narrow the reading surface, then uses deterministic tools to confirm facts, and only then changes code. The order is a little more verbose, but it is much cheaper than confidently editing the wrong file.

The prompt matters most
#

A good ace query describes behavior and avoids keyword piles:

timeout 60s ace "frontend sends requestId to backend and starts a processing job" -w /repo
timeout 60s ace "用户拖入不支持的文件后应该显示跳过文件提示" -w /repo
timeout 60s ace "how provider config is persisted and restored after app restart" -w /repo

I try to include four kinds of information:

User action: click, drag, upload, stop generation
Runtime boundary: frontend to backend, CLI handler to core service
Expected effect: persist config, abort loop, show skipped-file feedback
Known fields: sessionId, requestId, files, workspace

This is much more stable than only searching upload or provider. It lets the retrieval system look for behavior and data flow, and it reminds the agent that this step is still semantic retrieval, not evidence by itself.

Why I open-sourced it
#

ace-wrapper has very little code. The core is just FileSystemContext.create(str(workspace)) plus context.search(args.query). I wanted to preserve the workflow constraints around those few lines:

If the keyword is unknown, start with semantic retrieval
Ask one workflow per query
Treat results as candidate files
Read the files, then use rg to confirm exact evidence
Do not conclude without evidence

Once these rules live in the tool README, skill, and agent prompt, they become much more likely to stick. Otherwise every session depends on a human reminding the agent again, which gets old fast.

The previous post said Harness Engineering means putting an engineering track around AI. ace-wrapper is one small piece of that track: it does not make the agent better at writing code; it just helps the agent read the right place first.

From Vibe Coding to Harness Engineering: How My AI Coding Workflow Changed

Sat, 09 May 2026 14:19:00 +0800

I am not a native English speaker; this article was translated by AI.

This is the written version of an internal team sharing session. The slides are here:

From Vibe Coding to Harness Engineering

For a while I kept looking at one question: can AI really take over most of the coding work?

The answer is mostly settled now. When the project context, quality gates, and verification flow are in place, AI-generated code can enter the engineering workflow reliably. Human time moves from “typing the code” to “holding the line”: breaking down requirements, judging architecture, arranging context, checking boundaries, and handling failures.

Recent practice pushed this one step further. The question is no longer how to make the prompt prettier. It is whether the whole workflow can survive long-running tasks. I have stepped on this rake a few times, especially when I open the laptop in the morning, see that the agent ran all night, and still cannot tell which diff should be kept.

What changed
#

Early Vibe Coding solved the entry problem: describe the requirement clearly, put project rules into AGENTS.md / CLAUDE.md, and let tests, lint, and review catch the model output.

That setup is still useful, but it is closer to single-task engineering. Once a task gets longer, a few problems start showing up:

Context keeps growing until the model loses the important part
Repeated retries can push the fix further away from the real issue
Without external references, strategy becomes guesswork
After many rounds, it is hard to tell which changes should be kept
User rejection, permission blocks, and empty output need explicit stop semantics

So I now prefer calling this layer Harness Engineering: put an engineering track around AI so tasks are executable, results are verifiable, and failures are recoverable. The name sounds a bit grand. In practice, it just means trusting “it will figure it out” a little less and adding a few guardrails.

flowchart LR
  A[Task scope] --> B[Context route]
  B --> C[Agent loop]
  C --> D[Verification gate]
  D --> E[Recovery / memory]
  D -->|failed| F[Patch harness]
  F --> C

The four things I manage first
#

The first thing is task boundaries.

Before a medium-sized task starts, I want at least done when, out of scope, the change surface, and the verification command. This does not need to be a long document. Five lines are often enough. The point is to let the executor know when to stop, instead of drifting into “while I am here” changes.

The second thing is context routing.

AGENTS.md should not become an encyclopedia. It works better as an index: project rules, entry points, verification commands, things that must not be touched, and where to read the next layer of docs. Long context should be opened on demand, not dumped into the session. When the context gets too full, the model behaves a bit like me with too many browser tabs open: it looks busy, but the focus is gone.

The third thing is the verification loop.

My default order is now:

Read: read README, AGENTS, older notes, and key implementation files
Search: use ace, rg, ast-grep, nmem, and Exa to find evidence
Change: apply a small patch and avoid drive-by refactors
Verify: run narrow checks first, then expand by risk
Record: write repeated lessons back into rules, tests, or memory

This order is boring in a good way. Reading and searching first reduce model guesswork. Narrow verification avoids one giant change where nobody knows which step broke.

The fourth thing is failure handling.

After a failure, I classify it first: stop, retry, patch the harness, or record it.

Type	When to use it	Handling
Stop	User rejection, permission block, side effect risk, repeated spinning	Break the loop and return control
Retry	Network jitter, fixable parameter, read failure without side effects	Retry in small steps and keep logs
Patch	Same class of error appears twice	Add tests, rules, scripts, or logs
Record	The case will likely happen again	Save trigger conditions, verification commands, and evidence entry points

I used to treat many failures as “try again.” Now I am more careful: only retry failures that are actually retryable, and stop when the situation says stop. Letting an agent push forward from a wrong premise usually just creates more diff for a human to clean up.

Where external research fits
#

In this workflow, Exa or similar web search tools also have a clearer place.

I usually do not search for broad trends. I search for concrete engineering questions:

What timeout should be used?
Should this failure be retried?
How should the default strategy be split?
What boundaries do mainstream tools provide?
What failure samples show up in real issues?

I still do not copy external answers directly. External material gives me a reference frame, and the final decision has to fit the current repo. Useful conclusions should land in specs, project rules, tests, or scripts. Otherwise I will search for the same thing again next time, which is a very small but reliable way to waste time.

Autoresearch and Ralph Loop
#

Autoresearch works best for long loops with a clear metric. Give the agent a goal, a guard, and a verification command first. Each round should allow only one rollback-friendly change. If it drifts, the damage is still contained.

I currently treat Ralph Loop as persistent single-owner execution. The same owner keeps driving the work. PRD and test spec come first, then the agent runs the long task. It cares more about preserving context, judgment, and verification clues than about adding more agents early. Fewer people in the loop can sometimes make ownership much clearer.

Both patterns share the same idea: define the track before letting the agent run. The track needs metrics, boundaries, verification, and rules for what to keep or discard.

Three steps worth copying first
#

If this needs to move into a team workflow, I would not start with platform work. Three steps are enough to copy tomorrow:

Write done when and out of scope for every medium-sized task
Ask the agent to list files, evidence, and the change surface before allowing edits
After one failure, patch tests, rules, or scripts before letting the agent continue

Once these three steps are in place, AI coding moves a bit from “it can produce output” toward “it can be shipped.” Autoresearch, Ralph Loop, team workers, and memory become easier to reason about after that.

AI Coding on Code is cheap, let's talk

Growing a Codex Workflow as a Living System: From Session Logs to Skills

Do not write tools too early #

Skills first, scripts later #

Make it work across machines #

How I think about it now #

What remained #

Stopping Codex SQLite Log Growth with a Trigger

Block new log rows with one trigger #

Windows PowerShell version #

Restore log writes #

Compact the old logs too #

The limit of this trick #

ACE Broke, So I Rewrote It: From ace-wrapper to fast-context

The structural difference from ace-wrapper #

Reverse-engineering SWE-grep #

Why local Semble cache works #

Credential handling got more involved #

The current workflow #

Benchmark results #

What I do not want to tear down and rebuild again #

Wrapping up #

Putting Semantic Search into an AI Coding Harness: Notes on Open-Sourcing ace-wrapper

Why this layer exists #

Usage is short #

How it works with rg #

Where it fits in Harness Engineering #

The prompt matters most #

Why I open-sourced it #

From Vibe Coding to Harness Engineering: How My AI Coding Workflow Changed

What changed #

The four things I manage first #

Where external research fits #

Autoresearch and Ralph Loop #

Three steps worth copying first #

Do not write tools too early
#

Skills first, scripts later
#

Make it work across machines
#

How I think about it now
#

What remained
#

Block new log rows with one trigger
#

Windows PowerShell version
#

Restore log writes
#

Compact the old logs too
#

The limit of this trick
#

The structural difference from ace-wrapper
#

Reverse-engineering SWE-grep
#

Why local Semble cache works
#

Credential handling got more involved
#

The current workflow
#

Benchmark results
#

What I do not want to tear down and rebuild again
#

Wrapping up
#

Why this layer exists
#

Usage is short
#

How it works with rg
#

Where it fits in Harness Engineering
#

The prompt matters most
#

Why I open-sourced it
#

What changed
#

The four things I manage first
#

Where external research fits
#

Autoresearch and Ralph Loop
#

Three steps worth copying first
#