How Autoresearch Made VoiceFox Faster Without Losing Accuracy

What happened when we ran an autoresearch loop against a local dictation app, then had to build a real voice corpus before we could trust the wins.

We cut VoiceFox's end-to-end dictation time from about 2.4 seconds down to 1.0 to 1.5 seconds. Here is what changed. VoiceFox is simple when it works well. Hold a hotkey, say a sentence, let go, and the words appear wherever your cursor is. No cloud round trip. No meeting transcript service. Just local dictation on an M-series Mac.

When it feels slow, though, the magic breaks. A few months ago VoiceFox was closer to 2.4 seconds end to end. That does not sound catastrophic until you try to dictate into a text box and wait. The pause is just long enough for your brain to ask, "Did it catch that?"

It was not one clever optimization. It was a bunch of small measurements, a few autoresearch runs, and one important lesson: the benchmark has to protect the thing users actually care about.

A quick word on autoresearch

Autoresearch is what happens when you point an AI coding agent at a piece of code, hand it a benchmark that prints a number, and let it run a tight loop. Read the code. Try a small patch. Run the benchmark. Keep the patch if the number got better, revert it if it did not. The loop runs inside a fence that says exactly which files the agent may touch.

Andrej Karpathy created and named this pattern in his autoresearch repo, which builds on his nanochat. In Karpathy's setup the target is a single-GPU language model: the agent edits a training script that contains the GPT model, the Muon plus AdamW optimizer, and the training loop, then trains for five minutes and is judged on validation bits per byte. The whole point is autonomous ML research, an agent searching the design space of a real, if small, language model. Our use of autoresearch borrows the loop pattern but points it at a different target: instead of bringing a tiny GPT's training loss down, the agent shaves milliseconds off a local dictation app and has to keep transcription accuracy intact.

The agent is interchangeable. This round was Codex, because that was what we had wired up. The same loop works with Claude Code. It is the read, patch, evaluate, keep-or-revert pattern, plus the gate, that does the actual work.

What VoiceFox is

VoiceFox is push-to-talk dictation for Apple Silicon Macs. The model is Parakeet, NVIDIA's open-source speech recognition model, running locally through an MLX port. MLX is Apple's open-source machine learning framework built for Apple Silicon, so the model runs on the Mac's neural engine and unified memory rather than via CPU only or a cloud round-trip. Around that model is a practical little workflow: capture audio, transcribe it, clean up custom vocabulary, put the text on the clipboard, paste it into the active app, and then get out of the way.

VoiceFox is also intentionally a small script. That made it a good place to try autoresearch because the loop had one main file to work on, voicefox.py, and we could measure one layer at a time.

VoiceFox workflow

Six steps from speech to text

The app feels simple because the working path is short. Every latency fix had to respect this path.

Hotkey Start and stop capture

Audio Record local input

Model Run Parakeet through MLX

Decode Turn model output into words

Cleanup Apply vocabulary rules

Paste Place text in the active app

The workflow gave us a map: fix one layer, then prove the rest still works.

First, measure the waiting

The first pass was not fancy. We added profiling and asked a boring question: where does the time go?

That audit found the kind of delays that hide in small apps. VoiceFox was writing captured audio to a temporary WAV file and reading it back before inference. It had fixed sleeps in the paste path. It had capture settings that were fine for correctness but not tuned for speed. It had decode settings that were conservative because nobody had proved a lower cap was safe.

Change	Measured effect	Why it mattered
Direct audio processing	about 330 ms saved	VoiceFox no longer writes every capture to a temporary WAV file before feeding MLX.
Low-latency capture	about 50 to 100 ms saved	Recording overhead dropped without changing the interaction model.
Paste timing	about 160 ms saved	Fixed sleeps and clipboard sequencing were trimmed after profiling showed the floor.
Decode controls	about 50 to 100 ms saved	Parakeet decode caps were tuned only after a real-voice accuracy gate existed.

Those changes account for roughly 540 to 590 ms of the improvement. Good start. But this is also where you can fool yourself. A latency benchmark will happily reward you for deleting safety. VoiceFox needed to get faster, but it still had to paste reliably and transcribe accurately.

Beginner view

Autoresearch is a loop, not a magic button

TaskFox, our agent skills factory, packages the boundary: what Codex or Claude Code may change, what test to run, and when to keep or revert the patch.

01 Read

02 Patch

03 Evaluate

Gate speed + quality

04 Keep or revert

The important part is the boundary. The loop can move quickly because the gate decides what counts as a win.

Then we let the loop try small changes

The first autoresearch loop went after post-submit latency: the time after VoiceFox has text and before that text lands in the active app. This was a nice starter problem. It was narrow, fast to measure, and mostly isolated from speech recognition quality.

Actual loop data

Autoresearch progress: 8 experiments, 8 kept improvements

Charted in the spirit of Karpathy's autoresearch progress plot. Each green dot is a kept row. The step line is the running best. Lower is better.

Source: workspace/autoresearch-voicefox-post-submit-latency/results.tsv in taskfox-factory. Every row in this run advanced, so there are no discarded points to plot. The accuracy-gated STT loop later in the article had the inverse shape: 50 experiments, only two kept.

That run used a Codex backend with gpt-5.4-mini at low effort. The agent kept making tiny changes and measuring them. The baseline row was 202.0 ms. The best row was 66.7 ms after eight rows, a 67.0 percent reduction on that specific metric.

That was exciting, but it also exposed the first trap: a benchmark can be too narrow. The post-submit loop could find real paste-path wins. It could not tell us whether a faster decoder was cutting off words. For that, we needed to move closer to the model.

The next benchmark was too easy to trust

The next loop targeted the real speech-to-text path. We started with one real audio clip because it was fast and convenient. Codex gpt-5.4-mini at high effort did the first big search. It reached the requested 50 experiment rows and found a promising stack: faster audio handoff, a reused decoding config, a lower symbol cap, and a few small cleanup wins around the model call.

Then we handed the same lane to stronger model runs. gpt-5.5 at medium effort extended the loop past row 50. It did not beat the current best. A final gpt-5.5 xhigh pass tried deeper MLX ideas. None advanced. One crashed and was reverted.

Single-fixture loop: baseline median 233.434 ms, best loop median 178.041 ms, 65 experiment rows plus baseline, 6 advances, 57 discards, 2 crashes.

That looked like a clean win. Then confirmation made it less tidy, which is exactly what confirmation is for. Pre-review reruns showed a 216.994 ms median, about 7.0 percent better than the 233.434 ms baseline. Post-review reruns were noisier at 225.608 ms, while a longer 21-iteration check landed at 190.755 ms. Still useful, but not as simple as "the loop found 23.7 percent and we shipped it."

Confirmation check

The single-clip win got smaller when we reran it

The loop found a tempting best row. Confirmation made the result more honest.

Speech-to-text median latency

Baseline 233.434 ms

Loop best 178.041 ms

Pre-review check 216.994 ms

Post-review check 225.608 ms

Longer check 190.755 ms

From the VoiceFox run notes. The lesson was not that the loop was wrong; it was that one clip was too small to trust by itself.

This was the second trap: a single fixture is a leaderboard, not a product test. It can tell you a candidate is interesting. It cannot tell you the app still works for the variety of things a person actually says.

So we recorded a real voice corpus

We had the agent build a small recording program: scripts/record_voicefox_corpus.py. The job of that program was deliberately plain. Show Casey a short transcript, wait for him to read it aloud, record mono 16 kHz PCM WAV, then write a manifest that pairs the audio file with the expected transcript.

The corpus came from a very normal place: Casey's car, in a parking lot. The laptop was tethered to the internet through his phone. So the benchmark was not born in a studio or a lab. It came from the same kind of imperfect setup people actually use: laptop, mic, parking lot, phone hotspot, and an autoresearch loop working through bounded changes in the background.

The prompts were not random sentences. They were small, practical phrases that resemble actual dictation work:

short commands such as "Open the build log" and "Save this note"
normal dictation such as "Please summarize the pull request and mention any risky assumptions"
fast speech, punctuation-like speech, long numbers, proper nouns, dense run-on phrases, quiet voice, and one longer dictation sample

The finished corpus had 16 fixtures. The audio stays private because it is real voice, but the shape of the corpus is visible in the repo. The recorder writes audio/*.wav and a manifest.json with id, path, category, and expected fields. That was enough structure for the benchmark to load the same clips every time and compare output against the expected transcript.

The gate made latency honest

The eval surface was bench_real_stt_corpus.py. It loads the local Parakeet model once, runs every WAV fixture, records latency samples, and then applies accuracy gates. The primary loop score is intentionally easy to understand:

gated_score_ms = median_ms + 0.25 * p95_ms

Median latency captures the normal case. p95 adds a penalty for spiky behavior. Then comes the big rule: if any fixture fails accuracy, the score becomes 999999.0. A bad transcript cannot hide behind a faster median.

Gate	Failure condition
Empty output	Any blank transcription fails.
Character similarity	Below `max(0.90, baseline fixture similarity - 0.01)` fails.
Word error rate	Above `min(0.25, baseline fixture WER + 0.03)` fails.
Output length	Below 85 percent of the baseline output length fails.

Character similarity is a rough "does this look like the expected sentence?" score. Word error rate is exactly what it sounds like: how many words are wrong, missing, or inserted. Output length catches the especially sneaky failure where the model returns a short, plausible fragment and calls it done.

This gate turned the search from "make the number smaller" into "make the number smaller without breaking dictation." The loop could change decode and inference settings in voicefox.py. It could not edit the recordings, the benchmark, paste behavior, model files, dependencies, or vocabulary.

The run we trusted

The corpus-gated run produced 50 experiment rows. Only two rows advanced. Forty-eight were discarded. The best loop row was retry_tiny_short_cap_transcripts, which used a lower decode cap for short audio and retried once with a safer cap if the first transcript looked suspiciously short.

Accuracy-gated decode loop

Lower gated_score_ms is better. Any accuracy failure becomes 999999.0.

Rows 50 experiments tried

Advances 2 48 discarded

Loop baseline 394.7 gated score ms

Loop best 307.6 provisional, before confirmation

Baseline confirmation

395.770

max_symbols=10

390.506

retry_floor_13_chars

383.773

Accepted guarded retry

379.510

accuracy_fail_count: 0 min char similarity: 0.907895 mean char similarity: 0.978944 max WER: 0.250000

The loop report looked stronger than the final retained result: 394.7 down to 307.6, a provisional 22.1 percent improvement. This is the point where it would have been easy to stop too early. Instead, we reran the retained candidate against the original baseline and two close challengers, then compared median-of-three confirmation scores. The guarded retry stayed ahead, but the confirmed gap was smaller. That is exactly why the confirmation step mattered.

The accepted change is also easy to state. For short audio, VoiceFox tries a faster decode setting first. If the transcript comes back suspiciously short, it retries once with the safer general setting. In code, that means a short-audio cap of 4, a general cap of 5, a short-audio threshold of 7 seconds, and a retry floor of 10 characters.

The rejected rows mattered too

The rejected rows matter as much as the accepted row. The loop tried lower caps, alternative retry floors, and nearby decode configurations. Some were faster in isolation. They did not beat the gated score after the full corpus ran, or they looked less convincing once compared against the baseline and close challengers.

This is where the VoiceFox run became more than an "AI optimized the code" story. The autoresearch loop was not allowed to keep a change just because one number got smaller. The change had to survive the corpus. Then it had to survive repeated confirmation against baseline. Then the tests had to cover the new short-audio path so the behavior could not disappear later.

The reusable pattern

The reusable pattern is small. If you have the following, you can run a useful autoresearch loop:

A piece of code or a setting you are willing to let an agent change
A repeatable test that prints a number
A rule that says when a result is not acceptable

The gate is the part that gets skipped most often. Without a gate, the loop becomes a hill climb that optimizes the visible metric and silently regresses a property the metric did not capture. With a gate, the loop can still move quickly, but it cannot win by cheating.

The 16-fixture VoiceFox corpus is intentionally small. A bigger corpus would catch more failure modes, but it would slow the loop and reduce the number of ideas the agent could test. For this problem, the better trade was a small corpus with sharp gates: enough clips to catch short commands, fast speech, numbers, proper nouns, quiet speech, and dense dictation, but fast enough to run repeatedly.

The pattern also makes incidents auditable. When the symbol-cap value gets bumped a year from now, the loop will run, and the gates will refuse the bump if the short-clip case still degrades. The decision lives in the eval, not in someone's memory.

What we learned

The 540 to 590 ms improvement is real, but the better lesson is how it was earned. The post-submit loop produced a satisfying improvement curve. The single-fixture STT loop produced a tempting leaderboard. The corpus loop produced the result we could actually trust: a smaller confirmed latency win, with zero accuracy failures and unchanged accuracy metrics across the confirmation runs.

If you build AI tools that run locally, the most useful thing in this article is probably not the loop itself. It is the discipline around the loop. Pick the metric. Build the fixture set. Name the gates. Show the result rows. Then let the autoresearcher move quickly inside that boundary.

VoiceFox is in the Lab. The latency audit and the autoresearch loop spec live in the project repo; the autoresearcher pattern itself is generic enough that it has its own article waiting to be written about non-VoiceFox uses.