Testing Methodology

Every accuracy score on this site comes from a published, repeatable methodology. This page explains exactly how we measure, what fixtures we use, and what each metric means.

Audio fixtures

All apps are tested against identical audio samples synthesised with ElevenLabs Text-to-Speech (English voice, neutral accent, consistent tempo) plus real microphone recordings for noise conditions. Using synthesised speech ensures the ground-truth transcript is exact — no ambiguity about what was actually said.

Current fixture set (7 samples):

ID	Type	Duration	Condition
en-01	Coding / technical	~90s	Clean
en-02	Casual voice memo	~75s	Clean + café SNR5
en-03	Conference / meeting	~80s	Clean
en-04	Long-form narration	~160s	Clean
en-05	Mixed ITN / numbers	~60s	Clean
ru-test	Russian / EN mixed	~90s	Clean

Metrics

Word Error Rate (WER)

The standard ASR accuracy metric. Counts substitutions, deletions, and insertions at the word level, divided by the number of words in the reference transcript. Lower is better. A WER of 5% means 5 errors per 100 words.

WER = (S + D + I) / N

where S = substitutions, D = deletions, I = insertions, N = reference word count.

Character Error Rate (CER)

Same formula applied at the character level. CER is more sensitive to small errors (missing letters, wrong capitalisation) and better captures the "how readable is this" quality of the output.

Punctuation Error Rate (PER)

Our custom metric: computed separately on punctuation tokens only (commas, periods, question marks, etc.). Punctuation dramatically affects readability and copy-paste usability. A transcript with perfect words but no punctuation is not useful for professional writing. PER = punctuation errors / reference punctuation count.

Multi-reference ground truth

Real dictation has valid variation: "two hundred" and "200" are both correct transcriptions of the number. Disfluency words like "um" and "uh" may or may not appear in the output. Compound words can be written solid or hyphenated.

We encode these alternatives using {option1|option2|option3} syntax in the ground-truth file. The scorer picks the alternative that gives the lowest edit distance for each position — so a model is never penalised for a valid variation.

HTML error diffs

For each tested sample we publish an interactive HTML diff that shows every word and punctuation token in colour:

Green — exact match
Yellow — accepted alternative form
Red text — substitution (wrong word)
Blue — insertion (extra word)
Red box — deletion (missing word)
Small red circle — missing punctuation

Accuracy score

Raw WER is hard to read at a glance. We convert it to a 1–10 score using the table below. The score reflects how usable the output is for professional dictation — not just statistical accuracy. A score of 8+ means you can use the output with minimal editing. Below 5, the transcript requires significant correction.

WER range	Score	Verdict
≤ 1%	10 / 10	Exceptional — near-perfect
1–2%	9 / 10	Excellent — 1 error per 50 words
2–3%	8 / 10	Very good — occasional corrections needed
3–5%	7 / 10	Good — light editing required
5–8%	6 / 10	Acceptable — noticeable errors
8–12%	5 / 10	Fair — frequent errors, usable with effort
12–18%	4 / 10	Poor — heavy editing required
18–25%	3 / 10	Bad — nearly 1 in 4 words wrong
25–35%	2 / 10	Very bad — barely usable
> 35%	1 / 10	Unusable

The average accuracy score shown on each review page is the mean of individual model scores. Apps with multiple models (local + cloud) get separate per-model scores; the average covers all tested models equally.

Ranking criteria

Our best-of rankings sort apps on one dimension at a time. Every ranking is derived from the same measured data — we never hand-place an app. Apps with no data for a dimension (for example a cloud-only app under local accuracy) drop to the bottom as an honest “not applicable”, never an invented score.

Accuracy. The out-of-box word accuracy of each app’s default model on identical audio (the score table above). We rank what a new user actually gets, not the best a hidden model could do.
Speed. End-to-end latency — the time from finishing speech to text appearing on screen — measured on the same machine. Faster ranks higher.
Privacy. A derived score that penalises audio upload, data sent beyond audio, streaming before you press Stop, and missing training / tracking / history opt-outs. Local-only behaviour scores highest.
Best free option. How good the app is without paying. We separate a real free tier — one you can keep using indefinitely — from a trial wall, which only delays the paywall. An app ranks here only if it has a genuine free tier; trial-only and paid-only apps are listed as having no free tier. Among apps that qualify, we weigh:
- the practical limit (words or minutes per day / week / month, or a one-off cap);
- whether an account or a credit card is required before you can use it;
- whether the free model is the same one paid users get, or a downgraded version;
- privacy on the free tier — many free tiers train on your audio by default.
A generous, no-card weekly allowance on the same model as paid ranks above a tiny one-off word cap that forces a sign-up.
Best local accuracy. The accuracy of the most accurate model that runs fully offline on your machine. Cloud-only apps have no local mode and are listed as not applicable.

What we do not measure

Real-time latency for cloud models — network varies. We measure end-to-end latency (press hotkey → text appears) as a separate indicative metric, not part of the accuracy score.
Domain-specific vocabulary accuracy. Our fixtures are general-purpose; medical or legal vocabulary is outside this methodology's scope.
Speaker accent variation. All EN fixtures use a neutral North American accent.

Methodology version

Current version: 1.2 (2026-05). Changes are logged in the methodology changelog linked below.