Testing Methodology
Every accuracy score on this site comes from a published, repeatable methodology. This page explains exactly how we measure, what fixtures we use, and what each metric means.
Audio fixtures
All apps are tested against identical audio samples synthesised with ElevenLabs Text-to-Speech (English voice, neutral accent, consistent tempo) plus real microphone recordings for noise conditions. Using synthesised speech ensures the ground-truth transcript is exact — no ambiguity about what was actually said.
Current fixture set (7 samples):
| ID | Type | Duration | Condition |
|---|---|---|---|
| en-01 | Coding / technical | ~90s | Clean |
| en-02 | Casual voice memo | ~75s | Clean + café SNR5 |
| en-03 | Conference / meeting | ~80s | Clean |
| en-04 | Long-form narration | ~160s | Clean |
| en-05 | Mixed ITN / numbers | ~60s | Clean |
| ru-test | Russian / EN mixed | ~90s | Clean |
Metrics
Word Error Rate (WER)
The standard ASR accuracy metric. Counts substitutions, deletions, and insertions at the word level, divided by the number of words in the reference transcript. Lower is better. A WER of 5% means 5 errors per 100 words.
WER = (S + D + I) / N
where S = substitutions, D = deletions, I = insertions, N = reference word count.
Character Error Rate (CER)
Same formula applied at the character level. CER is more sensitive to small errors (missing letters, wrong capitalisation) and better captures the "how readable is this" quality of the output.
Punctuation Error Rate (PER)
Our custom metric: computed separately on punctuation tokens only (commas, periods, question marks, etc.). Punctuation dramatically affects readability and copy-paste usability. A transcript with perfect words but no punctuation is not useful for professional writing. PER = punctuation errors / reference punctuation count.
Multi-reference ground truth
Real dictation has valid variation: "two hundred" and "200" are both correct transcriptions of the number. Disfluency words like "um" and "uh" may or may not appear in the output. Compound words can be written solid or hyphenated.
We encode these alternatives using {option1|option2|option3} syntax
in the ground-truth file. The scorer picks the alternative that gives the lowest
edit distance for each position — so a model is never penalised for a valid variation.
HTML error diffs
For each tested sample we publish an interactive HTML diff that shows every word and punctuation token in colour:
- Green — exact match
- Yellow — accepted alternative form
- Red text — substitution (wrong word)
- Blue — insertion (extra word)
- Red box — deletion (missing word)
- Small red circle — missing punctuation
Accuracy score
Raw WER is hard to read at a glance. We convert it to a 1–10 score using the table below. The score reflects how usable the output is for professional dictation — not just statistical accuracy. A score of 8+ means you can use the output with minimal editing. Below 5, the transcript requires significant correction.
| WER range | Score | Verdict |
|---|---|---|
| ≤ 1% | 10 / 10 | Exceptional — near-perfect |
| 1–2% | 9 / 10 | Excellent — 1 error per 50 words |
| 2–3% | 8 / 10 | Very good — occasional corrections needed |
| 3–5% | 7 / 10 | Good — light editing required |
| 5–8% | 6 / 10 | Acceptable — noticeable errors |
| 8–12% | 5 / 10 | Fair — frequent errors, usable with effort |
| 12–18% | 4 / 10 | Poor — heavy editing required |
| 18–25% | 3 / 10 | Bad — nearly 1 in 4 words wrong |
| 25–35% | 2 / 10 | Very bad — barely usable |
| > 35% | 1 / 10 | Unusable |
The average accuracy score shown on each review page is the mean of individual model scores. Apps with multiple models (local + cloud) get separate per-model scores; the average covers all tested models equally.
What we do not measure
- Real-time latency for cloud models — network varies. We measure end-to-end latency (press hotkey → text appears) as a separate indicative metric, not part of the accuracy score.
- Domain-specific vocabulary accuracy. Our fixtures are general-purpose; medical or legal vocabulary is outside this methodology's scope.
- Speaker accent variation. All EN fixtures use a neutral North American accent.
Methodology version
Current version: 1.2 (2026-05). Changes are logged in the methodology changelog linked below.