EFFICACY · METHODOLOGY · PROOF
How accurate is the Britwise examiner?
Most AI English-grading vendors quote a single accuracy number and refuse to publish their methodology. We publish the rubric, the models, the dataset, the gaps, and the human-examiner panel that grades against us. If it disagrees with us, we publish that too.
Cambridge-aligned rubric
Human examiner reference panel
External audit pending
OUR STACK
The models that grade your test
No black boxes. Every component, every model, every latency budget.
Task
Model
Latency
Notes
Speech-to-text
Deepgram Nova-2
~600 ms
EU Frankfurt endpoint · zero-retention · filler-word capture
Pronunciation prosody
Hume AI
~400 ms
Confidence + emotion + cadence — feeds into Pronunciation band
Speaking grading
OpenAI GPT-5.2
~2.2 s
Cambridge 4-criteria rubric · temp 0.2 · structured JSON
Full mock exam grading
OpenAI GPT-5.5
~3.0 s
Reasoning-heavy · used for end-of-week mock tests
Writing Task 1 / 2
OpenAI GPT-5.2
~2.5 s
Cambridge 4-criteria · 1500-token feedback + B-level rewrite
Reading / Listening MCQ
Deterministic key
<50 ms
Cambridge mock-test answer keys · band-scaled per official conversion
Coach voice (Angie)
ElevenLabs Flash v2.5
~75 ms first byte
British RP · stream-first
All LLM calls route through the Britwise abstraction layer — we can swap providers without breaking the rubric. Audio is processed in the EEA where possible; no audio is retained by Deepgram or OpenAI (zero-retention contracts on file).
ACCURACY
How close are we to a human Cambridge examiner?
The honest table. Where we’re using extrapolated numbers (because the 100-sample study is still in flight) we say so.
Metric
Human examiner
Britwise (GPT-5.2)
Typical AI competitor
Status
Within ±0.5 of human (target)
70%
≈ 70% (extrapolated)
≈ 60%
study in progress
Within ±1.0 of human
95%
≈ 92% (extrapolated)
≈ 88%
study in progress
Off by ≥ 2.0 bands (catastrophic)
<1%
≈ 1.6% (extrapolated)
≈ 5%
study in progress
Inter-attempt consistency (same sample, 5 runs)
n/a
0.32 σ
0.61 σ
internal test, June 2026
Reference numbers for “human examiner” come from Cambridge ESOL inter-rater reliability studies (Taylor & Galaczi, 2011; Cambridge Research Notes vol. 65). Competitor estimates are extrapolated from published GPT-4 IELTS papers (2024) since most vendors do not publish their own numbers.
METHODOLOGY
Five steps. No theatre.
The exact procedure we follow to produce the numbers above.
01
Sample
100 anonymised candidate audio submissions from the past 60 days — stratified across Bands 4 → 8 (20 per band). Personal identifiers stripped; consent recorded under the Britwise Privacy Notice §4.
02
Reference panel
Each sample is independently graded by THREE Cambridge-certified examiners (recruited via the British Council network) using the public IELTS Speaking Band Descriptors. The reference band is the median of the three.
03
System grade
Britwise grades each sample 5 times using GPT-5.2 with temperature 0.2 and our production rubric prompt. The Britwise band is the mean of the 5 runs (rounded to the nearest 0.5).
04
Metrics
We report: % within ±0.5 of reference · % within ±1.0 · catastrophic disagreement rate · run-to-run standard deviation. All raw data published on this page in the JSON download.
05
Audit
An external party (target: a UK NCFE-registered awarding body) audits the dataset and methodology. The audit report is published in full once received.
KNOWN GAPS
Where we’re honest about the limits
If a vendor claims 99% accuracy without publishing their dataset, run.
Empirical benchmark in progress — numbers above are extrapolated from published GPT-5 family papers, not from our own dataset yet.
Pronunciation grading currently leans on Hume AI prosody + LLM judgement, not yet phoneme-level scoring (planned: integrate Microsoft Azure Speech Pronunciation Assessment).
Inter-rater reliability between human Cambridge examiners is itself only ~70% within ±0.5 — no AI system can exceed that ceiling.
Speaking-Part-2 long-turn (the 2-minute monologue) is the hardest task; our error is concentrated here. We expect to publish Part-1/2/3 split numbers in the next study.
FOR PROCUREMENT TEAMS
Need the dataset, the rubric, or to speak with an examiner on the panel?
Email efficacy@britwise.school for the full methodology PDF, signed examiner CVs, and access to the redacted candidate audio used in the study. Available under NDA for active enterprise procurement.
Talk to procurement
Trust centre →
Britwise School LTD
Company No. 17253094 · ICO ZC174279 · Registered in England & Wales 71–75 Shelton Street, Covent Garden, London, WC2H 9JQ, United Kingdom
© 2026 Britwise School LTD · Registered 2024, Companies House 17253094 · British English speaking coach · Cambridge YLE/KET/PET/FCE/IELTS aligned · Worldwide
🇬🇧 British English · Cambridge-grade scoring · GDPR compliant