How three frontier models render the same canvas.
Three models, three price points, all near-ceiling. The Haiku gap is recipe-choice variance on one prompt — reasonable model behavior, not a framework failure.
Every prompt, every dimension.
| Prompt | Opus | Sonnet | Haiku |
|---|---|---|---|
I'm bored. | 100.0 | 100.0 | 100.0 |
Help me focus for the next hour. | 100.0 | 100.0 | 100.0 |
Set a 5-minute timer. | 100.0 | 100.0 | 100.0 |
Let's play tic-tac-toe, you go first. | 100.0 | 100.0 | 100.0 |
Show me my Q3 dashboard. | 100.0 | 100.0 | 100.0 |
Help me plan tomorrow with three priorities and a date picker. | 100.0 | 100.0 | 100.0 |
Show my expenses last month with category filters. | 100.0 | 100.0 | 97.4 |
Explain set theory with examples. | 100.0 | 100.0 | 81.8 |
Teach me Python loops with an interactive example. | 100.0 | 100.0 | 100.0 |
I have an hour. What should I do? | 100.0 | 100.0 | 100.0 |
A deterministic rubric.
We score each rendered tree against five dimensions, then take the weighted mean. No LLM judge — every score is reproducible by re-running the scorer on the same recording.
The corpus is 12 prompts spanning recipes, widgets, and freeform layouts. Eligible models pass when overall ≥ 95%.
Wired
Every interactive node carries an action. Orphan taps are impossible.
Coherent
The layout regularizer didn't have to fix anything.
On-style
Heuristic match against the prompt's StyleBrief prefer/avoid items.
On-shape
Top-level node matches the expected contract; minimum interactive count met.
Interactivity
The scripted user interaction succeeds against the rendered tree.
Two commands. Same results.
swift run ainativeui eval-record \
--provider anthropic \
--model claude-opus-4-7 \
--corpus Tests/Eval/Fidelity/corpus.json \
--output recordings/opus-$(date +%Y-%m-%d).jsonswift run ainativeui eval \
--corpus Tests/Eval/Fidelity/corpus.json \
--responses recordings/opus-2026-05-07.jsonRecording costs (May 2026): Opus ~$3–7 · Sonnet ~$1–3 · Haiku ~$0.30–1 · On-device $0