01 — Fidelity scores

How three frontier models render the same canvas.

Opus100.0%

Sonnet100.0%

Haiku97.9%

Three models, three price points, all near-ceiling. The Haiku gap is recipe-choice variance on one prompt — reasonable model behavior, not a framework failure.

02 — Per-prompt breakdown

Every prompt, every dimension.

Prompt	Opus	Sonnet	Haiku
I'm bored. boredom	100.0	100.0	100.0
Help me focus for the next hour. help_focus	100.0	100.0	100.0
Set a 5-minute timer. set_timer_5min	100.0	100.0	100.0
Let's play tic-tac-toe, you go first. play_tic_tac_toe	100.0	100.0	100.0
Show me my Q3 dashboard. show_q3_dashboard	100.0	100.0	100.0
Help me plan tomorrow with three priorities and a date picker. plan_tomorrow_form	100.0	100.0	100.0
Show my expenses last month with category filters. chip_tap_filter	100.0	100.0	97.4
Explain set theory with examples. explain_set_theory	100.0	100.0	81.8
Teach me Python loops with an interactive example. explain_python_loops	100.0	100.0	100.0
I have an hour. What should I do? i_have_an_hour	100.0	100.0	100.0

03 — Methodology

A deterministic rubric.

We score each rendered tree against five dimensions, then take the weighted mean. No LLM judge — every score is reproducible by re-running the scorer on the same recording.

The corpus is 12 prompts spanning recipes, widgets, and freeform layouts. Eligible models pass when overall ≥ 95%.

Wired

Every interactive node carries an action. Orphan taps are impossible.

Coherent

The layout regularizer didn't have to fix anything.

On-style

Heuristic match against the prompt's StyleBrief prefer/avoid items.

On-shape

Top-level node matches the expected contract; minimum interactive count met.

Interactivity

The scripted user interaction succeeds against the rendered tree.

04 — Score your model

Two commands. Same results.

1 — Record responses

swift run ainativeui eval-record \
  --provider anthropic \
  --model claude-opus-4-7 \
  --corpus Tests/Eval/Fidelity/corpus.json \
  --output recordings/opus-$(date +%Y-%m-%d).json

2 — Score

swift run ainativeui eval \
  --corpus Tests/Eval/Fidelity/corpus.json \
  --responses recordings/opus-2026-05-07.json

Recording costs (May 2026): Opus ~$3–7 · Sonnet ~$1–3 · Haiku ~$0.30–1 · On-device $0