Same briefings on a different model. Confirms the edge comes from the briefing content, not the specific LLM
| Arm | Return |
|---|---|
| Treatment (Briefings) | -2.03% |
| Control (Price Only) | -6.49% |
| Placebo (Stale Briefings) | -6.79% |
Switched from Opus 4.6 to Sonnet 4.5. Everything else identical to Run 1.
BTC declined from $108K to roughly $85K over Sep to late Nov 2025, with the steepest drop occurring in the final week. This window ends 8 days earlier than Run 1, cutting off just before the November selloff bottomed. The macro backdrop featured rising Treasury yields and broad risk aversion.
Switching from Opus 4.6 to Sonnet 4.5 was the only variable changed. The treatment arm still outperformed control by +4.46pp, confirming the edge comes from briefing content rather than model-specific reasoning. Sonnet 4.5 is a smaller, faster model, yet it extracted nearly the same value from the same briefings.
The control arm on Sonnet 4.5 behaved similarly to Run 1's control: it held long through most of the window, occasionally trimming on large red candles but never going short. Its -6.49% return was slightly better than Run 1's control (-8.44%), likely because this window ends earlier and misses the deepest part of the November selloff.
The placebo arm performed worse here (-6.79%) than in Run 1 (-5.19%). With stale briefings, Sonnet 4.5 appeared more confident in its (wrong) directional calls. It went long aggressively in late October based on outdated regime signals, then held through the drawdown. This hints that smaller models may be more susceptible to stale data than larger ones.
(2 more observations in the full report)
Previous: Run 1 | All Runs | Next: Run 3 | View Tick Data
Read the full research findings | How we test