Backtest Methodology: 4-Arm Randomized Controlled Trial

Full transparency on how we test whether PreReason briefings improve AI trading agent decisions. Our methodology uses a randomized controlled trial design with 4 arms and 8 controlled variables.

The 4 Arms

Treatment (Briefings) -- Receives fresh PreReason briefings (btc.energy, cross.regime, cross.breadth, btc.momentum) with a structured decision framework. This is what a paying customer would receive.
Control (Price Only) -- Receives only the current BTC price and a generic trading prompt. No market context, no signals, no regime classification. The baseline.
Placebo (Stale Briefings) -- Receives the same briefing format as Treatment, but with data frozen from August 2025 (before the model training cutoff). Tests whether the value comes from the format or the data.
Control-WS (Web Search) -- Receives live web search results about Bitcoin markets instead of briefings. Tests whether the value comes from having information at all, or specifically from PreReason structured analysis.

8 Controlled Variables

Same LLM -- All arms use the same model (Opus 4.6 or Sonnet 4.5) within each run
Same Trading Rules -- Identical position limits, allowed actions, and portfolio constraints
Same Fee Structure -- 0.045% transaction fees, 0.01% funding per 8 hours for all arms
Same Tick Frequency -- All arms process ticks at the same intervals
Fresh Context Per Tick -- Each tick runs in a fresh Claude Code agent with no memory of prior ticks
Shared Portfolio State -- Portfolio carryover via deterministic disk replay, identical across arms
Full Audit Trail -- Every decision, reasoning trace, and portfolio state is recorded
No Human Intervention -- Agent decisions are never overridden or altered

Training Data Cutoff

Opus 4.6 was trained through August 2025. Every tick in our backtests runs from September 2025 through March 2026. The model has never seen this data during training. This is genuine out-of-distribution testing.

Explore Results

Research: How Structured Context Improves LLM Decisions -- Deep dive into the three-gear system, findings hierarchy, and what we learned
Evidence Hub -- Overview of all 7 runs with key findings
Tick-by-Tick -- Browse every decision with full LLM reasoning

Browse the briefings we tested | Try them free