teranodeBenchmark — public dataset

Show your work.

Every model-pin choice on the Council is backed by a head-to-head comparison persisted to a Postgres trend table. This page surfaces that table — the actual data, queried at request time, no curated summaries. If the methodology says “decisive eval win, ship,” this is where you check the receipts.

What this is, and what it isn't. This is a benchmark of how the Council selects among AI models for each reasoner role. It is not investment, tax, legal, or financial advice; it does not measure investment outcomes; it does not imply that any single output should be acted upon without review by a qualified professional. Teranode is software infrastructure. Decision records produced by the Council are tools used by registered advisors at their sole discretion.

01

The aggregate.

48comparisons run
2judgments on file
14distinct models tested
100%tied comparisons

Live from production eval_runs. Counts exclude deleted rows. Date range: up to Sun, 24 May 2026 09:02:19 GMT.

02

Comparisons by varied role.

Each row is a Council role. Total = how many head-to-heads varied this role's model pin. Judged = how many of those received a founder judgment. Tied = how many came back with no decisive winner.

RoleTotalJudgedTiedDecisive judgment rate
chairman32110%
contrarian400
builder400
critic400
strategist400
03

Per-model win-loss record.

Every model that has appeared in a head-to-head comparison. Sorted by decisive win rate (wins ÷ wins + losses). Ties and pending judgments excluded from the rate.

ModelVendorComparisonsWinsLossesTiesPendingWin rate
claude-opus-4.7Anthropic3000030
grok-build-0.1xAI40004
gemini-3.5-flashGoogle10001
claude-opus-4.7-fastAnthropic70007
gemini-3.1-flash-liteGoogle50005
gpt-chat-latestOpenAI20002
grok-4.3xAI30003
grok-4.20xAI40004
deepseek-r1DeepSeek40004
gpt-5.4-proOpenAI40004
o3-proOpenAI2000218
o3OpenAI40004
gemini-2.5-proGoogle40004
gemini-3.1-pro-previewGoogle40004
04

Raw data.

The full comparison list — id, varied role, model_a, model_b, judgment, dates — is available as JSON. Scenario prompts and model outputs are NOT included in the public export to preserve scenario-author privacy. Per-comparison detail (including outputs) is available under NDA for design partners and qualified investors.

Download benchmark.json

Methodology: /methodology · reliability graph (live): /reliability-graph · sample decision record: /design/record

05

Boundaries.

  • This benchmark measures model selection, not investment outcomes.Win rate above means “judged to produce a higher-quality reasoner output for that role” — it does NOT mean any output predicts market behavior, returns, or client outcomes.
  • Judgments are founder judgments today. The eval scaffold is built to scale to design-partner judgments once contracted, and to outcome-signal judgments once those exist. Until then, the data is single-rater. Inter-rater audit available on request.
  • Teranode is software, not a registered investment adviser.Decision records produced by the Council are inputs to a registered advisor's process. They are not advice and should not be relied upon as such.
  • Models change. Pins change.The model-policy.json file is versioned and the live reliability graph at /reliability-graph reflects today's state. Past comparisons remain on the record and downloadable.