teranodeBenchmark — public dataset

Show your work.

Every model-pin choice on the Council is backed by a head-to-head comparison persisted to a Postgres trend table. This page surfaces that table — the actual data, queried at request time, no curated summaries. If the methodology says “decisive eval win, ship,” this is where you check the receipts.

What this is, and what it isn't. This is a benchmark of how the Council selects among AI models for each reasoner role. It is not investment, tax, legal, or financial advice; it does not measure investment outcomes; it does not imply that any single output should be acted upon without review by a qualified professional. Teranode is software infrastructure. Decision records produced by the Council are tools used by registered advisors at their sole discretion.

The aggregate.

48comparisons run

2judgments on file

14distinct models tested

100%tied comparisons

Live from production eval_runs. Counts exclude deleted rows. Date range: up to Sun, 24 May 2026 09:02:19 GMT.

Comparisons by varied role.

Each row is a Council role. Total = how many head-to-heads varied this role's model pin. Judged = how many of those received a founder judgment. Tied = how many came back with no decisive winner.

Role	Total	Judged	Tied	Decisive judgment rate
chairman	32	1	1	0%
contrarian	4	0	0	—
builder	4	0	0	—
critic	4	0	0	—
strategist	4	0	0	—

Per-model win-loss record.

Every model that has appeared in a head-to-head comparison. Sorted by decisive win rate (wins ÷ wins + losses). Ties and pending judgments excluded from the rate.

Model	Vendor	Comparisons	Ties	Pending	Win rate
`claude-opus-4.7`	Anthropic	30	0	30	—
`grok-build-0.1`	xAI	4	0	4	—
`gemini-3.5-flash`	Google	1	0	1	—
`claude-opus-4.7-fast`	Anthropic	7	0	7	—
`gemini-3.1-flash-lite`	Google	5	0	5	—
`gpt-chat-latest`	OpenAI	2	0	2	—
`grok-4.3`	xAI	3	0	3	—
`grok-4.20`	xAI	4	0	4	—
`deepseek-r1`	DeepSeek	4	0	4	—
`gpt-5.4-pro`	OpenAI	4	0	4	—
`o3-pro`	OpenAI	20	2	18	—
`o3`	OpenAI	4	0	4	—
`gemini-2.5-pro`	Google	4	0	4	—
`gemini-3.1-pro-preview`	Google	4	0	4	—

Raw data.

The full comparison list — id, varied role, model_a, model_b, judgment, dates — is available as JSON. Scenario prompts and model outputs are NOT included in the public export to preserve scenario-author privacy. Per-comparison detail (including outputs) is available under NDA for design partners and qualified investors.

Download benchmark.json

Methodology: /methodology · reliability graph (live): /reliability-graph · sample decision record: /design/record

Boundaries.

This benchmark measures model selection, not investment outcomes.Win rate above means “judged to produce a higher-quality reasoner output for that role” — it does NOT mean any output predicts market behavior, returns, or client outcomes.
Judgments are founder judgments today. The eval scaffold is built to scale to design-partner judgments once contracted, and to outcome-signal judgments once those exist. Until then, the data is single-rater. Inter-rater audit available on request.
Teranode is software, not a registered investment adviser.Decision records produced by the Council are inputs to a registered advisor's process. They are not advice and should not be relied upon as such.
Models change. Pins change.The model-policy.json file is versioned and the live reliability graph at /reliability-graph reflects today's state. Past comparisons remain on the record and downloadable.