Rationale infrastructure — how the record gets built.
Teranode produces one thing: a signed, timestamped, role-attributed rationale record for AI-assisted advisory recommendations. Reg BI (17 CFR 240.15l-1) requires a documented basis. IAA §206 and the Marketing Rule (206(4)-1) require substantiation. Books-and-records rules (17a-4 / 204-2) require retention. The artifact that satisfies those requirements is the product. This page explains how it gets built — which models we route to each role, which prompts each role uses, which evals back each pin, and how we've made the routing improve over time.
The SEC's 2026 Examination Priorities (released 2025-11-17) named adviser use of AI and the documentation of recommendations as focus areas. The Council's output format is designed to be the rationale a Reg BI examiner, an IAA §206 books-and-records audit, or an internal compliance review can read, retain, and verify.
What the eval scaffold does.
We run an in-product eval scaffold at /admin/eval that supports head-to-head A/B comparison. Two modes:
- Compare models.Vary one role's model; hold the other four roles and the chairman's prompt constant. Run the same scenario through both variants. Reasoner outputs run independently per side, but at the council level both A and B see the same set of role policy models — so the only systematic difference is the varied role's model.
- Compare chairman prompt versions. For chairman prompt changes specifically, we run the four reasoners ONCE per scenario and feed identical anonymized outputs to two prompt versions of the chairman in parallel. True prompt-isolation A/B — eliminates reasoner noise as a confound. See
runMeetingCouncilSharedReasonersin the source.
Every comparison persists to a Postgres trend table (eval_runs). The system supports founder judgment today, design-partner judgment once contracted, and outcome-signal judgment once those exist.
What we've run so far.
Live count from production eval_runs. Updated on every page render.
Current model-pin matrix — five of five eval-backed.
Every Council role's model pin is backed by a documented head-to-head comparison. Two pins were upgraded based on decisive eval wins. Three were defended on tied results plus secondary criteria (vendor diversification, architectural distinctness).
claude-opus-4.7Anthropictie 2-2 vs openai/o3 · vendor diversification breaks tie
gpt-5.4OpenAIcandidate o3-pro won 3-1 on quality but already pinned for Chairman; architectural override preserves the five-distinct-reasoners premise
gemini-3.1-pro-previewGooglewon 3-1 vs gemini-2.5-pro · sharper framing, more concrete recommendations
deepseek-r1DeepSeekwon 4-0 vs grok-4.20 · genuinely contrarian framing where prior pin produced soft consensus
claude-opus-4.7Anthropicemergency pin to claude-opus-4.7 (2026-04-27) — eval-tied with o3-pro; reverting to o3-pro pending latency normalization
4 distinct vendors across five roles — Anthropic, OpenAI, Google, DeepSeek. The cleanest possible expression of we route to the best frontier model per role, regardless of provider.
Decision rules — when do we ship a pin change?
- Decisive eval win. If the candidate wins 3 of 4 or 4 of 4 representative scenarios on quality, we ship the upgrade. The strategist gain (gemini-3.1-pro-preview, 3-1) and the contrarian gain (deepseek-r1, 4-0) both met this bar.
- Tied eval result. Defend the current pin by secondary criteria — typically vendor diversification. The chairman and critic both came back tied; current pins maintained because switching would concentrate roles on a single vendor.
- Architectural override.Even when a candidate wins on quality, if shipping it would duplicate an existing pin (same model on two roles), we hold. The builder eval showed 3-1 for o3-pro, but o3-pro was already the chairman candidate pool at the time — duplicating would defeat the "five distinct reasoners" premise. Builder pin stays at gpt-5.4-pro pending a non-duplicating candidate.
- Operational override.When a pin's latency or availability degrades a live demo, we ship an emergency pin shift to a vendor with eval-tied quality and normalize later. The chairman pin shifted from o3-pro to claude-opus-4.7 on 2026-04-27 after o3-pro's p99 latency exceeded the function timeout. Eval result said tied; operational reality picked the winner. Pin is reversible once o3-pro latency normalizes.
- Methodology integrity overrides results. We discovered mid-eval that prompt-comparison runs with independent reasoners produced false-divergence signals (chairman v1 vs v2 appeared to diverge on one scenario). We fixed the methodology (shared-reasoner orchestrator) before drawing conclusions. Both versions were equivalent under clean isolation.
The engine path.
The Council's baseline architecture is Mixture-of-Agents: five role prompts run in parallel against the same input, and a Chairman synthesizes their outputs into a structured decision record. That's the foundation. Three additional architectural moves are available, each user-selectable and each independently verifiable on the artifact.
Phase 0 — Dynamic dissent attribution
When the Chairman synthesizes, it identifies the strongest counter-argument from across all four role outputs — judging on substance, not on which role produced it. The role tied to that counter-argument becomes the dissent attribution on the public artifact ("Preserved dissent · attributed to Risk role" / "Strategist role" / "Contrarian role").
Earlier versions of the Council always attributed dissent to the Contrarian slot. That conflated structural position with substantive divergence. The current chairman prompt (v3) returns an explicit strongest_objection_source field; the orchestrator resolves the anonymized reasoner label back to the real role server-side. The Chairman never sees real role identities — its judgment runs against anonymized inputs to prevent label-bias.
Phase 1 — Targeted cross-examination
On user-selected battle-tested runs, the Chairman's first synthesis is followed by a targeted second round: the dissenting role is given the Chairman's decision and reasoning, then explicitly asked to stand by, modify, or withdraw the objection. The Chairman re-synthesizes with that response added to the inputs (under the same anonymized label, so role identity stays hidden). The artifact records both the original objection and the cross-examination response, plus a decision_revised flag indicating whether the Chairman changed the conclusion after engaging the dissent.
This is the architectural delta from Mixture-of-Agents to genuine multi-agent reasoning: one targeted exchange where the disagreement actually matters, decided by the Chairman's substance-based judgment of which role produced the strongest objection.
Phase 3 — Adversarial regulator review
On battle-tested runs, the Chairman's final synthesis (post-cross-examination, when applicable) is reviewed by a sixth model running in adversarial-FINRA-examiner mode. It reviews specifically for the failure modes a real examiner would surface in an enforcement letter: Marketing Rule 206(4)-1 issues, Investment Advisers Act §206 conflicts, FINRA Rule 2111 / Reg BI suitability gaps, books-and-records (17a-4 / 204-2) implications, and the AI-specific exam priorities the SEC published 2025-11-17.
On a "pass" verdict, the artifact carries a verification stamp. On a "gap" verdict, the Chairman re-synthesizes once with the examiner's correction note appended; both the original gap (with rule citation) and the applied correction are documented on the artifact. The verification discipline is the feature — the artifact reads "FINRA-examiner-mode reviewed · no gap identified" or "gap → corrected" with the specific rule and the substance of the correction.
The examiner is intentionally conservative: a borderline case is a pass. False positives degrade artifact quality more than false negatives, so the prompt instructs the examiner to flag only what a real examiner would call out.
Tier selection
Three user-selectable tiers map to depth/latency tradeoffs:
- Quick (~15s). Fast-tier models, single chairman pass. Sanity-check only — not a record-grade artifact.
- Standard (~45s). Deep-tier pinned models, single chairman pass. Standard advisor recommendation flow; covers most decisions.
- FINRA Battle-Tested (up to 3 min). Deep-tier + cross-examination + regulator review. Compliance-grade artifact for high-stakes records that need to survive an exam letter.
The eval scaffold (above) explicitly skips Phase 1 and Phase 3 — those are non-deterministic add-on rounds that would confound chairman A/B comparison. Eval signal stays clean.
Privacy posture — what the Council never sees.
The Council never receives client identifiers, contact information, account numbers, or financial-institution credentials. The product is architecturally narrow on purpose: an advisor-typed scenario plus, optionally, a structured-input schema of suitability-relevant numbers. Nothing else. This is a different privacy shape than horizontal AI agents that process customer conversations or production codebases — those products touch client data by design. We do not.
The four-layer posture, verifiable in source code:
- Schema-level. The structured input schema (
lib/council/structured-input.ts) collects no PII fields by design. Permitted fields are bracket numbers and suitability-relevant context (ages, income/asset brackets, time horizon, tax bracket, risk tolerance, state of residence for tax-jurisdiction reasoning, free-text goals and constraints). Forbidden fields, by absence: client name, address, SSN, account number, phone, email, full date of birth, beneficiary identifiers. - Telemetry-level. The
council_telemetrytable records only structural metadata — tier, mode, domain, duration, role-failure count, cross-exam fired, examiner verdict, chairman prompt version, model policy version. No scenario text, no role outputs, no chairman synthesis, no IP, no session ID, no client identifiers. Same shape as a page-view counter. - Storage-level. Every shared decision record is base64url-encoded into the URL itself per
app/lib/share.ts. The record IS the URL — no server-side persistence of inputs or outputs after the run completes. Sharing a record means sharing a link; there is no Teranode database row to subpoena, leak, or retain past the user's control. - Synthesis-level. The Chairman that writes the final decision record never sees the real role identities — reasoner outputs are anonymized to neutral labels (
reasoner_A,reasoner_B, etc.) before synthesis, so the Chairman's judgment about which dissent is strongest is grounded in substance, not in which slot produced it. This is label-bias prevention, complementary to PII protection.
The architectural posture has direct implications for regulatory cooperation: a Reg BI examiner, an IAA §206 books-and-records audit, or an internal compliance review can read every decision record produced by the firm without Teranode appearing in any vendor-data chain that needed to be evaluated for PII handling. There is no Teranode-side data for compliance to map; the firm holds whatever records it chooses to retain, in whatever format the firm controls.
Reproducibility.
Methodology and per-scenario comparison records are documented in the repository under eval-results/. Available under NDA before contracting. The eval scaffold itself is at /admin/eval (admin-gated; access on request for evaluation partners).
Anyone with admin access can re-run any comparison in a few minutes. The four representative scenarios used in the matrix evals are publicly visible as anonymized cases at /examples (password-gated for investor review).
Public benchmark dataset: /benchmark — every head-to-head comparison row queried live from the production eval table, with anonymized JSON download. The data behind every claim on this page.
The reliability graph (V1) — what gets recorded, when each metric activates.
The eval matrix above answers which model belongs in each role. The reliability graph answers a different question: how often does the system's output get kept, edited, or rejected — and what does that say about each role's accuracy on each domain over time? That signal is computed from a structured Decision/Outcome ledger every connected vertical writes into.
API surface, live since 2026-05-05: POST /api/v1/councils/convene, POST /api/v1/decisions, POST /api/v1/decisions/:id/outcomes. Bearer- token auth with per-vertical isolation; defense-in-depth from middleware to store layer. See /reliability-graph for the live dashboard.
What every Decision captures
One row per Council deliberation, recorded immediately on completion. Schema is intentionally generic across domains (travel itineraries today; advisory drafts and tax filings queued):
- Canonical
decision_id. Server-issued or client-suggested + idempotent. - Vertical. The domain that produced the decision. Anonymized to a Greek-letter slot (α, β, γ, δ, ε) at the public boundary so the graph reads as substrate, not as a list of products. Five reserved slots in the CHECK constraint; one currently active.
- Council snapshot. Full role definitions (id, name, system prompt, weight) at the moment the decision was made — every output is replayable end-to-end even if the live model policy shifts.
- Role contributions.What each role wrote, keyed by role id. Stored as JSONB so per-role queries ("which role gets edited out most often on tasks tagged X?") are answerable without joins on the hot read path.
- Synthesis. The final answer the user saw. Stored verbatim alongside its tokens-used and latency-ms metadata.
- Source.
teranodewhen our own convene endpoint produced it;anthropicwhen the caller's fallback path did, then logged forward for audit. Both contribute to the graph. - Tags + optional subject_id. Free-form labels for slicing (
itinerary-design,activity:moderate, etc.) and an opaque subject identifier for per-subject calibration over time. Subject_id is opaque to Teranode by contract.
What every Outcome captures
One row per real-world result that becomes known after the fact — minutes, weeks, or months later. Discriminated bykind; payload is kind-specific JSONB. Seven kinds, each minimal:
human_edited— original/final/editor-notessubject_accepted— accepted-at timestampsubject_rejected— reason (free-form)post_hoc_review— score 0..1, reviewer notesrole_correctness— role id, correct, whyvendor_reliability— vendor id, reliable, detailrepeat_engagement— days-later signal
When each reliability metric activates
Reliability is not declared. It's computed from history. Each metric has a minimum sample size before it activates — small-N percentages are noise, and showing them on a regulated-finance surface would be a Marketing Rule 206(4)-1 risk:
- Per-role accuracy — n ≥ 30
human_editedoutcomes per role per domain. Computes the share of outputs the human kept verbatim vs. rewrote. Surfaces which roles produce signal vs. which get edited out. - Per-domain calibration — n ≥ 50
subject_accepted/subject_rejectedoutcomes per domain. Computes whether the council's confidence aligns with subject acceptance. - Cross-domain transfer— 2+ domains each with ≥ 100 outcomes. Computes whether a role's reliability pattern on one domain generalizes to another. The flywheel that gets stronger as more verticals compound.
- Vendor reliability — ≥ 2 distinct external vendors referenced across decisions, with follow-up
vendor_reliabilityOutcomes. Tracks which third parties (advisors, products, partners) the system flagged that turned out reliable vs. not. - Per-tile small-N suppression— accept- rate percentages on the public dashboard suppress until n ≥ 10 per vertical, replaced with explicit raw counts and the disclosure "n=K · accept-rate suppressed (need ≥ 10)." The disclosure itself is the discipline.
The anonymization contract
Each connected vertical is mapped to a Greek-letter slot (vertical-α, vertical-β, vertical-γ, …) at the server boundary. The internal name of any connected vertical never crosses the wire to the client bundle, never lands on a public page, and never appears in this document — including in illustrative examples like the one you are reading. The store module that owns the mapping carries import "server-only"so Next's bundler refuses any client import at build time.
This is a client-confidentiality posture, not obfuscation: connected domains pre-design-partner go-live should not have their product surface exposed by association on a public reliability page. Once a design partner is signed and elects to be named, the mapping flips on a per-vertical basis; the schema does not change.
What this is part of.
The eval matrix (sections 01–07) gates the model layer. The reliability graph above gates the system-output layer. Together they form the substrate that lets a regulated advisor sign a Council decision record with epistemic integrity — model choice is eval-defended, output quality is reliability-graph-attested, and the artifact (signed, timestamped, role-attributed) is what gets filed under Reg BI / IAA §206 / 17a-4 / 204-2.
See also: /trust section 07 for the short summary in due-diligence context. /reliability-graph for the live cross-vertical dashboard. /methodology/log for the auto-published weekly transparency pulse.