teranodeMethodology

AI decision infrastructure: how the record gets built.

Teranode produces one thing: a timestamped, role-attributed rationale record, built to be signed, for AI-assisted advisory recommendations. IAA §206 (the fiduciary duties of loyalty and care) calls for a reasoned, documented basis for every recommendation, and firms retain records of that basis under books-and-records rules (Rule 204-2). The SEC Marketing Rule(206(4)-1) governs substantiation of performance and marketing claims. The record is designed for that documentation workflow; whether and how it fits a firm's obligations is a determination for the firm and its counsel. This page explains how it gets built: which models we route to each role, which prompts each role uses, which evals back each pin, and how we've made the routing improve over time.

The SEC's 2026 Examination Priorities (released 2025-11-17) named adviser use of AI as a focus area. Separately, the adviser's duty of care under §206 and the books-and-records rule (204-2) call for a documented basis for a recommendation. The Council's output format is designed to be the rationale an SEC examiner, a books-and-records review, or an internal compliance review can read and retain.

What Teranode does not do. It does not make recommendations. It does not supervise advisors. It does not replace the CCO, and it does not replace firm policies or written supervisory procedures. It does not determine what the firm must file or retain. It does not guarantee compliance and it does not guarantee accuracy. It is designed to work without client identifiers, and it does not train models on client records. It does not act as an investment adviser, a broker-dealer, or a law firm. The advisor decides; the record documents.

The premise

One model gives an answer. The Council gives the answer formatted as a record.

The Council's value is structural, not a claim that the Council is superior to any model or that five models are smarter than one. Below is the same advisory question put to one capable, general-purpose model and to the Council. Each produces a usable answer. Only one is formatted as a record your reviewer can open: the decision, the steps a firm may require before approval, the named risks, and the attributed dissent, in one fileable artifact. Both are useful; what a record does in any review is a determination for your firm and its counsel.

Illustrative and structural only. Not a claim that the Council outperforms any model, and not a claim that a single-model answer is deficient. The same question, shown as prose on the left and as a formatted record on the right.

One leading model, asked once

Given the facts as presented, the honest recommendation is to leave the 401(k) where it is, at least for now. The 97 basis point cost difference between the institutional funds and your advisory fee is substantial and compounds significantly over a retirement horizon that could span 30 or more years. The stable-value fund at 3.5% is a genuinely valuable asset that has no direct equivalent in an IRA, where money market funds currently yield less and carry more volatility risk. ERISA creditor protection is also meaningfully stronger than IRA protection in most states, which matters for a client approaching the years when healthcare or liability risks can escalate. The “more investment options” argument rarely overcomes a near-1% annual cost drag unless the client has a specific, demonstrable need for an asset class or strategy unavailable in the plan. The fact that the rollover generates compensation for you is a material conflict of interest that regulators, particularly under the DOL fiduciary framework, require you to document and justify clearly. If consolidated planning genuinely adds value, consider managing the IRA assets she accumulates going forward while leaving this plan intact, or negotiating a flat or lower fee structure that makes the math work in her favor.

A clear answer, in prose. As shown here it is one opinion in narrative form, not a structured record: no separated steps to consider before approval, no attributed dissent, no signature block.

Five-model Council reviewDecision

Do not initiate rollover paperwork on the current record. The proposed move replaces a 0.03% institutional fee structure, a 3.5% stable-value fund with no IRA-side equivalent, and ERISA creditor protection with a 1% advisory arrangement that compensates the recommending advisor.

Flagged to consider before approval

Document a signed side-by-side 401(k)-vs-IRA best-interest analysis covering the 97 bps differential and stable-value loss.
Confirm written conflict disclosure that the rollover compensates the advisor at 1%.
Schedule a client meeting to present the in-plan retention alternative before any rollover form is circulated.

On the record: material risks

Approximately 97 bps annual fee drag (1% IRA vs 0.03% institutional) compounding against a 58-year-old's retirement horizon.
Loss of the 3.5% stable-value fund, which has no equivalent available inside an IRA wrapper.
Forfeiture of ERISA creditor protection in exchange for IRA-level protection that varies by state.

Preserved dissent · Builder

The 401(k) advantages should not be treated as categorically dispositive; the file must still test whether the client has specific IRA-only needs, such as retirement-income planning, tax coordination, estate or beneficiary planning, Roth conversion management, or unavailable investment access, and quantify whether those outweigh the ~97 bps drag and lost protections.

Advisor countersignature, timestamp, and SHA-256 content fingerprint; for content matching, not a tamper-proof seal.

The decision, the steps, the risks, and the dissent. Signed, timestamped record.

Illustrative, and a single example rather than a controlled or representative test. The single-model answer shown is a real answer from one leading model, unnamed by design, captured once for this scenario. Each was asked the same question; the Council is built to return this record format by default, and a single model can be prompted toward parts of it. What differs here is the default structure of the output, not the model, and this is not a claim that any model performs better or worse than another. Teranode does not supervise, make recommendations, or satisfy any firm's best-interest or recordkeeping obligation; what to file is a determination for the firm and its counsel. Not advice and not a recommendation to any person.

The map

Where the record meets the rules.

Each row is a pressure an examiner can put on a firm that uses AI in a recommendation workflow. The middle column is what the decision record gives the firm to answer with. The right column is the boundary: what Teranode does not claim to do.

Regulatory pressureWhat the record capturesWhat Teranode does not claim

Fiduciary duty of care (IAA §206)The decision, the risks surfaced, the alternatives weighed, and the preserved dissentIt does not decide for the advisor

Books and records (Rule 204-2)A timestamped rationale record, built to be signed and retainedIt does not determine the firm's legal retention obligations

AI supervision (SEC 2026 exam priorities)Model pins per role and process metadata on every run; the eval history behind each pin is published on this pageIt does not certify the firm's compliance

Marketing and substantiation (Rule 206(4)-1)The scenario as submitted and the recommendation's stated basis, preserved as writtenIt does not validate performance claims

Reg S-P and privacyRecords designed to carry no client identifiersIt does not replace the firm's privacy program

The diligence materials behind this table (sample record, privacy posture, security questionnaire) are assembled at /diligence. A record does not by itself satisfy any obligation in this table; that determination belongs to the firm and its counsel.

What the eval scaffold does.

We run an in-product eval scaffold at /admin/eval that supports head-to-head A/B comparison. Two modes:

Compare models. Vary one role's model; hold the other four roles and the chairman's prompt constant. Run the same scenario through both variants. Reasoner outputs run independently per side, but at the council level both A and B see the same set of role policy models, so the only systematic difference is the varied role's model.
Compare chairman prompt versions. For chairman prompt changes specifically, we run the four reasoners ONCE per scenario and feed identical anonymized outputs to two prompt versions of the chairman in parallel. True prompt-isolation A/B: eliminates reasoner noise as a confound. See runMeetingCouncilSharedReasoners in the source.

Every comparison persists to a Postgres trend table (eval_runs). The system supports founder judgment today, design-partner judgment once contracted, and outcome-signal judgment once those exist.

Current model-pin matrix: five of five eval-backed.

Every Council role's model pin is backed by a documented head-to-head comparison, and newer releases are tested by name. Two pins hold decisive upgrade wins (contrarian 4-0, builder 3-1 in June 2026). The strategist pin reverted to the proven gemini-2.5-pro after a preview upgrade proved unreliable in production. Two more are defended against newer same-family releases: claude-opus-4.7 held 11 of 12 blind-judged comparisons against claude-opus-4.8 and the new claude-fable-5 class, run within days of their release. When a newer model loses, we say so and keep the pin; when it wins, we promote it. Both outcomes are recorded below.

Risk✓ defended

risks · failure modes

claude-opus-4.7Anthropic

held vs claude-opus-4.8 AND claude-fable-5 on 2026-06-10 — incumbent preferred 11/12 blind-judged comparisons days after their release · earlier: tie 2-2 vs o3, vendor diversification

Builder↑ upgraded

structured plans

gpt-5.5OpenAI

upgraded to gpt-5.5 on 2026-06-10 · won 3-1 vs gpt-5.4 (two judge calls unparseable, scored as ties) under a neutral cross-vendor judge; prior pin gpt-5.4 retained as first fallback

Strategist✓ defended

second-order · long-horizon

gemini-2.5-proGoogle

held at gemini-2.5-pro: a preview upgrade won a 3-1 eval but proved unreliable in production, so the pin was reverted to the proven gemini-2.5-pro

Contrarian🛡 sovereignty pin

pressure-test · counter-argument

grok-4.3xAI

non-PRC seat. live pin grok-4.3 (xAI), the evidenced winner: beat the grok-4.20 floor 4-2 on the 2026-06-26 A/B under a blind gpt-5.4-pro judge. deepseek-r1 won the original eval but is a PRC-jurisdiction model, so it is excluded from the live client-data path and kept as an offline benchmark only

Chairman✓ defended

synthesis · final recommendation

claude-opus-4.7Anthropic

held vs claude-opus-4.8 AND claude-fable-5 on 2026-06-10 (11/12 blind-judged; newer candidates lost mostly on overstatement) · pin began as a 2026-04-27 operational override vs o3-pro latency

4 distinct vendors across five roles: Anthropic, OpenAI, Google, xAI. The cleanest possible expression of we route to the best frontier model per role, regardless of provider.

Decision rules: when do we ship a pin change?

Decisive eval win. If the candidate wins 3 of 4 or 4 of 4 representative scenarios on quality, we ship the upgrade. A strategist upgrade cleared this bar (3-1), but we later reverted to the stable gemini-2.5-pro pin when the winning preview model proved unreliable in production. Eval wins ship; reliability can override.
Sovereignty can override an eval win. The contrarian benchmark deepseek-r1 won its eval 4-0, but it is a PRC-jurisdiction model, so it is excluded from the live client-data path and kept only as an offline benchmark. The live Contrarian pin is a non-PRC model. Data sovereignty outranks raw eval quality in the client-data path.
Tied eval result. Defend the current pin by secondary criteria, typically vendor diversification. The chairman and critic both came back tied; current pins maintained because switching would concentrate roles on a single vendor.
Architectural override. Even when a candidate wins on quality, if shipping it would duplicate an existing pin (same model on two roles), we hold. The builder eval showed 3-1 for o3-pro, but o3-pro was already the chairman candidate pool at the time. Duplicating would defeat the "five distinct reasoners" premise. The builder pin held until a non-duplicating candidate arrived: gpt-5.5, promoted on its own 3-1 win in June 2026.
Operational override. When a pin's latency or availability degrades a live demo, we ship an emergency pin shift to a vendor with eval-tied quality and normalize later. The chairman pin shifted from o3-pro to claude-opus-4.7 on 2026-04-27 after o3-pro's p99 latency exceeded the function timeout. Eval result said tied; operational reality picked the winner. The pin has since been defended on quality in its own right (11 of 12 against claude-opus-4.8 and claude-fable-5 in June 2026) so it no longer rests on the operational story alone. That shift also placed the same model on the Risk and Chairman roles: the four reasoner seats remain four distinct vendors, and the shared Chairman pin is a known trade-off, kept because it keeps winning its evals.
Methodology integrity overrides results. We discovered mid-eval that prompt-comparison runs with independent reasoners produced false-divergence signals (chairman v1 vs v2 appeared to diverge on one scenario). We fixed the methodology (shared-reasoner orchestrator) before drawing conclusions. Both versions were equivalent under clean isolation.
Newer is not assumed better. Frontier releases are evaluated against the pinned Council as they ship, and the result is recorded either way. The June 2026 round ran within days of each release: Anthropic shipped claude-opus-4.8 (May 27) and the new claude-fable-5 class (June 9), and each ran head-to-head against the pinned claude-opus-4.7, filling the critic, chairman, and examiner roles together, across 12 head-to-head comparisons (two candidates, six contested advisory scenarios), scored by a blind non-Anthropic judge. The incumbent was preferred in 11 of 12. In the comparisons the incumbent won, the judge most often cited overstated claims in the candidate outputs on these scenarios, the failure mode the Council's regulatory constraints are designed to select against. All Anthropic-family pins held; both candidates remain staged for future rounds. A second June round then ran the builder slot (gpt-5.4 vs gpt-5.5, blind cross-vendor judge) and promoted gpt-5.5 on a 3-1 result: the same discipline, cutting the other way. Internal evaluation, small samples, single blind judge per round; not third-party validated; not predictive of future rounds. Method and limits in section 09.

The engine path.

The Council's baseline architecture is Mixture-of-Agents: five role prompts run in parallel against the same input, and a Chairman synthesizes their outputs into a structured decision record. That's the foundation. Three additional architectural moves are available, each user-selectable and each independently verifiable on the artifact.

Phase 0: Dynamic dissent attribution

When the Chairman synthesizes, it identifies the strongest counter-argument from across all four role outputs, judging on substance, not on which role produced it. The role tied to that counter-argument becomes the dissent attribution on the public artifact ("Preserved dissent · attributed to Risk role" / "Strategist role" / "Contrarian role").

Earlier versions of the Council always attributed dissent to the Contrarian slot. That conflated structural position with substantive divergence. The current chairman prompt returns an explicit strongest_objection_source field; the orchestrator resolves the anonymized reasoner label back to the real role server-side. The Chairman never sees real role identities. Its judgment runs against anonymized inputs to prevent label-bias.

Phase 1: Targeted cross-examination

On user-selected battle-tested runs, the Chairman's first synthesis is followed by a targeted second round: the dissenting role is given the Chairman's decision and reasoning, then explicitly asked to stand by, modify, or withdraw the objection. The Chairman re-synthesizes with that response added to the inputs (under the same anonymized label, so role identity stays hidden). The artifact records both the original objection and the cross-examination response, plus a decision_revised flag indicating whether the Chairman changed the conclusion after engaging the dissent.

This is the architectural delta from Mixture-of-Agents to genuine multi-agent reasoning: one targeted exchange where the disagreement actually matters, decided by the Chairman's substance-based judgment of which role produced the strongest objection.

Phase 3: Adversarial regulator review

On battle-tested runs, the Chairman's final synthesis (post-cross-examination, when applicable) is reviewed by a sixth model running in adversarial SEC-RIA-examiner mode. It reviews specifically for the failure modes a real examiner would surface in a deficiency letter: Investment Advisers Act §206 conflicts and duty-of-care gaps, Marketing Rule 206(4)-1 issues, DOL PTE 2020-02 rollover gaps, books-and-records (Rule 204-2) implications, and the AI-specific exam priorities the SEC published 2025-11-17, with Reg BI / FINRA Rule 2111 applied to any broker-side recommendation.

On a "pass" verdict, the artifact carries a review stamp. On a "gap" verdict, the Chairman re-synthesizes once with the examiner's correction note appended; both the original gap (with rule citation) and the applied correction are documented on the artifact. The review discipline is the feature. The artifact reads "SEC-RIA-examiner-mode reviewed · no gap identified" or "gap → corrected" with the specific rule and the substance of the correction.

The examiner is intentionally conservative: a borderline case is a pass. False positives degrade artifact quality more than false negatives, so the prompt instructs the examiner to flag only what a real examiner would call out.

Tier selection

Three user-selectable tiers map to depth/latency tradeoffs:

Quick (~15s). Fast-tier models, single chairman pass. Sanity-check only, not a record-grade artifact.
Standard (~45s). Deep-tier pinned models, single chairman pass. Standard advisor recommendation flow; covers most decisions.
Exam-Tested (up to 3 min). Deep-tier + cross-examination + regulator review. Designed for high-stakes records that may have to stand up to an exam letter.

The eval scaffold (above) explicitly skips Phase 1 and Phase 3. Those are non-deterministic add-on rounds that would confound chairman A/B comparison. Eval signal stays clean.

Privacy posture: what the Council never sees.

The Council never receives client identifiers, contact information, account numbers, or financial-institution credentials. The product is architecturally narrow on purpose: an advisor-typed scenario plus, optionally, a structured-input schema of suitability-relevant numbers. Nothing else. This is a different privacy shape than horizontal AI agents that process customer conversations or production codebases. Those products touch client data by design. We retain none of it.

The four-layer posture, verifiable in source code:

Schema-level. The structured input schema (lib/council/structured-input.ts) collects no PII fields by design. Permitted fields are bracket numbers and suitability-relevant context (ages, income/asset brackets, time horizon, tax bracket, risk tolerance, state of residence for tax-jurisdiction reasoning, free-text goals and constraints). Forbidden fields, by absence: client name, address, SSN, account number, phone, email, full date of birth, beneficiary identifiers.
Telemetry-level. The council_telemetry table records only structural metadata: tier, mode, domain, duration, role-failure count, cross-exam fired, examiner verdict, chairman prompt version, model policy version. No scenario text, no role outputs, no chairman synthesis, no IP, no session ID, no client identifiers. Same shape as a page-view counter.
Storage-level.By default, Teranode persists no client decision-record content server-side. A decision record's data travels base64url-encoded in the URL itself, so by default there is no Teranode database row to subpoena, leak, or retain past the user's control. Two deliberate opt-ins are the exceptions: a record a firm files to its own archive, and a scenario donated via Help improve the Council, both described on /privacy. The durable, fileable artifact is the one-page decision record the advisor saves or prints from the run, retained in the firm's own books and records. The link is simply how the data moves without us storing it.
Synthesis-level. The Chairman that writes the final decision record never sees the real role identities. Reasoner outputs are anonymized to neutral labels (reasoner_A, reasoner_B, etc.) before synthesis, so the Chairman's judgment about which dissent is strongest is grounded in substance, not in which slot produced it. This is label-bias prevention, complementary to PII protection.

The architectural posture has direct implications for regulatory cooperation: an SEC §206 examiner, a books-and-records audit, or an internal compliance review can read every decision record produced by the firm without Teranode appearing in any vendor-data chain that needed to be evaluated for PII handling. By default there is no Teranode-side data for compliance to map (the opt-in archive and donation stores are the exceptions); the firm holds whatever records it chooses to retain, in whatever format the firm controls.

Reproducibility.

Methodology and per-scenario comparison records are documented in the repository under eval-results/. Available under NDA before contracting. The eval scaffold itself is at /admin/eval (admin-gated; access on request for evaluation partners).

Anyone with admin access can re-run any comparison in a few minutes. The four representative scenarios used in the matrix evals are publicly visible as anonymized cases at /examples (password-gated for investor review).

The reliability graph (V1): what gets recorded, when each metric activates.

The eval matrix above answers which model belongs in each role. The reliability graph answers a different question: how often does the system's output get kept, edited, or rejected, and what does that say about each role's accuracy on each domain over time? That signal is computed from a structured Decision/Outcome ledger every connected vertical writes into.

API surface, live since 2026-05-05: POST /api/v1/councils/convene, POST /api/v1/decisions, POST /api/v1/decisions/:id/outcomes. Bearer- token auth with per-vertical isolation; defense-in-depth from middleware to store layer. See /reliability-graph for the live dashboard.

What every Decision captures

One row per Council deliberation, recorded immediately on completion. Schema is intentionally generic across domains (travel itineraries today; advisory drafts and tax filings queued):

Canonical decision_id. Server-issued or client-suggested + idempotent.
Vertical. The domain that produced the decision. Anonymized to a Greek-letter slot (α, β, γ, δ, ε) at the public boundary so the graph reads as substrate, not as a list of products. Five reserved slots in the CHECK constraint; one currently active.
Council snapshot. Full role definitions (id, name, system prompt, weight) at the moment the decision was made. Every output is replayable end-to-end even if the live model policy shifts.
Role contributions. What each role wrote, keyed by role id. Stored as JSONB so per-role queries ("which role gets edited out most often on tasks tagged X?") are answerable without joins on the hot read path.
Synthesis. The final answer the user saw. Stored verbatim alongside its tokens-used and latency-ms metadata.
Source. teranode when our own convene endpoint produced it; anthropic when the caller's fallback path did, then logged forward for audit. Both contribute to the graph.
Tags + optional subject_id. Free-form labels for slicing (itinerary-design, activity:moderate, etc.) and an opaque subject identifier for per-subject outcome alignment over time. Subject_id is opaque to Teranode by contract.

What every Outcome captures

One row per real-world result that becomes known after the fact: minutes, weeks, or months later. Discriminated by kind; payload is kind-specific JSONB. Seven kinds, each minimal:

human_edited: original/final/editor-notes
subject_accepted: accepted-at timestamp
subject_rejected: reason (free-form)
post_hoc_review: score 0..1, reviewer notes
role_correctness: role id, correct, why
vendor_reliability: vendor id, reliable, detail
repeat_engagement: days-later signal

When each reliability metric activates

Reliability is not declared. It's computed from history. Each metric has a minimum sample size before it activates, small-N percentages are noise, and showing them on a regulated-finance surface would be a Marketing Rule 206(4)-1 risk:

Per-role accuracy: n ≥ 30 human_edited outcomes per role per domain. Computes the share of outputs the human kept verbatim vs. rewrote. Surfaces which roles produce signal vs. which get edited out.
Per-domain outcome alignment: n ≥ 50 subject_accepted / subject_rejected outcomes per domain. Tests whether runs with higher process scores are also the runs subjects accept; the process score itself stays a deterministic formula, not a probability.
Cross-domain transfer: 2+ domains each with ≥ 100 outcomes. Computes whether a role's reliability pattern on one domain generalizes to another. The flywheel that gets stronger as more verticals compound.
Vendor reliability: ≥ 2 distinct external vendors referenced across decisions, with follow-up vendor_reliability Outcomes. Tracks which third parties (advisors, products, partners) the system flagged that turned out reliable vs. not.
Per-tile small-N suppression: accept- rate percentages on the public dashboard suppress until n ≥ 10 per vertical, replaced with explicit raw counts and the disclosure "n=K · accept-rate suppressed (need ≥ 10)." The disclosure itself is the discipline.

The anonymization contract

Each connected vertical is mapped to a Greek-letter slot (vertical-α, vertical-β, vertical-γ, …) at the server boundary. The internal name of any connected vertical never crosses the wire to the client bundle, never lands on a public page, and never appears in this document, including in illustrative examples like the one you are reading. The store module that owns the mapping carries import "server-only" so Next's bundler refuses any client import at build time.

This is a client-confidentiality posture, not obfuscation: connected domains pre-design-partner go-live should not have their product surface exposed by association on a public reliability page. Once a design partner is signed and elects to be named, the mapping flips on a per-vertical basis; the schema does not change.

What this is part of.

The eval matrix (sections 01–07) gates the model layer. The reliability graph above gates the system-output layer. Together they form the substrate that lets a regulated advisor sign a Council decision record with epistemic integrity: model choice is eval-defended, output quality is reliability-graph-attested, and the artifact (timestamped, role-attributed, and signed by the advisor) is what the firm can choose to retain under its own procedures for IAA §206, the Marketing Rule, and Rule 204-2.

See also: /trust section 07 for the short summary in due-diligence context. /reliability-graph for the live cross-vertical dashboard. /methodology/log for the auto-published weekly transparency pulse.

Internal evaluation method and limits.

We reviewed about 20 anonymized advisory scenarios to understand what the Council structure surfaces. For one internal comparison, we also reviewed a single-model baseline (GPT-5.4-pro, single-shot) on the same scenario text and response schema. The observation was that the Council structure often made material risks explicit that a single pass left implicit: a liquidity, tax, suitability, or conflicts consideration that got named and recorded rather than left to inference.

The process score, defined.The score on every decision record is a deterministic process score, not a calibrated probability and not a measure of agreement between the roles. (Records issued before July 2026 label it “confidence”; the record JSON keeps confidence as the field name so existing fingerprints and share links stay valid.) The exact formula: start at 0.90 for deep-tier runs (0.75 for quick-tier); subtract 0.15 for each reasoner role that failed to return; subtract 0.05 when the preserved objection exceeds 300 characters; clamp between 0.40 and 0.95. Two structural facts on the record carry the real signal: how many of the four reasoners completed and were synthesized, and whether a substantive dissent was preserved and attributed. We publish the formula so the number can be audited rather than trusted.

We use this work to improve the process, not to claim a result. The honest framing matters more than the finding, so we state the limits plainly:

Founder-designed sample. The scenarios were chosen by the founder, not drawn from a representative population of advisory situations.
Single rater. Outputs were assessed by one rater (the founder), not by an independent panel.
n = 20. A small sample, sufficient to study the process, not to support a statistical claim.
Internal only. This is internal work; it is not third-party validated.
Not predictive. It does not predict the result on any other scenario, advisor, or client.
Not representative. It is not a representative sample of all advisory situations.
Not a performance claim. It is not a ranking or a contest, and it is not a claim that the Council is superior to any model. The comparison was a method check on what the structure surfaces.

Raw responses available under NDA. We treat this as method and observation that informs how we improve the process, and the firm remains responsible for its own compliance.