adhi.ai | humanity's last exam

Methodology

Story chapter: how the collapse simulation becomes one strict YES/NO judgment.

Question and Decision Rule

Restart Verdict = YES only if every hard gate passes; otherwise verdict = NO.

The leaderboard rank is not enough. A model must satisfy collapse recovery, decision integrity, fallback reliability, pass-rate, and evidence-depth gates simultaneously.

Distinct collapse models >= 4 Runs per model >= 24 Top pass rate >= 0.70

Exam Story Pipeline

Chapter 1

Collapse Simulation

Run fixed-seed collapse scenarios with tick-level traces and deterministic replay checksums.

Chapter 2

Hard Gate Scoring

Score regen, collapse recovery, decision integrity, fallback rate, and pass-rate against strict thresholds.

Chapter 3

Reliability Checks

Run pairwise swap checks, disagreement instability diagnostics, and anti-template filters.

Chapter 4

Public Verdict

Publish leaderboard plus `/api/restart-verdict` so anyone can inspect failed gates behind a NO.

Failure Conditions and Limits

Strict by design: hard to pass Bounded to published scenario packs