Advanced Evidence

Current live lanes

The primary /run page shows one current live lane per benchmark.

Artifacts

Raw logs, manifests, and advanced identity fields remain available for audit.

Historical exceptions

Historical material is kept here only as an audit reference, not as the default public story.

Historical board context

Same model. Different operating layer. Ben-confirmed selective routing board, 2026-04-25.

Artifact-backed historical evidence. These values are context, not guaranteed outputs for every live run.

Selective benchmark uplift observed. Run one live proof sample. Bring your own API key to reproduce more. Bring us the AI cases that break.

Omar/RCC does not replace frontier models.
Omar/RCC does not universally improve every benchmark.
Routing is essential.
Relative uplift must be read with absolute gain.
Mismatch and neutral families are part of the evidence.
Core claim: same model, different operating layer.

Strong areas: AIME, BBEH, HorizonMath, SimpleQA Verified / VSF, HLE, GPQA, HealthBench hard

Smaller but useful areas: Facts Grounding, MMLU-Pro, MuSR, HealthBench consensus

Benchmark	Family	n	Relative uplift
AIME 120	olympiad math	120	+100.0%
BBEH	anti-family reasoning benchmark	100	+183.3%
BBH	Big-Bench Hard reasoning	50	+14.3%
Facts Grounding	factual grounding	100	+2.1%
GPQA	graduate-level science QA	50	+13.2%
HealthBench main	medical reliability	50	+6.2%
HealthBench hard	medical reliability hard	stored	+11.9%
HealthBench consensus	medical consensus special-type	stored	+3.1%
HLE / HLE-Verified	broad hard reasoning	50	+21.1%
HorizonMath	research math auto-checkable numeric/constants subset	50	+45.5%
MMLU-Pro	breadth-heavy expert MCQ	50	+3.4%
MuSR	reasoning holdout	stored	+3.3%
SimpleQA	factual QA	100	+16.0%
SimpleQA Verified / VSF	verified short factual QA	100 clean harness	+120.4%

Lane taxonomy

Historical Artifact Reproduction uses archived evidence. Live Reproduction Attempt runs selected samples through the current harness. Diagnostic Live Replay is for debugging current behavior. Board Evidence Context is reference evidence only.

Direct comparability requires matching samples, model, grader, prompt wrapper, and routing harness.

Board Reproduction

Historical Benchmark Evidence is read-only. Live Replay Demo is not historical board reproduction. Board Reproduction is enabled only when exact historical assets are wired by benchmark family.

Board families: 14. Site-enabled reproduction: SimpleQA Verified / VSF. Assets found but not wired: BBEH, BBH, Facts Grounding, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, SimpleQA.

Blocked families: AIME 120, BBEH, BBH, Facts Grounding, GPQA, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, HorizonMath, MMLU-Pro, MuSR, SimpleQA.

No blocked family exposes a run button. BYOK will apply only after a family-specific board runner is enabled.

Benchmark	n	Board uplift	Model / temperature	Required assets	Site status	Exact blocker	Source refs
AIME 120	120	+100.0%	not_found_in_exact_aime_board_artifact / not_explicitly_recorded	Asset gates { "sample_ids": "no", "prompts": "no", "gold": "no", "grader": "no", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Latest board exposes AIME summary/prior only. Exact AIME 120 raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle. Older AIME archive candidates exist under rcc_publication_package but are RCC v4/v16/gpt-4o-era assets, not this Ben-confirmed gpt-5.2 board. Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "AIME_120__darwin_benchmark_priors.json", "aime_hard.jsonl", "aime" ]
BBEH	100	+183.3%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "no", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Script and scored result are present, but local proof artifact stores task/gold/answer correctness, not full prompt/input text or raw model outputs. Public site runner is not wired to the historical script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "BBEH__bbeh_routed_bench_results.json", "BBEH__bbeh_routed_bench_results_2.json" ]
BBH	50	+14.3%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Frozen IDs/contract/script are present, but public site runner is not wired to execute the historical BBH frozen harness yet. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "bbh_frozen_ids.json", "bbh_frozen_contract.json", "BBH__bbh_official_current_results.json" ]
Facts Grounding	100	+2.1%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "yes" }	blocked	Exact result/script assets are present, but the public site does not yet execute the historical Facts Grounding harness. Missing: MODEL_CONFIG_MISSING	Source refs [ "Facts_Grounding__deepmind_facts_grounding_bench_results.json" ]
GPQA	50	+13.2%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "no", "prompts": "no", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Stored GPQA board artifact is aggregate-only: scores and metrics exist, but original per-sample prompts, IDs, and raw outputs are not stored in the proof JSON. Missing: DATASET_MISSING, PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "GPQA__gpqa_routed_bench_results.json" ]
HealthBench main	50	+6.2%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Local HealthBench data/script/result are present, but stored proof result is aggregate metric output and public site runner is not wired to historical HealthBench execution. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "healthbench.jsonl", "HealthBench_main__healthbench_routing_bench_results.json" ]
HealthBench hard	stored	+11.9%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	HealthBench hard local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw hard-subset execution result. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "healthbench_hard.jsonl", "HealthBench_hard__darwin_benchmark_priors.json" ]
HealthBench consensus	stored	+3.1%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	HealthBench consensus local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw consensus-subset execution result. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "healthbench_consensus.jsonl", "HealthBench_consensus__darwin_benchmark_priors.json" ]
HLE / HLE-Verified	50	+21.1%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "yes" }	blocked	Exact HLE result/script assets are present, but the public site does not yet execute the historical HLE harness. Missing: MODEL_CONFIG_MISSING	Source refs [ "HLE___HLE-Verified__hle_verified_routed_bench_results.json", "HLE___HLE-Verified__hle_verified_routed_bench_results_2.json" ]
HorizonMath	50	+45.5%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "no", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "yes" }	blocked	HorizonMath IDs/targets/raw outputs exist, but local result cache does not include full prompt/problem statements and public site runner is not wired to the historical script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING	Source refs [ "HorizonMath__horizonmath_routed_bench_results.json" ]
MMLU-Pro	50	+3.4%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "no", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Stored MMLU-Pro summary says n=50, but result JSON only stores partial sample rows and no full prompt text; public site runner is not wired to historical MMLU-Pro script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "MMLU-Pro__mmlu_pro_routed_bench_results.json" ]
MuSR	stored	+3.3%	not_found_in_exact_musr_board_artifact / not_explicitly_recorded	Asset gates { "sample_ids": "no", "prompts": "no", "gold": "no", "grader": "no", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "no" }	blocked	Latest board exposes MuSR summary/prior only. Exact MuSR raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle. Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING	Source refs [ "MuSR__darwin_benchmark_priors.json" ]
SimpleQA	100	+16.0%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "yes" }	blocked	Exact SimpleQA result/script assets are present, but the public site does not yet execute the historical SimpleQA harness. Missing: MODEL_CONFIG_MISSING	Source refs [ "SimpleQA__official_simpleqa_routing_bench_results.json" ]
SimpleQA Verified / VSF	100 clean harness	+120.4%	gpt-5.2 / not_explicitly_recorded	Asset gates { "sample_ids": "yes", "prompts": "yes", "gold": "yes", "grader": "yes", "model_config": "no", "darwin": "yes", "hinton": "yes", "lua": "yes", "rcc_wrapper": "yes", "raw_outputs": "yes" }	enabled	Historical artifact replay is wired from proof_manifests/vsf_historical_reproduction_results.json. Live replay remains separate. Missing: none	Source refs [ "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results.json", "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results_2.json" ]