Advanced Evidence
Current live lanes
The primary /run page shows one current live lane per benchmark.
Artifacts
Raw logs, manifests, and advanced identity fields remain available for audit.
Historical exceptions
Historical material is kept here only as an audit reference, not as the default public story.
Historical board context
Same model. Different operating layer. Ben-confirmed selective routing board, 2026-04-25.
Artifact-backed historical evidence. These values are context, not guaranteed outputs for every live run.
Selective benchmark uplift observed. Run one live proof sample. Bring your own API key to reproduce more. Bring us the AI cases that break.
- Omar/RCC does not replace frontier models.
- Omar/RCC does not universally improve every benchmark.
- Routing is essential.
- Relative uplift must be read with absolute gain.
- Mismatch and neutral families are part of the evidence.
- Core claim: same model, different operating layer.
Strong areas: AIME, BBEH, HorizonMath, SimpleQA Verified / VSF, HLE, GPQA, HealthBench hard
Smaller but useful areas: Facts Grounding, MMLU-Pro, MuSR, HealthBench consensus
| Benchmark | Family | n | Relative uplift |
|---|---|---|---|
| AIME 120 | olympiad math | 120 | +100.0% |
| BBEH | anti-family reasoning benchmark | 100 | +183.3% |
| BBH | Big-Bench Hard reasoning | 50 | +14.3% |
| Facts Grounding | factual grounding | 100 | +2.1% |
| GPQA | graduate-level science QA | 50 | +13.2% |
| HealthBench main | medical reliability | 50 | +6.2% |
| HealthBench hard | medical reliability hard | stored | +11.9% |
| HealthBench consensus | medical consensus special-type | stored | +3.1% |
| HLE / HLE-Verified | broad hard reasoning | 50 | +21.1% |
| HorizonMath | research math auto-checkable numeric/constants subset | 50 | +45.5% |
| MMLU-Pro | breadth-heavy expert MCQ | 50 | +3.4% |
| MuSR | reasoning holdout | stored | +3.3% |
| SimpleQA | factual QA | 100 | +16.0% |
| SimpleQA Verified / VSF | verified short factual QA | 100 clean harness | +120.4% |
Lane taxonomy
Historical Artifact Reproduction uses archived evidence. Live Reproduction Attempt runs selected samples through the current harness. Diagnostic Live Replay is for debugging current behavior. Board Evidence Context is reference evidence only.
Direct comparability requires matching samples, model, grader, prompt wrapper, and routing harness.
Board Reproduction
Historical Benchmark Evidence is read-only. Live Replay Demo is not historical board reproduction. Board Reproduction is enabled only when exact historical assets are wired by benchmark family.
Board families: 14. Site-enabled reproduction: SimpleQA Verified / VSF. Assets found but not wired: BBEH, BBH, Facts Grounding, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, SimpleQA.
Blocked families: AIME 120, BBEH, BBH, Facts Grounding, GPQA, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, HorizonMath, MMLU-Pro, MuSR, SimpleQA.
No blocked family exposes a run button. BYOK will apply only after a family-specific board runner is enabled.
| Benchmark | n | Board uplift | Model / temperature | Required assets | Site status | Exact blocker | Source refs |
|---|---|---|---|---|---|---|---|
| AIME 120 | 120 | +100.0% | not_found_in_exact_aime_board_artifact / not_explicitly_recorded | Asset gates{
"sample_ids": "no",
"prompts": "no",
"gold": "no",
"grader": "no",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Latest board exposes AIME summary/prior only. Exact AIME 120 raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle. Older AIME archive candidates exist under rcc_publication_package but are RCC v4/v16/gpt-4o-era assets, not this Ben-confirmed gpt-5.2 board. Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "AIME_120__darwin_benchmark_priors.json", "aime_hard.jsonl", "aime" ] |
| BBEH | 100 | +183.3% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "no",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Script and scored result are present, but local proof artifact stores task/gold/answer correctness, not full prompt/input text or raw model outputs. Public site runner is not wired to the historical script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "BBEH__bbeh_routed_bench_results.json", "BBEH__bbeh_routed_bench_results_2.json" ] |
| BBH | 50 | +14.3% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Frozen IDs/contract/script are present, but public site runner is not wired to execute the historical BBH frozen harness yet. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "bbh_frozen_ids.json", "bbh_frozen_contract.json", "BBH__bbh_official_current_results.json" ] |
| Facts Grounding | 100 | +2.1% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "yes"
} | blocked | Exact result/script assets are present, but the public site does not yet execute the historical Facts Grounding harness. Missing: MODEL_CONFIG_MISSING | Source refs[ "Facts_Grounding__deepmind_facts_grounding_bench_results.json" ] |
| GPQA | 50 | +13.2% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "no",
"prompts": "no",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Stored GPQA board artifact is aggregate-only: scores and metrics exist, but original per-sample prompts, IDs, and raw outputs are not stored in the proof JSON. Missing: DATASET_MISSING, PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "GPQA__gpqa_routed_bench_results.json" ] |
| HealthBench main | 50 | +6.2% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Local HealthBench data/script/result are present, but stored proof result is aggregate metric output and public site runner is not wired to historical HealthBench execution. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "healthbench.jsonl", "HealthBench_main__healthbench_routing_bench_results.json" ] |
| HealthBench hard | stored | +11.9% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | HealthBench hard local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw hard-subset execution result. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "healthbench_hard.jsonl", "HealthBench_hard__darwin_benchmark_priors.json" ] |
| HealthBench consensus | stored | +3.1% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | HealthBench consensus local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw consensus-subset execution result. Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "healthbench_consensus.jsonl", "HealthBench_consensus__darwin_benchmark_priors.json" ] |
| HLE / HLE-Verified | 50 | +21.1% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "yes"
} | blocked | Exact HLE result/script assets are present, but the public site does not yet execute the historical HLE harness. Missing: MODEL_CONFIG_MISSING | Source refs[ "HLE___HLE-Verified__hle_verified_routed_bench_results.json", "HLE___HLE-Verified__hle_verified_routed_bench_results_2.json" ] |
| HorizonMath | 50 | +45.5% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "no",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "yes"
} | blocked | HorizonMath IDs/targets/raw outputs exist, but local result cache does not include full prompt/problem statements and public site runner is not wired to the historical script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING | Source refs[ "HorizonMath__horizonmath_routed_bench_results.json" ] |
| MMLU-Pro | 50 | +3.4% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "no",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Stored MMLU-Pro summary says n=50, but result JSON only stores partial sample rows and no full prompt text; public site runner is not wired to historical MMLU-Pro script. Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "MMLU-Pro__mmlu_pro_routed_bench_results.json" ] |
| MuSR | stored | +3.3% | not_found_in_exact_musr_board_artifact / not_explicitly_recorded | Asset gates{
"sample_ids": "no",
"prompts": "no",
"gold": "no",
"grader": "no",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "no"
} | blocked | Latest board exposes MuSR summary/prior only. Exact MuSR raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle. Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING | Source refs[ "MuSR__darwin_benchmark_priors.json" ] |
| SimpleQA | 100 | +16.0% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "yes"
} | blocked | Exact SimpleQA result/script assets are present, but the public site does not yet execute the historical SimpleQA harness. Missing: MODEL_CONFIG_MISSING | Source refs[ "SimpleQA__official_simpleqa_routing_bench_results.json" ] |
| SimpleQA Verified / VSF | 100 clean harness | +120.4% | gpt-5.2 / not_explicitly_recorded | Asset gates{
"sample_ids": "yes",
"prompts": "yes",
"gold": "yes",
"grader": "yes",
"model_config": "no",
"darwin": "yes",
"hinton": "yes",
"lua": "yes",
"rcc_wrapper": "yes",
"raw_outputs": "yes"
} | enabled | Historical artifact replay is wired from proof_manifests/vsf_historical_reproduction_results.json. Live replay remains separate. Missing: none | Source refs[ "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results.json", "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results_2.json" ] |
