Omar AGI logo

Advanced Evidence

Current live lanes

The primary /run page shows one current live lane per benchmark.

Artifacts

Raw logs, manifests, and advanced identity fields remain available for audit.

Historical exceptions

Historical material is kept here only as an audit reference, not as the default public story.

Historical board context

Same model. Different operating layer. Ben-confirmed selective routing board, 2026-04-25.

Artifact-backed historical evidence. These values are context, not guaranteed outputs for every live run.

Selective benchmark uplift observed. Run one live proof sample. Bring your own API key to reproduce more. Bring us the AI cases that break.

Strong areas: AIME, BBEH, HorizonMath, SimpleQA Verified / VSF, HLE, GPQA, HealthBench hard

Smaller but useful areas: Facts Grounding, MMLU-Pro, MuSR, HealthBench consensus

BenchmarkFamilynRelative uplift
AIME 120olympiad math120+100.0%
BBEHanti-family reasoning benchmark100+183.3%
BBHBig-Bench Hard reasoning50+14.3%
Facts Groundingfactual grounding100+2.1%
GPQAgraduate-level science QA50+13.2%
HealthBench mainmedical reliability50+6.2%
HealthBench hardmedical reliability hardstored+11.9%
HealthBench consensusmedical consensus special-typestored+3.1%
HLE / HLE-Verifiedbroad hard reasoning50+21.1%
HorizonMathresearch math auto-checkable numeric/constants subset50+45.5%
MMLU-Probreadth-heavy expert MCQ50+3.4%
MuSRreasoning holdoutstored+3.3%
SimpleQAfactual QA100+16.0%
SimpleQA Verified / VSFverified short factual QA100 clean harness+120.4%

Lane taxonomy

Historical Artifact Reproduction uses archived evidence. Live Reproduction Attempt runs selected samples through the current harness. Diagnostic Live Replay is for debugging current behavior. Board Evidence Context is reference evidence only.

Direct comparability requires matching samples, model, grader, prompt wrapper, and routing harness.

Board Reproduction

Historical Benchmark Evidence is read-only. Live Replay Demo is not historical board reproduction. Board Reproduction is enabled only when exact historical assets are wired by benchmark family.

Board families: 14. Site-enabled reproduction: SimpleQA Verified / VSF. Assets found but not wired: BBEH, BBH, Facts Grounding, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, SimpleQA.

Blocked families: AIME 120, BBEH, BBH, Facts Grounding, GPQA, HealthBench main, HealthBench hard, HealthBench consensus, HLE / HLE-Verified, HorizonMath, MMLU-Pro, MuSR, SimpleQA.

No blocked family exposes a run button. BYOK will apply only after a family-specific board runner is enabled.

BenchmarknBoard upliftModel / temperatureRequired assetsSite statusExact blockerSource refs
AIME 120120+100.0%not_found_in_exact_aime_board_artifact / not_explicitly_recorded
Asset gates
{
  "sample_ids": "no",
  "prompts": "no",
  "gold": "no",
  "grader": "no",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedLatest board exposes AIME summary/prior only. Exact AIME 120 raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle. Older AIME archive candidates exist under rcc_publication_package but are RCC v4/v16/gpt-4o-era assets, not this Ben-confirmed gpt-5.2 board.
Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "AIME_120__darwin_benchmark_priors.json",
  "aime_hard.jsonl",
  "aime"
]
BBEH100+183.3%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "no",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedScript and scored result are present, but local proof artifact stores task/gold/answer correctness, not full prompt/input text or raw model outputs. Public site runner is not wired to the historical script.
Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "BBEH__bbeh_routed_bench_results.json",
  "BBEH__bbeh_routed_bench_results_2.json"
]
BBH50+14.3%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedFrozen IDs/contract/script are present, but public site runner is not wired to execute the historical BBH frozen harness yet.
Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "bbh_frozen_ids.json",
  "bbh_frozen_contract.json",
  "BBH__bbh_official_current_results.json"
]
Facts Grounding100+2.1%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "yes"
}
blockedExact result/script assets are present, but the public site does not yet execute the historical Facts Grounding harness.
Missing: MODEL_CONFIG_MISSING
Source refs
[
  "Facts_Grounding__deepmind_facts_grounding_bench_results.json"
]
GPQA50+13.2%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "no",
  "prompts": "no",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedStored GPQA board artifact is aggregate-only: scores and metrics exist, but original per-sample prompts, IDs, and raw outputs are not stored in the proof JSON.
Missing: DATASET_MISSING, PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "GPQA__gpqa_routed_bench_results.json"
]
HealthBench main50+6.2%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedLocal HealthBench data/script/result are present, but stored proof result is aggregate metric output and public site runner is not wired to historical HealthBench execution.
Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "healthbench.jsonl",
  "HealthBench_main__healthbench_routing_bench_results.json"
]
HealthBench hardstored+11.9%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedHealthBench hard local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw hard-subset execution result.
Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "healthbench_hard.jsonl",
  "HealthBench_hard__darwin_benchmark_priors.json"
]
HealthBench consensusstored+3.1%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedHealthBench consensus local data exists, but latest proof bundle stores only Darwin benchmark priors for this board row, not the raw consensus-subset execution result.
Missing: MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "healthbench_consensus.jsonl",
  "HealthBench_consensus__darwin_benchmark_priors.json"
]
HLE / HLE-Verified50+21.1%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "yes"
}
blockedExact HLE result/script assets are present, but the public site does not yet execute the historical HLE harness.
Missing: MODEL_CONFIG_MISSING
Source refs
[
  "HLE___HLE-Verified__hle_verified_routed_bench_results.json",
  "HLE___HLE-Verified__hle_verified_routed_bench_results_2.json"
]
HorizonMath50+45.5%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "no",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "yes"
}
blockedHorizonMath IDs/targets/raw outputs exist, but local result cache does not include full prompt/problem statements and public site runner is not wired to the historical script.
Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING
Source refs
[
  "HorizonMath__horizonmath_routed_bench_results.json"
]
MMLU-Pro50+3.4%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "no",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedStored MMLU-Pro summary says n=50, but result JSON only stores partial sample rows and no full prompt text; public site runner is not wired to historical MMLU-Pro script.
Missing: PROMPTS_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "MMLU-Pro__mmlu_pro_routed_bench_results.json"
]
MuSRstored+3.3%not_found_in_exact_musr_board_artifact / not_explicitly_recorded
Asset gates
{
  "sample_ids": "no",
  "prompts": "no",
  "gold": "no",
  "grader": "no",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "no"
}
blockedLatest board exposes MuSR summary/prior only. Exact MuSR raw run, sample IDs, prompts, grader, and script were not found in the 2026-04-25 proof bundle.
Missing: DATASET_MISSING, PROMPTS_MISSING, GOLD_MISSING, GRADER_MISSING, MODEL_CONFIG_MISSING, RAW_LOGS_MISSING
Source refs
[
  "MuSR__darwin_benchmark_priors.json"
]
SimpleQA100+16.0%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "yes"
}
blockedExact SimpleQA result/script assets are present, but the public site does not yet execute the historical SimpleQA harness.
Missing: MODEL_CONFIG_MISSING
Source refs
[
  "SimpleQA__official_simpleqa_routing_bench_results.json"
]
SimpleQA Verified / VSF100 clean harness+120.4%gpt-5.2 / not_explicitly_recorded
Asset gates
{
  "sample_ids": "yes",
  "prompts": "yes",
  "gold": "yes",
  "grader": "yes",
  "model_config": "no",
  "darwin": "yes",
  "hinton": "yes",
  "lua": "yes",
  "rcc_wrapper": "yes",
  "raw_outputs": "yes"
}
enabledHistorical artifact replay is wired from proof_manifests/vsf_historical_reproduction_results.json. Live replay remains separate.
Missing: none
Source refs
[
  "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results.json",
  "SimpleQA_Verified___VSF__deepmind_simpleqa_verified_routing_bench_results_2.json"
]