Omar AGI logo

Live benchmark replay

Omar Benchmark Replay

Run live baseline vs Omar/RCC benchmark checks from the public harness. Start with the free BBEH smoke lane or use your own key for deeper runs.

Recent public live checks

Live runs from the public server. Small-n runs are smoke signals, not board-depth proof.

BBEH free smoke, n=20Latest: +400.00%

Smoke result — small samples can swing.

Smoke / live path check

HLE, n=20+100.00%

Smoke / reconstructed

HealthBench hard, n=20+44.44%

Smoke / reconstructed

HorizonMath, n=50+36.36%

Directional / gate missing

HealthBench main, n=20+15.38%

Smoke / reconstructed

MuSR, n=20+15.38%

Smoke / reconstructed

BBH, n=20+14.29%

Smoke / reconstructed

MMLU-Pro, n=20+7.14%

Smoke / reconstructed

VSF, n=20+5.00%

Drift-watch / weak positive

Artifact-backed board context

Historical/core benchmark results used as context. Public live runs may differ because live model outputs, reconstructed lanes, and gate availability vary by benchmark.

AIME 120 / n=120+100.0%

Historical context only, not current public proof

BBEH / n=100+183.3%

Historical context only, not current public proof

BBH / n=50+14.3%

Historical context only, not current public proof

Facts Grounding / n=100+2.1%

Historical context only, not current public proof

GPQA / n=50+13.2%

Historical context only, not current public proof

HealthBench main / n=50+6.2%

Historical context only, not current public proof

HealthBench hard / n=stored+11.9%

Historical context only, not current public proof

HealthBench consensus / n=stored+3.1%

Historical context only, not current public proof

HLE / HLE-Verified / n=50+21.1%

Historical context only, not current public proof

HorizonMath / n=50+45.5%

Historical context only, not current public proof

MMLU-Pro / n=50+3.4%

Historical context only, not current public proof

MuSR / n=stored+3.3%

Historical context only, not current public proof

SimpleQA / n=100+16.0%

Historical context only, not current public proof

SimpleQA Verified / VSF / n=100 clean harness+120.4%

Historical context only, not current public proof

Free BBEH smoke

Run Free BBEH 20-Sample Live Smoke without an API key.

Small sample, live output, not board-depth proof.

BYOK live runs

Run current public benchmark lanes with your own API key.

Your key is used only for the selected run and is not stored.

Private Technical Pilot

For technical teams building agents, evals, and reasoning-heavy workflows. Omar/RCC is tested against your workflow and returned as a technical report with raw outputs, route logs, and reproducibility artifacts.

Technical teams only.