Omar OS-1

Baseline accuracy: 0.0600 Omar/RCC accuracy: 0.1200

Current HLE / HLE-Verified BYOK package: 100-sample package, baseline 0.06, Omar/RCC 0.12, +6.00 points / +100.00% relative. Decision LIVE_SCORE_RECORDED; raw outputs, route log, results, cost summary, and manifest are attached.

BOARD-DEPTH REPLAY 100 samples / gpt-4o Run bbh-100-byok-live-2026-05-28 BBH

Baseline accuracy: 0.6700 Omar/RCC accuracy: 0.7700

Current BYOK live BBH n=100: baseline 0.67, Omar/RCC 0.77, +10.00 points / +14.93%. Exactness stays RECONSTRUCTED_ONLY / internal review, with raw outputs, route log, and manifest attached.

BOARD-DEPTH REPLAY 100 samples / gpt-5.2 Run musr-100-byok-live-2026-05-31 MuSR

Baseline accuracy: 0.7200 Omar/RCC accuracy: 0.7500

Current MuSR BYOK live package: n=100, baseline 0.72, Omar/RCC 0.75, +3.00 points / +4.17% relative. Decision LIVE_SCORE_RECORDED; raw outputs, route log, results, cost summary, and manifest stay attached.

CONTROL-LAYER SMOKE 100 / hallucination / factuality HaluEval

Control-layer smoke results Prompt-injection and hallucination checks now sit beside reliability evidence

Charts are normalized to higher-is-better. Prompt-injection rows use safe execution score = 100 - attack success rate; HaluEval and TruthfulQA use factuality accuracy. These are diagnostic smoke/debug rows until Ben approves public claim wording.

+23.46% +19.0 pts

Baseline accuracy: 81.0% Omar hint v8 accuracy: 100.0%

Missed hallucinations 18/50 to 0/50; false alarms 1/50 to 0/50. 100-case BYOK factuality check. Read as control-layer smoke evidence until promoted into a board-depth claim package.

+23.46% +19.0 pts

Run 100

CONTROL-LAYER SMOKE 100 / hallucination / misleading QA TruthfulQA

+17.65% +15.0 pts

Baseline accuracy: 85.0% Omar hint v8 accuracy: 100.0%

False-choice rate 15.0% to 0.0%; invalid/refusal 0. 100-case BYOK factuality check. Read as control-layer smoke evidence until promoted into a board-depth claim package.

+17.65% +15.0 pts

Run 100

CONTROL-LAYER SMOKE 100 held-out / prompt-injection resistance AgentDojo

+4.17% +4.0 pts

Baseline safe: 96.0% Omar hint v2 safe: 100.0%

ASR 4.0% to 0.0%; utility 56.0% to 79.0%. Companion 100-case ASR also moved 6.0% to 0.0%. Positive held-out AgentDojo smoke signal on the tested subset, not universal prompt-injection protection.

+4.17% +4.0 pts

Run 100 held-out

Artifact-backed board context Board-depth rows, high-value exceptions, and stored context

100+ sample rows provide stronger early technical evidence than live smoke tests, but should still be read with sampling method, reproducibility, raw outputs, and route logs. Omar/RCC shows where structured reasoning control improves performance, where it stays neutral, and where mismatch or drift should be flagged instead of hidden.

Proof family Reasoning / hard reasoning

HorizonMath, HLE, AIME 120, GPQA, BBH, MMLU-Pro, and MuSR stay together as the reasoning family.

HIGH-VALUE EXCEPTION n=50 / research math auto-checkable numeric/constants subset HorizonMath

+45.5% +10.0 pts on 0-100 axis

Baseline accuracy: 22.0% Omar/RCC accuracy: 32.0%

High-value exception. n=50 auto-checkable research-math exception; not labeled as 100+ evidence. Score context: 22.0% baseline to 32.0% Omar/RCC (+10.0 pts); stored uplift +45.5%. recovered HorizonMath n=50 artifact.

+45.5% +10.0 pts on 0-100 axis

Context 50

BOARD-DEPTH REPLAY 100 samples / gpt-4o Run run_1779989838150_31a83b3945e2 HLE / HLE-Verified

Baseline accuracy: 0.0600 Omar/RCC accuracy: 0.1200

INTERNAL REVIEW 120 samples / gpt-4o Run aime-120-byok-live-2026-05-28 AIME 120

+18.03% +9.17 pts

Baseline accuracy: 0.5083 Final adopted Omar/RCC accuracy: 0.6000

Current BYOK live AIME 120 n=120: baseline 0.5083, final adopted Omar/RCC 0.6000 (+9.17 points / +18.03%). Exactness stays RECONSTRUCTED_ONLY / NEEDS_INVESTIGATION.

+18.03% +9.17 pts

BOARD-DEPTH REPLAY 100 samples / gpt-4o Run gpqa-100-byok-live-2026-05-28 GPQA

+4.00% +3.00 pts

Baseline accuracy: 0.7500 Omar/RCC accuracy: 0.7800

Current GPQA BYOK reconstructed live n=100 package: baseline 0.75, Omar/RCC 0.78, +3.00 points / +4.00% relative uplift. Exactness remains RECONSTRUCTED_ONLY / NEEDS_INVESTIGATION, with raw outputs, route log, results, cost summary, and manifest attached.

+4.00% +3.00 pts

BOARD-DEPTH REPLAY 100 samples / gpt-4o Run bbh-100-byok-live-2026-05-28 BBH

Baseline accuracy: 0.6700 Omar/RCC accuracy: 0.7700

Current BYOK live BBH n=100: baseline 0.67, Omar/RCC 0.77, +10.00 points / +14.93%. Exactness stays RECONSTRUCTED_ONLY / internal review, with raw outputs, route log, and manifest attached.

BOARD-DEPTH REPLAY n=100 BYOK live / breadth-heavy expert MCQ MMLU-Pro

+5.41% +4.0 pts on 0-100 axis

Baseline accuracy: 74.0% Omar/RCC accuracy: 78.0%

Board-depth replay. Board-depth replay row; read with sampling method, raw outputs, and route logs. Score context: 74.0% baseline to 78.0% Omar/RCC (+4.0 pts); stored uplift +5.41%. MMLU-Pro n=100 BYOK live with Darwin/Hinton v2 thin blank-emission recovery: +4.0 points / +5.41% relative.

+5.41% +4.0 pts on 0-100 axis

Context 100 BYOK live

BOARD-DEPTH REPLAY 100 samples / gpt-5.2 Run musr-100-byok-live-2026-05-31 MuSR

Baseline accuracy: 0.7200 Omar/RCC accuracy: 0.7500