Benchmark-scoped; dialect validation in progress (IBM Enterprise COBOL, GnuCOBOL, Micro Focus).
DARPA-10 is an internally curated validation suite aligned with CLARA research objectives. Scores reflect pinned harness + decode parameters; not a production warranty.
Benchmark Results

Benchmark Summary

Loading benchmark data…

Full Leaderboard

Evaluation Framework

This benchmark evaluates AI-assisted COBOL modernization under structured, reproducible conditions. The framework emphasizes:

These principles support reproducibility and formal validation in legacy system transformation research.

Disclosures

Dialect Coverage

Scores are provisional with respect to COBOL dialect coverage. Current runs target a defined benchmark suite. Validation across IBM Enterprise COBOL, GnuCOBOL, and Micro Focus dialect variants is being expanded.

Benchmark Scope

Results reflect model outputs on a defined benchmark suite at a point in time. Model behavior may vary across tasks, parameter configurations, and COBOL program characteristics not represented in the current suite.

Reproducibility

Each benchmark task is versioned and deterministic. The assessment pipeline produces independently verifiable intermediate outputs at each phase.

Task definition: meta.json + prompt.md + source files + gold.json oracle
Run logging: model ID, decoding params, timestamps, per-task outputs
Auditability: suite hash + prompt hash recorded per run
Verification: Merkle-anchored audit trail; independent replay possible

Full methodology documentation, including the rating scale, assessment dimensions, and scoring process, is available on the methodology page.