Grades reflect pinned harness + decode parameters on the USGOV-C10 suite; not a production warranty.
Benchmark Results

Benchmark Summary

Loading benchmark data…

Full Leaderboard

Evaluation Framework

This benchmark evaluates AI-assisted COBOL modernization under structured, reproducible conditions. The framework emphasizes:

These principles support reproducibility and formal validation in legacy system transformation research.

Disclosures

Dialect Coverage

Grades shown for COBOL modernization outputs are benchmark-scoped and provisional with respect to dialect coverage. Current validation targets a defined benchmark suite. Validation across IBM Enterprise COBOL, GnuCOBOL, and Micro Focus dialect variants is being expanded. Grades reflect assessment of modernization output quality on the benchmark suite, not comprehensive dialect compliance.

Reproducibility

Each benchmark task is versioned and deterministic. The assessment pipeline produces independently verifiable intermediate outputs at each phase.

Task definition: meta.json + prompt.md + source files + gold.json oracle
Run logging: model ID, decoding params, timestamps, per-task outputs
Auditability: suite hash + prompt hash recorded per run
Verification: Merkle-anchored audit trail; independent replay possible

Full methodology documentation, including the rating scale, assessment dimensions, and scoring process, is available on the methodology page.