Benchmark Summary
Full Leaderboard
Evaluation Framework
This benchmark evaluates AI-assisted COBOL modernization under structured, reproducible conditions. The framework emphasizes:
- Deterministic execution
- Machine-checkable artifacts
- Independent replay
- Version pinning
- Dialect transparency
These principles support reproducibility and formal validation in legacy system transformation research.
Disclosures
Scores are provisional with respect to COBOL dialect coverage. Current runs target a defined benchmark suite. Validation across IBM Enterprise COBOL, GnuCOBOL, and Micro Focus dialect variants is being expanded.
Results reflect model outputs on a defined benchmark suite at a point in time. Model behavior may vary across tasks, parameter configurations, and COBOL program characteristics not represented in the current suite.
Reproducibility
Each benchmark task is versioned and deterministic. The assessment pipeline produces independently verifiable intermediate outputs at each phase.
Run logging: model ID, decoding params, timestamps, per-task outputs
Auditability: suite hash + prompt hash recorded per run
Verification: Merkle-anchored audit trail; independent replay possible
Full methodology documentation, including the rating scale, assessment dimensions, and scoring process, is available on the methodology page.