Glossary — Open Ratings

This glossary defines the technical terms that recur across the Open Ratings methodology, the COBOL modernization benchmark, and the underlying trust infrastructure. Where a term is also a credit-market or cryptographic primitive, the definition emphasizes how Open Ratings uses it — not the general-purpose meaning.

Terms

Proof of Intent PoI: A signed, Sigstore-verified envelope that records the human-stated intent for a piece of work — the why and the constraints — before any agent output is generated. PoI binds the rating to a verifiable specification, not to a free-running agent. Without a PoI, an output can still be rated, but it cannot achieve the highest grades, because there is nothing against which to verify intent.
Proof of Work Merkle-anchored: A Merkle-tree-anchored attestation that the assessment pipeline actually performed the analytic work it claims, with difficulty proportional to the achieved grade. Higher grades require more work; the resulting Merkle root is sealed inside the on-chain Verification Event Envelope.
Round-Trip Correctness RTF: An evaluation methodology in which a model is asked to regenerate code from an intermediate representation, and the regenerated code is graded by whether it compiles and behaves equivalently to the original program. After Allamanis et al., ICML 2024. RTF replaces brittle string-match metrics with a behavioral definition of correctness.
Adversarial Loop Extraction-Learning ALEL: An iterative training loop that extracts logic from existing code, asks a panel of teacher models to regenerate it, mines disagreements as hard negatives, and feeds the failures back into the next training round. Each ALEL round measurably reduces the rate at which the student model agrees with the teacher panel by accident rather than by understanding.
USGOV-C10: The US-government-derived COBOL-10 held-out benchmark, and the SHA-256 contamination firewall that prevents any of its programs from leaking into training corpora. Every candidate program is hashed and rejected if its SHA-256 matches the blocklist (scripts/firewall_usgov.py). USGOV-C10 is the public score Open Ratings reports against frontier models.
AGENTIC card: The grading dimension that scores how well an agent's output behaves under autonomous operation — tool-use safety, deviation from stated intent, hallucinated tool calls, and recovery from adversarial prompts. The AGENTIC card is the dimension most sensitive to agent identity: different agents have very different AGENTIC profiles even when their CODE scores are similar.
SECURITY card: The grading dimension covering vulnerability exposure, secret handling, input validation, dependency CVE propagation, and detection of hallucinated packages or fabricated API endpoints. SECURITY is one of two primary-weight dimensions in the composite.
CODE card: The grading dimension that measures architectural coherence, idiomatic style, abstraction level, and integration quality with the surrounding codebase. CODE rewards output that fits the project it lands in, not output that is technically correct in isolation.
RELIABILITY card: The grading dimension scoring logic correctness — semantic accuracy versus stated intent, edge-case handling, error propagation, and divergence between specification and behavior. RELIABILITY is the second primary-weight dimension; it is also the dimension most directly anchored by the PoI envelope.
PERFORMANCE card: The grading dimension measuring runtime efficiency, algorithmic complexity, resource use, and scaling behavior under representative workloads. PERFORMANCE is graded against the workload class declared in the PoI envelope, not against an absolute benchmark.
TECH-DEBT card: The grading dimension scoring maintainability — cyclomatic and cognitive complexity, nesting depth, function length, naming clarity, and long-term modifiability. High TECH-DEBT does not necessarily fail the output; it bounds how high the composite grade can climb.
DOCS card: The grading dimension assessing documentation coverage, accuracy of inline comments and module docstrings, completeness of public-API references, and how well the docs track the actual implementation. DOCS is a supporting-weight dimension but is required at non-trivial levels for any AA or AAA grade.
ALL composite: The final composite rating produced by weighting and aggregating the six individual card scores into a single letter grade (D through AAA). The ALL composite is what gets sealed into the on-chain VEE. Specific dimension weights are protected as trade secrets; qualitative tier labels (Primary, High, Moderate, Supporting) are published on the methodology page.
Agreement filter: A consensus gate that accepts a generated training pair only if a minimum number of teacher models independently produce outputs that pass the compile gate. Default threshold is two of five teachers. The agreement filter discards examples where the teacher panel disagrees with itself.
Compile gate: A hard pass/fail check that runs candidate code through a real compiler — for COBOL, GnuCOBOL with --std=ibm. Output that does not compile is rejected before any further evaluation runs. The compile gate is also one of the inputs to the RTF score.
Distillation teacher to student: The practice of training a smaller, deployable student model on outputs and reasoning traces produced by a panel of larger frontier teacher models. Open Ratings uses agreement-filtered teacher consensus rather than any single teacher, so the resulting student inherits the intersection of teacher capability rather than any one model's bias.
LogicIR: A language-neutral intermediate representation that captures a program's logic — control flow, data movement, business rules, and side effects — independent of the source dialect. LogicIR is used as the instruction side of every Round-Trip Correctness pair: extract LogicIR from program P, regenerate P' from LogicIR, then check whether P' compiles and behaves like P.
Notch-down rule verified intent: If the rated artifact's verified intent (PoI) does not match the artifact's observed behavior, the composite grade is notched down by one or more steps regardless of how well the individual cards score. Behavior that contradicts intent cannot be investment grade, even if every other dimension is strong.
Sigstore / Rekor: Open-source signing infrastructure used to bind a Proof-of-Intent envelope to an identity and a tamper-evident transparency log. Sigstore handles short-lived-certificate signing; Rekor is the append-only log that records the signature so anyone can later verify that the PoI existed at the claimed time.
Merkle root anchor: The single hash at the top of a Merkle tree built from every Verification Event Envelope produced during an assessment. Anchoring the root on-chain lets any third party verify the full tree without needing to see the underlying data — only the leaves they care about plus the sibling hashes on the path to the root.
Verification Event Envelope VEE: A signed, schema-typed record that captures one assessment event — its inputs, outputs, hash chain, and grade. VEE event types include CODE_QUALITY, AGENT_GOVERNANCE, COMPLIANCE_CHECK, and MODEL_ATTESTATION. The VEE is the unit of cryptographic anchoring; ratings are not signed individually, envelopes are.
Shadow rating: An unsolicited Open Rating issued without the involvement of the rated agent's operator. Modeled on the long-standing credit-rating-agency practice of unsolicited ratings of public issuers, shadow ratings are issued at our discretion for outputs of public interest.
Investment grade: Any composite rating of BBB or higher. Open Ratings uses the same investment-grade boundary as the corporate-bond market: BBB and above is suitable for production use without additional human verification; BB and below carries elevated counterparty risk and is labeled speculative grade.

A term is missing or unclear? Open an issue at github.com/lmsanch/toryx-openratings.