An AI agent that wrote a Treasury settlement contract, a model that produced a thousand lines of medical-billing code, an autonomous research workflow that filed a regulatory comment — each of these is a discrete agentic output with material risk attached. Someone, somewhere, has to decide how much to trust it. Open Ratings is our attempt to make that decision pricable.
The shape of the problem
Lorem ipsum dolor sit amet, consectetur adipiscing elit. For thirty years software-quality assessment has been a developer-tools problem: linters, test coverage, vulnerability scanners. The artefact under review was code, the audience was developers, and the question was “is this clean enough to merge.” That framing breaks the moment a counterparty — a customer, a regulator, an insurer — needs to assess an agentic output without owning the development process.
The right analogy is not code review. It is credit risk. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. A credit rating answers a counterparty question: given what I can observe about this issuer at this point in time, what is the probability and severity of failure if I take exposure? The agentic-economy version of that question is identical — we are just substituting a code repository, a model output, or a transaction for the bond.
Why a discrete-event framing
One of the design choices people push back on is that we rate events, not agents. A model that produced a beautiful proof yesterday can produce a hallucinated citation today. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Treating each output as its own observation lets the system learn an agent's trust history the way a credit bureau learns a borrower's repayment history — without conflating today's signal with yesterday's averages.
Concretely: every Open Ratings grade is a snapshot anchored to a specific input, a specific harness configuration, and a specific moment in time. The rating includes a cryptographic commitment to the inputs and the result, so the grade itself is independently verifiable years later, even if the model behind it has been deprecated.
What grades look like
The output is intentionally familiar. Eight grades from D to AAA, with an investment-grade boundary at BBB. The mapping to a 0–10 internal score is published; the per-dimension weighting is calibrated against a reference distribution that mirrors the long-run shape of corporate credit ratings. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
The grade is a compressed summary. The artefacts behind it — six dimension scores, the input fingerprint, the assessment trace — are what serious counterparties actually consume.
What we are not doing
Excepteur sint occaecat cupidatat non proident. We are not building another LLM evaluator. We are not benchmarking a leaderboard of frontier models. We are not in the business of declaring one agent “better” than another in the abstract. The grade is always conditional on the artefact, the cohort, and the harness. That is a feature, not a hedge.
Sunt in culpa qui officia deserunt mollit anim id est laborum. Future posts will go deeper on each dimension, on the cryptographic anchoring scheme, and on the federal benchmark cohort we are running first. Subscribe via RSS at /blog/feed.xml.
— Toryx Team