Turning an opinion into a rating: the calibration study, and why we won't publish a number yet

A grade becomes a rating only when it carries information about real outcomes. Here is exactly how we are testing that — and why there is no accuracy statistic on this site, by design, until there is a control sample.

Brierly Research Team · Jun 14, 2026 · 5 minute read · sourced & dated (verification-log discipline)

RuleScore is a deterministic, public opinion about contract language. To call it a rating, the grades have to predict something real: disputes, resolution delays, rule changes, refunds. We are building that evidence three ways, and we are being explicit about the limits of each.

First, a retrospective calibration: score each coded dispute case on the rules text as it was listed, blind to the outcome where possible, and read dispute/delay rates by grade band. Second — and this is the part that disciplines the claim — a CONTROL sample of non-disputed high-volume markets, the denominator without which a 'disputes-only' table proves nothing. We will publish no 'N times more likely' figure until that control set exists. Third, a forward log: live grades are hash-stamped with timestamps every scan, so in a few months there is a clean, pre-registered out-of-sample test that needs no historical reconstruction. That log is already accruing.

The honest constraint is that a clean historical backtest needs point-in-time rules text the firm largely does not have, because the point-in-time recorder was scoped down for data-rights reasons. We say so rather than imply a backtest we can't run. Results will appear on the Proof page only if the relationship holds, with exact N, dates, method, and caveats — and if it doesn't hold, that is a model-improvement signal, not a publishing event.

Until then: the methodology is fully public, the dispute evidence is sourced and dated, and the forward log runs daily. Follow the Proof page for the read-out.

Sources

← All notes Track record →