Every test readout in every company is the same thing: a screenshot of a dashboard in a slide deck. A screenshot proves the numbers existed. It proves nothing about how they were produced. Was the metric chosen before or after the results? Did the test run its full duration? Was the platform ever checked? The screenshot can't say, and six months later neither can anyone in the room.
Science solved this with the methods section: no journal accepts a result without a description of the process precise enough for a stranger to repeat it. Accounting solved it with the audit trail. Every field that depends on trusting numbers has learned the same lesson: the result is only as credible as the record of how it was made. Ecommerce experimentation, which routinely moves six-figure decisions, mostly has screenshots.
The receipt is that record. It documents what was registered and when, how the platform was calibrated, which integrity attestations hold, and the honest effect estimate, then stamps it with a SHA-256 fingerprint so a changed number is a changed stamp. Attach it to the readout and the "did we test this properly" argument is over before it starts. Results are easy to show. Process is what makes them believable.
"In God we trust. All others must bring data."
ATTRIBUTED TO W. EDWARDS DEMING — statistician, father of quality management
The receipt is stamped with a SHA-256 fingerprint of its contents — any change to the numbers produces a different stamp.
Each attestation carries a weight reflecting how badly its absence damages the conclusion: full duration 3, metric unchanged 3, no interim decisions 2, correct denominator 2, segments pre-declared 2, SRM checked 2, no concurrent test 1. Platform validation adds up to 2 more (2 if passed, 1 if issues were found and documented).
Two failures cap the grade regardless of everything else: not running the full pre-committed duration, and changing the primary metric. With either missing, the best achievable grade is C. These are the two sins that invalidate a test outright rather than merely weakening it — a peeked stop breaks the significance level, and a switched metric breaks the hypothesis.
Computed automatically, not attested: registration date must be on or before the start date (otherwise flagged "registered after start"), and the analysis date must be on or after the pre-committed end date (otherwise flagged "analysed before committed end").
Identical engine to Reality Check: two-proportion z-test for the p-value, and a normal-normal Bayesian posterior for the honest (shrunk) estimate, P(B beats A), and the 95% credible interval. See the math section on the Reality Check page for the full derivation.
Computed in your browser with the Web Crypto API. The hash is deterministic: the same inputs always produce the same 64-character stamp, and changing a single conversion count produces a completely different one. Anyone can re-enter the claimed inputs and verify the stamp matches. It proves consistency, not truth — it certifies that nobody edited the numbers after the receipt was issued.