Test Receipt

// why write a receipt?

Every test readout in every company is the same thing: a screenshot of a dashboard in a slide deck. A screenshot proves the numbers existed. It proves nothing about how they were produced. Was the metric chosen before or after the results? Did the test run its full duration? Was the platform ever checked? The screenshot can't say, and six months later neither can anyone in the room.

Science solved this with the methods section: no journal accepts a result without a description of the process precise enough for a stranger to repeat it. Accounting solved it with the audit trail. Every field that depends on trusting numbers has learned the same lesson: the result is only as credible as the record of how it was made. Ecommerce experimentation, which routinely moves six-figure decisions, mostly has screenshots.

The receipt is that record. It documents what was registered and when, how the platform was calibrated, which integrity attestations hold, and the honest effect estimate, then stamps it with a SHA-256 fingerprint so a changed number is a changed stamp. Attach it to the readout and the "did we test this properly" argument is over before it starts. Results are easy to show. Process is what makes them believable.

"In God we trust. All others must bring data."

ATTRIBUTED TO W. EDWARDS DEMING — statistician, father of quality management

01 — Experiment

Experiment name

Primary metric (as pre-registered)

Hypothesis (as pre-registered — verbatim)

Registered on

Test start

Test end (pre-committed)

Results analyzed

02 — Platform Calibration

Testing platform

Validated with Platform Validator?

Trust score (if validated)

03 — Integrity Attestations check only what is true — unchecked items appear on the receipt

04 — Result

Control visitors

Control conversions

Variant visitors

Variant conversions

Prior scepticism (for the honest estimate — same as Reality Check)

The receipt is stamped with a SHA-256 fingerprint of its contents — any change to the numbers produces a different stamp.

// the math — how the grade and the stamp are computed

Integrity grade — weighted attestations

Each attestation carries a weight reflecting how badly its absence damages the conclusion: full duration 3, metric unchanged 3, no interim decisions 2, correct denominator 2, segments pre-declared 2, SRM checked 2, no concurrent test 1. Platform validation adds up to 2 more (2 if passed, 1 if issues were found and documented).

score = earned weight / maximum weight (17) · A ≥ 0.90 · B ≥ 0.72 · C ≥ 0.50 · F below

Hard caps

Two failures cap the grade regardless of everything else: not running the full pre-committed duration, and changing the primary metric. With either missing, the best achievable grade is C. These are the two sins that invalidate a test outright rather than merely weakening it — a peeked stop breaks the significance level, and a switched metric breaks the hypothesis.

Timeline cross-checks

Computed automatically, not attested: registration date must be on or before the start date (otherwise flagged "registered after start"), and the analysis date must be on or after the pre-committed end date (otherwise flagged "analysed before committed end").

Statistical results on the receipt

Identical engine to Reality Check: two-proportion z-test for the p-value, and a normal-normal Bayesian posterior for the honest (shrunk) estimate, P(B beats A), and the 95% credible interval. See the math section on the Reality Check page for the full derivation.

The tamper-evident stamp

stamp = SHA-256( canonical JSON of all inputs + attestations )

Computed in your browser with the Web Crypto API. The hash is deterministic: the same inputs always produce the same 64-character stamp, and changing a single conversion count produces a completely different one. Anyone can re-enter the claimed inputs and verify the stamp matches. It proves consistency, not truth — it certifies that nobody edited the numbers after the receipt was issued.