See our work
What is hard — and what we sell — is the rigor that makes a number trustworthy: tested out-of-sample, checked against chance, pre-registered, forward-tested, and reported honestly even when it kills the idea. Below: how we do that, then a worked example or two for each service.
We have no public client case studies yet, so nothing here is a dressed-up client result. Each example is tagged either Reproducible (a real, public result you can re-run, linked to its source) or Illustrative (a representative walk-through of the method, no invented figures).
How we make a result trustworthy
We never grade a method on the data it was built on. We hold data back — different periods, different cases — and only believe a result if it survives there.
Could a coin-flip have done this? We shuffle the labels and re-run (a permutation test). If a result does not clearly beat the shuffle, it does not ship.
Try enough ideas and one looks great by luck. We deflate every result for the number of attempts behind it, so we are not rewarding a lucky search.
Before we see an outcome, we write down the decision, the hypotheses, and the exact test — then freeze it with a dated SHA-256 hash. The answer cannot be quietly moved after the fact.
Some claims only the calendar can settle. We pre-register the prediction and let it read out on a future date — not in hindsight.
Our own files are full of “we tried this, it did not work.” That is the point — a firm that only ever shows wins is hiding its losses.
Point us at the data-heavy step that is slowing you down; we rebuild it and prove the gain.
We took a publicly-visible third-party screening benchmark and matched 95.8% of its top-100 from public data alone — then published the full reproducibility package, every factor's correlation included. The point is not the number; it is that you can re-run it yourself.
See the method + reproducibility →On a public dataset of 17,379 hourly demand records, we forecast a later stretch of hours the model never saw. Scored the easy, flattering way (a random split), it looks like R² 0.67, where 1.0 is perfect. Scored the honest way, out-of-time, it is 0.62 — and it cuts a naive hour-of-day baseline's error by about 6%. The proof it is signal, not luck: across 1,000 shuffled-label runs, not one did better (p ≈ 0.001). Same public data, one script, the same numbers every time.
The public dataset (UCI) →More worked examples are being added for this service.
Design the dataset you should be building now for the decision you are about to make.
Before any data is touched, we freeze the decision, the hypotheses, the experiment and causal design, and the success test as a signed, dated, SHA-256 record. When the result lands there is no room to quietly re-cut it to taste — the freeze is the accountability.
See Decision-Grade Data →Representative: a team about to run an experiment. We map the decision to its hypotheses, the causal design, the power needed to detect a real effect, and the pre-registration — so the data they collect is decision-grade, not vanity. Illustrative of the design process.
More worked examples are being added for this service.
Bring us the data-heavy step that is slowing you down, or the decision you need better data to make. We will scope it — and you will get the result with its working, not just the headline.