Recipes Are Not Cooking: Structured Experiments Inside the Sandbox

On this page

Two teams, six weeks

Team A ran eight experiments. The team was enthusiastic. The weekly meetings were energetic. At the end of six weeks, the write-up the executive received was a page and a half of impressions: the assistant worked well on summaries, less well on drafting, surprising on synthesis, uneven on edits. When the executive asked which use cases were worth graduating to Skills or Solutions, the team could not produce a list. They had tried a lot of things. They could not tell the executive which were real.

Team B ran four experiments. The team was quieter. The weekly meetings were short and structured. Two of the four experiments ended with written verdicts the senior leader could hold up to the board. Two ended with clear rerouting notes. The page-and-a-half equivalent was forty lines of concrete findings, any of which the executive could defend under a hard question.

The difference between the two teams was not tool choice, not team composition, not effort. Team B had five specific fields filled in for every experiment. Team A did not. The difference between an anecdote and a finding is rarely about the quality of the experimenter. It is about the structure around the attempt.

What an experiment actually is

A sandbox experiment is a specific bet, written down in advance, that can be resolved. Not we tried the assistant on our newsletter. Something more like: this kind of input, producing this kind of output, in this amount of time, reviewed by this person, against this quality standard, with this risk surface noted. Written before the attempt, not after.

The reason this matters is simple. After an attempt, human memory reshapes what the attempt was to match what happened. A team that went in fuzzy and came out with a surprising result will reconstruct the attempt around the surprise. Experiments written in advance resist this reshaping. They are the organization's immune system against retroactive storytelling.

Most sandbox failures are not tool failures. They are experiment-design failures. The team did not ask a resolvable question, so the run produced no resolution. The team did not name a reviewer, so nobody was accountable for the quality verdict. The team did not specify the input carefully, so the comparison to the human-only version was unfair in ways that were invisible until the debrief.

The five required fields

An experiment, to be resolvable, needs five fields. Each is non-optional. A sandbox that runs experiments with three of the five produces findings that will not survive a serious review.

Input. The specific material the assistant receives, sourced in a way that respects the data tiers Safety defined. Not some of our recent emails. The exact artifacts, named, with any redactions or synthetic substitutions documented. Without specific input, comparisons to the human-only baseline are not comparisons; they are stories with numbers attached.

Output. The specific artifact expected, in a form a human can evaluate. Not a good version. The shape, the approximate length, the elements it must contain, the register it must hit. If the team cannot describe the expected output concretely enough that a reviewer could grade it, the team is not running an experiment. They are running a mood test.

Time comparison. How long the human-only version would have taken, and how long the assisted version actually took, both measured honestly. Honesty is the hard word here. The human-only baseline must not be the worst case; it must be the realistic one. The assisted time must include the review-and-correction time, not just the generation time. Most teams overclaim time savings by an order of magnitude because they compare human wrestling to assistant sprint without including human cleanup.

Quality judgment. A named person, using named criteria, producing a written verdict. The named person is not the experiment owner. The criteria are agreed on before the output is produced, not selected afterward to match what got made. The written verdict is a paragraph, not a score. Scores abstract the thing you need to preserve; paragraphs preserve it.

Risk flag. What, if anything, surfaced during the run that the organization's Safety layer should hear about. A near-miss on confidential data. A moment where the output was quietly wrong about a factual matter. A passage where the tone drifted toward the sector median in ways that should be noticed. Risk flags are how experiments feed the incident register instead of quietly becoming incidents themselves.

Why each field is non-optional

Skipping input definition produces unfair comparisons. The assistant looked faster because the team fed it clean, pre-processed material; the human baseline was an hour of staring at raw notes first. Or the assistant looked worse because the team fed it raw notes; the human baseline was already fifteen minutes into the work before the clock started. The input specification is what makes the comparison a comparison.

Skipping output definition produces it was good as a finding. A finding of it was good is indistinguishable from a finding of nobody looked carefully. The output specification is what gives the reviewer something to grade against.

Skipping time comparison produces false productivity claims. Teams that skip this field almost always overstate time savings, because generating something takes less time than wrestling it into shape, and the second part of the work is usually the part that got measured in the human-only baseline but not in the assisted one.

Skipping quality judgment produces verdicts the senior leader cannot use. A peer who did not run the experiment, using written criteria, producing a written verdict, is what turns the sandbox's output from vibes into evidence. Any weaker version means the portfolio downstream is built on impressions.

Skipping risk flags produces incidents disguised as successes. The experiment went well on the metrics and nobody noticed that the output drifted toward the sector median, or that the assistant quietly fabricated a statistic, or that a piece of sensitive material ended up in a prompt it should not have. Without an explicit risk-flag field, these go unremarked. With one, they go into the incident register and feed the organization's learning.

Recipes are not cooking

A well-run experiment produces a recipe. Input of this kind, produces output of this kind, in roughly this time, to roughly this quality, with these risks. The recipe is durable. Another staff member can follow it. Another team can reproduce it. The organization has turned a moment into a capacity.

Recipes are useful. They are not the whole curriculum.

The cooking is what the senior team does with the recipes: which to use, when, for which stakes, with which overrides. That capacity, the formed judgment of when a recipe is the right recipe for this situation, is not what experiments produce. It is what the third layer of sandbox work, judgment, produces over time. Confusing recipes with cooking is the single most common way sandbox programs graduate people too early. The team has good recipes and is treated as if they had formed judgment. Six months later, the recipes are being applied to situations they were not tested for, and the organization discovers that recipes without cooking are surprisingly fragile.

This piece is about the recipes. The later pieces in the series, beginning with The Ethical & Relational Flag, are about the cooking. Both are real work. Neither substitutes for the other.

The shared artifact rule

Experiments live in a shared document the senior team can read. Not in a staff member's notes. Not in a group chat. Not in an email thread that will be impossible to find in six months. A shared page, updated as experiments complete, readable by anyone who will have to make decisions based on the findings.

The rule matters for two reasons. First, it keeps the work real. A finding that nobody writes down in a form somebody else can read is not a finding; it is a recollection. Second, it creates the conditions for cross-team learning. One team's recipe becomes another team's starting point. The sandbox stops being a series of isolated trials and starts being a body of organizational knowledge, which is what the Sandbox stage is ultimately for.

Private experiments are not sandbox experiments. They are personal workflow. Individual staff will continue to use assistants in their own work, and that is fine. Those uses do not feed the sandbox's output. The portfolio is built only from experiments that were written down in a form the senior team could evaluate.

What comes next

Experiments produce claims. Claims must be scored. Scoring Value Honestly is the next piece, and it picks up where this one leaves off: once the experiment has run, how do you tell which claims are real, which are inflated, and which need rerouting before they graduate.

A well-run experiment is a prerequisite for honest scoring. A badly run experiment cannot be rescued by careful scoring. The five fields are the prerequisites. The scoring is what you do with them.

Most sandbox failures are not tool failures. They are experiment-design failures.

ShareEmail

Sandbox

When you are ready to run a season, not only read about it

The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

Sandbox Season Canon hub Methodology map Eight patterns (visual)Template pack request

Recipes Are Not Cooking: Structured Experiments Inside the Sandbox

Two teams, six weeks

What an experiment actually is

The five required fields

Why each field is non-optional

Recipes are not cooking

The shared artifact rule

What comes next

When you are ready to run a season, not only read about it

More in this series

The Three Layers of Sandbox Work

Discovering Value Under Constraint

The Three Kinds of Value AI Legitimately Produces