Skip to content
Sandbox
SeriesSandbox curriculum06 / 09

Scoring Value Honestly

By Josh Shepherd6 min read
On this page

Twelve greens, then the incident

The team had run twelve experiments over the quarter. The scoring sheet was mostly green. Eleven of the twelve had green checks in every column; the twelfth had one yellow, and the team was proud of having caught it. Leadership signed off on the portfolio. Two of the green-scored use cases were promoted into early Solutions work.

A month later, both produced the same kind of incident, separately. A factual drift in donor-facing copy that passed internal review because the reviewers were pattern-matching to format rather than checking the claim. Neither incident was dramatic. Both were the exact failure mode the sandbox had been supposed to prevent.

When the senior leader went back through the scoring sheets afterward, a pattern was obvious in hindsight. The greens were not wrong. They were not scrutinized. The reviewers had filled in the column the way people fill in forms, not the way auditors check work. The scoring was a ritual. The scrutiny was not.

This is the problem scoring discipline exists to solve. Experiments produce results. Results produce enthusiasm. Enthusiasm produces bad scoring. The organizations that avoid this are the ones that treat scoring as senior work, built to resist its own momentum.

The four dimensions

Every completed experiment gets scored on four dimensions. None is optional. None is sufficient alone.

Time saved. Real, measured, honest time, net of all the time spent reviewing and correcting. Not felt faster. Actual minutes, across a fair comparison baseline. If the team cannot produce a specific number, the dimension scores neutral by default, because unverifiable time savings are the most common place scoring inflation starts.

Quality improved. Named reviewer, named criteria, written verdict. The reviewer is a peer who did not run the experiment. The criteria are agreed in advance. The written verdict is a paragraph. The question is not is this output acceptable but does this output improve meaningfully on the human-only baseline, and do we have the evidence to say so.

Risk introduced. Categorical and honest. Confidentiality risk. Accuracy risk. Authorship risk. Voice risk. Relational risk. For each category, the scorer answers: did anything surface during the experiment that should go into the incident register. The categorical list matters because no risks observed is usually wrong and almost never caught; asking about five specific categories makes the review more honest than asking about risk in general.

Repeatability. Could a different staff member, following the recipe, reproduce the result. If not, the result is artisanal. Artisanal results are not nothing; they show what is possible with a specific person's skill. They are not the basis for an organizational decision, because the organization does not have that person's skill in replicable form. A low repeatability score is a signal that the use case needs more time in the sandbox before it can be used as evidence for anything.

Each dimension gets scored independently. A green on time saved does not imply anything about quality. A green on quality does not imply anything about risk. The four must be held separately, because most scoring failures happen when one green gets allowed to pull the others along with it.

Why self-scoring fails

People who ran the experiment cannot score it. They see the time savings and miss the quality drift, because they are too close to the output. They see the quality they intended and miss the subtle shift in voice, because they produced the output with that shift in mind. They see the absence of obvious risk and miss the category they were not looking for.

This is not a criticism of experiment owners. It is a structural observation about how perception works. Scoring has to involve someone who did not touch the experiment. In small organizations this is awkward. In larger ones it is usually possible. Either way, the scoring discipline requires a second set of eyes; without one, the scoring is not scoring.

The peer reviewer is not a neutral observer. They are someone close enough to the work to know what good looks like, and far enough from the experiment to have no investment in the verdict. A senior colleague. Someone from an adjacent team who knows the organization's voice. Someone whose quality standard is known and respected by the sandbox team.

The honesty problem

Scoring inflation is the single most common sandbox failure mode. Teams want to justify the time they invested. Leaders want to justify the program. Boards want to believe the AI initiative is working. Every incentive in the room points toward generous scores.

The scoring discipline must assume this and counteract it. Three moves help.

The first is the precommitted standard. Quality criteria are written before the experiment runs, not selected afterward. This is a small act of honesty that removes a surprising amount of retrospective rationalization. A team that scores a green on a criterion they wrote last Tuesday has a different relationship to the score than a team that scores a green on a criterion they chose this morning to match what happened.

The second is the default-to-yellow rule. When a dimension is genuinely uncertain, the score is yellow. Not a hopeful green. Not a cautious let's call it green for now. Yellow. The yellow is not a failure; it is the honest reading of an uncertain result, and yellows in the scoring sheet are valuable information downstream. A portfolio of honest yellows forces a more careful review than a portfolio of optimistic greens ever could.

The third is the out-loud-for-one question. For every green on the sheet, somebody in the room asks, aloud, for one experiment at a time: what would have to be true for this to be wrong? If nobody in the room can answer, the score is too confident. This is a thirty-second discipline. It catches a disproportionate share of scoring inflation.

What a low score means

A low score is not a failure of the tool or the team. It is the sandbox doing its job. The goal of the stage is not a wall of green scores. The goal is a small number of high-confidence greens, a useful set of yellows that point to more work, and a clear set of reds that closed the door on wrong directions before they became incidents.

A team that finishes a sandbox season with twelve greens and no yellows is not running a sandbox. They are running a validation exercise in a sandbox costume. Real exploration produces a mixed ledger. Reds are not setbacks; they are the specific reason the sandbox exists.

This is one of the harder cultural moves in the sandbox stage. Teams and leaders who have been trained to pattern-match good work with positive results have to learn a different correspondence: good work with honest results. The sandbox that produces five reds, seven yellows, and three high-confidence greens is doing better work than the sandbox that produces twelve greens. The second sandbox will find this out in the quarter after the portfolio graduates. The first sandbox is spared the discovery.

How scores aggregate

Scores do three things once they are recorded.

They feed the portfolio. The green-green-green use cases are candidates for Solutions. The green-on-some-yellow-on-others cases return to the sandbox for another cycle. The reds are not deleted; they are kept with a note, so the next team that proposes something similar can see what happened the last time.

They feed Safety's incident register. Risk flags observed in the sandbox are early warnings. A risk category that shows up repeatedly across experiments is a signal about the organization's broader exposure, not just the specific use case that surfaced it. Safety reads the sandbox's risk column regularly, and the sandbox expects them to.

They feed the Skills stage. The greens become the workflows Skills training is organized around. Teams learn on the use cases the organization already validated, which means training lands on real work with known boundaries. This is the handoff that turns the sandbox's output into organizational capacity rather than a one-time product.

The sandbox is not a ledger. It is a sensor. The scores are the organization's reading of what the instrument picked up. Treating the scores as a report card misunderstands the work; treating them as signals to be interpreted is how the stage earns its name.

What this piece has not yet done

The four-dimension scoring is necessary. It is not sufficient. Every one of the dimensions can score green while a fifth dimension, still unnamed here, scores red. That dimension, the human cost of a use case that works structurally but erodes something the organization should not let be eroded, is the subject of the next piece.

Scoring catches structural failure. The ethical and relational flag catches the failure structural scoring will always miss.

The goal is not a wall of green scores. It is a small number of high-confidence greens and a clear set of reds that closed the door on wrong directions.

ShareEmail

Sandbox

When you are ready to run a season, not only read about it

The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

Continue reading

More in this series

  • Sandbox

    The Three Layers of Sandbox Work

    A team I know ran twelve experiments in six weeks. They had picked sensible use cases. They had done the work of setting each one up. By the end, everyone in the room believed the Sandbox had been a success, and everyone in the room struggled to say what had b

    6 min read
  • Sandbox

    Discovering Value Under Constraint

    At the end of the first week, the communications director wrote to her executive: the team is buzzing, everybody has ideas, we should do another one of these soon. The executive asked what the organization had learned, and the honest answer was that the staff

    5 min read
  • Sandbox

    The Three Kinds of Value AI Legitimately Produces

    A mid-sized nonprofit I watched from a reasonable distance ran an AI program last year that hit every visible metric. Newsletter output multiplied by ten. List growth of forty percent in two quarters. Readability scores up across every piece they published. Th

    7 min read