On this page
The donor's note
The use case had passed everything. Green on time saved; the staff saved roughly thirty hours a quarter. Green on quality; readability up, tone more consistent. Green on risk; no confidential data in the prompts, no obvious factual drift. Green on repeatability; three staff members ran the recipe and got substantially similar results.
Six weeks after the use case graduated to Solutions, a donor who had given to the organization for eleven years sent a short note back:
I noticed our notes from you got warmer and less specific at the same time. I can't put my finger on it. What changed?
Nothing the scoring dimensions were looking for had changed. Everything the scoring dimensions were not looking for had changed. The donor, who had a decade of pattern on what the organization sounded like when it was talking to her, had picked up the shift in about two months. The organization, running four-dimension scoring on every experiment, had not picked it up at all.
This is why the flag exists.
The question
For every use case that scores well on the four dimensions, one more question must be asked, in writing, by a named person who was not in the experiment:
What about this might cost us something humanly?
Not a risk in the structural sense. Risks are already covered in the scoring. Something else. Something the dashboard cannot price and the scoring rubric was not built to catch. The flag is not the same category as the scoring. It is a separate pass, applied after the scoring, by someone senior enough that their flag cannot be waved off.
The question sounds vague. It is not. It has four specific exposures underneath it, and the work of the flag is to run a candidate use case through all four.
The four exposures
Authorship. Whose name is on this, and could they honestly defend it as theirs? The thank-you note bears the executive director's name. If a donor asked her, did you write this, could she answer without flinching? The answer can be legitimately complex. I shaped what it says and the assistant drafted the words is a real answer. But the answer has to exist, and the person whose name is on it has to be willing to give it. A use case that depends on authorship obfuscation is a use case that will eventually embarrass its author in front of the person it was written for.
Voice. Does this sound like us, or does it sound like every other organization doing the same work? Voice is the texture that makes a piece of writing recognizable as yours. It is not style. It is the specific set of decisions (what you foreground, how you make a point, which rhythms you trust, which words you refuse) that an experienced reader of your work will notice even when they cannot quite name. A generator produces outputs that sound like the median of the training data. Use cases applied thoughtlessly will drift toward the median, and the drift is cumulative across a quarter. The flag asks whether the specific output, in this specific use case, carries voice or dilutes it.
Trust. If the recipient knew how this was made, would the relationship hold or fracture? The test is specific and unforgiving. It is not would a reasonable recipient theoretically understand. It is would this specific donor, this specific congregant, this specific member, this specific reader, continue to trust us if they saw the production process. Trust that depends on the recipient not knowing how the work was made is trust that is already compromised; the recipient just has not found out yet.
Formation. What are we becoming, as an organization, by doing this at scale? Every workflow is also a formational practice. The team that writes thank-you notes by hand is being shaped, at a small scale, into a team that pays specific attention to specific donors. The team that drafts thank-you notes with an assistant is being shaped, at a small scale, into a team that delegates that kind of attention. Either shape can be defensible. Neither is neutral. The flag asks what kind of organization this use case, compounded over years, will turn us into.
Why the flag is separate from scoring
Structural dimensions and human dimensions are different categories. Merging them lets one outvote the other, and the human dimensions usually lose that vote.
A scoring sheet that tried to include voice, authorship, trust, and formation as additional columns would blunt them. The columns would get numerical or categorical treatment. Yellow scores would accumulate. Greens on time, quality, risk, and repeatability would swamp yellows on the human dimensions. The math would produce a portfolio decision that the sum of the scores told them was sound.
The flag is deliberately not that. The flag is a separate pass, after the scoring, that can reroute a use case even if every scoring dimension came back green. It does not compete with the scoring in the aggregate; it sits on top of it as a check.
This architecture matters. Mission-driven organizations are most exposed to failure modes the scoring cannot see. Revenue, productivity, and quality metrics are all recoverable. Trust is not. Voice is not. Authorship is not. Formation is not. A compromise on any of these compounds in the direction the organization cannot afford and on a time horizon the scoring cannot reach.
Who writes the flag
Not the experiment owner. Not the peer reviewer who handled the quality judgment. Someone senior. In most organizations, this is the role of the senior leader who owns Safety, the person whose accountability already includes the theological, relational, and governance integrity of the organization's work. In smaller organizations, it is the executive. In larger ones, it is the senior leader for the domain the use case touches.
The flag writer needs three things the experiment team often does not have. Distance from the enthusiasm. Authority to reroute without negotiation. A long enough memory to recognize the kind of slow-compounding harm the scoring misses. A flag written by someone too close to the project or too junior to push back is not a flag. It is a signature on a form.
The flag is a written paragraph. Short. Specific. Sitting in the shared document alongside the scoring. Either no flag raised, for these reasons or flagged, for these reasons. A blank field is not an answer; it is a skip, and skipped flags are the failure mode the flag exists to prevent.
What a red flag produces
A red flag does not automatically kill the use case. It reroutes it.
Some flagged use cases come back with a narrower scope. The thank-you note use case might return with an explicit rule that the first four sentences must be written by a human and only the last paragraph drafted by the assistant. The boundary preserves the specific-attention signal the original use case was eroding.
Some flagged use cases come back with a slower rollout. The use case stays in the sandbox for another cycle, under stricter conditions, with an explicit review after each batch. Time, in this case, is the mitigation. The organization buys the chance to notice drift before it reaches donors.
Some flagged use cases come back redesigned. The pattern was right; the instance was wrong. A translation-category use case that flattened voice returns as a translation use case that requires a named human editor to restore voice before the output ships. The pattern stays in the portfolio; the specific recipe does not.
A small number of flagged use cases are parked. Not rejected. Parked. The organization is not ready for this yet. The use case goes into the parked list with a clear note of what would have to change for it to come back.
What a red flag does not produce is an argument. The flag writer does not have to win a debate with the experiment team. If the flag is red, the use case reroutes. This is load-bearing. The moment a red flag becomes negotiable, the flag becomes theater, and the whole discipline collapses.
Why this is the hardest discipline
The flag produces short-term friction and long-term coherence. Organizations under quarterly pressure will consistently under-flag. The senior leader who writes the flag is trading the team's disappointment this month for the organization's integrity over years. Most organizations, most of the time, will not make that trade unless the discipline is structural.
This is also why the flag exists as a written, named, sequenced practice. Integrity that depends on a senior leader's mood in any given week is not integrity; it is luck. The flag is the architectural way the sandbox holds itself honest when the incentive in the room points elsewhere.
Done well, the flag does not slow the organization down overall. It slows specific use cases down while they are being redesigned and lets the rest move faster, because the rest has been checked. An organization that flags honestly in quarter one has a portfolio in quarter two that nobody has to re-litigate. An organization that skips flagging in quarter one has a portfolio in quarter three that is quietly producing the kind of donor note this piece opened with.
Handoff
Flagged and scored use cases now go into the portfolio. The green-scored, flag-clear use cases are candidates for Solutions. The rerouted use cases return to the sandbox. The parked use cases wait. Building the Use Case Portfolio is how the portfolio gets assembled and how it becomes the artifact the rest of SSSS inherits.
The structural dimensions can all be green while the human dimensions are red. Merging them lets the greens outvote the reds.
Sandbox
When you are ready to run a season, not only read about it
The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

