Skip to content
Sandbox
SeriesSandbox curriculum07 / 09

The Ethical & Relational Flag

By Josh Shepherd7 min read
On this page

The donor's note

The use case had passed everything. Green on time saved; the staff saved roughly thirty hours a quarter. Green on quality; readability up, tone more consistent. Green on risk; no confidential data in the prompts, no obvious factual drift. Green on repeatability; three staff members ran the recipe and got substantially similar results.

Six weeks after the use case graduated to Solutions, a donor who had given to the organization for eleven years sent a short note back:

I noticed our notes from you got warmer and less specific at the same time. I can't put my finger on it. What changed?

Nothing the scoring dimensions were looking for had changed. Everything the scoring dimensions were not looking for had changed. The donor, who had a decade of pattern on what the organization sounded like when it was talking to her, had picked up the shift in about two months. The organization, running four-dimension scoring on every experiment, had not picked it up at all.

This is why the flag exists.

The question

For every use case that scores well on the four dimensions, one more question must be asked, in writing, by a named person who was not in the experiment:

What about this might cost us something humanly?

Not a risk in the structural sense. Risks are already covered in the scoring. Something else. Something the dashboard cannot price and the scoring rubric was not built to catch. The flag is not the same category as the scoring. It is a separate pass, applied after the scoring, by someone senior enough that their flag cannot be waved off.

The question sounds vague. It is not. It has four specific exposures underneath it, and the work of the flag is to run a candidate use case through all four.

The four exposures

Authorship. Whose name is on this, and could they honestly defend it as theirs? The thank-you note bears the executive director's name. If a donor asked her, did you write this, could she answer without flinching? The answer can be legitimately complex. I shaped what it says and the assistant drafted the words is a real answer. But the answer has to exist, and the person whose name is on it has to be willing to give it. A use case that depends on authorship obfuscation is a use case that will eventually embarrass its author in front of the person it was written for.

Voice. Does this sound like us, or does it sound like every other organization doing the same work? Voice is the texture that makes a piece of writing recognizable as yours. It is not style. It is the specific set of decisions (what you foreground, how you make a point, which rhythms you trust, which words you refuse) that an experienced reader of your work will notice even when they cannot quite name. A generator produces outputs that sound like the median of the training data. Use cases applied thoughtlessly will drift toward the median, and the drift is cumulative across a quarter. The flag asks whether the specific output, in this specific use case, carries voice or dilutes it.

Trust. If the recipient knew how this was made, would the relationship hold or fracture? The test is specific and unforgiving. It is not would a reasonable recipient theoretically understand. It is would this specific donor, this specific congregant, this specific member, this specific reader, continue to trust us if they saw the production process. Trust that depends on the recipient not knowing how the work was made is trust that is already compromised; the recipient just has not found out yet.

Formation. What are we becoming, as an organization, by doing this at scale? Every workflow is also a formational practice. The team that writes thank-you notes by hand is being shaped, at a small scale, into a team that pays specific attention to specific donors. The team that drafts thank-you notes with an assistant is being shaped, at a small scale, into a team that delegates that kind of attention. Either shape can be defensible. Neither is neutral. The flag asks what kind of organization this use case, compounded over years, will turn us into.

Why the flag is separate from scoring

Structural dimensions and human dimensions are different categories. Merging them lets one outvote the other, and the human dimensions usually lose that vote.

A scoring sheet that tried to include voice, authorship, trust, and formation as additional columns would blunt them. The columns would get numerical or categorical treatment. Yellow scores would accumulate. Greens on time, quality, risk, and repeatability would swamp yellows on the human dimensions. The math would produce a portfolio decision that the sum of the scores told them was sound.

The flag is deliberately not that. The flag is a separate pass, after the scoring, that can reroute a use case even if every scoring dimension came back green. It does not compete with the scoring in the aggregate; it sits on top of it as a check.

This architecture matters. Mission-driven organizations are most exposed to failure modes the scoring cannot see. Revenue, productivity, and quality metrics are all recoverable. Trust is not. Voice is not. Authorship is not. Formation is not. A compromise on any of these compounds in the direction the organization cannot afford and on a time horizon the scoring cannot reach.

Who writes the flag

Not the experiment owner. Not the peer reviewer who handled the quality judgment. Someone senior. In most organizations, this is the role of the senior leader who owns Safety, the person whose accountability already includes the theological, relational, and governance integrity of the organization's work. In smaller organizations, it is the executive. In larger ones, it is the senior leader for the domain the use case touches.

The flag writer needs three things the experiment team often does not have. Distance from the enthusiasm. Authority to reroute without negotiation. A long enough memory to recognize the kind of slow-compounding harm the scoring misses. A flag written by someone too close to the project or too junior to push back is not a flag. It is a signature on a form.

The flag is a written paragraph. Short. Specific. Sitting in the shared document alongside the scoring. Either no flag raised, for these reasons or flagged, for these reasons. A blank field is not an answer; it is a skip, and skipped flags are the failure mode the flag exists to prevent.

What a red flag produces

A red flag does not automatically kill the use case. It reroutes it.

Some flagged use cases come back with a narrower scope. The thank-you note use case might return with an explicit rule that the first four sentences must be written by a human and only the last paragraph drafted by the assistant. The boundary preserves the specific-attention signal the original use case was eroding.

Some flagged use cases come back with a slower rollout. The use case stays in the sandbox for another cycle, under stricter conditions, with an explicit review after each batch. Time, in this case, is the mitigation. The organization buys the chance to notice drift before it reaches donors.

Some flagged use cases come back redesigned. The pattern was right; the instance was wrong. A translation-category use case that flattened voice returns as a translation use case that requires a named human editor to restore voice before the output ships. The pattern stays in the portfolio; the specific recipe does not.

A small number of flagged use cases are parked. Not rejected. Parked. The organization is not ready for this yet. The use case goes into the parked list with a clear note of what would have to change for it to come back.

What a red flag does not produce is an argument. The flag writer does not have to win a debate with the experiment team. If the flag is red, the use case reroutes. This is load-bearing. The moment a red flag becomes negotiable, the flag becomes theater, and the whole discipline collapses.

Why this is the hardest discipline

The flag produces short-term friction and long-term coherence. Organizations under quarterly pressure will consistently under-flag. The senior leader who writes the flag is trading the team's disappointment this month for the organization's integrity over years. Most organizations, most of the time, will not make that trade unless the discipline is structural.

This is also why the flag exists as a written, named, sequenced practice. Integrity that depends on a senior leader's mood in any given week is not integrity; it is luck. The flag is the architectural way the sandbox holds itself honest when the incentive in the room points elsewhere.

Done well, the flag does not slow the organization down overall. It slows specific use cases down while they are being redesigned and lets the rest move faster, because the rest has been checked. An organization that flags honestly in quarter one has a portfolio in quarter two that nobody has to re-litigate. An organization that skips flagging in quarter one has a portfolio in quarter three that is quietly producing the kind of donor note this piece opened with.

Handoff

Flagged and scored use cases now go into the portfolio. The green-scored, flag-clear use cases are candidates for Solutions. The rerouted use cases return to the sandbox. The parked use cases wait. Building the Use Case Portfolio is how the portfolio gets assembled and how it becomes the artifact the rest of SSSS inherits.

The structural dimensions can all be green while the human dimensions are red. Merging them lets the greens outvote the reds.

ShareEmail

Sandbox

When you are ready to run a season, not only read about it

The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

Continue reading

More in this series

  • Sandbox

    The Three Layers of Sandbox Work

    A team I know ran twelve experiments in six weeks. They had picked sensible use cases. They had done the work of setting each one up. By the end, everyone in the room believed the Sandbox had been a success, and everyone in the room struggled to say what had b

    6 min read
  • Sandbox

    Discovering Value Under Constraint

    At the end of the first week, the communications director wrote to her executive: the team is buzzing, everybody has ideas, we should do another one of these soon. The executive asked what the organization had learned, and the honest answer was that the staff

    5 min read
  • Sandbox

    The Three Kinds of Value AI Legitimately Produces

    A mid-sized nonprofit I watched from a reasonable distance ran an AI program last year that hit every visible metric. Newsletter output multiplied by ten. List growth of forty percent in two quarters. Readability scores up across every piece they published. Th

    7 min read