Skip to content
Sandbox
SeriesSandbox curriculum03 / 09

The Three Kinds of Value AI Legitimately Produces

By Josh Shepherd7 min read
On this page

Three wins, one funeral

A mid-sized nonprofit I watched from a reasonable distance ran an AI program last year that hit every visible metric. Newsletter output multiplied by ten. List growth of forty percent in two quarters. Readability scores up across every piece they published. The internal narrative was triumphant, and the metrics backed it. Time saved. Revenue earned. Quality improved. Three for three.

Eighteen months later, sixty percent of their largest recurring donors had quietly drifted off the list. Not dramatic resignations; the slower kind, the kind that does not show up on a dashboard until it is already a trend. The organization still produced tenfold more content than it had before. Fewer of the people the content was built for were reading it.

Three wins and a funeral. The category error is not at the metric level. It is at the adjective level.

The three categories, stated cleanly

AI, used well, produces three kinds of value that are worth taking. Time saved wisely. Revenue earned legitimately. Quality improved meaningfully. The nouns are the categories. The adverbs are the argument.

The market speaks about the nouns and drops the adverbs. AI saves time. AI grows revenue. AI improves quality. Each of those statements is true in isolation and deeply misleading as an instruction. Every one of them has a failure mode that looks, on a dashboard, exactly like a success. The case above is what it feels like from the inside when all three failure modes run simultaneously.

This piece is about the adverbs. Once a leadership team holds them steady, the Sandbox's scoring discipline has somewhere to land. Without them, the portfolio fills up with use cases that score well and do damage.

Save time — wisely

Time saved is the easiest kind of AI value to claim and the easiest to misprice. The trap is that the saving is measured in hours while the cost is measured in something else.

Time saved on work that should not have been done at all is not saved time. It is wasted effort, now produced faster. A weekly report nobody reads becomes a weekly report nobody reads, produced in twenty minutes instead of four hours. The organization did not gain three hours and forty minutes; it gained permission to keep doing a pointless thing. Automating the wrong task is the most common way organizations lock themselves into work they should have stopped.

Time saved at the cost of a relational signal is also not saved time. A handwritten thank-you note takes fifteen minutes and carries a specific kind of weight to the recipient. A generated thank-you note takes thirty seconds and carries a different kind of weight, one harder to name until you notice the recipient has stopped answering. Fourteen minutes and thirty seconds were not saved. They were traded for something the dashboard does not price.

Time saved that gets absorbed into more of the same kind of work is also not saved. The team used to produce one newsletter a week in four hours. The assistant lets them produce four newsletters a week in four hours. Time saved per newsletter, yes. Time saved for the organization, no. The organization is now doing four times the work it did before, in the same number of hours, and the staff are as tired as they were last year.

Wise time saving has one feature. The saved time gets redeployed into higher-judgment work that would not otherwise have happened. The weekly report got replaced with a quarterly strategic read the writer's attention could carry. The thank-you note stayed handwritten; the generated time went into a deeper stewardship call. The newsletter team kept producing one newsletter a week and used the saved time to write the book the founder has been trying to finish for three years. Wisely is not a virtue word. It is the name of a specific design choice about where the saved hours go.

Earn revenue — legitimately

Revenue earned with AI is the category most likely to compound into a problem that becomes visible years after the revenue was booked.

The cleanest failure mode is personalization past the point of trust. A donor who used to receive a handful of specific, clearly human notes a year begins receiving more personalized, more frequent, more tonally on-target communications. The first three land well. The seventh lands oddly. The tenth lands, and the donor, reasonably, wonders how much of what they are reading reflects someone actually paying attention to them. The revenue from the more-personalized cadence can be measured. The erosion of trust cannot be measured until the donor is already gone.

A second failure mode is distinctiveness collapse. The organization's voice flattens across campaigns until the copy is indistinguishable from every other organization using the same tools on similar audiences. Short-term conversion may even improve. Long-term brand equity — the specific reason a donor chose this organization over any of a dozen plausible alternatives — thins. Revenue up, moat down. This one is visible only when you zoom out far enough to see the graph of who remembers why they first gave.

A third failure mode is accelerating the wrong acquisition engine. AI makes it easier to reach more people with less effort. The reach is real. The question the revenue category puts to the organization is who is being reached, and whether those people are the people the organization exists for. An organization that optimizes for revenue-per-touch will pick up donors who look like revenue. An organization that optimizes for legitimate revenue will ask, quarterly, whether the new cohort actually resembles the people the mission is for. The first optimization is faster. The second compounds.

Legitimate revenue has one test. It would survive being described honestly to the people paying for it. The donor who gave because of a personalized note would still give if they understood, precisely, how the note was produced. If the answer is probably not, the revenue is not legitimate. It is short-term cash earned against a reputation the organization has not yet priced into the books.

Improve quality — meaningfully

The quality category is the trickiest of the three, because quality is a word that can slide sideways under almost any definition.

The first failure mode is producer-visible quality that nobody else sees. The team scores their AI-assisted output higher than their previous work on internal rubrics: readability, clarity, tone consistency. The scores go up. The people the work is for cannot tell the difference. Internal satisfaction improved. External impact did not. Quality in this sense is real enough to the producer and irrelevant to anyone else. It is worth knowing about. It is not what the Sandbox is for.

The second failure mode is competitive-parity quality. Every organization using the same assistants on similar work ends up producing similar outputs. The quality bar rises across the whole field. Your organization's work now looks as good as three peer organizations' work, where previously it had looked somewhat better. You have caught up to the median, which means you are now invisible at the top. Quality went up. Distinctiveness went down. The net effect on the readers who actually matter is negative.

The third failure mode is smoothed-out quality that erases the specific edges that were carrying your signal. The slightly awkward sentence that was load-bearing. The argument that did not fit the normal rhythm because the idea was actually new. The paragraph your audience knew immediately was yours. Generic quality-improvement flattens these, consistently and invisibly. The writing gets better on every surface metric and weaker on the metric that mattered.

Meaningful quality improvement has one feature. The improvement is visible to the people the work is for, not just to the producer, and it increases the organization's distinctive signal rather than converging on the sector median. An article that reads as more specifically you after the assistant touched it passes the test. An article that reads like any number of competently written pieces does not.

This connects directly to canon #8, The Collapse of Signal in the AI Age. Surface-level quality is now free. The organizations that mistake surface-level quality for the real thing are enrolling themselves in a race where the prize is indistinguishability from everyone else in the race.

Using the three together

Every candidate use case the Sandbox considers gets scored against these three categories. A use case that fits one of them cleanly is worth testing. A use case that fits two is worth prioritizing. A use case that fits none of them is not a use case. It is activity in costume.

The scoring is not mechanical. It is senior work. A leader who writes saves time next to a use case without asking whether the saving is wise has not scored; they have filed paperwork. The three adverbs — wisely, legitimately, meaningfully — are where the scoring becomes leadership. Drop them, and the Sandbox becomes a machine for enrolling the organization into the failure modes the curriculum exists to prevent.

The pattern is also diagnostic in reverse. If your organization has been running AI work for a year and cannot point to specific use cases that clearly saved time wisely, earned revenue legitimately, or improved quality meaningfully, the issue is not that you need more tools. The issue is that the frame has been missing, which means the scoring has been missing, which means nobody has had a way to tell the three kinds of value apart from their three failure modes.

The Eight Patterns is where we go next. Now that we have named what legitimate value looks like, we can look, with a disciplined eye, at where inside your actual work this value tends to hide.

Time saved on work that shouldn't have been done at all is wasted effort, now produced faster.

ShareEmail

Sandbox

When you are ready to run a season, not only read about it

The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

Continue reading

More in this series

  • Sandbox

    The Three Layers of Sandbox Work

    A team I know ran twelve experiments in six weeks. They had picked sensible use cases. They had done the work of setting each one up. By the end, everyone in the room believed the Sandbox had been a success, and everyone in the room struggled to say what had b

    6 min read
  • Sandbox

    Discovering Value Under Constraint

    At the end of the first week, the communications director wrote to her executive: the team is buzzing, everybody has ideas, we should do another one of these soon. The executive asked what the organization had learned, and the honest answer was that the staff

    5 min read
  • Sandbox

    The Eight Patterns: Where Value Hides

    The senior team sat down for a twenty-minute exercise. Eight categories, three minutes each, one simple question each time: where inside our work, in the last month, has something like this shown up?

    11 min read