Skip to content
Sandbox
SeriesSandbox curriculum09 / 09

Discovery Engine, Not Generator

By Josh Shepherd8 min read
On this page

Two organizations, same tool

Two organizations deploy the same AI use case discovery platform. One uses it to accelerate a disciplined sandbox, with the team running the eight-pattern scan, writing the five-field experiments, doing the four-dimension scoring, and writing the flag paragraphs by hand. The tool helps them surface candidates faster, track experiments, and compare scores across teams.

The other uses it to skip the sandbox. The platform generates candidate use cases from a scan of the organization's workflows. It runs simulated experiments. It scores them. It flags them. The staff review the outputs. The senior team approves the portfolio. The process takes a week instead of a season.

After six months, both organizations have portfolios. One has a team that could rebuild the portfolio from scratch if the tool went dark tomorrow. The other has a slide deck that looks, on the surface, identical.

This piece is for the leader being asked the question that now reliably surfaces at the end of a sandbox season: can we build or buy a tool that does this for us?

The temptation

The temptation is real and it is not stupid.

If the sandbox is structured, the structure can be tooled. The eight-pattern scan can be software. The experiment templates can be forms. The scoring rubric can be a calculator. The flag questions can be checklists. The portfolio can be a dashboard. Each of these things is true.

If it can be tooled, the tooling can scale. A single team's sandbox season becomes a ten-team sandbox season without ten times the effort. Cross-team patterns become visible. Comparisons become easy. The senior leader gets a consolidated view they could not possibly assemble by hand.

If it scales, the organization saves months. Or quarters. Or, in the largest organizations, years of sandbox work that can be compressed into a fraction of the time by running the discipline in software.

Every one of those claims is plausible. The question is not whether the compression is possible. The question is what gets compressed out when the work becomes a tool.

The trap

Automating discovery removes the formation that discovery produces.

The point of the sandbox is not just to end up with a portfolio. It is to end up with people whose pattern recognition, experiment discipline, and judgment have been trained by the work of building the portfolio. A tool that delivers the portfolio without the training delivers a slide deck, not a capability. The organization that deploys such a tool looks, from the outside, exactly like an organization that did the sandbox work. Inside, it is different. The people have not been formed. The judgment has not been built. The next set of use cases, six months later, cannot be generated by the same tool against new work, because nobody in the organization knows how to evaluate what the tool is producing.

This is not a theoretical risk. It is the same pattern that made Skills as Formation, Not Training necessary as a distinct canon piece. Training transfers skills. Formation reshapes the person doing the work. A tool that runs the sandbox as a skill, with inputs and outputs and the human as operator, bypasses the formation entirely. The organization ends up with tool literacy and no capability.

The trap is visible in hindsight and invisible in the moment. In the moment, the automated sandbox looks like a productivity win. The portfolio is produced. The board is informed. The Solutions stage can proceed. What cannot yet be seen is what happens when the tool's assumptions no longer match the organization's reality, and the people who should be able to notice the mismatch have never built the eye for it.

The correct role of tooling

A tool can legitimately play a role in the sandbox. The role is not use case generation. It is something closer to a discovery engine: an instrument that accelerates the discipline without replacing it.

Such a tool does five things well.

It guides pattern recognition without substituting for it. When a team runs the eight-pattern scan, the tool might surface candidates the team did not name, drawn from data the tool has access to that the team does not. The team still does the scanning; the tool extends the reach. What the tool does not do is hand the team a finished list of candidates and ask for approval.

It helps score and compare. Once experiments have run, a tool can hold the scoring data across use cases, across teams, and across time, in ways that make patterns visible no single team could see alone. The scoring is still done by the named peer reviewer; the tool holds the aggregate.

It tracks experiments. The five fields, the recipes, the quality verdicts, the risk flags. All of this lives better in a tool than in a shared document, at scale. The tool is infrastructure for the discipline, not a replacement for it.

It surfaces patterns across teams. Which repetition use cases worked well in communications and failed in development. Which translation use cases consistently eroded voice regardless of who ran them. Which decision-support experiments revealed the same formation gap across three departments. These cross-team signals are hard to see manually and valuable to see early.

It hands artifacts back to senior leaders in forms they can reason about. Not black-box scores. Not aggregate confidence ratings. The actual experiment records, the actual flag paragraphs, the actual portfolio entries, organized for a human to read and question.

The tool assists the discipline. The tool does not replace the discipline. The difference is the difference between a discovery engine and a generator.

Three design constraints

If the organization decides to build or buy such a tool, three constraints determine whether it will scaffold or replace the work.

The tool must require human pattern-recognition input, not substitute for it. The eight-pattern scan must begin with people reading their own work. The tool can extend the scan; it cannot initiate it. A tool that asks the team to confirm an automatically generated list of candidates has already compressed out the formational part of the work. A tool that accepts the team's initial list and adds candidates the team can accept, reject, or refine has not.

The tool must produce artifacts in forms senior leadership can read and question. Not black-box scores. Not aggregate confidence ratings. Not the AI recommends graduating these three. A senior leader should be able to open any item in the portfolio and see the specific experiment, the specific quality verdict, the specific flag paragraph, the specific reasoning. If the tool abstracts the reasoning away, the senior leader cannot own the portfolio, which means the portfolio is not a leadership artifact anymore.

The tool must track cross-team patterns without collapsing them into a single organizational average. The temptation in any cross-team rollup is to show the median. The average time saved, the average quality score, the average risk rate. Averages are where distinctiveness goes to die. The organization's voice, its formation, and its trust are specifically not at the median of the sector. A tool that reports averages across teams is training the organization to converge on the averages. A tool that reports specifics, which team did what, with what result, under what conditions, protects the specificity that makes the work worth doing.

A tool that fails any of these three constraints is the wrong tool, regardless of how impressive the product demo looks. A tool that meets all three is the kind of infrastructure a mature sandbox actually benefits from.

Build, buy, or neither

Build only if the organization's work is distinctive enough that a generic discovery engine would flatten it, and the organization has the engineering capacity to build something that respects the three constraints above. Most organizations do not meet both conditions. The ones that do are the ones for whom off-the-shelf tooling is consistently wrong in ways they cannot configure around.

Buy only if the candidate tool passes the three constraints in practice, not in marketing. This requires more diligence than most procurement processes do. A tool that claims to respect human pattern recognition but defaults to automated candidate lists is not respecting it. A tool that claims to preserve specificity but surfaces averages in every executive view is not preserving it. The due diligence is specific use: does the tool, in the hands of the team, actually scaffold the discipline, or does it, in practice, short-circuit it. The only way to know is to pilot it on a real sandbox season, comparing the team that used it to a team that did not.

Neither, and do the work by hand for at least one full season, if the organization has not yet built the judgment to evaluate a tool. Most organizations are in this category and do not realize it. They are being offered tools they cannot yet evaluate, because the evaluation requires the judgment the tool claims to produce. Running the sandbox manually for a season produces the judgment that makes a subsequent tool evaluation honest.

This last category is where the canon voice is firmest. An organization that buys a discovery tool before doing a manual sandbox season is buying a product it cannot yet evaluate, from a category whose failure modes it cannot yet see. The tool may be excellent and the organization may be fine. Often, the tool is mediocre and the organization discovers a year later that the portfolio it shipped was not what it thought.

The deeper question

The original question that sits underneath all of this: am I robbing people of the learning process?

If the tool fully automates discovery, yes. The organization receives the output and forgoes the formation. Six months later, faced with a new category of work, the team cannot produce the next portfolio, because the eye that would see the new candidates was never trained.

If the tool scaffolds the discipline, no. The organization receives the output and the formation, at accelerated pace. The tool amplifies the discipline without substituting for it. The team becomes more capable, faster, than they would have become by hand, which is what good infrastructure has always done.

The difference is not in the category of tool. It is in the organization's posture toward it. The same platform, used by one team as a scaffold and by another as a generator, produces two different kinds of organization. The product choice matters much less than the posture choice.

Close

The sandbox curriculum is eight pieces, or nine if this optional piece has done its work, because the discipline is that many things. A tool that ships all of them as outputs and none of them as practice is selling a destination without a road. The organizations that will be strongest at AI adoption five years from now are not the ones with the most advanced tooling. They are the ones whose people built, by hand at least once, the capability the tooling now accelerates.

That is why Sandbox exists as a stage, and why Skills comes next rather than Solutions. The next canon piece, Skills as Formation, Not Training, picks up exactly here: what the Sandbox formed, and how it deepens into the judgment that the rest of SSSS depends on.

A tool that delivers the portfolio without the training delivers a slide deck, not a capability.

ShareEmail

Sandbox

When you are ready to run a season, not only read about it

The articles describe the argument. The Sandbox Season is the fixed-scope engagement where a cohort does the work with facilitation, scoring discipline, and a Week 12 handoff.

Continue reading

More in this series

  • Sandbox

    The Three Layers of Sandbox Work

    A team I know ran twelve experiments in six weeks. They had picked sensible use cases. They had done the work of setting each one up. By the end, everyone in the room believed the Sandbox had been a success, and everyone in the room struggled to say what had b

    6 min read
  • Sandbox

    Discovering Value Under Constraint

    At the end of the first week, the communications director wrote to her executive: the team is buzzing, everybody has ideas, we should do another one of these soon. The executive asked what the organization had learned, and the honest answer was that the staff

    5 min read
  • Sandbox

    The Three Kinds of Value AI Legitimately Produces

    A mid-sized nonprofit I watched from a reasonable distance ran an AI program last year that hit every visible metric. Newsletter output multiplied by ten. List growth of forty percent in two quarters. Readability scores up across every piece they published. Th

    7 min read