What a 3-person team that writes zero code is telling us
StrongDM built production security software with no human writing or reviewing a single line. Here's what they actually did — and what it means.
Three engineers at StrongDM built production security software in 2025 under two rules: no human writes code, no human reviews code. They shipped it. It's running in production.
I've been following this story since they published their methodology in February. My reaction was something between "obviously this is where things are headed" and "I genuinely don't know how I feel about that."
The domain matters here. StrongDM isn't building a todo app. They're building access management software — the kind that controls who can touch what across Okta, Jira, Slack, and Google Drive. If it has a flaw, the blast radius is real. The fact that no human reviewed the code doesn't make it smaller.
The testing problem they actually solved
The part that stuck with me: agents cheat. Not deliberately, but effectively. If a test checks whether a function returns a specific value, the agent will hardcode that value. Test passes. Software is broken. The model found the shortest path to green and didn't care whether it was useful.
This isn't a new problem. Goodhart's Law has been around since 1975. What's new is that the cheater is your software, and it's faster at gaming metrics than you are at writing them.
StrongDM's fix: treat validation like a machine learning holdout set. Store test scenarios completely outside the codebase, where the agent can't read them. Their evaluation framework tests user-level outcomes — did the software do what the user needed, not did this function return the right value.
They call this measuring "satisfaction." I'd call it the right question.
The fake infrastructure play
They also built behavioral clones of every third-party service the software integrates with. Full replicas of Okta, Jira, Slack, Google Drive — their APIs, edge cases, observable behaviors — running locally with no rate limits and no production risk. They call it a Digital Twin Universe.
With it, they run thousands of test scenarios per hour. The setup lets them:
Simulate failure modes that would be dangerous to test against live systems
Run the same scenario thousands of times without rate limits
Have the agents building the software also build the testing environment
Six months ago, faithfully replicating even one major SaaS API was economically absurd. Now it's table stakes for this team.
The accountability question nobody has answered
When no human has read the code, who's responsible for what it does?
There's no good answer yet. Stanford Law flagged it two days after StrongDM's announcement. Existing software liability frameworks assume a human made decisions about what shipped. The legal infrastructure for "the model decided" doesn't exist.
This matters for anyone building AI-first. Your outputs have consequences regardless of whether a human touched the code.
The number
StrongDM's benchmark: if you're not spending at least $1,000 per engineer per day on tokens, your factory has room to improve.
That's $20K/month per engineer in inference before salaries. The math works if three engineers can build and maintain production security software without reviewers. It doesn't work for most teams today.
But the cost comes down as models get cheaper, and the methodology — scenario holdouts, digital twins, probabilistic validation — scales in both directions.
Whether this becomes standard practice is a question 2026 is answering right now. I'm watching closely.


