AI products will live or die based on this rule

7 min read

contents

A hit rate for an AI product is the percentage of its outputs that are good enough to use as-is. If you’re building something, what hit rate does your product need before people actually trust it? Two apps that I’m watching this month have wildly different answers, even though they’re running on the same underlying LLM models.

The first is ReproKit, an AI “bug-catcher” that takes reports from users of your app, along with their console logs, to find real production issues. It would be a great product even at a 50% hit rate, because a senior developer can quickly eyeball the top 10 bugs and decide which are real issues we need to work on.

The second is Tivi, a bookkeeping tool that categorizes your company’s transactions and creates monthly overview accounts. It would be hard to fully trust it at 75%, let alone 50%.

The same AI models, two completely different trust thresholds required. What determines the bar is the work that I have to do as a user to verify whether the AI got things right.

What 50/50 actually means

Let’s take the bug-catcher. It ingests a few hundred thousand console logs a week, groups them into “issues,” and surfaces the ten most worrying clusters on a dashboard. Only half of those clusters might be real bugs. The other half might be noise: a ResizeObserver warning the browser threw during a scroll, a deprecated API call nobody cares about, an error from a third-party widget that’s been broken for a year.

A developer can scan the list quickly, dismiss the obvious noise, open the few that look real, and get back to work. The cost of filtering one row is two clicks and ten seconds of scanning. Twenty rows later, you’ve spent three minutes and eliminated a week’s worth of effort chasing down issues from users. The product is already worth paying for at this level. To be honest I’d even pay for it at 30/70. AI does the aggregation, we can do the triage quickly, and the unit economics hold because we save lots of time.

Now take the bookkeeping tool. It pulls data, categorizes VAT, reconciles receipts, and generates an overview. If half of those line items are wrong, I have to read every line to find the bad ones, and by the time I’ve fixed the mistakes I may as well have paid a human bookkeeper. At 50/50, the user has done the work twice, once to verify, once to fix. The rational move is to drop the AI.

The visible spread in shipped AI products

Looking at the numbers from companies that ship AI at scale, the gap is enormous.

GitHub Copilot’s average suggestion acceptance rate sits between 30% and 65% across different studies, and it’s a great product. Developers hit Tab 4-7 times out of 10 and might not get what they wanted; they ignore those, and they accept the 3-6 that fit. The filtering cost is one keystroke. Cursor’s custom autocomplete model claims a 71% accept rate on roughly the same task surface, also a great product, also operating under a “one keystroke to filter” economy. Both are good, and sure the gap makes a difference, but neither breaks.

Decagon, the AI customer-support company valued at $4.5 billion as of March 2026, prices on outcomes: they only get paid when an interaction resolves without a human. Reported resolution rates fall in the 70–90% range across travel, fintech, and DTC retail. The number has to be that high because an end customer with a billing question has zero patience for a wrong answer. A bad response gets escalated to a human to take over, and Decagon has to eat the cost. The model could be the same one any other vendor is using; the pricing structure forces the hit-rate problem onto the company building the product.

Then there’s Hippocratic AI, which builds AI agents that make pre-op and post-discharge nursing calls. Their published constellation of clinical benchmarks puts their patient-safety failure rate below the rate of human nurses on the same calls. The bar is brutal. A clinical AI that gives a 5% rate of unsafe medication advice doesn’t ship at all, because the regulatory consequences and the patient consequences are both catastrophic. Hippocratic’s real competition is the safety record of the median human clinician, and the product has to be at least that good before the conversation about adoption can start.

Why this stays the bottleneck

The standard counter is that models will improve and the floor will lift. That’s true, and on a long enough horizon it does close the gap. If the underlying models get to genuine human-level reliability across the board, the bookkeeping tool will ship at 99% and everyone will trust it. At that point the filtering cost framing stops mattering. The problem is that “long enough horizon” doesn’t help the people building today, and if you sit around waiting for that then you’ve already lost. Model improvements will raise everyone’s hit rate roughly together: if Copilot moves from 30% to 50%, the bookkeeping tool might move from 70% to 85%, still not quite the 95% it needs to be truly useful. The gap closes eventually. But it closes more slowly than the models improve, because in high-trust workflows, “better” and “trusted” are not the same thing.

The other main counter is “put a human in the loop.” Fine, sometimes. The honest objection is that human-in-the-loop is what you do when the model isn’t good enough on its own; if you have to do it for every output, what you’ve built is a workflow tool with AI in it, not an AI product. The interesting version is when the product itself decides which outputs need a human look. You route the user’s attention specifically to the margin cases where the model’s confidence is low, or where the consequence of being wrong is high. That’s a product and UX effort more than a model decision, and it’s where a lot of the leverage currently is.

How to check which bucket you’re in

A useful test, in order:

  • What does it cost your user, in seconds, to verify or fix one output?
  • What happens to their trust in the product when they hit a bad one? Do they shrug, or do they stop using it?
  • Is your distribution of bad outputs boring (a few wrong rows on a list) or catastrophic (wrong advice to a patient, or mistakes on a VAT return)?
  • Could you charge on outcomes, like Decagon, or only on usage, like Copilot? Outcome pricing forces you to solve the hit-rate problem. Usage pricing lets you survive a lower hit rate.

If filtering is cheap and bad outputs are acceptable, you can ship at 50/50 and your product gets better as the floor rises. If filtering is expensive or bad outputs are catastrophic, your product needs to be somewhere between 90% and 99%, and the work between now and then is mostly evals, scaffolding, and routing.

My own version of this

I’ve been generating articles with an AI content pipeline this month and half of them come out genuinely good, with only a few edits needed. The other half are bad in ways that aren’t even interesting. Wrong angle, boring opening, off-voice, often all three. As the person building the pipeline, I’m willing to read every output and fully rewrite (or discard) the bad ones; that’s my job. As a user of someone else’s tool, I’d quit at the second bad article and never come back, because fully rewriting a bad piece defeats the purpose.

My teammate building the bug-catcher can hopefully ship something useful at a 50% hit rate by the end of this month. The teammate building the bookkeeping tool faces a higher trust bar, because of the nature of his product. His user pays a higher time cost for filtering mistakes. So the product cannot rely on gradual accuracy gains alone. It also has to reliably route attention to the edge cases, with evals on categorization, scaffolding around likely errors, and a UI that brings low-confidence rows to the surface.

So what does your user pay, in time and effort, to look at one output and decide whether to keep it? If the cost is a glance, you can ship early and improve as the model floor rises. If they have to read the whole thing, aim for 90%+ to make it feel useful. The work is figuring out where that bar sits, then designing around it. Good UX involves sparing the user from having to check everything. The AI builders who understand this in 2026 will look ahead of the curve in 2028. The ones who don’t will still be waiting for the model to be ready.