Erfan Soliman

Recipes don't work for AI agents as well as you think

Fri, 22 May 2026 00:00:00 GMT

The most reliable line in my AI agent’s prompt isn’t an instruction. It’s a cultural value:

Our product always adheres to “the dividends principle”: we reward user investment immediately with delight or magic.

It sits at the top of the onboarding prompt, above 200 lines of explicit branching logic, and it handles more situations correctly than any of the rules underneath it. I didn’t expect that when I wrote it.

Almost every prompt I’ve seen for an AI agent looks like an instruction manual. The 200-line prompt files produce agents that follow the script decently when the script fits, but produce useless outcomes when it doesn’t. The one-line value, weirdly, if done well, generalizes and produces more reliable outcomes.

Why recipes don’t survive contact with reality

The instinct to write a step-by-step recipe feels intuitive. The more I spell out, the more reliable the agent will be, right? That’s true some of the time, but anyone who has tried knows that agents don’t work as reliably as you would like. The pure recipe approach is wrong for the same reason that giving a new hire a 200-page manual is the wrong way to onboard them.

A new employee with a manual can probably handle the situations and examples covered in the document, but they most likely won’t substitute when the situation is off-pattern, can’t tell you why the policy exists, and will fall back to “I’ll ask my manager” the moment something is unfamiliar. Help them internalize your culture instead, alongside the manual, i.e. what the company values, what good judgment looks like here, what you’d do in their seat if you had to choose, and they’ll handle the situations the manual covers, plus a bunch of adjacent ones that you never anticipated, all roughly the way you would have wanted them to.

The same thing happens with AI agents. The 200-line prompt gives the agent a guide for 80-90% of onboarding scenarios. The one-line value gives it a tiebreaker it can apply across every adjacent scenario, including the ones you didn’t anticipate.

This isn’t just one founder’s opinion

The strongest evidence in favor of the principles-vs-recipe point is that all strong teams, across time and discipline, eventually arrive at it from different angles.

Anthropic’s published guidance for CLAUDE.md files (the project-context file every Claude Code session reads) is unambiguous: as few instructions as possible, ideally only ones which are universally applicable to your task. If removing a line doesn’t result in an obvious mistake, remove it. HumanLayer’s published version recommends keeping the file under 60 lines ideally, and no more than 300. That’s the team that ships agents giving direct advice to the operators of agents.

Randall Bennett of Bolt Foundry, building agent-reliability tooling for several years, frames the same idea from the other side: “agents are about culture; computers are about [instructions].” The system prompt is for values; the rest is what computers can already do.

This isn’t limited to modern-day employees and AI agents. People arrived at the same conclusion two centuries ago, on the battlefield. The Prussian army got beaten badly enough at Jena and Auerstedt in 1806 that they had to rethink how orders should work. The doctrine they landed on, eventually called mission command, has the same shape: subordinates are told what effect to achieve and why, then decide locally how. It’s a response to the same problem AI operators have today, where the principal can’t anticipate everything that happens in the field. Centralized intent, decentralized execution.

Where the recipe still has a place

I’m not saying you should write zero instructions. There’s still a floor of explicit constraints worth writing down, especially the ones that would be expensive to discover by trial and error. “Don’t refund without checking the user’s payment method first.” “All dates are in the format YYYY-MM-DD.” “We use Jira instead of Linear.” The reason they sit in the prompt instead of being inferred from a principle is that the cost of inferring wrong is high.

Beyond those specific instructions, the principles do most of the adjacent work, and the fact that we cover an explicit floor means we handle the set of cases where getting it wrong by guessing is too costly.

The other thing the principles-only framing misses is operator responsibility. This is the part I find easy to dismiss and hard to actually do. Refining the principles file is real work. You don’t write a values document once and walk away. You watch what the agent does, you notice the choices that surprise you, and you go back and update the intent. If the agent makes a choice you didn’t expect, your intent wasn’t good enough yet. I’d argue this is closer to managing a team than writing a spec.

The dividends principle

In my own agent-coded products, the example I mentioned above has been working so effectively that it’s the whole reason I wanted to write this post. The dividends principle: good products reward user investment immediately, not after some milestone. Whether it’s time investment that a user makes, or the willingness to give us information about themselves during onboarding, or an integration or configuration step, I want my product to reward the user with a moment of magic or delight. I added this to my onboarding agent’s system prompt as a value rather than a rule.

What surprised me was how durable it turned out to be. Different agents, different scenarios, but they kept making decisions that respected the principle. They also came up with some great new ideas I would have never thought of, even when nothing in the prompt explicitly told them to apply the principle to that case. The agent uses the cultural value for ideation as well as in tiebreaker scenarios when explicit instructions run out, which is most of the time in practice.

I’m still figuring out a few things. How many principles is the right number? Mine have been creeping; if I add one a week I’ll be back to a procedure manual by August. And how does this scale across teams; can a principle that works for one operator be reused by another, or is it culture-bound the same way company values are?

As models continue to improve, I’d bet most of the prompts being written right now are going to get thrown out and rewritten as values-based documents within the next 18 months. The ones that don’t will only apply to cases where the total range of possible outcomes is small enough that exhaustive rules can cover them all. That’s a smaller category than most teams think it is.

The agents of the future will look less like compliance contractors and more like new hires who internalized the culture. The work shifts from writing rules to articulating intent, and from reviewing outputs to refining the intent when the output surprises you. It’s slower than it sounds, in the way that managing a team is slower than writing a spec, but the compounding will definitely show up later.

Your app's onboarding is an editing problem

Wed, 20 May 2026 00:00:00 GMT

You signed up for a SaaS tool last week. You don’t remember which one. There was a welcome modal, then a question about your role, then a question about your team size, then a four-step “let’s customize your workspace” carousel, then a tooltip explaining the sidebar, then a sample project pre-loaded with fake data, then a banner asking if you wanted to invite teammates. You closed the tab somewhere around step five. You meant to come back at some point, but of course you didn’t.

The team that built that flow spent two months on it. Every step was someone’s idea, defended in a meeting, refined in design review, even A/B-tested. They’re proud of it. It’s the reason their activation rate is 11%.

“Onboarding” is one of those words that lost its meaning by becoming too standard. Founders use it to mean the multi-step flow between signup and the product. But the flow is what they built when they confused onboarding with education or personalization. The actual onboarding (the thing the user is doing), is forming an opinion about whether the product is worth coming back to. That happens whether you ship a wizard or not, and the wizard doesn’t always help.

There’s a way to frame this: Onboarding is an editorial discipline. The job is to cut things until what’s left is the smallest experience the user can have and still feel something. Most founders, including myself, treat it as a building discipline instead, a place to add features, options, explanations, education. Often, we end up reducing activation when the goal was the opposite.

The only known exceptions are consumer behavior-change subscription products like Noom or Calm or Duolingo. For everything else (e.g. tool-shaped products, B2B SaaS, dev tools, productivity apps, and most consumer apps), the user is signing up for something they want to use, and the rule holds.

The instinct is wrong on purpose

The reason founders overbuild onboarding is that the people building the product are, by definition, the people who already understand it. When you understand a product, every feature looks like an asset. You added it for a reason, so surfacing it during onboarding feels like generosity. “Look at all the things you can do here. Welcome.”

This happens to every product team eventually. Improvements are always seen as additions, not subtractions. And additions to onboarding flows almost always cost more than they earn back. This is because the user doesn’t see assets. They see a tax. Every screen between signup and the first useful action is a place they can quit. We’ve noticed this in our own product more than once: the team’s instinct is to build a “good first experience,” and every time we ship that good first experience the activation rate either stays the same or drops. Our instincts keep pulling us in the same direction.

The fix is to intentionally and painfully cut down to a narrower experience. Sam Altman’s startup playbook makes the point bluntly: simplicity beats feature-richness at early stage, and friction kills growth more than missing features do. Most founders flip the priority and lose months of compounding to it.

What “editing” looks like in practice

The clearest tell that a team is doing it wrong is the configuration step. Some configuration is genuinely required: connecting a calendar, picking a primary email, or naming a workspace. Most of it isn’t. Most of what teams put in front of new users is configuration that could be deferred until after the user has done something useful, and is sitting in the onboarding flow because it felt thoroughly considered to put it there.

Configuration during onboarding is a tax on the roughly nine users who will never come back, paid in exchange for slightly less friction for the one user who actually uses it. The tradeoff math stays bad even when the configuration is genuinely valuable. Please defer it.

The harder edit is the educational one. Founders confuse onboarding with education. Education is what you do after the user is hooked. Onboarding is what you do to hook them, and it should be ruthless about cutting things that aren’t part of the hook. The tooltip explaining how filters work belongs after the user has typed something into the app, the product tour belongs after the user has saved their first thing, and the “learn more about our integrations” panel doesn’t belong in onboarding at all.

This is one of those lessons we all need to re-learn multiple times. For every screen, tooltip, or extra option: if I remove this, does the user still get to a moment where the product does something for them? If yes, remove it. The screens you keep are the ones where removal breaks a path.

Why this generalizes beyond developer tools

Back in 2011, Stripe’s pitch was that you could accept payments with 7-9 lines of code. Nine lines. At a time when every competitor was selling a multi-step integration that took a week. The “Collison Installation,” where Patrick and John flew to early customers to install Stripe on their laptops, has become famous in YC circles.

Linear loads your default workspace in under a minute. No 10-step wizard, no customization carousel, no quiz about your team’s working style. You’re inside the app, and the next thing you do is type.

Notion’s signup hands you a blank page after three simple questions. Templates exist but you don’t have to use them. The default state is clean, and the first thing the product asks you to do is anything you want.

The pattern is easy to spot, but it’s a lot harder to act on, because every instinct is pulling you the other way.

A reasonable objection is that Stripe is a developer tool and developers are unusually tolerant of bare-bones experiences. The nine-lines pitch works for them, but it wouldn’t work for a typical SaaS buyer who expects a polished welcome flow. But Notion isn’t a developer tool. Linear has plenty of non-developer users on the design and PM side. The pattern crosses categories because the underlying mechanism is about attention. New users have roughly sixty seconds of patience before they form a judgment about whether your product is worth coming back to, and that’s mostly based on whether they got to a useful moment. The welcome flow’s polish has minimal impact.

The second reasonable objection is that a too-simple onboarding will generate customer support load. In practice, support load comes from confused expectations. A clear, narrow onboarding produces a clear, narrow expectation: “I signed up and I’m in the product.” A multi-step wizard produces a much fuzzier one: “I signed up and I’m being prepared for something.” When the product doesn’t match the wizard’s implied scope, that ironically increases support requests.

There’s also the consumer-app exception. Long onboarding quizzes really do increase activation and retention for some apps in that space. The clearest case is Lose It!‘s own PM saying publicly that they just kept extending onboarding and trial starts kept going up, and that they don’t particularly care what the answers are. Perceived personalization works even when the personalization isn’t real, and a meaningful slice of the consumer subscription economy is built on this.

The carveout is narrower than it sounds. It applies to behavior-change products sold on a subscription funnel, where the user arrives with a specific outcome in mind and the personalized program is the product. Health and fitness, meditation, language learning, astrology, focus-and-study apps. The quiz works because of a stack of four things: the calibration is the actual value, each question is a micro-commitment that lifts completion, the time invested produces sunk-cost attachment before the paywall, and a long quiz reads as authority. Remove any of those four and the funnel weakens. Run the same playbook on Linear and you get a worse outcome.

The B2B side is the opposite direction without exception. Top-decile PLG products hit time-to-value in 5-15 minutes, and trials taking longer than seven days see up to 40% fewer conversions. The long onboarding stories in enterprise (Salesforce, Toast, Veeva) are sales-assisted human implementations of integrations that genuinely have to happen for the product to work.

So the rule survives. If you’re building Noom, ignore the rest of this post. If you’re building anything else, editing discipline is crucial.

How to tell if you’re overbuilding it

Some questions, mostly stolen from conversations with founder friends who’ve gone through this:

What’s the median time from signup to a moment the user would describe as useful? If it’s longer than a minute, the fix is almost certainly cutting things rather than building new ones.
How many of your onboarding steps are configuration the user could do later? Count them. It’s probably more than half.
If you cut the entire welcome tour, what breaks? If the answer is “nothing breaks; some users might be slightly confused for five seconds,” you should try surfacing it later.
Are you confusing onboarding with education? Education materials are good. They belong somewhere else.
Which steps in your flow were added because a single user complained about not having them?

This audit is uncomfortable because the steps you’d cut are usually the ones the team is proudest of. The carefully designed personalization quiz and the configuration screen with the sensible defaults, not to mention the tasteful product tour. They look like craftsmanship. But they’re sitting between your user and their first productive moment.

The onboarding flows that win their categories tend to look slightly embarrassing to the teams that built them. There aren’t enough screens to point at when someone asks what the design team has been working on. Even the features that you could have sworn were previously A/B-tested might get cut, the activation rate goes up, and nobody is quite sure how to explain it.

The hard part is what you do on Monday, when the team you’ve worked with for three years pulls up the activation dashboard, looks at the flow they built together, and starts naming the screens they might have to cut. A good onboarding is a hallway, and you walk through it without noticing. A bad onboarding is a hotel lobby with someone in a vest insisting you sign the guest book before they’ll let you upstairs.

The tedious middle is where the next AI wave lives

Tue, 19 May 2026 00:00:00 GMT

If US bookkeeping services were a country, its GDP would be #88 in the world, ahead of Uganda, behind Azerbaijan.

The job falls in a category called the tedious-middle. “Too high-stakes to hand to an amateur and too tedious for a specialist to enjoy.” These are the most underloved jobs in the economy. They require a real qualification to enter, but then they bury you in repetitive work. A nurse doing eight hours of discharge calls has to remember which medications can be mixed and which can’t. A junior associate redlining a 200-page contract has to spot the adverse change clause hiding in section 9. A bookkeeper has to know which sales tax line is which, or someone will have to refile in March.

The first wave of venture-scale AI targeted the obvious bottom of the labor pyramid: customer service, copywriting, personal assistants. The second wave is harder to see because it’s going inside jobs that you need a credential to start.

What the middle looks like for AI

The category has three properties.

It requires real expertise. You can’t hand bookkeeping to a random person off the street and trust the output. Same goes for nursing, medical coding, radiology triage, or legal due diligence. There’s training behind each of them, usually a credential, and real risk when the work is wrong.

It is, for the practitioner, not mentally stimulating. The associate scouring through contracts for one clause is not doing what law school promised them, and the bookkeeper categorizing the 500th Stripe payout is not having a good time. On a typical Tuesday, neither of them is using the skills that got them qualified for their job.

And the work has to be correct, because so much downstream depends on it. A medical coding mistake propagates into a denied claim and a 90-day rework. Too high-stakes to hand to an amateur and too tedious for a specialist to enjoy.

These examples are less about jobs being lost to AI and more about where the next venture-scale opportunities live in 2026. It’s different than the bottom of the skill ladder, which has been getting absorbed for a couple years already. Customer service is the cleanest case, with the likes of Sierra and Intercom Fin building real businesses there. Autonomous driving is on its own slower clock for different reasons. Waymo is real, but it’s also nearly a decade in and still city-by-city. In contrast, the specialist top, like brain surgery and the genuinely engaging end of expert work, sits on the opposite side: the role is challenging and the specialist actually wants the work. The middle is where the next wave of accuracy-grade but tedious workflows is opening up, and it’s still under-built relative to the unit economics.

Four proofs at scale

Pilot is the cleanest proof for bookkeeping. Founded 2017, last priced at roughly $1.2 billion in 2021 in an extension led by Whale Rock + Bezos Expeditions on top of Sequoia’s earlier lead, and meaningfully more AI-shaped today than at that round. Bench Accounting shutting down at the end of 2024 is proof: the human version of SMB bookkeeping had broken unit economics before AI was even meaningfully pitted against it. The demand was real, but the cost structure didn’t really work. Other increasingly commoditized services have this in common. Now Intuit Assist is rolling autonomous bookkeeping agents inside QuickBooks, and Intuit itself reported 68% of US small businesses were using AI in 2025.

Harvey is the proof for legal. $11 billion valuation in March 2026 on a $200M raise co-led by GIC and Sequoia. ARR went from about $50M at the end of 2024 to $100M in August 2025 to $190M in January 2026. Used by 140K+ lawyers across 1,500 organizations. Legal due diligence is a sibling tedious-middle workflow: expertise-gated and accuracy-grade, but mind-numbing for the associate doing it.

I’ve written about Hippocratic AI recently as a nursing-adjacent example. A $3.5 billion valuation after a $126M Series C in November 2025. Their product is a voice agent that does the bounded inpatient workflows that nurses describe as the most draining part of the job: pre-op prep calls, discharge instructions, chronic-disease follow-ups. Note that they’re not selling a robotic nurse. The bedside is untouched. What they focus on is the part that wears the practitioner down and demands accuracy without demanding presence. (Moxi, Diligent Robotics’ physical hospital robot, got pulled from MultiCare Health System in 2025 because nurses found it annoying and unhelpful.)

Aidoc is the proof for radiology. 31 FDA-cleared authorizations by early 2026, including a recent foundation-model clearance covering 14 acute indications on abdominal CT triage. The FDA cleared 1,104 radiology AI devices through the end of 2025, roughly 76% of all AI-enabled medical authorizations. Radiologists are still the ones making the diagnoses; the triage aspect is what Aidoc takes off their plate.

”This has been about to happen for 30 years”

QuickBooks shipped in 1992. The “AI will automate bookkeeping” conversation is older than my career. The reasonable skeptic asks why now is different from then. The answer is that LLM agents can finally handle the long tail of edge cases that rule-based automation couldn’t get quite right, like the weird vendor names and the ambiguous spend amounts. Intuit’s own 2025 data shows 68% of US small businesses using AI. Intuit shipped autonomous bookkeeping agents inside QuickBooks during the past 12 months. 70% of US health systems plan to expand AI medical-coding automation in 2026. Harvey went from $50M to $190M ARR in 13 months. The proof is in the pudding.

The other reasonable objection is that the tightest accuracy-gated work is the most regulated. You can’t ship a clinical workflow the same way you would with a Chrome extension. That’s true. But it’s also the moat. The 1,104 radiology AI clearances are proof that the regulatory path exists and that companies who figure it out first get to compound while competitors are still drafting their pre-submission.

What this means for the founder picking a target

If you are picking what to build in 2026 and your ambition is venture-scale, run your candidate workflow through the three conditions before anything else: does it require real expertise, is it unrewarding for the practitioner, and is the accuracy floor high enough that mistakes cascade downstream. If all three hold, it’s a painkiller idea with a reasonably strong moat. If any one is missing, it might still work, but the customer’s willingness to pay or your protection from competitors may not be as robust.

In all these successful examples, the company picked a single workflow inside a credentialed role and replaced just that one component. I think there are plenty of similar opportunities still out there. Compliance reconciliation, audit prep, supply-chain documentation, prior authorization, fraud-investigation casework. Each of them is somebody’s idea of the worst part of their job. If you hear someone say “I wish a computer would just do this part for me,” you’re most likely listening to a TAM that hasn’t been priced yet.

AI products will live or die based on this rule

Mon, 18 May 2026 00:00:00 GMT

A hit rate for an AI product is the percentage of its outputs that are good enough to use as-is. If you’re building something, what hit rate does your product need before people actually trust it? Two apps that I’m watching this month have wildly different answers, even though they’re running on the same underlying LLM models.

The first is ReproKit, an AI “bug-catcher” that takes reports from users of your app, along with their console logs, to find real production issues. It would be a great product even at a 50% hit rate, because a senior developer can quickly eyeball the top 10 bugs and decide which are real issues we need to work on.

The second is Tivi, a bookkeeping tool that categorizes your company’s transactions and creates monthly overview accounts. It would be hard to fully trust it at 75%, let alone 50%.

The same AI models, two completely different trust thresholds required. What determines the bar is the work that I have to do as a user to verify whether the AI got things right.

What 50/50 actually means

Let’s take the bug-catcher. It ingests a few hundred thousand console logs a week, groups them into “issues,” and surfaces the ten most worrying clusters on a dashboard. Only half of those clusters might be real bugs. The other half might be noise: a ResizeObserver warning the browser threw during a scroll, a deprecated API call nobody cares about, an error from a third-party widget that’s been broken for a year.

A developer can scan the list quickly, dismiss the obvious noise, open the few that look real, and get back to work. The cost of filtering one row is two clicks and ten seconds of scanning. Twenty rows later, you’ve spent three minutes and eliminated a week’s worth of effort chasing down issues from users. The product is already worth paying for at this level. To be honest I’d even pay for it at 30/70. AI does the aggregation, we can do the triage quickly, and the unit economics hold because we save lots of time.

Now take the bookkeeping tool. It pulls data, categorizes VAT, reconciles receipts, and generates an overview. If half of those line items are wrong, I have to read every line to find the bad ones, and by the time I’ve fixed the mistakes I may as well have paid a human bookkeeper. At 50/50, the user has done the work twice, once to verify, once to fix. The rational move is to drop the AI.

The visible spread in shipped AI products

Looking at the numbers from companies that ship AI at scale, the gap is enormous.

GitHub Copilot’s average suggestion acceptance rate sits between 30% and 65% across different studies, and it’s a great product. Developers hit Tab 4-7 times out of 10 and might not get what they wanted; they ignore those, and they accept the 3-6 that fit. The filtering cost is one keystroke. Cursor’s custom autocomplete model claims a 71% accept rate on roughly the same task surface, also a great product, also operating under a “one keystroke to filter” economy. Both are good, and sure the gap makes a difference, but neither breaks.

Decagon, the AI customer-support company valued at $4.5 billion as of March 2026, prices on outcomes: they only get paid when an interaction resolves without a human. Reported resolution rates fall in the 70–90% range across travel, fintech, and DTC retail. The number has to be that high because an end customer with a billing question has zero patience for a wrong answer. A bad response gets escalated to a human to take over, and Decagon has to eat the cost. The model could be the same one any other vendor is using; the pricing structure forces the hit-rate problem onto the company building the product.

Then there’s Hippocratic AI, which builds AI agents that make pre-op and post-discharge nursing calls. Their published constellation of clinical benchmarks puts their patient-safety failure rate below the rate of human nurses on the same calls. The bar is brutal. A clinical AI that gives a 5% rate of unsafe medication advice doesn’t ship at all, because the regulatory consequences and the patient consequences are both catastrophic. Hippocratic’s real competition is the safety record of the median human clinician, and the product has to be at least that good before the conversation about adoption can start.

Why this stays the bottleneck

The standard counter is that models will improve and the floor will lift. That’s true, and on a long enough horizon it does close the gap. If the underlying models get to genuine human-level reliability across the board, the bookkeeping tool will ship at 99% and everyone will trust it. At that point the filtering cost framing stops mattering. The problem is that “long enough horizon” doesn’t help the people building today, and if you sit around waiting for that then you’ve already lost. Model improvements will raise everyone’s hit rate roughly together: if Copilot moves from 30% to 50%, the bookkeeping tool might move from 70% to 85%, still not quite the 95% it needs to be truly useful. The gap closes eventually. But it closes more slowly than the models improve, because in high-trust workflows, “better” and “trusted” are not the same thing.

The other main counter is “put a human in the loop.” Fine, sometimes. The honest objection is that human-in-the-loop is what you do when the model isn’t good enough on its own; if you have to do it for every output, what you’ve built is a workflow tool with AI in it, not an AI product. The interesting version is when the product itself decides which outputs need a human look. You route the user’s attention specifically to the margin cases where the model’s confidence is low, or where the consequence of being wrong is high. That’s a product and UX effort more than a model decision, and it’s where a lot of the leverage currently is.

How to check which bucket you’re in

A useful test, in order:

What does it cost your user, in seconds, to verify or fix one output?
What happens to their trust in the product when they hit a bad one? Do they shrug, or do they stop using it?
Is your distribution of bad outputs boring (a few wrong rows on a list) or catastrophic (wrong advice to a patient, or mistakes on a VAT return)?
Could you charge on outcomes, like Decagon, or only on usage, like Copilot? Outcome pricing forces you to solve the hit-rate problem. Usage pricing lets you survive a lower hit rate.

If filtering is cheap and bad outputs are acceptable, you can ship at 50/50 and your product gets better as the floor rises. If filtering is expensive or bad outputs are catastrophic, your product needs to be somewhere between 90% and 99%, and the work between now and then is mostly evals, scaffolding, and routing.

My own version of this

I’ve been generating articles with an AI content pipeline this month and half of them come out genuinely good, with only a few edits needed. The other half are bad in ways that aren’t even interesting. Wrong angle, boring opening, off-voice, often all three. As the person building the pipeline, I’m willing to read every output and fully rewrite (or discard) the bad ones; that’s my job. As a user of someone else’s tool, I’d quit at the second bad article and never come back, because fully rewriting a bad piece defeats the purpose.

My teammate building the bug-catcher can hopefully ship something useful at a 50% hit rate by the end of this month. The teammate building the bookkeeping tool faces a higher trust bar, because of the nature of his product. His user pays a higher time cost for filtering mistakes. So the product cannot rely on gradual accuracy gains alone. It also has to reliably route attention to the edge cases, with evals on categorization, scaffolding around likely errors, and a UI that brings low-confidence rows to the surface.

So what does your user pay, in time and effort, to look at one output and decide whether to keep it? If the cost is a glance, you can ship early and improve as the model floor rises. If they have to read the whole thing, aim for 90%+ to make it feel useful. The work is figuring out where that bar sits, then designing around it. Good UX involves sparing the user from having to check everything. The AI builders who understand this in 2026 will look ahead of the curve in 2028. The ones who don’t will still be waiting for the model to be ready.

Conviction over consensus

Thu, 07 May 2026 00:00:00 GMT

Two startup ideas walk into a room of 50 founders.

The first one helps travelers share curated Google Maps lists across cities. The room shrugs. It feels like a vitamin, not a painkiller. Existing apps already do most of this. No obvious way to monetize.

The second one lets strangers sleep in each other’s homes for money. The room laughs. Nobody will trust it. Safety will be impossible and regulation will crush it. The platform doesn’t exist.

Both pitches get fifty rejections. One is Airbnb. The other is an app nobody needs. The hard question is what separates them, and how can you tell which one you’ve got before you spend two years working on it.

Most founders interpret consensus rejection as a no. Friends say no, the early VCs say no, the Reddit community says no, and at some point the founder concludes the idea must be bad and shelves it. After five years of running a startup, I think the truth is more nuanced. When consensus rejects an idea for a specific reason, and that reason turns out to be wrong, the rejection is only a weak negative signal. Sometimes it’s positive signal. Consensus ideas are usually crowded. Misunderstood ideas are where opportunities are hiding.

What Peter Thiel was actually asking

The canonical version of this is from Zero to One:

What important truth do very few people agree with you on?

It’s a question founders quote constantly and answer rarely. Thiel’s framing is precise. A secret is something that’s both important and unknown, rather than just edgy for the sake of it. The structure of a real answer is “most people believe X, but the truth is the opposite of X.” Getting knee-jerk disagreement from people doesn’t quite clear the bar; “everyone disagrees with me” alone isn’t enough. The idea is “everyone disagrees with me, and I have a specific articulable reason to think they’re missing something.”

Skeptics and cynics say no, and they’re right most of the time, but the non-consensus founders who succeed manage to clear both halves of the statement: they’re holding an unpopular position and they’ve identified the specific assumption everyone else is unconsciously making.

The Airbnb case

The most-cited example is Airbnb’s seed round. In June 2008, Brian Chesky’s team got introduced to seven prominent VCs. Five rejected them. Two didn’t reply. They were trying to raise $150,000 at a $1.5 million valuation.

Chesky later published the rejection emails. He doesn’t list the VCs’ reasons, but the kinds of reasons a 2008 VC would give were structurally similar: market size concerns, trust dynamics, timing relative to the mobile platform. These weren’t dumb reasons. They were the reasons a smart, careful investor would give in 2008.

Chesky was betting on a different theory of trust. As Airbnb iterated over the next two years, it built the signals (review systems, professional photographs, social proof) that would ultimately change how strangers interpreted each other. I don’t need to say that today Airbnb is publicly traded with a market cap of roughly $82 billion in May 2026, around $12 billion in annual revenue, and 8 million listings.

The VCs understood the 2008 trust dynamics correctly. They just didn’t predict that those dynamics could be re-engineered so effectively.

Why this is hard to act on

Most founders don’t act on the inversion despite hearing the advice constantly. Thiel’s question is a startup cliché at this point. The hard part is that holding an unpopular position has a real psychological cost when you turn out to be wrong, and most people don’t want to pay it.

Consensus is comfortable. If you follow it and fail, you get to share blame with the consensus. If you follow it and succeed, you get the reward without ever looking dumb. The only painful outcome is “I was the one who said the consensus was wrong, and the consensus was right.” You were not merely wrong. You were arrogant. Most founders, imagining that judgment in advance, subconsciously choose to avoid it.

So the advice spreads widely and is followed narrowly. Which is itself a form of non-consensus opportunity.

The team-veto problem

Where this becomes operational is in how teams filter ideas. Inside a company, the same consensus-as-filter dynamic plays out, often in a brutal form.

The founder pitches an idea in standup. Two engineers raise concerns. The PM nods along. The idea gets categorized as “interesting but not now,” which is code for “death by consensus.” But no one actually voted against the idea, and no one tested the hypothesis.

The team is usually right though, and they are saving you from yourself on most days. The median pitched idea isn’t very good.

But the unspoken veto is dangerous. It exits the good non-consensus ideas at the same rate it exits bad ones, and it does this silently, before the idea ever gets tested against reality. Whatever survives the veto is by definition the consensus-acceptable subset. And that’s a crowded subset.

The fix isn’t to suppress disagreement. Suppressing disagreement makes the team worse, not better. We just need to decouple disagreement from authority. Anyone can say an idea is bad; nobody has to listen.

What we’re trying

We’re running a month-long hackathon at our company this month, where everyone gets to work on any idea they have strong conviction in. When deciding whether feedback from other team members should be considered, we landed on a framework of “conviction over consensus.” The instruction is build the thing your teammates rolled their eyes at, as long as you feel strongly about it. The feedback rule took a bit of discussion though: it’s not that you can’t say someone’s idea sucks. You should absolutely still say it. They just don’t have to listen to you.

Most teams let consensus filter ideas through veto, mostly tacit. Most companies do the same at every level. Removing the veto without removing the disagreement is the move I want to see if we can pull off.

I don’t know yet which of our ideas will pass the second test, non-consensus AND right. I’d guess most won’t. But if any of them do, my bet’s on an idea that a normal team standup would have killed on day one.

I’ll share what we find.

The billion-dollar ideas in your browser tabs

Mon, 04 May 2026 00:00:00 GMT

There’s a website called Linktree. It lets you put more than one link in your Instagram bio. That’s the entire product, worth $1.3 billion.

Grammarly checks your spelling. Valued at a measly $13 billion with over $700 million in annual revenue.

Iubenda is a tool that generates privacy policies for websites. Just a privacy policy. $24 million in revenue, 160 employees, acquired by team.blue in 2022.

The “soundwave on top of album art” format you might have seen a thousand times on X/Twitter? That’s a tool called Wavve. A solo founder built it, ran it for a few years, and sold it for an indie millionaire exit while it was throwing off $100K a month in profit.

Meanwhile, you and I are spending our time thinking about whether to fine-tune a model, build an agentic workflow, or move to SF for the AI summer.

There’s a name for this: schlep blindness.

Paul Graham wrote this 14 years ago

The term comes from a Paul Graham essay published in January 2012. “Schlep” is Yiddish for “a tedious, unpleasant task.” PG’s claim is painful:

Your unconscious won’t even let you see ideas that involve painful schleps. That’s schlep blindness.

The mechanism is not deliberate, and you’re not consciously rejecting hard ideas. Your brain is filtering them out before you even consider them, the same way it filters out the hum of an air conditioner you stopped noticing an hour ago.

PG’s main example was Stripe. Online payments were a $20 trillion problem in 2010. The technical solution wasn’t novel. The reason nobody had built it was that it required dealing with banks, fraud, regulations, chargebacks, and the headaches of integrating with COBOL-era banking infrastructure. So nobody did. Until two brothers from Limerick decided the schlep was worth it. Now Stripe is worth more than $50 billion.

The essay is 14 years old and has been read by every founder who’s ever taken Y Combinator’s Office Hours seriously, even today.

A $29 billion company built on this one essay

In 2016, Alexandr Wang dropped out of MIT to start Scale AI with a simple pitch. AI models need huge volumes of high-quality labeled data, and labeling data is mind-numbing manual work that no AI researcher wants to do. Scale AI would do it.

Wang has been explicit about where the idea came from:

One of the secrets to Scale AI — and I think this applies to almost every industry — was that the problem we were solving of building really high quality data sets was something that most machine learning teams knew was important but wasn’t necessarily the sexiest problem that every AI scientist wanted to work on.

The company is now valued at $29 billion, powering training data for OpenAI, Meta, and the U.S. military.

PG wrote about it in 2012. Wang acted on it for AI in 2016. By 2026, schlep blindness should be a solved problem. Founders should have read the essay, internalized the lesson, and now be elbows-deep in unsexy work.

But we aren’t. The 2025–2026 cohort of founder energy is overwhelmingly chasing AI moonshots: personalized medicine, electronics in space, inference chips for AI workflows, AGI bets. All those are worthwhile efforts. Meanwhile, every existing schlep just got cheaper to solve, because the tedious parts (data entry, boilerplate, manual review) are exactly what AI is good at.

What 2026 schlep blindness looks like

The frontier of schlep blindness in 2026 is old schleps that AI just made tractable.

Backlinks for SEO are a schlep. One well-known tactic involves finding broken links across the web, identifying their original content, recreating that content on a client’s blog, and writing personalized outreach to dozens of webmasters per backlink. Done manually, it’s $2-5k a month for a handful of links. AI can do most of the steps now, but nobody’s built the AI-native version yet.

Privacy policy compliance is a schlep. Updating one when laws change across 30 jurisdictions is a schlep on top of a schlep. AI can handle most of it. There’s room for an Iubenda 2.0 that’s 5x cheaper and also automatically updates for you when you add a new tool, because AI does the work.

Cold email is a schlep. The new 2026 tactic is reversing the script: do the hard part first and build a personalized deliverable for the prospect (a draft blog, a custom report), then send a single email pointing to it. You still need lots of volume, but conversions are higher for high-quality deliverables that are already made. The schlep is producing the deliverables at scale. AI can eat that schlep for breakfast before you drink your oat latte.

The cocktail-party version of “AI is changing everything” usually points at the moonshots. But what the average founder should be excited about is that thousands of boring problems just became economically viable to solve.

How to spot your own schlep blindness

If schlep blindness is unconscious, you can’t introspect your way out of it. But you can use other people’s annoyance as a proxy. Some questions:

What’s the most-skipped task in your workflow? Look beyond the things you’d describe as “I hate this” if asked. What do you do regularly that you avoid thinking about at all.
What problem do you assume someone else is solving, but can’t name a product for? Obvious problem, no obvious product. Schlep city.
What’s been broken for 10+ years that nobody seems to fix? If something’s been broken that long, you know there’s something there.
What’s gated by “this is annoying” versus “this is technically hard”? The annoyance bucket is hugely under-supplied.
Which of your daily complaints could a 50-person company solve, if they were willing to do the unglamorous work? That’s a schlep waiting for the right person.

The honest test, though, is the gut one. Look at the products on this list (Linktree, Wavve, Iubenda, Grammarly, Scale AI, Stripe) and ask: would you have built any of these? In your top-of-mind list of startup ideas right now, are any of them close in shape to “more than one link in my Instagram bio”?

If the answer is no, it’s probably because your unconscious is filtering them out.

The bet

To be fair, several opportunities in 2026 really are downstream of big AI breakthroughs and moonshots. But many more are in the schleps you’ve been walking past for years, with new economics because AI took the worst part of the work off the table.

PG saw the pattern in 2012, Wang acted on it in 2016, and many more since. The list of companies in this post is what happens when people take it seriously. Plenty of schleps are still out there.

Erfan Soliman

Recipes don't work for AI agents as well as you think

Why recipes don’t survive contact with reality#

This isn’t just one founder’s opinion#

Where the recipe still has a place#

The dividends principle#

Your app's onboarding is an editing problem

The instinct is wrong on purpose#

What “editing” looks like in practice#

Why this generalizes beyond developer tools#

How to tell if you’re overbuilding it#

The tedious middle is where the next AI wave lives

What the middle looks like for AI#

Four proofs at scale#

”This has been about to happen for 30 years”#

What this means for the founder picking a target#

AI products will live or die based on this rule

What 50/50 actually means#

The visible spread in shipped AI products#

Why this stays the bottleneck#

How to check which bucket you’re in#

My own version of this#