Issue #1: The Model Is Not the Product
Why the builders obsessing over benchmarks are building the wrong thing — and what the ones quietly winning have in common.
The model is not the product
There's a pattern I keep seeing in data and AI teams: someone reads a benchmark, gets excited, swaps the underlying model, and calls it a product decision. It isn't. It's a procurement decision dressed up as strategy.
The model is a component. A good one, increasingly a commodity. The product is everything else.
What "everything else" actually means
The builders who are quietly winning in this space aren't the ones who got to GPT-4 first. They're the ones who figured out:
- Where the loop breaks. Every AI product has a failure mode that doesn't show up in demos. The winners instrument it early and fix it obsessively.
- What the user actually does next. "The model gave a good answer" is not a product outcome. What did the user do with it? Did the workflow improve? Did they come back?
- Which data is defensible. Fine-tuning on your own data, building feedback loops, accumulating signal that competitors can't easily replicate — that's the moat. The model weights are not.
"The infrastructure layer of AI is compressing fast. The application layer is just starting to get interesting."
The benchmark trap
Benchmarks measure models on standardized tests. Your users are not standardized. They have specific domains, specific failure tolerances, specific workflows. A model that scores 3 points higher on MMLU might be strictly worse for your use case.
The teams falling into the benchmark trap share a tell: they optimize for what's easy to measure instead of what matters. Don't.
What to watch this week
Structured outputs are getting serious. JSON mode was a hack; constrained decoding done right changes what's possible for reliable pipelines. If you're building anything with document extraction or API orchestration, this is worth your attention.
The context window wars are a distraction. 1M tokens sounds impressive. Most production systems use less than 10K. The real question is what you do with the context you have — retrieval quality, chunking strategy, reranking. That's where performance lives.
Evals are the new tests. If your team doesn't have a rigorous eval harness, you're shipping on vibes. Not a judgment — most teams don't. But the ones that do are compounding faster than you'd expect.
One thing worth doing differently
Before your next model swap, write down the metric that would tell you the swap was worth it. A user-facing metric. Not perplexity, not ROUGE score — something that connects to whether the product is better.
If you can't write that metric down, you're not ready to make the swap.
Folio ships weekly. Forward this to someone building something.