Category: AI

AI Evaluation: Why Your Benchmarks Do not Match Production

The AI industry runs on benchmarks. MMLU, HumanEval, GPQA — each promises to measure something real about model capability. Engineering teams use these numbers to decide which model to deploy. Product managers use them to set expectations. Investors use them to compare startups.

The problem: benchmark performance does not reliably predict production performance.

What Benchmarks Actually Measure

Benchmarks test a model ability on curated datasets under specific conditions. HumanEval measures code completion on LeetCode-style problems. MMLU tests knowledge retrieval across 57 subjects. Each benchmark defines narrow success criteria and holds the test conditions constant.

Production environments do not hold anything constant. Users submit malformed inputs. Edge cases arrive in unpredictable sequences. The same question gets asked thirty different ways. A model that scores 90% on a benchmark might drop to 60% when the input distribution shifts even slightly.

The Benchmark Gaming Problem

When incentives are misaligned, benchmarks get gamed. Labs optimize specifically for benchmark datasets. This works — until the benchmark leakage becomes obvious and the scores lose credibility. We have seen this play out repeatedly: models that ranked high on coding benchmarks produced unusable code in production.

The deeper issue is that benchmarks measure what gets measured. Creativity, edge case handling, and real-world judgment do not translate cleanly into standardized tests.

What Production Teams Actually Need

Teams deploying AI in production care about three things: latency, accuracy, and failure behavior. Latency affects user experience directly. Accuracy determines whether the output gets used. Failure behavior decides how the system degrades under stress.

Benchmarks rarely address all three simultaneously. A model that fast might sacrifice accuracy. A model that accurate might fail in ways that are hard to detect. The trade-off space is complex, and single-number benchmarks cannot capture it.

Building Better Evaluation Locally

The practical alternative: evaluate on your own data, under your own conditions. Sample real queries from production. Test against the specific task you need the model to perform. Measure latency, error rates, and user satisfaction.

This approach requires more effort than citing a benchmark. It also produces more useful results. Teams that do this consistently make better deployment decisions than teams that rely on published benchmarks alone.

The Honest Framework

If you are evaluating AI systems for production use, treat benchmark scores as one data point among many. Run your own evaluation. Test for your specific use case. Measure what actually matters to your users.

The question is not whether a model is good — it is whether it solves your problem at acceptable cost and risk. Benchmarks cannot answer that. Only your own evaluation can.

How are you evaluating AI systems for your specific use case? Are benchmarks giving you false confidence?

April 14, 2026
The Prompt Engineering Trap: Why More Tokens Don’t Mean Better Results
The prompt engineering discourse has gone sideways. Somewhere between the viral Twitter threads and the $500/hour consultants, we lost the plot. The conversation shifted from “How do I get better outputs?” to “How do I craft the perfect prompt architecture?” These are not the same problem.

I’ve watched teams spend weeks perfecting prompt templates while ignoring the actual bottleneck: they were asking the wrong questions.

The Optimization Trap

The assumption behind elaborate prompt engineering is that better prompts produce better results. This is true but incomplete. Better prompts produce better rephrasings of your implicit assumptions. If your assumptions are wrong, better prompting just produces wrong answers with better formatting.

Consider the typical workflow: stakeholder describes a feature requirement, engineer prompts an AI to generate a spec, prompt gets refined to produce more detailed specs, iterations continue until the output looks polished. The spec is clean, well-structured, and completely disconnected from what users actually need.

The optimization target drifted from “solve the problem” to “produce good-sounding output.”

This is the trap. Prompt engineering optimizes for the artifact, not the outcome. Teams get very good at producing polished nonsense.

What Actually Matters

After watching this pattern repeat across dozens of projects, three factors consistently determine whether AI assistance produces useful results:

Question quality is upstream of prompt quality. The best prompts I’ve seen aren’t syntactically sophisticated. They’re precise about what problem needs solving, what constraints exist, and what success looks like. This precision comes from the human’s understanding, not the prompt’s structure. When I see prompts with elaborate role definitions, chain-of-thought sequences, and output format specifications, I usually see a team trying to compensate for unclear thinking with prompt complexity.

Iteration cadence beats iteration depth. The teams getting real value from AI aren’t the ones crafting perfect single-shot prompts. They’re running rapid cycles: prompt, evaluate, adjust, prompt again. A mediocre prompt run five times with feedback beats a perfect prompt run once. The learning compounds. Prompt engineering as a discipline treats prompts as finished artifacts to optimize. Effective usage treats prompts as hypotheses to test.

Context quality beats context quantity. The race to fill context windows with documents, code, and specifications often backfires. More context means more noise. It means the AI spends tokens on relevance ranking instead of reasoning. I’ve consistently seen better results from carefully selected, highly relevant context than from comprehensive dumps. Three pages of exactly the right information outperform fifty pages of everything.

The Meta-Problem

Here’s what nobody talks about: prompt engineering as a practice assumes the human knows what they want. The elaborate frameworks—CoT, ReAct, Tree of Thoughts—assume you can specify the reasoning path. When the problem is figuring out what you actually need, these frameworks add structure without adding clarity.

The teams that struggle most with AI tools aren’t the ones using bad prompts. They’re the ones who haven’t done the work to understand their own problems. AI makes it easier to produce answers. It doesn’t make it easier to ask the right questions.

This isn’t a limitation of current AI. It’s a fundamental constraint. AI can help you explore solution spaces. It cannot help you define the problem space unless you’ve already done that work yourself.

Practical Implications

If you’re trying to improve how your team uses AI tools, the sequence matters:
1. Clarify before you prompt. Spend time writing out what you actually know, what you don’t know, and what constraints exist. This work belongs to humans.
2. Test prompts against real cases. Run your “optimized” prompt against five actual problems. Measure whether the outputs solve the problem, not whether they look polished.
3. Favor specificity over sophistication. “Explain this error in plain English, focusing on root cause and fix” outperforms elaborate role-play scenarios and output format specifications.
4. Build feedback loops. Track which prompts work and which don’t. The patterns matter more than any individual prompt.
5. Know when to stop prompting. If you’ve iterated three times and the output still doesn’t solve the problem, the problem isn’t the prompt. The problem is either the question or the tool selection.
The Honest Assessment

Prompt engineering has value. For well-defined problems with clear constraints, thoughtful prompting improves results. The issue is that most teams use sophisticated prompting techniques on poorly-defined problems, then blame the technique when it fails.

The people getting the most value from AI tools aren’t the best prompt engineers. They’re the ones who know when prompting is the right tool and when they need to step back and think through the problem themselves.

The skill that matters isn’t knowing how to prompt. It’s knowing when to stop prompting and start reasoning.

What’s your experience been like? Are you spending more time on prompt structure or problem definition?
April 13, 2026
Local-First AI: Running Language Models Without the Cloud

Cloud-based AI is convenient. Upload your data, get results back, pay by the token. The model lives somewhere else, and so does your context. That trade-off works until it doesn’t.

Running models locally changes the equation. Your data stays on your machine. Your context window belongs to you. Latency drops to milliseconds. Cost structure flips from per-token billing to one-time hardware investment.

The Hardware Reality

Local inference hardware has improved dramatically. A mid-range consumer laptop now runs 3-billion-parameter models in real time. Larger models, up to 70B parameters, run on desktop hardware with discrete GPUs or high-memory configurations.

The Intel Core Ultra 9 185H, a laptop-class processor, handles 3B-8B parameter models at acceptable speeds without a discrete GPU. Adding a dedicated GPU shifts the ceiling significantly higher. The practical constraint isn’t hardware — it’s knowing which model fits your hardware and your task.

What You Actually Gain

Privacy is the obvious benefit. Code, documents, conversations — none of it leaves your machine. For enterprise users, this eliminates a category of compliance overhead. For individuals, it means your personal context isn’t training someone else’s model.

Less discussed: latency changes how you interact with AI. When response times drop below 100ms, you stop treating AI as a separate workflow. It becomes part of your existing tools. The interaction model shifts from “submit prompt, wait, read response” to “iterate rapidly on ideas.”

Offline capability matters more than it should. Presentations without wifi, flights, conference calls in venues with bad connectivity — the model still works. This isn’t theoretical. It changes which problems you attempt to solve with AI.

The Trade-offs Are Real

Smaller models have lower capability ceilings. A 3B parameter model won’t reason through complex multi-step problems the way a frontier model does. The gap closes for specific tasks — summarization, extraction, classification — but it doesn’t disappear.

Maintenance overhead increases. Local models need updates, hardware upgrades, and troubleshooting. Cloud providers handle this invisibly. Self-hosting means you own the full stack.

Context window management becomes your problem. Cloud providers abstract this away with retrieval-augmented generation or extended context windows. Running locally means you manage chunking, retrieval, and context overflow yourself.

When It Makes Sense

Local-first works when data sensitivity is high, when you need offline capability, or when usage volume would make cloud costs prohibitive. Development workflows with proprietary codebases fit this profile. Research workflows with sensitive documents fit it too.

The sweet spot is tasks that don’t require frontier model capability. Summarization, extraction, classification, code completion — these work well at 3B-8B parameters. The moment you need multi-step reasoning on novel problems, cloud models still win.

Most teams will end up using both. Local for privacy-sensitive, high-volume, latency-critical tasks. Cloud for capability-intensive tasks. The interesting question is how to build workflows that switch between them intelligently.

What’s your current setup? Are you running models locally, or is everything cloud-based?

April 13, 2026
AI Coding Assistants: Six Months In the Trenches
I spent the last six months working with AI coding assistants daily. Not as a demo, but as my primary workflow. Here’s what actually changed.

The shift isn’t about AI writing your code. It’s about how you think about problems.

The Real Productivity Gain

Most discussions focus on autocomplete speed. That’s the visible part. The real gain is harder to measure: reduced friction between thinking and implementing.

When I have an idea, I can test it immediately. Describe the function in plain language, review what the AI generates, iterate. The bottleneck shifts from typing to reasoning.

Three things surprised me:
- Debugging time dropped: AI reads error messages differently than humans. It correlates the error with your specific codebase, not just the general pattern. Half my debugging sessions now end in minutes instead of hours.
- Code review quality improved: When AI suggests changes, it explains the reasoning. I find myself understanding other people’s code faster because the AI can summarize unfamiliar sections.
- Documentation got actually written: Instead of dreading the docstring, I let AI draft it and then review. This sounds minor until you realize how much institutional knowledge disappears when nobody documents the tricky parts.
Where It Breaks Down

AI coding assistants fail in specific ways. Understanding these failure modes matters more than the capabilities.

Context windows are real constraints. Feed an AI a 50-file codebase and ask about architectural decisions made three years ago, you’ll get confident nonsense. The model works best with focused, recent changes.

Security edge cases get missed. AI will suggest code that works for the happy path. It doesn’t naturally think about adversarial inputs, race conditions, or compliance requirements unless you explicitly ask.

The biggest risk is subtle: learned helplessness. If you rely on AI to generate everything, you stop building the mental models that let you catch mistakes. The tool makes you faster until you forget how to verify the output.

What I’d Tell My Past Self

Use AI for the mechanical work. Let it handle boilerplate, refactoring, test generation, and initial drafts. Your job is to define what good looks like and verify the result.

The developers who thrive won’t be the ones who use AI most. They’ll be the ones who know when to trust it and when to dig in manually.

The question isn’t whether to use AI coding assistants. It’s whether you’re using them to augment your thinking or to replace it.

What’s your experience been? Are you seeing real productivity gains, or is the tooling still too immature for your workflow?
April 12, 2026
RAG vs Fine-tuning: What Nobody Tells You
I’ve been watching the RAG vs Fine-tuning debate unfold for months now. Every week there’s a new benchmark, a new paper, another startup claiming their approach is superior. But talking to engineering teams on the ground, the picture gets messier.

The choice between these two approaches isn’t just technical — it shapes how your product evolves, how fast you can iterate, and what your team looks like.

What These Approaches Actually Do

Retrieval-Augmented Generation pulls information at query time. When a user asks something, the system finds relevant documents and feeds them into the model alongside the question. The model then generates an answer using that context.

Fine-tuning takes a different path. Instead of retrieving information at query time, you train the model on your specific data upfront. After training, the model “knows” your domain without needing external documents.

Both paths solve the same problem — getting a model to answer questions about your specific business — but the operational characteristics differ significantly.
- When to Reach for RAG: If your data changes frequently, if you need to cite sources, or if audit trails matter. RAG lets you swap out documents without retraining. Legal firms and healthcare providers often prefer this because every answer can point to the exact document that informed it.
- When to Fine-tune: If latency is critical, if you’re building specialized terminology that confuses base models, or if your data is stable but large. A fine-tuned model responds faster because nothing needs to be retrieved at inference time.
The Hidden Cost Nobody Talks About

The benchmarks you see in vendor marketing tell a partial story. They measure accuracy on test sets — curated questions with known answers. Real deployments are messier.

Users ask things you didn’t anticipate. They phrase questions in ways that don’t match your document structure. They expect answers that combine information from multiple sources.

With RAG, you can debug this by looking at what documents got retrieved. You can see if the retrieval step failed. With fine-tuning, the knowledge is baked into model weights — harder to inspect, harder to correct when the model confidently says something wrong.

On the other hand, fine-tuned models don’t suffer from the “garbage in, garbage out” problem that plagues RAG systems. If your document retrieval is flaky, your answers will be too.

What Teams Actually Choose

Talking to ML engineers and product managers, I see a pattern emerging. Early-stage products tend to start with RAG because it’s faster to ship. You can connect your existing document store and have something working in days.

As products mature, some teams migrate to fine-tuning. This usually happens when they hit latency ceilings or when they need consistent sub-second responses in user-facing applications.

A smaller group does both — fine-tuning the model to understand domain language, then using RAG to provide up-to-date context. This is more expensive and complex, but it captures benefits of both approaches.

The honest answer is that there’s no universally correct choice. The right approach depends on your data characteristics, your latency requirements, and how much your domain knowledge differs from what the base model was trained on.

Which approach are you using today, and what drove that decision? I’d be curious to hear if the reality matches what the benchmarks promised.
April 12, 2026
The Agentic Workflow: How AI is Changing Product Requirements
The Product Requirement Document has been the backbone of product management for years. It tells engineering exactly what to build. But that model is breaking under the weight of AI-driven development.

We are moving toward agentic workflows. Agents don’t read specs and wait for clarification. They take a directive, interpret it, and start building. For product teams, this fundamentally changes what a “requirement” even means.

Instead of a 40-page document, requirements become a set of constraints and success criteria. The PM’s job shifts from writing specs to defining the logic the agent follows.

Constraint-Based Requirements

In a traditional workflow, the PM details every user story, edge case, and UI state. That level of granularity was necessary because developer time was expensive and misalignment was costly. Agents flip that cost equation. It is now cheaper to iterate on a high-level directive than to document every step in advance.

The requirement is no longer a step-by-step instruction. It becomes a boundary.
- Success metrics over user stories: Instead of “Add a filter dropdown,” the directive is “Users must be able to narrow results to under 50 items with two clicks.” The agent figures out the implementation.
- Rapid prototyping: Agents can generate working drafts or code skeletons in minutes. PMs validate against the output rather than a theoretical spec, turning discovery into a feedback loop.
- Technical and persona guardrails: The agent needs rules. “Must use existing API,” “Must comply with WCAG 2.1,” “Target audience: enterprise admins.” These constraints keep the agent’s output aligned with reality.
From Writer to Orchestrator

This transition moves the product manager away from documentation and toward system management. The value is no longer in how well you write a spec, but in how effectively you coordinate the agents that execute it.

Three responsibilities become central:
- Strategic direction: Agents optimize for what they’re told. They don’t know about the Q3 revenue target or the recent customer churn spike. The PM provides the business context that prevents local optimization.
- Governance: Autonomous systems need hard limits. PMs define the non-negotiables—data privacy boundaries, brand standards, compliance requirements. The agent handles the rest.
- Human alignment: An agent can draft a feature, but it can’t negotiate with engineering on technical debt or align with sales on a launch timeline. That human coordination is still a PM’s core responsibility.
The Friction Is Real

Adopting this workflow is not trivial. Data security is the first hurdle; teams are understandably cautious about feeding roadmaps into external models. Then there’s reliability. Agents hallucinate. They misinterpret nuance. They produce confident but incorrect outputs.

The practical approach is hybrid. Use agents for the heavy lifting of documentation, test case generation, and initial prototyping. Keep human review before anything reaches production.

Teams that do this well report significantly shorter cycles from concept to working software. But it requires a new level of discipline. The spec isn’t gone—it’s just executable now.

How is your team approaching this? Are you using AI to accelerate the discovery phase, or are you still keeping it strictly out of the requirements process?
April 11, 2026
The Hidden Cost of Free AI: What You’re Actually Paying For
We live in the golden age of free AI models. Thanks to platforms like OpenRouter, anyone with an internet connection can spin up a session with a model that would’ve cost thousands of dollars in compute just a year ago. No credit card, no API keys (mostly), no commitment. Just type and watch the magic happen.

But let’s talk about the thing nobody puts in the marketing copy.

The Bill Always Comes Due

Here’s the uncomfortable truth about “free” AI: compute isn’t free. Electricity isn’t free. GPU clusters aren’t free. The engineers who fine-tuned those models aren’t working for exposure. Someone is paying the bill.

When the platform isn’t you, the product is.

Free tiers on AI platforms typically sustain themselves through a combination of strategies, and it’s worth understanding exactly how your “free” session is being funded:

Data collection and model improvement. Every prompt you send, every correction you make, every conversation you have is logged, anonymized (we hope), and fed back into the training pipeline. Your real-world questions become the fine-tuning data that makes the next version smarter. You’re not the customer. You’re the labeling workforce.

Rate limiting and quality routing. Free tiers often get routed to lower-tier inference endpoints. Your requests might hit oversaturated servers, get batched in ways that reduce quality, or be deprioritized when demand spikes. Meanwhile, paying customers get the fast lane. This isn’t malicious — it’s basic economics. But it means your “free” experience is intentionally throttled.

The upsell funnel. Free access is the best marketing tool in the world. Once you’ve built a workflow around a free model, hitting a rate limit or needing a slightly better model makes the $20/month upgrade feel like a no-brainer. The free tier is a trial that’s genuinely useful — but it’s a trial designed to create dependency.

The Privacy Tradeoff

Here’s the part that should give you pause: when you type something into a free AI, where does it go?

Terms of service for most free-tier services include broad language about data usage. Your conversations might be stored for “service improvement,” “safety monitoring,” or “research purposes.” If you’re pasting code snippets, business logic, or personal information, you’re trading that data for convenience.

This matters more than you think. A developer pastes proprietary code into a free model to debug a tricky bug. A founder shares their go-to-market strategy with a chatbot for feedback. A student submits their thesis for editing help. All of it becomes part of someone else’s dataset.

There’s no conspiracy here. It’s the same bargain we’ve been making with free internet services for twenty years: your data for convenience. The difference is that with AI, your data isn’t just your search history — it’s your actual thinking process.

What You Can Do About It

This isn’t a “stop using free AI” message. Free AI is democratizing access to powerful technology, and that’s genuinely great. But here’s how to be smart about it:
- Assume everything you type is logged. Don’t paste code, credentials, trade secrets, or personal information into free-tier models. If it wouldn’t be appropriate on a billboard, don’t type it.
- Use free models for exploratory work. Brainstorming, learning, casual writing — these are perfect use cases for free tiers. Save paid, privacy-respecting options for anything sensitive.
- Read the privacy policy. I know, nobody does this. But the difference between “we anonymize and aggregate your data” and “we may use your inputs for commercial purposes” is worth knowing.
- Consider local models for sensitive tasks. Open-weight models that run on your own hardware — which we’ll cover in a future post — give you the power of AI without the data surrender. It’s not free (you need compute), but it’s private.
The Bottom Line

Free AI is an incredible resource, and it’s not going anywhere. The providers offering it aren’t charities — they’re running a sustainable business model that extracts value in ways that may never touch your wallet but will touch your data.

That’s not necessarily bad. But knowing the cost lets you make informed decisions about what you share, when you share it, and when you should invest in something that respects your privacy as much as your intelligence.

What’s your threshold for pasting something into a free AI model? Do you have a “no personal data” rule, or do you treat it like a trusted colleague? I’d love to hear where you draw the line.
April 11, 2026
Why Every Developer Needs a Local AI Setup in 2026
Six months ago, I recommended spinning up a VM before letting an AI agent loose on your system. It was good advice. But the landscape has shifted, and the recommendation has evolved.

Running AI on someone else’s servers is fine for casual use. But if you’re a developer who writes code for a living — or even as a passionate hobby — you should seriously consider running at least some AI workloads on your own hardware. Here’s why.

The Trust Equation Changed

The Claude Code source code leak in March 2026 was a wake-up call for anyone who thought proprietary AI was a secure black box. When a single missed line in a configuration file can expose half a million lines of source code, including internal tooling, security logic, and hidden experimental features, it becomes clear that the “trust the provider” model has cracks.

If a company as well-resourced as Anthropic can accidentally expose their entire codebase, what does that mean for the data you’re sending through their hosted APIs?

Local models remove a variable from the trust equation. When the model runs on your machine, your data never leaves it. No terms of service to parse, no data usage policies to hope are enforced, no third-party server to get breached. What you type stays on your hardware. Full stop.

It’s Easier Than You Think

There’s a persistent myth that running AI locally requires a workstation that costs more than a used car. That was true two years ago. It isn’t anymore.

You don’t need to train a model. You just need to run one — and for that, Ollama and llama.cpp have made the barrier to entry almost trivially low. On a modern laptop with 16GB of RAM and a decent CPU (no GPU required for smaller models), you can run a 7B or even 13B parameter model that handles code completion, summarization, drafting, and general Q&A quite well.

The setup is usually: install Ollama, pull a model, and you’re done. No Docker, no CUDA (unless you want it), no venv hell. It takes about ten minutes.

Not Everything Should Leave Your Machine

Think about the tasks you do as a developer on any given day:
- Pasting a stack trace to figure out what broke
- Asking an AI to review a function before committing
- Feeding it a config file to debug a deployment issue
- Running it against your git diff to generate a commit message
All of these involve code that might be proprietary, infrastructure details that reveal your architecture, or bugs that expose vulnerabilities. When you send these to a cloud API, you’re trusting that provider with information about your actual work product.

With a local model, you can do all of this without transmitting a single byte externally. You can point the model at a codebase in your home directory and ask it things without creating a data trail. That’s not paranoia — it’s good operational hygiene.

The Reality Check: Local Models Aren’t Magic

Let’s be honest about what local models can and can’t do right now.

A 7B model running locally won’t match GPT-4.5 on complex reasoning tasks. It won’t architect a microservices migration or catch subtle logic errors in your codebase. The smaller the model, the more you’re trading accuracy and depth for privacy and control.

But here’s the thing: you don’t always need GPT-4.5. For code completion, docstring generation, regex writing, explaining errors, summarizing PRs, or drafting emails — small local models are genuinely competent. They’re good enough to save you hours of context-switching to the browser while keeping your work private.

Think of it like having a junior colleague: they won’t design the system, but they’ll happily format your documentation, explain that cryptic error message, and write the boilerplate you really don’t want to type.

When to Use Local vs Cloud

The smartest approach isn’t “local only” or “cloud only.” It’s knowing which tool fits which job:

Use local models for: Code review, debugging, writing scripts, generating documentation, experimenting, and anything involving sensitive code or data.

Use cloud models for: Complex architecture decisions, multi-step reasoning, tasks requiring the latest knowledge, and anything that needs a frontier model to get right.

This hybrid approach gives you the best of both: privacy and speed for the everyday grunt work, and raw power when the problem demands it.

Getting Started

If you’re curious, here’s the shortest path:
- Install Ollama from ollama.com
- Run ollama pull qwen2.5-coder:7b (a model specifically fine-tuned for code tasks)
- Run ollama run qwen2.5-coder:7b and paste it some code
That’s it. You now have a private AI coding assistant running on your own hardware. It won’t replace your cloud models, but it might surprise you with how much useful work it can do without ever phoning home.

Have you tried running models locally yet? What’s the smallest model you’ve found that’s actually useful for your day-to-day work? Drop your setup in the comments.
April 11, 2026
The Real Reason Startups Are Firing Engineers and Hiring PMs (Or Vice Versa)
If you’ve been paying attention to tech job postings lately, you’ve noticed a strange pattern. Some startups are quietly trimming their engineering teams — not the dramatic headlines of 30,000 cuts at Oracle, but slow, deliberate reductions. And at the same time, they’re hiring aggressively in product management, developer relations, and customer success.

The obvious explanation is “AI will replace engineers.” It makes for a good tweet. But the reality is more interesting and more nuanced.

The Cost-to-Value Equation Has Flipped

Two years ago, a startup’s competitive advantage was its engineering velocity. If you could ship faster, iterate quicker, and build a more polished product than your competitors, you won. So startups hired engineers — lots of them. Every additional engineer meant more features, more experiments, more shipped code.

AI has compressed that advantage. What used to take a team of three engineers a week now takes one engineer an afternoon with a capable AI coding assistant. The marginal value of each additional engineer has dropped, dramatically.

But here’s the thing nobody talks about: building the product was always the easy part. Finding product-market fit, understanding what customers actually want, pricing it right, communicating it effectively, keeping customers happy — those things haven’t gotten any easier. If anything, AI has made them more important, because now everyone can build.

The Real Bottleneck Moved

In 2023, the bottleneck was engineering capacity. In 2026, it’s strategic clarity.

A startup can now build a functioning MVP in a weekend. Three founders with AI assistants, no dedicated engineering team, and a clear vision can ship something that would’ve required six months and a $2M seed round two years ago. The barrier to building has collapsed.

But the barrier to knowing what to build? That’s still incredibly hard.

This is where the shift in hiring comes from. Startups are realizing that their scarcest resource isn’t coding capacity anymore — it’s product insight. They need people who can:
- Talk to customers and translate messy, contradictory feedback into clear feature priorities
- Define a positioning strategy that cuts through the noise of a thousand AI-wrapped competitors
- Write PRDs that actually constrain AI behavior instead of vague wishlists
- Design go-to-market motions that don’t rely on “build it and they will come”
That’s a product manager’s job. It always has been. It just got way more valuable relative to everything else.

But Here’s the Twist: It Goes Both Ways

Not every startup is the same, and the reverse trend is equally real: engineering-heavy startups are finding they don’t need traditional PMs anymore.

Why? Because a good engineer with an AI assistant can now do most of what a PM used to do. Draft a PRD? AI can help. Analyze user feedback? AI can summarize thousands of reviews in seconds. Create user personas? AI can do it from your existing customer data. Write a competitive analysis? Ten minutes with an LLM and a clear prompt.

The PM role is getting squeezed from both sides. On one end, AI-augmented engineers are absorbing the tactical PM work (writing specs, prioritizing backlogs, analyzing data). On the other end, PMs who learn to use AI are becoming so efficient at their core work that fewer of them are needed.

The surviving PMs are the ones who’ve moved up the value chain — from writing tickets to shaping strategy, from backlog management to market positioning, from feature spec to business model.

What This Means for You

If you’re an engineer: your coding skills are table stakes now. The engineers who thrive in 2026 are the ones who combine technical depth with product instinct. You need to be able to talk to users, understand market dynamics, and make judgment calls about what to build — not just how to build it.

If you’re a PM: stop being a ticket factory. If your job is just writing user stories and grooming backlogs, you are one AI prompt away from obsolescence. Move toward strategy, toward user research, toward the parts of the job that require actual human judgment about what the market wants and why.

The startups that will win in this environment are the ones that figure out the right ratio. Too many engineers without product direction means you’re building efficiently in the wrong direction. Too many PMs without building capacity means you’re strategizing with nothing to ship.

The sweet spot is a small, sharp team of T-shaped people — engineers who understand their customers, and PMs who understand the technical tradeoffs — all operating at maximum leverage with AI doing the heavy lifting on execution.

The org chart is flattening. The roles are blurring. And the people who’ll thrive are the ones who stop thinking about what their title is and start thinking about what the product needs.

What do you think? Has your team’s ratio shifted, or are you seeing the opposite trend? I’m genuinely curious what the data looks like on the ground.
April 8, 2026
AI Agent Weekend Chronicles: 5.73 Million Tokens, Zero Grass Touched

What do you do on a long weekend? Some people touch grass. I decided to dive headfirst into the glorious chaos of AI agents. Naturally.

First things first: I spun up an Ubuntu VM. Why? Because I’ve been around the internet long enough to know that letting an autonomous AI agent loose on my personal machine is like giving a toddler a loaded smartphone. The VM had internet access, zero personal data, and enough leeway to make mistakes I wouldn’t have to explain to anyone. Safety first.

Agent #1: #PaperclipAI. I hooked it up to my #OpenAI Codex subscription, created a company, hired a virtual content development team, and let them loose. Before I knew it, they were cranking out posts and articles of surprisingly decent quality. I even got the agent to publish directly to my self-hosted WordPress site. At this point, I was basically a media mogul who hadn’t left the couch.

Next up: #OpenClaw, the crowd favourite. Installed it, pointed it at qwen/qwen3.6-plus:free on #OpenRouter, and asked it to blog about Oracle layoffs and shiny new AI models. It did a solid job. Grammarly’s AI detector gave it a clean 0% AI-generated bill of health. Take that, detectors. Either the AI is getting scary good at sounding human, or the detector is just vibing.

Then came #Hermes. And wow, what a tool. It practically deserves its own podcast. This thing can run the entire show solo. I fed it my resume PDF for a review. It said, “There’s potential here,” which is polite code for “this needs work.” Then it handed me a questionnaire like a career counsellor at a crossroads. I filled it out, fed it back, and Hermes promptly realized it didn’t have PDF creation tools. No panic. It made a .md file instead, told me to install the missing tools, and ten minutes later I had a freshly polished resume. Ten minutes. My last resume update took a procrastination cycle measured in seasons.

The plot twist: all of this is glorious, but these agents are absolute token guzzlers. They eat through tokens like I eat through snacks on a movie night. If you’re billing your corporate AmEx, sure, party on. If you’re like me and riding the free-model wave, you’re essentially paying with your data. The age-old bargain: convenience for surveillance.

Oh, and I almost forgot #Claude Code. I paired it with stepfun/step-3.5-flash:free on OpenRouter and asked it to build a WebUI so I could chat with it from a browser. Two hours. 5.73 million tokens. Endless questions and approvals later… I got a codebase that doesn’t work. Five point seven three million tokens. I could’ve written War and Peace in fewer tokens. Or at least a working to-do app.

All in all, the long weekend was a blast. I built companies, reviewed resumes, published blogs, and burned through tokens like a dragon with a credit card. Would I do it again? Absolutely. Would I do it on my main machine? …Let’s not get crazy.

April 7, 2026

► Necessary Cookies Always Active

Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.

► Functional Cookies Remark

Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.

► Analytical Cookies Remark

Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.

► Advertisement Cookies Remark

Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.