The Token Accounting Problem: Why AI Projects Return Less Than Expected

·

, ,

Most teams running AI projects today can tell you one number: their monthly token spend. Few can tell you the actual return on that spend. The gap between those two numbers explains why so many AI initiatives look promising in demos and collapse in production.

The Surface-Level Math

A PM evaluating an AI feature runs a familiar calculation. The task costs $X in human labor. AI inference costs $Y. If Y is meaningfully less than X, the ROI case closes. Teams ship the feature. What they don’t account for is that $Y is rarely the full cost of AI doing the work.

Real inference costs include the task prompt, the system prompt, the few-shot examples, and the multi-step reasoning chains that chain-of-thought demands. One user-facing request to an AI agent can generate $0.05 in token costs where a simple classifier would have cost $0.001. Scale that across thousands of daily users, and the economics shift fast.

The Human Review Layer

AI output does not arrive pre-validated. When AI generates a product description, writes a first-draft PRD, or triages an inbox, a human still reviews it. That human time is not free. It is also not always counted in the AI ROI model, because it does not show up on the API bill.

The total cost of an AI-assisted task looks like this: inference cost plus human review time, plus rework when AI gets it wrong, plus the occasional incident when nobody caught the error in time. Teams that model only the inference side consistently misjudge their unit economics by 40-60%.

The Hidden Assumption

The ROI case for most AI features rests on an assumption that AI output is nearly free once the model works. That assumption holds only if AI gets things right the first time with sufficient reliability that human review becomes optional. In practice, that threshold is rarely met.

When AI reliability is 85%, human review is still a required step in the workflow. When it drops to 70%, the rework burden starts eating into the labor savings. Many AI features that looked positive on a slide actually net negative when you measure the full loop: AI does the work, human validates it, human fixes it when AI is wrong.

The Right Unit to Measure

The correct question is not whether AI can do the work. It usually can. The correct question is what it costs, per unit of correct output, to have AI do the work versus alternatives. That number changes with model choice, prompt design, context length, and the error rate your use case can tolerate.

Teams that have cracked this measure cost-per-valid-output, not token spend. They benchmark multiple models for their specific use case, optimize prompt length ruthlessly, and treat inference cost as a variable to optimize, not a fixed overhead. That discipline separates AI investments that compound from those that quietly erode margin while looking like innovation.

The next time you review an AI project budget, ask the team for one number: cost per correct output. If they cannot produce it, the ROI case is not closed — it is just unfinished.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *