Skip to main content

No More Cheap Claude: Four First Principles of Token Economics in 2026

April 19, 2026

TL;DR: Token Economics in the Era of Scarcity

Your Claude Pro subscription hits limits faster than it did in January, as Anthropic quietly re-priced the ceiling, and every AI provider is rationing compute. If you keep working with Claude the way you did six months ago, you are in for a rude awakening. This article gives you four principles that explain how Token Economics actually works, so you can stop accepting the black box and start using your budget deliberately.

Image
No More Cheap Claude: Four First Principles of Token Economics in 2026, Separating Professionals from Amateurs - by PST Stefan Wolpers of Berlin-Product-People.com.

🗞 Shall I notify you about articles like this one? Awesome! You can sign up here for the ‘Food for Agile Thought’ newsletter and join 35,000-plus subscribers.

The End of Cheap Claude, The Rise of Token Economics

It is April 2026, and the subsidies are gone.

If you hold a Claude Pro subscription and have started hitting your limit before lunch, you are not imagining it. Anthropic confirmed on March 26 that session limits now deplete faster during weekday peak hours (5am-11am PT / 1pm-7pm GMT) by design, not by bug. Around 7% of users, "particularly in Pro tiers," will hit limits they would not have in January, according to PCWorld, with The Register adding additional context on the throttling mechanics. Nine days later, on April 4, Anthropic cut off third-party tools like Cline, Cursor, and Windsurf from using subscription authentication, forcing automation workloads onto metered API billing. Neither event was announced on the Anthropic homepage. Both appeared on Reddit and X first.

Anthropic is not being stingy, but reacts to market forces: GPU rental prices for Nvidia's Blackwell chips rose 48% in two months, reaching $4.08 per hour as of early April 2026. CoreWeave extended minimum contract terms from 1 year to 3 years. Those numbers will keep moving, and they will keep moving in one direction until energy infrastructure and data center buildouts catch up. OpenAI's CFO says publicly that her company is "making some very tough trades at the moment on things we're not pursuing because we don't have enough compute." Infrastructure is the bottleneck. Every large provider is rationing access through some combination of price, throttling, or gated availability. Anthropic has to do all three; their conservative approach to capacity buildout comes with a trade-off.

For your work with Claude, this means one thing: The flat-rate experience you enjoyed in January 2026 has been silently re-priced. It will keep being re-priced. Efficiency is no longer a nice-to-have; applying Token Economics have become a must.

There is a temptation to respond to this with a list of tricks: Turn off extended thinking, edit messages instead of sending follow-ups, or use Haiku for simple tasks. These tactics work, and I will come back to them. But tactics alone are fragile: Anthropic ships a product update, one trick breaks, and the practitioner is back to guessing. What lasts is understanding the mechanism underneath, the four principles that describe that mechanism. Every tactic you have read in a Substack post this month reduces to one of these four.

Token Economics Principle 1: Every Turn Re-Consumes Everything Before It

Claude does not remember your conversation the way a human colleague does. Every time you send a message, Claude reads the entire conversation again from the top: your first question, Claude's first answer, your second question, and so on. Message 30 pays to re-read messages 1 through 29 before it even starts working on your new question.

A Concordia University research team measured this directly in a multi-agent coding system running on GPT-5 Reasoning, finding that input tokens made up 53.9% of total token consumption across 30 software development tasks. More than half of the budget went to re-consuming context rather than generating new output. The exact ratio will vary across Claude products and use cases, but the mechanism remains the same.

This effect is why "start a new chat when the topic changes" is the single most repeated piece of advice in every article on this subject. The advice is not about organization, but about economics.

Token Economics Principle 2: The Context Window Is a Shared Container with Inputs You Cannot See

You think of your prompt as what you type. Claude sees something much larger.

Every file Claude reads during a session stays in context for the rest of that session. Every tool output, every connector response, every Search result, every artifact you generated three turns ago, the system prompt you never wrote, the CLAUDE.md or Project instructions you uploaded once and forgot about, the Memory feature's silent additions, and the entire message history. All of it shares the same finite window. Most of it is invisible to you in the interface.

Jenny Ouyang, who writes about Claude Code after receiving a $1,600 API bill in two months, ranks tool call outputs as the single largest drain on token budgets. She puts them above the conversation length. A 10,000-line log file Claude reads early in a session stays in context for every subsequent message. On Claude.ai, the equivalent is a large PDF you uploaded to a chat. Anthropic's own token-counting documentation shows a 51-page PDF (a Tesla quarterly SEC filing used as an example) counted at roughly 119,000 tokens, or about 2,300 tokens per page. A standard JPEG image runs around 1,550 tokens for a typical photograph. Upload the same 15-page PDF into four different chats because you forgot you already did, and you have paid for it four times.

Paweł Huryn, who built an open-source dashboard that reads Claude Code transcripts locally, writes that /usage does not break down tokens by model, project, or session. You hit a limit and have no direct way to see what caused it. Huryn's dashboard showed a single-day spike of 700 million cached tokens on his account, which turned out to be an Anthropic bug, not his usage. Without the dashboard, he would not have noticed.

What Claude writes also counts. Verbose responses, extended thinking output, generated artifacts, and the results of Research sessions all consume the budget on the way out, and then become part of the conversation history that gets re-read on the next turn. Output tokens are billed at exactly five times the rate of input tokens on Anthropic's current API across Opus, Sonnet, and Haiku; on a subscription, that cost hides inside the usage meter, but the mechanism is the same: You pay for a 2,000-word response you did not actually need several times over, once when it is written, and once on every subsequent turn that re-reads it.

That is the condition your audience is working in. The container is shared, most of its contents are invisible, and the tools to inspect it exist only for API users who are willing to build them. Pro and Max subscribers fly blind.

Token Economics Principle 3: Stable Context Is Cheap; Changing Context Is Expensive

Anthropic's caching system gives a large discount to context that stays identical across requests. Cache reads cost roughly 10 percent of the base input token price. Cache writes cost 25 percent more than base input, paid once and amortized across every subsequent hit. The default cache lifetime is five minutes, extensible to one hour at additional cost.

The caching hierarchy processes requests in a fixed order: tools, then system prompt, then message history. A change early in the order invalidates everything after it. Rearrange your system prompt, add a new MCP server, upload a new file to your Project, and the cached prefix breaks. The next request rebuilds the cache from the first changed byte onward, at full cost.

This mechanism explains why the same task can cost differently on two different days. You stepped away for 30 minutes for a coffee. The cache expired. Your next message rebuilt the entire context at write cost instead of read cost. Piunikaweb reports that Anthropic's Thariq Shihipar attributed some of the extreme session-drain cases users reported in late March to "expensive prompt cache misses" when resuming long conversations with large context windows.

On Claude.ai specifically, you cannot place cache breakpoints yourself. What you can do is behave in ways that make caching work:

  • Keep your persistent context (Project instructions, about-me files, CLAUDE.md) short and stable.
  • Do not reorder files you upload.
  • Do not take long breaks in the middle of heavy work.
  • Finish a session before it drifts off-task.

Claude Projects deserve a separate note because most articles get them wrong. On paid plans, Projects use retrieval-augmented generation (RAG), but only when your uploaded knowledge "approaches or exceeds" the context window limit, which sits around 200,000 tokens. Anthropic does not publish the exact trigger point, and it may shift. Below that threshold, every file in the Project loads into context on every single prompt. Above it, Claude retrieves only the relevant chunks, and a visual indicator appears in the interface. The practical consequence: if you sit below the threshold, fewer and shorter Project files are strictly better, because you are paying for all of them on every turn. If you sit above it, you can add more material without linear cost. The advice you see that treats Projects as automatic efficiency magic is wrong for most Pro users, whose Projects contain a few style guides and reference docs and live well under the threshold.

The worst place to sit is just below the threshold. A Project near the 200,000-token line pays the full cost of every file on every prompt, without the retrieval efficiency that kicks in when RAG activates. Call this the Valley of Death. If you find yourself there, you have three reasonable moves:

  • Trim the Project aggressively, down to a quarter of the threshold, so the per-prompt cost is contained. Trim is right when most of your work uses a small, stable set of references.
  • Pad the Project with genuinely useful reference material to cross the threshold and trigger RAG mode. Pad is right when you have a genuinely large knowledge base Claude needs to draw from across sessions.
  • Often the best move: partition. Split one bloated Project into several task-specific ones. A marketing Project carrying 180,000 tokens of brand voice, social copy guidelines, and competitor research is really three Projects pretending to be one. Break them apart, and you stay far below the threshold in each, and Claude stops re-reading competitor research every time you draft a tweet. Partition is right when the content in the Project serves different tasks that rarely need each other.

What is not defensible is leaving the Project to drift at the threshold line, paying the maximum cost for the minimum efficiency.

Token Economics Principle 4: Scarcity Is Structural, Not Cyclical

Flat-rate generosity was the marketing of a land-grab phase. It was never the steady state.

Tomasz Tunguz, a venture capitalist writing about AI infrastructure, calls what is happening now "the beginning of scarcity in AI." He names five hallmarks:

  • Relationship-based selling (SOTA models gated to privileged customers),
  • AI to the highest bidder,
  • Available-but-slow access,
  • Inflationary pricing, and
  • Forced diversification toward smaller or self-hosted models.

Quote: "The age of abundant AI is over, and it will remain so for years."

The PYMNTS coverage of the same period describes it as "AI rationing" and notes that Google, Anthropic, and others are simultaneously publishing explicit daily prompt caps where vague access language used to stand. Anthropic's April 4 lockout of third-party subscription routing fits the pattern: subscription access is being actively defended as a retail product, and arbitrage through automation tooling is being closed off.

Your Pro subscription in April 2026 is not the same product as your Pro subscription in January 2026. The marketing copy is the same, but the economic reality underneath it has shifted. If your work with Claude was built on the January assumption, it is now running on borrowed time. That reality changes the question the user should be asking.

The old question was "how do I save tokens?" That question treats tokens as a cost to minimize. The more useful question is "what is the return on intelligence per token?" Every token you spend should buy intelligence worth paying for. Fifty thousand tokens to draft a routine email that a template would have produced is economically illiterate, regardless of whether you hit your limit. Five thousand tokens to decode a difficult incentive structure before a hard conversation is a high return. The discipline is not "use fewer tokens." The discipline is knowing what you are buying.

In a scarcity regime, that judgment is what separates a professional user from a consumer.

The Visibility Problem

Scarcity plus opacity produces anxiety, and that is the condition your audience is working in. Pro and Max subscribers have no per-prompt token breakdown. No real-time usage indicator. No way to know whether a given message hit the cache or missed it. The only signal is the usage meter, which moves in discrete steps and resets on a rolling schedule that varies by time of day. You cannot measure what you cannot see.

Classical optimization advice assumes instrumentation: measure first, then optimize the highest-impact areas. That advice is sound for API users with dashboards and for production LLM applications where a team can A/B test prompt variants. It does not apply to a product manager using Cowork on a Pro plan. That user cannot measure. What remains, for them, is informed default behavior. Habits of mind that keep the usage meter below the waterline without requiring instruments.

Which brings me back to my favorite claim: This is not optimization for token economics in the engineering sense, but human judgment.

The Counter-Argument and Why It Is Partly Right

There is a real counter-argument to everything in this article on Token Economics. It goes roughly like this: optimizing for tokens is premature optimization. The real constraint on your work with Claude is the quality of your thinking, not the quantity of your tokens. Compress your prompts and you will confuse the model, get worse answers, and spend more tokens on retries. Quality-first, cost-second.

The counter-argument is partly right. Harsh or unclear prompts lead to worse work. A vague 15-word prompt that forces Claude to ask three clarifying questions costs more in total than a precise 60-word prompt that works on the first try. Premature optimization is real, and optimizing against a metric you cannot see is a recipe for false economy. Do not strip a prompt of context in the name of token hygiene if the context was load-bearing.

The counter-argument is wrong about one thing: in a scarcity regime, clear thinking and disciplined token use are the same skill, not competing ones. A well-framed problem consumes fewer tokens because clarity is itself compressive. A developer who uses Claude efficiently is not cutting corners. They are demonstrating the exact engineering judgment senior engineers have always demonstrated: understand the problem before articulating it, decompose cleanly, provide the right context and no more, and evaluate output critically. The token count is a side effect of clear thinking. Or, let me rephrase, the thinking is the point of the exercise.

This approach matters for your audience because it reframes the question. Token discipline is not a penny-pinching habit imposed by scarcity. It is an observable signal of professional competence with AI, in the same way that scope discipline is a signal of professional competence in Sprint Planning.

Judgment as the Professional Response

You have seen this pattern before in other domains: flat-rate became metered, generous became gated. The internet went from unlimited to capped. Enterprise software went from site licensing to seat pricing. The pattern always arrived with the same signal: the previous model was subsidizing growth, growth slowed, and the economics had to surface.

AI is now there. The response that works is the response that worked in the previous cycles: develop judgment about the resource before you are forced to; learn about Token Economics.

Four practices, grouped by principle, deserve to become habits:

On Principle 1 (every turn re-consumes everything before it): One topic per chat. Start a new conversation when the subject changes. At the end of a substantive session, ask Claude to write a short notes file covering decisions and next steps. Start the next session by loading that file. You carry forward exactly what matters and leave the rest behind. (Of course, you can write a skill for that job like I did.)

On Principle 2 (hidden inputs share the container): Do not load context Claude does not need. Select only the Project files relevant to the current task. Turn off Search, connectors, and extended thinking when you do not need them. Convert PDFs and screenshots to plain text before uploading, where possible. If you read your own files through Claude for your own eyes, use a script or the file system directly. Claude does not need to be in the loop for reading. For long outputs, use Skeleton-of-Thought: ask Claude for the outline and key data points first, review it, then expand only the sections you actually need, ideally in a new clean chat. This way treats the token budget as a surgical tool rather than a firehose, and it costs far less than asking for a 2,000-word report that you then have to read, correct, and discard in part. When you do need short output, constrain it explicitly: "top three bullet points, no commentary," "the table only, no preamble." Claude defaults to thorough; however, thoroughness has a price, and you pay it twice, once when the response is generated and again every time it is re-read on subsequent turns.

On Principle 3 (stable context is cheap): Keep your Project instructions and persistent context short. If your about-me file or Project instructions have grown into thousands of words over time, trim them aggressively. The cost of carrying that weight is paid on every single prompt. Do not reorder or re-upload files mid-session. Finish heavy work in one sitting rather than across a two-hour gap that kills the cache.

On Principle 4 (scarcity is structural): Default to Haiku, escalate on demand. Run logic and structure checks through Haiku first, where speed and quota concerns barely register. Once the approach is sound, move the refined prompt to Sonnet for most daily work, and reserve Opus for the cases where Sonnet has visibly failed, or the reasoning is genuinely hard. Starting every session with Opus is a 2024 luxury that does not survive the 2026 peak-hour regime. Plan in Chat, execute in Cowork, or artifacts, because the expensive surfaces should do only the work that needs them. If you run heavy automation on a schedule, move it to off-peak hours. The premium you pay for a Max subscription can easily be repaid by a single week of not hitting limits on Pro.

Notice what these suggestions are not: they are not hacks. They are not a checklist to cross off before Monday. They are the professional defaults of someone who has internalized that the machine they are working with has a finite, invisible, and actively shrinking budget, and who has decided to work within that reality instead of against it.

Judgment is a human thing; the tool is neutral, but your competence with the tool is not.

Token Economics — Conclusion

Pick one Token Economics practice from this article and install it this week. My suggestion: the end-of-session notes file. At the end of your next real working session with Cowork or Chat, ask Claude to summarize what you decided, what is unresolved, and what the next step is. Save the output. Start your next session with it. The practice takes ninety seconds per session and breaks the single most expensive habit in Principle 1: carrying an entire sprawling conversation into tomorrow because it is there.

Do that for a month. Then come back and tell me whether the meter behaves differently.

By the way, this practice is perfectly suited for creating a skill. I did so a month ago.


What did you think about this post?

Comments (0)

Be the first to comment!