How we saved 80% off LLM inference costs

How we saved 80 % off LLM inference costs by pruning “junk tokens”

And why this pattern works almost anywhere.

1. The silent line-item on your AI invoice

How we saved 80% off LLM inference costs? Large-language-model APIs don’t charge for compute time; they charge for tokens. Every HTML tag, tracking pixel, boiler-plate legal clause or stale Slack signature you send to a model is billable. With GPT-4o, for example, a million input tokens cost ≈ $5 and the same amount of generated output costs ≈ $15. At scale those pennies compound fast.

Yet most raw data streams are overwhelmed with text that the model never needed to see in the first place.

2. Case study: job-post extraction at scale

A client aggregates thousands of job adverts per hour. The raw scraper delivered full HTML documents: styles, scripts, hidden DIVs, social buttons, cookies banners – about 10× more text than the human-visible ad.

My brief was simple:

“Cut our OpenAI bill without sacrificing extraction accuracy.”

What we did

DOM hygiene. Remove tags with no semantic value (<script>, <style>, tracking spans, empty wrappers).
Recursive pruning. Any node whose children collapse to whitespace gets dropped.
Minimal plain-text render. Serialize the cleaned DOM; keep only visible sentences.
Lightweight lint. Deduplicate whitespace, normalise unicode, strip boiler-plate lines (“Apply on LinkedIn”, etc.).

The outcome

Metric	Before	After cleaning
Avg. characters / advert	47 k	6 k
Tokens sent to LLM	≈ 11 k	≈ 1.8 k
OpenAI cost per 1 000 ads	$7.50	$1.10

Inference spend dropped ~ 85 % and quality went up, because the model no longer hallucinated on noisy context.

3. Prune first, reason second

The same “prune first, reason later” mindset applies anywhere the payload is noisy or repetitive:

Domain	Typical noise	Quick wins
Customer-support email threads	Quoted history, signatures, tracking pixels	Strip previous replies; keep topmost message only
E-commerce product pages	Carousels, ads, hidden SEO copy	Extract canonical description, schema.org fields
Legal PDFs	Headers, footers, Bates numbers	Page-header detection, merge body text
Log files	Timestamp boiler-plate, debug stack traces	Regex sponge before summarisation
Meeting transcripts	Fillers, cross-talk, greetings	ASR confidence > 0.9, remove <200 ms utterances

Academic work backs this up: JPMorgan Chase’s TRIM pipeline shows that dropping inferable words saves ~ 20 % tokens with negligible semantic loss arXiv — before any domain-specific cleaning.

4. Building a repeatable pipeline

Deterministic filters first. Regular expressions, DOM rules or AST visitors are cheap and fast.
Lightweight gatekeeper model. A small local model (e.g. Tiny-Llama) can classify chunks as “signal” or “noise” at <$0.01 per k tokens.
Chunk intelligently. Aim for coherent 1–2 k-token windows so the LLM’s attention isn’t diluted.
Trace every byte. Log token counts per stage; you can’t optimise what you don’t measure.
Re-evaluate monthly. New ad formats, email templates or vendor widgets sneak back in.

We package these steps as CI checks so the cost regression suite runs automatically on every pull request, one of the small rituals that keeps bills honest at AI Flow.

5. Questions to ask your own team

Which 10 % of your input produces 90 % of model value?
Do you track token usage per data source in your observability stack?
What would break if you removed every HTML tag that has no attributes?
Could a $0.001 local model pre-filter data before the $0.03 flagship call?

If you can’t answer these yet, you’re probably paying the hype tax.

6. Closing thoughts

Tokens are the new cloud instances: invisible, elastic, and costly when left unchecked. Pruning them is rarely glamorous work, but it is foundational. The earlier you embed a “clean-before-call” mindset, the more head-room you keep for actual innovation.

I’ve been advocating this approach since the feature-selection days at Google and continue to apply it across sectors, from energy to generative video. The tools change; the principle stays: rigor before scale.

For a deeper technical walkthrough, you can book a call with Mihai Anton at aiflow.ltd/meet.