GPT-5.5 is more capable than GPT-5.4, but the release also makes one thing obvious: model routing is now part of the product. OpenAI lists higher standard pricing for GPT-5.5 than GPT-5.4, and describes a Fast mode in Codex that generates faster for higher plan usage.
That is normal frontier-model economics. The mistake is treating the newest model as the default for every step.
The routing rule
Use the expensive model where failure is expensive.
| Workflow step | Good default | Why |
|---|---|---|
| Intake classification | small or mid model | The task is repetitive and easy to validate |
| Document extraction | mid model plus schema checks | Accuracy matters, but the output shape constrains the work |
| Ambiguous planning | frontier model | The model has to choose the path |
| Tool-heavy execution | frontier model with budget | Bad tool calls create real cost |
| Final review | frontier or separate reviewer model | Independent checks catch drift |
| Receipt writing | small model or deterministic template | The facts should already be logged |
What token efficiency really means
OpenAI says GPT-5.5 can deliver better results with fewer tokens than GPT-5.4 for most Codex users. That can be true and still cost more on the wrong workload.
The question is not only "how many tokens did it use?" The better question is:
- how many retries disappeared
- how much human correction disappeared
- how many tool calls were avoided
- how often the first artifact passed review
- how often the model stopped early
If GPT-5.5 cuts two failed attempts into one successful run, the higher token price may be cheap. If it writes nicer summaries for a task a smaller model already handles, it is waste.
A sane small-business route
Start with three lanes.
| Lane | Model class | Use it for |
|---|---|---|
| Cheap lane | fast small model | tagging, spam filtering, simple extraction |
| Work lane | capable general model | drafting, summarizing, structured responses |
| Hard lane | GPT-5.5 class | long context, tool use, multi-step ambiguity |
The harness decides when to escalate. The client should never need to know the model name unless the model choice affects cost, latency, or trust.
What to measure
For every run, log:
- model and effort setting
- input and output token count
- tool calls
- retries
- validation failures
- human edits
- final artifact status
That receipt is how you keep the model upgrade honest.



