Why Input Tokens and Output Tokens Affect Cost Differently

Every LLM API charges separately for input and output tokens — and output is almost always more expensive, typically 3–5x the input price. This asymmetry isn't arbitrary. It reflects real differences in computation, and it shapes how you design workflows and choose models for different tasks.

Why output costs more to compute

When a model processes input tokens, it runs a forward pass through the neural network to build a representation of the context. This is computationally significant but happens once for the entire input.

When a model generates output tokens, it runs a forward pass for each token generated — because generating token N requires knowing what tokens 1 through N-1 were. This is autoregressive generation: each output token depends on all previous output tokens. Generating a 500-token response means 500 sequential forward passes, compared to one pass for a 1,000-token input.

Generation also requires maintaining state, sampling from probability distributions, and applying temperature and top-p sampling — all of which add to the per-token computation cost.

What this means for different task types

Classification and routing tasks have long inputs (the content being classified) but tiny outputs (a category label, "yes/no", a confidence score). Output cost is negligible. Input cost dominates — and these tasks often work well with budget models even at high volume.

Summarisation tasks have variable-length inputs and short-to-medium outputs. A 5,000-token document summarised to 300 words costs mostly on the input side — and cheaper models with large context windows (Gemini Flash, Claude Haiku) are well-suited.

Code generation, analysis, long-form writing can have short prompts but very long outputs. A request to "write a complete REST API for user management" might have a 200-token prompt and a 3,000-token response. At $15/MTok output, that's $0.045 in output cost alone — compared to $0.001 in input cost.

# Example: output cost dominates for long-response tasks
Code generation request:
  Input:  200 tokens × $3/MTok  = $0.0006
  Output: 3000 tokens × $15/MTok = $0.045
  Ratio: output is 75× more expensive than input for this call

Controlling output cost

Set max_tokens. Capping output length directly caps maximum cost. If a task genuinely needs 500 tokens of output, max_tokens=600 prevents runaway 3,000-token responses from verbose models.

Ask for concise output explicitly. "Return only the JSON, no explanation" or "Respond in under 100 words" reduces output length for tasks where verbosity isn't valuable.

Use batch processing. Most providers offer async batch processing at 50% discount on output tokens. For workflows that can tolerate minutes-level latency — data processing, content generation queues — batch pricing halves output costs.

Streaming doesn't save money. Streaming tokens as they're generated feels faster for users, but you're charged for the same number of output tokens either way. Streaming is a latency UX improvement, not a cost reduction.