How to Track Token Cost Across LLM Workflows

For teams shipping AI products, token usage is a core unit-cost signal behind summarization, support answers, agent steps, code suggestions, and retrieval workflows. This guide shows how to track token usage and cost so teams can attribute spend, catch inefficiencies, and make better AI investment decisions.

By: Dan Juengst

| June 26, 2026

AI & LLMs

How to Track Token Cost Across LLM Workflows

AI features rarely fail because of model quality alone. They often struggle when teams can’t predict the cost of running them in production. “Token cost” is what you pay an LLM provider for the input, output, and sometimes cached tokens processed during a request. Unlike crypto tokens, AI tokens are units of model usage, often chunks of text, that providers use to meter and price workloads.

For teams shipping AI products, token usage is a core unit-cost signal behind summarization, support answers, agent steps, code suggestions, and retrieval workflows. Without request-level tracking, you may know the monthly bill, but not which feature, customer, model, prompt, or retry loop caused it.

This guide shows how to track token usage and cost so teams can attribute spend, catch inefficiencies, and make better AI investment decisions.

How to track token cost step by step

The process for tracking token cost is straightforward:

Capture token usage from every model response.
Attach business metadata, such as feature, environment, team, user, or workspace.
Compute request cost using the provider’s pricing table at log time.
Store the result as request-level telemetry.
Build dashboards for cost trends, outliers, and cost per outcome.
Alert when usage drifts from expected baselines.

The goal is to move beyond monthly invoices and create visibility into where costs originate. Use this table as a guide to the kinds of fields you should capture and where they exist across your solution:

What data do you need to measure token cost accurately?

To track token costs accurately, you will need to log costs with the provider, model, and pricing version. To give finance and engineering the same receipt, be sure to compute the cost when the request is logged, not weeks later in a reporting job. Pricing tables, model names, cache rules, and tool charges can change quickly and unpredictably. Here’s what to know about measuring tokens.

Input tokens, output tokens, and cached tokens

Most AI token cost calculations need at least three usage values:

Input tokens represent the prompt, retrieved context, tool instructions, and conversation history sent to the model.
Output tokens represent the model’s generated response.
Cached tokens represent repeated prompt material that providers may bill differently from fresh input.

This distinction between the types of token matters because provider pricing is not always a single rate. Most providers charge separately for different types of token usage. For example, OpenAI’s pricing documentation says tokens are billed at the chosen model’s input and output rates, and its pricing page lists cached input pricing where applicable. Anthropic’s pricing documentation separates base input tokens, cache writes, cache hits and refreshes, and output tokens.

Cost calculation from provider pricing tables

The basic formula for token costs should be:

Request cost = (input tokens x input rate) + (output tokens x output rate)

If rates are listed per million tokens, divide each token count by 1,000,000 before multiplying.

For some use cases, you can greatly reduce costs by utilizing prompt caching. Input tokens from cache read are usually heavily discounted (~90%). In this case, you will need to separate out cached and uncached input. You will also need to consider that some providers charge a premium to write to the cache, with the premium increasing the longer the input is cached. So use caching only where prompts are likely to be reused often enough to justify the write premium.

What you can safely leave out of cost logs

Many teams assume they need to store prompts to measure token cost. However, for most attribution and observability use cases, you only need the following:

token counts
model ID
request ID
feature names
environment tags
user or workspace identifiers
and success or error status

This is an important privacy boundary. Teams can measure cost per request, cost per feature, and token cost by user without storing raw prompts or sensitive customer content in their cost telemetry.

For deeper prompt behavior analysis, pair cost fields with an LLM token counter guide, but keep cost logs focused on the metadata needed to explain spend.

How should you instrument token cost by request, feature, and environment?

Monthly invoices tell you how much you spent, and request-level telemetry tells you why. Here’s how to instrument your token costs.

Capture usage at the model-call boundary

To make token cost actionable, capture usage data at the same place your application calls the LLM provider: the wrapper, gateway, or service function responsible for the model request. This is where you will have the provider response, application context, and trace context.

After the model call returns, read the provider’s usage data from the response and normalize it into fields your team controls. Do not build dashboards around each provider’s raw field names. Instead, map them into a consistent schema, such as:

llm.provider
llm.model
llm.input_tokens
llm.cached_input_tokens
llm.output_tokens
llm.total_tokens
llm.request_cost_usd
llm.pricing_version

This is the foundation of LLM cost monitoring: tying every model call to the request, feature, user, and outcome that created it.

Tagging requests with feature and environment metadata

Provider invoices rarely know what your product was trying to do. Your application does. Enrich the event before you leave the service, and you can use a schema like this:

app.feature
app.environment
app.team
app.workspace_id
app.user_tier
app.experiment

These tags make it possible to visualize LLM cost per user, workspace, feature, or environment.

A minimum useful telemetry event might look like this:

{
 "name": "llm.request",
 "request_id": "req_123",
 "trace_id": "trace_abc",
 "llm.provider": "openai",
 "llm.model": "example-model",
 "llm.input_tokens": 4210,
 "llm.cached_input_tokens": 1800,
 "llm.output_tokens": 640,
 "llm.request_cost_usd": 0.0187,
 "llm.pricing_version": "2026-06-01",
 "app.feature": "contract_summary",
 "app.environment": "production",
 "app.team": "documents",
 "app.workspace_id": "workspace_456",
 "duration_ms": 2430,
 "finish_reason": "stop",
 "status": "success"
}

Emit this event once per LLM request, ideally as part of the same trace that captures retrieval calls, tool calls, retries, and downstream work. This gives teams enough context to debug cost alongside latency, errors, model changes, and product behavior.

Storing request-level cost telemetry for later analysis

Invoice totals are useful for accounting, but they are too late and too coarse for engineering action. Storing token cost as OpenTelemetry-style, request-level telemetry rather than as a monthly spreadsheet lets you ask questions like:

Which model change increased output tokens?
Which agent step is retrying?
Which customer segment has the highest cost per outcome?
Which feature has the best margin?

An AI observability platform should let you answer those questions without pre-aggregating away the context.

How do you turn token cost into a useful operating dashboard?

Signals like “monthly provider spend went up" are not as helpful to understanding feature ROI as signals that are more closely tied to outcomes. An example of a more useful signal would be “cost per successful support answer increased 38% after the new retrieval prompt shipped.” Here’s how to create a useful operating dashboard.

Cost trends and outliers

Start with five dashboard views:

Spend over time
Cost by feature
Cost by model
Cost by environment
Highest-cost/outlier requests

A dashboard panel of outlier requests can often be the most useful view. A single runaway request, huge context window, or retry loop can distort daily spend. Seeing the most expensive traces helps engineers find the actual cause, not just the symptom.

Cost per feature and cost per outcome

One of the least useful metrics in AI operations is total monthly spend. A much better metric is cost per successful outcome.

Examples include:

Cost per generated summary
Cost per document processed
Cost per completed workflow
Cost per active user session

These metrics connect AI spending directly to business value for data-driven decision-making. For launches, experiments, and model migrations, the same dashboards help teams track API token usage and costs and compare cost per outcome before and after each change.

How can you keep token cost from scaling faster than feature value?

Instead of optimizing for the cheapest model or the fewest tokens, the team’s objective should be to optimize for the lowest cost per successful outcome. There are several optimization techniques that consistently deliver the biggest savings.

Cache repeated prompts and trim context windows

The highest-impact savings usually come from reducing repeated input. Cache stable prompt prefixes, system instructions, examples, policy text, and reusable documents when your provider supports it. Then trim the context window to include only what the model needs to complete the task.

Context window creep is one of the easiest ways for LLM token usage to grow. Retrieval results get longer, conversation history expands, tool outputs accumulate, and suddenly every request carries thousands of unnecessary input tokens.

Route to the right model and cap output length

Not every request requires the most capable model. Simple classification, extraction, and routing tasks can often use smaller, less expensive models when quality is sufficient. Reserve larger or more expensive models for complex reasoning, high-value customers, or escalation paths.

Set output token caps, too. A single chatty response should not be able to blow up daily spend. Max output limits, stop conditions, and response-shape constraints keep costs predictable.

Alert on retry loops, runaway agents, and context creep

The biggest cost spikes often come from systems doing the same thing repeatedly: retry loops, agents calling tools in circles, workflows that reprocess the same document, or sessions that keep appending context.

Set anomaly alerts on tokens per request, tokens per session, cost per feature, and cost per workspace. This is where observability cost management becomes operational: teams see drift while they can still fix it.

A production-ready checklist for operationalizing token cost tracking

Use this checklist to establish a production-ready token cost tracking program:

Assign ownership to verify and capture provider pricing (this can be versioned and maintained in code).
Require model, feature, environment, team, and user or workspace tags on every request.
Log input tokens, cached input tokens, output tokens, request cost, latency, and finish reason.
Create daily budget alerts and anomaly detection for tokens per session and cost per feature.
Reconcile telemetry totals against provider invoices monthly.

How Honeycomb helps teams track token cost

Tracking token cost only matters if teams can understand the behavior behind the numbers. Honeycomb gives engineering teams request-level visibility across prompts, models, users, retrieval systems, tool calls, and agent workflows so they can investigate individual requests instead of relying on aggregate billing reports.

That context helps teams identify costly patterns, understand which features are efficient, and connect AI spend to business outcomes. The result is greater predictability, stronger accountability, and faster investigation when costs unexpectedly increase.

Learn more about Honeycomb for LLM observability cost management.