Streaming
Also known as: Token Streaming, Incremental Output, Server-Sent Deltas
Sends partial output frame by frame as it is generated, not after the model finishes.
Claude Code
- Claude Code streams tool output to the terminal by default — no configuration needed for the core streaming UX.
- Use hooks to intercept
PostToolUseevents and act on completed tool results before the next model call. - In custom agent code, use
streamTextfrom the AI SDK and inspect mid-streamerrorframes — an HTTP 200 does not guarantee a clean stream. - Pin
maxRetrieson your streaming client; a buffering reverse proxy holds the response until the stream closes, defeating incremental delivery.
Primitives
Related patterns
Cursor
- Cursor streams tokens in the chat panel by default — the incremental UX is built-in for all Agent and Ask mode responses.
- Reference large files via
@filerather than pasting content; Cursor fetches and injects context on demand without pre-loading everything. - Use Ask mode for read-only queries where you want to inspect the partial output before any edits are applied.
- Use cloud agents for long-running tasks; they run in isolated cloud environments and surface progress through the Cursor web interface.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Use this when the end-to-end latency of a full response would feel broken to a human reader: chat assistants, code completion, anything a person watches arrive in real time. | −When the consumer is another program that needs the full response to act (a batch evaluator, a webhook handler, a structured-data extractor), streaming buys nothing and complicates retry semantics. |
| +Justified where the consumer can act on partial output before the model finishes: a UI that renders tokens as they arrive, a downstream pipeline that processes structured fields the moment each one closes. | −Without an HTTP client and proxy chain that supports long-lived connections, streaming silently degrades: a buffering reverse proxy will hold the response until the stream closes and the user sees the same blocking behaviour you tried to avoid. |
| +A good fit when you need to surface mid-generation telemetry (token counts, finish reason, tool calls under construction) that is impossible to inspect once the response is a single returned blob. | −When the output is short enough that the network round trip dominates generation time, the framing overhead of SSE is wasted and a single response is simpler. |
| +Best for tool-call workflows where the client wants to validate, render, or short-circuit a tool invocation while the arguments are still being generated. |
In the wild
| Source | Claim |
|---|---|
| docs.anthropic.com → | Anthropic's Messages API documents the canonical event flow (message_start, a series of content_block_delta frames per content block, message_delta, message_stop) that every Claude client (Console, Claude Code, third-party SDKs) consumes to render incremental output. |
| ai-sdk.dev → | The Vercel AI SDK ships streamText and streamObject as first-class primitives; streamObject parses partial JSON against a Zod schema and yields a typed partial value at every delta, which is what powers form-fill and structured-output UIs in the Vercel templates. |
| github.com → | OpenAI's cookbook publishes a runnable notebook that opens a chat completion with stream=True, iterates the SSE response, and prints each text delta as it arrives. It is the reference shape every OpenAI-compatible client implements. |
Reader gotcha
A streaming request returns HTTP 200 the moment the connection opens, before the model has generated a single token, so transport-layer error handling that branches on status code will mark a half-completed or overload-aborted response as success. Anthropic documents that mid-stream errors (overloaded_error, network drops) arrive as event frames inside the already-200 response; the client must inspect each frame and treat error events as failures, not just check the HTTP status when the connection closes. source
Implementation sketch
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
const result = streamText({
model: openai('gpt-4o'),
prompt: 'Summarise the streaming pattern in three sentences.',
maxRetries: 2,
})
// Consume the union stream so text deltas, finish reason, and mid-stream
// errors all surface from a single for-await loop.
for await (const part of result.fullStream) {
switch (part.type) {
case 'text-delta':
process.stdout.write(part.textDelta)
break
case 'finish':
console.log('\nfinish:', part.finishReason, 'tokens:', part.usage.totalTokens)
break
case 'error':
console.error('mid-stream error:', part.error)
break
}
}
export {}
- Vercel AI SDK
- LangChain
- LangGraph
- OpenAI Agents
- Mastra
References
- DOCSStreaming output
Streaming changes when the model's output reaches the consumer: bytes leave the inference server the moment they are generated, rather than accumulating until the full response is ready. The transport is almost always Server-Sent Events: a long-lived HTTP response whose body is a sequence of newline-delimited data frames the client parses as it reads. Each frame carries a small typed delta: a chunk of generated text, a partial JSON fragment of a tool call, an updated finish reason, or a terminal event that closes the stream. The consumer reconstructs the final state by accumulating deltas in order; nothing is broadcast and nothing is replayed.
Background · context and trade-offs
Three sub-variants share the same wire shape but differ in what is being incrementally revealed. Token streaming emits text deltas one fragment at a time and is what every chat UI renders character-by-character. Structured streaming emits a partial JSON object whose shape is fixed by a schema: the Vercel AI SDK's streamObject exposes the partial as a typed value at every step, so a form fills in field-by-field instead of appearing all at once. Tool-call streaming emits the arguments of a function call as they are generated; a long argument list becomes visible before the model has finished writing it, and a UI can begin rendering the call before dispatch.
The pattern is a UX contract change as much as a transport detail. A thirty-second response that streams feels usable because the user sees evidence of work within the first hundred milliseconds; the same thirty seconds behind a single blocking request reads as a hung connection. Streaming is distinct from polling, where the client repeatedly asks whether the result is ready, and from progress events, which are out-of-band metadata about an otherwise-blocking operation. With streaming, the partial output is the operation.