Streaming
Also known as: Token Streaming, Incremental Output, Server-Sent Deltas
Deliver partial output as it is generated, not after the model has finished.
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when the end-to-end latency of a full response would feel broken to a human reader — chat assistants, code completion, anything a person watches arrive in real time. | −When the consumer is another program that needs the full response to act — a batch evaluator, a webhook handler, a structured-data extractor — streaming buys nothing and complicates retry semantics. |
| +Use where the consumer can act on partial output before the model finishes — a UI that renders tokens as they arrive, a downstream pipeline that processes structured fields the moment each one closes. | −Without an HTTP client and proxy chain that supports long-lived connections, streaming silently degrades: a buffering reverse proxy will hold the response until the stream closes and the user sees the same blocking behaviour you tried to avoid. |
| +Reach for it when you need to surface mid-generation telemetry — token counts, finish reason, tool calls under construction — that is impossible to inspect once the response is a single returned blob. | −When the output is short enough that the network round trip dominates generation time, the framing overhead of SSE is wasted and a single response is simpler. |
| +Prefer it for tool-call workflows where the client wants to validate, render, or short-circuit a tool invocation while the arguments are still being generated. |
In the wild
| Source | Claim |
|---|---|
| docs.anthropic.com → | Anthropic's Messages API documents the canonical event flow — message_start, a series of content_block_delta frames per content block, message_delta, message_stop — that every Claude client (Console, Claude Code, third-party SDKs) consumes to render incremental output. |
| ai-sdk.dev → | The Vercel AI SDK ships streamText and streamObject as first-class primitives; streamObject parses partial JSON against a Zod schema and yields a typed partial value at every delta, which is what powers form-fill and structured-output UIs in the Vercel templates. |
| github.com → | OpenAI's cookbook publishes a runnable notebook that opens a chat completion with stream=True, iterates the SSE response, and prints each text delta as it arrives — the reference shape every OpenAI-compatible client implements. |
Reader gotcha
A streaming request returns HTTP 200 the moment the connection opens, before the model has generated a single token — so transport-layer error handling that branches on status code will mark a half-completed or overload-aborted response as success. Anthropic documents that mid-stream errors (overloaded_error, network drops) arrive as event frames inside the already-200 response; the client must inspect each frame and treat error events as failures, not just check the HTTP status when the connection closes. source
Implementation sketch
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
const result = streamText({
model: openai('gpt-4o'),
prompt: 'Summarise the streaming pattern in three sentences.',
maxRetries: 2,
})
// Consume the union stream so text deltas, finish reason, and mid-stream
// errors all surface from a single for-await loop.
for await (const part of result.fullStream) {
switch (part.type) {
case 'text-delta':
process.stdout.write(part.textDelta)
break
case 'finish':
console.log('\nfinish:', part.finishReason, 'tokens:', part.usage.totalTokens)
break
case 'error':
console.error('mid-stream error:', part.error)
break
}
}
export {}
- Vercel AI SDK
- LangChain
- LangGraph
- OpenAI Agents
- Mastra
References
- Anthropic·2025·accessed
canonical SSE event flow: message_start, content_block_delta, message_delta, message_stop, plus tool-call input_json_delta
- DOCSStreaming outputAnthropic·2025·accessed
mid-stream error events on an already-200 response; basis for the reader gotcha
- Vercel·2025·accessed
first-party TypeScript primitives streamText and streamObject; partial-output stream typed against a Zod schema
- LangChain·2025·accessed
cross-vendor framework view: stream, astream, and astream_events as the three consumer surfaces
- OpenAI·2024·accessed
reference notebook for OpenAI-compatible SSE delta consumption with stream=True
- WHATWG·2025·accessed
underlying wire format every LLM streaming API rides on; defines the EventSource interface and reconnection semantics
Overview · 1-paragraph mechanism
Streaming changes when the model's output reaches the consumer: bytes leave the inference server the moment they are generated, rather than accumulating until the full response is ready. The transport is almost always Server-Sent Events — a long-lived HTTP response whose body is a sequence of newline-delimited data frames the client parses as it reads. Each frame carries a small typed delta: a chunk of generated text, a partial JSON fragment of a tool call, an updated finish reason, or a terminal event that closes the stream. The consumer reconstructs the final state by accumulating deltas in order; nothing is broadcast and nothing is replayed.
Background · context and trade-offs
Three sub-variants share the same wire shape but differ in what is being incrementally revealed. Token streaming emits text deltas one fragment at a time and is what every chat UI renders character-by-character. Structured streaming emits a partial JSON object whose shape is fixed by a schema — the Vercel AI SDK's streamObject exposes the partial as a typed value at every step, so a form fills in field-by-field instead of appearing all at once. Tool-call streaming emits the arguments of a function call as they are generated; a long argument list becomes visible before the model has finished writing it, and a UI can begin rendering the call before dispatch.
The pattern is a UX contract change as much as a transport detail. A thirty-second response that streams feels usable because the user sees evidence of work within the first hundred milliseconds; the same thirty seconds behind a single blocking request reads as a hung connection. Streaming is distinct from polling, where the client repeatedly asks whether the result is ready, and from progress events, which are out-of-band metadata about an otherwise-blocking operation. With streaming, the partial output is the operation.