Skip to main content
d-n
← Back to Agentic Design Patterns
Layer 4: Interfaces & Transport

Streaming

Also known as: Token Streaming, Incremental Output, Server-Sent Deltas

Sends partial output frame by frame as it is generated, not after the model finishes.

by

Token-by-token, field-by-field, or argument-by-argument: streaming sends typed SSE deltas as the model generates them; the client accumulates and reconstructs rather than waiting for a complete response.

Claude Code

  • Claude Code streams tool output to the terminal by default — no configuration needed for the core streaming UX.
  • Use hooks to intercept PostToolUse events and act on completed tool results before the next model call.
  • In custom agent code, use streamText from the AI SDK and inspect mid-stream error frames — an HTTP 200 does not guarantee a clean stream.
  • Pin maxRetries on your streaming client; a buffering reverse proxy holds the response until the stream closes, defeating incremental delivery.

Primitives

  • PostToolUse hooks (post-tool event handling)
  • Terminal streaming (built-in)
  • Vercel AI SDK streamText

Related patterns

Cursor

  • Cursor streams tokens in the chat panel by default — the incremental UX is built-in for all Agent and Ask mode responses.
  • Reference large files via @file rather than pasting content; Cursor fetches and injects context on demand without pre-loading everything.
  • Use Ask mode for read-only queries where you want to inspect the partial output before any edits are applied.
  • Use cloud agents for long-running tasks; they run in isolated cloud environments and surface progress through the Cursor web interface.

Primitives

  • Built-in chat streaming
  • Ask mode (read-only stream)
  • Cloud agents (async long-running tasks)
  • @file references

Related patterns

Decision

Use when ✓Avoid when ✗
+Use this when the end-to-end latency of a full response would feel broken to a human reader: chat assistants, code completion, anything a person watches arrive in real time.When the consumer is another program that needs the full response to act (a batch evaluator, a webhook handler, a structured-data extractor), streaming buys nothing and complicates retry semantics.
+Justified where the consumer can act on partial output before the model finishes: a UI that renders tokens as they arrive, a downstream pipeline that processes structured fields the moment each one closes.Without an HTTP client and proxy chain that supports long-lived connections, streaming silently degrades: a buffering reverse proxy will hold the response until the stream closes and the user sees the same blocking behaviour you tried to avoid.
+A good fit when you need to surface mid-generation telemetry (token counts, finish reason, tool calls under construction) that is impossible to inspect once the response is a single returned blob.When the output is short enough that the network round trip dominates generation time, the framing overhead of SSE is wasted and a single response is simpler.
+Best for tool-call workflows where the client wants to validate, render, or short-circuit a tool invocation while the arguments are still being generated.

In the wild

SourceClaim
docs.anthropic.comAnthropic's Messages API documents the canonical event flow (message_start, a series of content_block_delta frames per content block, message_delta, message_stop) that every Claude client (Console, Claude Code, third-party SDKs) consumes to render incremental output.
ai-sdk.devThe Vercel AI SDK ships streamText and streamObject as first-class primitives; streamObject parses partial JSON against a Zod schema and yields a typed partial value at every delta, which is what powers form-fill and structured-output UIs in the Vercel templates.
github.comOpenAI's cookbook publishes a runnable notebook that opens a chat completion with stream=True, iterates the SSE response, and prints each text delta as it arrives. It is the reference shape every OpenAI-compatible client implements.

Reader gotcha

A streaming request returns HTTP 200 the moment the connection opens, before the model has generated a single token, so transport-layer error handling that branches on status code will mark a half-completed or overload-aborted response as success. Anthropic documents that mid-stream errors (overloaded_error, network drops) arrive as event frames inside the already-200 response; the client must inspect each frame and treat error events as failures, not just check the HTTP status when the connection closes. source

Implementation sketch

import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'

const result = streamText({
  model: openai('gpt-4o'),
  prompt: 'Summarise the streaming pattern in three sentences.',
  maxRetries: 2,
})

// Consume the union stream so text deltas, finish reason, and mid-stream
// errors all surface from a single for-await loop.
for await (const part of result.fullStream) {
  switch (part.type) {
    case 'text-delta':
      process.stdout.write(part.textDelta)
      break
    case 'finish':
      console.log('\nfinish:', part.finishReason, 'tokens:', part.usage.totalTokens)
      break
    case 'error':
      console.error('mid-stream error:', part.error)
      break
  }
}

export {}
First-party TS SDK
  • Vercel AI SDK
  • LangChain
  • LangGraph
  • OpenAI Agents
  • Mastra

References

  1. Anthropic·2025·accessed

    canonical SSE event flow: message_start, content_block_delta, message_delta, message_stop, plus tool-call input_json_delta

  2. Anthropic·2025·accessed

    mid-stream error events on an already-200 response; basis for the reader gotcha

  3. Vercel·2025·accessed

    first-party TypeScript primitives streamText and streamObject; partial-output stream typed against a Zod schema

  4. LangChain·2025·accessed

    cross-vendor framework view: stream, astream, and astream_events as the three consumer surfaces

  5. OpenAI·2024·accessed

    reference notebook for OpenAI-compatible SSE delta consumption with stream=True

  6. WHATWG·2025·accessed

    underlying wire format every LLM streaming API rides on; defines the EventSource interface and reconnection semantics

Streaming changes when the model's output reaches the consumer: bytes leave the inference server the moment they are generated, rather than accumulating until the full response is ready. The transport is almost always Server-Sent Events: a long-lived HTTP response whose body is a sequence of newline-delimited data frames the client parses as it reads. Each frame carries a small typed delta: a chunk of generated text, a partial JSON fragment of a tool call, an updated finish reason, or a terminal event that closes the stream. The consumer reconstructs the final state by accumulating deltas in order; nothing is broadcast and nothing is replayed.

Background · context and trade-offs

Three sub-variants share the same wire shape but differ in what is being incrementally revealed. Token streaming emits text deltas one fragment at a time and is what every chat UI renders character-by-character. Structured streaming emits a partial JSON object whose shape is fixed by a schema: the Vercel AI SDK's streamObject exposes the partial as a typed value at every step, so a form fills in field-by-field instead of appearing all at once. Tool-call streaming emits the arguments of a function call as they are generated; a long argument list becomes visible before the model has finished writing it, and a UI can begin rendering the call before dispatch.

The pattern is a UX contract change as much as a transport detail. A thirty-second response that streams feels usable because the user sees evidence of work within the first hundred milliseconds; the same thirty seconds behind a single blocking request reads as a hung connection. Streaming is distinct from polling, where the client repeatedly asks whether the result is ready, and from progress events, which are out-of-band metadata about an otherwise-blocking operation. With streaming, the partial output is the operation.