Runs & streaming
A run is one turn: input messages in, events while it works, an output record at the end. Runs are where every other feature meets: tools pause them, approvals gate them, usage meters them, traces time them.
Anatomy of a run
POST /v1/smiths/{sid}/runs
→ queued → running → completed
↘ paused_for_tool (an external-execution tool needs you)
↘ paused_for_approval (a human must approve)
↘ failed / cancelled
A paused run resumes when you submit the missing piece via /submit,
the same endpoint for tool results, approval decisions, and cancellation.
Threads carry conversation history: pass a stable thread_id per
conversation (any string of yours, or omit it and one is minted with a
thr_ prefix and returned on the run) and the smith sees the recent turns of
that thread.
input is a list of messages: { "role": "user" | "assistant", "content": "<text>" }. content is a plain string for a text turn, or an array of typed
parts ({ "type": "text" | "image" | "file", … }) for a multimodal turn — when
you read a run back, a turn that carried an image or file comes back in the
parts form. Multiple messages are allowed (e.g. to replay context); system
behaviour comes from the smith's resolved instructions, not a system
message. For long turns prefer "stream": true. A synchronous call holds the
connection open for the whole turn.
Synchronous runs
# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
-H "Authorization: Bearer $IC_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "input": [{ "role": "user", "content": "Summarize what we discussed yesterday." }],
"thread_id": "chat_42" }'
The response is the run record:
{ "id": "run_…", "smith_id": "smt_…", "thread_id": "chat_42",
"status": "queued | running | completed | paused_for_approval | failed | cancelled",
"output": { "content": "…", "tool_calls": [ { "id": "call_1", "type": "function",
"function": { "name": "get_weather", "arguments": "{…}" } } ] },
"stop_reason": "end_turn",
"usage": { "input_tokens": 0, "output_tokens": 0, "total_tokens": 0,
"cost": 0.0241 } }
output.tool_calls follows the OpenAI tool_calls shape ({ id, type, function })
for a model-driven call; an approval pause instead surfaces a
pending-call object naming the tool and its arguments. A structured run (one that
passed response_format) sets output.content_type (e.g. application/json) so a
reader knows how to parse output.content. When the agent offered
quick-reply chips on a chat channel,
output.suggested_replies lists the labels it presented. usage reports this run's own token
counts and, when the model has a price-book entry, its priced cost in your
account currency — the run's own line-item, no separate query. (A turn on an
unpriced model carries tokens but omits cost.) The usage API
is where those are aggregated across runs into your billing summary.
Reads: GET /v1/smiths/{sid}/runs/{rid} (one),
GET /v1/smiths/{sid}/runs (per smith),
GET /v1/runs?status=&smith_id=&agent_id= (project-wide feed; agent_id
pulls every run across one agent's smiths),
GET /v1/smiths/{sid}/runs/{rid}/events (SSE replay of the recorded log).
Which tools the run resolved
Every run records the tool set it resolved under metadata.tools, so "did my
MCP servers actually reach the model?" is a one-read fact instead
of something you infer from the model behaving as if it had none:
"metadata": { "tools": {
"total": 4,
"mcp": [{ "server": "librarian", "tools": 3 }],
"hosted": ["web_search"], "deployment": 0, "memory": 0,
"errors": [{ "server": "acme", "error": "runtime load failed: stored secret could not be decoded" }]
} }
mcp lists each registered MCP server with the number of its tools that passed
the allow-list and reached the model — an empty list or a 0
count is the signal that a server contributed nothing to this run. errors names
any registered server the run skipped and why (e.g. a stored secret that can
no longer be decoded), so a missing tool reads as a recorded reason rather than a
guess — the same failure also flips that server to degraded on
GET /v1/tenant/mcp.
Streaming with SSE
"stream": true returns text/event-stream. Every event is one envelope:
event: carries the type, data: a JSON object that always includes
{ "v": 1, "run_id": "…" }. Prefer off-the-shelf SDKs for an in-app chat tab?
The OpenAI-compatible API projects this same loop onto
the OpenAI Chat Completions wire format, and the Vercel AI SDK
adapter wraps that in a streamText/useChat integration. Reach
for this native envelope when you want live tool.executing frames, structured
output, or server-to-server calls.
| Event | Payload | What to do |
|---|---|---|
run.started | { smith_id, thread_id } | capture run_id |
message.delta | { delta } | append the text chunk |
tool.executing | { tool } | informational: a tool is running |
tool.completed | { tool } | informational: a tool finished |
approval.required | { approval_id, tool, args, tool_call_id } | a human decides; submit via /submit |
run.paused | { reason, tool_calls } | the run is waiting on you |
run.completed | { stop_reason } | done |
run.failed | { error } | give up or retry |
run.duplicate | { run_id, reason, run } | a retried Idempotency-Key matched an existing run; reconnect/poll it instead (see below) |
Both hosted tools (e.g. web_search) and your MCP tools execute
server-side. Ingram Cloud calls them and the stream just keeps going, surfacing
the informational tool.executing / tool.completed frames as they happen. Those
frames are also mirrored to the run timeline and the event feed,
so a run's tool activity stays auditable after the fact. You only act on an
approval.required pause.
When a run pauses you first get an approval.required event per pending call
and then a single run.paused as the terminal marker. Act on the per-call
events; treat run.paused as the state change. The streamed run.completed
carries stop_reason but not always usage. Read the run record for
authoritative token counts and cost.
Idempotent run creation
Creating a run is not free. A retry after a network timeout would otherwise
start (and bill) a second run. Send an Idempotency-Key header on
POST /v1/smiths/{sid}/runs and a replay with the same key returns the original
run instead of starting a new one:
# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
-H "Authorization: Bearer $IC_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Idempotency-Key: 8f3c…" \
-H "Content-Type: application/json" \
-d '{ "input": [{ "role": "user", "content": "…" }], "stream": true }'
This covers streaming runs too, exactly the case most likely to time out
mid-turn. The original token stream can't be replayed, so a retried key returns
a single run.duplicate event carrying the existing run's id and current state;
reconnect to that run's recorded event log
(GET /v1/smiths/{sid}/runs/{rid}/events) or poll GET …/runs/{rid}. A
synchronous retry simply returns the original run record. Keys are scoped to your
tenant and honoured for 24 hours; use a fresh key per distinct run.
Replay a run
Re-run a recorded run's input through the smith as it stands now:
# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs/run_…/replay \
-H "Authorization: Bearer $IC_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "stream": false }'
The reply is a fresh run record, carrying the same input, on its own new
thread (the original conversation is untouched), with metadata.replay_of set
to the source run id. "stream": true streams the replay like any create.
Replay is a re-execution, not a deterministic snapshot: it runs against the
smith's current config and memory, which have moved on since the original —
so the output can differ. A run whose input carried a file attachment can't be
replayed yet (the offloaded bytes aren't rehydrated); it returns 422.
Pause and resume: the universal /submit
One endpoint resumes everything, discriminated by kind:
# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs/run_…/submit \
-H "Authorization: Bearer $IC_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "kind": "tool_result", "tool_call_id": "tc_…",
"result": { "events": ["standup at 10:00"] }, "stream": true }'
kind: "approval_decision":{ approval_id, decision: "approve" | "reject", actor }. The common resume: onapprove, Ingram Cloud calls your MCP tool and continues; a rejection completes the run withstop_reason: "approval_rejected". Pass"stream": trueto pump the continuation back in the same envelope.kind: "tool_result":{ tool_call_id, result }, only for an external-execution tool that paused the run (MCP tools run in-process and never need this).kind: "cancel":{ reason }; the run ends ascancelled.
The whole loop with approvals: stream the run → on approval.required, get a
human decision → submit approval_decision with stream: true → keep pumping
→ run.completed.
Structured output
For server-side calls that must return schema-valid JSON (classify, extract,
route), pass response_format. The run becomes a one-shot model call (no
tools, no memory, no streaming) using the smith's configured model:
# Authorization: tenant-admin token (server-side only)
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
-H "Authorization: Bearer $IC_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "input": [{ "role": "user", "content": "<source text>" }],
"response_format": { "type": "json_schema", "name": "Ticket",
"strict": true, "schema": { "type": "object", "additionalProperties": false,
"properties": { "priority": { "enum": ["low", "high"] } },
"required": ["priority"] } } }'
# → output.content is a JSON *string* that parses and validates
Schema rules. Every object node must set "additionalProperties": false and
list all of its keys in "required" — the strict json_schema contract the model
provider enforces (strict: true or false makes no difference here). A schema
that breaks this is rejected fast with 422 schema_error; the run record is marked
failed with stop_reason: schema_error and the error.detail names the offending
rule. (Generating the schema with z.toJSONSchema satisfies this automatically; the
trap is hand-written schemas.) If the model can't produce valid output against a
valid schema, the API retries, then returns 500 structured_output_failed.
Use a dedicated smith with auto_memory: false for these utility calls.
Inspecting runs in the console
Observe → Runs lists every turn with status, tokens, and dollar cost. A run's detail page shows duration, tokens, and cost up top, then two tabs: Timeline: the timed span waterfall (model calls, tool calls, per-span cost) plus the lifecycle event log; and Transcript: the conversation with tool calls. Replay (top right) re-runs the same input as a fresh run and opens it. Runs fired from the Playground land here too.