Runs & streaming

A run is one turn: input messages in, events while it works, an output record at the end. Runs are where every other feature meets: tools pause them, approvals gate them, usage meters them, traces time them.

Anatomy of a run

POST /v1/smiths/{sid}/runs
  → queued → running → completed
                     ↘ paused_for_tool      (an external-execution tool needs you)
                     ↘ paused_for_approval  (a human must approve)
                     ↘ failed / cancelled

A paused run resumes when you submit the missing piece via /submit, the same endpoint for tool results, approval decisions, and cancellation.

Threads carry conversation history: pass a stable thread_id per conversation (any string of yours, or omit it and one is minted with a thr_ prefix and returned on the run) and the smith sees the recent turns of that thread.

input is a list of messages: { "role": "user" | "assistant", "content": "<text>" }. content is a plain string for a text turn, or an array of typed parts ({ "type": "text" | "image" | "file", … }) for a multimodal turn — when you read a run back, a turn that carried an image or file comes back in the parts form. Multiple messages are allowed (e.g. to replay context); system behaviour comes from the smith's resolved instructions, not a system message. For long turns prefer "stream": true. A synchronous call holds the connection open for the whole turn.

Synchronous runs

# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
  -H "Authorization: Bearer $IC_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "input": [{ "role": "user", "content": "Summarize what we discussed yesterday." }],
        "thread_id": "chat_42" }'

The response is the run record:

{ "id": "run_…", "smith_id": "smt_…", "thread_id": "chat_42",
  "status": "queued | running | completed | paused_for_approval | failed | cancelled",
  "output": { "content": "…", "tool_calls": [ { "id": "call_1", "type": "function",
               "function": { "name": "get_weather", "arguments": "{…}" } } ] },
  "stop_reason": "end_turn",
  "usage": { "input_tokens": 0, "output_tokens": 0, "total_tokens": 0,
             "cost": 0.0241 } }

output.tool_calls follows the OpenAI tool_calls shape ({ id, type, function }) for a model-driven call; an approval pause instead surfaces a pending-call object naming the tool and its arguments. A structured run (one that passed response_format) sets output.content_type (e.g. application/json) so a reader knows how to parse output.content. When the agent offered quick-reply chips on a chat channel, output.suggested_replies lists the labels it presented. usage reports this run's own token counts and, when the model has a price-book entry, its priced cost in your account currency — the run's own line-item, no separate query. (A turn on an unpriced model carries tokens but omits cost.) The usage API is where those are aggregated across runs into your billing summary.

Reads: GET /v1/smiths/{sid}/runs/{rid} (one), GET /v1/smiths/{sid}/runs (per smith), GET /v1/runs?status=&smith_id=&agent_id= (project-wide feed; agent_id pulls every run across one agent's smiths), GET /v1/smiths/{sid}/runs/{rid}/events (SSE replay of the recorded log).

Which tools the run resolved

Every run records the tool set it resolved under metadata.tools, so "did my MCP servers actually reach the model?" is a one-read fact instead of something you infer from the model behaving as if it had none:

"metadata": { "tools": {
  "total": 4,
  "mcp": [{ "server": "librarian", "tools": 3 }],
  "hosted": ["web_search"], "deployment": 0, "memory": 0,
  "errors": [{ "server": "acme", "error": "runtime load failed: stored secret could not be decoded" }]
} }

mcp lists each registered MCP server with the number of its tools that passed the allow-list and reached the model — an empty list or a 0 count is the signal that a server contributed nothing to this run. errors names any registered server the run skipped and why (e.g. a stored secret that can no longer be decoded), so a missing tool reads as a recorded reason rather than a guess — the same failure also flips that server to degraded on GET /v1/tenant/mcp.

Streaming with SSE

"stream": true returns text/event-stream. Every event is one envelope: event: carries the type, data: a JSON object that always includes { "v": 1, "run_id": "…" }. Prefer off-the-shelf SDKs for an in-app chat tab? The OpenAI-compatible API projects this same loop onto the OpenAI Chat Completions wire format, and the Vercel AI SDK adapter wraps that in a streamText/useChat integration. Reach for this native envelope when you want live tool.executing frames, structured output, or server-to-server calls.

Event	Payload	What to do
`run.started`	`{ smith_id, thread_id }`	capture `run_id`
`message.delta`	`{ delta }`	append the text chunk
`tool.executing`	`{ tool }`	informational: a tool is running
`tool.completed`	`{ tool }`	informational: a tool finished
`approval.required`	`{ approval_id, tool, args, tool_call_id }`	a human decides; submit via `/submit`
`run.paused`	`{ reason, tool_calls }`	the run is waiting on you
`run.completed`	`{ stop_reason }`	done
`run.failed`	`{ error }`	give up or retry
`run.duplicate`	`{ run_id, reason, run }`	a retried `Idempotency-Key` matched an existing run; reconnect/poll it instead (see below)

Both hosted tools (e.g. web_search) and your MCP tools execute server-side. Ingram Cloud calls them and the stream just keeps going, surfacing the informational tool.executing / tool.completed frames as they happen. Those frames are also mirrored to the run timeline and the event feed, so a run's tool activity stays auditable after the fact. You only act on an approval.required pause.

When a run pauses you first get an approval.required event per pending call and then a single run.paused as the terminal marker. Act on the per-call events; treat run.paused as the state change. The streamed run.completed carries stop_reason but not always usage. Read the run record for authoritative token counts and cost.

Idempotent run creation

Creating a run is not free. A retry after a network timeout would otherwise start (and bill) a second run. Send an Idempotency-Key header on POST /v1/smiths/{sid}/runs and a replay with the same key returns the original run instead of starting a new one:

# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
  -H "Authorization: Bearer $IC_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Idempotency-Key: 8f3c…" \
  -H "Content-Type: application/json" \
  -d '{ "input": [{ "role": "user", "content": "…" }], "stream": true }'

This covers streaming runs too, exactly the case most likely to time out mid-turn. The original token stream can't be replayed, so a retried key returns a single run.duplicate event carrying the existing run's id and current state; reconnect to that run's recorded event log (GET /v1/smiths/{sid}/runs/{rid}/events) or poll GET …/runs/{rid}. A synchronous retry simply returns the original run record. Keys are scoped to your tenant and honoured for 24 hours; use a fresh key per distinct run.

Replay a run

Re-run a recorded run's input through the smith as it stands now:

# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs/run_…/replay \
  -H "Authorization: Bearer $IC_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "stream": false }'

The reply is a fresh run record, carrying the same input, on its own new thread (the original conversation is untouched), with metadata.replay_of set to the source run id. "stream": true streams the replay like any create.

Replay is a re-execution, not a deterministic snapshot: it runs against the smith's current config and memory, which have moved on since the original — so the output can differ. A run whose input carried a file attachment can't be replayed yet (the offloaded bytes aren't rehydrated); it returns 422.

Pause and resume: the universal `/submit`

One endpoint resumes everything, discriminated by kind:

# Authorization: tenant-admin token (server-side only), or a smith token
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs/run_…/submit \
  -H "Authorization: Bearer $IC_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "kind": "tool_result", "tool_call_id": "tc_…",
        "result": { "events": ["standup at 10:00"] }, "stream": true }'

kind: "approval_decision": { approval_id, decision: "approve" | "reject", actor }. The common resume: on approve, Ingram Cloud calls your MCP tool and continues; a rejection completes the run with stop_reason: "approval_rejected". Pass "stream": true to pump the continuation back in the same envelope.
kind: "tool_result": { tool_call_id, result }, only for an external-execution tool that paused the run (MCP tools run in-process and never need this).
kind: "cancel": { reason }; the run ends as cancelled.

The whole loop with approvals: stream the run → on approval.required, get a human decision → submit approval_decision with stream: true → keep pumping → run.completed.

Structured output

For server-side calls that must return schema-valid JSON (classify, extract, route), pass response_format. The run becomes a one-shot model call (no tools, no memory, no streaming) using the smith's configured model:

# Authorization: tenant-admin token (server-side only)
curl https://api.cloud.ingram.tech/v1/smiths/smt_…/runs \
  -H "Authorization: Bearer $IC_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "input": [{ "role": "user", "content": "<source text>" }],
        "response_format": { "type": "json_schema", "name": "Ticket",
          "strict": true, "schema": { "type": "object", "additionalProperties": false,
            "properties": { "priority": { "enum": ["low", "high"] } },
            "required": ["priority"] } } }'
# → output.content is a JSON *string* that parses and validates

Schema rules. Every object node must set "additionalProperties": false and list all of its keys in "required" — the strict json_schema contract the model provider enforces (strict: true or false makes no difference here). A schema that breaks this is rejected fast with 422 schema_error; the run record is marked failed with stop_reason: schema_error and the error.detail names the offending rule. (Generating the schema with z.toJSONSchema satisfies this automatically; the trap is hand-written schemas.) If the model can't produce valid output against a valid schema, the API retries, then returns 500 structured_output_failed.

Use a dedicated smith with auto_memory: false for these utility calls.

Inspecting runs in the console

Observe → Runs lists every turn with status, tokens, and dollar cost. A run's detail page shows duration, tokens, and cost up top, then two tabs: Timeline: the timed span waterfall (model calls, tool calls, per-span cost) plus the lifecycle event log; and Transcript: the conversation with tool calls. Replay (top right) re-runs the same input as a fresh run and opens it. Runs fired from the Playground land here too.

Smiths OpenAI-compatible API