OpenAI-compatible API

POST /v1/chat/completions drives a smith behind the OpenAI Chat Completions wire format. Point the openai SDK or the Vercel AI SDK's @ai-sdk/openai-compatible provider at Ingram Cloud and a logged-in web session streams straight from the smith: no custom transport, no bespoke SSE parser.

It is the same engine as Runs & streaming: the same smith, the same MCP tools, the same memory, the same approval gates and usage metering. Only the bytes on the wire change. Use this surface for in-app web chat; use /v1/smiths/{sid}/runs when you want the native envelope, structured output, or server-to-server calls.

Building an in-app chat tab with the Vercel AI SDK? The AI SDK adapter (@ingram-tech/ai-sdk-adapter) wraps this surface with the identity, memory, and approval helpers wired in — same standard wire format, less boilerplate. This page is the wire reference beneath it.

Identity: the token is the smith

There is no smith id in the path. A smith token (sub = "<tenant>:<smith>") already names exactly one smith, so the call runs as that smith, the same trust model as an MCP tools/call. The agent is the one that smith runs — resolved from the smith, never from a request field.

The model field is the literal OpenAI thing: the upstream inference LLM. Omit it (or send "") to use the agent's configured model; send a model id like gpt-5.5 to override the LLM for that one call (instructions, tools, and memory still come from the smith's agent). It does not select the agent — the agent is the one the smith runs.

A model the upstream provider rejects (an unknown id, a typo, an agent slug) is a real error, not an empty answer. A non-streaming call returns a non-2xx with the standard { "error": { … } } body; a streaming call emits a terminal data: { "error": { … } } frame before data: [DONE] (the same convention the openai SDK and @ai-sdk/openai-compatible raise as an exception). You will never get a 200 with finish_reason: "stop" and empty content for a failed turn.

// apiKey: a per-smith token (never a tenant-admin token from the browser)
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { streamText } from "ai";

const ingram = createOpenAICompatible({
  name: "ingram",
  baseURL: "https://api.cloud.ingram.tech/v1",
  apiKey: SMITH_TOKEN,
});

// "" → the smith's agent's configured model. Use "gpt-5.5" etc. to override the LLM.
const result = streamText({ model: ingram(""), prompt });
for await (const delta of result.textStream) process.stdout.write(delta);

Calling server-side with a tenant-admin token? Name the smith one of two ways: the OpenAI-standard user field set to the smith's external_id (your own user id), or an IC-Smith-Id: smt_… header carrying the smt_ id. The user field is the zero-custom-header path — a stock OpenAI client just works. Without a resolvable smith the call returns 400 smith_unresolved.

# Authorization: smith token (browser-safe, scoped to one smith)
curl https://api.cloud.ingram.tech/v1/chat/completions \
  -H "Authorization: Bearer $IC_SMITH_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "model": "", "stream": true,
        "messages": [{ "role": "user", "content": "What did I spend on travel in May?" }] }'

The stream is standard Chat Completions framing: data: {chunk}\n\n chunks whose text rides on choices[0].delta.content, terminated by data: [DONE]. The chunk id is the Ingram Cloud run_… id, so a stream and a run.completed webhook for the same turn are correlatable, and a dropped stream can be reconciled against the run record.

Memory: stateless by default, stateful on request

Plain Chat Completions is stateless, so by default the messages you send are the whole context for that turn and a fresh thread is used. To use Ingram Cloud's server-side memory instead, send an IC-Thread-Id: <your-id> header: Ingram Cloud then holds the thread and you send only the new user turn (the same thread model as a native run's thread_id).

Instructions: per-request `system` is honored

The smith's agent provides the base instructions. A system message (Chat) or the instructions field (Responses) is appended to them for that one turn — so the agent's persona and guardrails stay authoritative and you add live, per-request context (the page the user is on, a read/write toggle, anything recomputed each call). Omit it and behaviour is unchanged. This is the standard channel for dynamic context — use it instead of rewriting agent config per turn.

Images and files

Image and file content parts are passed through to the smith's model. Send the standard OpenAI shape — a content array mixing text with image_url (Chat) / input_image (Responses) for images, or file (Chat) / input_file (Responses) for documents like PDFs. An image may be a hosted https URL or an inline base64 data URL; a file is inlined as base64:

{ "messages": [{ "role": "user", "content": [
  { "type": "text", "text": "What's wrong with my invoice?" },
  { "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo…" } },
  { "type": "file", "file": {
      "filename": "invoice.pdf",
      "file_data": "data:application/pdf;base64,JVBERi0…" } }
] }] }

The smith's model must be vision/document-capable (most current models are). The same content shape works whichever provider backs the model — each receives the image or document in its native form.

Inline files are stored for auditability. Bytes you inline (an image data URL or a file's file_data) are saved to file storage rather than kept in the run record, which holds a lightweight reference instead; the run's input then carries a file_id. Fetch a stored file's metadata at GET /v1/files/{file_id} and its bytes at GET /v1/files/{file_id}/content (token needs the files:read scope). A hosted https image URL is passed by reference and not stored. These inline files are reachable by id or through the run that referenced them; they are not enumerated by a list endpoint.

Inline bytes are bounded by a 32 MB request limit (matching the tightest common provider); a larger body is rejected with 413 payload_too_large. For bigger files, host them and pass an https image URL, or wait for the Files API.

Tools

Two models, both standard, pick per use case:

Client-side tools — you define functions and run them yourself (the standard OpenAI function-call loop). Send tools on the request.
Server-side tools — Ingram Cloud calls your MCP server and runs them for you, with approval gating. No tools on the request; register the server once.

Client-side tools (you execute)

Send tools exactly as you would to OpenAI. The model's calls come back as tool_calls for you to execute; send the results back as tool messages and the model continues. Ingram Cloud runs nothing and gates nothing — your loop owns both — and it's stateless, so re-send the conversation each turn (same as OpenAI):

// 1) you send tools + the turn; the model asks to call one
{ "model": "", "tools": [
    { "type": "function", "function": { "name": "get_weather",
      "parameters": { "type": "object", "additionalProperties": false,
        "properties": { "city": { "type": "string" } }, "required": ["city"] } } } ],
  "messages": [ { "role": "user", "content": "weather in Paris?" } ] }
// → finish_reason "tool_calls", message.tool_calls = [{ id, function:{ name, arguments } }]

// 2) you run get_weather, then re-send the whole conversation with the result
{ "model": "", "tools": [ /* same tools */ ],
  "messages": [
    { "role": "user", "content": "weather in Paris?" },
    { "role": "assistant", "tool_calls": [ { "id": "call_1", "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\"}" } } ] },
    { "role": "tool", "tool_call_id": "call_1", "content": "{\"tempC\":21}" } ] }
// → the model answers with finish_reason "stop"

A turn that sends tools uses only those client tools — the smith's agent still provides the system instructions, but its server-side MCP tools and memory are not in play for that turn (you brought your own loop). To use MCP tools, omit tools.

Forcing a tool with `tool_choice`

Send tool_choice alongside tools to control whether the model may call them, exactly as you would to OpenAI:

"auto" (the default) — the model decides.
"none" — the model must answer in text and call nothing.
"required" — the model must call one of the tools.
{ "type": "function", "function": { "name": "get_weather" } } — the model must call that specific tool. On the Responses API the shape is the flatter { "type": "function", "name": "get_weather" }.

tool_choice only governs the tools you sent on the request, so it is an error to send it without a non-empty tools array, or to name a tool that isn't in it — both return 400 invalid_tool_choice (param: "tool_choice") rather than being silently ignored.

// Authorization: per-smith token
{ "model": "", "tools": [ /* get_weather as above */ ],
  "tool_choice": { "type": "function", "function": { "name": "get_weather" } },
  "messages": [ { "role": "user", "content": "weather in Paris?" } ] }
// → the model is forced to call get_weather; finish_reason "tool_calls"

Server-side tools (MCP) and approvals

Read/automatic tools run server-side and never appear in this (Chat Completions) stream: the person just sees assistant text, exactly as the run loop calls them for you. (If you want a live "tool is running" indicator, the Responses API surfaces each server-executed call as an mcp_call item with in_progress/completed lifecycle events.)

A tool marked destructiveHint pauses the run for approval. In this surface the pause is projected into the tool-call channel: you receive a choices[0].delta.tool_calls entry naming the real tool and its arguments, and the turn ends with finish_reason: "tool_calls". The call id is "<run_id>::<tool_call_id>" so you know which run to resume.

You resume by sending the decision back as the next turn's tool message: the tool_call_id echoes the id you received, and the content is approve or reject:

// next request body, staying inside the standard tool-call channel
{ "model": "", "stream": true, "messages": [
  { "role": "user", "content": "Delete the May draft." },
  { "role": "assistant", "tool_calls": [
    { "id": "run_abc::tc_1", "type": "function",
      "function": { "name": "delete_page", "arguments": "{\"id\":\"p1\"}" } } ] },
  { "role": "tool", "tool_call_id": "run_abc::tc_1", "content": "approve" }
] }

On approve, Ingram Cloud executes the tool itself (your MCP server, the host loop, you never run it) and streams the continuation. On reject, the run completes with stop_reason: "approval_rejected" and nothing is executed. This maps onto the same approval that a deployment confirmation or /submit approval_decision drives: they are one mechanism, three front doors.

Usage and cancellation

Set stream_options: { include_usage: true } (the Vercel AI SDK does this for you) and the final chunk carries usage in the standard place (prompt_tokens / completion_tokens / total_tokens). Closing the HTTP connection cancels the run: the turn ends as cancelled and the smith stops calling tools and the model.

Non-streaming

Omit stream (or set it false) and you get a standard chat.completion object with choices[0].message, finish_reason, and usage. An approval pause comes back as finish_reason: "tool_calls" with the same tool_calls shape; resume it exactly as above.

Chat Completions vs Responses

This page is the Chat Completions projection: the widest-supported format, and what @ai-sdk/openai-compatible speaks. Ingram Cloud also speaks the newer Responses API at POST /v1/responses — same smith, memory, tools, and approvals, with two advantages for approval-heavy agents:

Approvals are first-class. A destructiveHint pause surfaces as an mcp_approval_request output item (its id is "<run_id>::<tool_call_id>"); you resume by sending an mcp_approval_response input item ({"approve": true}), instead of the Chat Completions tool-call convention.
Stateful by design. Pass previous_response_id (a prior run_… id) to continue the conversation in that run's thread — you send only the new input and memory carries the history. IC-Thread-Id works too.
Server-tool activity is visible. A tool the run loop executes for you (your MCP server) streams as an mcp_call output item: response.output_item.added with status: "in_progress", then response.mcp_call.in_progress, response.mcp_call.completed, and response.output_item.done with status: "completed". That drives a live "tool is running" indicator in an in-app chat — the thing Chat Completions deliberately hides. (Approval-gated tools still ride the mcp_approval_request path above, not mcp_call.)

Client-side tools work the same on both surfaces; on Responses the model's calls come back as function_call output items and you send results back as function_call_output input items (the standard Responses contract), instead of the Chat Completions tool_calls/tool-message convention.

# Authorization: smith token (browser-safe, scoped to one smith)
curl https://api.cloud.ingram.tech/v1/responses \
  -H "Authorization: Bearer $IC_SMITH_TOKEN" \
  -H "IC-Api-Version: 2026-05-01" \
  -H "Content-Type: application/json" \
  -d '{ "model": "", "input": "What did I spend on travel in May?" }'

Identity is resolved exactly as above (smith token, the user field, or IC-Smith-Id), and the response id is the IC run_… id, so it correlates with a run.completed webhook just like Chat Completions.

Runs & streaming Vercel AI SDK