OpenAI-compatible API
POST /v1/chat/completions drives a smith behind the OpenAI Chat
Completions wire format. Point the openai SDK or the Vercel AI SDK's
@ai-sdk/openai-compatible provider at Ingram Cloud and a logged-in web session
streams straight from the smith: no custom transport, no bespoke SSE parser.
It is the same engine as Runs & streaming: the same smith, the
same MCP tools, the same memory, the same
approval gates and usage metering. Only the bytes on the wire change. Use this
surface for in-app web chat; use /v1/smiths/{sid}/runs when you want the
native envelope, structured output, or server-to-server calls.
Building an in-app chat tab with the Vercel AI SDK? The
AI SDK adapter (@ingram-tech/ai-sdk-adapter) wraps this surface
with the identity, memory, and approval helpers wired in — same
standard wire format, less boilerplate. This page is the wire reference beneath it.
Identity: the token is the smith
There is no smith id in the path. A smith token (sub = "<tenant>:<smith>")
already names exactly one smith, so the call runs as that smith, the same trust
model as an MCP tools/call. The agent is the one that smith runs — resolved
from the smith, never from a request field.
The model field is the literal OpenAI thing: the upstream inference LLM.
Omit it (or send "") to use the agent's configured model; send a model id like
gpt-5.5 to override the LLM for that one call (instructions, tools, and memory
still come from the smith's agent). It does not select the agent — the agent
is the one the smith runs.
A model the upstream provider rejects (an unknown id, a typo, an agent slug)
is a real error, not an empty answer. A non-streaming call returns a non-2xx with
the standard { "error": { … } } body; a streaming call emits a terminal
data: { "error": { … } } frame before data: [DONE] (the same convention the
openai SDK and @ai-sdk/openai-compatible raise as an exception). You will
never get a 200 with finish_reason: "stop" and empty content for a failed
turn.
// apiKey: a per-smith token (never a tenant-admin token from the browser)
import { createOpenAICompatible } from "@ai-sdk/openai-compatible";
import { streamText } from "ai";
const ingram = createOpenAICompatible({
name: "ingram",
baseURL: "https://api.cloud.ingram.tech/v1",
apiKey: SMITH_TOKEN,
});
// "" → the smith's agent's configured model. Use "gpt-5.5" etc. to override the LLM.
const result = streamText({ model: ingram(""), prompt });
for await (const delta of result.textStream) process.stdout.write(delta);
Calling server-side with a tenant-admin token? Name the smith one of two
ways: the OpenAI-standard user field set to the smith's external_id (your own
user id), or an IC-Smith-Id: smt_… header carrying the smt_ id. The user
field is the zero-custom-header path — a stock OpenAI client just works. Without a
resolvable smith the call returns 400 smith_unresolved.
# Authorization: smith token (browser-safe, scoped to one smith)
curl https://api.cloud.ingram.tech/v1/chat/completions \
-H "Authorization: Bearer $IC_SMITH_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "model": "", "stream": true,
"messages": [{ "role": "user", "content": "What did I spend on travel in May?" }] }'
The stream is standard Chat Completions framing: data: {chunk}\n\n chunks whose
text rides on choices[0].delta.content, terminated by data: [DONE]. The
chunk id is the Ingram Cloud run_… id, so a stream and a
run.completed webhook for the same turn are correlatable, and a
dropped stream can be reconciled against the run record.
Memory: stateless by default, stateful on request
Plain Chat Completions is stateless, so by default the messages you send are
the whole context for that turn and a fresh thread is used. To use Ingram Cloud's
server-side memory instead, send an IC-Thread-Id: <your-id>
header: Ingram Cloud then holds the thread and you send only the new user turn
(the same thread model as a native run's thread_id).
Instructions: per-request system is honored
The smith's agent provides the base instructions. A system message (Chat) or the
instructions field (Responses) is appended to them for that one turn — so the
agent's persona and guardrails stay authoritative and you add live, per-request
context (the page the user is on, a read/write toggle, anything recomputed each
call). Omit it and behaviour is unchanged. This is the standard channel for dynamic
context — use it instead of rewriting agent config per turn.
Images and files
Image and file content parts are passed through to the smith's model. Send the
standard OpenAI shape — a content array mixing text with image_url (Chat) /
input_image (Responses) for images, or file (Chat) / input_file (Responses)
for documents like PDFs. An image may be a hosted https URL or an inline base64
data URL; a file is inlined as base64:
{ "messages": [{ "role": "user", "content": [
{ "type": "text", "text": "What's wrong with my invoice?" },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,iVBORw0KGgo…" } },
{ "type": "file", "file": {
"filename": "invoice.pdf",
"file_data": "data:application/pdf;base64,JVBERi0…" } }
] }] }
The smith's model must be vision/document-capable (most current models are). The same content shape works whichever provider backs the model — each receives the image or document in its native form.
Inline files are stored for auditability. Bytes you inline (an image data URL
or a file's file_data) are saved to file storage rather than kept in the run
record, which holds a lightweight reference instead; the run's input then carries
a file_id. Fetch a stored file's metadata at GET /v1/files/{file_id} and its
bytes at GET /v1/files/{file_id}/content (token needs the files:read scope). A
hosted https image URL is passed by reference and not stored. These inline files
are reachable by id or through the run that referenced them; they are not enumerated
by a list endpoint.
Inline bytes are bounded by a 32 MB request limit (matching the tightest common
provider); a larger body is rejected with 413 payload_too_large. For bigger files,
host them and pass an https image URL, or wait for the Files API.
Tools
Two models, both standard, pick per use case:
- Client-side tools — you define functions and run them yourself (the standard
OpenAI function-call loop). Send
toolson the request. - Server-side tools — Ingram Cloud calls your MCP server and runs
them for you, with approval gating. No
toolson the request; register the server once.
Client-side tools (you execute)
Send tools exactly as you would to OpenAI. The model's calls come back as
tool_calls for you to execute; send the results back as tool messages and
the model continues. Ingram Cloud runs nothing and gates nothing — your loop owns
both — and it's stateless, so re-send the conversation each turn (same as OpenAI):
// 1) you send tools + the turn; the model asks to call one
{ "model": "", "tools": [
{ "type": "function", "function": { "name": "get_weather",
"parameters": { "type": "object", "additionalProperties": false,
"properties": { "city": { "type": "string" } }, "required": ["city"] } } } ],
"messages": [ { "role": "user", "content": "weather in Paris?" } ] }
// → finish_reason "tool_calls", message.tool_calls = [{ id, function:{ name, arguments } }]
// 2) you run get_weather, then re-send the whole conversation with the result
{ "model": "", "tools": [ /* same tools */ ],
"messages": [
{ "role": "user", "content": "weather in Paris?" },
{ "role": "assistant", "tool_calls": [ { "id": "call_1", "type": "function",
"function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\"}" } } ] },
{ "role": "tool", "tool_call_id": "call_1", "content": "{\"tempC\":21}" } ] }
// → the model answers with finish_reason "stop"
A turn that sends tools uses only those client tools — the smith's agent still
provides the system instructions, but its server-side MCP tools and memory are not in
play for that turn (you brought your own loop). To use MCP tools, omit tools.
Forcing a tool with tool_choice
Send tool_choice alongside tools to control whether the model may call them,
exactly as you would to OpenAI:
"auto"(the default) — the model decides."none"— the model must answer in text and call nothing."required"— the model must call one of thetools.{ "type": "function", "function": { "name": "get_weather" } }— the model must call that specific tool. On the Responses API the shape is the flatter{ "type": "function", "name": "get_weather" }.
tool_choice only governs the tools you sent on the request, so it is an error to
send it without a non-empty tools array, or to name a tool that isn't in it — both
return 400 invalid_tool_choice (param: "tool_choice") rather than being silently
ignored.
// Authorization: per-smith token
{ "model": "", "tools": [ /* get_weather as above */ ],
"tool_choice": { "type": "function", "function": { "name": "get_weather" } },
"messages": [ { "role": "user", "content": "weather in Paris?" } ] }
// → the model is forced to call get_weather; finish_reason "tool_calls"
Server-side tools (MCP) and approvals
Read/automatic tools run server-side and never appear in this (Chat
Completions) stream: the person just sees assistant text, exactly as the run loop
calls them for you. (If you want a live "tool is running" indicator, the
Responses API surfaces each server-executed call
as an mcp_call item with in_progress/completed lifecycle events.)
A tool marked destructiveHint pauses the run for approval. In this surface the
pause is projected into the tool-call channel: you receive a
choices[0].delta.tool_calls entry naming the real tool and its arguments, and
the turn ends with finish_reason: "tool_calls". The call id is
"<run_id>::<tool_call_id>" so you know which run to resume.
You resume by sending the decision back as the next turn's tool message: the
tool_call_id echoes the id you received, and the content is approve or
reject:
// next request body, staying inside the standard tool-call channel
{ "model": "", "stream": true, "messages": [
{ "role": "user", "content": "Delete the May draft." },
{ "role": "assistant", "tool_calls": [
{ "id": "run_abc::tc_1", "type": "function",
"function": { "name": "delete_page", "arguments": "{\"id\":\"p1\"}" } } ] },
{ "role": "tool", "tool_call_id": "run_abc::tc_1", "content": "approve" }
] }
On approve, Ingram Cloud executes the tool itself (your MCP server, the host
loop, you never run it) and streams the continuation. On reject, the run
completes with stop_reason: "approval_rejected" and nothing is executed. This
maps onto the same approval that a deployment confirmation or /submit approval_decision drives: they are one mechanism, three front doors.
Usage and cancellation
Set stream_options: { include_usage: true } (the Vercel AI SDK does this for
you) and the final chunk carries usage in the standard place
(prompt_tokens / completion_tokens / total_tokens). Closing the HTTP
connection cancels the run: the turn ends as cancelled and the smith
stops calling tools and the model.
Non-streaming
Omit stream (or set it false) and you get a standard chat.completion object
with choices[0].message, finish_reason, and usage. An approval pause comes
back as finish_reason: "tool_calls" with the same tool_calls shape; resume it
exactly as above.
Chat Completions vs Responses
This page is the Chat Completions projection: the widest-supported format,
and what @ai-sdk/openai-compatible speaks. Ingram Cloud also speaks the newer
Responses API at POST /v1/responses — same smith, memory, tools, and
approvals, with two advantages for approval-heavy agents:
- Approvals are first-class. A
destructiveHintpause surfaces as anmcp_approval_requestoutput item (itsidis"<run_id>::<tool_call_id>"); you resume by sending anmcp_approval_responseinput item ({"approve": true}), instead of the Chat Completions tool-call convention. - Stateful by design. Pass
previous_response_id(a priorrun_…id) to continue the conversation in that run's thread — you send only the newinputand memory carries the history.IC-Thread-Idworks too. - Server-tool activity is visible. A tool the run loop executes for you
(your MCP server) streams as an
mcp_calloutput item:response.output_item.addedwithstatus: "in_progress", thenresponse.mcp_call.in_progress,response.mcp_call.completed, andresponse.output_item.donewithstatus: "completed". That drives a live "tool is running" indicator in an in-app chat — the thing Chat Completions deliberately hides. (Approval-gated tools still ride themcp_approval_requestpath above, notmcp_call.)
Client-side tools work the same on both surfaces; on Responses the model's calls
come back as function_call output items and you send results back as
function_call_output input items (the standard Responses contract), instead of the
Chat Completions tool_calls/tool-message convention.
# Authorization: smith token (browser-safe, scoped to one smith)
curl https://api.cloud.ingram.tech/v1/responses \
-H "Authorization: Bearer $IC_SMITH_TOKEN" \
-H "IC-Api-Version: 2026-05-01" \
-H "Content-Type: application/json" \
-d '{ "model": "", "input": "What did I spend on travel in May?" }'
Identity is resolved exactly as above (smith token, the user field, or
IC-Smith-Id), and the response id is the IC run_… id, so it correlates with
a run.completed webhook just like Chat Completions.