Prompt API

Building block for the Web’s Built-in Prompt API (LanguageModel). One-shot ask() for embeds and widgets, plus a createSession() primitive for chat-shaped apps that need independent per-conversation sessions, streaming with deltas, and a multi-turn history.

The package’s README covers install and the API table. This page is the conceptual overview of the core (vanilla) surface. The React adapter lives at @web-ai-sdk/prompt/react; see usePrompt and useSession.

One-shot: `ask()`

import { ask } from "@web-ai-sdk/prompt";

const result = await ask({
  input: "Summarize this in one sentence: WebMCP lets pages expose tools to agents.",
  systemPrompt: "You are concise. Reply with a single sentence.",
  samplingMode: "predictable",
  onUpdate: (text) => render(text),
});

console.log(result.output, result.cached);

Pass onUpdate to render partial text as it streams. result.output is the final cleaned text; result.cached tells you whether the response came back without invoking the model.

onUpdate receives the cumulative buffer, not deltas. For delta-shaped streaming use createSession().sendStreaming().

Options

interface AskOptions {
  input: string;
  systemPrompt?: string;
  samplingMode?: "most-predictable" | "predictable" | "balanced" | "creative" | "most-creative";
  temperature?: number;
  topK?: number;
  language?: string;
  supportedLanguages?: readonly string[];
  expectedInputs?: LanguageModelExpectedInput[];
  expectedOutputs?: LanguageModelExpectedOutput[];
  tools?: LanguageModelTool[];
  monitor?: (m: CreateMonitor) => void;
  responseConstraint?: object;
  cache?: "session" | "local" | { get, set };
  cacheKey?: string;
  onUpdate?: (text: string) => void;
  signal?: AbortSignal;
}

Name	Type	Description
`input` ^required	`string`	The user-facing prompt / question. Empty / whitespace resolves to `{ output: null }`.
`systemPrompt`	`string`	Optional system prompt (folded into `initialPrompts` as a `system` role).
`samplingMode`	`"most-predictable" \| "predictable" \| "balanced" \| "creative" \| "most-creative"`	Semantic sampling preset. Mutually exclusive with `temperature` / `topK`.
`temperature`	`number`	Legacy raw sampling temperature. Web page contexts are moving to `samplingMode`.
`topK`	`number`	Legacy raw sampling top-k. Web page contexts are moving to `samplingMode`.
`language`	`string`	BCP-47 language hint. Folded into `expectedInputs` / `expectedOutputs` when supported.
`supportedLanguages`	`readonly string[]`	Languages the model supports for the language hint. Defaults to `["en"]`.
`expectedInputs`	`LanguageModelExpectedInput[]`	Advanced: full passthrough. Overrides the `language` hint.
`expectedOutputs`	`LanguageModelExpectedOutput[]`	Advanced: full passthrough. Overrides the `language` hint.
`tools`	`LanguageModelTool[]`	Experimental. Native function-calling passthrough (see Native tool calling).
`monitor`	`(m) => void`	Observe the first-call model download.
`responseConstraint`	`object`	JSON Schema for structured output.
`cache`	`"session" \| "local" \| { get, set }`	Opt-in result cache.
`cacheKey`	`string`	Override the default cache key (JSON string of the prompt and output-shaping options).
`onUpdate`	`(text: string) => void`	Streaming update callback. Cumulative buffer, not deltas.
`signal`	`AbortSignal`	Abort signal.

Returns

interface AskResult {
  output: string | null;
  cached: boolean;
}

Chat: `createSession()`

ask() is isolated per call: it may keep a warm base LanguageModel for same-shape calls, but each prompt runs on a fresh clone when the browser supports clone(), or on a fresh one-shot instance otherwise. That’s a good fit for embeds and ask-and-display flows. createSession() gives every conversation its own underlying instance for multi-turn chat:

import { createSession } from "@web-ai-sdk/prompt";

const session = createSession({
  systemPrompt: "You are a helpful assistant.",
  samplingMode: "balanced",
});

// Streaming yields DELTA chunks (not cumulative buffers):
for await (const delta of session.sendStreaming("Tell me about WebMCP.")) {
  process.stdout.write(delta);
}

// Or one-shot per turn:
const text = await session.send("And what about the Prompt API?");

// Tear down explicitly when the conversation ends.
session.destroy();

The wrapper is intentionally thin. It handles cross-browser smoothing every consumer would otherwise reimplement (delta-vs-cumulative chunk detection, output sanitization, abort wiring, typed unavailability) and forwards everything else to the native instance. It does not track conversation history or queue concurrent sends; those are your data model and UI concerns. It does surface clone(), since forking a warm base session is the spec’s recommended way to start a fresh task (see Session resilience below).

createSession() never touches the ask() session cache. Two sessions with identical options get two independent instances with isolated history, system prompt, sampling, and lifecycle — abort() / destroy() on one session never touch another. Concurrent send / sendStreaming calls on the same session are NOT queued — the underlying LanguageModel is sequential per instance and will reject the overlapping call with InvalidStateError. Either await the previous send or call session.abort() before issuing a new turn.

Concurrency note. Each session is an independent LanguageModel instance, but the underlying on-device model is single-instance. Chrome 148 / Edge 138 currently schedule sendStreaming calls across sessions FIFO: overlapping sends do not interleave token-by-token — the second send waits for the first to drain. This is a constraint of the runtime, not of the API; code written against createSession() becomes faster automatically if a future release exposes parallel inference.

Session resilience: base + per-task `clone()`

For agents and multi-task flows there’s a lose-lose with naive session handling: reuse one long-lived session and history accumulates (later runs start echoing earlier ones, and you eventually hit QuotaExceededError); recreate a session per task and you pay the cold start and can trip Chrome’s single-instance degradation. Chrome’s session management guidance recommends a third way: keep one warm base session (system prompt only) and clone() it per task. The clone inherits the system prompt and history without re-parsing instructions or paying another create(), then gets its own independent history and lifecycle.

const base = createSession({ systemPrompt }); // create once; keep warm
// per task / run:
const turn = await base.clone();              // fresh history, no re-parse
try {
  for await (const delta of turn.sendStreaming(input)) render(delta);
} finally {
  turn.destroy();                             // free the clone, keep the base warm
}

clone() throws SessionDestroyedError if the base has been destroyed, and PromptUnavailableError if the underlying browser instance doesn’t support cloning. The clone is fully independent: destroying it never affects the base, and vice versa. The React useSession hook returns the Session directly, so session.clone() is available there too.

Injecting context without a turn — `Session.append()`

Agent loops often need to push tool results or other context into history without triggering a model turn. Faking this with an extra send() wastes tokens and latency on an empty intermediate response. session.append() forwards to the native LanguageModel.append(): the messages land in history, and the next send / sendStreaming sees them as prior turns.

const session = createSession({ systemPrompt });
await session.send("What's the weather in Tokyo?");
// The model asked to call a tool; run it yourself, then inject the result:
await session.append([
  { role: "assistant", content: "I'll check the weather." },
  { role: "user", content: "tool result: 24°C, clear" },
]);
// The next turn sees the tool result as history — no wasted intermediate turn.
const plan = await session.send("Based on that, suggest an outfit.");

append() throws SessionDestroyedError if the session is destroyed and PromptUnavailableError if the browser instance doesn’t support append(). Aborts reject with PromptAbortError.

Prefill and message arrays

Session.send / sendStreaming accept either a single string turn or a full LanguageModelMessage[]. Passing an array lets you supply multi-message context, control roles per turn, and, most usefully, prefill the assistant’s reply: set prefix: true on the trailing assistant message and the model treats its content as the start of its own answer rather than a turn to respond to.

const session = createSession({ systemPrompt });

// Multi-message turn: full conversation context, roles per message.
const reply = await session.send([
  { role: "user", content: "What is RAG?" },
  { role: "assistant", content: "Retrieval-Augmented Generation." },
  { role: "user", content: "Give me the three-step recipe." },
]);

// Prefill: bias the model toward JSON without a full schema.
const json = await session.send([
  { role: "user", content: "Describe a cat in one word of JSON." },
  { role: "assistant", content: '{"thought":"', prefix: true },
]);
// model completes: feline"}  ->  you parse {"thought":"feline"}

Prefill vs responseConstraint: both shape output, different trade-offs:

Prefill (prefix: true): cheaper per turn (no schema inlined into context), weaker guarantee; the model may drift off the prefixed format. Good for cheap nudges and structured-output hints that you parse defensively.
responseConstraint: enforced JSON Schema (the runtime validates against it), higher per-turn token cost when the schema is large. Use omitResponseConstraintInput: true to drop the inlined schema and keep only the enforced constraint.

They compose: prefill the opening brace, set responseConstraint for the full shape.

Spec rule: prefix: true is only valid on the trailing assistant message. Anywhere else (a non-final message, a non-assistant role) the browser throws a "SyntaxError" DOMException. The SDK does not catch this, so it propagates to your send / sendStreaming caller.

Note on content: LanguageModelMessage.content is currently string only. Multimodal ContentPart[] content (images, audio) is tracked as a future enhancement; no timeline is promised.

Context-window introspection

Session surfaces the live token budget the native instance reports, so consumers can size work to the real context window instead of hardcoding a char cap (the exact bypass — reaching past the SDK to globalThis.LanguageModel — that this avoids):

session.contextWindow — max input tokens for the session (the context window).
session.contextUsage — input tokens used so far. On a fresh base-clone this reflects the inherited history (≈ the system prompt), the right baseline to budget a turn against.

These mirror the Prompt API’s contextWindow / contextUsage (the renamed successors of inputQuota / inputUsage); the wrapper reads the new names and falls back to the deprecated ones on older Chrome builds. Both are undefined until the underlying instance exists. The instance is created lazily on the first send / sendStreaming, so read them after a send or — cleaner — on a session from clone(), whose instance is live the moment clone() resolves.

const base = createSession({ systemPrompt }); // keep warm
const turn = await base.clone();               // instance is live here
const quota = turn.contextWindow;              // e.g. 4096 / 6144 tokens
const used = turn.contextUsage ?? 0;           // ≈ system prompt
if (quota) {
  const available = quota - used - ANSWER_RESERVE_TOKENS;
  const budgetChars = Math.max(0, available) * 4; // ~4 chars/token
  // truncate fetched content to budgetChars so it fits in one turn
}
// Fall back to a fixed char cap when contextWindow is undefined
// (older browsers / pre-creation).

session.onContextOverflow(listener) subscribes to the native contextoverflow event, which fires when a turn pushes usage past the window and the oldest history is dropped. Use it to compact or fork a fresh clone() before hitting QuotaExceededError. It returns an idempotent cleanup function, and is a no-op (returns a no-op cleanup) when the underlying instance doesn’t expose the event.

const stop = session.onContextOverflow(() => {
  // compact, summarize, or start a fresh clone before QuotaExceededError
});
// later
stop();

How it works

Chrome’s LanguageModel exposes LanguageModel.create({...}) to spin up a session and session.prompt(input) / session.promptStreaming(input) to run it. The wrapper does the following on top:

Feature detection. isAvailable() / checkAvailability() return false / null on browsers without the API. The vanilla ask() throws PromptUnavailableError; the React hook surfaces status: "unavailable".
Warm base reuse for ask() (ergonomic default, scoped to ask). A bounded LRU caches base LanguageModel.create() calls by stringified create options. Each prompt still runs on an isolated clone, or on a fresh one-shot instance when clone() is unavailable. createSession() never touches this cache.
Optional result cache for ask() (off by default). Every call hits the model unless you opt in. Pass cache: "session" for sessionStorage, cache: "local" for localStorage, or any { get, set }-shaped object to memoize responses by (input, systemPrompt, samplingMode / temperature / topK).
Delta-vs-cumulative chunk detection. Chrome ships delta chunks; some Edge backends ship cumulative. The wrapper normalizes per-chunk so onUpdate (cumulative) and sendStreaming() (deltas) work the same across browsers.

The two cache-shaped items above are ergonomic defaults scoped to ask(), not opinions about how to use a language model. Multi-package compositions (agent loops, conversation history, tool dispatch) are not part of this package; see Architecture for the split.

System prompt + sampling

systemPrompt is folded into the session’s initialPrompts as a system role. Prefer samplingMode for output variety: the browser maps the semantic mode to model-appropriate sampling parameters. Legacy temperature and topK are still passed through where browsers expose them, but they are mutually exclusive with samplingMode.

ask({
  input: "Tell me a joke about web standards.",
  systemPrompt: "You are a stand-up comedian. Be punchy.",
  samplingMode: "creative",
});

Language hints

Chrome’s Prompt API accepts expectedInputs / expectedOutputs with optional language arrays. The wrapper sets these automatically when you pass language and the language is in supportedLanguages (default: ["en"]):

ask({
  input: "Explain CORS in two sentences.",
  language: "en-US",
});

For unsupported languages the hints are silently omitted; the model still runs, just without the hint.

Structured output

Pass a JSON Schema via responseConstraint to constrain the model’s output:

const result = await ask({
  input: "Extract the city and country from: 'I live in Belo Horizonte, Brazil.'",
  responseConstraint: {
    type: "object",
    properties: {
      city: { type: "string" },
      country: { type: "string" },
    },
    required: ["city", "country"],
  },
});

const parsed = JSON.parse(result.output ?? "{}");

The wrapper passes responseConstraint straight through to session.prompt() / session.promptStreaming(). Support depends on your Chrome version.

By default the schema is inlined into the prompt context, which costs tokens. Pass omitResponseConstraintInput: true (on ask() or SessionSendOptions) to drop it; the constraint still shapes the output, but you should then include format guidance in the prompt text itself. The flag is only forwarded when responseConstraint is also set — the native API throws a TypeError otherwise, so the wrapper ignores it on its own.

Native tool calling (experimental)

The Prompt API spec defines native function calling: register tools on the session, and the runtime invokes each tool’s execute on the model’s behalf, then feeds the result back into the conversation. ask() and createSession() forward a tools array straight through to LanguageModel.create():

import { createSession, type LanguageModelTool } from "@web-ai-sdk/prompt";

const tools: LanguageModelTool[] = [
  {
    name: "fetch_url",
    description: "Fetch a URL and return its text.",
    inputSchema: {
      type: "object",
      properties: { url: { type: "string" } },
      required: ["url"],
    },
    async execute(args) {
      const { url } = args as { url: string };
      return await (await fetch(url)).text();
    },
  },
];

const session = createSession({ systemPrompt, tools });

This is pass-through only — the SDK forwards tools and never calls execute itself. Whether the model actually invokes a tool depends on the browser. Native execution is not wired on current stable Chrome: the option is accepted but is a silent no-op, and the model may surface its tool call as plain text (a tool_code block) that your own code must parse. The passthrough starts working automatically on browsers that ship native execution; until then, responseConstraint remains the robust default.

The heuristic tool_code parser and the tool-execution loop are deliberately left to the consumer layer — parsing is model-dependent (it drifts on names and formats for less-trained tools), so it doesn’t belong in a lifecycle wrapper. To declare the native tool modalities, pass { type: "tool-response" } / { type: "tool-call" } through the advanced expectedInputs / expectedOutputs fields.

tools works on ask() too (ask({ input, tools })). One caveat: ask() may keep warm base sessions through an LRU keyed by JSON.stringify(createOptions), and JSON.stringify drops functions — so a tool’s execute doesn’t contribute to the key, only its name / description / inputSchema do. Each ask() prompt still runs on a clone or fresh one-shot instance. That’s harmless today (the SDK never runs execute), but it matters once native execution lands — so prefer createSession() for tool-bearing sessions. It bypasses the cache entirely and matches the base-session + per-run-clone() pattern.

Aborting

AbortSignal is supported on every surface. Aborting mid-stream resolves cleanly; an opt-in result cache is not written for aborted runs:

const controller = new AbortController();
const promise = ask({ input: "Long story…", signal: controller.signal });
setTimeout(() => controller.abort(), 1000);

For sessions, call session.abort() to stop the most recent in-flight send / sendStreaming, or session.destroy() to tear down the underlying instance.

Aborted runs reject with PromptAbortError (exported from the package). Both ask() and sessions throw the same class, so err instanceof PromptAbortError works; its name is "AbortError" for compatibility with standard abort handling.

import { ask, PromptAbortError } from "@web-ai-sdk/prompt";

try {
  await ask({ input: "…", signal });
} catch (err) {
  if (err instanceof PromptAbortError) return; // user cancelled
  throw err;
}

Errors and unavailability

The vanilla ask() throws PromptUnavailableError when the API is missing or reports availability: "unavailable". Callers branch explicitly:

import { ask, PromptUnavailableError } from "@web-ai-sdk/prompt";

try {
  const result = await ask({ input: "hi" });
} catch (err) {
  if (err instanceof PromptUnavailableError) {
    // No Chrome flag, no model, or download blocked; fall back.
    return;
  }
  throw err;
}

createSession() returns a Session synchronously even when creation fails; the error surfaces on the first send / sendStreaming.