Openai/694057b6-101c-8007-9a65-c40578c7252d

This conversation was exported from ChatGPT^[1]

gpt-5-2-thinking[edit | edit source]

User: Is there exist python tooling for interacting with llama.cpp, and what is it / what features does it provide?[edit | edit source]

Is there exist python tooling for interacting with llama.cpp, and what is it / what features does it provide?

Assistant: Yes — the main Python tooling is llama-cpp-python, which provides Python bindings to llama.cpp plus a couple of “batter…[edit | edit source]

Yes — the main Python tooling is llama-cpp-python, which provides Python bindings to llama.cpp plus a couple of “batteries included” ways to serve and call models. GitHub^[2]

What you get:[edit | edit source]

High-level Llama class for local inference (text + chat-style prompting). GitHub^[3]
OpenAI-like API surface (so it feels like chat.completions, etc.). PyPI^[4]
Embeddings support. PyPI^[5]
Low-level access to llama.cpp’s C API via ctypes if you want finer control. PyPI^[6]
Integrations: works with LangChain and LlamaIndex. PyPI^[7]
Also advertised in the ecosystem/docs: function calling support and vision API support (when model/build supports it). PyPI^[8]

Tiny example (in-process):

from llama_cpp import Llama

llm = Llama(model_path="model.gguf", n_ctx=4096)
out = llm("Write a haiku about GPUs.")
print(out["choices"][0]["text"])

(Exact knobs vary by your build/model.)

If you’d rather keep models out-of-process, it can run an OpenAI API–compatible web server so you can point any OpenAI client library at it. Llama CPP Python^[9][edit | edit source]

Run it:

pip install 'llama-cpp-python[server]'
python -m llama_cpp.server --model model.gguf

Llama CPP Python^[10]

Then in Python you can use openai-style clients by setting the base URL to your local server (exact code depends on which OpenAI SDK version you use).

Separately from Python bindings, llama.cpp itself ships `llama-server`, a lightweight HTTP server exposing OpenAI-compatible endpoints like /v1/chat/completions, and it supports concurrent requests/parallel decoding. GitHub^[11][edit | edit source]

Run:

llama-server -m model.gguf --port 8080

GitHub^[12]

Then you can call it from Python with requests (or an OpenAI-compatible client pointed at http://localhost:8080/v1).

If you tell me whether you prefer in-process (bindings) or server-based (HTTP), and what GPUs/build you’re using (CUDA/Vulkan/CPU), I can suggest the cleanest setup + example code for your exact workflow.

User: This will be part of a system managing several different llama instances using different models on different gpus[edit | edit source]

This will be part of a system managing several different llama instances using different models on different gpus

Assistant: Nice — in that setup, you’ll usually want to treat llama.cpp as a fleet of small, single-purpose services and put a thi…[edit | edit source]

Nice — in that setup, you’ll usually want to treat llama.cpp as a fleet of small, single-purpose services and put a thin Python control plane in front of them.

Here’s a pattern that works well.

###[edit | edit source]

An allocation unit is typically:

one model on one GPU (most common), or
one model sharded across multiple GPUs (for big models), still as one server process.

This keeps memory/KV-cache ownership simple and avoids cross-talk between workloads.

You can run either:

llama.cpp llama-server processes, or
llama-cpp-python server processes

Both expose OpenAI-ish endpoints; the key is: separate processes per model/GPU.

Use environment + flags so each instance only sees the GPU(s) it should use.[edit | edit source]

Common approaches:

CUDA_VISIBLE_DEVICES=… per process (simplest, very reliable)
llama.cpp GPU selection/sharding flags (exact names vary by build/version)

Example idea:

model A on GPU 0 → port 8001
model B on GPU 1 → port 8002
model C sharded across GPUs 2,3 → port 8003 (with an appropriate split flag, if you’re doing tensor split)

Make one FastAPI (or similar) service that:[edit | edit source]

Accepts OpenAI-compatible requests (/v1/chat/completions, /v1/embeddings, /v1/models)
Routes by model name (and optionally by purpose or max_tokens, ctx, etc.)
Proxies streaming responses (SSE) back to the client
Handles retries, timeouts, and health checks

Clients talk to one base URL, your router decides which llama instance handles it.

###[edit | edit source]

Start/stop/restart instances (via subprocess, systemd, Docker, or Nomad/K8s)
Assign ports, log paths, and GPU pinning
Enforce per-instance limits (max concurrent requests, queue depth)

- Periodic health checks (e.g., call /v1/models or a tiny completion)[edit | edit source]

Track “busy” signals: - in-flight requests - recent latency - (optional) parse server logs for queue/backpressure messages
Simple scheduler policies: - static: model→instance mapping - pooled: multiple replicas per model, pick least-busy

- Central structured logs[edit | edit source]

Metrics endpoint (Prometheus style if you want)
Request tracing (request id propagated to backends)

###[edit | edit source]

instances:
* name: mistral_7b_gpu0
    model: mistral-7b-instruct
    cmd: ["./llama-server", "-m", "/models/mistral.gguf", "--port", "8001"]
    env: { CUDA_VISIBLE_DEVICES: "0" }
* name: qwen_14b_gpu1
    model: qwen-14b
    cmd: ["./llama-server", "-m", "/models/qwen14b.gguf", "--port", "8002"]
    env: { CUDA_VISIBLE_DEVICES: "1" }

routing:
  mistral-7b-instruct: ["mistral_7b_gpu0"]
  qwen-14b: ["qwen_14b_gpu1"]

- POST /v1/chat/completions: - read body["model"] - pick backend from routing[model] (optionally least-busy) - proxy request to http://backend_host:port/v1/chat/completions - if stream=true, proxy SSE bytes directly[edit | edit source]

Use httpx.AsyncClient for proxying; FastAPI makes streaming pass-through pretty clean.

- Prefer one model per process. Multi-model in one process is almost never worth the complexity unless you’re doing something very custom.[edit | edit source]

Be explicit about context size per instance; KV cache is the silent VRAM killer. If you’re mixing workloads, separate “long ctx” and “short ctx” instances even for the same model.
If you use MIG / MPS / time-slicing, keep the policy outside llama.cpp (router-level) and don’t expect llama.cpp to “fair schedule” across tenants.
Add a cheap warmup (1-token completion) after start so the first real request doesn’t eat the initialization penalty.
If you need strict isolation, use containers; if you want max performance and simplest GPU pinning, systemd services with env vars is hard to beat.

If you tell me:

whether you’re using llama-server or llama-cpp-python[server]
how many GPUs and whether any models are multi-GPU sharded
whether you need streaming + embeddings + tool/function calling …I can sketch a router skeleton (FastAPI + httpx) that’s already OpenAI-compatible and drop-in for your clients.

User: My original plan was to have a different systemd service starting up ollama servers, with the higher level software hav…[edit | edit source]

My original plan was to have a different systemd service starting up ollama servers, with the higher level software having modules to talk to all the servers. Now that I'm switching to llama.cpp I want to have the main program start llama subprocesses and talk to them.

Assistant: Got it.[edit | edit source]

Got it. If your main program is going to spawn llama.cpp workers and then talk to them, the cleanest way is:

spawn llama-server (or llama-cli --server if you’re on an older build) as a subprocess per model/GPU allocation
talk to each worker over HTTP using the OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings, etc.)
keep a small supervisor layer in Python: ports, env (GPU pinning), health checks, restarts, and graceful shutdown

That gives you the same “many Ollama daemons” topology, just under your program’s control instead of systemd.

For a multi-model, multi-GPU manager, separate processes are usually better because:[edit | edit source]

KV cache + model weights stay isolated per worker (no accidental contention)
one hung model doesn’t take out your whole control plane
you can hard-pin GPUs with CUDA_VISIBLE_DEVICES
restarting a worker frees VRAM reliably

###[edit | edit source]

from dataclasses import dataclass
from typing import Optional, Dict, List

@dataclass
class LlamaWorkerSpec:
    name: str
    model_path: str
    port: int
    cuda_visible_devices: str            # e.g. "0" or "2,3"
    host: str = "127.0.0.1"
    extra_args: Optional[List[str]] = None
    env_extra: Optional[Dict[str, str]] = None

Use asyncio so you can manage many workers cleanly and stream logs.[edit | edit source]

import asyncio, os, signal
from typing import Optional

class LlamaWorker:
    def __init__(self, spec: LlamaWorkerSpec, llama_server_path: str = "llama-server"):
        self.spec = spec
        self.llama_server_path = llama_server_path
        self.proc: Optional[asyncio.subprocess.Process] = None

    async def start(self) -> None:
        if self.proc and self.proc.returncode is None:
            return

        env = os.environ.copy()
        env["CUDA_VISIBLE_DEVICES"] = self.spec.cuda_visible_devices
        if self.spec.env_extra:
            env.update(self.spec.env_extra)

        args = [
            self.llama_server_path,
            "-m", self.spec.model_path,
            "--host", self.spec.host,
            "--port", str(self.spec.port),
        ]
        if self.spec.extra_args:
            args += self.spec.extra_args

        self.proc = await asyncio.create_subprocess_exec(
            *args,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.STDOUT,
            env=env,
        )

        # optional: start a log pump task
        asyncio.create_task(self._pump_logs())

    async def _pump_logs(self) -> None:
        assert self.proc and self.proc.stdout
        while True:
            line = await self.proc.stdout.readline()
            if not line:
                break
            # replace with structured logging
            print(f"[{self.spec.name}] {line.decode(errors='replace').rstrip()}")

    async def stop(self, timeout_s: float = 10.0) -> None:
        if not self.proc or self.proc.returncode is not None:
            return

        # Try graceful first
        self.proc.send_signal(signal.SIGTERM)
        try:
            await asyncio.wait_for(self.proc.wait(), timeout=timeout_s)
        except asyncio.TimeoutError:
            self.proc.kill()
            await self.proc.wait()

    async def wait(self) -> int:
        if not self.proc:
            return 0
        return await self.proc.wait()

You don’t want to route requests until the worker is actually serving.[edit | edit source]

import httpx

async def wait_ready(spec: LlamaWorkerSpec, timeout_s: float = 60.0) -> None:
    url = f"http://{spec.host}:{spec.port}/v1/models"
    async with httpx.AsyncClient(timeout=2.0) as client:
        deadline = asyncio.get_event_loop().time() + timeout_s
        while True:
            try:
                r = await client.get(url)
                if r.status_code == 200:
                    return
            except Exception:
                pass
            if asyncio.get_event_loop().time() > deadline:
                raise RuntimeError(f"Worker {spec.name} not ready after {timeout_s}s")
            await asyncio.sleep(0.25)

=====

async def chat(spec: LlamaWorkerSpec, model_name: str, messages, stream: bool = False): =====
    url = f"http://{spec.host}:{spec.port}/v1/chat/completions"
    payload = {"model": model_name, "messages": messages, "stream": stream}
    async with httpx.AsyncClient(timeout=None) as client:
        if not stream:
            r = await client.post(url, json=payload)
            r.raise_for_status()
            return r.json()
        else:
            # streaming passthrough (SSE-ish)
            async with client.stream("POST", url, json=payload) as resp:
                resp.raise_for_status()
                async for chunk in resp.aiter_bytes():
                    yield chunk

- Start N workers[edit | edit source]

wait until ready
monitor processes; if one exits, restart it

class LlamaFleet:
    def __init__(self, specs, llama_server_path="llama-server"):
        self.workers = [LlamaWorker(s, llama_server_path) for s in specs]

    async def start_all(self):
        for w in self.workers:
            await w.start()
        for w in self.workers:
            await wait_ready(w.spec)

    async def stop_all(self):
        await asyncio.gather(*(w.stop() for w in self.workers), return_exceptions=True)

    async def supervise_forever(self):
        while True:
            for w in self.workers:
                if w.proc and w.proc.returncode is not None:
                    print(f"[fleet] {w.spec.name} exited ({w.proc.returncode}); restarting")
                    await w.start()
                    await wait_ready(w.spec)
            await asyncio.sleep(1.0)

- Port allocation: deterministic (e.g., 8100+index) so restarts don’t reshuffle.[edit | edit source]

GPU pinning: always via CUDA_VISIBLE_DEVICES per worker; it’s the least surprising.
Separate “profiles”: run two workers for the same model if you need different --ctx-size / batch settings.
Backpressure: track in-flight requests per worker and refuse/queue when overloaded (otherwise latency explodes).
Graceful shutdown: SIGTERM + timeout + SIGKILL to reliably free VRAM.

If you tell me whether you’re using CUDA builds and whether any of your models are multi-GPU sharded, I can show how to build the extra_args for each worker (ctx size, n-gpu-layers, tensor split, parallel slots) and a small router function that picks the best worker per request.

User: Ok, let's lock down the requirements on this module, work on code later.[edit | edit source]

Ok, let's lock down the requirements on this module, work on code later.

The software using this will be using multiple instances of the class. It will at times want to shut down a llama and create a new one (the instance doesn't need to have change ability; it will be dropped and a new one created, it just needs start / stop).

The software will init with the various settings which are desired for that llama, including the ability to set a maximum number >1 of simultaneous requests. While running the class needs to maintain a status readout of some sort; whether it's active, queries processing, etc. Usage pattern will be for the caller to call a method with a system + user prompt which will immediately return. Then when the request is finished it can get the output (maybe callbacks, but polling is acceptable to start with).

The class will handle all of the back and forth from there. It should detect stalls / disconnection, and attempt reconnection and eventual restart of the server if needed.

There will also need to be a tool system. This can be handled by a separate class that is passed to the module, but it should transparently handle registering tools, receiving a call, handling, and resume with tool data.

Is there anything else I've missed in this?

Assistant: You’ve already got the core nailed.[edit | edit source]

You’ve already got the core nailed. The main “gotchas” that tend to bite later are around request lifecycle, capacity/backpressure, observability, and tool-call state. Here’s a checklist of requirements I’d add/lock down now so you don’t end up redesigning mid-flight.

Lifecycle[edit | edit source]

start() / stop() only (no “reconfigure in place”).
Owns a single llama server subprocess + its port + its GPU pinning env.
Clean shutdown: try graceful, then hard kill; always free VRAM.

Requests

submit(system_prompt, user_prompt, …) -> request_id returns immediately.
Completion retrieved by polling: get_result(request_id) / get_status(request_id).
Supports max_concurrent_requests > 1 (internal queue when saturated).

Health & resilience

Detect stalls/disconnects.
Attempt reconnect; if unhealthy past threshold, restart the subprocess.
Resume operation after restart (at minimum, fail in-flight requests cleanly with a clear reason).

Tools

Tool registry passed in (separate class OK).
Handles tool-call detection → executes tool → resumes generation with tool result.

That’s a solid v1.

###[edit | edit source]

You want an explicit state machine, e.g.:

QUEUED → DISPATCHED → RUNNING → TOOL_WAIT → COMPLETED
terminal: FAILED | CANCELED | TIMEOUT

Add:

cancel(request_id) (even if it’s best-effort). Without cancellation, queue buildup becomes painful under load.

Decide now what happens when saturated:[edit | edit source]

Queue up to max_queue_depth, then reject with QueueFullError
or drop oldest, or priority lanes (later)

Also decide whether max_concurrent_requests is:

per worker instance (likely), and
enforced with a semaphore + queue.

People often only add one and then it’s messy.[edit | edit source]

Connect timeout: can I reach the server?
First-byte timeout: server accepted request but never started responding (common stall symptom).
Total timeout: wall-clock cap per request.

Also add a restart backoff (avoid crash loops).

Lock this down explicitly:[edit | edit source]

If the server restarts, do you: - mark all in-flight as FAILED(restarted) immediately, or - try to replay (dangerous unless your prompts/tools are deterministic and idempotent)

For v1: I strongly recommend fail in-flight with a clear error.

Even if you only expose polling now, build the internal plumbing so you can add:[edit | edit source]

partial tokens
“tail” polling (get latest chunk since offset)
or callback hooks later

If you don’t, tool-calls + long responses can feel opaque.

Minimum that saves you hours:[edit | edit source]

last N log lines from subprocess accessible via get_debug_info()
per-request timing: queued_ms, run_ms, tool_ms, retries
health metrics: restart_count, last_healthy_time, consecutive_failures

Expose:[edit | edit source]

model path/name
context length, gpu-layers, batch settings, etc.
a config hash so the higher-level system can understand “what worker is this?”

This matters when you have many instances and need to correlate behavior.

If multiple instances are created/destroyed dynamically:[edit | edit source]

module should either accept a port (caller-managed) or
include a reliable port allocator + bind-check.

Also: decide whether you bind to 127.0.0.1 only (recommended).

Tool support is where hidden requirements show up. Define:[edit | edit source]

tool schema format (OpenAI tool schema? custom?)
max tool recursion / max tool calls per request
tool timeout + retry policy
tool result size limits (avoid stuffing huge payloads back into context)
idempotency expectations (important if you ever retry a tool call)

Also: persist enough per-request state that tool resumes are correct even if you poll later.

If you want max_concurrent_requests > 1, decide if that means:[edit | edit source]

multiple simultaneous HTTP requests to one server process (depends on server’s parallelism support), or
you’ll actually spawn multiple worker processes for true parallel throughput.

Even if you choose the former, keep a switch so you can move to the latter without API changes.

It can help to say “not in scope for v1”:[edit | edit source]

automatic model swapping inside an instance
replaying in-flight requests across restarts
global scheduling across workers (belongs in a higher-level manager)
distributed tool execution (keep it in-process initially)

- start(), stop(), restart()[edit | edit source]

submit(system, user, , tools_enabled=True, request_params={...}) -> id

get_status(id) -> {state, progress?, error?, timestamps...}
get_result(id) -> {text, tool_trace?, usage?}
cancel(id)
get_worker_status() -> {active, healthy, in_flight, queued, restarts, last_error...}
get_debug_info() -> logs, config_hash, last_health_checks...

If you lock those items down now, you’ll avoid the common rewrites later. When you’re ready to move to code, we can turn this into a concrete state machine + class layout (worker + request objects + tool runner) with clear boundaries.

User: As a general rule for this nuke & repave is the preferred strategy rather than trying delicate replay and similar trick…[edit | edit source]

As a general rule for this nuke & repave is the preferred strategy rather than trying delicate replay and similar tricks.

Multiple requests should be a designated number of "slots" which can be used. If all slots are full there is no reason to take a request it can't handle yet; that can wait or be directed to another instance.

For timeouts and other in flight behavior: this will among other things be used with large slow models on slow hardware and relatively large contexts. A practical concern is that time to first token can be significant for some requests.

Gathering streaming partial tokens is probably desirable even if not exposed to the outside, if only as part of the monitoring and recovery system.

Tools should probably be OpenAI format: something which models will easily recognize and be able to handle.

The internal tools here will be limited to lightweight calculator or lookup types. Larger calls would be routed to a higher level model.

Assistant: Cool — with those constraints, you can simplify a lot.[edit | edit source]

Cool — with those constraints, you can simplify a lot. Here’s what I’d “lock in” as the module requirements/spec, reflecting nuke & repave, slot-based concurrency, slow-first-token reality, internal streaming, and OpenAI-style tools.

###[edit | edit source]

No delicate replay. If the server is unhealthy, restart the subprocess.
Any in-flight requests at restart time are marked FAILED(reason="worker_restarted") (or CANCELED if caller canceled).

- Worker has N slots (N ≥ 1). A slot = one in-flight generation request owned by this worker.[edit | edit source]

If slots_full, submit() returns immediately with NO_SLOT_AVAILABLE (caller can retry later or route to another worker).
Optionally later: allow a tiny bounded queue, but default is “no queue”.

- Even if the external API is polling-only, the worker internally uses streaming (or reads incremental output) to: - update “last progress” timestamps - capture partial tokens for debugging/monitoring - improve stall detection[edit | edit source]

###[edit | edit source]

Each request has:

id, timestamps, config snapshot (model/ctx/tools enabled), slot id
states: ALLOCATED_SLOT → DISPATCHED → RUNNING → TOOL_RUNNING (optional) → COMPLETED
terminal: FAILED | CANCELED | TIMEOUT | NO_SLOT (NO_SLOT is a submit-time result)

Required methods:

submit(...) -> RequestHandle | error(NO_SLOT_AVAILABLE)
poll(id) -> status (includes partial text length / last_progress_time)
result(id) -> final (or still running)
cancel(id) best-effort: cancels HTTP stream and frees slot

You want multiple independent timers, and at least one that is progress-based rather than “first token in X seconds”.[edit | edit source]

Recommended set:

A. Connect timeout (short)

e.g. 1–3s to open TCP connection to the server.

B. “Request accepted” timeout (medium)

time until you get HTTP status/headers back (server actually accepted). If you can’t even get headers, that’s a strong sign of dead server.

C. Progress timeout (primary stall detector)

a “deadman” timer that resets whenever any bytes arrive on the stream (or whenever token count increases).
This is the one that works with slow hardware + huge contexts.
Example: if no bytes received for 120s (configurable), treat as stall.

D. Absolute wall-clock timeout (optional / large)

Set very high or disable by default for long prompts. If enabled, should be per-request override.

Key point: avoid a strict “time to first token” timeout unless it’s huge or derived from request size.

Also specify:

after a stall: attempt one reconnect/health check, then restart if still bad
restart backoff + crash-loop breaker (e.g., if 5 restarts in 2 minutes, mark worker unhealthy and stop)

Worker-level status should include:[edit | edit source]

active (started), healthy, restarting
slots_total, slots_in_use, list of active request ids
last_healthy_time, restart_count, last_error
optional: rolling latency stats (ttft, tokens/sec, total time)

Request-level status should include:

state, created_at, started_at, last_progress_at
partial_output_chars (or tokens) + optionally last N chars of partial for debugging
tool_calls_count and tool trace entries

Lock down:[edit | edit source]

Tools are provided as OpenAI-style tools=[{type:"function", function:{name, description, parameters}}]
Model-facing messages follow OpenAI tool calling conventions: - assistant returns tool call(s) - worker executes them via the provided ToolRunner - worker appends tool results and continues until final answer

Constraints (important for safety + stability):

max tool call depth / iterations (e.g. 8)
per-tool timeout (e.g. 2–10s)
max tool output size (truncate / summarize)
tool calls should be deterministic / idempotent where possible (since restarts nuke in-flight)

Explicitly require:[edit | edit source]

server stdout/stderr captured (ring buffer for debug)
SIGTERM then SIGKILL on stop
port binding strategy (caller provides port OR worker allocates)
GPU pinning via env (e.g., CUDA_VISIBLE_DEVICES) per worker process

- If submit() succeeds, the worker guarantees: - the request is either completed with a result, or ends in a terminal error with reason[edit | edit source]

Terminal error reasons should be machine-usable: - server_restart, stall_timeout, connect_failed, http_error, tool_error, canceled, etc.

What is “a slot” mapped to under the hood?[edit | edit source]

If llama-server truly handles N parallel requests well on your build/hardware, a slot can be “one concurrent HTTP streaming request.”
If not, you still keep the same slot API, but your higher-level orchestrator can spin up more worker processes instead of increasing slots.

Either way, the module contract stays the same.

Two small but high-impact requirements:[edit | edit source]

Warmup on start After launching the server, run a tiny “1 token” request so the worker becomes “healthy” only after it’s genuinely responsive.
Structured event log Emit events like WORKER_STARTED, REQUEST_ASSIGNED_SLOT, TOOL_EXECUTED, STALL_DETECTED, WORKER_RESTARTED. This makes fleet debugging way easier than scraping freeform logs.

If you want, next we can translate this into a clean “interface spec” (classes + method signatures + event types + error enums) before writing any subprocess/HTTP code.

User: There should be a secondary low level system prompt mechanism handled by the class, almost a "BIOS prompt".[edit | edit source]

There should be a secondary low level system prompt mechanism handled by the class, almost a "BIOS prompt". It would feed in data like how many tool call iterations are remaining, current date & time, or other low level stuff the model should be aware of.

Assistant: Yep — that’s a good idea, and it’s worth making it a first-class requirement because it affects how you build the promp…[edit | edit source]

Yep — that’s a good idea, and it’s worth making it a first-class requirement because it affects how you build the prompt stack, tool-loop, and monitoring.

###[edit | edit source]

A secondary, low-level system prompt owned by the worker class (not the caller) that injects runtime/control-plane facts the model should always see, such as:

tool-call budget remaining (and max)
current date/time (+ timezone)
worker identity / model identity (optional)
response constraints that are “platform rules” (optional)
tool usage rules (e.g., “only call tools from the provided list”)

Lock this down because ordering matters.[edit | edit source]

Recommended invariant order:

BIOS system prompt (worker-generated, dynamic per turn)
Caller’s system prompt (the user-facing “persona/task” system message, static per request)
Conversation messages (user/assistant/tool messages)

This gives your platform-level constraints the highest precedence, while still letting the caller define task behavior.

For maximum compatibility (since different servers vary in how they treat multiple system messages), the worker should support both:[edit | edit source]

Two system messages (BIOS + caller system), if the backend supports it cleanly
Single combined system message, where BIOS is prepended with a clear delimiter (safe fallback)

Example delimiter idea (spec-level, not exact text):

[BIOS]
...dynamic control-plane info...
[/BIOS]

...caller system prompt...

The BIOS prompt should be regenerated:[edit | edit source]

at request start
before each tool-loop continuation (so “tool iterations remaining” stays accurate)
optionally on long streams (if you ever add “keepalive” logic)

- No secrets in BIOS (API keys, hostnames if sensitive, etc.). Treat it as model-visible.[edit | edit source]

Hard size cap on BIOS text (it’s easy to accidentally bloat it with debug info).
Version it: include a short BIOS schema/version string so you can change formatting later without confusion.
Deterministic format: stable keys so logs/monitoring can parse it.

Because you’re using internal streaming, the BIOS layer should also update worker/request status with:[edit | edit source]

tool budget remaining
last BIOS timestamp injected (useful when debugging tool-loops)
BIOS version/hash (handy for correlation)

Add to your module contract:[edit | edit source]

Worker config includes a bios_provider (callable/class) that returns a string (or structured dict → rendered to string).
The ToolRunner updates a tool_iterations_remaining counter.
The request dispatcher rebuilds the prompt stack with the latest BIOS before sending/continuing.

Include a “capabilities line” in BIOS that describes what the worker supports right now (tools enabled? max tool calls? streaming?); it helps models behave more consistently than relying on implicit behavior.[edit | edit source]

If you want, next step is to write the spec for the BIOS provider (inputs/outputs, required fields, formatting rules) so it stays clean and doesn’t leak complexity into the rest of the worker.

User: On timeouts, some of the test requests I've done with the hardware this will be managing can take multiple minutes, or …[edit | edit source]

On timeouts, some of the test requests I've done with the hardware this will be managing can take multiple minutes, or in some worst cases 10s of minutes for initial processing before tokens begin.

Assistant: That’s totally compatible with the “nuke & repave” philosophy — you just have to make sure your stall detection doesn’t…[edit | edit source]

That’s totally compatible with the “nuke & repave” philosophy — you just have to make sure your stall detection doesn’t confuse “long prefill” with “hung”.

Here’s how I’d lock the timeout requirements so they work with multi-minute / tens-of-minutes TTFT.

###[edit | edit source]

These detect dead server / dead socket, not slow inference:

Connect timeout (TCP connect): ~1–3s
Header/accept timeout (time to receive HTTP status/headers): ~5–30s (configurable)

If you can’t connect or can’t even get headers back, that’s not “slow model”, that’s “not serving”.

With your workloads, a “time to first token” timeout is basically unusable unless it’s huge.[edit | edit source]

Requirement:

ttft_timeout = None by default (disabled)
If enabled, it’s per-worker profile and set very high.

Instead of “no tokens yet”, stall should mean “nothing indicates progress”.[edit | edit source]

Define a Progress/Liveness watchdog that is satisfied by any of these signals:

A. Stream progress Once streaming begins, any bytes/tokens reset the watchdog.

B. Process liveness progress (prefill-safe) Before tokens arrive, use one or more low-level signals to decide “the model is still working”:

server subprocess is alive and
its CPU time is increasing (via /proc/<pid>/stat), or
optional: GPU activity associated with that PID (via nvidia-smi pmon / utilization sampling), if you’re on NVIDIA

Requirement-wise: make these checks pluggable so you can support “just /proc” everywhere and add GPU checks where available.

Then you can define:

stall_timeout = “no stream bytes AND no liveness progress” for X seconds → treat as hung

Given your TTFT reality, X should be measured in tens of minutes by default, or even disabled unless you have liveness checks.

For long contexts, absolute timeouts should be either:[edit | edit source]

very large (hours), or
per-request override set by the caller (e.g., “this request can run 45 minutes”).

Default requirement:

absolute_timeout = None or extremely high.

Even if no tokens for 20 minutes, the worker should still show:[edit | edit source]

request state = RUNNING
last_liveness_at updating from /proc / GPU checks
optionally: “prefill_estimated” flag (if you infer it from prompt size/ctx)

This prevents your supervisor from “helpfully” restarting healthy-but-slow jobs.

- If connect/header failures happen: restart quickly.[edit | edit source]

If stall_timeout triggers (no evidence of life): restart + fail in-flight requests with FAILED(reason="stall_timeout").
If the process exits: restart + fail in-flight with FAILED(reason="worker_died").

Workers should support profiles, because timeout expectations will differ wildly:[edit | edit source]

“fast 7B on P40” vs “big model on M10 shards” etc.

So bake into config:

timeouts = {connect, headers, stall, ttft(optional), absolute(optional)}
liveness_checks = {proc_cpu_time: on, gpu_util: optional}

If you’re good with that spec, next step (when you’re ready for code) is implementing a watchdog that updates last_progress_at from either stream bytes or liveness probes, so “slow prefill” never trips the restart logic.

User: Yes, timeout profiles are a good idea, or at least a way to give larger or smaller failure limits per-process.[edit | edit source]

Yes, timeout profiles are a good idea, or at least a way to give larger or smaller failure limits per-process.

Assistant: Great — then I’d bake “timeout profiles” in as a first-class config object, but keep it simple: per-worker defaults + p…[edit | edit source]

Great — then I’d bake “timeout profiles” in as a first-class config object, but keep it simple: per-worker defaults + per-request overrides, with a clear precedence order.

###[edit | edit source]

Each worker instance is created with a TimeoutPolicy, e.g.

connect_timeout_s (TCP connect)
headers_timeout_s (time to receive HTTP headers / accept the request)
idle_stream_timeout_s (once streaming, max time with no bytes)
prefill_liveness_timeout_s (before first byte/token, max time with no liveness evidence)
absolute_timeout_s (optional, usually None)
restart_backoff_s and max_restarts_per_window (crash-loop protection)

Default stance for your environment

connect_timeout_s: small (1–3s)
headers_timeout_s: moderate (10–30s)
prefill_liveness_timeout_s: large (e.g., 20–60+ minutes) or None if you want “never kill slow prefill unless dead”
idle_stream_timeout_s: moderate/large (e.g., 60–300s) since once it’s streaming, silence usually means trouble
absolute_timeout_s: None by default

Timeouts that apply before first token depend on having something to watch. Require:[edit | edit source]

liveness_probe_interval_s
liveness_sources list (enabled/disabled): - subprocess alive (always) - /proc/<pid> CPU time delta (portable on Linux) - optional GPU activity sampling (NVIDIA-only; nice-to-have)

Define a single internal value per request:

last_liveness_at
last_stream_byte_at
and a derived last_progress_at = max(last_liveness_at, last_stream_byte_at)

All “stall” logic keys off last_progress_at.

Callers can override upwards or downwards within guardrails:[edit | edit source]

e.g., submit(..., timeouts_override={ "absolute_timeout_s": 3600 })

Precedence:

request override (if provided)
worker policy
module hard minimums/maximums (safety rails)

If a timeout triggers, the request ends with:[edit | edit source]

FAILED(reason="headers_timeout")
FAILED(reason="prefill_liveness_timeout")
FAILED(reason="idle_stream_timeout")
FAILED(reason="absolute_timeout")

and the worker may also:

restart the subprocess for stall-type reasons (prefill/idle), per policy.

You can optionally define a few named presets (just for convenience in your higher-level config):[edit | edit source]

fast: aggressive timeouts (small models)
normal: reasonable defaults
slow_prefill: huge/disabled TTFT + large prefill liveness
debug: disables restarts, logs everything

But the key requirement is the policy object, not the naming.

Add a per-worker boolean:[edit | edit source]

restart_on_timeout: default True for stall-type timeouts, False for absolute timeout (absolute timeout can just fail the request without restarting the worker).

If you’re happy with this, the next “requirements” step is to decide the minimum liveness probe you want to rely on (I’d suggest /proc/<pid>/stat CPU time delta as the baseline), and whether you want prefill liveness timeout to be “very large” or “disabled unless process dies.”

User: Another failure mode I've bumped into was a model getting stuck in a loop, repeating the same line endlessly.[edit | edit source]

Another failure mode I've bumped into was a model getting stuck in a loop, repeating the same line endlessly.

A very long absolute timeout would be the simplest way to stop that, but if a simple loop detector can be made to cut things short that might be useful.

Though it occurs to me this can also be handled with a max-token setting on the model.

Assistant: Yep — “degenerate loops” are a real failure mode, and you can address them cleanly with two layers:[edit | edit source]

Yep — “degenerate loops” are a real failure mode, and you can address them cleanly with two layers:

Make max_tokens (or “max new tokens”) a first-class per-request setting with a sane per-worker default.[edit | edit source]

If the model loops, it will eventually hit the cap and terminate.
This is simple, predictable, and doesn’t require heuristics.
Return a clear terminal reason like COMPLETED(reason="max_tokens_reached") (or FAILED if you want it treated as abnormal).

This should be your always-on safety net.

Since you’re already planning internal streaming, you can add a loop detector that triggers before max_tokens when the output is clearly stuck.[edit | edit source]

- Runs only while streaming output is coming in.[edit | edit source]

If it triggers, it cancels that request and frees the slot.
It should not automatically restart the worker process (a looping model is usually not a dead server).
Records a short “why” snippet in request debug info.

Terminal reason example: FAILED(reason="loop_detected").

Use multiple weak signals together (to avoid false positives):[edit | edit source]

Repeated line detector

Keep the last K lines (e.g., 20–50).
If the same line appears N times in a row (e.g., 8+) or dominates the window (e.g., >70%), flag.

Repeated suffix detector

Keep last M characters/tokens (e.g., 2–8k chars).
If the newest chunk makes the output end with a suffix that has appeared repeatedly (same ~200–500 char tail repeated 3+ times), flag.

Low novelty / low entropy

Track unique-token or unique-3gram rate over a sliding window.
If novelty drops below a threshold for long enough and output length keeps growing, flag.

This matters because some legitimate outputs look repetitive:[edit | edit source]

tables, logs, code templates, poetry refrains, etc.

So add to worker config / per-request overrides:

loop_detection_enabled (default on, but tunable)
thresholds (line repeats, suffix length, window size)
min_generated_tokens_before_check (don’t trigger too early)

Even with max_tokens + detector, it’s worth exposing the common anti-loop knobs per request:[edit | edit source]

repetition penalty / repeat window
temperature/top_p adjustments
stop sequences (when you know a delimiter)

Exact names vary depending on whether you’re calling llama.cpp directly or via an OpenAI-compatible endpoint, but conceptually these belong in your request params.

- Always support max_tokens (default per worker, override per request).[edit | edit source]

Optional loop detector that triggers request cancel (not worker restart) and yields loop_detected.
Both are reported in the request’s status/debug fields.

If you want, next we can define the exact loop-detector thresholds you’d start with (conservative defaults that almost never false-trigger) and where they live in your TimeoutPolicy vs a separate SafetyPolicy.

User: I think a repeated line detector would be enough beyond the token limit.[edit | edit source]

I think a repeated line detector would be enough beyond the token limit. It can be very simple, unlikely to have false positives, and can early-kill the worst derangements.

Another idea I'm considering is having "tools" for the ending: not meant for the normal tool round trip, but as indicators or requests for higher level decision models so it can easily and clearly indicate certain kinds of issues without needing the other model to read and understand all of the output.

Assistant: Both ideas fit your “simple, robust, fleet-friendly” approach.[edit | edit source]

Both ideas fit your “simple, robust, fleet-friendly” approach.

I’d spec it like this:[edit | edit source]

Maintain last_line and repeat_count.
As streamed text arrives, split into completed lines (buffer partial line until newline).
Normalize a line before comparison: - strip trailing whitespace - optionally collapse runs of spaces/tabs - ignore empty/very-short lines (e.g. < 8 chars)
If line == last_line: repeat_count += 1, else reset to 1 and update last_line.
If repeat_count >= N and line length >= min_len, trigger: - cancel request (free slot) - mark terminal: FAILED(reason="repeated_line_loop") - attach loop_line (truncated) and repeat_count to debug/status

Conservative defaults that won’t trip on normal output:

N = 8–12
min_len = 32–64
ignore lines that are just punctuation/whitespace
optional: don’t start checking until min_generated_chars (e.g. 512) so you don’t kill harmless prompt echoes

This plus max_tokens is a very solid “early kill the worst derangements” combo.

This is a good pattern. I’d frame it as a control-signaling channel: the model can emit machine-readable “flags” without you having to run another model to interpret raw text.[edit | edit source]

You’ve got two clean options:[edit | edit source]

A) Control tools (OpenAI function-calling schema, but intercepted)

You include a small set of special functions in the tools list, e.g.: - signal_issue(code, detail, severity) - request_escalation(reason, summary) - declare_outcome(outcome_code)
The worker intercepts these calls and does not send tool results back (or sends an empty ack and immediately ends, depending on your preference).
The request returns with signals=[...] alongside text.

This keeps “format familiarity” high for the model (since it already understands tool calling).

B) Structured footer block (no tools)

BIOS instructs the model to optionally end with a block like: - ===CONTROL=== {json...} ===/CONTROL===
Worker parses and strips it, and stores the JSON as signals.

This is simpler if your backend’s tool-calling behavior is inconsistent.

- “I’m uncertain / likely wrong” → route to a verifier model[edit | edit source]

“Need external data” → route to a browsing-capable model
“Output too long / summarization requested” → route to summarizer
“Potential policy risk / sensitive content” → route to a safety review model
“Tooling limitation encountered” → route to an alternate worker

Treat signals as hints, not authority:[edit | edit source]

models will sometimes over/under-trigger
your orchestrator should combine signals with objective checks (timeouts, loop detector, max tokens reached, etc.)

- Signals are optional and bounded (max count per request, max size).[edit | edit source]

Each signal has stable fields: - type (issue/escalation/outcome) - code (enum) - severity (low/med/high) - summary (short string) - meta (small dict, optional)
Worker returns them in request result/status, even if the text is empty.

If you like, the next step in requirements is to define:

the exact repeated-line thresholds you want as defaults, and
the initial code enum for control signals (keep it tiny at first—like 8–12 codes—so it stays reliable).

User: Yes those are the types of signals this would be intended for.[edit | edit source]

Yes those are the types of signals this would be intended for.

Or things like requesting a high level management decision where there are different ways to do something, with that decision being fed into a new request at a later time.

Assistant: Perfect — then I’d explicitly broaden “signals” from “error/health flags” into a structured control channel that can al…[edit | edit source]

Perfect — then I’d explicitly broaden “signals” from “error/health flags” into a structured control channel that can also ask for decision input from your higher-level manager (human or model), without forcing it to read/interpret the whole raw output.

###[edit | edit source]

A signal is a small, machine-readable record emitted by the generation worker and returned alongside the text result (or even with an empty text result).

Minimum fields:

type: "issue" | "escalation" | "decision_request" | "outcome"
code: stable short string enum
severity: "low" | "med" | "high"
summary: short, human-readable
meta: small JSON dict (bounded size)
request_id: for correlation (worker fills this)

- Signals are optional.[edit | edit source]

Signals are bounded (e.g., max 3 per request; max 4KB total).
Signals are hints, not authoritative—your orchestrator decides what to do.

###[edit | edit source]

Expose a tiny tool set that the worker intercepts. Example tools:

signal_issue(code, severity, summary, meta={})
request_escalation(reason, summary, meta={})
request_decision(question, options, default_option=None, context=None, meta={})
declare_outcome(code, summary=None, meta={})

Worker behavior:

Capture the tool call → add to signals[]
Optionally end early after certain tools (e.g., request_decision)
No normal “tool round trip” needed unless you want it (keep it one-way for v1)

This keeps it natural for models that already understand tool calling.

Your BIOS prompt should explicitly tell the model something like:[edit | edit source]

“If you need a higher-level decision, call request_decision and stop. Don’t keep guessing.”

###[edit | edit source]

question: concise decision prompt
options: list of 2–6 options, each with: - id (short stable key) - label (short) - pros / cons (optional, very short)
default_option (optional)
context: short context summary (not full transcript)
meta: can include “what happens next” hints, e.g. {"next_step": "rerun_with_choice"}

When the orchestrator sees a decision_request signal, it will:[edit | edit source]

store it as a pending decision (keyed by request_id or a decision_id)
obtain a decision (human or a higher-level model)
start a new request later with: - the original task restated briefly - the chosen option injected (and optionally the rationale) - and a note like “Decision chosen: X; proceed.”

This matches your “nuke & repave” style: no resuming the same stream required.

Keep the enum small and stable:[edit | edit source]

NEEDS_MANAGER_DECISION
NEEDS_EXTERNAL_INFO
NEEDS_HIGHER_REASONER
TOOL_LIMIT_REACHED
LOW_CONFIDENCE
LOOP_DETECTED (your repeated-line detector can emit this too)
MAX_TOKENS_REACHED
POLICY_RISK (if you want a safety escalation lane)

Add a worker option: stop_on_decision_request = True (default).[edit | edit source]

If the model requests a decision, the worker ends generation immediately and returns the signal.
Prevents the model from “making a choice anyway” and wasting tokens.

If you want, we can next define the exact schemas for these control tools (OpenAI JSON schema for parameters) and the BIOS wording that reliably causes models to use them instead of rambling.

User: Ok, assemble the full list of design decisions so far.[edit | edit source]

Ok, assemble the full list of design decisions so far.

Assistant: Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module.[edit | edit source]

Here’s the consolidated list of design decisions / requirements we’ve locked in so far for the llama.cpp worker module.

- Each class instance owns one llama.cpp server subprocess (one model per process; may use 1+ GPUs if sharded).[edit | edit source]

Instance is immutable once created; reconfiguration is drop & recreate.
Public lifecycle is start() / stop() (optionally restart() internally).
Preferred recovery strategy is nuke & repave: restart the server rather than attempting delicate replay.

- Concurrency is slot-based: worker has a configured integer slots (≥1).[edit | edit source]

A slot represents one in-flight generation request handled by this worker.
If all slots are full, the worker does not accept new work (returns an immediate “no slot available” result). No internal queue by default.

- Caller submits work via a method like submit(system_prompt, user_prompt, …) that returns immediately with a request id/handle.[edit | edit source]

Caller retrieves progress/output via polling initially (callbacks optional later).
Worker tracks per-request state and exposes: - running/queued (if you ever add queue), completed, failed, canceled, timed out - in-flight count / slot utilization

- Worker uses streaming internally even if the external API is polling-only.[edit | edit source]

Internal streaming is used for: - monitoring / progress timestamps - partial output capture (debug/ops) - stall detection

- Timeouts must handle very long prefill (multi-minute to tens-of-minutes before first token).[edit | edit source]

Timeout configuration is per-worker (“profile” / policy object) with optional per-request overrides.
Timeout types: - connect timeout (short) - headers/accept timeout (moderate) - TTFT timeout is disabled by default (or set very high) - primary stall detection is progress/liveness based, not “no tokens yet” - optional absolute timeout (default None or very large)
Liveness/progress signals (pluggable): - stream bytes/tokens (when streaming) - subprocess liveness + /proc CPU time delta (baseline) - optional GPU activity probes (nice-to-have)
On stall/health failure: reconnect attempt(s) then restart, per policy.
Crash-loop protection via restart backoff / restart rate limits.

- On worker restart, in-flight requests are failed with a clear reason (no replay).[edit | edit source]

Looping output is treated as request-level failure (not necessarily worker restart).

- Always support `max_tokens` / max-new-tokens as a hard stop (per-worker default, per-request override).[edit | edit source]

Add an optional repeated-line detector as an early-kill mechanism: - detects same sufficiently-long line repeated N times consecutively - cancels the request, frees slot, returns terminal reason repeated_line_loop - designed to be simple and low false-positive

- Tools follow OpenAI tool/function-calling format.[edit | edit source]

Tool runner is a separate component passed in, but the worker handles: - registering/exposing tools to the model - receiving tool calls - executing via tool runner - resuming the model with tool results
Tools are intended to be lightweight (calculator/lookup). Large calls are escalated to higher-level models.
Tool loop constraints: - max tool iterations - per-tool timeout - tool output size caps
Tool iteration budget should be visible to the model via BIOS prompt.

- Worker injects a low-level BIOS system prompt (secondary system prompt) per request/turn.[edit | edit source]

BIOS contains runtime/control-plane facts such as: - tool iteration budget remaining - current date/time + timezone - (optional) capability hints / platform rules
Prompt ordering: 1. BIOS system prompt (highest priority) 2. caller’s system prompt 3. conversation messages (user/assistant/tool)

- Worker maintains a status readout with at least: - active/healthy/restarting - slots total / slots used + active request ids - restart count, last error, last healthy time[edit | edit source]

Per-request status includes: - state, timestamps (created, started, last_progress) - partial output length (and optionally small tail snippet for debugging) - tool call count / trace metadata
Subprocess stdout/stderr captured (ring buffer) for diagnostics.

- In addition to normal tools, the worker supports one-way control signals emitted by the model to help orchestration without needing another model to read all output.[edit | edit source]

Signals cover: - uncertainty / low confidence - needs external info / browsing - needs higher reasoning model - tooling limitations - policy/safety escalation lane (optional) - decision requests (management-level choice among options)
Implemented as “control tools” in OpenAI function-call format (intercepted by worker), or equivalent structured footer if needed.
Signals are: - bounded (count + size) - returned alongside output - treated as hints; orchestrator decides actions
For decision_request specifically: - worker can be configured to stop immediately once signaled (stop_on_decision_request=True) - orchestrator will later create a new request with the chosen decision injected (no resume/replay required)

- No in-place model swapping; recreate worker instead.[edit | edit source]

No replay/resume of in-flight requests after restart.
No global scheduler inside the worker (handled by higher-level manager).

If you want, the next step is to turn this into a crisp “module spec” with: class responsibilities, method signatures, request/worker state enums, timeout policy schema, and the exact control-tool JSON schemas.

User: Timeouts won't need per request overrides.[edit | edit source]

Timeouts won't need per request overrides. Models will be setup based on the kind of jobs they are expected to handle.

The tools system is focused on lightweight tools for lower level and smaller models, larger and smarter models will be given access to heavier and more consequential tooling using the same systems. Though shouldn't change the way it is implemented beyond what helper class is plugged in, but assumptions need to not be baked too deeply into the call mechanism.

I think the BIOS should also contain basic universally applicable system prompt data, such as telling the model that it is operating as a hivemind of cooperating models, with the passed in system prompt being more flexible and changing according to the job type.

The non-goals in section 12 are all correct.

Next is coding style. Simplicity, Clarity, robustness, and a solid test suite are to be prioritized over clever tricks. As with the nuke and repave plan: the model time lost in a couple minutes of dumping a malfunctioning server and reloading is less of a problem than a system which accretes cruft.

A small caveat is that this will be working in a somewhat CPU and RAM constrained environment. It should not waste compute in it's functioning either.

Assistant: Got it.[edit | edit source]

Got it. Here’s the updated, consolidated spec with your refinements folded in (and a couple implementation-facing decisions that follow directly from them).

###[edit | edit source]

One class instance owns one llama.cpp server subprocess (one model per process; may span multiple GPUs if sharded).
Instance config is immutable; reconfiguration is drop & recreate.
Public lifecycle: start() / stop() (internal restart() allowed).
Failure handling is nuke & repave: - detect unhealthy/stalled server → restart subprocess - no replay of in-flight requests; they terminate with a clear failure reason.

- Worker has slots (≥1). A slot = one in-flight generation.[edit | edit source]

If all slots are occupied, submit() immediately returns NO_SLOT_AVAILABLE.
No internal queueing unless explicitly added later as a bounded optional feature.

- submit(system_prompt, user_prompt, …) -> request_id/handle returns immediately.[edit | edit source]

Caller polls: - get_status(request_id) → state + progress metadata - get_result(request_id) → final output or “not ready”
Minimal state machine (exact naming flexible): RUNNING | TOOL_RUNNING | COMPLETED | FAILED | CANCELED
cancel(request_id) is best-effort and frees the slot.

- Worker uses streaming internally even if the external API is polling.[edit | edit source]

Streaming drives: - progress timestamps (last_stream_byte_at) - partial output capture (bounded) - loop detection - liveness/stall detection

- No per-request timeout overrides. Workers are configured based on expected workload.[edit | edit source]

Timeout profile (per worker) includes: - connect timeout (short) - headers/accept timeout (moderate) - TTFT timeout disabled by default (or effectively huge) - stall detection is liveness/progress based to support long-prefill (tens of minutes) - optional absolute timeout (usually None or very large)
Liveness sources (pluggable): - subprocess alive - /proc/<pid> CPU time delta (baseline) - optional GPU activity probes later
Restart policy with backoff + crash-loop protection.

- Always support max new tokens (max_tokens) per worker default (override per request for token limit may exist, but it’s orthogonal to timeouts).[edit | edit source]

Add a simple repeated-line detector (low false positives): - triggers when the same sufficiently-long line repeats N times consecutively - cancels the request (does not restart worker) - records repeated_line_loop reason + the repeated line snippet.

- Tools use OpenAI function-calling schema.[edit | edit source]

Tool execution is via a plug-in ToolRunner interface.
The worker must not bake in assumptions like “tools are always lightweight” into the protocol: - the worker handles registration, call parsing, execution, and resume uniformly - the plugged-in runner determines whether a tool is lightweight/heavy and how it is executed.
Tool loop controls: - max tool iterations (budget) - per-tool timeout - tool output size caps

- Worker injects a BIOS system prompt layer that includes: - universally applicable platform guidance (e.g., “you are one agent in a cooperating hivemind of models”) - current date/time + timezone - tool budget remaining and tool usage rules - optionally capability hints (“tools available”, “decision signals available”)[edit | edit source]

Ordering remains: 1. BIOS system prompt (highest priority) 2. caller system prompt (job-specific) 3. conversation messages

- Support one-way control signals emitted by the model for orchestration: - low confidence, needs external info, needs higher reasoner, tool limits, etc. - decision requests for management-level branching choices[edit | edit source]

Implemented as a small set of control tools (OpenAI function format) that the worker intercepts and returns as structured signals[].
stop_on_decision_request=True default: the worker ends generation immediately when such a signal is emitted.
Orchestrator handles the decision and later starts a new request with the chosen option injected.

- Worker status includes: - active/healthy/restarting - slots total/used + active request IDs - restart count, last error, last healthy time[edit | edit source]

Request status includes: - state + timestamps - last progress time (from stream/liveness) - partial output length + bounded tail snippet (for ops) - tool trace metadata and captured control signals
Subprocess logs captured in a bounded ring buffer.

- No in-place model swapping[edit | edit source]

No replay/resume after restart
No global scheduler inside the worker

###[edit | edit source]

Simplicity and clarity > cleverness.
Robustness over attempting to salvage broken states (aligned with nuke & repave).
Small, explicit state machines; avoid hidden implicit behavior.
Strong invariants (e.g., “slot ownership implies exactly one active request”).

- Unit tests for: - state transitions - slot accounting under concurrency - loop detector - tool-call parsing + resume - BIOS assembly ordering[edit | edit source]

Integration tests using a fake/stub HTTP server to simulate: - slow headers, slow streams, stalls, disconnects - malformed tool calls - restart conditions and crash loops

- Use bounded buffers everywhere (partial output, logs, tool traces).[edit | edit source]

Liveness probing interval should be configurable and not too frequent (avoid tight polling).
Avoid heavy dependencies; prefer stdlib + one HTTP client.
Avoid per-token processing that scales badly: - loop detector should operate on completed lines and minimal state - avoid storing whole outputs unless explicitly requested.

If you want next, we can turn this into a concrete module outline (files/classes/interfaces + the exact public method signatures + enums + config dataclasses), then code against that with tests driving the implementation.

User: keeping whole outputs until complete and gathered by the caller isn't a problem[edit | edit source]

keeping whole outputs until complete and gathered by the caller isn't a problem

Assistant: Great — then we can simplify the request store a bit.[edit | edit source]

Great — then we can simplify the request store a bit.

- The worker will accumulate the full generated output for each request in memory until it reaches a terminal state (COMPLETED/FAILED/CANCELED) and the caller retrieves it.[edit | edit source]

Polling status can still expose “tail”/metrics, but internally we keep the whole thing.

- This reduces complexity (no chunk files, no reassembly, no partial paging).[edit | edit source]

It does mean you’ll want clear lifecycle cleanup: - once the caller calls get_result(id) (or an explicit release(id)), the worker can free stored output and metadata. - optionally a TTL GC for abandoned requests.

Even if “whole outputs” are fine, I’d add an optional per-worker max_output_chars (or None for unlimited). Not because you expect it normally, but because a runaway loop + high max_tokens could otherwise chew RAM.[edit | edit source]

Everything else stays the same: internal streaming feeds the accumulator, loop detector watches completed lines, and slot/backpressure is unchanged.

User: Even the largest hypothetical outputs will be tiny relative to the available ram.[edit | edit source]

Even the largest hypothetical outputs will be tiny relative to the available ram.

Assistant: Perfect — then we can drop output-size caps as a requirement and keep the implementation straightforward:[edit | edit source]

Perfect — then we can drop output-size caps as a requirement and keep the implementation straightforward:

- The worker stores the full output text (and tool/control traces) in memory for each request until the caller retrieves it.[edit | edit source]

No max_output_chars cap needed.
Add only a simple cleanup rule to avoid unbounded accumulation from forgotten requests: - explicit release(request_id) or “auto-release after successful get_result()” - optional periodic GC of requests older than a configured TTL (primarily for robustness, not RAM pressure)

Everything else (slot limits, internal streaming, long-prefill-friendly liveness timeouts, repeated-line loop detector, OpenAI-format tools + control signals, BIOS hivemind prompt) stays as previously specified.

User: Ok, write up the design doc, including a description of what the module is and what it's for.[edit | edit source]

Ok, write up the design doc, including a description of what the module is and what it's for.

Assistant:[edit | edit source]

##[edit | edit source]

This module provides a robust, simple, supervisor-style wrapper around a single llama.cpp server subprocess (one model per process). It is intended for use inside a larger “hivemind” system that runs multiple model instances across multiple GPUs, routes work between them, and occasionally tears down and recreates instances based on workload.

The worker module’s responsibilities are:

Start/stop a llama.cpp server subprocess configured for a specific model + GPU allocation.
Accept generation requests up to a configured number of concurrency slots.
Execute requests asynchronously (submit returns immediately; caller polls for status/result).
Provide monitoring status (health, slot usage, in-flight request details).
Handle tool calling (OpenAI function-calling schema) via a pluggable tool runner.
Inject a low-level BIOS system prompt with platform-wide guidance and runtime metadata.
Detect and recover from failures using a nuke & repave strategy (restart the server rather than attempt delicate replay).
Detect obvious “runaway loops” via a simple repeated-line detector, in addition to max token limits.

This module is designed to prioritize simplicity, clarity, robustness, and testability over cleverness. It is also designed not to waste compute in a CPU/RAM constrained environment: bounded polling, lightweight liveness checks, and minimal background work.

1. One-process-per-model worker: each instance manages exactly one llama.cpp server process and its configuration.[edit | edit source]

Slot-based concurrency: support slots > 1 with immediate refusal if full (no queue by default).
Async request lifecycle: - submit() returns immediately with a request handle/id. - caller uses polling APIs to read status and final output.
Nuke & repave reliability: - detect dead/stalled/unreachable server; - restart subprocess; - fail in-flight requests with explicit error reasons.
Long-prefill-friendly timeouts: - accommodate multi-minute / tens-of-minutes time-to-first-token; - stall detection is liveness/progress-based, not TTFT-based.
Tool calling: - OpenAI function-calling format; - tool runner is pluggable; protocol should not assume “lightweight tools” even if typical for smaller models.
BIOS system prompt: - inject platform/hivemind guidance + runtime data (date/time, tool budget remaining, etc.).
Control signals channel: - allow model to emit structured “signals” (e.g., low confidence, escalation, decision request) without requiring downstream models to parse long text.
Simple loop early-kill: - repeated-line detector that cancels a request when it clearly loops.

- In-place reconfiguration or model swapping within a worker instance.[edit | edit source]

Replay/resume of in-flight requests after restart.
Global scheduling across workers (handled by the higher-level orchestrator).
Complex token-level analytics or heavy output post-processing.

###[edit | edit source]

Worker: owns the llama.cpp server subprocess; exposes request APIs; supervises health; injects BIOS prompt; runs tool loop; emits status.
Request: encapsulates one generation job (messages, tool state, output accumulation, status).
ToolRunner (plugin): executes tools invoked by the model; worker handles tool-call parsing and continuation.
Control Signals: structured records emitted by the model via “control tools” (one-way).

- The worker runs a small internal event loop / async runtime (or thread-based equivalent) to: - start subprocess, - dispatch HTTP requests, - stream responses, - perform liveness checks at a configurable interval, - update request state and worker status.[edit | edit source]

Exact naming is flexible; the intent is to keep it small and explicit.[edit | edit source]

- start() -> None[edit | edit source]

stop() -> None
(internal) restart(reason: str) -> None

- submit(system_prompt: str, user_prompt: str, *, params: GenerationParams | None = None) -> SubmitResult - returns immediately: - success: request_id - failure: NO_SLOT_AVAILABLE (no queue by default) or WORKER_NOT_READY[edit | edit source]

get_status(request_id) -> RequestStatus
get_result(request_id) -> RequestResult | NOT_READY
cancel(request_id) -> bool (best-effort)
release(request_id) -> None (optional; can also auto-release after successful get_result())

- get_worker_status() -> WorkerStatus[edit | edit source]

get_debug_info() -> WorkerDebugInfo (recent subprocess logs, last errors, restart count, etc.)

###[edit | edit source]

name: identifier
model_path
host, port
gpu_env: typically CUDA_VISIBLE_DEVICES="0" (or similar)
llama.cpp server args: ctx, gpu layers, batch, tensor split, etc. (opaque list/dict)
slots: max concurrent requests
timeout_profile: per-worker only (no per-request overrides)
tool_policy: max tool iterations, per-tool timeout, etc.
bios_provider: callable that renders BIOS prompt text
control_tools_enabled: bool
stop_on_decision_request: bool (default true)
loop_detector: enable + thresholds

Designed for long TTFT workloads.[edit | edit source]

connect_timeout_s (short)
headers_timeout_s (moderate)
ttft_timeout_s: typically disabled / None
prefill_liveness_timeout_s: large or None (prefill-safe)
idle_stream_timeout_s: max time without bytes once streaming starts
absolute_timeout_s: optional/None
liveness_probe_interval_s
restart backoff + crash-loop limits: - restart_backoff_s - max_restarts_per_window - restart_window_s

States (minimum viable):[edit | edit source]

RUNNING (includes prefill + generation)
TOOL_RUNNING (worker executing a tool; generation paused)
Terminal: - COMPLETED - FAILED (with reason) - CANCELED

Key timestamps:

created_at
dispatched_at
last_stream_byte_at
last_liveness_at
last_progress_at = max(last_stream_byte_at, last_liveness_at)
completed_at

###[edit | edit source]

Worker-owned, regenerated:

at request start
before each tool-loop continuation

BIOS includes:

platform/hivemind statement (universal)
current date/time + timezone
tool iteration budget remaining and constraints
optionally: how to emit control signals / decision requests

1. BIOS system prompt (highest priority)[edit | edit source]

Caller system prompt (job-specific)
User/assistant/tool messages

The worker should support either:[edit | edit source]

multiple system messages, or
a single combined system message with clear delimiters (fallback).

###[edit | edit source]

OpenAI function-calling schema (tools=[{type:"function", function:{...}}]).

- Worker: registers tools into request payload; detects tool calls; manages continuation; appends tool result messages.[edit | edit source]

ToolRunner (plugin): executes tool by name with args and returns result.

- max_tool_iterations budget (exposed in BIOS)[edit | edit source]

per-tool timeout
tool output size handling (may be truncated/summarized by ToolRunner if needed)

Even if most tools for small models are lightweight, the worker must not embed that assumption into the protocol. The same mechanism must work for heavier tools when plugged into a different ToolRunner.[edit | edit source]

###[edit | edit source]

Let the model emit structured “flags” and “requests” that the orchestrator can act on without parsing large outputs.

Expose a minimal set of control tools in OpenAI function format. Worker intercepts and records them into signals[] returned with the result/status.[edit | edit source]

Example signal types:

issue (low confidence, tool limit reached, etc.)
escalation (needs higher reasoning model / external info)
decision_request (requires management branch decision)
outcome (summarizable terminal condition)

Behavior:

bounded count/size
stop_on_decision_request=True default: end generation immediately when a decision is requested
orchestrator later creates a new request with the decision injected (no resume).

###[edit | edit source]

Nuke & repave:

if server unhealthy/stalled/disconnected: restart subprocess
mark all in-flight requests as FAILED(reason="worker_restarted") (no replay)

1. Connect/header failures - cannot connect or cannot receive headers in time → likely dead server → restart.[edit | edit source]

Stalls - use last_progress_at driven by: - stream bytes when streaming exists - liveness probes during prefill - if no progress for configured window → restart (policy-controlled)
Process death - subprocess exits → restart; fail in-flight requests.

- restart backoff[edit | edit source]

cap restarts per time window; mark worker unhealthy if exceeded.

###[edit | edit source]

A simple early-kill mechanism in addition to max_tokens.

Parse streamed output into completed lines.
Compare with previous completed line after normalization (trim whitespace).
Ignore empty/very short lines.
If same line repeats N times consecutively, cancel the request: - FAILED(reason="repeated_line_loop") - include the repeated line snippet + repeat count in debug/status.

This should be conservative (low false positives) and cheap (minimal CPU).

- Worker accumulates the full output text for each request in memory until terminal state.[edit | edit source]

Caller retrieves via get_result().
Cleanup policy: - auto-release after get_result() or explicit release(request_id) - optional TTL-based GC for abandoned requests (robustness, not RAM pressure)

###[edit | edit source]

active, healthy, restarting
slot counts: slots_total, slots_used
active_request_ids
restart_count, last_error, last_healthy_at

- state, timestamps[edit | edit source]

last_progress_at
output length so far
tool iteration remaining
recorded signals[]

- ring buffer of subprocess logs (recent N lines)[edit | edit source]

last health check results / recent restart reasons

- No heavy periodic work; liveness probing is interval-based and configurable.[edit | edit source]

Default to lightweight Linux /proc CPU time checks for prefill liveness.
Avoid expensive per-token processing; repeated-line detection only processes completed lines.
Keep dependencies minimal and stable.

###[edit | edit source]

Request state transitions and slot accounting (including cancellation)
BIOS prompt assembly and ordering
Tool call parsing → ToolRunner invocation → continuation assembly
Control tool interception and “stop on decision request”
Repeated-line detector behavior (including conservative non-triggers)
Timeout logic with synthetic progress/liveness timestamps

Using a stub HTTP server that can simulate:[edit | edit source]

slow headers
long silence then streaming
streaming stalls
disconnect mid-stream
malformed tool calls
server crash / restart conditions and crash-loop limits

- repeated start/stop cycles[edit | edit source]

concurrent submits up to slot limit
ensure no slot leaks on failures/cancellations

This module is a single-worker supervisor for llama.cpp servers designed for a multi-instance, multi-GPU orchestrated environment. It exposes a simple async request API with slot-based concurrency, robust failure recovery via nuke & repave, long-prefill-friendly health detection, OpenAI-format tool calling, BIOS prompt injection for platform context, and a structured control-signals channel for escalation and decision-making.[edit | edit source]

If you want, next step is to translate this doc into:

concrete class/method signatures + dataclasses/enums, and
a test-first skeleton (request store, slot semaphore, stub server harness) before hooking up llama-server subprocess + HTTP streaming.

User: Add that it is written in python3, using the typed python extensions.[edit | edit source]

Add that it is written in python3, using the typed python extensions.

As a final pass can you find anything I may have forgotten or under specified.

Assistant:[edit | edit source]

##[edit | edit source]

This module is a Python 3 component (using PEP 484 typing plus typing_extensions constructs such as Protocol, TypedDict, Self, etc.) that provides a robust, simple supervisor-style wrapper around a single llama.cpp server subprocess (one model per process).

It is intended for a larger “hivemind” system that runs multiple model instances across multiple GPUs, routes work between them, and occasionally tears down and recreates instances based on workload.

The worker module’s responsibilities are:

Start/stop a llama.cpp server subprocess configured for a specific model + GPU allocation.
Accept generation requests up to a configured number of concurrency slots.
Execute requests asynchronously (submit returns immediately; caller polls for status/result).
Provide monitoring status (health, slot usage, in-flight request details).
Handle tool calling (OpenAI function-calling schema) via a pluggable tool runner.
Inject a low-level BIOS system prompt with platform-wide guidance and runtime metadata.
Detect and recover from failures using a nuke & repave strategy (restart the server rather than attempt delicate replay).
Detect obvious “runaway loops” via a simple repeated-line detector, in addition to max token limits.

This module prioritizes simplicity, clarity, robustness, and testability over cleverness. It also aims not to waste compute in CPU/RAM constrained environments: bounded polling, lightweight liveness checks, and minimal background work.

1. One-process-per-model worker: each instance manages exactly one llama.cpp server process and its configuration.[edit | edit source]

Slot-based concurrency: support slots > 1 with immediate refusal if full (no queue by default).
Async request lifecycle: - submit() returns immediately with a request handle/id. - caller uses polling APIs to read status and final output.
Nuke & repave reliability: - detect dead/stalled/unreachable server; - restart subprocess; - fail in-flight requests with explicit error reasons.
Long-prefill-friendly timeouts: - accommodate multi-minute / tens-of-minutes time-to-first-token; - stall detection is liveness/progress-based, not TTFT-based.
Tool calling: - OpenAI function-calling format; - tool runner is pluggable; protocol should not assume “lightweight tools” even if typical for smaller models.
BIOS system prompt: - inject platform/hivemind guidance + runtime data (date/time, tool budget remaining, etc.).
Control signals channel: - allow model to emit structured “signals” (e.g., low confidence, escalation, decision request) without requiring downstream models to parse long text.
Simple loop early-kill: - repeated-line detector that cancels a request when it clearly loops.

- In-place reconfiguration or model swapping within a worker instance.[edit | edit source]

Replay/resume of in-flight requests after restart.
Global scheduling across workers (handled by the higher-level orchestrator).
Heavy output post-processing or expensive token analytics.

###[edit | edit source]

Worker: owns the server subprocess; exposes request APIs; supervises health; injects BIOS prompt; runs tool loop; emits status.
Request: encapsulates one generation job (messages, tool state, output accumulation, status).
ToolRunner (plugin): executes tools invoked by the model; worker handles tool-call parsing and continuation.
Control Signals: structured records emitted by the model via “control tools” (one-way).

###[edit | edit source]

start() -> None
stop() -> None
(internal) restart(reason: str) -> None

- submit(system_prompt: str, user_prompt: str, *, params: GenerationParams | None = None) -> SubmitResult - immediate return: - success: request_id - failure: NO_SLOT_AVAILABLE or WORKER_NOT_READY[edit | edit source]

get_status(request_id) -> RequestStatus
get_result(request_id) -> RequestResult | NOT_READY
cancel(request_id) -> bool
release(request_id) -> None (or auto-release after get_result())

- get_worker_status() -> WorkerStatus[edit | edit source]

get_debug_info() -> WorkerDebugInfo

###[edit | edit source]

name
model_path
host, port
gpu_env (e.g., CUDA_VISIBLE_DEVICES)
server command template: executable path + args (opaque list/dict)
slots
timeout_profile (per worker only)
tool_policy (max iterations, per-tool timeout, etc.)
bios_provider
control_tools_enabled
stop_on_decision_request (default true)
loop_detector enable + thresholds

- connect_timeout_s[edit | edit source]

headers_timeout_s
ttft_timeout_s (typically None/disabled)
prefill_liveness_timeout_s (large/None)
idle_stream_timeout_s
absolute_timeout_s (optional/None)
liveness_probe_interval_s
restart backoff + crash-loop limits

###[edit | edit source]

Worker-owned, regenerated:

at request start
before each tool-loop continuation

BIOS includes:

universal platform/hivemind guidance
current date/time + timezone
tool iteration budget remaining and constraints
instructions for emitting control signals/decision requests

1. BIOS system prompt[edit | edit source]

caller system prompt
user/assistant/tool messages

- Tools use OpenAI function-calling schema.[edit | edit source]

Worker manages: exposing tools, detecting calls, invoking ToolRunner, appending tool results, continuing generation.
Budgeted by max iterations; timeout per tool; ToolRunner may bound output sizes.

- One-way “control tools” intercepted by worker and returned as structured signals[].[edit | edit source]

Supports issue/escalation/outcome/decision_request.
Default stop_on_decision_request=True; orchestrator makes a later new request with chosen decision.

- Nuke & repave: restart server on unhealthy/stalled/disconnected.[edit | edit source]

In-flight requests are failed with a clear reason on restart.
Crash-loop protection via backoff and restart rate limits.

- Repeated-line detector on streamed output: - detect same sufficiently-long line repeated N consecutive times; - cancel request; mark FAILED(reason="repeated_line_loop"); - attach repeated line snippet + count to debug/status.[edit | edit source]

Also always support max_tokens as hard stop.

- Accumulate full output in memory until caller retrieves it.[edit | edit source]

Auto-release after result retrieval or explicit release().
Optional TTL GC for abandoned requests (robustness).

- WorkerStatus: active/healthy/restarting, slot usage, active request IDs, restart count, last error, last healthy time.[edit | edit source]

RequestStatus: state, timestamps, last progress, output length, tool iteration remaining, captured signals.
Debug: ring buffer of subprocess logs.

- Python 3.[edit | edit source]

Heavily typed using built-in typing + typing_extensions (Protocol, TypedDict, etc.).
Design emphasizes: - small explicit state machines - straightforward control flow - minimal dependencies - test suite depth over clever optimizations
Resource-aware: - liveness probing interval is configurable and not aggressive - avoid heavy per-token work; repeated-line detector operates on completed lines

Below are the main “spec gaps” I see, plus a concrete default choice for each so you can lock them down without adding complexity.[edit | edit source]

Under-specified: Are submit/get_status/get_result safe to call from multiple threads? Is the module asyncio-native?[edit | edit source]

Recommendation (simple + robust):

Make the worker asyncio-native internally, but expose a thread-safe sync facade if needed.
At minimum: specify “public methods are thread-safe” or “caller must stay on one event loop”.

Under-specified: Are you targeting llama-server (llama.cpp) or llama-cpp-python[server]? Their OpenAI-compat can differ across versions.[edit | edit source]

Recommendation:

Treat the server as an OpenAI-compatible HTTP endpoint and keep the server command fully configurable.
Define required endpoints you rely on: - POST /v1/chat/completions (streaming supported) - GET /v1/models (health/ready check)
If /v1/models doesn’t exist, allow a configurable health URL or a tiny completion warmup.

Under-specified: SSE vs raw chunks, JSON framing differences.[edit | edit source]

Recommendation:

Specify: worker supports SSE “data:” JSON lines and tolerates keepalive/comment lines.
Define what “progress” means: “any received bytes after headers count as stream progress”.

Under-specified: If the backend doesn’t emit tool_calls formally, how do you detect tool calls?[edit | edit source]

Recommendation (robust without replay):

Prefer native tool_calls when present.
Otherwise fall back to a structured prompting convention + strict JSON parse (still OpenAI-shaped), controlled by BIOS instructions.
If tool-call parsing fails, treat as FAILED(reason="tool_parse_error") (or emit a control signal requesting escalation).

Under-specified: ID format, how long results persist, GC behavior.[edit | edit source]

Recommendation:

IDs: monotonic integer or UUID (integer is cheaper; UUID is easier for distributed logging).
Results persist until get_result() or release().
Optional TTL GC disabled by default; enable only if you see abandoned requests.

Under-specified: Does the worker allocate ports or are they assigned externally?[edit | edit source]

Recommendation:

Port is provided by caller (simplest in multi-worker orchestrators).
Worker validates port availability at start; if bind fails, mark WORKER_NOT_READY with explicit error.

Under-specified: Ensuring subprocess and any children die.[edit | edit source]

Recommendation:

Start subprocess in its own process group/session.
On stop: SIGTERM → wait → SIGKILL process group.
This avoids orphan workers consuming VRAM.

Under-specified: Does a slot map to “one concurrent HTTP request” or to “server’s internal parallel slots”?[edit | edit source]

Recommendation:

Define slots as “max concurrent requests the worker will dispatch”.
Don’t assume the server truly parallelizes well; slots is an admission control knob, not a throughput guarantee.

Under-specified: Besides max_tokens, what parameters are supported (temperature, top_p, stop, seed, etc.)?[edit | edit source]

Recommendation:

Define a typed GenerationParams that includes common OpenAI-like fields (optional): - max_tokens, temperature, top_p, stop, seed (if supported), etc.
Worker passes through fields it knows, ignores/records unsupported fields (for forwards compatibility).

Under-specified: Exact JSON schema and stable enum codes.[edit | edit source]

Recommendation:

Lock a tiny stable enum set now (you can expand later without breaking): - LOW_CONFIDENCE, NEEDS_EXTERNAL_INFO, NEEDS_HIGHER_REASONER, NEEDS_MANAGER_DECISION, TOOL_LIMIT_REACHED, MAX_TOKENS_REACHED, LOOP_DETECTED, POLICY_RISK
Specify maximum signals per request and “stop on decision request” behavior (already decided).

If you want, I can produce the “next artifact” as a concrete typed interface spec (Protocols + TypedDicts/Enums + dataclasses) that directly matches this doc, ready to implement and test against.

User: 1.[edit | edit source]

your async solution is correct, make that explicit
llama-server
yes, data flowing is enough to count as progress
that tool parsing method is acceptable
keep the IDs simple: incrementing integer, plus a job-name fed in as part of the request
assign externally
yes, the system needs automatic takedown to ensure there aren't any orphaned processes
"slots" are whatever maps best to having multiple queries in flight at once
keep the parameters feed open to whatever are needed so the module doesn't have to be modified every time a new feature is needed
have the exit-tools specified as an argument on init. It doesn't need the plugin of the tool system since there is no round trip, it just needs to feed the information to the model of what it's options are and allow parsing afterwards

Assistant:[edit | edit source]

##[edit | edit source]

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single llama-server subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s designed to be used by a higher-level “hivemind” orchestrator that:

runs many workers (different models / GPUs / profiles),
routes jobs between them,
and prefers nuke & repave (restart a broken worker) over fragile replay.

1. One worker = one `llama-server` process (one model per process; may span multiple GPUs if configured).[edit | edit source]

Async-first API (explicitly asyncio-native): requests are submitted and completed asynchronously.
Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
Robustness via nuke & repave: - detect unhealthy/stalled server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
Long-prefill-friendly: supports workloads where time-to-first-token is minutes to tens of minutes.
OpenAI-format tool calling with a pluggable ToolRunner; also supports a fallback tool-call parsing method.
BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
One-way “exit tools” (control signals): model can emit structured signals (issue/escalation/decision request) without a tool round-trip.
Simple early loop kill: repeated-line detector as a supplement to max token limits.
Forward-compatible parameters: allow passing arbitrary generation parameters without changing module code.

- In-place model swapping or reconfiguration of a running worker instance.[edit | edit source]

Replay/resume of in-flight requests after restart.
A global scheduler across workers (belongs in the orchestrator).
Fancy token analytics or heavyweight monitoring agents.

- The module is asyncio-native.[edit | edit source]

Public methods are async def and expected to be called from an asyncio event loop.
Thread-safety is not a primary requirement; the orchestrator should call worker methods from a consistent async context. (A sync wrapper may exist later, but is not required for v1.)

###[edit | edit source]

LlamaWorker - owns the llama-server subprocess and HTTP client - manages slots, request table, and health state - assembles prompts (BIOS + caller system + conversation) - streams response internally and accumulates full output - runs tool-call loop and parses/records exit-tools
ToolRunner (plugin) - pluggable execution for normal tool calls (OpenAI function schema) - may be lightweight or heavy; worker must not assume
Request records - store full output until retrieved - expose status + debug metadata

- Worker launches llama-server as a subprocess in its own process group/session.[edit | edit source]

stop() must ensure no orphaned processes: - send SIGTERM to the process group → wait → SIGKILL process group if needed.
Worker captures stdout/stderr into a bounded ring buffer for debugging.
Port is assigned externally and provided via config; worker validates it can connect / server binds.

- Worker has slots: int.[edit | edit source]

A slot is “whatever maps best to having multiple queries in flight at once”: - Implementation will treat a slot as permission to dispatch one concurrent HTTP streaming request. - Slots are admission control, not a guarantee of linear throughput.
If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
No internal queue by default.

- Worker assigns request IDs as an incrementing integer (1, 2, 3, …).[edit | edit source]

Each submission also includes a caller-provided job_name string for correlation and orchestration.
The (job_name, request_id) pair is the primary handle externally (request_id is unique within a worker lifetime).

###[edit | edit source]

async start() -> None
async stop() -> None
(internal) async restart(reason: str) -> None

- async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult - returns: - success: { ok: True, request_id: int } - failure: { ok: False, error: "NO_SLOT_AVAILABLE" | "WORKER_NOT_READY" | ... }[edit | edit source]

async get_status(request_id: int) -> RequestStatus
async get_result(request_id: int) -> RequestResult | NotReady
async cancel(request_id: int) -> bool
async release(request_id: int) -> None (or auto-release after successful get_result(); implementation choice)

- async get_worker_status() -> WorkerStatus[edit | edit source]

async get_debug_info() -> WorkerDebugInfo (recent logs, last errors, restart count)

All structures are typed (TypedDict/dataclasses) with stable machine-readable fields.

- params is an open mapping passed through to llama-server’s OpenAI-compatible request body.[edit | edit source]

The worker: - merges required fields it controls (messages/tools/stream flags), - passes unknown keys through untouched, - may optionally record “unsupported parameter” warnings if the backend rejects them.
This prevents needing module changes whenever llama.cpp adds a new knob.

###[edit | edit source]

BIOS is a worker-generated system prompt that includes:

stable platform-wide instruction: the model is one cooperating agent in a multi-model hivemind
current date/time + timezone
tool budget remaining / tool usage rules
instructions on how to emit exit-tools/control signals (if enabled)

BIOS is regenerated:

at request start
before each continuation after tool execution

1. BIOS system prompt (highest priority)[edit | edit source]

caller’s system prompt
conversation messages (user/assistant/tool)

Multiple system messages are allowed; if backend behavior requires it, the worker may concatenate with clear delimiters (fallback).

- Tools follow OpenAI schema: tools=[{type:"function", function:{name, description, parameters}}].[edit | edit source]

Workflow: 1. Send request with tools available. 2. If assistant returns a tool call: - execute via ToolRunner, - append tool result to conversation, - decrement tool iteration budget, - continue generation.
Tool loop controls: - max_tool_iterations (per worker) - per-tool timeout (per worker)
Fallback parsing: If backend doesn’t emit structured tool_calls reliably, worker can use BIOS-enforced structured JSON conventions and strict parsing; failure is a request-level terminal error (or can emit an exit-tool escalation signal).

###[edit | edit source]

Exit tools are specified at worker init as a list of OpenAI-format tool definitions (same schema).
They do not use the ToolRunner and do not involve a tool result round trip.
The worker exposes these to the model as “available control actions” and then parses and records any such tool calls emitted.

- LOW_CONFIDENCE[edit | edit source]

NEEDS_EXTERNAL_INFO
NEEDS_HIGHER_REASONER
NEEDS_MANAGER_DECISION (with options)
POLICY_RISK (optional lane)
etc.

- Exit tool calls become structured signals[] in the request record.[edit | edit source]

Default: stop_on_decision_request=True (worker terminates generation early on decision requests so orchestrator can branch via a new request).

- Worker always uses streaming internally (even if caller only polls).[edit | edit source]

Progress definition (explicit): any data flowing after headers counts as progress.
Timestamps tracked per request: - last_stream_byte_at (any bytes received) - last_liveness_at (prefill liveness probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

- Before tokens/bytes arrive, worker uses lightweight probes: - process alive - /proc/<pid> CPU time delta (baseline on Linux) - optional GPU probes later[edit | edit source]

No per-request overrides.[edit | edit source]

A worker’s timeout profile includes:

connect_timeout_s (short)
headers_timeout_s (moderate)
ttft_timeout_s (typically disabled/None)
prefill_liveness_timeout_s (large/None)
idle_stream_timeout_s (time without any bytes once streaming)
absolute_timeout_s (optional/None)
liveness_probe_interval_s
restart controls: backoff and crash-loop limits

- On connect/header failures: restart quickly.[edit | edit source]

On stall (no progress beyond thresholds): restart and fail in-flight requests.
On process death: restart and fail in-flight requests.
No replay; in-flight requests fail with reasons like: - worker_restarted - stall_timeout - connect_failed - headers_timeout - server_died

- Primary: max_tokens / max-new-tokens in params (per worker default; may be overridden by caller if desired).[edit | edit source]

Secondary: repeated-line detector - detects the same sufficiently-long line repeated N times consecutively - cancels the request and records FAILED(reason="repeated_line_loop") - cheap and conservative (line-based, not token-heavy)

- Worker stores full output text in memory for each request until: - caller retrieves it and the worker auto-releases, or - caller calls release(request_id).[edit | edit source]

Given your RAM assumptions, no output cap is required for v1.

###[edit | edit source]

active, healthy, restarting
slots_total, slots_used
list of active request IDs (and job_name mapping)
restart_count, last_error, last_healthy_at

- request_id, job_name[edit | edit source]

state + timestamps
output length so far
last_progress_at
tool iteration remaining
captured signals[]

- last N lines of subprocess output[edit | edit source]

last restart reasons and times

Priority is a solid suite over cleverness.[edit | edit source]

- slot accounting and state transitions[edit | edit source]

BIOS assembly + ordering
tool loop logic + fallback parsing
exit tools parsing and early-stop behavior
repeated-line detector behavior
timeout bookkeeping with synthetic time/progress

Using a stub server (or controlled llama-server in CI when feasible) to simulate:[edit | edit source]

slow headers
long silence then streaming
streaming stalls
disconnect mid-stream
malformed tool calls and exit tool calls
subprocess crash and restart behavior
process-group teardown (no orphans)

These aren’t blockers, but locking them now will reduce rework:[edit | edit source]

Exactly which endpoint is used for health/ready - Recommend: GET /v1/models if supported; otherwise a tiny warmup completion endpoint.
Result retention policy - Auto-release on get_result() vs requiring explicit release(). (Either is fine; pick one and test it.)
Exit tool semantics - Which exit tools should always terminate generation immediately (decision requests, policy risk), vs which just annotate and continue.
Cancellation behavior - Confirm that cancel is “best effort” and may return partial output + terminal reason canceled.

If you want, I can apply your choices for (2)–(4) and produce a final “interfaces + types” section (Protocols/TypedDicts/enums/dataclasses) that’s ready to implement directly.

User: 2.[edit | edit source]

getting the result should function as the completion for that operation, releasing everything
exit tools should never terminate output themselves beyond what the model does. They are meant to convey information upwards

Assistant:[edit | edit source]

##[edit | edit source]

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single llama-server subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s meant to be used by a higher-level “hivemind” orchestrator that:

runs many workers (different models / GPUs / profiles),
routes jobs between them,
and prefers nuke & repave (restart a broken worker) over fragile replay.

1. One worker = one `llama-server` process (one model per process; may span multiple GPUs if configured).[edit | edit source]

Async-first API: explicitly asyncio-native.
Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes.
OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method.
BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow.
Simple early loop kill: repeated-line detector as a supplement to max token limits.
Forward-compatible parameters: allow passing arbitrary generation parameters without module changes.

- In-place model swapping or reconfiguration of a running worker.[edit | edit source]

Replay/resume of in-flight requests after restart.
Global scheduling across workers (belongs in orchestrator).
Heavy output post-processing.

- The module is asyncio-native.[edit | edit source]

Public methods are async def and expected to be called from an asyncio event loop.
Thread-safety is not a v1 requirement; keep calls within a consistent async context.

- Launch llama-server in its own process group/session.[edit | edit source]

stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed.
Capture stdout/stderr into a bounded ring buffer for debugging.
Port is assigned externally and passed in config.

- Worker has slots: int.[edit | edit source]

A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests).
If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
No internal queue by default.

- Request IDs are incrementing integers (1, 2, 3… per worker lifetime).[edit | edit source]

Each request includes a caller-provided job_name for correlation.

###[edit | edit source]

async start() -> None
async stop() -> None
(internal) async restart(reason: str) -> None

- async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult[edit | edit source]

async get_status(request_id: int) -> RequestStatus
async get_result(request_id: int) -> RequestResult | NotReady
async cancel(request_id: int) -> bool

Result retrieval releases resources (explicit decision):

Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request.
After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response (e.g., NOT_FOUND / RELEASED).

- async get_worker_status() -> WorkerStatus[edit | edit source]

async get_debug_info() -> WorkerDebugInfo

- params is an open mapping passed through to llama-server’s OpenAI-compatible request body.[edit | edit source]

Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched.

###[edit | edit source]

Worker-owned, regenerated at request start and before each post-tool continuation. Includes:

universal platform/hivemind guidance
current date/time + timezone
tool iteration budget remaining and constraints
instructions for using tools and exit-tools

1. BIOS system prompt[edit | edit source]

caller system prompt
conversation messages

- OpenAI function-calling schema.[edit | edit source]

Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted.

Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable.

Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does.[edit | edit source]

Exit tools are provided at worker init as OpenAI-format function tool definitions.
Worker includes them in the tool list (or a dedicated list, depending on server compatibility) so the model knows its signaling options.
When the model emits an exit-tool call, the worker records it into signals[] (structured, typed).
The worker does not automatically: - stop the request, - cancel it, - restart the server, - or change sampling/params.
The orchestrator can react to signals[] (including choosing to call cancel() externally).

This keeps exit-tools purely informational and “upward-facing”.

- Worker always uses streaming internally.[edit | edit source]

Progress definition: any response data flowing after headers counts as progress.
Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

Prefill liveness baseline:

subprocess alive
/proc/<pid> CPU time delta

- No per-request overrides.[edit | edit source]

Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits

Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay).

- Primary: max_tokens (passed via params, default set per worker profile).[edit | edit source]

Secondary: repeated-line detector cancels request on clear degenerate repetition.

- Full output is accumulated while running.[edit | edit source]

On successful get_result(), the request’s stored output, tool trace, and signals are released immediately.

- WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error.[edit | edit source]

RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals.
DebugInfo: bounded subprocess logs + recent restart reasons.

- Strong unit + integration tests prioritized over cleverness.[edit | edit source]

Integration tests should specifically validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state

These are the only places where you may want to pick a default before coding:[edit | edit source]

Health endpoint / readiness check - Prefer GET /v1/models; fallback to a tiny completion as warmup if needed.
What get_status() returns after get_result() - Recommend: NOT_FOUND or RELEASED with a stable error code.
Exit-tools parsing precedence - If a response includes both normal tool_calls and an exit-tool call, do you process normal tools first, or just record exit-tools and continue? (Default recommendation: process normal tool loop as usual; record exit-tools whenever they occur.)

If you want, I can append a compact “Types & Protocols” section (TypedDicts/Enums/Protocols) that matches this design doc 1:1, ready to implement.

User: Another point of design guidance: in places like the BIOS prompt, have the generation be a distinct method for clarity,…[edit | edit source]

Another point of design guidance: in places like the BIOS prompt, have the generation be a distinct method for clarity, ease of modification, and separation of concerns.

Assistant: Absolutely—good call.[edit | edit source]

Absolutely—good call. Here’s how that design guidance fits into (and slightly strengthens) the spec.

###[edit | edit source]

The worker must implement BIOS prompt generation as a distinct method/component, not inline inside request dispatch.
BIOS generation should be pure-ish: given a small structured input (worker config + request state + budgets + time), it returns a string (or list of messages) with no side effects.

- Clarity: prompt logic is visible and testable in isolation.[edit | edit source]

Ease of modification: changing platform-level behavior doesn’t risk breaking transport/tooling logic.
Separation of concerns: request execution vs prompt assembly vs tool handling stays clean.

Add a “Prompt Assembly” subsection that defines:[edit | edit source]

build_bios_prompt(ctx: BiosContext) -> str | list[Message]
build_message_stack(bios: ..., caller_system: ..., conversation: ...) -> list[Message]

Where BiosContext includes:

current datetime + timezone
tool iteration budget remaining
worker identity/model identity (optional)
enabled exit-tools list (names/codes)
any other platform constants (“hivemind” statement)

- BIOS generation must have direct unit tests verifying: - required fields appear (date/time, budgets, hivemind statement) - formatting stability (version tag / delimiters if used) - correct updates across tool iterations[edit | edit source]

If you want, I can patch this into the design doc verbatim as a new subsection (“Separation of concerns: prompt generation”) and add a short “Module internal structure” section listing the main internal methods/components (process supervisor, HTTP transport, prompt builder, tool loop, parsers).

User: yes, add that to the doc as both a specific rule and a guide of how to approach problems[edit | edit source]

yes, add that to the doc as both a specific rule and a guide of how to approach problems

Assistant:[edit | edit source]

##[edit | edit source]

llama_worker is a Python 3 module (typed with PEP 484 + typing_extensions) that supervises a single llama-server subprocess from llama.cpp, and provides an async, slot-limited, resilient interface for running chat-style inference requests against it.

It’s meant to be used by a higher-level “hivemind” orchestrator that:

runs many workers (different models / GPUs / profiles),
routes jobs between them,
and prefers nuke & repave (restart a broken worker) over fragile replay.

1. One worker = one `llama-server` process (one model per process; may span multiple GPUs if configured).[edit | edit source]

Async-first API: explicitly asyncio-native.
Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons (no replay).
Long-prefill-friendly: supports workloads where time-to-first-token can be minutes to tens of minutes.
OpenAI-format tool calling with a pluggable ToolRunner; plus a fallback tool-call parsing method.
BIOS prompt layer: inject stable platform-wide instructions (hivemind context) + runtime metadata (date/time, budgets, etc.).
One-way exit-tools (control signals): model can emit structured signals upward; worker records them but does not alter control flow.
Simple early loop kill: repeated-line detector as a supplement to max token limits.
Forward-compatible parameters: allow passing arbitrary generation parameters without module changes.
Separation of concerns rule: prompt generation (especially BIOS) must be isolated, testable, and independently modifiable.

- In-place model swapping or reconfiguration of a running worker.[edit | edit source]

Replay/resume of in-flight requests after restart.
Global scheduling across workers (belongs in orchestrator).
Heavy output post-processing.

- The module is asyncio-native.[edit | edit source]

Public methods are async def and expected to be called from an asyncio event loop.
Thread-safety is not a v1 requirement; keep calls within a consistent async context.

- Launch llama-server in its own process group/session.[edit | edit source]

stop() must ensure no orphaned processes: - SIGTERM to process group → wait → SIGKILL process group if needed.
Capture stdout/stderr into a bounded ring buffer for debugging.
Port is assigned externally and passed in config.

- Worker has slots: int.[edit | edit source]

A slot is permission to have one request “in flight” (best mapping to concurrent HTTP streaming requests).
If all slots are in use, submit() returns immediately with NO_SLOT_AVAILABLE.
No internal queue by default.

- Request IDs are incrementing integers (1, 2, 3… per worker lifetime).[edit | edit source]

Each request includes a caller-provided job_name for correlation.

###[edit | edit source]

async start() -> None
async stop() -> None
(internal) async restart(reason: str) -> None

- async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult[edit | edit source]

async get_status(request_id: int) -> RequestStatus
async get_result(request_id: int) -> RequestResult | NotReady
async cancel(request_id: int) -> bool

Result retrieval releases resources (explicit decision):

Calling get_result(request_id) when the request is terminal returns the result and releases all stored state/output for that request.
After successful get_result(), subsequent get_status/get_result for that request_id should return a stable “unknown/released” response.

- async get_worker_status() -> WorkerStatus[edit | edit source]

async get_debug_info() -> WorkerDebugInfo

- params is an open mapping passed through to llama-server’s OpenAI-compatible request body.[edit | edit source]

Worker merges in fields it controls (messages/tools/stream) and passes unknown keys through untouched.

###[edit | edit source]

Worker-owned, regenerated at request start and before each post-tool continuation. Includes:

universal platform/hivemind guidance
current date/time + timezone
tool iteration budget remaining and constraints
instructions for using tools and exit-tools

1. BIOS system prompt[edit | edit source]

caller system prompt
conversation messages

- OpenAI function-calling schema.[edit | edit source]

Worker: - exposes tools, - detects tool calls (structured, or fallback), - executes via ToolRunner, - appends tool result messages, - continues generation until completion or tool-iteration budget exhausted.

Fallback tool parsing is allowed via BIOS-enforced structured JSON conventions if native tool_calls is unreliable.

Explicit decision: Exit tools never terminate output or alter control flow beyond what the model itself does.[edit | edit source]

Exit tools are provided at worker init as OpenAI-format function tool definitions.
Worker includes them so the model knows its signaling options.
When the model emits an exit-tool call, the worker records it into signals[].
The orchestrator may react (including choosing to cancel externally), but the worker itself does not change behavior.

- Worker always uses streaming internally.[edit | edit source]

Progress definition: any response data flowing after headers counts as progress.
Track: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

Prefill liveness baseline:

subprocess alive
/proc/<pid> CPU time delta

- No per-request overrides.[edit | edit source]

Profile includes: - connect timeout - headers timeout - TTFT timeout typically disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes) - optional absolute timeout - probe interval - restart backoff + crash-loop limits

Recovery is nuke & repave; in-flight requests fail with explicit reasons (no replay).

- Primary: max_tokens (passed via params, default set per worker profile).[edit | edit source]

Secondary: repeated-line detector cancels request on clear degenerate repetition.

- Full output is accumulated while running.[edit | edit source]

On successful get_result(), the request’s stored output, tool trace, and signals are released immediately.

- WorkerStatus: active/healthy/restarting, slot usage, active request ids, restarts, last error.[edit | edit source]

RequestStatus: state, timestamps, output length so far, last_progress_at, tool iterations remaining, recorded signals.
DebugInfo: bounded subprocess logs + recent restart reasons.

###[edit | edit source]

In places likely to evolve—especially BIOS prompt generation and prompt assembly—implementation must be separated into distinct methods/components, not embedded inline inside request execution or transport code.

At minimum:

BIOS prompt creation is a distinct method/component.
Message-stack assembly (BIOS + caller system + conversation) is a distinct method/component.

When adding features or handling edge cases:[edit | edit source]

Identify which concern it belongs to, and keep changes inside that layer: - Process supervision (start/stop/restart, orphan prevention) - Transport (HTTP request/stream parsing, retries) - Prompting (BIOS generation, message ordering, tool/exit-tool exposure) - Tool loop (detect tool call, execute via ToolRunner, resume) - Parsing (tool-call parsing, exit-tool parsing, repeated-line detection) - State & accounting (slots, request lifecycle, status reporting)
Prefer small, testable functions over clever shared logic.
Keep policy separate from mechanics: - Timeout policy and restart policy should be data/config driven. - Prompt content should be generated from structured inputs.

BIOS generation and message assembly must be unit-testable in isolation:[edit | edit source]

required fields present (hivemind guidance, date/time, budgets)
formatting stability (e.g., version tag or delimiters if used)
correct updates as tool budget decreases

- Strong unit + integration tests prioritized over cleverness.[edit | edit source]

Integration tests should validate: - process-group teardown leaves no orphans - long prefill does not trigger false timeouts - disconnect/stall triggers restart and fails in-flight requests - tool-call parsing (native + fallback) - exit-tools are recorded but do not affect control flow - get_result() releases request state

1. Health endpoint / readiness check - Recommend: GET /v1/models; fallback to a tiny warmup completion if needed.[edit | edit source]

Post-release status behavior - Recommend: get_status() after get_result() returns a stable NOT_FOUND / RELEASED code.
Exit-tools parsing precedence - If both a normal tool call and an exit-tool appear, recommended default is: - process normal tool loop as usual - record exit-tool signals whenever they occur

If you want, next step is producing a compact “Types & Protocols” appendix (Protocols/TypedDicts/dataclasses/enums) that exactly matches this doc, so implementation can proceed test-first without ambiguity.

User: 1.[edit | edit source]

I'm unclear on what is being asked here?
minimum status info is just that it is running / failed / ready. Tokens received and tokens per second are nice to haves.
that is an acceptable exit tool priority. In practice models will be encouraged to not use them until they are ready to finish.

Assistant:[edit | edit source]

When the worker starts llama-server, it needs a cheap, reliable way to decide “the server is ready to accept requests” (and later, “it’s still alive”).[edit | edit source]

So the missing decision is: what exact HTTP request should the worker use as its readiness/health probe?

Recommended default for llama-server:

Ready check: GET /v1/models - If it returns HTTP 200 + parses as JSON, consider the worker READY.
If that endpoint ever proves unreliable in your build: fall back to a tiny POST /v1/chat/completions with max_tokens=1 (warmup-style probe), but only if needed.

This probe is separate from “slow prefill” timeouts—health probes should answer “is the server reachable and responding at all?” not “is this particular request producing tokens yet?”

I’ll update the doc wording to make that explicit.

Agreed. We’ll define minimum worker status as:[edit | edit source]

READY (server is up and probe succeeds; can accept work)
RUNNING (process is started but not yet READY, or currently serving—implementation can keep READY/RUNNING separate or collapse to READY as “running + ready”)
FAILED (process exited or repeated probe failures / crash-loop lockout)

Per-request output length, tokens received, tokens/sec become optional nice-to-haves, not required.

Great. We’ll keep the behavior:[edit | edit source]

If both normal tools and exit-tools appear, the worker processes normal tool calls as usual and records exit-tool signals whenever they appear.
BIOS guidance will encourage models to use exit-tools near completion, but the worker won’t rely on that for correctness.

If you want, I can now produce the final “Types & Protocols” appendix consistent with these clarifications (minimal WorkerState enum, probe config, and typed status/result records).

User: that is an acceptable endpoint to use[edit | edit source]

that is an acceptable endpoint to use

Assistant: Great — we’ll use that as the default readiness/health probe.[edit | edit source]

Great — we’ll use that as the default readiness/health probe.

###[edit | edit source]

Primary readiness check: GET /v1/models - If it returns HTTP 200 and parses as JSON, worker enters READY.
This probe is used for: - “server is up and responding” - post-restart validation
It is not used to judge per-request progress (slow prefill is handled by the request liveness/progress logic).

Minimum status surface is:[edit | edit source]

READY — server responds to /v1/models
RUNNING — process started but not yet READY (optional; you can collapse this into “not ready yet” if you prefer)
FAILED — process exited or worker locked out due to crash-loop/backoff

Tokens received / tokens/sec remain optional nice-to-haves.

If you want, next step is the typed “Types & Protocols” appendix so the code and tests have a strict contract to implement against.

User: Ok, let's move to the next section.[edit | edit source]

Ok, let's move to the next section.

Assistant:[edit | edit source]

This section defines the concrete typed interfaces the implementation and test suite will code against. It’s intentionally “boring” and explicit.[edit | edit source]

from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any, Mapping, MutableMapping, Sequence

from typing_extensions import (
    Literal,
    NotRequired,
    Protocol,
    TypedDict,
    runtime_checkable,
)

=====

class WorkerState(str, Enum): =====
    RUNNING = "running"   # process started (may or may not be ready yet)
    READY = "ready"       # GET /v1/models succeeds
    FAILED = "failed"     # crashed, or locked out by crash-loop protection
    STOPPED = "stopped"   # not running

class RequestState(str, Enum):
    RUNNING = "running"
    TOOL_RUNNING = "tool_running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELED = "canceled"

Keep these as stable strings so the orchestrator can route without parsing text.[edit | edit source]

RequestFailReason = Literal[
    "worker_restarted",
    "server_died",
    "connect_failed",
    "headers_timeout",
    "stall_timeout",
    "tool_parse_error",
    "tool_execution_error",
    "repeated_line_loop",
    "canceled",
    "unknown_error",
]

RequestFinishReason = Literal[
    "stop",              # normal stop sequence / end-of-generation
    "max_tokens",        # hit max new tokens
    "canceled",
    "failed",
]

We keep this permissive so the module doesn’t need edits whenever llama.cpp adds fields.[edit | edit source]

class ChatMessage(TypedDict, total=False):
    role: Literal["system", "user", "assistant", "tool"]
    content: str
    name: str
    tool_call_id: str
    # For tool calling, the server may include more fields:
    tool_calls: Any

class ToolFunctionDef(TypedDict, total=False):
    name: str
    description: str
    parameters: dict[str, Any]  # JSON schema

class ToolDef(TypedDict, total=False):
    type: Literal["function"]
    function: ToolFunctionDef

class ToolCall(TypedDict, total=False):
    id: str
    type: Literal["function"]
    function: dict[str, Any]  # expects {"name": str, "arguments": str|dict}

Exit tools are “one-way”: they are recorded, not executed, and do not change control flow.[edit | edit source]

class ExitSignal(TypedDict, total=False):
    tool_name: str
    arguments: dict[str, Any]
    # helpful metadata for debugging / correlation
    emitted_at: float

IDs are incrementing integers. job_name is caller-provided.[edit | edit source]

class SubmitOk(TypedDict):
    ok: Literal[True]
    request_id: int

class SubmitErr(TypedDict):
    ok: Literal[False]
    error: Literal["NO_SLOT_AVAILABLE", "WORKER_NOT_READY", "WORKER_FAILED"]

SubmitResult = SubmitOk | SubmitErr

class RequestStatus(TypedDict, total=False):
    request_id: int
    job_name: str
    state: RequestState

    created_at: float
    dispatched_at: NotRequired[float]
    completed_at: NotRequired[float]

    last_progress_at: NotRequired[float]
    output_chars: NotRequired[int]

    # optional nice-to-haves
    tokens_received: NotRequired[int]
    tokens_per_second: NotRequired[float]

    # tool loop info
    tool_iters_remaining: NotRequired[int]

    # exit-tool info
    signals: NotRequired[list[ExitSignal]]

    # error info if terminal failed/canceled
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class RequestResult(TypedDict, total=False):
    request_id: int
    job_name: str

    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason

    text: str
    signals: NotRequired[list[ExitSignal]]

    # terminal error info (if failed/canceled)
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

Release semantics (locked in):

get_result(request_id) returns RequestResult only once; it also releases stored state/output.
After release, later lookups return a stable “not found”.

class NotFound(TypedDict):
    ok: Literal[False]
    error: Literal["NOT_FOUND"]

GetResultResponse = RequestResult | NotFound
GetStatusResponse = RequestStatus | NotFound

Minimum: running/ready/failed. Extra fields are optional.[edit | edit source]

class WorkerStatus(TypedDict, total=False):
    state: WorkerState
    slots_total: int
    slots_used: int
    active_request_ids: list[int]

    restart_count: int
    last_error: NotRequired[str]
    last_ready_at: NotRequired[float]

class WorkerDebugInfo(TypedDict, total=False):
    # bounded ring buffer content (most recent N lines)
    recent_logs: list[str]
    recent_restart_reasons: list[str]

No per-request overrides.[edit | edit source]

@dataclass(frozen=True, slots=True)
class TimeoutProfile:
    connect_timeout_s: float
    headers_timeout_s: float

    # time-to-first-token is usually disabled / huge in your environment:
    ttft_timeout_s: float | None

    # prefill-safe: based on liveness probes before any bytes arrive
    prefill_liveness_timeout_s: float | None

    # once streaming starts: max allowed time with no bytes
    idle_stream_timeout_s: float | None

    absolute_timeout_s: float | None

    liveness_probe_interval_s: float

    # restart control
    restart_backoff_s: float
    restart_window_s: float
    max_restarts_per_window: int

BIOS generation is its own component. It should be easy to unit test.[edit | edit source]

@dataclass(frozen=True, slots=True)
class BiosContext:
    now: datetime
    timezone_name: str

    worker_name: str

    tool_iters_remaining: int
    normal_tools: Sequence[ToolDef]
    exit_tools: Sequence[ToolDef]

    # optional: stable version tag for formatting evolution
    bios_version: str = "bios-v1"

@runtime_checkable
class BiosProvider(Protocol):
    def __call__(self, ctx: BiosContext) -> str: ...

A separate method/component assembles the message list:

def build_message_stack(
    *,
    bios_text: str,
    caller_system_prompt: str,
    conversation: Sequence[ChatMessage],
) -> list[ChatMessage]:
    """Pure function: returns full message list in required order."""
    ...

Normal tools use a plugin that can be swapped for lightweight vs heavyweight implementations.[edit | edit source]

@runtime_checkable
class ToolRunner(Protocol):
    async def run_tool(
        self,
        *,
        name: str,
        arguments: dict[str, Any],
        request_id: int,
        job_name: str,
    ) -> Any:
        """
        Return any JSON-serializable result (dict/list/str/number/bool/null).
        Worker will serialize it into a tool message payload.
        """
        ...

=====

GenerationParams = Mapping[str, Any] =====

@dataclass(frozen=True, slots=True)
class WorkerConfig:
    name: str

    host: str
    port: int

    # full command including executable: ["./llama-server", "-m", "...", "--port", "...", ...]
    server_cmd: Sequence[str]

    # env overrides, e.g. {"CUDA_VISIBLE_DEVICES": "0"}
    env: Mapping[str, str]

    slots: int
    timeouts: TimeoutProfile

    # Tools
    normal_tools: Sequence[ToolDef]
    tool_runner: ToolRunner | None

    # Exit tools (one-way)
    exit_tools: Sequence[ToolDef]

    # BIOS
    bios_provider: BiosProvider
    timezone_name: str

Locked-in default:[edit | edit source]

@dataclass(frozen=True, slots=True)
class ReadinessProbe:
    method: Literal["GET"] = "GET"
    path: str = "/v1/models"
    # Used with timeouts.headers_timeout_s and connect_timeout_s

- Exit tools never alter control flow. They’re recorded into signals[] and returned to the orchestrator.[edit | edit source]

Normal tool calls run through ToolRunner and continue generation (tool loop budgeted).
Progress is “any bytes after headers count as progress,” plus prefill liveness via /proc/<pid> CPU time deltas.
get_result() releases everything for that request_id immediately.

If you want the next section after this, I’d suggest Appendix B: Internal module layout (classes/files and which component owns what), so the implementation stays clean and testable from day one.

User: this looks good.[edit | edit source]

this looks good. Write up the full doc as it currently is

Assistant:[edit | edit source]

##[edit | edit source]

llama_worker is a Python 3 module (fully typed with PEP 484 + typing_extensions) that supervises a single llama-server subprocess from llama.cpp and provides an asyncio-native, slot-limited, resilient interface for running chat-style inference requests against it.

It is intended to be used by a higher-level “hivemind” orchestrator that runs multiple workers (different models / GPUs / profiles), routes work between them, and prefers nuke & repave (restart broken workers) over fragile replay.

This module provides:[edit | edit source]

Subprocess supervision for llama-server: - start/stop/restart - hard guarantees against orphan processes
Async request execution: - submit() returns immediately (with a request id) - caller polls for status/result
Admission control via concurrency slots: - no internal queue by default
Robust failure handling: - detect dead/stalled/unreachable server - restart the subprocess - fail in-flight requests with explicit reasons
Prompting with a worker-owned BIOS system prompt layer: - hivemind/cooperation guidance - runtime metadata (date/time, tool budgets) - stable, testable, separately generated
Tools: - normal OpenAI-format tool calling with a pluggable ToolRunner (round-trip) - fallback tool-call parsing method when needed
Exit tools (one-way control signals): - provided at init - recorded when emitted by the model - never alter control flow (beyond whatever the model does)
Internal streaming: - used for monitoring, progress detection, loop detection - progress = any data flowing after headers
Loop mitigation: - rely on max_tokens - plus a simple repeated-line early-kill detector

1. One worker = one `llama-server` process (one model per process; may span multiple GPUs if configured).[edit | edit source]

Async-first API: explicitly asyncio-native.
Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
Nuke & repave reliability: - detect dead/stalled/unreachable server - restart subprocess - fail in-flight requests with explicit reasons (no replay)
Long-prefill-friendly operation: - time-to-first-token can be minutes to tens of minutes - avoid false restarts by using liveness/progress-based timeouts
OpenAI-format tool calling: - ToolRunner is pluggable - tool mechanism must not hardcode assumptions about tool “weight”
BIOS prompt layer: - inject stable platform/hivemind guidance + runtime metadata - generated via separate, testable method/component
Exit tools / control signals: - model emits structured signals upward - worker records them but does not change behavior
Simple early loop kill: - repeated-line detector to stop the worst degenerate loops early
Forward-compatible parameters:

accept an open params mapping and pass through unknown keys unchanged

Engineering style:

simplicity, clarity, robustness, strong tests > clever tricks
avoid wasteful compute in CPU/RAM constrained environments

- In-place model swapping or reconfiguration within a running worker.[edit | edit source]

Replay/resume of in-flight requests after restart.
Global scheduling across workers (belongs in orchestrator).
Complex token analytics or heavyweight monitoring agents.

- The module is asyncio-native.[edit | edit source]

Public APIs are async def, intended to run in an asyncio event loop.
Thread-safety is not a v1 requirement; call methods from a consistent async context.

###[edit | edit source]

LlamaWorker - owns the subprocess and HTTP transport - manages slots and request lifecycle - assembles prompts (BIOS + caller system + conversation) - streams responses internally and accumulates full output - runs tool loop and parses/records exit-tools - supervises health and restarts
ToolRunner (plugin) - executes normal round-trip tools - may be lightweight or heavy; worker must not assume
Request records - store full output until retrieved - store minimal status and optional nice-to-haves

BIOS generation and prompt/message assembly must be separate methods/components, not inline inside transport or request execution.[edit | edit source]

This is both a rule and a guiding approach:

When changing behavior, identify the layer (supervision/transport/prompting/tool loop/parsing/state).
Prefer small, pure-ish, unit-testable functions over clever shared logic.
Keep policy (timeouts, budgets, prompts) data/config-driven, separate from mechanics.

- llama-server runs in its own process group/session.[edit | edit source]

stop() must ensure no orphaned processes: - SIGTERM process group → wait → SIGKILL process group if needed
Capture stdout/stderr into a bounded ring buffer for debug info.
Port is assigned externally and passed in configuration.

- Worker has slots: int.[edit | edit source]

Slots represent “whatever maps best to having multiple queries in flight at once.” - Implementation will treat a slot as permission to dispatch one concurrent HTTP streaming request. - Slots are admission control, not a guarantee of linear throughput.
If slots are full, submit() returns immediately with NO_SLOT_AVAILABLE.
No internal queue by default.

- Request IDs are incrementing integers per worker lifetime (1, 2, 3…).[edit | edit source]

Caller supplies a job_name string per request for correlation/logging.

###[edit | edit source]

async start() -> None
async stop() -> None
(internal) async restart(reason: str) -> None

- async submit(job_name: str, system_prompt: str, user_prompt: str, *, params: Mapping[str, Any] | None = None) -> SubmitResult - returns immediately: - success: request_id - failure: NO_SLOT_AVAILABLE / WORKER_NOT_READY / WORKER_FAILED[edit | edit source]

async get_status(request_id: int) -> RequestStatus | NOT_FOUND
async get_result(request_id: int) -> RequestResult | NOT_FOUND
async cancel(request_id: int) -> bool (best effort)

Result retrieval releases resources (locked in):

get_result(request_id) on a terminal request returns the result and releases all stored state/output for that request id.
After release, get_status/get_result returns NOT_FOUND (or equivalent stable “released” code).

- async get_worker_status() -> WorkerStatus[edit | edit source]

async get_debug_info() -> WorkerDebugInfo

- params is an open mapping passed into the OpenAI-compatible llama-server request payload.[edit | edit source]

The worker: - merges required fields it controls (messages/tools/stream flags), - passes unknown keys through unchanged, - does not require module edits when llama.cpp adds new parameters.

###[edit | edit source]

The BIOS prompt includes:

stable platform/hivemind guidance (cooperating agent context)
current date/time + timezone
tool iteration budgets and constraints
instructions for normal tools and exit tools

BIOS is regenerated:

at request start
before each post-tool continuation

1. BIOS system prompt[edit | edit source]

caller’s system prompt
conversation messages (user/assistant/tool)

Implementation may emit multiple system messages or a combined message with delimiters if needed for backend compatibility.

- Tools follow OpenAI function-calling schema.[edit | edit source]

Worker responsibilities: - expose tool definitions to model - detect tool calls - execute via ToolRunner - append tool result message(s) - continue generation until completion or tool-iteration budget is exhausted

Fallback tool parsing (acceptable):

If native structured tool calls are unreliable, worker may use a BIOS-enforced structured JSON convention and strict parsing.
Tool parsing failure is a request-level terminal error (or can be signaled via exit-tools if configured).

Exit tools never terminate output or change control flow beyond what the model does.[edit | edit source]

They exist to convey structured information upward.

Exit tools are provided at worker init as OpenAI-format tool definitions.
Worker includes them so the model knows its signaling options.
If the model emits an exit-tool call, the worker records a structured signal entry.
Orchestrator may react (route, cancel, escalate), but worker does not auto-stop/restart based on signals.

Priority behavior (locked in):

Normal tool calls are processed normally.
Exit tool calls are recorded whenever they appear.
Models will be encouraged via BIOS to emit exit-tools near completion, but correctness does not depend on that.

- Worker uses streaming internally for all requests.[edit | edit source]

Progress definition (locked in): any response data flowing after headers counts as progress.
Track timestamps: - last_stream_byte_at - last_liveness_at (prefill probes) - last_progress_at = max(last_stream_byte_at, last_liveness_at)

Prefill liveness baseline:

process alive
/proc/<pid> CPU time delta (Linux)

- No per-request timeout overrides.[edit | edit source]

Timeouts must tolerate long prefills (minutes to tens of minutes).
TimeoutProfile includes: - connect timeout (short) - headers timeout (moderate) - TTFT timeout usually disabled/None - prefill liveness timeout (large/None) - idle stream timeout (no bytes once streaming) - optional absolute timeout (often None) - liveness probe interval - restart backoff + crash-loop limits

Recovery is nuke & repave:

restart subprocess on unhealthy/stalled/disconnected
fail in-flight requests (no replay)

- Primary: max_tokens (max new tokens) via params (default per worker).[edit | edit source]

Secondary: repeated-line detector cancels the request when the same sufficiently-long line repeats N times consecutively.

- Full output is accumulated in memory while running.[edit | edit source]

On successful get_result(), all stored output and state for that request is released.

###[edit | edit source]

Minimum is:

READY
RUNNING
FAILED

Optional nice-to-haves:

tokens received
tokens/sec
partial output size

- bounded ring buffer of recent subprocess logs[edit | edit source]

recent restart reasons/timestamps

Default readiness/health probe:[edit | edit source]

GET /v1/models must return HTTP 200 and parse as JSON → worker is READY.

Prioritize:[edit | edit source]

simplicity and correctness
clear invariants
strong tests over cleverness

- slot accounting and request state transitions[edit | edit source]

BIOS prompt generation and message ordering (isolated, pure-ish)
normal tool loop (native + fallback parsing)
exit tools recording (ensures non-terminating behavior)
repeated-line detector behavior
timeout bookkeeping with simulated progress/liveness

- llama-server subprocess process-group teardown (no orphans)[edit | edit source]

readiness probe behavior
long prefill does not false-timeout
disconnect/stall triggers restart and fails in-flight requests
malformed tool calls and parsing failures

===

from __future__ import annotations ===

from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any, Mapping, Sequence

from typing_extensions import (
    Literal,
    NotRequired,
    Protocol,
    TypedDict,
    runtime_checkable,
)

class WorkerState(str, Enum):
    RUNNING = "running"
    READY = "ready"
    FAILED = "failed"
    STOPPED = "stopped"

class RequestState(str, Enum):
    RUNNING = "running"
    TOOL_RUNNING = "tool_running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELED = "canceled"

RequestFailReason = Literal[
    "worker_restarted",
    "server_died",
    "connect_failed",
    "headers_timeout",
    "stall_timeout",
    "tool_parse_error",
    "tool_execution_error",
    "repeated_line_loop",
    "canceled",
    "unknown_error",
]

RequestFinishReason = Literal[
    "stop",
    "max_tokens",
    "canceled",
    "failed",
]

class ChatMessage(TypedDict, total=False):
    role: Literal["system", "user", "assistant", "tool"]
    content: str
    name: str
    tool_call_id: str
    tool_calls: Any

class ToolFunctionDef(TypedDict, total=False):
    name: str
    description: str
    parameters: dict[str, Any]

class ToolDef(TypedDict, total=False):
    type: Literal["function"]
    function: ToolFunctionDef

class ToolCall(TypedDict, total=False):
    id: str
    type: Literal["function"]
    function: dict[str, Any]  # expects {"name": str, "arguments": str|dict}

class ExitSignal(TypedDict, total=False):
    tool_name: str
    arguments: dict[str, Any]
    emitted_at: float

class SubmitOk(TypedDict):
    ok: Literal[True]
    request_id: int

class SubmitErr(TypedDict):
    ok: Literal[False]
    error: Literal["NO_SLOT_AVAILABLE", "WORKER_NOT_READY", "WORKER_FAILED"]

SubmitResult = SubmitOk | SubmitErr

class RequestStatus(TypedDict, total=False):
    request_id: int
    job_name: str
    state: RequestState

    created_at: float
    dispatched_at: NotRequired[float]
    completed_at: NotRequired[float]

    last_progress_at: NotRequired[float]
    output_chars: NotRequired[int]

    # optional nice-to-haves
    tokens_received: NotRequired[int]
    tokens_per_second: NotRequired[float]

    tool_iters_remaining: NotRequired[int]
    signals: NotRequired[list[ExitSignal]]

    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class RequestResult(TypedDict, total=False):
    request_id: int
    job_name: str

    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason

    text: str
    signals: NotRequired[list[ExitSignal]]

    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class NotFound(TypedDict):
    ok: Literal[False]
    error: Literal["NOT_FOUND"]

GetResultResponse = RequestResult | NotFound
GetStatusResponse = RequestStatus | NotFound

class WorkerStatus(TypedDict, total=False):
    state: WorkerState
    slots_total: int
    slots_used: int
    active_request_ids: list[int]

    restart_count: int
    last_error: NotRequired[str]
    last_ready_at: NotRequired[float]

class WorkerDebugInfo(TypedDict, total=False):
    recent_logs: list[str]
    recent_restart_reasons: list[str]

@dataclass(frozen=True, slots=True)
class TimeoutProfile:
    connect_timeout_s: float
    headers_timeout_s: float

    ttft_timeout_s: float | None
    prefill_liveness_timeout_s: float | None
    idle_stream_timeout_s: float | None
    absolute_timeout_s: float | None

    liveness_probe_interval_s: float

    restart_backoff_s: float
    restart_window_s: float
    max_restarts_per_window: int

@dataclass(frozen=True, slots=True)
class BiosContext:
    now: datetime
    timezone_name: str
    worker_name: str

    tool_iters_remaining: int
    normal_tools: Sequence[ToolDef]
    exit_tools: Sequence[ToolDef]

    bios_version: str = "bios-v1"

@runtime_checkable
class BiosProvider(Protocol):
    def __call__(self, ctx: BiosContext) -> str: ...

def build_message_stack(
    *,
    bios_text: str,
    caller_system_prompt: str,
    conversation: Sequence[ChatMessage],
) -> list[ChatMessage]:
    """Pure function: returns full message list in required order."""
    raise NotImplementedError

@runtime_checkable
class ToolRunner(Protocol):
    async def run_tool(
        self,
        *,
        name: str,
        arguments: dict[str, Any],
        request_id: int,
        job_name: str,
    ) -> Any: ...

GenerationParams = Mapping[str, Any]

@dataclass(frozen=True, slots=True)
class WorkerConfig:
    name: str

    host: str
    port: int

    server_cmd: Sequence[str]
    env: Mapping[str, str]

    slots: int
    timeouts: TimeoutProfile

    normal_tools: Sequence[ToolDef]
    tool_runner: ToolRunner | None

    exit_tools: Sequence[ToolDef]

    bios_provider: BiosProvider
    timezone_name: str

@dataclass(frozen=True, slots=True)
class ReadinessProbe:
    method: Literal["GET"] = "GET"
    path: str = "/v1/models"

If you want to proceed, the natural next section is Appendix B: internal module layout (classes/files and responsibilities) so the code stays clean and test-driven from day one.

User: Let's move to the next section.[edit | edit source]

Let's move to the next section.

Assistant:[edit | edit source]

This section describes a clean internal structure for implementing the design without accreting cruft. The guiding rule is: keep supervision, transport, prompting, parsing, and state accounting in separate, testable components.[edit | edit source]

A small package (directory) with a few focused modules:[edit | edit source]

llama_worker/types.py - All public TypedDict/dataclasses/enums/protocols from Appendix A. - No logic.
llama_worker/worker.py - LlamaWorker public API implementation. - Wires together the other components. - Owns the request table + slot semaphore. - No low-level subprocess details, no prompt formatting details.
llama_worker/process.py - Subprocess lifecycle: - spawn llama-server (process group/session) - stop/kill group reliably - capture logs into ring buffer - Exposes minimal interface: start(), stop(), pid, is_alive(), recent_logs().
llama_worker/transport.py - HTTP client logic to talk to llama-server. - Implements: - readiness probe: GET /v1/models - POST /v1/chat/completions streaming request/response handling - Does not know about slots, tools, BIOS, or orchestration.
llama_worker/prompting.py - BIOS prompt generation adapter + message-stack assembly. - Contains build_message_stack() implementation. - May contain helpers for combining multiple system prompts if needed.
llama_worker/tooling.py - Normal tool loop machinery: - detect tool call vs final output - invoke ToolRunner - append tool-result messages - decrement tool budget - Also includes fallback tool parsing strategy.
llama_worker/exit_signals.py - Parsing and recording of exit-tool calls into signals[]. - Must be explicitly non-terminating (records only).
llama_worker/liveness.py - Prefill-safe liveness probes: - /proc/<pid> CPU-time delta - process-alive checks - Returns timestamps or “evidence of life” booleans.
llama_worker/timeouts.py - Implements timeout bookkeeping based on: - connect/header timeouts (transport-level) - progress timestamps (last_stream_byte_at, last_liveness_at) - Returns “should_restart” decisions (policy evaluation), but does not restart itself.
llama_worker/loopdetect.py - Repeated-line detector. - Pure incremental API: feed text chunks, ask “triggered?”.
llama_worker/util.py - Small shared utilities (ring buffer, monotonic time helpers, etc.). - Keep tiny.

This structure ensures each concern can be unit-tested in isolation.

LlamaWorker (in worker.py) should be mostly orchestration glue:[edit | edit source]

Holds: - ProcessSupervisor (from process.py) - TransportClient (from transport.py) - BiosProvider and prompt assembly (from prompting.py) - ToolLoopRunner (from tooling.py) - ExitSignalParser (from exit_signals.py) - LivenessProbe (from liveness.py) - TimeoutEvaluator (from timeouts.py) - RepeatedLineDetector per request (from loopdetect.py) - request table: dict[int, RequestRecord] - slot semaphore: asyncio.Semaphore(slots)
Owns the only “long-lived background tasks”: - optional supervisor task that watches subprocess death/restart policy - optional periodic readiness check (low frequency, only when needed)

Everything else should be invoked only when work arrives (to avoid wasting CPU).

A request is run by a single asyncio Task created on submit():[edit | edit source]

Admission control

Try acquire a slot semaphore immediately.
If not available: return NO_SLOT_AVAILABLE.

Assemble prompt

Call BIOS generator (bios_provider(ctx)) via prompting.py.
Build message stack via build_message_stack().

Dispatch streaming completion

Use TransportClient.stream_chat_completions(...).
Update last_stream_byte_at on any bytes after headers.
Accumulate full output text.
Feed chunks into RepeatedLineDetector.

Parse tool calls

If the server yields a tool call: - Pass it to ToolLoopRunner: - execute tool via ToolRunner - append tool result message - regenerate BIOS (updated tool budget) - continue generation
Exit-tools: - If an exit-tool call is detected at any point: - record into signals[] - continue normally (non-terminating)

Finish

Terminal states: completed/failed/canceled.
get_result() returns and releases stored request record.

- Slots - Implemented as asyncio.Semaphore(slots). - Always release the semaphore in a finally: block. - Slot count is the primary invariant; tests should ensure no leaks.[edit | edit source]

Cancellation - cancel(request_id) cancels the asyncio task for that request. - Transport streaming must be cancellation-friendly (close stream promptly). - The request transitions to CANCELED with fail_reason="canceled".
Result retrieval = release - get_result() pops the request record from the table (or marks as released). - Subsequent lookups return NOT_FOUND.

Nuke & repave is implemented at the worker level:[edit | edit source]

Restart triggers: - subprocess exits - readiness probe fails repeatedly - timeout evaluator says “stalled” (no progress/liveness for too long)
On restart: - fail all in-flight requests with fail_reason="worker_restarted" (no replay) - stop process group, start new subprocess, wait for readiness (GET /v1/models) - transition worker state accordingly

Crash-loop protection lives in timeouts.py (policy) plus worker’s restart gatekeeping (mechanics).

transport.py owns these details:[edit | edit source]

Readiness probe: - GET /v1/models → parse JSON → ok/not ok
Streaming: - Accepts request payload and yields decoded events: - raw bytes, or structured “delta text”, or tool_call payloads (depending on how you decide to parse) - The worker layer treats any bytes after headers as progress and does not need to know SSE details. - Keep transport tolerant: - allow keepalive lines - handle partial JSON frames - raise clear exceptions for irrecoverable protocol errors

Keep “how to parse llama-server’s stream” in one place so it’s easy to adjust.

prompting.py must remain “pure-ish”:[edit | edit source]

BiosProvider(ctx) -> str is injected and unit-testable.
build_message_stack(...) -> list[ChatMessage] is a pure function with unit tests.
BIOS generation should not touch: - subprocess state - HTTP logic - request table mutation

This makes it safe to evolve the BIOS rules without breaking transport.

When adding something new, put it in the smallest responsible layer:[edit | edit source]

New timeout rule → timeouts.py + tests
New stream parsing quirk → transport.py + tests
New BIOS fields → prompting.py + tests
New tool behavior → tooling.py + tests
New signal schema → exit_signals.py + tests

Avoid “just add an if in worker.py”. That’s how the module becomes unmaintainable.

- tests/test_prompting.py - BIOS content, ordering, formatting stability[edit | edit source]

tests/test_loopdetect.py - repeated-line detector cases (positive + non-trigger)
tests/test_timeouts.py - stall decisions using synthetic timestamps - crash-loop/backoff behavior
tests/test_tooling.py - tool loop state machine - fallback parsing behavior
tests/test_exit_signals.py - signals are recorded and non-terminating
tests/test_process.py - process-group teardown (use a dummy child process tree)
tests/test_worker_integration.py - stub HTTP server simulating llama-server behaviors: - slow headers - long silence then streaming - stall - disconnect - tool call emission - exit-tool emission - verifies slot behavior, restart behavior, and get_result() release semantics

This provides high confidence without needing heavyweight infrastructure.

If you’re ready, the next natural section is Appendix C: State machines and invariants (explicit invariants for slots, request lifecycle, restart behavior) so the implementation has a tight correctness target.

User: That looks good, let's go to the next section[edit | edit source]

That looks good, let's go to the next section

Assistant:[edit | edit source]

This section makes the “boring but critical” rules explicit so the implementation and tests have a tight correctness target. The module should enforce these invariants rather than relying on convention.[edit | edit source]

####[edit | edit source]

STOPPED: subprocess not running; cannot accept requests
RUNNING: subprocess started but not yet confirmed ready (or starting/restarting)
READY: readiness probe succeeds (GET /v1/models → 200 + JSON)
FAILED: subprocess dead or crash-loop lockout; cannot accept requests

- STOPPED → RUNNING (start)[edit | edit source]

RUNNING → READY (readiness probe succeeds)
RUNNING → FAILED (subprocess exits before ready, or repeated probe failure beyond policy)
READY → RUNNING (restart initiated)
READY → FAILED (subprocess exits, crash-loop lockout triggers)
FAILED → RUNNING (explicit restart/start allowed by policy)
RUNNING/READY → STOPPED (stop)

1. No request dispatch unless state is READY.[edit | edit source]

State READY implies: - subprocess is alive and - last readiness probe succeeded within a reasonable recent window (configurable, but “probe success happened” is the minimum).
FAILED implies: - submit() returns WORKER_FAILED immediately (no slot consumption).
STOPPED implies: - submit() returns WORKER_NOT_READY immediately.

####[edit | edit source]

RUNNING: request task is executing (includes prefill and token streaming)
TOOL_RUNNING: normal tool execution in progress (awaiting ToolRunner)
COMPLETED: finished successfully
FAILED: finished with error
CANCELED: canceled by caller

- COMPLETED, FAILED, CANCELED[edit | edit source]

- (created) → RUNNING[edit | edit source]

RUNNING → TOOL_RUNNING (model emitted normal tool call)
TOOL_RUNNING → RUNNING (tool result appended; generation continues)
RUNNING → COMPLETED
RUNNING → FAILED
TOOL_RUNNING → FAILED (tool error / parse error)
RUNNING/TOOL_RUNNING → CANCELED (cancel)
RUNNING/TOOL_RUNNING → FAILED (worker restart / server death / timeout)

1. One request occupies exactly one slot for its entire lifetime from acceptance to terminal state.[edit | edit source]

A request ID is unique within the lifetime of a worker instance.
Output accumulation is monotonic (only grows) until terminal.
Exit tools never cause a transition by themselves: - they may be recorded at any time, but do not change request state.
Tool iteration budget is monotonic decreasing and never negative.
After get_result() succeeds, the request record is released: - subsequent get_status/get_result returns NOT_FOUND.

Slots are the core stability mechanism and must be leak-proof.[edit | edit source]

No slot is consumed on failed submit. - If submit() returns NO_SLOT_AVAILABLE, nothing was allocated.
If submit() succeeds, a slot is consumed immediately and will be released exactly once.
Slot release must occur in a finally: block in the request task, ensuring: - tool errors - transport exceptions - cancellations - restart-triggered failures do not leak slots.
slots_used = number of non-terminal request records currently held (or equivalently, the number of acquired semaphore permits).

When a restart happens, correctness matters more than salvage.[edit | edit source]

No replay of in-flight requests.
On restart initiation, all in-flight requests must transition to FAILED with: - fail_reason="worker_restarted" (or more specific, e.g. "server_died"), - and must free their slots promptly.
Restart is performed as: - stop process group (SIGTERM → SIGKILL) - start new subprocess - wait for readiness probe success - move worker state to READY or FAILED
Crash-loop protection invariant: - If restarts exceed max_restarts_per_window, worker becomes FAILED and will not auto-restart until explicitly restarted (or until policy allows).

Given long prefill, timeouts must be careful.[edit | edit source]

Progress is defined as any bytes received after headers, plus any liveness evidence from prefill probes.
A request updates: - last_stream_byte_at on any received stream data - last_liveness_at on each successful liveness probe - last_progress_at = max(last_stream_byte_at, last_liveness_at)
“Stall” decisions are based on no progress for configured windows, not lack of tokens.
Connect and header timeouts remain short/moderate and are treated as “server unreachable” rather than “model slow”.

Normal tools:[edit | edit source]

Tool calls must be matched to a known tool name from normal_tools, otherwise it’s a tool parse/validation failure (FAILED).
Each tool iteration decrements the remaining tool budget.
ToolRunner output must be JSON-serializable (or worker serializes safely), and appended as a tool message.

Exit tools:

Exit tool calls are recorded into signals[] even if they are malformed (best effort), but malformed signals should not crash the worker.
Exit tools do not decrement normal tool budgets unless you explicitly decide to share budgets (default: separate; but not required).

Worker status:[edit | edit source]

READY/RUNNING/FAILED/STOPPED must be accurate with respect to readiness probe and subprocess state.

Request status:

state must reflect the real request lifecycle
output length must be consistent with accumulated text
after release, request is NOT_FOUND

In debug builds or behind a flag, periodically assert:[edit | edit source]

0 <= slots_used <= slots_total
semaphore permits + active requests are consistent
no request is both terminal and still has an active task
tool_iters_remaining never negative

These checks are low-cost and prevent silent corruption.

If you want to continue, the next section would be Appendix D: Concrete defaults (recommended default values for timeouts, tool budgets, loop detector thresholds, restart backoff) to help implementation start with sensible behavior without requiring more decisions later.

User: Looks good, let's move to the next one[edit | edit source]

Looks good, let's move to the next one

Assistant:[edit | edit source]

These defaults are meant to be safe, conservative, and boring—especially given your long-prefill workloads and “nuke & repave” philosophy. They’re starting points, not hard commitments; each worker profile can tune them.[edit | edit source]

Readiness probe[edit | edit source]

GET /v1/models
Consider READY on: HTTP 200 + JSON parse success.

Startup sequence

Start subprocess → probe every 0.5s initially (short burst) until ready, then back off.
Startup max wait: 120s before declaring startup failed and moving to FAILED (this is about “server didn’t come up”, not model prefill).

Rationale: server should bind/respond quickly even if model load is heavy; if it truly needs longer, increase this per worker.

A good baseline profile for “slow hardware / big contexts”:[edit | edit source]

connect_timeout_s = 3.0
headers_timeout_s = 30.0
ttft_timeout_s = None (disabled)
prefill_liveness_timeout_s = None (disabled) or 3600.0 (1 hour) if you want eventual kill even with liveness
idle_stream_timeout_s = 300.0 (5 minutes) (Once streaming, 5 minutes with zero bytes is suspicious; tune higher if needed.)
absolute_timeout_s = None (disabled by default)
liveness_probe_interval_s = 5.0 (lightweight)
Restart controls: - restart_backoff_s = 5.0 (first delay) - restart_window_s = 120.0 (2 minutes) - max_restarts_per_window = 5

Notes:

If you disable prefill_liveness_timeout_s, then a “hard hang during prefill” is handled only by detecting process death or external cancellation. That may be acceptable in your environment.
If you enable it, keep it large.

- max_tool_iterations = 8[edit | edit source]

per-tool timeout: - lightweight tool runners: 5–10s - heavier tool runners: configured in that ToolRunner, not here
tool output size: - not capped by worker; ToolRunner is responsible if needed

Fallback parsing:

Enabled if native tool_calls aren’t present; strict JSON parsing; failure → tool_parse_error.

- Exit tools list is provided by the orchestrator at worker init.[edit | edit source]

Worker records signals; does not change control flow.
Suggested “starter” exit tools (names are up to you): - signal_issue(code, severity, summary, meta={}) - request_escalation(reason, summary, meta={}) - request_decision(question, options, default=None, context=None, meta={}) - declare_outcome(code, summary=None, meta={})

(These are recommendations only; worker treats them as opaque schemas.)

Conservative thresholds to minimize false positives:[edit | edit source]

Ignore lines with length < 64 characters (after stripping).
Ignore empty/whitespace-only lines.
Trigger if the same normalized line repeats consecutively: - repeat_threshold = 10
Start checking only after some output exists: - min_output_chars_before_check = 512

On trigger:

cancel request → FAILED(reason="repeated_line_loop")
record: repeated line snippet (truncated) + count

- Default slots = 1 for very slow or memory-tight models.[edit | edit source]

Use slots > 1 only when the server and hardware demonstrably handle concurrency without pathological latency.

- Track tokens/sec only if the backend provides usage/token counts without expensive parsing.[edit | edit source]

Otherwise, record: - output chars - elapsed time - last_progress_at

If you want to keep going, the next appendix would be Appendix E: Example BIOS prompt template (including a stable formatting/versioning approach and how it references tool budgets and exit-tools) since that’s a high-churn area and benefits from being specified early.

User: On the loop detector, the recent incident of a looping model I had to kill was emitting 39 character lines.[edit | edit source]

On the loop detector, the recent incident of a looping model I had to kill was emitting 39 character lines. 64 char is too long of a default.

Assistant: Good catch.[edit | edit source]

Good catch. We can make the default sensitive enough for ~40-char loops without turning it into a false-positive machine by making the threshold length-dependent.

###[edit | edit source]

Split on completed lines (\n).
Normalize by: - strip leading/trailing whitespace - collapse internal whitespace runs (optional but recommended)
Ignore empty/whitespace-only lines.

Use a minimum length of 32 chars, and vary the repeat threshold:[edit | edit source]

If len(line) >= 64: trigger at repeat_threshold = 8
If 32 <= len(line) < 64: trigger at repeat_threshold = 12
If len(line) < 32: ignore (default)

This catches your 39-char case quickly (12 repeats) but remains conservative.

- min_output_chars_before_check = 256 (down from 512, since shorter lines can loop early)[edit | edit source]

Additionally, don’t start counting repeats until you’ve observed at least 2 completed non-empty lines (cheap sanity check).

- Cancel request → FAILED(reason="repeated_line_loop")[edit | edit source]

Record repeated line snippet (truncated) + repeat count in status/debug.

If you want it even more conservative, we can add a tiny heuristic like “only count repeats for lines containing at least N alphanumeric characters” (e.g., ≥ 12), but the length-tiered repeat threshold above is usually enough.

User: Also for any kind of failure, whatever output did happen should still be retrievable for logging and debugging purposes.[edit | edit source]

Also for any kind of failure, whatever output did happen should still be retrievable for logging and debugging purposes.

Assistant: Good point — let’s lock that in as an explicit contract and update the doc accordingly.[edit | edit source]

Good point — let’s lock that in as an explicit contract and update the doc accordingly.

###[edit | edit source]

For any terminal outcome (COMPLETED, FAILED, CANCELED), get_result(request_id) must return whatever output was accumulated so far as text.
This includes failures caused by: - worker restart (worker_restarted) - server death (server_died) - timeouts/stalls - tool parse/execution errors - loop detector cancel - caller cancel
If no output was produced (e.g., failed before headers/stream), text is an empty string.

- get_result() remains the one-time completion+cleanup call: - It returns the final result including partial output if any. - It then releases the request record (subsequent lookups return NOT_FOUND).[edit | edit source]

###[edit | edit source]

Add/modify invariant:

On restart initiation, in-flight requests transition to FAILED, but their accumulated output is preserved until the caller retrieves it via get_result().

Make text effectively always present in RequestResult (even for failure/cancel). Easiest is to make it required in the TypedDict:[edit | edit source]

class RequestResult(TypedDict):
    request_id: int
    job_name: str
    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason
    text: str
    signals: NotRequired[list[ExitSignal]]
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

And keep RequestStatus.output_chars meaningful even for failed/canceled requests.

- When a request is terminated due to restart/timeout/cancel, the worker should: - stop the stream / cancel the task - set terminal state + fail reason - not discard the output buffer - release the slot - retain the request record until get_result() is called[edit | edit source]

If you want, I can fold this into the full doc text (sections 9.2, 17, Appendix C, Appendix A) as a clean patch.

User: Yes, fold that in and let's move to the next section.[edit | edit source]

Yes, fold that in and let's move to the next section.

Assistant: Done — I’ve folded the “partial output must be retrievable on failures” rule into the design including types + invarian…[edit | edit source]

Done — I’ve folded the “partial output must be retrievable on failures” rule into the design (including types + invariants + retention semantics), updated the loop detector defaults, and then added the next section.

###[edit | edit source]

get_result(request_id) is the one-time completion + cleanup call, and it returns a result for any terminal state (COMPLETED, FAILED, CANCELED).
For any failure or cancel, get_result() must still return whatever output accumulated so far in text (possibly empty if nothing was produced).
After get_result() returns, the request record is released, and later lookups return NOT_FOUND.

RequestResult is now defined with required text, even for failures/cancel:[edit | edit source]

class RequestResult(TypedDict):
    request_id: int
    job_name: str

    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason

    # Always included, even on failure/cancel (may be empty string)
    text: str

    signals: NotRequired[list[ExitSignal]]
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

- On restart initiation, all in-flight requests transition to FAILED(...), release their slots, but retain their accumulated output until get_result() is called.[edit | edit source]

Replace the old D.5 with these defaults:[edit | edit source]

Ignore empty/whitespace-only lines.
Normalize: strip + optional whitespace collapsing.
Length-dependent thresholds: - len >= 64 → trigger at 8 repeats - 32 <= len < 64 → trigger at 12 repeats - len < 32 → ignore
Warmup gate: - min_output_chars_before_check = 256 - don’t start counting repeats until you’ve seen at least 2 completed non-empty lines
On trigger: cancel request → FAILED(reason="repeated_line_loop"), preserve partial output for retrieval.

This appendix specifies a stable, testable BIOS prompt format. It’s an example template intended for the injected BiosProvider(ctx) -> str. The key requirement is format stability and clear separation from caller prompts.[edit | edit source]

- No secrets: BIOS is model-visible.[edit | edit source]

Stable structure: include a version tag.
Small and universally applicable: job-specific rules belong in caller system prompt.
Machine-parsable cues only where needed (e.g., for fallback tool-call parsing).
Explicit tool/exit-tool guidance: - normal tools are for work steps (round-trip) - exit tools are for signaling upward (non-terminating)

Below is a concrete example. Your BiosProvider can render this with real values from BiosContext.[edit | edit source]

[BIOS v=bios-v1]
You are one agent in a cooperative hivemind of models working together. 
Your job is to produce the best possible output for the current request, and to communicate clearly when escalation or managerial decisions are needed.

Time: {NOW_ISO8601}
Timezone: {TIMEZONE_NAME}

Worker: {WORKER_NAME}

Tool budget:
* Normal tool iterations remaining: {TOOL_ITERS_REMAINING}

Tool usage rules:
* If you call a NORMAL tool, the system will execute it and provide the result. Then you should continue.
* Only call tools that are provided in the tool list.
* Avoid unnecessary tool calls; use tools when they materially improve correctness.

Exit tools (control signals):
* Exit tools are for sending structured signals upward to the orchestrator (e.g., low confidence, need external info, need a management decision).
* Exit tools DO NOT automatically stop execution; only use them when you intend the orchestrator to notice and potentially take action.
* Prefer emitting exit tools near the end of your response unless the situation is urgent.

Fallback tool-call formatting (only if needed):
* When calling any tool, ensure arguments are valid JSON objects (no trailing commas).
* If you cannot form valid tool arguments, do not guess; explain the limitation in text or emit an exit signal.

[/BIOS]

- The BIOS prompt intentionally does not restate the caller’s task or persona. That belongs in the caller-provided system prompt.[edit | edit source]

If you want even more stability, you can include a short “Capabilities” stanza, but keep it small.

If you find models behave better with a tiny reminder of available tool names (without dumping full schemas), you can add:[edit | edit source]

Normal tools available: {NORMAL_TOOL_NAMES_COMMA_SEPARATED}
Exit tools available: {EXIT_TOOL_NAMES_COMMA_SEPARATED}

This should be derived from the ToolDef list and kept short.

- BIOS includes: - version tag - hivemind statement - time + timezone - tool iterations remaining - exit tools guidance (non-terminating)[edit | edit source]

BIOS changes when tool iterations decrement (regeneration before post-tool continuation).
BIOS stays independent from caller system prompt content.

If you want to continue, the next section would naturally be Appendix F: Transport/Streaming expectations for llama-server (SSE framing assumptions, what constitutes “tool call vs text delta”, and how the fallback parsing is triggered).

User: let's go to the next section[edit | edit source]

let's go to the next section

Assistant:[edit | edit source]

This appendix defines what the worker’s transport layer (transport.py) should assume about llama-server’s OpenAI-compatible HTTP API, and how it should behave when the server is imperfect or evolving.[edit | edit source]

Readiness probe (locked in):[edit | edit source]

GET /v1/models (no trailing slash). GET /v1/models/ may return 404 on some versions, so the worker must probe the exact path without a trailing slash. GitHub^[13]

Chat inference (primary endpoint):

POST /v1/chat/completions GitHub^[14]

Notes:

llama-server is described as an OpenAI API compatible HTTP server and documents the chat completion endpoint at /v1/chat/completions. GitHub^[15]
Other endpoints exist (e.g., embeddings/reranking in certain modes), but are out of scope for this module unless you add explicit support later. GitHub^[16]

The worker should construct an OpenAI-style chat completion payload with (at minimum):[edit | edit source]

messages: ordered list of {role, content, ...} including: - BIOS system message - caller system message - user / assistant / tool messages from the ongoing request context
stream: true (the module always streams internally)
tools: a list of OpenAI function tool definitions, including: - normal tools (round-trip) - exit tools (one-way signals)

Forward compatibility requirement: the worker must pass through any caller-provided params fields unchanged unless they collide with fields the worker owns (e.g., messages, tools, stream). This keeps the module from needing edits when llama.cpp adds new knobs.

When stream=true, the transport should expect Server-Sent Events (SSE) framing, where:[edit | edit source]

The stream consists of event records separated by blank lines.
The primary payload is carried in data: lines.
The stream typically terminates with data: [DONE]. GitHub^[17]

Transport parsing rules (robust and tolerant):

Treat any bytes received after headers as “progress” (worker-level definition).
Parse SSE incrementally: - tolerate partial lines and partial JSON frames - ignore keepalive/comment/empty lines that carry no data
For each data: payload: - if it is [DONE], finish the stream cleanly - else parse JSON and yield structured events upward (e.g., “text delta”, “tool call”, “usage update”, etc.)
If the stream ends unexpectedly (socket close), treat it as an error unless a terminal condition was already observed.

Some llama-server versions have emitted streaming error records using an SSE field name like error: instead of data:, which can be ignored by strict SSE decoders (including OpenAI client implementations). GitHub^[18][edit | edit source]

Transport requirement:

Treat both data: and error: as possible carriers of JSON.
If an error: record is seen: - parse its JSON (best effort), - surface it as a terminal transport error to the worker, - preserve any partial output already accumulated for later get_result() retrieval.

This keeps the worker resilient across server versions.

llama-server supports OpenAI-style function/tool calling via its chat handling, including:[edit | edit source]

native tool-call formats for many model families
a generic tool-call handler when a template isn’t recognized
optional parallel tool calling via payload "parallel_tool_calls": true (supported but disabled by default). GitHub^[19]

Worker behavior requirements (transport-facing):

The transport should not “decide” tool semantics; it should simply surface parsed JSON events to the worker/tool loop.
Tool calls may appear: - in a final message object, or - in streaming deltas (depending on server/model/template behavior).
The tool loop layer (tooling.py) must support: - structured tool_calls when provided, and - the BIOS-driven fallback parsing strategy when they’re not.

Note: the function-calling doc indicates llama-server tool calling is used when started with --jinja and that generic/native handlers exist. GitHub^[20] (Your worker config keeps the server command fully configurable, so enabling --jinja or templates is an orchestrator concern.)

The transport layer should distinguish:[edit | edit source]

A) Transport/protocol errors

connection refused / connect timeout
header timeout
malformed SSE frames / unparseable JSON (beyond best-effort tolerance)
premature disconnect before completion

These are candidates for worker-level restart decisions (nuke & repave), depending on policy and frequency.

B) Application errors from server

returned as JSON error bodies in non-streaming responses, or
emitted into the streaming channel as a JSON error (sometimes via the error: field described above). GitHub^[21]

These should fail the request (preserving partial output), and may or may not trigger a restart depending on how often they occur (policy-driven).

To preserve separation of concerns, transport.py should expose a small surface, e.g.:[edit | edit source]

async probe_ready() -> bool (calls GET /v1/models)
async stream_chat(payload: dict[str, Any]) -> AsyncIterator[TransportEvent]

Where TransportEvent is a small internal union such as:

BytesProgress() (optional)
TextDelta(str)
ToolCallEvent(tool_call_payload)
UsageEvent(usage_payload)
ServerErrorEvent(error_payload)
StreamDone()

The worker should treat any received bytes as progress; everything else is higher-level semantics.

If you want to continue, the next section I’d propose is Appendix G: Tool-call loop algorithm (step-by-step pseudo-code for the tool loop + fallback parsing + how exit-tools are recorded alongside), so implementation and tests line up exactly.

User: What appendices are left which would be useful?[edit | edit source]

What appendices are left which would be useful?

Assistant: Useful appendices still left in roughly the order I’d write them:[edit | edit source]

Useful appendices still left (in roughly the order I’d write them):

Step-by-step pseudo-code for:[edit | edit source]

detecting native tool_calls
invoking ToolRunner
appending tool results
regenerating BIOS (budget decrement)
continuing generation
fallback parsing path
how exit-tools are recorded alongside without affecting control flow

Why useful: this is the trickiest “logic core” and easiest place for subtle bugs.

Not required for correctness (since they’re provided at init), but useful to standardize:[edit | edit source]

suggested minimal fields (code/severity/summary/meta)
decision-request structure (options list)
how orchestrator should interpret/route signals
example tool definitions

Why useful: keeps model behavior consistent across many workers/models.

Explicit algorithm for:[edit | edit source]

when to restart (process death vs probe failures vs stall)
backoff schedule
“lockout” criteria
how to fail in-flight requests while preserving output
what transitions worker state takes

Why useful: prevents “accidental thrash” and makes tests straightforward.

Concrete method for /proc/<pid>/stat CPU time delta:[edit | edit source]

what fields to read
how to compute deltas using monotonic time
probe interval and overhead
what counts as “evidence of life”

Why useful: your long-prefill workloads depend on getting this right.

Define:[edit | edit source]

ring buffer size default
what gets logged (start/stop/restart reasons, request lifecycle events)
correlation keys (worker_name, job_name, request_id)
minimal structured event list

Why useful: you’ll debug fleets; this saves time.

A concrete test matrix and the stub “fake llama-server” behaviors:[edit | edit source]

slow headers
long silence then stream
mid-stream disconnect
tool call emission
exit-tool emission
server crash / restart

Why useful: ensures the “robustness first” goal is actually enforced.

A few example worker configs:[edit | edit source]

small fast model worker
big slow model worker
tool-enabled worker
multi-slot worker

Why useful: helps the orchestrator author configure workers consistently.

If you want to keep it lean, I’d do G, I, J, L as the “must-haves”, and treat the rest as nice-to-have documentation.

User: H and K are outside of scope for the module beyond normal log system breadcrumbs.[edit | edit source]

H and K are outside of scope for the module beyond normal log system breadcrumbs. Storage of prompts and outputs will be handled by the caller.

M might be a nice to have later, but needs information which will require experimentation to derive.

Continue to the next section.

Assistant:[edit | edit source]

This appendix defines the exact logic the worker uses to run normal tool calls (round-trip via ToolRunner), while also recording exit-tools (one-way signals) without altering control flow.[edit | edit source]

The goal is to make the implementation deterministic and testable.

- Normal tools: config.normal_tools (round-trip).[edit | edit source]

Exit tools: config.exit_tools (one-way signals).
Tool budget: tool_iters_remaining (applies to normal tools only).
Conversation: list of ChatMessage built via build_message_stack(): - BIOS system message - caller system message - ongoing messages (user/assistant/tool)

Key rule (locked in): exit tools never change worker behavior; they are recorded only. (If the model chooses to stop output when emitting an exit-tool, that is the model’s behavior.)

transport.stream_chat(payload) yields events. The worker updates progress timestamps on any bytes after headers regardless of event type.[edit | edit source]

The tool loop is written in terms of these semantic events:

TextDelta(text_fragment: str) — content tokens/deltas
AssistantMessageFinal(message_obj: dict) — final assembled assistant message (may include tool_calls)
StreamDone() — end of stream
ServerError(err_obj: dict | str) — error surfaced from stream
(optional) UsageUpdate(usage_obj: dict) — tokens, timing if available

The exact parsing is transport’s job; the tool loop consumes these events.

Each request task runs this loop until it reaches a terminal state:[edit | edit source]

Build initial conversation (BIOS + caller system + user prompt).
Dispatch a streaming request.
Accumulate text output and update repeated-line detector.
If a normal tool call is emitted: - execute it (via ToolRunner), - append tool result message, - decrement tool budget, - regenerate BIOS, - continue generation (next iteration).
If an exit tool is emitted: - record it as a signal, - do not execute it, - do not decrement tool budget, - do not change control flow.

Tool calls may appear:[edit | edit source]

in structured fields (preferred), or
in assistant content in fallback mode.

The worker must support both.

When an assistant message contains tool_calls, parse them into a list of tool call objects, then partition:[edit | edit source]

normal_calls = [c for c in tool_calls if c.name in normal_tool_names]
exit_calls = [c for c in tool_calls if c.name in exit_tool_names]
unknown_calls = everything else

Rules:

If unknown_calls is non-empty → treat as FAILED(reason="tool_parse_error") (preserve partial output).
Record each exit_call into signals[] immediately (best effort parsing).
Proceed with normal_calls via the normal tool loop.

Important detail (prevents “unanswered tool calls”):

When continuing generation after normal tool execution, the worker should append an assistant message containing only the normal tool calls, not the exit tool calls.
Exit tool calls are treated as out-of-band signals and are not part of the conversation that requires tool responses.

This keeps exit-tools “no round trip” while avoiding confusing the model with unresponded calls.

If no structured tool_calls are present but the model is expected to call tools, the worker uses a BIOS-enforced convention, for example:[edit | edit source]

A tool call is represented as a single JSON object (or JSON line) that includes: - {"tool": "<name>", "arguments": { ... }} - optionally with a stable prefix/suffix marker if you choose to enforce one.

Fallback parsing rules:

Attempt parsing only when the accumulated assistant output contains a clearly delimited tool-call candidate.
If parsed tool name is in normal tools: - treat it as a normal tool call - strip the tool-call directive from the user-visible output (so the final result isn’t polluted)
If tool name is in exit tools: - record signal - optionally strip directive from output (recommended)
If parsing fails or tool name unknown: - FAILED(reason="tool_parse_error") (preserve output as-is for debugging)

This fallback logic should be isolated in tooling.py and unit-tested heavily.

For each normal tool call, in order:[edit | edit source]

Preconditions

tool_iters_remaining > 0, else fail request with FAILED(reason="tool_execution_error", detail="tool budget exhausted") (preserve output).

Execution

Parse tool call name + arguments: - arguments must be JSON object; if not → tool_parse_error.
Invoke ToolRunner: - await tool_runner.run_tool(name=..., arguments=..., request_id=..., job_name=...) - Enforce per-tool timeout (via asyncio.wait_for).
Serialize tool result into a tool message: - {"role": "tool", "tool_call_id": <id>, "content": <json-serialized result>}
Update conversation: - Append assistant tool-call message (containing the normal tool call(s) only). - Append tool result message(s).
Decrement budget: - tool_iters_remaining -= 1 (or decrement by number of executed tool calls if you allow multiple per iteration).
Regenerate BIOS and rebuild message stack for continuation: - new BIOS includes updated tool_iters_remaining.

Failure handling

Tool runner timeout → FAILED(reason="tool_execution_error")
Tool runner raises exception → FAILED(reason="tool_execution_error")
In all cases: - preserve accumulated output for retrieval via get_result() - release slot promptly

To keep v1 simple and robust:[edit | edit source]

The worker may support multiple normal tool calls emitted together by executing them sequentially in the order given.
Each executed tool call decrements the normal tool iteration budget by 1 (simple, predictable).
If the model emits N tool calls but budget has fewer than N remaining: - execute up to the remaining budget - then fail with tool_execution_error (“budget exhausted”) while preserving output and any recorded signals.

This behavior is deterministic and easy to test.

When an exit tool call is detected (structured or fallback):[edit | edit source]

Record ExitSignal: - tool_name - arguments (best effort JSON) - emitted_at (monotonic timestamp)
Do not execute it.
Do not decrement normal tool budget.
Do not append it to the conversation used for continuation (if continuation occurs due to normal tools).

If the model emits only exit tool calls and then stops:

request completes (likely with empty/partial text), and signals are returned upward.

- The request accumulates output text whenever TextDelta events arrive.[edit | edit source]

On any terminal state (COMPLETED, FAILED, CANCELED), get_result() returns: - text: the full accumulated output so far (possibly empty) - signals: any recorded exit signals - failure details if applicable

Even tool failures and restarts must preserve accumulated output until get_result() is called.

=====

async def run_request(req: RequestRecord) -> None: =====
    try:
        # Build initial conversation with BIOS + caller system + user
        req.conversation = build_initial_conversation(req)

        while True:
            payload = build_chat_payload(req.conversation, req.params, tools=req.all_tools)
            stream = transport.stream_chat(payload)

            # Per-stream scratch
            assistant_text = ""
            structured_tool_calls = None

            async for event in stream:
                req.last_stream_byte_at = now_monotonic()  # any bytes after headers count as progress

                if event.type == "TextDelta":
                    assistant_text += event.text
                    req.output += event.text
                    loop_detector.feed(event.text)
                    if loop_detector.triggered():
                        req.fail("repeated_line_loop")
                        return

                elif event.type == "AssistantMessageFinal":
                    structured_tool_calls = extract_tool_calls(event.message_obj)
                    # Some servers finalize here; continue to StreamDone

                elif event.type == "ServerError":
                    req.fail("unknown_error", detail=str(event.error))
                    return

                elif event.type == "StreamDone":
                    break

            # Tool detection: structured first, then fallback
            normal_calls, exit_calls, err = partition_structured_calls(structured_tool_calls)
            if err:
                req.fail("tool_parse_error", detail=err)
                return

            record_exit_calls(req, exit_calls)

            if normal_calls:
                if req.tool_iters_remaining <= 0:
                    req.fail("tool_execution_error", detail="tool budget exhausted")
                    return

                # Append assistant tool-call message (NORMAL ONLY), execute tools, append tool results
                ok, err = await execute_normal_tools(req, normal_calls)
                if not ok:
                    req.fail("tool_execution_error", detail=err)
                    return

                req.tool_iters_remaining -= len(normal_calls)
                req.conversation = rebuild_conversation_with_new_bios(req)
                continue  # loop back to generate more

            # No structured normal calls; try fallback parsing on the recent output
            fb = try_fallback_parse_tool_call(assistant_text)
            if fb:
                if fb.name in exit_tool_names:
                    record_exit_signal(req, fb)
                    # optional: strip directive from output
                    req.output = strip_fallback_directive(req.output, fb)
                    # no continuation mandated; allow model to finish naturally
                    # (if you want continuation behavior, keep it explicit and tested)
                elif fb.name in normal_tool_names:
                    # strip directive from output, run normal tool loop, then continue
                    req.output = strip_fallback_directive(req.output, fb)
                    ok, err = await execute_normal_tools(req, [fb])
                    if not ok:
                        req.fail("tool_execution_error", detail=err)
                        return
                    req.tool_iters_remaining -= 1
                    req.conversation = rebuild_conversation_with_new_bios(req)
                    continue
                else:
                    req.fail("tool_parse_error", detail="unknown fallback tool")
                    return

            # Otherwise: no tools detected -> completion
            req.complete(finish_reason=deduce_finish_reason(req.params, req.output))
            return

    except asyncio.CancelledError:
        req.cancel()
        raise
    except Exception as e:
        req.fail("unknown_error", detail=str(e))
    finally:
        release_slot(req)

If you want to proceed, the next appendices that are most useful for implementation and testing are:

Appendix I: Restart policy and crash-loop behavior (explicit algorithm)
Appendix J: Liveness probing details (/proc/<pid>/stat CPU time delta)
Appendix L: Test plan and stub-server fixtures (test matrix)

User: The tool budget needs to be iterations, not individual calls.[edit | edit source]

The tool budget needs to be iterations, not individual calls. A model should have the option to call several tools at once.

Add that and let's go through the remaining sections.

Assistant: Absolutely — tool budget will be iterations, not “number of tool calls”.[edit | edit source]

Absolutely — tool budget will be iterations, not “number of tool calls”. A single iteration may contain multiple tool calls, and that entire batch consumes exactly 1 iteration.

Below is the updated Appendix G plus the remaining appendices (I, J, L) so you’ve got the full remaining set.

###[edit | edit source]

- Tool budget = iterations.[edit | edit source]

An iteration is: one assistant tool-call emission event (i.e., one model “turn” where it asks for tools).
In that one iteration, the model may request multiple normal tools at once.
The worker executes all normal tool calls in that iteration (sequentially by default), then appends tool results, regenerates BIOS, and continues generation.
After successfully processing the iteration, decrement: - tool_iters_remaining -= 1 (always one, regardless of number of calls)

To keep v1 simple and robust:[edit | edit source]

If the assistant emits N normal tool calls in one message: - execute them sequentially in the listed order - append N tool result messages (and an assistant tool-call message containing those normal calls) - decrement iteration budget by 1 (not N)
If any tool call in the batch fails (timeout/exception/parse error): - fail the request with tool_execution_error - preserve partial output for retrieval

Budget exhaustion rule (iteration-based):

If tool_iters_remaining == 0 and the assistant emits any normal tool calls: - fail immediately with tool_execution_error (“tool iteration budget exhausted”) - preserve output

Replace:[edit | edit source]

req.tool_iters_remaining -= len(normal_calls) with:
req.tool_iters_remaining -= 1

And treat the “batch of tool calls” as a single iteration.

This appendix specifies when and how the worker restarts llama-server, and how it avoids thrashing.[edit | edit source]

A restart may be initiated when any of the following occurs:[edit | edit source]

Process death

llama-server subprocess exits unexpectedly.

Readiness probe failure (when expecting READY)

While in READY, repeated failures of GET /v1/models beyond policy thresholds.

Stall timeout

A request has no progress (no stream bytes and no liveness evidence) beyond stall_timeout/policy thresholds.
“Progress” is defined elsewhere: any bytes after headers, or /proc liveness during prefill.

Restart is “nuke & repave”:[edit | edit source]

Mark worker state = RUNNING (restarting)
Fail all in-flight requests

Transition each in-flight request → FAILED
Set fail_reason="worker_restarted" (or "server_died" if it actually died)
Do not discard partial output; it must remain retrievable via get_result()
Ensure each request releases its slot promptly (task cancel + cleanup)

Stop the subprocess process group

SIGTERM → wait (short) → SIGKILL group if needed

Start subprocess

spawn with configured server_cmd and env (port externally assigned)

Wait for readiness

Poll GET /v1/models until success or startup deadline
On success: worker state = READY
On failure: worker state = FAILED (or remain RUNNING briefly if retrying per backoff policy)

To prevent thrash:[edit | edit source]

Track restart timestamps in a rolling window.
If restarts exceed max_restarts_per_window inside restart_window_s: - worker state becomes FAILED - submit() returns WORKER_FAILED immediately (no slot consumption) - orchestrator can decide whether/when to call start() again

When a restart is triggered repeatedly:[edit | edit source]

Apply restart_backoff_s between restart attempts (can be constant or modest exponential, but keep it simple and testable).
Backoff should never block stop() from completing promptly.

- submit() while restarting: - should return WORKER_NOT_READY (or equivalent) without consuming a slot.[edit | edit source]

In-flight requests failed due to restart remain retrievable via get_result() until fetched (and then released).

This appendix defines a concrete, lightweight Linux liveness probe suitable for very long prefill.[edit | edit source]

A probe yields “evidence of life” if:[edit | edit source]

the subprocess PID exists and is alive, and
the process CPU time has increased since the last probe

CPU time increase indicates the process is doing work even if no tokens/bytes have been emitted yet.

Linux /proc/<pid>/stat contains:[edit | edit source]

utime (user mode CPU ticks)
stime (kernel mode CPU ticks)

A simple probe can parse the file and extract utime and stime as integers, then compute:

cpu_ticks = utime + stime

Implementation notes

Field parsing: /proc/<pid>/stat has a tricky second field (comm) which may contain spaces; parsing must account for the closing ) before splitting remaining fields.
Don’t over-probe: interval default around 5s is fine.

Maintain per-worker probe state:[edit | edit source]

last observed cpu_ticks (int)
last probe time (monotonic timestamp)

On each probe:

If PID no longer exists: - report “no liveness” (and process death likely triggers restart separately)
Read current cpu_ticks
If cpu_ticks > last_cpu_ticks: - update last_cpu_ticks - set last_liveness_at = now - return “alive with progress”
Else: - return “alive but no cpu progress”

- During prefill (no stream bytes yet): - update last_liveness_at on positive CPU tick deltas - stall decision uses last_progress_at = max(last_liveness_at, last_stream_byte_at)[edit | edit source]

- Reading one small /proc file every few seconds is very cheap.[edit | edit source]

If /proc read fails transiently: - treat as “no liveness update this time”, not an immediate failure - let stall policy decide over a longer window

This appendix defines what to test to enforce the “robustness first” goal.[edit | edit source]

Slots & lifecycle[edit | edit source]

submit() returns NO_SLOT_AVAILABLE when slots exhausted (no queue)
slot release happens on: - completion - tool errors - cancellation - restart-induced failure
no slot leaks across many iterations

BIOS separation

BIOS is generated by a distinct method (mock provider)
message ordering is BIOS → caller system → conversation
BIOS updates when tool iteration budget changes

Tool loop (iteration budget)

A single assistant tool-call message containing multiple tool calls consumes 1 iteration
Budget decrements once per iteration (not per call)
Budget exhausted: - any normal tool call triggers tool_execution_error
Exit tools: - recorded at any time - do not decrement tool budget - do not alter state transitions

Fallback parsing

valid JSON tool call parses and executes
malformed JSON → tool_parse_error
unknown tool → tool_parse_error
exit-tool fallback parse records signal and continues

Loop detector

triggers on 39-char repeated lines using new length-tier thresholds
does not trigger on common benign repetition patterns (a few repeats)

Result release semantics

get_result() returns terminal result and releases record
subsequent calls return NOT_FOUND
ensure partial output is returned on failure/cancel

Implement a small local asyncio HTTP server that mimics the parts you need:[edit | edit source]

Endpoints

GET /v1/models: - returns 200 OK when “ready” - can be configured to fail intermittently
POST /v1/chat/completions with stream=true: - emits SSE-like frames: - text deltas - optional tool calls - [DONE] - can simulate stalls/disconnects/errors

Scenarios

Slow headers

delay headers beyond headers_timeout_s to validate failure path and restart logic

Long prefill / no tokens

hold stream silent but keep subprocess “alive” (unit tests cover /proc; integration can just simulate no bytes + no done)
confirm worker doesn’t false-timeout if prefill liveness is enabled in that profile (or confirm behavior matches your profile)

Mid-stream stall

send some bytes then pause beyond idle_stream_timeout_s → triggers stall → restart

Disconnect mid-stream

abruptly close socket; ensure request fails and partial output returned

Normal tool iteration (multi-call batch)

emit one assistant message with 2–3 normal tool calls at once
validate worker executes all, appends tool results, decrements iteration budget by 1, continues

Exit tool emission

emit exit tool call; ensure signal recorded and generation continues normally

Crash-loop protection

repeatedly make /v1/models fail or simulate subprocess death quickly
confirm worker transitions to FAILED after max restarts per window

Use a dummy subprocess that spawns a child process and verify:[edit | edit source]

stop() kills the entire process group
no child remains after stop/restart

If you want to keep going: the only “remaining” section that’s still useful (given your scope cuts) would be a short Appendix N: Implementation checklist (a punch list for coding order + minimal test harness order).

User: As a rule for tool design: they should only be considered failures which kill the request when there is an actual probl…[edit | edit source]

As a rule for tool design: they should only be considered failures which kill the request when there is an actual problem. A search tool for example should not classify nothing found as a failure.

Let's look at the next section

Assistant: Locked in.[edit | edit source]

Locked in. I’ve added this as a tooling rule (applies to both normal tools and any future heavier tools) and then here’s the next section.

###[edit | edit source]

The worker must treat a tool call as failed (request-killing) only when there is an actual execution problem, such as: - tool runner timeout - tool runner raises an exception - tool runner returns a non-serializable result (implementation bug) - malformed tool arguments (parse/validation failure)
“No results found” (e.g., search returns empty) is not a failure. It must be represented as a normal successful tool result (e.g., [], { "results": [] }, or { "found": false }) and the request continues.

- The worker does not interpret tool output semantics (empty vs non-empty). It only enforces transport/format/execution correctness.[edit | edit source]

The ToolRunner is responsible for returning a meaningful “no results” value rather than raising.

- Appendix G (Tool loop): failure conditions for tools are strictly execution/format problems, not domain outcomes.[edit | edit source]

Appendix A (ToolRunner Protocol): “no results” must be represented in the returned JSON-serializable value, not as an exception.

A practical coding order that keeps the system simple, test-driven, and avoids early integration pain.[edit | edit source]

1. Create package layout (types.py, worker.py, process.py, transport.py, prompting.py, tooling.py, exit_signals.py, liveness.py, timeouts.py, loopdetect.py).[edit | edit source]

Copy Appendix A types/protocols into types.py.
Add minimal internal dataclasses not exposed publicly: - RequestRecord (id, job_name, task, state, output buffer, signals, timestamps, tool_iters_remaining) - RestartTracker (timestamps)

1. Implement build_message_stack() in prompting.py as a pure function.[edit | edit source]

Implement the repeated-line detector in loopdetect.py with the updated length-tier thresholds.
Implement a small ring buffer utility for subprocess logs.

Tests to write now

test_prompting_ordering.py
test_loopdetect.py

1. Implement ProcessSupervisor: - start in new session/process group - stdout/stderr capture to ring buffer - stop with SIGTERM→SIGKILL process group[edit | edit source]

Write the “no orphans” test using a dummy process tree.

Tests

test_process_group_teardown.py

1. Implement readiness probe: GET /v1/models.[edit | edit source]

Implement streaming client for POST /v1/chat/completions:

SSE incremental parser (tolerant)
yield events (text deltas, final message object, done, error)
expose “any bytes received” progress hook to worker

Tests

test_transport_probe.py
test_transport_stream_parser.py (using a stub SSE server)

1. Implement /proc/<pid>/stat CPU tick probe in liveness.py.[edit | edit source]

Implement TimeoutEvaluator in timeouts.py:

evaluates “should restart?” based on timestamps + profile + crash-loop counters
no side effects

Tests

test_liveness_proc_stat.py
test_timeouts_stall_and_crashloop.py

1. Implement exit_signals.py to detect/record exit-tool calls (structured + fallback).[edit | edit source]

Implement tooling.py:

iteration-based tool budget (one assistant tool-call emission == one iteration)
execute batch of normal tools sequentially
strict failure only on execution/format issues (not “no results”)
fallback parsing hooks

Tests

test_exit_signals_non_terminating.py
test_tooling_iteration_budget.py
test_tooling_no_results_is_success.py
test_tooling_fallback_parse.py

1. Implement LlamaWorker in worker.py:[edit | edit source]

state machine (STOPPED/RUNNING/READY/FAILED)
slot semaphore admission
request table and lifecycle
per-request task execution: - build BIOS via provider - build message stack - stream via transport - loop detect - normal tool loop + exit signal recording
restart handling: - fail in-flight (preserve partial output) - restart subprocess - re-probe readiness

Implement get_result() semantics:

always returns accumulated output on terminal states (including failures/cancel)
releases record after returning once

Tests

test_worker_slots.py
test_worker_get_result_releases.py
test_worker_partial_output_on_failure.py
test_worker_restart_fails_inflight_preserve_output.py

1. Create a stub server fixture that can simulate:[edit | edit source]

/v1/models ready/not-ready
streaming text deltas
tool call emission (batch)
exit-tool emission
mid-stream disconnect
stalls

Run full-worker integration tests against the stub.

Tests

test_worker_integration_streaming.py
test_worker_integration_tools_and_signals.py
test_worker_integration_restart_paths.py

If you want, next we can convert Appendix N into a concrete “first milestone” target (minimum subset to get one worker running with streaming + slotting + get_result semantics, before tools).

User: Redo this as a per-file / submodule plan.[edit | edit source]

Redo this as a per-file / submodule plan. Constructing each submodule and its test harnesses in turn.

Assistant: Below is the same Appendix N plan, redone as a per-file / submodule build plan, where each submodule is implemented wit…[edit | edit source]

Below is the same Appendix N plan, redone as a per-file / submodule build plan, where each submodule is implemented with its own focused tests before moving on. The intent is “finish one layer, lock it down with tests, then proceed”.

###[edit | edit source]

Purpose

Central home for all public types, enums, protocols, TypedDicts, and dataclasses (Appendix A).

Contents

WorkerState, RequestState
RequestFailReason, RequestFinishReason
ChatMessage, ToolDef, ToolCall, ExitSignal
SubmitResult, RequestStatus, RequestResult (with required text)
WorkerStatus, WorkerDebugInfo
TimeoutProfile, BiosContext, WorkerConfig, ReadinessProbe
BiosProvider Protocol, ToolRunner Protocol
GenerationParams = Mapping[str, Any]

Tests

Minimal “type smoke” test only (optional): import succeeds. - tests/test_types_import.py

Purpose[edit | edit source]

Tiny shared utilities only; no business logic.

Contents

RingBuffer[str] (bounded append, snapshot)
now_monotonic() helper
utc_now() or now_tz(timezone_name) (if you want centralized time for BIOS contexts)

Tests

tests/test_ring_buffer.py - append past capacity, ordering, snapshot behavior

Purpose[edit | edit source]

Pure-ish prompt construction helpers. Must be easy to modify and unit test.

Contents

build_message_stack(bios_text, caller_system_prompt, conversation) -> list[ChatMessage] - enforces ordering: BIOS system → caller system → rest
Optional helper: - make_bios_message(bios_text) -> ChatMessage - make_system_message(text) -> ChatMessage

Tests

tests/test_prompting.py - ordering invariant - correct roles - no mutation of input list

Purpose[edit | edit source]

Repeated-line detector (cheap, incremental).

Contents

RepeatedLineDetector with: - feed(text_chunk: str) -> None - triggered: bool - reason/detail (line snippet, count)
Defaults (locked): - ignore <32 chars - 32–63: 12 repeats - 64+: 8 repeats - start after 256 output chars and 2 non-empty lines

Tests

tests/test_loopdetect.py - triggers on ~39-char repeated line - doesn’t trigger on small repeats - whitespace normalization behavior

Purpose[edit | edit source]

Own subprocess lifecycle and guarantee no orphans.

Contents

ProcessSupervisor (or similar) with: - async start() - async stop() - pid, is_alive() - log capture: recent_logs()
Must start process in new process group/session
Stop sequence: SIGTERM → wait → SIGKILL process group
Capture stdout/stderr to ring buffer (from util)

Tests

tests/test_process_teardown.py - spawn a dummy parent that spawns a child; ensure stop kills both
tests/test_process_logs.py - ensure stdout/stderr captured boundedly

Purpose[edit | edit source]

HTTP client and SSE streaming parser for llama-server.

Contents

TransportClient with: - async probe_ready() -> bool using GET /v1/models - async stream_chat(payload: dict[str, Any]) -> AsyncIterator[TransportEvent]
SSE parsing tolerant: - parse data: lines, ignore keepalives - handle [DONE] - also accept error: payloads if present
Provide a small internal TransportEvent union (private to transport module).

Tests

tests/fixtures/fake_llama_server.py (shared fixture starts here) - minimal aiohttp/asyncio server with: - /v1/models - /v1/chat/completions streaming SSE
tests/test_transport_probe.py - ready vs not ready responses
tests/test_transport_stream_parser.py - yields text deltas - handles [DONE] - handles partial frames and keepalive lines - handles error: events

Purpose[edit | edit source]

Prefill-safe liveness evidence via /proc/<pid>/stat.

Contents

ProcCpuLivenessProbe: - probe(pid: int) -> bool or returns (alive: bool, progressed: bool) - handles parsing (comm) properly - caches last cpu_ticks to detect delta
Helper to read cpu_ticks robustly.

Tests

tests/test_liveness_proc_stat.py - integration-style: spawn a CPU-busy subprocess and verify ticks advance - verify parsing works even when comm contains spaces (simulate via parsing test string)

Purpose[edit | edit source]

Pure policy evaluation: decide if worker should restart or request should fail due to timeouts/stall.

Contents

RestartTracker: - record restart timestamps - locked_out(now) -> bool
TimeoutEvaluator: - should_restart_worker(worker_state, probe_ok, restart_tracker, ...) - request_stalled(last_progress_at, now, profile) -> bool - no side effects

Tests

tests/test_timeouts.py - stall decisions based on synthetic timestamps - crash-loop lockout logic - backoff behavior is deterministic

Purpose[edit | edit source]

Record exit-tool calls without affecting control flow.

Contents

ExitToolRegistry (constructed from config.exit_tools) - knows tool names and (optionally) basic arg parsing expectations
record_exit_signals(...): - takes structured tool call objects (or fallback parse results) - appends ExitSignal to request record - best-effort parsing; malformed signals shouldn’t crash worker

Tests

tests/test_exit_signals.py - records signals - ignores unknown tool names - malformed args handled gracefully - non-terminating: no state changes (this is enforced at worker level, but you can test helper purity)

Purpose[edit | edit source]

Normal tool loop (round-trip) + fallback parsing logic.
Budget is iterations, not individual calls.
Tool “no results” is success (only execution/format problems kill request).

Contents

ToolLoopRunner with: - async handle_tool_iteration(...) -> ToolIterationResult - input: batch of normal tool calls from one assistant message - executes sequentially by default - appends assistant tool-call message (NORMAL ONLY) + tool result messages - decrements tool_iters_remaining by 1 on success
ToolCallExtractor: - extract structured tool calls if present - fallback parser if not present (BIOS-defined JSON convention)
Failure classification rules: - failures: timeout/exception/parse/unknown tool/non-serializable result - NOT failures: empty results from search etc.

Tests

tests/test_tooling_iteration_budget.py - batch of multiple calls consumes 1 iteration
tests/test_tooling_failure_classification.py - empty results are success - timeout/exception is failure
tests/test_tooling_fallback.py - parses valid fallback JSON - unknown tool → parse error - malformed JSON → parse error

Purpose[edit | edit source]

The orchestration glue that composes all components.

Contents

LlamaWorker implementing: - start/stop/restart - submit, get_status, get_result, cancel - internal request task creation - slot semaphore admission control - readiness gating (GET /v1/models via transport) - tool loop integration - exit signal recording integration - repeated-line detector integration - restart handling with nuke & repave and crash-loop lockout
Ensure invariants: - no dispatch unless READY - slots never leak - partial output always preserved and returned in get_result() even on failure/cancel/restart - get_result() releases request record

Tests

Unit-ish worker tests (with fake transport/process injected or monkeypatched): - tests/test_worker_slots.py - tests/test_worker_get_result_releases.py - tests/test_worker_partial_output_on_failure.py - tests/test_worker_exit_signals_recorded.py - tests/test_worker_tool_iterations.py

Purpose[edit | edit source]

Shared stub server to run integration tests without needing real llama-server.

Capabilities

Configure per-test behaviors: - /v1/models ready/not ready/flaky - streaming response: - text deltas - tool calls (structured) - exit tool calls (structured) - delays/stalls/disconnects - error: events

Integration tests

tests/test_integration_streaming.py
tests/test_integration_tools.py
tests/test_integration_restart.py
tests/test_integration_no_orphans.py (process module may already cover this; choose one place)

1. types.py[edit | edit source]

util.py
prompting.py
loopdetect.py
process.py
transport.py (+ fake server fixture)
liveness.py
timeouts.py
exit_signals.py
tooling.py
worker.py
Full integration tests

This order ensures each layer has a test harness before it becomes a dependency for the next layer.

If you want, I can take the first actionable chunk (types/util/prompting/loopdetect) and turn it into a concrete file skeleton + corresponding pytest files, ready to start implementing.

User: Take everything we have and turn it into a design document which will be part of the repo and fed into future code gene…[edit | edit source]

Take everything we have and turn it into a design document which will be part of the repo and fed into future code generation.

Assistant: llamaworker Design Document[edit | edit source]

llama_worker Design Document[edit | edit source]

Status: Draft (repo design doc; intended to be fed into future code generation) Language: Python 3, fully typed (PEP 484) with typing_extensions (Protocol, TypedDict, etc.) Backend: llama-server from llama.cpp (OpenAI-compatible HTTP server)

0. Overview[edit | edit source]

llama_worker is an asyncio-native supervisor for a single llama-server subprocess (one model per process). It is designed to be used by a higher-level “hivemind” orchestrator that runs multiple workers (different models / GPUs / profiles), routes jobs between them, and prefers nuke & repave (restart broken workers) over fragile replay.

The worker provides:

Subprocess supervision (start/stop/restart) with no-orphan guarantees (process group teardown).
Async request execution: submit() returns immediately with a request id; caller polls status/result.
Slot-based concurrency (no queue by default).
Robust failure handling and restart policy (nuke & repave).
BIOS prompt injection (hivemind + runtime data), generated as a distinct method/component.
OpenAI-format normal tool calling (round-trip via a pluggable ToolRunner) with a fallback parsing option.
Exit tools (one-way “control signals”) provided at init and recorded (never executed, never alter control flow).
Internal streaming for monitoring and progress detection.
Loop mitigation: max_tokens plus a conservative repeated-line detector.
Partial output is always retrievable via get_result() even for failures/cancels/restarts.

This module optimizes for simplicity, clarity, robustness, and testability. It must also avoid wasting compute in CPU/RAM constrained environments (lightweight polling/probes, minimal background work).

1. Goals[edit | edit source]

One worker = one llama-server process (one model per process; may span multiple GPUs if configured).
Async-first API (explicitly asyncio-native).
Slot-based concurrency: accept up to slots concurrent in-flight requests; otherwise reject immediately.
Nuke & repave reliability: - detect dead/stalled/unreachable server, - restart subprocess, - fail in-flight requests with explicit reasons, - no replay/resume.
Long-prefill-friendly: - time-to-first-token can be minutes to tens of minutes, - stall detection is based on progress/liveness, not TTFT.
Tools: - OpenAI function/tool calling for normal tools (round-trip), - ToolRunner is pluggable; the mechanism must not assume “lightweight tools.”
BIOS prompt: - universal hivemind context + runtime metadata, - generated via a distinct, testable method/component.
Exit tools (signals): - provided at init, - recorded as structured signals, - never change worker control flow.
Loop mitigation: - rely on max_tokens as baseline, - plus repeated-line early kill.
Forward-compatible params:

accept a pass-through mapping of generation params without modifying the module for new server features.

Engineering style:

simple state machines, clear invariants, strong tests,
avoid compute waste (no aggressive polling, no expensive per-token work).

2. Non-goals[edit | edit source]

In-place model swapping or reconfiguration of a running worker.
Replay/resume of in-flight requests after restart.
Global scheduling across workers (belongs in orchestrator).
Heavy output post-processing or complex token analytics.
Persistent storage of prompts/outputs (handled by caller/orchestrator).

3. Terminology[edit | edit source]

Worker: one running llama-server subprocess plus its management logic.
Slot: admission-control unit representing one in-flight request.
Iteration (tools): one assistant “tool call turn” (a single assistant emission that may include multiple tool calls).
Normal tools: round-trip tools executed via ToolRunner.
Exit tools: one-way control-signal “tools” recorded only.
Progress: any bytes received after HTTP headers, plus liveness evidence during prefill.

4. Async model (explicit)[edit | edit source]

The module is asyncio-native.
Public APIs are async def and intended to be called from an asyncio event loop.
Thread-safety is not a v1 requirement; keep calls within a consistent async context.

5. Subprocess supervision requirements[edit | edit source]

Worker launches llama-server in its own process group/session.
stop() must ensure no orphaned processes: - SIGTERM process group → short wait → SIGKILL process group if needed.
Capture stdout/stderr into a bounded ring buffer for debug breadcrumbs.
Port assignment is external: worker config includes host/port; worker does not auto-assign ports.

6. Concurrency model: slots[edit | edit source]

Worker has slots: int.
A slot is permission to have one request “in flight” (best mapping: concurrent streaming HTTP request).
If slots are full, submit() returns immediately with NO_SLOT_AVAILABLE.
No internal queue by default.

7. Request identity[edit | edit source]

Request IDs are incrementing integers per worker lifetime (1, 2, 3, …).
Caller provides a job_name string per request for correlation.

8. Public interface[edit | edit source]

Lifecycle[edit | edit source]

async start() -> None
async stop() -> None
(internal) async restart(reason: str) -> None

Requests[edit | edit source]

async submit(job_name: str, system_prompt: str, user_prompt: str, , params: Mapping[str, Any] | None = None) -> SubmitResult

async get_status(request_id: int) -> RequestStatus | NOT_FOUND
async get_result(request_id: int) -> RequestResult | NOT_FOUND
async cancel(request_id: int) -> bool (best-effort)

Resource release semantics (locked)[edit | edit source]

get_result(request_id) is the one-time completion call: - returns terminal result (completed/failed/canceled), - returns partial output for any failure/cancel (possibly empty), - releases all stored state/output for that request id.
After a successful get_result(), later lookups return NOT_FOUND.

9. Generation params: forward-compatible pass-through[edit | edit source]

params is an open mapping passed through to the OpenAI-compatible request payload.
Worker overwrites only fields it owns (e.g., messages, tools, stream).
Unknown keys are preserved so adding new features does not require modifying the worker module.

10. Prompt assembly[edit | edit source]

BIOS prompt (worker-owned)[edit | edit source]

The BIOS prompt is a universal system-level prompt layer that includes:

hivemind/cooperating-agent context,
current date/time and timezone,
tool iteration budget remaining,
instructions for normal tools and exit tools,
fallback tool-call formatting rules (if fallback parsing is enabled).

BIOS generation must be a distinct method/component and unit-testable.

Ordering (invariant)[edit | edit source]

BIOS system prompt
caller system prompt
conversation messages (user/assistant/tool)

11. Transport contract (llama-server)[edit | edit source]

Readiness probe: GET /v1/models → HTTP 200 + JSON parse success means READY.
Chat inference: POST /v1/chat/completions with stream=true.
Streaming parser should be tolerant of SSE framing and partial JSON.
Progress definition (locked): any bytes after headers count as progress.

Transport is a distinct module; the worker consumes semantic events (text delta, final message, error, done).

12. Timeouts, progress, and liveness (per worker profile only)[edit | edit source]

No per-request timeout overrides. Workers are configured based on job type/hardware.
Stall detection must tolerate multi-minute / tens-of-minutes prefill: - use last_progress_at, not time-to-first-token.

Liveness evidence (prefill-safe)[edit | edit source]

During prefill (no stream bytes yet), worker uses lightweight probes: - process alive, - /proc/<pid>/stat CPU tick deltas (Linux baseline).

Restart policy[edit | edit source]

Nuke & repave: - restart subprocess on death/unreachable/stall per policy, - fail in-flight requests with explicit reasons, - preserve partial output for retrieval.

13. Tools and exit tools[edit | edit source]

Normal tools (round-trip)[edit | edit source]

OpenAI function/tool calling schema.
Executed via pluggable ToolRunner.
Budget is iterations, not calls: - one assistant tool-emission that contains any normal tool calls consumes exactly 1 iteration, even if it requests multiple tools at once.
Fallback parsing is allowed when structured tool_calls are absent/unreliable (BIOS-guided strict JSON).

Tool failure classification (locked)[edit | edit source]

Tools should only kill the request when there is an actual execution/format problem, such as:

tool runner timeout,
tool runner exception,
malformed tool arguments (parse/validation),
non-serializable result (implementation error),
unknown tool name.

Domain outcomes are not failures:

e.g. “search found nothing” must be a successful tool result (empty list / {results: []} / {found: false}), and the request continues.

Exit tools (one-way signals; non-terminating)[edit | edit source]

Exit tools are provided at worker init as tool definitions.
Worker exposes them to the model and records any exit-tool calls as structured signals[].
Exit tools are never executed (no ToolRunner) and never alter control flow.
Priority: process normal tool loop as usual; record exit signals whenever they occur.
Models will be encouraged (via BIOS) to emit exit tools near completion, but correctness does not depend on it.

14. Loop mitigation[edit | edit source]

Baseline: pass max_tokens via params (default set per worker profile/orchestrator).
Repeated-line detector: - detects degenerate loops where the model repeats the same line. - On trigger: cancel request → FAILED(reason="repeated_line_loop"). - Partial output remains retrievable via get_result().

15. Output retention[edit | edit source]

The worker accumulates full output in memory while running.
For any terminal state (completed/failed/canceled), get_result() returns whatever output was accumulated.
Output and state are released only after get_result() succeeds.

16. Observability (minimum)[edit | edit source]

Minimum worker status surface:

READY
RUNNING
FAILED
(internally STOPPED is useful)

Nice-to-haves (optional):

output length,
tokens received,
tokens/sec.

Logging is breadcrumbs only; persistent storage of prompts/outputs is handled by the caller.

17. Engineering rule: separation of concerns (explicit)[edit | edit source]

Specific rule[edit | edit source]

Prompt generation (especially BIOS) and message-stack construction must be implemented as distinct methods/components, not embedded inline inside request execution or transport code.

Guide for approaching problems[edit | edit source]

When adding features or fixing issues, place changes in the smallest responsible layer:

process supervision,
transport,
prompting,
tool loop,
exit signal parsing,
liveness probes,
timeout policy evaluation,
loop detection,
state accounting.

Prefer small, testable functions over clever shared logic; keep policy data-driven.

Appendix A: Types & Protocols (public contract)[edit | edit source]

from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from typing import Any, Mapping, Sequence

from typing_extensions import (
    Literal,
    NotRequired,
    Protocol,
    TypedDict,
    runtime_checkable,
)

class WorkerState(str, Enum):
    RUNNING = "running"
    READY = "ready"
    FAILED = "failed"
    STOPPED = "stopped"

class RequestState(str, Enum):
    RUNNING = "running"
    TOOL_RUNNING = "tool_running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELED = "canceled"

RequestFailReason = Literal[
    "worker_restarted",
    "server_died",
    "connect_failed",
    "headers_timeout",
    "stall_timeout",
    "tool_parse_error",
    "tool_execution_error",
    "repeated_line_loop",
    "canceled",
    "unknown_error",
]

RequestFinishReason = Literal[
    "stop",
    "max_tokens",
    "canceled",
    "failed",
]

class ChatMessage(TypedDict, total=False):
    role: Literal["system", "user", "assistant", "tool"]
    content: str
    name: str
    tool_call_id: str
    # Backends may include extra fields; keep permissive.
    tool_calls: Any

class ToolFunctionDef(TypedDict, total=False):
    name: str
    description: str
    parameters: dict[str, Any]

class ToolDef(TypedDict, total=False):
    type: Literal["function"]
    function: ToolFunctionDef

class ToolCall(TypedDict, total=False):
    id: str
    type: Literal["function"]
    function: dict[str, Any]  # expects {"name": str, "arguments": str|dict}

class ExitSignal(TypedDict, total=False):
    tool_name: str
    arguments: dict[str, Any]
    emitted_at: float

class SubmitOk(TypedDict):
    ok: Literal[True]
    request_id: int

class SubmitErr(TypedDict):
    ok: Literal[False]
    error: Literal["NO_SLOT_AVAILABLE", "WORKER_NOT_READY", "WORKER_FAILED"]

SubmitResult = SubmitOk | SubmitErr

class RequestStatus(TypedDict, total=False):
    request_id: int
    job_name: str
    state: RequestState

    created_at: float
    dispatched_at: NotRequired[float]
    completed_at: NotRequired[float]

    last_progress_at: NotRequired[float]
    output_chars: NotRequired[int]

    # optional nice-to-haves
    tokens_received: NotRequired[int]
    tokens_per_second: NotRequired[float]

    # normal tool iteration budget remaining
    tool_iters_remaining: NotRequired[int]

    # exit-tool signals recorded
    signals: NotRequired[list[ExitSignal]]

    # terminal details
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class RequestResult(TypedDict):
    request_id: int
    job_name: str

    state: Literal["completed", "failed", "canceled"]
    finish_reason: RequestFinishReason

    # Always returned (even on failure/cancel). May be "".
    text: str

    signals: NotRequired[list[ExitSignal]]
    fail_reason: NotRequired[RequestFailReason]
    fail_detail: NotRequired[str]

class NotFound(TypedDict):
    ok: Literal[False]
    error: Literal["NOT_FOUND"]

GetResultResponse = RequestResult | NotFound
GetStatusResponse = RequestStatus | NotFound

class WorkerStatus(TypedDict, total=False):
    state: WorkerState
    slots_total: int
    slots_used: int
    active_request_ids: list[int]

    restart_count: int
    last_error: NotRequired[str]
    last_ready_at: NotRequired[float]

class WorkerDebugInfo(TypedDict, total=False):
    recent_logs: list[str]
    recent_restart_reasons: list[str]

@dataclass(frozen=True, slots=True)
class TimeoutProfile:
    connect_timeout_s: float
    headers_timeout_s: float

    ttft_timeout_s: float | None
    prefill_liveness_timeout_s: float | None
    idle_stream_timeout_s: float | None
    absolute_timeout_s: float | None

    liveness_probe_interval_s: float

    restart_backoff_s: float
    restart_window_s: float
    max_restarts_per_window: int

@dataclass(frozen=True, slots=True)
class BiosContext:
    now: datetime
    timezone_name: str
    worker_name: str

    tool_iters_remaining: int
    normal_tools: Sequence[ToolDef]
    exit_tools: Sequence[ToolDef]

    bios_version: str = "bios-v1"

@runtime_checkable
class BiosProvider(Protocol):
    def __call__(self, ctx: BiosContext) -> str: ...

def build_message_stack(
    *,
    bios_text: str,
    caller_system_prompt: str,
    conversation: Sequence[ChatMessage],
) -> list[ChatMessage]:
    """Pure function: returns full message list in required order."""
    raise NotImplementedError

@runtime_checkable
class ToolRunner(Protocol):
    async def run_tool(
        self,
        *,
        name: str,
        arguments: dict[str, Any],
        request_id: int,
        job_name: str,
    ) -> Any:
        """
        Return any JSON-serializable result. "No results" is not an error;
        represent it as an empty structure ([], {"results": []}, etc.).
        """
        ...

GenerationParams = Mapping[str, Any]

@dataclass(frozen=True, slots=True)
class WorkerConfig:
    name: str

    host: str
    port: int

    # full command including executable
    server_cmd: Sequence[str]
    env: Mapping[str, str]

    slots: int
    timeouts: TimeoutProfile

    # Normal tools (round-trip)
    normal_tools: Sequence[ToolDef]
    tool_runner: ToolRunner | None

    # Exit tools (one-way)
    exit_tools: Sequence[ToolDef]

    # BIOS
    bios_provider: BiosProvider
    timezone_name: str

@dataclass(frozen=True, slots=True)
class ReadinessProbe:
    method: Literal["GET"] = "GET"
    path: str = "/v1/models"

Appendix B: Internal module layout and responsibilities[edit | edit source]

Suggested package layout (each module testable in isolation):

llama_worker/types.py — public types/protocols only.
llama_worker/util.py — tiny utilities (ring buffer, monotonic time).
llama_worker/prompting.py — BIOS + message assembly helpers (pure-ish).
llama_worker/loopdetect.py — repeated-line detector.
llama_worker/process.py — subprocess start/stop (process group), log capture.
llama_worker/transport.py — readiness probe + streaming SSE parsing.
llama_worker/liveness.py — /proc/<pid>/stat CPU tick liveness probes.
llama_worker/timeouts.py — timeout policy evaluation (pure).
llama_worker/exit_signals.py — exit-tool parsing/recording (non-terminating).
llama_worker/tooling.py — normal tool loop (iteration budget) + fallback parsing.
llama_worker/worker.py — orchestration glue; owns request table, slot semaphore, restart mechanics.

Guiding rule: avoid “just add an if in worker.py” when the change belongs in a narrower layer.

Appendix C: State machines and invariants[edit | edit source]

C.1 Worker state machine[edit | edit source]

States:

STOPPED, RUNNING, READY, FAILED

Transitions:

STOPPED → RUNNING → READY
READY → RUNNING (restart) → READY or FAILED
Any → STOPPED (stop)
RUNNING/READY → FAILED on crash-loop lockout

Invariants:

No request dispatch unless state is READY.
FAILED → submit() returns WORKER_FAILED (no slot use).
STOPPED/RUNNING (not ready) → submit() returns WORKER_NOT_READY (no slot use).

C.2 Request state machine[edit | edit source]

States:

RUNNING, TOOL_RUNNING, COMPLETED, FAILED, CANCELED

Invariants:

One request occupies exactly one slot from acceptance until terminal.
Output accumulation is monotonic until terminal.
Exit tools do not cause state transitions.
After get_result() succeeds, request record is released; subsequent lookups return NOT_FOUND.
On any terminal outcome, get_result() returns partial output accumulated so far (may be empty).

C.3 Slot invariants[edit | edit source]

Failed submit consumes no slot.
Successful submit consumes a slot immediately.
Slot release happens exactly once in a finally: path (no leaks).

C.4 Restart invariants (nuke & repave)[edit | edit source]

No replay of in-flight requests.
On restart initiation: - in-flight requests fail (with reason), - slots released promptly, - partial output preserved until get_result() is called.

Appendix D: Concrete defaults (recommended starting values)[edit | edit source]

These are starting points, tuned for long prefill and conservative behavior.

D.1 Readiness[edit | edit source]

Probe: GET /v1/models
Startup probe interval: 0.5s initially (short burst), then backoff
Startup max wait: 120s (server bind/readiness, not model TTFT)

D.2 TimeoutProfile baseline (slow hardware / big contexts)[edit | edit source]

connect_timeout_s = 3.0
headers_timeout_s = 30.0
ttft_timeout_s = None
prefill_liveness_timeout_s = None (disabled) or very large (e.g., 3600s)
idle_stream_timeout_s = 300.0
absolute_timeout_s = None
liveness_probe_interval_s = 5.0
restart controls: - restart_backoff_s = 5.0 - restart_window_s = 120.0 - max_restarts_per_window = 5

D.3 Normal tools[edit | edit source]

max_tool_iterations = 8 (iterations, not calls)
per-tool timeout: set by worker/tooling policy (e.g., 10s default), but heavy tools can be handled by ToolRunner configs.

D.4 Repeated-line detector (updated for ~39-char loops)[edit | edit source]

Ignore empty/whitespace-only lines.
Normalize by stripping and (optionally) collapsing internal whitespace.
Thresholds: - len(line) >= 64 → trigger at 8 consecutive repeats - 32 <= len(line) < 64 → trigger at 12 consecutive repeats - len(line) < 32 → ignore
Warmup: - start checking after min_output_chars_before_check = 256 - require at least 2 completed non-empty lines observed
On trigger: - cancel request → FAILED(reason="repeated_line_loop") - preserve output for retrieval

Appendix E: Example BIOS prompt template[edit | edit source]

Example BIOS text produced by BiosProvider(ctx):

[BIOS v=bios-v1]
You are one agent in a cooperative hivemind of models working together.

Time: {NOW_ISO8601}
Timezone: {TIMEZONE_NAME}
Worker: {WORKER_NAME}

Normal tool budget:
* iterations remaining: {TOOL_ITERS_REMAINING}

Tool rules:
* NORMAL tools are executed by the system and results will be returned.
* Exit tools are control signals to the orchestrator; they do not automatically change execution.

Exit tools guidance:
* Use exit tools to report: low confidence, need external info, need a management decision, etc.
* Prefer emitting exit tools near completion unless urgent.

Fallback tool-call formatting (only if needed):
* Tool arguments must be valid JSON objects.
[/BIOS]

BIOS must be unit-testable and regenerated before each post-tool continuation (budget changes).

Appendix F: Transport and streaming expectations[edit | edit source]

Readiness probe: GET /v1/models → 200 + JSON => READY.
Streaming chat: POST /v1/chat/completions with stream=true.
Parser expectations: - tolerate SSE framing and keepalives, - tolerate partial JSON chunks, - treat data: [DONE] as end-of-stream if present, - treat unexpected disconnect as an error unless already terminal.

Progress rule (locked): any bytes received after headers count as progress.

Appendix G: Tool-call loop algorithm (iteration budget)[edit | edit source]

G.1 Iteration semantics (locked)[edit | edit source]

Tool budget is iterations, not calls.
One assistant tool-emission “turn” containing any normal tool calls consumes 1 iteration, even if it includes multiple tool calls.

G.2 Execution rules[edit | edit source]

When a tool call emission is detected:

Partition calls into: - normal tool calls (round-trip), - exit tool calls (record only), - unknown tool calls (error).
Record exit calls into signals[] immediately (best effort).
If unknown calls exist → fail request with tool_parse_error (preserve output).

Normal tool iteration processing[edit | edit source]

If at least one normal tool call exists:

If tool_iters_remaining == 0 → fail with tool_execution_error (“iteration budget exhausted”).
Otherwise: - execute all normal tool calls sequentially (order given), - append assistant tool-call message (normal calls only), - append tool result messages, - decrement tool_iters_remaining -= 1, - regenerate BIOS and continue generation.

Failure classification (locked)[edit | edit source]

A tool iteration fails (kills the request) only for actual problems:

parse/validation failure,
ToolRunner timeout/exception,
non-serializable result,
unknown tool name.

“No results found” is not failure; ToolRunner returns an empty or “not found” result.

Exit tool behavior (locked)[edit | edit source]

Exit tools are recorded and never executed.
Exit tools do not decrement normal tool iteration budget.
Exit tools do not alter control flow.

Appendix I: Restart policy and crash-loop behavior[edit | edit source]

I.1 Restart triggers[edit | edit source]

subprocess death,
repeated readiness probe failure while expecting READY,
stall timeout (no progress/liveness beyond thresholds).

I.2 Restart sequence (nuke & repave)[edit | edit source]

Mark worker state RUNNING (restarting).
Fail all in-flight requests: - set terminal FAILED reason, - release slots promptly, - preserve partial output until get_result().
Stop process group (SIGTERM → wait → SIGKILL).
Start subprocess.
Wait for readiness probe success: - READY on success, - FAILED on repeated startup failure/lockout.

I.3 Crash-loop protection[edit | edit source]

Track restart timestamps within a rolling window.
If restarts exceed max_restarts_per_window within restart_window_s: - worker becomes FAILED (lockout), - submit() returns WORKER_FAILED without consuming slots.

Appendix J: Liveness probing (/proc/<pid>/stat CPU tick delta)[edit | edit source]

Evidence of life is: process is alive and CPU ticks increased since last probe.
Read /proc/<pid>/stat, parse safely around (comm) field, extract utime and stime.
cpu_ticks = utime + stime; delta indicates activity.
Probe interval default ~5 seconds.
Liveness updates last_liveness_at; stall decisions use last_progress_at.

Transient /proc read failures should not immediately kill the worker; they simply don’t advance liveness for that probe.

Appendix L: Test plan and stub fixtures[edit | edit source]

L.1 Unit test matrix[edit | edit source]

Prompting ordering and BIOS regeneration triggers.
Repeated-line detector triggers on ~39-char lines under the new default thresholds.
Slot accounting: no leaks under completion/failure/cancel/restart.
Tool loop: - multiple tool calls in one iteration consume 1 budget, - exit tools recorded and non-terminating, - “no results” tool outputs are success, - fallback parsing success/failure paths.
Result semantics: - get_result() returns partial output on failure/cancel/restart, - get_result() releases records; subsequent lookup is NOT_FOUND.

L.2 Integration tests with stub “fake llama-server”[edit | edit source]

Stub server supports:

/v1/models ready/not-ready/flaky.
/v1/chat/completions streaming: - text deltas, - structured tool calls (batch), - exit tool calls, - stalls, disconnects, errors.

Scenarios:

slow headers → headers timeout paths,
long silence with liveness updates (where applicable),
mid-stream stall → idle stream timeout,
disconnect mid-stream → request fails with partial output,
tool iteration with multiple calls,
exit tool emission recorded,
crash-loop lockout behavior.

L.3 No-orphans test[edit | edit source]

Spawn a dummy process tree; assert stop/restart kills the entire process group.

Appendix N: Per-file / submodule implementation plan (build + tests in turn)[edit | edit source]

Implement each submodule with its own harness before proceeding:

types.py — public contract.
util.py — ring buffer, time helpers.
prompting.py — build_message_stack() (pure-ish) + tests.
loopdetect.py — repeated-line detector + tests.
process.py — process group start/stop + log capture + no-orphans tests.
transport.py — readiness probe + SSE streaming parser + stub server fixture + tests.
liveness.py — /proc/<pid>/stat CPU ticks + tests.
timeouts.py — pure timeout/crash-loop evaluation + tests.
exit_signals.py — record exit tool calls + tests.
tooling.py — tool-loop (iteration budget) + fallback parsing + failure classification + tests.
worker.py — orchestration glue + unit-ish tests (mock components).
Full integration tests against the stub server.

End of document[edit | edit source]

This design is intentionally strict about invariants (slots, restart semantics, output retention, tool iterations) to keep implementation straightforward and robust in a multi-worker environment.

References[edit | edit source]

↑ "Python tooling for llama.cpp". ChatGPT. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "PyPI". pypi.org. Retrieved 2025-12-16.
↑ "PyPI". pypi.org. Retrieved 2025-12-16.
↑ "PyPI". pypi.org. Retrieved 2025-12-16.
↑ "PyPI". pypi.org. Retrieved 2025-12-16.
↑ "PyPI". pypi.org. Retrieved 2025-12-16.
↑ "Llama CPP Python". Llama CPP Python. Retrieved 2025-12-16.
↑ "Llama CPP Python". Llama CPP Python. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.
↑ "GitHub". raw.githubusercontent.com. Retrieved 2025-12-16.
↑ "GitHub". raw.githubusercontent.com. Retrieved 2025-12-16.
↑ "GitHub". github.com. Retrieved 2025-12-16.

[1] "Python tooling for llama.cpp". ChatGPT. Retrieved 2025-12-16.

[2] "GitHub". github.com. Retrieved 2025-12-16.

[3] "GitHub". github.com. Retrieved 2025-12-16.

[4] "PyPI". pypi.org. Retrieved 2025-12-16.

[5] "PyPI". pypi.org. Retrieved 2025-12-16.

[6] "PyPI". pypi.org. Retrieved 2025-12-16.

[7] "PyPI". pypi.org. Retrieved 2025-12-16.

[8] "PyPI". pypi.org. Retrieved 2025-12-16.

[9] "Llama CPP Python". Llama CPP Python. Retrieved 2025-12-16.

[10] "Llama CPP Python". Llama CPP Python. Retrieved 2025-12-16.

[11] "GitHub". github.com. Retrieved 2025-12-16.

[12] "GitHub". github.com. Retrieved 2025-12-16.

[13] "GitHub". github.com. Retrieved 2025-12-16.

[14] "GitHub". github.com. Retrieved 2025-12-16.

[15] "GitHub". github.com. Retrieved 2025-12-16.

[16] "GitHub". github.com. Retrieved 2025-12-16.

[17] "GitHub". github.com. Retrieved 2025-12-16.

[18] "GitHub". github.com. Retrieved 2025-12-16.

[19] "GitHub". raw.githubusercontent.com. Retrieved 2025-12-16.

[20] "GitHub". raw.githubusercontent.com. Retrieved 2025-12-16.

[21] "GitHub". github.com. Retrieved 2025-12-16.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

gpt-5-2-thinking[edit | edit source]

User: Is there exist python tooling for interacting with llama.cpp, and what is it / what features does it provide?[edit | edit source]

Assistant: Yes — the main Python tooling is llama-cpp-python, which provides Python bindings to llama.cpp plus a couple of “batter…[edit | edit source]

What you get:[edit | edit source]

If you’d rather keep models out-of-process, it can run an OpenAI API–compatible web server so you can point any OpenAI client library at it. Llama CPP Python[9][edit | edit source]

Separately from Python bindings, llama.cpp itself ships llama-server, a lightweight HTTP server exposing OpenAI-compatible endpoints like /v1/chat/completions, and it supports concurrent requests/parallel decoding. GitHub[11][edit | edit source]

User: This will be part of a system managing several different llama instances using different models on different gpus[edit | edit source]

Assistant: Nice — in that setup, you’ll usually want to treat llama.cpp as a fleet of small, single-purpose services and put a thi…[edit | edit source]

###[edit | edit source]

Use environment + flags so each instance only sees the GPU(s) it should use.[edit | edit source]

Make one FastAPI (or similar) service that:[edit | edit source]

###[edit | edit source]

- Periodic health checks (e.g., call /v1/models or a tiny completion)[edit | edit source]

- Central structured logs[edit | edit source]

###[edit | edit source]

- POST /v1/chat/completions: - read body["model"] - pick backend from routing[model] (optionally least-busy) - proxy request to http://backend_host:port/v1/chat/completions - if stream=true, proxy SSE bytes directly[edit | edit source]

- Prefer one model per process. Multi-model in one process is almost never worth the complexity unless you’re doing something very custom.[edit | edit source]

User: My original plan was to have a different systemd service starting up ollama servers, with the higher level software hav…[edit | edit source]

Assistant: Got it.[edit | edit source]

For a multi-model, multi-GPU manager, separate processes are usually better because:[edit | edit source]

###[edit | edit source]

Use asyncio so you can manage many workers cleanly and stream logs.[edit | edit source]

You don’t want to route requests until the worker is actually serving.[edit | edit source]

- Start N workers[edit | edit source]

- Port allocation: deterministic (e.g., 8100+index) so restarts don’t reshuffle.[edit | edit source]

User: Ok, let's lock down the requirements on this module, work on code later.[edit | edit source]

Assistant: You’ve already got the core nailed.[edit | edit source]

Lifecycle[edit | edit source]

###[edit | edit source]

Decide now what happens when saturated:[edit | edit source]

People often only add one and then it’s messy.[edit | edit source]

Lock this down explicitly:[edit | edit source]

Even if you only expose polling now, build the internal plumbing so you can add:[edit | edit source]

Minimum that saves you hours:[edit | edit source]

Expose:[edit | edit source]

If multiple instances are created/destroyed dynamically:[edit | edit source]

Tool support is where hidden requirements show up. Define:[edit | edit source]

If you want max_concurrent_requests > 1, decide if that means:[edit | edit source]

It can help to say “not in scope for v1”:[edit | edit source]

- start(), stop(), restart()[edit | edit source]

User: As a general rule for this nuke & repave is the preferred strategy rather than trying delicate replay and similar trick…[edit | edit source]

Assistant: Cool — with those constraints, you can simplify a lot.[edit | edit source]

###[edit | edit source]

- Worker has N slots (N ≥ 1). A slot = one in-flight generation request owned by this worker.[edit | edit source]

- Even if the external API is polling-only, the worker internally uses streaming (or reads incremental output) to: - update “last progress” timestamps - capture partial tokens for debugging/monitoring - improve stall detection[edit | edit source]

###[edit | edit source]

You want multiple independent timers, and at least one that is progress-based rather than “first token in X seconds”.[edit | edit source]

Worker-level status should include:[edit | edit source]

Lock down:[edit | edit source]

Explicitly require:[edit | edit source]

- If submit() succeeds, the worker guarantees: - the request is either completed with a result, or ends in a terminal error with reason[edit | edit source]

What is “a slot” mapped to under the hood?[edit | edit source]

Two small but high-impact requirements:[edit | edit source]

User: There should be a secondary low level system prompt mechanism handled by the class, almost a "BIOS prompt".[edit | edit source]

Assistant: Yep — that’s a good idea, and it’s worth making it a first-class requirement because it affects how you build the promp…[edit | edit source]

###[edit | edit source]

Lock this down because ordering matters.[edit | edit source]

For maximum compatibility (since different servers vary in how they treat multiple system messages), the worker should support both:[edit | edit source]

The BIOS prompt should be regenerated:[edit | edit source]

- No secrets in BIOS (API keys, hostnames if sensitive, etc.). Treat it as model-visible.[edit | edit source]

Because you’re using internal streaming, the BIOS layer should also update worker/request status with:[edit | edit source]

Add to your module contract:[edit | edit source]

Include a “capabilities line” in BIOS that describes what the worker supports right now (tools enabled? max tool calls? streaming?); it helps models behave more consistently than relying on implicit behavior.[edit | edit source]

User: On timeouts, some of the test requests I've done with the hardware this will be managing can take multiple minutes, or …[edit | edit source]

Assistant: That’s totally compatible with the “nuke & repave” philosophy — you just have to make sure your stall detection doesn’t…[edit | edit source]

###[edit | edit source]

With your workloads, a “time to first token” timeout is basically unusable unless it’s huge.[edit | edit source]

Instead of “no tokens yet”, stall should mean “nothing indicates progress”.[edit | edit source]

For long contexts, absolute timeouts should be either:[edit | edit source]

Even if no tokens for 20 minutes, the worker should still show:[edit | edit source]

- If connect/header failures happen: restart quickly.[edit | edit source]

Workers should support profiles, because timeout expectations will differ wildly:[edit | edit source]

User: Yes, timeout profiles are a good idea, or at least a way to give larger or smaller failure limits per-process.[edit | edit source]

Assistant: Great — then I’d bake “timeout profiles” in as a first-class config object, but keep it simple: per-worker defaults + p…[edit | edit source]

###[edit | edit source]

Timeouts that apply before first token depend on having something to watch. Require:[edit | edit source]

Callers can override upwards or downwards within guardrails:[edit | edit source]

If a timeout triggers, the request ends with:[edit | edit source]

You can optionally define a few named presets (just for convenience in your higher-level config):[edit | edit source]

Add a per-worker boolean:[edit | edit source]

If you’d rather keep models out-of-process, it can run an OpenAI API–compatible web server so you can point any OpenAI client library at it. Llama CPP Python^[9][edit | edit source]

Separately from Python bindings, llama.cpp itself ships `llama-server`, a lightweight HTTP server exposing OpenAI-compatible endpoints like /v1/chat/completions, and it supports concurrent requests/parallel decoding. GitHub^[11][edit | edit source]

- Always support `max_tokens` / max-new-tokens as a hard stop (per-worker default, per-request override).[edit | edit source]