Editing Openai/694057b6-101c-8007-9a65-c40578c7252d (section)

=== Assistant: Absolutely — tool budget will be iterations, not “number of tool calls”. ===
Absolutely — tool budget will be iterations, not “number of tool calls”. A single iteration may contain multiple tool calls, and that entire batch consumes exactly 1 iteration.

Below is the updated Appendix G plus the remaining appendices (I, J, L) so you’ve got the full remaining set.

==== ### ====

===== - Tool budget = iterations. =====
* An iteration is: one assistant tool-call emission event (i.e., one model “turn” where it asks for tools).
* In that one iteration, the model may request multiple normal tools at once.
* The worker executes all normal tool calls in that iteration (sequentially by default), then appends tool results, regenerates BIOS, and continues generation.
* After successfully processing the iteration, decrement: - tool_iters_remaining -= 1 (always one, regardless of number of calls)

===== To keep v1 simple and robust: =====
* If the assistant emits N normal tool calls in one message: - execute them sequentially in the listed order - append N tool result messages (and an assistant tool-call message containing those normal calls) - decrement iteration budget by 1 (not N)
* If any tool call in the batch fails (timeout/exception/parse error): - fail the request with tool_execution_error - preserve partial output for retrieval

Budget exhaustion rule (iteration-based):
* If tool_iters_remaining == 0 and the assistant emits any normal tool calls: - fail immediately with tool_execution_error (“tool iteration budget exhausted”) - preserve output

===== Replace: =====
* req.tool_iters_remaining -= len(normal_calls) with:
* req.tool_iters_remaining -= 1

And treat the “batch of tool calls” as a single iteration.

==== This appendix specifies when and how the worker restarts llama-server, and how it avoids thrashing. ====

===== A restart may be initiated when any of the following occurs: =====
# Process death
* llama-server subprocess exits unexpectedly.
# Readiness probe failure (when expecting READY)
* While in READY, repeated failures of GET /v1/models beyond policy thresholds.
# Stall timeout
* A request has no progress (no stream bytes and no liveness evidence) beyond stall_timeout/policy thresholds.
* “Progress” is defined elsewhere: any bytes after headers, or /proc liveness during prefill.

===== Restart is “nuke & repave”: =====
# Mark worker state = RUNNING (restarting)
# Fail all in-flight requests
* Transition each in-flight request → FAILED
* Set fail_reason="worker_restarted" (or "server_died" if it actually died)
* Do not discard partial output; it must remain retrievable via get_result()
* Ensure each request releases its slot promptly (task cancel + cleanup)
# Stop the subprocess process group
* SIGTERM → wait (short) → SIGKILL group if needed
# Start subprocess
* spawn with configured server_cmd and env (port externally assigned)
# Wait for readiness
* Poll GET /v1/models until success or startup deadline
* On success: worker state = READY
* On failure: worker state = FAILED (or remain RUNNING briefly if retrying per backoff policy)

===== To prevent thrash: =====
* Track restart timestamps in a rolling window.
* If restarts exceed max_restarts_per_window inside restart_window_s: - worker state becomes FAILED - submit() returns WORKER_FAILED immediately (no slot consumption) - orchestrator can decide whether/when to call start() again

===== When a restart is triggered repeatedly: =====
* Apply restart_backoff_s between restart attempts (can be constant or modest exponential, but keep it simple and testable).
* Backoff should never block stop() from completing promptly.

===== - submit() while restarting: - should return WORKER_NOT_READY (or equivalent) without consuming a slot. =====
* In-flight requests failed due to restart remain retrievable via get_result() until fetched (and then released).

==== This appendix defines a concrete, lightweight Linux liveness probe suitable for very long prefill. ====

===== A probe yields “evidence of life” if: =====
* the subprocess PID exists and is alive, and
* the process CPU time has increased since the last probe

CPU time increase indicates the process is doing work even if no tokens/bytes have been emitted yet.

===== Linux /proc/<pid>/stat contains: =====
* utime (user mode CPU ticks)
* stime (kernel mode CPU ticks)

A simple probe can parse the file and extract utime and stime as integers, then compute:
* cpu_ticks = utime + stime

Implementation notes
* Field parsing: /proc/<pid>/stat has a tricky second field (comm) which may contain spaces; parsing must account for the closing ) before splitting remaining fields.
* Don’t over-probe: interval default around 5s is fine.

===== Maintain per-worker probe state: =====
* last observed cpu_ticks (int)
* last probe time (monotonic timestamp)

On each probe:
# If PID no longer exists: - report “no liveness” (and process death likely triggers restart separately)
# Read current cpu_ticks
# If cpu_ticks > last_cpu_ticks: - update last_cpu_ticks - set last_liveness_at = now - return “alive with progress”
# Else: - return “alive but no cpu progress”

===== - During prefill (no stream bytes yet): - update last_liveness_at on positive CPU tick deltas - stall decision uses last_progress_at = max(last_liveness_at, last_stream_byte_at) =====

===== - Reading one small /proc file every few seconds is very cheap. =====
* If /proc read fails transiently: - treat as “no liveness update this time”, not an immediate failure - let stall policy decide over a longer window

==== This appendix defines what to test to enforce the “robustness first” goal. ====

===== Slots & lifecycle =====
* submit() returns NO_SLOT_AVAILABLE when slots exhausted (no queue)
* slot release happens on: - completion - tool errors - cancellation - restart-induced failure
* no slot leaks across many iterations

BIOS separation
* BIOS is generated by a distinct method (mock provider)
* message ordering is BIOS → caller system → conversation
* BIOS updates when tool iteration budget changes

Tool loop (iteration budget)
* A single assistant tool-call message containing multiple tool calls consumes 1 iteration
* Budget decrements once per iteration (not per call)
* Budget exhausted: - any normal tool call triggers tool_execution_error
* Exit tools: - recorded at any time - do not decrement tool budget - do not alter state transitions

Fallback parsing
* valid JSON tool call parses and executes
* malformed JSON → tool_parse_error
* unknown tool → tool_parse_error
* exit-tool fallback parse records signal and continues

Loop detector
* triggers on 39-char repeated lines using new length-tier thresholds
* does not trigger on common benign repetition patterns (a few repeats)

Result release semantics
* get_result() returns terminal result and releases record
* subsequent calls return NOT_FOUND
* ensure partial output is returned on failure/cancel

===== Implement a small local asyncio HTTP server that mimics the parts you need: =====

Endpoints
* GET /v1/models: - returns 200 OK when “ready” - can be configured to fail intermittently
* POST /v1/chat/completions with stream=true: - emits SSE-like frames: - text deltas - optional tool calls - [DONE] - can simulate stalls/disconnects/errors

Scenarios
# Slow headers
* delay headers beyond headers_timeout_s to validate failure path and restart logic
# Long prefill / no tokens
* hold stream silent but keep subprocess “alive” (unit tests cover /proc; integration can just simulate no bytes + no done)
* confirm worker doesn’t false-timeout if prefill liveness is enabled in that profile (or confirm behavior matches your profile)
# Mid-stream stall
* send some bytes then pause beyond idle_stream_timeout_s → triggers stall → restart
# Disconnect mid-stream
* abruptly close socket; ensure request fails and partial output returned
# Normal tool iteration (multi-call batch)
* emit one assistant message with 2–3 normal tool calls at once
* validate worker executes all, appends tool results, decrements iteration budget by 1, continues
# Exit tool emission
* emit exit tool call; ensure signal recorded and generation continues normally
# Crash-loop protection
* repeatedly make /v1/models fail or simulate subprocess death quickly
* confirm worker transitions to FAILED after max restarts per window

===== Use a dummy subprocess that spawns a child process and verify: =====
* stop() kills the entire process group
* no child remains after stop/restart

If you want to keep going: the only “remaining” section that’s still useful (given your scope cuts) would be a short Appendix N: Implementation checklist (a punch list for coding order + minimal test harness order).