Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
freem
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Openai/694057b6-101c-8007-9a65-c40578c7252d
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Assistant: Absolutely — tool budget will be iterations, not “number of tool calls”. === Absolutely — tool budget will be iterations, not “number of tool calls”. A single iteration may contain multiple tool calls, and that entire batch consumes exactly 1 iteration. Below is the updated Appendix G plus the remaining appendices (I, J, L) so you’ve got the full remaining set. ==== ### ==== ===== - Tool budget = iterations. ===== * An iteration is: one assistant tool-call emission event (i.e., one model “turn” where it asks for tools). * In that one iteration, the model may request multiple normal tools at once. * The worker executes all normal tool calls in that iteration (sequentially by default), then appends tool results, regenerates BIOS, and continues generation. * After successfully processing the iteration, decrement: - tool_iters_remaining -= 1 (always one, regardless of number of calls) ===== To keep v1 simple and robust: ===== * If the assistant emits N normal tool calls in one message: - execute them sequentially in the listed order - append N tool result messages (and an assistant tool-call message containing those normal calls) - decrement iteration budget by 1 (not N) * If any tool call in the batch fails (timeout/exception/parse error): - fail the request with tool_execution_error - preserve partial output for retrieval Budget exhaustion rule (iteration-based): * If tool_iters_remaining == 0 and the assistant emits any normal tool calls: - fail immediately with tool_execution_error (“tool iteration budget exhausted”) - preserve output ===== Replace: ===== * req.tool_iters_remaining -= len(normal_calls) with: * req.tool_iters_remaining -= 1 And treat the “batch of tool calls” as a single iteration. ==== This appendix specifies when and how the worker restarts llama-server, and how it avoids thrashing. ==== ===== A restart may be initiated when any of the following occurs: ===== # Process death * llama-server subprocess exits unexpectedly. # Readiness probe failure (when expecting READY) * While in READY, repeated failures of GET /v1/models beyond policy thresholds. # Stall timeout * A request has no progress (no stream bytes and no liveness evidence) beyond stall_timeout/policy thresholds. * “Progress” is defined elsewhere: any bytes after headers, or /proc liveness during prefill. ===== Restart is “nuke & repave”: ===== # Mark worker state = RUNNING (restarting) # Fail all in-flight requests * Transition each in-flight request → FAILED * Set fail_reason="worker_restarted" (or "server_died" if it actually died) * Do not discard partial output; it must remain retrievable via get_result() * Ensure each request releases its slot promptly (task cancel + cleanup) # Stop the subprocess process group * SIGTERM → wait (short) → SIGKILL group if needed # Start subprocess * spawn with configured server_cmd and env (port externally assigned) # Wait for readiness * Poll GET /v1/models until success or startup deadline * On success: worker state = READY * On failure: worker state = FAILED (or remain RUNNING briefly if retrying per backoff policy) ===== To prevent thrash: ===== * Track restart timestamps in a rolling window. * If restarts exceed max_restarts_per_window inside restart_window_s: - worker state becomes FAILED - submit() returns WORKER_FAILED immediately (no slot consumption) - orchestrator can decide whether/when to call start() again ===== When a restart is triggered repeatedly: ===== * Apply restart_backoff_s between restart attempts (can be constant or modest exponential, but keep it simple and testable). * Backoff should never block stop() from completing promptly. ===== - submit() while restarting: - should return WORKER_NOT_READY (or equivalent) without consuming a slot. ===== * In-flight requests failed due to restart remain retrievable via get_result() until fetched (and then released). ==== This appendix defines a concrete, lightweight Linux liveness probe suitable for very long prefill. ==== ===== A probe yields “evidence of life” if: ===== * the subprocess PID exists and is alive, and * the process CPU time has increased since the last probe CPU time increase indicates the process is doing work even if no tokens/bytes have been emitted yet. ===== Linux /proc/<pid>/stat contains: ===== * utime (user mode CPU ticks) * stime (kernel mode CPU ticks) A simple probe can parse the file and extract utime and stime as integers, then compute: * cpu_ticks = utime + stime Implementation notes * Field parsing: /proc/<pid>/stat has a tricky second field (comm) which may contain spaces; parsing must account for the closing ) before splitting remaining fields. * Don’t over-probe: interval default around 5s is fine. ===== Maintain per-worker probe state: ===== * last observed cpu_ticks (int) * last probe time (monotonic timestamp) On each probe: # If PID no longer exists: - report “no liveness” (and process death likely triggers restart separately) # Read current cpu_ticks # If cpu_ticks > last_cpu_ticks: - update last_cpu_ticks - set last_liveness_at = now - return “alive with progress” # Else: - return “alive but no cpu progress” ===== - During prefill (no stream bytes yet): - update last_liveness_at on positive CPU tick deltas - stall decision uses last_progress_at = max(last_liveness_at, last_stream_byte_at) ===== ===== - Reading one small /proc file every few seconds is very cheap. ===== * If /proc read fails transiently: - treat as “no liveness update this time”, not an immediate failure - let stall policy decide over a longer window ==== This appendix defines what to test to enforce the “robustness first” goal. ==== ===== Slots & lifecycle ===== * submit() returns NO_SLOT_AVAILABLE when slots exhausted (no queue) * slot release happens on: - completion - tool errors - cancellation - restart-induced failure * no slot leaks across many iterations BIOS separation * BIOS is generated by a distinct method (mock provider) * message ordering is BIOS → caller system → conversation * BIOS updates when tool iteration budget changes Tool loop (iteration budget) * A single assistant tool-call message containing multiple tool calls consumes 1 iteration * Budget decrements once per iteration (not per call) * Budget exhausted: - any normal tool call triggers tool_execution_error * Exit tools: - recorded at any time - do not decrement tool budget - do not alter state transitions Fallback parsing * valid JSON tool call parses and executes * malformed JSON → tool_parse_error * unknown tool → tool_parse_error * exit-tool fallback parse records signal and continues Loop detector * triggers on 39-char repeated lines using new length-tier thresholds * does not trigger on common benign repetition patterns (a few repeats) Result release semantics * get_result() returns terminal result and releases record * subsequent calls return NOT_FOUND * ensure partial output is returned on failure/cancel ===== Implement a small local asyncio HTTP server that mimics the parts you need: ===== Endpoints * GET /v1/models: - returns 200 OK when “ready” - can be configured to fail intermittently * POST /v1/chat/completions with stream=true: - emits SSE-like frames: - text deltas - optional tool calls - [DONE] - can simulate stalls/disconnects/errors Scenarios # Slow headers * delay headers beyond headers_timeout_s to validate failure path and restart logic # Long prefill / no tokens * hold stream silent but keep subprocess “alive” (unit tests cover /proc; integration can just simulate no bytes + no done) * confirm worker doesn’t false-timeout if prefill liveness is enabled in that profile (or confirm behavior matches your profile) # Mid-stream stall * send some bytes then pause beyond idle_stream_timeout_s → triggers stall → restart # Disconnect mid-stream * abruptly close socket; ensure request fails and partial output returned # Normal tool iteration (multi-call batch) * emit one assistant message with 2–3 normal tool calls at once * validate worker executes all, appends tool results, decrements iteration budget by 1, continues # Exit tool emission * emit exit tool call; ensure signal recorded and generation continues normally # Crash-loop protection * repeatedly make /v1/models fail or simulate subprocess death quickly * confirm worker transitions to FAILED after max restarts per window ===== Use a dummy subprocess that spawns a child process and verify: ===== * stop() kills the entire process group * no child remains after stop/restart If you want to keep going: the only “remaining” section that’s still useful (given your scope cuts) would be a short Appendix N: Implementation checklist (a punch list for coding order + minimal test harness order).
Summary:
Please note that all contributions to freem are considered to be released under the Creative Commons Attribution-ShareAlike 4.0 (see
Freem:Copyrights
for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource.
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)