Heal tool-call XML in streaming pass into real tool executions#4433
Heal tool-call XML in streaming pass into real tool executions#4433danielhanchen wants to merge 1 commit intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
bc37fa8 to
b5f63c5
Compare
3f52fe2 to
ab2cb04
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83ce0de95e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| cumulative = _strip_tool_markup(cumulative, final = True) | ||
| _last_emitted = cumulative | ||
| if cumulative: | ||
| yield {"type": "content", "text": cumulative} |
There was a problem hiding this comment.
Preserve append-only content when healing streamed tool XML
This branch can emit a shorter cumulative content string than was already streamed. gguf_tool_stream() in studio/backend/routes/inference.py:1110-1115 assumes tool-mode text is append-only and computes deltas as cumulative[len(prev_text):]; once raw <tool_call> tokens have already reached prev_text, the corrective shorter value yielded here is discarded, and later chunks are sliced against the longer raw-XML string. In practice the UI keeps the leaked markup and appends the post-tool answer after it instead of replacing it.
Useful? React with 👍 / 👎.
| _effective_timeout = ( | ||
| None if tool_call_timeout >= 9999 else tool_call_timeout | ||
| ) | ||
| result = execute_tool( | ||
| tool_name, |
There was a problem hiding this comment.
Reject healed tool names that were not enabled for this request
The non-streaming path only exposes the caller-selected tools (studio/backend/routes/inference.py:1031-1038), but this healed-stream branch executes whatever tool_name was parsed from raw_content. Because the streaming payload built above still omits tools, a model can emit <function=python> or <function=terminal> even when the request enabled only web_search, and this code will run it. That bypasses the request's tool allow-list and can execute a materially more powerful tool than the caller permitted.
Useful? React with 👍 / 👎.
| # stop the stream, parse & execute the tool calls, append results | ||
| # to the conversation, and re-stream -- up to remaining | ||
| # iterations from the main loop above. | ||
| remaining_iters = max(1, max_tool_iterations - iteration) |
There was a problem hiding this comment.
Leave one stream iteration for the post-tool assistant reply
This budget calculation does not reserve a final synthesis pass after a healed stream-side tool call. If the non-streaming loop breaks on iteration == max_tool_iterations - 1, remaining_iters becomes 1; a healed tool call in that only stream iteration hits the continue at 2069-2071, exhausts the loop, and the generator returns without ever asking the model for the assistant message that uses the last tool result. Conversations that reach the limit this way end on tool output only.
Useful? React with 👍 / 👎.
83ce0de to
faa760b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b305b81e7b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # stop the stream, parse & execute the tool calls, append results | ||
| # to the conversation, and re-stream -- up to remaining | ||
| # iterations from the main loop above. | ||
| remaining_iters = max(1, max_tool_iterations - iteration) |
There was a problem hiding this comment.
Handle the documented
max_tool_calls_per_message=0 case
max_tool_calls_per_message is explicitly documented as 0 = disabled in studio/backend/models/inference.py:325-328, and the route forwards that value unchanged at studio/backend/routes/inference.py:1056-1058. With 0, for iteration in range(max_tool_iterations) never runs, so iteration is never bound and this expression raises UnboundLocalError before any response is streamed. Requests that use the documented disable setting now 500 instead of falling back to a normal completion.
Useful? React with 👍 / 👎.
| cumulative = "" | ||
| _last_emitted = "" |
There was a problem hiding this comment.
Reset streamed text before starting the next assistant turn
cumulative and _last_emitted are initialized outside the _stream_iter loop, but a healed tool call appends an assistant/tool turn to conversation at 1985-1990 and then starts a fresh completion on the next iteration. Because these accumulators are preserved, any pre-tool prose already streamed in the first pass is concatenated onto the next assistant reply, so a response like "Let me check that" + tool call + final answer is rendered as one merged bubble instead of only the post-tool assistant message. The non-streaming tool path does not expose content_text from the tool-call turn, so this changes visible behavior for any model that emits text before the healed XML.
Useful? React with 👍 / 👎.
b305b81 to
37d96c5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0583472360
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if auto_heal_tool_calls and _TOOL_OPEN_RE.search(raw_content): | ||
| tool_calls = self._parse_tool_calls_from_text(raw_content) |
There was a problem hiding this comment.
Don't auto-execute literal
<function=...> text in the final reply
This new raw_content check treats any streamed mention of <tool_call> or <function= as a real tool invocation. Because _parse_tool_calls_from_text() accepts bare <function=name> blocks with no closing tags or parameters (studio/backend/core/inference/llama_cpp.py:1226-1291), a final answer that merely shows the XML syntax after an earlier tool round—e.g. documentation or a code example containing <function=web_search>—will be parsed and executed here instead of being displayed. The old final streaming pass only stripped markup, so this introduces unintended searches/command runs and can send the model back into another tool iteration on otherwise normal explanatory answers.
Useful? React with 👍 / 👎.
| if tool_name == "web_search": | ||
| status_text = f"Searching: {arguments.get('query', '')}" | ||
| elif tool_name == "python": | ||
| preview = ( | ||
| (arguments.get("code") or "") |
There was a problem hiding this comment.
Normalize healed tool arguments before using
.get()
The streaming-heal path assumes parsed arguments are always mappings, but _parse_tool_calls_from_text() preserves non-object JSON arguments values from <tool_call>{...}</tool_call> payloads (for example an array or number at studio/backend/core/inference/llama_cpp.py:1214-1221). In this branch those values fall through arguments = raw_args, and the first status formatter immediately does arguments.get(...); a malformed streamed tool call therefore raises AttributeError and aborts the SSE response instead of failing gracefully. This regression is specific to the new healed-stream execution path, since the old final streaming pass never dereferenced parsed arguments at all.
Useful? React with 👍 / 👎.
0583472 to
8a6d4c4
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a6d4c46a9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| stream_payload["stop"] = stop | ||
| _stop = list(stop) if stop else [] | ||
| if auto_heal_tool_calls: | ||
| _stop += ["<tool_call>", "<function="] |
There was a problem hiding this comment.
Avoid stopping on literal
<function= content
Adding "<tool_call>" and "<function=" as unconditional stop strings for the final synthesis stream can truncate legitimate answers that merely show these tokens as text (for example, when the user asks for an XML/tool-call example after a tool run). In that case no tool invocation is intended, but generation halts at the marker and returns a partial/empty reply even though the prior non-streaming pass had already produced a normal assistant message.
Useful? React with 👍 / 👎.
Two fixes for raw <tool_call> XML appearing in chat bubbles: 1. Non-streaming tool loop: llama-server can return BOTH structured tool_calls AND raw <tool_call> XML in the content field at the same time. Previously the XML stripping only ran in the fallback path (when no structured tool_calls were found). Now it always strips tool-call XML from content_text when any tool calls are present, regardless of whether they came from the structured field or the XML fallback parser. 2. Final streaming pass: add "<tool_call>" and "<function=" as stop sequences so the model cannot emit tool-call XML. Also use open-ended strip patterns during streaming (not just on final flush) as a safety net.
8a6d4c4 to
4463fa5
Compare
Problem
When tool calling is enabled, the model outputs raw
<tool_call><function=terminal><parameter=command>...XML directly into the chat bubble instead of the tool being executed. This happens with both quantized and BF16 GGUF models (Qwen3.5-4B, Qwen3.5-9B tested).The non-streaming tool loop works correctly. But after it finishes, the final streaming pass does not include
toolsin the payload, so the model falls back to generating raw<tool_call>XML from memory of the conversation. The existing_strip_tool_markupregex tries to clean this up mid-stream but fails on malformed/incomplete closing tags, so raw XML leaks into the UI.Fix
Instead of stripping or stopping at the XML, heal it into real tool executions:
<tool_call>or<function=patterns_parse_tool_calls_from_textparsertool_start/tool_endevents (so the frontend shows them correctly in the tool output panel)This means tool calls work correctly regardless of whether llama-server returns them as structured
tool_calls(non-streaming loop) or as raw XML in the text output (streaming pass).Testing