Skip to content

Heal tool-call XML in streaming pass into real tool executions#4433

Open
danielhanchen wants to merge 1 commit intomainfrom
fix/stop-tool-xml-in-final-pass
Open

Heal tool-call XML in streaming pass into real tool executions#4433
danielhanchen wants to merge 1 commit intomainfrom
fix/stop-tool-xml-in-final-pass

Conversation

@danielhanchen
Copy link
Contributor

@danielhanchen danielhanchen commented Mar 18, 2026

Problem

When tool calling is enabled, the model outputs raw <tool_call><function=terminal><parameter=command>... XML directly into the chat bubble instead of the tool being executed. This happens with both quantized and BF16 GGUF models (Qwen3.5-4B, Qwen3.5-9B tested).

The non-streaming tool loop works correctly. But after it finishes, the final streaming pass does not include tools in the payload, so the model falls back to generating raw <tool_call> XML from memory of the conversation. The existing _strip_tool_markup regex tries to clean this up mid-stream but fails on malformed/incomplete closing tags, so raw XML leaks into the UI.

Fix

Instead of stripping or stopping at the XML, heal it into real tool executions:

  1. After each streaming pass completes, check the raw accumulated content for <tool_call> or <function= patterns
  2. If found, parse them with the existing _parse_tool_calls_from_text parser
  3. Execute the tools, yielding proper tool_start/tool_end events (so the frontend shows them correctly in the tool output panel)
  4. Append the assistant message + tool results to the conversation
  5. Re-stream so the model can synthesize a response incorporating the new tool results
  6. Repeat up to the remaining iteration budget from the main non-streaming loop

This means tool calls work correctly regardless of whether llama-server returns them as structured tool_calls (non-streaming loop) or as raw XML in the text output (streaming pass).

Testing

  • Tested with Qwen3.5-9B-UD-Q4_K_XL and Qwen3.5-4B BF16
  • Tool calls that previously showed as raw XML are now parsed, executed, and displayed in the tool output panel
  • Multi-step tool chains work (model calls tool, sees result, calls another tool, etc.)

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from bc37fa8 to b5f63c5 Compare March 18, 2026 16:35
@danielhanchen danielhanchen changed the title Stop tool-call XML from leaking into chat UI Heal tool-call XML in streaming pass into real tool executions Mar 18, 2026
@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 3f52fe2 to ab2cb04 Compare March 18, 2026 16:38
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83ce0de95e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1979 to +1982
cumulative = _strip_tool_markup(cumulative, final = True)
_last_emitted = cumulative
if cumulative:
yield {"type": "content", "text": cumulative}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve append-only content when healing streamed tool XML

This branch can emit a shorter cumulative content string than was already streamed. gguf_tool_stream() in studio/backend/routes/inference.py:1110-1115 assumes tool-mode text is append-only and computes deltas as cumulative[len(prev_text):]; once raw <tool_call> tokens have already reached prev_text, the corrective shorter value yielded here is discarded, and later chunks are sliced against the longer raw-XML string. In practice the UI keeps the leaked markup and appends the post-tool answer after it instead of replacing it.

Useful? React with 👍 / 👎.

Comment on lines +2041 to +2045
_effective_timeout = (
None if tool_call_timeout >= 9999 else tool_call_timeout
)
result = execute_tool(
tool_name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject healed tool names that were not enabled for this request

The non-streaming path only exposes the caller-selected tools (studio/backend/routes/inference.py:1031-1038), but this healed-stream branch executes whatever tool_name was parsed from raw_content. Because the streaming payload built above still omits tools, a model can emit <function=python> or <function=terminal> even when the request enabled only web_search, and this code will run it. That bypasses the request's tool allow-list and can execute a materially more powerful tool than the caller permitted.

Useful? React with 👍 / 👎.

# stop the stream, parse & execute the tool calls, append results
# to the conversation, and re-stream -- up to remaining
# iterations from the main loop above.
remaining_iters = max(1, max_tool_iterations - iteration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Leave one stream iteration for the post-tool assistant reply

This budget calculation does not reserve a final synthesis pass after a healed stream-side tool call. If the non-streaming loop breaks on iteration == max_tool_iterations - 1, remaining_iters becomes 1; a healed tool call in that only stream iteration hits the continue at 2069-2071, exhausts the loop, and the generator returns without ever asking the model for the assistant message that uses the last tool result. Conversations that reach the limit this way end on tool output only.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 83ce0de to faa760b Compare March 18, 2026 16:49
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b305b81e7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

# stop the stream, parse & execute the tool calls, append results
# to the conversation, and re-stream -- up to remaining
# iterations from the main loop above.
remaining_iters = max(1, max_tool_iterations - iteration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle the documented max_tool_calls_per_message=0 case

max_tool_calls_per_message is explicitly documented as 0 = disabled in studio/backend/models/inference.py:325-328, and the route forwards that value unchanged at studio/backend/routes/inference.py:1056-1058. With 0, for iteration in range(max_tool_iterations) never runs, so iteration is never bound and this expression raises UnboundLocalError before any response is streamed. Requests that use the documented disable setting now 500 instead of falling back to a normal completion.

Useful? React with 👍 / 👎.

Comment on lines 1849 to 1850
cumulative = ""
_last_emitted = ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset streamed text before starting the next assistant turn

cumulative and _last_emitted are initialized outside the _stream_iter loop, but a healed tool call appends an assistant/tool turn to conversation at 1985-1990 and then starts a fresh completion on the next iteration. Because these accumulators are preserved, any pre-tool prose already streamed in the first pass is concatenated onto the next assistant reply, so a response like "Let me check that" + tool call + final answer is rendered as one merged bubble instead of only the post-tool assistant message. The non-streaming tool path does not expose content_text from the tool-call turn, so this changes visible behavior for any model that emits text before the healed XML.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from b305b81 to 37d96c5 Compare March 18, 2026 17:01
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0583472360

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1970 to +1971
if auto_heal_tool_calls and _TOOL_OPEN_RE.search(raw_content):
tool_calls = self._parse_tool_calls_from_text(raw_content)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't auto-execute literal <function=...> text in the final reply

This new raw_content check treats any streamed mention of <tool_call> or <function= as a real tool invocation. Because _parse_tool_calls_from_text() accepts bare <function=name> blocks with no closing tags or parameters (studio/backend/core/inference/llama_cpp.py:1226-1291), a final answer that merely shows the XML syntax after an earlier tool round—e.g. documentation or a code example containing <function=web_search>—will be parsed and executed here instead of being displayed. The old final streaming pass only stripped markup, so this introduces unintended searches/command runs and can send the model back into another tool iteration on otherwise normal explanatory answers.

Useful? React with 👍 / 👎.

Comment on lines +2013 to +2017
if tool_name == "web_search":
status_text = f"Searching: {arguments.get('query', '')}"
elif tool_name == "python":
preview = (
(arguments.get("code") or "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize healed tool arguments before using .get()

The streaming-heal path assumes parsed arguments are always mappings, but _parse_tool_calls_from_text() preserves non-object JSON arguments values from <tool_call>{...}</tool_call> payloads (for example an array or number at studio/backend/core/inference/llama_cpp.py:1214-1221). In this branch those values fall through arguments = raw_args, and the first status formatter immediately does arguments.get(...); a malformed streamed tool call therefore raises AttributeError and aborts the SSE response instead of failing gracefully. This regression is specific to the new healed-stream execution path, since the old final streaming pass never dereferenced parsed arguments at all.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 0583472 to 8a6d4c4 Compare March 18, 2026 17:07
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6d4c46a9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

stream_payload["stop"] = stop
_stop = list(stop) if stop else []
if auto_heal_tool_calls:
_stop += ["<tool_call>", "<function="]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid stopping on literal <function= content

Adding "<tool_call>" and "<function=" as unconditional stop strings for the final synthesis stream can truncate legitimate answers that merely show these tokens as text (for example, when the user asks for an XML/tool-call example after a tool run). In that case no tool invocation is intended, but generation halts at the marker and returns a partial/empty reply even though the prior non-streaming pass had already produced a normal assistant message.

Useful? React with 👍 / 👎.

Two fixes for raw <tool_call> XML appearing in chat bubbles:

1. Non-streaming tool loop: llama-server can return BOTH structured
   tool_calls AND raw <tool_call> XML in the content field at the same
   time. Previously the XML stripping only ran in the fallback path
   (when no structured tool_calls were found). Now it always strips
   tool-call XML from content_text when any tool calls are present,
   regardless of whether they came from the structured field or the
   XML fallback parser.

2. Final streaming pass: add "<tool_call>" and "<function=" as stop
   sequences so the model cannot emit tool-call XML. Also use
   open-ended strip patterns during streaming (not just on final
   flush) as a safety net.
@danielhanchen danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 8a6d4c4 to 4463fa5 Compare March 18, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant