Heal tool-call XML in streaming pass into real tool executions by danielhanchen · Pull Request #4433 · unslothai/unsloth

danielhanchen · 2026-03-18T16:33:00Z

Problem

When tool calling is enabled, the model outputs raw <tool_call><function=terminal><parameter=command>... XML directly into the chat bubble instead of the tool being executed. This happens with both quantized and BF16 GGUF models (Qwen3.5-4B, Qwen3.5-9B tested).

The non-streaming tool loop works correctly. But after it finishes, the final streaming pass does not include tools in the payload, so the model falls back to generating raw <tool_call> XML from memory of the conversation. The existing _strip_tool_markup regex tries to clean this up mid-stream but fails on malformed/incomplete closing tags, so raw XML leaks into the UI.

Fix

Instead of stripping or stopping at the XML, heal it into real tool executions:

After each streaming pass completes, check the raw accumulated content for <tool_call> or <function= patterns
If found, parse them with the existing _parse_tool_calls_from_text parser
Execute the tools, yielding proper tool_start/tool_end events (so the frontend shows them correctly in the tool output panel)
Append the assistant message + tool results to the conversation
Re-stream so the model can synthesize a response incorporating the new tool results
Repeat up to the remaining iteration budget from the main non-streaming loop

This means tool calls work correctly regardless of whether llama-server returns them as structured tool_calls (non-streaming loop) or as raw XML in the text output (streaming pass).

Testing

Tested with Qwen3.5-9B-UD-Q4_K_XL and Qwen3.5-4B BF16
Tool calls that previously showed as raw XML are now parsed, executed, and displayed in the tool output panel
Multi-step tool chains work (model calls tool, sees result, calls another tool, etc.)

gemini-code-assist · 2026-03-18T16:33:06Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83ce0de95e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T16:42:55Z

studio/backend/core/inference/llama_cpp.py

+                    cumulative = _strip_tool_markup(cumulative, final = True)
+                    _last_emitted = cumulative
+                    if cumulative:
+                        yield {"type": "content", "text": cumulative}


Preserve append-only content when healing streamed tool XML

This branch can emit a shorter cumulative content string than was already streamed. gguf_tool_stream() in studio/backend/routes/inference.py:1110-1115 assumes tool-mode text is append-only and computes deltas as cumulative[len(prev_text):]; once raw <tool_call> tokens have already reached prev_text, the corrective shorter value yielded here is discarded, and later chunks are sliced against the longer raw-XML string. In practice the UI keeps the leaked markup and appends the post-tool answer after it instead of replacing it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T16:42:55Z

studio/backend/core/inference/llama_cpp.py

+                        _effective_timeout = (
+                            None if tool_call_timeout >= 9999 else tool_call_timeout
+                        )
+                        result = execute_tool(
+                            tool_name,


Reject healed tool names that were not enabled for this request

The non-streaming path only exposes the caller-selected tools (studio/backend/routes/inference.py:1031-1038), but this healed-stream branch executes whatever tool_name was parsed from raw_content. Because the streaming payload built above still omits tools, a model can emit <function=python> or <function=terminal> even when the request enabled only web_search, and this code will run it. That bypasses the request's tool allow-list and can execute a materially more powerful tool than the caller permitted.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T16:42:55Z

studio/backend/core/inference/llama_cpp.py

+        # stop the stream, parse & execute the tool calls, append results
+        # to the conversation, and re-stream -- up to remaining
+        # iterations from the main loop above.
+        remaining_iters = max(1, max_tool_iterations - iteration)


Leave one stream iteration for the post-tool assistant reply

This budget calculation does not reserve a final synthesis pass after a healed stream-side tool call. If the non-streaming loop breaks on iteration == max_tool_iterations - 1, remaining_iters becomes 1; a healed tool call in that only stream iteration hits the continue at 2069-2071, exhausts the loop, and the generator returns without ever asking the model for the assistant message that uses the last tool result. Conversations that reach the limit this way end on tool output only.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b305b81e7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T16:54:51Z

studio/backend/core/inference/llama_cpp.py

+        # stop the stream, parse & execute the tool calls, append results
+        # to the conversation, and re-stream -- up to remaining
+        # iterations from the main loop above.
+        remaining_iters = max(1, max_tool_iterations - iteration)


Handle the documented max_tool_calls_per_message=0 case

max_tool_calls_per_message is explicitly documented as 0 = disabled in studio/backend/models/inference.py:325-328, and the route forwards that value unchanged at studio/backend/routes/inference.py:1056-1058. With 0, for iteration in range(max_tool_iterations) never runs, so iteration is never bound and this expression raises UnboundLocalError before any response is streamed. Requests that use the documented disable setting now 500 instead of falling back to a normal completion.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T16:54:51Z

studio/backend/core/inference/llama_cpp.py

        cumulative = ""
        _last_emitted = ""


Reset streamed text before starting the next assistant turn

cumulative and _last_emitted are initialized outside the _stream_iter loop, but a healed tool call appends an assistant/tool turn to conversation at 1985-1990 and then starts a fresh completion on the next iteration. Because these accumulators are preserved, any pre-tool prose already streamed in the first pass is concatenated onto the next assistant reply, so a response like "Let me check that" + tool call + final answer is rendered as one merged bubble instead of only the post-tool assistant message. The non-streaming tool path does not expose content_text from the tool-call turn, so this changes visible behavior for any model that emits text before the healed XML.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0583472360

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T17:06:40Z

studio/backend/core/inference/llama_cpp.py

+            if auto_heal_tool_calls and _TOOL_OPEN_RE.search(raw_content):
+                tool_calls = self._parse_tool_calls_from_text(raw_content)


Don't auto-execute literal <function=...> text in the final reply

This new raw_content check treats any streamed mention of <tool_call> or <function= as a real tool invocation. Because _parse_tool_calls_from_text() accepts bare <function=name> blocks with no closing tags or parameters (studio/backend/core/inference/llama_cpp.py:1226-1291), a final answer that merely shows the XML syntax after an earlier tool round—e.g. documentation or a code example containing <function=web_search>—will be parsed and executed here instead of being displayed. The old final streaming pass only stripped markup, so this introduces unintended searches/command runs and can send the model back into another tool iteration on otherwise normal explanatory answers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T17:06:40Z

studio/backend/core/inference/llama_cpp.py

+                        if tool_name == "web_search":
+                            status_text = f"Searching: {arguments.get('query', '')}"
+                        elif tool_name == "python":
+                            preview = (
+                                (arguments.get("code") or "")


Normalize healed tool arguments before using .get()

The streaming-heal path assumes parsed arguments are always mappings, but _parse_tool_calls_from_text() preserves non-object JSON arguments values from <tool_call>{...}</tool_call> payloads (for example an array or number at studio/backend/core/inference/llama_cpp.py:1214-1221). In this branch those values fall through arguments = raw_args, and the first status formatter immediately does arguments.get(...); a malformed streamed tool call therefore raises AttributeError and aborts the SSE response instead of failing gracefully. This regression is specific to the new healed-stream execution path, since the old final streaming pass never dereferenced parsed arguments at all.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a6d4c46a9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T17:14:34Z

studio/backend/core/inference/llama_cpp.py

-            stream_payload["stop"] = stop
+        _stop = list(stop) if stop else []
+        if auto_heal_tool_calls:
+            _stop += ["<tool_call>", "<function="]


Avoid stopping on literal <function= content

Adding "<tool_call>" and "<function=" as unconditional stop strings for the final synthesis stream can truncate legitimate answers that merely show these tokens as text (for example, when the user asks for an XML/tool-call example after a tool run). In that case no tool invocation is intended, but generation halts at the marker and returns a partial/empty reply even though the prior non-streaming pass had already produced a normal assistant message.

Useful? React with 👍 / 👎.

Two fixes for raw <tool_call> XML appearing in chat bubbles: 1. Non-streaming tool loop: llama-server can return BOTH structured tool_calls AND raw <tool_call> XML in the content field at the same time. Previously the XML stripping only ran in the fallback path (when no structured tool_calls were found). Now it always strips tool-call XML from content_text when any tool calls are present, regardless of whether they came from the structured field or the XML fallback parser. 2. Final streaming pass: add "<tool_call>" and "<function=" as stop sequences so the model cannot emit tool-call XML. Also use open-ended strip patterns during streaming (not just on final flush) as a safety net.

danielhanchen requested a review from rolandtannous as a code owner March 18, 2026 16:33

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from bc37fa8 to b5f63c5 Compare March 18, 2026 16:35

danielhanchen changed the title ~~Stop tool-call XML from leaking into chat UI~~ Heal tool-call XML in streaming pass into real tool executions Mar 18, 2026

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 3f52fe2 to ab2cb04 Compare March 18, 2026 16:38

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 83ce0de to faa760b Compare March 18, 2026 16:49

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from b305b81 to 37d96c5 Compare March 18, 2026 17:01

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 0583472 to 8a6d4c4 Compare March 18, 2026 17:07

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

danielhanchen force-pushed the fix/stop-tool-xml-in-final-pass branch from 8a6d4c4 to 4463fa5 Compare March 18, 2026 17:25

		if auto_heal_tool_calls and _TOOL_OPEN_RE.search(raw_content):
		tool_calls = self._parse_tool_calls_from_text(raw_content)

Uh oh!

Conversation

danielhanchen commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented Mar 18, 2026 •

edited

Loading