Skip to content

[Bug] Hermes Lora chat test bug: does not have a {% if add_generation_prompt %} for generation purposes #4150

@mykeehu

Description

@mykeehu
  1. Did you update? pip install --upgrade unsloth unsloth_zoo -> Yes
  2. Colab or Kaggle or local / cloud -> local
  3. Number GPUs used, use nvidia-smi -> 1
  4. Which notebook? Please link! -> Llama Factory
  5. Which Unsloth version, TRL version, transformers version, PyTorch version? -> Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6, Torch: 2.10.0+cu130. CUDA: 8.6. CUDA Toolkit: 13.0. Triton: 3.6.0
  6. Which trainer? SFTTrainer, GRPOTrainer etc -> SFT (Llama Factory)
[INFO|configuration_utils.py:765] 2026-03-03 22:08:38,762 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\config.json
[INFO|configuration_utils.py:839] 2026-03-03 22:08:38,762 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128040,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|2026-03-03 22:08:39] llamafactory.data.template:144 >> Add <|im_start|> to stop words.
[WARNING|2026-03-03 22:08:39] llamafactory.data.template:149 >> New tokens have been added, make sure `resize_vocab` is True.
[INFO|configuration_utils.py:765] 2026-03-03 22:08:39,716 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\config.json
[INFO|configuration_utils.py:839] 2026-03-03 22:08:39,716 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128040,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|2026-03-03 22:08:39] llamafactory.model.model_utils.kv_cache:144 >> KV cache is enabled for faster generation.
[INFO|configuration_utils.py:765] 2026-03-03 22:08:40,280 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\config.json
[INFO|configuration_utils.py:839] 2026-03-03 22:08:40,280 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128040,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 128256
}

Unsloth: WARNING `trust_remote_code` is True.
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.57.6.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 24.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.10.0+cu130. CUDA: 8.6. CUDA Toolkit: 13.0. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
[INFO|configuration_utils.py:765] 2026-03-03 22:08:45,334 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\config.json
[INFO|configuration_utils.py:839] 2026-03-03 22:08:45,334 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128040,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|configuration_utils.py:765] 2026-03-03 22:08:45,484 >> loading configuration file config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\config.json
[INFO|configuration_utils.py:839] 2026-03-03 22:08:45,500 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128040,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.6",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:1172] 2026-03-03 22:08:45,501 >> loading weights file model.safetensors from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\model.safetensors.index.json
[INFO|modeling_utils.py:2341] 2026-03-03 22:08:45,501 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:986] 2026-03-03 22:08:45,501 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128040
}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.75s/it]
[INFO|configuration_utils.py:941] 2026-03-03 22:08:53,074 >> loading configuration file generation_config.json from cache at C:\Users\Mykee\.cache\huggingface\hub\models--unsloth--Hermes-3-Llama-3.1-8B\snapshots\d90c07b930d73927ba6798f76bf611d857234229\generation_config.json
[INFO|configuration_utils.py:986] 2026-03-03 22:08:53,074 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128040,
  "temperature": 0.6,
  "top_p": 0.9
}

[INFO|dynamic_module_utils.py:423] 2026-03-03 22:08:53,224 >> Could not locate the custom_generate/generate.py inside unsloth/Hermes-3-Llama-3.1-8B.
Traceback (most recent call last):
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\queueing.py", line 849, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\route_utils.py", line 354, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\blocks.py", line 2191, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\blocks.py", line 1710, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 760, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 751, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\to_thread.py", line 63, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\_backends\_asyncio.py", line 2502, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\anyio\_backends\_asyncio.py", line 986, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 734, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\gradio\utils.py", line 898, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\webui\chatter.py", line 158, in load_model
    super().__init__(args)
  File "E:\LlamaFactory\src\llamafactory\chat\chat_model.py", line 53, in __init__
    self.engine: BaseEngine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\chat\hf_engine.py", line 59, in __init__
    self.model = load_model(
                 ^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\loader.py", line 189, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\adapter.py", line 360, in init_adapter
    model = _setup_lora_tuning(
            ^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\adapter.py", line 208, in _setup_lora_tuning
    model = load_unsloth_peft_model(config, model_args, finetuning_args, is_trainable=is_trainable)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\src\llamafactory\model\model_utils\unsloth.py", line 96, in load_unsloth_peft_model
    model, _ = FastLanguageModel.from_pretrained(**unsloth_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\models\loader.py", line 602, in from_pretrained
    model, tokenizer = dispatch_model.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\models\llama.py", line 2490, in from_pretrained
    tokenizer = load_correct_tokenizer(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\tokenizer_utils.py", line 622, in load_correct_tokenizer
    chat_template = fix_chat_template(tokenizer)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\LlamaFactory\venv\Lib\site-packages\unsloth\tokenizer_utils.py", line 734, in fix_chat_template
    raise RuntimeError(
RuntimeError: Unsloth: The tokenizer `saves\Llama-3.1-8B-Instruct\lora\train_2026-03-03-20-56-39-Hermes-v1`
does not have a {% if add_generation_prompt %} for generation purposes.
Please file a bug report to the maintainers of `saves\Llama-3.1-8B-Instruct\lora\train_2026-03-03-20-56-39-Hermes-v1` - thanks!

I can't load Lora into the test chat with unsloth optimizer for the unsloth Hermes 3 8B model. I trained it with a ChatML template.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions