llama cpp Error: Unknown (built-in) filter 'items' for type String

by fullstack - opened 11 days ago

srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 120, column 73 in source:\n..._name, args_value in tool_call.arguments|items %}↵ {{- '<...\n ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500

I am getting this error from presumably the prompt template in this repo

flaviocb

11 days ago

I had the exact same issue.

It was solved by updating my llama.cpp image.

danielhanchen

Unsloth AI org 11 days ago

•

edited 11 days ago

Hi there, please re-download the quants and update llama.cpp image! @fullstack

This should fix it: https://github.com/ggml-org/llama.cpp/pull/19870

bpool

11 days ago

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/LLM/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf'
srv load_model: failed to load model, '/LLM/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf'

I am using llama.cpp b8145 with Vulkan backend.

CHNtentes

11 days ago

.\llama-server.exe --port 9999 --device CUDA0 -ngl 99 --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 --jinja -ub 2048 -b 2048 -fa on -m D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf -c 65536 --alias local -ctk q8_0 -ctv q8_0 -t 12 --n-cpu-moe 30 -fit off
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8148 (244641955) with MSVC 19.38.33135.0 for x64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 16

init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070) (0000:01:00.0) - 11090 MiB free
gguf_init_from_file_impl: failed to read magic
llama_model_load: error loading model: llama_model_loader: failed to load model from D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
srv load_model: failed to load model, 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error

latest llama.cpp build

vico44

11 days ago

Hello, could the backslash in path being involved ? try -m D:/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf instead maybe.

Also, to be sure that other parameters does not interfer you could use the new --fit on (witch is on by default i think).
.\llama-server.exe --port 9999 --device CUDA0 --fit on --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 --jinja -ub 2048 -b 2048 -fa on -m D:/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf

And probably a good idea to check the downloaded model with a checksum agains sha 😀

Good luck

dyoung

11 days ago

@CHNtentes : see https://github.com/ggml-org/llama.cpp/issues/19868
Looks like your situation could be related.

CHNtentes

11 days ago

@CHNtentes : see https://github.com/ggml-org/llama.cpp/issues/19868
Looks like your situation could be related.

Thanks for your help :)

dyoung

11 days ago

•

edited 11 days ago

https://github.com/ggml-org/llama.cpp/pull/19870 as well.
Looks like it may be addressed as seen with release tag b8149: https://github.com/ggml-org/llama.cpp/releases/tag/b8149
Looks like your running the llama.cpp version 8148, so you might be ok if you try versions b8149 and on.

CHNtentes

11 days ago

https://github.com/ggml-org/llama.cpp/pull/19870 as well.
Looks like it may be addressed as seen with release tag b8149: https://github.com/ggml-org/llama.cpp/releases/tag/b8149
Looks like your running the llama.cpp version 8148, so you might be ok if you try versions b8149 and on.

it's working normally with latest version. performance with Q3_K_XL on 4070 12G + 32G DDR5:

short prompt:
prompt eval time = 464.82 ms / 13 tokens ( 35.76 ms per token, 27.97 tokens per second)
eval time = 5883.79 ms / 367 tokens ( 16.03 ms per token, 62.37 tokens per second)

long prompt:
prompt eval time = 12036.66 ms / 20649 tokens ( 0.58 ms per token, 1715.51 tokens per second)
eval time = 40254.51 ms / 2203 tokens ( 18.27 ms per token, 54.73 tokens per second)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment