llama cpp Error: Unknown (built-in) filter 'items' for type String
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 120, column 73 in source:\n..._name, args_value in tool_call.arguments|items %}β΅ {{- '<...\n ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
I am getting this error from presumably the prompt template in this repo
I had the exact same issue.
It was solved by updating my llama.cpp image.
Hi there, please re-download the quants and update llama.cpp image! @fullstack
This should fix it: https://github.com/ggml-org/llama.cpp/pull/19870
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe'
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/LLM/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf'
srv load_model: failed to load model, '/LLM/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf'
I am using llama.cpp b8145 with Vulkan backend.
.\llama-server.exe --port 9999 --device CUDA0 -ngl 99 --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 --jinja -ub 2048 -b 2048 -fa on -m D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf -c 65536 --alias local -ctk q8_0 -ctv q8_0 -t 12 --n-cpu-moe 30 -fit off
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8148 (244641955) with MSVC 19.38.33135.0 for x64
system info: n_threads = 12, n_threads_batch = 12, total_threads = 16
system_info: n_threads = 12 (n_threads_batch = 12) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4070) (0000:01:00.0) - 11090 MiB free
gguf_init_from_file_impl: failed to read magic
llama_model_load: error loading model: llama_model_loader: failed to load model from D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
srv load_model: failed to load model, 'D:\Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
latest llama.cpp build
Hello, could the backslash in path being involved ? try -m D:/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf instead maybe.
Also, to be sure that other parameters does not interfer you could use the new --fit on (witch is on by default i think).
.\llama-server.exe --port 9999 --device CUDA0 --fit on --temp 0.6 --min-p 0.0 --top-k 20 --top-p 0.95 --jinja -ub 2048 -b 2048 -fa on -m D:/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf
And probably a good idea to check the downloaded model with a checksum agains sha π
Good luck
++
@CHNtentes : see https://github.com/ggml-org/llama.cpp/issues/19868
Looks like your situation could be related.
@CHNtentes : see https://github.com/ggml-org/llama.cpp/issues/19868
Looks like your situation could be related.
Thanks for your help :)
https://github.com/ggml-org/llama.cpp/pull/19870 as well.
Looks like it may be addressed as seen with release tag b8149: https://github.com/ggml-org/llama.cpp/releases/tag/b8149
Looks like your running the llama.cpp version 8148, so you might be ok if you try versions b8149 and on.
https://github.com/ggml-org/llama.cpp/pull/19870 as well.
Looks like it may be addressed as seen with release tag b8149: https://github.com/ggml-org/llama.cpp/releases/tag/b8149
Looks like your running the llama.cpp version 8148, so you might be ok if you try versions b8149 and on.
it's working normally with latest version. performance with Q3_K_XL on 4070 12G + 32G DDR5:
short prompt:
prompt eval time = 464.82 ms / 13 tokens ( 35.76 ms per token, 27.97 tokens per second)
eval time = 5883.79 ms / 367 tokens ( 16.03 ms per token, 62.37 tokens per second)
long prompt:
prompt eval time = 12036.66 ms / 20649 tokens ( 0.58 ms per token, 1715.51 tokens per second)
eval time = 40254.51 ms / 2203 tokens ( 18.27 ms per token, 54.73 tokens per second)