Running the quantized model
#1
by
bver - opened
Hi, I have a stupid question:
What is the best way of running this model?
I tried:
- llama_cpp (
Llamaclass) ->llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'llama-embed' AutoModel.from_pretrained("mradermacher/llama-nemotron-embed-1b-v2-GGUF", gguf_file="llama-nemotron-embed-1b-v2.Q4_K_M.gguf")->ValueError: GGUF model with architecture llama-embed is not supported yet.
My setup:
PyTorch version: 2.9.1+cu130
Transformers version: 4.57.3
llama_cpp_python version: 0.3.16
Thank you for your help in advance.
Pavel
bver changed discussion title from
Running the queantized model
to Running the quantized model