Running the quantized model

by bver - opened Jan 12

Jan 12

Hi, I have a stupid question:

What is the best way of running this model?
I tried:

llama_cpp (Llama class) ->
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'llama-embed'
AutoModel.from_pretrained("mradermacher/llama-nemotron-embed-1b-v2-GGUF", gguf_file="llama-nemotron-embed-1b-v2.Q4_K_M.gguf") ->
ValueError: GGUF model with architecture llama-embed is not supported yet.

My setup:
PyTorch version: 2.9.1+cu130
Transformers version: 4.57.3
llama_cpp_python version: 0.3.16

Thank you for your help in advance.
Pavel

bver changed discussion title from Running the queantized model to Running the quantized model Jan 13

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment