Running the quantized model

#1
by bver - opened

Hi, I have a stupid question:

What is the best way of running this model?
I tried:

  • llama_cpp (Llama class) ->
    llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'llama-embed'
  • AutoModel.from_pretrained("mradermacher/llama-nemotron-embed-1b-v2-GGUF", gguf_file="llama-nemotron-embed-1b-v2.Q4_K_M.gguf") ->
    ValueError: GGUF model with architecture llama-embed is not supported yet.

My setup:
PyTorch version: 2.9.1+cu130
Transformers version: 4.57.3
llama_cpp_python version: 0.3.16

Thank you for your help in advance.
Pavel

bver changed discussion title from Running the queantized model to Running the quantized model

Sign up or log in to comment