Such a great model. (RTX 4090 Performance data inside)

by phakio - opened 5 days ago

I've been using this 27b dense model since it came out, and it never ceases to amaze me at the quality for size. This model, in non thinking mode, exceeds any MOE of similar size, and I'd say even for some tasks it's a much better choice. The output of qwen3.5 a3b non thinking,doesn't compare, and 122b a10b is probably a little less quality in my honest opinion.

Although I have 4 gpus at my disposal and 512gb system ram, I'm able to fit this quant in ONE 4090, along with about 20k context (easily expandable to 60k+ if I need.) I'm getting a solid 48 t/s generation, and inSANE pp speeds (see screenshot stats for more info)

The quality of the text this model outputs is superb. I'm not using it for coding, I'm using it as an assistant to help explain medical and biochemical terms as I study for my premed courses. It's nailing concepts with damn near 100% accuracy, and able to explain them in a way that is easy to comprehend, and sticks. This is obviously what most people would describe as the earliest stage and possibly simplest phase of medical school, however these concepts are foreign to most people, and the fact that the model is basically a interactable textbook at this point is crazy to me.

I was using vLLM early across all my GPUs and got over 100 t/s with qwen 3.5 27b, but I always keep coming back to ik_llama... perhaps it's for the best.

Thanks again for the quants, I'm finding these modern dense models are really getting good, and I'm excited to see where the next batch of models take us -

Looking forward to deepseek v4 later this month? Fingers crossed -

ubergarm

Owner 5 days ago

@phakio

Wow thanks for the glowing review! I'm glad this quant is able to retain enough knowledge for usable accuracy for your studies!

Yeah the new models are definitely becoming "actually useful for some things" without requiring excessive hardware! I was using my Qwen3.5-122B-A10B IQ4_KSS across two GPUs with -sm graph in ik_llama.cpp just for a fast "internet search" reference of sorts e.g. helping me figure out how to manage my gmail inbox with secret filter tags because I turned off google's gemini and now they won't filter my e-mail anymore so I go swamped by spam lol...

Amusingly I asked the model if the poor google mail filter experience is deliberate enshittification and the LLM totally agreed sycophanticly with me lmao...

Agreed we're all wondering if DSV4 is going to take us another step forward for local models. I don't think there is -sm graph support for MLA models yet, but I'm sure we'll see much more improvements in the coming months between new models, ik features, mainline always growing, and the huge growth on the client side too.

I forget, what is your current preferred client? I'm using opencode with the following config file but want to test out some more light weight ones like some vibe-coded pi variants haha.

{
    "$schema": "https://opencode.ai/config.json",
    "share": "disabled",
    "autoupdate": false,
    "experimental": {
        "openTelemetry": false
    },
    "tools": {
        "websearch": true,
        "todoread": false,
        "todorwrite": false
    },
    "disabled_providers": ["exa"],
    "provider": {
        "LMstudio": {
            "npm": "@ai-sdk/openai-compatible",
            "name": "ik_llama.cpp (local)",
            "options": {
                "baseURL": "http://localhost:8080/v1",
                "timeout": 99999999999
            },
            "models": {
                "Kimi-K2.5-Q4_X": {
                  "name": "Kimi-K2.5-Q4_X",
                  "limit": { "context": 1000000, "output": 32000 },
                  "cost": { "input": 5.0, "output": 25.0 },
                  "temperature": true,
                  "reasoning": true,
                  "tool_call": true
                }
            }
        }
    }
}

Kimi is just a name placeholder, it works with whatever model i load up on the backend.

ubergarm

Owner 4 days ago

•

edited 4 days ago

@phakio

Hey do you have any tips for prompt caching with qwen35 dense or qwen35 moe? My understanding is because they are gated delta net recurrent attention it needs some special cache flags? I'm looking at some discussion on PR1310 but maybe you have a command example already? Some more possible chatter in a recent Issue 1383...

Also do you use any of the self speculative decoding stuff e.g. stuff like this:

      --draft-min 1 --spec-ngram-size-n 8
      --draft-max 4 --spec-ngram-size-m 16
      --spec-type ngram-simple
      --draft-p-min 0.7

I'm just workshopping my own commands to eek out a little more perf haha... Thanks!

phakio

3 days ago

As of right now, prompt caching isn't an issue I'm concerned about right now, as I'm really just using the bot as a study assistant, so each message thread has like 8-10 messages, then when I change subjects I'm starting a fresh conversation. (same system prompt ofc)

I'm definitely going to have to give opencode a try one day, the webui looks nice. honestly I just have the default ik_llama.cpp webui in a pinned tab and inference w/ the model that way, as it works for my needs and is pretty lightweight... I do host my own VLLM chat client with persistant conversation memory across devices that I vibe coded one night, and it's pretty slick. again, I'm not really doing agentic stuff at this moment, just studying assist.

I enabled speculative decoding on my VLLM instance of Qwen3.5 35b a3b, but I don't have much to say about it as I switched to this quant shortly after. I like the speed of VLLM, but I don't like the power draw. it really saps every ounce of each GPU, and makes my UPS system flicker on full tilt lmao, I don't want to burn my house down!

As for local agents, I did set up nanobot (lightweight openclaw clone) and give it a vm on my central server to go run about and do things! that was fun to play with a bit, and was able to do some basic things I wanted it to do, like search the web, make web scraper scripts, playwright automation etc. I used your minimax quant for that, but it refused to do anything with torrents because of the legality of them, so I switched to qwen coder next and that worked great. Haven't tried agentic stuff with this quant yet but i'm sure it runs solid.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment