Instructions to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation", filename="Qwen3.5-9B.BF16-mmproj.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16 # Run inference directly in the terminal: llama-cli -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16 # Run inference directly in the terminal: llama-cli -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16 # Run inference directly in the terminal: ./llama-cli -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Use Docker
docker model run hf.co/Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
- LM Studio
- Jan
- vLLM
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
- Ollama
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Ollama:
ollama run hf.co/Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
- Unsloth Studio new
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation to start chatting
- Pi new
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Run Hermes
hermes
- Docker Model Runner
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Docker Model Runner:
docker model run hf.co/Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
- Lemonade
How to use Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation:BF16
Run and chat with the model
lemonade run user.Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation-BF16
List all available models
lemonade list
Flipradio Qwen 3.5 9B · DeepSeek V4 Flash Distillation
该模型是基于李厚辰的翻转电台节目制作的微调版本,基于 Qwen3.5-9B 的多模态模型,通过 DeepSeek V4 + Flash Attention 蒸馏得到的 checkpoint-1200,使用
llama.cpp量化为 Q8_0 GGUF 格式,可直接在 LM Studio / Ollama / llama.cpp 中运行。 数据集来源:flipradio.archive
欢迎关注李厚辰的相关节目:
Youtube: FearNation 世界苦茶 三個水槍手
Podcasts: 翻转电台FlipRadio
Website: Flipradio.club
模型概览
| 项目 | 详情 |
|---|---|
| 基座模型 | unsloth/Qwen3.5-9B |
| 训练方式 | DeepSeek V4 蒸馏 · Flash Attention |
| 检查点 | checkpoint-1200 |
| 量化方式 | Q8_0 (8-bit, 几乎无损) |
| 格式 | GGUF (llama.cpp 兼容) |
| 多模态 | 支持图像输入 (附带 mmproj) |
| 上下文长度 | 32K (默认) |
文件清单
| 文件 | 大小 | 说明 |
|---|---|---|
Qwen3.5-9B.Q8_0.gguf |
9.53 GB | 主模型权重 (Q8_0 量化) |
Qwen3.5-9B.BF16-mmproj.gguf |
922 MB | 多模态视觉投影层 (BF16) |
export_metadata.json |
— | 导出元数据 |
提示:如果你只用纯文本对话,可以只下载
Qwen3.5-9B.Q8_0.gguf; 需要图像理解能力时,再额外下载Qwen3.5-9B.BF16-mmproj.gguf。
快速开始
方式一:LM Studio(推荐新手)
- 打开 LM Studio,搜索
Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation - 下载
Qwen3.5-9B.Q8_0.gguf(如需图像理解,同时下载 mmproj) - 在 Chat 界面加载模型即可对话
- 多模态模式:在加载时把 mmproj 文件挂到 "Vision Adapter" 槽位
方式二:llama.cpp(命令行)
# 1. 下载模型
huggingface-cli download Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation \
Qwen3.5-9B.Q8_0.gguf Qwen3.5-9B.BF16-mmproj.gguf \
--local-dir ./models
# 2. 纯文本对话
./llama-cli -m ./models/Qwen3.5-9B.Q8_0.gguf \
-p "你好,请介绍一下自己" \
-c 8192 -ngl 99 --temp 0.7
# 3. 多模态(图像 + 文本)
./llama-mtmd-cli -m ./models/Qwen3.5-9B.Q8_0.gguf \
--mmproj ./models/Qwen3.5-9B.BF16-mmproj.gguf \
--image ./test.jpg \
-p "描述一下这张图片"
# 4. 启动 OpenAI 兼容 API 服务
./llama-server -m ./models/Qwen3.5-9B.Q8_0.gguf \
--mmproj ./models/Qwen3.5-9B.BF16-mmproj.gguf \
--host 0.0.0.0 --port 8080 \
-c 32768 -ngl 99
方式三:Ollama
# 创建 Modelfile
cat > Modelfile <<'EOF'
FROM ./Qwen3.5-9B.Q8_0.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
EOF
# 导入并运行
ollama create flipradio-qwen -f Modelfile
ollama run flipradio-qwen
方式四:Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="./Qwen3.5-9B.Q8_0.gguf",
n_ctx=8192,
n_gpu_layers=-1, # -1 = 全部卸载到 GPU
flash_attn=True,
)
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "你是一个有帮助的中文 AI 助手。"},
{"role": "user", "content": "用一句话解释什么是知识蒸馏。"},
],
temperature=0.7,
max_tokens=512,
)
print(output["choices"][0]["message"]["content"])
推荐推理参数
| 参数 | 值 | 说明 |
|---|---|---|
temperature |
0.6 ~ 0.8 |
创作类用 0.8,问答用 0.6 |
top_p |
0.9 |
nucleus 采样 |
top_k |
40 |
— |
repeat_penalty |
1.05 |
防止重复 |
n_ctx |
8192 ~ 32768 |
视显存而定 |
n_gpu_layers |
99 或 -1 |
全部卸载到 GPU |
硬件需求
| 模式 | 最低显存 | 推荐显存 |
|---|---|---|
| 纯 CPU | — (16 GB 内存) | 32 GB 内存 |
| GPU 部分卸载 | 8 GB | 12 GB |
| GPU 全卸载 | 12 GB | 16 GB+ |
| 多模态全卸载 | 14 GB | 20 GB+ |
在 RTX 3090 / 4090 上可获得 50+ tokens/s 的推理速度。
提示词模板
本模型使用标准的 ChatML 格式:
<|im_start|>system
你是一个有帮助的中文 AI 助手。<|im_end|>
<|im_start|>user
你的问题<|im_end|>
<|im_start|>assistant
许可证
本模型遵循 Apache 2.0 开源协议,可商用,请保留原作者署名。
基座模型 Qwen3.5-9B 同样遵循 Apache 2.0。
引用
如果本模型对你的工作有帮助,欢迎引用:
@misc{flipradio-qwen-3.5-9b-2026,
author = {Pixelber},
title = {Flipradio Qwen 3.5 9B - DeepSeek V4 Flash Distillation},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Pixelber/Flipradio_qwen_3.5_9B_Deepseek_V4_flash_Distillation}},
}
反馈与交流
- 问题反馈:请在本仓库的 Community 标签页发起讨论
- 使用过程中如发现 bug 或有优化建议,欢迎提 issue
Happy Hacking!
- Downloads last month
- 2,385
8-bit