Instructions to use welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLC-LLM
How to use welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC with MLC-LLM:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Ternary-Bonsai-8B bonsai_tq_f32 for MLC/WebLLM
This repository contains an experimental MLC/WebLLM conversion of
prism-ml/Ternary-Bonsai-8B-unpacked.
It is a browser-runtime artifact, not a new model, fine-tune, GGUF, MLX, or ONNX
mirror.
The source checkpoint is Prism ML's unpacked FP16 Ternary Bonsai model. This
conversion uses a local MLC bonsai_tq_f32 profile: symmetric 2-bit group
quantization with uint32 storage, group size 128, and FP32 scales. The encoded
values represent the ternary lane -scale, 0, and +scale.
Artifact Summary
| Field | Value |
|---|---|
| Source checkpoint | prism-ml/Ternary-Bonsai-8B-unpacked |
| Architecture | Qwen3-shaped decoder |
| MLC model type | qwen3 |
| Quantization | bonsai_tq_f32 |
| Quantized storage | 2-bit symmetric group quantization in uint32 |
| Conversation template | qwen3_nothink |
| Context window in config | 32768 |
| Prefill chunk in config | 2048 |
| Total parameters | 8,188,548,096 |
| Quantized parameter size | 2.146 GB |
| Bits per parameter | 2.251 |
| Parameter shards | 69 |
| Artifact size | about 2.1 GB |
| WebGPU library | libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm |
Runtime Requirement
This artifact requires an MLC/WebLLM runtime with the local bonsai_tq_f32
quantization profile registered. It is not expected to load in an unmodified
upstream WebLLM build until this profile is upstreamed or otherwise carried in
the runtime.
This first ternary path uses MLC's group-quantized graph path. It is a compact WebGPU artifact and a correctness/release milestone, but it is not yet a custom fused ternary matmul kernel. Benchmark it before making speed claims.
WebLLM Configuration
const appConfig = {
model_list: [
{
model: "https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/",
model_id: "Ternary-Bonsai-8B-tq-MLC",
model_lib:
"https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm",
overrides: {
context_window_size: 4096,
prefill_chunk_size: 512,
},
},
],
};
The smaller override values above are intended for local browser smoke tests. Increase them only after measuring browser memory and cache behavior on the target device. The 8B artifact is materially larger than the 1.7B and 4B artifacts, so browser cache quota and GPU memory should be checked before using larger context settings.
Validation
The artifact was converted and WebGPU-compiled on the GCP MLC/WebLLM builder VM, not on a local laptop.
- Source:
prism-ml/Ternary-Bonsai-8B-unpacked - Quantization:
bonsai_tq_f32 - Quantization profile:
int2values,uint32packed storage, FP32 scales - Conversion peak RAM: 9.188 GB on CPU
- WebGPU compile completed successfully
- Compile estimate without KV cache: 3830.35 MB
- Compile estimate with 4K KV cache: 4982.35 MB
Limitations
- This is an experimental runtime artifact, not a general
transformersmodel checkpoint. - This repo does not claim the same runtime performance as Prism ML's native MLX 2-bit release.
- Quality evaluation is limited to conversion and WebGPU compile checks; no benchmark score is claimed by this repository.
- Browser success depends on WebGPU support, available GPU memory, cache quota, and a compatible patched WebLLM runtime.
Provenance
Original model by Prism ML:
MLC/WebLLM conversion by welcoma.
- Downloads last month
- 32
Model tree for welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC
Base model
prism-ml/Ternary-Bonsai-8B-unpacked