Ternary-Bonsai-8B `bonsai_tq_f32` for MLC/WebLLM

This repository contains an experimental MLC/WebLLM conversion of prism-ml/Ternary-Bonsai-8B-unpacked. It is a browser-runtime artifact, not a new model, fine-tune, GGUF, MLX, or ONNX mirror.

The source checkpoint is Prism ML's unpacked FP16 Ternary Bonsai model. This conversion uses a local MLC bonsai_tq_f32 profile: symmetric 2-bit group quantization with uint32 storage, group size 128, and FP32 scales. The encoded values represent the ternary lane -scale, 0, and +scale.

Artifact Summary

Field	Value
Source checkpoint	`prism-ml/Ternary-Bonsai-8B-unpacked`
Architecture	Qwen3-shaped decoder
MLC model type	`qwen3`
Quantization	`bonsai_tq_f32`
Quantized storage	2-bit symmetric group quantization in `uint32`
Conversation template	`qwen3_nothink`
Context window in config	`32768`
Prefill chunk in config	`2048`
Total parameters	8,188,548,096
Quantized parameter size	2.146 GB
Bits per parameter	2.251
Parameter shards	69
Artifact size	about 2.1 GB
WebGPU library	`libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm`

Runtime Requirement

This artifact requires an MLC/WebLLM runtime with the local bonsai_tq_f32 quantization profile registered. It is not expected to load in an unmodified upstream WebLLM build until this profile is upstreamed or otherwise carried in the runtime.

This first ternary path uses MLC's group-quantized graph path. It is a compact WebGPU artifact and a correctness/release milestone, but it is not yet a custom fused ternary matmul kernel. Benchmark it before making speed claims.

WebLLM Configuration

const appConfig = {
  model_list: [
    {
      model: "https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/",
      model_id: "Ternary-Bonsai-8B-tq-MLC",
      model_lib:
        "https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm",
      overrides: {
        context_window_size: 4096,
        prefill_chunk_size: 512,
      },
    },
  ],
};

The smaller override values above are intended for local browser smoke tests. Increase them only after measuring browser memory and cache behavior on the target device. The 8B artifact is materially larger than the 1.7B and 4B artifacts, so browser cache quota and GPU memory should be checked before using larger context settings.

Validation

The artifact was converted and WebGPU-compiled on the GCP MLC/WebLLM builder VM, not on a local laptop.

Source: prism-ml/Ternary-Bonsai-8B-unpacked
Quantization: bonsai_tq_f32
Quantization profile: int2 values, uint32 packed storage, FP32 scales
Conversion peak RAM: 9.188 GB on CPU
WebGPU compile completed successfully
Compile estimate without KV cache: 3830.35 MB
Compile estimate with 4K KV cache: 4982.35 MB

Limitations

This is an experimental runtime artifact, not a general transformers model checkpoint.
This repo does not claim the same runtime performance as Prism ML's native MLX 2-bit release.
Quality evaluation is limited to conversion and WebGPU compile checks; no benchmark score is claimed by this repository.
Browser success depends on WebGPU support, available GPU memory, cache quota, and a compatible patched WebLLM runtime.

Provenance

Original model by Prism ML:

MLC/WebLLM conversion by welcoma.

Downloads last month: 32

Model tree for welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC

Base model

prism-ml/Ternary-Bonsai-8B-unpacked

Quantized

(18)

this model

Ternary-Bonsai-8B bonsai_tq_f32 for MLC/WebLLM