Ternary-Bonsai-8B bonsai_tq_f32 for MLC/WebLLM

This repository contains an experimental MLC/WebLLM conversion of prism-ml/Ternary-Bonsai-8B-unpacked. It is a browser-runtime artifact, not a new model, fine-tune, GGUF, MLX, or ONNX mirror.

The source checkpoint is Prism ML's unpacked FP16 Ternary Bonsai model. This conversion uses a local MLC bonsai_tq_f32 profile: symmetric 2-bit group quantization with uint32 storage, group size 128, and FP32 scales. The encoded values represent the ternary lane -scale, 0, and +scale.

Artifact Summary

Field Value
Source checkpoint prism-ml/Ternary-Bonsai-8B-unpacked
Architecture Qwen3-shaped decoder
MLC model type qwen3
Quantization bonsai_tq_f32
Quantized storage 2-bit symmetric group quantization in uint32
Conversation template qwen3_nothink
Context window in config 32768
Prefill chunk in config 2048
Total parameters 8,188,548,096
Quantized parameter size 2.146 GB
Bits per parameter 2.251
Parameter shards 69
Artifact size about 2.1 GB
WebGPU library libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm

Runtime Requirement

This artifact requires an MLC/WebLLM runtime with the local bonsai_tq_f32 quantization profile registered. It is not expected to load in an unmodified upstream WebLLM build until this profile is upstreamed or otherwise carried in the runtime.

This first ternary path uses MLC's group-quantized graph path. It is a compact WebGPU artifact and a correctness/release milestone, but it is not yet a custom fused ternary matmul kernel. Benchmark it before making speed claims.

WebLLM Configuration

const appConfig = {
  model_list: [
    {
      model: "https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/",
      model_id: "Ternary-Bonsai-8B-tq-MLC",
      model_lib:
        "https://huggingface.co/welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC/resolve/main/libs/ternary-bonsai-8b-bonsai_tq_f32-webgpu.wasm",
      overrides: {
        context_window_size: 4096,
        prefill_chunk_size: 512,
      },
    },
  ],
};

The smaller override values above are intended for local browser smoke tests. Increase them only after measuring browser memory and cache behavior on the target device. The 8B artifact is materially larger than the 1.7B and 4B artifacts, so browser cache quota and GPU memory should be checked before using larger context settings.

Validation

The artifact was converted and WebGPU-compiled on the GCP MLC/WebLLM builder VM, not on a local laptop.

  • Source: prism-ml/Ternary-Bonsai-8B-unpacked
  • Quantization: bonsai_tq_f32
  • Quantization profile: int2 values, uint32 packed storage, FP32 scales
  • Conversion peak RAM: 9.188 GB on CPU
  • WebGPU compile completed successfully
  • Compile estimate without KV cache: 3830.35 MB
  • Compile estimate with 4K KV cache: 4982.35 MB

Limitations

  • This is an experimental runtime artifact, not a general transformers model checkpoint.
  • This repo does not claim the same runtime performance as Prism ML's native MLX 2-bit release.
  • Quality evaluation is limited to conversion and WebGPU compile checks; no benchmark score is claimed by this repository.
  • Browser success depends on WebGPU support, available GPU memory, cache quota, and a compatible patched WebLLM runtime.

Provenance

Original model by Prism ML:

MLC/WebLLM conversion by welcoma.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for welcoma/Ternary-Bonsai-8B-bonsai_tq_f32-MLC

Quantized
(18)
this model