GLM-OCR-oQ8-fp16

This model was quantized using oQ mixed-precision quantization.

float16 gives ~20% faster prefill on M1/M2 Apple Silicon (native fp16). bfloat16 is safer on M3/M4 and for numerical stability.

Benchmark (on M1 Max)

Model Variant	PP (Tokens per second)	TG (Tokens per second)
Original (bf16)	4,684	104.8
oQ8-fp16	3,806	99.0

Safetensors

Model size

0.6B params

Tensor type

F16

U32

MLX

Hardware compatibility

8-bit

Base model

Quantized

(20)

this model