Massive clipping damage? Why is Q8KXL have F16 tensors/layers when it's a native BF16 model?

#3
by rkh661 - opened

Hey Dan, I was looking in the hugging face file viewer, And I once again saw that you were converting tensors and layers to f16, when it's a native BF16 model as most are nowadays. Why is it that you do that? I've noticed it with many other, uh, conversions of yours. That's why I've seen other people say that Q8 can be less damged than your clipped f16 quants because it does less damage? Maybe it's true but can you address this? Thanks

Unsloth AI org
β€’
edited 4 days ago

Just to confirm, there is no accuracy degradation when using f16 over bf16. Ollama and LM Studio all use f16 instead of bf16 just like us.

It's because people keep saying that if you do bf16, it's much much slower than f16 on a lot of device. e.g. see this recent comment complaining about the speed issue.

Native BF16 is unsupported on most hardware (older); I believe llama.cpp emulate bf16 on such hardware it by converting to FP32 on-the-fly, which causes a massive performance hit. F16 is better for most people.

@danielhanchen If possible, can you consider offering two XL variants: a 'Performance' version optimized for maximum inference speed and accuracy at specific bit-depths (e.g., 4-bit), and an 'Efficiency' version optimized for minimal VRAM usage and high accuracy at lower quantizations (e.g., 2-bit or 3-bit).

Unsloth AI org

Native BF16 is unsupported on most hardware (older); I believe llama.cpp emulate bf16 on such hardware it by converting to FP32 on-the-fly, which causes a massive performance hit. F16 is better for most people.

@danielhanchen If possible, can you consider offering two XL variants: a 'Performance' version optimized for maximum inference speed and accuracy at specific bit-depths (e.g., 4-bit), and an 'Efficiency' version optimized for minimal VRAM usage and high accuracy at lower quantizations (e.g., 2-bit or 3-bit).

Yes you are correct!! We will see what we can do thank you.

Closing this thread for now as it is not an issue. You can still however write any new comments. Thank you!

danielhanchen changed discussion status to closed

bf16 has a higher dynamic range, so converting bf16 weights to fp16 will lose some information?

bf16 has a higher dynamic range, so converting bf16 weights to fp16 will lose some information?

However, since the weights are usually quite small value, maybe fp16 is good to go?

Q8_0 is already extremely close to quality/precision to the F16/BF16. it's hard to imagine F16 and BF16 being much different

Q8_0 is already extremely close to quality/precision to the F16/BF16. it's hard to imagine F16 and BF16 being much different

yes, but it is not a precision issue. When you quantize you don't just chop off the top of the number. They use a scaling factor, a multiplier.
The quant handles the relative shape, and the multiplier handles the size. It loses some detail, but the magnitude of the outlier survives.
FP16 has no scaling factor. Its hard mathematical limit is 65 504. If an activation calculates to 150,000, FP16 simply throws its hands up and logs Infinity
The issue is that standard FP16 math protocols turn clipped numbers into Infinity and NaN, which physically breaks the model. BF16 is a clever format that was made to have the same minimum and maximum range as F32, but with less fidelity. So casting them to F16 means we lost the outliers, no scaling factors.

Hey Dan, that is false That is false.

https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/discussions/7

Here's an example. That's Gemma 2 27B, That's Bartowski looking into this very thing a year and a half ago. And it's supported on everything 30 series and newer. I believe BF16 is. Now that's been five years since that came out. And maybe there's AMD or Intel issues, but again, we're looking at 90 plus percent of the market here and out of that market share that's in this industry and in this, you know, deep here, I would bet there is a tiny minority that does not have BF16 support and it does serious damage to these AI models. And you not converting them in the intermediate process to BF16 and then quanting them is clipping them at a massive scale. Because you did not address the issue. And for you to close this issue a few hours later shows you don't want people to know. And I will reopen the issue, and I will make a new one. Because Bartowski converts to BF 16 before he quants, as does Mraderacher, I I probably just butchered his name, but you'll know who I mean. I just asked him last night and Bartowski already had confirmed it a while ago. So again, why are you doing this?

Sign up or log in to comment