How to Fine-Tune Vision Layers with LoRA?
I'm trying to fine-tune only the vision layers of the model using LoRA, but encountering issues where the model doesn't learn (evaluation loss remains constant). Has anyone successfully implemented this?
What I've Tried:
LoRA configuration targeting vision projection layers (_proj layers in vision encoder)
Various learning rates (from 1e-5 to 5e-3)
Verified vision layers are trainable (requires_grad=True)
Different batch sizes and gradient accumulation steps
Specific Issues:
The loss doesn't decrease when only vision layers are tuned
Language layers fine-tune normally when targeted
Any advice or working examples would be greatly appreciated!
Granite 显然在 forward 里强行把视觉特征 detach 了,所以即使我把 get_image_features 打开梯度、甚至把预算的特征传回去,依然会被切断。
在这种实现里,直接训练视觉塔是不可能的(除非重写它的 forward 源码)