Add SDPA attention support to models
#3
by
mrdbourke - opened
Models can now be loaded without flash_attention_2 installed.
For example:
import torch
from transformers import AutoModelForSequenceClassification, AutoProcessor
from transformers.image_utils import load_image
modality = "image"
# Load model
model_path = "nvidia/llama-nemotron-rerank-vl-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="sdpa", # use sdpa attention instead of flash_attention_2
device_map="auto"
).eval()
This adds code changes as discussed here: https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2/discussions/2
Thanks for putting this together @mrdbourke . Incorporated this feature in #4 as part of a broader change to get this model working correctly in the latest versions of transformers as well as enabling configurating different attention implementations when loading the model
nvidia-oliver-holworthy changed pull request status to
closed