Add SDPA attention support to models

#3

Models can now be loaded without flash_attention_2 installed.

For example:

import torch
from transformers import AutoModelForSequenceClassification, AutoProcessor
from transformers.image_utils import load_image

modality = "image"

# Load model
model_path = "nvidia/llama-nemotron-rerank-vl-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="sdpa", # use sdpa attention instead of flash_attention_2
    device_map="auto"
).eval()

This adds code changes as discussed here: https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2/discussions/2

Thanks for putting this together @mrdbourke . Incorporated this feature in #4 as part of a broader change to get this model working correctly in the latest versions of transformers as well as enabling configurating different attention implementations when loading the model

nvidia-oliver-holworthy changed pull request status to closed

Sign up or log in to comment