Add SDPA attention support to models

by mrdbourke - opened Jan 22

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

-5

mrdbourke

Jan 22

•

edited Jan 22

Models can now be loaded without flash_attention_2 installed.

For example:

import torch
from transformers import AutoModelForSequenceClassification, AutoProcessor
from transformers.image_utils import load_image

modality = "image"

# Load model
model_path = "nvidia/llama-nemotron-rerank-vl-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForSequenceClassification.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="sdpa", # use sdpa attention instead of flash_attention_2
    device_map="auto"
).eval()

This adds code changes as discussed here: https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2/discussions/2

Add SDPA attention support to models47e5a355

nvidia-oliver-holworthy

NVIDIA org 10 days ago

Thanks for putting this together @mrdbourke . Incorporated this feature in #4 as part of a broader change to get this model working correctly in the latest versions of transformers as well as enabling configurating different attention implementations when loading the model

nvidia-oliver-holworthy changed pull request status to closed 10 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment