Extend support for transformers versions from 4.45.0 to 5.2.0

#1

This PR updates the model implementation to extend support for a wider range of transformers versions than only 4.49.0.
The model's custom code relied on internal transformers APIs that changed across versions, causing failures on versions other than 4.49.0.

  1. _update_causal_mask removed in transformers 5.0 — The LlamaBidirectionalModel overrode _update_causal_mask(), a private method on LlamaModel that was removed in transformers 5.0 in favor of create_bidirectional_mask from
    transformers.masking_utils.

  2. hidden_states[-1] returns None on transformers >= 4.57_extract_embeddings accessed outputs.hidden_states[-1], but the custom LlamaBidirectionalModel.forward() never populated the hidden_states tuple, only last_hidden_state.
    This worked incidentally on some versions but broke on 4.57+ where the base class internals changed.

  3. additional_special_tokens unavailable on transformers 5.0 — The processor filtered tokenizer.additional_special_tokens, but in transformers 5.0 the tokenizer backend changed to TokenizersBackend which doesn't expose this attribute.

  4. Missing self.post_init() callLlamaNemotronVLModel.__init__ didn't call self.post_init(), the standard transformers finalization step. This caused AttributeError: 'LlamaNemotronVLModel' object has no attribute 'all_tied_weights_keys'
    on transformers 5.0+.

Changes

modeling_llama_nemotron_vl.py

  • Replaced _update_causal_mask override with explicit forward() and _create_bidirectional_mask — Instead of hooking into a private method, LlamaBidirectionalModel now has its own forward() that constructs the bidirectional mask
    directly, dispatching to transformers.masking_utils.create_bidirectional_mask on 5.0+ and falling back to _prepare_4d_attention_mask on older versions.

  • Version-portable decoder layer calls — Uses runtime introspection to detect API differences across transformers versions: past_key_value vs past_key_values parameter naming (changed in 4.56), DynamicCache constructor signature, and tuple
    vs tensor return from decoder layers.

  • Changed _extract_embeddings to use outputs.last_hidden_state — Replaced self(**batch, output_hidden_states=True).hidden_states[-1] with self(**batch).last_hidden_state, which is always reliably populated by the custom forward().

  • Changed forward() return type to BaseModelOutputWithPast — The parent returned CausalLMOutputWithPast (which has logits but no last_hidden_state). Since this is an embedding model that doesn't compute logits, BaseModelOutputWithPast
    is the correct output type and propagates last_hidden_state from the language model.

  • Added self.post_init() — Standard transformers pattern that initializes internal bookkeeping (all_tied_weights_keys, weight tying, etc.), required by transformers 5.0+.

processing_llama_nemotron_vl.py

  • Removed additional_special_tokens filtering — The processor filtered out <box>, </box>, <ref>, </ref> tokens from additional_special_tokens, which broke on transformers 5.0 where the attribute doesn't exist. Testing confirmed this
    filtering has no effect on embedding output (zero diff with and without it), so it was removed entirely.

Test results

All versions produce zero diff against the reference (generated with transformers 4.49.0):

Version Result
4.44.2 FAIL — tokenizers crate can't parse tokenizer.json
4.45.0 PASS (zero diff)
4.46.1 PASS (zero diff)
4.47.0 PASS (zero diff)
4.48.0 PASS (zero diff)
4.49.0 PASS (reference)
4.57.6 PASS (zero diff)
5.0.0 PASS (zero diff)
5.1.0 PASS (zero diff)
5.2.0 PASS (zero diff)
nvidia-oliver-holworthy changed pull request status to open
nvidia-oliver-holworthy changed pull request status to merged

Sign up or log in to comment