Gemma 4 E4B Legal GRPO — LoRA Adapter

Fine-tuned Google Gemma 4 E4B for legal analysis using GRPO (Group Relative Policy Optimization) reinforcement learning.

Model Details

Property Value
Base model google/gemma-4-E4B-it (4B effective params)
Quantized base unsloth/gemma-4-E4B-it-unsloth-bnb-4bit
Method GRPO (RL-based alignment)
LoRA rank 16 (alpha=16, RSLoRA)
Trainable params 42.4M (0.53% of base)
Adapter tensors 588 (language_model only)
Adapter size 140 MB
Training hardware NVIDIA RTX PRO 6000 Blackwell (95 GB, Colab G4)
Training time ~1.3 hours (250 steps, 1000 prompts)
License Apache 2.0

Model Description

Legal-domain LoRA adapter trained with GRPO reinforcement learning on 45K+ legal/reasoning examples. Optimized for legal document analysis, statute interpretation, evidence classification, and case law reasoning with proper Bluebook citations.

  • Developed by: Semaj90
  • Model type: PEFT LoRA adapter (text generation)
  • Language: English
  • License: Apache 2.0
  • Fine-tuned from: google/gemma-4-E4B-it

Model Sources

How to Get Started

With Unsloth (recommended)

from unsloth import FastVisionModel
from peft import PeftModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "Semaj90/gemma4-e4b-legal-grpo")
FastVisionModel.for_inference(model)

messages = [{"role": "user", "content": [{"type": "text", "text": "Analyze 42 U.S.C. Section 1983"}]}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
output = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

With Ollama (GGUF)

# See: Semaj90/gemma4-e4b-legal-grpo-GGUF
ollama create gemma4-legal:latest -f Modelfile
ollama run gemma4-legal:latest "What are the elements of negligence?"

Uses

Direct Use

  • Legal document analysis and case law research
  • Evidence classification and chain of custody assessment
  • Statute interpretation (U.S.C., CFR, state codes)
  • Case law reasoning and precedent analysis
  • Contract review and liability assessment
  • Legal AI chatbot / RAG pipeline augmentation

Out-of-Scope Use

  • Not a substitute for professional legal advice
  • Not designed for jurisdictions outside U.S. law
  • Should not be used for automated legal decision-making without human review

Training Details

Training Data

Dataset Samples Purpose
FineTome-100k 10,000 General instruction following
GSM8K 5,000 Mathematical reasoning
Pile of Law 20,000 Legal text corpus
LexGLUE CaseHold 5,000 Legal reasoning / holdings
LexGLUE SCOTUS 5,000 Supreme Court opinions
Custom codebase patterns 250 Legal AI system patterns

45K+ examples distilled into 1,000 GRPO prompts (Phase 1 pilot).

Training Procedure

Method: GRPO (Group Relative Policy Optimization) — generates multiple completions per prompt, scores them with reward functions, and updates the policy to favor higher-reward outputs.

Training Hyperparameters

  • Training regime: BFloat16
  • Optimizer: AdamW 8-bit
  • Learning rate: 5e-6 (cosine schedule)
  • Batch size: 4 (gradient accumulation: 2, effective: 8)
  • Generations per prompt: 2
  • Max completion length: 256 tokens
  • Warmup steps: 30
  • Weight decay: 0.01
  • Max grad norm: 0.5
  • Epochs: 1

Reward Functions

Single combined reward with 5 signals optimized for training throughput:

Signal Weight Description
Citation accuracy 0.25 Valid Bluebook, U.S.C., CFR citation formats
Reasoning logic 0.25 Logical connectors (therefore, because, pursuant to)
Legal formatting 0.20 Numbered lists, paragraph structure
Anti-hallucination 0.15 Penalize fabricated case names
Length efficiency 0.15 Target 100-250 words

Speeds, Sizes, Times

  • Total training time: ~1.3 hours (4,744 seconds)
  • Steps: 250
  • Throughput: 0.21 samples/sec
  • Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB VRAM)
  • Cloud: Google Colab G4 GPU

Bias, Risks, and Limitations

  • May hallucinate case citations despite anti-hallucination reward signal
  • Trained primarily on U.S. federal law — limited state and international coverage
  • Text-only adapter — vision/audio capabilities of base Gemma 4 pass through unmodified
  • Phase 1 pilot training (1K prompts, 1 epoch) — production training with full dataset recommended

Adapter Surgery Note

The original training produced 884 tensors (588 language + 224 vision + 72 audio). Despite finetune_vision_layers=False, generic target_modules (q_proj, k_proj, etc.) matched projections across all sub-models. The vision/audio tensors use Gemma4ClippableLinear which PEFT cannot merge (upstream issue).

This adapter has been surgically cleaned to contain only the 588 language_model tensors (140 MB). Vision and audio capabilities pass through from the base model unmodified.

Environmental Impact

  • Hardware Type: NVIDIA RTX PRO 6000 Blackwell
  • Hours used: ~1.3
  • Cloud Provider: Google Colab
  • Compute Region: US

Technical Specifications

Compute Infrastructure

  • Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB, Compute 12.0)
  • Software: Unsloth 2026.4.2, TRL 0.24.0, PEFT 0.18.1, Transformers 5.5.0, PyTorch 2.10.0+cu128

Citation

@misc{gemma4-legal-grpo-2026,
  title={Gemma 4 E4B Legal GRPO Adapter},
  author={Semaj90},
  year={2026},
  url={https://huggingface.co/Semaj90/gemma4-e4b-legal-grpo}
}

Framework Versions

  • PEFT 0.18.1
  • Unsloth 2026.4.2
  • TRL 0.24.0
  • Transformers 5.5.0
  • PyTorch 2.10.0+cu128
Downloads last month
58
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Semaj90/gemma4-e4b-legal-grpo

Adapter
(53)
this model

Datasets used to train Semaj90/gemma4-e4b-legal-grpo

Paper for Semaj90/gemma4-e4b-legal-grpo