Gemma 4 E4B Legal GRPO — LoRA Adapter
Fine-tuned Google Gemma 4 E4B for legal analysis using GRPO (Group Relative Policy Optimization) reinforcement learning.
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-E4B-it (4B effective params) |
| Quantized base | unsloth/gemma-4-E4B-it-unsloth-bnb-4bit |
| Method | GRPO (RL-based alignment) |
| LoRA rank | 16 (alpha=16, RSLoRA) |
| Trainable params | 42.4M (0.53% of base) |
| Adapter tensors | 588 (language_model only) |
| Adapter size | 140 MB |
| Training hardware | NVIDIA RTX PRO 6000 Blackwell (95 GB, Colab G4) |
| Training time | ~1.3 hours (250 steps, 1000 prompts) |
| License | Apache 2.0 |
Model Description
Legal-domain LoRA adapter trained with GRPO reinforcement learning on 45K+ legal/reasoning examples. Optimized for legal document analysis, statute interpretation, evidence classification, and case law reasoning with proper Bluebook citations.
- Developed by: Semaj90
- Model type: PEFT LoRA adapter (text generation)
- Language: English
- License: Apache 2.0
- Fine-tuned from:
google/gemma-4-E4B-it
Model Sources
- Repository: Semaj90/gemma4-e4b-legal-grpo
- GRPO Paper: DeepSeekMath (arXiv:2402.03300)
How to Get Started
With Unsloth (recommended)
from unsloth import FastVisionModel
from peft import PeftModel
model, tokenizer = FastVisionModel.from_pretrained(
model_name="unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = PeftModel.from_pretrained(model, "Semaj90/gemma4-e4b-legal-grpo")
FastVisionModel.for_inference(model)
messages = [{"role": "user", "content": [{"type": "text", "text": "Analyze 42 U.S.C. Section 1983"}]}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
output = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
With Ollama (GGUF)
# See: Semaj90/gemma4-e4b-legal-grpo-GGUF
ollama create gemma4-legal:latest -f Modelfile
ollama run gemma4-legal:latest "What are the elements of negligence?"
Uses
Direct Use
- Legal document analysis and case law research
- Evidence classification and chain of custody assessment
- Statute interpretation (U.S.C., CFR, state codes)
- Case law reasoning and precedent analysis
- Contract review and liability assessment
- Legal AI chatbot / RAG pipeline augmentation
Out-of-Scope Use
- Not a substitute for professional legal advice
- Not designed for jurisdictions outside U.S. law
- Should not be used for automated legal decision-making without human review
Training Details
Training Data
| Dataset | Samples | Purpose |
|---|---|---|
| FineTome-100k | 10,000 | General instruction following |
| GSM8K | 5,000 | Mathematical reasoning |
| Pile of Law | 20,000 | Legal text corpus |
| LexGLUE CaseHold | 5,000 | Legal reasoning / holdings |
| LexGLUE SCOTUS | 5,000 | Supreme Court opinions |
| Custom codebase patterns | 250 | Legal AI system patterns |
45K+ examples distilled into 1,000 GRPO prompts (Phase 1 pilot).
Training Procedure
Method: GRPO (Group Relative Policy Optimization) — generates multiple completions per prompt, scores them with reward functions, and updates the policy to favor higher-reward outputs.
Training Hyperparameters
- Training regime: BFloat16
- Optimizer: AdamW 8-bit
- Learning rate: 5e-6 (cosine schedule)
- Batch size: 4 (gradient accumulation: 2, effective: 8)
- Generations per prompt: 2
- Max completion length: 256 tokens
- Warmup steps: 30
- Weight decay: 0.01
- Max grad norm: 0.5
- Epochs: 1
Reward Functions
Single combined reward with 5 signals optimized for training throughput:
| Signal | Weight | Description |
|---|---|---|
| Citation accuracy | 0.25 | Valid Bluebook, U.S.C., CFR citation formats |
| Reasoning logic | 0.25 | Logical connectors (therefore, because, pursuant to) |
| Legal formatting | 0.20 | Numbered lists, paragraph structure |
| Anti-hallucination | 0.15 | Penalize fabricated case names |
| Length efficiency | 0.15 | Target 100-250 words |
Speeds, Sizes, Times
- Total training time: ~1.3 hours (4,744 seconds)
- Steps: 250
- Throughput: 0.21 samples/sec
- Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB VRAM)
- Cloud: Google Colab G4 GPU
Bias, Risks, and Limitations
- May hallucinate case citations despite anti-hallucination reward signal
- Trained primarily on U.S. federal law — limited state and international coverage
- Text-only adapter — vision/audio capabilities of base Gemma 4 pass through unmodified
- Phase 1 pilot training (1K prompts, 1 epoch) — production training with full dataset recommended
Adapter Surgery Note
The original training produced 884 tensors (588 language + 224 vision + 72 audio). Despite finetune_vision_layers=False, generic target_modules (q_proj, k_proj, etc.) matched projections across all sub-models. The vision/audio tensors use Gemma4ClippableLinear which PEFT cannot merge (upstream issue).
This adapter has been surgically cleaned to contain only the 588 language_model tensors (140 MB). Vision and audio capabilities pass through from the base model unmodified.
Environmental Impact
- Hardware Type: NVIDIA RTX PRO 6000 Blackwell
- Hours used: ~1.3
- Cloud Provider: Google Colab
- Compute Region: US
Technical Specifications
Compute Infrastructure
- Hardware: NVIDIA RTX PRO 6000 Blackwell Edition (95 GB, Compute 12.0)
- Software: Unsloth 2026.4.2, TRL 0.24.0, PEFT 0.18.1, Transformers 5.5.0, PyTorch 2.10.0+cu128
Citation
@misc{gemma4-legal-grpo-2026,
title={Gemma 4 E4B Legal GRPO Adapter},
author={Semaj90},
year={2026},
url={https://huggingface.co/Semaj90/gemma4-e4b-legal-grpo}
}
Framework Versions
- PEFT 0.18.1
- Unsloth 2026.4.2
- TRL 0.24.0
- Transformers 5.5.0
- PyTorch 2.10.0+cu128
- Downloads last month
- 58
Model tree for Semaj90/gemma4-e4b-legal-grpo
Base model
google/gemma-4-E4B-it