ak3ra/gemma-4-e2b-mixed-sft-fft
Mixed-modality SFT of Gemma 4 E2B on Luganda speech (ASR) and English↔Luganda/Acholi translation. Single full fine-tuned model that transcribes speech and translates text in one set of weights.
Recipe
Base:
jq/e2b-pretrain-eosmask(Gemma 4 E2B continued-pretraining checkpoint)Method: full fine-tuning, 1 epoch
LR: 5e-5, effective batch size 16,
max_length8192Optimizer: adamw_torch
Training data (deduped, 16,704 mixed rows):
- 6 Luganda speech subsets:
Sunbird/speech/lug_{commonvoice,fleurs,makbenchmark,makerereradio,salt,waxal}(audio clips >30 s filtered out; 4-step audio augmentation) Sunbird/sunflower-posttrain-datasft_translations(eng/lug/ach)Sunbird/sunflower-posttrain-datasft_instructions(eng/lug/ach)
- 6 Luganda speech subsets:
System message (used at training; must be matched at inference):
You are an assistant that transcribes speech and translates Ugandan languages.
Eval results
Selected: last checkpoint (step 1043, epoch 1.0). Loss-based selection
("best by eval_text_loss") was unreliable for this run -- text loss was
flat across the last few evals while the model continued resolving
repetition collapses on hard audio clips that loss couldn't see.
Translation (Sunbird/sunflower-translation-eval test, 99 ex × 2 langs)
| metric | value |
|---|---|
| avg chrF | 0.4561 |
| avg BLEU | 17.46 |
| lug chrF | 0.5104 |
| ach chrF | 0.4018 |
ASR (6 Luganda speech sets, dev[:10] each, 55 examples)
| metric | value |
|---|---|
| WER | 0.4516 |
| CER | 0.1129 |
WER/CER are per-example averaged (matches scripts/eval_asr.py after PR #20).
Quick start (Colab-ready)
Hardware: ~10 GB bf16 weights. Comfortable on A100 (40 GB) or L4 (24 GB). On a free Colab T4 (16 GB), pass
load_in_4bit=TruetoFastModel.from_pretrained.
Install:
pip install -U "transformers>=5.7" "unsloth>=2026.4" librosa soundfile datasets
Speech recognition (Luganda audio → transcript)
import os
os.environ.setdefault("UNSLOTH_COMPILE", "0")
os.environ.setdefault("TORCHDYNAMO_DISABLE", "1")
import torch
from datasets import load_dataset, Audio
from transformers import AutoProcessor
from unsloth import FastModel
REPO = "ak3ra/gemma-4-e2b-mixed-sft-fft"
SYSTEM = "You are an assistant that transcribes speech and translates Ugandan languages."
model, _ = FastModel.from_pretrained(
model_name=REPO,
max_seq_length=8192,
load_in_4bit=False, # set True on a free Colab T4
fast_inference=False,
)
# Unsloth's loader can return a tokenizer-only object for VLMs; reload the
# multimodal processor explicitly so audio inputs are encoded.
processor = AutoProcessor.from_pretrained(REPO)
model.eval()
# Pull a Luganda sample from SALT dev (any 16 kHz mono audio array works).
ds = load_dataset("Sunbird/speech", name="lug_salt", split="dev[:1]")
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds[0]["audio"]["array"]
print("Reference :", ds[0]["text"])
messages = [
{"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
{"role": "user", "content": [
{"type": "audio", "audio": audio},
{"type": "text", "text": "Please transcribe this Luganda audio."},
]},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print("Prediction:", processor.tokenizer.decode(gen, skip_special_tokens=True).strip())
Translation (English → Luganda or Acholi)
# Reuses `model`, `processor`, `SYSTEM` from above.
text = "How is the weather today?"
messages = [
{"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
{"role": "user", "content": [{"type": "text", "text": f"Translate to Luganda: {text}"}]},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print(processor.tokenizer.decode(gen, skip_special_tokens=True).strip())
For Acholi, swap "Translate to Luganda" → "Translate to Acholi".
Plain transformers fallback
Works, but slightly different attention / KV-cache numerics can flip borderline audio clips into repetition loops, so per-example averaged WER on small (n=10) eval subsets may differ by ~0.05–0.10 from the Unsloth path above. Use Unsloth for benchmarks; this path is fine for casual use.
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(REPO, dtype="bfloat16", device_map="auto")
processor = AutoProcessor.from_pretrained(REPO)
model.eval()
# ... build `messages` and `inputs` exactly as above, then model.generate(...)
Repository
Trained from https://github.com/SunbirdAI/sunflower branch sft-hp-sweeps.
- Downloads last month
- 54
Model tree for ak3ra/gemma-4-e2b-mixed-sft-fft
Base model
jq/e2b-pretrain-eosmask