You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ak3ra/gemma-4-e2b-mixed-sft-fft

Mixed-modality SFT of Gemma 4 E2B on Luganda speech (ASR) and English↔Luganda/Acholi translation. Single full fine-tuned model that transcribes speech and translates text in one set of weights.

Recipe

Base: jq/e2b-pretrain-eosmask (Gemma 4 E2B continued-pretraining checkpoint)
Method: full fine-tuning, 1 epoch
LR: 5e-5, effective batch size 16, max_length 8192
Optimizer: adamw_torch
Training data (deduped, 16,704 mixed rows):
- 6 Luganda speech subsets: Sunbird/speech/lug_{commonvoice,fleurs,makbenchmark,makerereradio,salt,waxal} (audio clips >30 s filtered out; 4-step audio augmentation)
- Sunbird/sunflower-posttrain-data sft_translations (eng/lug/ach)
- Sunbird/sunflower-posttrain-data sft_instructions (eng/lug/ach)

System message (used at training; must be matched at inference):

You are an assistant that transcribes speech and translates Ugandan languages.

Eval results

Selected: last checkpoint (step 1043, epoch 1.0). Loss-based selection ("best by eval_text_loss") was unreliable for this run -- text loss was flat across the last few evals while the model continued resolving repetition collapses on hard audio clips that loss couldn't see.

Translation (`Sunbird/sunflower-translation-eval` test, 99 ex × 2 langs)

metric	value
avg chrF	0.4561
avg BLEU	17.46
lug chrF	0.5104
ach chrF	0.4018

ASR (6 Luganda speech sets, dev[:10] each, 55 examples)

metric	value
WER	0.4516
CER	0.1129

WER/CER are per-example averaged (matches scripts/eval_asr.py after PR #20).

Quick start (Colab-ready)

Hardware: ~10 GB bf16 weights. Comfortable on A100 (40 GB) or L4 (24 GB). On a free Colab T4 (16 GB), pass load_in_4bit=True to FastModel.from_pretrained.

Install:

pip install -U "transformers>=5.7" "unsloth>=2026.4" librosa soundfile datasets

Speech recognition (Luganda audio → transcript)

import os
os.environ.setdefault("UNSLOTH_COMPILE", "0")
os.environ.setdefault("TORCHDYNAMO_DISABLE", "1")

import torch
from datasets import load_dataset, Audio
from transformers import AutoProcessor
from unsloth import FastModel

REPO   = "ak3ra/gemma-4-e2b-mixed-sft-fft"
SYSTEM = "You are an assistant that transcribes speech and translates Ugandan languages."

model, _ = FastModel.from_pretrained(
    model_name=REPO,
    max_seq_length=8192,
    load_in_4bit=False,    # set True on a free Colab T4
    fast_inference=False,
)
# Unsloth's loader can return a tokenizer-only object for VLMs; reload the
# multimodal processor explicitly so audio inputs are encoded.
processor = AutoProcessor.from_pretrained(REPO)
model.eval()

# Pull a Luganda sample from SALT dev (any 16 kHz mono audio array works).
ds = load_dataset("Sunbird/speech", name="lug_salt", split="dev[:1]")
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
audio = ds[0]["audio"]["array"]
print("Reference :", ds[0]["text"])

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
    {"role": "user",   "content": [
        {"type": "audio", "audio": audio},
        {"type": "text",  "text": "Please transcribe this Luganda audio."},
    ]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print("Prediction:", processor.tokenizer.decode(gen, skip_special_tokens=True).strip())

Translation (English → Luganda or Acholi)

# Reuses `model`, `processor`, `SYSTEM` from above.
text = "How is the weather today?"
messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM}]},
    {"role": "user",   "content": [{"type": "text", "text": f"Translate to Luganda: {text}"}]},
]
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
gen = out[0, inputs["input_ids"].shape[-1]:]
print(processor.tokenizer.decode(gen, skip_special_tokens=True).strip())

For Acholi, swap "Translate to Luganda" → "Translate to Acholi".

Plain `transformers` fallback

Works, but slightly different attention / KV-cache numerics can flip borderline audio clips into repetition loops, so per-example averaged WER on small (n=10) eval subsets may differ by ~0.05–0.10 from the Unsloth path above. Use Unsloth for benchmarks; this path is fine for casual use.

from transformers import AutoModelForCausalLM, AutoProcessor

model     = AutoModelForCausalLM.from_pretrained(REPO, dtype="bfloat16", device_map="auto")
processor = AutoProcessor.from_pretrained(REPO)
model.eval()
# ... build `messages` and `inputs` exactly as above, then model.generate(...)

Repository

Trained from https://github.com/SunbirdAI/sunflower branch sft-hp-sweeps.

Downloads last month: 54

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for ak3ra/gemma-4-e2b-mixed-sft-fft

Base model

jq/e2b-pretrain-eosmask

Finetuned

(2)

this model