HindiSTT

A fine-tuned Whisper model for Hindi speech-to-text transcription, outputting Hinglish (Hindi written in Roman script).

Model Description

This model transcribes Hindi audio into romanized text (Hinglish), making it easier to read and process Hindi speech without requiring Devanagari script support.

Example Output:

Audio: [Hindi speech saying "नमस्ते, आप कैसे हैं?"]
Output: namaste, aap kaise hain?

Key Features

Hinglish Output: Transcribes audio into spoken Hinglish language
Whisper Architecture: Based on Whisper Large V3, compatible with transformers
Noise Resistant: Handles noisy audio environments well
Low Hallucination: Minimizes transcription hallucinations

Usage

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "Svetozar1993/HindiSTT"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"task": "transcribe", "language": "en"}
)

result = pipe("audio.wav")
print(result["text"])

Flash Attention 2

For faster inference with Flash Attention:

pip install flash-attn --no-build-isolation

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    attn_implementation="flash_attention_2"
)

Model Details

Base Model: Whisper Large V3
Language: Hindi (Romanized/Hinglish output)
Parameters: 1.5B
License: Apache 2.0

Author

Svetozar1993

Downloads last month: 152

Safetensors

Model size

2B params

Tensor type

F32