Model Card for Cardiology-TTS

Model Details

This is a fine-tuned version of the Conversational Speech Model (CSM-1B) using LoRA for parameter-efficient fine-tuning. The model is trained on a 1,530-sample dataset of medical cardiology texts, designed to generate high-quality speech from cardiology-related text. It leverages the capabilities of the original CSM-1B model for text-to-speech synthesis, extended with domain-specific terminology for medical cardiology. It is intended for speech generation in English, especially for clinical and educational contexts.

Uses

Direct Use

Text-to-Speech (TTS) generation for cardiology educational content, medical reports, or clinical explanations.
Integrating spoken content in healthcare apps, e-learning platforms, or patient-facing tools for cardiology topics.
Research and prototyping domain-specific TTS applications using small medical datasets.

Bias, Risks, and Limitations

Small training dataset (2K samples) → Model may not generalize well to rare medical terms, long passages, or other medical domains outside cardiology.
English-only support → Model is not trained for other languages.
TTS artifacts → Some generated audio may contain unnatural pauses, mispronunciations, or clipping in challenging sentences.
Not for diagnostic purposes → Model outputs speech for educational/illustrative purposes and should not be used for medical diagnosis or patient instructions.
Model size and resources → CSM-1B is large; requires GPU for real-time inference and significant VRAM for batch synthesis.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel


model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"


processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

model = PeftModel.from_pretrained(base_model, "khazarai/Cardiology-TTS")

text = "The coronary arteries are patent with no significant stenosis."

speaker_id = 0

conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
    **processor.apply_chat_template(
        conversation,
        tokenize=True,
        return_dict=True,
    ).to("cuda"),
    max_new_tokens=200, 
    # play with these parameters to tweak results
    # depth_decoder_top_k=0,
    # depth_decoder_top_p=0.9,
    # depth_decoder_do_sample=True,
    # depth_decoder_temperature=0.9,
    # top_k=0,
    # top_p=1.0,
    # temperature=0.9,
    # do_sample=True,
    #########################################################
    output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)