Stable-Audio-3-DiT-Small-Music-MLX-4bit

speech-swift — Apple SDK
soniqo.audio — website
blog — blog

MLX port of Stability AI Stable Audio 3 (optimized). Latent-diffusion text-to-audio with mask-based inpainting and continuation, with the DiT denoiser quantized to INT4 for Apple Silicon.

What's in this bundle

Component	Format	Notes
DiT (Small-Music, 50M)	INT4	Diffusion Transformer denoiser, group-size 64
SAME-S encoder	FP32	Audio → latents (codec is precision-sensitive — differential attention cancels in FP16)
SAME-S decoder	FP32	Latents → 44.1 kHz stereo waveform
T5Gemma text encoder	FP16	Prompt conditioning

Codec stays FP32 because the SAME differential attention catastrophically cancels in FP16 (per Stability's own MLX runtime). T5Gemma stays FP16 — it's small relative to the DiT and quantization gives no speed-up on the short prompt encode pass.

Files

File	Size	Format
`dit_sm_music/model.safetensors`	256 MB	int4
`same_s_encoder/model.safetensors`	205 MB	fp32
`same_s_decoder/model.safetensors`	208 MB	fp32
`t5gemma/model.safetensors`	541 MB	fp16

Capabilities

Text-to-audio generation (music + SFX depending on DiT specialisation)
Inpainting / region editing via masked latent diffusion
Audio continuation from a short prompt clip
Variable-length generation up to several minutes

The DiT-Small-Music-* variant is music-specialised; DiT-Small-SFX-* is sound-effects specialised; DiT-Medium-* is the higher-quality general model.

Usage

This bundle is the quantized weights only — inference uses Stability AI's official pure-MLX runtime at stable-audio-3/optimized/mlx. At load time, each (base.weight, base.scales, base.biases) triplet is dequantized via mlx.core.dequantize back to FP16; codec and T5Gemma load as-is.

from huggingface_hub import snapshot_download
import mlx.core as mx

bundle = snapshot_download("aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit")

def load_component(comp_dir):
    w = dict(mx.load(f"{comp_dir}/model.safetensors"))
    bases = {k[:-7] for k in w if k.endswith(".scales")
              if f"{k[:-7]}.weight" in w and f"{k[:-7]}.biases" in w}
    out = {}
    for k, v in w.items():
        if k.endswith((".scales", ".biases")) and k.rsplit(".", 1)[0] in bases:
            continue
        if k.endswith(".weight") and k[:-7] in bases:
            base = k[:-7]
            out[k] = mx.dequantize(w[f"{base}.weight"], w[f"{base}.scales"],
                                   w[f"{base}.biases"], group_size=64, bits=4)
        else:
            out[k] = v
    return out

Plug the rehydrated dict into the matching model class from stable-audio-3/optimized/mlx/models/defs/.

Source

Upstream weights: stabilityai/stable-audio-3-optimized (Stability AI Community License)
Text encoder lineage: google/t5gemma-b-b-ul2 (Gemma Terms of Use)
Paper: Stable Audio 3 — Stability AI

License

Stability AI Community License — free for non-commercial research and for commercial use up to the revenue threshold defined by Stability AI; see the license text. T5Gemma component additionally inherits the Gemma Terms of Use.