Stable-Audio-3-DiT-Small-Music-MLX-4bit

MLX port of Stability AI Stable Audio 3 (optimized). Latent-diffusion text-to-audio with mask-based inpainting and continuation, with the DiT denoiser quantized to INT4 for Apple Silicon.

What's in this bundle

Component Format Notes
DiT (Small-Music, 50M) INT4 Diffusion Transformer denoiser, group-size 64
SAME-S encoder FP32 Audio → latents (codec is precision-sensitive — differential attention cancels in FP16)
SAME-S decoder FP32 Latents → 44.1 kHz stereo waveform
T5Gemma text encoder FP16 Prompt conditioning

Codec stays FP32 because the SAME differential attention catastrophically cancels in FP16 (per Stability's own MLX runtime). T5Gemma stays FP16 — it's small relative to the DiT and quantization gives no speed-up on the short prompt encode pass.

Files

File Size Format
dit_sm_music/model.safetensors 256 MB int4
same_s_encoder/model.safetensors 205 MB fp32
same_s_decoder/model.safetensors 208 MB fp32
t5gemma/model.safetensors 541 MB fp16

Capabilities

  • Text-to-audio generation (music + SFX depending on DiT specialisation)
  • Inpainting / region editing via masked latent diffusion
  • Audio continuation from a short prompt clip
  • Variable-length generation up to several minutes

The DiT-Small-Music-* variant is music-specialised; DiT-Small-SFX-* is sound-effects specialised; DiT-Medium-* is the higher-quality general model.

Usage

This bundle is the quantized weights only — inference uses Stability AI's official pure-MLX runtime at stable-audio-3/optimized/mlx. At load time, each (base.weight, base.scales, base.biases) triplet is dequantized via mlx.core.dequantize back to FP16; codec and T5Gemma load as-is.

from huggingface_hub import snapshot_download
import mlx.core as mx

bundle = snapshot_download("aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit")

def load_component(comp_dir):
    w = dict(mx.load(f"{comp_dir}/model.safetensors"))
    bases = {k[:-7] for k in w if k.endswith(".scales")
              if f"{k[:-7]}.weight" in w and f"{k[:-7]}.biases" in w}
    out = {}
    for k, v in w.items():
        if k.endswith((".scales", ".biases")) and k.rsplit(".", 1)[0] in bases:
            continue
        if k.endswith(".weight") and k[:-7] in bases:
            base = k[:-7]
            out[k] = mx.dequantize(w[f"{base}.weight"], w[f"{base}.scales"],
                                   w[f"{base}.biases"], group_size=64, bits=4)
        else:
            out[k] = v
    return out

Plug the rehydrated dict into the matching model class from stable-audio-3/optimized/mlx/models/defs/.

Source

License

Stability AI Community License — free for non-commercial research and for commercial use up to the revenue threshold defined by Stability AI; see the license text. T5Gemma component additionally inherits the Gemma Terms of Use.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit

Finetuned
(2)
this model

Collection including aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit

Paper for aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit