Instructions to use aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Stable-Audio-3-DiT-Small-Music-MLX-4bit aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Stable-Audio-3-DiT-Small-Music-MLX-4bit
- speech-swift — Apple SDK
- soniqo.audio — website
- blog — blog
MLX port of Stability AI Stable Audio 3 (optimized). Latent-diffusion text-to-audio with mask-based inpainting and continuation, with the DiT denoiser quantized to INT4 for Apple Silicon.
What's in this bundle
| Component | Format | Notes |
|---|---|---|
| DiT (Small-Music, 50M) | INT4 | Diffusion Transformer denoiser, group-size 64 |
| SAME-S encoder | FP32 | Audio → latents (codec is precision-sensitive — differential attention cancels in FP16) |
| SAME-S decoder | FP32 | Latents → 44.1 kHz stereo waveform |
| T5Gemma text encoder | FP16 | Prompt conditioning |
Codec stays FP32 because the SAME differential attention catastrophically cancels in FP16 (per Stability's own MLX runtime). T5Gemma stays FP16 — it's small relative to the DiT and quantization gives no speed-up on the short prompt encode pass.
Files
| File | Size | Format |
|---|---|---|
dit_sm_music/model.safetensors |
256 MB | int4 |
same_s_encoder/model.safetensors |
205 MB | fp32 |
same_s_decoder/model.safetensors |
208 MB | fp32 |
t5gemma/model.safetensors |
541 MB | fp16 |
Capabilities
- Text-to-audio generation (music + SFX depending on DiT specialisation)
- Inpainting / region editing via masked latent diffusion
- Audio continuation from a short prompt clip
- Variable-length generation up to several minutes
The DiT-Small-Music-* variant is music-specialised; DiT-Small-SFX-* is sound-effects specialised; DiT-Medium-* is the higher-quality general model.
Usage
This bundle is the quantized weights only — inference uses Stability AI's official pure-MLX runtime at stable-audio-3/optimized/mlx. At load time, each (base.weight, base.scales, base.biases) triplet is dequantized via mlx.core.dequantize back to FP16; codec and T5Gemma load as-is.
from huggingface_hub import snapshot_download
import mlx.core as mx
bundle = snapshot_download("aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit")
def load_component(comp_dir):
w = dict(mx.load(f"{comp_dir}/model.safetensors"))
bases = {k[:-7] for k in w if k.endswith(".scales")
if f"{k[:-7]}.weight" in w and f"{k[:-7]}.biases" in w}
out = {}
for k, v in w.items():
if k.endswith((".scales", ".biases")) and k.rsplit(".", 1)[0] in bases:
continue
if k.endswith(".weight") and k[:-7] in bases:
base = k[:-7]
out[k] = mx.dequantize(w[f"{base}.weight"], w[f"{base}.scales"],
w[f"{base}.biases"], group_size=64, bits=4)
else:
out[k] = v
return out
Plug the rehydrated dict into the matching model class from stable-audio-3/optimized/mlx/models/defs/.
Source
- Upstream weights: stabilityai/stable-audio-3-optimized (Stability AI Community License)
- Text encoder lineage: google/t5gemma-b-b-ul2 (Gemma Terms of Use)
- Paper: Stable Audio 3 — Stability AI
License
Stability AI Community License — free for non-commercial research and for commercial use up to the revenue threshold defined by Stability AI; see the license text. T5Gemma component additionally inherits the Gemma Terms of Use.
Quantized
Model tree for aufklarer/Stable-Audio-3-DiT-Small-Music-MLX-4bit
Base model
stabilityai/stable-audio-3-optimized