{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dasheng-AudioGen-Multilingual \u2014 Notebook Demo\n", "\n", "This notebook walks through the audio-generation usage shown in the [README](./README.md) for the **multilingual** variant of Dasheng-AudioGen. A CUDA-capable GPU is required.\n", "\n", "Each example takes a text description and produces an audio waveform that is saved to disk and played back inline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install torch torchaudio \"transformers<5\" einops" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Usage\n", "\n", "Load the multilingual model and generate audio from a single text prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torchaudio\n", "from transformers import AutoModel\n", "from IPython.display import Audio\n", "\n", "model = AutoModel.from_pretrained(\"mispeech/Dasheng-AudioGen-Multilingual\", trust_remote_code=True).cuda()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\"A dog barking in a park\")\n", "torchaudio.save(\"output.wav\", audio.cpu(), 16000)\n", "Audio(\"output.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aspect-wise Prompt\n", "\n", "Use `compose_prompt` to describe different audio aspects separately.\n", "\n", "> **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spanish example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompt = model.compose_prompt(\n", " caption=\"A conversation scene on a busy city street.\",\n", " speech=\"A young woman speaking softly in Spanish.\",\n", " env=\"Rain and distant traffic noise.\",\n", " asr=\"Creo que deber\u00edamos irnos ya.\",\n", ")\n", "audio = model.generate(prompt)\n", "torchaudio.save(\"output_spanish.wav\", audio.cpu(), 16000)\n", "Audio(\"output_spanish.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### German example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompt = model.compose_prompt(\n", " caption=\"A calm voice giving directions in a quiet office.\",\n", " speech=\"A middle-aged man speaking calmly in German.\",\n", " env=\"Quiet office ambience with faint keyboard typing.\",\n", " asr=\"Bitte biegen Sie an der n\u00e4chsten Kreuzung links ab.\",\n", ")\n", "audio = model.generate(prompt)\n", "torchaudio.save(\"output_german.wav\", audio.cpu(), 16000)\n", "Audio(\"output_german.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also pass a pre-formatted string with tags directly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\n", " \"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience.\"\n", ")\n", "torchaudio.save(\"output_helicopter.wav\", audio.cpu(), 16000)\n", "Audio(\"output_helicopter.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch Inference\n", "\n", "Pass a list of prompts to generate multiple audios in a single call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " model.compose_prompt(caption=\"A cat meowing softly.\", sfx=\"Soft cat meow.\"),\n", " model.compose_prompt(caption=\"Thunder rolling in the distance.\", env=\"Stormy night ambience.\"),\n", " model.compose_prompt(caption=\"A piano playing a gentle melody.\", music=\"Soft piano ballad.\"),\n", "]\n", "audios = model.generate(prompts)\n", "\n", "for i, audio in enumerate(audios):\n", " torchaudio.save(f\"output_{i}.wav\", audio.unsqueeze(0).cpu(), 16000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_0.wav\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_1.wav\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Audio(\"output_2.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generation Parameters\n", "\n", "Tune the denoising steps, classifier-free guidance scale, and sway sampling coefficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "audio = model.generate(\n", " prompts=\"A dog barking in a park\",\n", " num_steps=25, # number of denoising steps (default: 25)\n", " guidance_scale=5.0, # classifier-free guidance scale (default: 5.0)\n", " sway_sampling_coef=-1.0, # sway sampling coefficient (default: -1.0, 0 for linear)\n", ")\n", "torchaudio.save(\"output_tuned.wav\", audio.cpu(), 16000)\n", "Audio(\"output_tuned.wav\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10" } }, "nbformat": 4, "nbformat_minor": 5 }