{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dasheng-AudioGen-Multilingual \u2014 Notebook Demo\n",
    "\n",
    "This notebook walks through the audio-generation usage shown in the [README](./README.md) for the **multilingual** variant of Dasheng-AudioGen. A CUDA-capable GPU is required.\n",
    "\n",
    "Each example takes a text description and produces an audio waveform that is saved to disk and played back inline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install torch torchaudio \"transformers<5\" einops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "Load the multilingual model and generate audio from a single text prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from transformers import AutoModel\n",
    "from IPython.display import Audio\n",
    "\n",
    "model = AutoModel.from_pretrained(\"mispeech/Dasheng-AudioGen-Multilingual\", trust_remote_code=True).cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\"A dog barking in a park\")\n",
    "torchaudio.save(\"output.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aspect-wise Prompt\n",
    "\n",
    "Use `compose_prompt` to describe different audio aspects separately.\n",
    "\n",
    "> **Multilingual prompt convention:** All descriptive tags (`caption`, `speech`, `sfx`, `music`, `env`) should be written in **English**. Only the `<|asr|>` field (the actual spoken content to be synthesized) should use the target language."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Spanish example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = model.compose_prompt(\n",
    "    caption=\"A conversation scene on a busy city street.\",\n",
    "    speech=\"A young woman speaking softly in Spanish.\",\n",
    "    env=\"Rain and distant traffic noise.\",\n",
    "    asr=\"Creo que deber\u00edamos irnos ya.\",\n",
    ")\n",
    "audio = model.generate(prompt)\n",
    "torchaudio.save(\"output_spanish.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_spanish.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### German example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = model.compose_prompt(\n",
    "    caption=\"A calm voice giving directions in a quiet office.\",\n",
    "    speech=\"A middle-aged man speaking calmly in German.\",\n",
    "    env=\"Quiet office ambience with faint keyboard typing.\",\n",
    "    asr=\"Bitte biegen Sie an der n\u00e4chsten Kreuzung links ab.\",\n",
    ")\n",
    "audio = model.generate(prompt)\n",
    "torchaudio.save(\"output_german.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_german.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also pass a pre-formatted string with tags directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    \"<|caption|> A helicopter passing overhead. <|sfx|> Rhythmic helicopter blade sounds. <|env|> Open sky ambience.\"\n",
    ")\n",
    "torchaudio.save(\"output_helicopter.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_helicopter.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Inference\n",
    "\n",
    "Pass a list of prompts to generate multiple audios in a single call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompts = [\n",
    "    model.compose_prompt(caption=\"A cat meowing softly.\", sfx=\"Soft cat meow.\"),\n",
    "    model.compose_prompt(caption=\"Thunder rolling in the distance.\", env=\"Stormy night ambience.\"),\n",
    "    model.compose_prompt(caption=\"A piano playing a gentle melody.\", music=\"Soft piano ballad.\"),\n",
    "]\n",
    "audios = model.generate(prompts)\n",
    "\n",
    "for i, audio in enumerate(audios):\n",
    "    torchaudio.save(f\"output_{i}.wav\", audio.unsqueeze(0).cpu(), 16000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_0.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_1.wav\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Audio(\"output_2.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generation Parameters\n",
    "\n",
    "Tune the denoising steps, classifier-free guidance scale, and sway sampling coefficient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "audio = model.generate(\n",
    "    prompts=\"A dog barking in a park\",\n",
    "    num_steps=25,              # number of denoising steps (default: 25)\n",
    "    guidance_scale=5.0,        # classifier-free guidance scale (default: 5.0)\n",
    "    sway_sampling_coef=-1.0,   # sway sampling coefficient (default: -1.0, 0 for linear)\n",
    ")\n",
    "torchaudio.save(\"output_tuned.wav\", audio.cpu(), 16000)\n",
    "Audio(\"output_tuned.wav\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}