# PDF2Audio: Technical Documentation ## Table of Contents 1. [Introduction](#introduction) 2. [Application Overview](#application-overview) 3. [File Structure](#file-structure) 4. [Core Components](#core-components) - [Data Models](#data-models) - [PDF Processing](#pdf-processing) - [Text Generation](#text-generation) - [Audio Generation](#audio-generation) - [Instruction Templates](#instruction-templates) 5. [User Interface](#user-interface) - [Main Layout](#main-layout) - [Input Controls](#input-controls) - [Output Display](#output-display) - [Editing Features](#editing-features) 6. [Workflow](#workflow) 7. [Key Functions](#key-functions) 8. [Integration Points](#integration-points) 9. [Customization Options](#customization-options) 10. [Conclusion](#conclusion) ## Introduction PDF2Audio is a Gradio-based web application that converts PDF documents, markdown files, and text files into audio content using OpenAI's GPT models for text generation and text-to-speech (TTS) services. The application allows users to upload documents, select from various instruction templates (podcast, lecture, summary, etc.), and customize the output with different voices and models. This technical documentation provides a detailed explanation of the `app.py` file, which contains all the functionality of the PDF2Audio application. It is designed to help developers and designers understand the codebase to use it as a foundation for similar applications. ## Application Overview PDF2Audio follows a straightforward workflow: 1. User uploads one or more PDF, markdown, or text files 2. User selects an instruction template and customizes settings 3. The application extracts text from the uploaded files 4. An LLM (Language Learning Model) processes the text according to the selected template 5. The generated dialogue is converted to audio using OpenAI's TTS service 6. The user can listen to the audio, view the transcript, edit it, and regenerate if needed The application is built using the Gradio framework, which provides an easy-to-use interface for creating web applications with Python. The backend leverages OpenAI's API for both text generation and text-to-speech conversion. ## File Structure The entire application is contained within a single `app.py` file, which includes: - Import statements for required libraries - Data model definitions - Instruction templates for different output formats - Core functionality for text extraction, dialogue generation, and audio synthesis - Gradio UI components and layout - Event handlers for user interactions ## Core Components ### Data Models The application uses Pydantic models to structure the dialogue data: ```python class DialogueItem(BaseModel): text: str speaker: Literal["speaker-1", "speaker-2"] class Dialogue(BaseModel): scratchpad: str dialogue: List[DialogueItem] ``` These models ensure type safety and provide a structured way to handle the dialogue data throughout the application. ### PDF Processing PDF processing is handled using the PyPDF library. The application extracts text from uploaded PDF files: ```python if suffix == ".pdf": with file_path.open("rb") as f: reader = PdfReader(f) text = "\n\n".join( page.extract_text() for page in reader.pages if page.extract_text() ) combined_text += text + "\n\n" ``` The application also supports markdown and plain text files: ```python elif suffix in [".txt", ".md", ".mmd"]: with file_path.open("r", encoding="utf-8") as f: text = f.read() combined_text += text + "\n\n" ``` ### Text Generation Text generation is performed using OpenAI's GPT models through the `promptic` library's `llm` decorator. The application uses a custom `conditional_llm` wrapper to dynamically configure the LLM based on user selections: ```python @conditional_llm( model=text_model, api_base=api_base, api_key=openai_api_key, reasoning_effort=reasoning_effort, do_web_search=do_web_search, ) def generate_dialogue(text: str, intro_instructions: str, text_instructions: str, scratch_pad_instructions: str, prelude_dialog: str, podcast_dialog_instructions: str, edited_transcript: str = None, user_feedback: str = None) -> Dialogue: # Function body contains the prompt template ``` The `generate_dialogue` function is decorated with `@retry` to handle validation errors and retry the API call if necessary. ### Audio Generation Audio generation is handled by OpenAI's TTS API through the `get_mp3` function: ```python def get_mp3(text: str, voice: str, audio_model: str, api_key: str = None, speaker_instructions: str ='Speak in an emotive and friendly tone.') -> bytes: client = OpenAI( api_key=api_key or os.getenv("OPENAI_API_KEY"), ) with client.audio.speech.with_streaming_response.create( model=audio_model, voice=voice, input=text, instructions=speaker_instructions, ) as response: with io.BytesIO() as file: for chunk in response.iter_bytes(): file.write(chunk) return file.getvalue() ``` The application uses `concurrent.futures.ThreadPoolExecutor` to parallelize audio generation for each dialogue line, improving performance: ```python with cf.ThreadPoolExecutor() as executor: futures = [] for line in llm_output.dialogue: transcript_line = f"{line.speaker}: {line.text}" voice = speaker_1_voice if line.speaker == "speaker-1" else speaker_2_voice speaker_instructions=speaker_1_instructions if line.speaker == "speaker-1" else speaker_2_instructions future = executor.submit(get_mp3, line.text, voice, audio_model, openai_api_key, speaker_instructions) futures.append((future, transcript_line)) characters += len(line.text) for future, transcript_line in futures: audio_chunk = future.result() audio += audio_chunk transcript += transcript_line + "\n\n" ``` ### Instruction Templates The application includes a comprehensive set of instruction templates for different output formats: ```python INSTRUCTION_TEMPLATES = { "podcast": { ... }, "deep research analysis": { ... }, "clean rendering": { ... }, "SciAgents material discovery summary": { ... }, "lecture": { ... }, "summary": { ... }, "short summary": { ... }, "podcast (French)": { ... }, "podcast (German)": { ... }, "podcast (Spanish)": { ... }, "podcast (Portuguese)": { ... }, "podcast (Hindi)": { ... }, "podcast (Chinese)": { ... }, } ``` Each template contains five key components: 1. `intro`: High-level task description 2. `text_instructions`: How to process the input text 3. `scratch_pad`: Hidden brainstorming area for the model 4. `prelude`: Introduction to the main output 5. `dialog`: Main output instructions These templates guide the LLM in generating appropriate content based on the selected format. ## User Interface ### Main Layout The Gradio UI is structured with a clean, responsive layout: ```python with gr.Blocks(title="PDF to Audio", css=""" #header { display: flex; align-items: center; justify-content: space-between; padding: 20px; background-color: transparent; border-bottom: 1px solid #ddd; } /* Additional CSS styles */ """) as demo: cached_dialogue = gr.State() with gr.Row(elem_id="header"): # Header content with gr.Row(elem_id="main_container"): with gr.Column(scale=2): # Input controls with gr.Column(scale=3): # Template selection and customization # Output components # Editing features ``` ### Input Controls The left column contains input controls for file uploading and model selection: ```python files = gr.Files(label="PDFs (.pdf), markdown (.md, .mmd), or text files (.txt)", file_types=[".pdf", ".PDF", ".md", ".mmd", ".txt"]) openai_api_key = gr.Textbox( label="OpenAI API Key", visible=True, placeholder="Enter your OpenAI API Key here...", type="password" ) text_model = gr.Dropdown( label="Text Generation Model", choices=STANDARD_TEXT_MODELS, value="o3-mini", info="Select the model to generate the dialogue text." ) # Additional input controls for audio model, voices, etc. ``` ### Output Display The application provides several output components: ```python audio_output = gr.Audio(label="Audio", format="mp3", interactive=False, autoplay=False) transcript_output = gr.Textbox(label="Transcript", lines=25, show_copy_button=True) original_text_output = gr.Textbox(label="Original Text", lines=10, visible=False) error_output = gr.Textbox(visible=False) # Hidden textbox to store error message ``` ### Editing Features The application includes several features for editing and regenerating content: 1. **Transcript Editing**: ```python use_edited_transcript = gr.Checkbox(label="Use Edited Transcript", value=False) edited_transcript = gr.Textbox(label="Edit Transcript Here", lines=20, visible=False, show_copy_button=True, interactive=False) ``` 2. **Line-by-Line Editing**: ```python with gr.Accordion("Edit dialogue line‑by‑line", open=False) as editor_box: df_editor = gr.Dataframe( headers=["Speaker", "Line"], datatype=["str", "str"], wrap=True, interactive=True, row_count=(1, "dynamic"), col_count=(2, "fixed"), ) ``` 3. **User Feedback**: ```python user_feedback = gr.Textbox(label="Provide Feedback or Notes", lines=10) ``` ## Workflow The application workflow is managed through event handlers that connect UI components to backend functions: 1. **Template Selection**: ```python template_dropdown.change( fn=update_instructions, inputs=[template_dropdown], outputs=[intro_instructions, text_instructions, scratch_pad_instructions, prelude_dialog, podcast_dialog_instructions] ) ``` 2. **Generate Audio**: ```python submit_btn.click( fn=validate_and_generate_audio, inputs=[ files, openai_api_key, text_model, reasoning_effort, do_web_search, audio_model, speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions, api_base, intro_instructions, text_instructions, scratch_pad_instructions, prelude_dialog, podcast_dialog_instructions, edited_transcript, user_feedback, ], outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue] ) ``` 3. **Regenerate with Edits**: ```python regenerate_btn.click( fn=lambda use_edit, edit, *args: validate_and_generate_audio( *args[:12], # All inputs up to podcast_dialog_instructions edit if use_edit else "", # Use edited transcript if checkbox is checked *args[12:] # user_feedback and original_text_output ), inputs=[ use_edited_transcript, edited_transcript, # Additional inputs ], outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue] ) ``` 4. **Re-render Audio**: ```python rerender_btn.click( fn=render_audio_from_dialogue, inputs=[ cached_dialogue, openai_api_key, audio_model, speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions, ], outputs=[audio_output, transcript_output], ) ``` ## Key Functions ### `validate_and_generate_audio` This function serves as the entry point for audio generation, validating inputs and handling errors: ```python def validate_and_generate_audio(*args): files = args[0] if not files: return None, None, None, "Please upload at least one PDF (or MD/MMD/TXT) file before generating audio." try: audio_file, transcript, original_text, dialogue = generate_audio(*args) return audio_file, transcript, original_text, None, dialogue except Exception as e: return None, None, None, str(e), None ``` ### `generate_audio` This is the core function that orchestrates the entire process: 1. Validates the API key 2. Extracts text from uploaded files 3. Configures and calls the LLM to generate dialogue 4. Processes any user edits or feedback 5. Generates audio for each dialogue line 6. Returns the audio file, transcript, and original text ### `render_audio_from_dialogue` This function re-renders audio from an existing dialogue without regenerating the text: ```python def render_audio_from_dialogue( cached_dialogue, openai_api_key: str, audio_model: str, speaker_1_voice: str, speaker_2_voice: str, speaker_1_instructions: str, speaker_2_instructions: str, ) -> tuple[str, str]: # Function implementation ``` ### `save_dialogue_edits` This function saves edits made in the dataframe editor: ```python def save_dialogue_edits(df, cached_dialogue): if cached_dialogue is None: raise gr.Error("Nothing to edit yet – run Generate Audio first.") import pandas as pd new_dlg = df_to_dialogue(pd.DataFrame(df, columns=["Speaker", "Line"])) # regenerate plain transcript so the user sees the change immediately transcript_str = "\n".join(f"{d.speaker}: {d.text}" for d in new_dlg.dialogue) # Return updated state and transcript return new_dlg, gr.update(value=transcript_str), "Edits saved. Press *Re‑render* to hear them." ``` ## Integration Points The application integrates with several external services and libraries: 1. **OpenAI API**: Used for both text generation (GPT models) and text-to-speech conversion 2. **Promptic**: A library for working with LLM prompts 3. **PyPDF**: Used for extracting text from PDF files 4. **Gradio**: The web UI framework 5. **Pydantic**: Used for data validation and modeling 6. **Tenacity**: Used for implementing retry logic ## Customization Options The application offers several customization options: 1. **Instruction Templates**: Multiple pre-defined templates for different output formats 2. **Model Selection**: Support for various OpenAI models for both text and audio generation 3. **Voice Selection**: Multiple voice options for the speakers 4. **Voice Instructions**: Custom instructions for each speaker's voice 5. **API Base**: Option to use a custom API endpoint for text generation 6. **Web Search**: Option to enable web search during text generation 7. **Reasoning Effort**: Control over the reasoning effort for compatible models ## Conclusion PDF2Audio is a well-structured application that demonstrates effective use of modern AI APIs for content transformation. Its modular design and comprehensive feature set make it an excellent foundation for similar applications. Key strengths of the codebase include: 1. **Modularity**: Clear separation of concerns between text extraction, dialogue generation, and audio synthesis 2. **Extensibility**: Easy to add new instruction templates or customize existing ones 3. **Error Handling**: Robust error handling with informative user feedback 4. **Performance Optimization**: Parallel processing for audio generation 5. **User Experience**: Rich UI with multiple editing and customization options Developers looking to build similar applications can leverage this codebase as a starting point, focusing on extending functionality or improving specific aspects rather than building from scratch.