# PDF2Audio: Technical Documentation

## Table of Contents
1. [Introduction](#introduction)
2. [Application Overview](#application-overview)
3. [File Structure](#file-structure)
4. [Core Components](#core-components)
   - [Data Models](#data-models)
   - [PDF Processing](#pdf-processing)
   - [Text Generation](#text-generation)
   - [Audio Generation](#audio-generation)
   - [Instruction Templates](#instruction-templates)
5. [User Interface](#user-interface)
   - [Main Layout](#main-layout)
   - [Input Controls](#input-controls)
   - [Output Display](#output-display)
   - [Editing Features](#editing-features)
6. [Workflow](#workflow)
7. [Key Functions](#key-functions)
8. [Integration Points](#integration-points)
9. [Customization Options](#customization-options)
10. [Conclusion](#conclusion)

## Introduction

PDF2Audio is a Gradio-based web application that converts PDF documents, markdown files, and text files into audio content using OpenAI's GPT models for text generation and text-to-speech (TTS) services. The application allows users to upload documents, select from various instruction templates (podcast, lecture, summary, etc.), and customize the output with different voices and models.

This technical documentation provides a detailed explanation of the `app.py` file, which contains all the functionality of the PDF2Audio application. It is designed to help developers and designers understand the codebase to use it as a foundation for similar applications.

## Application Overview

PDF2Audio follows a straightforward workflow:

1. User uploads one or more PDF, markdown, or text files
2. User selects an instruction template and customizes settings
3. The application extracts text from the uploaded files
4. An LLM (Language Learning Model) processes the text according to the selected template
5. The generated dialogue is converted to audio using OpenAI's TTS service
6. The user can listen to the audio, view the transcript, edit it, and regenerate if needed

The application is built using the Gradio framework, which provides an easy-to-use interface for creating web applications with Python. The backend leverages OpenAI's API for both text generation and text-to-speech conversion.

## File Structure

The entire application is contained within a single `app.py` file, which includes:

- Import statements for required libraries
- Data model definitions
- Instruction templates for different output formats
- Core functionality for text extraction, dialogue generation, and audio synthesis
- Gradio UI components and layout
- Event handlers for user interactions

## Core Components

### Data Models

The application uses Pydantic models to structure the dialogue data:

```python
class DialogueItem(BaseModel):
    text: str
    speaker: Literal["speaker-1", "speaker-2"]

class Dialogue(BaseModel):
    scratchpad: str
    dialogue: List[DialogueItem]
```

These models ensure type safety and provide a structured way to handle the dialogue data throughout the application.

### PDF Processing

PDF processing is handled using the PyPDF library. The application extracts text from uploaded PDF files:

```python
if suffix == ".pdf":
    with file_path.open("rb") as f:
        reader = PdfReader(f)
        text = "\n\n".join(
            page.extract_text() for page in reader.pages if page.extract_text()
        )
        combined_text += text + "\n\n"
```

The application also supports markdown and plain text files:

```python
elif suffix in [".txt", ".md", ".mmd"]:
    with file_path.open("r", encoding="utf-8") as f:
        text = f.read()
        combined_text += text + "\n\n"
```

### Text Generation

Text generation is performed using OpenAI's GPT models through the `promptic` library's `llm` decorator. The application uses a custom `conditional_llm` wrapper to dynamically configure the LLM based on user selections:

```python
@conditional_llm(
    model=text_model,
    api_base=api_base,
    api_key=openai_api_key,
    reasoning_effort=reasoning_effort,
    do_web_search=do_web_search,           
)
def generate_dialogue(text: str, intro_instructions: str, text_instructions: str, 
                     scratch_pad_instructions: str, prelude_dialog: str, 
                     podcast_dialog_instructions: str,
                     edited_transcript: str = None, user_feedback: str = None) -> Dialogue:
    # Function body contains the prompt template
```

The `generate_dialogue` function is decorated with `@retry` to handle validation errors and retry the API call if necessary.

### Audio Generation

Audio generation is handled by OpenAI's TTS API through the `get_mp3` function:

```python
def get_mp3(text: str, voice: str, audio_model: str, api_key: str = None,
           speaker_instructions: str ='Speak in an emotive and friendly tone.') -> bytes:
    
    client = OpenAI(
        api_key=api_key or os.getenv("OPENAI_API_KEY"),
    )
 
    with client.audio.speech.with_streaming_response.create(
        model=audio_model,
        voice=voice,
        input=text,
        instructions=speaker_instructions,
    ) as response:
        with io.BytesIO() as file:
            for chunk in response.iter_bytes():
                file.write(chunk)
            return file.getvalue()
```

The application uses `concurrent.futures.ThreadPoolExecutor` to parallelize audio generation for each dialogue line, improving performance:

```python
with cf.ThreadPoolExecutor() as executor:
    futures = []
    for line in llm_output.dialogue:
        transcript_line = f"{line.speaker}: {line.text}"
        voice = speaker_1_voice if line.speaker == "speaker-1" else speaker_2_voice
        speaker_instructions=speaker_1_instructions if line.speaker == "speaker-1" else speaker_2_instructions
        future = executor.submit(get_mp3, line.text, voice, audio_model, openai_api_key, speaker_instructions)
        futures.append((future, transcript_line))
        characters += len(line.text)

    for future, transcript_line in futures:
        audio_chunk = future.result()
        audio += audio_chunk
        transcript += transcript_line + "\n\n"
```

### Instruction Templates

The application includes a comprehensive set of instruction templates for different output formats:

```python
INSTRUCTION_TEMPLATES = {
    "podcast": { ... },
    "deep research analysis": { ... },
    "clean rendering": { ... },
    "SciAgents material discovery summary": { ... },
    "lecture": { ... },
    "summary": { ... },
    "short summary": { ... },
    "podcast (French)": { ... },
    "podcast (German)": { ... },
    "podcast (Spanish)": { ... },
    "podcast (Portuguese)": { ... },
    "podcast (Hindi)": { ... },
    "podcast (Chinese)": { ... },
}
```

Each template contains five key components:
1. `intro`: High-level task description
2. `text_instructions`: How to process the input text
3. `scratch_pad`: Hidden brainstorming area for the model
4. `prelude`: Introduction to the main output
5. `dialog`: Main output instructions

These templates guide the LLM in generating appropriate content based on the selected format.

## User Interface

### Main Layout

The Gradio UI is structured with a clean, responsive layout:

```python
with gr.Blocks(title="PDF to Audio", css="""
    #header {
        display: flex;
        align-items: center;
        justify-content: space-between;
        padding: 20px;
        background-color: transparent;
        border-bottom: 1px solid #ddd;
    }
    /* Additional CSS styles */
""") as demo:
    
    cached_dialogue = gr.State()
    
    with gr.Row(elem_id="header"):
        # Header content
    
    with gr.Row(elem_id="main_container"):
        with gr.Column(scale=2):
            # Input controls
        with gr.Column(scale=3):
            # Template selection and customization
    
    # Output components
    # Editing features
```

### Input Controls

The left column contains input controls for file uploading and model selection:

```python
files = gr.Files(label="PDFs (.pdf), markdown (.md, .mmd), or text files (.txt)", 
                file_types=[".pdf", ".PDF", ".md", ".mmd", ".txt"])

openai_api_key = gr.Textbox(
    label="OpenAI API Key",
    visible=True,
    placeholder="Enter your OpenAI API Key here...",
    type="password"
)

text_model = gr.Dropdown(
    label="Text Generation Model",
    choices=STANDARD_TEXT_MODELS,
    value="o3-mini",
    info="Select the model to generate the dialogue text."
)

# Additional input controls for audio model, voices, etc.
```

### Output Display

The application provides several output components:

```python
audio_output = gr.Audio(label="Audio", format="mp3", interactive=False, autoplay=False)
transcript_output = gr.Textbox(label="Transcript", lines=25, show_copy_button=True)
original_text_output = gr.Textbox(label="Original Text", lines=10, visible=False)
error_output = gr.Textbox(visible=False)  # Hidden textbox to store error message
```

### Editing Features

The application includes several features for editing and regenerating content:

1. **Transcript Editing**:
   ```python
   use_edited_transcript = gr.Checkbox(label="Use Edited Transcript", value=False)
   edited_transcript = gr.Textbox(label="Edit Transcript Here", lines=20, visible=False,
                                show_copy_button=True, interactive=False)
   ```

2. **Line-by-Line Editing**:
   ```python
   with gr.Accordion("Edit dialogue line‑by‑line", open=False) as editor_box:
       df_editor = gr.Dataframe(
           headers=["Speaker", "Line"],
           datatype=["str", "str"],
           wrap=True,
           interactive=True,
           row_count=(1, "dynamic"),
           col_count=(2, "fixed"),
       )
   ```

3. **User Feedback**:
   ```python
   user_feedback = gr.Textbox(label="Provide Feedback or Notes", lines=10)
   ```

## Workflow

The application workflow is managed through event handlers that connect UI components to backend functions:

1. **Template Selection**:
   ```python
   template_dropdown.change(
       fn=update_instructions,
       inputs=[template_dropdown],
       outputs=[intro_instructions, text_instructions, scratch_pad_instructions, 
                prelude_dialog, podcast_dialog_instructions]
   )
   ```

2. **Generate Audio**:
   ```python
   submit_btn.click(
       fn=validate_and_generate_audio,
       inputs=[
           files, openai_api_key, text_model, reasoning_effort, do_web_search, audio_model, 
           speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions,
           api_base, intro_instructions, text_instructions, scratch_pad_instructions, 
           prelude_dialog, podcast_dialog_instructions, edited_transcript, user_feedback,
       ],
       outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue]
   )
   ```

3. **Regenerate with Edits**:
   ```python
   regenerate_btn.click(
       fn=lambda use_edit, edit, *args: validate_and_generate_audio(
           *args[:12],  # All inputs up to podcast_dialog_instructions
           edit if use_edit else "",  # Use edited transcript if checkbox is checked
           *args[12:]  # user_feedback and original_text_output
       ),
       inputs=[
           use_edited_transcript, edited_transcript,
           # Additional inputs
       ],
       outputs=[audio_output, transcript_output, original_text_output, error_output, cached_dialogue]
   )
   ```

4. **Re-render Audio**:
   ```python
   rerender_btn.click(
       fn=render_audio_from_dialogue,
       inputs=[
           cached_dialogue, openai_api_key, audio_model,
           speaker_1_voice, speaker_2_voice, speaker_1_instructions, speaker_2_instructions,
       ],
       outputs=[audio_output, transcript_output],
   )
   ```

## Key Functions

### `validate_and_generate_audio`

This function serves as the entry point for audio generation, validating inputs and handling errors:

```python
def validate_and_generate_audio(*args):
    files = args[0]
    if not files:
        return None, None, None, "Please upload at least one PDF (or MD/MMD/TXT) file before generating audio."
    try:
        audio_file, transcript, original_text, dialogue = generate_audio(*args)
        return audio_file, transcript, original_text, None, dialogue
    except Exception as e:
        return None, None, None, str(e), None
```

### `generate_audio`

This is the core function that orchestrates the entire process:

1. Validates the API key
2. Extracts text from uploaded files
3. Configures and calls the LLM to generate dialogue
4. Processes any user edits or feedback
5. Generates audio for each dialogue line
6. Returns the audio file, transcript, and original text

### `render_audio_from_dialogue`

This function re-renders audio from an existing dialogue without regenerating the text:

```python
def render_audio_from_dialogue(
    cached_dialogue,
    openai_api_key: str,
    audio_model: str,
    speaker_1_voice: str,
    speaker_2_voice: str,
    speaker_1_instructions: str,
    speaker_2_instructions: str,
) -> tuple[str, str]:
    # Function implementation
```

### `save_dialogue_edits`

This function saves edits made in the dataframe editor:

```python
def save_dialogue_edits(df, cached_dialogue):
    if cached_dialogue is None:
        raise gr.Error("Nothing to edit yet – run Generate Audio first.")

    import pandas as pd
    new_dlg = df_to_dialogue(pd.DataFrame(df, columns=["Speaker", "Line"]))

    # regenerate plain transcript so the user sees the change immediately
    transcript_str = "\n".join(f"{d.speaker}: {d.text}" for d in new_dlg.dialogue)

    # Return updated state and transcript
    return new_dlg, gr.update(value=transcript_str), "Edits saved. Press *Re‑render* to hear them."
```

## Integration Points

The application integrates with several external services and libraries:

1. **OpenAI API**: Used for both text generation (GPT models) and text-to-speech conversion
2. **Promptic**: A library for working with LLM prompts
3. **PyPDF**: Used for extracting text from PDF files
4. **Gradio**: The web UI framework
5. **Pydantic**: Used for data validation and modeling
6. **Tenacity**: Used for implementing retry logic

## Customization Options

The application offers several customization options:

1. **Instruction Templates**: Multiple pre-defined templates for different output formats
2. **Model Selection**: Support for various OpenAI models for both text and audio generation
3. **Voice Selection**: Multiple voice options for the speakers
4. **Voice Instructions**: Custom instructions for each speaker's voice
5. **API Base**: Option to use a custom API endpoint for text generation
6. **Web Search**: Option to enable web search during text generation
7. **Reasoning Effort**: Control over the reasoning effort for compatible models

## Conclusion

PDF2Audio is a well-structured application that demonstrates effective use of modern AI APIs for content transformation. Its modular design and comprehensive feature set make it an excellent foundation for similar applications.

Key strengths of the codebase include:

1. **Modularity**: Clear separation of concerns between text extraction, dialogue generation, and audio synthesis
2. **Extensibility**: Easy to add new instruction templates or customize existing ones
3. **Error Handling**: Robust error handling with informative user feedback
4. **Performance Optimization**: Parallel processing for audio generation
5. **User Experience**: Rich UI with multiple editing and customization options

Developers looking to build similar applications can leverage this codebase as a starting point, focusing on extending functionality or improving specific aspects rather than building from scratch.