SAC Agent for LunarLander-v3

This is a trained Soft Actor-Critic (SAC) agent for the LunarLander-v3 environment from Gymnasium.

Model Description

This model was trained using the SAC algorithm with continuous action space on the LunarLander-v3 environment. The agent learns to safely land a lunar module on a landing pad using continuous thrust controls.

Algorithm: Soft Actor-Critic (SAC)

SAC is an off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework. Key features:

  • Entropy Regularization: Encourages exploration by maximizing both expected reward and entropy
  • Twin Q-Networks: Uses two critic networks to reduce overestimation bias
  • Soft Policy Updates: Smooth target network updates using Polyak averaging

Training Hyperparameters

Parameter Value
Total Timesteps 750,000
Start Timesteps (Random) 10,000
Batch Size 256
Discount Factor (γ) 0.99
Soft Update Rate (Ï„) 0.005
Entropy Coefficient (α) 0.2
Learning Rate 3e-4
Replay Buffer Capacity 500,000
Network Architecture MLP (256, 256)

Environment Details

  • Environment: LunarLander-v3
  • Action Space: Continuous (Box)
    • Main engine: -1.0 to 1.0
    • Side engines: -1.0 to 1.0
  • Observation Space: 8-dimensional vector
    • Position (x, y)
    • Velocity (vx, vy)
    • Angle and angular velocity
    • Leg contact indicators
  • Reward: +100-140 for landing on pad, -100 for crash, fuel usage penalty

Model Architecture

Actor Network (Policy)

Input (8) → Linear(256) → ReLU → Linear(256) → ReLU → [Mean(2), Log_Std(2)]
  • Outputs mean and log standard deviation for Gaussian policy
  • Actions squashed through tanh and scaled by max_action

Critic Network (Twin Q-Networks)

Input (8 + 2) → Linear(256) → ReLU → Linear(256) → ReLU → Q-value(1)
  • Two separate Q-networks (Q1, Q2) for double Q-learning
  • Takes state-action pairs as input

Files Included

File Description
sac_model_actor.pth Actor (policy) network weights
sac_model_critic.pth Twin critic network weights
sac_model_critic_target.pth Target critic network weights

Usage

Installation

pip install gymnasium[box2d] torch numpy

Loading and Using the Model

import torch
import numpy as np
import gymnasium as gym

# Define the Actor Network
class ActorNetwork(torch.nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(ActorNetwork, self).__init__()
        self.l1 = torch.nn.Linear(state_dim, 256)
        self.l2 = torch.nn.Linear(256, 256)
        self.mean = torch.nn.Linear(256, action_dim)
        self.log_std_layer = torch.nn.Linear(256, action_dim)
        self.max_action = float(max_action)

    def forward(self, state):
        x = torch.nn.functional.relu(self.l1(state))
        x = torch.nn.functional.relu(self.l2(x))
        mean = self.mean(x)
        log_std = self.log_std_layer(x)
        log_std = torch.clamp(log_std, -20.0, 2.0)
        return mean, log_std

# Download and load the model
from huggingface_hub import hf_hub_download

# Download model file
actor_path = hf_hub_download(
    repo_id="MohamedMaher003/LunarLander-v3-SAC",
    filename="sac_model_actor.pth"
)

# Initialize environment and model
env = gym.make("LunarLander-v3", continuous=True, render_mode="human")
state_dim = 8
action_dim = 2
max_action = 1.0

# Load actor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
actor = ActorNetwork(state_dim, action_dim, max_action).to(device)
actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()

# Run evaluation
def select_action(state, actor, device):
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        mean, _ = actor(state_tensor)
        action = torch.tanh(mean) * actor.max_action
        return action.cpu().numpy().flatten()

state, _ = env.reset()
done = False
total_reward = 0

while not done:
    action = select_action(state, actor, device)
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Episode Reward: {total_reward:.2f}")
env.close()

Training Code

The full training implementation is available at: GitHub Repository

Training was done using:

  • Framework: PyTorch
  • Logging: Weights & Biases (wandb)

Evaluation Results

The trained agent achieves:

  • Average Reward: ~250+ over 10 evaluation episodes
  • Success Rate: High landing success rate on the pad
  • Stability: Consistent performance across episodes

Citation

If you use this model, please cite:

@misc{maher2024sac-lunarlander,
  author = {Mohamed Maher},
  title = {SAC Agent for LunarLander-v3},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MohamedMaher003/LunarLander-v3-SAC}}
}

References

Author

Mohamed Maher - Hugging Face Profile


This model was trained as part of a Reinforcement Learning course assignment on State-of-the-Art Model-Free Algorithms at cairo university faculty of engineering.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for MohamedMaher003/LunarLander-v3-SAC

Evaluation results