SAC Agent for LunarLander-v3

This is a trained Soft Actor-Critic (SAC) agent for the LunarLander-v3 environment from Gymnasium.

Model Description

This model was trained using the SAC algorithm with continuous action space on the LunarLander-v3 environment. The agent learns to safely land a lunar module on a landing pad using continuous thrust controls.

Algorithm: Soft Actor-Critic (SAC)

SAC is an off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework. Key features:

Entropy Regularization: Encourages exploration by maximizing both expected reward and entropy
Twin Q-Networks: Uses two critic networks to reduce overestimation bias
Soft Policy Updates: Smooth target network updates using Polyak averaging

Training Hyperparameters

Parameter	Value
Total Timesteps	750,000
Start Timesteps (Random)	10,000
Batch Size	256
Discount Factor (γ)	0.99
Soft Update Rate (τ)	0.005
Entropy Coefficient (α)	0.2
Learning Rate	3e-4
Replay Buffer Capacity	500,000
Network Architecture	MLP (256, 256)

Environment Details

Environment: LunarLander-v3
Action Space: Continuous (Box)
- Main engine: -1.0 to 1.0
- Side engines: -1.0 to 1.0
Observation Space: 8-dimensional vector
- Position (x, y)
- Velocity (vx, vy)
- Angle and angular velocity
- Leg contact indicators
Reward: +100-140 for landing on pad, -100 for crash, fuel usage penalty

Model Architecture

Actor Network (Policy)

Input (8) → Linear(256) → ReLU → Linear(256) → ReLU → [Mean(2), Log_Std(2)]

Outputs mean and log standard deviation for Gaussian policy
Actions squashed through tanh and scaled by max_action

Critic Network (Twin Q-Networks)

Input (8 + 2) → Linear(256) → ReLU → Linear(256) → ReLU → Q-value(1)

Two separate Q-networks (Q1, Q2) for double Q-learning
Takes state-action pairs as input

Files Included

File	Description
`sac_model_actor.pth`	Actor (policy) network weights
`sac_model_critic.pth`	Twin critic network weights
`sac_model_critic_target.pth`	Target critic network weights

Usage

Installation

pip install gymnasium[box2d] torch numpy

Loading and Using the Model

import torch
import numpy as np
import gymnasium as gym

# Define the Actor Network
class ActorNetwork(torch.nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(ActorNetwork, self).__init__()
        self.l1 = torch.nn.Linear(state_dim, 256)
        self.l2 = torch.nn.Linear(256, 256)
        self.mean = torch.nn.Linear(256, action_dim)
        self.log_std_layer = torch.nn.Linear(256, action_dim)
        self.max_action = float(max_action)

    def forward(self, state):
        x = torch.nn.functional.relu(self.l1(state))
        x = torch.nn.functional.relu(self.l2(x))
        mean = self.mean(x)
        log_std = self.log_std_layer(x)
        log_std = torch.clamp(log_std, -20.0, 2.0)
        return mean, log_std

# Download and load the model
from huggingface_hub import hf_hub_download

# Download model file
actor_path = hf_hub_download(
    repo_id="MohamedMaher003/LunarLander-v3-SAC",
    filename="sac_model_actor.pth"
)

# Initialize environment and model
env = gym.make("LunarLander-v3", continuous=True, render_mode="human")
state_dim = 8
action_dim = 2
max_action = 1.0

# Load actor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
actor = ActorNetwork(state_dim, action_dim, max_action).to(device)
actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()

# Run evaluation
def select_action(state, actor, device):
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        mean, _ = actor(state_tensor)
        action = torch.tanh(mean) * actor.max_action
        return action.cpu().numpy().flatten()

state, _ = env.reset()
done = False
total_reward = 0

while not done:
    action = select_action(state, actor, device)
    state, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Episode Reward: {total_reward:.2f}")
env.close()

Training Code

The full training implementation is available at: GitHub Repository

Training was done using:

Framework: PyTorch
Logging: Weights & Biases (wandb)

Evaluation Results

The trained agent achieves:

Average Reward: ~250+ over 10 evaluation episodes
Success Rate: High landing success rate on the pad
Stability: Consistent performance across episodes

Citation

If you use this model, please cite:

@misc{maher2024sac-lunarlander,
  author = {Mohamed Maher},
  title = {SAC Agent for LunarLander-v3},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MohamedMaher003/LunarLander-v3-SAC}}
}

References

Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL - Haarnoja et al., 2018
Gymnasium Documentation
LunarLander Environment

Author

Mohamed Maher - Hugging Face Profile

This model was trained as part of a Reinforcement Learning course assignment on State-of-the-Art Model-Free Algorithms at cairo university faculty of engineering.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Paper for MohamedMaher003/LunarLander-v3-SAC

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Paper • 1801.01290 • Published Jan 4, 2018 • 1

Evaluation results

Mean Reward on LunarLander-v3
self-reported

~250