SAC Agent for LunarLander-v3
This is a trained Soft Actor-Critic (SAC) agent for the LunarLander-v3 environment from Gymnasium.
Model Description
This model was trained using the SAC algorithm with continuous action space on the LunarLander-v3 environment. The agent learns to safely land a lunar module on a landing pad using continuous thrust controls.
Algorithm: Soft Actor-Critic (SAC)
SAC is an off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework. Key features:
- Entropy Regularization: Encourages exploration by maximizing both expected reward and entropy
- Twin Q-Networks: Uses two critic networks to reduce overestimation bias
- Soft Policy Updates: Smooth target network updates using Polyak averaging
Training Hyperparameters
| Parameter | Value |
|---|---|
| Total Timesteps | 750,000 |
| Start Timesteps (Random) | 10,000 |
| Batch Size | 256 |
| Discount Factor (γ) | 0.99 |
| Soft Update Rate (Ï„) | 0.005 |
| Entropy Coefficient (α) | 0.2 |
| Learning Rate | 3e-4 |
| Replay Buffer Capacity | 500,000 |
| Network Architecture | MLP (256, 256) |
Environment Details
- Environment:
LunarLander-v3 - Action Space: Continuous (Box)
- Main engine: -1.0 to 1.0
- Side engines: -1.0 to 1.0
- Observation Space: 8-dimensional vector
- Position (x, y)
- Velocity (vx, vy)
- Angle and angular velocity
- Leg contact indicators
- Reward: +100-140 for landing on pad, -100 for crash, fuel usage penalty
Model Architecture
Actor Network (Policy)
Input (8) → Linear(256) → ReLU → Linear(256) → ReLU → [Mean(2), Log_Std(2)]
- Outputs mean and log standard deviation for Gaussian policy
- Actions squashed through tanh and scaled by max_action
Critic Network (Twin Q-Networks)
Input (8 + 2) → Linear(256) → ReLU → Linear(256) → ReLU → Q-value(1)
- Two separate Q-networks (Q1, Q2) for double Q-learning
- Takes state-action pairs as input
Files Included
| File | Description |
|---|---|
sac_model_actor.pth |
Actor (policy) network weights |
sac_model_critic.pth |
Twin critic network weights |
sac_model_critic_target.pth |
Target critic network weights |
Usage
Installation
pip install gymnasium[box2d] torch numpy
Loading and Using the Model
import torch
import numpy as np
import gymnasium as gym
# Define the Actor Network
class ActorNetwork(torch.nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(ActorNetwork, self).__init__()
self.l1 = torch.nn.Linear(state_dim, 256)
self.l2 = torch.nn.Linear(256, 256)
self.mean = torch.nn.Linear(256, action_dim)
self.log_std_layer = torch.nn.Linear(256, action_dim)
self.max_action = float(max_action)
def forward(self, state):
x = torch.nn.functional.relu(self.l1(state))
x = torch.nn.functional.relu(self.l2(x))
mean = self.mean(x)
log_std = self.log_std_layer(x)
log_std = torch.clamp(log_std, -20.0, 2.0)
return mean, log_std
# Download and load the model
from huggingface_hub import hf_hub_download
# Download model file
actor_path = hf_hub_download(
repo_id="MohamedMaher003/LunarLander-v3-SAC",
filename="sac_model_actor.pth"
)
# Initialize environment and model
env = gym.make("LunarLander-v3", continuous=True, render_mode="human")
state_dim = 8
action_dim = 2
max_action = 1.0
# Load actor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
actor = ActorNetwork(state_dim, action_dim, max_action).to(device)
actor.load_state_dict(torch.load(actor_path, map_location=device))
actor.eval()
# Run evaluation
def select_action(state, actor, device):
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
mean, _ = actor(state_tensor)
action = torch.tanh(mean) * actor.max_action
return action.cpu().numpy().flatten()
state, _ = env.reset()
done = False
total_reward = 0
while not done:
action = select_action(state, actor, device)
state, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Episode Reward: {total_reward:.2f}")
env.close()
Training Code
The full training implementation is available at: GitHub Repository
Training was done using:
- Framework: PyTorch
- Logging: Weights & Biases (wandb)
Evaluation Results
The trained agent achieves:
- Average Reward: ~250+ over 10 evaluation episodes
- Success Rate: High landing success rate on the pad
- Stability: Consistent performance across episodes
Citation
If you use this model, please cite:
@misc{maher2024sac-lunarlander,
author = {Mohamed Maher},
title = {SAC Agent for LunarLander-v3},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MohamedMaher003/LunarLander-v3-SAC}}
}
References
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep RL - Haarnoja et al., 2018
- Gymnasium Documentation
- LunarLander Environment
Author
Mohamed Maher - Hugging Face Profile
This model was trained as part of a Reinforcement Learning course assignment on State-of-the-Art Model-Free Algorithms at cairo university faculty of engineering.
Paper for MohamedMaher003/LunarLander-v3-SAC
Evaluation results
- Mean Reward on LunarLander-v3self-reported~250