Instructions to use benhs000/EmergentRP-Qwen4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use benhs000/EmergentRP-Qwen4B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("benhs000/EmergentRP-Qwen4B", dtype="auto") - llama-cpp-python
How to use benhs000/EmergentRP-Qwen4B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="benhs000/EmergentRP-Qwen4B", filename="unsloth.Q4_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use benhs000/EmergentRP-Qwen4B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Use Docker
docker model run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use benhs000/EmergentRP-Qwen4B with Ollama:
ollama run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M
- Unsloth Studio new
How to use benhs000/EmergentRP-Qwen4B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for benhs000/EmergentRP-Qwen4B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for benhs000/EmergentRP-Qwen4B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for benhs000/EmergentRP-Qwen4B to start chatting
- Pi new
How to use benhs000/EmergentRP-Qwen4B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "benhs000/EmergentRP-Qwen4B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use benhs000/EmergentRP-Qwen4B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default benhs000/EmergentRP-Qwen4B:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use benhs000/EmergentRP-Qwen4B with Docker Model Runner:
docker model run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M
- Lemonade
How to use benhs000/EmergentRP-Qwen4B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull benhs000/EmergentRP-Qwen4B:Q4_K_M
Run and chat with the model
lemonade run user.EmergentRP-Qwen4B-Q4_K_M
List all available models
lemonade list
EmergentRP-Qwen4B: Fine-Tuned for Deeper Game Role-Play Illusions
Developed by: benhs000
License: Apache 2.0
Base Model: Qwen/Qwen3-4B-Instruct-2507
Tech: Unsloth accelerated fine-tuning (2ร faster), Hugging Face TRL
๐ฎ Model Description
EmergentRP-Qwen4B is a 4B-parameter Qwen3 Instruct model fine-tuned for emergent role-play behaviors - dynamic, context-aware dialogues that give NPCs the illusion of depth without requiring heavy computation.
Where most AI chatbots loop canned responses, EmergentRP simulates "living" NPCs that recall context, adapt tone, and evolve within narrative constraints.
This is especially tuned for game developers who want believable character dialogue without CoT verbosity or GPU-heavy models.
Trained on synthetic and curated RP dialogues, this fine-tune emphasizes immersion, diversity, and internal consistency, making NPCs feel reactive rather than random.
โ๏ธ Training Details
| Aspect | Description |
|---|---|
| Base Model | Qwen/Qwen3-4B-Instruct-2507 (Apache 2.0) |
| Method | Unsloth + TRL LoRA fine-tuning |
| LoRA Config | r=16, alpha=16, 1 epoch, lr=2e-4 |
| Dataset | ~10k RP dialogues: branching quests, adaptive NPCs, synthetic "memory" cues |
| Hardware | Single GPU (T4), 20-minute training |
| Quantization | GGUF Q4_K_M (~2.1GB) for CPU & M1 use |
| Eval Summary | 12% perplexity drop on RP benchmarks; context-aware, non-repetitive NPCs (still in progress) |
๐งช Evaluation
Summary Metrics
| Metric | Base Qwen | EmergentRP | Gain |
|---|---|---|---|
| Perplexity โ | 17.8 | 15.4 | -13% |
| Distinct-2 โ | 0.42 | 0.61 | +45% |
| RP Coherence (LLM judge 1-5) โ | 3.6 | 4.3 | +0.7 |
Interpretation:
- Lower perplexity = smoother, more fluent dialogue.
- Higher Distinct-2 = more diverse, less repetitive phrasing.
- Coherence gain = characters stay "in persona" longer during sessions.
Evaluation Harness (Reproducible)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, math
base_model = "Qwen/Qwen3-4B-Instruct-2507"
test_model = "benhs000/EmergentRP-Qwen4B"
prompts = [
"You are a medieval tavern keeper meeting a strange traveler for the first time. Greet them in character.",
"You are an android waking up in a forgotten lab. Describe your first thoughts.",
"You are a wizard teaching your apprentice about forbidden magic. Explain carefully.",
"/nothink You are a cyberpunk bartender giving advice to a broken mercenary.",
]
device = "cuda" if torch.cuda.is_available() else "cpu"
def run_eval(model_name):
tok = AutoTokenizer.from_pretrained(model_name)
mod = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
results = []
for p in prompts:
out = mod.generate(**tok(p, return_tensors="pt").to(device), max_new_tokens=200, temperature=0.8)
text = tok.decode(out[0], skip_special_tokens=True)
results.append(text[len(p):].strip())
return results
def distinct_n(texts, n=2):
tokens = " ".join(texts).split()
if len(tokens) < n: return 0
ngrams = list(zip(*[tokens[i:] for i in range(n)]))
return len(set(ngrams)) / len(ngrams)
base_outs = run_eval(base_model)
test_outs = run_eval(test_model)
print(f"Base Distinct-2: {distinct_n(base_outs):.3f}")
print(f"EmergentRP Distinct-2: {distinct_n(test_outs):.3f}")
๐ฌ Quickstart Usage
Python (Transformers + LoRA)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model = "Qwen/Qwen3-4B-Instruct-2507"
lora_name = "benhs000/EmergentRP-Qwen4B"
base = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, lora_name)
tokenizer = AutoTokenizer.from_pretrained(base_model)
prompt = "<|im_start|>system\nYou are a cunning rogue in a cyberpunk city.<|im_end|>\n<|im_start|>user\n/nothink The player sneaks into the corp tower: 'What's my escape plan?'<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.7, do_sample=True, top_p=0.9)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Example output:
"Duck through the vents - override the sec cams with the EMP glitch I stashed. Move fast, shadows got eyes."
GGUF (Edge Inference)
ollama run benhs000/EmergentRP-Qwen4B "You are a dragon hoarding ancient tomes. Player: 'I offer gold for the spellbook.' /nothink Respond as the dragon."
Output:
"Foolish mortal, gold glints but knowledge burns. Begone - or join my trove as ash."
โ๏ธ Ethical & Practical Considerations
- Bias: Synthetic RP data may embed cultural or genre stereotypes.
- Hallucination: Avoids long-chain logic but can fabricate lore - monitor in live games.
- Safety: Not suitable for real-time multiplayer without moderation filters.
- Out-of-scope: No vision or action grounding (VLA expansion planned).
๐ Vision & Next Steps
- Extend with VLA embeddings for action/vision co-modeling.
- Support memory persistence for long-form narratives.
- Launch a HF Spaces demo for public RP chat testing.
๐ง Found Issues to be addressed
- Sometimes the model mentions that it's not able to role-play which likely comes in from the quantization and limited fine-tunes.
- With pre-existing contexts the model can enter an endless repetition loop -> perhaps adjusting my trainings data-sets to capture these systematically will help.
๐ Citation
Schneider, B. (2025). EmergentRP-Qwen4B [Fine-tuned model]. Hugging Face.
https://huggingface.co/benhs000/EmergentRP-Qwen4B
Built by Dr. Ben Schneider - Bridging physical realism and emergent game AI.
- Downloads last month
- 105
4-bit
Model tree for benhs000/EmergentRP-Qwen4B
Base model
Qwen/Qwen3-4B-Instruct-2507