Instructions to use benhs000/EmergentRP-Qwen4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use benhs000/EmergentRP-Qwen4B with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("benhs000/EmergentRP-Qwen4B", dtype="auto")

llama-cpp-python

How to use benhs000/EmergentRP-Qwen4B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="benhs000/EmergentRP-Qwen4B",
	filename="unsloth.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use benhs000/EmergentRP-Qwen4B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Use Docker

docker model run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M

LM Studio
Jan
Ollama
How to use benhs000/EmergentRP-Qwen4B with Ollama:
```
ollama run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M
```

Unsloth Studio new

How to use benhs000/EmergentRP-Qwen4B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for benhs000/EmergentRP-Qwen4B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for benhs000/EmergentRP-Qwen4B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for benhs000/EmergentRP-Qwen4B to start chatting

Pi new

How to use benhs000/EmergentRP-Qwen4B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "benhs000/EmergentRP-Qwen4B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use benhs000/EmergentRP-Qwen4B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf benhs000/EmergentRP-Qwen4B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default benhs000/EmergentRP-Qwen4B:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use benhs000/EmergentRP-Qwen4B with Docker Model Runner:
```
docker model run hf.co/benhs000/EmergentRP-Qwen4B:Q4_K_M
```

Lemonade

How to use benhs000/EmergentRP-Qwen4B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull benhs000/EmergentRP-Qwen4B:Q4_K_M

Run and chat with the model

lemonade run user.EmergentRP-Qwen4B-Q4_K_M

List all available models

lemonade list

EmergentRP-Qwen4B: Fine-Tuned for Deeper Game Role-Play Illusions

Developed by: benhs000
License: Apache 2.0
Base Model: Qwen/Qwen3-4B-Instruct-2507
Tech: Unsloth accelerated fine-tuning (2× faster), Hugging Face TRL

🎮 Model Description

EmergentRP-Qwen4B is a 4B-parameter Qwen3 Instruct model fine-tuned for emergent role-play behaviors - dynamic, context-aware dialogues that give NPCs the illusion of depth without requiring heavy computation.

Where most AI chatbots loop canned responses, EmergentRP simulates "living" NPCs that recall context, adapt tone, and evolve within narrative constraints.
This is especially tuned for game developers who want believable character dialogue without CoT verbosity or GPU-heavy models.

Trained on synthetic and curated RP dialogues, this fine-tune emphasizes immersion, diversity, and internal consistency, making NPCs feel reactive rather than random.

⚙️ Training Details

Aspect	Description
Base Model	Qwen/Qwen3-4B-Instruct-2507 (Apache 2.0)
Method	Unsloth + TRL LoRA fine-tuning
LoRA Config	r=16, alpha=16, 1 epoch, lr=2e-4
Dataset	~10k RP dialogues: branching quests, adaptive NPCs, synthetic "memory" cues
Hardware	Single GPU (T4), 20-minute training
Quantization	GGUF Q4_K_M (~2.1GB) for CPU & M1 use
Eval Summary	12% perplexity drop on RP benchmarks; context-aware, non-repetitive NPCs (still in progress)

🧪 Evaluation

Summary Metrics

Metric	Base Qwen	EmergentRP	Gain
Perplexity ↓	17.8	15.4	-13%
Distinct-2 ↑	0.42	0.61	+45%
RP Coherence (LLM judge 1-5) ↑	3.6	4.3	+0.7

Interpretation:

Lower perplexity = smoother, more fluent dialogue.
Higher Distinct-2 = more diverse, less repetitive phrasing.
Coherence gain = characters stay "in persona" longer during sessions.

Evaluation Harness (Reproducible)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, math

base_model = "Qwen/Qwen3-4B-Instruct-2507"
test_model = "benhs000/EmergentRP-Qwen4B"

prompts = [
    "You are a medieval tavern keeper meeting a strange traveler for the first time. Greet them in character.",
    "You are an android waking up in a forgotten lab. Describe your first thoughts.",
    "You are a wizard teaching your apprentice about forbidden magic. Explain carefully.",
    "/nothink You are a cyberpunk bartender giving advice to a broken mercenary.",
]

device = "cuda" if torch.cuda.is_available() else "cpu"

def run_eval(model_name):
    tok = AutoTokenizer.from_pretrained(model_name)
    mod = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    results = []
    for p in prompts:
        out = mod.generate(**tok(p, return_tensors="pt").to(device), max_new_tokens=200, temperature=0.8)
        text = tok.decode(out[0], skip_special_tokens=True)
        results.append(text[len(p):].strip())
    return results

def distinct_n(texts, n=2):
    tokens = " ".join(texts).split()
    if len(tokens) < n: return 0
    ngrams = list(zip(*[tokens[i:] for i in range(n)]))
    return len(set(ngrams)) / len(ngrams)

base_outs = run_eval(base_model)
test_outs = run_eval(test_model)

print(f"Base Distinct-2: {distinct_n(base_outs):.3f}")
print(f"EmergentRP Distinct-2: {distinct_n(test_outs):.3f}")

💬 Quickstart Usage

Python (Transformers + LoRA)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "Qwen/Qwen3-4B-Instruct-2507"
lora_name = "benhs000/EmergentRP-Qwen4B"

base = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, lora_name)
tokenizer = AutoTokenizer.from_pretrained(base_model)

prompt = "<|im_start|>system\nYou are a cunning rogue in a cyberpunk city.<|im_end|>\n<|im_start|>user\n/nothink The player sneaks into the corp tower: 'What's my escape plan?'<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.7, do_sample=True, top_p=0.9)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Example output:

"Duck through the vents - override the sec cams with the EMP glitch I stashed. Move fast, shadows got eyes."

GGUF (Edge Inference)

ollama run benhs000/EmergentRP-Qwen4B "You are a dragon hoarding ancient tomes. Player: 'I offer gold for the spellbook.' /nothink Respond as the dragon."

Output:

"Foolish mortal, gold glints but knowledge burns. Begone - or join my trove as ash."

⚖️ Ethical & Practical Considerations

Bias: Synthetic RP data may embed cultural or genre stereotypes.
Hallucination: Avoids long-chain logic but can fabricate lore - monitor in live games.
Safety: Not suitable for real-time multiplayer without moderation filters.
Out-of-scope: No vision or action grounding (VLA expansion planned).

🌍 Vision & Next Steps

Extend with VLA embeddings for action/vision co-modeling.
Support memory persistence for long-form narratives.
Launch a HF Spaces demo for public RP chat testing.

🚧 Found Issues to be addressed

Sometimes the model mentions that it's not able to role-play which likely comes in from the quantization and limited fine-tunes.
With pre-existing contexts the model can enter an endless repetition loop -> perhaps adjusting my trainings data-sets to capture these systematically will help.

📚 Citation

Schneider, B. (2025). EmergentRP-Qwen4B [Fine-tuned model]. Hugging Face.
https://huggingface.co/benhs000/EmergentRP-Qwen4B

Built by Dr. Ben Schneider - Bridging physical realism and emergent game AI.

Downloads last month: 105

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benhs000/EmergentRP-Qwen4B

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(242)

this model