Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +69 -11
assets/benchmark_hero.svg +30 -0
assets/benchmark_table.svg +40 -0
assets/supplementary_stage_progress.svg +26 -0
assets/supplementary_stage_table.svg +14 -0
benchmarks.json +27 -0

README.md CHANGED Viewed

@@ -1,32 +1,90 @@
 ---
 base_model:
 - Qwen/Qwen3.5-9B
 pipeline_tag: text-generation
 tags:
 - hermes-agent
 - merged
 - standalone
-- browser
 - terminal
 - reasoning
 license: apache-2.0
 ---
 # Carnice-v1-9b Hermes-Agent Stage 2 Merged
-Standalone merged checkpoint for Hermes Agent, produced by merging the Stage 2 Carnice adapter into `Qwen/Qwen3.5-9B`.
-This is the merged model form of `kai-os/qwen35-hermes-stage2-adapter-v1`, not a separate full-parameter training run.
-## Source
-- Adapter: `kai-os/qwen35-hermes-stage2-adapter-v1`
 - Base model: `Qwen/Qwen3.5-9B`
-- Training data included:
-  - Stage A repair: `bespokelabs/Bespoke-Stratos-17k`, `AI-MO/NuminaMath-CoT`
-  - Stage B refresh: `kai-os/carnice-glm5-hermes-traces`, `open-thoughts/OpenThoughts-Agent-v1-SFT`
-## Notes
-- This repo exists so the model can be loaded directly without a separate PEFT adapter step.
-- The benchmark visuals and detailed benchmark card live on the adapter repo for now.

 ---
 base_model:
 - Qwen/Qwen3.5-9B
+library_name: transformers
 pipeline_tag: text-generation
 tags:
 - hermes-agent
 - merged
 - standalone
+- qwen3.5
 - terminal
+- tool-use
 - reasoning
 license: apache-2.0
 ---
+![banner](./banner.png)
 # Carnice-v1-9b Hermes-Agent Stage 2 Merged
+Standalone merged release of the Carnice Stage 2 Hermes model on top of `Qwen/Qwen3.5-9B`.
+This repo is the direct-load merged checkpoint form of [kai-os/qwen35-hermes-stage2-adapter-v1](https://huggingface.co/kai-os/qwen35-hermes-stage2-adapter-v1). It loads as its own model without a separate PEFT adapter step.
+Important detail: this is a merged standalone checkpoint, not a separate full-parameter training run from scratch.
+## What Changed
+- Stage A reasoning repair on `bespokelabs/Bespoke-Stratos-17k` and `AI-MO/NuminaMath-CoT`
+- Stage B Hermes refresh on [kai-os/carnice-glm5-hermes-traces](https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-traces) and `open-thoughts/OpenThoughts-Agent-v1-SFT`
 - Base model: `Qwen/Qwen3.5-9B`
+- Adapter source: [kai-os/qwen35-hermes-stage2-adapter-v1](https://huggingface.co/kai-os/qwen35-hermes-stage2-adapter-v1)
+## Benchmark Snapshot
+![hero](./assets/benchmark_hero.svg)
+## Benchmark Table
+| Metric | Base Qwen3.5-9B | Carnice Stage 2 Merged | Delta | Notes |
+|---|---:|---:|---:|---|
+| YC-Bench one-shot composite | 0.551 | 0.551 | +0.000 | Official Hermes benchmark on GH200 ARM |
+| YC-Bench one-shot survival | 1.000 | 1.000 | +0.000 | Both survived all 3 runs |
+| YC-Bench one-shot eval time | 78.6s | 23.1s | 70.6% lower | 3.40x faster wall-clock |
+![table](./assets/benchmark_table.svg)
+## Key Takeaways
+- Official Hermes benchmark parity: the merged model matched stock `Qwen/Qwen3.5-9B` on YC-Bench one-shot composite and survival.
+- Runtime win: the merged model completed the same official benchmark about `3.40x` faster on this GH200 ARM machine.
+- Harness-native training preserved agent behavior while materially reducing benchmark wall-clock time on the local OpenAI-compatible stack.
+## Benchmark Notes
+- The primary benchmark shown here is the official `YC-Bench` path that runs cleanly inside Hermes on GH200 ARM.
+- `TBLite` and `TerminalBench2` were not used for this release card on this machine because their task-image path is effectively x86-only here.
+- A longer multi-turn `YC-Bench fast_test` run is being executed separately. This card currently reflects the finished official one-shot comparison.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_id = "kai-os/carnice-v1-9b-hermes-agent-stage2-merged"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+```
+## Supplementary Diagnostics
+These are secondary diagnostics from the Stage A -> Stage B training progression and earlier internal release assets. They are useful context, but they are not the main benchmark story for this merged release.
+![supplementary hero](./assets/supplementary_stage_progress.svg)
+![supplementary table](./assets/supplementary_stage_table.svg)
+## Files
+- `benchmarks.json`: raw benchmark summary for this release page
+- `banner.png`: release banner
+- `assets/benchmark_hero.svg`: primary benchmark visual
+- `assets/benchmark_table.svg`: primary metric table visual
+- `assets/supplementary_stage_progress.svg`: older stage-progression visual
+- `assets/supplementary_stage_table.svg`: older supplementary metric visual

assets/benchmark_hero.svg ADDED Viewed

assets/benchmark_table.svg ADDED Viewed

assets/supplementary_stage_progress.svg ADDED Viewed

assets/supplementary_stage_table.svg ADDED Viewed

benchmarks.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "yc_bench_oneshot": {
+    "base": {
+      "avg_composite_score": 0.5509199735446368,
+      "survival_rate": 1.0,
+      "evaluation_time_seconds": 78.61562085151672
+    },
+    "merged": {
+      "avg_composite_score": 0.5509199735446368,
+      "survival_rate": 1.0,
+      "evaluation_time_seconds": 23.148362159729004
+    },
+    "delta": {
+      "avg_composite_score": 0.0,
+      "survival_rate": 0.0,
+      "evaluation_time_seconds": -55.46725869178772,
+      "speedup_x": 3.396,
+      "time_reduction_pct": 70.6
+    }
+  },
+  "training_progression": {
+    "stage_a_eval_loss": 0.4059831202030182,
+    "stage_b_eval_loss": 0.3007583022117615,
+    "stage_a_eval_perplexity": 1.5007772194294333,
+    "stage_b_eval_perplexity": 1.3508827966928918
+  }
+}