kai-os commited on
Commit
95e60b4
·
verified ·
1 Parent(s): 1a81bcf

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,32 +1,90 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen3.5-9B
 
4
  pipeline_tag: text-generation
5
  tags:
6
  - hermes-agent
7
  - merged
8
  - standalone
9
- - browser
10
  - terminal
 
11
  - reasoning
12
  license: apache-2.0
13
  ---
14
 
 
 
15
  # Carnice-v1-9b Hermes-Agent Stage 2 Merged
16
 
17
- Standalone merged checkpoint for Hermes Agent, produced by merging the Stage 2 Carnice adapter into `Qwen/Qwen3.5-9B`.
 
 
18
 
19
- This is the merged model form of `kai-os/qwen35-hermes-stage2-adapter-v1`, not a separate full-parameter training run.
20
 
21
- ## Source
22
 
23
- - Adapter: `kai-os/qwen35-hermes-stage2-adapter-v1`
 
24
  - Base model: `Qwen/Qwen3.5-9B`
25
- - Training data included:
26
- - Stage A repair: `bespokelabs/Bespoke-Stratos-17k`, `AI-MO/NuminaMath-CoT`
27
- - Stage B refresh: `kai-os/carnice-glm5-hermes-traces`, `open-thoughts/OpenThoughts-Agent-v1-SFT`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- ## Notes
30
 
31
- - This repo exists so the model can be loaded directly without a separate PEFT adapter step.
32
- - The benchmark visuals and detailed benchmark card live on the adapter repo for now.
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen3.5-9B
4
+ library_name: transformers
5
  pipeline_tag: text-generation
6
  tags:
7
  - hermes-agent
8
  - merged
9
  - standalone
10
+ - qwen3.5
11
  - terminal
12
+ - tool-use
13
  - reasoning
14
  license: apache-2.0
15
  ---
16
 
17
+ ![banner](./banner.png)
18
+
19
  # Carnice-v1-9b Hermes-Agent Stage 2 Merged
20
 
21
+ Standalone merged release of the Carnice Stage 2 Hermes model on top of `Qwen/Qwen3.5-9B`.
22
+
23
+ This repo is the direct-load merged checkpoint form of [kai-os/qwen35-hermes-stage2-adapter-v1](https://huggingface.co/kai-os/qwen35-hermes-stage2-adapter-v1). It loads as its own model without a separate PEFT adapter step.
24
 
25
+ Important detail: this is a merged standalone checkpoint, not a separate full-parameter training run from scratch.
26
 
27
+ ## What Changed
28
 
29
+ - Stage A reasoning repair on `bespokelabs/Bespoke-Stratos-17k` and `AI-MO/NuminaMath-CoT`
30
+ - Stage B Hermes refresh on [kai-os/carnice-glm5-hermes-traces](https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-traces) and `open-thoughts/OpenThoughts-Agent-v1-SFT`
31
  - Base model: `Qwen/Qwen3.5-9B`
32
+ - Adapter source: [kai-os/qwen35-hermes-stage2-adapter-v1](https://huggingface.co/kai-os/qwen35-hermes-stage2-adapter-v1)
33
+
34
+ ## Benchmark Snapshot
35
+
36
+ ![hero](./assets/benchmark_hero.svg)
37
+
38
+ ## Benchmark Table
39
+
40
+ | Metric | Base Qwen3.5-9B | Carnice Stage 2 Merged | Delta | Notes |
41
+ |---|---:|---:|---:|---|
42
+ | YC-Bench one-shot composite | 0.551 | 0.551 | +0.000 | Official Hermes benchmark on GH200 ARM |
43
+ | YC-Bench one-shot survival | 1.000 | 1.000 | +0.000 | Both survived all 3 runs |
44
+ | YC-Bench one-shot eval time | 78.6s | 23.1s | 70.6% lower | 3.40x faster wall-clock |
45
+
46
+ ![table](./assets/benchmark_table.svg)
47
+
48
+ ## Key Takeaways
49
+
50
+ - Official Hermes benchmark parity: the merged model matched stock `Qwen/Qwen3.5-9B` on YC-Bench one-shot composite and survival.
51
+ - Runtime win: the merged model completed the same official benchmark about `3.40x` faster on this GH200 ARM machine.
52
+ - Harness-native training preserved agent behavior while materially reducing benchmark wall-clock time on the local OpenAI-compatible stack.
53
+
54
+ ## Benchmark Notes
55
+
56
+ - The primary benchmark shown here is the official `YC-Bench` path that runs cleanly inside Hermes on GH200 ARM.
57
+ - `TBLite` and `TerminalBench2` were not used for this release card on this machine because their task-image path is effectively x86-only here.
58
+ - A longer multi-turn `YC-Bench fast_test` run is being executed separately. This card currently reflects the finished official one-shot comparison.
59
+
60
+ ## Usage
61
+
62
+ ```python
63
+ from transformers import AutoModelForCausalLM, AutoTokenizer
64
+ import torch
65
+
66
+ model_id = "kai-os/carnice-v1-9b-hermes-agent-stage2-merged"
67
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
68
+ model = AutoModelForCausalLM.from_pretrained(
69
+ model_id,
70
+ torch_dtype=torch.bfloat16,
71
+ device_map="auto",
72
+ )
73
+ ```
74
+
75
+ ## Supplementary Diagnostics
76
+
77
+ These are secondary diagnostics from the Stage A -> Stage B training progression and earlier internal release assets. They are useful context, but they are not the main benchmark story for this merged release.
78
+
79
+ ![supplementary hero](./assets/supplementary_stage_progress.svg)
80
+
81
+ ![supplementary table](./assets/supplementary_stage_table.svg)
82
 
83
+ ## Files
84
 
85
+ - `benchmarks.json`: raw benchmark summary for this release page
86
+ - `banner.png`: release banner
87
+ - `assets/benchmark_hero.svg`: primary benchmark visual
88
+ - `assets/benchmark_table.svg`: primary metric table visual
89
+ - `assets/supplementary_stage_progress.svg`: older stage-progression visual
90
+ - `assets/supplementary_stage_table.svg`: older supplementary metric visual
assets/benchmark_hero.svg ADDED
assets/benchmark_table.svg ADDED
assets/supplementary_stage_progress.svg ADDED
assets/supplementary_stage_table.svg ADDED
benchmarks.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "yc_bench_oneshot": {
3
+ "base": {
4
+ "avg_composite_score": 0.5509199735446368,
5
+ "survival_rate": 1.0,
6
+ "evaluation_time_seconds": 78.61562085151672
7
+ },
8
+ "merged": {
9
+ "avg_composite_score": 0.5509199735446368,
10
+ "survival_rate": 1.0,
11
+ "evaluation_time_seconds": 23.148362159729004
12
+ },
13
+ "delta": {
14
+ "avg_composite_score": 0.0,
15
+ "survival_rate": 0.0,
16
+ "evaluation_time_seconds": -55.46725869178772,
17
+ "speedup_x": 3.396,
18
+ "time_reduction_pct": 70.6
19
+ }
20
+ },
21
+ "training_progression": {
22
+ "stage_a_eval_loss": 0.4059831202030182,
23
+ "stage_b_eval_loss": 0.3007583022117615,
24
+ "stage_a_eval_perplexity": 1.5007772194294333,
25
+ "stage_b_eval_perplexity": 1.3508827966928918
26
+ }
27
+ }