Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Cognica-BP-v1.0-1.3B-base

Paper: Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters (Jeong, 2026)

A 1.384 B-parameter causal language model pretrained from scratch with standard end-to-end backprop (no PoE local learning). This is the control arm of the d24 r20 Chinchilla 3-way experiment that the paper reports, released so downstream researchers and practitioners can reproduce the exact BPB gap between standard backprop and PoE-based local learning at this scale.

This repo is a companion to cognica/Cognica-PoE-v1.0-1.3B-base, which is the PoE-trained version of the same architecture on the same data. They differ only in loss construction.

TL;DR

  • 1.384 B params, d24, 1536-dim, 12 heads, 4 clustered stages × 6 layers (same architecture as the PoE release; only the training loss differs).
  • 27.7 B tokens from ClimbMix (Chinchilla-20 ratio, matching the PoE run exactly).
  • Final val bpb: 0.6768 (the best-of-3-way run; PoE α=0.0 finished at 0.7209, a 6.52 % BPB gap).
  • No PoE inference modes. Because there are no per-stage losses, the intermediate layers do not produce valid predictors. Stage prefix pruning, WAND adaptive depth, speculative drafting from stage 0, and post-hoc specialist attach all require the PoE-trained base — they do not apply here.
  • Released as a reference baseline for scaling-law / local-learning research. Not intended as a production instruction-tuned model.

When to use which base

Use case Pick
You want the best BPB-per-FLOP at this scale and don't care about staged inference This repo (BP)
You want early-exit, WAND, speculative drafting, or the ability to attach SFT specialists Cognica-PoE-v1.0-1.3B-base
You want to reproduce the paper's BP-vs-PoE comparison Load both and run side by side

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "cognica/Cognica-BP-v1.0-1.3B-base",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).eval().cuda()

tok = AutoTokenizer.from_pretrained(
    "cognica/Cognica-BP-v1.0-1.3B-base",
    trust_remote_code=True,
)

# IMPORTANT: base models in this family expect <|bos|> as the first token.
# Without it, generation quality degrades sharply (see the companion PoE base's
# README for the historical anomaly / verification that led to this rule).
prompt = "<|bos|>The capital of France is"
ids = tok.encode(prompt, return_tensors="pt").cuda()

out = model.generate(ids, max_new_tokens=40, do_sample=False,
                     repetition_penalty=1.15, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0]))

Greedy decoding with the BP baseline produces coherent continuations on scientific and geographic probes (Paris / planets of the solar system / photosynthesis). The BP baseline model does not add any specialist head; its lm_head is a single projection from the final-layer hidden state.

Training

Setting Value
Architecture d24, hidden 1536, 12 heads (MHA), SSSL window
Activation relu² (squared ReLU)
Vocab 32768 (byte-level BPE, same as the PoE base)
Optimizer MuonAdamW hybrid (Muon for matrices, AdamW for embeddings / scalars)
Matrix LR 0.02
Embedding LR 0.30
Unembedding LR 0.008
Batch 1 048 576 tokens / step, seq len 2048
Steps 26 430 (27.7 B tokens, Chinchilla-20)
Warmup 40 steps
Warmdown 65 % of run
Final LR frac 0.05
Dataset ClimbMix 400 B mirror
Hardware 4 × A100 80 GB (Google Cloud, us-central1)
Wall time ~67 h

Validation bpb trajectory (every 1000 steps after warmup)

Final: 0.6768 @ step 26 430.

Step Val bpb
5000 0.8823
10000 0.7795
15000 0.7358
20000 0.7025
25000 0.6827
26430 0.6768

Comparison to PoE α=0.0 (same architecture, same data)

Run Loss Final val bpb Δ vs baseline
This repo (BP, α=∅) Standard final-layer CE 0.6768
Cognica-PoE-v1.0-1.3B-base PoE flat α=0.0 (per-stage detached CE through shared head) 0.7209 +0.0441 (+6.52 %)

The 6.52 % gap is the pure architectural / loss-shape cost of PoE local learning at this exact scale and data budget — everything else (optimizer, LR schedule, data, seed) is matched. The PoE base is the more capable artifact in practice because it also supports 1.82× WAND inference speedups, 1.87× speculative decoding, and post-hoc specialist attach (see the PoE base repo). The BP baseline is released so the comparison can be exactly reproduced.

A third arm (PoE α=0.5 sqrt(n) scaling) was launched in the same experiment but killed at step 14 000 (53 % of run) when it was tracking slightly worse than α=0.0 at the same step — the sqrt(n) hypothesis (P0 in the paper) is rejected by this data. Checkpoints for that run are not released.

Files

  • model.safetensors — 2.6 GB bf16 weights (175 tensors).
  • config.json — HF config; poe_mode="none" (vs "flat" on the PoE repo).
  • configuration_cognica_poe.py, modeling_cognica_poe.py, tokenization_cognica_poe.py — identical to the PoE base repo (same architecture class); the forward pass simply doesn't compute the PoE aggregate when poe_mode="none".
  • tokenizer.pkl, tokenizer_config.json, special_tokens_map.json, token_bytes.pt — tokenizer (identical to base).
  • generation_config.json — default generation params.

Limitations

  1. Not instruction-tuned. Pure causal-LM continuation only. To get chat / SFT behavior, attach a specialist stage (which requires the PoE base, not this one) or do your own full-model SFT.
  2. No early-exit capability. Stage 0 / stage 1 / stage 2 outputs from this model are not valid predictors because they were never supervised — they only appear in the frozen residual chain. Do not use head_mode=base / stage pruning patterns from the PoE repo on this one.
  3. English-only. ClimbMix is predominantly English; non-English generation may be poor.

Citation

@article{jeong2026poe,
  title  = {Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters},
  author = {Jeong, Jaepil},
  year   = {2026},
  institution = {Cognica, Inc.},
  doi    = {10.5281/zenodo.19547653},
  url    = {https://doi.org/10.5281/zenodo.19547653}
}

@misc{cognica-baseline-2026,
  title  = {Cognica-Baseline-v1.0-1.3B: Standard backprop reference for the PoE 3-way experiment},
  author = {{Cognica, Inc.}},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/cognica/Cognica-BP-v1.0-1.3B-base}}
}

License

Apache 2.0 — see LICENSE and NOTICE. Same terms as the PoE base. Training data (ClimbMix) carries its own license (see ClimbMix dataset card).

Downloads last month
124
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train cognica/Cognica-BP-v1.0-1.3B-base