sakasegawa/gpt2-jp-small
GPT-2 small (117M architecture / 110M actual with 32k vocab) trained from scratch on Japanese text. Companion model for the blog post ゼロから作る日本語 LLM (Study LLM Ep00).
Source code: nyosegawa/gpt2-jp-from-scratch
Training details
- Architecture: GPT-2 small (12 layers, 12 heads, 768 dim, 1024 context)
- Parameters: 109.53M non-embedding / 110.32M total
- Tokenizer: SentencePiece unigram, vocab 32,000 (trained on Wikipedia JA + Aozora Bunko)
- Training data: hotchpotch/fineweb-2-edu-japanese
sample_10BT, the first 2.3B tokens (Chinchilla-optimal budget for a 117M model) - Optimizer: AdamW (β=(0.9, 0.95), wd=0.1), cosine LR schedule 6e-4 → 6e-5 with 500-iter warmup, bf16 autocast
- Compute: 1× A100 80 GB, 4.17 h wall clock, 156K tokens/sec
- Final val loss: 3.0386 at iter 4,000 (perplexity ≈ 20.9)
Training log and sample generations at each 500-iter checkpoint are included
as training_log.jsonl and samples.jsonl in this repository.
Usage
import torch
from transformers import GPT2LMHeadModel, AutoTokenizer
model = GPT2LMHeadModel.from_pretrained("sakasegawa/gpt2-jp-small")
tok = AutoTokenizer.from_pretrained("sakasegawa/gpt2-jp-small")
enc = tok("日本の首都は", return_tensors="pt")
y = model.generate(enc.input_ids, attention_mask=enc.attention_mask,
max_new_tokens=60, do_sample=True, top_k=40,
temperature=0.8, pad_token_id=tok.eos_token_id)
print(tok.decode(y[0], skip_special_tokens=True))
The tokenizer is a SentencePiece unigram model wrapped as a
LlamaTokenizerFast so that AutoTokenizer.from_pretrained works directly.
Special tokens reuse the IDs that already exist in the SentencePiece vocab
(<unk>=0, <|endoftext|>=1, <|pad|>=2). The conversion + push script
from the nanoGPT checkpoint to this HF-compatible layout is at
scripts/04_push_to_hub.py.
Expected behavior
This is a 2.3B-token Chinchilla-optimal run — a faithful reproduction of GPT-2 scale on Japanese, not a modern LLM. The model produces grammatical Japanese, has some world knowledge (correctly answers 「日本の首都は」 → 東京), but is prone to hallucination and repetition loops beyond ~50 tokens. See the blog post for a fuller analysis.
License
MIT License. Model implementation is adapted from nanoGPT (MIT).
- Downloads last month
- 6