sakasegawa/gpt2-jp-small

GPT-2 small (117M architecture / 110M actual with 32k vocab) trained from scratch on Japanese text. Companion model for the blog post ゼロから作る日本語 LLM (Study LLM Ep00).

Source code: nyosegawa/gpt2-jp-from-scratch

Training details

Architecture: GPT-2 small (12 layers, 12 heads, 768 dim, 1024 context)
Parameters: 109.53M non-embedding / 110.32M total
Tokenizer: SentencePiece unigram, vocab 32,000 (trained on Wikipedia JA + Aozora Bunko)
Training data: hotchpotch/fineweb-2-edu-japanese sample_10BT, the first 2.3B tokens (Chinchilla-optimal budget for a 117M model)
Optimizer: AdamW (β=(0.9, 0.95), wd=0.1), cosine LR schedule 6e-4 → 6e-5 with 500-iter warmup, bf16 autocast
Compute: 1× A100 80 GB, 4.17 h wall clock, 156K tokens/sec
Final val loss: 3.0386 at iter 4,000 (perplexity ≈ 20.9)

Training log and sample generations at each 500-iter checkpoint are included as training_log.jsonl and samples.jsonl in this repository.

Usage

import torch
from transformers import GPT2LMHeadModel, AutoTokenizer

model = GPT2LMHeadModel.from_pretrained("sakasegawa/gpt2-jp-small")
tok   = AutoTokenizer.from_pretrained("sakasegawa/gpt2-jp-small")

enc = tok("日本の首都は", return_tensors="pt")
y = model.generate(enc.input_ids, attention_mask=enc.attention_mask,
                   max_new_tokens=60, do_sample=True, top_k=40,
                   temperature=0.8, pad_token_id=tok.eos_token_id)
print(tok.decode(y[0], skip_special_tokens=True))

The tokenizer is a SentencePiece unigram model wrapped as a LlamaTokenizerFast so that AutoTokenizer.from_pretrained works directly. Special tokens reuse the IDs that already exist in the SentencePiece vocab (<unk>=0, <|endoftext|>=1, <|pad|>=2). The conversion + push script from the nanoGPT checkpoint to this HF-compatible layout is at scripts/04_push_to_hub.py.

Expected behavior

This is a 2.3B-token Chinchilla-optimal run — a faithful reproduction of GPT-2 scale on Japanese, not a modern LLM. The model produces grammatical Japanese, has some world knowledge (correctly answers 「日本の首都は」 → 東京), but is prone to hallucination and repetition loops beyond ~50 tokens. See the blog post for a fuller analysis.

License

MIT License. Model implementation is adapted from nanoGPT (MIT).

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

sakasegawa
/

gpt2-jp-small

sakasegawa/gpt2-jp-small

Training details

Usage

Expected behavior

License

Datasets used to train sakasegawa/gpt2-jp-small