Instructions to use ku-nlp/gpt2-medium-japanese-char with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ku-nlp/gpt2-medium-japanese-char with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ku-nlp/gpt2-medium-japanese-char")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ku-nlp/gpt2-medium-japanese-char")
model = AutoModelForCausalLM.from_pretrained("ku-nlp/gpt2-medium-japanese-char")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ku-nlp/gpt2-medium-japanese-char with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ku-nlp/gpt2-medium-japanese-char"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ku-nlp/gpt2-medium-japanese-char",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ku-nlp/gpt2-medium-japanese-char

SGLang

How to use ku-nlp/gpt2-medium-japanese-char with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ku-nlp/gpt2-medium-japanese-char" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ku-nlp/gpt2-medium-japanese-char",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ku-nlp/gpt2-medium-japanese-char" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ku-nlp/gpt2-medium-japanese-char",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ku-nlp/gpt2-medium-japanese-char with Docker Model Runner:
```
docker model run hf.co/ku-nlp/gpt2-medium-japanese-char
```

Model Card for Japanese character-level GPT-2 Medium

Model description

This is a Japanese character-level GPT-2 Medium (310M parameters) language model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.

How to use

You can use this model directly with a pipeline for text generation.

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='ku-nlp/gpt2-medium-japanese-char')
>>> set_seed(5)
>>> generator("<s>昨日私は京都で", max_length=30, do_sample=True, num_return_sequences=5)

[{'generated_text': '<s>昨日私は京都で仕事だったのです。そのときに訪れた京都の街の'},
 {'generated_text': '<s>昨日私は京都で開かれた、「みんなで絵本の読み聞かせ会」に参'},
 {'generated_text': '<s>昨日私は京都で行われましたコンペティションに参加してきまし'},
 {'generated_text': '<s>昨日私は京都では雪が解けるの日経平均株価が下がるのみで今は'},
 {'generated_text': '<s>昨日私は京都でこみっくトレジャー２を開催して見ましたが、そ'}]

You can also use this model to get the features of a given text.

Vocabulary

A character-level vocabulary of size 6K is used. To be precise, rare characters may be split into bytes because byte-level byte-pair encoding (BPE) is used. The BPE tokenizer was trained on a small subset of the training data. Since the data were converted into a one-character-per-line format, merge operations never go beyond character boundaries.

Note that the tokenizer maps U+0020 to [UNK] because preprocessing eliminated whitespace characters (U+0020) from training data. Use U+3000 (Ideographic Space) instead.

Training data

We used the following corpora for pre-training:

Japanese Wikipedia (as of 20221020, 3.2GB, 27M sentences, 1.3M documents)
Japanese portion of CC-100 (85GB, 619M sentences, 66M documents)
Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)

Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR. Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of CC-100 and OSCAR. As a result, the total size of the training data is 171GB.

Training procedure

The training took about 3 months (with two interruptions) with a single NVIDIA A100 80GB GPU.

The following hyperparameters were used during pre-training:

learning_rate: 2e-4
per_device_train_batch_size: 14
gradient_accumulation_steps: 42
optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-06
weight_decay: 0.01
lr_scheduler_type: linear
max_grad_norm: 1.0
max_steps: 500,000 (but terminated at 186,000 steps ~= 2.0 epochs)
warmup_steps: 10,000

The eval loss was 1.411 while the eval accuracy was 0.6697. The evaluation set consists of 5,000 randomly sampled documents from each of the training corpora.

Downloads last month: 838

Safetensors

Model size

0.3B params

Tensor type

F32

BOOL

Model tree for ku-nlp/gpt2-medium-japanese-char

Finetunes

2 models

ku-nlp
/

gpt2-medium-japanese-char