How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lyon28/caca-1B-untrained"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lyon28/caca-1B-untrained",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/Lyon28/caca-1B-untrained
Quick Links
caca-1B-untrained

๐Ÿค– caca-1B-untrained

Arsitektur Transformer Modern โ€ข 1 Miliar Parameter โ€ข Belum Dilatih

License Python 3.8+ PyTorch Transformers Parameters Status

~1,000,000,000 parameters โ€ข 88 layers โ€ข 4,096 tokens context โ€ข 32,000 vocab


โš ๏ธ PENTING: Model Belum Dilatih (Untrained)

Model ini belum melalui proses training. Bobot masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.

โœ… Bisa โŒ Belum Bisa
Load arsitektur model Generate teks bermakna
Test forward pass Menjawab pertanyaan
Ukur memory & speed Reasoning & understanding
Mulai training Production deployment
Fine-tuning experiments Aplikasi real-world

๐Ÿ“‹ Deskripsi

caca-1B-untrained adalah bagian dari project Caca โ€” arsitektur LLM open-source yang menggabungkan berbagai teknik state-of-the-art. Model ini dirancang dengan fokus pada efisiensi komputasi, skalabilitas, dan performa tinggi untuk bahasa Indonesia dan Inggris.

"Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative." โ€” Lyon, Creator


๐Ÿ“Š Spesifikasi Model

Parameter Value
Total Parameters ~1,000,000,000
Hidden Size 1,024
Intermediate Size 2,688
Num Layers 88
Attention Heads 8
KV Heads (GQA) 1
Head Dimension 128
Max Context Length 4,096 tokens
Vocab Size 32,000
RoPE Theta 10,000
Model Size (FP32) ~4 GB
Model Size (FP16/BF16) ~2 GB
Model Size (INT8) ~1 GB
Model Size (INT4) ~0.5 GB

๐Ÿš€ Fitur Arsitektur

๐ŸŽฏ Attention

  • โšก Flash Attention 2 โ€” IO-aware algorithm, 3x lebih cepat dari attention standar
  • ๐Ÿ”‘ Grouped Query Attention (GQA) โ€” 8 query heads : 1 KV head
    • Hemat 87.5% memory KV cache vs Multi-Head Attention
    • Kecepatan inference mendekati Multi-Query Attention
  • โœจ QK Normalization โ€” RMSNorm pada query & key untuk stabilitas training
  • ๐Ÿ”„ RoPE โ€” Rotary Position Embeddings (ฮธ=10,000)
  • ๐ŸŽฏ xFormers Support โ€” Memory-efficient attention fallback
  • โš™๏ธ PyTorch SDPA โ€” Native scaled dot product attention

๐Ÿ—๏ธ Arsitektur

  • ๐Ÿ“ RMSNorm โ€” ~50% lebih cepat dari LayerNorm, tanpa mean subtraction
  • ๐Ÿ”ฅ SwiGLU Activation โ€” Gate projection + Up projection ร— SiLU
  • ๐Ÿ’ง Residual Dropout โ€” Regularisasi pada residual connections
  • ๐Ÿ›ก๏ธ NaN/Inf Recovery โ€” Deteksi & recovery otomatis dari numerical instability
  • ๐Ÿ“Š Gradient Monitoring โ€” Per-layer gradient norm tracking & clipping
  • ๐Ÿ”„ KV Cache โ€” Dynamic cache untuk efficient autoregressive generation

๐ŸŽ“ Training Features

  • ๐Ÿ’พ Gradient Checkpointing โ€” Hemat memory dengan trade compute
  • ๐ŸŽฏ Mixed Precision โ€” Support FP16, BF16, FP32
  • ๐Ÿ“‰ Label Smoothing โ€” Configurable (default: 0.0)
  • ๐Ÿ”€ Token Dropout โ€” Optional token-level regularization
  • ๐Ÿ“ˆ Metrics Tracking โ€” Real-time loss, perplexity, gradient norms

๐Ÿ”ง Advanced (Optional, Off by Default)

  • ๐Ÿง  Mixture of Experts (MoE) โ€” Sparse expert routing
  • ๐Ÿ”€ Mixture of Depths (MoD) โ€” Dynamic compute allocation
  • ๐Ÿ”— Cross-Attention โ€” Encoder-decoder fusion
  • ๐Ÿ‘๏ธ Vision Encoder โ€” ViT-based multimodal support
  • ๐Ÿ“Š Layer Scale โ€” Training stability untuk deep networks
  • ๐ŸŽฒ Stochastic Depth โ€” Random layer dropping
  • ๐Ÿ” LoRA โ€” Low-rank adaptation via PEFT
  • ๐Ÿ“ฆ Quantization โ€” 4/8-bit via bitsandbytes
  • ๐Ÿ”ข ฮผP (MuP) โ€” Maximal Update Parametrization

๐Ÿ’พ Kebutuhan Memory

Inference

Precision Model Size KV Cache (4K ctx) Total
FP32 ~4.0 GB ~0.2 GB ~4.2 GB
FP16 / BF16 ~2.0 GB ~0.1 GB ~2.1 GB
INT8 ~1.0 GB ~0.1 GB ~1.1 GB
INT4 (NF4) ~0.5 GB ~0.1 GB ~0.6 GB

Training

Configuration Memory
FP32 + AdamW ~16 GB
Mixed Precision (BF16) ~8 GB
+ Gradient Checkpointing ~5 GB
+ LoRA (rank=16) ~3 GB

๐Ÿ“ฆ Instalasi

# Core (wajib)
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

# Optional: performa maksimal
pip install flash-attn --no-build-isolation  # Flash Attention 2
pip install xformers                          # xFormers attention
pip install bitsandbytes                      # 4/8-bit quantization
pip install peft                              # LoRA fine-tuning

๐Ÿ’ป Cara Penggunaan

Basic Loading

from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Load config
config = AutoConfig.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# โš ๏ธ Model belum dilatih โ€” output tidak bermakna

4-bit Quantization

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)
# Memory: ~0.5 GB

Training Setup

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./caca-1B-untrained",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    max_steps=10000,
    lr_scheduler_type="cosine",
    warmup_steps=500,
    fp16=True,
    gradient_checkpointing=True,
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

LoRA Fine-tuning

# Aktifkan LoRA via config
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("Lyon28/caca-1B-untrained", trust_remote_code=True)
config.use_lora = True
config.lora_rank = 16
config.lora_alpha = 32.0
config.lora_target_modules = ["q_proj", "v_proj"]

model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    config=config,
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

model = model.apply_lora()
model.print_trainable_parameters()
# trainable params: ~2M || all params: ~1B || trainable%: ~0.2%

Chat Format

# Template chat bawaan
messages = [
    {"role": "system", "content": "Kamu adalah caca yang membantu."},
    {"role": "user", "content": "Jelaskan tentang machine learning."},
]

# Format manual
prompt = "System: Kamu adalah caca yang membantu.\nUser: Jelaskan tentang machine learning.\nAssistant:"

๐Ÿ”ฌ Detail Arsitektur

CacaForCausalLM (~1B params)
โ”‚
โ”œโ”€ Embedding: 32,000 ร— 1,024 = 32,768,000 params
โ”‚
โ”œโ”€ Transformer Layers (88ร—)
โ”‚  โ”œโ”€ RMSNorm (input)
โ”‚  โ”œโ”€ CacaAttention (GQA)
โ”‚  โ”‚  โ”œโ”€ Q: 8 heads ร— 128 dim โ†’ Linear(1024, 1024)
โ”‚  โ”‚  โ”œโ”€ K: 1 head  ร— 128 dim โ†’ Linear(1024, 128)
โ”‚  โ”‚  โ”œโ”€ V: 1 head  ร— 128 dim โ†’ Linear(1024, 128)
โ”‚  โ”‚  โ”œโ”€ O: Linear(1024, 1024)
โ”‚  โ”‚  โ”œโ”€ QK Norm (RMSNorm per head)
โ”‚  โ”‚  โ””โ”€ RoPE (ฮธ=10,000)
โ”‚  โ”œโ”€ Residual + Dropout
โ”‚  โ”œโ”€ RMSNorm (post-attention)
โ”‚  โ”œโ”€ CacaMLP (SwiGLU)
โ”‚  โ”‚  โ”œโ”€ Gate: Linear(1024, 2688)
โ”‚  โ”‚  โ”œโ”€ Up:   Linear(1024, 2688)
โ”‚  โ”‚  โ””โ”€ Down: Linear(2688, 1024)
โ”‚  โ””โ”€ Residual + Dropout
โ”‚
โ”œโ”€ Final RMSNorm
โ””โ”€ LM Head: Linear(1024, 32000)

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Parameter breakdown per layer:
  Attention: 1,024ร—(1,024 + 128 + 128 + 1,024) = 2,359,296
  FFN:       1,024ร—2,688ร—3 = 8,257,536
  Norms:     1,024ร—2 = 2,048
  Total/layer: ~10,618,880

88 layers ร— ~10.6M = ~934M
+ Embeddings: ~33M
+ LM Head: ~33M
= ~1,000M total
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

GQA vs MHA Comparison

Multi-Head Attention (MHA):
  Q: 8 heads, K: 8 heads, V: 8 heads
  KV cache: 8 ร— 128 ร— 2 = 2,048 values per token

Grouped Query Attention (GQA) โ€” caca-1B-untrained:
  Q: 8 heads, K: 1 head, V: 1 head
  KV cache: 1 ร— 128 ร— 2 = 256 values per token
  Saving: 87.5% โ†“ memory for KV cache

โš™๏ธ Konfigurasi Lengkap

{
  "model_type": "caca",
  "architectures": ["CacaForCausalLM"],
  "vocab_size": 32000,
  "hidden_size": 1024,
  "intermediate_size": 2688,
  "num_hidden_layers": 88,
  "num_attention_heads": 8,
  "num_key_value_heads": 1,
  "head_dim": 128,
  "max_position_embeddings": 4096,
  "rope_theta": 10000.0,
  "rms_norm_eps": 1e-6,
  "use_qk_norm": true,
  "use_flash_attn": true,
  "use_rotary_embeddings": true,
  "attention_dropout": 0.0,
  "hidden_dropout": 0.1,
  "residual_dropout": 0.1,
  "use_cache": true,
  "tie_word_embeddings": false
}

๐Ÿ› ๏ธ Tips Training

# Recommended hyperparameters untuk caca-1B-untrained

learning_rate = 2e-4          # Base LR
warmup_ratio = 0.05           # 5% warmup
lr_scheduler = "cosine"
weight_decay = 0.1
max_grad_norm = 1.0
batch_size_effective = 256    # batch ร— accum ร— gpus

# GPU requirements:
# A100 40GB  โ†’ batch=2, accum=8, fp16
# RTX 3090   โ†’ batch=1, accum=16, fp16 + grad_checkpoint
# RTX 4090   โ†’ batch=1, accum=16, bf16 + grad_checkpoint

๐Ÿ”ง Troubleshooting

Out of Memory:

model.gradient_checkpointing_enable()       # -40% memory
# + reduce batch size
# + load_in_8bit=True atau load_in_4bit=True
# + torch_dtype=torch.bfloat16

NaN Loss:

# Gunakan BF16 (lebih stable dari FP16)
torch_dtype = torch.bfloat16
# Atau kurangi learning rate 10x

Slow Training:

# Pastikan flash-attn terinstall
pip install flash-attn --no-build-isolation
# Compile model (PyTorch 2.0+)
model = torch.compile(model)

๐Ÿ“œ License & Citation

Model ini dirilis di bawah Apache License 2.0 โ€” bebas digunakan untuk keperluan komersial maupun non-komersial dengan attribution.

@misc{caca1b,
  author = {Lyon},
  title = {caca-1B-untrained: Modern Transformer with Grouped Query Attention},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Lyon28/caca-1B-untrained}},
  note = {Untrained model with ~1B parameters}
}

๐Ÿ™ Acknowledgments

  • Flash Attention (Tri Dao et al.) โ€” IO-aware attention algorithm
  • GQA (Ainslie et al., Google) โ€” Grouped Query Attention
  • LLaMA (Meta AI) โ€” Decoder-only architecture inspiration
  • RoPE (Su et al.) โ€” Rotary position embeddings
  • SwiGLU (Shazeer) โ€” Gated linear unit activation
  • ๐Ÿค— Hugging Face โ€” Transformers library & infrastructure

Dibuat dengan โค๏ธ oleh @Lyon28

"Dari nol, untuk semua"

Star

Downloads last month
1,774
Safetensors
Model size
1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support