AI & ML interests

None defined yet.

Recent Activity

ThingsAI  updated a model 2 minutes ago
ThingAI/Dwarf-15M
ThingsAI  published a model 3 minutes ago
ThingAI/Dwarf-15M
ThingsAI  updated a Space about 1 hour ago
ThingAI/README
View all activity

Organization Card

ThingsAI

Building efficient, specialist Small Language Models that run on consumer hardware. Zero telemetry. Open weights. Everything from tokenizer to training script is public.

Models

  • Dwarf-15M (training in progress) A 15.54M parameter shell/bash specialist. 12 layers, d_model=320, GQA 5Q/1KV, SwiGLU, RMSNorm, RoPE. Custom 8202-token vocabulary via DwarfGoToken. Training on 38.85B tokens (2500:1 token-to-parameter ratio) across 11 datasets spanning raw shell, Python, C, instruction pairs, and English web text. Target use case: CLI tool that translates natural language into bash commands with user review before execution.

  • Quark-270M Our largest model. 252M effective parameters, 32 layers, d_model=768, GQA 12Q/4KV, 65K bilingual vocabulary (Italian + English). Trained on curated multilingual data. Available as Base and Instruct variants.

  • Quark-135M Bilingual (Italian + English) general-purpose model. 135M parameters, 30 layers, 9 attention heads (3 KV, GQA), SwiGLU, RMSNorm, RoPE θ=10k. Trained on 15B+ tokens. Published benchmarks: HellaSwag 31.37%, ARC-Easy 41.46%, PIQA 61.26%.

  • Quark-72M (archived — research artifact) A 71.7M parameter model that taught us an expensive lesson. With vocab_size=65536 and d_model=512, the embedding matrix consumed ~33.5M of the 71.7M total parameters — nearly half the budget in pure lookup table. Effective transformer capacity was ~35M parameters, explaining why it underperformed the nominally smaller Quark-135M on every benchmark (PIQA 54.57% vs 58.32%, ARC-Easy 32.10% vs 47.73%). Additionally, zero-shot chain-of-thought prompting actively degraded performance, dropping ARC-Easy from 33% to 25.5% (random guess level). This model remains published with its limitations honestly documented. Every architectural decision in Dwarf-15M — the compact 8K vocabulary, the syntax-aware tokenizer, the instruction data mixed into pretraining — was a direct response to what went wrong here.

  • Quark-Mod Multi-label content moderation model. 9 categories: toxic, severe_toxic, obscene, threat, insult, identity_hate, cyberbullying, hate_speech, offensive.

Tokenizers

  • DwarfGoToken An 8202-token BPE tokenizer built for shell/bash. Uses ByteLevel pre-tokenization with syntax-aware protected tokens for shell operators (2>&1, &&, >>, ||) that would otherwise be split by standard BPE. Built on a 51MB corpus of shell, Python, C, and English text. Two critical bugs were found and fixed during Dwarf-15M training: space loss from incorrect pre-tokenizer configuration, and short shell keywords (fi, do, if) matching as substrings inside English words.

  • GoToken A BPE tokenizer written in Rust with Python bindings via PyO3. Published on crates.io and PyPI. Provides syntax-aware pre-tokenization for shell/bash patterns. Used as the foundation for DwarfGoToken.

What We Focus On

  • Specialist over generalist: A 15M model can't do everything, but it can excel at one thing. Dwarf targets shell/bash; future models will target math/physics.
  • Honest failure documentation: When something doesn't work (72M vocabulary problem, zero-shot CoT degradation, tokenizer bugs), we publish the failure and what we learned.
  • Extreme overtraining for small models: Following the Phi/SmolLM philosophy — small models need more tokens per parameter, not fewer. Dwarf trains at 2500:1, 125x beyond Chinchilla optimal.
  • Custom tooling from scratch: Tokenizers (gotoken, DwarfGoToken), training scripts with multi-source streaming, and inference tools — all built in-house, all open.
  • Consumer hardware: Everything runs on an RTX 3070 (8GB) or equivalent. No datacenter required.

Links