Instructions to use BrainChipInc/tenns-llm-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BrainChipInc/tenns-llm-1b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="BrainChipInc/tenns-llm-1b", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("BrainChipInc/tenns-llm-1b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use BrainChipInc/tenns-llm-1b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "BrainChipInc/tenns-llm-1b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "BrainChipInc/tenns-llm-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/BrainChipInc/tenns-llm-1b
- SGLang
How to use BrainChipInc/tenns-llm-1b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "BrainChipInc/tenns-llm-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "BrainChipInc/tenns-llm-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "BrainChipInc/tenns-llm-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "BrainChipInc/tenns-llm-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use BrainChipInc/tenns-llm-1b with Docker Model Runner:
docker model run hf.co/BrainChipInc/tenns-llm-1b
TENNs LLM 1B
A 1-billion-parameter causal language model built on gate-mode SSM (State Space Model) layers from TENNs Core. Uses recurrent inference instead of attention, making it efficient for streaming and long-context generation.
Architecture
| Component | Details |
|---|---|
| Layers | 24 ร TENNsBlock (gate mode) |
| Hidden dim | 2048 |
| Inner dim | 4096 |
| Vocabulary | 32,000 (Mistral-7B tokenizer) |
| Parameters | ~1B |
Each TENNsBlock: RMSNorm โ in_proj โ causal_conv(4) โ SSM(gate) โ out_proj โ residual
Quick Start (Google Colab / any environment)
!pip install transformers torch einops opt_einsum safetensors
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BrainChipInc/tenns-llm-1b")
model = AutoModelForCausalLM.from_pretrained(
"BrainChipInc/tenns-llm-1b",
trust_remote_code=True,
)
output = model.generate_text("The history of artificial intelligence", tokenizer, max_new_tokens=100)
print(output)
Do not use
pipeline()โ this model uses a custom recurrent architecture that is not compatible with HuggingFace's standard text-generation pipeline.
Installation
pip install transformers torch einops opt_einsum safetensors
Usage
Note: Do not use
pipeline()โ this model requiresmodel.generate_text()instead of HuggingFace's standardgenerate(). The recurrent SSM architecture is not compatible with the attention KV-cache pipeline.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BrainChipInc/tenns-llm-1b")
model = AutoModelForCausalLM.from_pretrained(
"BrainChipInc/tenns-llm-1b",
trust_remote_code=True,
)
output = model.generate_text("The history of artificial intelligence", tokenizer, max_new_tokens=100)
print(output)
Generation options
# Greedy decoding (default)
output = model.generate_text(prompt, tokenizer, max_new_tokens=50)
# Top-k sampling with temperature
output = model.generate_text(prompt, tokenizer, max_new_tokens=100, temperature=0.8, top_k=50)
trust_remote_code=True
This model uses custom modeling code bundled in the repository
(modeling_tenns_llm.py, configuration_tenns_llm.py, tenns_core/).
Loading requires trust_remote_code=True. The bundled tenns_core/ package
is a snapshot of the TENNs Core SSM library โ no separate installation needed.
Training
Fine-tuned from a base TENNs gate-mode model using LoRA adapters on English instruction data. LoRA adapters are merged into base weights at export time.
Limitations
- English only
- No system prompt or chat template โ plain completion model
- Recurrent state resets between calls to
generate_text()
- Downloads last month
- 563