metadata
base_model: dnotitia/DNA-2.0-14B
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
- fp8
- quantized
- vllm
- qwen3
- text-generation
- conversational
- compressed-tensors
- llmcompressor
language:
- ko
- en
- multilingual
model_creator: dnotitia
quantized_by: dataslab
DLM-2.0-14B-FP8
Overview
This is an FP8-quantized version of dnotitia/DNA-2.0-14B, optimized for efficient inference by DLM (Data Science Lab., Ltd.).
FP8 (8-bit floating point) quantization with static per-tensor scaling reduces model size by approximately 35% while maintaining near-original accuracy. Fully compatible with vLLM for high-throughput production serving.
Model Details
| Attribute | Value |
|---|---|
| Base Model | dnotitia/DNA-2.0-14B |
| Architecture | Qwen3ForCausalLM |
| Parameters | ~14B |
| Quantization | FP8 W8A8 (Static Per-Tensor) |
| Quantization Tool | llm-compressor |
| Calibration Data | HuggingFaceH4/ultrachat_200k (512 samples) |
| Model Size | ~19 GB (vs ~30 GB in BF16) |
| Context Length | 32K native / up to 131K with YaRN |
| Vocabulary | 151,936 tokens |
| License | Apache 2.0 |
| Quantized By | DLM (Data Science Lab., Ltd.) |
Quantization Details
- Method: Static FP8 quantization via
llm-compressoroneshot - Precision: FP8_E4M3 for weights, FP8_E4M3 for input activations
- Strategy: Per-tensor symmetric scaling with MinMax observer
- Calibration: 512 samples from
HuggingFaceH4/ultrachat_200k(train_sft split), max sequence length 2048 - Format: compressed-tensors (safetensors)
- Preserved layers:
lm_headkept in full precision (BF16) - Targets: All
Linearlayers (except lm_head)
Usage
vLLM (Recommended)
vllm serve dataslab/DLM-2.0-14B-FP8 \
--dtype auto \
--max-model-len 32768 \
--enable-reasoning \
--reasoning-parser deepseek_r1
Extended context (up to 131K with YaRN):
vllm serve dataslab/DLM-2.0-14B-FP8 \
--dtype auto \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072 \
--enable-reasoning \
--reasoning-parser deepseek_r1
Python (vLLM)
from vllm import LLM, SamplingParams
llm = LLM(model="dataslab/DLM-2.0-14B-FP8")
sampling_params = SamplingParams(
temperature=0.6, top_p=0.95, top_k=20, max_tokens=4096
)
messages = [
{"role": "user", "content": "ํ๊ตญ์ ๊ฒฝ์ ๋ฐ์ ๊ณผ์ ์ ๋ํด ์ค๋ช
ํด์ฃผ์ธ์."}
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dataslab/DLM-2.0-14B-FP8")
model = AutoModelForCausalLM.from_pretrained(
"dataslab/DLM-2.0-14B-FP8",
device_map="auto",
)
messages = [
{"role": "user", "content": "๋ณต์กํ ์ค๋ฆฌ์ ๋๋ ๋ง์ ๋ํด ๋ค๊ฐ๋๋ก ๋ถ์ํด์ค."}
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_dict=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Dynamic Thinking Mode
This model inherits DNA 2.0's dynamic thinking capability:
- Thinking mode: Add
/thinkto enable detailed step-by-step reasoning (temperature=0.6) - Non-thinking mode: Add
/no_thinkfor concise, direct responses (temperature=0.7)
Base Model
DNA 2.0 is developed by Dnotitia Inc. and features:
- Smoothie Qwen3 foundation with balanced multilingual optimization
- Uncensored reasoning training for objective, unbiased responses
- Advanced RL post-training for enhanced mathematical reasoning and Korean language capabilities
For more details, see the arXiv paper (2507.05686).
License
Apache 2.0 โ Same as the base model.
Quantized and released by DLM (Data Science Lab., Ltd.) โ HuggingFace