Instructions to use mlx-community/Lance-3B-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-Video-bf16 (MLX, video specialist)
MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status
🟢 t2v functional across the full scale envelope (2026-05-21). Painterly aesthetic is this checkpoint's training-time style by design. Concrete-subject prompts produce recognizable content; abstract-motion prompts produce painterly renderings the same way.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256×256 × 16f | ✅ Works | ~33 s/clip on M5 Max. |
| t2v at 512×512 × 16f | ✅ Works | ~60 s/clip. |
| t2v at 768×768 × 13f (n_lat=9.2k) | ✅ Works | ~2.5 min/clip. Recognizable subjects (red panda with cap → "dog with hat"). |
| t2v at 768×768 × 17f (n_lat=11.5k) | ✅ Works | ~20 min/clip. "Five balls on a wooden table" → recognizable balls on wood texture, varied colors. |
| t2v at 768×768 × 25f (n_lat=16.1k) | 🟡 Validated; see commit notes | |
| t2v at 768×768 × 49f (n_lat=30k) | ⚠️ Functional but slow (~2¼h/clip on M5 Max). Memory and time become impractical for casual use. | |
| x2t_video (video VQA / captioning) | ✅ Validated against Phase 0 oracle. Cooking-video VQA produces content-correct 256-token caption (kitchen + pan + spatula + tomato + meat + stirring all matched) in 17.5 s. | |
| video_edit (instruction-based) | ✅ Functional. "Change all the balls to a deep red color." → balls recolored, composition preserved. 17 frames × 256² in 81.6 s. |
For production-quality photorealistic image tasks (t2i, image_edit, x2t_image), use the sibling repo mlx-community/Lance-3B-bf16 — Lance_3B is the image specialist with crystal aesthetic.
Aesthetic note: painterly is intentional
This checkpoint was further fine-tuned on video data starting from the image-specialist Lance_3B. The native aesthetic is painterly, not photorealistic. Per-tensor diff confirms this is a different fine-tune (not just a positional-embedding extension): _moe_gen QK-norms differ by 0.5–0.85 in 6+ layers; lm_head and embed_tokens are byte-identical.
If your output looks "abstract" or "distorted," that's usually the aesthetic doing its job on a motion-fluid prompt (e.g. "surfing", "flying"). Concrete-noun prompts ("balls on a table", "a cup of coffee") produce clearly recognizable rendered scenes in the painterly style.
Why a separate "Video" checkpoint?
ByteDance ships two variants of Lance that differ in fine-tuning:
Lance_3B— image specialist. Crystal-clear photorealistic t2i.Lance_3B_Video— video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entrylatent_pos_embedtable that addresses video-resolution token grids.
Quickstart
Install from the lance-mlx source repo:
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")
Text-to-video
from lance_mlx.pipeline.t2v import TextToVideoPipeline
pipe = TextToVideoPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
"Five balls on a wooden table: two blue, three green.",
num_frames=17, height=768, width=768,
num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8
Encode to MP4 with imageio:
import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
for f in frames:
writer.append_data(f)
Video understanding
from lance_mlx.pipeline.understanding import UnderstandingPipeline
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
video="my_video.mp4",
question="Describe what happens in this video.",
num_sample_frames=16, target_h=224, target_w=224,
max_new_tokens=256, prompt_style="lance",
)
print(answer)
Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).
Video editing
from lance_mlx.pipeline.video_edit import VideoEditPipeline
pipe = VideoEditPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
input_video="my_video.mp4",
instruction="Change all the balls to a deep red color.",
height=256, width=256, num_frames=17,
num_steps=30, cfg_scale=4.0, seed=42,
)
Performance (M5 Max 128 GB)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2v | 256² × 16f, 30 steps, CFG=4.0 | ~33 s |
| t2v | 512² × 16f, 30 steps, CFG=4.0 | ~60 s |
| t2v | 768² × 13f, 30 steps, CFG=4.0 | ~145 s |
| t2v | 768² × 17f, 30 steps, CFG=4.0 | ~20 min |
| t2v | 768² × 49f, 30 steps, CFG=4.0 | ~2¼ hours (impractical) |
CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.87 GB | LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_video) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config |
conversion_report.json |
– | Provenance |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary |
Provenance
Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params).
Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.
Tips
- Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
- Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
- English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).
Limitations
- bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
- No streaming or batched generation.
- CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.
Architecture (shared with the image specialist)
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm. - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive); Wan2.2 VAE latent tokens →LLM_GEN(flow-matching velocity prediction). No learned gate. - MaPE — modality-aware RoPE with per-modality temporal anchor.
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
- Bidirectional attention within latent block.
- Untied LM head.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Original PyTorch model:
bytedance-research/Lance - Image specialist (production):
mlx-community/Lance-3B-bf16 - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16
- Downloads last month
- -
Quantized