Instructions to use mlx-community/Lance-3B-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Lance-3B-Video-bf16 (MLX, video specialist)
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-Video-bf16 (MLX, video specialist)
MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)
Phase 5j watercolor fix shipped 2026-05-21. Root cause was a port-side bug in _build_position_ids: the latent block's mrope (t, h, w) grid was anchored to base = text_len_before_latents, so with our verbose chat template the latent positions drifted with prompt length out of Qwen2.5-VL's training distribution (visual tokens train against grid-ORIGIN coords, not concatenated with text positions). The drift smeared high-frequency detail into a painterly/watercolor aesthetic. Fix: anchor the latent grid at base = 0 regardless of prompt length. Default for TextToVideoPipeline.generate is now latent_pos_base=0.
Phase 5j A/B at 256²×17f red-panda-surfing oracle (seed=42, 30 steps, CFG=4.0): legacy (base=text_len) → watercolor; fix (base=0) → photoreal. Scale-confirmed at 480×704×17f: CGI-quality red panda holding a yellow surfboard horizontally, water spray + atmospheric clouds, correct composition.
This closes a seven-phase investigation (4b/4c, 5d, 5e research engagement, 5f RockTalk-weights triangulation, 5g/5h refuted candidates, 5i bisect, 5j fix) tracked in github issue #2 (now closed). Full root-cause analysis: notes/phase5j_THE_FIX.md.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256² × 17f | 🟢 Photoreal | At lower resolutions, subject composition may simplify (surfboard orientation can vary) |
| t2v at 480×704 × 17f (n_lat = 6,600) | 🟢 CGI-quality | Cap, surfboard horizontal, water spray, atmospheric clouds — production-ready |
| t2v at 512² × 17f | 🟢 Photoreal | Similar profile |
| t2v at 768² × 13f (n_lat = 9,216) | 🟢 Photoreal | |
| t2v at 768² × ≥17f (n_lat ≥ 11,520) | 🟡 Partial degradation | Tracked in issue #1 — separate bug class (n_lat ceiling), NOT the watercolor |
| t2v at 768² × 50f (n_lat = 29,952) | ⚠️ Pure-noise output at this scale | Same issue #1 territory; the position-ID fix doesn't address it |
| x2t_video (video VQA / captioning) | ✅ Validated against Phase 0 oracle. Unaffected by the t2v bug — ViT + UND-tower path only | |
| video_edit (instruction-based) | 🟢 Same envelope as t2v after the fix |
Production-ready for t2v up to n_lat ≈ 9,216 (256²–768²×13f, 480×704×17f). Use the demo script at scripts/10_t2v_demo.py for a one-command path.
For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16 (or mlx-community/Lance-3B-8bit for 16 GB Macs).
Why a separate "Video" checkpoint?
ByteDance ships two variants of Lance that differ in fine-tuning:
Lance_3B— image specialist. Crystal-clear photorealistic t2i.Lance_3B_Video— video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entrylatent_pos_embedtable that addresses video-resolution token grids.
Quickstart
Install from the lance-mlx source repo:
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")
Text-to-video
from lance_mlx.pipeline.t2v import TextToVideoPipeline
pipe = TextToVideoPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
"Five balls on a wooden table: two blue, three green.",
num_frames=17, height=768, width=768,
num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8
Encode to MP4 with imageio:
import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
for f in frames:
writer.append_data(f)
Video understanding
from lance_mlx.pipeline.understanding import UnderstandingPipeline
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
video="my_video.mp4",
question="Describe what happens in this video.",
num_sample_frames=16, target_h=224, target_w=224,
max_new_tokens=256, prompt_style="lance",
)
print(answer)
Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).
Video editing
from lance_mlx.pipeline.video_edit import VideoEditPipeline
pipe = VideoEditPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
input_video="my_video.mp4",
instruction="Change all the balls to a deep red color.",
height=256, width=256, num_frames=17,
num_steps=30, cfg_scale=4.0, seed=42,
)
Performance (M5 Max 128 GB)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2v | 256² × 16f, 30 steps, CFG=4.0 | ~33 s |
| t2v | 512² × 16f, 30 steps, CFG=4.0 | ~60 s |
| t2v | 768² × 13f, 30 steps, CFG=4.0 | ~145 s |
| t2v | 768² × 17f, 30 steps, CFG=4.0 | ~20 min |
| t2v | 768² × 49f, 30 steps, CFG=4.0 | ~2¼ hours (impractical) |
CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.87 GB | LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_video) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config |
conversion_report.json |
– | Provenance |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary |
Provenance
Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params).
Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.
Tips
- Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
- Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
- English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).
Limitations
- bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
- No streaming or batched generation.
- CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.
Architecture (shared with the image specialist)
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm. - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive); Wan2.2 VAE latent tokens →LLM_GEN(flow-matching velocity prediction). No learned gate. - MaPE — modality-aware RoPE with per-modality temporal anchor.
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
- Bidirectional attention within latent block.
- Untied LM head.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Original PyTorch model:
bytedance-research/Lance - Image specialist (production):
mlx-community/Lance-3B-bf16 - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16
- Downloads last month
- 139
Quantized
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16