📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-Video-bf16 (MLX, video specialist)

MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status — 🟢 t2v in production after Phase 5j position-ID fix (2026-05-21)

Phase 5j watercolor fix shipped 2026-05-21. Root cause was a port-side bug in _build_position_ids: the latent block's mrope (t, h, w) grid was anchored to base = text_len_before_latents, so with our verbose chat template the latent positions drifted with prompt length out of Qwen2.5-VL's training distribution (visual tokens train against grid-ORIGIN coords, not concatenated with text positions). The drift smeared high-frequency detail into a painterly/watercolor aesthetic. Fix: anchor the latent grid at base = 0 regardless of prompt length. Default for TextToVideoPipeline.generate is now latent_pos_base=0.

Phase 5j A/B at 256²×17f red-panda-surfing oracle (seed=42, 30 steps, CFG=4.0): legacy (base=text_len) → watercolor; fix (base=0) → photoreal. Scale-confirmed at 480×704×17f: CGI-quality red panda holding a yellow surfboard horizontally, water spray + atmospheric clouds, correct composition.

This closes a seven-phase investigation (4b/4c, 5d, 5e research engagement, 5f RockTalk-weights triangulation, 5g/5h refuted candidates, 5i bisect, 5j fix) tracked in github issue #2 (now closed). Full root-cause analysis: notes/phase5j_THE_FIX.md.

Capability	Status	Notes
t2v at 256² × 17f	🟢 Photoreal	At lower resolutions, subject composition may simplify (surfboard orientation can vary)
t2v at 480×704 × 17f (n_lat = 6,600)	🟢 CGI-quality	Cap, surfboard horizontal, water spray, atmospheric clouds — production-ready
t2v at 512² × 17f	🟢 Photoreal	Similar profile
t2v at 768² × 13f (n_lat = 9,216)	🟢 Photoreal
t2v at 768² × ≥17f (n_lat ≥ 11,520)	🟡 Partial degradation	Tracked in issue #1 — separate bug class (n_lat ceiling), NOT the watercolor
t2v at 768² × 50f (n_lat = 29,952)	⚠️ Pure-noise output at this scale	Same issue #1 territory; the position-ID fix doesn't address it
x2t_video (video VQA / captioning)	✅ Validated against Phase 0 oracle. Unaffected by the t2v bug — ViT + UND-tower path only
video_edit (instruction-based)	🟢 Same envelope as t2v after the fix

Production-ready for t2v up to n_lat ≈ 9,216 (256²–768²×13f, 480×704×17f). Use the demo script at scripts/10_t2v_demo.py for a one-command path.

For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16 (or mlx-community/Lance-3B-8bit for 16 GB Macs).

Why a separate "Video" checkpoint?

ByteDance ships two variants of Lance that differ in fine-tuning:

Lance_3B — image specialist. Crystal-clear photorealistic t2i.
Lance_3B_Video — video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entry latent_pos_embed table that addresses video-resolution token grids.

Quickstart

Install from the lance-mlx source repo:

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")

Text-to-video

from lance_mlx.pipeline.t2v import TextToVideoPipeline

pipe = TextToVideoPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    "Five balls on a wooden table: two blue, three green.",
    num_frames=17, height=768, width=768,
    num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8

Encode to MP4 with imageio:

import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
    for f in frames:
        writer.append_data(f)

Video understanding

from lance_mlx.pipeline.understanding import UnderstandingPipeline

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
    video="my_video.mp4",
    question="Describe what happens in this video.",
    num_sample_frames=16, target_h=224, target_w=224,
    max_new_tokens=256, prompt_style="lance",
)
print(answer)

Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).

Video editing

from lance_mlx.pipeline.video_edit import VideoEditPipeline

pipe = VideoEditPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    input_video="my_video.mp4",
    instruction="Change all the balls to a deep red color.",
    height=256, width=256, num_frames=17,
    num_steps=30, cfg_scale=4.0, seed=42,
)

Performance (M5 Max 128 GB)

Task	Configuration	Wall-clock
t2v	256² × 16f, 30 steps, CFG=4.0	~33 s
t2v	512² × 16f, 30 steps, CFG=4.0	~60 s
t2v	768² × 13f, 30 steps, CFG=4.0	~145 s
t2v	768² × 17f, 30 steps, CFG=4.0	~20 min
t2v	768² × 49f, 30 steps, CFG=4.0	~2¼ hours (impractical)

CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.

Files in this repo

File	Size	Purpose
`model.safetensors`	12.87 GB	LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed)
`vit.safetensors`	1.34 GB	Qwen2.5-VL ViT (semantic encoder for x2t_video)
`vae.safetensors`	1.41 GB	Lance's bundled Wan2.2 VAE (also available standalone as `mlx-community/Wan2.2-VAE-Lance-bf16`)
`config.json`	–	`Qwen2_5_VLForConditionalGeneration` config
`conversion_report.json`	–	Provenance
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary

Provenance

Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params). Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.

Tips

Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).

Limitations

bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
No streaming or batched generation.
CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.

Architecture (shared with the image specialist)

Two expert towers (LLM_UND, LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm.
Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens → LLM_UND (autoregressive); Wan2.2 VAE latent tokens → LLM_GEN (flow-matching velocity prediction). No learned gate.
MaPE — modality-aware RoPE with per-modality temporal anchor.
Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
Bidirectional attention within latent block.
Untied LM head.

License

This MLX port: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Model tree for mlx-community/Lance-3B-Video-bf16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Finetuned

(5)

this model

Collection including mlx-community/Lance-3B-Video-bf16

Lance MLX

Collection

Feature-complete MLX port of ByteDance Lance: t2i, image_edit, x2t_image, t2v, video_edit, x2t_video. • 4 items • Updated about 10 hours ago • 1

Paper for mlx-community/Lance-3B-Video-bf16

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published 4 days ago • 69

mlx-community
/

Lance-3B-Video-bf16