mixture-of-experts
updated
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
• 1701.06538
• Published
• 7
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper
• 1907.04840
• Published
• 3
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper
• 1910.02054
• Published
• 11
A Mixture of h-1 Heads is Better than h Heads
Paper
• 2005.06537
• Published
• 2
GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
Paper
• 2006.16668
• Published
• 4
Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity
Paper
• 2101.03961
• Published
• 13
FastMoE: A Fast Mixture-of-Expert Training System
Paper
• 2103.13262
• Published
• 2
BASE Layers: Simplifying Training of Large, Sparse Models
Paper
• 2103.16716
• Published
• 3
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
• 2105.03036
• Published
• 2
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
• 2106.03760
• Published
• 4
Scaling Vision with Sparse Mixture of Experts
Paper
• 2106.05974
• Published
• 4
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
DEMix Layers: Disentangling Domains for Modular Language Modeling
Paper
• 2108.05036
• Published
• 3
A Machine Learning Perspective on Predictive Coding with PAQ
Paper
• 1108.3298
• Published
• 2
Efficient Large Scale Language Modeling with Mixtures of Experts
Paper
• 2112.10684
• Published
• 2
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
• 2202.08906
• Published
• 3
Mixture-of-Experts with Expert Choice Routing
Paper
• 2202.09368
• Published
• 4
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
• 2206.02770
• Published
• 4
Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language
Models
Paper
• 2208.03306
• Published
• 2
A Review of Sparse Expert Models in Deep Learning
Paper
• 2209.01667
• Published
• 3
Sparsity-Constrained Optimal Transport
Paper
• 2209.15466
• Published
• 1
Mixture of Attention Heads: Selecting Attention Heads Per Token
Paper
• 2210.05144
• Published
• 2
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
• 2211.15841
• Published
• 8
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
• 2212.05055
• Published
• 6
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models
Paper
• 2305.14705
• Published
From Sparse to Soft Mixtures of Experts
Paper
• 2308.00951
• Published
• 22
Approximating Two-Layer Feedforward Networks for Efficient Transformers
Paper
• 2310.10837
• Published
• 11
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
• 2310.16795
• Published
• 27
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper
• 2312.07987
• Published
• 41
Mixture of Cluster-conditional LoRA Experts for Vision-language
Instruction Tuning
Paper
• 2312.12379
• Published
• 2
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
• 2312.17238
• Published
• 7
Paper
• 2401.04088
• Published
• 160
MoE-Mamba: Efficient Selective State Space Models with Mixture of
Experts
Paper
• 2401.04081
• Published
• 74
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
• 2401.06066
• Published
• 59