-
The Impact of Depth and Width on Transformer Language Model Generalization
Paper • 2310.19956 • Published • 10 -
Retentive Network: A Successor to Transformer for Large Language Models
Paper • 2307.08621 • Published • 172 -
RWKV: Reinventing RNNs for the Transformer Era
Paper • 2305.13048 • Published • 20 -
Attention Is All You Need
Paper • 1706.03762 • Published • 104
Collections
Discover the best community collections!
Collections including paper arxiv:2402.01771
-
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Paper • 2311.14495 • Published • 1 -
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20 -
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2
-
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Paper • 2311.14495 • Published • 1 -
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
VMamba: Visual State Space Model
Paper • 2401.10166 • Published • 39 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 73 -
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Paper • 2501.16295 • Published • 8 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 53 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58
-
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper • 2401.18058 • Published • 22 -
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21 -
Scavenging Hyena: Distilling Transformers into Long Convolution Models
Paper • 2401.17574 • Published • 17 -
Rethinking Interpretability in the Era of Large Language Models
Paper • 2402.01761 • Published • 23
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58 -
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper • 2006.16668 • Published • 4 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
The Impact of Depth and Width on Transformer Language Model Generalization
Paper • 2310.19956 • Published • 10 -
Retentive Network: A Successor to Transformer for Large Language Models
Paper • 2307.08621 • Published • 172 -
RWKV: Reinventing RNNs for the Transformer Era
Paper • 2305.13048 • Published • 20 -
Attention Is All You Need
Paper • 1706.03762 • Published • 104
-
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Paper • 2501.12326 • Published • 65 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 73 -
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Paper • 2501.16295 • Published • 8 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Paper • 2311.14495 • Published • 1 -
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 53 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58
-
Simple linear attention language models balance the recall-throughput tradeoff
Paper • 2402.18668 • Published • 20 -
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 81 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2
-
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58
-
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization
Paper • 2311.14495 • Published • 1 -
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper • 2401.18058 • Published • 22 -
Efficient Tool Use with Chain-of-Abstraction Reasoning
Paper • 2401.17464 • Published • 21 -
Scavenging Hyena: Distilling Transformers into Long Convolution Models
Paper • 2401.17574 • Published • 17 -
Rethinking Interpretability in the Era of Large Language Models
Paper • 2402.01761 • Published • 23
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Paper • 2401.09417 • Published • 62 -
VMamba: Visual State Space Model
Paper • 2401.10166 • Published • 39 -
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
Paper • 2401.13560 • Published • 1 -
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces
Paper • 2402.00789 • Published • 2
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 58 -
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper • 2006.16668 • Published • 4 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 28 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25