Post
9
Latest
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
π§
ποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
β‘ Active params isn't the same as memory footprint, especially for sparse architectures
π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
π KV cache can still dominate depending on context length, batch size, and concurrency
π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem
hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
π§
hf-mem now splits MoE memory into base model weights, routed experts, and KV cacheποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
β‘ Active params isn't the same as memory footprint, especially for sparse architectures
π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
π KV cache can still dominate depending on context length, batch size, and concurrency
π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Check the repository at https://github.com/alvarobartt/hf-mem