Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
Sorting:
- Transformers components but in Triton☆33Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 10 months ago
- Awesome Triton Resources☆27Updated 2 weeks ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated 2 weeks ago
- Vocabulary Parallelism☆19Updated 2 months ago
- GPU operators for sparse tensor operations☆32Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- ☆20Updated 2 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆36Updated 3 weeks ago
- Quantized Attention on GPU☆45Updated 5 months ago
- ☆30Updated 11 months ago
- ☆19Updated 4 months ago
- triton ver of gqa flash attn, based on the tutorial☆11Updated 9 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆45Updated 6 months ago
- Sequence-level 1F1B schedule for LLMs.☆17Updated 11 months ago
- ☆38Updated last year
- A simple calculation for LLM MFU.☆37Updated 2 months ago
- Linear Attention Sequence Parallelism (LASP)☆82Updated 11 months ago
- ☆14Updated 2 years ago
- Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.☆10Updated 2 weeks ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆100Updated this week
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆46Updated 5 months ago
- ☆20Updated 3 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- ☆68Updated this week
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆44Updated 6 months ago
- GPTQ inference TVM kernel☆38Updated last year
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆15Updated 11 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆89Updated 2 weeks ago
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification☆52Updated 2 months ago