Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layoutsLinks
☆22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
Sorting:
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 11 months ago
- Transformers components but in Triton☆33Updated 3 weeks ago
- GPU operators for sparse tensor operations☆32Updated last year
- ☆38Updated last year
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last month
- Vocabulary Parallelism☆19Updated 2 months ago
- Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.☆14Updated last month
- Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"☆32Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆93Updated this week
- Quantized Attention on GPU☆44Updated 6 months ago
- ☆49Updated 2 weeks ago
- Awesome Triton Resources☆28Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆75Updated 9 months ago
- ☆25Updated 6 months ago
- ☆21Updated 2 months ago
- ☆53Updated this week
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆23Updated 2 weeks ago
- GPTQ inference TVM kernel☆38Updated last year
- ☆31Updated last year
- ☆20Updated last month
- ☆46Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆57Updated 11 months ago
- Estimate MFU for DeepSeekV3☆24Updated 4 months ago
- ☆70Updated 2 weeks ago
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆45Updated 7 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- Linear Attention Sequence Parallelism (LASP)☆83Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆116Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆115Updated this week