Aleph-Alpha / NeurIPS-WANT-submission-efficient-parallelization-layouts
β22Updated last year
Alternatives and similar repositories for NeurIPS-WANT-submission-efficient-parallelization-layouts:
Users that are interested in NeurIPS-WANT-submission-efficient-parallelization-layouts are comparing it to the libraries listed below
- π₯ A minimal training framework for scaling FLA modelsβ24Updated this week
- Odysseus: Playground of LLM Sequence Parallelismβ64Updated 7 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β58Updated 2 months ago
- Transformers components but in Tritonβ29Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β75Updated this week
- β31Updated 7 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ64Updated 4 months ago
- Sequence-level 1F1B schedule for LLMs.β17Updated 7 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inferenceβ36Updated last month
- GPU operators for sparse tensor operationsβ30Updated 10 months ago
- β38Updated last year
- GPTQ inference TVM kernelβ38Updated 8 months ago
- Vocabulary Parallelismβ16Updated 2 months ago
- β57Updated 7 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)β85Updated 3 months ago
- Awesome Triton Resourcesβ19Updated last month
- continous batching and parallel acceleration for RWKV6β24Updated 6 months ago
- β55Updated 3 months ago
- β18Updated last year
- β96Updated 4 months ago
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejectionβ38Updated 2 months ago
- QAQ: Quality Adaptive Quantization for LLM KV Cacheβ44Updated 9 months ago
- Framework to reduce autotune overhead to zero for well known deployments.β57Updated last month
- Quantized Attention on GPUβ34Updated last month
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inferenceβ29Updated 6 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Promptsβ38Updated 10 months ago
- β23Updated 2 months ago
- Distributed IO-aware Attention algorithmβ18Updated 4 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin toβ¦β21Updated 2 weeks ago