InternLM / turbomindLinks
β96Updated 6 months ago
Alternatives and similar repositories for turbomind
Users that are interested in turbomind are comparing it to the libraries listed below
Sorting:
- π€FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3xβπ vs SDPA EA.β223Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β124Updated 4 months ago
- β65Updated 5 months ago
- β‘οΈWrite HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peakβ‘οΈ Performance.β121Updated 5 months ago
- β100Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMsβ107Updated 6 months ago
- β148Updated 7 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.β41Updated 7 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.β144Updated 2 months ago
- Standalone Flash Attention v2 kernel without libtorch dependencyβ112Updated last year
- DeeperGEMM: crazy optimized versionβ72Updated 5 months ago
- β101Updated 5 months ago
- β120Updated 2 months ago
- β43Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.β115Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.β45Updated 4 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ320Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).β265Updated 3 months ago
- A collection of memory efficient attention operators implemented in the Triton language.β282Updated last year
- β78Updated 6 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clustersβ50Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformerβ94Updated last month
- Fast and memory-efficient exact attentionβ96Updated this week
- Implement Flash Attention using Cute.β96Updated 10 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ130Updated 10 months ago
- β205Updated 5 months ago
- JAX backend for SGLβ77Updated last week
- β107Updated 5 months ago
- Utility scripts for PyTorch (e.g. Memory profiler that understands more low-level allocations such as NCCL)β59Updated last month
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsityβ221Updated 2 years ago