☆155Mar 4, 2025Updated last year
Alternatives and similar repositories for deepseekv2-profile
Users that are interested in deepseekv2-profile are comparing it to the libraries listed below
Sorting:
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆30Jan 22, 2026Updated last month
- ☆13Jan 7, 2025Updated last year
- A lightweight design for computation-communication overlap.☆223Jan 20, 2026Updated last month
- ☆262Jul 11, 2024Updated last year
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆48May 10, 2024Updated last year
- ☆21Aug 14, 2024Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆120Mar 13, 2024Updated last year
- flash attention tutorial written in python, triton, cuda, cutlass☆488Jan 20, 2026Updated last month
- Implement Flash Attention using Cute.☆101Dec 17, 2024Updated last year
- ☆65Apr 26, 2025Updated 10 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆464May 30, 2025Updated 9 months ago
- Expert Specialization MoE Solution based on CUTLASS☆27Jan 19, 2026Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,025Sep 4, 2024Updated last year
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- ☆116May 16, 2025Updated 9 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆21Sep 19, 2024Updated last year
- Fast inference from large lauguage models via speculative decoding☆894Aug 22, 2024Updated last year
- ☆52May 19, 2025Updated 9 months ago
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,261Aug 28, 2025Updated 6 months ago
- DeepSeek-V3/R1 inference performance simulator☆179Mar 27, 2025Updated 11 months ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆969Feb 5, 2026Updated 3 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆644Jan 15, 2026Updated last month
- Odysseus: Playground of LLM Sequence Parallelism☆79Jun 17, 2024Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Feb 29, 2024Updated 2 years ago
- Fastest kernels written from scratch☆550Sep 18, 2025Updated 5 months ago
- Materials for learning SGLang☆753Jan 5, 2026Updated last month
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆147Dec 23, 2025Updated 2 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆524Feb 10, 2025Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆101Dec 15, 2025Updated 2 months ago
- FlashInfer: Kernel Library for LLM Serving☆5,057Updated this week
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 2 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆816Mar 6, 2025Updated 11 months ago
- https://bbuf.github.io/gpu-glossary-zh/☆26Nov 7, 2025Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆147May 10, 2025Updated 9 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆650Updated this week
- Awesome code, projects, books, etc. related to CUDA☆31Feb 3, 2026Updated last month