feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆16Updated last month
Alternatives and similar repositories for DPSKV3MFU:
Users that are interested in DPSKV3MFU are comparing it to the libraries listed below
- Quantized Attention on GPU☆34Updated 3 months ago
- Transformers components but in Triton☆31Updated 3 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 8 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆64Updated 3 months ago
- 16-fold memory access reduction with nearly no loss☆77Updated this week
- ☆52Updated 10 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆29Updated 3 months ago
- 🔥 A minimal training framework for scaling FLA models☆59Updated this week
- Sequence-level 1F1B schedule for LLMs.☆17Updated 8 months ago
- Vocabulary Parallelism☆17Updated 3 months ago
- Implement Flash Attention using Cute.☆69Updated 2 months ago
- A sparse attention kernel supporting mix sparse patterns☆133Updated last week
- GPTQ inference TVM kernel☆38Updated 9 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆40Updated 3 months ago
- TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.☆56Updated this week
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆39Updated 6 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆61Updated 3 weeks ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆39Updated 3 months ago
- [ICLR 2025] PEARL: parallel speculative decoding with adaptive draft length☆39Updated this week
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆84Updated 4 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆27Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated 2 months ago
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆58Updated 3 weeks ago
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"☆14Updated 4 months ago
- ☆41Updated 2 months ago
- ☆76Updated 2 years ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆42Updated 4 months ago