feifeibear / DPSKV3MFULinks

Estimate MFU for DeepSeekV3

☆26

Alternatives and similar repositories for DPSKV3MFU

Users that are interested in DPSKV3MFU are comparing it to the libraries listed below

Sorting:

flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
tile-ai / AttentionEngine
☆50Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆87Updated 4 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 2 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆147Updated 2 weeks ago
fzyzcjy / torch_utils
Utility scripts for PyTorch (e.g. Memory profiler that understands more low-level allocations such as NCCL)
☆62Updated last month
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 7 months ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 2 months ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆48Updated last month
cat538 / SKVQ
[COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆24Updated last year
ByteDance-Seed / cudaLLM
☆120Updated 2 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆46Updated 3 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
microsoft / chunk-attention
☆78Updated 6 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
d-matrix-ai / keyformer-llm
☆59Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated last month
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆60Updated 11 months ago
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆120Updated 6 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆20Updated 8 months ago
KuangjuX / AttnLink
An experimental communicating attention kernel based on DeepEP.
☆34Updated 2 months ago
InternLM / turbomind
☆97Updated 7 months ago
Jingyu6 / speculative_prefill
☆44Updated 5 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
InternLM / Awesome-LLM-Training-System
☆43Updated last year