madsys-dev / deepseekv2-profileLinks

☆148

Alternatives and similar repositories for deepseekv2-profile

Users that are interested in deepseekv2-profile are comparing it to the libraries listed below

Sorting:

kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆65Updated last year
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆154Updated last week
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
InternLM / turbomind
☆96Updated 6 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆281Updated last year
microsoft / chunk-attention
☆78Updated 6 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆115Updated last year
mdy666 / mdy_triton
☆148Updated 3 months ago
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆118Updated 6 months ago
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆428Updated last week
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆223Updated 2 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
HarryWu99 / llm_kvcache_sparsity
Implement some method of LLM KV Cache Sparsity
☆39Updated last year
CalebDu / Awesome-Cute
☆107Updated 5 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
fzyzcjy / torch_memory_saver
Allow torch tensor memory to be released and resumed later
☆150Updated this week
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆338Updated 3 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated last week
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆278Updated 4 months ago
InternLM / Awesome-LLM-Training-System
☆43Updated last year
d-matrix-ai / keyformer-llm
☆59Updated last year
OpenPPL / ppl.nn.llm
☆139Updated last year
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆118Updated 3 weeks ago
OpenPPL / ppl.llm.serving
☆129Updated 9 months ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆48Updated last month