microsoft / DeepSpeed-Kernels

☆57

Alternatives and similar repositories for DeepSpeed-Kernels:

Users that are interested in DeepSpeed-Kernels are comparing it to the libraries listed below

IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆64Updated 4 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆58Updated 2 months ago
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆75Updated this week
stanford-futuredata / stk
☆96Updated 4 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆64Updated 7 months ago
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆102Updated last month
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆93Updated 6 months ago
cchan / tccl
extensible collectives library in triton
☆76Updated 3 months ago
yifuwang / symm-mem-recipes
☆27Updated 3 weeks ago
INT-FlashAttention2024 / INT-FlashAttention
☆55Updated 3 months ago
triton-lang / kernels
☆64Updated 2 months ago
AlibabaPAI / FLASHNN
☆78Updated 4 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆230Updated 2 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆49Updated 6 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆64Updated this week
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆195Updated last year
exists-forall / striped_attention
☆38Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆57Updated last month
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆38Updated 8 months ago
RulinShao / LightSeq
Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
☆204Updated 4 months ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆84Updated 2 weeks ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆99Updated 4 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆187Updated last week
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆87Updated 10 months ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆211Updated this week
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆56Updated last year
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆62Updated 10 months ago
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆63Updated 2 years ago
Infini-AI-Lab / MagicDec
Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆107Updated last month
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆76Updated last month