stanford-futuredata / stkLinks

☆112

Alternatives and similar repositories for stk

Users that are interested in stk are comparing it to the libraries listed below

Sorting:

tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆216Updated last year
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated this week
cchan / tccl
extensible collectives library in triton
☆89Updated 6 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
triton-lang / kernels
☆92Updated 11 months ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated 2 years ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆157Updated 6 months ago
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆120Updated 10 months ago
exists-forall / striped_attention
☆41Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆215Updated last week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
deepspeedai / DeepSpeed-Kernels
☆72Updated 6 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆116Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆154Updated last year
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆154Updated last week
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated last month
gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆248Updated this week
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆62Updated this week
FasterDecoding / TEAL
☆145Updated 8 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆381Updated 3 weeks ago