osayamenja / KleosLinks

Complete GPU residency for ML.

☆43

Alternatives and similar repositories for Kleos

Users that are interested in Kleos are comparing it to the libraries listed below

Sorting:

microsoft / SparTA
☆150Updated last year
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆100Updated this week
yifuwang / symm-mem-recipes
☆116Updated 8 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆167Updated last week
facebookexperimental / triton
Github mirror of trition-lang/triton repo.
☆73Updated this week
parasailteam / coconet
☆82Updated 2 years ago
ParCIS / Chimera
Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.
☆66Updated 6 months ago
zhuohan123 / terapipe
☆75Updated 4 years ago
UDC-GAC / venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆53Updated last year
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆118Updated 2 weeks ago
humuyan / Korch
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
☆38Updated 5 months ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆122Updated 3 years ago
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆51Updated last year
ColfaxResearch / cutlass-kernels
☆233Updated last year
ColfaxResearch / cfx-article-src
☆139Updated 4 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆57Updated 5 months ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆138Updated 2 years ago
HPMLL / NVIDIA-Hopper-Benchmark
☆57Updated 3 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆220Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆70Updated 4 months ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆59Updated 2 weeks ago
UofT-EcoSystem / DietCode
DietCode Code Release
☆65Updated 3 years ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆57Updated last week
shenh10 / DeepSeek_Simulator
☆85Updated 5 months ago
ranggihwang / Pregated_MoE
☆52Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆185Updated 7 months ago
mutinifni / splitwise-sim
LLM serving cluster simulator
☆110Updated last year
apache / tvm-ffi
TVM FFI
☆46Updated this week