tonyzhang617 / nomad-distLinks

☆37

Alternatives and similar repositories for nomad-dist

Users that are interested in nomad-dist are comparing it to the libraries listed below

Sorting:

microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆51Updated 4 months ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
LeiWang1999 / Stream-k.tvm
☆19Updated 10 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆80Updated 6 months ago
microsoft / cusync
☆27Updated last year
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆31Updated this week
microsoft / FractalTensor
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆28Updated 7 months ago
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 4 years ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆56Updated this week
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆137Updated 2 years ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated 2 weeks ago
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆30Updated 8 months ago
microsoft / SparTA
☆150Updated last year
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆50Updated last week
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆54Updated 6 months ago
mit-han-lab / tinychat-tutorial
☆72Updated 9 months ago
chhzh123 / ptc-tutorial
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆18Updated 2 years ago
UDC-GAC / venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆52Updated last year
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆109Updated 9 months ago
VITA-Group / Q-Hitter
☆14Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆225Updated 8 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year
argonne-lcf / LLM-Inference-Bench
LLM-Inference-Bench
☆48Updated 3 weeks ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆57Updated 3 weeks ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆64Updated last month
hao-ai-lab / MuxServe
☆67Updated last year
triton-lang / kernels
☆85Updated 9 months ago
ParCIS / Magicube
Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
☆89Updated 2 years ago