tonyzhang617 / nomad-dist
☆19Updated 8 months ago
Related projects ⓘ
Alternatives and complementary repositories for nomad-dist
- ☆46Updated 5 months ago
- LLM Inference analyzer for different hardware platforms☆42Updated this week
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆19Updated last year
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆44Updated 5 months ago
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆17Updated last week
- ☆18Updated last month
- Stateful LLM Serving☆38Updated 3 months ago
- ☆33Updated 5 months ago
- ☆51Updated last month
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆181Updated last year
- ☆9Updated 5 months ago
- ☆13Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- ☆31Updated last year
- ☆21Updated last year
- llama INT4 cuda inference with AWQ☆48Updated 4 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆11Updated last month
- ☆45Updated 2 weeks ago
- TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆21Updated last month
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆57Updated this week
- Artifacts of EVT ASPLOS'24☆17Updated 8 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- An Attention Superoptimizer☆20Updated 6 months ago
- High performance Transformer implementation in C++.☆82Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆53Updated 3 weeks ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆81Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Mult…☆37Updated 8 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆103Updated 3 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆114Updated 2 months ago