fw-ai / llama-cuda-graph-exampleLinks
Example of applying CUDA graphs to LLaMA-v2
☆12Updated 2 years ago
Alternatives and similar repositories for llama-cuda-graph-example
Users that are interested in llama-cuda-graph-example are comparing it to the libraries listed below
Sorting:
- Triton-based Symmetric Memory operators and examples☆32Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆83Updated last year
- ☆72Updated 6 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆130Updated 10 months ago
- ☆112Updated last year
- extensible collectives library in triton☆89Updated 6 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆84Updated 3 weeks ago
- How to ensure correctness and ship LLM generated kernels in PyTorch☆66Updated this week
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- Triton-based implementation of Sparse Mixture of Experts.☆244Updated last week
- ☆50Updated 4 months ago
- ☆129Updated 4 months ago
- Transformers components but in Triton☆34Updated 5 months ago
- ☆65Updated 5 months ago
- ☆143Updated 7 months ago
- train with kittens!☆62Updated 11 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆42Updated last year
- ring-attention experiments☆153Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆168Updated last year
- A bunch of kernels that might make stuff slower 😉☆61Updated this week
- ☆27Updated last year
- Official implementation for Training LLMs with MXFP4☆96Updated 5 months ago
- ☆14Updated 3 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆167Updated this week
- PyTorch bindings for CUTLASS grouped GEMM.☆124Updated 4 months ago
- Estimate MFU for DeepSeekV3☆25Updated 9 months ago
- ☆100Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆60Updated last year
- KV cache compression for high-throughput LLM inference☆141Updated 8 months ago
- QuIP quantization☆59Updated last year